y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-alignment News & Analysis

22 articles tagged with #llm-alignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles
AINeutralarXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

Superficial Safety Alignment Hypothesis

Researchers propose the Superficial Safety Alignment Hypothesis (SSAH), suggesting that AI safety alignment in large language models can be understood as a binary classification task of fulfilling or refusing user requests. The study identifies four types of critical components at the neuron level that establish safety guardrails, enabling models to retain safety attributes while adapting to new tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

Researchers propose VISA (Value Injection via Shielded Adaptation), a new framework for aligning Large Language Models with human values while avoiding the 'alignment tax' that causes knowledge drift and hallucinations. The system uses a closed-loop architecture with value detection, translation, and rewriting components, demonstrating superior performance over standard fine-tuning methods and GPT-4o in maintaining factual consistency.

๐Ÿง  GPT-4
AIBullisharXiv โ€“ CS AI ยท Mar 37/105
๐Ÿง 

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Researchers introduce Elo-Evolve, a new framework for training AI language models using dynamic multi-agent competition instead of static reward functions. The method achieves 4.5x noise reduction and demonstrates superior performance compared to traditional alignment approaches when tested on Qwen2.5-7B models.

AIBullisharXiv โ€“ CS AI ยท 2d ago6/10
๐Ÿง 

CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Researchers introduce CoSToM, a framework that uses causal tracing and activation steering to improve Theory of Mind alignment in large language models. The work addresses a critical gap between LLMs' internal knowledge and external behavior, demonstrating that targeted interventions in specific neural layers can enhance social reasoning capabilities and dialogue quality.

AIBullisharXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Researchers introduce Sequence-Level PPO (SPPO), a new algorithm that improves how large language models are trained for reasoning tasks by addressing stability and computational efficiency issues in standard reinforcement learning approaches. SPPO matches the performance of resource-heavy methods while significantly reducing memory and computational costs, potentially accelerating LLM alignment for complex problem-solving.

AINeutralarXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

Researchers benchmarked five frontier LLMs against human players in Cards Against Humanity games, finding that while models exceed random baseline performance, their humor preferences align poorly with humans but strongly with each other. The findings suggest LLM humor judgment may reflect systematic biases and structural artifacts rather than genuine preference understanding.

AIBullisharXiv โ€“ CS AI ยท Apr 76/10
๐Ÿง 

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

Researchers propose APPA, a new framework for aligning large language models with diverse human preferences in federated learning environments. The method dynamically reweights group-level rewards to improve fairness, achieving up to 28% better alignment for underperforming groups while maintaining overall model performance.

๐Ÿข Meta๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

Researchers propose Rubrics to Tokens (RTT), a novel reinforcement learning framework that improves Large Language Model alignment by bridging response-level and token-level rewards. The method addresses reward sparsity and ambiguity issues in instruction-following tasks through fine-grained credit assignment and demonstrates superior performance across different models.

AIBearisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Researchers propose a priority graph model to understand conflicts in LLM alignment, revealing that unified stable alignment is challenging due to context-dependent inconsistencies. The study identifies 'priority hacking' as a vulnerability where adversaries can manipulate safety alignments, and suggests runtime verification mechanisms as a potential solution.

AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization

Researchers propose MetaKE, a new framework for knowledge editing in Large Language Models that addresses the 'Semantic-Execution Disconnect' through bi-level optimization. The method treats edit targets as learnable parameters and uses a Structural Gradient Proxy to align edits with the model's feasible manifold, showing significant improvements over existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 126/10
๐Ÿง 

Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Researchers propose a multi-agent negotiation framework for aligning large language models in scenarios involving conflicting stakeholder values. The approach uses two LLM instances with opposing personas engaging in structured dialogue to develop conflict resolution capabilities while maintaining collective agency alignment.

AIBullisharXiv โ€“ CS AI ยท Mar 36/105
๐Ÿง 

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.

AIBullisharXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

Token-Importance Guided Direct Preference Optimization

Researchers propose Token-Importance Guided Direct Preference Optimization (TI-DPO), a new framework for aligning Large Language Models with human preferences. The method uses hybrid weighting mechanisms and triplet loss to achieve more accurate and robust AI alignment compared to existing Direct Preference Optimization approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding

Researchers have developed EDT-Former, an Entropy-guided Dynamic Token Transformer that improves how Large Language Models understand molecular graphs. The system achieves state-of-the-art results on molecular understanding benchmarks while being computationally efficient by avoiding costly LLM backbone fine-tuning.

AINeutralarXiv โ€“ CS AI ยท Mar 26/1010
๐Ÿง 

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Researchers introduce RewardUQ, a unified framework for evaluating uncertainty quantification in reward models used to align large language models with human preferences. The study finds that model size and initialization have the most significant impact on performance, while providing an open-source Python package to advance the field.

AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

RLHFless: Serverless Computing for Efficient RLHF

Researchers introduce RLHFless, a serverless computing framework for Reinforcement Learning from Human Feedback (RLHF) that addresses resource inefficiencies in training large language models. The system achieves up to 1.35x speedup and 44.8% cost reduction compared to existing solutions by dynamically adapting to resource demands and optimizing workload distribution.

AINeutralarXiv โ€“ CS AI ยท Mar 95/10
๐Ÿง 

Evaluating LLM Alignment With Human Trust Models

Researchers analyzed how the GPT-J-6B language model internally represents and reasons about trust by comparing its embeddings to established human trust models. The study found that the AI's trust representation most closely aligns with the Castelfranchi socio-cognitive model, suggesting LLMs encode social concepts in meaningful ways.

AINeutralarXiv โ€“ CS AI ยท Mar 35/105
๐Ÿง 

Personalities at Play: Probing Alignment in AI Teammates

Researchers evaluated how AI language models can be aligned to express distinct personalities when functioning as teammates, testing models from GPT-4o, Claude, Gemini, and Grok across personality traits. The study found that AI personalities are measurable but context-dependent, with personality signals more detectable in long-term memory representations than in conversation alone.