22 articles tagged with #llm-alignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท Mar 167/10
๐ง Researchers propose the Superficial Safety Alignment Hypothesis (SSAH), suggesting that AI safety alignment in large language models can be understood as a binary classification task of fulfilling or refusing user requests. The study identifies four types of critical components at the neuron level that establish safety guardrails, enabling models to retain safety attributes while adapting to new tasks.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose VISA (Value Injection via Shielded Adaptation), a new framework for aligning Large Language Models with human values while avoiding the 'alignment tax' that causes knowledge drift and hallucinations. The system uses a closed-loop architecture with value detection, translation, and rewriting components, demonstrating superior performance over standard fine-tuning methods and GPT-4o in maintaining factual consistency.
๐ง GPT-4
AIBullisharXiv โ CS AI ยท Mar 47/104
๐ง Researchers present a new mathematical framework for training AI reward models using Likert scale preferences instead of simple binary comparisons. The approach uses ordinal regression to better capture nuanced human feedback, outperforming existing methods across chat, reasoning, and safety benchmarks.
AIBullisharXiv โ CS AI ยท Mar 37/105
๐ง Researchers introduce Elo-Evolve, a new framework for training AI language models using dynamic multi-agent competition instead of static reward functions. The method achieves 4.5x noise reduction and demonstrates superior performance compared to traditional alignment approaches when tested on Qwen2.5-7B models.
AIBullisharXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce CoSToM, a framework that uses causal tracing and activation steering to improve Theory of Mind alignment in large language models. The work addresses a critical gap between LLMs' internal knowledge and external behavior, demonstrating that targeted interventions in specific neural layers can enhance social reasoning capabilities and dialogue quality.
AIBullisharXiv โ CS AI ยท 3d ago6/10
๐ง Researchers introduce Sequence-Level PPO (SPPO), a new algorithm that improves how large language models are trained for reasoning tasks by addressing stability and computational efficiency issues in standard reinforcement learning approaches. SPPO matches the performance of resource-heavy methods while significantly reducing memory and computational costs, potentially accelerating LLM alignment for complex problem-solving.
AINeutralarXiv โ CS AI ยท 3d ago6/10
๐ง Researchers benchmarked five frontier LLMs against human players in Cards Against Humanity games, finding that while models exceed random baseline performance, their humor preferences align poorly with humans but strongly with each other. The findings suggest LLM humor judgment may reflect systematic biases and structural artifacts rather than genuine preference understanding.
AIBullisharXiv โ CS AI ยท Apr 76/10
๐ง Researchers propose APPA, a new framework for aligning large language models with diverse human preferences in federated learning environments. The method dynamically reweights group-level rewards to improve fairness, achieving up to 28% better alignment for underperforming groups while maintaining overall model performance.
๐ข Meta๐ง Llama
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers propose Rubrics to Tokens (RTT), a novel reinforcement learning framework that improves Large Language Model alignment by bridging response-level and token-level rewards. The method addresses reward sparsity and ambiguity issues in instruction-following tasks through fine-grained credit assignment and demonstrates superior performance across different models.
AIBearisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a priority graph model to understand conflicts in LLM alignment, revealing that unified stable alignment is challenging due to context-dependent inconsistencies. The study identifies 'priority hacking' as a vulnerability where adversaries can manipulate safety alignments, and suggests runtime verification mechanisms as a potential solution.
AIBullisharXiv โ CS AI ยท Mar 166/10
๐ง Researchers propose MetaKE, a new framework for knowledge editing in Large Language Models that addresses the 'Semantic-Execution Disconnect' through bi-level optimization. The method treats edit targets as learnable parameters and uses a Structural Gradient Proxy to align edits with the model's feasible manifold, showing significant improvements over existing approaches.
AIBullisharXiv โ CS AI ยท Mar 126/10
๐ง Researchers propose a multi-agent negotiation framework for aligning large language models in scenarios involving conflicting stakeholder values. The approach uses two LLM instances with opposing personas engaging in structured dialogue to develop conflict resolution capabilities while maintaining collective agency alignment.
AIBullisharXiv โ CS AI ยท Mar 36/105
๐ง Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง Researchers propose Token-Importance Guided Direct Preference Optimization (TI-DPO), a new framework for aligning Large Language Models with human preferences. The method uses hybrid weighting mechanisms and triplet loss to achieve more accurate and robust AI alignment compared to existing Direct Preference Optimization approaches.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง Researchers have developed EDT-Former, an Entropy-guided Dynamic Token Transformer that improves how Large Language Models understand molecular graphs. The system achieves state-of-the-art results on molecular understanding benchmarks while being computationally efficient by avoiding costly LLM backbone fine-tuning.
AINeutralarXiv โ CS AI ยท Mar 26/1010
๐ง Researchers introduce RewardUQ, a unified framework for evaluating uncertainty quantification in reward models used to align large language models with human preferences. The study finds that model size and initialization have the most significant impact on performance, while providing an open-source Python package to advance the field.
AIBullisharXiv โ CS AI ยท Mar 27/1014
๐ง Researchers propose MetaAPO, a new framework for aligning large language models with human preferences that dynamically balances online and offline training data. The method uses a meta-learner to evaluate when on-policy sampling is beneficial, resulting in better performance while reducing online annotation costs by 42%.
AIBullisharXiv โ CS AI ยท Feb 276/106
๐ง Researchers introduce RLHFless, a serverless computing framework for Reinforcement Learning from Human Feedback (RLHF) that addresses resource inefficiencies in training large language models. The system achieves up to 1.35x speedup and 44.8% cost reduction compared to existing solutions by dynamically adapting to resource demands and optimizing workload distribution.
AINeutralarXiv โ CS AI ยท Mar 95/10
๐ง Researchers analyzed how the GPT-J-6B language model internally represents and reasons about trust by comparing its embeddings to established human trust models. The study found that the AI's trust representation most closely aligns with the Castelfranchi socio-cognitive model, suggesting LLMs encode social concepts in meaningful ways.
AIBearisharXiv โ CS AI ยท Mar 44/102
๐ง This is a satirical academic paper that critiques AI pluralistic alignment research by using the absurd metaphor of 'mulching' humans into nutrient slurry. The authors parody current AI ethics frameworks to highlight how technical approaches to value alignment can potentially enable harmful systems.
AINeutralarXiv โ CS AI ยท Mar 35/105
๐ง Researchers evaluated how AI language models can be aligned to express distinct personalities when functioning as teammates, testing models from GPT-4o, Claude, Gemini, and Grok across personality traits. The study found that AI personalities are measurable but context-dependent, with personality signals more detectable in long-term memory representations than in conversation alone.