AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce MaPPO, a new preference optimization method for large language models that integrates prior reward knowledge into the training objective. Building on Direct Preference Optimization (DPO), MaPPO demonstrates consistent improvements across multiple benchmarks while maintaining computational efficiency and compatibility with existing DPO variants.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce StoryRMB, the first benchmark for evaluating reward models on story generation preferences, and develop StoryReward, a specialized reward model achieving 66.3% accuracy where existing models struggle. The work addresses the challenge of modeling subjective human preferences in narrative generation, enabling better alignment between LLM-generated stories and human expectations.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers introduce TUR-DPO, an improved method for aligning large language models with human preferences that incorporates reasoning topology and uncertainty awareness. Unlike standard Direct Preference Optimization, this approach evaluates not just answer correctness but the quality of the reasoning process, showing improvements across mathematical reasoning, factual QA, and dialogue tasks while maintaining training simplicity.
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers introduce CoSToM, a framework that uses causal tracing and activation steering to improve Theory of Mind alignment in large language models. The work addresses a critical gap between LLMs' internal knowledge and external behavior, demonstrating that targeted interventions in specific neural layers can enhance social reasoning capabilities and dialogue quality.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers introduce Sequence-Level PPO (SPPO), a new algorithm that improves how large language models are trained for reasoning tasks by addressing stability and computational efficiency issues in standard reinforcement learning approaches. SPPO matches the performance of resource-heavy methods while significantly reducing memory and computational costs, potentially accelerating LLM alignment for complex problem-solving.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers benchmarked five frontier LLMs against human players in Cards Against Humanity games, finding that while models exceed random baseline performance, their humor preferences align poorly with humans but strongly with each other. The findings suggest LLM humor judgment may reflect systematic biases and structural artifacts rather than genuine preference understanding.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers propose APPA, a new framework for aligning large language models with diverse human preferences in federated learning environments. The method dynamically reweights group-level rewards to improve fairness, achieving up to 28% better alignment for underperforming groups while maintaining overall model performance.
🏢 Meta🧠 Llama
AIBullisharXiv – CS AI · Apr 66/10
🧠Researchers propose Rubrics to Tokens (RTT), a novel reinforcement learning framework that improves Large Language Model alignment by bridging response-level and token-level rewards. The method addresses reward sparsity and ambiguity issues in instruction-following tasks through fine-grained credit assignment and demonstrates superior performance across different models.
AIBearisharXiv – CS AI · Mar 176/10
🧠Researchers propose a priority graph model to understand conflicts in LLM alignment, revealing that unified stable alignment is challenging due to context-dependent inconsistencies. The study identifies 'priority hacking' as a vulnerability where adversaries can manipulate safety alignments, and suggests runtime verification mechanisms as a potential solution.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers propose MetaKE, a new framework for knowledge editing in Large Language Models that addresses the 'Semantic-Execution Disconnect' through bi-level optimization. The method treats edit targets as learnable parameters and uses a Structural Gradient Proxy to align edits with the model's feasible manifold, showing significant improvements over existing approaches.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers propose a multi-agent negotiation framework for aligning large language models in scenarios involving conflicting stakeholder values. The approach uses two LLM instances with opposing personas engaging in structured dialogue to develop conflict resolution capabilities while maintaining collective agency alignment.
AIBullisharXiv – CS AI · Mar 36/105
🧠Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers propose Token-Importance Guided Direct Preference Optimization (TI-DPO), a new framework for aligning Large Language Models with human preferences. The method uses hybrid weighting mechanisms and triplet loss to achieve more accurate and robust AI alignment compared to existing Direct Preference Optimization approaches.
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers have developed EDT-Former, an Entropy-guided Dynamic Token Transformer that improves how Large Language Models understand molecular graphs. The system achieves state-of-the-art results on molecular understanding benchmarks while being computationally efficient by avoiding costly LLM backbone fine-tuning.
AINeutralarXiv – CS AI · Mar 26/1010
🧠Researchers introduce RewardUQ, a unified framework for evaluating uncertainty quantification in reward models used to align large language models with human preferences. The study finds that model size and initialization have the most significant impact on performance, while providing an open-source Python package to advance the field.
AIBullisharXiv – CS AI · Mar 27/1014
🧠Researchers propose MetaAPO, a new framework for aligning large language models with human preferences that dynamically balances online and offline training data. The method uses a meta-learner to evaluate when on-policy sampling is beneficial, resulting in better performance while reducing online annotation costs by 42%.
AIBullisharXiv – CS AI · Feb 276/106
🧠Researchers introduce RLHFless, a serverless computing framework for Reinforcement Learning from Human Feedback (RLHF) that addresses resource inefficiencies in training large language models. The system achieves up to 1.35x speedup and 44.8% cost reduction compared to existing solutions by dynamically adapting to resource demands and optimizing workload distribution.
AINeutralarXiv – CS AI · Mar 95/10
🧠Researchers analyzed how the GPT-J-6B language model internally represents and reasons about trust by comparing its embeddings to established human trust models. The study found that the AI's trust representation most closely aligns with the Castelfranchi socio-cognitive model, suggesting LLMs encode social concepts in meaningful ways.
AIBearisharXiv – CS AI · Mar 44/102
🧠This is a satirical academic paper that critiques AI pluralistic alignment research by using the absurd metaphor of 'mulching' humans into nutrient slurry. The authors parody current AI ethics frameworks to highlight how technical approaches to value alignment can potentially enable harmful systems.
AINeutralarXiv – CS AI · Mar 35/105
🧠Researchers evaluated how AI language models can be aligned to express distinct personalities when functioning as teammates, testing models from GPT-4o, Claude, Gemini, and Grok across personality traits. The study found that AI personalities are measurable but context-dependent, with personality signals more detectable in long-term memory representations than in conversation alone.