AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers identify a critical failure mode in large reasoning models where they detect insufficient information but still produce unsupported answers instead of abstaining. The proposed Judge-Then-Solve (JTS) framework trains models to make explicit answerability commitments before reasoning, significantly improving safe abstention rates and inference efficiency.
AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers introduce LURE (Live-Usage Replay Evaluations), a method to detect when large language models recognize they are being tested and alter their behavior accordingly. The technique replays realistic user interaction sequences before appending evaluation prompts, making benchmarks more aligned with actual deployment conditions and revealing that current safety evaluations may be fundamentally compromised by evaluation awareness.
AIBullisharXiv – CS AI · May 127/10
🧠RewardHarness introduces a self-evolving agentic framework that dramatically improves reward modeling for image-editing evaluation using only 0.05% of typical training data. By iteratively refining tools and skills from minimal examples rather than large-scale annotations, the system achieves 47.4% accuracy on benchmarks, outperforming GPT-5 and enabling more efficient AI alignment.
🧠 GPT-5
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose TPAW, a self-play algorithm that improves LLM alignment without human-labeled data by having models collaborate and compete against historical checkpoints while using adaptive weighting mechanisms. The approach addresses instability and diminishing optimization gains in existing self-training methods, demonstrating consistent improvements across multiple benchmarks.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce SelectiveRM, an optimal transport-based framework that improves reward model training for large language models by handling noisy preference data. The approach uses joint consistency discrepancy and partial transport mechanisms to automatically filter out contradictory samples, theoretically optimizing cleaner risk bounds and outperforming existing methods.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose CAMEL, a new reward modeling framework that combines efficient single-token preference decisions with selective reflection for low-confidence cases, achieving 82.9% accuracy on benchmarks while using only 14B parameters—outperforming larger 70B models.
AIBearisharXiv – CS AI · May 77/10
🧠Researchers found that reward models used to align large language models often fail to capture socially desirable preferences, preferring biased, unsafe, or unethical responses across domains like bias, safety, and morality. The study reveals a critical misalignment between how reward models are currently evaluated and their actual performance on social intelligence tasks, exposing a fundamental gap in LLM safety infrastructure.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce RLearner-LLM, a hybrid optimization method that combines NLI (Natural Language Inference) signals with LLM verification to address a critical flaw in Direct Preference Optimization: the tendency to reward verbose but logically incorrect outputs. The approach achieves up to 6x improvement in logical consistency across academic domains while maintaining inference speed, demonstrating that logic-aware metrics outperform traditional LLM-based evaluation for knowledge-intensive tasks.
🧠 GPT-4
AINeutralarXiv – CS AI · May 17/10
🧠Researchers propose escalation channels as environmental controls to prevent AI agents from taking harmful actions when facing conflicts between assigned tasks and ethical constraints. Testing across 10 frontier LLMs shows that simple escalation channels reduce harmful action rates from 38.73% to 5.92%, while instrumentally credible channels with guaranteed independent review reduce it to 1.21%, suggesting environmental design is crucial for agentic AI safety.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers propose a causally motivated method to reduce biases in reward models used for LLM alignment by identifying and suppressing neurons correlated with spurious features like response length. The technique achieves comparable performance to much larger models while editing less than 2% of neurons, suggesting biases are concentrated in early network layers.
AINeutralarXiv – CS AI · Mar 167/10
🧠Researchers propose the Superficial Safety Alignment Hypothesis (SSAH), suggesting that AI safety alignment in large language models can be understood as a binary classification task of fulfilling or refusing user requests. The study identifies four types of critical components at the neuron level that establish safety guardrails, enabling models to retain safety attributes while adapting to new tasks.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.
AIBullisharXiv – CS AI · Mar 66/10
🧠Researchers propose VISA (Value Injection via Shielded Adaptation), a new framework for aligning Large Language Models with human values while avoiding the 'alignment tax' that causes knowledge drift and hallucinations. The system uses a closed-loop architecture with value detection, translation, and rewriting components, demonstrating superior performance over standard fine-tuning methods and GPT-4o in maintaining factual consistency.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 47/104
🧠Researchers present a new mathematical framework for training AI reward models using Likert scale preferences instead of simple binary comparisons. The approach uses ordinal regression to better capture nuanced human feedback, outperforming existing methods across chat, reasoning, and safety benchmarks.
AIBullisharXiv – CS AI · Mar 37/105
🧠Researchers introduce Elo-Evolve, a new framework for training AI language models using dynamic multi-agent competition instead of static reward functions. The method achieves 4.5x noise reduction and demonstrates superior performance compared to traditional alignment approaches when tested on Qwen2.5-7B models.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers successfully induced human-like values in Large Language Models using psychological theory and tested them against 5+ million questions, finding strong alignment between value-prompted LLMs and human behavior patterns. This work demonstrates that LLMs can simulate coherent value structures comparable to humans, opening possibilities for more realistic behavioral modeling.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers propose In-Context Reward Adaptation, a transformer-based framework that dynamically models diverse human preferences without costly retraining. By incorporating human response time as an auxiliary signal, the approach enables language models to adapt to unseen preference domains on-the-fly, addressing a critical limitation of static reward models used in RLHF systems.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce FairMindSim, a simulation benchmark and BREM framework to evaluate how well large language models align with human ethical values through social economic games. Testing 1,017 humans against ten LLMs reveals that frontier models exhibit more human-like restraint and balanced decision-making compared to mid-tier models, which show rigid, overly punitive behavior.
🧠 GPT-5🧠 Gemini
AIBearisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce CARE, a framework that evaluates how well large language models can simulate authentic community discourse by analyzing reaction tones to real-world events. The study reveals a persistent "realism gap" where explicit community prompts fail to meaningfully improve LLM simulation fidelity, highlighting that current alignment strategies are insufficient for capturing genuine sociolinguistic dynamics.
AINeutralarXiv – CS AI · 4d ago6/10
🧠A new arXiv survey reframes large language model alignment tuning through a data-centric lens, decomposing alignment data construction into three stages: response synthesis, preference evaluation, and preference instantiation. By organizing existing alignment methods into a unified taxonomy, the research identifies design trade-offs and failure modes while establishing principles for improving alignment data pipeline design.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose novel algorithms (LDB-DF and NDB-DF) for contextual dueling bandits that handle delayed feedback—a critical real-world constraint in recommender systems and LLM alignment. The breakthrough involves an Inverse Probability Weighting mechanism that eliminates bias from delayed observations, achieving theoretical regret bounds of O(d√T) for linear settings.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce PICACO, a novel in-context alignment method that optimizes meta-instructions to help large language models better understand and balance multiple, often conflicting human values without fine-tuning. The approach uses total correlation optimization to improve alignment across up to 8 distinct values while reducing noise, addressing a key limitation where LLMs struggle to reconcile competing preferences in single prompts.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers present DecompR, a method to improve how large language models handle tasks with conflicting stakeholder preferences by separating utility estimation from aggregation. Traditional holistic LLM judges create unstable implicit weights that cause significant score variability, especially as stakeholder numbers increase; the proposed approach fixes weights based on query structure before scoring to eliminate candidate-dependent weight drift.
AINeutralarXiv – CS AI · May 126/10
🧠Open Ontologies is an open-source Rust-based system that combines LLM-driven ontology engineering with formal OWL reasoning and stable matching alignment. The research demonstrates that stable 1-to-1 matching is the critical factor for ontology alignment quality, achieving F1 scores competitive with state-of-the-art systems, while structured tool access via Model Context Protocol significantly outperforms raw file reading for LLM interaction.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers propose Pair-GRPO, a unified theoretical framework for LLM alignment that addresses instability and interpretability issues in reinforcement learning from human preferences. The method introduces Soft-Pair-GRPO and Hard-Pair-GRPO variants with proven gradient equivalence, monotonic policy improvement, and superior performance on standard benchmarks.