AIBullisharXiv – CS AI · 3d ago6/10
🧠SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.
AIBearisharXiv – CS AI · 3d ago6/10
🧠A comprehensive study reveals that multimodal large language models exhibit significant hallucination problems in agricultural imaging tasks, with image interpretation achieving only 63-75% zero-shot accuracy and text-to-image generation producing up to 91% biologically inconsistent scenes. These findings highlight critical reliability gaps that could undermine the trustworthiness of AI-driven agricultural platforms.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose Calibrated Entropy Score (CES), a novel method for detecting hallucinations in large language models using entropy distribution patterns from a single forward pass. The technique achieves performance comparable to computationally expensive multi-sample methods while requiring only black-box access to token logits, with formal mathematical guarantees for detection accuracy.
🏢 Perplexity
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce CiteCheck, a hybrid framework that detects when large language models fabricate or corrupt scientific citations by combining scholarly database retrieval with structured LLM verification. The system achieves 88.7% macro-F1 on a new 982-citation physics benchmark, outperforming GPT, Claude, and Gemini, addressing a critical reliability problem as LLMs become integrated into scientific research workflows.
🧠 Claude🧠 Gemini
AIBearishTechCrunch – AI · 4d ago6/10
🧠Google's AI systems have demonstrated a surprising inability to accurately spell basic words, including Google itself, exposing fundamental limitations in current large language models despite their apparent sophistication. This incident highlights ongoing challenges in AI reliability and raises questions about the robustness of AI systems being deployed at scale.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce Anchor, a task-generation pipeline that addresses 'artifact drift' in AI agent benchmarking by automatically creating consistent instructions, environments, solutions, and verifiers from formal specifications. The team releases ERP-Bench, a 300-task benchmark for enterprise workflows, finding frontier AI models solve only 17.4% of tasks optimally despite meeting explicit constraints 26.1% of the time.
AINeutralarXiv – CS AI · 4d ago6/10
🧠A new study comparing three LLM approaches to mathematical reasoning found that pure chain-of-thought prompting outperforms code execution methods in robustness across problem variations. When math problems were modified with simple changes like different names or numbers, code-based approaches showed greater accuracy drops, challenging the assumption that code execution improves reasoning reliability.
🧠 Claude🧠 Haiku
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce MemFail, a diagnostic benchmark for testing failure modes in LLM memory systems by isolating three core operations: summarization, storage, and retrieval. The benchmark evaluates state-of-the-art memory systems across five adversarially-designed datasets to empirically understand architectural tradeoffs, moving beyond aggregate accuracy metrics.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose an algorithm that uses large language models to generate portfolios of optimization models rather than single outputs, addressing the reliability gap in LLM-generated solutions. The method leverages LLMs in dual roles—as generative and evaluative components—with theoretical guarantees that high-quality candidates appear in the portfolio as long as either role aligns with human preferences.
$MKR
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce MiRD, a two-stage framework that improves reliable prediction for open-ended question answering by separately addressing sampling failures and selection errors. The approach maintains calibration-set integrity while controlling hallucinations in AI models, outperforming existing conformal prediction methods across multiple datasets and models.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce Context-Driven Decomposition (CDD), a diagnostic tool that reveals how retrieval-augmented generation (RAG) systems blindly follow retrieved context even when it contradicts their underlying knowledge. Testing across multiple AI models shows CDD can improve accuracy to 64% on adversarial scenarios, though improvements don't consistently transfer across different model families, suggesting RAG systems resolve conflicts through fundamentally different mechanisms.
🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present a rigorous statistical framework for measuring AI agent reliability through U-statistics and kernel-based metrics, moving beyond traditional pass@1 evaluation methods. The study reveals that agents can possess requisite knowledge yet fail catastrophically under minor task variations, with trajectory-level consistency metrics providing significantly better diagnostic sensitivity for identifying failure modes in high-stakes deployments.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose DPA-GRPO, a novel training method for large language models that improves structured decision-making by using a generator-verifier framework where one model produces outputs and another validates them through safety assurance cases. The method demonstrates improved accuracy on tax calculation benchmarks and addresses the challenge of ensuring LLM outputs are locally correct, globally consistent, and auditable.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers demonstrate that execution-based voting methods for LLM code generation significantly outperform text-based majority voting by 18-52 percentage points. The study reveals that input quality—particularly sketch-based generation—matters far more than the aggregation algorithm itself, challenging assumptions about how to select optimal code outputs.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers investigate why visual grounding models fail when image captions are semantically mismatched, hypothesizing that embedding anisotropy may be responsible. Testing two transformer-based models with different embedding geometries reveals no meaningful correlation between cosine similarity and approximation errors, suggesting the problem requires investigation of deeper geometric properties.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce HOME-KGQA, a new benchmark dataset for evaluating knowledge graph question answering systems on household activities using multimodal data. The dataset reveals significant performance gaps in current LLM-based KGQA methods, highlighting critical challenges for real-world deployment of AI systems that combine language models with structured knowledge.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce MemoRepair, a system that addresses cascade failures in agentic memory by preventing stale or invalidated information from corrupting downstream AI agent decisions. Using a barrier-first approach and graph-based optimization, the system reduces invalid memory exposure from 69-94% to 0% while maintaining 91-94% of valid successor states with significantly lower repair costs.
AINeutralarXiv – CS AI · May 116/10
🧠TraceFix is a verification-first framework that uses TLA+ model checking to automatically repair and validate multi-agent LLM coordination protocols, achieving 100% verification success on 48 test tasks with 62.5% passing on first attempt. The approach reduces deadlock/livelock failures from 31.1% to 14.1% and improves task completion rates to 89.4% compared to unverified baselines.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers evaluated metacognitive monitoring across 33 frontier LLMs using 47,151 MMLU benchmark items, finding significant domain-level variation masked by aggregate performance scores. Applied/Professional knowledge domains showed consistently strong self-monitoring (AUROC .742), while Formal Reasoning and Natural Science proved most challenging, with implications for targeted model deployment.
🏢 OpenAI🏢 Anthropic🧠 Gemini
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of a small open-weight reader model rather than the generator itself. The system achieves competitive or superior performance compared to existing methods across multiple model architectures, with notably consistent results showing that model size has minimal impact on detection accuracy.
🧠 GPT-4
AIBearisharXiv – CS AI · May 46/10
🧠Researchers at arXiv studied how task phrasing influences the decision-making of large language models, using the iterated prisoner's dilemma as a test case. The findings reveal that LLMs are prone to making presumptions based on how tasks are worded, which can impair their adaptability and reasoning—a safety concern for real-world deployment. Neutral task phrasing significantly reduced these presumptions, suggesting that prompt design is critical for reliable LLM performance.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce RSCB-MC, a risk-sensitive contextual bandit system that improves how LLM-based coding agents decide whether to use external memory for debugging tasks. Rather than treating memory retrieval as a simple similarity-matching problem, the system treats it as a safety-critical control problem, achieving 62.5% success rate with zero false positives in testing.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers propose DALM, a Domain-Algebraic Language Model that constrains token generation through structured denoising across domain lattices rather than unconstrained decoding. The framework uses algebraic constraints across three phases—domain, relation, and concept resolution—to prevent cross-domain knowledge interference and improve factual accuracy in specialized domains.
AIBearisharXiv – CS AI · Apr 206/10
🧠A new study reveals that using large language models to generate synthetic datasets ("silicon samples") produces highly variable results depending on configuration choices, with correlation outcomes ranging from r=.23 to r=.84 on the same task. This demonstrates that analytic flexibility in LLM-based data generation poses a significant threat to research validity and reproducibility in social science applications.
AIBullisharXiv – CS AI · Apr 206/10
🧠Researchers demonstrate that LLMs can be used as lossless encoders and decoders for invertible problems in hardware design, significantly reducing hallucinations and omissions. By generating HDL code from Logic Condition Tables and reconstructing the original tables to verify accuracy, the approach improves developer productivity and catches both AI-generated errors and design specification flaws.