956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers propose AEX, a new attestation protocol for LLM APIs that provides cryptographic proof that API responses actually correspond to client requests. The system addresses trust issues with hosted AI models by adding signed attestation objects to existing JSON-based APIs without disrupting current functionality.
🏢 OpenAI
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers developed PA³, a new method to improve AI assistant alignment with business policies by teaching models to recall and apply relevant rules during reasoning without including full policies in prompts. The approach reduces computational overhead by 40% while achieving 16-point performance improvements over baselines.
$PA
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers studied computational resource allocation in AI retrieval systems for long-horizon agents, finding that re-ranking stages benefit more from powerful models and deeper candidate pools than query expansion stages. The study suggests concentrating compute power on re-ranking rather than distributing it uniformly across pipeline stages for better performance.
🧠 Gemini
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce AdaAnchor, a new AI reasoning framework that performs silent computation in latent space rather than generating verbose step-by-step reasoning. The system adaptively determines when to stop refining its internal reasoning process, achieving up to 5% better accuracy while reducing token generation by 92-93% and cutting refinement steps by 48-60%.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers conducted an empirical study on 16 Large Language Models to understand how they process tabular data, revealing a three-phase attention pattern and finding that tabular tasks require deeper neural network layers than math reasoning. The study analyzed attention dynamics, layer depth requirements, expert activation in MoE models, and the impact of different input designs on table understanding performance.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers introduced InterveneBench, a new benchmark comprising 744 peer-reviewed studies to evaluate large language models' ability to reason about policy interventions and causal inference in social science contexts. Current state-of-the-art LLMs struggle with this type of reasoning, prompting the development of STRIDES, a multi-agent framework that significantly improves performance on these tasks.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduced AssetOpsBench, a unified framework for benchmarking AI agents in industrial asset operations and maintenance automation. The platform has gained significant adoption with 250+ users and 500+ submitted agents, providing a standardized way to evaluate AI solutions for Industry 4.0 applications.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce AutoEP, a framework that uses Large Language Models (LLMs) as zero-shot reasoning engines to automatically configure algorithm hyperparameters without requiring training. The system combines real-time landscape analysis with multi-LLM reasoning to outperform existing methods and enables open-source models like Qwen3-30B to match GPT-4's performance in optimization tasks.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce Reason2Decide, a two-stage training framework that improves clinical decision support systems by aligning AI explanations with predictions. The system achieves better performance than larger foundation models while using 40x smaller models, making clinical AI more accessible for resource-constrained deployments.
AIBearisharXiv – CS AI · Mar 176/10
🧠Researchers discovered that skip connections in deep neural networks make adversarial attacks more transferable across different AI models. They developed the Skip Gradient Method (SGM) which exploits this vulnerability in ResNets, Vision Transformers, and even Large Language Models to create more effective adversarial examples.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers propose CausalDANN, a novel method using large language models to estimate causal effects of textual interventions in social systems. The approach addresses limitations of traditional causal inference methods when dealing with complex, high-dimensional textual data and can handle arbitrary text interventions even with observational data only.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers developed E2H Reasoner, a curriculum reinforcement learning method that improves LLM reasoning by training on tasks from easy to hard. The approach shows significant improvements for small LLMs (1.5B-3B parameters) that struggle with vanilla RL training alone.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers have developed EvolvR, a self-evolving framework that improves AI's ability to evaluate and generate stories through pairwise reasoning and multi-agent data filtering. The system achieves state-of-the-art performance on three evaluation benchmarks and significantly enhances story generation quality when used as a reward model.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers conducted the first systematic study on post-training quantization for diffusion large language models (dLLMs), identifying activation outliers as a key challenge for compression. The study evaluated state-of-the-art quantization methods across multiple dimensions to provide insights for efficient dLLM deployment on edge devices.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce Slow-Fast Policy Optimization (SFPO), a new reinforcement learning framework that improves training stability and efficiency for large language model reasoning. SFPO outperforms existing methods like GRPO by up to 2.80 points on math benchmarks while requiring up to 4.93x fewer rollouts and 4.19x less training time.
AINeutralarXiv – CS AI · Mar 176/10
🧠Research reveals that while increasing the number of LLM agents improves mathematical problem-solving accuracy, these multi-agent systems remain vulnerable to adversarial attacks. The study found that human-like typos pose the greatest threat to robustness, and the adversarial vulnerability gap persists regardless of agent count.
🧠 Llama
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers developed LabelFusion, a hybrid AI architecture combining Large Language Models with transformer encoders for financial news classification. The system achieves 96% F1 score on full datasets but LLMs alone perform better in low-data scenarios, suggesting different strategies based on available training data.
AIBearisharXiv – CS AI · Mar 176/10
🧠Researchers introduced MDial, the first large-scale framework for generating multi-dialectal conversational data across nine English dialects, revealing that over 80% of English speakers don't use Standard American English. Evaluation of 17 LLMs showed even frontier models achieve under 70% accuracy in dialect identification, with particularly poor performance on non-American dialects.
AIBearisharXiv – CS AI · Mar 176/10
🧠Researchers introduce HEARTS, a comprehensive benchmark for evaluating large language models' ability to reason over health time series data across 16 datasets and 12 health domains. The study reveals that current LLMs significantly underperform compared to specialized models and struggle with multi-step temporal reasoning in healthcare applications.
AIBullishImport AI (Jack Clark) · Mar 166/10
🧠ImportAI 449 explores recent developments in AI research including LLMs training other LLMs, a 72B parameter distributed training run, and findings that computer vision tasks remain more challenging than generative text tasks. The newsletter highlights autonomous LLM refinement capabilities and post-training benchmark results showing significant AI capability growth.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers introduce a new knowledge distillation framework that improves training of smaller AI models by using intermediate representations from large language models rather than their final outputs. The method shows consistent improvements across reasoning benchmarks, particularly when training data is limited, by providing cleaner supervision signals.
AINeutralarXiv – CS AI · Mar 166/10
🧠Researchers propose Global Evolutionary Refined Steering (GER-steer), a new training-free framework for controlling Large Language Models without fine-tuning costs. The method addresses issues with existing activation engineering approaches by using geometric stability to improve steering vector accuracy and reduce noise.
AINeutralarXiv – CS AI · Mar 166/10
🧠Researchers have launched LLM BiasScope, an open-source web application that enables real-time bias analysis and side-by-side comparison of outputs from major language models including Google Gemini, DeepSeek, and Meta Llama. The platform uses a two-stage bias detection pipeline and provides interactive visualizations to help researchers and practitioners evaluate bias patterns across different AI models.
🏢 Hugging Face🧠 Gemini🧠 Llama
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers introduce Delta1, a framework that integrates automated theorem generation with large language models to create explainable AI reasoning. The system combines formal logic rigor with natural language explanations, demonstrating applications across healthcare, compliance, and regulatory domains.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers developed a human-in-the-loop LLM system for grading handwritten mathematics assessments that reduces grading time by 23% while maintaining accuracy comparable to manual grading. The system combines automated scanning, multi-pass LLM scoring, consistency checks, and mandatory human verification to handle pen-and-paper tests at scale.