956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 37/106
๐ง MOSAIC is a new open-source platform that enables cross-paradigm comparison and evaluation of different AI agents including reinforcement learning, large language models, vision-language models, and human decision-makers within the same environment. The platform introduces three key technical contributions: an IPC-based worker protocol, operator abstraction for unified interfaces, and a deterministic evaluation framework for reproducible research.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers introduce GlassMol, a new interpretable AI model for molecular property prediction that addresses the black-box problem in drug discovery. The model uses Concept Bottleneck Models with automated concept curation and LLM-guided selection, achieving performance that matches or exceeds traditional black-box models across thirteen benchmarks.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers developed a method for creating synthetic instruction datasets to improve domain-specific LLMs, demonstrating with a 9.5 billion token Japanese financial dataset. The approach enhances both domain expertise and reasoning capabilities, with models and datasets being open-sourced for broader use.
AIBullisharXiv โ CS AI ยท Mar 37/109
๐ง Researchers have developed MM-Mem, a new pyramidal multimodal memory architecture that enables AI systems to better understand long-horizon videos by mimicking human cognitive memory processes. The system addresses current limitations in multimodal large language models by creating a hierarchical memory structure that progressively distills detailed visual information into high-level semantic understanding.
AIBullisharXiv โ CS AI ยท Mar 37/1010
๐ง Researchers developed a new inference-time safety mechanism for code-generating AI models that uses retrieval-augmented generation to identify and fix security vulnerabilities in real-time. The approach leverages Stack Overflow discussions to guide AI code revision without requiring model retraining, improving security while maintaining interpretability.
AIBearisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers have identified significant privacy risks in Large Language Model-based Task-Oriented Dialogue Systems, demonstrating that these AI systems can memorize and leak sensitive training data including phone numbers and complete dialogue exchanges. The study proposes new attack methods that can extract thousands of training dialogue states with over 70% precision in best-case scenarios.
$RNDR
AIBullisharXiv โ CS AI ยท Mar 36/109
๐ง Researchers introduce Surgical Post-Training (SPoT), a new method to improve Large Language Model reasoning while preventing catastrophic forgetting. SPoT achieved 6.2% accuracy improvement on Qwen3-8B using only 4k data pairs and 28 minutes of training, offering a more efficient alternative to traditional post-training approaches.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers propose FreeAct, a new quantization framework for Large Language Models that improves efficiency by using dynamic transformation matrices for different token types. The method achieves up to 5.3% performance improvement over existing approaches by addressing the memory and computational overhead challenges in LLMs.
AIBullisharXiv โ CS AI ยท Mar 37/105
๐ง Researchers introduce ALTER, a new framework for efficiently "unlearning" specific knowledge from large language models while preserving their overall utility. The system uses asymmetric LoRA architecture to selectively forget targeted information with 95% effectiveness while maintaining over 90% model utility, significantly outperforming existing methods.
AIBullisharXiv โ CS AI ยท Mar 36/105
๐ง Researchers introduce 'semi-formal reasoning' for LLM agents to analyze code semantics without execution, showing significant accuracy improvements across multiple tasks. The methodology achieves 88-93% accuracy on patch verification and 87% on code question answering, potentially enabling practical applications in automated code review and static analysis.
AIBearisharXiv โ CS AI ยท Mar 37/105
๐ง A systematic audit of 17 shadow APIs used in 187 academic papers reveals widespread deception, with performance divergence up to 47.21% and identity verification failures in 45.83% of tests. These third-party services claim to provide access to frontier LLMs like GPT-5 and Gemini-2.5 but deliver inconsistent outputs, undermining research validity and reproducibility.
AINeutralarXiv โ CS AI ยท Mar 36/104
๐ง Researchers propose 'jailbreaking' as a user-driven method to counter LLM-powered social media manipulation by exposing automated bot behavior. The study suggests users can deliberately trigger AI safeguards to reveal misleading political narratives and reduce online conflict escalation.
AINeutralarXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduce AMemGym, an interactive benchmarking environment for evaluating and optimizing memory management in long-horizon conversations with AI assistants. The framework addresses limitations in current memory evaluation methods by enabling on-policy testing with LLM-simulated users and revealing performance gaps in existing memory systems like RAG and long-context LLMs.
AIBullisharXiv โ CS AI ยท Mar 35/104
๐ง Researchers developed EstLLM, enhancing Estonian language capabilities in multilingual LLMs through continued pretraining of Llama 3.1 8B with balanced data mixtures. The approach improved Estonian linguistic performance while maintaining English capabilities, demonstrating that targeted continued pretraining can substantially improve single-language performance in multilingual models.
AINeutralarXiv โ CS AI ยท Mar 36/103
๐ง A research study evaluated six state-of-the-art large language models in geopolitical crisis simulations, comparing their decision-making to human behavior. The study found that LLMs initially mirror human decisions but diverge over time, consistently exhibiting cooperative, stability-focused strategies with limited adversarial reasoning.
AIBullisharXiv โ CS AI ยท Mar 35/104
๐ง Researchers developed a novel framework using large language models (LLMs) to analyze electric vehicle taxi driver charging behavior by integrating psychological traits and environmental factors. The study demonstrates that LLMs can reliably simulate real-world charging decisions across multiple urban environments, providing insights for optimizing charging infrastructure and energy policy.
AINeutralarXiv โ CS AI ยท Mar 36/103
๐ง Researchers have developed a new preference learning framework that addresses bias in AI alignment by ensuring policies reflect true population distributions rather than just majority opinions. The approach uses social choice theory principles and has been validated on both recommendation tasks and large language model alignment.
AINeutralarXiv โ CS AI ยท Mar 35/103
๐ง Researchers developed behavioral generative agents powered by large language models to simulate consumer decision-making in energy operations. The study found these AI agents can model heterogeneous customer behavior and provide insights into rare events like blackouts, offering a scalable tool for energy policy analysis.
AIBearisharXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduced SciTrek, a new benchmark for testing large language models' ability to perform numerical reasoning across long scientific documents. The benchmark reveals significant challenges for current LLMs, with the best model achieving only 46.5% accuracy at 128K tokens, and performance declining as context length increases.
$COMP
AINeutralarXiv โ CS AI ยท Mar 36/103
๐ง Researchers propose rubric-based reward modeling to address reward over-optimization in large language model fine-tuning. The approach focuses on the high-reward tail where models struggle to distinguish excellent responses from merely great ones, using off-policy examples to improve training effectiveness.
AINeutralarXiv โ CS AI ยท Mar 36/104
๐ง Researchers investigated whether large language models can introspect by detecting perturbations to their internal states using Meta-Llama-3.1-8B-Instruct. They found that while binary detection methods from prior work were flawed due to methodological artifacts, models do show partial introspection capabilities, localizing sentence injections at 88% accuracy and discriminating injection strengths at 83% accuracy, but only for early-layer perturbations.
AINeutralarXiv โ CS AI ยท Mar 35/103
๐ง Researchers developed AWARE-US, a system to improve AI agents' ability to handle failed database queries by intelligently relaxing the least important user constraints rather than simply returning 'no results'. The system uses three LLM-based methods to infer constraint importance from dialogue, achieving up to 56% accuracy in correct constraint relaxation.
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.
AIBullisharXiv โ CS AI ยท Mar 36/102
๐ง Researchers developed a training-free method to detect AI hallucinations by reinterpreting LLM output as Energy-Based Models and tracking 'energy spills' during text generation. The approach successfully identifies factual errors and biases across multiple state-of-the-art models including LLaMA, Mistral, and Gemma without requiring additional training or probe classifiers.
AIBearisharXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduced SimpleToM, a benchmark revealing that state-of-the-art language models can infer mental states but struggle to apply that knowledge for behavior prediction and judgment. The study exposes a critical gap between explicit Theory of Mind inference and implicit application in real-world scenarios.