954 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce SCAN, a new framework for editing Large Language Models that prevents catastrophic forgetting during sequential knowledge updates. The method uses sparse circuit manipulation instead of dense parameter changes, maintaining model performance even after 3,000 sequential edits across major models like Gemma2, Qwen3, and Llama3.1.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce Memory-as-Asset, a new paradigm for human-centric artificial general intelligence that treats personal memory as a digital asset. The framework features three key components: human-centric memory ownership, collaborative knowledge formation, and collective memory evolution, supported by a three-layer infrastructure including decentralized memory exchange networks.
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers have introduced OpenSeeker, the first fully open-source search agent that achieves frontier-level performance using only 11,700 training samples. The model outperforms existing open-source competitors and even some industrial solutions, with complete training data and model weights being released publicly.
AIBearishTechCrunch โ AI ยท Mar 167/10
๐ง Encyclopedia Britannica and Merriam-Webster have filed a lawsuit against OpenAI, alleging copyright infringement of nearly 100,000 articles used in training their large language models. This legal action adds to growing concerns about AI companies' use of copyrighted content for model development.
๐ข OpenAI
AINeutralarXiv โ CS AI ยท Mar 167/10
๐ง Researchers developed a testing framework to evaluate how reliably AI agents maintain consistent reasoning when inputs are semantically equivalent but differently phrased. Their study of seven foundation models across 19 reasoning problems found that larger models aren't necessarily more robust, with the smaller Qwen3-30B-A3B achieving the highest stability at 79.6% invariant responses.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers introduce the AI Search Paradigm, a comprehensive framework for next-generation search systems using four LLM-powered agents (Master, Planner, Executor, Writer) that collaborate to handle everything from simple queries to complex reasoning tasks. The system employs modular architecture with dynamic workflows for task planning, tool integration, and content synthesis to create more adaptive and scalable AI search capabilities.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers have developed a new methodology that leverages Large Language Models to automate the creation of Ontological Knowledge Bases, addressing traditional challenges of manual development. The approach demonstrates significant improvements in scalability, consistency, and efficiency through automated knowledge acquisition and continuous refinement cycles.
AINeutralarXiv โ CS AI ยท Mar 167/10
๐ง Researchers developed a supervised fine-tuning approach to align large language model agents with specific economic preferences, addressing systematic deviations from rational behavior in strategic environments. The study demonstrates how LLM agents can be trained to follow either self-interested or morally-guided strategies, producing distinct outcomes in economic games and pricing scenarios.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers introduce LightMoE, a new framework that compresses Mixture-of-Experts language models by replacing redundant expert modules with parameter-efficient alternatives. The method achieves 30-50% compression rates while maintaining or improving performance, addressing the substantial memory demands that limit MoE model deployment.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers have developed Pyramid MoA, a new framework that optimizes large language model inference costs by using a hierarchical router system that escalates queries to more expensive models only when necessary. The system achieves up to 62.7% cost savings while maintaining Oracle-level accuracy on various benchmarks including coding and mathematical reasoning tasks.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers introduced QMatSuite, an open-source platform that enables AI agents to accumulate and apply knowledge across computational materials science experiments. The system demonstrated significant improvements, reducing reasoning overhead by 67% and improving accuracy from 47% to 3% deviation from literature benchmarks.
AIBearisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers identify a significant bias in Large Language Models when processing multiple updates to the same factual information within context. The study reveals that LLMs struggle to accurately retrieve the most recent version of updated facts, with performance degrading as the number of updates increases, similar to memory interference patterns observed in cognitive psychology.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers developed a new method for training AI language models using multi-turn user conversations through self-distillation, leveraging follow-up messages to improve model alignment. Testing on real-world WildChat conversations showed improvements in alignment and instruction-following benchmarks while enabling personalization without explicit feedback.
AINeutralarXiv โ CS AI ยท Mar 127/10
๐ง Research examining five major LLMs found they exhibit human-like cognitive biases when evaluating judicial scenarios, showing stronger virtuous victim effects but reduced credential-based halo effects compared to humans. The study suggests LLMs may offer modest improvements over human decision-making in judicial contexts, though variability across models limits current practical application.
๐ง ChatGPT๐ง Claude๐ง Sonnet
AIBearisharXiv โ CS AI ยท Mar 127/10
๐ง A new study reveals that large language models exhibit patterns similar to the Dunning-Kruger effect, where poorly performing AI models show severe overconfidence in their abilities. The research tested four major models across 24,000 trials, finding that Kimi K2 displayed the worst calibration with 72.6% overconfidence despite only 23.3% accuracy, while Claude Haiku 4.5 achieved the best performance with proper confidence calibration.
๐ง Claude๐ง Haiku๐ง Gemini
AINeutralarXiv โ CS AI ยท Mar 127/10
๐ง Researchers discover that the 'Lost in the Middle' phenomenon in transformer models - where AI performs poorly on middle context but well on beginning and end content - is an inherent architectural property present even before training begins. The U-shaped performance bias stems from the mathematical structure of causal decoders with residual connections, creating a 'factorial dead zone' in middle positions.
AINeutralarXiv โ CS AI ยท Mar 127/10
๐ง A comprehensive study comparing reinforcement learning approaches for AI alignment finds that diversity-seeking algorithms don't outperform reward-maximizing methods in moral reasoning tasks. The research demonstrates that moral reasoning has more concentrated high-reward distributions than mathematical reasoning, making standard optimization methods equally effective without explicit diversity mechanisms.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers propose Mashup Learning, a method that leverages historical model checkpoints to improve AI training efficiency. The technique identifies relevant past training runs, merges them, and uses the result as initialization, achieving 0.5-5% accuracy improvements while reducing training time by up to 37%.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers introduce Targeted Reasoning Unlearning (TRU), a new method for removing specific knowledge from large language models while preserving general capabilities. The approach uses reasoning-based targets to guide the unlearning process, addressing issues with previous gradient ascent methods that caused unintended capability degradation.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers developed ES-dLLM, a training-free inference acceleration framework that speeds up diffusion large language models by selectively skipping tokens in early layers based on importance scoring. The method achieves 5.6x to 16.8x speedup over vanilla implementations while maintaining generation quality, offering a promising alternative to autoregressive models.
๐ข Nvidia
AIBearisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers have developed a risk assessment framework for open-source Model Context Protocol (MCP) servers, revealing significant security vulnerabilities through static code analysis. The study found many MCP servers contain exploitable weaknesses that compromise confidentiality, integrity, and availability, highlighting the need for secure-by-design development as these tools become widely adopted for LLM agents.
AINeutralarXiv โ CS AI ยท Mar 127/10
๐ง Researchers introduce TRACED, a framework that evaluates AI reasoning quality through geometric analysis rather than traditional scalar probabilities. The system identifies correct reasoning as high-progress stable trajectories, while AI hallucinations show low-progress unstable patterns with high curvature fluctuations.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers developed KernelSkill, a multi-agent framework that optimizes GPU kernel performance using expert knowledge rather than trial-and-error approaches. The system achieved 100% success rates and significant speedups (1.92x to 5.44x) over existing methods, addressing a critical bottleneck in AI system efficiency.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed two software techniques (OAS and MBS) that dramatically improve MXFP4 quantization accuracy for Large Language Models, reducing the performance gap with NVIDIA's NVFP4 from 10% to below 1%. This breakthrough makes MXFP4 a viable alternative while maintaining 12% hardware efficiency advantages in tensor cores.
๐ข Nvidia