954 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv – CS AI · 1d ago7/10
🧠Researchers have identified a critical privacy vulnerability in LLM-based multi-agent systems, demonstrating that communication topologies can be reverse-engineered through black-box attacks. The Communication Inference Attack (CIA) achieves up to 99% accuracy in inferring how agents communicate, exposing significant intellectual property and security risks in AI systems.
AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers propose Schema-Adaptive Tabular Representation Learning, which uses LLMs to convert structured clinical data into semantic embeddings that transfer across different electronic health record schemas without retraining. When combined with imaging data for dementia diagnosis, the method achieves state-of-the-art results and outperforms board-certified neurologists on retrospective diagnostic tasks.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose Generative Actor-Critic (GenAC), a new approach to value modeling in large language model reinforcement learning that uses chain-of-thought reasoning instead of one-shot scalar predictions. The method addresses a longstanding challenge in credit assignment by improving value approximation and downstream RL performance compared to existing value-based and value-free baselines.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce ReflectiChain, an AI framework combining large language models with generative world models to improve semiconductor supply chain resilience against geopolitical disruptions. The system demonstrates 250% performance improvements over standard LLM approaches by integrating physical environmental constraints and autonomous policy learning, restoring operational capacity from 13.3% to 88.5% under extreme scenarios.
AIBullishOpenAI News · 6d ago7/10
🧠OpenAI's suite of products—including ChatGPT, Codex, and developer APIs—demonstrates practical applications of artificial intelligence across work, software development, and consumer tasks. These tools represent a significant shift toward mainstream AI adoption, enabling organizations and individuals to integrate machine learning capabilities into everyday workflows.
🏢 OpenAI🧠 ChatGPT
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose a new constrained maximum likelihood estimation (MLE) method to accurately estimate failure rates of large language models by combining human-labeled data, automated judge annotations, and domain-specific constraints. The approach outperforms existing methods like Prediction-Powered Inference across various experimental conditions, providing a more reliable framework for LLM safety certification.
AINeutralarXiv – CS AI · Apr 77/10
🧠Researchers introduce 'error verifiability' as a new metric to measure whether AI-generated justifications help users distinguish correct from incorrect answers. The study found that common AI improvement methods don't enhance verifiability, but two new domain-specific approaches successfully improved users' ability to assess answer correctness.
AINeutralarXiv – CS AI · Apr 77/10
🧠Researchers at arXiv have identified two key mechanisms behind reasoning hallucinations in large language models: Path Reuse and Path Compression. The study models next-token prediction as graph search, showing how memorized knowledge can override contextual constraints and how frequently used reasoning paths become shortcuts that lead to unsupported conclusions.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce a geometric framework for understanding LLM hallucinations, showing they arise from basin structures in latent space that vary by task complexity. The study demonstrates that factual tasks have clearer separation while summarization tasks show unstable, overlapping patterns, and proposes geometry-aware steering to reduce hallucinations without retraining.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce SkillX, an automated framework for building reusable skill knowledge bases for AI agents that addresses inefficiencies in current self-evolving paradigms. The system uses multi-level skill design, iterative refinement, and exploratory expansion to create plug-and-play skill libraries that improve task success and execution efficiency across different agents and environments.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose SLaB, a novel framework for compressing large language models by decomposing weight matrices into sparse, low-rank, and binary components. The method achieves significant improvements over existing compression techniques, reducing perplexity by up to 36% at 50% compression rates without requiring model retraining.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers developed an LLM-powered evolutionary search method to automatically design uncertainty quantification systems for large language models, achieving up to 6.7% improvement in performance over manual designs. The study found that different AI models employ distinct evolutionary strategies, with some favoring complex linear estimators while others prefer simpler positional weighting approaches.
🧠 Claude🧠 Sonnet🧠 Opus
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce LLMA-Mem, a memory framework for LLM multi-agent systems that balances team size with lifelong learning capabilities. The study reveals that larger agent teams don't always perform better long-term, and smaller teams with better memory design can outperform larger ones while reducing costs.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers developed PALM (Portfolio of Aligned LLMs), a method to create a small collection of language models that can serve diverse user preferences without requiring individual models per user. The approach provides theoretical guarantees on portfolio size and quality while balancing system costs with personalization needs.
AINeutralarXiv – CS AI · Apr 77/10
🧠Research reveals a 'Persuasion Paradox' where LLM explanations increase user confidence but don't reliably improve human-AI team performance, and can actually undermine task accuracy. The study found that explanation effectiveness varies significantly by task type, with visual reasoning tasks seeing decreased error recovery while logical reasoning tasks benefited from explanations.
AIBullisharXiv – CS AI · Apr 77/10
🧠Research published on arXiv demonstrates that large language models playing poker can develop sophisticated Theory of Mind capabilities when equipped with persistent memory, progressing to advanced levels of opponent modeling and strategic deception. The study found memory is necessary and sufficient for this emergent behavior, while domain expertise enhances but doesn't gate ToM development.
🧠 GPT-4
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.
🏢 Nvidia
AIBearisharXiv – CS AI · Apr 77/10
🧠Research reveals that large language models like DeepSeek-V3.2, Gemini-3, and GPT-5.2 show rigid adaptation patterns when learning from changing environments, particularly struggling with loss-based learning compared to humans. The study found LLMs demonstrate asymmetric responses to positive versus negative feedback, with some models showing extreme perseveration after environmental changes.
🧠 GPT-5🧠 Gemini
AI × CryptoNeutralarXiv – CS AI · Apr 77/10
🤖PolySwarm is a new multi-agent AI framework that uses 50 diverse large language models to trade on prediction markets like Polymarket, combining swarm intelligence with arbitrage strategies. The system outperformed single-model baselines in probability calibration and includes latency arbitrage capabilities to exploit pricing inefficiencies across markets.
AIBearisharXiv – CS AI · Apr 77/10
🧠A research study reveals that AI-powered conversational interfaces can triple the rate of sponsored product selection compared to traditional search engines (61.2% vs 22.4%). Users largely fail to detect this commercial steering, even with explicit sponsor labels, indicating current transparency measures are insufficient.
AINeutralarXiv – CS AI · Apr 77/10
🧠A new research study reveals that truth directions in large language models are less universal than previously believed, with significant variations across different model layers, task types, and prompt instructions. The findings show truth directions emerge earlier for factual tasks but later for reasoning tasks, and are heavily influenced by model instructions and task complexity.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce Multi-Objective Control (MOC), a new approach that trains a single large language model to generate personalized responses based on individual user preferences across multiple objectives. The method uses multi-objective optimization principles in reinforcement learning from human feedback to create more controllable and adaptable AI systems.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose PassiveQA, a new AI framework that teaches language models to recognize when they don't have enough information to answer questions, choosing to ask for clarification or abstain rather than hallucinate responses. The three-action system (Answer, Ask, Abstain) uses supervised fine-tuning to align model behavior with information sufficiency, showing significant improvements in reducing hallucinations.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce ROSClaw, a new AI framework that integrates large language models with robotic systems to improve multi-agent collaboration and long-horizon task execution. The framework addresses critical gaps between semantic understanding and physical execution by using unified vision-language models and enabling real-time coordination between simulated and real-world robots.
AI × CryptoNeutralarXiv – CS AI · Apr 77/10
🤖Researchers introduced CREBench, a benchmark to evaluate large language models' capabilities in cryptographic binary reverse engineering. The best-performing model (GPT-5.4) achieved 64.03% success rate, while human experts scored 92.19%, showing AI still lags behind human expertise in cryptographic analysis tasks.
🧠 GPT-5