AI × Crypto News Feed

Real-time AI-curated news from 92,322+ articles across 50+ sources. Sentiment analysis, importance scoring, and key takeaways — updated every 15 minutes.

92322 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Researchers introduce HALO, a trained orchestrator system that reduces LLM API costs by 45x compared to GPT-4-mini while matching performance on PDDL planning tasks. By leveraging verifier-certified trajectories as direct supervision rather than prompting frontier models at every step, HALO achieves significant cost efficiency improvements across multiple planning benchmarks.

🧠 GPT-5🧠 Gemini

AIBullisharXiv – CS AI · Jun 237/10

🧠

VISTA Architect: A graph database-oriented health AI system demonstrated in multidisciplinary tumor boards

Stanford Medicine researchers unveiled VISTA Architect, a graph database-powered AI system that integrates large language models with electronic health records to achieve 96.4% accuracy in clinical data extraction for tumor board preparation. The architecture precomputes patient histories into organized knowledge graphs, reducing processing time and latency compared to traditional RAG approaches while maintaining full data provenance.

AIBullisharXiv – CS AI · Jun 237/10

🧠

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Researchers introduce CLI-Universe, a systematic framework for generating high-quality training data for terminal agents by sampling task combinations across multiple capability dimensions and subjecting candidates to rigorous executable verification. Fine-tuning Qwen3-32B on the resulting CLI-Universe-6K dataset achieves state-of-the-art performance on Terminal-Bench 2.0 at 33.4%, outperforming much larger models and demonstrating that structured, high-fidelity data synthesis significantly improves AI agent efficiency.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Researchers introduced AgentCIBench, a safety testing framework that reveals critical privacy vulnerabilities in computer-use agents (CUAs) that access multiple personal applications. Testing 15 frontier agents found that 11 leak sensitive information on over 50% of scenarios, exposing risks from UI co-location, task ambiguity, and recipient misalignment.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents

Researchers discover that LLM agents lose safety compliance when governance constraints are compressed or summarized during long sessions, with violations rising from 0% to 59% after context compaction. The study introduces a benchmark demonstrating this 'Governance Decay' failure mode and proposes Constraint Pinning as a training-free mitigation.

AINeutralarXiv – CS AI · Jun 237/10

🧠

PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement

PaperClaw is a multi-agent AI system that automates academic research from conception to publication, combining autonomous operation with human-in-the-loop refinement. The system curates literature, generates hypotheses, tests them iteratively, and produces venue-compliant papers while maintaining verifiable citations and reproducible results.

AIBullisharXiv – CS AI · Jun 237/10

🧠

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

SpotAttention is a lightweight machine learning technique that reduces computational costs for large language models processing long text sequences. By learning to identify only the most relevant tokens to attend to, it achieves 3.9x faster decoding speeds while maintaining accuracy at context lengths eight times longer than training, addressing a critical efficiency bottleneck in modern LLMs.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Active Inference as the Test-Time Scaling Law for Physical AI Agents

Researchers introduce a novel test-time scaling law for physical AI agents based on active inference principles, enabling agents to generalize to unforeseen scenarios by dynamically updating policies through reasoning about prediction errors. The approach outperforms existing reinforcement learning methods by 36% in inference efficiency on autonomous driving tasks and scales with real-world experience rather than just training data or model size.

AINeutralarXiv – CS AI · Jun 237/10

🧠

When Is Emergent Consensus Real? A Measured Coupling Gain and a Validity Diagnostic for LLM Agent Societies

Researchers introduce a measurement framework called 'coupling gain' to quantify whether consensus or polarization in LLM agent societies reflects genuine social dynamics or model artifacts. The study reveals that frontier LLMs do not spontaneously polarize, and that emergent consensus claims must be validated against initial conditions and context-specific coupling metrics rather than assumed theoretical models.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

Researchers demonstrate that large language model agents fail to maintain plans as persistent internal state, instead relying on plans remaining in the context window. Using diagnostic techniques on Llama-3.1-70B and DeepSeek-R1, the study shows plan signal decays rapidly when compressed out of context, with practical implications for agent reliability in long-horizon tasks.

🧠 Llama

AINeutralarXiv – CS AI · Jun 237/10

🧠

A Differentiable Atari VCS:A Complex, Fully Known Ground Truth for Explainable AI

Researchers have created fully differentiable emulators of the Atari 2600 computer system in Julia and JAX, solving a fundamental problem in explainable AI by providing a complex system with complete ground truth. The emulators are bit-for-bit identical to the original hardware while remaining mathematically differentiable, enabling gradient-based analysis to understand how AI systems make decisions.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Finding the Evidence: Discovering Decision-Supporting Tokens for On-Policy Reasoning Distillation

Researchers introduce DEAR, a novel on-policy distillation method that improves AI model training by distinguishing between decision tokens (where models branch) and evidence tokens (supporting intermediate steps). The technique achieves significant performance gains of up to 5.7% on code generation and 2.5% on math benchmarks compared to standard distillation approaches.

AIBullisharXiv – CS AI · Jun 237/10

🧠

ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

Researchers introduce ENVS (Environment-Native Verified Search), a novel training approach for GUI agents that discovers verified action trajectories in live desktop environments before policy optimization. The method achieves 30.3 pass@8 on OSWorld benchmarks while reducing computational requirements by 25-28% compared to existing reinforcement learning approaches, and demonstrates robust performance even under simulated desktop interruptions.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers

Researchers propose Keyless Attention, a transformer mechanism that eliminates key projections to reduce KV cache memory by 50% while maintaining or improving performance across multiple model architectures. The approach introduces a value-space routing matrix that replaces the traditional key projection, demonstrating competitive results on perplexity and downstream benchmarks.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation

Researchers introduce FOCA, a new framework for improving Vision-Language-Action (VLA) models in robotic control with limited training data. The method achieves significant performance gains in few-shot learning scenarios, reaching 95.7% success on benchmark tasks with just 20 demonstrations and up to 26% improvements on real robots.

AIBullisharXiv – CS AI · Jun 237/10

🧠

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

Researchers introduce AOHP, an open-source OS-level agent harness built on Android that treats AI agents as first-class operating system actors. The framework addresses architectural gaps in current systems by enabling personalized service composition, efficient agent interfaces, and secure information flow, demonstrating significant improvements in task completion rates, execution costs, and security compliance.

AIBearisharXiv – CS AI · Jun 237/10

🧠

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

Researchers introduce HOLMES, a new benchmark for evaluating higher-order logical reasoning in large language models, revealing that current LLMs struggle significantly with complex symbolic reasoning tasks that go beyond simple first-order logic. The benchmark demonstrates critical gaps in AI reliability, with the best-performing models achieving only 59.54% accuracy on tasks involving reasoning over rules, predicates, and constraints across legal and financial domains.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

Researchers introduce the Power Systems Agent Benchmark, an executable evaluation framework for AI agents in electric power engineering with 41 task families across eight engineering domains. The benchmark uses deterministic evaluation to assess whether AI agents can perform real power-system engineering tasks correctly, marking the first major standardized assessment tool for this emerging application area.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

Researchers demonstrate a critical flaw in using large language models as user simulators for training conversational AI: LLM simulators systematically misrepresent how real customers disengage from purchases, showing excessive deliberation and muted resistance compared to actual users. This bias could lead developers to overestimate the effectiveness of sales agents trained on synthetic user interactions.

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Language-Energy Divide: Measuring Energy Costs of Multilingual LLM Inference

A comprehensive study reveals that multilingual LLM inference consumes vastly different amounts of energy across languages, with Pashto requiring 179 times more energy than English for identical requests. The disparity stems from complex script processing and token generation inefficiency in low-resource languages, compounded by a double penalty where high-energy languages also deliver lower accuracy.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Reinforcement learning to improve large language model-based automated code compliance systems

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning and Group Relative Policy Optimization to improve LLM accuracy in automated building code compliance systems. The approach reduces errors by up to 38.6% compared to baseline models and outperforms leading LLMs like Claude and GPT in zero-shot settings.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBearisharXiv – CS AI · Jun 237/10

🧠

Benchmarking Robot Memory Under Interference

Researchers introduce RoboMME-Interference, a benchmark testing how robot memory systems perform across multiple sessions with irrelevant distractions. Testing current memory-augmented AI models reveals significant performance degradation as unrelated sessions accumulate, highlighting a critical gap in long-context robustness for real-world robot deployment.

AINeutralarXiv – CS AI · Jun 237/10

🧠

The AI Evaluability Gap: The Missing Layer for Managing Risk and Sustaining Value

Researchers introduce the concept of 'Evaluability' to address the AI Evaluability Gap—the insufficient evidence organizations have to make confident governance decisions about AI risk and value. The framework proposes six properties of evaluable evidence and distinguishes between operational and investment certification to strengthen AI governance practices.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

Researchers present a boundary-aware Curriculum Reinforcement Learning approach that improves large language model reasoning capacity beyond what standard RLVR methods achieve. Testing across Qwen, Llama, and DeepSeek models shows 9.8 percentage point improvements in pass@256 scores over base models, suggesting a more scalable path for continuous LLM advancement.

🧠 Llama

AINeutralarXiv – CS AI · Jun 237/10

🧠

Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements

Researchers demonstrate that closed-loop automated machine learning systems can discover generalizable improvements in molecular property prediction by having language-model agents modify features, models, and acquire external evidence. Testing across 36 molecular endpoints reveals that while some improvements validate strongly, they don't consistently transfer to held-out test sets, highlighting critical challenges in ensuring reproducibility of AI-driven research discoveries.

← PrevPage 111 of 3693Next →