y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#ai-agents News & Analysis

449 articles tagged with #ai-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

449 articles
AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Researchers introduced AgentMath, a new AI framework that combines language models with code interpreters to solve complex mathematical problems more efficiently than current Large Reasoning Models. The system achieves state-of-the-art performance on mathematical competition benchmarks, with AgentMath-30B-A3B reaching 90.6% accuracy on AIME24 while remaining competitive with much larger models like OpenAI-o3.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

AgentOCR: Reimagining Agent History via Optical Self-Compression

Researchers introduce AgentOCR, a framework that converts AI agent interaction histories from text to compressed visual format, reducing token usage by over 50% while maintaining 95% performance. The system uses visual caching and adaptive compression to address memory bottlenecks in large language model deployments.

AIBullisharXiv โ€“ CS AI ยท Mar 37/102
๐Ÿง 

The FM Agent

Researchers have developed FM Agent, a multi-agent AI framework that combines large language models with evolutionary search to autonomously solve complex research problems. The system achieved state-of-the-art results across multiple domains including operations research, machine learning, and GPU optimization without human intervention.

AINeutralarXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

InnoGym: Benchmarking the Innovation Potential of AI Agents

Researchers introduce InnoGym, the first benchmark designed to evaluate AI agents' innovation potential rather than just correctness. The framework measures both performance gains and methodological novelty across 18 real-world engineering and scientific tasks, revealing that while AI agents can generate novel approaches, they lack robustness for significant performance improvements.

AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered

Researchers propose GenDB, a revolutionary database system that uses Large Language Models to synthesize query execution code instead of relying on traditional engineered query processors. Early prototype testing shows GenDB outperforms established systems like DuckDB, Umbra, and PostgreSQL on OLAP workloads.

AIBearisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents

Researchers have developed AudAgent, an automated tool that monitors AI agents in real-time to ensure they comply with their stated privacy policies. The tool revealed that many AI agents powered by major providers like Claude, Gemini, and DeepSeek fail to protect highly sensitive data like SSNs and violate their own privacy policies.

$LINK
AINeutralarXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

PsyAgent: Constructing Human-like Agents Based on Psychological Modeling and Contextual Interaction

Researchers introduce PsyAgent, a new AI framework that creates human-like agents by combining personality modeling based on Big Five traits with contextual social awareness. The system uses structured prompts and fine-tuning to produce AI agents that maintain stable personality traits while adapting appropriately to different social situations and roles.

AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

MagicAgent: Towards Generalized Agent Planning

Researchers have developed MagicAgent, a series of foundation models designed for generalized AI agent planning that outperforms existing sub-100B models and even surpasses leading ultra-scale models like GPT-5.2. The models achieve superior performance through a novel synthetic data framework and two-stage training paradigm that addresses gradient interference in multi-task learning.

AINeutralarXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

Researchers introduce GLEE, a new framework for studying how Large Language Models behave in economic games and strategic interactions. The study reveals that LLM performance in economic scenarios depends heavily on market parameters and model selection, with complex interdependent effects on outcomes.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Surge AI introduces CoreCraft, the first environment in EnterpriseBench for training AI agents on realistic enterprise workflows. Training GLM 4.6 on this high-fidelity customer support simulation improved task performance from 25% to 37% and showed positive transfer to other benchmarks, demonstrating that quality training environments enable generalizable AI capabilities.

AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

Researchers introduce PolySkill, a framework that enables AI agents to learn generalizable skills by separating abstract goals from concrete implementations, inspired by software engineering polymorphism. The method improves skill reuse by 1.7x and boosts success rates by up to 13.9% on web navigation tasks while reducing execution steps by over 20%.

AI ร— CryptoBullishCoinTelegraph โ€“ AI ยท Feb 277/108
๐Ÿค–

Alchemy introduces autonomous payment rails for AI agents on Base

Alchemy has launched autonomous payment rails for AI agents on the Base blockchain, enabling automated payments for blockchain data and compute credits using USDC. This development supports the growing trend of autonomous crypto applications by providing seamless payment infrastructure for AI-driven systems.

Alchemy introduces autonomous payment rails for AI agents on Base
AI ร— CryptoBullishBitcoinist ยท Feb 277/104
๐Ÿค–

Ethereum Network Takes The Crown As The Home Of On-Chain AI Agents

Ethereum is emerging as the dominant blockchain for AI agent development, expanding beyond its traditional DeFi leadership role. The network is now positioning itself as the primary platform for on-chain AI innovation, demonstrating constructive rather than speculative growth.

$ETH
AI ร— CryptoBullishThe Block ยท Feb 277/107
๐Ÿค–

The surge of RWAs, AI and tokenized equities, with Galaxy and Ondo

DeFi leaders from Ondo and Galaxy Digital discuss the emerging trends of Real World Assets (RWAs), AI integration, and tokenized equities in decentralized finance. The conversation explores how AI agents are expected to transform DeFi trading practices and provides a bullish outlook despite current market conditions.

AIBullishOpenAI News ยท Feb 277/105
๐Ÿง 

Introducing the Stateful Runtime Environment for Agents in Amazon Bedrock

Amazon Bedrock introduces a new Stateful Runtime Environment for AI agents that provides persistent orchestration, memory capabilities, and secure execution for complex multi-step AI workflows. The service leverages OpenAI technology to enable more sophisticated AI agent operations with maintained state across interactions.

AIBullishOpenAI News ยท Feb 277/106
๐Ÿง 

OpenAI and Amazon announce strategic partnership

OpenAI and Amazon have announced a strategic partnership that will integrate OpenAI's Frontier platform with AWS infrastructure. The collaboration aims to expand AI capabilities through enhanced infrastructure, custom model development, and enterprise AI agent solutions.

AINeutralarXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

Researchers introduce ProactiveMobile, a new benchmark for developing AI agents that can proactively anticipate user needs on mobile devices rather than just responding to commands. The benchmark includes over 3,600 test instances across 14 scenarios, with current models achieving low success rates, indicating significant room for improvement in proactive AI capabilities.

AINeutralarXiv โ€“ CS AI ยท Feb 277/103
๐Ÿง 

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.

AIBearisharXiv โ€“ CS AI ยท Feb 277/105
๐Ÿง 

Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

Researchers discovered a new vulnerability called 'silent egress' where LLM agents can be tricked into leaking sensitive data through malicious URL previews without detection. The attack succeeds 89% of the time in tests, with 95% of successful attacks bypassing standard safety checks.

AIBullisharXiv โ€“ CS AI ยท Feb 277/104
๐Ÿง 

MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks

Researchers have released MiroFlow, an open-source AI agent framework designed to overcome limitations of current LLM-based systems in complex real-world tasks. The framework features agent graph orchestration, deep reasoning capabilities, and robust workflow execution, achieving state-of-the-art performance across multiple benchmarks including GAIA and FutureX.

AINeutralarXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

VeRO: An Evaluation Harness for Agents to Optimize Agents

Researchers introduced VeRO (Versioning, Rewards, and Observations), a new evaluation framework for testing AI coding agents that can optimize other AI agents through iterative improvement cycles. The system provides reproducible benchmarks and structured execution traces to systematically measure how well coding agents can improve target agents' performance.

AINeutralarXiv โ€“ CS AI ยท Feb 277/107
๐Ÿง 

Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?

A research paper introduces the concept of 'vibe researching' where AI agents can autonomously execute entire research pipelines from idea to submission using specialized skills. The study analyzes how AI agents excel at speed and methodological tasks but struggle with theoretical originality and tacit knowledge, creating a cognitive rather than sequential delegation boundary in research workflows.

AIBullisharXiv โ€“ CS AI ยท Feb 277/105
๐Ÿง 

Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents

Researchers introduce Agent Behavioral Contracts (ABC), a formal framework for specifying and enforcing reliable behavior in autonomous AI agents. The system addresses critical issues of drift and governance failures in AI deployments by implementing runtime-enforceable contracts that achieve 88-100% compliance rates and significantly improve violation detection.

AIBullisharXiv โ€“ CS AI ยท Feb 277/107
๐Ÿง 

General Agent Evaluation

Researchers have developed Exgentic, a new framework for evaluating general-purpose AI agents that can perform tasks across different environments without domain-specific tuning. The study benchmarked five prominent agent implementations and found that general agents can achieve performance comparable to specialized agents, establishing the first Open General Agent Leaderboard.

AINeutralarXiv โ€“ CS AI ยท Feb 277/107
๐Ÿง 

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

LiveMCPBench introduces the first large-scale benchmark evaluating AI agents' ability to navigate real-world tasks using Model Context Protocol (MCP) tools across multiple servers. The benchmark reveals significant performance gaps, with top model Claude-Sonnet-4 achieving 78.95% success while most models only reach 30-50%, identifying tool retrieval as the primary bottleneck.

$OCEAN