449 articles tagged with #ai-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduced AgentMath, a new AI framework that combines language models with code interpreters to solve complex mathematical problems more efficiently than current Large Reasoning Models. The system achieves state-of-the-art performance on mathematical competition benchmarks, with AgentMath-30B-A3B reaching 90.6% accuracy on AIME24 while remaining competitive with much larger models like OpenAI-o3.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce AgentOCR, a framework that converts AI agent interaction histories from text to compressed visual format, reducing token usage by over 50% while maintaining 95% performance. The system uses visual caching and adaptive compression to address memory bottlenecks in large language model deployments.
AIBullisharXiv โ CS AI ยท Mar 37/102
๐ง Researchers have developed FM Agent, a multi-agent AI framework that combines large language models with evolutionary search to autonomously solve complex research problems. The system achieved state-of-the-art results across multiple domains including operations research, machine learning, and GPU optimization without human intervention.
AINeutralarXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce InnoGym, the first benchmark designed to evaluate AI agents' innovation potential rather than just correctness. The framework measures both performance gains and methodological novelty across 18 real-world engineering and scientific tasks, revealing that while AI agents can generate novel approaches, they lack robustness for significant performance improvements.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers propose GenDB, a revolutionary database system that uses Large Language Models to synthesize query execution code instead of relying on traditional engineered query processors. Early prototype testing shows GenDB outperforms established systems like DuckDB, Umbra, and PostgreSQL on OLAP workloads.
AIBearisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers have developed AudAgent, an automated tool that monitors AI agents in real-time to ensure they comply with their stated privacy policies. The tool revealed that many AI agents powered by major providers like Claude, Gemini, and DeepSeek fail to protect highly sensitive data like SSNs and violate their own privacy policies.
$LINK
AINeutralarXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce PsyAgent, a new AI framework that creates human-like agents by combining personality modeling based on Big Five traits with contextual social awareness. The system uses structured prompts and fine-tuning to produce AI agents that maintain stable personality traits while adapting appropriately to different social situations and roles.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers have developed MagicAgent, a series of foundation models designed for generalized AI agent planning that outperforms existing sub-100B models and even surpasses leading ultra-scale models like GPT-5.2. The models achieve superior performance through a novel synthetic data framework and two-stage training paradigm that addresses gradient interference in multi-task learning.
AINeutralarXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce GLEE, a new framework for studying how Large Language Models behave in economic games and strategic interactions. The study reveals that LLM performance in economic scenarios depends heavily on market parameters and model selection, with complex interdependent effects on outcomes.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Surge AI introduces CoreCraft, the first environment in EnterpriseBench for training AI agents on realistic enterprise workflows. Training GLM 4.6 on this high-fidelity customer support simulation improved task performance from 25% to 37% and showed positive transfer to other benchmarks, demonstrating that quality training environments enable generalizable AI capabilities.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce PolySkill, a framework that enables AI agents to learn generalizable skills by separating abstract goals from concrete implementations, inspired by software engineering polymorphism. The method improves skill reuse by 1.7x and boosts success rates by up to 13.9% on web navigation tasks while reducing execution steps by over 20%.
AI ร CryptoBullishCoinTelegraph โ AI ยท Feb 277/108
๐คAlchemy has launched autonomous payment rails for AI agents on the Base blockchain, enabling automated payments for blockchain data and compute credits using USDC. This development supports the growing trend of autonomous crypto applications by providing seamless payment infrastructure for AI-driven systems.
AI ร CryptoBullishBitcoinist ยท Feb 277/104
๐คEthereum is emerging as the dominant blockchain for AI agent development, expanding beyond its traditional DeFi leadership role. The network is now positioning itself as the primary platform for on-chain AI innovation, demonstrating constructive rather than speculative growth.
$ETH
AI ร CryptoBullishThe Block ยท Feb 277/107
๐คDeFi leaders from Ondo and Galaxy Digital discuss the emerging trends of Real World Assets (RWAs), AI integration, and tokenized equities in decentralized finance. The conversation explores how AI agents are expected to transform DeFi trading practices and provides a bullish outlook despite current market conditions.
AIBullishOpenAI News ยท Feb 277/105
๐ง Amazon Bedrock introduces a new Stateful Runtime Environment for AI agents that provides persistent orchestration, memory capabilities, and secure execution for complex multi-step AI workflows. The service leverages OpenAI technology to enable more sophisticated AI agent operations with maintained state across interactions.
AIBullishOpenAI News ยท Feb 277/106
๐ง OpenAI and Amazon have announced a strategic partnership that will integrate OpenAI's Frontier platform with AWS infrastructure. The collaboration aims to expand AI capabilities through enhanced infrastructure, custom model development, and enterprise AI agent solutions.
AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง Researchers introduce ProactiveMobile, a new benchmark for developing AI agents that can proactively anticipate user needs on mobile devices rather than just responding to commands. The benchmark includes over 3,600 test instances across 14 scenarios, with current models achieving low success rates, indicating significant room for improvement in proactive AI capabilities.
AINeutralarXiv โ CS AI ยท Feb 277/103
๐ง Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.
AIBearisharXiv โ CS AI ยท Feb 277/105
๐ง Researchers discovered a new vulnerability called 'silent egress' where LLM agents can be tricked into leaking sensitive data through malicious URL previews without detection. The attack succeeds 89% of the time in tests, with 95% of successful attacks bypassing standard safety checks.
AIBullisharXiv โ CS AI ยท Feb 277/104
๐ง Researchers have released MiroFlow, an open-source AI agent framework designed to overcome limitations of current LLM-based systems in complex real-world tasks. The framework features agent graph orchestration, deep reasoning capabilities, and robust workflow execution, achieving state-of-the-art performance across multiple benchmarks including GAIA and FutureX.
AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง Researchers introduced VeRO (Versioning, Rewards, and Observations), a new evaluation framework for testing AI coding agents that can optimize other AI agents through iterative improvement cycles. The system provides reproducible benchmarks and structured execution traces to systematically measure how well coding agents can improve target agents' performance.
AINeutralarXiv โ CS AI ยท Feb 277/107
๐ง A research paper introduces the concept of 'vibe researching' where AI agents can autonomously execute entire research pipelines from idea to submission using specialized skills. The study analyzes how AI agents excel at speed and methodological tasks but struggle with theoretical originality and tacit knowledge, creating a cognitive rather than sequential delegation boundary in research workflows.
AIBullisharXiv โ CS AI ยท Feb 277/105
๐ง Researchers introduce Agent Behavioral Contracts (ABC), a formal framework for specifying and enforcing reliable behavior in autonomous AI agents. The system addresses critical issues of drift and governance failures in AI deployments by implementing runtime-enforceable contracts that achieve 88-100% compliance rates and significantly improve violation detection.
AIBullisharXiv โ CS AI ยท Feb 277/107
๐ง Researchers have developed Exgentic, a new framework for evaluating general-purpose AI agents that can perform tasks across different environments without domain-specific tuning. The study benchmarked five prominent agent implementations and found that general agents can achieve performance comparable to specialized agents, establishing the first Open General Agent Leaderboard.
AINeutralarXiv โ CS AI ยท Feb 277/107
๐ง LiveMCPBench introduces the first large-scale benchmark evaluating AI agents' ability to navigate real-world tasks using Model Context Protocol (MCP) tools across multiple servers. The benchmark reveals significant performance gaps, with top model Claude-Sonnet-4 achieving 78.95% success while most models only reach 30-50%, identifying tool retrieval as the primary bottleneck.
$OCEAN