y0news
#performance-evaluation2 articles
2 articles
AI ร— CryptoBearisharXiv โ€“ CS AI ยท 4h ago2
๐Ÿค–

TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

TraderBench introduces a new benchmark for evaluating AI agents in financial markets, combining expert-verified static tasks with adversarial trading simulations. The study found that 8 of 13 tested AI models showed minimal variation across market conditions, indicating they rely on fixed strategies rather than adaptive market behavior.

AINeutralarXiv โ€“ CS AI ยท 4h ago2
๐Ÿง 

According to Me: Long-Term Personalized Referential Memory QA

Researchers introduce ATM-Bench, the first benchmark for evaluating AI assistants' ability to recall and reason over long-term personalized memory across multiple modalities. The benchmark reveals poor performance (under 20% accuracy) for current state-of-the-art memory systems, highlighting significant limitations in personalized AI capabilities.