#ai-capability News & Analysis

8 articles tagged with #ai-capability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Human vs Machine Mathematical Difficulty on Project Euler: An Experimental Analysis

A new study analyzing 3,840 AI attempts across 50 mathematical problems from Project Euler finds that frontier AI systems scale more efficiently with problem difficulty than previously predicted, with machine effort following a power-law relationship where the exponent is less than 1 for most models tested. This suggests AI systems may actually improve relative to humans as problems become harder, contrary to earlier theoretical predictions.

AINeutralarXiv – CS AI · Jun 97/10

🧠

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Researchers demonstrate that AI agents' performance in drug-asset valuation is fundamentally limited by access to proprietary data rather than reasoning quality alone. A three-arm experiment shows that adding reasoning scaffolds and structured tools improves calibration but cannot overcome gaps in underlying evidence, with proprietary datasets enabling 96% recovery of expert valuations versus 38% for public-data-only systems.

AINeutralarXiv – CS AI · May 117/10

🧠

Evaluating Large Language Models in Scientific Discovery

Researchers introduce a scenario-grounded benchmark for evaluating large language models in scientific discovery, revealing significant performance gaps compared to general science benchmarks. The framework tests LLMs across biology, chemistry, materials, and physics through project-level tasks involving hypothesis generation and experimental design, showing that current models remain distant from achieving general scientific superintelligence despite demonstrating promise in specific applications.

AIBullisharXiv – CS AI · May 77/10

🧠

Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

Researchers have demonstrated an updated AI agent system called Design Conductor 2.0 that autonomously designed VerTQ, an LLM inference accelerator optimized for TurboQuant, in 80 hours. The system represents a significant advancement in capability, handling 80x larger design tasks than its predecessor while maintaining autonomous operation and high quality output.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

Researchers demonstrate an autonomous LLM agent capable of executing a complete research loop—reading, reproducing, critiquing, and extending computational physics papers. Testing across 111 papers reveals the agent identifies substantive flaws in 42% of cases, with 97.7% of issues requiring actual computation to detect, and produces a publishable peer-review comment on a Nature Communications paper without human direction.

AINeutralArs Technica – AI · Apr 147/10

🧠

UK gov's Mythos AI tests help separate cybersecurity threat from hype

The UK government's Mythos AI has become the first AI system to successfully complete a complex multi-step cybersecurity infiltration challenge, demonstrating tangible progress in AI capability assessment. This breakthrough helps distinguish genuine AI security threats from speculative hype, providing clearer benchmarks for evaluating AI systems' real-world vulnerabilities.

AIBullisharXiv – CS AI · Jun 236/10

🧠

One Interaction Is Worth a Thousand Guesses: Benchmarking the Interactive Capabilities of Deep Research Agents

Researchers introduce IDRBench, the first benchmark for evaluating interactive capabilities of deep research agents powered by Large Language Models. The benchmark measures how well agents can solicit user clarification during research tasks and quantifies the tradeoff between alignment improvements and interaction costs across seven LLMs.

AINeutralarXiv – CS AI · Apr 146/10

🧠

The Rise and Fall of $G$ in AGI

Researchers apply psychometric analysis to large language model benchmarks, discovering that AI's general intelligence factor (G-factor) peaked around 2023-2024 before fragmenting as models specialized in reasoning tasks. The finding suggests AI development is shifting from unified capability improvement toward specialized tool-using systems, challenging assumptions about monolithic AGI progress.