#coding-benchmarks News & Analysis

2 articles tagged with #coding-benchmarks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AIBullisharXiv – CS AI · Mar 167/10

🧠

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Researchers introduce the Darwin Gödel Machine (DGM), a self-improving AI system that can iteratively modify its own code and validate changes through benchmarks. The system demonstrated significant performance improvements, increasing coding capabilities from 20.0% to 50.0% on SWE-bench and from 14.2% to 30.7% on Polyglot benchmarks.

AINeutralarXiv – CS AI · Apr 156/10

🧠

Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

Researchers introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across coding tasks including generation, summarization, and classification. They propose VERA, a two-stage evaluator combining evidence-grounded verification with ambiguity-aware score correction, achieving significant performance improvements over existing methods.