y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#swe-bench News & Analysis

5 articles tagged with #swe-bench. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBullisharXiv โ€“ CS AI ยท 5d ago7/10
๐Ÿง 

From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

Researchers demonstrate a methodology for translating a large production Rust codebase (648K LOC) into Python using LLM assistance, guided by benchmark performance as an objective function. The Python port of Codex CLI, an AI coding agent, achieves near-parity performance on real-world tasks while reducing code size by 15.9x and enabling 30 new features absent from the original Rust implementation.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Researchers propose a new framework called Critic Rubrics to bridge the gap between academic coding agent benchmarks and real-world applications. The system learns from sparse, noisy human interaction data using 24 behavioral features and shows significant improvements in code generation tasks including 15.9% better reranking performance on SWE-bench.

AINeutralarXiv โ€“ CS AI ยท 5d ago6/10
๐Ÿง 

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

A large-scale empirical study of 679 GitHub instruction files shows that AI coding agent performance improves by 7-14 percentage points when rules are applied, but surprisingly, random rules work as well as expert-curated ones. The research reveals that negative constraints outperform positive directives, suggesting developers should focus on guardrails rather than prescriptive guidance.

AINeutralOpenAI News ยท Feb 236/105
๐Ÿง 

Why we no longer evaluate SWE-bench Verified

SWE-bench Verified, a popular coding evaluation benchmark, is being discontinued due to increasing contamination and flawed testing methodology. The analysis reveals training data leakage and unreliable test cases that fail to accurately measure AI coding capabilities, with SWE-bench Pro recommended as the replacement.

AIBullishOpenAI News ยท Aug 135/105
๐Ÿง 

Introducing SWE-bench Verified

SWE-bench Verified is being released as a human-validated subset of the original SWE-bench benchmark. This new version aims to provide more reliable evaluation of AI models' capabilities in solving real-world software engineering problems.