🧠 AI⚪ NeutralImportance 6/10

Why we no longer evaluate SWE-bench Verified

OpenAI News|February 23, 2026 at 11:00 AM|5 views

🤖AI Summary

SWE-bench Verified, a popular coding evaluation benchmark, is being discontinued due to increasing contamination and flawed testing methodology. The analysis reveals training data leakage and unreliable test cases that fail to accurately measure AI coding capabilities, with SWE-bench Pro recommended as the replacement.

Key Takeaways

→SWE-bench Verified benchmark is compromised by contamination and training data leakage issues.
→The benchmark contains flawed tests that provide inaccurate measurements of AI coding progress.
→Evaluation methodology problems undermine the reliability of coding AI assessments.
→SWE-bench Pro is recommended as a superior alternative for evaluating coding capabilities.
→The discontinuation highlights challenges in creating reliable AI performance benchmarks.