y0news
AnalyticsDigestsSourcesRSSAICrypto
#benchmark-contamination2 articles
2 articles
AIBearisharXiv โ€“ CS AI ยท 5d ago7/103
๐Ÿง 

On The Fragility of Benchmark Contamination Detection in Reasoning Models

New research reveals that benchmark contamination in language reasoning models (LRMs) is extremely difficult to detect, allowing developers to easily inflate performance scores on public leaderboards. The study shows that reinforcement learning methods like GRPO and PPO can effectively conceal contamination signals, undermining the integrity of AI model evaluations.

$NEAR
AINeutralOpenAI News ยท Feb 236/105
๐Ÿง 

Why we no longer evaluate SWE-bench Verified

SWE-bench Verified, a popular coding evaluation benchmark, is being discontinued due to increasing contamination and flawed testing methodology. The analysis reveals training data leakage and unreliable test cases that fail to accurately measure AI coding capabilities, with SWE-bench Pro recommended as the replacement.