AINeutralarXiv – CS AI · 6h ago6/10
🧠
Too long; didn't solve
A new study examining mathematical benchmarks used to evaluate large language models reveals that both prompt length and solution length correlate with increased model failure rates. The research, conducted on an adversarial dataset of expert-authored math problems, demonstrates that structural complexity is a significant factor in model performance difficulty.