AINeutralarXiv β CS AI Β· 7h ago7/10
π§
MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
Researchers introduced MEDLEY-BENCH, a new AI benchmark that evaluates metacognitionβan AI model's ability to monitor and revise its own reasoning. The study found that while larger models evaluate their reasoning better, they don't actually control their outputs more effectively, and smaller models often match larger ones in metacognitive tasks, suggesting scale alone doesn't determine reasoning quality.