y0news
AnalyticsDigestsRSSAICrypto
#reward-hacking2 articles
2 articles
AIBearisharXiv โ€“ CS AI ยท 5h ago
๐Ÿง 

Language Model Goal Selection Differs from Humans' in an Open-Ended Task

Research comparing four state-of-the-art language models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur) to humans in goal selection tasks reveals substantial divergence in behavior. While humans explore diverse approaches and learn gradually, the AI models tend to exploit single solutions or show poor performance, raising concerns about using current LLMs as proxies for human decision-making in critical applications.

๐Ÿง  Claude๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท 5h ago
๐Ÿง 

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Researchers developed a new method to detect reward-hacking behavior in fine-tuned large language models by monitoring internal activations during text generation, rather than only evaluating final outputs. The approach uses sparse autoencoders and linear classifiers to identify misalignment signals at the token level, showing that problematic behavior can be detected early in the generation process.