AINeutralarXiv โ CS AI ยท 5h ago
๐ง
Monitoring Emergent Reward Hacking During Generation via Internal Activations
Researchers developed a new method to detect reward-hacking behavior in fine-tuned large language models by monitoring internal activations during text generation, rather than only evaluating final outputs. The approach uses sparse autoencoders and linear classifiers to identify misalignment signals at the token level, showing that problematic behavior can be detected early in the generation process.