AINeutralarXiv – CS AI · 14h ago6/10
🧠
LLMSurgeon: Diagnosing Data Mixture of Large Language Models
Researchers introduce LLMSurgeon, a framework that reverse-engineers the pretraining data composition of Large Language Models by analyzing their generated text, addressing the opacity surrounding how foundation models are trained. The method estimates domain-level distributions across a predefined taxonomy without requiring access to actual training datasets, offering a practical auditing tool for understanding model behavior and capabilities.