y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#dataset-composition News & Analysis

1 article tagged with #dataset-composition. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AINeutralarXiv – CS AI · 14h ago6/10
🧠

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Researchers introduce LLMSurgeon, a framework that reverse-engineers the pretraining data composition of Large Language Models by analyzing their generated text, addressing the opacity surrounding how foundation models are trained. The method estimates domain-level distributions across a predefined taxonomy without requiring access to actual training datasets, offering a practical auditing tool for understanding model behavior and capabilities.