🧠 AI⚪ NeutralImportance 6/10

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

arXiv – CS AI|Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that neural scaling laws and the Vendi Score—two methods for evaluating dataset quality—are both submodular functions, enabling optimization via a broader class of matrix spectral functions. By developing efficient secular-equation-based updates, they achieve 35,000x speedup in computations, making direct optimization feasible on large-scale datasets and revealing that facility location outperforms other objectives for predicting training subset value.

Analysis

This research addresses a fundamental challenge in machine learning: quantifying dataset value beyond simple size metrics. The work bridges theoretical understanding with practical computation, showing that popular data valuation approaches share mathematical structure through submodularity. This discovery matters because it unifies previously disparate methods under a common framework, enabling researchers to reason about their strengths and limitations more systematically.

The efficiency breakthrough deserves particular attention. By avoiding repeated eigendecompositions during optimization—reducing marginal-gain evaluation by an O(m) factor—the authors enable direct optimization on ImageNet-1K-scale datasets where it was previously infeasible. This computational acceleration democratizes access to sophisticated data valuation techniques beyond well-resourced research labs.

The empirical findings challenge conventional assumptions about data quality. The research reveals that the Vendi Score, despite theoretical elegance, becomes a poor proxy for downstream performance when pushed to extreme values. More surprisingly, uniformly random subsets show remarkable concentration in both appraisal scores and held-out performance, suggesting that randomness provides a stronger baseline than previously recognized. Facility location consistently outperforming alternatives indicates that diversity-based selection remains more practically valuable than entropy-based approaches.

These insights reshape how practitioners should approach dataset curation. The finding that size, class balance, and training budget alone don't determine data value—even when controlled for—indicates that subset selection requires domain-specific optimization rather than rule-of-thumb heuristics. The work suggests future research should focus on understanding why facility location succeeds and how to incorporate domain knowledge into spectral function selection.

Key Takeaways

→Neural scaling laws and Vendi Score are both submodular functions, part of a broader matrix spectral function family enabling unified optimization.
→Secular-equation-based updates achieve 35,000x speedup, making large-scale dataset optimization feasible where eigendecomposition was previously prohibitive.
→Facility location consistently outperforms entropy-based Vendi Score and DPP methods across multiple datasets and experimental conditions.
→Vendi Score becomes unreliable as a performance proxy when optimized to extreme values, despite theoretical elegance.
→Random subset selection concentrates remarkably in both appraisal scores and performance, suggesting stronger baseline assumptions than previously believed.

#machine-learning #dataset-valuation #scaling-laws #optimization #submodular-functions #vendi-score #facility-location

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge