An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding
Researchers empirically tested the k-NAF budget accounting mechanism in Anchored Decoding across 8,500 executions and found that cumulative KL divergence spending remained consistently below sequence-level budgets, with no clear evidence of budget exhaustion even under adaptive stress testing. Results suggest the budget mechanism functions reliably, though some proxy artifacts appeared in small-sample evaluations on copyright-domain workloads.
This empirical audit addresses a critical question in language model decoding: whether budget-constrained systems reliably maintain their theoretical spending limits in practice. The k-NAF mechanism in Anchored Decoding attempts to control token generation costs by tracking cumulative KL divergence against predefined budgets. The research design is notably rigorous, combining fixed workloads with class stratification and adaptive prompt searches designed to stress-test the system by targeting high proxy spend ratios. The findings demonstrate that under standard conditions, actual spending patterns remain well below allocated budgets across multiple prompt categories, with surface-overlap metrics confirming minimal deviation from base model outputs. This consistency across thousands of executions suggests the accounting mechanism provides reliable safety guarantees. However, the detection of anomalous proxy ratios above 1.0 in copyright-domain workloads during small-sample evaluations warrants attention. These artifacts disappear when sample sizes increase, indicating they reflect evaluation noise rather than fundamental budget failures. The research reveals an important distinction between true budget exhaustion and statistical artifacts in proxy measurement. For practitioners deploying budget-constrained generation systems, these results provide empirical validation that theoretical guarantees hold in practice. The work also highlights the importance of adequate sample sizes when evaluating constrained systems, as small-sample evaluations can produce misleading proxy metrics. Moving forward, attention should focus on whether these budget constraints remain effective as model scale increases and on understanding the practical tradeoffs between conservative budgeting and generation quality across diverse domains.
- βk-NAF budget accounting maintains spending well below theoretical limits across 8,500+ diverse prompt executions with no clear budget exhaustion
- βAdaptive search procedures designed to stress-test the system fail to produce meaningful budget violations, indicating robustness in the mechanism
- βProxy ratios above 1.0 observed in copyright-domain workloads resolve when sample sizes increase, suggesting measurement artifacts rather than real failures
- βSurface-overlap diagnostics remain small across all test conditions, confirming generated outputs stay close to base model distributions
- βAdequate sample sizes are critical for evaluating budget-constrained systems, as small samples can produce misleading proxy metrics