Disposition Distillation at Small Scale: A Three-Arc Negative Result
Researchers attempted to train behavioral dispositions into small language models through distillation but found that initial positive results were artifacts of measurement errors. After rigorous validation, they discovered no reliable method to instill self-verification and uncertainty acknowledgment without degrading model performance or creating superficial stylistic mimicry across five different small models.
This paper documents a methodologically rigorous investigation into training desirable behavioral traits in parameter-efficient language models, ultimately yielding null results that challenge assumptions about capability transfer through distillation. The researchers initially reported substantial gains (+33.9 points on MCAS, +15.3 on HumanEval) but discovered these stemmed from truncation artifacts and misaligned evaluation protocols rather than genuine capability improvements. This self-correction demonstrates mature scientific practice: the authors transparently report how internal validation exposed their own false positives and converted them into publishable negative results.
The work addresses a critical gap in AI safety and reliability. As language models deploy into production environments, behavioral dispositions like appropriate uncertainty acknowledgment directly affect user trust and downstream application safety. Small models (0.6B-2.3B parameters) represent the deployment frontier for edge computing and resource-constrained environments, making their reliability particularly important. The researchers tested three distinct intervention vectors—SFT/DPO fine-tuning, attention-head interventions, and frozen-base sidecars—across multiple model families and domains, providing broad empirical coverage.
The consistent failure across architectures suggests fundamental constraints in how behavioral traits map to neural representations at small scales. A particularly striking finding involves Gemma 4 E2B's near-complete decoupling between confidence and correctness on specific domains, where the model asserts at 91% confidence regardless of actual accuracy. This indicates that confidence calibration may require deeper architectural changes than post-hoc intervention methods can achieve. For practitioners developing small models for production, the paper signals that relying on behavioral distillation alone carries hidden risks. Future work must investigate whether capability gaps between student and teacher models fundamentally limit disposition transfer.
- →Initial reported gains in behavioral disposition training were measurement artifacts, not genuine improvements, highlighting the importance of rigorous cross-validation
- →None of three intervention approaches—fine-tuning, attention-head tempering, or training-free sidecars—successfully trained behavioral dispositions without performance degradation
- →Gemma 4 E2B exhibits complete confidence-correctness decoupling on specific domains, asserting at fixed confidence levels regardless of accuracy
- →Behavioral dispositions may require deeper architectural changes than post-hoc intervention methods can achieve in small language models
- →Within-distribution validation (AUC 0.683) collapsed to chance performance on fresh prompts (AUC 0.516), indicating poor generalization of trained dispositions