The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods
Researchers propose Semantic Softmax, a novel inference-time method that improves zero-shot LLM classification by recovering probability mass lost during constrained decoding. The approach aggregates scores from semantic synonyms, reducing calibration errors and boosting accuracy on emotion and toxicity detection tasks.
Large language models face a fundamental challenge when adapted for zero-shot classification: standard constrained decoding discards probability assigned to semantic synonyms outside the target label set, creating what researchers term 'Renormalization Bias.' This phenomenon produces artificially inflated confidence scores and poor probability calibration, undermining model reliability in high-stakes applications. The Silent Vote mechanism identifies how linguistic information gets filtered away when softmax operations are restricted to narrow label spaces, leaving models overconfident in their predictions.
This research builds on growing recognition that LLM reliability extends beyond raw accuracy. As models become embedded in production systems—from content moderation to sentiment analysis—calibration becomes critical for risk assessment and threshold setting. Prior work established that zero-shot performance degrades significantly under distribution shift, yet few solutions addressed the fundamental probability redistribution problem at inference time.
Semantic Softmax directly tackles this by leveraging the semantic structure already embedded in model representations. By aggregating neighboring semantic concepts, the method preserves information discarded during standard decoding. Evaluation on GoEmotions and Civil Comments datasets shows consistent improvements across Expected Calibration Error, Brier Score, AUROC, and Macro-F1—indicating gains in both calibration and discrimination without architecture changes.
The approach has immediate practical implications for practitioners deploying LLMs in classification pipelines. Since it operates at inference time with no model retraining required, adoption barriers remain low. Future work should examine scalability across larger label sets and domain-specific semantic spaces, particularly for applications where miscalibration risks compound—financial risk assessment, medical diagnosis support, or legal document classification.
- →Renormalization Bias causes LLMs to discard probability mass from semantic synonyms during constrained decoding, inflating false confidence
- →Semantic Softmax recovers lost information by aggregating scores from semantic neighborhoods, improving both calibration and accuracy
- →The method requires no model retraining and operates as an inference-time layer, enabling easy integration into existing pipelines
- →Evaluation on emotion and toxicity datasets shows consistent improvements in Expected Calibration Error, Brier Score, AUROC, and Macro-F1
- →Better calibrated zero-shot classifiers reduce deployment risks in high-stakes applications requiring reliable confidence estimates