Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA
Researchers introduced DOSEBENCH, a benchmark of 81 OTC medication dosing scenarios, to evaluate how well large language models handle safety-critical medical decisions involving temporal reasoning and constraint adherence. Testing four LLMs revealed significant weaknesses in rolling-window calculations, ambiguity handling, and consistency—critical gaps for a use case where incorrect answers pose real health risks.
The study addresses a practical but overlooked vulnerability in LLM deployment: their application to time-sensitive, safety-relevant health questions. Over-the-counter medication dosing requires models to track multiple constraints simultaneously—24-hour rolling intake windows, product-label maximums, and incomplete user histories—while maintaining logical consistency across repeated queries. This represents a narrower, more measurable problem than general medical QA, making it an effective diagnostic tool for temporal reasoning flaws.
The findings expose a critical disconnect between apparent model confidence and actual correctness. Responses that sound authoritative and logically structured can still violate dosing safety limits, suggesting that current LLM outputs lack grounding in constraint-checking systems. The 1,620 model responses across four models revealed consistent struggles with multi-step temporal calculations and edge cases involving ambiguous timing information.
For healthcare applications, this research demonstrates that LLMs require external validation layers and explicit constraint-checking mechanisms before deployment in safety-critical contexts. Developers cannot rely on model explanations alone to verify medical safety. The benchmark itself provides a standardized evaluation framework that could prevent similar oversight in future medical AI systems.
Investors and developers should recognize this as evidence supporting hybrid AI architectures—pairing LLM reasoning with rule-based constraint systems—rather than pure end-to-end learning for safety-sensitive tasks. This work may influence how healthcare companies approach LLM guardrails and liability considerations.
- →LLMs frequently fail at rolling-window dose calculations despite confident-sounding responses, creating serious safety risks
- →DOSEBENCH's 81 curated scenarios reveal systematic weaknesses in temporal reasoning and constraint adherence across four tested models
- →Model confidence and consistency are poor indicators of actual correctness in dosing decisions
- →Safety-critical medical applications require external constraint-checking layers beyond pure LLM outputs
- →Focused benchmarks like this expose gaps that broader medical QA evaluations typically miss