Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
A study evaluating the consistency of exercise prescriptions generated by Gemini 2.5 Flash found high semantic consistency but significant variability in quantitative components like exercise intensity. The research highlights that while LLMs produce semantically similar outputs, structural constraints and expert validation are necessary before clinical deployment.
This research addresses a critical gap in understanding LLM reliability for healthcare applications. The study's repeated generation design—producing 120 outputs across six clinical scenarios—provides empirical evidence that LLMs exhibit inconsistent behavior even under identical input conditions, a phenomenon often overlooked in enthusiastic AI adoption discussions. The findings reveal a nuanced reliability profile: semantic similarity scores of 0.879-0.939 suggest LLMs maintain thematic coherence, yet 10-25% of resistance training outputs contained unclassifiable intensity expressions, directly undermining clinical usability.
This research contextualizes broader concerns about deploying LLMs in regulated healthcare environments. While the technology shows promise for generating personalized content at scale, the variability in quantitative prescriptions—the precise intensity, duration, or frequency specifications that clinicians require—exposes fundamental limitations in current models. The finding that safety expressions varied significantly despite being present in 100% of outputs demonstrates that inclusion and quality consistency differ substantially.
For healthcare providers and developers building AI-assisted clinical tools, this study provides actionable validation requirements. The emphasis on prompt structure's influence on consistency suggests that careful engineering can improve reliability, yet cannot eliminate variability entirely. The research effectively demonstrates that LLM outputs require systematic validation against clinical standards before patient-facing deployment. This positions expert validation not as optional enhancement but as mandatory infrastructure for healthcare AI systems. Organizations developing clinical decision-support tools should view this as evidence that governance frameworks and human oversight remain essential components of any LLM-based healthcare application.
- →LLM-generated exercise prescriptions show high semantic consistency (0.879-0.939 cosine similarity) but significant variability in quantitative components like exercise intensity
- →10-25% of resistance training outputs contained unclassifiable intensity expressions, indicating critical gaps for clinical deployment
- →Safety expressions appeared in all outputs but varied significantly in frequency, revealing inconsistency between content inclusion and content quality
- →Prompt structure substantially influences LLM consistency, suggesting that careful engineering can improve but not eliminate variability
- →Expert validation and additional structural constraints are mandatory before deploying LLM-generated clinical prescriptions in healthcare settings