Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Researchers propose Comet-H, an AI system that orchestrates language models to generate research software by keeping mathematical theory, code, benchmarks, and documentation synchronized. The framework addresses hallucination and desynchronization failures in LLM-driven development, demonstrating effectiveness through a portfolio of 46 research repositories, with a static-analysis tool reaching F1=0.768 performance.
This research addresses a fundamental problem in AI-assisted software development: language models can generate individual artifacts—code, papers, documentation—but struggle to maintain coherence across these coupled systems. The proposed Comet-H system treats software development as a dynamic workspace where ideation, implementation, evaluation, and documentation form interdependent coordinates rather than isolated tasks. This approach directly tackles two critical failure modes: hallucination accumulation where unsupported claims propagate across sessions, and desynchronization where code diverges from theory or stated capabilities.
The technical contribution frames prompt selection as a contextual bandit problem, using transparent linear scoring based on workspace deficits rather than opaque learned policies. This design choice enables legibility—developers can understand why the system selected each prompt—while avoiding the overhead of training reinforcement learning policies. The emphasis on audit-and-contraction passes during later development phases suggests that refinement and validation become increasingly important as projects mature, a pattern that could inform how teams organize LLM-assisted workflows.
For the broader AI development community, this work demonstrates that systematic orchestration of language models can produce competitive research-quality outputs, with the A3 static-analysis tool substantially outperforming baselines (F1=0.768 versus 0.364). The portfolio of 46 repositories across diverse domains provides evidence of generalizability beyond toy problems. However, the framework's reliance on careful prompt engineering and manual workspace state management suggests significant practical barriers to adoption outside research settings.
- →Comet-H orchestrates LLMs to maintain synchronization between theory, code, benchmarks, and documentation throughout development cycles
- →The system identifies and mitigates hallucination accumulation and desynchronization as LLM-specific failure modes in research software
- →Prompt selection using transparent contextual bandit scoring over workspace deficits improves legibility compared to learned policies
- →An A3 static-analysis tool built entirely within Comet-H achieved F1=0.768, more than double the next-best baseline
- →Audit-and-contraction validation passes dominate successful project trajectories, suggesting refinement becomes critical in later development phases