Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly is a new profiling framework that optimizes LLM inference simulation by reducing redundant profiling across different hardware and software configurations. By leveraging structural insights about operation dependencies, the system cuts profiling costs by over 56% while maintaining simulation accuracy within 5-8% error margins, addressing a critical bottleneck in LLM deployment optimization.
Dooly addresses a fundamental inefficiency in large language model deployment optimization. Current profiling-based simulators require complete re-profiling whenever hardware, serving engines, or model configurations change, creating exponential computational overhead for organizations evaluating inference strategies. The research team identified that input dimensions for LLM operations are largely predetermined by model architecture or request characteristics, meaning identical operations execute across multiple configurations. By taint-propagating input origins through a single inference pass and selectively profiling only novel operations, Dooly eliminates redundant measurements.
The broader context reveals increasing complexity in LLM infrastructure decisions. As model serving becomes more competitive, operators must evaluate trade-offs across GPUs, attention mechanisms, and architectures to optimize latency and throughput. This multi-dimensional optimization space previously demanded prohibitive computational resources, creating friction in deployment pipelines. Dooly's 56.4% reduction in profiling GPU-hours directly translates to faster iteration cycles and lower operational costs for enterprises building LLM applications.
The market implications are substantial for AI infrastructure providers and enterprises deploying large models. Reduced profiling overhead accelerates the evaluation and optimization of inference configurations, lowering barriers to exploring cost-effective deployment strategies. This becomes particularly valuable as model sizes grow and inference costs dominate operational budgets. The framework's demonstrated accuracy across diverse platforms and architectures suggests broad applicability across the AI stack, potentially influencing infrastructure decisions at scale.
- βDooly reduces LLM profiling costs by 56.4% through configuration-agnostic, redundancy-aware analysis
- βThe system maintains simulation accuracy within 5% error for time-to-first-token and 8% for time-per-output-token metrics
- βStructural understanding of operation dependencies eliminates manual re-profiling across hardware and software configurations
- βSingle inference pass with taint propagation identifies which input dimensions are reusable across different configurations
- βFramework integrates as a drop-in backend for existing simulators, enabling immediate adoption in production pipelines