Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
Researchers introduce Metal-Sci, a benchmark suite for optimizing machine learning kernels on Apple Silicon using evolutionary LLM-driven search. The system demonstrates speedups ranging from 1.0x to 10.7x across scientific computing tasks while introducing a held-out validation mechanism that catches silent regressions in generalization, revealing critical flaws that in-distribution metrics alone cannot detect.
Metal-Sci addresses a significant gap in AI infrastructure optimization: systematically improving compute kernels for Apple's proprietary hardware through LLM-guided search. The benchmark spans six optimization domains—from stencil computations to FFT operations—providing a standardized evaluation framework that extends beyond theoretical performance to practical speedups. This work matters because Apple Silicon adoption is accelerating among AI developers, yet optimization tooling remains fragmented.
The methodological innovation centers on the held-out gate scoring function, which evaluates kernels on unseen problem sizes. This catches genuine regressions: an OpenAI GPT-optimized FFT kernel achieved 2.95x speedup on training sizes but collapsed to 0.23x on 256³ cubes. Such silent failures represent a critical risk in automated optimization loops where in-distribution metrics create false confidence. The framework tests three major LLM models—Claude Opus, Gemini Pro, and GPT—demonstrating that even state-of-the-art reasoning models produce kernels with generalization failures.
For the developer ecosystem, Metal-Sci establishes reproducible benchmarks for Apple Silicon optimization, reducing fragmentation and enabling comparative analysis across LLM agents. The lightweight harness enables runtime compilation and structured diagnostics, making it accessible for researchers without deep Metal programming expertise. The open-source release accelerates community contributions to Apple Silicon optimization, directly impacting machine learning frameworks targeting Apple's hardware.
Looking forward, this model of LLM-guided kernel search with mechanical oversight primitives could extend to other hardware platforms and optimization domains. The emphasis on held-out validation suggests broader applications where automated system optimization risks silent failures—a pattern emerging across AI-driven infrastructure tools.
- →Metal-Sci benchmark demonstrates 1.0x to 10.7x speedups through evolutionary LLM kernel optimization on Apple Silicon
- →Held-out validation mechanism reveals silent regressions: a GPT FFT kernel achieved 2.95x in-distribution but dropped to 0.23x on unseen sizes
- →Framework tests Claude Opus, Gemini Pro, and GPT, showing all produce kernels with generalization failures despite in-distribution wins
- →Open-source release provides standardized benchmarks across six scientific computing domains for Apple Silicon optimization
- →Mechanical oversight primitives in automated search loops address critical gap in catching failures that isolated metrics cannot detect