MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation
Researchers introduce MPC-Patch-Bench, the first repository-level benchmark for evaluating LLM code repair in Secure Multi-Party Computation systems. The benchmark reveals that current LLMs achieve only 22.9% functional resolution on MPC tasks, dropping to 17.1% when security and numerical-fidelity constraints are applied, highlighting significant gaps in AI's ability to handle cryptographically-sensitive code.
MPC-Patch-Bench addresses a critical gap in AI-assisted code repair by introducing the first standardized evaluation framework specifically designed for Secure Multi-Party Computation software. Unlike general-purpose benchmarks such as SWE-bench, this framework accounts for MPC's unique structural challenges: codebases heavy with generic Python infrastructure, lack of standardized test coverage, and the need for cryptographic correctness alongside functional correctness. The benchmark combines domain-specific data curation with an MPC-aware verifier that checks for unsafe information reveals, insecure arithmetic operations, and illegal type conversions.
The research exposes a substantial capability gap between general-purpose LLM performance and security-critical cryptographic code repair. When the strongest evaluated LLM resolves 22.9% of tasks functionally, but only 17.1% pass security verification, it demonstrates that functional correctness alone is insufficient—up to 40% of apparently working patches contain cryptographic vulnerabilities. This distinction matters because MPC systems increasingly underpin privacy-preserving machine learning, biomedical data collaboration, and secure analytics applications where cryptographic failure creates real privacy and security risks.
For the broader AI and cryptography communities, MPC-Patch-Bench establishes both a measurement tool and a reality check. It validates that current LLMs cannot reliably handle production-grade cryptographic code maintenance without human oversight, which has direct implications for organizations considering AI-assisted development in privacy-critical systems. The benchmark's design—combining automated curation with human-in-the-loop verification—suggests a model for evaluating LLMs on other security-sensitive domains. The significant gap between functional and verified success rates will likely drive future research into LLM training and verification methods specifically tailored for cryptographic systems.
- →Current LLMs achieve only 17.1% verified success on MPC code repair tasks when security constraints are applied, down from 22.9% functional correctness.
- →MPC-Patch-Bench introduces the first repository-level benchmark specifically designed for evaluating LLM code repair in cryptographic systems.
- →Up to 40% of functionally-correct patches fail security and numerical-fidelity verification, highlighting the gap between functional and cryptographic correctness.
- →The benchmark combines domain-specific curation with static and dynamic verification methods tailored to MPC security requirements.
- →Results indicate LLMs require significant improvements before reliable autonomous code maintenance in privacy-preserving and cryptographic applications.