Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?
Researchers found that large language models frequently arrive at correct code predictions through flawed reasoning, with performance dropping up to 70% when code undergoes semantics-preserving mutations. The study reveals substantial gaps between apparent accuracy and genuine semantic understanding, questioning the reliability of LLMs for critical programming tasks.
A comprehensive empirical study demonstrates fundamental weaknesses in how state-of-the-art LLMs understand code semantics. Researchers tested nine models by applying five mutation techniques—variable renaming, comparison mirroring, branch swapping, loop conversion, and unrolling—that preserve program meaning while altering syntax. The findings expose a critical disconnect between reported accuracy metrics and actual reasoning capabilities. Between 10-50% of correct predictions in LLMs stem from flawed logic rather than genuine comprehension, suggesting models rely on pattern matching and surface-level features rather than true semantic analysis. Performance degradation reaching 70% under minimal syntactic changes indicates that LLMs lack stable, semantically-grounded understanding even when initial accuracy appears strong. While proprietary models like GPT-4 outperform open-source alternatives in both accuracy and expert-evaluated reasoning quality, all models demonstrate fragility across mutation scenarios. This research challenges the assumption that high accuracy equates to reliable code understanding. For the developer community, these findings suggest caution when deploying LLMs for code analysis, generation, or review in high-stakes environments. The instability under semantics-preserving transformations indicates that LLMs may fail unpredictably when encountering legitimately equivalent code variations. The implications extend beyond academic concern—production systems relying on LLM-based code assistance could inherit this fragility, potentially introducing subtle bugs or security vulnerabilities that surface only under specific code formulations.
- →LLMs produce correct code predictions through flawed reasoning in 10-50% of cases despite high accuracy metrics
- →Performance drops up to 70% when code undergoes semantics-preserving mutations like variable renaming or loop conversion
- →Proprietary models show stronger accuracy and reasoning quality than open-source alternatives, but all exhibit fragility under transformations
- →Current LLMs lack stable, semantically-grounded understanding despite appearing to understand code at surface level
- →Critical implications for production systems deploying LLMs for code analysis, generation, and review tasks