AINeutralarXiv – CS AI · 8h ago6/10
🧠
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
Researchers introduce DiagnosticIQ, a benchmark dataset of 6,690 expert-validated questions testing whether large language models can recommend maintenance actions based on industrial sensor rules. Evaluation of 29 LLMs reveals that while frontier models perform well on standard tasks, they exhibit significant brittleness—losing 13-60% accuracy under minor perturbations and pattern-matching rather than reasoning when conditions are inverted.