AIBearisharXiv – CS AI · 10h ago7/10
🧠
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
Researchers introduce IndustryBench, a 2,049-item benchmark testing large language models on industrial procurement tasks grounded in Chinese national standards. The study reveals that current LLMs perform poorly on safety-critical industrial applications, with the best models scoring only 2.08/3.0, and that extended reasoning paradoxically increases safety violations by introducing unsupported details into answers.
🧠 GPT-5