AINeutralarXiv – CS AI · 11h ago6/10
🧠
MultiZebraLogic: A Multilingual Logical Reasoning Benchmark
Researchers have developed MultiZebraLogic, a multilingual logical reasoning benchmark comprising high-quality datasets across nine languages using zebra puzzles to evaluate LLM reasoning capabilities. The study introduces red herring clues as a difficulty mechanism and finds that puzzle complexity significantly affects model performance, with GPT-4o mini and o3-mini reaching appropriate challenge levels at different puzzle sizes.
🏢 OpenAI🧠 GPT-4