AINeutralarXiv โ CS AI ยท 14h ago7/10
๐ง
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Researchers introduce METER, a benchmark that evaluates Large Language Models' ability to perform contextual causal reasoning across three hierarchical levels within unified settings. The study identifies critical failure modes in LLMs: susceptibility to causally irrelevant information and degraded context faithfulness at higher causal levels.