AINeutralarXiv โ CS AI ยท 14h ago7/10
๐ง
Can Large Language Models Infer Causal Relationships from Real-World Text?
Researchers developed the first real-world benchmark for evaluating whether large language models can infer causal relationships from complex academic texts. The study reveals that LLMs struggle significantly with this task, with the best models achieving only 0.535 F1 scores, highlighting a critical gap in AI reasoning capabilities needed for AGI advancement.