AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce VitalAgent, an AI framework that combines language models with tool-augmented reasoning to enable both reactive question answering and proactive monitoring of physiological data from wearable devices like ECG and PPG sensors. The framework achieves 30% improvement over baseline approaches and is validated against a new benchmark dataset (VitalBench) containing 1,862 QA pairs and 90+ hours of continuous biometric recordings.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce dual-trace memory encoding for LLM agents, pairing factual records with narrative scene reconstructions to improve cross-session recall by 20+ percentage points. The method significantly enhances temporal reasoning and multi-session knowledge aggregation without increasing computational costs, advancing the capability of persistent AI agent systems.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce Audio Flamingo Next (AF-Next), an advanced open-source audio-language model that processes speech, sound, and music with support for inputs up to 30 minutes. The model incorporates a new temporal reasoning approach and demonstrates competitive or superior performance compared to larger proprietary alternatives across 20 benchmarks.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers developed a new training method combining Chain-of-Thought supervision with reinforcement learning to teach large language models when to abstain from answering temporal questions they're uncertain about. Their approach enabled a smaller Qwen2.5-1.5B model to outperform GPT-4o on temporal question answering tasks while improving reliability by 20% on unanswerable questions.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 46/102
🧠Researchers introduce CoWVLA (Chain-of-World VLA), a new Vision-Language-Action model paradigm that combines world-model temporal reasoning with latent motion representation for embodied AI. The approach outperforms existing methods in robotic simulation benchmarks while maintaining computational efficiency through a unified autoregressive decoder that models both keyframes and action sequences.
AINeutralarXiv – CS AI · 3d ago6/10
🧠TANDEM introduces a unified framework for detecting hate speech in multimodal content by combining audio, visual, and textual analysis with temporal grounding. The system achieves 30% improvement over existing methods in target identification while providing interpretable, actionable evidence for human moderators rather than functioning as a black box.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce AsyncTool, a benchmark for evaluating how well LLM-based agents handle multiple concurrent tasks with realistic tool response delays. The study reveals that current AI agents struggle significantly with asynchronous multitasking, experiencing substantial performance degradation when tool feedback is delayed, highlighting a critical gap in real-world applicability.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce SONIC-O1, a comprehensive benchmark for evaluating multimodal large language models on audio-video understanding tasks. The study reveals significant performance gaps between closed-source and open-source models, particularly in temporal localization, and identifies demographic disparities in model behavior across 60 hours of real-world conversational data.
🏢 Hugging Face
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce EgoMemReason, a comprehensive benchmark for evaluating AI systems on week-long egocentric video understanding through memory-driven reasoning. The benchmark reveals that even state-of-the-art multimodal models achieve only 39.6% accuracy, indicating that long-horizon memory and temporal reasoning remain unsolved challenges for next-generation visual assistants.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce AdaTKG, a novel machine learning approach for temporal knowledge graph reasoning that maintains adaptive per-entity memory updated with each interaction, enabling better predictions on evolving relational data and improved handling of unseen entities compared to existing static representation methods.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers present a neuro-symbolic framework that challenges the conventional belief that temporal reasoning failures in LLMs stem from inherent logical deduction deficits. By decoupling text-to-event representation from symbolic reasoning using a Probabilistic Inconsistency Signal, the framework achieves perfect accuracy on structured temporal tasks and identifies that representation quality—not reasoning capability—is the true bottleneck.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce TimeSeriesExamAgent, a scalable framework for automatically generating time series reasoning benchmarks using LLM agents and templates. The study reveals that while large language models show promise in time series tasks, they significantly underperform in abstract reasoning and domain-specific applications across healthcare, finance, and weather domains.
AINeutralarXiv – CS AI · Apr 76/10
🧠Researchers have developed LiveFact, a new dynamic benchmark for evaluating Large Language Models' ability to detect fake news and misinformation in real-time conditions. The benchmark addresses limitations of static testing by using temporal evidence sets and finds that open-source models like Qwen3-235B-A22B now match proprietary systems in performance.
AINeutralarXiv – CS AI · Mar 176/10
🧠Research reveals that Large Language Models struggle with dynamic Theory of Mind tasks, particularly tracking how others' beliefs change over time. While LLMs can infer current beliefs effectively, they fail to maintain and retrieve prior belief states after updates occur, showing patterns consistent with human cognitive biases.
AIBearisharXiv – CS AI · Mar 176/10
🧠Researchers introduce HEARTS, a comprehensive benchmark for evaluating large language models' ability to reason over health time series data across 16 datasets and 12 health domains. The study reveals that current LLMs significantly underperform compared to specialized models and struggle with multi-step temporal reasoning in healthcare applications.
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers have developed Egocentric Co-Pilot, a web-native AI framework that runs on smart glasses and uses Large Language Models to provide assistive AI without requiring screens or free hands. The system combines perception, reasoning, and web tools to support accessibility for people with vision impairments or cognitive overload, showing superior performance compared to commercial baselines.