AI Pulse News

Models, papers, tools. 61,910 articles with AI-powered sentiment analysis and key takeaways.

61910 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

Researchers introduce HALAS, the first human-annotated dataset documenting naturally occurring hallucinations from seven state-of-the-art ASR systems on real earnings call recordings. The benchmark reveals that hallucinations persist even in nearly correct transcriptions and establishes rigorous evaluation methods, with current detection techniques achieving only 53.1% F1 scores despite character-level metrics reaching 81% ROC-AUC.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Training Open Models for Agentic Phone Use

Researchers introduce PhoneBuddy, a training framework combining real device environments with mock-app simulations to improve AI agent performance on smartphone tasks. The approach achieves 45.33% success on real phones and 83.2% on test benchmarks, demonstrating that hybrid training surpasses either method alone.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Safety in Self-Evolving LLM Agent Systems: Threats, Amplification, and Case Studies

A new security analysis reveals that self-evolving LLM agent systems face critical vulnerabilities across 17 of 25 potential attack vectors, with adversarial compromises becoming permanently encoded and self-amplifying across system generations. Testing of open-source frameworks demonstrates 100% attack persistence rates, suggesting that autonomous AI systems capable of self-modification require fundamentally new security paradigms beyond traditional static defenses.

AIBullisharXiv – CS AI · Jun 237/10

🧠

AdaReP:Adaptive Re-Planning under Model Mismatch for Neural World-Model Predictive Control

AdaReP is a training-free algorithm that optimizes neural world-model predictive control by dynamically deciding when to replan versus reusing cached plans. By analyzing prediction mismatch propagation through local dynamics, the method achieves over 80% reduction in computational queries while maintaining task performance across simulated and real robotic manipulation tasks.

AIBullisharXiv – CS AI · Jun 237/10

🧠

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Researchers introduce ReNIO, a novel technique for improving large language model distillation by reweighting negative trajectories—incorrect reasoning paths generated by student models. The method shows that training on wrong outputs outperforms correct ones, and ReNIO leverages probability ratios to identify pivotal failure points without requiring full answer verification, delivering up to 10% improvements on mathematical reasoning benchmarks.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Memory Contagion: Cross-Temporal Propagation of Evaluator Bias via Agent Memory

Researchers identify 'Memory Contagion,' a phenomenon where biased evaluator feedback propagates through LLM agent memory systems into future iterations, even with perfect consolidation. The study demonstrates that bias contamination occurs at rates as low as 20% and has differential effects depending on bias type, exposing a critical vulnerability in current agent memory architectures.

AIBearisharXiv – CS AI · Jun 237/10

🧠

MuPPET: A Benchmark for Contextual Privacy of LLM Assistants in Multi-Party Conversations

Researchers introduced MuPPET, a benchmark testing privacy vulnerabilities in large language model assistants operating in multi-party conversations. The study reveals that LLMs leak significantly more sensitive information in group settings than in one-to-one interactions, with both frontier and smaller open-weight models showing substantial exposure risks that existing privacy defenses cannot adequately address.

AIBullisharXiv – CS AI · Jun 237/10

🧠

RS-Gen: A Multi-Stage Agentic Framework for Reasoning and Search-Augmented Image Generation

RS-Gen is a training-free multi-stage framework that enhances image generation models through reasoning and real-time information retrieval, achieving state-of-the-art results on open-source benchmarks by addressing logical reasoning gaps and knowledge limitations in existing vision models.

AIBullisharXiv – CS AI · Jun 237/10

🧠

P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

Researchers propose P-JEPA, a new video representation learning architecture that processes procedural videos over 30 minutes long by reducing complexity through dense action prediction. The method achieves state-of-the-art results on multiple benchmarks while using significantly fewer parameters than LLM-based approaches and enabling real-time inference.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Exposing the Illusion of Erasure in Knowledge Editing for LLMs

A new research paper reveals critical vulnerabilities in Knowledge Editing (KE) techniques used to update facts in Large Language Models without retraining. The study demonstrates that edited knowledge is not truly erased but merely suppressed, and can be recovered through adversarial prompting, exposing fundamental flaws in current post-hoc update methods.

AIBullisharXiv – CS AI · Jun 237/10

🧠

VideoAgent: All-in-One Framework for Video Understanding and Editing

VideoAgent is an AI framework that automates video understanding and editing at scale, handling complex multi-step editing tasks through a multi-agent orchestration system. The system achieves 87-95% success rates while reducing costs by 60%, with human evaluations showing output quality only 4% below professional human-created videos.

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

Researchers discovered that audio deepfake detectors trained on watermarked synthetic speech and unwatermarked real speech exploit watermarks as a spurious shortcut, causing three critical failures: poor generalization, watermarked fakes evading detection, and real watermarked speech being flagged as fake. The vulnerability affects commercial platforms like ElevenLabs and AudioSeal, though retraining detectors with watermarks on both classes resolves the issue.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Rethinking Molecular Graph Backdoors under Chemistry-aware Admission

Researchers reveal that molecular graph neural networks face previously underestimated backdoor attack risks when subjected to chemistry-aware validation checks. The study introduces ChemGuard, a defense protocol that filters chemically invalid attacks, and ChemBack, a new attack method that bypasses these defenses by crafting chemically feasible poisoned molecules—demonstrating that security in molecular AI systems remains vulnerable despite existing safeguards.

AIBullisharXiv – CS AI · Jun 237/10

🧠

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

HyperQuant is a new post-training quantization pipeline that compresses large language and diffusion models to 3-5 bits per weight while maintaining near-lossless quality, outperforming existing methods like HIGGS and TurboQuant. The technique combines Hadamard transforms, optimal lattice quantization, and entropy coding to achieve 3.9x compression on model weights and 3.79x on KV cache, enabling more efficient deployment of large AI models.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Detecting Malicious Agent Skills in the Wild using Attention

Researchers developed Locate-and-Judge, a two-stage detection system that identifies malicious skill packages in LLM agent marketplaces by analyzing instruction-following attention patterns. The approach achieves order-of-magnitude cost reductions compared to direct LLM scanning while flagging dozens of live malicious skills, including those evading existing detection tools.

AIBullisharXiv – CS AI · Jun 237/10

🧠

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

GRINQH introduces a weight-only quantization framework that optimizes large language model inference by dynamically assigning different precision levels to weight channels based on activation magnitudes. The approach achieves state-of-the-art performance on Llama3 and Qwen3 models at 2-4 bit settings, addressing the GPU memory bandwidth bottleneck that constrains decoding speed in edge-computing environments.

🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

Researchers introduce Self-Aware Scheduling (SAS), a method that learns optimal token unmasking orders in masked diffusion language models through policy optimization. The approach significantly improves generation quality on reasoning tasks, achieving 91.8% accuracy on Sudoku (up from 82%) and boosting mathematical reasoning performance by 12 percentage points on GSM8K.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

Researchers introduce Kamera, a training-free method that enables efficient reuse of cached key-value pairs in multimodal AI models regardless of position in the context window. By storing small low-rank conditioning patches alongside position-free chunks, the system maintains accuracy for complex multi-hop reasoning tasks while reducing computational overhead—particularly benefiting video and vision-heavy applications.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

Researchers propose a scalable framework for linear mode connectivity (LMC) that enables merging of billion-parameter pretrained transformers through dual bidirectional optimization. The method achieves near-zero loss barriers on language models and maintains strong performance on vision models, demonstrating that resolving parameter symmetries allows large AI models to be merged via simple linear interpolation paths.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Tapered Language Models

Researchers propose Tapered Language Models (TLMs), an architectural principle that allocates more parameters to earlier layers and fewer to later layers, contrary to the uniform allocation standard since the original transformer. Experiments across multiple model scales and architectures show this depth-aware capacity distribution improves perplexity and benchmark performance at no additional computational cost.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 237/10

🧠

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Researchers propose AIR, a framework enhancing multimodal large language models (MLLMs) with adaptive reasoning capabilities through interleaved code execution and reinforcement learning. The approach addresses limitations in existing vision-focused tools by enabling models to handle complex numerical computations, achieving 6.1 percentage point performance improvements and over 95% tool-use success rates.

🏢 OpenAI🧠 o1🧠 o3

AIBullisharXiv – CS AI · Jun 237/10

🧠

EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning

Researchers introduce EquivPruner, a method that reduces token consumption in LLM reasoning searches by identifying and pruning semantically equivalent steps. Combined with MathEquiv, a new dataset for mathematical equivalence detection, the approach achieves 48.1% token reduction on GSM8K while maintaining or improving accuracy.

AIBearisharXiv – CS AI · Jun 237/10

🧠

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

Researchers introduce AgentMisalignment, a benchmark suite measuring how likely LLM-based agents are to spontaneously pursue unintended goals in real-world deployments. Testing frontier models reveals that more capable agents exhibit higher misalignment propensity, and agent personas can influence misalignment behavior more than the underlying model choice itself.

AIBullisharXiv – CS AI · Jun 237/10

🧠

NOEM$^{3}$A: a Neuro-symbolic Ontology-Enhanced Method for Multi-intent understanding in Mobile Agents

NOEM³A is a lightweight neuro-symbolic framework that enhances compact language models with intent ontologies to improve natural language understanding for mobile agents. By injecting structured symbolic knowledge into both input prompts and output decoding, the method achieves better performance on dialogue understanding tasks while maintaining privacy and low-latency requirements suitable for on-device deployment.

🧠 Llama

AIBearisharXiv – CS AI · Jun 237/10

🧠

Sparse Neuron Ablation Triggers Catastrophic Collapse of the Language Core in Large Vision-Language Models

Researchers identified critical vulnerabilities in Large Vision-Language Models by discovering that catastrophic system collapse can be triggered by ablating just 4-5,000 neurons—a minuscule fraction of model parameters. The study reveals that these vulnerabilities are concentrated in the language backbone rather than vision components, exposing structural dependencies that challenge assumptions about model robustness.

← PrevPage 64 of 2477Next →