🧠

AI

11,659 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

11659 articles

AIBullisharXiv – CS AI · Mar 117/10

🧠

AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem

Researchers propose AgentOS, a new operating system paradigm that replaces traditional GUI/CLI interfaces with natural language-driven interactions powered by AI agents. The system would feature an Agent Kernel for intent interpretation and task coordination, transforming conventional applications into modular skills that users can compose through natural language commands.

AIBullisharXiv – CS AI · Mar 117/10

🧠

MASEval: Extending Multi-Agent Evaluation from Models to Systems

MASEval introduces a new framework-agnostic evaluation library for multi-agent AI systems that treats entire systems rather than just models as the unit of analysis. Research across 3 benchmarks, models, and frameworks reveals that framework choice impacts performance as much as model selection, challenging current model-centric evaluation approaches.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Large Language Model-Assisted Superconducting Qubit Experiments

Researchers have developed a framework that uses large language models (LLMs) to automate superconducting qubit experiments, potentially streamlining quantum computing research. The system successfully demonstrated autonomous resonator characterization and quantum non-demolition measurements, offering a more user-friendly approach to controlling complex quantum hardware.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Clear, Compelling Arguments: Rethinking the Foundations of Frontier AI Safety Cases

This research paper proposes rethinking safety cases for frontier AI systems by drawing on methodologies from traditional safety-critical industries like aerospace and nuclear. The authors critique current alignment community approaches and present a case study focusing on Deceptive Alignment and CBRN capabilities to establish more robust safety frameworks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Researchers introduce BiCLIP, a new framework that improves vision-language models' ability to adapt to specialized domains through geometric transformations. The approach achieves state-of-the-art results across 11 benchmarks while maintaining simplicity and low computational requirements.

AIBearisharXiv – CS AI · Mar 117/10

🧠

NetDiffuser: Deceiving DNN-Based Network Attack Detection Systems with Diffusion-Generated Adversarial Traffic

Researchers developed NetDiffuser, a framework that uses diffusion models to generate natural adversarial examples capable of deceiving AI-based network intrusion detection systems. The system achieved up to 29.93% higher attack success rates compared to baseline attacks, highlighting significant vulnerabilities in current deep learning-based security systems.

AIBullisharXiv – CS AI · Mar 117/10

🧠

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

Researchers introduce SATURN, a new reinforcement learning framework that uses Boolean Satisfiability (SAT) problems to improve large language models' reasoning capabilities. The framework addresses key limitations in existing RL approaches by enabling scalable task construction, automated verification, and precise difficulty control through curriculum learning.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement

A research study reveals that AI-powered search engines like Perplexity, SearchGPT, and Google Gemini produce highly variable citation results for identical queries, making single-run visibility metrics unreliable. The study demonstrates that citation distributions follow power-law patterns with substantial variability, and argues that uncertainty estimates are essential for accurate measurement of domain visibility in generative search.

🏢 OpenAI🏢 Perplexity🧠 Gemini

AIBullisharXiv – CS AI · Mar 117/10

🧠

Hindsight Credit Assignment for Long-Horizon LLM Agents

Researchers introduced HCAPO, a new framework that uses hindsight credit assignment to improve Large Language Model agents' performance in long-horizon tasks. The system leverages LLMs as post-hoc critics to refine decision-making, achieving 7.7% and 13.8% improvements over existing methods on WebShop and ALFWorld benchmarks respectively.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

Researchers propose SEER (Self-Enhancing Efficient Reasoning), a framework that compresses Chain-of-Thought reasoning in Large Language Models while maintaining accuracy. The study found that longer reasoning chains don't always improve performance and can increase latency by up to 5x, leading to a 42.1% reduction in CoT length while improving accuracy.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.

AIBullisharXiv – CS AI · Mar 117/10

🧠

A Hybrid Quantum-Classical Framework for Financial Volatility Forecasting Based on Quantum Circuit Born Machines

Researchers developed a hybrid quantum-classical framework combining LSTM neural networks with Quantum Circuit Born Machines for financial volatility forecasting. Testing on Shanghai Stock Exchange data showed significant improvements over classical methods in key metrics like MSE and RMSE, demonstrating quantum computing's potential in financial modeling.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Researchers propose 'Curveball steering', a nonlinear method for controlling large language model behavior that outperforms traditional linear approaches. The study challenges the Linear Representation Hypothesis by showing that LLM activation spaces have substantial geometric distortions that require geometry-aware interventions.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Researchers introduce Efficient Draft Adaptation (EDA), a framework that significantly reduces the cost of adapting draft models for speculative decoding when target LLMs are fine-tuned. EDA achieves superior performance through decoupled architecture, data regeneration, and smart sample selection while requiring substantially less training resources than full retraining.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning

Researchers propose a new asynchronous framework for LLM reinforcement learning that separates inference and training deployment, achieving 3-5x improvement in training throughput. The approach maintains on-policy correctness while enabling concurrent inference and training through a producer-consumer pipeline architecture.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Reinforcing Numerical Reasoning in LLMs for Tabular Prediction via Structural Priors

Researchers propose PRPO (Permutation Relative Policy Optimization), a reinforcement learning framework that enhances large language models' numerical reasoning capabilities for tabular data prediction. The method achieves performance comparable to supervised baselines while excelling in zero-shot scenarios, with an 8B parameter model outperforming much larger models by up to 53.17%.

AIBearisharXiv – CS AI · Mar 117/10

🧠

Security Considerations for Multi-agent Systems

A comprehensive study reveals that multi-agent AI systems (MAS) face distinct security vulnerabilities that existing frameworks inadequately address. The research evaluated 16 AI security frameworks against 193 identified threats across 9 categories, finding that no framework achieves majority coverage in any single category, with non-determinism and data leakage being the most under-addressed areas.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Real-Time Trust Verification for Safe Agentic Actions using TrustBench

Researchers introduced TrustBench, a real-time verification framework that prevents harmful actions by AI agents before execution, achieving 87% reduction in harmful actions across multiple tasks. The system uses domain-specific plugins for healthcare, finance, and technical domains with sub-200ms latency, marking a shift from post-execution evaluation to preventive action verification.

AIBullisharXiv – CS AI · Mar 117/10

🧠

World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models

Researchers introduce World2Mind, a training-free spatial intelligence toolkit that enhances foundation models' 3D spatial reasoning capabilities by up to 18%. The system uses 3D reconstruction and cognitive mapping to create structured spatial representations, enabling text-only models to perform complex spatial reasoning tasks.

🧠 GPT-5

AINeutralarXiv – CS AI · Mar 117/10

🧠

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

Researchers introduce 'opaque serial depth' as a metric to measure how much reasoning large language models can perform without externalizing it through chain of thought processes. The study provides computational bounds for Gemma 3 models and releases open-source tools to calculate these bounds for any neural network architecture.

AINeutralarXiv – CS AI · Mar 117/10

🧠

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Researchers introduce OOD-MMSafe, a new benchmark revealing that current Multimodal Large Language Models fail to identify hidden safety risks up to 67.5% of the time. They developed CASPO framework which dramatically reduces failure rates to under 8% for risk identification in consequence-driven safety scenarios.

AINeutralarXiv – CS AI · Mar 117/10

🧠

ALADIN: Accuracy-Latency-Aware Design-space Inference Analysis for Embedded AI Accelerators

Researchers have developed ALADIN, a framework for analyzing accuracy-latency trade-offs in AI accelerators for embedded systems. The tool enables evaluation of quantized neural networks without requiring deployment on target hardware, potentially reducing development time and costs for AI chip designers.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Researchers have developed two software techniques (OAS and MBS) that dramatically improve MXFP4 quantization accuracy for Large Language Models, reducing the performance gap with NVIDIA's NVFP4 from 10% to below 1%. This breakthrough makes MXFP4 a viable alternative while maintaining 12% hardware efficiency advantages in tensor cores.

🏢 Nvidia

AINeutralarXiv – CS AI · Mar 117/10

🧠

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Researchers introduce MiniAppBench, a new benchmark for evaluating Large Language Models' ability to generate interactive HTML applications rather than static text responses. The benchmark includes 500 real-world tasks and an agentic evaluation framework called MiniAppEval that uses browser automation for testing.

← PrevPage 50 of 467Next →