🧠

AI

12,522 AI articles curated from 50+ sources with AI-powered sentiment analysis, importance scoring, and key takeaways.

12522 articles

AINeutralarXiv – CS AI · Apr 206/10

🧠

Training Time Prediction for Mixed Precision-based Distributed Training

Researchers have developed a precision-aware training time predictor for distributed deep learning that accounts for floating-point precision settings, achieving 9.8% prediction accuracy compared to 147.85% error in existing models that ignore precision variations. The work addresses a critical gap in resource allocation and cost estimation for AI training workloads, where precision choices can create 2.4x variations in training time.

AINeutralarXiv – CS AI · Apr 206/10

🧠

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Researchers introduce AtManRL, a method that combines differentiable attention manipulation with reinforcement learning to improve the faithfulness of chain-of-thought reasoning in large language models. By training attention masks to identify which tokens genuinely influence model predictions, the approach demonstrates that LLM reasoning traces can be made more interpretable and transparent.

🧠 Llama

AIBullisharXiv – CS AI · Apr 206/10

🧠

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Researchers introduce JumpLoRA, a novel framework that uses sparse adapters with JumpReLU gating to enable continual learning in large language models while mitigating catastrophic forgetting. The method dynamically isolates parameters across tasks, outperforming existing state-of-the-art approaches like ELLA and significantly improving IncLoRA performance.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Researchers propose a conformal prediction framework for large language models that uses internal neural representations rather than surface-level outputs to assess reliability and uncertainty. The Layer-Wise Information scoring method improves prediction validity under distribution shift while maintaining competitive performance, addressing a critical challenge in deploying LLMs where traditional uncertainty signals become unreliable.

AINeutralarXiv – CS AI · Apr 206/10

🧠

"Taking Stock at FAccT": Using Participatory Design to Co-Create a Vision for the Fairness, Accountability and Transparency Community

ACM FAccT conference employed large-scale participatory design to democratize governance decisions around AI fairness, accountability, and transparency issues. The process combined in-person sessions, asynchronous polling, and community-authored statements to shape the conference agenda and organizational direction.

AINeutralarXiv – CS AI · Apr 206/10

🧠

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Researchers evaluated four major LLMs (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Grok-1) on Vietnamese legal text simplification using a dual-aspect framework combining benchmarking metrics with expert-validated error analysis. The study reveals a critical trade-off: while some models excel at readability, they sacrifice legal accuracy, and high accuracy scores often mask subtle but serious reasoning errors that matter in legal contexts.

🧠 GPT-4🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · Apr 206/10

🧠

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

Researchers introduce SAI-DPO, a dynamic data sampling framework that adapts training data selection based on a model's evolving capabilities during training, rather than using static metrics. Tested on mathematical reasoning benchmarks including AIME24 and AMC23, SAI-DPO achieves state-of-the-art performance with significantly less training data, outperforming baselines by nearly 6 points.

AINeutralarXiv – CS AI · Apr 206/10

🧠

TabularMath: Understanding Math Reasoning over Tables with Large Language Models

Researchers introduce TabularMath, a benchmark and neuro-symbolic framework for evaluating large language models' mathematical reasoning over tabular data. The study reveals that LLMs struggle with table complexity, low-quality data, and inconsistent information—critical limitations for real-world business intelligence applications that demand reliable numerical reasoning.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

Researchers present Deliberative Searcher, a framework that enhances large language model reliability by combining certainty calibration with retrieval-based search for question answering. The system uses reinforcement learning with soft reliability constraints to improve alignment between model confidence and actual correctness, producing more trustworthy outputs.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning

Researchers challenge the Uniform Information Density hypothesis in LLM reasoning, finding that high-quality reasoning exhibits locally smooth but globally non-uniform information flow. This counter-intuitive pattern suggests LLMs optimize differently than human communication, with entropy-based metrics effectively predicting reasoning quality across seven benchmarks.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions

Researchers introduced Distribution Shift Alignment (DSA), a novel fine-tuning method that enables large language models to more accurately simulate human survey responses by learning distribution patterns rather than memorizing training data. DSA outperforms existing methods across five public datasets and reduces required real-world data by 53-69%, offering significant cost savings for large-scale survey research.

AIBullisharXiv – CS AI · Apr 206/10

🧠

MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

Researchers introduce MM-Telco, a comprehensive multimodal benchmark and model suite designed to adapt large language models for telecommunications applications. The framework addresses domain-specific challenges in network optimization, troubleshooting, and customer support, with fine-tuned models demonstrating significant performance improvements over baseline LLMs.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Researchers propose trace rewriting techniques to protect language models from unauthorized knowledge distillation, a process where smaller models learn from larger ones without permission. The methods preserve model accuracy while degrading distillation usefulness and embedding detectable watermarks in student models.

AINeutralarXiv – CS AI · Apr 206/10

🧠

DASB -- Discrete Audio and Speech Benchmark

Researchers introduce DASB, a comprehensive benchmark framework for evaluating discrete audio tokens across speech, audio, and music domains. The study reveals that discrete representations lag behind continuous features and require significant tuning, with semantic tokens outperforming acoustic ones, establishing standardized evaluation protocols for multimodal AI systems.

AIBullisharXiv – CS AI · Apr 206/10

🧠

Transformer Neural Processes - Kernel Regression

Researchers introduce Transformer Neural Process - Kernel Regression (TNP-KR), a scalable machine learning architecture that dramatically reduces computational complexity for neural processes from O(n²) to O(n_c) while maintaining or exceeding accuracy. The breakthrough enables processing of 100K context points with 1M+ test points on a single GPU, advancing the feasibility of neural processes for large-scale applications.

AINeutralarXiv – CS AI · Apr 206/10

🧠

When Cultures Meet: Multicultural Text-to-Image Generation

Researchers introduce the first benchmark for multicultural text-to-image generation, revealing that state-of-the-art AI models struggle with culturally diverse scenes. The study of 9,000 images across five countries and multiple demographics shows significant performance disparities, with a multi-agent framework using cultural personas demonstrating potential improvements in image quality and cultural accuracy.

AIBullisharXiv – CS AI · Apr 206/10

🧠

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

Researchers propose FSPO (Few-Shot Preference Optimization), a meta-learning algorithm that personalizes large language models using minimal user preference data. The approach uses synthetically generated preferences to train models that can quickly adapt to individual user preferences, achieving 87% performance on synthetic users and 70% on real human users in evaluation tasks.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

Researchers propose FedTSP, a federated learning method that uses pre-trained language models to generate semantically-enriched prototypes for improving model performance across heterogeneous data. The approach leverages textual descriptions of classes to preserve semantic relationships while mitigating data heterogeneity challenges in federated settings.

AIBearisharXiv – CS AI · Apr 206/10

🧠

The threat of analytic flexibility in using large language models to simulate human data

A new study reveals that using large language models to generate synthetic datasets ("silicon samples") produces highly variable results depending on configuration choices, with correlation outcomes ranging from r=.23 to r=.84 on the same task. This demonstrates that analytic flexibility in LLM-based data generation poses a significant threat to research validity and reproducibility in social science applications.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Intelligent Healthcare Imaging Platform: A VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

Researchers have developed an intelligent healthcare imaging platform using Vision-Language Models (VLMs), specifically Google Gemini 2.5 Flash, to automate medical image analysis and clinical report generation across CT, MRI, X-ray, and ultrasound modalities. The system achieves 80-pixel average deviation in location measurement and demonstrates zero-shot learning capabilities, though the authors acknowledge clinical validation is necessary before widespread adoption.

🧠 Gemini

AINeutralarXiv – CS AI · Apr 206/10

🧠

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

Researchers introduced RoleConflictBench, a benchmark dataset containing over 13,000 scenarios across 65 social roles designed to test whether large language models prioritize contextual cues or learned preferences when facing conflicting role expectations. Analysis of 10 leading LLMs revealed that models predominantly rely on ingrained role preferences rather than responding dynamically to situational urgency, indicating a significant gap in contextual sensitivity.

AIBullisharXiv – CS AI · Apr 206/10

🧠

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Researchers propose Adaptive Entropy Regularization (AER), a dynamic framework that addresses policy entropy collapse in LLM reinforcement learning by adjusting exploration intensity based on task difficulty. The method improves upon fixed entropy regularization approaches, demonstrating consistent gains in mathematical reasoning benchmarks while maintaining balanced exploration-exploitation tradeoffs.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Researchers have created the first comprehensive Arabic Cultural QA benchmark that translates questions across Modern Standard Arabic and regional dialects, converting multiple-choice questions into open-ended formats. Testing reveals that large language models significantly underperform on dialectal content and struggle with open-ended Arabic questions, highlighting critical gaps in culturally grounded language understanding.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Reading Between the Lines: The One-Sided Conversation Problem

Researchers formalize the one-sided conversation problem (1SC), where only one participant's dialogue can be recorded—common in telemedicine, call centers, and smart glasses. The study evaluates methods to reconstruct missing speaker turns and generate summaries from incomplete transcripts, finding that smaller models require finetuning while larger models show promise with prompting techniques.

AINeutralarXiv – CS AI · Apr 206/10

🧠

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Researchers introduce MTR-DuplexBench, a new evaluation framework for Full-Duplex Speech Language Models that enables real-time overlapping conversations. The benchmark addresses critical gaps by assessing multi-round interactions across conversational quality, instruction-following, and safety dimensions, revealing that current FD-SLMs struggle with consistency across multiple communication rounds.

← PrevPage 125 of 501Next →