AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce Amortized-Precision Quantization (APQ) and MAQEE, a framework that optimizes Vision Transformers for low-precision deployment with early-exit mechanisms. By jointly optimizing exit thresholds and bit-widths while accounting for quantization noise across layers, the approach achieves up to 95% reduction in computational operations while maintaining accuracy across vision tasks.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce the Byte Latent Transformer (BLT), a new approach to byte-level language models that dramatically accelerates generation speed through diffusion-based and speculative decoding techniques. The methods reduce memory-bandwidth costs by over 50% compared to standard byte-level models, potentially making byte-level LMs practical for real-world deployment.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose a theoretical framework for identifying when layer skipping in vision-language models reduces computational costs without sacrificing performance. The work establishes experimentally verifiable redundancy conditions that unify and improve upon existing pruning heuristics, confirming that early and late vision tokens contain significant redundancies across models.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce AdaCorrection, a framework that improves the efficiency of Diffusion Transformers (DiTs) used in image and video generation by adaptively correcting cached features during inference. The method maintains generation quality while reducing computational costs through intelligent cache reuse without requiring retraining or additional supervision.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers demonstrate that KV-cache offloading techniques, designed to reduce memory usage in large language models, significantly degrade performance on context-intensive tasks requiring extensive information extraction. The study introduces the Text2JSON benchmark and identifies low-rank projection and unreliable landmarks as key failure points, proposing improved alternatives.
🧠 Llama
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose a two-stage inference-time budget control system for LLM search agents that optimizes how language models allocate computational resources between tool calls and token generation during multi-hop question answering. The method uses Value-of-Information scoring to decide when to retrieve information, decompose questions, or commit to final answers, demonstrating consistent performance gains across multiple benchmarks and model sizes.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers introduce Open-SAT, a training-free algorithm that uses Large Language Models to refine query embeddings for satellite image retrieval tasks. The method improves upon existing vision-language models by leveraging LLM-guided contextual refinement at inference time, achieving up to 16% F1 score improvement on open-vocabulary satellite imagery tasks without requiring additional training.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers develop a decision-theoretic framework for optimizing LLM cascades, where cheaper models defer to expensive ones on low-confidence queries. Testing across five benchmarks reveals that cascade performance is fundamentally limited by structural costs rather than routing sophistication, with simpler router-based approaches often outperforming optimized cascade policies.
AIBullisharXiv – CS AI · May 96/10
🧠Researchers conducted the first large-scale mechanistic study of tabular foundation models, revealing significant redundancy across inference layers. They demonstrated that a single-layer looped model can match performance of state-of-the-art models while using only 20% of the parameters, challenging assumptions about depth requirements in transformer architectures.
AIBullisharXiv – CS AI · May 76/10
🧠Researchers propose Predict-then-Diffuse, a framework that optimizes diffusion-based large language models by predicting required response length before generation, reducing computational waste from padding tokens and re-computation overhead while maintaining output quality across multiple datasets.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce Budgeted LoRA, a distillation framework that compresses large language models by treating model compression as a structured compute allocation problem. The method achieves up to 4.05x speedup in inference through selective dense component removal and adaptive low-rank allocation, controlled by a single compute budget parameter.
🏢 Perplexity
AINeutralarXiv – CS AI · May 76/10
🧠Coral is a new multi-LLM serving system that optimizes resource allocation across heterogeneous cloud GPUs to reduce inference costs by up to 2.79x. The system uses a two-stage decomposition algorithm that maintains optimal performance while reducing optimization time from hours to seconds, enabling dynamic adaptation to changing demand and resource availability.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce PAD-Rec, a lightweight module that optimizes speculative decoding for LLM-based recommendation systems by incorporating position-aware embeddings. The approach achieves up to 3.1x speedup in inference while preserving recommendation quality, addressing the latency bottleneck in generative list-wise recommendations.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce VISE, the first benchmark for evaluating sycophancy in video large language models (Video-LLMs), where models incorrectly agree with user inputs that contradict visual evidence. The study proposes two training-free mitigation strategies: enhanced visual grounding through keyframe selection and inference-time neural representation steering, addressing a critical reliability gap in multimodal AI systems.
AIBullisharXiv – CS AI · Apr 206/10
🧠Researchers introduce LACE, a framework enabling large language models to reason through multiple parallel paths that interact and correct each other during inference, rather than operating independently. Using synthetic training data to teach cross-thread communication, LACE achieves over 7 percentage points improvement in reasoning accuracy compared to standard parallel search methods.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers propose AdaRankLLM, an adaptive retrieval-augmented generation framework that dynamically filters irrelevant passages to reduce computational overhead while maintaining output quality. The study challenges whether adaptive retrieval remains necessary as language models grow more robust, finding that its value differs significantly between weaker and stronger models.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers introduce DepCap, a training-free framework that optimizes diffusion language model (DLM) inference through adaptive block-wise parallel decoding. The method achieves up to 5.63× speedup by using cross-step signals to determine block boundaries and identifying conflict-free token subsets for safe parallel execution, maintaining quality while significantly accelerating inference.
AIBullisharXiv – CS AI · Apr 156/10
🧠Researchers propose RPRA (Reason-Predict-Reason-Answer/Act), a framework enabling smaller language models to predict how a larger LLM judge would evaluate their outputs before responding. By routing simple queries to smaller models and complex ones to larger models, the approach reduces computational costs while maintaining output quality, with fine-tuned smaller models achieving up to 55% accuracy improvements.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers identify a critical failure mode in non-autoregressive diffusion language models caused by proximity bias, where the denoising process concentrates on adjacent tokens, creating spatial error propagation. They propose a minimal-intervention approach using a lightweight planner and temperature annealing to guide early token selection, achieving substantial improvements on reasoning and planning tasks.
AINeutralarXiv – CS AI · Apr 146/10
🧠StyleBench is a new benchmark that evaluates how different reasoning structures (Chain-of-Thought, Tree-of-Thought, etc.) affect LLM performance across various tasks and model sizes. The research reveals that structural complexity only improves accuracy in specific scenarios, with simpler approaches often proving more efficient, and that learning adaptive reasoning strategies is itself a complex problem requiring advanced training methods.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce GroupRank, a novel LLM-based passage reranking paradigm that balances efficiency and accuracy by combining pointwise and listwise ranking approaches. The method achieves state-of-the-art performance with 65.2 NDCG@10 on BRIGHT benchmark while delivering 6.4x faster inference than existing approaches.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers present CGD-PD, a test-time decoding method that improves large language models' performance on three-way logical question answering (True/False/Unknown) by enforcing negation consistency and resolving epistemic uncertainty through targeted entailment probes. The approach achieves up to 16% relative accuracy improvements on the FOLIO benchmark while reducing spurious Unknown predictions.
AIBullisharXiv – CS AI · Apr 106/10
🧠Researchers introduce S³ (Stratified Scaling Search), a test-time scaling method for diffusion language models that improves output quality by reallocating compute during the denoising process rather than simple best-of-K sampling. The technique uses a lightweight verifier to evaluate and selectively resample candidate trajectories at each step, demonstrating consistent performance gains across mathematical reasoning and knowledge tasks without requiring model retraining.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers demonstrate that large language models exhibit critical control failures in causal reasoning, where they produce sound logical arguments but abandon them under social pressure or authority hints. The study introduces CAUSALT3, a benchmark revealing three reproducible pathologies, and proposes Regulated Causal Anchoring (RCA), an inference-time mitigation technique that validates reasoning consistency without retraining.