AIBullisharXiv – CS AI · 2d ago7/10
🧠CRANE is a training-free parameter-editing method that merges paired Instruct and Thinking model checkpoints to create superior code agents. By selectively combining reasoning capabilities from Thinking models with the tool-discipline of Instruct models, CRANE achieves significant performance gains—66.2% pass rate on Roo-Eval (+19.5%) and resolves 14 additional instances on SWE-bench—while maintaining computational efficiency.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers formalize the theoretical foundations of LLM scaling laws by modeling transformer learning dynamics as differential equations, establishing matching upper and lower bounds that characterize a two-phase convergence pattern: exponential decay during optimization followed by power-law decay during the statistical phase. This work bridges the gap between empirical observations and rigorous mathematical theory, providing independent scaling relationships for model size, training time, and dataset size.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce projectmem, an open-source memory layer for AI coding agents that records development events in an append-only log and prevents agents from repeating failed debugging attempts. The system runs locally with no telemetry, potentially saving 5,000-20,000 tokens per session and improving AI assistant efficiency in software development workflows.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce SHAPE, a novel expert pruning framework for Sparse Mixture-of-Experts (MoE) language models that reduces memory requirements by up to 40% without retraining. Unlike traditional pruning methods that evaluate experts independently, SHAPE models expert cooperation using game theory, identifying which expert combinations matter most for model performance.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers present RTPurbo, a method that transforms standard full-attention language models into efficient sparse models within just hundreds of training steps. By leveraging the observation that LLMs are intrinsically sparse, the approach achieves up to 9.36× speedup during prefill and 2.01× during decode at 1M context length while maintaining near-lossless accuracy.
AIBullisharXiv – CS AI · Jun 57/10
🧠SAGE-PTQ introduces a novel ultra-low-bit quantization framework for large language models that dramatically reduces scaling overhead while maintaining accuracy. The method achieves 1.03 weight bits per parameter with minimal scaling costs, outperforming existing approaches like BiLLM by orders of magnitude in perplexity metrics while requiring significantly less GPU memory.
🏢 Nvidia🏢 Perplexity
AINeutralarXiv – CS AI · Jun 47/10
🧠Researchers introduce OckBench, the first benchmark measuring both accuracy and token efficiency in large language models, revealing that models solving identical problems can differ by up to 5.0x in token usage. The findings highlight significant inefficiencies in current LLMs that inflate serving costs and latency, prompting a shift in evaluation paradigms toward optimizing token efficiency alongside performance.
🧠 GPT-5🧠 Gemini
AIBullisharXiv – CS AI · Jun 27/10
🧠Researchers propose Sparse Memory-Efficient Training (SMET), a method that stabilizes Dynamic Sparse Training for large language models by addressing optimization instability through optimizer warm-up and density-aware learning-rate scaling. The approach reduces memory consumption while maintaining training stability, offering a practical alternative to dense model training.
AIBullisharXiv – CS AI · Jun 27/10
🧠Researchers introduce LatentMAS, a framework enabling LLM agents to collaborate directly in latent space rather than through text, achieving up to 14.6% higher accuracy while reducing token usage by 70.8%-83.7% and improving inference speed 4× faster than text-based multi-agent systems.
AIBullisharXiv – CS AI · Jun 17/10
🧠Researchers propose MedCoG, a meta-cognitive agent that improves Large Language Model efficiency in medical reasoning by dynamically regulating knowledge utilization based on self-assessed task complexity and familiarity. The approach achieves 6.2x inference density improvement while reducing computational costs and improving accuracy on medical benchmarks.
AIBullisharXiv – CS AI · May 297/10
🧠Researchers have developed a method to improve how large language models verify factual claims by framing fact-checking as a true/false reading comprehension task with explicit test-taking strategies. The approach reduces token usage by over 80% while maintaining competitive performance, and enables smaller language models to perform similarly to larger ones through fine-tuning and self-revision mechanisms.
AIBullisharXiv – CS AI · May 287/10
🧠PromptEmbedder introduces a dual-LLM framework that decouples text embedding from specific model architectures, achieving comparable performance to LoRA while reducing GPU memory by 40% and accelerating training 3.7x. The innovation enables efficient transfer across different LLM backbones by retraining only a lightweight alignment matrix rather than entire models.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers introduce TSVD, a framework for training Large Language Models more efficiently by maintaining low-rank representations and strict weight orthonormality throughout pretraining. The method uses adaptive rank selection and caching mechanisms to reduce computational overhead while matching or exceeding the performance of standard full-parameter models.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers introduce Prompt Codebooks (PCO), a new framework for automatic prompt optimization that breaks down instructions into reusable, atomic components rather than treating prompts as fixed strings. The method achieves up to 30% performance gains over baseline approaches while reducing prompt lengths by 14x, enabling more efficient and adaptive language model instruction refinement.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers propose a basis rotation framework to address gradient staleness in asynchronous pipeline parallelism, a technique used for distributed AI training. By aligning the optimizer's coordinate system with the Hessian eigenbasis, the method reduces training iterations by 81.7% compared to existing asynchronous baselines, enabling more efficient large-scale model training.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers provide the first theoretical analysis of Chain-of-Thought (CoT) compression in Large Language Models, proving that skipping intermediate reasoning steps creates exponential learning signal decay for high-order logical dependencies. They propose ALiCoT, a framework that achieves 54.4x computational speedup while maintaining reasoning performance by aligning latent token distributions with intermediate states.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers introduce Self-Signals Driven Multi-LLM Debate (SID), a method that leverages internal model signals like token logits and attention mechanisms to improve multi-agent LLM reasoning while reducing computational overhead. The approach enables high-confidence models to exit early and compresses redundant debate content, achieving better accuracy with lower token consumption than existing multi-LLM debate techniques.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Mixed-Policy Distillation (MPD), a technique that compresses reasoning in smaller language models by having larger teacher models rewrite student-generated reasoning traces into more concise versions. The method reduces token usage by up to 27.1% while maintaining or improving performance, addressing critical deployment constraints around memory, latency, and serving costs.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Priming, a method that converts pre-trained Transformers into efficient Hybrid State-Space models through knowledge transfer rather than training from scratch. The technique recovers downstream performance using less than 0.5% of original pre-training tokens and enables the first large-scale comparison of SSM architectures, with Hybrid GKA 32B achieving 3.8-point reasoning improvements while delivering 2.3x faster decoding.
🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers apply game-theoretic free energy principles to analyze attention head interactions in large language models, discovering that heads exhibit higher-order redundancy. Their framework enables principled pruning of low-contribution heads, achieving 18% FLOP reduction and 22% throughput improvement in GPT2 with minimal performance degradation.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose a framework for optimizing data selection in large language model instruction tuning by learning task-specific and model-specific weights for multiple quality indicators. Using efficient in-context learning signals on small validation sets, the method achieves comparable performance to full-dataset training with only 30% of samples, revealing important trade-offs between semantic diversity and logical complexity.
🧠 Llama
AINeutralarXiv – CS AI · May 117/10
🧠Researchers have identified why layer pruning causes sudden performance collapse in large language models by analyzing decision representation dynamics. The study reveals that pruning disrupts a critical 'Silent Phase' where the model internally processes information before making predictions, while the subsequent 'Decisive Phase' remains robust to pruning.
AIBullisharXiv – CS AI · May 97/10
🧠ReaComp introduces a method to compile reasoning traces from large language models into reusable symbolic program synthesizers that eliminate runtime LLM calls. The approach achieves 91.3% accuracy on benchmark tasks while reducing token usage by 78%, demonstrating that neuro-symbolic hybrid systems can outperform pure LLM inference on complex program synthesis problems.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce Post-Reasoning, a technique that improves LLM performance by having models justify answers after generating final responses, without increasing latency or token costs. The method demonstrates 17.37% mean performance improvements across 117 model-benchmark settings and establishes a new efficiency frontier for direct-answer AI capabilities.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce reasoning graphs, a persistent knowledge structure that improves language model reasoning accuracy by storing and reusing chains of thought tied to evidence items. The system achieves 47% error reduction on multi-hop questions and maintains deterministic outputs without model retraining, using only context engineering.