AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce BitTP, a quantization technique that compresses LLM-based trajectory prediction models to 1.58-bit weights while maintaining full-precision activations, enabling deployment on resource-constrained edge devices. The approach not only reduces memory and latency but actually improves prediction accuracy by 14-21% compared to full-precision baselines, demonstrating that strategic quantization can serve as an effective regularizer.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce Logit-aware Final-block Quantization (LFQ), a technique that improves low-bit quantization of large language models by optimizing the final transformer block to preserve token probability distributions. This advancement addresses quality degradation in generative tasks while maintaining efficiency gains critical for deploying scaled LLMs.
AIBullisharXiv – CS AI · 3d ago7/10
🧠LoopFM introduces a novel knowledge distillation framework that transfers rich intermediate representations from large foundation models to compact vertical models, achieving significant conversion improvements (0.5-1.22%) in industrial-scale systems by structuring FM embeddings as input features rather than relying on single scalar predictions.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce OccamToken, a training-free method for compressing vision-language models by pruning unnecessary visual tokens while maintaining accuracy. The approach reduces visual token sequences by 98.6% (from 2,880 to 40 tokens) on LLaVA-NeXT while preserving over 93% accuracy, addressing computational bottlenecks in VLM inference.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers present PeRQ, a post-training quantization method that uses permutations to optimize block rotations for neural network compression. The approach recovers up to 90% of full-vector rotation performance when quantizing large language models to INT4, significantly outperforming existing block rotation methods.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers have developed a method to improve how large language models verify factual claims by framing fact-checking as a true/false reading comprehension task with explicit test-taking strategies. The approach reduces token usage by over 80% while maintaining competitive performance, and enables smaller language models to perform similarly to larger ones through fine-tuning and self-revision mechanisms.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce HARP, a learnable adaptive rotation processor that improves extreme low-bit quantization for large language models by replacing fixed Hadamard transforms with optimizable structured orthogonal processors. The technique maintains full-precision equivalence while achieving better perplexity and accuracy across 2-4 bit quantization settings on models up to 70B parameters, with deployment speeds competitive with standard approaches.
🏢 Perplexity
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for compressing KV caches in large language models using quaternion mathematics. The technique achieves 5x compression with minimal perplexity loss, matching full-precision performance at ~5 bits while outperforming existing quantization methods across five major model architectures.
🧠 Llama
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce TSVD, a framework for training Large Language Models more efficiently by maintaining low-rank representations and strict weight orthonormality throughout pretraining. The method uses adaptive rank selection and caching mechanisms to reduce computational overhead while matching or exceeding the performance of standard full-parameter models.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose LIFT and PLACE, a knowledge distillation framework that enables stable training of extremely lightweight diffusion models by decomposing the teacher's complex denoising process into coarse and fine stages with spatially adaptive guidance. The method achieves stable convergence even at extreme compression ratios (1.6% of teacher size) where conventional distillation fails, with potential applications across image generation, latent diffusion, and flow-based models.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose Group-Query Latent Attention (GQLA), an advancement of DeepSeek's Multi-head Latent Attention that enables hardware-adaptive decoding through two algebraically equivalent inference paths without requiring model retraining. The innovation allows a single trained model to optimize performance across different hardware platforms—H100 GPUs and export-restricted H20 chips—while maintaining computational efficiency and supporting distributed tensor parallelism.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Google researchers unveiled BlazeEdit, a 195M-parameter image-to-image diffusion model optimized for on-device mobile deployment, eliminating text-conditioning to handle object removal, outpainting, tone correction, relighting, and sticker generation. The model completes inference in 290ms on Pixel 10 while maintaining competitive quality, advancing the trend toward privacy-preserving edge AI.
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers develop a systematic approach to quantization-aware training for large language models using 8-bit floating-point formats, identifying and solving two critical failure modes—amax saturation and catastrophic forgetting—that don't surface in standard training metrics. Their solution achieves near-lossless performance with only 0.43% degradation on benchmark tasks, advancing practical LLM deployment efficiency.
AIBullisharXiv – CS AI · 5d ago7/10
🧠StreamSplit introduces a novel framework enabling continuous contrastive learning on edge devices by dynamically partitioning computation between local and cloud resources. Using reinforcement learning and uncertainty guidance, the system reduces latency by up to 4.7x and bandwidth by 77.1% while maintaining near-server accuracy, making distributed AI inference practical for resource-constrained hardware.
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers address a critical failure mode in quantized Vision-Language Models by proposing LRA-EE, a technique that uses early exit strategies to bypass noise-saturated layers in INT8 CLIP. The method improves zero-shot classification accuracy by 2.44 percentage points while reducing computational load by 13.4%, demonstrating that selective layer utilization can recover performance lost to quantization-induced representation collapse.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce RuPLaR, a novel compression framework that enables Large Language Models to generate latent reasoning tokens in a single training stage, eliminating inefficiencies of traditional multi-step Chain-of-Thought approaches. The method achieves 11.1% accuracy improvement over existing latent CoT systems while using minimal tokens, demonstrating significant progress in efficient LLM reasoning.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce LiteMedCoT-VL, a technique that transfers chain-of-thought reasoning from large language models to compact 2B parameter models for medical visual question answering, achieving 64.9% accuracy on the PMC-VQA benchmark without relying on image captions. The breakthrough demonstrates that smaller models enhanced with reasoning distillation can match or exceed the performance of larger models, enabling deployment of sophisticated medical AI on resource-constrained clinical devices.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers identify weight gradient (Wgrad) quantization as the primary cause of instability in FP4 training of large language models, while forward and activation gradient quantization prove relatively benign. Using deterministic Hadamard rotations on AMD MI355X GPUs, they demonstrate that structured micro-scaling errors—not insufficient randomness—drive training divergence, offering insights for efficient LLM pretraining.
🧠 Llama
AIBearisharXiv – CS AI · May 117/10
🧠Researchers discovered that multimodal large language models (MLLMs) become vulnerable to jailbreaking when visual content is degraded through lower resolution or distortion, even when text remains readable. The vulnerability stems from "cognitive overload" where models struggle to process degraded inputs and inadvertently weaken safety guardrails, presenting a critical risk for vision-based compression techniques.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers have identified why layer pruning causes sudden performance collapse in large language models by analyzing decision representation dynamics. The study reveals that pruning disrupts a critical 'Silent Phase' where the model internally processes information before making predictions, while the subsequent 'Decisive Phase' remains robust to pruning.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MatryoshkaLoRA, a novel training framework that improves upon Low-Rank Adaptation (LoRA) for efficient large language model fine-tuning by learning hierarchical low-rank representations through a strategically placed diagonal scaling matrix. The method enables dynamic rank selection with minimal accuracy loss and introduces AURAC, a new evaluation metric for hierarchical adapters, addressing a key limitation in current parameter-efficient fine-tuning approaches.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce DomLoRA, a parameter-efficient fine-tuning method that identifies a single 'dominant adaptation module' where most gradient energy concentrates, achieving superior performance with only 0.7% of standard LoRA's trainable parameters. The discovery reveals that optimal adapter placement is architecture-dependent but task-stable across instruction following, reasoning, and code generation applications.
AIBullisharXiv – CS AI · May 77/10
🧠EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.