AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers have developed ALADIN, a framework for analyzing accuracy-latency trade-offs in AI accelerators for embedded systems. The tool enables evaluation of quantized neural networks without requiring deployment on target hardware, potentially reducing development time and costs for AI chip designers.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.
AIBullisharXiv – CS AI · Mar 67/10
🧠Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers developed LiteVLA-Edge, a deployment-oriented Vision-Language-Action model pipeline that enables fully on-device inference on embedded robotics hardware like Jetson Orin. The system achieves 150.5ms latency (6.6Hz) through FP32 fine-tuning combined with 4-bit quantization and GPU-accelerated inference, operating entirely offline within a ROS 2 framework.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers introduce Concentration-Alignment Transforms (CAT), a new method to reduce quantization error in large language and vision models by improving both weight/activation concentration and alignment. The technique consistently matches or outperforms existing quantization methods at 4-bit precision across several LLMs.
AINeutralarXiv – CS AI · Mar 56/10
🧠Researchers reproduced and analyzed severe accuracy degradation in BERT transformer models when applying post-training quantization, showing validation accuracy drops from 89.66% to 54.33%. The study found that structured activation outliers intensify with model depth, with mixed precision quantization being the most effective mitigation strategy.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers successfully developed Bielik-Q2-Sharp, the first systematic evaluation of extreme 2-bit quantization for Polish language models, achieving near-baseline performance while significantly reducing model size. The study compared six quantization methods on an 11B parameter model, with the best variant maintaining 71.92% benchmark performance versus 72.07% baseline at just 3.26 GB.
AIBullisharXiv – CS AI · Mar 47/102
🧠Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers developed a training method for large-scale Mixture-of-Experts (MoE) models using FP4 precision on Hopper GPUs without native 4-bit support. The technique achieves 14.8% memory reduction and 12.5% throughput improvement for 671B parameter models by using FP4 for activations while keeping core computations in FP8.
AINeutralarXiv – CS AI · Mar 47/102
🧠Researchers prove that the GPTQ neural network quantization algorithm is mathematically equivalent to Babai's nearest-plane algorithm for solving lattice problems. The work establishes a connection between neural network quantization and lattice geometry, suggesting potential improvements through lattice basis reduction techniques.
AIBullisharXiv – CS AI · Mar 37/102
🧠ButterflyMoE introduces a breakthrough approach to reduce memory requirements for AI expert models by 150× through geometric parameterization instead of storing independent weight matrices. The method uses shared ternary prototypes with learned rotations to achieve sub-linear memory scaling, enabling deployment of multiple experts on edge devices.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers introduce the first theoretical framework analyzing convergence of adaptive optimizers like Adam and Muon under floating-point quantization in low-precision training. The study shows these algorithms maintain near full-precision performance when mantissa length scales logarithmically with iterations, with Muon proving more robust than Adam to quantization errors.
AIBullisharXiv – CS AI · Mar 37/105
🧠Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers developed a runtime-reconfigurable bitwise systolic array architecture for multi-precision quantized neural networks on FPGA hardware accelerators. The system achieves 1.3-3.6x speedup on mixed-precision models while supporting higher clock frequencies up to 250MHz, addressing the trade-off between hardware efficiency and inference accuracy.
AIBullisharXiv – CS AI · Feb 277/108
🧠FlashOptim introduces memory optimization techniques that reduce AI training memory requirements by over 50% per parameter while maintaining model quality. The suite reduces AdamW memory usage from 16 bytes to 7 bytes per parameter through improved master weight splitting and 8-bit optimizer state quantization.
AIBullisharXiv – CS AI · Feb 277/108
🧠Researchers introduce UniQL, a unified framework for quantizing and compressing large language models to run efficiently on mobile devices. The system achieves 4x-5.7x memory reduction and 2.7x-3.4x speed improvements while maintaining accuracy within 5% of original models.
AIBullisharXiv – CS AI · Feb 277/105
🧠Tencent Hunyuan team introduces AngelSlim, a comprehensive toolkit for large model compression featuring quantization, speculative decoding, and pruning techniques. The toolkit includes the first industrially viable 2-bit large model (HY-1.8B-int2) and achieves 1.8x to 2.0x throughput gains while maintaining output quality.
AIBullisharXiv – CS AI · Feb 277/105
🧠Researchers developed a new approach to quantization-aware training (QAT) that optimizes compute allocation between full-precision and quantized training phases. They discovered that contrary to previous findings, the optimal ratio of QAT to full-precision training increases with total compute budget, and derived scaling laws to predict optimal configurations across different model sizes and bit widths.
AIBullishHugging Face Blog · Sep 187/105
🧠The article discusses techniques for fine-tuning large language models (LLMs) to achieve extreme quantization down to 1.58 bits, making the process more accessible and efficient. This represents a significant advancement in model compression technology that could reduce computational requirements and costs for AI deployment.
AIBullishHugging Face Blog · May 247/108
🧠The article discusses advances in making Large Language Models (LLMs) more accessible through bitsandbytes library, 4-bit quantization techniques, and QLoRA (Quantized Low-Rank Adaptation). These technologies enable running and fine-tuning large AI models on consumer hardware with significantly reduced memory requirements.
AI × CryptoBullishBlockonomi · 3d ago6/10
🤖Ethereum co-founder Vitalik Buterin has highlighted connections between DeepSeek V4's efficiency improvements and privacy-focused infrastructure on Ethereum. DeepSeek V4's 2-bit quantized version runs on 90 GB of VRAM, enabling local AI deployment on consumer hardware, with Apple silicon achieving 35 tokens per second versus AMD's 7 tokens per second. Buterin suggests zero-knowledge proof infrastructure can support both private LLM interactions and confidential blockchain operations.
$ETH
AIBullisharXiv – CS AI · 3d ago6/10
🧠ASTRA is a new framework that enables efficient multi-device Transformer inference by combining sequence parallelism with mixed-precision attention, allowing non-local token embeddings to be transmitted as compressed codes while maintaining full precision for local attention. The system achieves significant speedups (up to 2.64x) over single-device inference while operating at extremely low bandwidth requirements (as low as 10 Mbps), making it practical for bandwidth-constrained environments.
🧠 Llama
AINeutralarXiv – CS AI · 3d ago6/10
🧠Clark Hash is a new compression codec that reduces neural embedding storage from 1,536 bytes to 48 bytes (32x compression) using deterministic sparse Johnson-Lindenstrauss projection and scalar quantization. The method requires no training, learned codebooks, or corpus statistics, achieving 0.91+ correlation with dense cosine similarity scores on multilingual sentence-embedding benchmarks.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce Soro, a family of Tajik-language large language models built on Gemma 3 that outperforms baseline models while maintaining English capabilities. The project addresses computational constraints in Tajikistan through efficient quantization methods and includes newly open-sourced Tajik benchmarks for rigorous evaluation.
🏢 Hugging Face