AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce Reasoning in Memory (RiM), a novel method that enables large language models to perform internal reasoning using fixed memory blocks instead of generating intermediate tokens. The approach matches or exceeds existing reasoning methods while being more compute-efficient, as memory blocks process in a single forward pass rather than through autoregressive generation.
AIBullishAI News · May 207/10
🧠Alibaba has unveiled the Zhenwu M890 AI processor specifically designed for AI agents, coupled with a multi-year silicon roadmap and a new large language model. This integrated approach signals that Alibaba is building a comprehensive AI stack rather than simply compensating for US export restrictions, fundamentally reshaping the competitive landscape in AI chip development.
AINeutralStratechery · May 117/10
🧠The article argues that agentic inference—AI systems operating autonomously without human involvement—will fundamentally differ from current inference workloads, eliminating the speed-critical requirements that dominate today's compute infrastructure design. This shift will reshape hardware and infrastructure priorities as latency becomes less critical than efficiency and throughput for agent-based systems.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.
🏢 Nvidia
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers developed SyTTA, a test-time adaptation framework that improves large language models' performance in specialized domains without requiring additional labeled data. The method achieved over 120% improvement on agricultural question answering tasks using just 4 extra tokens per query, addressing the challenge of deploying LLMs in domains with limited training data.
🏢 Perplexity
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers have developed QUARK, a quantization-enabled FPGA acceleration framework that significantly improves Transformer model performance by optimizing nonlinear operations through circuit sharing. The system achieves up to 1.96x speedup over GPU implementations while reducing hardware overhead by more than 50% compared to existing approaches.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers demonstrate that PLDR-LLMs trained at self-organized criticality exhibit enhanced reasoning capabilities at inference time. The study shows that reasoning ability can be quantified using an order parameter derived from global model statistics, with models performing better when this parameter approaches zero at criticality.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers propose Simple Energy Adaptation (SEA), a new algorithm for aligning large language models with human feedback at inference time. SEA uses gradient-based sampling in continuous latent space rather than searching discrete response spaces, achieving up to 77.51% improvement on AdvBench and 16.36% on MATH benchmarks.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers have developed rationale-enhanced decoding (RED), a new inference-time strategy that improves chain-of-thought reasoning in large vision-language models. The method addresses the problem where LVLMs ignore generated rationales by harmonizing visual and rationale information during decoding, showing consistent improvements across multiple benchmarks.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce AgentDiet, a trajectory reduction technique that cuts computational costs for LLM-based agents by 39.9%-59.7% in input tokens and 21.1%-35.9% in total costs while maintaining performance. The approach removes redundant and expired information from agent execution trajectories during inference time.
AIBullisharXiv – CS AI · Mar 177/10
🧠Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.
🏢 Meta
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed two software techniques (OAS and MBS) that dramatically improve MXFP4 quantization accuracy for Large Language Models, reducing the performance gap with NVIDIA's NVFP4 from 10% to below 1%. This breakthrough makes MXFP4 a viable alternative while maintaining 12% hardware efficiency advantages in tensor cores.
🏢 Nvidia
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.
AIBullisharXiv – CS AI · Mar 67/10
🧠Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.
🏢 Perplexity
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers introduce Draft-Conditioned Constrained Decoding (DCCD), a training-free method that improves structured output generation in large language models by up to 24 percentage points. The technique uses a two-step process that first generates an unconstrained draft, then applies constraints to ensure valid outputs like JSON and API calls.
AIBullisharXiv – CS AI · Mar 46/103
🧠Researchers introduce CHaRS (Concept Heterogeneity-aware Representation Steering), a new method for controlling large language model behavior that uses optimal transport theory to create context-dependent steering rather than global directions. The approach models representations as Gaussian mixture models and derives input-dependent steering maps, showing improved behavioral control over existing methods.
AINeutralarXiv – CS AI · Mar 46/103
🧠Researchers have developed SEAL, a reference framework for measuring carbon emissions from Large Language Model inference at the prompt level. The framework addresses the growing sustainability concerns as LLM inference emissions are rapidly surpassing training emissions due to massive usage volumes.
AIBullisharXiv – CS AI · Mar 46/104
🧠xLLM is a new open-source Large Language Model inference framework that delivers significantly improved performance for enterprise AI deployments. The framework achieves 1.7-2.2x higher throughput compared to existing solutions like MindIE and vLLM-Ascend through novel architectural optimizations including decoupled service-engine design and intelligent scheduling.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers introduce LightMem, a new memory system for Large Language Models that mimics human memory structure with three stages: sensory, short-term, and long-term memory. The system achieves up to 7.7% better QA accuracy while reducing token usage by up to 106x and API calls by up to 159x compared to existing methods.
AIBullisharXiv – CS AI · Feb 277/108
🧠Researchers introduce RAGdb, a revolutionary architecture that consolidates Retrieval-Augmented Generation into a single SQLite container, eliminating the need for cloud infrastructure and GPUs. The system achieves 100% entity retrieval accuracy while reducing disk footprint by 99.5% compared to traditional Docker-based RAG stacks, enabling truly portable AI applications for edge computing and privacy-sensitive environments.
AIBullisharXiv – CS AI · Feb 277/108
🧠Researchers introduce UniQL, a unified framework for quantizing and compressing large language models to run efficiently on mobile devices. The system achieves 4x-5.7x memory reduction and 2.7x-3.4x speed improvements while maintaining accuracy within 5% of original models.
AIBullishHugging Face Blog · Mar 77/108
🧠The article provides a guide for running Large Language Models (LLMs) directly on mobile devices using React Native, enabling edge inference capabilities. This development represents a significant step toward decentralized AI processing, reducing reliance on cloud-based services and improving privacy and latency for mobile AI applications.
AIBullishHugging Face Blog · Jan 187/107
🧠Hugging Face announced they achieved a 100x speed improvement for transformer inference in their API services. The optimization breakthrough significantly enhances performance for AI model deployment and reduces latency for customers using their platform.
AINeutralTechCrunch – AI · 1d ago6/10
🧠Groq, an AI chip startup, is raising $650 million in funding while shifting its strategic focus from hardware development toward AI inference optimization. This funding round follows Nvidia's recent decision to acquire a chip design team rather than purchase an existing company, signaling evolving dynamics in the competitive AI silicon landscape.
🏢 Nvidia