AIBullisharXiv – CS AI · 3d ago7/10
🧠PrunePath is a new structured sparsification framework that optimizes feed-forward networks in language models by replacing traditional pruning methods with a softmax-normalized routing system. The approach converts model sparsity into practical hardware efficiency gains, demonstrated through memory savings and faster decoding speeds via custom Triton kernels.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Xe-Forge is an LLM-powered system that automates kernel optimization for Intel GPUs, eliminating repetitive manual porting work that typically gates algorithm deployment on new accelerators. Testing on 97 kernels achieved 1.17x geometric mean speedup with 67% of kernels improving and some exceeding 5x gains, demonstrating that structured domain knowledge combined with hardware-in-the-loop verification can systematically accelerate hardware adoption.
AIBullisharXiv – CS AI · May 17/10
🧠Researchers present a unified system for optimizing KV cache memory management in large-scale GPU inference, addressing three critical inefficiencies through architecture-aware sizing, multi-tier memory hierarchy spanning CPU to NVMe storage, and predictive eviction policies. The approach achieves 70-84% cache hit rates and projects 1.4-2.1x improvements in latency and 1.7-2.9x throughput gains while reducing costs by 47% compared to existing solutions.
AIBullisharXiv – CS AI · Apr 147/10
🧠EdgeCIM presents a specialized hardware-software framework designed to accelerate Small Language Model inference on edge devices by addressing memory-bandwidth bottlenecks inherent in autoregressive decoding. The system achieves significant performance and energy improvements over existing mobile accelerators, reaching 7.3x higher throughput than NVIDIA Orin Nano on 1B-parameter models.
🏢 Nvidia
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers developed an SRAM-based compute-in-memory accelerator for spiking neural networks that uses linear decay approximation instead of exponential decay, achieving 1.1x to 16.7x reduction in energy consumption. The innovation addresses the bottleneck of neuron state updates in neuromorphic computing by performing in-place decay directly within memory arrays.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers propose ROMA, a new hardware accelerator for running large language models on edge devices using QLoRA. The system uses ROM storage for quantized base models and SRAM for LoRA weights, achieving over 20,000 tokens/s generation speed without external memory.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers developed a runtime-reconfigurable bitwise systolic array architecture for multi-precision quantized neural networks on FPGA hardware accelerators. The system achieves 1.3-3.6x speedup on mixed-precision models while supporting higher clock frequencies up to 250MHz, addressing the trade-off between hardware efficiency and inference accuracy.
AIBullishHugging Face Blog · May 106/10
🧠MachinaCheck represents a significant advancement in AI-driven manufacturing optimization by deploying a multi-agent system on AMD's MI300X GPU architecture to assess CNC manufacturability. This development demonstrates how specialized AI infrastructure enables complex industrial problem-solving while highlighting the growing intersection between high-performance computing hardware and practical enterprise applications.
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers introduce AscendOptimizer, an AI agent that optimizes operators for Huawei's Ascend NPUs through evolutionary search and experience-based learning. The system achieved 1.19x geometric-mean speedup over baselines on 127 real operators, with nearly 50% outperforming reference implementations.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers introduce EvoKernel, a self-evolving AI framework that addresses the 'Data Wall' problem in deploying Large Language Models for kernel synthesis on data-scarce hardware platforms like NPUs. The system uses memory-based reinforcement learning to improve correctness from 11% to 83% and achieves 3.60x speedup through iterative refinement.
AIBullisharXiv – CS AI · Mar 26/1014
🧠Researchers propose BiKA, a new ultra-lightweight neural network accelerator inspired by Kolmogorov-Arnold Networks that uses binary thresholds instead of complex computations. The FPGA prototype demonstrates 27-51% reduction in hardware resource usage compared to existing binarized and quantized neural network accelerators while maintaining competitive accuracy.
AIBullishHugging Face Blog · Mar 286/107
🧠The article discusses accelerating Large Language Model (LLM) inference using Text Generation Inference (TGI) on Intel Gaudi hardware. This represents a technical advancement in AI infrastructure optimization for improved performance and efficiency in LLM deployment.
AIBullishHugging Face Blog · May 256/106
🧠Intel has released optimization techniques for running Stable Diffusion AI models on CPUs using NNCF (Neural Network Compression Framework) and Hugging Face Optimum. These optimizations aim to improve performance and reduce computational requirements for AI image generation on Intel hardware without requiring expensive GPUs.
AIBullishHugging Face Blog · Jun 156/104
🧠Intel has partnered with Hugging Face to democratize machine learning hardware acceleration, making AI model deployment more accessible across different hardware platforms. This collaboration aims to optimize AI workloads on Intel hardware while leveraging Hugging Face's extensive model ecosystem.
AINeutralarXiv – CS AI · May 45/10
🧠Researchers demonstrate successful adaptation of AI-accelerated computational fluid dynamics (CFD) simulations to Graphcore's IPU platform, achieving up to 34% speedup through optimized data pipeline management. The study shows strong scalability from 2 to 16 IPUs, increasing throughput from 560.8 to 2805.8 samples per second, validating IPUs as viable accelerators for AI-enhanced scientific computing workloads.
AIBullishHugging Face Blog · Nov 194/105
🧠The article discusses methods for accelerating PyTorch distributed fine-tuning using Intel's hardware and software technologies. It focuses on optimizations for training deep learning models more efficiently on Intel infrastructure.
AINeutralHugging Face Blog · Feb 63/103
🧠The article appears to be about optimizing PyTorch Transformers performance using Intel Sapphire Rapids processors, but the article body content is missing from the provided text.