AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers have discovered that large AI models develop decomposable internal structures during training, with many parameter dependencies remaining statistically unchanged from initialization. They propose a post-training method to identify and remove unsupported dependencies, enabling parallel inference without modifying model functionality.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers introduce veScale-FSDP, a redesigned Fully Sharded Data Parallel system that overcomes limitations of current FSDP implementations used for training large-scale AI models. The new system features flexible RaggedShard format and structure-aware planning, achieving 5-66% higher throughput and 16-30% lower memory usage while supporting advanced training methods and scaling to tens of thousands of GPUs.
AINeutralarXiv – CS AI · 4d ago5/10
🧠Researchers propose MC-PSO and MC-APSO, novel parallel neural network architectures that combine multi-column radial basis function networks with particle swarm optimization algorithms. These methods outperform existing approaches in accuracy, recall, and computational efficiency on benchmark datasets by distributing training across spatial subsets.
AINeutralarXiv – CS AI · May 125/10
🧠Researchers have developed parHSOM, a parallel implementation of Hierarchical Self-Organizing Maps designed to accelerate training for cybersecurity intrusion detection systems. Testing across multiple datasets and configurations demonstrates faster training times without performance degradation compared to sequential HSOM approaches.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers have developed NCCL EP, a new communication library for Mixture-of-Experts (MoE) AI model architectures that improves GPU-initiated communication performance. The library provides unified APIs supporting both low-latency inference and high-throughput training modes, built entirely on NVIDIA's NCCL Device API.
🏢 Nvidia
AINeutralarXiv – CS AI · May 45/10
🧠Researchers demonstrate successful adaptation of AI-accelerated computational fluid dynamics (CFD) simulations to Graphcore's IPU platform, achieving up to 34% speedup through optimized data pipeline management. The study shows strong scalability from 2 to 16 IPUs, increasing throughput from 560.8 to 2805.8 samples per second, validating IPUs as viable accelerators for AI-enhanced scientific computing workloads.
AINeutralarXiv – CS AI · Feb 274/106
🧠Researchers evaluated Large Language Models' ability to generate parallel code across three programming frameworks (OpenMP, C++, HPX) using different input prompts. The study found LLMs show varying performance depending on problem complexity and framework, revealing both capabilities and limitations in high-performance computing applications.
AINeutralarXiv – CS AI · Mar 34/104
🧠Researchers propose Coupled Policy Optimization (CPO), a new reinforcement learning method that regulates policy diversity through KL constraints to improve exploration efficiency in large-scale parallel environments. The method outperforms existing baselines like PPO and SAPG across multiple tasks, demonstrating that controlled diverse exploration is key to stable and sample-efficient learning.