y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#distributed-systems News & Analysis

40 articles tagged with #distributed-systems. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

40 articles
AI × CryptoNeutralarXiv – CS AI · 3d ago7/10
🤖

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Researchers introduced Agora, a multi-agent LLM framework designed to detect deep logic bugs in consensus protocols used by blockchains and distributed systems. The system discovered 15 previously unknown protocol-level bugs in major implementations (Raft, EPaxos, HotStuff, BullShark) that existing LLM approaches failed to identify, demonstrating the effectiveness of domain-aware collaborative AI for protocol verification.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG introduces a federated framework for retrieval-augmented generation that enables decentralized LLM deployment across edge devices without centralizing sensitive data. The system achieves 7.8% accuracy improvements and 8.4x latency reductions by splitting lightweight memory access from expensive LLM reasoning, while aggregating anonymized knowledge across fragmented device networks.

AI × CryptoBullisharXiv – CS AI · 4d ago7/10
🤖

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

SwarmHarness proposes a decentralized protocol enabling unused computing resources across personal devices and servers to be shared through a self-organizing network of AI agents without central authority. The system combines peer discovery via DHT, intelligent task routing based on capability and trust metrics, and a Shapley-value-based credit mechanism to align incentives and create a self-regulating participation economy.

AINeutralarXiv – CS AI · May 127/10
🧠

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

A production analysis of a 504-GPU NVIDIA B200 cluster reveals that large-scale AI training requires multi-signal failure detection strategies, with a 100% detection rate achieved through statistical analysis of 751 metrics. The study identifies storage I/O bottlenecks invisible at smaller scales and shows auto-retry mechanisms succeed 2.7x more often than manual recovery, providing critical operational insights for distributed AI infrastructure.

🏢 Nvidia
AI × CryptoBullisharXiv – CS AI · May 127/10
🤖

Robust Multi-Agent LLMs under Byzantine Faults

Researchers propose Self-Anchored Consensus (SAC), a decentralized protocol enabling LLM agents to collaborate reliably over peer-to-peer networks while resisting Byzantine attacks. The method allows agents to iteratively filter unreliable messages and refine outputs without centralized coordination, addressing a critical vulnerability in distributed AI systems.

AI × CryptoBullishCrypto Briefing · May 37/10
🤖

Ben Fielding: Neural architecture search automates deep learning, the shift to horizontal scaling is essential, and blockchain security enhances consensus algorithms | Unchained

Ben Fielding discusses how neural architecture search (NAS) automates deep learning model design, emphasizes the necessity of horizontal scaling in distributed systems, and explores blockchain security's role in strengthening consensus algorithms. The convergence of machine learning and blockchain represents a transformative shift comparable to MapReduce's impact on distributed computing.

Ben Fielding: Neural architecture search automates deep learning, the shift to horizontal scaling is essential, and blockchain security enhances consensus algorithms | Unchained
AI × CryptoNeutralarXiv – CS AI · Apr 147/10
🤖

Emergent Social Structures in Autonomous AI Agent Networks: A Metadata Analysis of 626 Agents on the Pilot Protocol

Researchers analyzed 626 autonomous AI agents that independently joined the Pilot Protocol, discovering that these machines formed complex social structures mirroring human networks without explicit instruction. The emergent topology exhibits small-world properties, preferential attachment, and specialized clustering, representing the first empirical evidence of spontaneous social organization among autonomous AI systems.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Distributed Interpretability and Control for Large Language Models

Researchers have developed a scalable system for interpreting and controlling large language models distributed across multiple GPUs, achieving up to 7x memory reduction and 41x throughput improvements. The method enables real-time behavioral steering of frontier LLMs like LLaMA and Qwen without fine-tuning, with results released as open-source tooling.

AIBullisharXiv – CS AI · Apr 67/10
🧠

Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Researchers have developed Glia, an AI architecture using large language models in a multi-agent workflow to autonomously design computer systems mechanisms. The system generates interpretable designs for distributed GPU clusters that match human expert performance while providing novel insights into workload behavior.

AINeutralarXiv – CS AI · Mar 177/10
🧠

Efficient Federated Conformal Prediction with Group-Conditional Guarantee

Researchers propose group-conditional federated conformal prediction (GC-FCP), a new protocol that enables trustworthy AI uncertainty quantification across distributed clients while providing coverage guarantees for specific groups. The framework addresses challenges in federated learning for applications in healthcare, finance, and mobile sensing by creating compact weighted summaries that support efficient calibration.

AIBullisharXiv – CS AI · Mar 46/104
🧠

xLLM Technical Report

xLLM is a new open-source Large Language Model inference framework that delivers significantly improved performance for enterprise AI deployments. The framework achieves 1.7-2.2x higher throughput compared to existing solutions like MindIE and vLLM-Ascend through novel architectural optimizations including decoupled service-engine design and intelligent scheduling.

AINeutralarXiv – CS AI · 18h ago6/10
🧠

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

Researchers propose IBAL, an adversarial learning framework that makes multi-agent reinforcement learning systems robust against attacks that disrupt agent coordination through observation and action perturbations. The method addresses a gap in existing defenses by focusing on interaction-breaking attacks rather than value-oriented ones, demonstrating improved resilience across multiple scenarios.

AINeutralarXiv – CS AI · 18h ago6/10
🧠

Regret-Based Federated Causal Discovery with Unknown Interventions

Researchers introduce I-PERI, a federated causal discovery algorithm that handles unknown client-level interventions across decentralized systems. The method advances privacy-preserving causal inference by recovering tighter equivalence classes when clients operate under heterogeneous, undisclosed policies—addressing a critical gap between theoretical causal discovery methods and real-world deployment constraints.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Grimlock is a security framework that uses eBPF and TLS 1.3 channel binding to enforce authorization and delegation controls in agentic AI systems without modifying application code. The system intercepts sandbox communications, validates identity through post-handshake attestation, and issues short-lived scope tokens to enable secure multi-cloud orchestration with transparent auditability.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

Researchers introduce HEART, a novel framework for efficient multi-model federated learning across vehicle-edge-cloud architectures that addresses training latency and resource allocation challenges in IoV systems. The solution combines hybrid synchronous-asynchronous aggregation with optimized task scheduling using particle swarm optimization and genetic algorithms.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

HEAL: Resilient and Self-* Hub-based Learning

Researchers introduce HEAL, a decentralized machine learning framework that combines federated learning's efficiency with gossip learning's fault tolerance through a self-healing peer-to-peer overlay network. The system dynamically promotes nodes as aggregators, achieving federated learning performance while remaining fully decentralized and resilient to node failures.

AIBullisharXiv – CS AI · 4d ago6/10
🧠

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

AgensFlow is an open-source framework that treats multi-agent LLM coordination as a learnable policy problem rather than a fixed pipeline, enabling dynamic routing decisions across skill protocols, agent roles, and model bindings. Evaluated on distributed systems and security tasks, the framework demonstrates that learned coordination outperforms static designs while reducing exploration costs through warm-started policy graphs.

AIBullisharXiv – CS AI · May 126/10
🧠

Intelligent Autonomous Orchestration for Distributed Cloud Resources using Complex-Stability Analysis

Researchers propose C-SAS, an AI-driven orchestration framework using complex stability analysis to optimize distributed cloud resource allocation. The system reduces VM flapping by 94% and achieves 96% resource efficiency, outperforming traditional PID and machine learning approaches by embedding formal stability constraints into autonomous cloud infrastructure.

AINeutralarXiv – CS AI · May 126/10
🧠

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

A dissertation presents research on scaling reinforcement learning across distributed systems while ensuring trustworthy behavior in AI applications. The work addresses communication efficiency in federated settings and alignment with human preferences in large language models, proposing that next-generation intelligent systems require both optimization efficiency and safety mechanisms.

AINeutralarXiv – CS AI · May 116/10
🧠

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

TraceFix is a verification-first framework that uses TLA+ model checking to automatically repair and validate multi-agent LLM coordination protocols, achieving 100% verification success on 48 test tasks with 62.5% passing on first attempt. The approach reduces deadlock/livelock failures from 31.1% to 14.1% and improves task completion rates to 89.4% compared to unverified baselines.

AINeutralarXiv – CS AI · May 96/10
🧠

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

Researchers present an analytical framework for optimizing Attention/FFN provisioning ratios in disaggregated LLM serving architectures. The work provides closed-form rules and practical guidance for balancing memory-intensive attention computation with compute-intensive FFN operations, achieving predictions within 10% of simulation-optimal configurations.

AINeutralarXiv – CS AI · May 76/10
🧠

Coward: Collision-based OOD Watermarking for Practical Proactive Federated Backdoor Detection

Researchers introduce Coward, a novel proactive backdoor detection method for federated learning that uses collision-based watermarking to identify poisoned model updates from malicious clients. The approach addresses critical limitations in existing detection methods by leveraging multi-backdoor collision effects and regulated OOD data injection, achieving state-of-the-art performance with fewer false positives.

AINeutralarXiv – CS AI · May 76/10
🧠

Resilient AI Supercomputer Networking using MRC and SRv6

OpenAI and Microsoft have deployed MRC, a new RDMA-based transport protocol combined with SRv6 static routing, to eliminate tail latency issues in massive AI training clusters exceeding 100K GPUs. The system uses multi-plane Clos topologies and intelligent load-balancing to bypass network failures without interrupting synchronous training jobs, addressing a critical bottleneck in frontier model development.

🏢 OpenAI
Page 1 of 2Next →