y0news
#llm56 articles
56 articles
AINeutralarXiv – CS AI · 4h ago3
🧠

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Researchers introduce Jailbreak Foundry (JBF), a system that automatically converts AI jailbreak research papers into executable code modules for standardized testing. The system successfully reproduced 30 attacks with high accuracy and reduces implementation code by nearly half while enabling consistent evaluation across multiple AI models.

AIBullisharXiv – CS AI · 4h ago5
🧠

Higress-RAG: A Holistic Optimization Framework for Enterprise Retrieval-Augmented Generation via Dual Hybrid Retrieval, Adaptive Routing, and CRAG

Researchers have developed Higress-RAG, a new enterprise-grade framework that addresses key challenges in Retrieval-Augmented Generation systems including low retrieval precision, hallucination, and high latency. The system introduces innovations like 50ms semantic caching, hybrid retrieval methods, and corrective evaluation to optimize the entire RAG pipeline for production use.

$LINK
AINeutralarXiv – CS AI · 4h ago4
🧠

An Agentic LLM Framework for Adverse Media Screening in AML Compliance

Researchers have developed an agentic LLM framework using Retrieval-Augmented Generation to automate adverse media screening for anti-money laundering compliance in financial institutions. The system addresses high false-positive rates in traditional keyword-based approaches by implementing multi-step web searches and computing Adverse Media Index scores to distinguish between high-risk and low-risk individuals.

AIBullisharXiv – CS AI · 4h ago3
🧠

FinBloom: Knowledge Grounding Large Language Model with Real-time Financial Data

Researchers have developed FinBloom 7B, a specialized large language model trained on 14 million financial news articles and SEC filings, designed to handle real-time financial queries. The model introduces a Financial Agent system that can access up-to-date market data and financial information to support decision-making and algorithmic trading applications.

AINeutralarXiv – CS AI · 4h ago6
🧠

Do LLMs Benefit From Their Own Words?

Research reveals that large language models don't significantly benefit from conditioning on their own previous responses in multi-turn conversations. The study found that omitting assistant history can reduce context lengths by up to 10x while maintaining response quality, and in some cases even improves performance by avoiding context pollution where models over-condition on previous responses.

AINeutralarXiv – CS AI · 4h ago6
🧠

Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

A comprehensive study of 504 AI model configurations reveals that reasoning capabilities in large language models are highly task-dependent, with simple tasks like binary classification actually degrading by up to 19.9 percentage points while complex 27-class emotion recognition improves by up to 16.0 points. The research challenges the assumption that reasoning universally improves AI performance across all language tasks.

AIBullisharXiv – CS AI · 4h ago3
🧠

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Researchers introduce PointCoT, a new AI framework that enables multimodal large language models to perform explicit geometric reasoning on 3D point cloud data using Chain-of-Thought methodology. The framework addresses current limitations where AI models suffer from geometric hallucinations by implementing a 'Look, Think, then Answer' paradigm with 86k instruction-tuning samples.

AIBullisharXiv – CS AI · 4h ago2
🧠

Learning to Generate Secure Code via Token-Level Rewards

Researchers have developed Vul2Safe, a new framework for generating secure code using large language models, which addresses security vulnerabilities through self-reflection and token-level reinforcement learning. The approach introduces the PrimeVul+ dataset and SRCode training framework to provide more precise optimization of security patterns in code generation.

AIBullisharXiv – CS AI · 4h ago2
🧠

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

Researchers from PKU-SEC-Lab have developed KEEP, a new memory management system that significantly improves the efficiency of AI-powered embodied planning by optimizing KV cache usage. The system achieves 2.68x speedup compared to text-based memory methods while maintaining accuracy, addressing a key bottleneck in memory-augmented Large Language Models for complex planning tasks.

AIBullisharXiv – CS AI · 4h ago5
🧠

Real-Time Aligned Reward Model beyond Semantics

Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.

AIBullisharXiv – CS AI · 4h ago4
🧠

LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning

Researchers propose an LLM-driven framework for generating multi-turn task-oriented dialogues to create more realistic reasoning benchmarks. The framework addresses limitations in current AI evaluation methods by producing synthetic datasets that better reflect real-world complexity and contextual coherence.

AIBullisharXiv – CS AI · 4h ago4
🧠

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Researchers introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that improves AI reasoning efficiency by helping large reasoning models know when to stop thinking. The approach addresses the problem of redundant, lengthy reasoning chains that don't improve accuracy while reducing computational costs and response times.

AINeutralarXiv – CS AI · 4h ago5
🧠

LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems

Researchers have developed LumiMAS, a comprehensive framework for monitoring and detecting failures in multi-agent systems that incorporate large language models. The framework features three layers: monitoring and logging, anomaly detection, and anomaly explanation with root cause analysis, addressing the unique challenges of observing entire multi-agent systems rather than individual agents.

AIBullisharXiv – CS AI · 4h ago6
🧠

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Researchers developed a data-driven pipeline to optimize GPU efficiency for distributed LLM adapter serving, achieving sub-5% throughput estimation error while running 90x faster than full benchmarking. The system uses a Digital Twin, machine learning models, and greedy placement algorithms to minimize GPU requirements while serving hundreds of adapters concurrently.

AIBullisharXiv – CS AI · 4h ago1
🧠

Preference Packing: Efficient Preference Optimization for Large Language Models

Researchers propose 'preference packing,' a new optimization technique for training large language models that reduces training time by at least 37% through more efficient handling of duplicate input prompts. The method optimizes attention operations and KV cache memory usage in preference-based training methods like Direct Preference Optimization.

AIBullisharXiv – CS AI · 4h ago4
🧠

Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Researchers introduce Latent Self-Consistency (LSC), a new method for improving Large Language Model output reliability across both short and long-form reasoning tasks. LSC uses learnable token embeddings to select semantically consistent responses with only 0.9% computational overhead, outperforming existing consistency methods like Self-Consistency and Universal Self-Consistency.

AIBullisharXiv – CS AI · 4h ago2
🧠

The Auton Agentic AI Framework

Researchers have introduced the Auton Agentic AI Framework, a new architecture designed to bridge the gap between stochastic LLM outputs and deterministic backend systems required for autonomous AI agents. The framework separates cognitive blueprints from runtime engines, enabling cross-platform portability and formal auditability while incorporating advanced safety mechanisms and memory systems.

AIBullisharXiv – CS AI · 4h ago4
🧠

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Researchers propose ODAR-Expert, an adaptive routing framework for large language models that optimizes accuracy-efficiency trade-offs by dynamically routing queries between fast and slow processing agents. The system achieved 98.2% accuracy on MATH benchmarks while reducing computational costs by 82%, suggesting that optimal AI scaling requires adaptive resource allocation rather than simply increasing test-time compute.

AIBullisharXiv – CS AI · 4h ago2
🧠

SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems

Researchers propose SafeGen-LLM, a new approach to enhance safety in robotic task planning by combining supervised fine-tuning with policy optimization guided by formal verification. The system demonstrates superior safety generalization across multiple domains compared to existing classical planners, reinforcement learning methods, and base large language models.

AIBullisharXiv – CS AI · 4h ago3
🧠

FineScope : SAE-guided Data Selection Enables Domain Specific LLM Pruning and Finetuning

Researchers introduce FineScope, a framework that uses Sparse Autoencoder (SAE) techniques to create smaller, domain-specific language models from larger pretrained LLMs through structured pruning and self-data distillation. The method achieves competitive performance while significantly reducing computational requirements compared to training from scratch.

AIBullisharXiv – CS AI · 4h ago5
🧠

CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Researchers introduce CoMind, a multi-agent AI system that leverages community knowledge to automate machine learning engineering tasks. The system achieved a 36% medal rate on 75 past Kaggle competitions and outperformed 92.6% of human competitors in eight live competitions, establishing new state-of-the-art performance.

AINeutralarXiv – CS AI · 4h ago4
🧠

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

Researchers have developed LemmaBench, a new benchmark for evaluating Large Language Models on research-level mathematics by automatically extracting and rewriting lemmas from arXiv papers. Current state-of-the-art LLMs achieve only 10-15% accuracy on these mathematical theorem proving tasks, revealing a significant gap between AI capabilities and human-level mathematical research.

Page 1 of 3Next →