956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers tested the stability of moral judgments in large language models using nearly 3,000 ethical dilemmas, finding that narrative framing and evaluation methods significantly influence AI decisions. The study reveals that LLM moral reasoning is highly dependent on how questions are presented rather than underlying moral substance, with only 35.7% consistency across different evaluation protocols.
๐ง GPT-4๐ง Claude
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers developed SecureRAG-RTL, a new AI framework that uses Retrieval-Augmented Generation to detect security vulnerabilities in hardware designs. The system improves detection accuracy by 30% on average across different LLM architectures and addresses the challenge of limited hardware security datasets for AI training.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce StreamWise, a system for real-time multi-modal content generation that can produce 10-minute podcast videos with sub-second startup delays. The system dynamically manages quality and resources across LLMs, text-to-speech, and video generation, costing under $25 for basic generation or $45 for high-quality real-time streaming.
AIBearisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers have identified 'ambiguity collapse' as a significant epistemic risk when large language models encounter ambiguous terms and produce singular interpretations without human deliberation. The phenomenon threatens decision-making processes in content moderation, hiring, and AI self-regulation by bypassing normal human practices of meaning negotiation and potentially distorting shared vocabularies over time.
AINeutralarXiv โ CS AI ยท Mar 96/10
๐ง Researchers have developed ConStory-Bench, a new benchmark to evaluate consistency errors in long-form story generation by Large Language Models. The study reveals that LLMs frequently contradict their own established facts and character traits when generating lengthy narratives, with errors most commonly occurring in factual and temporal dimensions around the middle of stories.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers developed a method called HuLM (Human-aware Language Modeling) that improves large language model performance by considering the context of text written by the same author over time. Testing on an 8B Llama model showed that incorporating author context during fine-tuning significantly improves performance across eight downstream tasks.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers developed an explainable AI (XAI) system that transforms raw execution traces from LLM-based coding agents into structured, human-interpretable explanations. The system enables users to identify failure root causes 2.8 times faster and propose fixes with 73% higher accuracy through domain-specific failure taxonomy, automatic annotation, and hybrid explanation generation.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers have developed MASFactory, a new graph-centric framework for orchestrating Large Language Model-based Multi-Agent Systems (MAS). The framework introduces 'Vibe Graphing,' which allows users to compile natural language instructions into executable workflow graphs, making complex AI agent coordination more accessible and reusable.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce MoEless, a serverless framework for serving Mixture-of-Experts Large Language Models that addresses expert load imbalance issues. The system reduces inference latency by 43% and costs by 84% compared to existing solutions by using predictive load balancing and optimized expert scaling strategies.
AINeutralarXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce AgoraBench, a new framework for improving Large Language Models' bargaining and negotiation capabilities through utility-based feedback mechanisms. The study reveals that current LLMs struggle with strategic depth in negotiations and proposes human-aligned metrics and training methods to enhance their performance.
AIBearisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers conducted a controlled study examining the effectiveness of large language models (LLMs) for time series forecasting, finding that existing approaches often overfit to small datasets. Despite some promise, LLMs did not consistently outperform models specifically trained on large-scale time series data.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce Answer-Then-Check, a novel safety alignment approach for large language models that enables them to evaluate response safety before outputting to users. The method uses a new 80K-sample dataset called Reasoned Safety Alignment (ReSA) and demonstrates improved jailbreak defense while maintaining general reasoning capabilities.
๐ข Hugging Face
AIBullishHugging Face Blog ยท Mar 66/10
๐ง NVIDIA has released NeMo Evaluator Agent Skills, a tool that enables rapid evaluation of conversational large language models in minutes. This development streamlines the testing and validation process for LLM applications, potentially accelerating AI development workflows.
๐ข Nvidia
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers introduce RLSTA (Reinforcement Learning with Single-Turn Anchors), a new training method that addresses 'contextual inertia' - a problem where AI models fail to integrate new information in multi-turn conversations. The approach uses single-turn reasoning capabilities as anchors to improve multi-turn interaction performance across domains.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers introduced GCAgent, an LLM-driven system that enhances group chat communication through AI dialogue agents. The system achieved significant improvements in real-world deployments, increasing message volume by 28.80% over 350 days and scoring 4.68 across various criteria.
AINeutralarXiv โ CS AI ยท Mar 66/10
๐ง Researchers introduce X-RAY, a new system for analyzing large language model reasoning capabilities through formally verified probes that isolate structural components of reasoning. The study reveals LLMs handle constraint refinement well but struggle with solution-space restructuring, providing contamination-free evaluation methods.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose STRUCTUREDAGENT, a new AI framework that uses hierarchical planning with AND/OR trees to improve web agent performance on complex, long-horizon tasks. The system addresses limitations in current LLM-based agents through better memory tracking and structured planning approaches.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose CTRL-RAG, a new reinforcement learning framework that improves large language models' ability to generate accurate, context-faithful responses in Retrieval-Augmented Generation systems. The method uses a Contrastive Likelihood Reward mechanism that optimizes the difference between responses with and without supporting evidence, addressing issues of hallucination and model collapse in existing RAG systems.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers introduce the What Is Missing (WIM) rating system for Large Language Models that uses natural-language feedback instead of numerical ratings to improve preference learning. WIM computes ratings by analyzing cosine similarity between model outputs and judge feedback embeddings, producing more interpretable and effective training signals with fewer ties than traditional rating methods.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose ZorBA, a new federated learning framework for fine-tuning large language models that reduces memory usage by up to 62.41% through zeroth-order optimization and heterogeneous block activation. The system eliminates gradient storage requirements and reduces communication overhead by using shared random seeds and finite difference methods.
AINeutralarXiv โ CS AI ยท Mar 55/10
๐ง Researchers have introduced RealPref, a new benchmark for evaluating how well Large Language Models follow user preferences in long-term personalized interactions. The study reveals that LLM performance significantly degrades with longer contexts and more implicit preference expressions, highlighting challenges in developing user-aware AI assistants.
AIBullisharXiv โ CS AI ยท Mar 55/10
๐ง Researchers developed a hybrid AI architecture for agricultural advisory that separates factual retrieval from conversational delivery, using supervised fine-tuning on expert-curated agricultural knowledge. The system showed improved accuracy and safety for smallholder farmers while achieving comparable results to frontier models at lower cost.
AINeutralarXiv โ CS AI ยท Mar 55/10
๐ง Researchers developed a neurosymbolic approach using social science theory and abductive reasoning to help Large Language Models transform text narratives while preserving core messages. The method achieved 55.88% improvement over baseline performance with GPT-4o when shifting between collectivistic and individualistic narrative frameworks.
๐ง GPT-4๐ง Llama๐ง Grok
AIBullisharXiv โ CS AI ยท Mar 55/10
๐ง Researchers have released Tucano 2, an open-source suite of Portuguese language models ranging from 0.5-3.7 billion parameters, featuring enhanced datasets and training recipes. The models achieve state-of-the-art performance on Portuguese benchmarks and include capabilities for coding, tool use, and chain-of-thought reasoning.
AINeutralarXiv โ CS AI ยท Mar 55/10
๐ง Researchers introduce CodeTaste, a benchmark testing whether AI coding agents can perform code refactoring at human-level quality. The study reveals frontier AI models struggle to identify appropriate refactorings when given general improvement areas, but perform better with detailed specifications.