AIBullisharXiv – CS AI · 8h ago7/10
🧠COLLEAGUE.SKILL is an open-source system that automates the conversion of expert knowledge traces into portable, inspectable AI agent skills through a structured distillation workflow. The framework enables person-grounded agents to encode human expertise, decision-making patterns, and communication styles as versioned, correctable skill packages that can be deployed across multiple agent hosts.
AIBearisharXiv – CS AI · 8h ago7/10
🧠Researchers demonstrate a novel poisoning attack on retrieval-augmented text-to-music systems where attackers inject malicious captions into music databases to manipulate generation outputs toward attacker-chosen targets while maintaining alignment with original user prompts. The attack reveals a critical integrity vulnerability in AI systems that depend on external knowledge bases for prompt augmentation.
AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers investigated how prompt tone affects Large Language Model accuracy across multiple models and datasets, finding that tonal variations produce systematic yet model-dependent performance shifts. Testing ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite on 50-620 multiple-choice questions, they discovered some models show statistically significant accuracy changes while others experience large swings, with sensitivity varying by subject domain. The findings highlight that LLM reliability cannot be assumed tone-robust in production deployments.
🧠 ChatGPT🧠 Gemini
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce SCOPE, a framework that enables Large Language Model agents to automatically evolve their prompts by learning from execution traces in dynamic environments. The system improves task success rates from 14.23% to 38.64% on benchmark tests, addressing a critical limitation in how LLM agents manage complex, changing contexts without human intervention.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce WIRE, a diagnostic pipeline for detecting conflicting rules within LLM agent prompt policies. Testing six public policies, the system identified 170 rule-pair conflicts and found that 64.6% of witnessed conflict scenarios resulted in at least one source-rule violation, revealing significant gaps in how language models handle competing policy directives.
AIBullisharXiv – CS AI · 4d ago7/10
🧠PromptEmbedder introduces a dual-LLM framework that decouples text embedding from specific model architectures, achieving comparable performance to LoRA while reducing GPU memory by 40% and accelerating training 3.7x. The innovation enables efficient transfer across different LLM backbones by retraining only a lightweight alignment matrix rather than entire models.
AIBearisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that large language models suffer from 'in-context fixation,' where homogeneous demonstration labels—even semantically valid ones—cause classification accuracy to collapse below 12%. The models treat label-slot tokens as an exhaustive vocabulary set rather than learning from semantic meaning, revealing that in-context learning operates as constrained vocabulary retrieval rather than genuine concept learning.
🧠 Llama
AIBullisharXiv – CS AI · May 127/10
🧠Skill-R1 introduces a reinforcement learning framework that optimizes reusable natural language procedures (skills) for large language model agents without modifying the underlying model itself. By training a lightweight skill generator that works with frozen LLMs, the approach reduces adaptation costs while maintaining compatibility with both open and closed-source models, demonstrating consistent improvements on complex multi-step tasks.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers propose Lorem Perturbation for Exploration (LoPE), a training technique that addresses the zero-advantage problem in reinforcement learning for large language models by prepending random Latin-based text to prompts, enabling broader reasoning exploration across 1.7B to 7B parameter models.
🏢 Perplexity
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce Post-Reasoning, a technique that improves LLM performance by having models justify answers after generating final responses, without increasing latency or token costs. The method demonstrates 17.37% mean performance improvements across 117 model-benchmark settings and establishes a new efficiency frontier for direct-answer AI capabilities.
AIBearisharXiv – CS AI · May 77/10
🧠Research shows that Large Language Models exhibit measurable bias when their downstream purpose is revealed, even when generating supposedly task-independent metrics. This bias stems from human research design choices rather than algorithmic flaws, raising critical questions about how AI systems are deployed in financial and other sensitive domains.
AIBearisharXiv – CS AI · May 17/10
🧠Research shows that in-context examples in large language models suppress recall of scientific knowledge, causing models to shift from knowledge-driven reasoning to empirical pattern fitting even when examples are generated from the same formulas they should reinforce. This finding across 60 tasks and four models suggests practitioners deploying LLMs for scientific work should be cautious about using examples, as they may undermine rather than support domain expertise.
AIBearisharXiv – CS AI · Apr 207/10
🧠Researchers document a case study where a user's custom LLM system designed for self-regulation inadvertently caused loss of agency within 48 hours due to architectural flaws in prompt isolation. The study identifies context contamination and metacognitive co-option as failure mechanisms and proposes physical rather than logical isolation as a solution, raising critical ethical questions about protective versus restrictive AI system design.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce RePAIR, a framework enabling users to instruct large language models to forget harmful knowledge, misinformation, and personal data through natural language prompts at inference time. The system uses a training-free method called STAMP that manipulates model activations to achieve selective unlearning with minimal computational overhead, outperforming existing approaches while preserving model utility.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that inserting sentence boundary delimiters in LLM inputs significantly enhances model performance across reasoning tasks, with improvements up to 12.5% on specific benchmarks. This technique leverages the natural sentence-level structure of human language to enable better processing during inference, tested across model scales from 7B to 600B parameters.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that modern large language models can significantly improve code generation accuracy through iterative self-repair—feeding execution errors back to the model for correction—achieving 4.9-30.0 percentage point gains across benchmarks. The study reveals that instruction-tuned models succeed with prompting alone even at 8B scale, with Gemini 2.5 Flash reaching 96.3% pass rates on HumanEval, though logical errors remain substantially harder to fix than syntax errors.
🧠 Gemini🧠 Llama
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers tested whether large language models develop spatial world models through maze-solving tasks, finding that leading models like Gemini, GPT-4, and Claude struggle with spatial reasoning. Performance varies dramatically (16-86% accuracy) depending on input format, suggesting LLMs lack robust, format-invariant spatial understanding rather than building true internal world models.
🧠 GPT-5🧠 Claude🧠 Gemini
AIBullisharXiv – CS AI · Apr 147/10
🧠FACT-E is a new evaluation framework that uses controlled perturbations to assess the faithfulness of Chain-of-Thought reasoning in large language models, addressing the problem of models generating seemingly coherent explanations with invalid intermediate steps. By measuring both internal chain consistency and answer alignment, FACT-E enables more reliable detection of flawed reasoning and selection of trustworthy reasoning trajectories for in-context learning.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers propose a cost-effective proxy model framework that uses smaller, efficient models to approximate the interpretability explanations of expensive Large Language Models (LLMs), achieving over 90% fidelity at just 11% of computational cost. The framework includes verification mechanisms and demonstrates practical applications in prompt compression and data cleaning, making interpretability tools viable for real-world LLM development.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed a method to unlock prompt infilling capabilities in masked diffusion language models by extending full-sequence masking during supervised fine-tuning, rather than the conventional response-only masking. This breakthrough enables models to automatically generate effective prompts that match or exceed manually designed templates, suggesting training practices rather than architectural limitations were the primary constraint.
AINeutralarXiv – CS AI · Apr 67/10
🧠Researchers published a comprehensive technical survey on Large Language Model augmentation strategies, examining methods from in-context learning to advanced Retrieval-Augmented Generation techniques. The study provides a unified framework for understanding how structured context at inference time can overcome LLMs' limitations of static knowledge and finite context windows.
AIBearisharXiv – CS AI · Mar 277/10
🧠Research reveals that LLM system prompt configuration creates massive security vulnerabilities, with the same model's phishing detection rates ranging from 1% to 97% based solely on prompt design. The study PhishNChips demonstrates that more specific prompts can paradoxically weaken AI security by replacing robust multi-signal reasoning with exploitable single-signal dependencies.
AIBearisharXiv – CS AI · Mar 177/10
🧠Researchers introduce Brittlebench, a new evaluation framework that reveals frontier AI models experience up to 12% performance degradation when faced with minor prompt variations like typos or rephrasing. The study shows that semantics-preserving input perturbations can account for up to half of a model's performance variance, highlighting significant robustness issues in current language models.
AIBearisharXiv – CS AI · Mar 167/10
🧠Researchers introduced OffTopicEval, a benchmark revealing that all major LLMs suffer from poor operational safety, with even top performers like Qwen-3 and Mistral achieving only 77-80% accuracy in staying on-topic for specific use cases. The study proposes prompt-based steering methods that can improve performance by up to 41%, highlighting critical safety gaps in current AI deployment.
🧠 Llama
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers developed PhyPrompt, a reinforcement learning framework that automatically refines text prompts to generate physically realistic videos from AI models. The system uses a two-stage approach with curriculum learning to improve both physical accuracy and semantic fidelity, outperforming larger models like GPT-4o with only 7B parameters.
🧠 GPT-4