#prompt-engineering News & Analysis

185 articles tagged with #prompt-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

185 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a novel framework that enhances multi-agent AI systems by evolving meta-skills through a closed optimization loop, achieving significant performance gains while maintaining cost efficiency across diverse LLMs and tasks.

AINeutralarXiv – CS AI · Jun 237/10

🧠

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Researchers introduce WikiProfile, a benchmark that reframes LLM factuality failures as either missing knowledge or poor recall of encoded information. Analysis of 13 models shows frontier models encode 95-98% of facts but struggle significantly with recall, suggesting future improvements depend less on scaling and more on better knowledge access mechanisms.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · Jun 197/10

🧠

DeFrame: Debiasing Large Language Models Against Framing Effects

Researchers identify 'framing disparity' as a hidden source of bias in large language models, where semantically equivalent prompts expressed differently produce inconsistent fairness outcomes. The study proposes DeFrame, a debiasing method that improves LLM consistency across alternative framings, addressing a gap between standard fairness evaluations and real-world performance.

🏢 Meta

AIBullisharXiv – CS AI · Jun 197/10

🧠

Uncertainty Decomposition for Clarification Seeking in LLM Agents

Researchers introduce a prompt-based uncertainty decomposition method that enables LLM agents to proactively seek clarification when task specifications are ambiguous. The approach separates action confidence from request uncertainty and demonstrates 36-73% improvements in clarification performance across multiple LLM backbones compared to existing uncertainty frameworks.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 117/10

🧠

TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

Researchers introduce Tahoe, a system that optimizes LLM-based Text-to-SQL conversion through dynamic prompt engineering rather than model retraining. By consolidating debugging traces into reusable hints and modeling conflicting user intents as strategies, Tahoe increases query pass rates from 62% to 79% on Spider 2.0-Snow benchmarks while maintaining compatibility across weaker model backbones.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 117/10

🧠

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

Researchers present the Minimum Viable Evaluation Suite (MVES), a framework for systematically testing LLM applications, revealing that generic prompt improvements often fail to deliver consistent gains and can cause significant performance regressions. Testing on local models showed that adding generic rules to prompts degraded RAG citation compliance by up to 70%, underscoring the need for rigorous, task-specific evaluation before deployment.

🧠 Llama

AIBullisharXiv – CS AI · Jun 107/10

🧠

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Researchers demonstrate that selective context management—retaining only recent tool interactions plus automated summarization—enables LLM agents to complete enterprise workflows with 91.6% success while reducing token consumption and runtime by ~63% compared to full-history retention. The findings challenge the assumption that maximum context retention improves agent performance in long-horizon tasks.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBullisharXiv – CS AI · Jun 107/10

🧠

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

Researchers introduce Rotate2Think, a training-free method that improves language model reasoning by applying geometric transformations to embedding space. The technique identifies that input and reasoning embeddings occupy distinct directional regions and uses orthogonal rotation to geometrically prime the model before generating reasoning traces, showing consistent accuracy improvements across 30 of 32 tested model-benchmark configurations.

AINeutralarXiv – CS AI · Jun 107/10

🧠

Deployment-Time Memorization in Foundation-Model Agents

Researchers characterize how memory-design choices in foundation-model agents affect privacy and utility, introducing metrics to measure personalization recall, extraction risk, and deletion fidelity. Key-fact summarization reduces data extraction vulnerability by 64-76% while preserving personalization, but creates deletion-fidelity failures where compressed data remains recoverable without full-pipeline purging.

🧠 GPT-4

AIBullisharXiv – CS AI · Jun 97/10

🧠

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

Researchers introduce MemToolAgent, a framework that enhances LLM agents' ability to use tools effectively by implementing memory management systems that store and retrieve past experiences. The approach achieves significant performance improvements (17-80% relative gains) across multiple benchmarks without requiring model fine-tuning, suggesting practical advances in making AI agents more personalized and reliable.

AINeutralarXiv – CS AI · Jun 97/10

🧠

Scaffold Effects on GAIA: A Controlled Comparison

A controlled study comparing three AI scaffolding approaches across five large language models reveals that prompt engineering and system design choices can swing accuracy by up to 28 percentage points on the same task, challenging assumptions that published capability scores reflect true model performance and suggesting the elicitation gap persists even as models improve.

🏢 Anthropic🧠 GPT-5🧠 Claude

AIBearisharXiv – CS AI · Jun 97/10

🧠

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

Researchers have identified a critical reliability flaw in multimodal large language models (MLLMs) used for video understanding: when the correct answer is absent from available options, these models fail to recognize it and instead select plausible incorrect alternatives. Testing across multiple models and benchmarks reveals this limitation is especially severe in temporal reasoning tasks and worsens with increased video frame sampling, with chain-of-thought prompting offering only partial mitigation.

AINeutralarXiv – CS AI · Jun 87/10

🧠

Measuring Agents in Production

A comprehensive study of deployed LLM-based agents across 26 domains reveals that production systems rely on simple, human-centered approaches rather than complex automation. The research shows 68% of agents require human intervention within 10 steps, 70% use prompt engineering instead of model fine-tuning, and reliability remains the primary development challenge addressed through systems-level design.

AINeutralarXiv – CS AI · Jun 57/10

🧠

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

Researchers discovered that large language models refuse to correct their own reasoning errors but readily accept corrections when identical claims come from external sources like users or tools. This behavior stems not from cognitive limitations but from how chat templates assign roles to different message types, suggesting AI systems may have built-in biases toward authoritative external sources.

AINeutralarXiv – CS AI · Jun 57/10

🧠

CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

Researchers introduced CogManip, a new AI safety benchmark evaluating 15 manipulation strategy risks across 1,000 multi-turn LLM interactions. Testing 13 models including GPT-5.4 and DeepSeek-V3.2 revealed significant vulnerabilities to covert psychological manipulation tactics, with findings suggesting prompt-based defenses can mitigate these risks.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 57/10

🧠

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Researchers introduce ToolMaze, a benchmark testing how AI language models handle real-world tool failures and recovery scenarios, revealing that implicit semantic failures cause performance drops of ~37% and that fault-tolerance improves significantly slower than basic task performance as models scale.

AIBearisharXiv – CS AI · Jun 47/10

🧠

The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

Researchers discovered that incidental contextual cues in prompts systematically steer LLM code generation toward different algorithms, even when all outputs are functionally correct. Across 46,535 experiments, subtle variations in wording and metadata produced algorithm-choice shifts up to 100 percentage points, creating unpredictable performance and security outcomes in production code.

AIBearisharXiv – CS AI · Jun 27/10

🧠

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Researchers demonstrate that Large Language Models exhibit significant limitations in zero-shot annotation tasks, with only 34.8% of initial errors correctable through prompting. The study reveals that model-internalized priors and concept definitions strongly influence LLM performance more than text-level memorization, highlighting fundamental constraints in LLM adaptability for reliable AI-as-a-judge applications.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Structure Enables Effective Self-Localization of Errors in LLMs

Researchers introduce Thought-ICS, a self-correction framework that structures LLM reasoning into discrete thought steps, enabling models to identify and fix errors more reliably. The method achieves 20-40% improvement in self-correction when errors are verified externally, and outperforms existing baselines in fully autonomous settings.

AIBullisharXiv – CS AI · Jun 27/10

🧠

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

Researchers introduce KACE, a novel context engineering method that improves large language models' mathematical reasoning by separating knowledge storage from usage through difficulty and domain-based organization. The approach achieves 62.2% accuracy on AIME 2025, significantly outperforming existing self-consistency methods while maintaining comparable computational efficiency.

AIBullisharXiv – CS AI · Jun 17/10

🧠

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL is an open-source system that automates the conversion of expert knowledge traces into portable, inspectable AI agent skills through a structured distillation workflow. The framework enables person-grounded agents to encode human expertise, decision-making patterns, and communication styles as versioned, correctable skill packages that can be deployed across multiple agent hosts.

AIBearisharXiv – CS AI · Jun 17/10

🧠

Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation

Researchers demonstrate a novel poisoning attack on retrieval-augmented text-to-music systems where attackers inject malicious captions into music databases to manipulate generation outputs toward attacker-chosen targets while maintaining alignment with original user prompts. The attack reveals a critical integrity vulnerability in AI systems that depend on external knowledge bases for prompt augmentation.

AINeutralarXiv – CS AI · May 297/10

🧠

Mind Your Tone: Does Tone Alter LLM Performance?

Researchers investigated how prompt tone affects Large Language Model accuracy across multiple models and datasets, finding that tonal variations produce systematic yet model-dependent performance shifts. Testing ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite on 50-620 multiple-choice questions, they discovered some models show statistically significant accuracy changes while others experience large swings, with sensitivity varying by subject domain. The findings highlight that LLM reliability cannot be assumed tone-robust in production deployments.

🧠 ChatGPT🧠 Gemini

AIBullisharXiv – CS AI · May 297/10

🧠

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Researchers introduce SCOPE, a framework that enables Large Language Model agents to automatically evolve their prompts by learning from execution traces in dynamic environments. The system improves task success rates from 14.23% to 38.64% on benchmark tests, addressing a critical limitation in how LLM agents manage complex, changing contexts without human intervention.

AIBearisharXiv – CS AI · May 287/10

🧠

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

Researchers introduce WIRE, a diagnostic pipeline for detecting conflicting rules within LLM agent prompt policies. Testing six public policies, the system identified 170 rule-pair conflicts and found that 64.6% of witnessed conflict scenarios resulted in at least one source-rule violation, revealing significant gaps in how language models handle competing policy directives.

Page 1 of 8Next →