#prompt-engineering News & Analysis

185 articles tagged with #prompt-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

185 articles

AIBullisharXiv – CS AI · May 296/10

🧠

Harnessing non-adversarial robustness in large language models

Researchers propose a debiasing fine-tuning method to improve Large Language Model robustness against semantically-neutral prompt variations without expensive full retraining. The approach identifies perturbation-induced bias in neural network outputs and demonstrates theoretical and experimental evidence that targeted debiasing can enhance model resilience to prompt alterations.

AINeutralarXiv – CS AI · May 296/10

🧠

Anchorless Diversification for Parallel LLM Ideation

Researchers present methods for improving how large language models generate diverse pools of creative ideas during parallel inference without relying on seed examples. Their findings show that semantic direction stratification—organizing generations across different semantic directions with a single planning call—outperforms anchor-dependent baselines while maintaining quality and computational efficiency.

AINeutralarXiv – CS AI · May 296/10

🧠

Temporal Stability and Few-Shot Prompting in Math Task Assessment

A longitudinal study examined how AI models (Gemini and Coteach) perform on mathematics task classification using the Task Analysis Guide, testing stability across model versions and responsiveness to few-shot prompting. Results showed newer model versions produced mixed effects, but few-shot prompting consistently improved both models' accuracy, suggesting prompt engineering is more reliable than passive model updates for specialized educational tasks.

🧠 Gemini

AINeutralarXiv – CS AI · May 296/10

🧠

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Researchers propose a novel method for optimizing multi-agent LLM systems by decomposing credit assignment into temporal and structural components, enabling more efficient prompt optimization through targeted refinement rather than global updates. The approach uses state-space bottleneck analysis and role-based policy isolation to identify and fix weak components in collaborative AI systems, reducing computational queries while improving reasoning performance across benchmarks.

AIBullisharXiv – CS AI · May 296/10

🧠

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Researchers propose Canonical-Context On-Policy Distillation (CCOPD), a training method that improves large language models' ability to solve problems when information is revealed incrementally across multiple conversation turns rather than all at once. By using a frozen teacher model with complete context to guide a student model receiving fragmented information, CCOPD achieves 32% relative performance improvement on multi-turn tasks while maintaining single-prompt performance.

AINeutralarXiv – CS AI · May 296/10

🧠

Steering Language Models Before They Speak: Logit-Level Interventions

Researchers introduce SWAI, a training-free method for controlling language model outputs by manipulating logit scores using corpus-derived statistics. The technique enables real-time steering of model behavior—such as adjusting readability, politeness, and toxicity—without modifying model weights or accessing internal layers, outperforming existing prompt-based and logit-level baselines.

AIBullisharXiv – CS AI · May 286/10

🧠

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

Researchers propose a hierarchical framework for deploying compact language models in resource-constrained agentic systems, combining knowledge distillation with oracle-supervised fine-tuning to maintain protocol compliance and semantic performance. The approach addresses core deployment challenges including context length limitations, memory constraints, and cost efficiency by separating schema learning from semantic adaptation.

AINeutralarXiv – CS AI · May 286/10

🧠

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.

AINeutralarXiv – CS AI · May 286/10

🧠

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

Researchers benchmarked 43 large language models used for academic scholar recommendations, revealing that prompt design significantly affects recommendation quality and diversity. The study found that model choice, persona prompting (language, location, role), and context variables independently shape which scholars are recommended, with geographic location prompts producing the most variation in factuality and representativeness across disciplines.

AINeutralarXiv – CS AI · May 286/10

🧠

Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration

Researchers present twelve token optimization strategies for using LLMs to migrate Oracle databases to PostgreSQL, addressing cost and quality degradation challenges. Adaptive routing emerges as the optimal approach, reducing token consumption by 8.72% while maintaining 88.40% semantic match accuracy, demonstrating that token optimization requires balancing multiple objectives rather than simple prompt shortening.

AINeutralarXiv – CS AI · May 276/10

🧠

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Researchers investigated why chain-of-thought prompting improves language model accuracy by analyzing what happens at inference time rather than generation time. They discovered that the improvement comes primarily from lexical activation and short-range token co-occurrence (2-3 adjacent tokens) rather than from logical sentence-level reasoning, challenging assumptions about how rationales actually drive model performance.

AIBullisharXiv – CS AI · May 276/10

🧠

Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains

Researchers introduce Augment Engineering, a methodology for orchestrating multiple AI tools across professional domains by applying portable meta-skills like prompt and context engineering. A five-month case study demonstrates that a single practitioner can produce work traditionally requiring domain specialists across seven domains, with statistical evidence supporting increased efficiency and production acceleration.

AINeutralarXiv – CS AI · May 276/10

🧠

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

Researchers reveal that correct demonstrations in in-context learning don't guarantee improved model performance—some accurate examples actually degrade accuracy. The study introduces task-preserving perturbations to show that exemplar utility depends on how demonstrations influence contextual inference, not merely on correctness, challenging conventional assumptions about how AI models learn from examples.

AINeutralarXiv – CS AI · May 276/10

🧠

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

Researchers evaluated 13 large language models' ability to generate code following the Singleton design pattern across four prompting strategies, finding that iterative binary feedback and instruction-based guidance most effectively guide LLMs to incorporate architectural best practices while maintaining code functionality.

🧠 Llama

AINeutralarXiv – CS AI · May 276/10

🧠

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

Researchers adapted Microsoft's QuantumKatas quantum computing curriculum from Q# to Qiskit and created a 350-task benchmark with LLM evaluation infrastructure. Testing 16 language models revealed significant capability gaps, with frontier models achieving 83.1% pass rates versus 32.3% for weaker models, while highlighting that LLMs excel at implementing known algorithms but struggle with problem encoding.

AINeutralarXiv – CS AI · May 276/10

🧠

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

Researchers have developed a mechanistic interpretability framework that reverses information flow through Chain-of-Thought prompting to understand how AI models reason. The study reveals CoT functions as a decoding space pruner that uses answer templates to guide outputs, with task-dependent neuron modulation that reduces activation in open-domain tasks but increases it in closed-domain scenarios.

AINeutralarXiv – CS AI · May 276/10

🧠

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

Researchers propose Adaptive Multi-prompt Contrastive Network (AMCN), a novel approach for few-shot out-of-distribution detection that requires only minimal labeled samples. The method leverages CLIP's vision-language capabilities with learnable textual prompts to distinguish between in-distribution and outlier samples, advancing practical AI safety applications.

AINeutralarXiv – CS AI · May 276/10

🧠

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Researchers introduced PhyWorldBench, a comprehensive benchmark that evaluates text-to-video generation models on their ability to simulate real-world physics accurately. Testing 12 state-of-the-art models across 1,050 prompts, the study reveals significant gaps in how current AI video generators handle physical phenomena, from basic object motion to complex interactions, while also introducing novel evaluation methods using multimodal language models.