#prompt-engineering News & Analysis

113 articles tagged with #prompt-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

113 articles

AIBullisharXiv – CS AI · Mar 46/103

🧠

Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection

Researchers propose PDP, a new framework for Incremental Object Detection that addresses prompt degradation issues in AI models. The method achieves significant improvements of 9.2% AP on MS-COCO and 3.3% AP on PASCAL VOC benchmarks through dual-pool prompt decoupling and prototype-guided pseudo-label generation.

AINeutralarXiv – CS AI · Mar 47/104

🧠

Toward a Dynamic Stackelberg Game-Theoretic Framework for Agentic AI Defense Against LLM Jailbreaking

Researchers propose a game-theoretic framework using Stackelberg equilibrium and Rapidly exploring Random Trees to model interactions between attackers trying to jailbreak LLMs and defensive AI systems. The framework provides a mathematical foundation for understanding and improving AI safety guardrails against prompt-based attacks.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Emergent Coordination in Multi-Agent Language Models

Researchers developed an information-theoretic framework to measure when multi-agent AI systems exhibit coordinated behavior beyond individual agents. The study found that specific prompt designs can transform collections of AI agents into coordinated collectives that mirror human group intelligence principles.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Researchers propose Canonical-Context On-Policy Distillation (CCOPD), a training method that improves large language models' ability to solve problems when information is revealed incrementally across multiple conversation turns rather than all at once. By using a frozen teacher model with complete context to guide a student model receiving fragmented information, CCOPD achieves 32% relative performance improvement on multi-turn tasks while maintaining single-prompt performance.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Steering Language Models Before They Speak: Logit-Level Interventions

Researchers introduce SWAI, a training-free method for controlling language model outputs by manipulating logit scores using corpus-derived statistics. The technique enables real-time steering of model behavior—such as adjusting readability, politeness, and toxicity—without modifying model weights or accessing internal layers, outperforming existing prompt-based and logit-level baselines.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Anchorless Diversification for Parallel LLM Ideation

Researchers present methods for improving how large language models generate diverse pools of creative ideas during parallel inference without relying on seed examples. Their findings show that semantic direction stratification—organizing generations across different semantic directions with a single planning call—outperforms anchor-dependent baselines while maintaining quality and computational efficiency.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

Researchers conducted a controlled study of persona prompting in large language models across 1,140 questions and 38 expert roles, finding that while aggregate metrics show minimal improvement, persona prompting consistently trades clarity for expertise depth. The technique's effectiveness varies significantly by domain and question type, with benefits appearing mainly in advisory contexts like medicine and psychology, while baseline prompting outperforms in domains requiring concise explanations.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Temporal Stability and Few-Shot Prompting in Math Task Assessment

A longitudinal study examined how AI models (Gemini and Coteach) perform on mathematics task classification using the Task Analysis Guide, testing stability across model versions and responsiveness to few-shot prompting. Results showed newer model versions produced mixed effects, but few-shot prompting consistently improved both models' accuracy, suggesting prompt engineering is more reliable than passive model updates for specialized educational tasks.

🧠 Gemini

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Researchers propose a novel method for optimizing multi-agent LLM systems by decomposing credit assignment into temporal and structural components, enabling more efficient prompt optimization through targeted refinement rather than global updates. The approach uses state-space bottleneck analysis and role-based policy isolation to identify and fix weak components in collaborative AI systems, reducing computational queries while improving reasoning performance across benchmarks.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Harnessing non-adversarial robustness in large language models

Researchers propose a debiasing fine-tuning method to improve Large Language Model robustness against semantically-neutral prompt variations without expensive full retraining. The approach identifies perturbation-induced bias in neural network outputs and demonstrates theoretical and experimental evidence that targeted debiasing can enhance model resilience to prompt alterations.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

Researchers propose a hierarchical framework for deploying compact language models in resource-constrained agentic systems, combining knowledge distillation with oracle-supervised fine-tuning to maintain protocol compliance and semantic performance. The approach addresses core deployment challenges including context length limitations, memory constraints, and cost efficiency by separating schema learning from semantic adaptation.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

Researchers benchmarked 43 large language models used for academic scholar recommendations, revealing that prompt design significantly affects recommendation quality and diversity. The study found that model choice, persona prompting (language, location, role), and context variables independently shape which scholars are recommended, with geographic location prompts producing the most variation in factuality and representativeness across disciplines.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration

Researchers present twelve token optimization strategies for using LLMs to migrate Oracle databases to PostgreSQL, addressing cost and quality degradation challenges. Adaptive routing emerges as the optimal approach, reducing token consumption by 8.72% while maintaining 88.40% semantic match accuracy, demonstrating that token optimization requires balancing multiple objectives rather than simple prompt shortening.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Researchers investigated why chain-of-thought prompting improves language model accuracy by analyzing what happens at inference time rather than generation time. They discovered that the improvement comes primarily from lexical activation and short-range token co-occurrence (2-3 adjacent tokens) rather than from logical sentence-level reasoning, challenging assumptions about how rationales actually drive model performance.

AIBullisharXiv – CS AI · 5d ago6/10

🧠

Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains

Researchers introduce Augment Engineering, a methodology for orchestrating multiple AI tools across professional domains by applying portable meta-skills like prompt and context engineering. A five-month case study demonstrates that a single practitioner can produce work traditionally requiring domain specialists across seven domains, with statistical evidence supporting increased efficiency and production acceleration.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

Researchers reveal that correct demonstrations in in-context learning don't guarantee improved model performance—some accurate examples actually degrade accuracy. The study introduces task-preserving perturbations to show that exemplar utility depends on how demonstrations influence contextual inference, not merely on correctness, challenging conventional assumptions about how AI models learn from examples.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

Researchers evaluated 13 large language models' ability to generate code following the Singleton design pattern across four prompting strategies, finding that iterative binary feedback and instruction-based guidance most effectively guide LLMs to incorporate architectural best practices while maintaining code functionality.

🧠 Llama

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

Researchers adapted Microsoft's QuantumKatas quantum computing curriculum from Q# to Qiskit and created a 350-task benchmark with LLM evaluation infrastructure. Testing 16 language models revealed significant capability gaps, with frontier models achieving 83.1% pass rates versus 32.3% for weaker models, while highlighting that LLMs excel at implementing known algorithms but struggle with problem encoding.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

Researchers have developed a mechanistic interpretability framework that reverses information flow through Chain-of-Thought prompting to understand how AI models reason. The study reveals CoT functions as a decoding space pruner that uses answer templates to guide outputs, with task-dependent neuron modulation that reduces activation in open-domain tasks but increases it in closed-domain scenarios.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

Researchers propose Adaptive Multi-prompt Contrastive Network (AMCN), a novel approach for few-shot out-of-distribution detection that requires only minimal labeled samples. The method leverages CLIP's vision-language capabilities with learnable textual prompts to distinguish between in-distribution and outlier samples, advancing practical AI safety applications.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Researchers introduced PhyWorldBench, a comprehensive benchmark that evaluates text-to-video generation models on their ability to simulate real-world physics accurately. Testing 12 state-of-the-art models across 1,050 prompts, the study reveals significant gaps in how current AI video generators handle physical phenomena, from basic object motion to complex interactions, while also introducing novel evaluation methods using multimodal language models.

AINeutralarXiv – CS AI · May 126/10

🧠

Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

Researchers demonstrate that execution-based voting methods for LLM code generation significantly outperform text-based majority voting by 18-52 percentage points. The study reveals that input quality—particularly sketch-based generation—matters far more than the aggregation algorithm itself, challenging assumptions about how to select optimal code outputs.

AIBullisharXiv – CS AI · May 126/10

🧠

Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution

Researchers introduce QD-LLM, a framework that evolves lightweight prompt embeddings (~32K parameters) to steer frozen large language models toward diverse outputs without fine-tuning. The approach outperforms existing quality-diversity optimization methods by 46.4% in coverage and demonstrates practical applications in test generation and training data improvement.

🧠 Llama

AINeutralarXiv – CS AI · May 126/10

🧠

Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

Researchers demonstrate that overlaying coordinate grids on chart images significantly improves multimodal LLM accuracy for data extraction tasks, reducing error rates from 25.5% to 19.5%. This spatial priming approach outperforms semantic methods like Chain-of-Thought prompting, suggesting that explicit spatial context is more effective than high-level semantic guidance for current-generation vision-language models.

← PrevPage 2 of 5Next →