#prompt-engineering News & Analysis

185 articles tagged with #prompt-engineering. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

185 articles

AIBearisharXiv – CS AI · Jun 256/10

🧠

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

A research study challenges the widespread practice of using context files (like AGENTS.md) to enhance coding agent performance, finding that these files provide no measurable improvement in task completion rates while increasing inference costs by over 20%. The findings suggest that while context files help agents follow instructions, repository overviews—commonly recommended by model providers—offer minimal practical value.

AINeutralarXiv – CS AI · Jun 255/10

🧠

SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding

Researchers propose SFL-MTSC, a framework that improves spoken language understanding in large language models by addressing inconsistent intent-slot structures in multi-intent scenarios. Using semantic frame-level aggregation instead of simple majority voting, the method shows improved slot F1 and accuracy on the MAC-SLU benchmark while maintaining stable intent recognition.

AIBullisharXiv – CS AI · Jun 236/10

🧠

How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs

A research paper demonstrates that organizing demonstration data hierarchically into labeled subgoals significantly improves LLM agent performance on ambiguous tasks, achieving 90.7% pass rates versus 76.7% for flat action logs. This finding provides concrete design guidance for Programming by Demonstration systems and broader procedural knowledge transfer to AI agents.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting

Researchers analyze how vision-language models perform zero-shot remote sensing tasks across multiple datasets and find that textual design choices critically impact performance. The study reveals that semantically rich LLM-generated descriptions don't consistently outperform simpler template-based descriptions due to noise in text embeddings, but lightweight query embedding calibration effectively improves results.

AINeutralarXiv – CS AI · Jun 236/10

🧠

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality

Researchers introduce PeerCheck, a framework that analyzes differences between LLM-generated and human-written academic reviews, finding that LLMs prioritize theoretical aspects while humans emphasize methodology. Using techniques like Chain-of-Thought prompting improves LLM review quality, though retrieval-augmented generation surprisingly produces inconsistent and sometimes degraded results.

AINeutralarXiv – CS AI · Jun 236/10

🧠

PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLMs

Researchers introduce PRIME, a framework for evaluating how large language models handle conflicting instructions, revealing that conflict type significantly impacts model behavior regardless of scale. The study of five instruction-tuned LLMs exposes critical gaps in current benchmarking methods that assess instructions in isolation, demonstrating that real-world instruction-following capabilities cannot be accurately measured without testing competing directives.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Evaluating LLMs for Real-World Web Vulnerability Detection

Researchers benchmarked six large language models on their ability to detect real-world web vulnerabilities in WordPress plugins, finding that while all models can identify security issues, detection rates vary significantly (35-63%) and no model maintains consistent results across repeated tests. The findings reveal both the promise and critical limitations of LLM-based vulnerability detection for security practitioners.

🧠 GPT-5🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · Jun 236/10

🧠

Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning

Researchers introduce Denoising Iterative Self-Correction (DISC), a test-time procedure that improves large language model reasoning by treating verification outputs as noisy signals to progressively correct errors across multiple passes. The method demonstrates superior performance over existing correction approaches, achieving 81.6% accuracy on BIG-Bench Mistake with 13x better improvement-to-degradation ratios than Chain-of-Verification.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing

Researchers demonstrate that preprocessing raw IoT sensor data into structured textual formats significantly improves the accuracy of edge-deployed language models for environmental monitoring, narrowing the performance gap with cloud-based systems while maintaining low latency. Testing on indoor and outdoor air-quality datasets shows local model accuracy improving from 50.9% to 81.7% indoors and 63.7% to 89.3% outdoors through progressive prompt enrichment, achieving inference speeds near 0.22 seconds.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Text2DSL: LLM-Based Code Generation for Domain-Specific Languages

Researchers introduce Text2DSL, a framework for automatically generating domain-specific language (DSL) code from natural language using large language models, validated on 4,204 Polkit security policy rules. The study demonstrates that providing structured context like BNF grammar and API specifications dramatically improves code generation accuracy to 98.6-99.4% syntactic validity across different model scales without requiring fine-tuning.

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis

Researchers find that intrinsic self-correction in large language models works inconsistently across tasks, succeeding only when task structure supports specific revision mechanisms like constraint verification or complex reasoning review. The study challenges the assumption that self-correction is universally reliable and instead positions it as a task-dependent inference strategy.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Semantic Browsing: Controllable Diversity for Image Generation

Researchers introduce Semantic Browsing, a method that improves diversity in AI-generated images by controlling variation at the text level rather than through random pixel-level changes. Using Vision Language Models and structured prompting, the technique enables users to explore meaningful, interpretable variations of generated images organized along semantic axes.

AIBearisharXiv – CS AI · Jun 236/10

🧠

Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

Researchers developed a Shapley-value-based framework to quantify how adjectives steer Large Language Model outputs across architectures (GPT-4o-mini, Llama-3-70b, DeepSeek-R1, Phi-3, o3). The study reveals that steering effects are model-dependent, non-universal, and exhibit complex interaction patterns—larger models show unpredictable compositional behavior while smaller models respond more literally, challenging the viability of one-size-fits-all prompting strategies.

🧠 GPT-4

AI × CryptoNeutralCrypto Briefing · Jun 216/10

🤖

OpenAI shares 28 tips to enhance ChatGPT prompt engineering

OpenAI has released 28 prompt engineering tips designed to improve ChatGPT's performance and decision-making quality. While better prompting techniques can enhance AI utility, the guidance implicitly acknowledges risks of over-relying on AI outputs for critical financial and business decisions.

🏢 OpenAI🧠 ChatGPT

AIBullisharXiv – CS AI · Jun 196/10

🧠

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

Researchers developed an adaptive large language model tutoring system that uses subject-aware prompting and machine learning to personalize education for high-school students. Testing with 656 conversations showed the system improved instructional efficiency by reducing interactions by ~3 turns and increased exercise completion rates to 28.1% using stochastic strategy sampling, demonstrating effective sim-to-real transfer from simulation training to live student interactions.

AIBullisharXiv – CS AI · Jun 196/10

🧠

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill introduces a method to compress natural-language AI agent skills into compact continuous context objects that improve task performance without retraining frozen language models. By replacing lengthy Markdown skill files with 32-token soft prefixes, the approach demonstrates significant accuracy gains across multiple benchmarks while reducing computational overhead.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

Researchers demonstrate that query placement significantly impacts performance in Diffusion Large Language Models (dLLMs) during in-context learning, contrary to conventional practices inherited from autoregressive models. The study reveals a spatial recency effect in attention mechanisms and proposes Auto-ICL, a training-free strategy that dynamically optimizes query positioning to approach oracle performance across diverse tasks.

AIBullisharXiv – CS AI · Jun 196/10

🧠

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO (Fully Autonomous Prompt Optimization) is a new framework that automatically optimizes multi-step LLM pipelines by iteratively refining prompts and, when necessary, restructuring the pipeline architecture itself. The system demonstrates significant performance improvements across multiple benchmarks, achieving up to 33.8 percentage point gains over existing optimization methods.

🧠 GPT-5🧠 Claude

AINeutralarXiv – CS AI · Jun 196/10

🧠

Too long; didn't solve

A new study examining mathematical benchmarks used to evaluate large language models reveals that both prompt length and solution length correlate with increased model failure rates. The research, conducted on an adversarial dataset of expert-authored math problems, demonstrates that structural complexity is a significant factor in model performance difficulty.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Researchers introduce Visual Attentive Prompting (VAP), a training-free method that enables Vision-Language-Action models to perform personalized object manipulation tasks by using reference images to identify specific instances of objects. The approach bridges the gap between semantic understanding and instance-level control, allowing robots to execute commands like 'bring my cup' by distinguishing target objects from visually similar alternatives without requiring model retraining.

AINeutralarXiv – CS AI · Jun 116/10

🧠

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

Researchers identify a 'structural attention tax' where knowledge graph formats capture 2-3x more model attention than semantically equivalent natural language, degrading in-context learning performance by up to 42% regardless of content relevance. The study formalizes attention decomposition into semantic and structural components, revealing that retrieval format can independently distort LLM outputs independent of knowledge quality.

AINeutralarXiv – CS AI · Jun 116/10

🧠

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Researchers introduce AVIS, a lightweight adaptive policy that optimizes inference efficiency in Vision-Language Models by jointly scaling visual context and reasoning computation. The method uses token pruning and difficulty prediction to reduce computational costs while maintaining or improving accuracy across image and video reasoning tasks.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

Researchers present a three-stage pipeline for zero-shot accident detection in surveillance videos that combines temporal localization, semantic classification, and spatial grounding using vision-language models. The method decomposes accident understanding into when, what, and where components, achieving significant improvements over baseline approaches on the ACCIDENT benchmark.

AIBullisharXiv – CS AI · Jun 106/10

🧠

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

Researchers developed an AI framework using eight large language models to automatically generate high-quality source code documentation, with a novel multi-LLM evaluation system assessing outputs across nine quality criteria. Testing on a medical physics library revealed a 42% performance gap between top and bottom models, demonstrating the framework's effectiveness in reducing manual documentation effort for safety-critical software.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 106/10

🧠

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

Researchers have developed a systematic framework for conditioning Multimodal Large Language Models (MLLMs) with explicit personality traits, revealing that while personality induction improves certain tasks like image captioning, it can degrade performance on reasoning-heavy tasks like visual question answering. The study demonstrates that model behavior is dynamically modulated by both previous and current personality constraints, exposing fundamental challenges in personality modeling for multimodal AI systems.

← PrevPage 3 of 8Next →