y0news
← Feed
Back to feed
🧠 AI NeutralImportance 5/10

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

arXiv – CS AI|Sercan Karaka\c{s}, Yusuf \c{S}im\c{s}ek|
🤖AI Summary

Researchers compared supervised learning and large language model prompting approaches for detecting Turkish idiomatic light verb constructions, finding that while zero-shot LLMs struggle with recall, few-shot demonstrations significantly improve performance. The study reveals that careful prompt engineering can match or exceed traditional supervised baselines, though results remain highly model-sensitive.

Analysis

This research addresses a fundamental challenge in natural language processing: distinguishing idiomatic expressions from literal language uses, specifically in Turkish. Light verb constructions present a particular difficulty because they maintain standard syntactic patterns while functioning as single semantic units with non-compositional meanings. The controlled evaluation methodology—using matched negatives and in-domain literal controls—represents rigorous experimental design that avoids overstating performance on unbalanced datasets.

The comparison between supervised encoders and instruction-tuned LLMs reveals competing strengths in modern NLP approaches. Traditional supervised models like BERTurk provide consistent, calibrated performance through explicit training on labeled examples. Conversely, LLMs demonstrate capacity for rapid adaptation through demonstrations but exhibit concerning prompt sensitivity, where minimal changes in example selection dramatically shift prediction behavior and model-specific biases emerge.

This work has implications for multilingual NLP development, particularly for lower-resourced languages where labeled training data remains scarce. The finding that few-shot prompting can match supervised baselines suggests practical value for rapid deployment scenarios, yet the prominence of model-specific biases warns against assuming LLMs provide universally robust solutions without careful validation.

The research highlights an underexplored tradeoff: LLMs offer flexibility and reduced annotation requirements but demand substantial prompt engineering effort and model-specific tuning. Organizations developing Turkish NLP systems must weigh the convenience of LLM prompting against the reliability and interpretability of supervised approaches, particularly for production systems where consistency matters.

Key Takeaways
  • Zero-shot LLM prompting shows high precision on negative examples but critically low recall for idiomatic expressions, indicating fundamental limitations in unaided few-shot reasoning.
  • One-shot demonstrations create strong model-specific biases that cause systematic overprediction or underprediction, requiring careful prompt construction to mitigate.
  • Few-shot prompting with richer demonstrations achieves calibration comparable to supervised baselines for GPT-OSS-20B and Qwen 2.5-14B models.
  • Supervised transformer encoders remain competitive with LLMs despite their static nature, providing consistency that prompt-based approaches struggle to guarantee.
  • Turkish metalinguistic classification exhibits extreme sensitivity to prompt design, suggesting that LLM performance claims require extensive ablation studies before production deployment.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles