AINeutralarXiv โ CS AI ยท 7h ago7/10
๐ง
Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading
A new research paper demonstrates that current LLM evaluation frameworks using static prompts across all models produce misleading rankings compared to industry practice. The study reveals that prompt optimization (PO) significantly affects model performance rankings, suggesting practitioners must optimize prompts per model for accurate comparative evaluations.