#human-evaluation News & Analysis

6 articles tagged with #human-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBearisharXiv – CS AI · Jun 97/10

🧠

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

Researchers conducted a large-scale analysis of human evaluation protocols across 284 *CL conference papers (2023-2025), discovering widespread under-reporting of critical study design details that undermine reproducibility. The findings reveal that transparency gaps in how text generation quality is assessed create ambiguity about measurement methodology, evaluator credentials, and result interpretation, prompting actionable recommendations for improved reporting standards.

AIBearisharXiv – CS AI · Jun 87/10

🧠

Re-Centering Humans in LLM Personalization

Researchers reveal a significant gap between synthetic and real-world performance in LLM personalization systems by analyzing 550 human conversations across three stages: attribute extraction, attribute selection, and response generation. The study finds that current models struggle with human-aligned personalization and that learned reward models fail to adequately capture human preferences, highlighting fundamental limitations in how AI systems understand and incorporate user information.

AINeutralarXiv – CS AI · Jun 106/10

🧠

How can we assess human-agent interactions? Case studies in software agent design

Researchers propose PULSE, a framework for evaluating human-agent interactions in software engineering rather than relying solely on automated benchmarks. The framework combines human feedback with machine learning predictions to assess user satisfaction, revealing significant gaps between benchmark performance and real-world agent effectiveness across 15,000 users.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 26/10

🧠

InFerActive: Interactive Tree-Based Exploration of LLM Sampling for Safety Evaluation

InFerActive is an interactive system that improves how AI safety evaluators assess large language models by visualizing sampling results as navigable trees rather than static spreadsheets. The tool uses breadth-first sampling to achieve equivalent harmful-response coverage with up to 5x fewer samples, significantly improving evaluation efficiency according to controlled user studies.

AINeutralarXiv – CS AI · May 126/10

🧠

A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives

Researchers developed a reflective storytelling agent that combines large language models with knowledge graphs and argumentation theory to generate personalized narratives for older adults. Testing with 55 participants showed the system successfully identified personally relevant purposes in two-thirds of narratives, with argument-based grounding and hallucination detection significantly improving perceived consistency and clarity.

AINeutralarXiv – CS AI · Mar 34/103

🧠

When Numbers Tell Half the Story: Human-Metric Alignment in Topic Model Evaluation

Researchers introduce Topic Word Mixing (TWM), a new human evaluation method for assessing topic models in specialized domains. The study reveals misalignment between automated metrics and human judgment, particularly in domain-specific corpora like philosophy of science publications.