#automated-assessment News & Analysis

5 articles tagged with #automated-assessment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBearisharXiv – CS AI · Apr 107/10

🧠

Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

A new study challenges the validity of using LLM judges as proxies for human evaluation of AI-generated disinformation, finding that eight frontier LLM judges systematically diverge from human reader responses in their scoring, ranking, and reliance on textual signals. The research demonstrates that while LLMs agree strongly with each other, this internal coherence masks fundamental misalignment with actual human perception, raising critical questions about the reliability of automated content moderation at scale.

AINeutralarXiv – CS AI · Apr 67/10

🧠

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

Researchers developed a scalable method using LLMs as judges to evaluate AI safety for users with psychosis, finding strong alignment with human clinical consensus. The study addresses critical risks of LLMs potentially reinforcing delusions in vulnerable mental health populations through automated safety assessment.

AIBearisharXiv – CS AI · Apr 106/10

🧠

The Impact of Steering Large Language Models with Persona Vectors in Educational Applications

Researchers studied how persona vectors—AI steering techniques that inject personality traits into large language models—affect educational applications like essay generation and automated grading. The study found that persona steering significantly degrades answer quality, with substantially larger negative impacts on open-ended humanities tasks compared to factual science questions, and reveals that AI scorers exhibit predictable bias patterns based on assigned personality traits.

AIBullisharXiv – CS AI · Mar 266/10

🧠

PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation

Researchers have developed PASTA, a scalable AI compliance evaluation framework that can assess multiple policies simultaneously using LLM-powered analysis. The system evaluates five major AI policies in under two minutes for approximately $3, with expert validation showing strong alignment with human judgment.

AIBullisharXiv – CS AI · Mar 26/1010

🧠

Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Researchers developed the TREC 2025 DRAGUN Track to evaluate AI systems that help readers assess news trustworthiness through automated report generation. The initiative created reusable evaluation resources including human-assessed rubrics and an AutoJudge system that correlates well with human evaluations for RAG-based news analysis tools.