y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-evaluation News & Analysis

58 articles tagged with #llm-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

58 articles
AINeutralarXiv โ€“ CS AI ยท Apr 64/10
๐Ÿง 

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

Researchers developed EWAD and CPDP techniques for improving multi-teacher knowledge distillation in low-resource abstractive summarization tasks. The study across Bangla and cross-lingual datasets shows logit-level knowledge distillation provides most reliable gains, while complex distillation improves short summaries but degrades longer outputs.

AINeutralarXiv โ€“ CS AI ยท Mar 125/10
๐Ÿง 

CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Researchers introduced the Contextual Emotional Inference (CEI) Benchmark, a dataset of 300 human-validated scenarios designed to evaluate how well large language models understand pragmatic reasoning in complex communication. The benchmark tests LLMs' ability to interpret ambiguous utterances across five pragmatic subtypes including sarcasm, mixed signals, and passive aggression in various social contexts.

AIBullisharXiv โ€“ CS AI ยท Mar 95/10
๐Ÿง 

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Researchers have developed Lexara, a user-centered toolkit for evaluating Large Language Models in Conversational Visual Analytics applications. The toolkit addresses current evaluation challenges by providing interpretable metrics for both visualization and language quality, along with real-world test cases and an interactive interface that doesn't require programming expertise.

AINeutralarXiv โ€“ CS AI ยท Mar 95/10
๐Ÿง 

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Research demonstrates that ChatGPT can code communication data with accuracy comparable to human raters while maintaining consistency across different demographic groups including gender and racial/ethnic categories. The study introduces three evaluation checks for assessing subgroup consistency in LLM-based coding systems for large-scale collaboration assessments.

๐Ÿง  ChatGPT
AINeutralarXiv โ€“ CS AI ยท Mar 54/10
๐Ÿง 

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

Researchers have created CzechTopic, a new benchmark dataset for evaluating AI models' ability to identify specific topics within historical Czech documents. The study compared various large language models and BERT-based models, finding significant performance variations with the strongest models approaching human-level accuracy in topic detection.

AIBullishHugging Face Blog ยท Feb 205/108
๐Ÿง 

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

A new Open Ko-LLM Leaderboard has been launched to evaluate Korean language large language models, establishing a standardized evaluation framework for the Korean AI ecosystem. This initiative aims to advance Korean LLM development by providing transparent benchmarking and comparison tools for researchers and developers.

AINeutralHugging Face Blog ยท Feb 25/108
๐Ÿง 

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

NPHardEval Leaderboard introduces a new evaluation framework for assessing large language models' reasoning capabilities through computational complexity classes with dynamic updates. The leaderboard aims to provide more rigorous testing of LLM reasoning abilities by incorporating problems from different complexity categories.

AINeutralOpenAI News ยท Jul 71/106
๐Ÿง 

Evaluating large language models trained on code

The article appears to have an empty body, with only the title 'Evaluating large language models trained on code' provided. Without the actual content, no meaningful analysis of LLM evaluation methods or findings can be conducted.

โ† PrevPage 3 of 3