GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models
Researchers introduce GPF-LiveNews, a streaming evaluation protocol that audits how large language models frame news differently based on group identities and prompts. Testing 23 models across 42 identity labels reveals that policy-oriented prompts trigger stronger semantic shifts in framing, while sentiment variation remains inconsistent, highlighting the need for continuous monitoring of LLM outputs in production environments.
GPF-LiveNews addresses a critical gap in AI safety research: static bias benchmarks fail to capture how language models dynamically frame information for different audiences in real-time. This matters because deployed LLMs encounter constantly shifting inputs, retrieval systems, and safety mechanisms that traditional evaluation methods don't measure. The protocol streams fresh news from established sources through multiple identity-conditioned prompts, systematically measuring whether models subtly alter their framing—a phenomenon distinct from outright toxicity or refusal.
The research emerges from growing concerns about algorithmic amplification of group-based narratives. While prior work examined factual accuracy or demographic representation, GPF-LiveNews specifically tracks semantic drift and sentiment disparity, two mechanisms through which models could reinforce polarization without triggering safety filters. The pilot's finding that policy-focused prompts generate the strongest semantic movement suggests models are most sensitive to requests demanding actionable guidance tied to group identity.
For AI developers and deployers, this framework provides a practical monitoring tool that moves beyond snapshot benchmarks toward continuous auditing. The sentiment variation findings—flatter across dimensions than expected—warrant deeper investigation into whether models genuinely exhibit consistent behavior or whether sentiment metrics lack sensitivity to subtle framing differences.
The authors deliberately frame results as audit signals for human review rather than fairness verdicts, avoiding overconfidence claims. Future work should expand beyond news domains and test whether findings generalize to financial information, medical guidance, and policy recommendations where group-conditioned framing carries material consequences for different populations.
- →GPF-LiveNews enables continuous monitoring of how LLMs frame news differently across 42 identity groups and seven prompt families, moving beyond static bias benchmarks.
- →Policy-and-action prompts trigger the strongest semantic shifts in model outputs, indicating sensitivity to guidance requests tied to group identity.
- →Sentiment variation proved surprisingly flat across demographic and prompt dimensions, suggesting either robust consistency or insufficient sensitivity metrics.
- →The protocol treats all scores as audit signals for human review rather than permanent fairness rankings, acknowledging limitations of automated evaluation.
- →Fresh news streams and reproduction scripts are released as artifacts, enabling other teams to replicate and extend the evaluation methodology.