#data-governance News & Analysis

25 articles tagged with #data-governance. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

25 articles

AINeutralarXiv – CS AI · Jun 257/10

🧠

Small edits, large models: How Wikipedia advocacy shapes LLM values

A research study demonstrates that a small group of Wikipedia editors advocating for animal welfare has measurably shaped how large language models discuss the topic, with their edits appearing in 68% of the most relevant documents for animal welfare queries. Using advanced data attribution techniques, researchers traced the influence of 125 edits across 115 pages and found the effect was specific to animal welfare topics rather than general company discussion, revealing how concentrated editorial efforts on widely-used training sources can influence AI system behavior.

🏢 Perplexity🧠 Llama

AIBearishBlockonomi · Jun 187/10

🧠

JPMorgan Cuts Claude AI Access in Hong Kong Amid Rising Security Concerns

JPMorgan has restricted Claude AI access for its Hong Kong employees, joining Goldman Sachs in limiting advanced AI tools over regulatory and geopolitical security concerns. The move reflects broader financial sector caution regarding AI data exposure in sensitive jurisdictions amid heightened compliance scrutiny.

🧠 Claude

AIBearisharXiv – CS AI · Jun 107/10

🧠

Local Is Not a Sufficient Privacy Boundary: Governing OS-Integrated On-Device AI

Researchers present a comprehensive OS-centered privacy framework arguing that local AI processing alone does not guarantee privacy, as on-device models can still aggregate sensitive data, retain embeddings, invoke cloud services, and emit telemetry. The framework provides a threat model, risk taxonomy, and audit rubric, demonstrating that meaningful privacy depends on constrained information flow, bounded authority, and auditable governance rather than deployment location.

🧠 Gemini

AIBearisharXiv – CS AI · Jun 57/10

🧠

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

Researchers propose a bilayer SIR epidemic model to analyze how synthetic data contamination spreads across AI systems when models train on each other's outputs. Through theoretical analysis, simulations, and GPT-2 experiments, they demonstrate that cross-contamination can sustain itself (R₀ > 1) and identify detection-based filtering as the most effective intervention strategy.

AIBullisharXiv – CS AI · Jun 27/10

🧠

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

GuidaPA is a privacy-preserving chatbot for Italian public administration that uses federated learning to train on sensitive documentation without centralizing data. The system achieves comparable performance to traditional centralized fine-tuning while keeping sensitive data distributed across agency servers, demonstrating federated learning's viability for regulated institutional deployments.

AIBullisharXiv – CS AI · May 297/10

🧠

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Researchers introduce LLUMI, an open-source LLM system for mental health support that uses community feedback from Reddit to improve response quality without relying on proprietary cloud models. The approach achieves comparable performance to GPT models while offering better privacy protection for sensitive health contexts.

AIBullisharXiv – CS AI · May 97/10

🧠

Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use

Researchers present a layered security architecture for multitenant enterprise AI systems that isolates data and controls access in retrieval-augmented generation (RAG) and agentic AI deployments. The approach separates security-critical operations to the server while preventing cross-tenant data leakage, validated through an open-source OGX framework with negligible performance overhead.

🏢 OpenAI

AIBullisharXiv – CS AI · Apr 147/10

🧠

Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems

Researchers introduce Context Kubernetes, an architecture that applies container orchestration principles to managing enterprise knowledge in AI agent systems. The system addresses critical governance, freshness, and security challenges, demonstrating that without proper controls, AI agents leak data in over 26% of queries and serve stale content silently.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

Researchers propose a framework for implementing Fine-grained Access Control (FGAC) in vector databases, addressing a critical security gap as these systems become essential for AI applications. The paper identifies fundamental tensions between enforcing access policies, maintaining search accuracy, and preserving query performance in vector database architectures.

AIBullishCrypto Briefing · Jun 116/10

🧠

OpenAI acquires Ona to enhance Codex with secure cloud execution technology

OpenAI has acquired Ona, a company specializing in secure cloud execution technology, to integrate its capabilities into Codex. This acquisition aims to address enterprise concerns around security and data governance, potentially accelerating Codex adoption in corporate environments where these considerations are critical.

🏢 OpenAI

AIBearishThe Verge – AI · Jun 106/10

🧠

Microsoft restricts Claude Fable for employees over data retention concerns

Microsoft has restricted employee access to Anthropic's newly released Claude Fable 5 model due to data retention concerns, while making it available to external GitHub Copilot and Azure customers. The restriction stems from Anthropic's new data retention requirements conflicting with Microsoft's Zero Data Retention (ZDR) policy for internal tools.

🏢 Anthropic🏢 Microsoft🧠 Claude

AINeutralarXiv – CS AI · Jun 96/10

🧠

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

Researchers introduce SlideCheck, a data guidance tool for pathology foundation models that uses frozen model features to score and curate pretraining datasets. The system provides abnormality and malignancy scores to help organize and audit WSI-derived patch data, demonstrating that controlled dataset composition significantly influences downstream self-supervised learning outcomes.

AINeutralarXiv – CS AI · Jun 86/10

🧠

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Researchers introduce REMEDI, a benchmark for evaluating machine unlearning methods in clinical disease inference using real patient data from MIMIC-III. The study reveals fundamental trade-offs between model utility and data removal effectiveness, with existing unlearning techniques proving poorly suited for multi-label medical classification tasks.

GeneralNeutralCrypto Briefing · Jun 16/10

📰

European cloud providers back EU push to cut reliance on US tech giants

European cloud providers are rallying behind the EU's cloud sovereignty initiative, which aims to reduce the continent's dependence on US technology giants like AWS, Microsoft Azure, and Google Cloud. The push could fundamentally reshape Europe's tech market by strengthening local competitors and limiting American tech dominance in the region.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Researchers propose Gap-K%, a novel method for detecting whether text was part of an LLM's pretraining data by analyzing the probability gap between a model's top prediction and the actual target token. The technique outperforms existing approaches on standard benchmarks and addresses critical privacy and copyright concerns surrounding the opaque datasets used to train large language models.

AINeutralDecrypt – AI · May 256/10

🧠

Pope Leo Releases First AI Encyclical, Calls Data a Common Good and Rejects Moral Neutrality of Tech

Pope Leo released the Catholic Church's first AI encyclical, a 245-paragraph document asserting that data constitutes a common good and rejecting the notion that technology is morally neutral. The document was presented alongside Anthropic co-founder Christopher Olah, whose AI company is currently engaged in litigation against the Trump administration over military AI applications.

🏢 Anthropic

AIBullisharXiv – CS AI · May 126/10

🧠

GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction

Researchers have developed GLiNER2-PII, a compact 0.3B-parameter multilingual model for detecting personally identifiable information across 42 entity types at character-level precision. Trained on a synthetic corpus of 4,910 annotated texts to overcome privacy constraints in real data collection, the model outperforms existing systems including OpenAI's Privacy Filter on benchmark evaluations and is now publicly available on Hugging Face.

🏢 OpenAI🏢 Hugging Face

AIBullishOpenAI News · May 66/10

🧠

How ChatGPT learns about the world while protecting privacy

OpenAI has implemented privacy safeguards in ChatGPT's training process, allowing users to control whether their conversations contribute to model improvement while minimizing personal data retention. The approach addresses growing privacy concerns around AI model training without compromising the system's ability to learn from diverse data sources.

🧠 ChatGPT

AIBullishMIT Technology Review · May 16/10

🧠

Operationalizing AI for Scale and Sovereignty

Companies are increasingly taking control of their own data to customize AI systems for specific needs, creating a new paradigm of data sovereignty. The challenge involves balancing proprietary data ownership with the requirement for safe, high-quality data flows that enable reliable AI insights. MIT Technology Review's EmTech AI conference explores how AI factories achieve scalability while maintaining governance standards.

AINeutralarXiv – CS AI · Apr 156/10

🧠

PrivacyReasoner: Can LLM Emulate a Human-like Privacy Mind?

Researchers introduce PrivacyReasoner, an LLM-based agent architecture that reconstructs individual privacy perspectives from online comment history to predict how specific people would perceive data practices. The system outperforms baseline models in predicting privacy concerns across AI, e-commerce, and healthcare domains by contextually activating relevant privacy beliefs.

AIBullisharXiv – CS AI · Apr 146/10

🧠

AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation

Researchers introduce AdaQE-CG, a framework that automatically generates model and data cards for AI systems with improved accuracy and completeness. The approach combines dynamic query expansion to extract information from papers with cross-card knowledge transfer to fill gaps, accompanied by MetaGAI-Bench, a new benchmark for evaluating documentation quality.

🏢 Meta🏢 Hugging Face

AIBullisharXiv – CS AI · Apr 146/10

🧠

A Proposed Biomedical Data Policy Framework to Reduce Fragmentation, Improve Quality, and Incentivize Sharing in Indian Healthcare in the era of Artificial Intelligence and Digital Health

A research paper proposes a comprehensive policy framework for India to address fragmentation in biomedical data sharing by aligning institutional incentives around AI and digital health. The framework recommends recognizing data curation in academic promotions, incorporating open data metrics into institutional rankings, and implementing Shapley Value-based revenue sharing in federated learning—while navigating India's 2023 data protection regulations.

AIBullishOpenAI News · Feb 56/106

🧠

Introducing data residency in Europe

OpenAI announces the introduction of data residency capabilities in Europe, expanding their enterprise-grade data privacy and security offerings. This development builds upon their existing compliance programs designed to support customers globally with enhanced data governance requirements.

AIBullishFortune Crypto · Mar 105/10

🧠

Financial software company Datarails aims to disrupt itself with AI before someone else does with launch of new FinanceOS product

Financial software company Datarails is launching a new FinanceOS product to proactively disrupt its own business model with AI before competitors do. The company is positioning data and financial model governance as its key competitive advantage in an AI-driven financial analysis landscape.

GeneralNeutralSimon Willison Blog · Jun 183/10

📰

datasette-acl 0.6a0

Datasette-acl version 0.6a0 represents an incremental pre-release update to the access control library for Datasette, a tool for exploring and publishing data. The release appears to be a development milestone with alpha status, suggesting ongoing refinement of permission management features for data access control systems.