y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#training-data News & Analysis

29 articles tagged with #training-data. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

29 articles
AIBearishcrypto.news ยท 3d ago7/10
๐Ÿง 

Latest AI News: The Most Powerful AI Models Are Now the Least Transparent and Why Stanford Says That Is a Problem

Stanford HAI's 2026 AI Index reveals that the most advanced AI models are becoming increasingly opaque, with leading companies disclosing less information about training data, methodologies, and testing protocols. This transparency decline raises concerns about accountability, safety validation, and the ability of independent researchers to audit frontier AI systems.

Latest AI News: The Most Powerful AI Models Are Now the Least Transparent and Why Stanford Says That Is a Problem
AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Researchers have introduced OpenSeeker, the first fully open-source search agent that achieves frontier-level performance using only 11,700 training samples. The model outperforms existing open-source competitors and even some industrial solutions, with complete training data and model weights being released publicly.

AIBearishTechCrunch โ€“ AI ยท Mar 167/10
๐Ÿง 

The dictionary sues OpenAI

Encyclopedia Britannica and Merriam-Webster have filed a lawsuit against OpenAI, alleging copyright infringement of nearly 100,000 articles used in training their large language models. This legal action adds to growing concerns about AI companies' use of copyrighted content for model development.

๐Ÿข OpenAI
AINeutralarXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

Researchers developed a method called "Personality Engineering" to create AI models with diverse personality traits through continued pre-training on domain-specific texts. The study found that AI performance peaks in two types: "Expressive Generalists" and "Suppressed Specialists," with reduced social traits actually improving complex reasoning abilities.

AIBearishCrypto Briefing ยท Mar 57/10
๐Ÿง 

xAI fails to block California AI transparency law requiring training data disclosure

xAI failed to prevent California's AI transparency law from taking effect, which requires AI companies to disclose training data. This regulatory development establishes a significant precedent that could influence competitive dynamics and reshape investor strategies across the AI industry.

๐Ÿข xAI
AIBearisharXiv โ€“ CS AI ยท Feb 277/105
๐Ÿง 

Poisoned Acoustics

Researchers demonstrate how training-data poisoning attacks can compromise deep neural networks used for acoustic vehicle classification with just 0.5% corrupted data, achieving 95.7% attack success rate while remaining undetectable. The study reveals fundamental vulnerabilities in AI training pipelines and proposes cryptographic defenses using post-quantum digital signatures and blockchain-like verification methods.

AIBearishArs Technica โ€“ AI ยท Feb 237/106
๐Ÿง 

AIs can generate near-verbatim copies of novels from training data

Research reveals that large language models (LLMs) can reproduce near-exact copies of novels and other content from their training datasets, indicating these AI systems memorize significantly more training data than previously understood. This discovery raises important concerns about copyright infringement, data privacy, and the extent of memorization in AI training processes.

$NEAR
AINeutralarXiv โ€“ CS AI ยท Apr 106/10
๐Ÿง 

The Human Condition as Reflected in Contemporary Large Language Models

A research study analyzes six leading large language models to identify shared cultural patterns revealed in their training data, finding consensus around themes like narrative meaning-making, status competition, and moral rationalization. The findings suggest LLMs function as 'cultural condensates' that compress how humans describe and contest their social lives across massive text datasets.

AIBearisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

What Is The Political Content in LLMs' Pre- and Post-Training Data?

Research reveals that large language models exhibit political biases stemming from systematically left-leaning training data, with pre-training datasets containing more politically engaged content than post-training data. The study finds strong correlations between political stances in training data and model behavior, with biases persisting across all training stages.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

The AI Fiction Paradox

A new research paper identifies the 'AI-Fiction Paradox' - AI models desperately need fiction for training data but struggle to generate quality fiction themselves. The paper outlines three core challenges: narrative causation requiring temporal paradoxes, informational revaluation that conflicts with current attention mechanisms, and multi-scale emotional architecture that current AI cannot orchestrate effectively.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

Researchers introduce Gradient Atoms, an unsupervised method that decomposes AI model training gradients to discover interpretable behaviors without requiring predefined queries. The technique can identify model behaviors like refusal patterns and arithmetic capabilities, while also serving as effective steering vectors to control model outputs.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Quality Assessment of Public Summary of Training Content for GPAI models required by AI Act Article 53(1)(d)

Researchers developed a framework to assess public summaries of AI training data required by EU's AI Act Article 53(1)(d), evaluating transparency and usefulness for stakeholder rights enforcement. The study analyzed 5 public summaries from GPAI model providers as of January 2026, creating guidelines for compliance and a public resource website.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

Researchers introduce the Infinite Problem Generator (IPG), an AI framework that creates verifiable physics problems using executable Python code instead of probabilistic text generation. The system released ClassicalMechanicsV1, a dataset of 1,335 physics problems that demonstrates how code complexity can precisely measure problem difficulty for training large language models.

AIBullishAI News ยท Mar 116/10
๐Ÿง 

Ai2: Building physical AI with virtual simulation data

Ai2 is developing physical AI systems using virtual simulation data through their MolmoBot initiative, aiming to reduce reliance on expensive manually-collected real-world training data. This approach represents a shift from traditional methods that require extensive real-world demonstrations for training generalist manipulation agents.

AIBullisharXiv โ€“ CS AI ยท Mar 37/108
๐Ÿง 

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

Researchers introduce LOGIGEN, a logic-driven framework that synthesizes verifiable training data for autonomous AI agents operating in complex environments. The system uses a triple-agent orchestration approach and achieved a 79.5% success rate on benchmarks, nearly doubling the base model's 40.7% performance.

AIBullisharXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks

Researchers introduce SWE-Hub, a comprehensive system for generating scalable, executable software engineering tasks for training AI agents. The platform addresses current limitations in AI software development by providing unified environment automation, bug synthesis, and diverse task generation across multiple programming languages.

AIBullisharXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

Researchers introduce CoVe, a framework for training interactive tool-use AI agents that uses constraint-guided verification to generate high-quality training data. The compact CoVe-4B model achieves competitive performance with models 17 times larger on benchmark tests, with the team open-sourcing code, models, and 12K training trajectories.

AINeutralarXiv โ€“ CS AI ยท Mar 36/108
๐Ÿง 

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

New theoretical research analyzes how Large Language Models learn during pretraining versus post-training phases, revealing that balanced pretraining data creates latent capabilities activated later, while supervised fine-tuning works best on small, challenging datasets and reinforcement learning requires large-scale data that isn't overly difficult.

AIBearisharXiv โ€“ CS AI ยท Mar 37/108
๐Ÿง 

Extracting Training Dialogue Data from Large Language Model based Task Bots

Researchers have identified significant privacy risks in Large Language Model-based Task-Oriented Dialogue Systems, demonstrating that these AI systems can memorize and leak sensitive training data including phone numbers and complete dialogue exchanges. The study proposes new attack methods that can extract thousands of training dialogue states with over 70% precision in best-case scenarios.

$RNDR
AINeutralarXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

Understanding the Role of Training Data in Test-Time Scaling

Research paper analyzes test-time scaling in large language models, revealing that longer reasoning chains (CoTs) can reduce training data requirements but may harm performance if relevant skills aren't present in training data. The study provides theoretical framework showing that diverse, relevant, and challenging training tasks optimize test-time scaling performance.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Researchers developed EditReward, a human-aligned reward model for instruction-guided image editing trained on over 200K preference pairs. The model demonstrates superior performance on established benchmarks and can effectively filter high-quality training data, addressing a key bottleneck in open-source image editing models.

AIBullisharXiv โ€“ CS AI ยท Mar 26/1014
๐Ÿง 

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

Researchers introduce MMKG-RDS, a framework that uses multimodal knowledge graphs to synthesize high-quality training data for improving AI model reasoning abilities. Testing on Qwen3 models showed 9.2% improvement in reasoning accuracy, with applications for complex benchmark construction involving tables and formulas.

AINeutralarXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

Researchers created a 4.5k text corpus analyzing how different AI personas, including Microsoft's controversial Sydney chatbot, express views on human-AI relationships across 12 major language models. The study examines how the Sydney persona has spread memetically through training data, allowing newer models to simulate its distinctive characteristics and perspectives.

Page 1 of 2Next โ†’