#generalization News & Analysis

129 articles tagged with #generalization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

129 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Active Inference as the Test-Time Scaling Law for Physical AI Agents

Researchers introduce a novel test-time scaling law for physical AI agents based on active inference principles, enabling agents to generalize to unforeseen scenarios by dynamically updating policies through reasoning about prediction errors. The approach outperforms existing reinforcement learning methods by 36% in inference efficiency on autonomous driving tasks and scales with real-world experience rather than just training data or model size.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

Researchers formalize the theoretical foundations of LLM scaling laws by modeling transformer learning dynamics as differential equations, establishing matching upper and lower bounds that characterize a two-phase convergence pattern: exponential decay during optimization followed by power-law decay during the statistical phase. This work bridges the gap between empirical observations and rigorous mathematical theory, providing independent scaling relationships for model size, training time, and dataset size.

AIBearisharXiv – CS AI · Jun 107/10

🧠

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

Researchers demonstrate that Large Language Models used for graph reasoning lack robustness to common graph representation variations like node reindexing and edge reordering, producing inconsistent outputs. Fine-tuning worsens sensitivity to structural and formatting changes while failing to improve generalization on unseen tasks, raising concerns about LLM-based graph reasoners' reliability in production environments.

AIBullisharXiv – CS AI · Jun 97/10

🧠

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Researchers introduce GEAR-VLA, a Vision-Language-Action framework that improves robotic manipulation by learning geometry-aware representations that generalize across unseen objects, backgrounds, and different robot embodiments. The system demonstrates state-of-the-art performance on multiple benchmarks and achieves 90.1% success on a universal grasping benchmark with 212 previously unseen objects.

AIBullisharXiv – CS AI · Jun 57/10

🧠

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

Researchers introduce A4D, a machine learning system that enables robots to reason about object functionalities rather than appearances for planning tasks. The approach achieves 94% inference accuracy on existing affordances and over 90% on new affordances while requiring significantly less training data, addressing a fundamental limitation in current robot planning systems.

AIBullisharXiv – CS AI · Jun 57/10

🧠

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Building The Ph(ysical)AI Layer Of Machine Intelligence

Researchers propose principle-driven foundation models that encode physics-based principles rather than learn statistical correlations, achieving cross-modal transfer from radio-frequency data to audio, images, text, and video without fine-tuning. A 1.99M parameter frozen encoder reaches 77.7% average accuracy across 15 tasks, with performance varying systematically between physically-grounded (84.5%) and semantic tasks (70.0%), suggesting complementary approaches to AI generalization.

AIBullisharXiv – CS AI · Jun 47/10

🧠

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

Researchers propose Self-Evolving Prompt Optimization (SePO), a novel system that automatically optimizes AI agent prompts by treating the prompt agent's own instructions as an optimization target. The method demonstrates consistent performance gains across five diverse benchmarks, outperforming existing approaches and showing generalization to unseen tasks.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Towards a Physics Foundation Model

Researchers introduce the General Physics Transformer (GPhyT), a foundation model trained on 1.8 TB of simulation data that can simulate diverse physical systems without domain-specific retraining. The model demonstrates breakthrough capabilities in multi-domain physics prediction, zero-shot generalization to unseen systems, and stable long-horizon forecasting, potentially democratizing access to high-fidelity scientific simulations.

AINeutralarXiv – CS AI · Jun 27/10

🧠

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Researchers establish a theoretical framework explaining why large language models optimized through outcome-based reinforcement learning develop brittle reasoning despite strong benchmark performance. The study introduces 'Reward-Induced Manifold Collapse' and demonstrates that process reward models can prevent this failure mode by enforcing information constraints on reasoning steps.

AIBearisharXiv – CS AI · Jun 27/10

🧠

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Researchers introduced InPhyRe, a new benchmark showing that large multimodal models (LMMs) struggle with inductive physical reasoning—their ability to apply learned physical laws to novel, unseen scenarios. Testing 13 LMMs revealed critical weaknesses: models fail to generalize parametric knowledge, perform poorly with unseen physical laws, and exhibit language bias that causes them to ignore visual inputs, raising concerns about their reliability for safety-critical applications.

AIBullisharXiv – CS AI · Jun 17/10

🧠

SWIM: Single-Instance Whole-Body Imitation for swiMming

Researchers have developed SWIM, a machine learning method for synthesizing physically realistic swimming animations from minimal training data. The approach enables AI systems to learn complex full-body swimming motions from a single example and generalize across different environments, body types, and swimming styles, addressing long-standing challenges in physics-based character animation.

AIBullisharXiv – CS AI · Jun 17/10

🧠

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

Researchers introduce DeMaVLA, a Vision-Language-Action foundation model designed to enable robots to generalize deformable-object manipulation across diverse household tasks without requiring category-specific training. The model combines a VLM backbone with an efficient action expert using flow matching and is trained on 5,000 hours of real-world demonstrations plus corrective learning from robot failures, achieving strong performance on folding benchmarks.

AIBearisharXiv – CS AI · Jun 17/10

🧠

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Researchers introduce NumLeak, a framework revealing that frontier large language models memorize public numeric benchmarks from pretraining data rather than genuinely understanding underlying concepts. The study demonstrates that models achieve near-perfect recall on financial and economic metrics when prompted with dates, but this performance collapses on recent holdout data, indicating memorization rather than reasoning capability.

AIBullisharXiv – CS AI · Jun 17/10

🧠

GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation

GSAM is a new robotic framework that improves articulated object manipulation through vision-based perception, VLM-based refinement with commonsense reasoning, and constraint-based planning to prevent collisions. In experiments across 50 hinge tasks, GSAM achieved 36% higher success rates and 3.1% lower standard deviation compared to existing baselines, demonstrating superior generalization and safety.

AIBearisharXiv – CS AI · May 297/10

🧠

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

Researchers benchmarked five physics foundation models across 8 physical dynamics and 25 test regimes, revealing that current models function as conditional rather than universal generalists. The study demonstrates that model performance heavily depends on physical regime, temporal scale, and distribution shifts, with pretraining and scaling unable to reliably overcome these limitations.

AIBullisharXiv – CS AI · May 297/10

🧠

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

Researchers introduce VLA-Pro, a framework that enhances vision-language-action models for robotics by storing and retrieving task-specific procedural memories during inference. The approach achieves dramatic performance gains—up to 207% improvement in simulation and raising real-world success rates from 5.8% to 65%—demonstrating significant progress in cross-task generalization for robotic manipulation.

AIBullisharXiv – CS AI · May 297/10

🧠

Quantifying and Optimizing Simplicity via Polynomial Representations

Researchers introduce polynomial representations as a quantitative measure of neural network simplicity, demonstrating that the effective degree of these representations predicts generalization better than existing metrics. The approach yields a differentiable regularizer that improves performance across image classification, text tasks, vision-language models, and reinforcement learning.

AIBearisharXiv – CS AI · May 127/10

🧠

Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials

Researchers have created a benchmark to test whether machine learning interatomic potentials can generalize to unseen molecules by learning underlying chemical principles. The study reveals that state-of-the-art models, including foundation models trained on millions of molecules, fail significantly on out-of-distribution examples, with errors often 10x higher than on training data.

AIBullisharXiv – CS AI · May 127/10

🧠

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Researchers propose Latent Personality Alignment (LPA), a novel defense mechanism for large language models that achieves adversarial robustness by training on abstract personality traits rather than harmful examples. The method requires fewer than 100 training examples while matching the performance of traditional approaches using 150,000+ harmful prompts, and demonstrates superior generalization to unseen attack vectors.

AIBullisharXiv – CS AI · May 117/10

🧠

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

Researchers introduce HCL-GP, a machine learning approach that enables large language model agents to learn and reuse hierarchical task decompositions for improved performance on complex applications. The method achieves 98.2% accuracy on standard tasks and demonstrates significant improvements over static synthesis approaches, particularly benefiting open-source models through dynamic component reuse.

AINeutralarXiv – CS AI · May 117/10

🧠

Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization

Researchers demonstrate that neural networks fail at out-of-distribution (OOD) generalization not due to insufficient training data, but because the choice of feature representation fundamentally determines what extrapolation patterns a model can learn. The same architecture achieving identical in-distribution loss can differ by 520x out-of-distribution depending on how features are encoded, showing that correct feature engineering is necessary but not sufficient without appropriate model class constraints.

AINeutralarXiv – CS AI · May 97/10

🧠

Are Flat Minima an Illusion?

A research paper challenges the prevailing assumption that flat minima in neural network loss landscapes improve generalization, arguing instead that 'weakness'—the volume of function-compatible parameter configurations—is the true driver of generalization. The author demonstrates that flatness is reparameterization-dependent and thus not causally responsible for better performance, while weakness remains invariant across different parameterizations.

AIBullisharXiv – CS AI · May 97/10

🧠

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

Researchers propose ADAPT, an online data reweighting framework that dynamically adjusts training sample importance during LLM training rather than using static offline selection methods. This approach maintains data diversity while improving generalization, outperforming existing offline curation techniques on instruction tuning and large-scale pretraining tasks.

AIBullisharXiv – CS AI · May 47/10

🧠

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Researchers introduce Preference Goal Tuning (PGT), a novel post-training framework that optimizes goal embeddings as continuous control variables rather than updating frozen policy parameters. Testing on Minecraft SkillForge demonstrates PGT achieves 72-81% relative improvements over expert-crafted prompts while showing superior generalization in out-of-distribution settings compared to traditional fine-tuning.

Page 1 of 6Next →