#tabular-data News & Analysis

44 articles tagged with #tabular-data. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

44 articles

AIBearisharXiv – CS AI · Jun 117/10

🧠

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

Researchers identify a fundamental limitation in large language models' ability to adapt to structured data through in-context learning, discovering that LLMs fail to update their categorical token distributions learned during pre-training even with additional examples. While parameter-efficient fine-tuning overcomes this constraint, it introduces memorization risks and potential instability in structured output generation.

AIBullisharXiv – CS AI · May 117/10

🧠

Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

Researchers introduce PIQL, a framework that leverages privileged information to accelerate training and improve generalization in tabular foundation models. By incorporating dataset-level statistics and encodings of data-generating processes during training, the approach reduces computational requirements and convergence time while maintaining inference efficiency through reconstruction mechanisms.

AIBullisharXiv – CS AI · May 117/10

🧠

Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors

Researchers propose a novel uncertainty quantification method for Prior-Data Fitted Networks (PFNs), emerging foundation models for tabular data prediction, using martingale posteriors to provide calibrated confidence estimates. The technique is tuning-free, computationally efficient, and mathematically proven to converge, addressing a significant limitation in PFNs' practical applicability.

AIBullisharXiv – CS AI · May 97/10

🧠

Data Language Models: A New Foundation Model Class for Tabular Data

Researchers introduce Schema-1, the first Data Language Model (DLM) designed to natively understand tabular data without preprocessing, similar to how language models understand text. The 140M-parameter model trained on 2.3M datasets outperforms gradient-boosted trees, AutoML systems, and existing tabular foundation models on prediction benchmarks and demonstrates superior performance on missing value imputation and dataset classification tasks.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

Researchers propose Schema-Adaptive Tabular Representation Learning, which uses LLMs to convert structured clinical data into semantic embeddings that transfer across different electronic health record schemas without retraining. When combined with imaging data for dementia diagnosis, the method achieves state-of-the-art results and outperforms board-certified neurologists on retrospective diagnostic tasks.

AINeutralarXiv – CS AI · Apr 107/10

🧠

OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale

OmniTabBench introduces the largest tabular data benchmark with 3,030 datasets to evaluate gradient boosted decision trees, neural networks, and foundation models. The comprehensive analysis reveals no universally superior approach, but identifies specific conditions favoring different model categories through decoupled metafeature analysis.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Reinforcing Numerical Reasoning in LLMs for Tabular Prediction via Structural Priors

Researchers propose PRPO (Permutation Relative Policy Optimization), a reinforcement learning framework that enhances large language models' numerical reasoning capabilities for tabular data prediction. The method achieves performance comparable to supervised baselines while excelling in zero-shot scenarios, with an 8B parameter model outperforming much larger models by up to 53.17%.

AIBullisharXiv – CS AI · Mar 57/10

🧠

SPRINT: Semi-supervised Prototypical Representation for Few-Shot Class-Incremental Tabular Learning

Researchers introduce SPRINT, the first Few-Shot Class-Incremental Learning (FSCIL) framework designed specifically for tabular data domains like cybersecurity and healthcare. The system achieves 77.37% accuracy in 5-shot learning scenarios, outperforming existing methods by 4.45% through novel semi-supervised techniques that leverage unlabeled data and confidence-based pseudo-labeling.

AIBullisharXiv – CS AI · Mar 46/103

🧠

MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction

Researchers introduce MedFeat, a new AI framework that uses Large Language Models for healthcare feature engineering in clinical tabular predictions. The system incorporates model awareness and domain knowledge to discover clinically meaningful features that outperform traditional approaches and demonstrate robustness across different hospital settings.

AIBearisharXiv – CS AI · Jun 256/10

🧠

Are Tabular Foundation Models Robust to Realistic Query Distribution Shifts in Microbiome Data?

Researchers benchmarked tabular foundation models (TFMs) on microbiome data to test their robustness against realistic distribution shifts, finding that all models degrade significantly under perturbations even when key discriminative features are preserved. The study reveals that TFMs are particularly vulnerable to zero-inflation shifts and global feature structure corruption, suggesting current foundation model architectures may struggle with real-world data variability in biological applications.

AIBearisharXiv – CS AI · Jun 236/10

🧠

When Is an LLM Worth It for Hyperparameter Optimization? A Budget-Matched Study on Tabular Data Finds the Warm-Start Is a Default Configuration, Not the Model

A rigorous empirical study challenges claims that large language models improve hyperparameter optimization for tabular data, finding that LLM advisors' apparent advantage comes entirely from a fixed default configuration seed, not the model itself. Classical search methods with the same seed match or outperform LLM approaches within a handful of evaluations, suggesting LLM-based HPO systems offer no meaningful generalization benefit.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Cluster-Specific Localized Drift Detection for Efficient Batch Model Adaptation under Controlled Distribution Shift

Researchers propose a framework for simulating controlled distribution shifts in static datasets to evaluate how machine learning models adapt to nonstationary data environments. The study benchmarks six adaptation strategies across multiple model families, addressing a critical gap in reproducible evaluation of drift detection methods for real-world deployment scenarios.

AINeutralarXiv – CS AI · Jun 236/10

🧠

TACO: Task-Aware Column Description Generation Using LLMs

Researchers introduce TACO, a framework for automatically generating accurate column descriptions in datasets using large language models. The three-step pipeline addresses critical limitations in existing approaches by standardizing abbreviated names, enriching descriptions with synonyms, and refining outputs through simulated downstream tasks, demonstrating up to 32% improvement in downstream NLP performance.

AINeutralarXiv – CS AI · Jun 196/10

🧠

DataMagic: Transforming Tabular Data into Data Insight Video

DataMagic is an AI system that automatically converts raw tabular data and natural language queries into narrative data-insight videos with dynamic charts, voice narration, and animations. The system introduces DVSpec, a declarative specification ensuring data fidelity, and uses a multi-agent architecture to generate and orchestrate video scenes while supporting interactive exploration modes.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

Researchers propose a lightweight adaptation method to apply tabular foundation models to clinical survival analysis, demonstrating that pretrained representations combined with survival-aware objectives outperform traditional approaches. Testing on MIMIC-IV and eICU datasets shows 1.4-1.7% improvements over strong baselines like DeepSurv in predicting patient mortality and time-to-event outcomes.

AINeutralarXiv – CS AI · Jun 116/10

🧠

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

TAROT is a new GNN-based framework that improves few-shot tabular learning by constructing task-adaptive semantic graphs from LLM-inferred feature relationships. The approach addresses privacy concerns of direct LLM tabular data processing while achieving state-of-the-art performance on few-shot benchmarks through intelligent graph refinement that filters LLM hallucinations.

AINeutralarXiv – CS AI · Jun 96/10

🧠

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

Researchers introduce LATTEArena, a standardized evaluation framework for comparing LLM-powered tabular feature engineering methods. The framework decomposes 15 representative techniques into reusable components and reveals that Tree-of-Thought combined with Monte Carlo Tree Search offers optimal cost-effectiveness, while RPN and Code formats excel at different task types.

🏢 Meta

AINeutralarXiv – CS AI · Jun 96/10

🧠

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

TRL-Bench introduces a standardized benchmark for evaluating tabular data encoders across different training paradigms, releasing curated datasets and demonstrating that encoder quality is task-dependent rather than universally superior. The framework enables fair comparison of 20 models across representation-level tasks, revealing that no single encoder dominates across all scenarios.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

Researchers propose a hybrid machine learning architecture combining FT-Transformer neural networks with XGBoost gradient boosting to predict customer churn in banking and subscription services. The ensemble method achieves superior performance metrics (62.10% F1, 0.861 AUC-ROC) compared to baseline models while addressing critical challenges in class imbalance and probability calibration.

AINeutralarXiv – CS AI · Jun 96/10

🧠

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Researchers introduce BSTabDiff, a generative framework designed to create synthetic high-dimensional tabular data with limited samples by partitioning features into latent blocks and using diffusion priors. The method addresses challenges in domains like genomics where data is sparse relative to feature count, producing more realistic synthetic data than existing approaches.

AINeutralarXiv – CS AI · Jun 95/10

🧠

A Universal Dense Football Event Representation Based on TabTransformer

Researchers propose a TabTransformer-based neural network that learns dense representations of football event data by treating categorical features as learned embeddings rather than one-hot encodings. The approach captures sport-specific action semantics during pretraining, enabling superior performance on downstream tasks like action value estimation and play style recognition.

AINeutralarXiv – CS AI · Jun 96/10

🧠

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

Researchers propose Strategic Prior-data Fitted Network (SPN), a framework addressing how tabular foundation models fail when users strategically manipulate data post-deployment. The method adapts pretrained models to strategic environments through inference-time adjustments without retraining, demonstrating improved robustness on real-world datasets.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Large Language Models for Imbalanced Classification: Diversity makes the difference

Researchers have developed a novel LLM-based oversampling method to address imbalanced classification in machine learning, focusing on generating diverse synthetic minority samples. The approach outperforms existing methods like SMOTE by preserving categorical information and introducing enhanced diversity through novel sampling and fine-tuning strategies.

AIBullisharXiv – CS AI · Jun 56/10

🧠

No Need to Train Your RDB Foundation Model

Researchers present RDBLearn, a foundation model that enables in-context learning over relational databases without requiring model training or fine-tuning. By developing principled compression techniques that preserve semantic relationships within database columns rather than across heterogeneous data types, the approach allows existing single-table foundation models to operate effectively on multi-table database systems.

AINeutralarXiv – CS AI · Jun 25/10

🧠

TabChange: Precise Attribute Changes in Tabular Data

TabChange is a new machine learning approach for modifying individual attributes in tabular datasets while maintaining data naturalness and minimizing unintended changes. The method analyzes attribute relationships and uses adversarial techniques to remove latent information about target attributes, producing more valid counterfactuals than existing generative models.

Page 1 of 2Next →