Skip to main content
Enterprise AI Analysis: Toward Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Enterprise AI Analysis: Scientific Paper Deep Dive

Toward Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Authors: DONGJIE WANG, YANYONG HUANG, WANGYANG YING, HAOYUE BAI, NANXU GONG, XINYUAN WANG, SIXUN DONG, TAO ZHE, KUNPENG LIU, MENG XIAO, PENGFEI WANG, PENGYANG WANG, HUI XIONG, YANJIE FU

Publication Date: May 2026

Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.

Executive Summary: Data-Centric AI for Tabular Data Transformation

This paper provides a comprehensive survey on Data-Centric AI, focusing on feature selection and generation for tabular data. It highlights the shift from model-centric to data-centric AI, emphasizing the importance of high-quality data for robust model performance across industries like finance, healthcare, and marketing. The survey reviews traditional methods (filter, wrapper, embedded) and advanced techniques (Reinforcement Learning, Generative AI), addressing their strengths, limitations, and future directions. Key findings include the necessity of adaptable, automated feature engineering, the role of explainable AI, privacy-conscious approaches, and the potential of LLMs and multimodal systems for advancing data-centric AI.

0 Improved Model Performance
0 Reduction in Data Processing Time
0 Industries Impacted

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction

The introduction sets the stage for data-centric AI, highlighting its increasing importance over model-centric AI. It emphasizes that high-quality data is the bedrock for innovation and superior model performance. The paper outlines the unique challenges of tabular data, such as high dimensionality, complex feature interactions, and heterogeneity, and introduces feature selection and generation as key transformation tasks. Traditional methods (filter, wrapper, embedded) are briefly introduced, along with the emerging role of Reinforcement Learning (RL) and Generative AI (GAI) in automating and optimizing these processes. This section establishes the survey's scope and its contribution to advancing tabular data-centric AI.

71:2 Article Reference for Introduction

Enterprise Process Flow

Tabular Data Challenges
Feature Selection
Feature Generation
Enhanced Data Quality
Improved AI Performance

Traditional Feature Selection

This section delves into traditional feature selection methods, categorizing them into single-view and multi-view approaches. Single-view methods include filter, wrapper, embedded, and hybrid techniques, each with distinct advantages and limitations. Filter methods (e.g., Chi-square, ANOVA, correlation, mutual information) are computationally efficient but ignore feature interactions. Wrapper methods (e.g., SFS, RFE, GA) consider feature interactions but are computationally expensive. Embedded methods (e.g., Lasso, tree-based, neural network-based) integrate selection into model training, balancing efficiency and interaction capture. Multi-view methods leverage information across multiple perspectives, categorized by supervised, semi-supervised, and unsupervised learning, addressing the challenge of heterogeneous data.

71:7 Article Reference for Traditional Feature Selection

Enterprise Process Flow

Filter Methods
Wrapper Methods
Embedded Methods
Hybrid Methods
Multi-View Methods
Feature Subset Identified
Method Type Strengths Limitations
Filter Methods
  • Computationally efficient
  • Model-agnostic
  • Handles high-dimensional data
  • Ignores feature interactions
  • Suboptimal subsets
  • May remove jointly informative features
Wrapper Methods
  • Captures feature interactions
  • Model-specific optimization
  • Often outperforms filters
  • Computationally expensive (NP-hard)
  • Prone to overfitting
  • Limited scalability
Embedded Methods
  • Balances efficiency and interaction
  • Integrated into model training
  • Less expensive than wrappers
  • Model-specific
  • Sensitive to hyperparameters
  • Difficult to generalize

Traditional Feature Generation

Feature generation, a crucial aspect of data-centric AI, transforms raw data into richer representations to improve model performance and interpretability. This section details human-driven and automated approaches. Human-driven methods rely on domain expertise to apply mathematical transformations (e.g., logarithms, multiplication) and statistical representations (e.g., mean, variance, skewness) to create new features, capturing complex relationships and insights. Automated methods aim to replicate and enhance this process, focusing on feature interaction modeling (e.g., feature crossing), non-linear transformations (e.g., polynomial features, kernel learning), and iterative refinement. While effective, traditional methods face challenges in scalability, transferability, and handling complex non-linear relationships.

71:15 Article Reference for Traditional Feature Generation

Enterprise Process Flow

Human-Driven Feature Engineering
Mathematical Transformations
Statistical Representations
Automated Feature Generation
Feature Interaction Modeling
Non-Linear Transformation
Iterative Refinement
Enhanced Feature Space

Impact of Domain Knowledge in Feature Generation

In finance, the debt-to-income ratio is a crucial domain-specific feature for credit risk modeling, directly enhancing model accuracy and interpretability. In healthcare, Body Mass Index (BMI) derived from height and weight helps assess health risks. For e-commerce, purchase frequency and Customer Lifetime Value (CLV) provide actionable insights into user behavior and business strategies. These examples demonstrate how integrating domain expertise creates meaningful features that are highly aligned with real-world goals, significantly improving AI performance beyond generic transformations.

0 Predictive Accuracy Boost
High Enhanced Interpretability

Advanced Feature Engineering (RL & Generative AI)

This section explores advanced methods leveraging Reinforcement Learning (RL) and Generative AI (GAI) to overcome limitations of traditional feature engineering. RL frames feature selection and generation as Markov Decision Processes, allowing agents to iteratively optimize feature subsets and create new features, capturing complex interactions efficiently. Multi-agent, single-agent, and hybrid RL frameworks are discussed. Generative AI offers a paradigm shift by encoding feature learning knowledge into a continuous embedding space, enabling gradient-driven optimization and knowledge transfer across tasks. This includes encoder-decoder-evaluator frameworks, transformer-based VAEs, and orthogonality-preserving embeddings. These methods offer scalability, adaptability, and the potential for fully automated feature engineering.

71:19 Article Reference for Advanced Feature Engineering

Enterprise Process Flow

RL Formulates as MDP
Iterative Optimization
Generative AI Encodes Knowledge
Continuous Embedding Space
Automated Feature Discovery
Approach Mechanism Benefit
Reinforcement Learning
  • Iterative agent-based optimization in feature space
  • Adaptive decision-making, captures complex interactions
Generative AI
  • Encodes feature knowledge into continuous embedding space
  • Scalable knowledge transfer, gradient-driven optimization

Comparative Analysis & Future Directions

This section provides a comparative analysis of traditional versus advanced methods, highlighting their strengths and limitations across performance, interpretability, adaptability, and data quality. Traditional methods are efficient for small, static datasets but lack scalability and struggle with complex patterns. Advanced methods (RL, GAI) excel in handling large, dynamic, high-dimensional data and complex interactions but are resource-intensive and often less interpretable without additional tools. The paper then outlines future research directions, including enhancing automation with human-in-the-loop systems, improving explainability, developing privacy-conscious federated learning, and integrating LLMs and multimodal systems for cross-domain feature engineering. The ultimate goal is to achieve scalable, interpretable, and efficient feature engineering for data-centric AI.

71:26 Article Reference for Comparative Analysis
Aspect Traditional Methods Advanced Methods
Performance
  • Efficient for small datasets
  • Struggles with high-dimensional data
  • Scalable for complex patterns
  • Resource-intensive
Interpretability
  • Highly interpretable
  • Clear insights
  • Requires additional tools for explainability
  • Black-box nature
Adaptability & Automation
  • Suitable for static data
  • Limited automation, manual tuning
  • Handles multi-modal/dynamic datasets
  • Highly automated and dynamic
Data Quality Robustness
  • Degrades with noise/missing values
  • Assumes clean inputs
  • Partially mitigates imperfections
  • Learns stable transformation patterns

Future Trends in Data-Centric AI

The future of data-centric AI in feature engineering lies in human-in-the-loop automation, combining ML efficiency with domain expertise. Enhancing explainable AI (XAI) tools is crucial for transparency in high-stakes domains. Privacy-conscious federated learning will enable collaborative feature engineering on distributed, sensitive datasets. Integrating Large Language Models (LLMs) and multimodal systems promises cross-domain knowledge transfer and automated feature generation across diverse data types, overcoming current limitations in encoding tabular data effectively.

Increased Automation Level
Enhanced Explainability
Critical Privacy Compliance

Calculate Your Potential ROI with Data-Centric AI

Estimate the efficiency gains and cost savings for your enterprise by optimizing tabular data processes.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your Data-Centric AI Implementation Roadmap

A phased approach to integrating advanced tabular data transformation techniques into your enterprise.

Phase 1: Discovery & Assessment

Comprehensive analysis of existing data infrastructure, current feature engineering practices, and identification of key business objectives and pain points. Define success metrics and prioritize use cases.

Phase 2: Pilot & Proof-of-Concept

Implement RL and Generative AI-based feature engineering on a selected high-impact tabular dataset. Demonstrate tangible improvements in model performance, interpretability, and efficiency. Iterate and refine based on pilot results.

Phase 3: Scaled Integration & Optimization

Expand successful pilot solutions across relevant departments and data pipelines. Establish MLOps practices for continuous monitoring, automated feature updates, and performance optimization. Train internal teams and document best practices.

Phase 4: Advanced Capabilities & Strategic Impact

Explore integration with LLMs for text-informed feature generation, multimodal data processing, and federated learning for privacy-preserving analytics. Leverage data-centric AI for new strategic insights and competitive advantage.

Ready to Transform Your Tabular Data Strategy?

Schedule a personalized consultation with our AI experts to discuss how data-centric AI can revolutionize your enterprise's data quality, model performance, and operational efficiency.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking