Why Most Valuable AI Systems Are Still Tabular Models

The Hard Part of Predictive AI Isn’t the Model

I’ve spent most of my career building predictive systems on tabular data.

The highest-value AI systems I’ve seen in production aren’t LLMs. They’re predictive models that operate on structured operational data: customers, orders, shipments, transactions, support events, etc.

These systems quietly generate millions in value by replacing expensive third-party services, improving operational decisions, and turning predictions into products.

Examples include churn prediction, fraud detection, ETA prediction, inventory demand forecasting, and operational anomaly detection.

In practice, the model itself is rarely the bottleneck.

The real bottleneck is integrating signals from relational data.

Why Tabular Data Is Hard

Most operational systems store data across many relational tables, not a single ML-ready dataset.

For example, consider a simple commerce schema:

customers -> orders -> order_notifications

If you want to train a model predicting something like:

Will this customer churn in the next 30 days?

the model does not train directly on these tables.

Instead you must first construct a training table like:

customer_id num_orders_last_30_days avg_order_value days_since_last_order num_notifications_last_7_days notification_rate_per_order ... target_churn

Building this dataset requires joins, aggregations, time windows, handling one-to-many relationships, and preventing data leakage.

For example:

num_orders_last_30_days = COUNT(orders WHERE order_timestamp >= now() - 30d)

num_notifications_last_7_days = COUNT(order_notifications WHERE timestamp >= now() - 7d)

This sounds simple, but at scale this quickly becomes hundreds of features, dozens of tables, and complex temporal joins.

In most organizations this data preparation step dominates the project.

Not the model.

Where Tabular Foundation Models Fit

Recently there has been a lot of excitement around tabular foundation models, such as TabPFN, TabTransformer variants, and other pretrained tabular architectures.

These models are interesting because they can often produce strong predictions with very little tuning.

You can often train them with something as simple as:

model.fit(X_train, y_train)

and they work surprisingly well.

However, these models typically expect a single flat table.

Something like:

customer_id f1 f2 f3 f4 ... target

They generally do not operate directly on relational schemas.

So the fundamental bottleneck remains:

How do you turn relational data into a useful feature table?

GraphReduce: Treating Relational Data as a Graph

This is where approaches like GraphReduce come in.

Relational schemas naturally form a graph structure.

Using the previous example:

customers | orders | order_notifications

Each edge represents a relationship where signals can propagate.

For example, orders can propagate to customers, and notifications can propagate to orders and then to customers.

GraphReduce treats the schema as a propagation graph.

Each table contributes signals that are aggregated upward.

Example propagation:

From order_notifications to orders:

notifications_per_order max_notification_delay notification_count

Then from orders to customers:

total_orders avg_order_value orders_last_30_days notification_rate_per_order

The result is a feature table at the target level:

customer_id orders_last_30_days avg_order_value notification_rate days_since_last_order ...

This table can then be fed directly into a predictive model.

Why This Matters for Tabular Foundation Models

Tabular foundation models are strongest when operating on a well-constructed flat dataset.

GraphReduce helps produce that dataset automatically by traversing relational graphs, aggregating signals, and generating structured features.

The pipeline looks like this:

Relational DB -> GraphReduce -> Unified feature table -> Tabular foundation model (e.g. TabPFN) -> Prediction

In practice this can dramatically increase the throughput of building predictive systems, because the hardest step, data integration, becomes much easier.

Why This Is Still an Open Problem

Most AI discussion today focuses on models.

But for structured data systems, the real challenges are relational structure, temporal aggregation, signal propagation, and feature construction.

Until those problems are solved, the modeling layer will always be limited.

Tabular foundation models may significantly reduce the modeling effort.

But relational data preparation remains the gating step.

The interesting opportunity is combining both.

Example Implementation

Here is a simple end-to-end example combining relational aggregation with a tabular foundation model:

https://wesmadrigal.github.io/GraphReduce/end_to_end_examples/predictive_ai_tabpfn/

There is some similar research coming out of the University of Hong Kong: https://arxiv.org/pdf/2602.13697

Thoughts?

5 points | by madman2890 2 hours ago

1 comments