I’ve spent most of my career building predictive systems on tabular data.
The highest-value AI systems I’ve seen in production aren’t LLMs. They’re predictive models that operate on structured operational data: customers, orders, shipments, transactions, support events, etc.
These systems quietly generate millions in value by replacing expensive third-party services, improving operational decisions, and turning predictions into products.
Examples include churn prediction, fraud detection, ETA prediction, inventory demand forecasting, and operational anomaly detection.
In practice, the model itself is rarely the bottleneck.
The real bottleneck is integrating signals from relational data.
Why Tabular Data Is Hard
Most operational systems store data across many relational tables, not a single ML-ready dataset.
For example, consider a simple commerce schema:
customers -> orders -> order_notifications
If you want to train a model predicting something like:
Will this customer churn in the next 30 days?
the model does not train directly on these tables.
Instead you must first construct a training table like:
customer_id num_orders_last_30_days avg_order_value days_since_last_order num_notifications_last_7_days notification_rate_per_order ... target_churn
Building this dataset requires joins, aggregations, time windows, handling one-to-many relationships, and preventing data leakage.
For example:
num_orders_last_30_days = COUNT(orders WHERE order_timestamp >= now() - 30d)
num_notifications_last_7_days = COUNT(order_notifications WHERE timestamp >= now() - 7d)
This sounds simple, but at scale this quickly becomes hundreds of features, dozens of tables, and complex temporal joins.
In most organizations this data preparation step dominates the project.
Not the model.
Where Tabular Foundation Models Fit
Recently there has been a lot of excitement around tabular foundation models, such as TabPFN, TabTransformer variants, and other pretrained tabular architectures.
These models are interesting because they can often produce strong predictions with very little tuning.
You can often train them with something as simple as:
model.fit(X_train, y_train)
and they work surprisingly well.
However, these models typically expect a single flat table.
Something like:
customer_id f1 f2 f3 f4 ... target
They generally do not operate directly on relational schemas.
So the fundamental bottleneck remains:
How do you turn relational data into a useful feature table?
GraphReduce: Treating Relational Data as a Graph
This is where approaches like GraphReduce come in.
Relational schemas naturally form a graph structure.
Using the previous example:
customers | orders | order_notifications
Each edge represents a relationship where signals can propagate.
For example, orders can propagate to customers, and notifications can propagate to orders and then to customers.
GraphReduce treats the schema as a propagation graph.
Each table contributes signals that are aggregated upward.
Example propagation:
From order_notifications to orders:
notifications_per_order max_notification_delay notification_count
Then from orders to customers:
total_orders avg_order_value orders_last_30_days notification_rate_per_order
The result is a feature table at the target level:
customer_id orders_last_30_days avg_order_value notification_rate days_since_last_order ...
This table can then be fed directly into a predictive model.
Why This Matters for Tabular Foundation Models
Tabular foundation models are strongest when operating on a well-constructed flat dataset.
GraphReduce helps produce that dataset automatically by traversing relational graphs, aggregating signals, and generating structured features.
The pipeline looks like this:
Relational DB -> GraphReduce -> Unified feature table -> Tabular foundation model (e.g. TabPFN) -> Prediction
In practice this can dramatically increase the throughput of building predictive systems, because the hardest step, data integration, becomes much easier.
Why This Is Still an Open Problem
Most AI discussion today focuses on models.
But for structured data systems, the real challenges are relational structure, temporal aggregation, signal propagation, and feature construction.
Until those problems are solved, the modeling layer will always be limited.
Tabular foundation models may significantly reduce the modeling effort.
But relational data preparation remains the gating step.
The interesting opportunity is combining both.
Example Implementation
Here is a simple end-to-end example combining relational aggregation with a tabular foundation model:
https://wesmadrigal.github.io/GraphReduce/end_to_end_examples/predictive_ai_tabpfn/
There is some similar research coming out of the University of Hong Kong: https://arxiv.org/pdf/2602.13697
Thoughts?
1 comments