I took a different approach and built Aura-State, an open-source Python framework that compiles LLM workflows into formally verified state machines.
Instead of hoping the AI figures it out, I brought in real algorithms from hardware verification and statistical learning:
CTL Model Checking: the same technique used to verify flight control systems, now applied to LLM workflow graphs. Proves safety properties before execution.
Z3 Theorem Prover: every LLM extraction gets formally proven against business constraints. If the total ≠ price × quantity, Z3 catches it with a counterexample.
Conformal Prediction: distribution-free 95% confidence intervals on every extracted field. Not just "the LLM said $450k" but "95% CI: [$448k, $452k]."
MCTS Routing: Monte Carlo Tree Search (the algorithm behind AlphaGo) scores ambiguous state transitions mathematically.
Sandboxed Math: English math rules compile to Python AST. Zero hallucination calculations.
I ran a live benchmark against 10 real-estate sales transcripts using GPT-4o-mini: → 100% budget extraction accuracy ($0 mean error) → 20/20 Z3 proof obligations passed → 3/3 temporal safety properties proven → 65 automated tests passing
The gap between "it usually works" and "it provably works" is smaller than people think.
Would love feedback from anyone building production LLM systems; what would you want formally verified?
https://github.com/munshi007/Aura-State
But I reckon while this doesn't eliminate the need for human oversight, it might help supervisors sitting at human-in-the-loop checkpoints.
True
> fabricate the 'proofs' for the verification sandbox
Not sure: It seems the tool here allows for non-llm code to perform validation inbetween the steps
So in this sense, it allows you to manage the LLM. Of course if you run the workflow through an outer agentic loop this would be different.
In this case, I think the reasonable approach would be to let the orchestrating tool write its results to a space that the agents themselves have no access to.