Aura-State: Formally Verified LLM State Machine Compiler

I noticed a pattern: every LLM framework today lets the AI manage state and do math. Then we wonder why pipelines hallucinate numbers and break at 3 AM.

I took a different approach and built Aura-State, an open-source Python framework that compiles LLM workflows into formally verified state machines.

Instead of hoping the AI figures it out, I brought in real algorithms from hardware verification and statistical learning:

CTL Model Checking: the same technique used to verify flight control systems, now applied to LLM workflow graphs. Proves safety properties before execution.

Z3 Theorem Prover: every LLM extraction gets formally proven against business constraints. If the total ≠ price × quantity, Z3 catches it with a counterexample.

Conformal Prediction: distribution-free 95% confidence intervals on every extracted field. Not just "the LLM said $450k" but "95% CI: [$448k, $452k]."

MCTS Routing: Monte Carlo Tree Search (the algorithm behind AlphaGo) scores ambiguous state transitions mathematically.

Sandboxed Math: English math rules compile to Python AST. Zero hallucination calculations.

I ran a live benchmark against 10 real-estate sales transcripts using GPT-4o-mini: → 100% budget extraction accuracy ($0 mean error) → 20/20 Z3 proof obligations passed → 3/3 temporal safety properties proven → 65 automated tests passing

The gap between "it usually works" and "it provably works" is smaller than people think.

Would love feedback from anyone building production LLM systems; what would you want formally verified?

https://github.com/munshi007/Aura-State

22 points | by rohanmunshi08 4 days ago

4 comments

  • ozozozd 3 days ago
    This is interesting and I would appreciate if you could elaborate with a few examples, and perhaps mention what inspired this, and/or what’s similar and/or analogous to your method.
  • mentalgear 8 hours ago
    Despite the sophisticated theoretical frameworks and algorithmic safeguards, the core vulnerability remains: in an autonomous workflow, the LLM can hallucinate the input or fabricate the 'proofs' for the verification sandbox. This is essentially building elaborate scaffolding ontop a fundamental flaw.

    But I reckon while this doesn't eliminate the need for human oversight, it might help supervisors sitting at human-in-the-loop checkpoints.

    • Garlef 8 hours ago
      > hallucinate the input

      True

      > fabricate the 'proofs' for the verification sandbox

      Not sure: It seems the tool here allows for non-llm code to perform validation inbetween the steps

      So in this sense, it allows you to manage the LLM. Of course if you run the workflow through an outer agentic loop this would be different.

      In this case, I think the reasonable approach would be to let the orchestrating tool write its results to a space that the agents themselves have no access to.

      • mentalgear 3 hours ago
        So when the orchestrating tool is an LLM, where we are back to square 1 and if a human in the loop, my second paragraph discussed that case.
  • Perenti 9 hours ago
    Interesting. Seems you are automating my qwen workflow. Every output stage is verified through mathematical proof whenever possible, before being fed to the next step in transforming ideas into code. Except for when qwen decides to go in a very unusual direction, its working reasonably well at producing provably correct code. It's slowish though, with lots of nested iterations, and when qwen goes strange it takes a lot of effort to get it back on task.
  • aristofun 3 days ago
    Couple of simplified examples for us, mere mortals, not mathematicians would be nice.