A prompt tweak, model swap, or context change can silently shift behavior — wrong tone, dropped policy details, different tool routing. Standard unit tests pass. You / users notice later.
Agentura adds behavioral evals to your CI pipeline. On every PR, it runs your agent against expected outputs, compares scores to your main branch baseline, and shows you exactly which cases regressed before you merge. 100% Free, Open Source.
Try it locally:
npx agentura@latest init
npx agentura@latest run --local
Agentura adds behavioral evals to your CI pipeline. On every PR, it runs your agent against expected outputs, compares scores to your main branch baseline, and shows you exactly which cases regressed before you merge. 100% Free, Open Source.
Try it locally:
npx agentura@latest init npx agentura@latest run --local
Live playground (online): https://playground.agentura.run
Website: https://agentura.run
Repo: https://github.com/SyntheticSynaptic/agentura
Welcome all comments + feedback