A quality gate is the rule that decides whether an evalflow run should block a pull request. When you run evalflow eval, the process exits with a numeric code your CI system can act on.
Run outcomes
Every run ends in one of three states:
| Outcome | Meaning |
|---|
PASS | Quality is at or above the configured threshold |
FAIL | Quality dropped below the threshold |
ERROR | Setup, config, or provider execution failed |
Quality Gate: PASS
Baseline: saved
Failures: 0
Duration: 2.8s
Exit codes
evalflow maps each outcome to a standard exit code:
| Exit code | Outcome | Cause |
|---|
0 | PASS | All test cases met the threshold |
1 | FAIL | One or more test cases fell below the threshold |
2 | ERROR | Config, provider, or execution error |
CI systems (GitHub Actions, GitLab CI, CircleCI) treat any non-zero exit code as a failure. Exit code 1 blocks the PR because quality regressed. Exit code 2 blocks it because something is broken. Both deserve attention, but for different reasons.
Separating quality failures (1) from setup errors (2) lets you write different triage steps in your pipeline — for example, alerting on errors while treating quality failures as expected review signals.
Baselines and regression detection
A baseline is the saved result of the last accepted run for the same dataset. evalflow compares every new run against the baseline so it catches regressions rather than just low absolute scores.
Baseline: saved
Δ overall: +0.03
When you run evalflow eval, the result is compared per test case against the stored baseline. If any case drops below its previous score by more than the configured threshold, the run fails.
Save a new baseline explicitly with evalflow eval --save-baseline after you accept a result as the new standard. The baseline advances only when you tell it to.
The baseline is stored locally in .evalflow/. Commit that directory to version control so every developer and your CI runner share the same reference point.
What a prompt regression looks like
A prompt regression is when a change improves one case but silently harms another. This is the failure mode evalflow is designed to catch before a change ships.
✓ summarize-short-article 0.91 (was 0.88)
✗ answer-with-context 0.61 (was 1.00)
Quality Gate: FAIL
Failures: 1
In this example, the summarization test improved, but the answer quality test dropped from a perfect score to 0.61. Without a quality gate that tracks both, the regression in answer-with-context would have merged undetected.
Improving one prompt can degrade another, especially when prompts share a system instruction or when a model update affects different task types differently. Always run the full dataset, not just the cases you changed.