Quality gate

A quality gate is the rule that decides whether an evalflow run should block a pull request. When you run evalflow eval, the process exits with a numeric code your CI system can act on.

Run outcomes

Every run ends in one of three states:

Outcome	Meaning
`PASS`	Quality is at or above the configured threshold
`FAIL`	Quality dropped below the threshold
`ERROR`	Setup, config, or provider execution failed

Quality Gate: PASS
Baseline: saved
Failures: 0
Duration: 2.8s

Exit codes

evalflow maps each outcome to a standard exit code:

Exit code	Outcome	Cause
`0`	PASS	All test cases met the threshold
`1`	FAIL	One or more test cases fell below the threshold
`2`	ERROR	Config, provider, or execution error

CI systems (GitHub Actions, GitLab CI, CircleCI) treat any non-zero exit code as a failure. Exit code 1 blocks the PR because quality regressed. Exit code 2 blocks it because something is broken. Both deserve attention, but for different reasons.

Separating quality failures (1) from setup errors (2) lets you write different triage steps in your pipeline — for example, alerting on errors while treating quality failures as expected review signals.

Baselines and regression detection

A baseline is the saved result of the last accepted run for the same dataset. evalflow compares every new run against the baseline so it catches regressions rather than just low absolute scores.

Baseline: saved
Δ overall: +0.03

When you run evalflow eval, the result is compared per test case against the stored baseline. If any case drops below its previous score by more than the configured threshold, the run fails.

Save a new baseline explicitly with evalflow eval --save-baseline after you accept a result as the new standard. The baseline advances only when you tell it to.

The baseline is stored locally in .evalflow/. Commit that directory to version control so every developer and your CI runner share the same reference point.

What a prompt regression looks like

A prompt regression is when a change improves one case but silently harms another. This is the failure mode evalflow is designed to catch before a change ships.

✓ summarize-short-article    0.91   (was 0.88)
✗ answer-with-context        0.61   (was 1.00)

Quality Gate: FAIL
Failures: 1

In this example, the summarization test improved, but the answer quality test dropped from a perfect score to 0.61. Without a quality gate that tracks both, the regression in answer-with-context would have merged undetected.

Improving one prompt can degrade another, especially when prompts share a system instruction or when a model update affects different task types differently. Always run the full dataset, not just the cases you changed.

Get Started

Core Concepts

CI/CD

Providers

Run outcomes

Exit codes

Baselines and regression detection

What a prompt regression looks like

​Run outcomes

​Exit codes

​Baselines and regression detection

​What a prompt regression looks like

Run outcomes

Exit codes

Baselines and regression detection

What a prompt regression looks like