Initialize your project
Run
Non-interactive modeFor CI, Docker, or scripted environments, skip the prompts with To see all supported providers and their default models:
evalflow init from your project root. It prompts you to choose a provider, model, and API key environment variable name:evalflow init creates the following files:| File | Purpose |
|---|---|
evalflow.yaml | Project config — provider, model, thresholds |
prompts/ | Directory for your prompt files |
evals/dataset.json | Starter test dataset |
.env.example | Environment variable template |
.evalflow/ | Local run storage and baseline cache |
--non-interactive:evalflow init stores the environment variable name, not the key value itself. Your API key never touches the config file.Write your first test case
Open Each test case has the following fields:
Add as many test cases to the
evals/dataset.json. The starter file already contains one example test case:| Field | Description |
|---|---|
id | Unique identifier shown in eval output |
task_type | The kind of task: summarization, classification, qa, etc. |
input | The prompt or query sent to the model |
expected_output | The reference output used for scoring |
tags | Labels for filtering runs with --tag |
eval_config.methods | Scoring methods to apply — embedding_similarity is a good default |
eval_config.judge | Set to true to add an LLM-as-judge score |
eval_config.weight | Relative weight of this case when computing the overall score |
test_cases array as you need. Each one is independent.Run the quality gate
Export your API key, then run the eval:evalflow runs your test cases and prints a result for each one:The first run saves a baseline. Subsequent runs compare against it — if scores drop below your configured thresholds, the quality gate fails.Useful flags
Exit codes
| Flag | Description |
|---|---|
--provider, -p | Override the provider from config |
--model, -m | Override the model from config |
--tag, -t | Run only test cases with a specific tag |
--save-baseline | Save this run as the new baseline |
--offline | Use cached responses without making API calls |
--concurrency | Number of test cases to run in parallel (default: 5) |
| Code | Meaning |
|---|---|
0 | Quality gate passed |
1 | Quality regression detected |
2 | Setup or provider error |
Add to CI
Create Then add your API key as a repository secret:If you use a different provider, set the matching variable — for example
.github/workflows/evalflow.yml to run the quality gate on every pull request that touches prompts or eval files:GROQ_API_KEY or ANTHROPIC_API_KEY — and update the env block in the workflow accordingly.Pull requests that cause a quality regression now fail CI automatically before they can be merged.Run
evalflow doctor locally before pushing to catch configuration issues early. The --no-provider-check flag in the workflow skips the live API connectivity check, which is handled by the eval step itself.