Skip to main content
1

Install evalflow

Install evalflow with pip:
pip install evalflow
Verify the installation:
evalflow --version
evalflow v0.1.2
2

Initialize your project

Run evalflow init from your project root. It prompts you to choose a provider, model, and API key environment variable name:
evalflow init
Choose provider [openai, anthropic, groq, gemini, ollama]: openai
Choose model [gpt-4o-mini]:
API key env var name [OPENAI_API_KEY]:
  evalflow initialized

  Next steps:
  1. Add your API key to your environment:
     export OPENAI_API_KEY="your-key-here"

  2. Run your first eval:
     evalflow eval
evalflow init creates the following files:
FilePurpose
evalflow.yamlProject config — provider, model, thresholds
prompts/Directory for your prompt files
evals/dataset.jsonStarter test dataset
.env.exampleEnvironment variable template
.evalflow/Local run storage and baseline cache
Non-interactive modeFor CI, Docker, or scripted environments, skip the prompts with --non-interactive:
evalflow init --provider groq --model llama-3.1-8b-instant --non-interactive
To see all supported providers and their default models:
evalflow init --list-providers
evalflow init stores the environment variable name, not the key value itself. Your API key never touches the config file.
3

Write your first test case

Open evals/dataset.json. The starter file already contains one example test case:
{
  "version": "1.0",
  "test_cases": [
    {
      "id": "example-summarization",
      "description": "Basic summarization test",
      "task_type": "summarization",
      "input": "Summarize in one sentence: The quick brown fox jumps over the lazy dog.",
      "expected_output": "A fox jumps over a dog.",
      "context": "",
      "tags": ["example"],
      "eval_config": {
        "methods": ["embedding_similarity"],
        "judge": false,
        "weight": 1.0
      }
    }
  ]
}
Each test case has the following fields:
FieldDescription
idUnique identifier shown in eval output
task_typeThe kind of task: summarization, classification, qa, etc.
inputThe prompt or query sent to the model
expected_outputThe reference output used for scoring
tagsLabels for filtering runs with --tag
eval_config.methodsScoring methods to apply — embedding_similarity is a good default
eval_config.judgeSet to true to add an LLM-as-judge score
eval_config.weightRelative weight of this case when computing the overall score
Add as many test cases to the test_cases array as you need. Each one is independent.
Start with 5–10 representative test cases that cover your most important tasks. You can always add more later.
4

Run the quality gate

Export your API key, then run the eval:
export OPENAI_API_KEY="your-key-here"
evalflow eval
evalflow runs your test cases and prints a result for each one:
Running 1 test cases against gpt-4o-mini...

✓ example-summarization    0.91

Quality Gate: PASS
Baseline: saved
Failures: 0
Duration: 2.8s
Run ID: 20260325-a3f9c2d81b4e
The first run saves a baseline. Subsequent runs compare against it — if scores drop below your configured thresholds, the quality gate fails.Useful flags
FlagDescription
--provider, -pOverride the provider from config
--model, -mOverride the model from config
--tag, -tRun only test cases with a specific tag
--save-baselineSave this run as the new baseline
--offlineUse cached responses without making API calls
--concurrencyNumber of test cases to run in parallel (default: 5)
Exit codes
CodeMeaning
0Quality gate passed
1Quality regression detected
2Setup or provider error
Exit code 1 is intentional — CI pipelines treat any non-zero exit as a failure, so regressions block the pull request automatically.
5

Add to CI

Create .github/workflows/evalflow.yml to run the quality gate on every pull request that touches prompts or eval files:
# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"
      - "evalflow.yaml"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow doctor --no-provider-check
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Then add your API key as a repository secret:
GitHub → Settings → Secrets and variables → Actions → New repository secret

Name:  OPENAI_API_KEY
Value: your real provider key
If you use a different provider, set the matching variable — for example GROQ_API_KEY or ANTHROPIC_API_KEY — and update the env block in the workflow accordingly.Pull requests that cause a quality regression now fail CI automatically before they can be merged.
Run evalflow doctor locally before pushing to catch configuration issues early. The --no-provider-check flag in the workflow skips the live API connectivity check, which is handled by the eval step itself.