Quick Start

Install evalflow

Install evalflow with pip:

pip install evalflow

Verify the installation:

evalflow --version

evalflow v0.1.2

Initialize your project

Run evalflow init from your project root. It prompts you to choose a provider, model, and API key environment variable name:

evalflow init

Choose provider [openai, anthropic, groq, gemini, ollama]: openai
Choose model [gpt-4o-mini]:
API key env var name [OPENAI_API_KEY]:
  evalflow initialized

  Next steps:
  1. Add your API key to your environment:
     export OPENAI_API_KEY="your-key-here"

  2. Run your first eval:
     evalflow eval

evalflow init creates the following files:

File	Purpose
`evalflow.yaml`	Project config — provider, model, thresholds
`prompts/`	Directory for your prompt files
`evals/dataset.json`	Starter test dataset
`.env.example`	Environment variable template
`.evalflow/`	Local run storage and baseline cache

Non-interactive modeFor CI, Docker, or scripted environments, skip the prompts with --non-interactive:

evalflow init --provider groq --model llama-3.1-8b-instant --non-interactive

To see all supported providers and their default models:

evalflow init --list-providers

evalflow init stores the environment variable name, not the key value itself. Your API key never touches the config file.

Write your first test case

Open evals/dataset.json. The starter file already contains one example test case:

{
  "version": "1.0",
  "test_cases": [
    {
      "id": "example-summarization",
      "description": "Basic summarization test",
      "task_type": "summarization",
      "input": "Summarize in one sentence: The quick brown fox jumps over the lazy dog.",
      "expected_output": "A fox jumps over a dog.",
      "context": "",
      "tags": ["example"],
      "eval_config": {
        "methods": ["embedding_similarity"],
        "judge": false,
        "weight": 1.0
      }
    }
  ]
}

Each test case has the following fields:

Field	Description
`id`	Unique identifier shown in eval output
`task_type`	The kind of task: `summarization`, `classification`, `qa`, etc.
`input`	The prompt or query sent to the model
`expected_output`	The reference output used for scoring
`tags`	Labels for filtering runs with `--tag`
`eval_config.methods`	Scoring methods to apply — `embedding_similarity` is a good default
`eval_config.judge`	Set to `true` to add an LLM-as-judge score
`eval_config.weight`	Relative weight of this case when computing the overall score

Add as many test cases to the test_cases array as you need. Each one is independent.

Start with 5–10 representative test cases that cover your most important tasks. You can always add more later.

Run the quality gate

Export your API key, then run the eval:

export OPENAI_API_KEY="your-key-here"
evalflow eval

evalflow runs your test cases and prints a result for each one:

Running 1 test cases against gpt-4o-mini...

✓ example-summarization    0.91

Quality Gate: PASS
Baseline: saved
Failures: 0
Duration: 2.8s
Run ID: 20260325-a3f9c2d81b4e

The first run saves a baseline. Subsequent runs compare against it — if scores drop below your configured thresholds, the quality gate fails.Useful flags

Flag	Description
`--provider`, `-p`	Override the provider from config
`--model`, `-m`	Override the model from config
`--tag`, `-t`	Run only test cases with a specific tag
`--save-baseline`	Save this run as the new baseline
`--offline`	Use cached responses without making API calls
`--concurrency`	Number of test cases to run in parallel (default: 5)

Exit codes

Code	Meaning
`0`	Quality gate passed
`1`	Quality regression detected
`2`	Setup or provider error

Exit code 1 is intentional — CI pipelines treat any non-zero exit as a failure, so regressions block the pull request automatically.

Add to CI

Create .github/workflows/evalflow.yml to run the quality gate on every pull request that touches prompts or eval files:

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"
      - "evalflow.yaml"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow doctor --no-provider-check
      - run: evalflow eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Then add your API key as a repository secret:

GitHub → Settings → Secrets and variables → Actions → New repository secret

Name:  OPENAI_API_KEY
Value: your real provider key

If you use a different provider, set the matching variable — for example GROQ_API_KEY or ANTHROPIC_API_KEY — and update the env block in the workflow accordingly.Pull requests that cause a quality regression now fail CI automatically before they can be merged.

Run evalflow doctor locally before pushing to catch configuration issues early. The --no-provider-check flag in the workflow skips the live API connectivity check, which is handled by the eval step itself.

Get Started

Core Concepts

CI/CD

Providers