Skip to main content

evalflow init

Set up evalflow in your project. By default, runs an interactive wizard that prompts for a provider, model, and API key environment variable name. Use --non-interactive for CI, Docker, or scripted environments. Syntax
evalflow init [--provider PROVIDER] [--model MODEL] [--non-interactive] [--force] [--list-providers]
Flags
--provider
string
LLM provider to configure. Accepted values: openai, anthropic, groq, gemini, ollama. Defaults to openai in non-interactive mode.
--model
string
Model name to use. Defaults to the provider’s recommended model when omitted.
--non-interactive
boolean
Skip all prompts. Also accepted as --yes or -y. Required when stdin is not a TTY (CI, Docker).
--force
boolean
Overwrite an existing evalflow.yaml without asking.
--list-providers
boolean
Print supported providers and their default models, then exit. Does not write any files.
Files created evalflow init writes the following files on first run:
  • evalflow.yaml — project configuration
  • evals/dataset.json — starter dataset with one example test case
  • prompts/ — empty directory for prompt YAML files
  • .env.example — template for required environment variables
  • .evalflow/ — local storage directory (runs database, cache)
Examples Interactive setup:
evalflow init
Choose provider [openai, anthropic, groq, gemini, ollama]: openai
Choose model [gpt-4o-mini]:
API key env var name [OPENAI_API_KEY]:
  evalflow initialized

  Next steps:
  1. Add your API key to your environment:
     export OPENAI_API_KEY="your-key-here"

  2. Run your first eval:
     evalflow eval
Non-interactive setup for CI:
evalflow init --provider groq --model llama-3.1-8b-instant --non-interactive
List supported providers:
evalflow init --list-providers
Provider    Default model
openai      gpt-4o-mini
anthropic   claude-3-5-haiku-20241022
groq        llama-3.1-8b-instant
gemini      gemini-1.5-flash
ollama      llama3.2

evalflow eval

Run the LLM quality gate against your dataset. Each test case is scored and compared to the saved baseline. The command exits with code 0 on pass, 1 on quality failure (which blocks CI), and 2 on an unexpected error. Syntax
evalflow eval [--provider PROVIDER] [--model MODEL] [--dataset PATH] [--tag TAG]
              [--offline] [--debug] [--save-baseline] [--concurrency N]
Flags
--provider, -p
string
Override the provider set in evalflow.yaml. Accepted values: openai, anthropic, groq, gemini, ollama.
--model, -m
string
Override the model set in evalflow.yaml for this run.
--dataset, -d
string
Path to a dataset JSON file. Defaults to evals/dataset.json (or the value in evalflow.yaml).
--tag, -t
string
Run only test cases whose tags array contains this value. Useful for running a smoke or critical subset.
--offline
boolean
Use cached LLM responses instead of making live API calls. Useful for re-scoring without consuming API quota.
--debug
boolean
Print full exception tracebacks on unexpected errors. Not recommended for production use.
--save-baseline
boolean
Save this run as the new baseline. Future runs are compared against it. The first run always saves a baseline automatically.
--concurrency
number
default:"5"
Maximum number of test cases to run in parallel. Minimum value is 1.
Example
evalflow eval --provider openai --model gpt-4o-mini --dataset evals/dataset.json --tag critical --save-baseline
Running 5 test cases against gpt-4o-mini...

✓ summarize-short-article      0.91
✓ classify-support-priority    1.00
✗ answer-with-context          0.61
✓ extract-key-dates            0.87
✓ rewrite-formal-tone          0.83

Quality Gate: FAIL
Baseline: saved
Failures: 1
Duration: 4.2s
Run ID: 20260325-a3f9c2d81b4e
evalflow eval exits with code 1 when the quality gate fails. CI systems interpret a non-zero exit code as a pipeline failure, blocking the PR automatically.

evalflow doctor

Check your local evalflow setup. Verifies the installation, configuration file, dataset, environment variables, storage, and optional provider health. Run this first when something isn’t working. Syntax
evalflow doctor [--fix] [--check-providers | --no-provider-check] [--validate-config]
Flags
--fix
boolean
Automatically apply supported fixes, such as adding required entries to .gitignore.
--check-providers / --no-provider-check
boolean
Run a live health check against each configured provider. Off by default to avoid unnecessary API calls.
--validate-config
boolean
Validate evalflow.yaml syntax, field values, and configured API key environment variables, then exit.
Example
evalflow doctor --check-providers
✓ evalflow 0.1.2 installed
✓ Python 3.11.9
✓ evalflow.yaml found
✓ evalflow.yaml valid
✓ dataset.json found (3 test cases)
✓ dataset.json valid
✓ .evalflow directory exists
✓ SQLite database accessible
✓ Response cache: 0 entries
✓ Git repository detected
✓ OPENAI_API_KEY set
✓ openai health check
! sentence-transformers not installed (optional - needed for embedding_similarity)
✗ .gitignore has .env entry

1 issue found. Run evalflow doctor --fix to resolve.

evalflow runs

List recent eval runs stored in the local SQLite history. Each row shows the run ID, date, provider, model, overall score, and pass/fail status. Syntax
evalflow runs [--limit N] [--since WINDOW] [--failed-only]
Flags
--limit, -n
number
default:"20"
Maximum number of runs to display.
--since
string
Show only runs newer than a time window. Accepted formats: 7d (days) or 24h (hours).
--failed-only
boolean
Show only runs that failed the quality gate.
Example
evalflow runs --limit 20 --since 7d --failed-only
Run ID                  Date         Provider  Model         Score  Status
20260325-a3f9c2d81b4e   2026-03-25   openai    gpt-4o-mini   0.89   pass
20260324-b4c2e1f39d7a   2026-03-24   openai    gpt-4o-mini   0.71   fail

evalflow compare RUN_A RUN_B

Compare two runs side by side. Shows the overall score change and highlights which test cases improved or regressed. Partial run IDs are accepted as long as they resolve to a unique run. Syntax
evalflow compare RUN_A RUN_B
Arguments
RUN_A
string
required
The first run ID (or a unique prefix of at least 8 characters).
RUN_B
string
required
The second run ID (or a unique prefix of at least 8 characters).
Example
evalflow compare 20260325-a3f9 20260324-b4c2
Overall score: +0.04

summarize-short-article     improved    0.87 → 0.91
answer-with-context         regressed   0.88 → 0.61
classify-support-priority   unchanged   1.00 → 1.00
Use evalflow runs first to find the run IDs you want to compare.

evalflow prompt create NAME

Create a new prompt YAML file in the prompts/ directory. The name must be in lowercase kebab-case. Syntax
evalflow prompt create NAME
Example
evalflow prompt create summarization
✓ Created prompts/summarization.yaml

evalflow prompt list

List all prompts in the registry with their current version, status, author, and creation date. Syntax
evalflow prompt list
Example
evalflow prompt list
ID              Version  Status      Author   Created
summarization   1        draft       unknown  2026-03-25
classification  2        staging     unknown  2026-03-20

evalflow prompt diff NAME V1 V2

Show a diff between two stored versions of a prompt. Lines removed in the newer version are prefixed with -, and lines added are prefixed with +. Syntax
evalflow prompt diff NAME V1 V2
Arguments
NAME
string
required
The prompt name, matching the file in prompts/.
V1
number
required
The older version number.
V2
number
required
The newer version number.
Example
evalflow prompt diff summarization 1 2
- You are a helpful assistant.
+ You are a concise summarization assistant. Always respond in one sentence.

evalflow prompt promote NAME --to STATUS

Promote a prompt version to staging or production. Run evalflow eval before promoting to production to confirm quality has not regressed. Syntax
evalflow prompt promote NAME --to STATUS
Flags
--to
string
required
Target status. Accepted values: staging, production.
Example
evalflow prompt promote summarization --to production
Consider running evalflow eval before promoting to production.
summarization promoted to production
Promoting to production does not automatically run an eval. Always verify quality with evalflow eval first.

evalflow dataset lint

Validate your dataset file without running a full eval. Checks structure, field types, and per-test-case quality rules. Syntax
evalflow dataset lint [PATH]
Arguments
PATH
string
default:"evals/dataset.json"
Path to the dataset JSON file to validate. Defaults to evals/dataset.json.
Example
evalflow dataset lint
✓ dataset.json valid (3 test cases)
✓ summarize-release-notes: id kebab-case
✓ summarize-release-notes: input non-empty
✓ summarize-release-notes: expected_output reasonable length
...

Dataset lint passed.
If the dataset contains issues, evalflow dataset lint exits with code 2 and prints the failing checks.
Run evalflow dataset lint after editing your dataset to catch schema errors before they surface during a full evalflow eval run.