CLI - evalflow

`evalflow init`

Set up evalflow in your project. By default, runs an interactive wizard that prompts for a provider, model, and API key environment variable name. Use --non-interactive for CI, Docker, or scripted environments. Syntax

evalflow init [--provider PROVIDER] [--model MODEL] [--non-interactive] [--force] [--list-providers]

Flags

--provider

string

LLM provider to configure. Accepted values: openai, anthropic, groq, gemini, ollama. Defaults to openai in non-interactive mode.

--model

string

Model name to use. Defaults to the provider’s recommended model when omitted.

--non-interactive

boolean

Skip all prompts. Also accepted as --yes or -y. Required when stdin is not a TTY (CI, Docker).

--force

boolean

Overwrite an existing evalflow.yaml without asking.

--list-providers

boolean

Print supported providers and their default models, then exit. Does not write any files.

Files created evalflow init writes the following files on first run:

evalflow.yaml — project configuration
evals/dataset.json — starter dataset with one example test case
prompts/ — empty directory for prompt YAML files
.env.example — template for required environment variables
.evalflow/ — local storage directory (runs database, cache)

Examples Interactive setup:

evalflow init

Choose provider [openai, anthropic, groq, gemini, ollama]: openai
Choose model [gpt-4o-mini]:
API key env var name [OPENAI_API_KEY]:
  evalflow initialized

  Next steps:
  1. Add your API key to your environment:
     export OPENAI_API_KEY="your-key-here"

  2. Run your first eval:
     evalflow eval

Non-interactive setup for CI:

evalflow init --provider groq --model llama-3.1-8b-instant --non-interactive

List supported providers:

evalflow init --list-providers

Provider    Default model
openai      gpt-4o-mini
anthropic   claude-3-5-haiku-20241022
groq        llama-3.1-8b-instant
gemini      gemini-1.5-flash
ollama      llama3.2

`evalflow eval`

Run the LLM quality gate against your dataset. Each test case is scored and compared to the saved baseline. The command exits with code 0 on pass, 1 on quality failure (which blocks CI), and 2 on an unexpected error. Syntax

evalflow eval [--provider PROVIDER] [--model MODEL] [--dataset PATH] [--tag TAG]
              [--offline] [--debug] [--save-baseline] [--concurrency N]

Flags

--provider, -p

string

Override the provider set in evalflow.yaml. Accepted values: openai, anthropic, groq, gemini, ollama.

--model, -m

string

Override the model set in evalflow.yaml for this run.

--dataset, -d

string

Path to a dataset JSON file. Defaults to evals/dataset.json (or the value in evalflow.yaml).

--tag, -t

string

Run only test cases whose tags array contains this value. Useful for running a smoke or critical subset.

--offline

boolean

Use cached LLM responses instead of making live API calls. Useful for re-scoring without consuming API quota.

--debug

boolean

Print full exception tracebacks on unexpected errors. Not recommended for production use.

--save-baseline

boolean

Save this run as the new baseline. Future runs are compared against it. The first run always saves a baseline automatically.

--concurrency

number

default:"5"

Maximum number of test cases to run in parallel. Minimum value is 1.

Example

evalflow eval --provider openai --model gpt-4o-mini --dataset evals/dataset.json --tag critical --save-baseline

Running 5 test cases against gpt-4o-mini...

✓ summarize-short-article      0.91
✓ classify-support-priority    1.00
✗ answer-with-context          0.61
✓ extract-key-dates            0.87
✓ rewrite-formal-tone          0.83

Quality Gate: FAIL
Baseline: saved
Failures: 1
Duration: 4.2s
Run ID: 20260325-a3f9c2d81b4e

evalflow eval exits with code 1 when the quality gate fails. CI systems interpret a non-zero exit code as a pipeline failure, blocking the PR automatically.

`evalflow doctor`

Check your local evalflow setup. Verifies the installation, configuration file, dataset, environment variables, storage, and optional provider health. Run this first when something isn’t working. Syntax

evalflow doctor [--fix] [--check-providers | --no-provider-check] [--validate-config]

Flags

--fix

boolean

Automatically apply supported fixes, such as adding required entries to .gitignore.

--check-providers / --no-provider-check

boolean

Run a live health check against each configured provider. Off by default to avoid unnecessary API calls.

--validate-config

boolean

Validate evalflow.yaml syntax, field values, and configured API key environment variables, then exit.

Example

evalflow doctor --check-providers

✓ evalflow 0.1.2 installed
✓ Python 3.11.9
✓ evalflow.yaml found
✓ evalflow.yaml valid
✓ dataset.json found (3 test cases)
✓ dataset.json valid
✓ .evalflow directory exists
✓ SQLite database accessible
✓ Response cache: 0 entries
✓ Git repository detected
✓ OPENAI_API_KEY set
✓ openai health check
! sentence-transformers not installed (optional - needed for embedding_similarity)
✗ .gitignore has .env entry

1 issue found. Run evalflow doctor --fix to resolve.

`evalflow runs`

List recent eval runs stored in the local SQLite history. Each row shows the run ID, date, provider, model, overall score, and pass/fail status. Syntax

evalflow runs [--limit N] [--since WINDOW] [--failed-only]

Flags

--limit, -n

number

default:"20"

Maximum number of runs to display.

--since

string

Show only runs newer than a time window. Accepted formats: 7d (days) or 24h (hours).

--failed-only

boolean

Show only runs that failed the quality gate.

Example

evalflow runs --limit 20 --since 7d --failed-only

Run ID                  Date         Provider  Model         Score  Status
20260325-a3f9c2d81b4e   2026-03-25   openai    gpt-4o-mini   0.89   pass
20260324-b4c2e1f39d7a   2026-03-24   openai    gpt-4o-mini   0.71   fail

`evalflow compare RUN_A RUN_B`

Compare two runs side by side. Shows the overall score change and highlights which test cases improved or regressed. Partial run IDs are accepted as long as they resolve to a unique run. Syntax

evalflow compare RUN_A RUN_B

Arguments

RUN_A

string

required

The first run ID (or a unique prefix of at least 8 characters).

RUN_B

string

required

The second run ID (or a unique prefix of at least 8 characters).

Example

evalflow compare 20260325-a3f9 20260324-b4c2

Overall score: +0.04

summarize-short-article     improved    0.87 → 0.91
answer-with-context         regressed   0.88 → 0.61
classify-support-priority   unchanged   1.00 → 1.00

Use evalflow runs first to find the run IDs you want to compare.

`evalflow prompt create NAME`

Create a new prompt YAML file in the prompts/ directory. The name must be in lowercase kebab-case. Syntax

evalflow prompt create NAME

Example

evalflow prompt create summarization

✓ Created prompts/summarization.yaml

`evalflow prompt list`

List all prompts in the registry with their current version, status, author, and creation date. Syntax

evalflow prompt list

Example

evalflow prompt list

ID              Version  Status      Author   Created
summarization   1        draft       unknown  2026-03-25
classification  2        staging     unknown  2026-03-20

`evalflow prompt diff NAME V1 V2`

Show a diff between two stored versions of a prompt. Lines removed in the newer version are prefixed with -, and lines added are prefixed with +. Syntax

evalflow prompt diff NAME V1 V2

Arguments

NAME

string

required

The prompt name, matching the file in prompts/.

number

required

The older version number.

number

required

The newer version number.

Example

evalflow prompt diff summarization 1 2

- You are a helpful assistant.
+ You are a concise summarization assistant. Always respond in one sentence.

`evalflow prompt promote NAME --to STATUS`

Promote a prompt version to staging or production. Run evalflow eval before promoting to production to confirm quality has not regressed. Syntax

evalflow prompt promote NAME --to STATUS

Flags

--to

string

required

Target status. Accepted values: staging, production.

Example

evalflow prompt promote summarization --to production

Consider running evalflow eval before promoting to production.
summarization promoted to production

Promoting to production does not automatically run an eval. Always verify quality with evalflow eval first.

`evalflow dataset lint`

Validate your dataset file without running a full eval. Checks structure, field types, and per-test-case quality rules. Syntax

evalflow dataset lint [PATH]

Arguments

PATH

string

default:"evals/dataset.json"

Path to the dataset JSON file to validate. Defaults to evals/dataset.json.

Example

evalflow dataset lint

✓ dataset.json valid (3 test cases)
✓ summarize-release-notes: id kebab-case
✓ summarize-release-notes: input non-empty
✓ summarize-release-notes: expected_output reasonable length
...

Dataset lint passed.

If the dataset contains issues, evalflow dataset lint exits with code 2 and prints the failing checks.

Run evalflow dataset lint after editing your dataset to catch schema errors before they surface during a full evalflow eval run.

​evalflow init

​evalflow eval

​evalflow doctor

​evalflow runs

​evalflow compare RUN_A RUN_B

​evalflow prompt create NAME

​evalflow prompt list

​evalflow prompt diff NAME V1 V2

​evalflow prompt promote NAME --to STATUS

​evalflow dataset lint

`evalflow init`

`evalflow eval`

`evalflow doctor`

`evalflow runs`

`evalflow compare RUN_A RUN_B`

`evalflow prompt create NAME`

`evalflow prompt list`

`evalflow prompt diff NAME V1 V2`

`evalflow prompt promote NAME --to STATUS`

`evalflow dataset lint`