evalflow init
Set up evalflow in your project. By default, runs an interactive wizard that prompts for a provider, model, and API key environment variable name. Use --non-interactive for CI, Docker, or scripted environments.
Syntax
LLM provider to configure. Accepted values:
openai, anthropic, groq, gemini, ollama. Defaults to openai in non-interactive mode.Model name to use. Defaults to the provider’s recommended model when omitted.
Skip all prompts. Also accepted as
--yes or -y. Required when stdin is not a TTY (CI, Docker).Overwrite an existing
evalflow.yaml without asking.Print supported providers and their default models, then exit. Does not write any files.
evalflow init writes the following files on first run:
evalflow.yaml— project configurationevals/dataset.json— starter dataset with one example test caseprompts/— empty directory for prompt YAML files.env.example— template for required environment variables.evalflow/— local storage directory (runs database, cache)
evalflow eval
Run the LLM quality gate against your dataset. Each test case is scored and compared to the saved baseline. The command exits with code 0 on pass, 1 on quality failure (which blocks CI), and 2 on an unexpected error.
Syntax
Override the provider set in
evalflow.yaml. Accepted values: openai, anthropic, groq, gemini, ollama.Override the model set in
evalflow.yaml for this run.Path to a dataset JSON file. Defaults to
evals/dataset.json (or the value in evalflow.yaml).Run only test cases whose
tags array contains this value. Useful for running a smoke or critical subset.Use cached LLM responses instead of making live API calls. Useful for re-scoring without consuming API quota.
Print full exception tracebacks on unexpected errors. Not recommended for production use.
Save this run as the new baseline. Future runs are compared against it. The first run always saves a baseline automatically.
Maximum number of test cases to run in parallel. Minimum value is
1.evalflow eval exits with code 1 when the quality gate fails. CI systems interpret a non-zero exit code as a pipeline failure, blocking the PR automatically.evalflow doctor
Check your local evalflow setup. Verifies the installation, configuration file, dataset, environment variables, storage, and optional provider health. Run this first when something isn’t working.
Syntax
Automatically apply supported fixes, such as adding required entries to
.gitignore.Run a live health check against each configured provider. Off by default to avoid unnecessary API calls.
Validate
evalflow.yaml syntax, field values, and configured API key environment variables, then exit.evalflow runs
List recent eval runs stored in the local SQLite history. Each row shows the run ID, date, provider, model, overall score, and pass/fail status.
Syntax
Maximum number of runs to display.
Show only runs newer than a time window. Accepted formats:
7d (days) or 24h (hours).Show only runs that failed the quality gate.
evalflow compare RUN_A RUN_B
Compare two runs side by side. Shows the overall score change and highlights which test cases improved or regressed. Partial run IDs are accepted as long as they resolve to a unique run.
Syntax
The first run ID (or a unique prefix of at least 8 characters).
The second run ID (or a unique prefix of at least 8 characters).
evalflow prompt create NAME
Create a new prompt YAML file in the prompts/ directory. The name must be in lowercase kebab-case.
Syntax
evalflow prompt list
List all prompts in the registry with their current version, status, author, and creation date.
Syntax
evalflow prompt diff NAME V1 V2
Show a diff between two stored versions of a prompt. Lines removed in the newer version are prefixed with -, and lines added are prefixed with +.
Syntax
The prompt name, matching the file in
prompts/.The older version number.
The newer version number.
evalflow prompt promote NAME --to STATUS
Promote a prompt version to staging or production. Run evalflow eval before promoting to production to confirm quality has not regressed.
Syntax
Target status. Accepted values:
staging, production.evalflow dataset lint
Validate your dataset file without running a full eval. Checks structure, field types, and per-test-case quality rules.
Syntax
Path to the dataset JSON file to validate. Defaults to
evals/dataset.json.evalflow dataset lint exits with code 2 and prints the failing checks.