Configuration - evalflow

evalflow.yaml lives at the root of your project and controls which providers to use, where to find your dataset, scoring thresholds, and local storage settings. Run evalflow init to generate it automatically, or copy the example below and edit it by hand.

Store environment variable names in evalflow.yaml, not the API key values themselves. Never commit real API keys to version control.

Complete example

version: "1.0"
project: my-llm-app

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini
  groq:
    api_key_env: GROQ_API_KEY
    default_model: llama-3.1-8b-instant

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: openai
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/

storage:
  store_raw_outputs: true
  max_output_chars: 2000

Top-level fields

version

string

required

Schema version. Use "1.0".

project

string

A human-readable name for the project. Used in run history and output headers. Defaults to the directory name when generated by evalflow init.

`providers`

Configure one or more LLM providers. You can define multiple providers and switch between them using --provider on the command line. At least one provider must be configured.

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini

providers.PROVIDER_NAME

object

Replace PROVIDER_NAME with one of: openai, anthropic, groq, gemini, ollama.

Show properties

api_key_env

string

required

The name of the environment variable that holds the API key. evalflow reads this variable at runtime — it does not read the key from the config file itself.

default_model

string

required

The model to use when --model is not passed on the command line. Provider defaults are shown by running evalflow init --list-providers.

For Ollama (local models), api_key_env is still required by the schema but the variable does not need to hold a real value. Set it to any non-empty string.

`eval`

Controls how evalflow eval loads and runs your dataset.

eval.dataset

string

default:"evals/dataset.json"

Path to the dataset JSON file, relative to the project root.

eval.baseline_file

string

default:".evalflow/baseline.json"

Path where the baseline run is stored. The baseline is used to detect regressions between runs.

eval.default_provider

string

default:"openai"

The provider to use when --provider is not passed on the command line. Must match a key under providers.

eval.consistency_runs

number

default:"3"

Number of times to call the model when using the consistency eval method. Higher values improve accuracy at the cost of more API calls.

`thresholds`

Minimum acceptable scores for each quality dimension. A test case fails when its score falls below the relevant threshold.

thresholds.task_success

number

default:"0.80"

Minimum score for the overall task to be considered successful. Range: 0.0–1.0.

thresholds.relevance

number

default:"0.75"

Minimum relevance score. Range: 0.0–1.0.

thresholds.hallucination_max

number

default:"0.10"

Maximum allowed hallucination score. A run fails if hallucination exceeds this value. Range: 0.0–1.0.

thresholds.consistency_min

number

default:"0.85"

Minimum consistency score when using the consistency eval method. Range: 0.0–1.0.

`judge`

Settings for the LLM-as-judge used when a test case sets eval_config.judge: true.

judge.provider

string

default:"groq"

Provider for the judge model. Must be one of the supported providers.

judge.model

string

default:"llama-3.1-8b-instant"

Model to use as judge. A fast, low-cost model is recommended here since the judge runs once per test case.

`prompts`

Settings for the prompt registry.

prompts.directory

string

default:"prompts/"

Directory where prompt YAML files are stored. Created automatically by evalflow init.

`storage`

Controls what evalflow stores locally after each run.

storage.store_raw_outputs

boolean

default:"true"

When true, the raw model output for each test case is saved to the local database. Set to false to reduce disk usage.

storage.max_output_chars

number

default:"2000"

Maximum number of characters to store per model output. Outputs longer than this are truncated in storage (not in scoring).

Security

evalflow reads API keys from environment variables at runtime. The config file stores only the variable name.

# Correct — stores the variable name
providers:
  openai:
    api_key_env: OPENAI_API_KEY

# Wrong — never do this
providers:
  openai:
    api_key_env: sk-abc123...

Keep .env out of version control. evalflow init adds the following entries to .gitignore automatically:

.env
.env.local
.env.*
!.env.example
.evalflow/
*.evalflow.db

​Complete example

​Top-level fields

​providers

​eval

​thresholds

​judge

​prompts

​storage

​Security

Complete example

Top-level fields

`providers`

`eval`

`thresholds`

`judge`

`prompts`

`storage`

Security