Skip to main content
evalflow.yaml lives at the root of your project and controls which providers to use, where to find your dataset, scoring thresholds, and local storage settings. Run evalflow init to generate it automatically, or copy the example below and edit it by hand.
Store environment variable names in evalflow.yaml, not the API key values themselves. Never commit real API keys to version control.

Complete example

version: "1.0"
project: my-llm-app

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini
  groq:
    api_key_env: GROQ_API_KEY
    default_model: llama-3.1-8b-instant

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: openai
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/

storage:
  store_raw_outputs: true
  max_output_chars: 2000

Top-level fields

version
string
required
Schema version. Use "1.0".
project
string
A human-readable name for the project. Used in run history and output headers. Defaults to the directory name when generated by evalflow init.

providers

Configure one or more LLM providers. You can define multiple providers and switch between them using --provider on the command line. At least one provider must be configured.
providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini
providers.PROVIDER_NAME
object
Replace PROVIDER_NAME with one of: openai, anthropic, groq, gemini, ollama.
For Ollama (local models), api_key_env is still required by the schema but the variable does not need to hold a real value. Set it to any non-empty string.

eval

Controls how evalflow eval loads and runs your dataset.
eval.dataset
string
default:"evals/dataset.json"
Path to the dataset JSON file, relative to the project root.
eval.baseline_file
string
default:".evalflow/baseline.json"
Path where the baseline run is stored. The baseline is used to detect regressions between runs.
eval.default_provider
string
default:"openai"
The provider to use when --provider is not passed on the command line. Must match a key under providers.
eval.consistency_runs
number
default:"3"
Number of times to call the model when using the consistency eval method. Higher values improve accuracy at the cost of more API calls.

thresholds

Minimum acceptable scores for each quality dimension. A test case fails when its score falls below the relevant threshold.
thresholds.task_success
number
default:"0.80"
Minimum score for the overall task to be considered successful. Range: 0.01.0.
thresholds.relevance
number
default:"0.75"
Minimum relevance score. Range: 0.01.0.
thresholds.hallucination_max
number
default:"0.10"
Maximum allowed hallucination score. A run fails if hallucination exceeds this value. Range: 0.01.0.
thresholds.consistency_min
number
default:"0.85"
Minimum consistency score when using the consistency eval method. Range: 0.01.0.

judge

Settings for the LLM-as-judge used when a test case sets eval_config.judge: true.
judge.provider
string
default:"groq"
Provider for the judge model. Must be one of the supported providers.
judge.model
string
default:"llama-3.1-8b-instant"
Model to use as judge. A fast, low-cost model is recommended here since the judge runs once per test case.

prompts

Settings for the prompt registry.
prompts.directory
string
default:"prompts/"
Directory where prompt YAML files are stored. Created automatically by evalflow init.

storage

Controls what evalflow stores locally after each run.
storage.store_raw_outputs
boolean
default:"true"
When true, the raw model output for each test case is saved to the local database. Set to false to reduce disk usage.
storage.max_output_chars
number
default:"2000"
Maximum number of characters to store per model output. Outputs longer than this are truncated in storage (not in scoring).

Security

evalflow reads API keys from environment variables at runtime. The config file stores only the variable name.
# Correct — stores the variable name
providers:
  openai:
    api_key_env: OPENAI_API_KEY

# Wrong — never do this
providers:
  openai:
    api_key_env: sk-abc123...
Keep .env out of version control. evalflow init adds the following entries to .gitignore automatically:
.env
.env.local
.env.*
!.env.example
.evalflow/
*.evalflow.db