evalflow.yaml lives at the root of your project and controls which providers to use, where to find your dataset, scoring thresholds, and local storage settings. Run evalflow init to generate it automatically, or copy the example below and edit it by hand.
Complete example
Top-level fields
Schema version. Use
"1.0".A human-readable name for the project. Used in run history and output headers. Defaults to the directory name when generated by
evalflow init.providers
Configure one or more LLM providers. You can define multiple providers and switch between them using --provider on the command line. At least one provider must be configured.
Replace
PROVIDER_NAME with one of: openai, anthropic, groq, gemini, ollama.For Ollama (local models),
api_key_env is still required by the schema but the variable does not need to hold a real value. Set it to any non-empty string.eval
Controls how evalflow eval loads and runs your dataset.
Path to the dataset JSON file, relative to the project root.
Path where the baseline run is stored. The baseline is used to detect regressions between runs.
The provider to use when
--provider is not passed on the command line. Must match a key under providers.Number of times to call the model when using the
consistency eval method. Higher values improve accuracy at the cost of more API calls.thresholds
Minimum acceptable scores for each quality dimension. A test case fails when its score falls below the relevant threshold.
Minimum score for the overall task to be considered successful. Range:
0.0–1.0.Minimum relevance score. Range:
0.0–1.0.Maximum allowed hallucination score. A run fails if hallucination exceeds this value. Range:
0.0–1.0.Minimum consistency score when using the
consistency eval method. Range: 0.0–1.0.judge
Settings for the LLM-as-judge used when a test case sets eval_config.judge: true.
Provider for the judge model. Must be one of the supported providers.
Model to use as judge. A fast, low-cost model is recommended here since the judge runs once per test case.
prompts
Settings for the prompt registry.
Directory where prompt YAML files are stored. Created automatically by
evalflow init.storage
Controls what evalflow stores locally after each run.
When
true, the raw model output for each test case is saved to the local database. Set to false to reduce disk usage.Maximum number of characters to store per model output. Outputs longer than this are truncated in storage (not in scoring).
Security
evalflow reads API keys from environment variables at runtime. The config file stores only the variable name..env out of version control. evalflow init adds the following entries to .gitignore automatically: