evals/dataset.json by default (configurable in evalflow.yaml). It is a JSON object that contains a version string and an array of test cases. evalflow validates the file on every run and reports clear errors when something is wrong.
Complete example
The following dataset uses three different eval methods across three test cases:Top-level fields
Dataset schema version. Use
"1.0".Array of test case objects. Must contain at least one entry. All
id values must be unique.Test case fields
Unique identifier for the test case. Must be lowercase kebab-case (e.g.
summarize-release-notes). Used in run output and comparisons.Short human-readable description of what this test case checks. Displayed in run history and diffs.
The type of task the model is performing. Accepted values:
| Value | When to use |
|---|---|
summarization | The model condenses a longer input |
classification | The model assigns a category label |
extraction | The model pulls specific information from a text |
qa | The model answers a question, optionally using context |
generation | The model produces free-form content |
rewrite | The model rephrases or transforms an existing text |
The prompt or instruction sent to the model. Write this exactly as you would pass it to the LLM.
The ideal model response. Scoring methods compare the actual model output against this value.
Optional additional context passed alongside
input. Use this for retrieval-augmented generation (RAG) test cases where the model must answer from a specific document or data source. Set to "" when not needed.Optional list of labels. Use tags to group related test cases and run a subset with
evalflow eval --tag TAG. Common patterns: smoke, critical, regression, slow.Per-test evaluation settings. If omitted, evalflow uses default scoring behaviour.
Validation rules
evalflow validates your dataset before running and exits with a clear error if any rule is violated:versionmust be present.test_casesmust contain at least one entry.- Every
idmust be unique and in lowercase kebab-case. - Every test case must have
inputandexpected_output. task_typemust be one of the accepted values.eval_config.methodsentries must be unique within a test case.eval_config.weightmust be greater than0.0.