Dataset schema - evalflow

Your dataset lives at evals/dataset.json by default (configurable in evalflow.yaml). It is a JSON object that contains a version string and an array of test cases. evalflow validates the file on every run and reports clear errors when something is wrong.

Complete example

The following dataset uses three different eval methods across three test cases:

{
  "version": "1.0",
  "test_cases": [
    {
      "id": "summarize-release-notes",
      "description": "Summarize short product notes in one sentence",
      "task_type": "summarization",
      "input": "Summarize in one sentence: Evalflow now stores baselines locally and prints run IDs in the summary.",
      "expected_output": "Evalflow stores baselines locally and shows run IDs in summaries.",
      "context": "",
      "tags": ["smoke", "summary"],
      "eval_config": {
        "methods": ["embedding_similarity"],
        "judge": false,
        "weight": 1.0
      }
    },
    {
      "id": "classify-support-priority",
      "description": "Classify a support issue as urgent",
      "task_type": "classification",
      "input": "Classify as urgent or normal: Customer cannot access production after migration.",
      "expected_output": "urgent",
      "context": "",
      "tags": ["smoke", "classification"],
      "eval_config": {
        "methods": ["exact_match"],
        "judge": false,
        "weight": 1.0
      }
    },
    {
      "id": "answer-billing-question",
      "description": "Answer a billing question using provided context",
      "task_type": "qa",
      "input": "What plan includes audit logs?",
      "expected_output": "The business plan includes audit logs.",
      "context": "Pricing: Starter has email support. Business includes audit logs and SSO.",
      "tags": ["smoke", "qa"],
      "eval_config": {
        "methods": ["embedding_similarity"],
        "judge": true,
        "weight": 1.5
      }
    }
  ]
}

Top-level fields

version

string

required

Dataset schema version. Use "1.0".

test_cases

object[]

required

Array of test case objects. Must contain at least one entry. All id values must be unique.

Test case fields

string

required

Unique identifier for the test case. Must be lowercase kebab-case (e.g. summarize-release-notes). Used in run output and comparisons.

description

string

required

Short human-readable description of what this test case checks. Displayed in run history and diffs.

task_type

string

required

The type of task the model is performing. Accepted values:

Value	When to use
`summarization`	The model condenses a longer input
`classification`	The model assigns a category label
`extraction`	The model pulls specific information from a text
`qa`	The model answers a question, optionally using `context`
`generation`	The model produces free-form content
`rewrite`	The model rephrases or transforms an existing text

input

string

required

The prompt or instruction sent to the model. Write this exactly as you would pass it to the LLM.

expected_output

string

required

The ideal model response. Scoring methods compare the actual model output against this value.

context

string

Optional additional context passed alongside input. Use this for retrieval-augmented generation (RAG) test cases where the model must answer from a specific document or data source. Set to "" when not needed.

tags

string[]

Optional list of labels. Use tags to group related test cases and run a subset with evalflow eval --tag TAG. Common patterns: smoke, critical, regression, slow.

eval_config

object

Per-test evaluation settings. If omitted, evalflow uses default scoring behaviour.

Show eval_config properties

methods

string[]

List of scoring methods to apply. You can combine multiple methods for a single test case. Accepted values:

Method	Description
`exact_match`	Passes when the model output matches `expected_output` after normalization (lowercased, whitespace stripped). Best for classification labels and structured extraction.
`embedding_similarity`	Semantic similarity score using sentence embeddings. Requires `sentence-transformers` to be installed.
`consistency`	Runs the model multiple times (controlled by `eval.consistency_runs` in `evalflow.yaml`) and scores how consistent the outputs are.
`llm_judge`	Uses a second LLM to judge quality. Configured via the `judge` section of `evalflow.yaml`.

judge

boolean

default:"false"

When true, the LLM-as-judge (configured in evalflow.yaml) scores this test case in addition to any methods. Use for test cases where semantic correctness is hard to capture with automated metrics alone.

weight

number

default:"1.0"

Relative importance of this test case in the overall run score. Must be greater than 0.0. A test case with weight: 2.0 counts twice as much as one with weight: 1.0. Use higher weights for critical paths.

Validation rules

evalflow validates your dataset before running and exits with a clear error if any rule is violated:

version must be present.
test_cases must contain at least one entry.
Every id must be unique and in lowercase kebab-case.
Every test case must have input and expected_output.
task_type must be one of the accepted values.
eval_config.methods entries must be unique within a test case.
eval_config.weight must be greater than 0.0.

Run evalflow doctor to validate your dataset without running a full eval.

​Complete example

​Top-level fields

​Test case fields

​Validation rules

Complete example

Top-level fields

Test case fields

Validation rules