Skip to main content
Your dataset lives at evals/dataset.json by default (configurable in evalflow.yaml). It is a JSON object that contains a version string and an array of test cases. evalflow validates the file on every run and reports clear errors when something is wrong.

Complete example

The following dataset uses three different eval methods across three test cases:
{
  "version": "1.0",
  "test_cases": [
    {
      "id": "summarize-release-notes",
      "description": "Summarize short product notes in one sentence",
      "task_type": "summarization",
      "input": "Summarize in one sentence: Evalflow now stores baselines locally and prints run IDs in the summary.",
      "expected_output": "Evalflow stores baselines locally and shows run IDs in summaries.",
      "context": "",
      "tags": ["smoke", "summary"],
      "eval_config": {
        "methods": ["embedding_similarity"],
        "judge": false,
        "weight": 1.0
      }
    },
    {
      "id": "classify-support-priority",
      "description": "Classify a support issue as urgent",
      "task_type": "classification",
      "input": "Classify as urgent or normal: Customer cannot access production after migration.",
      "expected_output": "urgent",
      "context": "",
      "tags": ["smoke", "classification"],
      "eval_config": {
        "methods": ["exact_match"],
        "judge": false,
        "weight": 1.0
      }
    },
    {
      "id": "answer-billing-question",
      "description": "Answer a billing question using provided context",
      "task_type": "qa",
      "input": "What plan includes audit logs?",
      "expected_output": "The business plan includes audit logs.",
      "context": "Pricing: Starter has email support. Business includes audit logs and SSO.",
      "tags": ["smoke", "qa"],
      "eval_config": {
        "methods": ["embedding_similarity"],
        "judge": true,
        "weight": 1.5
      }
    }
  ]
}

Top-level fields

version
string
required
Dataset schema version. Use "1.0".
test_cases
object[]
required
Array of test case objects. Must contain at least one entry. All id values must be unique.

Test case fields

id
string
required
Unique identifier for the test case. Must be lowercase kebab-case (e.g. summarize-release-notes). Used in run output and comparisons.
description
string
required
Short human-readable description of what this test case checks. Displayed in run history and diffs.
task_type
string
required
The type of task the model is performing. Accepted values:
ValueWhen to use
summarizationThe model condenses a longer input
classificationThe model assigns a category label
extractionThe model pulls specific information from a text
qaThe model answers a question, optionally using context
generationThe model produces free-form content
rewriteThe model rephrases or transforms an existing text
input
string
required
The prompt or instruction sent to the model. Write this exactly as you would pass it to the LLM.
expected_output
string
required
The ideal model response. Scoring methods compare the actual model output against this value.
context
string
Optional additional context passed alongside input. Use this for retrieval-augmented generation (RAG) test cases where the model must answer from a specific document or data source. Set to "" when not needed.
tags
string[]
Optional list of labels. Use tags to group related test cases and run a subset with evalflow eval --tag TAG. Common patterns: smoke, critical, regression, slow.
eval_config
object
Per-test evaluation settings. If omitted, evalflow uses default scoring behaviour.

Validation rules

evalflow validates your dataset before running and exits with a clear error if any rule is violated:
  • version must be present.
  • test_cases must contain at least one entry.
  • Every id must be unique and in lowercase kebab-case.
  • Every test case must have input and expected_output.
  • task_type must be one of the accepted values.
  • eval_config.methods entries must be unique within a test case.
  • eval_config.weight must be greater than 0.0.
Run evalflow doctor to validate your dataset without running a full eval.