Skip to main content
Each test case in your dataset declares which method evalflow uses to score the model output. The right method depends on what kind of output you are testing and how much it can vary in wording while still being correct.

Overview

MethodBest forCost
exact_matchLabels, booleans, structured outputZero
embedding_similaritySummaries, semantic answers, rewritesZero after local model download
consistencyFlaky or unstable generationsMultiple provider calls
llm_judgeGroundedness, nuanced quality checksOne extra judge-model call

exact_match

Use exact_match when the correct answer has one right form after normalization. evalflow lowercases and strips whitespace from both the expected and actual output before comparing. Best for: classification labels, boolean answers (yes/no), short structured output where wording must be exact. Cost: zero — no model calls, no local inference.
"eval_config": {
  "methods": ["exact_match"],
  "judge": false,
  "weight": 1.0
}
evalflow normalizes both the expected and actual output before comparing (lowercased, whitespace stripped), so minor capitalization differences are handled automatically. However, exact_match will still fail if the model adds punctuation, extra words, or returns text around the answer. Use it only when the output format is tightly controlled by your prompt.

embedding_similarity

Use embedding_similarity when the wording of the answer can vary but the meaning should stay close to the expected output. evalflow embeds both strings using a local model and computes cosine similarity. Best for: summaries, paraphrases, semantic Q&A, rewrites where exact phrasing does not matter. Cost: zero after the local embedding model is downloaded on first use. No provider calls are made.
"eval_config": {
  "methods": ["embedding_similarity"],
  "judge": false,
  "weight": 1.0
}
You can combine embedding_similarity with exact_match in the same test case by listing both in the methods array. evalflow scores each method and averages them according to the configured weight.

consistency

Use consistency when a prompt should produce stable, similar output across repeated calls. evalflow runs the same prompt multiple times and measures how similar the outputs are to each other. Best for: detecting flaky generations — cases where the model sometimes gives a good answer and sometimes does not. Cost: multiple provider calls per test case. Each consistency check sends the prompt to your provider several times, which increases latency and API cost.
"eval_config": {
  "methods": ["consistency"],
  "judge": false,
  "weight": 1.0
}
consistency is the most expensive method. Reserve it for prompts where instability is a known risk — for example, open-ended generation tasks or prompts that produce different formats on different runs.

llm_judge

Use llm_judge when string matching and embeddings are not strong enough — for example, when you need to check whether an answer is grounded in a provided context, or when quality requires genuine semantic reasoning to evaluate. Best for: groundedness checks, factual accuracy, nuanced quality signals that a second model can assess more reliably than pattern matching. Cost: one extra provider call per test case. The judge model scores the output; you can configure which model acts as the judge.
"eval_config": {
  "methods": ["llm_judge"],
  "judge": true,
  "weight": 1.0
}
Set "judge": true to activate the judge model. evalflow sends the input, expected output, and actual output to the judge and uses its score as the result for that test case.

Combining methods

You can list multiple methods in a single test case. evalflow scores each one and combines the results using the weight field.
{
  "id": "answer-with-context",
  "input": "What is the capital of France based on the provided text?",
  "expected_output": "Paris",
  "context": "France is a country in Western Europe. Its capital is Paris.",
  "eval_config": {
    "methods": ["embedding_similarity", "llm_judge"],
    "judge": true,
    "weight": 1.0
  }
}
Use exact_match when the output must be a specific token or short phrase with no acceptable variation — for example, a category label like positive or a boolean like yes. Use embedding_similarity when the output is a sentence or paragraph where correct answers can be worded differently.
By default, evalflow runs the prompt three times for a consistency check. The outputs are compared pairwise using embedding similarity. The final score reflects how similar those three outputs are to each other.
Yes. The judge model is configured separately from the primary model in evalflow.yaml. This lets you use a cheaper or more capable model specifically for evaluation.