Overview
| Method | Best for | Cost |
|---|---|---|
exact_match | Labels, booleans, structured output | Zero |
embedding_similarity | Summaries, semantic answers, rewrites | Zero after local model download |
consistency | Flaky or unstable generations | Multiple provider calls |
llm_judge | Groundedness, nuanced quality checks | One extra judge-model call |
exact_match
Use exact_match when the correct answer has one right form after normalization. evalflow lowercases and strips whitespace from both the expected and actual output before comparing.
Best for: classification labels, boolean answers (yes/no), short structured output where wording must be exact.
Cost: zero — no model calls, no local inference.
evalflow normalizes both the expected and actual output before comparing (lowercased, whitespace stripped), so minor capitalization differences are handled automatically. However,
exact_match will still fail if the model adds punctuation, extra words, or returns text around the answer. Use it only when the output format is tightly controlled by your prompt.embedding_similarity
Use embedding_similarity when the wording of the answer can vary but the meaning should stay close to the expected output. evalflow embeds both strings using a local model and computes cosine similarity.
Best for: summaries, paraphrases, semantic Q&A, rewrites where exact phrasing does not matter.
Cost: zero after the local embedding model is downloaded on first use. No provider calls are made.
consistency
Use consistency when a prompt should produce stable, similar output across repeated calls. evalflow runs the same prompt multiple times and measures how similar the outputs are to each other.
Best for: detecting flaky generations — cases where the model sometimes gives a good answer and sometimes does not.
Cost: multiple provider calls per test case. Each consistency check sends the prompt to your provider several times, which increases latency and API cost.
llm_judge
Use llm_judge when string matching and embeddings are not strong enough — for example, when you need to check whether an answer is grounded in a provided context, or when quality requires genuine semantic reasoning to evaluate.
Best for: groundedness checks, factual accuracy, nuanced quality signals that a second model can assess more reliably than pattern matching.
Cost: one extra provider call per test case. The judge model scores the output; you can configure which model acts as the judge.
Set
"judge": true to activate the judge model. evalflow sends the input, expected output, and actual output to the judge and uses its score as the result for that test case.Combining methods
You can list multiple methods in a single test case. evalflow scores each one and combines the results using theweight field.
When should I use exact_match vs embedding_similarity?
When should I use exact_match vs embedding_similarity?
Use
exact_match when the output must be a specific token or short phrase with no acceptable variation — for example, a category label like positive or a boolean like yes. Use embedding_similarity when the output is a sentence or paragraph where correct answers can be worded differently.How many provider calls does consistency use?
How many provider calls does consistency use?
By default, evalflow runs the prompt three times for a consistency check. The outputs are compared pairwise using embedding similarity. The final score reflects how similar those three outputs are to each other.
Can I use a different model as the judge?
Can I use a different model as the judge?
Yes. The judge model is configured separately from the primary model in
evalflow.yaml. This lets you use a cheaper or more capable model specifically for evaluation.