evalflow

evalflow is a CLI quality gate for LLM prompts. When you change a prompt, evalflow runs your test suite and blocks CI if quality regresses — the same way pytest blocks CI when tests fail.

Quick Start

Install evalflow and run your first quality gate in 10 minutes.

Core Concepts

Understand quality gates, baselines, and eval methods.

CLI Reference

Complete reference for every evalflow command and flag.

CI/CD Integration

Add evalflow to GitHub Actions, GitLab CI, or CircleCI.

How it works

Install

pip install evalflow

Initialize

Run evalflow init to create your config, prompt files, and dataset.

Write test cases

Add test cases to evals/dataset.json with expected outputs and eval methods.

Run the gate

Run evalflow eval. Exit code 0 means pass, 1 means regression, 2 means error — ready for CI.

Why evalflow

Traditional unit tests cannot tell you when a prompt tweak quietly degrades a task. evalflow gives you a local quality gate that works the same on your laptop and in CI.

You changed one prompt.
Summarization improved.
Classification silently broke.
Nobody noticed for 4 days.

evalflow catches this in CI before it ships.

Supported providers

evalflow works with OpenAI, Anthropic, Groq, Gemini, and Ollama. Switch providers with a single flag — no code changes required.

Get Started

Core Concepts

CI/CD

Providers

Quick Start

Core Concepts

CLI Reference

CI/CD Integration

How it works

Why evalflow

Supported providers

Quick Start

Core Concepts

CLI Reference

CI/CD Integration

​How it works

​Why evalflow

​Supported providers

How it works

Why evalflow

Supported providers