The examples/ directory in the evalflow repository contains three ready-to-run projects. Each one demonstrates a different setup. Pick the example closest to your stack, copy its evalflow.yaml and dataset, then adapt them to your own prompts.
Starting from an example
# Clone the repository
git clone https://github.com/emartai/evalflow
cd evalflow/examples/<example-name>
# Install evalflow
pip install evalflow
# Copy the environment template
cp .env.example .env
# Check your setup
evalflow doctor
# Run the quality gate
evalflow eval
examples/openai-basic
A minimal local setup using OpenAI. This is the best starting point if you are new to evalflow or want the simplest possible configuration.
What it demonstrates
- A single-provider
evalflow.yaml configured for OpenAI
- Three test cases covering summarization, classification, and QA
- Both
exact_match and embedding_similarity eval methods
- A prompt registry entry in
prompts/
Files
examples/openai-basic/
├── evalflow.yaml
├── evals/
│ └── dataset.json
├── prompts/
│ └── assistant.yaml
├── .env.example
└── README.md
evalflow.yaml
version: "1.0"
project: openai-basic
providers:
openai:
api_key_env: OPENAI_API_KEY
default_model: gpt-4o-mini
eval:
dataset: evals/dataset.json
baseline_file: .evalflow/baseline.json
default_provider: openai
consistency_runs: 3
thresholds:
task_success: 0.80
relevance: 0.75
hallucination_max: 0.10
consistency_min: 0.85
judge:
provider: groq
model: llama-3.1-8b-instant
prompts:
directory: prompts/
Running it
cd examples/openai-basic
cp .env.example .env
# Edit .env and set OPENAI_API_KEY
evalflow eval
Running 3 test cases against gpt-4o-mini...
✓ summarize-release-notes 0.92
✓ classify-support-priority 1.00
✓ answer-billing-question 0.88
Quality Gate: PASS
Failures: 0
Run ID: 20240315-a3f9c2d81b4e
Duration: 3.8s
examples/groq-ci
A Groq-based setup with a GitHub Actions workflow included. Groq’s free tier makes this example well suited for CI pipelines where you want zero-cost quality gates on every pull request.
What it demonstrates
- A single-provider
evalflow.yaml configured for Groq
- A working GitHub Actions workflow (
.github/workflows/evalflow.yml)
- How to pass the API key as a GitHub Actions secret
- Test cases for summarization, classification, and QA
Files
examples/groq-ci/
├── evalflow.yaml
├── .github/
│ └── workflows/
│ └── evalflow.yml
├── evals/
│ └── dataset.json
├── .env.example
└── README.md
evalflow.yaml
version: "1.0"
project: groq-ci
providers:
groq:
api_key_env: GROQ_API_KEY
default_model: llama-3.1-8b-instant
eval:
dataset: evals/dataset.json
baseline_file: .evalflow/baseline.json
default_provider: groq
consistency_runs: 3
thresholds:
task_success: 0.80
relevance: 0.75
hallucination_max: 0.10
consistency_min: 0.85
judge:
provider: groq
model: llama-3.1-8b-instant
prompts:
directory: prompts/
GitHub Actions workflow
# .github/workflows/evalflow.yml
name: LLM Quality Gate
on:
pull_request:
paths:
- "prompts/**"
- "evals/**"
- "**.py"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install evalflow
- run: evalflow eval --provider groq
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
Running it locally
cd examples/groq-ci
cp .env.example .env
# Edit .env and set GROQ_API_KEY
evalflow eval --provider groq
Running 3 test cases against llama-3.1-8b-instant...
✓ summarize-pr-description 0.91
✓ classify-ci-failure 1.00
✓ answer-readme-question 0.86
Quality Gate: PASS
Failures: 0
Run ID: 20240315-c7e8b1d23f9a
Duration: 3.1s
Add GROQ_API_KEY to your repository’s Settings → Secrets and variables → Actions to enable the GitHub Actions workflow.
examples/langchain-app
An example that runs evalflow alongside a LangChain application. It shows how to add a quality gate to a project that already uses LangChain for its LLM calls, without changing any application code.
What it demonstrates
- Running
evalflow eval independently of your application code
- How evalflow’s prompt registry integrates with a LangChain app (
app.py loads the production prompt from the registry)
- Test cases that mirror the prompts used in the application
Files
examples/langchain-app/
├── evalflow.yaml
├── app.py
├── evals/
│ └── dataset.json
├── prompts/
│ └── assistant.yaml
├── .env.example
└── README.md
evalflow.yaml
version: "1.0"
project: langchain-app
providers:
openai:
api_key_env: OPENAI_API_KEY
default_model: gpt-4o-mini
eval:
dataset: evals/dataset.json
baseline_file: .evalflow/baseline.json
default_provider: openai
consistency_runs: 3
thresholds:
task_success: 0.80
relevance: 0.75
hallucination_max: 0.10
consistency_min: 0.85
judge:
provider: groq
model: llama-3.1-8b-instant
prompts:
directory: prompts/
Running it
cd examples/langchain-app
pip install evalflow langchain-openai langchain-core
cp .env.example .env
# Edit .env and set OPENAI_API_KEY
# Run the LangChain app
python app.py
# Run the quality gate
evalflow eval
> python app.py
Reply: The assistant answers using the production prompt body.
> evalflow eval
Running 3 test cases against gpt-4o-mini...
✓ summarize-meeting-notes 0.90
✓ classify-user-intent 1.00
✓ answer-faq-context 0.87
Quality Gate: PASS
Failures: 0
Run ID: 20240315-b2d9f4e67c1b
Duration: 4.0s