Skip to main content
The examples/ directory in the evalflow repository contains three ready-to-run projects. Each one demonstrates a different setup. Pick the example closest to your stack, copy its evalflow.yaml and dataset, then adapt them to your own prompts.

Starting from an example

# Clone the repository
git clone https://github.com/emartai/evalflow
cd evalflow/examples/<example-name>

# Install evalflow
pip install evalflow

# Copy the environment template
cp .env.example .env

# Check your setup
evalflow doctor

# Run the quality gate
evalflow eval

examples/openai-basic

A minimal local setup using OpenAI. This is the best starting point if you are new to evalflow or want the simplest possible configuration. What it demonstrates
  • A single-provider evalflow.yaml configured for OpenAI
  • Three test cases covering summarization, classification, and QA
  • Both exact_match and embedding_similarity eval methods
  • A prompt registry entry in prompts/
Files
examples/openai-basic/
├── evalflow.yaml
├── evals/
│   └── dataset.json
├── prompts/
│   └── assistant.yaml
├── .env.example
└── README.md
evalflow.yaml
version: "1.0"
project: openai-basic

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: openai
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/
Running it
cd examples/openai-basic
cp .env.example .env
# Edit .env and set OPENAI_API_KEY
evalflow eval
Running 3 test cases against gpt-4o-mini...

✓ summarize-release-notes      0.92
✓ classify-support-priority    1.00
✓ answer-billing-question      0.88

Quality Gate: PASS
Failures: 0
Run ID: 20240315-a3f9c2d81b4e
Duration: 3.8s

examples/groq-ci

A Groq-based setup with a GitHub Actions workflow included. Groq’s free tier makes this example well suited for CI pipelines where you want zero-cost quality gates on every pull request. What it demonstrates
  • A single-provider evalflow.yaml configured for Groq
  • A working GitHub Actions workflow (.github/workflows/evalflow.yml)
  • How to pass the API key as a GitHub Actions secret
  • Test cases for summarization, classification, and QA
Files
examples/groq-ci/
├── evalflow.yaml
├── .github/
│   └── workflows/
│       └── evalflow.yml
├── evals/
│   └── dataset.json
├── .env.example
└── README.md
evalflow.yaml
version: "1.0"
project: groq-ci

providers:
  groq:
    api_key_env: GROQ_API_KEY
    default_model: llama-3.1-8b-instant

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: groq
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/
GitHub Actions workflow
# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval --provider groq
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
Running it locally
cd examples/groq-ci
cp .env.example .env
# Edit .env and set GROQ_API_KEY
evalflow eval --provider groq
Running 3 test cases against llama-3.1-8b-instant...

✓ summarize-pr-description    0.91
✓ classify-ci-failure         1.00
✓ answer-readme-question      0.86

Quality Gate: PASS
Failures: 0
Run ID: 20240315-c7e8b1d23f9a
Duration: 3.1s
Add GROQ_API_KEY to your repository’s Settings → Secrets and variables → Actions to enable the GitHub Actions workflow.

examples/langchain-app

An example that runs evalflow alongside a LangChain application. It shows how to add a quality gate to a project that already uses LangChain for its LLM calls, without changing any application code. What it demonstrates
  • Running evalflow eval independently of your application code
  • How evalflow’s prompt registry integrates with a LangChain app (app.py loads the production prompt from the registry)
  • Test cases that mirror the prompts used in the application
Files
examples/langchain-app/
├── evalflow.yaml
├── app.py
├── evals/
│   └── dataset.json
├── prompts/
│   └── assistant.yaml
├── .env.example
└── README.md
evalflow.yaml
version: "1.0"
project: langchain-app

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: openai
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/
Running it
cd examples/langchain-app
pip install evalflow langchain-openai langchain-core
cp .env.example .env
# Edit .env and set OPENAI_API_KEY

# Run the LangChain app
python app.py

# Run the quality gate
evalflow eval
> python app.py
Reply: The assistant answers using the production prompt body.

> evalflow eval

Running 3 test cases against gpt-4o-mini...

✓ summarize-meeting-notes     0.90
✓ classify-user-intent        1.00
✓ answer-faq-context          0.87

Quality Gate: PASS
Failures: 0
Run ID: 20240315-b2d9f4e67c1b
Duration: 4.0s