Examples - evalflow

The examples/ directory in the evalflow repository contains three ready-to-run projects. Each one demonstrates a different setup. Pick the example closest to your stack, copy its evalflow.yaml and dataset, then adapt them to your own prompts.

Starting from an example

# Clone the repository
git clone https://github.com/emartai/evalflow
cd evalflow/examples/<example-name>

# Install evalflow
pip install evalflow

# Copy the environment template
cp .env.example .env

# Check your setup
evalflow doctor

# Run the quality gate
evalflow eval

`examples/openai-basic`

A minimal local setup using OpenAI. This is the best starting point if you are new to evalflow or want the simplest possible configuration. What it demonstrates

A single-provider evalflow.yaml configured for OpenAI
Three test cases covering summarization, classification, and QA
Both exact_match and embedding_similarity eval methods
A prompt registry entry in prompts/

Files

examples/openai-basic/
├── evalflow.yaml
├── evals/
│   └── dataset.json
├── prompts/
│   └── assistant.yaml
├── .env.example
└── README.md

evalflow.yaml

version: "1.0"
project: openai-basic

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: openai
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/

Running it

cd examples/openai-basic
cp .env.example .env
# Edit .env and set OPENAI_API_KEY
evalflow eval

Running 3 test cases against gpt-4o-mini...

✓ summarize-release-notes      0.92
✓ classify-support-priority    1.00
✓ answer-billing-question      0.88

Quality Gate: PASS
Failures: 0
Run ID: 20240315-a3f9c2d81b4e
Duration: 3.8s

`examples/groq-ci`

A Groq-based setup with a GitHub Actions workflow included. Groq’s free tier makes this example well suited for CI pipelines where you want zero-cost quality gates on every pull request. What it demonstrates

A single-provider evalflow.yaml configured for Groq
A working GitHub Actions workflow (.github/workflows/evalflow.yml)
How to pass the API key as a GitHub Actions secret
Test cases for summarization, classification, and QA

Files

examples/groq-ci/
├── evalflow.yaml
├── .github/
│   └── workflows/
│       └── evalflow.yml
├── evals/
│   └── dataset.json
├── .env.example
└── README.md

evalflow.yaml

version: "1.0"
project: groq-ci

providers:
  groq:
    api_key_env: GROQ_API_KEY
    default_model: llama-3.1-8b-instant

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: groq
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/

GitHub Actions workflow

# .github/workflows/evalflow.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - "prompts/**"
      - "evals/**"
      - "**.py"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install evalflow
      - run: evalflow eval --provider groq
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}

Running it locally

cd examples/groq-ci
cp .env.example .env
# Edit .env and set GROQ_API_KEY
evalflow eval --provider groq

Running 3 test cases against llama-3.1-8b-instant...

✓ summarize-pr-description    0.91
✓ classify-ci-failure         1.00
✓ answer-readme-question      0.86

Quality Gate: PASS
Failures: 0
Run ID: 20240315-c7e8b1d23f9a
Duration: 3.1s

Add GROQ_API_KEY to your repository’s Settings → Secrets and variables → Actions to enable the GitHub Actions workflow.

`examples/langchain-app`

An example that runs evalflow alongside a LangChain application. It shows how to add a quality gate to a project that already uses LangChain for its LLM calls, without changing any application code. What it demonstrates

Running evalflow eval independently of your application code
How evalflow’s prompt registry integrates with a LangChain app (app.py loads the production prompt from the registry)
Test cases that mirror the prompts used in the application

Files

examples/langchain-app/
├── evalflow.yaml
├── app.py
├── evals/
│   └── dataset.json
├── prompts/
│   └── assistant.yaml
├── .env.example
└── README.md

evalflow.yaml

version: "1.0"
project: langchain-app

providers:
  openai:
    api_key_env: OPENAI_API_KEY
    default_model: gpt-4o-mini

eval:
  dataset: evals/dataset.json
  baseline_file: .evalflow/baseline.json
  default_provider: openai
  consistency_runs: 3

thresholds:
  task_success: 0.80
  relevance: 0.75
  hallucination_max: 0.10
  consistency_min: 0.85

judge:
  provider: groq
  model: llama-3.1-8b-instant

prompts:
  directory: prompts/

Running it

cd examples/langchain-app
pip install evalflow langchain-openai langchain-core
cp .env.example .env
# Edit .env and set OPENAI_API_KEY

# Run the LangChain app
python app.py

# Run the quality gate
evalflow eval

> python app.py
Reply: The assistant answers using the production prompt body.

> evalflow eval

Running 3 test cases against gpt-4o-mini...

✓ summarize-meeting-notes     0.90
✓ classify-user-intent        1.00
✓ answer-faq-context          0.87

Quality Gate: PASS
Failures: 0
Run ID: 20240315-b2d9f4e67c1b
Duration: 4.0s

​Starting from an example

​examples/openai-basic

​examples/groq-ci

​examples/langchain-app

Starting from an example

`examples/openai-basic`

`examples/groq-ci`

`examples/langchain-app`