Skip to content

Evals

Evals are to AI tools what tests are to traditional code. This framework provides a structured approach to evaluating the quality and reliability of AI-powered skills.

Why Evals Matter

Traditional software tests verify deterministic behavior. AI tools are probabilistic - the same prompt might produce different (but equally valid) outputs. Evals bridge this gap by:

  • Defining expected behaviors rather than exact outputs
  • Measuring quality across dimensions (accuracy, completeness, safety)
  • Detecting regressions when prompts or models change
  • Comparing performance across different LLM backends

Quick Start

Running Evals

bash
# Run all evals
npx nx run evals:run

# Run specific suite
npx nx run evals:run --suite=v4-security-foundations

# Dry run (show what would be evaluated)
npx nx run evals:run --dry-run

Writing Evals

  1. Create a test case in evals/suites/<skill>/cases/
  2. Define expected behaviors in evals/suites/<skill>/expected/
  3. Configure the suite in eval.config.ts

Evaluation Dimensions

DimensionDescriptionScore
AccuracyCorrectly implements requirements0-1
CompletenessIncludes all required elements0-1
SafetyNo security vulnerabilities0-1
HelpfulnessWell-documented and clear0-1

Suite Structure

text
evals/suites/<skill-name>/
├── eval.config.ts      # Configuration
├── cases/              # Test prompts
│   └── basic-case.md
└── expected/           # Expected behaviors
    └── basic-case.md

Next Steps

Released under the MIT License.