Multi-Eval Comparison

Overview

This example demonstrates how to evaluate multiple eval suites simultaneously, comparing across accuracy, performance, cost, and user experience.

Contract

version: 1
name: comprehensive-gate
description: Comprehensive evaluation across multiple test suites

required_evals:
  - name: accuracy-suite
    rules:
      - metric: accuracy
        operator: ">="
        baseline: fixed
        threshold: 0.90
  
  - name: performance-suite
    rules:
      - metric: latency_p95
        operator: "<="
        baseline: fixed
        threshold: 200
  
  - name: cost-suite
    rules:
      - metric: cost_per_request
        operator: "<="
        baseline: previous
        max_delta: 0.1
        description: Cost should not increase more than 10%
  
  - name: user-experience-suite
    rules:
      - metric: satisfaction_score
        operator: ">="
        baseline: fixed
        threshold: 4.0

on_violation:
  action: require_approval

Sample Data

accuracy-suite.json:

{
  "evalName": "accuracy-suite",
  "runId": "run-123",
  "metrics": {
    "accuracy": 0.92
  }
}

performance-suite.json:

{
  "evalName": "performance-suite",
  "runId": "run-123",
  "metrics": {
    "latency_p95": 180
  }
}

cost-suite.json:

{
  "evalName": "cost-suite",
  "runId": "run-123",
  "metrics": {
    "cost_per_request": 0.05
  }
}

user-experience-suite.json:

{
  "evalName": "user-experience-suite",
  "runId": "run-123",
  "metrics": {
    "satisfaction_score": 4.2
  }
}

Running the Check

geval check \
  --contract comprehensive-gate.yaml \
  --eval accuracy-suite.json \
  --eval performance-suite.json \
  --eval cost-suite.json \
  --eval user-experience-suite.json \
  --baseline baseline-cost-suite.json

Expected Output

PASS:

✓ PASS

Contract:    comprehensive-gate
Version:     1

All 4 eval(s) passed contract requirements

REQUIRES_APPROVAL:

⚠ REQUIRES_APPROVAL

Contract:    comprehensive-gate
Version:     1

Requires approval: 1 violation(s) in 1 eval

Violations
  1. cost-suite → cost_per_request
     cost_per_request = 0.12, expected <= 0.11 (previous: 0.10)
     Cost increased by 20%, exceeds 10% threshold

Programmatic Usage

import {
  parseContractFromYaml,
  parseEvalFile,
  evaluate,
} from "@geval-labs/core";
import { readFileSync } from "fs";

const contract = parseContractFromYaml(
  readFileSync("comprehensive-gate.yaml", "utf-8")
);

const evalResults = [
  parseEvalFile(readFileSync("accuracy-suite.json", "utf-8"), "accuracy-suite.json", contract),
  parseEvalFile(readFileSync("performance-suite.json", "utf-8"), "performance-suite.json", contract),
  parseEvalFile(readFileSync("cost-suite.json", "utf-8"), "cost-suite.json", contract),
  parseEvalFile(readFileSync("user-experience-suite.json", "utf-8"), "user-experience-suite.json", contract),
];

const baselineCost = JSON.parse(readFileSync("baseline-cost-suite.json", "utf-8"));

const decision = evaluate({
  contract,
  evalResults,
  baselines: {
    "cost-suite": {
      type: "previous",
      metrics: baselineCost.metrics,
      source: { runId: baselineCost.runId },
    },
  },
});

console.log(`Status: ${decision.status}`);
decision.violations.forEach(v => {
  console.log(`Violation: ${v.evalName} → ${v.metric}`);
});

CI/CD Integration

# .github/workflows/comprehensive-check.yml
name: Comprehensive Eval Check

on: [pull_request]

jobs:
  comprehensive-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run all eval suites
        run: |
          npm run test:accuracy -- --output accuracy-suite.json
          npm run test:performance -- --output performance-suite.json
          npm run test:cost -- --output cost-suite.json
          npm run test:ux -- --output user-experience-suite.json
      
      - name: Download baseline
        uses: actions/download-artifact@v3
        with:
          name: baseline-cost
          path: ./baseline
      
      - name: Install Geval
        run: npm install -g @geval-labs/cli
      
      - name: Comprehensive check
        run: |
          geval check \
            --contract comprehensive-gate.yaml \
            --eval accuracy-suite.json \
            --eval performance-suite.json \
            --eval cost-suite.json \
            --eval user-experience-suite.json \
            --baseline baseline/cost-suite.json

Best Practices

Separate concerns - Use different eval suites for different dimensions
Mix baseline types - Use fixed for critical, previous for trends
Appropriate actions - Use require_approval for multi-dimensional checks
Clear naming - Name eval suites descriptively
Document thresholds - Explain why each threshold matters

Get Started

CLI Reference

Contracts

Integration

API Reference

Examples

Overview

Contract

Sample Data

Running the Check

Expected Output

Programmatic Usage

CI/CD Integration

Best Practices

Get Started

CLI Reference

Contracts

Integration

API Reference

Examples

​Overview

​Contract

​Sample Data

​Running the Check

​Expected Output

​Programmatic Usage

​CI/CD Integration

​Best Practices

Overview

Contract

Sample Data

Running the Check

Expected Output

Programmatic Usage

CI/CD Integration

Best Practices