Skip to main content

Basic Structure

A contract defines quality requirements for your AI application. Here’s the basic structure:
version: 1                    # Contract schema version (required)
name: string                  # Contract name (required)
description: string           # Optional description
environment: production       # Optional: production | staging | development

sources:                     # Optional: inline CSV/JSON parsing config
  csv:
    metrics:
      - column: accuracy
        aggregate: avg
    evalName:
      fixed: my-eval

required_evals:              # Required: list of eval suites to check
  - name: eval-name
    description: string      # Optional
    rules:                   # Required: list of rules
      - metric: accuracy
        operator: ">="
        baseline: fixed
        threshold: 0.85

on_violation:                # Required: violation handler
  action: block              # block | require_approval | warn
  message: string            # Optional custom message

metadata:                    # Optional: contract metadata
  key: value                 # All values must be strings

Contract Schema Reference

# Required: Schema version (always 1)
version: 1

# Required: Contract name
name: my-contract

# Optional: Description
description: Quality requirements for my AI system

# Optional: Environment (development, staging, production)
environment: production

# Optional: Source configurations for parsing CSV/JSON/JSONL files
sources:
  # Config for CSV files
  csv:
    metrics:
      - column: accuracy          # Simple: just column name
        aggregate: avg            # How to aggregate (see Aggregation Methods)
      - column: latency
        aggregate: p95
        as: latency_p95           # Optional: rename metric
      - column: status
        aggregate: pass_rate
    evalName:
      fixed: my-eval              # Fixed name, or use column name
    runId:
      fixed: run-123              # Optional: run ID
    timestamp: created_at         # Optional: column for timestamp
  
  # Config for JSON files (optional)
  json:
    metrics:
      - column: score
        aggregate: avg
    json:
      resultsPath: data.results   # Path to array in JSON

# Required: At least one eval suite
required_evals:
  - name: eval-suite-name  # Must match evalName in results
    description: Optional description
    rules:
      - metric: metric_name        # Metric to check
        operator: ">="             # ==, !=, <, <=, >, >=
        baseline: fixed            # fixed, previous, or main
        threshold: 0.85            # For fixed baseline
        max_delta: 0.05            # Max allowed change (for previous/main)
        description: Rule explanation

# Required: What to do on violation
on_violation:
  action: block           # block, require_approval, or warn
  message: "Custom message"

Violation Actions

Define what happens when a contract is violated:

block

Block the release. Exit code 1. Use for production-critical requirements.
on_violation:
  action: block
  message: "Quality metrics did not meet requirements"

require_approval

Require manual approval. Exit code 2. Use when human review is needed.
on_violation:
  action: require_approval
  message: "Manual review required"

warn

Log a warning but allow the release. Exit code 0. Use for non-critical metrics.
on_violation:
  action: warn
  message: "Warning: Some metrics below optimal"

Best Practices

1. Start Simple

Begin with fixed thresholds. Add relative baselines later.
rules:
  - metric: accuracy
    operator: ">="
    baseline: fixed
    threshold: 0.85

2. Version Your Contracts

Store contracts in git alongside your code. Review contract changes like code changes.

3. Use Descriptive Names

name: customer-support-bot-quality-gate
required_evals:
  - name: response-quality-metrics
    rules:
      - metric: helpfulness_score
        description: Responses must be helpful to customers

4. Set Appropriate Thresholds

Don’t set thresholds too tight initially. Adjust based on real data.

5. Block on Critical Metrics Only

Use warn for non-critical metrics, block for critical ones.