DocumentationCI/CD Quality Gates

CI/CD Quality Gates

Automatically test your agent on every PR and block deploys that don’t meet your reliability standards.

⚠️

The Quality Gates dashboard feature is coming soon. The CI/CD workflow pattern described below works today using your own BSS threshold checks, but the dashboard-based gate configuration is not yet available.

This guide shows you how to integrate Invarium into your CI/CD pipeline using GitHub Actions. The workflow generates tests, runs them against your agent, syncs results, and fails the build if the BSS score drops below your threshold.


How it works

PR opened
Generate tests
Run agent
Sync results
Check BSS
Pass / Fail
  1. A developer opens a pull request
  2. The CI workflow triggers and generates behavioral tests for the agent
  3. Tests are run against the agent in the CI environment
  4. Results are synced to Invarium
  5. The workflow checks the BSS score against your threshold
  6. The build passes or fails based on the score

GitHub Actions workflow

Create a file at .github/workflows/invarium-test.yml in your repository:

name: Invarium Agent Testing
 
on:
  pull_request:
    branches: [main, develop]
  push:
    branches: [main]
 
env:
  INVARIUM_API_KEY: ${{ secrets.INVARIUM_API_KEY }}
  AGENT_NAME: customer-support-agent
  BSS_THRESHOLD: 75
  TEST_COUNT: 10
 
jobs:
  test-agent:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
 
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install invarium-mcp requests
 
      - name: Generate test cases
        id: generate
        run: |
          python3 -c "
          import requests, os, time
 
          api_key = os.environ['INVARIUM_API_KEY']
          agent = os.environ['AGENT_NAME']
          count = int(os.environ['TEST_COUNT'])
          base_url = 'https://api.invarium.dev/v1'
          headers = {'Authorization': f'Bearer {api_key}'}
 
          # Generate tests
          resp = requests.post(f'{base_url}/agents/{agent}/tests/generate',
              headers=headers,
              json={'count': count, 'complexity': 'mixed'})
          resp.raise_for_status()
          gen_id = resp.json()['generation_id']
          print(f'Generation started: {gen_id}')
 
          # Poll for completion
          for _ in range(30):
              time.sleep(5)
              resp = requests.get(f'{base_url}/agents/{agent}/tests?generation_id={gen_id}',
                  headers=headers)
              data = resp.json()
              if data.get('status') == 'completed':
                  break
 
          # Save test cases
          import json
          with open('test_cases.json', 'w') as f:
              json.dump(data['tests'], f)
          print(f'Generated {len(data[\"tests\"])} test cases')
          "
 
      - name: Run tests against agent
        run: |
          python3 -c "
          import json
 
          # Load test cases
          with open('test_cases.json') as f:
              tests = json.load(f)
 
          # Import your agent (adjust the import to match your project)
          from agent import executor
 
          results = []
          for test in tests:
              try:
                  response = executor.invoke({'input': test['user_message']})
                  results.append({
                      'scenario_id': test['scenario_id'],
                      'user_message': test['user_message'],
                      'agent_response': response['output'],
                      'tools_called': [
                          {'name': step.tool, 'parameters': step.tool_input}
                          for step in response.get('intermediate_steps', [])
                      ]
                  })
              except Exception as e:
                  results.append({
                      'scenario_id': test['scenario_id'],
                      'user_message': test['user_message'],
                      'agent_response': f'ERROR: {str(e)}',
                      'tools_called': []
                  })
 
          with open('results.json', 'w') as f:
              json.dump(results, f)
          print(f'Ran {len(results)} tests')
          "
 
      - name: Sync results and check BSS
        run: |
          python3 -c "
          import requests, os, json, sys
 
          api_key = os.environ['INVARIUM_API_KEY']
          agent = os.environ['AGENT_NAME']
          threshold = int(os.environ['BSS_THRESHOLD'])
          base_url = 'https://api.invarium.dev/v1'
          headers = {'Authorization': f'Bearer {api_key}'}
 
          # Load results
          with open('results.json') as f:
              results = json.load(f)
 
          # Sync results
          resp = requests.post(f'{base_url}/agents/{agent}/results/sync',
              headers=headers,
              json={'results': results, 'source': 'ci'})
          resp.raise_for_status()
          data = resp.json()
 
          bss = data.get('bss_score', 0)
          passed = data.get('passed', 0)
          failed = data.get('failed', 0)
          total = passed + failed
 
          print(f'BSS Score: {bss}')
          print(f'Tests: {passed}/{total} passed')
          print(f'Threshold: {threshold}')
          print(f'Dashboard: {data.get(\"dashboard_url\", \"\")}')
 
          if bss < threshold:
              print(f'FAILED: BSS score {bss} is below threshold {threshold}')
              sys.exit(1)
          else:
              print(f'PASSED: BSS score {bss} meets threshold {threshold}')
          "

Setting up secrets

Add your Invarium API key as a GitHub Actions secret:

  1. Go to your repository on GitHub
  2. Navigate to Settings > Secrets and variables > Actions
  3. Click New repository secret
  4. Name: INVARIUM_API_KEY
  5. Value: your API key (e.g., inv_abc123...)
⚠️

Never commit API keys to your repository. Always use GitHub Secrets or your CI provider’s secret management.


Configuration options

Customize the workflow by changing the environment variables at the top:

VariableDescriptionDefault
AGENT_NAMEThe name of the agent to test(required)
BSS_THRESHOLDMinimum BSS score to pass the gate75
TEST_COUNTNumber of test cases to generate10

Choosing a threshold

Use caseRecommended threshold
Early development50-60
Staging / pre-production65-75
Production deploy gate75-85
Safety-critical agents85-95

Start with a threshold slightly below your current BSS score and raise it over time as your agent improves.


Other CI providers

The same approach works with GitLab CI. Add INVARIUM_API_KEY as a CI/CD variable in your GitLab project settings:

# .gitlab-ci.yml
invarium-test:
  stage: test
  image: python:3.11
  script:
    - pip install -r requirements.txt
    - pip install invarium-mcp requests
    - python3 scripts/run_invarium_tests.py
  variables:
    AGENT_NAME: customer-support-agent
    BSS_THRESHOLD: "75"
    TEST_COUNT: "10"

Extract the test logic from the GitHub Actions workflow into a standalone Python script (scripts/run_invarium_tests.py) that handles generation, running, syncing, and threshold checking.


Tips

  • Run on PRs, not just pushes. Testing on pull requests catches regressions before they merge. Testing on pushes to main gives you a baseline.
  • Keep test count manageable. 10-20 tests is usually enough for a CI gate. Save larger test suites (50-100 tests) for nightly or weekly runs.
  • Use complexity: 'mixed' in CI to cover a range of difficulty levels. Use complexity: 'complex' for more thorough nightly runs.
  • Monitor CI test duration. If the workflow takes too long, reduce test count or run tests in parallel.
  • Set up dashboard alerts. Configure quality gate alerts to notify your team when the BSS score changes significantly.
Was this page helpful?