CI/CD Quality Gates
Automatically test your agent on every PR and block deploys that don’t meet your reliability standards.
The Quality Gates dashboard feature is coming soon. The CI/CD workflow pattern described below works today using your own BSS threshold checks, but the dashboard-based gate configuration is not yet available.
This guide shows you how to integrate Invarium into your CI/CD pipeline using GitHub Actions. The workflow generates tests, runs them against your agent, syncs results, and fails the build if the BSS score drops below your threshold.
How it works
- A developer opens a pull request
- The CI workflow triggers and generates behavioral tests for the agent
- Tests are run against the agent in the CI environment
- Results are synced to Invarium
- The workflow checks the BSS score against your threshold
- The build passes or fails based on the score
GitHub Actions workflow
Create a file at .github/workflows/invarium-test.yml in your repository:
name: Invarium Agent Testing
on:
pull_request:
branches: [main, develop]
push:
branches: [main]
env:
INVARIUM_API_KEY: ${{ secrets.INVARIUM_API_KEY }}
AGENT_NAME: customer-support-agent
BSS_THRESHOLD: 75
TEST_COUNT: 10
jobs:
test-agent:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install invarium-mcp requests
- name: Generate test cases
id: generate
run: |
python3 -c "
import requests, os, time
api_key = os.environ['INVARIUM_API_KEY']
agent = os.environ['AGENT_NAME']
count = int(os.environ['TEST_COUNT'])
base_url = 'https://api.invarium.dev/v1'
headers = {'Authorization': f'Bearer {api_key}'}
# Generate tests
resp = requests.post(f'{base_url}/agents/{agent}/tests/generate',
headers=headers,
json={'count': count, 'complexity': 'mixed'})
resp.raise_for_status()
gen_id = resp.json()['generation_id']
print(f'Generation started: {gen_id}')
# Poll for completion
for _ in range(30):
time.sleep(5)
resp = requests.get(f'{base_url}/agents/{agent}/tests?generation_id={gen_id}',
headers=headers)
data = resp.json()
if data.get('status') == 'completed':
break
# Save test cases
import json
with open('test_cases.json', 'w') as f:
json.dump(data['tests'], f)
print(f'Generated {len(data[\"tests\"])} test cases')
"
- name: Run tests against agent
run: |
python3 -c "
import json
# Load test cases
with open('test_cases.json') as f:
tests = json.load(f)
# Import your agent (adjust the import to match your project)
from agent import executor
results = []
for test in tests:
try:
response = executor.invoke({'input': test['user_message']})
results.append({
'scenario_id': test['scenario_id'],
'user_message': test['user_message'],
'agent_response': response['output'],
'tools_called': [
{'name': step.tool, 'parameters': step.tool_input}
for step in response.get('intermediate_steps', [])
]
})
except Exception as e:
results.append({
'scenario_id': test['scenario_id'],
'user_message': test['user_message'],
'agent_response': f'ERROR: {str(e)}',
'tools_called': []
})
with open('results.json', 'w') as f:
json.dump(results, f)
print(f'Ran {len(results)} tests')
"
- name: Sync results and check BSS
run: |
python3 -c "
import requests, os, json, sys
api_key = os.environ['INVARIUM_API_KEY']
agent = os.environ['AGENT_NAME']
threshold = int(os.environ['BSS_THRESHOLD'])
base_url = 'https://api.invarium.dev/v1'
headers = {'Authorization': f'Bearer {api_key}'}
# Load results
with open('results.json') as f:
results = json.load(f)
# Sync results
resp = requests.post(f'{base_url}/agents/{agent}/results/sync',
headers=headers,
json={'results': results, 'source': 'ci'})
resp.raise_for_status()
data = resp.json()
bss = data.get('bss_score', 0)
passed = data.get('passed', 0)
failed = data.get('failed', 0)
total = passed + failed
print(f'BSS Score: {bss}')
print(f'Tests: {passed}/{total} passed')
print(f'Threshold: {threshold}')
print(f'Dashboard: {data.get(\"dashboard_url\", \"\")}')
if bss < threshold:
print(f'FAILED: BSS score {bss} is below threshold {threshold}')
sys.exit(1)
else:
print(f'PASSED: BSS score {bss} meets threshold {threshold}')
"Setting up secrets
Add your Invarium API key as a GitHub Actions secret:
- Go to your repository on GitHub
- Navigate to Settings > Secrets and variables > Actions
- Click New repository secret
- Name:
INVARIUM_API_KEY - Value: your API key (e.g.,
inv_abc123...)
Never commit API keys to your repository. Always use GitHub Secrets or your CI provider’s secret management.
Configuration options
Customize the workflow by changing the environment variables at the top:
| Variable | Description | Default |
|---|---|---|
AGENT_NAME | The name of the agent to test | (required) |
BSS_THRESHOLD | Minimum BSS score to pass the gate | 75 |
TEST_COUNT | Number of test cases to generate | 10 |
Choosing a threshold
| Use case | Recommended threshold |
|---|---|
| Early development | 50-60 |
| Staging / pre-production | 65-75 |
| Production deploy gate | 75-85 |
| Safety-critical agents | 85-95 |
Start with a threshold slightly below your current BSS score and raise it over time as your agent improves.
Other CI providers
The same approach works with GitLab CI. Add INVARIUM_API_KEY as a CI/CD variable in your GitLab project settings:
# .gitlab-ci.yml
invarium-test:
stage: test
image: python:3.11
script:
- pip install -r requirements.txt
- pip install invarium-mcp requests
- python3 scripts/run_invarium_tests.py
variables:
AGENT_NAME: customer-support-agent
BSS_THRESHOLD: "75"
TEST_COUNT: "10"Extract the test logic from the GitHub Actions workflow into a standalone Python script (scripts/run_invarium_tests.py) that handles generation, running, syncing, and threshold checking.
Tips
- Run on PRs, not just pushes. Testing on pull requests catches regressions before they merge. Testing on pushes to main gives you a baseline.
- Keep test count manageable. 10-20 tests is usually enough for a CI gate. Save larger test suites (50-100 tests) for nightly or weekly runs.
- Use
complexity: 'mixed'in CI to cover a range of difficulty levels. Usecomplexity: 'complex'for more thorough nightly runs. - Monitor CI test duration. If the workflow takes too long, reduce test count or run tests in parallel.
- Set up dashboard alerts. Configure quality gate alerts to notify your team when the BSS score changes significantly.