A complete CI/CD lifecycle for an Azure AI Foundry prompt agent, demonstrating versioned prompts, tool changes, model upgrades, evaluation gates, and rollback.
- Phase 1: Web search only (
web_searchtool) - Phase 2: Web search + Code Interpreter (
web_search+code_interpretertools) for data analysis
agents/
tech-trends-agent.json Active agent config (tools, model ref, eval pointers)
tech-trends-agent.default.json Default baseline (rollback target)
prompts/
tech-trends-agent.md Active system prompt (Markdown)
tech-trends-agent.default.md Default baseline prompt (rollback target)
evals/ Golden dataset + evaluator config
scripts/
deploy_agent.py Deploy agent to TEST or PROD
rollback_agent.py Rollback to default or a saved artifact
run_evaluation.py Run Foundry evaluation locally
compare_models.py Side-by-side model comparison
bootstrap.sh One-time Azure + GitHub setup
teardown.sh Reverse everything bootstrap created
lifecycle/
01-phase1-web-search.sh PR: agent with web search only
02-phase2-code-interpreter.sh PR: add code interpreter
03-model-upgrade.sh PR: upgrade model to gpt-4.1
infra/ Bicep IaC for Foundry project
artifacts/ Generated deployment snapshots (post-deploy)
.github/workflows/ CI/CD pipelines
tests/ Unit tests
- Python 3.12+
- Azure CLI (
az login) - GitHub CLI (
gh auth login) - An Azure subscription with permissions to create resources and App Registrations
- An Azure OpenAI deployment (e.g.
gpt-4o-2024-11-20)
The bootstrap script provisions all Azure infrastructure and GitHub configuration in one shot.
# 1. Login to Azure and GitHub
az login
gh auth login
# 2. Run bootstrap
./scripts/bootstrap.sh \
--resource-group rg-agent-devops \
--account-name agentdevops \
--location eastus \
--github-repo san360/agent-devopsThis creates:
- A resource group with TEST and PROD AI Foundry projects (via Bicep)
- An App Registration with a Service Principal
- 3 federated credentials for GitHub OIDC (main branch, pull requests, tags)
- RBAC role assignments (Azure AI User, Cognitive Services OpenAI User)
- Model availability validation (checks current + upgrade target model)
- 6 GitHub repository variables (
AZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_ID,FOUNDRY_TEST_ENDPOINT,FOUNDRY_PROD_ENDPOINT,GPT_DEPLOYMENT)
State is saved to .bootstrap-state.json for use by the teardown script.
| Flag | Required | Default | Description |
|---|---|---|---|
--resource-group |
Yes | — | Azure resource group name |
--account-name |
Yes | — | Base name for Foundry accounts (suffixed with test/prod) |
--location |
No | eastus |
Azure region |
--github-repo |
No | san360/agent-devops |
GitHub owner/repo for variables and federation |
--gpt-deployment |
No | gpt-4o-2024-11-20 |
GPT model deployment name |
--gpt-capacity |
No | 30 |
GPT deployment capacity (tokens per minute in thousands) |
Three scripts simulate the full agent lifecycle by creating PRs that trigger the CI/CD pipeline. Run them sequentially — each builds on the previous phase.
./scripts/lifecycle/01-phase1-web-search.sh- Creates branch
feature/phase1-web-search - Configures the agent with the
web_searchtool - Evaluation runs 5 Phase 1 test cases
- Opens a PR —
evaluate.ymltriggers, deploys to TEST, runs eval
After the eval passes, merge the PR. deploy-prod.yml deploys to PROD.
./scripts/lifecycle/02-phase2-code-interpreter.sh- Creates branch
feature/phase2-code-interpreterfrom updatedmain - Adds
code_interpretertool alongside existingweb_search - Extends the system prompt with a
## Data Analysissection - Evaluation now runs all 8 test cases (Phase 1 + Phase 2) — checks for regressions
- Opens a PR
After the eval passes, merge the PR.
./scripts/lifecycle/03-model-upgrade.sh- Creates branch
chore/model-upgrade-gpt41 - Upgrades model from
gpt-4o-2024-11-20(default) togpt-4.1 - Updates the
GPT_DEPLOYMENTGitHub variable togpt-4.1 - Adds a model history entry in the agent config
- Opens a PR — the eval gate verifies the new model scores at or above thresholds
The bootstrap script validates that both the current model and the upgrade target
(gpt-4.1) are available in your chosen Azure region. If gpt-4.1 is not available,
the script will list alternatives you can use instead.
After the eval passes, merge the PR. The full lifecycle demo is complete.
flowchart LR
subgraph PR Pipeline
A[Developer Push] --> B[Create PR]
B --> C[Deploy to TEST]
C --> D[Smoke Test]
D --> E[Evaluation Run]
E --> F{Scores Pass?}
F -->|Yes| G[Post Results to PR]
F -->|No| H[Block Merge]
end
subgraph Merge Pipeline
G --> I[Merge to main]
I --> J[Deploy to PROD]
J --> K[Commit Artifact]
end
flowchart TD
A[Pipeline Triggered] --> B{Evaluation exists?}
B -->|No| C[Create Evaluation\ntech-trends-agent-eval]
B -->|Yes| D[Reuse Existing Evaluation]
C --> E[Create Run]
D --> E
E --> F[Upload Golden Dataset]
F --> G[Invoke Agent with Queries]
G --> H[Run Evaluators\ntask_adherence, relevance,\ngroundedness, coherence]
H --> I[Wait for Completion]
I --> J[Output Scores & Report URL]
J --> K[Post Summary to PR]
style C fill:#4ade80,stroke:#16a34a
style D fill:#60a5fa,stroke:#2563eb
flowchart LR
P1["Phase 1\nweb_search"] -->|merge| P2["Phase 2\ncode_interpreter"]
P2 -->|merge| P3["Phase 3\nModel Upgrade\ngpt-4o → gpt-4.1"]
P1 -.->|eval gate| P1E[✅ 5 queries]
P2 -.->|eval gate| P2E[✅ 8 queries]
P3 -.->|eval gate| P3E[✅ 8 queries]
Remove all Azure resources and GitHub configuration created by bootstrap:
./scripts/teardown.sh # interactive confirmation prompt
./scripts/teardown.sh --yes # skip confirmationThis deletes:
- The resource group (and all resources within — TEST project, PROD project, model deployments)
- Federated credentials and the App Registration
- All 7 GitHub repository variables
- The
.bootstrap-state.jsonstate file
If you prefer to set up infrastructure manually instead of using bootstrap:
# 1. Install dependencies
pip install -r requirements.txt
# 2. Configure environment
cp .env.example .env
# Edit .env with your Foundry endpoints and deployment names
# Ensure each variable is prefixed with 'export' so they are
# visible to Python (os.environ) when sourced.
# 3. Login to Azure
az login
# 4. Deploy to test
source .env
python scripts/deploy_agent.py --env test --semver 1.0.0 --tools web_search| Workflow | Trigger | Purpose |
|---|---|---|
evaluate.yml |
PR touching agents/, prompts/, evals/ |
Deploy to test, run eval, post results to PR |
deploy-prod.yml |
Push to main touching agents/, prompts/ |
Deploy to prod, commit artifact |
monitor.yml |
Daily cron (06:00 UTC) | Eval prod agent, open issue on drift |
The eval gate uses a create-once, run-many pattern with four evaluators:
- Task Adherence (threshold: 0.80)
- Relevance (threshold: 0.75)
- Groundedness (threshold: 0.75)
- Coherence (threshold: 0.80)
A smoke test step runs before evaluation — it invokes the agent with a test query and verifies a valid response is returned.
Evaluation naming: A single evaluation named tech-trends-agent-eval is created
on the first pipeline run. Subsequent runs reuse the same evaluation and add new runs.
Each run is named {branch}/{commit-sha} for full traceability back to the source change.
The rollback script supports two modes:
python scripts/rollback_agent.py --default prodCopies agents/tech-trends-agent.default.json and prompts/tech-trends-agent.default.md
over the active config and prompt, then re-deploys the clean baseline to Foundry.
python scripts/rollback_agent.py artifacts/tech-trends-agent-v1.0.1.json prodReconstructs the agent config from the artifact's definition (model, tools, prompt) and
re-deploys that exact state to Foundry. Also writes the artifact's definition back to
agents/tech-trends-agent.json so local state matches production.
- Creates a new Foundry version with the restored prompt, tools, and model
- Updates the local
agents/tech-trends-agent.jsonto match the rolled-back state - Description field notes it is a rollback and the source version
- Does not delete the bad version from Foundry (history is immutable)
The artifacts/ folder is the deployment ledger. Every production deploy via
deploy-prod.yml commits a versioned JSON snapshot here, recording:
- What was deployed (model, tools, prompt hash)
- When and where (timestamp, endpoint, environment)
- Which code produced it (git commit SHA, branch, tag)
Artifacts enable rollback, auditability, and drift detection. They are committed
with [skip ci] to avoid triggering re-deployment.
python scripts/compare_models.py --current gpt-4o-2024-11-20 --candidate gpt-4.1 --tools web_searchDeploys both model versions to test for side-by-side evaluation.
Uses GitHub OIDC federation — no secrets stored in the repository. The bootstrap.sh script
configures this automatically by creating an App Registration with federated credentials for
three GitHub Actions contexts:
| Credential | Subject | Used by |
|---|---|---|
github-main |
repo:owner/repo:ref:refs/heads/main |
deploy-prod.yml |
github-pr |
repo:owner/repo:pull_request |
evaluate.yml |
github-release |
repo:owner/repo:ref:refs/tags/* |
Future release workflows |
For manual setup, create these federated credentials on an App Registration and assign
the Azure AI User and Cognitive Services OpenAI User roles scoped to your resource group.
pytest tests/ -v