Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/evals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Skill Evaluation Status

Continuous evaluation status for Tula skills. This page is regenerated
automatically by `scripts/generate-eval-status.sh` on every CI run that
touches `skills/` or `evals/`. Static analysis (compliance, spec
checks, token budgets) is fresh on every run; live eval results come
from manually-published runs in `results/`.

Powered by [Microsoft Waza](https://github.com/microsoft/waza).

| Skill | Compliance | Spec | Tokens | Last live run |
|---|---|---|---|---|
| `epic-note` | Medium-High | 9/9 ✓ | 705 / 500 ⚠ | - |
| `health-records` | Medium-High | 9/9 ✓ | 1318 / 500 ⚠ | - |
| `lookout` | Medium-High | 9/9 ✓ | 1577 / 500 ⚠ | - |
| `med-pdf` | Medium-High | 9/9 ✓ | 842 / 500 ⚠ | - |
| `memory-diff` | Medium-High | 9/9 ✓ | 1183 / 500 ⚠ | - |
| `myhealth-pulse` | Medium-High | 9/9 ✓ | 1176 / 500 ⚠ | - |
| `prep-my-visit` | Medium-High | 9/9 ✓ | 457 / 500 ✓ | - |
| `request-amendment` | Medium-High | 9/9 ✓ | 990 / 500 ⚠ | - |

---

## What this measures

- **Compliance** - Waza's agentskills.io readiness score
(`High` / `Medium-High` / `Medium` / `Low`). `Medium-High` or better
is the house target.
- **Spec** - count of agentskills.io spec checks the skill passes
(`spec-frontmatter`, `spec-name`, `spec-allowed-fields`, and so on).
9/9 is full pass.
- **Tokens** - total tokens in `SKILL.md` against Waza's 500-token soft
limit. Tula's house style accepts a higher count when openclaw
fidelity would suffer (per `skills/AGENTS.md`'s "Token Discipline"
section). `⚠` marks "exceeds the soft cap but intentional"; `✓` marks
"within budget."
- **Last live run** - most recent `waza run` output published in
`results/`. Cells show pass rate, run date, and model used (e.g.,
`5/5 ✓ (2026-05-17, sonnet-4.6)`). Live eval execution requires
`executor: copilot-sdk` plus model auth, so it is a deliberate
publish today rather than a per-PR CI run. Raw run outputs stay
private; only the pass-rate summary surfaces here.

## What this does NOT measure

- The model's actual answer quality. Evals check task-completion
signals (output shape, presence/absence of keywords, routing
behavior, schema validity), not clinical correctness.
- Production behavior under PHI. All evals run against synthetic
personas. See `evals/*/fixtures/` for the test data.
- Anything inside Aria's closed governance layer - multi-tenant
isolation, audit emission, cross-actor coordination - which is
evaluated separately under hospital-scale fixtures.

## See also

- [Eval suites](../evals/) - task definitions and fixtures
- [Skill authoring conventions](../skills/AGENTS.md)
- [Tula deployment guide](deployment-guide.md)
- [Microsoft Waza](https://github.com/microsoft/waza) - the eval framework
73 changes: 72 additions & 1 deletion scripts/agent-backup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,9 @@
# ## Exit codes
# 0 Success (whether or not there were changes)
# 1 Generic error
# 2 Secret-pattern scan failed - see stderr for offending file(s)
# 2 Secret-pattern scan or large-file guard failed - see stderr
# 3 Push failed (commit was made; resolve auth and retry `git push`)
# 4 Privacy guard failed (remote repo is not PRIVATE - refused to push)
#
# ## Exclusions (mirrors the repo's `.gitignore` - keep both in sync)
# credentials/ telegram pairing secrets
Expand Down Expand Up @@ -153,6 +154,11 @@ PURGE=(
'logs'
'update-check.json'
'plugin-runtime-deps'
'npm' # ~700MB of plugin npm projects;
# contains coding-agent binaries
# 200MB+ each (> GitHub's 100MB
# file cap). Regenerable via
# `openclaw plugins install ...`.
)

# Nested-.git protection. Any `.git` directory under the source - at any
Expand All @@ -172,6 +178,12 @@ PROTECT=(
'docs'
)

# Hard cap on individual file size in the backup tree. GitHub rejects any
# file >100MB without LFS. We set a tighter 50MB cap to catch problems
# before they hit the remote, and to keep the repo cloneable on slow links.
# Anything over this should be added to PURGE.
MAX_FILE_BYTES=$((50 * 1024 * 1024))

# Regex patterns that look like real credentials. Tuned to be high-signal;
# if a pattern fires, the run aborts unless the file is in ALLOWLIST_GLOBS.
SECRET_PATTERNS=(
Expand Down Expand Up @@ -340,6 +352,63 @@ else
log "secret scan: clean"
fi

# ---------- step 3b: large-file guard --------------------------------------
#
# Refuse to stage anything over MAX_FILE_BYTES. GitHub rejects >100MB hard,
# but we want to catch the problem early (cheaper than a failed push) and
# under a tighter budget so clone-from-backup stays fast.

if [[ $DRY_RUN -eq 0 ]]; then
big_files=$(find "$AGENT_REPO_DIR" -type f -size +"${MAX_FILE_BYTES}c" \
-not -path "$AGENT_REPO_DIR/.git/*" 2>/dev/null || true)
if [[ -n "$big_files" ]]; then
echo "" >&2
echo "Large-file guard FAILED. Files over $((MAX_FILE_BYTES/1024/1024))MB:" >&2
echo "------------------------------------------------------------" >&2
while IFS= read -r f; do
sz=$(du -h "$f" | cut -f1)
printf ' %s\t%s\n' "$sz" "${f#$AGENT_REPO_DIR/}" >&2
done <<< "$big_files"
echo "------------------------------------------------------------" >&2
echo "Add the offending path (or its parent dir) to the PURGE array." >&2
exit 2
fi
log "large-file guard: clean (no files > $((MAX_FILE_BYTES/1024/1024))MB)"
fi

# ---------- step 3c: remote-private guard ----------------------------------
#
# Defense in depth: refuse to push if the GitHub repo is somehow public.
# Catches a hand-toggle in the GitHub UI that would otherwise expose every
# subsequent backup commit. Only runs for github.com remotes when `gh` is
# available and authenticated; otherwise it is a soft warning.

verify_repo_private() {
local remote_url="$1"
if ! command -v gh >/dev/null 2>&1; then
log "privacy guard: gh CLI not installed - SKIPPED (soft warning)"
return 0
fi
if ! [[ "$remote_url" =~ github\.com[:/]([^/]+)/([^/.]+)(\.git)?$ ]]; then
log "privacy guard: non-github remote - SKIPPED"
return 0
fi
local owner="${BASH_REMATCH[1]}"
local name="${BASH_REMATCH[2]}"
local visibility
visibility=$(gh repo view "$owner/$name" --json visibility -q .visibility 2>/dev/null || echo "")
if [[ -z "$visibility" ]]; then
log "privacy guard: could not query gh - SKIPPED (soft warning)"
return 0
fi
if [[ "$visibility" != "PRIVATE" ]]; then
log "privacy guard: REFUSING to push - $owner/$name is $visibility (expected PRIVATE)"
return 1
fi
log "privacy guard: $owner/$name confirmed PRIVATE"
return 0
}

# ---------- step 4 & 5: commit ---------------------------------------------

cd "$AGENT_REPO_DIR"
Expand Down Expand Up @@ -380,6 +449,8 @@ fi
REMOTE_URL=$(git remote get-url "$AGENT_REMOTE" 2>/dev/null || true)
[[ -z "$REMOTE_URL" ]] && { log "remote '$AGENT_REMOTE' not configured"; exit 3; }

verify_repo_private "$REMOTE_URL" || exit 4

log "push: $AGENT_REMOTE $AGENT_BRANCH ($REMOTE_URL)"

if [[ -n "${GITHUB_TOKEN:-}" && "$REMOTE_URL" =~ ^https://github\.com/ ]]; then
Expand Down
80 changes: 80 additions & 0 deletions scripts/tenant-template/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Tula tenant-template build pipeline

This directory holds the three artifacts that turn a Tula development VM
into a per-tenant golden image and provision new tenants from it.

| File | Purpose | Runs on |
|---|---|---|
| `deprovision.sh` | Scrubs a source VM for image capture | The source VM (the one being baked) |
| `tula-provision.sh` | Spawns a new tenant from a captured image | The operator's laptop / control-plane VM |
| `cloud-init-template.yaml` | First-boot configuration for each new tenant | Auto-injected; never run manually |

Full specification: [`~/.openclaw/workspace/docs/TENANT_TEMPLATE_BUILD.md`](../../../.openclaw/workspace/docs/TENANT_TEMPLATE_BUILD.md)

## Quick start (operator)

```bash
# One-time: prepare ops home
mkdir -p ~/tula-ops/{tenants,secrets}
chmod 700 ~/tula-ops ~/tula-ops/secrets
echo -n 'sk-ant-xxxx' > ~/tula-ops/secrets/anthropic-api-key && chmod 600 ~/tula-ops/secrets/anthropic-api-key
echo -n 'ghp_xxxx' > ~/tula-ops/secrets/github-pat-tenant-write && chmod 600 ~/tula-ops/secrets/github-pat-tenant-write

# Add a few Telegram bot tokens to the pool (one per row)
cat <<EOF >> ~/tula-ops/bot-token-pool.txt
# pool_name bot_token bot_username status
tula_aux_001 1234567890:AAH... TulaAux001Bot available
tula_aux_002 0987654321:AAH... TulaAux002Bot available
EOF
chmod 600 ~/tula-ops/bot-token-pool.txt

# Bake the image (one-time, ~30 min)
ssh azureuser@ra-bake-vm 'sudo ~/tula/scripts/tenant-template/deprovision.sh --version 0.1.0 --confirm'
ssh azureuser@ra-bake-vm 'sudo waagent -deprovision+user -force'
az vm deallocate -g ra-healthcareagents-rg -n ra-bake-vm
az vm generalize -g ra-healthcareagents-rg -n ra-bake-vm
az image create -g ra-healthcareagents-rg -n tula-tenant-template-0-1-0 --source ra-bake-vm

# Provision a tenant (per tenant, ~5 min)
~/tula/scripts/tenant-template/tula-provision.sh new-tenant "Jane Doe" "jane@example.com"
```

## Subcommands

- `tula-provision new-tenant <name> <email>` - full provision
- `tula-provision list` - list tenants
- `tula-provision show <tenant-id>` - show one tenant's record
- `tula-provision health <tenant-id>` - health check
- `tula-provision rollback <tenant-id>` - clean teardown (idempotent)
- `tula-provision decommission <tenant-id>` - 30-day-grace offboarding

## Safety

- `deprovision.sh` refuses to run on hosts named `tula-tenant-*` (prevents
nuking a live tenant)
- `deprovision.sh` requires `--confirm`; supports `--dry-run`
- `tula-provision.sh` rolls back automatically on any failure during
provisioning (deletes Azure RG, deletes GitHub repo, returns bot
token to pool)
- All operator secrets live in `~/tula-ops/secrets/` with 0600 perms
- Tenant secrets live in `/etc/tula-tenant-secrets.env` on the tenant
VM with 0600 perms, owned by `azureuser`
- No tenant content ever crosses any operator boundary (operator can
break-glass via SSH, but the operation is logged)

## v0.1 known gaps (to harden in v0.2)

- GitHub PAT per tenant is currently shared across tenants via
`~/tula-ops/secrets/github-pat-tenant-write`. Should be per-tenant
fine-grained PAT or GitHub App installation. Tracked in
[`TENANT_TEMPLATE_BUILD.md`](../../../.openclaw/workspace/docs/TENANT_TEMPLATE_BUILD.md) § 6.5.
- Data disk is currently combined with OS disk. v0.2 separates them so
image updates don't require workspace data migration.
- No control plane yet; tenant heartbeat to a central observability
endpoint is wired but disabled. Enable when control plane lands.
- No automated image-update workflow for existing tenants; updates are
manual per tenant in v0.1.

## License

Apache-2.0 (inherited from the Tula repository).
Loading
Loading