Skip to content

Latest commit

 

History

History
326 lines (233 loc) · 9.35 KB

File metadata and controls

326 lines (233 loc) · 9.35 KB

Rollback and Fallback Runbook

Operator guide for reverting failed installer, database, Rust binary, and hook changes. Run commands from the repository root unless a section says otherwise.

1. Installer rollback

Use this when install.py, auto-update-tools.py, or a local tool refresh leaves the launcher or copied tools in a bad state.

1.1 Snapshot before risky installer work

python install.py --doctor --manifest > /tmp/sk-install-before.txt
python install.py --install-sk --quiet

1.2 Remove only the managed sk launcher

python install.py --uninstall-launcher
python install.py --doctor --manifest

1.3 Remove managed copied tools while preserving session data

python install.py --uninstall

--uninstall preserves ~/.copilot/session-state/ and the knowledge database. If the checkout was installed with editable pip, remove that wrapper first:

python -m pip uninstall copilot-session-knowledge
python install.py --uninstall

1.4 Reinstall from a known-good checkout

git fetch origin main
git switch main
git pull --ff-only
python install.py --install-sk --quiet
python install.py --doctor --manifest

2. Database schema rollback

Use this before migrations, manual repair, or any change that writes ~/.copilot/session-state/knowledge.db.

2.1 Create a verified rollback backup

POSIX:

python migrate.py ~/.copilot/session-state/knowledge.db --backup-only --backup-path /tmp/knowledge.db.backup
python migrate.py ~/.copilot/session-state/knowledge.db

Windows PowerShell:

New-Item -ItemType Directory -Force -Path "C:\Temp" | Out-Null
python migrate.py "$env:USERPROFILE\.copilot\session-state\knowledge.db" --backup-only --backup-path "C:\Temp\knowledge.db.backup"
python migrate.py "$env:USERPROFILE\.copilot\session-state\knowledge.db"

2.2 Restore a known-good backup on POSIX

Stop active writers first (sk watch, sync daemons, launchd services, CI jobs), then replace the database and remove WAL sidecars:

rm -f ~/.copilot/session-state/knowledge.db-wal ~/.copilot/session-state/knowledge.db-shm
cp /tmp/knowledge.db.backup ~/.copilot/session-state/knowledge.db
python migrate.py ~/.copilot/session-state/knowledge.db

The migration run should report Schema up to date.

2.3 Restore a known-good backup on Windows PowerShell

Remove-Item "$env:USERPROFILE\.copilot\session-state\knowledge.db-wal" -ErrorAction SilentlyContinue
Remove-Item "$env:USERPROFILE\.copilot\session-state\knowledge.db-shm" -ErrorAction SilentlyContinue
Copy-Item "C:\Temp\knowledge.db.backup" "$env:USERPROFILE\.copilot\session-state\knowledge.db" -Force
python migrate.py "$env:USERPROFILE\.copilot\session-state\knowledge.db"

2.4 Preserve a bad database for investigation

mv ~/.copilot/session-state/knowledge.db ~/.copilot/session-state/knowledge.db.corrupt
mv ~/.copilot/session-state/knowledge.db-wal ~/.copilot/session-state/knowledge.db-wal.corrupt 2>/dev/null || true
mv ~/.copilot/session-state/knowledge.db-shm ~/.copilot/session-state/knowledge.db-shm.corrupt 2>/dev/null || true
python migrate.py ~/.copilot/session-state/knowledge.db

3. Rust binary rollback

Use this when a released sk binary fails at startup, routes commands incorrectly, or regresses native hooks/watch behavior.

3.1 Replace the binary with the latest release

POSIX:

bash sk-rust/install.sh
sk --help

Windows PowerShell:

powershell -ExecutionPolicy Bypass -File sk-rust\install.ps1
sk --help

3.2 Roll back to the Python shim temporarily

If the Rust binary is broken but Python tools are intact, move the binary aside and call the shim directly:

mv ~/.copilot/bin/sk ~/.copilot/bin/sk.broken
python ~/.copilot/tools/sk.py --help
python ~/.copilot/tools/sk.py hooks run sessionStart

Windows PowerShell:

Rename-Item "$env:USERPROFILE\.copilot\bin\sk.exe" "sk.exe.broken"
python "$env:USERPROFILE\.copilot\tools\sk.py" --help
python "$env:USERPROFILE\.copilot\tools\sk.py" hooks run sessionStart

3.3 Validate before re-enabling the binary

cd sk-rust
cargo test --quiet
cargo clippy -- -D warnings
cd ..
python tests/test_py_rust_boundary.py --rust-bin sk-rust/target/debug/sk

Use sk-rust\target\debug\sk.exe for --rust-bin on Windows.

4. Hook provisioning rollback

Use this when hook installation, tamper protection, or generated hook config blocks legitimate work.

4.1 Unlock hooks before repair

python install.py --unlock-hooks

4.2 Restore a backed-up hook

cp .git/hooks/pre-commit.backup .git/hooks/pre-commit
cp .git/hooks/pre-push.backup .git/hooks/pre-push
chmod +x .git/hooks/pre-commit .git/hooks/pre-push

Windows PowerShell:

Copy-Item .git\hooks\pre-commit.backup .git\hooks\pre-commit -Force
Copy-Item .git\hooks\pre-push.backup .git\hooks\pre-push -Force

4.3 Re-provision known-good hooks

python install.py --deploy-hooks
python install.py --install-git-hooks
python install.py --lock-hooks
python tests/test_hook_compat.py
python tests/test_quality_gates.py

4.4 Emergency bypass policy

Prefer restoring hooks over bypassing them. If a human authorizes a one-off bypass, document the reason in the PR and run the same tests the hook would have enforced before merge.

5. Evidence checklist for PR closeout

Every PR that changes installer, migration, hook, Rust binary, or rollback surfaces must include either command output or an explicit N/A reason for:

  • python test_security.py
  • python test_fixes.py
  • python run_all_tests.py
  • python tests/test_rollback_runbook.py
  • python tests/test_migration_rehearsal.py
  • python tests/test_install_sandbox.py
  • python tests/test_hook_compat.py
  • Cross-platform CI status: Linux, macOS, and Windows
  • Migration rehearsal evidence when migrate.py or DB schema changes
  • Install sandbox evidence when installer/update scripts change
  • Hook security/compat evidence when hook provisioning or hook rules change

6. Known failure modes and recovery

This section documents failure modes that have been observed in practice, along with the recommended recovery path for each.

6.1 Startup regression: sk binary exits non-zero after upgrade

Symptom: sk --help or sk --version returns a non-zero exit code after a binary upgrade. The startup-benchmark CI job detects this as a failed warmup run and prints benchmark startup: warmup #1 failed.

Recovery:

# Roll back to the Python shim while investigating
mv ~/.copilot/bin/sk ~/.copilot/bin/sk.broken
python ~/.copilot/tools/sk.py --help

Windows PowerShell:

Rename-Item "$env:USERPROFILE\.copilot\bin\sk.exe" "sk.exe.broken"
python "$env:USERPROFILE\.copilot\tools\sk.py" --help

Re-run binary regression tests to confirm the Python shim is healthy:

python tests/test_py_rust_boundary.py --rust-bin ~/.copilot/bin/sk.broken

6.2 Migration fails with "database is locked"

Symptom: migrate.py exits with sqlite3.OperationalError: database is locked or similar. Usually caused by an active sk watch, sync daemon, or CI job.

Recovery:

  1. Stop all active writers:
python watch-sessions.py --stop 2>/dev/null || true
python sync-daemon.py --stop 2>/dev/null || true
  1. Remove WAL sidecars (Windows PowerShell equivalent: Remove-Item ... -ErrorAction SilentlyContinue):
rm -f ~/.copilot/session-state/knowledge.db-wal \
      ~/.copilot/session-state/knowledge.db-shm
  1. Retry migration:
python migrate.py ~/.copilot/session-state/knowledge.db

6.3 Hook provisioning fails: tamper protection blocks re-install

Symptom: install.py --deploy-hooks prints hooks are locked; run --unlock-hooks first or a hook hash verification error blocks the pre-commit.

Recovery:

python install.py --unlock-hooks
python install.py --deploy-hooks
python install.py --lock-hooks
python tests/test_hook_compat.py

6.4 Installer creates a broken sk.cmd on Windows (missing CRLF)

Symptom: sk.cmd runs but cmd.exe raises a parse error because line endings are LF-only instead of CRLF. Detected by python tests/test_platform_compat.pytest_sk_cmd_launcher_content_uses_crlf.

Recovery:

Re-install the launcher from a clean checkout to regenerate the CRLF-correct file:

python install.py --uninstall-launcher
python install.py --install-sk --quiet
python tests/test_platform_compat.py

6.5 DB backup fails silently (OSError swallowed)

Symptom: migrate.py --backup-only exits 0 but the backup file is missing or zero bytes. Usually caused by insufficient disk space or a locked source file.

Recovery:

Verify the backup manually before running the forward migration:

python migrate.py ~/.copilot/session-state/knowledge.db --backup-only \
  --backup-path ~/knowledge.db.backup
ls -lh ~/knowledge.db.backup          # must be non-zero
sqlite3 ~/knowledge.db.backup "PRAGMA quick_check"  # must return 'ok'
python migrate.py ~/.copilot/session-state/knowledge.db

Windows PowerShell:

python migrate.py "$env:USERPROFILE\.copilot\session-state\knowledge.db" `
  --backup-only --backup-path "$env:USERPROFILE\knowledge.db.backup"
(Get-Item "$env:USERPROFILE\knowledge.db.backup").Length  # must be >0
python migrate.py "$env:USERPROFILE\.copilot\session-state\knowledge.db"