feat: backup/restore the SQLite store over the API (#999)#1003
Conversation
Add a physical backup/restore feature for the SQLite store, exposed over the HTTP API so it works with remote-server setups and keeps "only the server touches data". Distinct from the logical `export` (JSON/CSV): this is a full physical snapshot (audit log, notes, dependencies, tags) for real recovery. Core (storage-neutral port + SQLite impl): - IBackupStore port exposing only chunk streams (create_snapshot / stage_restore / apply_pending_restore) — no .db, paths, or SQL leak out. - SqliteBackupStore: VACUUM INTO for a consistent single-file snapshot, PRAGMA integrity_check on upload, startup file swap with WAL/SHM cleanup and a pre-restore copy of the previous DB. - BackupController, RestoreResultDTO, BackupError hierarchy. Server: - GET /api/v1/backup streams the snapshot; POST /api/v1/restore validates and stages the upload (400 on integrity failure), returning a pending status. Both honor the existing API-key auth. - apply_pending_restore runs in the lifespan BEFORE the engine is created, so Alembic migrates a restored (possibly older) backup forward. Client & CLI: - TaskdogApiClient.backup(path) / restore(path); BaseApiClient.auth_headers. - `taskdog backup [-o PATH]` (default ./taskdog-backup-<ts>.db) and `taskdog restore-db PATH` (confirmation prompt, --yes to skip). The restore command is named `restore-db` because `restore` already means un-archiving tasks. Verified end-to-end against a running server: backup -> mutate -> restore -> restart applies the snapshot. MCP is out of scope for v1.
There was a problem hiding this comment.
Review Summary
This PR implements a well-designed physical backup/restore feature for SQLite. The architecture properly separates concerns with domain abstraction (IBackupStore) and infrastructure implementation (SqliteBackupStore). The approach of staging restores before server restart and using VACUUM INTO for consistent snapshots is sound.
However, I've identified 4 critical defects that block merge:
- Memory exhaustion vulnerability in the restore endpoint - loads entire uploads into RAM instead of streaming
- Race condition crash when restoring to non-existent database path
- Resource leak in SQLite integrity validation leaving connections in WAL checkpoint mode
- Partial write corruption in backup download without atomic file operations
All issues have specific code suggestions to fix. Once addressed, the implementation will be production-ready.
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
| data = await file.read() | ||
| return controller.restore(iter([data])) |
There was a problem hiding this comment.
🛑 Memory Exhaustion Vulnerability: Loading the entire upload into memory risks crashes and denial-of-service for large database files. The server should stream the upload directly to disk rather than buffering it entirely in RAM with await file.read(). This bypasses the chunked streaming designed in stage_restore(data: Iterator[bytes]).
| data = await file.read() | |
| return controller.restore(iter([data])) | |
| async def read_chunks(): | |
| while chunk := await file.read(1024 * 1024): | |
| yield chunk | |
| return controller.restore(read_chunks()) |
| db_path = self._db_path | ||
| if db_path.exists(): | ||
| ts = datetime.now().strftime("%Y%m%d-%H%M%S") | ||
| db_path.replace(db_path.with_name(f"{db_path.name}.pre-restore-{ts}")) |
There was a problem hiding this comment.
🛑 Race Condition Crash Risk: If the database file doesn't exist when apply_pending_restore() is called (e.g., fresh install with a staged restore), this code attempts to move a non-existent file, causing FileNotFoundError. The condition only checks if pending.exists(), not db_path.exists() before the replace operation on line 131.
| db_path = self._db_path | |
| if db_path.exists(): | |
| ts = datetime.now().strftime("%Y%m%d-%H%M%S") | |
| db_path.replace(db_path.with_name(f"{db_path.name}.pre-restore-{ts}")) | |
| db_path = self._db_path | |
| if db_path.exists(): | |
| ts = datetime.now().strftime("%Y%m%d-%H%M%S") | |
| backup_path = db_path.with_name(f"{db_path.name}.pre-restore-{ts}") | |
| db_path.rename(backup_path) |
| @staticmethod | ||
| def _is_valid_sqlite(path: Path) -> bool: | ||
| try: | ||
| conn = sqlite3.connect(path) | ||
| try: | ||
| row = conn.execute("PRAGMA integrity_check").fetchone() | ||
| finally: | ||
| conn.close() | ||
| except sqlite3.DatabaseError: | ||
| return False | ||
| return row is not None and row[0] == "ok" |
There was a problem hiding this comment.
🛑 Resource Leak: Missing conn.execute("PRAGMA optimize") after integrity check causes database connection to remain in WAL checkpoint mode. SQLite keeps the connection open with file locks, preventing concurrent access and potentially corrupting the staged file during validation.
| @staticmethod | |
| def _is_valid_sqlite(path: Path) -> bool: | |
| try: | |
| conn = sqlite3.connect(path) | |
| try: | |
| row = conn.execute("PRAGMA integrity_check").fetchone() | |
| finally: | |
| conn.close() | |
| except sqlite3.DatabaseError: | |
| return False | |
| return row is not None and row[0] == "ok" | |
| @staticmethod | |
| def _is_valid_sqlite(path: Path) -> bool: | |
| try: | |
| conn = sqlite3.connect(path) | |
| try: | |
| row = conn.execute("PRAGMA integrity_check").fetchone() | |
| if row is not None and row[0] == "ok": | |
| conn.execute("PRAGMA optimize") | |
| return True | |
| return False | |
| finally: | |
| conn.close() | |
| except sqlite3.DatabaseError: | |
| return False |
| try: | ||
| with self._base.client.stream( | ||
| "GET", "/api/v1/backup", headers=self._base.auth_headers() | ||
| ) as response: | ||
| if not response.is_success: | ||
| response.read() | ||
| self._base._handle_error(response) | ||
| with output_path.open("wb") as out: | ||
| for chunk in response.iter_bytes(_CHUNK_SIZE): | ||
| out.write(chunk) | ||
| except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e: | ||
| raise ServerConnectionError(self._base.base_url, e) from e |
There was a problem hiding this comment.
🛑 Partial Write Corruption: backup() writes chunks without atomic guarantees. If the process crashes mid-write or disk is full, a corrupted partial file remains. The restore endpoint will then accept this invalid backup. Add a temp file with atomic rename to ensure only complete downloads are written.
| try: | |
| with self._base.client.stream( | |
| "GET", "/api/v1/backup", headers=self._base.auth_headers() | |
| ) as response: | |
| if not response.is_success: | |
| response.read() | |
| self._base._handle_error(response) | |
| with output_path.open("wb") as out: | |
| for chunk in response.iter_bytes(_CHUNK_SIZE): | |
| out.write(chunk) | |
| except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e: | |
| raise ServerConnectionError(self._base.base_url, e) from e | |
| try: | |
| tmp_path = output_path.with_suffix(output_path.suffix + ".tmp") | |
| with self._base.client.stream( | |
| "GET", "/api/v1/backup", headers=self._base.auth_headers() | |
| ) as response: | |
| if not response.is_success: | |
| response.read() | |
| self._base._handle_error(response) | |
| with tmp_path.open("wb") as out: | |
| for chunk in response.iter_bytes(_CHUNK_SIZE): | |
| out.write(chunk) | |
| tmp_path.replace(output_path) | |
| except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e: | |
| raise ServerConnectionError(self._base.base_url, e) from e |
The /api/v1/restore endpoint uses FastAPI UploadFile/File, which requires python-multipart at import time. It was present locally as a transitive dep but not declared, so CI's clean install failed to import the router.
|
/q review |
There was a problem hiding this comment.
Review Summary
This PR implements a physical SQLite backup/restore feature over HTTP API. The implementation follows good architectural patterns with clean port/adapter separation. However, I've identified 5 critical defects that must be fixed before merge:
Critical Issues Found:
- Security Vulnerability (CWE-400) - Memory exhaustion attack vector in restore endpoint
- Race Condition (CWE-362) - Database corruption risk in file swap logic
- Missing Exception Handler - Unhandled
BackupNotSupportedErrorcauses 500 errors - Crash Risk - Unhandled exceptions during startup restore operation
- Missing Exception Handler Registration - Import present but handler not registered
Summary of Findings:
Security (1 issue):
- The
/api/v1/restoreendpoint loads entire uploaded files into memory, enabling authenticated attackers to crash the server with large files
Logic Errors (1 issue):
- The
apply_pending_restore()file swap has no exclusive locking, allowing concurrent database writes to be silently lost
Crash Risks (3 issues):
BackupNotSupportedErrornot registered in exception handlers- Exception handler import missing
BackupNotSupportedError - Startup failures from
apply_pending_restore()crash the server permanently
All issues have specific fixes provided in the inline comments. The test coverage appears comprehensive, but these defects were not caught because they involve edge cases (large files, concurrent access, startup failures).
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
| file: Annotated[UploadFile, File()], | ||
| ) -> RestoreResultDTO: | ||
| """Stage an uploaded `.db` snapshot to be applied on the next server restart.""" | ||
| data = await file.read() |
There was a problem hiding this comment.
🛑 Security Vulnerability: Loading entire uploaded file into memory enables memory exhaustion attacks. An attacker with valid credentials could upload multi-gigabyte files to crash the server via OOM. Stream the upload in chunks using file.stream() or read in fixed-size chunks to bound memory usage.
async for chunk in file.stream():
if not hasattr(restore, "_chunks"):
restore._chunks = []
restore._chunks.append(chunk)
return controller.restore(iter(restore._chunks))| # Drop stale WAL/SHM so the old journal cannot shadow the restored DB. | ||
| for suffix in ("-wal", "-shm"): | ||
| self._sidecar(suffix).unlink(missing_ok=True) | ||
|
|
||
| pending.replace(db_path) |
There was a problem hiding this comment.
🛑 Logic Error: Race condition allows database corruption if apply_pending_restore is called while the database is in use. The file swap at line 137 executes after WAL/SHM deletion but provides no exclusive locking. If another process writes to the database between line 135 and 137, the pending restore overwrites those writes silently. Wrap the entire swap operation (lines 133-137) in a SQLite connection with BEGIN EXCLUSIVE or verify no connections exist before swapping.
| from fastapi import FastAPI, Request, status | ||
| from fastapi.responses import JSONResponse | ||
|
|
||
| from taskdog_core.domain.exceptions.backup_exceptions import BackupValidationError |
There was a problem hiding this comment.
🛑 Crash Risk: BackupNotSupportedError is not registered in the exception handler. If a client calls /api/v1/backup or /api/v1/restore on an in-memory database, the unhandled exception causes a 500 Internal Server Error instead of a proper 400/501 response. Add handler registration for BackupNotSupportedError.
| from taskdog_core.domain.exceptions.backup_exceptions import BackupValidationError | |
| from taskdog_core.domain.exceptions.backup_exceptions import ( | |
| BackupNotSupportedError, | |
| BackupValidationError, | |
| ) |
| app.add_exception_handler(TaskValidationError, _bad_request_handler) | ||
| app.add_exception_handler(BackupValidationError, _bad_request_handler) |
There was a problem hiding this comment.
Add handler registration for BackupNotSupportedError after line 49 to properly return 400 BAD_REQUEST when backup operations are attempted on in-memory databases.
| app.add_exception_handler(TaskValidationError, _bad_request_handler) | |
| app.add_exception_handler(BackupValidationError, _bad_request_handler) | |
| app.add_exception_handler(TaskValidationError, _bad_request_handler) | |
| app.add_exception_handler(BackupValidationError, _bad_request_handler) | |
| app.add_exception_handler(BackupNotSupportedError, _bad_request_handler) |
|
|
||
| # Apply a pending restore (if any) BEFORE the engine is created, so the | ||
| # restored DB is what Alembic then migrates forward. | ||
| SqliteBackupStore(resolve_database_url(config)).apply_pending_restore() |
There was a problem hiding this comment.
🛑 Crash Risk: Unhandled exceptions from apply_pending_restore() crash server startup. If the pending restore file is corrupted or the filesystem is full during the file swap, the exception propagates uncaught, preventing the server from starting and leaving it in a broken state. Wrap line 58 in try-except to log the error and continue startup, or implement a startup health check that fails gracefully.
…rror mapping (#999) - server/restore: stream the spooled upload off disk in 1 MiB chunks via file.file instead of buffering the whole DB in memory (CWE-400). - client/backup: download to a .part file and atomically replace, so an interrupted download never leaves a corrupt partial snapshot. - server: map BackupNotSupportedError -> 501 (was an unhandled 500 when a store is in-memory). - server startup: guard apply_pending_restore with try/except so a failed swap logs and continues instead of leaving the server unstartable. Not changed (reviewed and rejected): the pre-restore move is already guarded by db_path.exists(); the "PRAGMA optimize" suggestion is unrelated to any leak (the connection is closed in finally); and BEGIN EXCLUSIVE on the swap is unnecessary because apply_pending_restore runs at startup before the engine is created, with no other connections by design.
|
Thanks for the review. Triaged each finding — addressed the 4 valid ones in 5520ffb, pushed back on 3 false positives. Fixed
Rejected (with reasons)
All suites pass (core 1116 / client 223 / server 324) with ruff + mypy clean. |
Summary
Add a physical backup/restore feature for the SQLite store, exposed over the HTTP API so it works with remote-server setups and keeps the "only the server touches data" rule. This is distinct from the existing logical
export(JSON/CSV): backup is a full physical snapshot (audit log, notes, dependencies, tags) intended for real recovery.Closes #999.
Design (mirrors the existing repository port/impl split)
domain/services/backup_store.py—IBackupStore(ABC), storage-neutral:create_snapshot() -> Iterator[bytes],stage_restore(data),apply_pending_restore() -> bool. No.db, paths, or SQL leak into domain/application.infrastructure/.../sqlite_backup_store.py—SqliteBackupStore, the only place that knows SQLite:VACUUM INTOfor a consistent defragmented single-file snapshot,PRAGMA integrity_checkon upload, and a startup file swap (move current DB to*.pre-restore-<ts>, delete stale-wal/-shm, rename pending ontotasks.db).controllers/backup_controller.py—BackupController(store);RestoreResultDTO;BackupErrorhierarchy.Endpoints
GET /api/v1/backup→StreamingResponse,Content-Disposition: attachment; filename="taskdog-backup-<ts>.db".POST /api/v1/restore→ multipart.dbupload → integrity check (400 on failure) → staged as<db>.pending-restore→{"status": "pending", "message": "restart required"}.apply_pending_restore()runs in the FastAPI lifespan before the engine is created, so Alembic migrates a restored (possibly older) backup forward.Client & CLI
TaskdogApiClient.backup(path)(streaming download) /restore(path)(multipart upload);BaseApiClient.auth_headers()extracted so streaming/multipart reuse auth.taskdog backup [-o PATH]— default./taskdog-backup-YYYYMMDD-HHMMSS.db.taskdog restore-db PATH— confirmation prompt (--yesto skip), instructs a restart on success.The issue proposed
taskdog restore PATH, butrestorealready exists and means un-archive task(s). To avoid breaking it, the DB-restore command is namedrestore-db. Happy to rename if you'd prefer adbcommand group instead.Open question (from the issue)
Default
-olocation: chose cwd + timestamped name (./taskdog-backup-<ts>.db).Tests
SqliteBackupStoresnapshot round-trip, no-temp-leak, in-memory unsupported, restore staging + integrity rejection,apply_pending_restoreincl. WAL/SHM cleanup and pre-restore copy.--yes/ missing-file).Verification
make test, vulture) passed.backup→ added a second task →restore-db --yes→ restarted the server → only the first task remained;*.pre-restore-*copy preserved, WAL/SHM regenerated cleanly.Out of scope (v1)