Skip to content

feat: backup/restore the SQLite store over the API (#999)#1003

Merged
Kohei-Wada merged 3 commits into
mainfrom
feat/backup-restore-999
Jun 6, 2026
Merged

feat: backup/restore the SQLite store over the API (#999)#1003
Kohei-Wada merged 3 commits into
mainfrom
feat/backup-restore-999

Conversation

@Kohei-Wada
Copy link
Copy Markdown
Owner

Summary

Add a physical backup/restore feature for the SQLite store, exposed over the HTTP API so it works with remote-server setups and keeps the "only the server touches data" rule. This is distinct from the existing logical export (JSON/CSV): backup is a full physical snapshot (audit log, notes, dependencies, tags) intended for real recovery.

Closes #999.

Design (mirrors the existing repository port/impl split)

  • domain/services/backup_store.pyIBackupStore(ABC), storage-neutral: create_snapshot() -> Iterator[bytes], stage_restore(data), apply_pending_restore() -> bool. No .db, paths, or SQL leak into domain/application.
  • infrastructure/.../sqlite_backup_store.pySqliteBackupStore, the only place that knows SQLite: VACUUM INTO for a consistent defragmented single-file snapshot, PRAGMA integrity_check on upload, and a startup file swap (move current DB to *.pre-restore-<ts>, delete stale -wal/-shm, rename pending onto tasks.db).
  • controllers/backup_controller.pyBackupController(store); RestoreResultDTO; BackupError hierarchy.

Endpoints

  • GET /api/v1/backupStreamingResponse, Content-Disposition: attachment; filename="taskdog-backup-<ts>.db".
  • POST /api/v1/restore → multipart .db upload → integrity check (400 on failure) → staged as <db>.pending-restore{"status": "pending", "message": "restart required"}.
  • Both honor the existing API-key auth.
  • apply_pending_restore() runs in the FastAPI lifespan before the engine is created, so Alembic migrates a restored (possibly older) backup forward.

Client & CLI

  • TaskdogApiClient.backup(path) (streaming download) / restore(path) (multipart upload); BaseApiClient.auth_headers() extracted so streaming/multipart reuse auth.
  • taskdog backup [-o PATH] — default ./taskdog-backup-YYYYMMDD-HHMMSS.db.
  • taskdog restore-db PATH — confirmation prompt (--yes to skip), instructs a restart on success.

⚠️ Naming note

The issue proposed taskdog restore PATH, but restore already exists and means un-archive task(s). To avoid breaking it, the DB-restore command is named restore-db. Happy to rename if you'd prefer a db command group instead.

Open question (from the issue)

Default -o location: chose cwd + timestamped name (./taskdog-backup-<ts>.db).

Tests

  • core: SqliteBackupStore snapshot round-trip, no-temp-leak, in-memory unsupported, restore staging + integrity rejection, apply_pending_restore incl. WAL/SHM cleanup and pre-restore copy.
  • server: both endpoints + 400 on integrity failure.
  • client + CLI: mocked (download chunks, error mapping, multipart upload, confirmation / --yes / missing-file).

Verification

  • Full suites: core 1116, client 223, server 322, ui 1049 — all pass. ruff + mypy clean across all four packages. pre-push (make test, vulture) passed.
  • End-to-end against a running server: created a task → backup → added a second task → restore-db --yes → restarted the server → only the first task remained; *.pre-restore-* copy preserved, WAL/SHM regenerated cleanly.

Out of scope (v1)

  • MCP (operational action, not task management).

Add a physical backup/restore feature for the SQLite store, exposed over
the HTTP API so it works with remote-server setups and keeps "only the
server touches data". Distinct from the logical `export` (JSON/CSV): this
is a full physical snapshot (audit log, notes, dependencies, tags) for
real recovery.

Core (storage-neutral port + SQLite impl):
- IBackupStore port exposing only chunk streams (create_snapshot /
  stage_restore / apply_pending_restore) — no .db, paths, or SQL leak out.
- SqliteBackupStore: VACUUM INTO for a consistent single-file snapshot,
  PRAGMA integrity_check on upload, startup file swap with WAL/SHM cleanup
  and a pre-restore copy of the previous DB.
- BackupController, RestoreResultDTO, BackupError hierarchy.

Server:
- GET /api/v1/backup streams the snapshot; POST /api/v1/restore validates
  and stages the upload (400 on integrity failure), returning a pending
  status. Both honor the existing API-key auth.
- apply_pending_restore runs in the lifespan BEFORE the engine is created,
  so Alembic migrates a restored (possibly older) backup forward.

Client & CLI:
- TaskdogApiClient.backup(path) / restore(path); BaseApiClient.auth_headers.
- `taskdog backup [-o PATH]` (default ./taskdog-backup-<ts>.db) and
  `taskdog restore-db PATH` (confirmation prompt, --yes to skip). The
  restore command is named `restore-db` because `restore` already means
  un-archiving tasks.

Verified end-to-end against a running server: backup -> mutate -> restore
-> restart applies the snapshot. MCP is out of scope for v1.
Copy link
Copy Markdown

@amazon-q-developer amazon-q-developer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR implements a well-designed physical backup/restore feature for SQLite. The architecture properly separates concerns with domain abstraction (IBackupStore) and infrastructure implementation (SqliteBackupStore). The approach of staging restores before server restart and using VACUUM INTO for consistent snapshots is sound.

However, I've identified 4 critical defects that block merge:

  1. Memory exhaustion vulnerability in the restore endpoint - loads entire uploads into RAM instead of streaming
  2. Race condition crash when restoring to non-existent database path
  3. Resource leak in SQLite integrity validation leaving connections in WAL checkpoint mode
  4. Partial write corruption in backup download without atomic file operations

All issues have specific code suggestions to fix. Once addressed, the implementation will be production-ready.


You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

Comment on lines +40 to +41
data = await file.read()
return controller.restore(iter([data]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Memory Exhaustion Vulnerability: Loading the entire upload into memory risks crashes and denial-of-service for large database files. The server should stream the upload directly to disk rather than buffering it entirely in RAM with await file.read(). This bypasses the chunked streaming designed in stage_restore(data: Iterator[bytes]).

Suggested change
data = await file.read()
return controller.restore(iter([data]))
async def read_chunks():
while chunk := await file.read(1024 * 1024):
yield chunk
return controller.restore(read_chunks())

Comment on lines +128 to +131
db_path = self._db_path
if db_path.exists():
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
db_path.replace(db_path.with_name(f"{db_path.name}.pre-restore-{ts}"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Race Condition Crash Risk: If the database file doesn't exist when apply_pending_restore() is called (e.g., fresh install with a staged restore), this code attempts to move a non-existent file, causing FileNotFoundError. The condition only checks if pending.exists(), not db_path.exists() before the replace operation on line 131.

Suggested change
db_path = self._db_path
if db_path.exists():
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
db_path.replace(db_path.with_name(f"{db_path.name}.pre-restore-{ts}"))
db_path = self._db_path
if db_path.exists():
ts = datetime.now().strftime("%Y%m%d-%H%M%S")
backup_path = db_path.with_name(f"{db_path.name}.pre-restore-{ts}")
db_path.rename(backup_path)

Comment on lines +108 to +118
@staticmethod
def _is_valid_sqlite(path: Path) -> bool:
try:
conn = sqlite3.connect(path)
try:
row = conn.execute("PRAGMA integrity_check").fetchone()
finally:
conn.close()
except sqlite3.DatabaseError:
return False
return row is not None and row[0] == "ok"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Resource Leak: Missing conn.execute("PRAGMA optimize") after integrity check causes database connection to remain in WAL checkpoint mode. SQLite keeps the connection open with file locks, preventing concurrent access and potentially corrupting the staged file during validation.

Suggested change
@staticmethod
def _is_valid_sqlite(path: Path) -> bool:
try:
conn = sqlite3.connect(path)
try:
row = conn.execute("PRAGMA integrity_check").fetchone()
finally:
conn.close()
except sqlite3.DatabaseError:
return False
return row is not None and row[0] == "ok"
@staticmethod
def _is_valid_sqlite(path: Path) -> bool:
try:
conn = sqlite3.connect(path)
try:
row = conn.execute("PRAGMA integrity_check").fetchone()
if row is not None and row[0] == "ok":
conn.execute("PRAGMA optimize")
return True
return False
finally:
conn.close()
except sqlite3.DatabaseError:
return False

Comment on lines +43 to +54
try:
with self._base.client.stream(
"GET", "/api/v1/backup", headers=self._base.auth_headers()
) as response:
if not response.is_success:
response.read()
self._base._handle_error(response)
with output_path.open("wb") as out:
for chunk in response.iter_bytes(_CHUNK_SIZE):
out.write(chunk)
except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:
raise ServerConnectionError(self._base.base_url, e) from e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Partial Write Corruption: backup() writes chunks without atomic guarantees. If the process crashes mid-write or disk is full, a corrupted partial file remains. The restore endpoint will then accept this invalid backup. Add a temp file with atomic rename to ensure only complete downloads are written.

Suggested change
try:
with self._base.client.stream(
"GET", "/api/v1/backup", headers=self._base.auth_headers()
) as response:
if not response.is_success:
response.read()
self._base._handle_error(response)
with output_path.open("wb") as out:
for chunk in response.iter_bytes(_CHUNK_SIZE):
out.write(chunk)
except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:
raise ServerConnectionError(self._base.base_url, e) from e
try:
tmp_path = output_path.with_suffix(output_path.suffix + ".tmp")
with self._base.client.stream(
"GET", "/api/v1/backup", headers=self._base.auth_headers()
) as response:
if not response.is_success:
response.read()
self._base._handle_error(response)
with tmp_path.open("wb") as out:
for chunk in response.iter_bytes(_CHUNK_SIZE):
out.write(chunk)
tmp_path.replace(output_path)
except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:
raise ServerConnectionError(self._base.base_url, e) from e

The /api/v1/restore endpoint uses FastAPI UploadFile/File, which requires
python-multipart at import time. It was present locally as a transitive
dep but not declared, so CI's clean install failed to import the router.
@Kohei-Wada
Copy link
Copy Markdown
Owner Author

/q review

Copy link
Copy Markdown

@amazon-q-developer amazon-q-developer Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR implements a physical SQLite backup/restore feature over HTTP API. The implementation follows good architectural patterns with clean port/adapter separation. However, I've identified 5 critical defects that must be fixed before merge:

Critical Issues Found:

  1. Security Vulnerability (CWE-400) - Memory exhaustion attack vector in restore endpoint
  2. Race Condition (CWE-362) - Database corruption risk in file swap logic
  3. Missing Exception Handler - Unhandled BackupNotSupportedError causes 500 errors
  4. Crash Risk - Unhandled exceptions during startup restore operation
  5. Missing Exception Handler Registration - Import present but handler not registered

Summary of Findings:

Security (1 issue):

  • The /api/v1/restore endpoint loads entire uploaded files into memory, enabling authenticated attackers to crash the server with large files

Logic Errors (1 issue):

  • The apply_pending_restore() file swap has no exclusive locking, allowing concurrent database writes to be silently lost

Crash Risks (3 issues):

  • BackupNotSupportedError not registered in exception handlers
  • Exception handler import missing BackupNotSupportedError
  • Startup failures from apply_pending_restore() crash the server permanently

All issues have specific fixes provided in the inline comments. The test coverage appears comprehensive, but these defects were not caught because they involve edge cases (large files, concurrent access, startup failures).


You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

file: Annotated[UploadFile, File()],
) -> RestoreResultDTO:
"""Stage an uploaded `.db` snapshot to be applied on the next server restart."""
data = await file.read()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Security Vulnerability: Loading entire uploaded file into memory enables memory exhaustion attacks. An attacker with valid credentials could upload multi-gigabyte files to crash the server via OOM. Stream the upload in chunks using file.stream() or read in fixed-size chunks to bound memory usage.

    async for chunk in file.stream():
        if not hasattr(restore, "_chunks"):
            restore._chunks = []
        restore._chunks.append(chunk)
    return controller.restore(iter(restore._chunks))

Comment on lines +133 to +137
# Drop stale WAL/SHM so the old journal cannot shadow the restored DB.
for suffix in ("-wal", "-shm"):
self._sidecar(suffix).unlink(missing_ok=True)

pending.replace(db_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Logic Error: Race condition allows database corruption if apply_pending_restore is called while the database is in use. The file swap at line 137 executes after WAL/SHM deletion but provides no exclusive locking. If another process writes to the database between line 135 and 137, the pending restore overwrites those writes silently. Wrap the entire swap operation (lines 133-137) in a SQLite connection with BEGIN EXCLUSIVE or verify no connections exist before swapping.

from fastapi import FastAPI, Request, status
from fastapi.responses import JSONResponse

from taskdog_core.domain.exceptions.backup_exceptions import BackupValidationError
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Crash Risk: BackupNotSupportedError is not registered in the exception handler. If a client calls /api/v1/backup or /api/v1/restore on an in-memory database, the unhandled exception causes a 500 Internal Server Error instead of a proper 400/501 response. Add handler registration for BackupNotSupportedError.

Suggested change
from taskdog_core.domain.exceptions.backup_exceptions import BackupValidationError
from taskdog_core.domain.exceptions.backup_exceptions import (
BackupNotSupportedError,
BackupValidationError,
)

Comment on lines 48 to +49
app.add_exception_handler(TaskValidationError, _bad_request_handler)
app.add_exception_handler(BackupValidationError, _bad_request_handler)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add handler registration for BackupNotSupportedError after line 49 to properly return 400 BAD_REQUEST when backup operations are attempted on in-memory databases.

Suggested change
app.add_exception_handler(TaskValidationError, _bad_request_handler)
app.add_exception_handler(BackupValidationError, _bad_request_handler)
app.add_exception_handler(TaskValidationError, _bad_request_handler)
app.add_exception_handler(BackupValidationError, _bad_request_handler)
app.add_exception_handler(BackupNotSupportedError, _bad_request_handler)


# Apply a pending restore (if any) BEFORE the engine is created, so the
# restored DB is what Alembic then migrates forward.
SqliteBackupStore(resolve_database_url(config)).apply_pending_restore()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Crash Risk: Unhandled exceptions from apply_pending_restore() crash server startup. If the pending restore file is corrupted or the filesystem is full during the file swap, the exception propagates uncaught, preventing the server from starting and leaving it in a broken state. Wrap line 58 in try-except to log the error and continue startup, or implement a startup health check that fails gracefully.

…rror mapping (#999)

- server/restore: stream the spooled upload off disk in 1 MiB chunks via
  file.file instead of buffering the whole DB in memory (CWE-400).
- client/backup: download to a .part file and atomically replace, so an
  interrupted download never leaves a corrupt partial snapshot.
- server: map BackupNotSupportedError -> 501 (was an unhandled 500 when a
  store is in-memory).
- server startup: guard apply_pending_restore with try/except so a failed
  swap logs and continues instead of leaving the server unstartable.

Not changed (reviewed and rejected): the pre-restore move is already
guarded by db_path.exists(); the "PRAGMA optimize" suggestion is unrelated
to any leak (the connection is closed in finally); and BEGIN EXCLUSIVE on
the swap is unnecessary because apply_pending_restore runs at startup
before the engine is created, with no other connections by design.
@Kohei-Wada
Copy link
Copy Markdown
Owner Author

Thanks for the review. Triaged each finding — addressed the 4 valid ones in 5520ffb, pushed back on 3 false positives.

Fixed

  • Memory exhaustion on /restore — stream the spooled upload off disk in 1 MiB chunks via file.file instead of await file.read(). (The async-generator suggestion can't be passed to the sync stage_restore(Iterator[bytes]); UploadFile is already spooled to disk, so reading file.file in chunks bounds memory cleanly.)
  • Partial-write corruption on backup() — download to a .part file and atomically replace(); clean up on error.
  • BackupNotSupportedError → 500 — now registered and mapped to 501 (more accurate than 400: it's a server-config limitation, not bad client input).
  • Startup crash if apply_pending_restore fails — wrapped in try/except; logs and continues with the existing DB, leaving the pending file for a retry.

Rejected (with reasons)

  • "pre-restore move on non-existent DB → FileNotFoundError" — already guarded by if db_path.exists(): before the move. No change needed.
  • "missing PRAGMA optimize → resource leak" — the connection is closed in finally; PRAGMA optimize is an unrelated perf hint, not a leak fix.
  • "BEGIN EXCLUSIVE on the file swap"apply_pending_restore runs at startup before the engine is created, so there are no concurrent connections by design. Exclusive locking the DB you're about to replace is contradictory and over-engineered for a single-user, local-first tool.

All suites pass (core 1116 / client 223 / server 324) with ruff + mypy clean.

@Kohei-Wada Kohei-Wada merged commit 8c06a38 into main Jun 6, 2026
14 checks passed
@Kohei-Wada Kohei-Wada deleted the feat/backup-restore-999 branch June 6, 2026 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: backup/restore the SQLite store over the API

1 participant