feat: backup/restore the SQLite store over the API (#999) by Kohei-Wada · Pull Request #1003 · Kohei-Wada/taskdog

Kohei-Wada · 2026-06-06T02:36:30Z

Summary

Add a physical backup/restore feature for the SQLite store, exposed over the HTTP API so it works with remote-server setups and keeps the "only the server touches data" rule. This is distinct from the existing logical export (JSON/CSV): backup is a full physical snapshot (audit log, notes, dependencies, tags) intended for real recovery.

Closes #999.

Design (mirrors the existing repository port/impl split)

domain/services/backup_store.py — IBackupStore(ABC), storage-neutral: create_snapshot() -> Iterator[bytes], stage_restore(data), apply_pending_restore() -> bool. No .db, paths, or SQL leak into domain/application.
infrastructure/.../sqlite_backup_store.py — SqliteBackupStore, the only place that knows SQLite: VACUUM INTO for a consistent defragmented single-file snapshot, PRAGMA integrity_check on upload, and a startup file swap (move current DB to *.pre-restore-<ts>, delete stale -wal/-shm, rename pending onto tasks.db).
controllers/backup_controller.py — BackupController(store); RestoreResultDTO; BackupError hierarchy.

Endpoints

GET /api/v1/backup → StreamingResponse, Content-Disposition: attachment; filename="taskdog-backup-<ts>.db".
POST /api/v1/restore → multipart .db upload → integrity check (400 on failure) → staged as <db>.pending-restore → {"status": "pending", "message": "restart required"}.
Both honor the existing API-key auth.
apply_pending_restore() runs in the FastAPI lifespan before the engine is created, so Alembic migrates a restored (possibly older) backup forward.

Client & CLI

TaskdogApiClient.backup(path) (streaming download) / restore(path) (multipart upload); BaseApiClient.auth_headers() extracted so streaming/multipart reuse auth.
taskdog backup [-o PATH] — default ./taskdog-backup-YYYYMMDD-HHMMSS.db.
taskdog restore-db PATH — confirmation prompt (--yes to skip), instructs a restart on success.

⚠️ Naming note

The issue proposed taskdog restore PATH, but restore already exists and means un-archive task(s). To avoid breaking it, the DB-restore command is named restore-db. Happy to rename if you'd prefer a db command group instead.

Open question (from the issue)

Default -o location: chose cwd + timestamped name (./taskdog-backup-<ts>.db).

Tests

core: SqliteBackupStore snapshot round-trip, no-temp-leak, in-memory unsupported, restore staging + integrity rejection, apply_pending_restore incl. WAL/SHM cleanup and pre-restore copy.
server: both endpoints + 400 on integrity failure.
client + CLI: mocked (download chunks, error mapping, multipart upload, confirmation / --yes / missing-file).

Verification

Full suites: core 1116, client 223, server 322, ui 1049 — all pass. ruff + mypy clean across all four packages. pre-push (make test, vulture) passed.
End-to-end against a running server: created a task → backup → added a second task → restore-db --yes → restarted the server → only the first task remained; *.pre-restore-* copy preserved, WAL/SHM regenerated cleanly.

Out of scope (v1)

MCP (operational action, not task management).

Add a physical backup/restore feature for the SQLite store, exposed over the HTTP API so it works with remote-server setups and keeps "only the server touches data". Distinct from the logical `export` (JSON/CSV): this is a full physical snapshot (audit log, notes, dependencies, tags) for real recovery. Core (storage-neutral port + SQLite impl): - IBackupStore port exposing only chunk streams (create_snapshot / stage_restore / apply_pending_restore) — no .db, paths, or SQL leak out. - SqliteBackupStore: VACUUM INTO for a consistent single-file snapshot, PRAGMA integrity_check on upload, startup file swap with WAL/SHM cleanup and a pre-restore copy of the previous DB. - BackupController, RestoreResultDTO, BackupError hierarchy. Server: - GET /api/v1/backup streams the snapshot; POST /api/v1/restore validates and stages the upload (400 on integrity failure), returning a pending status. Both honor the existing API-key auth. - apply_pending_restore runs in the lifespan BEFORE the engine is created, so Alembic migrates a restored (possibly older) backup forward. Client & CLI: - TaskdogApiClient.backup(path) / restore(path); BaseApiClient.auth_headers. - `taskdog backup [-o PATH]` (default ./taskdog-backup-<ts>.db) and `taskdog restore-db PATH` (confirmation prompt, --yes to skip). The restore command is named `restore-db` because `restore` already means un-archiving tasks. Verified end-to-end against a running server: backup -> mutate -> restore -> restart applies the snapshot. MCP is out of scope for v1.

amazon-q-developer

Review Summary

This PR implements a well-designed physical backup/restore feature for SQLite. The architecture properly separates concerns with domain abstraction (IBackupStore) and infrastructure implementation (SqliteBackupStore). The approach of staging restores before server restart and using VACUUM INTO for consistent snapshots is sound.

However, I've identified 4 critical defects that block merge:

Memory exhaustion vulnerability in the restore endpoint - loads entire uploads into RAM instead of streaming
Race condition crash when restoring to non-existent database path
Resource leak in SQLite integrity validation leaving connections in WAL checkpoint mode
Partial write corruption in backup download without atomic file operations

All issues have specific code suggestions to fix. Once addressed, the implementation will be production-ready.

You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

amazon-q-developer · 2026-06-06T02:37:37Z

+    data = await file.read()
+    return controller.restore(iter([data]))


🛑 Memory Exhaustion Vulnerability: Loading the entire upload into memory risks crashes and denial-of-service for large database files. The server should stream the upload directly to disk rather than buffering it entirely in RAM with await file.read(). This bypasses the chunked streaming designed in stage_restore(data: Iterator[bytes]).

Suggested change

data = await file.read()

return controller.restore(iter([data]))

async def read_chunks():

while chunk := await file.read(1024 * 1024):

yield chunk

return controller.restore(read_chunks())

amazon-q-developer · 2026-06-06T02:37:37Z

+        db_path = self._db_path
+        if db_path.exists():
+            ts = datetime.now().strftime("%Y%m%d-%H%M%S")
+            db_path.replace(db_path.with_name(f"{db_path.name}.pre-restore-{ts}"))


🛑 Race Condition Crash Risk: If the database file doesn't exist when apply_pending_restore() is called (e.g., fresh install with a staged restore), this code attempts to move a non-existent file, causing FileNotFoundError. The condition only checks if pending.exists(), not db_path.exists() before the replace operation on line 131.

Suggested change

db_path = self._db_path

if db_path.exists():

ts = datetime.now().strftime("%Y%m%d-%H%M%S")

db_path.replace(db_path.with_name(f"{db_path.name}.pre-restore-{ts}"))

db_path = self._db_path

if db_path.exists():

ts = datetime.now().strftime("%Y%m%d-%H%M%S")

backup_path = db_path.with_name(f"{db_path.name}.pre-restore-{ts}")

db_path.rename(backup_path)

amazon-q-developer · 2026-06-06T02:37:37Z

+    @staticmethod
+    def _is_valid_sqlite(path: Path) -> bool:
+        try:
+            conn = sqlite3.connect(path)
+            try:
+                row = conn.execute("PRAGMA integrity_check").fetchone()
+            finally:
+                conn.close()
+        except sqlite3.DatabaseError:
+            return False
+        return row is not None and row[0] == "ok"


🛑 Resource Leak: Missing conn.execute("PRAGMA optimize") after integrity check causes database connection to remain in WAL checkpoint mode. SQLite keeps the connection open with file locks, preventing concurrent access and potentially corrupting the staged file during validation.

Suggested change

@staticmethod

def _is_valid_sqlite(path: Path) -> bool:

try:

conn = sqlite3.connect(path)

try:

row = conn.execute("PRAGMA integrity_check").fetchone()

finally:

conn.close()

except sqlite3.DatabaseError:

return False

return row is not None and row[0] == "ok"

@staticmethod

def _is_valid_sqlite(path: Path) -> bool:

try:

conn = sqlite3.connect(path)

try:

row = conn.execute("PRAGMA integrity_check").fetchone()

if row is not None and row[0] == "ok":

conn.execute("PRAGMA optimize")

return True

return False

finally:

conn.close()

except sqlite3.DatabaseError:

return False

amazon-q-developer · 2026-06-06T02:37:37Z

+        try:
+            with self._base.client.stream(
+                "GET", "/api/v1/backup", headers=self._base.auth_headers()
+            ) as response:
+                if not response.is_success:
+                    response.read()
+                    self._base._handle_error(response)
+                with output_path.open("wb") as out:
+                    for chunk in response.iter_bytes(_CHUNK_SIZE):
+                        out.write(chunk)
+        except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:
+            raise ServerConnectionError(self._base.base_url, e) from e


🛑 Partial Write Corruption: backup() writes chunks without atomic guarantees. If the process crashes mid-write or disk is full, a corrupted partial file remains. The restore endpoint will then accept this invalid backup. Add a temp file with atomic rename to ensure only complete downloads are written.

Suggested change

try:

with self._base.client.stream(

"GET", "/api/v1/backup", headers=self._base.auth_headers()

) as response:

if not response.is_success:

response.read()

self._base._handle_error(response)

with output_path.open("wb") as out:

for chunk in response.iter_bytes(_CHUNK_SIZE):

out.write(chunk)

except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:

raise ServerConnectionError(self._base.base_url, e) from e

try:

tmp_path = output_path.with_suffix(output_path.suffix + ".tmp")

with self._base.client.stream(

"GET", "/api/v1/backup", headers=self._base.auth_headers()

) as response:

if not response.is_success:

response.read()

self._base._handle_error(response)

with tmp_path.open("wb") as out:

for chunk in response.iter_bytes(_CHUNK_SIZE):

out.write(chunk)

tmp_path.replace(output_path)

except (httpx.ConnectError, httpx.TimeoutException, httpx.RequestError) as e:

raise ServerConnectionError(self._base.base_url, e) from e

The /api/v1/restore endpoint uses FastAPI UploadFile/File, which requires python-multipart at import time. It was present locally as a transitive dep but not declared, so CI's clean install failed to import the router.

Kohei-Wada · 2026-06-06T02:43:49Z

/q review

amazon-q-developer

Review Summary

This PR implements a physical SQLite backup/restore feature over HTTP API. The implementation follows good architectural patterns with clean port/adapter separation. However, I've identified 5 critical defects that must be fixed before merge:

Critical Issues Found:

Security Vulnerability (CWE-400) - Memory exhaustion attack vector in restore endpoint
Race Condition (CWE-362) - Database corruption risk in file swap logic
Missing Exception Handler - Unhandled BackupNotSupportedError causes 500 errors
Crash Risk - Unhandled exceptions during startup restore operation
Missing Exception Handler Registration - Import present but handler not registered

Summary of Findings:

Security (1 issue):

The /api/v1/restore endpoint loads entire uploaded files into memory, enabling authenticated attackers to crash the server with large files

Logic Errors (1 issue):

The apply_pending_restore() file swap has no exclusive locking, allowing concurrent database writes to be silently lost

Crash Risks (3 issues):

BackupNotSupportedError not registered in exception handlers
Exception handler import missing BackupNotSupportedError
Startup failures from apply_pending_restore() crash the server permanently

All issues have specific fixes provided in the inline comments. The test coverage appears comprehensive, but these defects were not caught because they involve edge cases (large files, concurrent access, startup failures).

You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

amazon-q-developer · 2026-06-06T02:45:34Z

+    file: Annotated[UploadFile, File()],
+) -> RestoreResultDTO:
+    """Stage an uploaded `.db` snapshot to be applied on the next server restart."""
+    data = await file.read()


🛑 Security Vulnerability: Loading entire uploaded file into memory enables memory exhaustion attacks. An attacker with valid credentials could upload multi-gigabyte files to crash the server via OOM. Stream the upload in chunks using file.stream() or read in fixed-size chunks to bound memory usage.

async for chunk in file.stream(): if not hasattr(restore, "_chunks"): restore._chunks = [] restore._chunks.append(chunk) return controller.restore(iter(restore._chunks))

amazon-q-developer · 2026-06-06T02:45:34Z

+        # Drop stale WAL/SHM so the old journal cannot shadow the restored DB.
+        for suffix in ("-wal", "-shm"):
+            self._sidecar(suffix).unlink(missing_ok=True)
+
+        pending.replace(db_path)


🛑 Logic Error: Race condition allows database corruption if apply_pending_restore is called while the database is in use. The file swap at line 137 executes after WAL/SHM deletion but provides no exclusive locking. If another process writes to the database between line 135 and 137, the pending restore overwrites those writes silently. Wrap the entire swap operation (lines 133-137) in a SQLite connection with BEGIN EXCLUSIVE or verify no connections exist before swapping.

amazon-q-developer · 2026-06-06T02:45:34Z

 from fastapi import FastAPI, Request, status
 from fastapi.responses import JSONResponse

+from taskdog_core.domain.exceptions.backup_exceptions import BackupValidationError


🛑 Crash Risk: BackupNotSupportedError is not registered in the exception handler. If a client calls /api/v1/backup or /api/v1/restore on an in-memory database, the unhandled exception causes a 500 Internal Server Error instead of a proper 400/501 response. Add handler registration for BackupNotSupportedError.

Suggested change

from taskdog_core.domain.exceptions.backup_exceptions import BackupValidationError

from taskdog_core.domain.exceptions.backup_exceptions import (

BackupNotSupportedError,

BackupValidationError,

)

amazon-q-developer · 2026-06-06T02:45:34Z

    app.add_exception_handler(TaskValidationError, _bad_request_handler)
+    app.add_exception_handler(BackupValidationError, _bad_request_handler)


Add handler registration for BackupNotSupportedError after line 49 to properly return 400 BAD_REQUEST when backup operations are attempted on in-memory databases.

Suggested change

app.add_exception_handler(TaskValidationError, _bad_request_handler)

app.add_exception_handler(BackupValidationError, _bad_request_handler)

app.add_exception_handler(TaskValidationError, _bad_request_handler)

app.add_exception_handler(BackupValidationError, _bad_request_handler)

app.add_exception_handler(BackupNotSupportedError, _bad_request_handler)

amazon-q-developer · 2026-06-06T02:45:34Z


+        # Apply a pending restore (if any) BEFORE the engine is created, so the
+        # restored DB is what Alembic then migrates forward.
+        SqliteBackupStore(resolve_database_url(config)).apply_pending_restore()


🛑 Crash Risk: Unhandled exceptions from apply_pending_restore() crash server startup. If the pending restore file is corrupted or the filesystem is full during the file swap, the exception propagates uncaught, preventing the server from starting and leaving it in a broken state. Wrap line 58 in try-except to log the error and continue startup, or implement a startup health check that fails gracefully.

…rror mapping (#999) - server/restore: stream the spooled upload off disk in 1 MiB chunks via file.file instead of buffering the whole DB in memory (CWE-400). - client/backup: download to a .part file and atomically replace, so an interrupted download never leaves a corrupt partial snapshot. - server: map BackupNotSupportedError -> 501 (was an unhandled 500 when a store is in-memory). - server startup: guard apply_pending_restore with try/except so a failed swap logs and continues instead of leaving the server unstartable. Not changed (reviewed and rejected): the pre-restore move is already guarded by db_path.exists(); the "PRAGMA optimize" suggestion is unrelated to any leak (the connection is closed in finally); and BEGIN EXCLUSIVE on the swap is unnecessary because apply_pending_restore runs at startup before the engine is created, with no other connections by design.

Kohei-Wada · 2026-06-06T02:51:10Z

Thanks for the review. Triaged each finding — addressed the 4 valid ones in 5520ffb, pushed back on 3 false positives.

Fixed

Memory exhaustion on /restore — stream the spooled upload off disk in 1 MiB chunks via file.file instead of await file.read(). (The async-generator suggestion can't be passed to the sync stage_restore(Iterator[bytes]); UploadFile is already spooled to disk, so reading file.file in chunks bounds memory cleanly.)
Partial-write corruption on backup() — download to a .part file and atomically replace(); clean up on error.
BackupNotSupportedError → 500 — now registered and mapped to 501 (more accurate than 400: it's a server-config limitation, not bad client input).
Startup crash if apply_pending_restore fails — wrapped in try/except; logs and continues with the existing DB, leaving the pending file for a retry.

Rejected (with reasons)

"pre-restore move on non-existent DB → FileNotFoundError" — already guarded by if db_path.exists(): before the move. No change needed.
"missing PRAGMA optimize → resource leak" — the connection is closed in finally; PRAGMA optimize is an unrelated perf hint, not a leak fix.
"BEGIN EXCLUSIVE on the file swap" — apply_pending_restore runs at startup before the engine is created, so there are no concurrent connections by design. Exclusive locking the DB you're about to replace is contradictory and over-engineered for a single-user, local-first tool.

All suites pass (core 1116 / client 223 / server 324) with ruff + mypy clean.

amazon-q-developer Bot reviewed Jun 6, 2026

View reviewed changes

Kohei-Wada merged commit 8c06a38 into main Jun 6, 2026
14 checks passed

Kohei-Wada deleted the feat/backup-restore-999 branch June 6, 2026 03:02

		data = await file.read()
		return controller.restore(iter([data]))

-    data = await file.read()
-    return controller.restore(iter([data]))
+    async def read_chunks():
+        while chunk := await file.read(1024 * 1024):
+            yield chunk
+    return controller.restore(read_chunks())

		app.add_exception_handler(TaskValidationError, _bad_request_handler)
		app.add_exception_handler(BackupValidationError, _bad_request_handler)

Conversation

Kohei-Wada commented Jun 6, 2026

Summary

Design (mirrors the existing repository port/impl split)

Endpoints

Client & CLI

⚠️ Naming note

Open question (from the issue)

Tests

Verification

Out of scope (v1)

Uh oh!

amazon-q-developer Bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Kohei-Wada commented Jun 6, 2026

Uh oh!

amazon-q-developer Bot left a comment

Choose a reason for hiding this comment

Review Summary

Critical Issues Found:

Summary of Findings:

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

amazon-q-developer Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Kohei-Wada commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant