fix(import): make Limitless imports idempotent#8075
Conversation
Re-uploading the same Limitless export created a duplicate set of conversations every time, because each imported conversation was given a random UUID. Derive a deterministic conversation ID from the lifelog's start time and persist it with an atomic create-if-absent, so re-importing skips lifelogs already stored instead of duplicating or overwriting them. - conversation_id_for_lifelog: deterministic ID via document_id_from_seed, keyed on (uid, parsed start time) so re-titled re-exports still dedupe - create_conversation_if_absent: Firestore document.create() (atomic), preserving edits made to a previously-imported conversation - track and surface conversations_skipped on the import job - tests for idempotency, edit preservation, and per-file error isolation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19ef0fd975
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| continue | ||
|
|
||
| # Deterministic, idempotent conversation ID (keyed on lifelog identity). | ||
| conversation_id = conversation_id_for_lifelog(uid, filename) |
There was a problem hiding this comment.
Preserve nested paths for unparseable lifelog IDs
When a ZIP contains nested */lifelogs/*.md entries whose basenames are the same but whose names do not match the timestamp pattern, filename is only Path(lifelog_path).name, so conversation_id_for_lifelog() falls back to that same basename for both files. With the new create-if-absent write, the second distinct lifelog is treated as already imported and skipped; this regresses from the prior random-ID behavior by dropping data for unparseable duplicate basenames. Pass the full lifelog_path (or the parsed content timestamp) into the ID helper for the fallback case while keeping basename parsing for titles.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 7354743: the ID helper now takes the full in-zip path and falls back to it (not just the basename) when the filename has no parseable timestamp, so distinct files sharing a basename in different folders get distinct IDs instead of one being skipped. Added test_nested_unparseable_basenames_do_not_collide as a regression test.
When a lifelog filename has no parseable timestamp, the conversation ID fell back to the basename, so two distinct lifelogs sharing a basename in different nested folders (the importer accepts */lifelogs/*.md) collapsed to one ID and the second was skipped as already-imported — dropping data. Fall back to the full in-zip path instead, keeping basename parsing for the timestamp/title. Addresses the Codex automated review on the PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@codex review |
kodjima33
left a comment
There was a problem hiding this comment.
Solid idempotent-import fix with regression test (deterministic conversation IDs + atomic create-if-absent). Approve only: 389 lines exceeds the scoped-diff bar and it changes core conversation storage/ID scheme — leaving final merge to Nik.
Summary
Re-uploading the same Limitless export currently creates a full duplicate set of
conversations every time, because each imported conversation gets a random
uuid.uuid4()ID. This change gives each conversation a deterministic ID derivedfrom the lifelog's start time and persists it with an atomic create-if-absent,
so re-importing skips lifelogs already stored instead of duplicating or overwriting
them ("first import wins").
Problem
process_limitless_import(the background worker forPOST /v1/import/limitless)builds a
Conversationper lifelog file and callsconversations_db.upsert_conversation(uid, conversation.dict()), which writes bydocument ID (
.document(id).set(data)). Because the ID was random, a second importof the same export never collided with the first — it always inserted. Users who
re-upload (or upload an overlapping export) got N extra copies of everything, and
there is no working cleanup path (the "delete Limitless conversations" endpoint is
stubbed out because it can't yet distinguish imported from pendant conversations).
Fix
Deterministic conversation ID from
(uid, lifelog start-time)using theexisting
document_id_from_seedprimitive — the same idempotency mechanismalready used for memory summaries, chat messages, and trends.
Atomic create-if-absent. New
conversations_db.create_conversation_if_absentuses Firestore
document.create()(which raisesConflict— the base class ofAlreadyExists— if the doc exists) instead ofset(). The importer creates newconversations and skips existing ones; it never overwrites. Encryption /
data-protection prep is applied by the same decorators
upsert_conversationuses.Why this design
is a normal conversation afterwards: the user can rename it, star it, move it to a
folder, or reprocess it (creating memories/vectors/trends). Overwriting on re-import
would silently destroy those.
document.create()also closes the race where twoconcurrent imports both see "not exists" and one clobbers the other. The trade-off
is that a re-import won't pick up Limitless's own later edits to a lifelog — an
acceptable, clearly-bounded "first import wins" semantic.
2025-10-08_07h00m25s_Title-slug.md. The title slug is AI-generated and can changebetween exports; the start timestamp is intrinsic. Keying on the timestamp means a
re-titled re-export still maps to the same conversation. A single pendant can't start
two lifelogs in the same second, so the timestamp identifies a lifelog. Unparseable
filenames fall back to the full name.
like a brand-new conversation (a near-duplicate), defeating the dedup.
uidin the seed keeps IDs distinct across users (defensive). The"limitless"namespace is a frozen constant so a refactor can't silently change IDs for already-
imported data.
Observability
The job now tracks
conversations_skippedalongsideconversations_created,persisted on the
ImportJob(authoritatively in the final update so tailskips/empties don't leave a stale count) and exposed on
ImportJobResponse. Thecompletion notification reports new vs already-imported counts and any failures.
Edge cases
matcher accepts nested
*/lifelogs/*.md): the secondcreate()returns "alreadyexists" → skipped and logged, first occurrence wins, never a silent overwrite.
the basename), so distinct files sharing a basename in different folders don't
collide and get one silently skipped.
import continues.
Scope and follow-ups (intentionally out of this PR)
cleaned up here. A follow-up can backfill/dedupe by
source=limitless.edits would need conflict handling; create-if-absent is the simpler safe choice now.
conversation_id_for_lifelogpattern later.document_id_from_seedlives in a module thatinitialises Firestore at import, so the test mirrors it; a follow-up could move the
primitive to a Firestore-free module so tests import it directly.
Tests
New
backend/tests/unit/test_limitless_import_idempotency.py(registered intest.sh).Stubs Firestore/FCM, imports the real Pydantic models, and runs the real
process_limitless_importagainst an in-memory store modelling atomic create-if-absent:sensitive; unparseable filename falls back to the full name.
True concurrency atomicity is a property of Firestore
create()and is notunit-tested here.
Co-evolution / review trail (Codex adversarial passes)
basename collapse → in-archive handling + test; re-title duplicates → key on
start-time; tests only proved ID generation → fake store proves persistence +
edit preservation; ID hygiene → frozen namespace constant.
document.create()create-if-absent (also removed the per-file existence read);counters could go stale on tail skips → authoritative counts in the final update +
conversations_skippedsurfaced on the API; added per-file-error isolation test.Contested and deferred with rationale: ms-precision/content identity (negligible
same-second risk for single-pendant data), pre-existing-duplicate cleanup
(forward-only), field-level merge.
codex exec review) — broadened the create() catch fromAlreadyExiststo its baseConflictso all-duplicate re-imports skip rather thanerror across gRPC/REST transports; added
conversations_skippedto the job detail(polling) endpoint, not just the list endpoint.