Skip to content

fix(import): make Limitless imports idempotent#8075

Open
alanshurafa wants to merge 2 commits into
BasedHardware:mainfrom
alanshurafa:claude/kind-kapitsa-e81c75
Open

fix(import): make Limitless imports idempotent#8075
alanshurafa wants to merge 2 commits into
BasedHardware:mainfrom
alanshurafa:claude/kind-kapitsa-e81c75

Conversation

@alanshurafa

@alanshurafa alanshurafa commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Re-uploading the same Limitless export currently creates a full duplicate set of
conversations every time, because each imported conversation gets a random
uuid.uuid4() ID. This change gives each conversation a deterministic ID derived
from the lifelog's start time
and persists it with an atomic create-if-absent,
so re-importing skips lifelogs already stored instead of duplicating or overwriting
them ("first import wins").

Problem

process_limitless_import (the background worker for POST /v1/import/limitless)
builds a Conversation per lifelog file and calls
conversations_db.upsert_conversation(uid, conversation.dict()), which writes by
document ID (.document(id).set(data)). Because the ID was random, a second import
of the same export never collided with the first — it always inserted. Users who
re-upload (or upload an overlapping export) got N extra copies of everything, and
there is no working cleanup path (the "delete Limitless conversations" endpoint is
stubbed out because it can't yet distinguish imported from pendant conversations).

Fix

  1. Deterministic conversation ID from (uid, lifelog start-time) using the
    existing document_id_from_seed primitive — the same idempotency mechanism
    already used for memory summaries, chat messages, and trends.

    LIMITLESS_IMPORT_ID_NAMESPACE = "limitless"  # never change — baked into existing IDs
    
    def conversation_id_for_lifelog(uid, filename):
        started_at, _slug = parse_lifelog_filename(filename)
        identity = started_at.isoformat() if started_at else filename
        return document_id_from_seed(f"{LIMITLESS_IMPORT_ID_NAMESPACE}:{uid}:{identity}")
  2. Atomic create-if-absent. New conversations_db.create_conversation_if_absent
    uses Firestore document.create() (which raises Conflict — the base class of
    AlreadyExists — if the doc exists) instead of set(). The importer creates new
    conversations and skips existing ones; it never overwrites. Encryption /
    data-protection prep is applied by the same decorators upsert_conversation uses.

Why this design

  • Atomic create, not read-then-set, and not overwrite. An imported conversation
    is a normal conversation afterwards: the user can rename it, star it, move it to a
    folder, or reprocess it (creating memories/vectors/trends). Overwriting on re-import
    would silently destroy those. document.create() also closes the race where two
    concurrent imports both see "not exists" and one clobbers the other. The trade-off
    is that a re-import won't pick up Limitless's own later edits to a lifelog — an
    acceptable, clearly-bounded "first import wins" semantic.
  • Key on the start-time, not the whole filename. Limitless filenames look like
    2025-10-08_07h00m25s_Title-slug.md. The title slug is AI-generated and can change
    between exports; the start timestamp is intrinsic. Keying on the timestamp means a
    re-titled re-export still maps to the same conversation. A single pendant can't start
    two lifelogs in the same second, so the timestamp identifies a lifelog. Unparseable
    filenames fall back to the full name.
  • Key on identity, not content. Content-hashing would make any lifelog edit look
    like a brand-new conversation (a near-duplicate), defeating the dedup.
  • uid in the seed keeps IDs distinct across users (defensive). The "limitless"
    namespace is a frozen constant so a refactor can't silently change IDs for already-
    imported data.

Observability

The job now tracks conversations_skipped alongside conversations_created,
persisted on the ImportJob (authoritatively in the final update so tail
skips/empties don't leave a stale count) and exposed on ImportJobResponse. The
completion notification reports new vs already-imported counts and any failures.

Edge cases

  • Duplicate identity within one archive (same lifelog under two folders, since the
    matcher accepts nested */lifelogs/*.md): the second create() returns "already
    exists" → skipped and logged, first occurrence wins, never a silent overwrite.
  • Re-titled lifelog: deduped (keyed on timestamp, not slug).
  • Unparseable filename: falls back to the full in-zip path as identity (not just
    the basename), so distinct files sharing a basename in different folders don't
    collide and get one silently skipped.
  • Per-file failure: a create error is caught per file and counted; the rest of the
    import continues.

Scope and follow-ups (intentionally out of this PR)

  • Forward-only. Conversations already duplicated by past random-ID imports are not
    cleaned up here. A follow-up can backfill/dedupe by source=limitless.
  • No field-level merge. Picking up Limitless's later edits while preserving user
    edits would need conflict handling; create-if-absent is the simpler safe choice now.
  • Other importers can adopt the same conversation_id_for_lifelog pattern later.
  • Test import of the ID primitive. document_id_from_seed lives in a module that
    initialises Firestore at import, so the test mirrors it; a follow-up could move the
    primitive to a Firestore-free module so tests import it directly.

Tests

New backend/tests/unit/test_limitless_import_idempotency.py (registered in test.sh).
Stubs Firestore/FCM, imports the real Pydantic models, and runs the real
process_limitless_import against an in-memory store modelling atomic create-if-absent:

  • ID is deterministic, timestamp-keyed (same time, different slug → same ID), uid/time
    sensitive; unparseable filename falls back to the full name.
  • Re-importing the same export adds no new documents.
  • A user edit to an imported conversation survives re-import (skip, not overwrite).
  • Two different lifelogs get two IDs; a re-titled re-export deduplicates.
  • Two same-identity entries in one archive collapse to one, first occurrence wins.
  • The persisted ID equals the deterministic seed (never a random UUID).
  • The same export imported by two users does not collide.
  • A create error on one lifelog is isolated; the rest of the import still succeeds.

True concurrency atomicity is a property of Firestore create() and is not
unit-tested here.

Co-evolution / review trail (Codex adversarial passes)

  • Round 1blind overwrite destroys later edits → adopted skip-on-existing;
    basename collapse → in-archive handling + test; re-title duplicates → key on
    start-time; tests only proved ID generation → fake store proves persistence +
    edit preservation; ID hygiene → frozen namespace constant.
  • Round 2skip wasn't atomic (read-then-set race) → switched to Firestore
    document.create() create-if-absent (also removed the per-file existence read);
    counters could go stale on tail skips → authoritative counts in the final update +
    conversations_skipped surfaced on the API; added per-file-error isolation test.
    Contested and deferred with rationale: ms-precision/content identity (negligible
    same-second risk for single-pendant data), pre-existing-duplicate cleanup
    (forward-only), field-level merge.
  • Dev review (codex exec review) — broadened the create() catch from
    AlreadyExists to its base Conflict so all-duplicate re-imports skip rather than
    error across gRPC/REST transports; added conversations_skipped to the job detail
    (polling) endpoint, not just the list endpoint.

Re-uploading the same Limitless export created a duplicate set of
conversations every time, because each imported conversation was given a
random UUID. Derive a deterministic conversation ID from the lifelog's
start time and persist it with an atomic create-if-absent, so re-importing
skips lifelogs already stored instead of duplicating or overwriting them.

- conversation_id_for_lifelog: deterministic ID via document_id_from_seed,
  keyed on (uid, parsed start time) so re-titled re-exports still dedupe
- create_conversation_if_absent: Firestore document.create() (atomic),
  preserving edits made to a previously-imported conversation
- track and surface conversations_skipped on the import job
- tests for idempotency, edit preservation, and per-file error isolation

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Re-trigger cubic

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19ef0fd975

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/imports/limitless.py Outdated
continue

# Deterministic, idempotent conversation ID (keyed on lifelog identity).
conversation_id = conversation_id_for_lifelog(uid, filename)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve nested paths for unparseable lifelog IDs

When a ZIP contains nested */lifelogs/*.md entries whose basenames are the same but whose names do not match the timestamp pattern, filename is only Path(lifelog_path).name, so conversation_id_for_lifelog() falls back to that same basename for both files. With the new create-if-absent write, the second distinct lifelog is treated as already imported and skipped; this regresses from the prior random-ID behavior by dropping data for unparseable duplicate basenames. Pass the full lifelog_path (or the parsed content timestamp) into the ID helper for the fallback case while keeping basename parsing for titles.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7354743: the ID helper now takes the full in-zip path and falls back to it (not just the basename) when the filename has no parseable timestamp, so distinct files sharing a basename in different folders get distinct IDs instead of one being skipped. Added test_nested_unparseable_basenames_do_not_collide as a regression test.

When a lifelog filename has no parseable timestamp, the conversation ID fell
back to the basename, so two distinct lifelogs sharing a basename in different
nested folders (the importer accepts */lifelogs/*.md) collapsed to one ID and
the second was skipped as already-imported — dropping data. Fall back to the
full in-zip path instead, keeping basename parsing for the timestamp/title.

Addresses the Codex automated review on the PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alanshurafa

Copy link
Copy Markdown
Author

@codex review

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid idempotent-import fix with regression test (deterministic conversation IDs + atomic create-if-absent). Approve only: 389 lines exceeds the scoped-diff bar and it changes core conversation storage/ID scheme — leaving final merge to Nik.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants