fix(import): make Limitless imports idempotent by alanshurafa · Pull Request #8075 · BasedHardware/omi

alanshurafa · 2026-06-20T23:38:55Z

Summary

Re-uploading the same Limitless export currently creates a full duplicate set of
conversations every time, because each imported conversation gets a random
uuid.uuid4() ID. This change gives each conversation a deterministic ID derived
from the lifelog's start time and persists it with an atomic create-if-absent,
so re-importing skips lifelogs already stored instead of duplicating or overwriting
them ("first import wins").

Problem

process_limitless_import (the background worker for POST /v1/import/limitless)
builds a Conversation per lifelog file and calls
conversations_db.upsert_conversation(uid, conversation.dict()), which writes by
document ID (.document(id).set(data)). Because the ID was random, a second import
of the same export never collided with the first — it always inserted. Users who
re-upload (or upload an overlapping export) got N extra copies of everything, and
there is no working cleanup path (the "delete Limitless conversations" endpoint is
stubbed out because it can't yet distinguish imported from pendant conversations).

Fix

Deterministic conversation ID from (uid, lifelog start-time) using the
existing document_id_from_seed primitive — the same idempotency mechanism
already used for memory summaries, chat messages, and trends.

LIMITLESS_IMPORT_ID_NAMESPACE = "limitless"  # never change — baked into existing IDs

def conversation_id_for_lifelog(uid, filename):
    started_at, _slug = parse_lifelog_filename(filename)
    identity = started_at.isoformat() if started_at else filename
    return document_id_from_seed(f"{LIMITLESS_IMPORT_ID_NAMESPACE}:{uid}:{identity}")

Atomic create-if-absent. New conversations_db.create_conversation_if_absent
uses Firestore document.create() (which raises Conflict — the base class of
AlreadyExists — if the doc exists) instead of set(). The importer creates new
conversations and skips existing ones; it never overwrites. Encryption /
data-protection prep is applied by the same decorators upsert_conversation uses.

Why this design

Atomic create, not read-then-set, and not overwrite. An imported conversation
is a normal conversation afterwards: the user can rename it, star it, move it to a
folder, or reprocess it (creating memories/vectors/trends). Overwriting on re-import
would silently destroy those. document.create() also closes the race where two
concurrent imports both see "not exists" and one clobbers the other. The trade-off
is that a re-import won't pick up Limitless's own later edits to a lifelog — an
acceptable, clearly-bounded "first import wins" semantic.
Key on the start-time, not the whole filename. Limitless filenames look like
2025-10-08_07h00m25s_Title-slug.md. The title slug is AI-generated and can change
between exports; the start timestamp is intrinsic. Keying on the timestamp means a
re-titled re-export still maps to the same conversation. A single pendant can't start
two lifelogs in the same second, so the timestamp identifies a lifelog. Unparseable
filenames fall back to the full name.
Key on identity, not content. Content-hashing would make any lifelog edit look
like a brand-new conversation (a near-duplicate), defeating the dedup.
uid in the seed keeps IDs distinct across users (defensive). The "limitless"
namespace is a frozen constant so a refactor can't silently change IDs for already-
imported data.

Observability

The job now tracks conversations_skipped alongside conversations_created,
persisted on the ImportJob (authoritatively in the final update so tail
skips/empties don't leave a stale count) and exposed on ImportJobResponse. The
completion notification reports new vs already-imported counts and any failures.

Edge cases

Duplicate identity within one archive (same lifelog under two folders, since the
matcher accepts nested */lifelogs/*.md): the second create() returns "already
exists" → skipped and logged, first occurrence wins, never a silent overwrite.
Re-titled lifelog: deduped (keyed on timestamp, not slug).
Unparseable filename: falls back to the full in-zip path as identity (not just
the basename), so distinct files sharing a basename in different folders don't
collide and get one silently skipped.
Per-file failure: a create error is caught per file and counted; the rest of the
import continues.

Scope and follow-ups (intentionally out of this PR)

Forward-only. Conversations already duplicated by past random-ID imports are not
cleaned up here. A follow-up can backfill/dedupe by source=limitless.
No field-level merge. Picking up Limitless's later edits while preserving user
edits would need conflict handling; create-if-absent is the simpler safe choice now.
Other importers can adopt the same conversation_id_for_lifelog pattern later.
Test import of the ID primitive. document_id_from_seed lives in a module that
initialises Firestore at import, so the test mirrors it; a follow-up could move the
primitive to a Firestore-free module so tests import it directly.

Tests

New backend/tests/unit/test_limitless_import_idempotency.py (registered in test.sh).
Stubs Firestore/FCM, imports the real Pydantic models, and runs the real
process_limitless_import against an in-memory store modelling atomic create-if-absent:

ID is deterministic, timestamp-keyed (same time, different slug → same ID), uid/time
sensitive; unparseable filename falls back to the full name.
Re-importing the same export adds no new documents.
A user edit to an imported conversation survives re-import (skip, not overwrite).
Two different lifelogs get two IDs; a re-titled re-export deduplicates.
Two same-identity entries in one archive collapse to one, first occurrence wins.
The persisted ID equals the deterministic seed (never a random UUID).
The same export imported by two users does not collide.
A create error on one lifelog is isolated; the rest of the import still succeeds.

True concurrency atomicity is a property of Firestore create() and is not
unit-tested here.

Co-evolution / review trail (Codex adversarial passes)

Round 1 — blind overwrite destroys later edits → adopted skip-on-existing;
basename collapse → in-archive handling + test; re-title duplicates → key on
start-time; tests only proved ID generation → fake store proves persistence +
edit preservation; ID hygiene → frozen namespace constant.
Round 2 — skip wasn't atomic (read-then-set race) → switched to Firestore
document.create() create-if-absent (also removed the per-file existence read);
counters could go stale on tail skips → authoritative counts in the final update +
conversations_skipped surfaced on the API; added per-file-error isolation test.
Contested and deferred with rationale: ms-precision/content identity (negligible
same-second risk for single-pendant data), pre-existing-duplicate cleanup
(forward-only), field-level merge.
Dev review (codex exec review) — broadened the create() catch from
AlreadyExists to its base Conflict so all-duplicate re-imports skip rather than
error across gRPC/REST transports; added conversations_skipped to the job detail
(polling) endpoint, not just the list endpoint.

Re-uploading the same Limitless export created a duplicate set of conversations every time, because each imported conversation was given a random UUID. Derive a deterministic conversation ID from the lifelog's start time and persist it with an atomic create-if-absent, so re-importing skips lifelogs already stored instead of duplicating or overwriting them. - conversation_id_for_lifelog: deterministic ID via document_id_from_seed, keyed on (uid, parsed start time) so re-titled re-exports still dedupe - create_conversation_if_absent: Firestore document.create() (atomic), preserving edits made to a previously-imported conversation - track and surface conversations_skipped on the import job - tests for idempotency, edit preservation, and per-file error isolation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cubic-dev-ai

No issues found across 6 files

_{Re-trigger cubic}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19ef0fd975

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T23:45:36Z

                        continue

+                    # Deterministic, idempotent conversation ID (keyed on lifelog identity).
+                    conversation_id = conversation_id_for_lifelog(uid, filename)


Preserve nested paths for unparseable lifelog IDs

When a ZIP contains nested */lifelogs/*.md entries whose basenames are the same but whose names do not match the timestamp pattern, filename is only Path(lifelog_path).name, so conversation_id_for_lifelog() falls back to that same basename for both files. With the new create-if-absent write, the second distinct lifelog is treated as already imported and skipped; this regresses from the prior random-ID behavior by dropping data for unparseable duplicate basenames. Pass the full lifelog_path (or the parsed content timestamp) into the ID helper for the fallback case while keeping basename parsing for titles.

Useful? React with 👍 / 👎.

Addressed in 7354743: the ID helper now takes the full in-zip path and falls back to it (not just the basename) when the filename has no parseable timestamp, so distinct files sharing a basename in different folders get distinct IDs instead of one being skipped. Added test_nested_unparseable_basenames_do_not_collide as a regression test.

When a lifelog filename has no parseable timestamp, the conversation ID fell back to the basename, so two distinct lifelogs sharing a basename in different nested folders (the importer accepts */lifelogs/*.md) collapsed to one ID and the second was skipped as already-imported — dropping data. Fall back to the full in-zip path instead, keeping basename parsing for the timestamp/title. Addresses the Codex automated review on the PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

alanshurafa · 2026-06-21T01:13:12Z

@codex review

kodjima33

Solid idempotent-import fix with regression test (deterministic conversation IDs + atomic create-if-absent). Approve only: 389 lines exceeds the scoped-diff bar and it changes core conversation storage/ID scheme — leaving final merge to Nik.

cubic-dev-ai Bot reviewed Jun 20, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

kodjima33 approved these changes Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(import): make Limitless imports idempotent#8075

fix(import): make Limitless imports idempotent#8075
alanshurafa wants to merge 2 commits into
BasedHardware:mainfrom
alanshurafa:claude/kind-kapitsa-e81c75

alanshurafa commented Jun 20, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

alanshurafa Jun 21, 2026

Uh oh!

alanshurafa commented Jun 21, 2026

Uh oh!

kodjima33 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanshurafa commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Why this design

Observability

Edge cases

Scope and follow-ups (intentionally out of this PR)

Tests

Co-evolution / review trail (Codex adversarial passes)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

alanshurafa Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

alanshurafa commented Jun 21, 2026

Uh oh!

kodjima33 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alanshurafa commented Jun 20, 2026 •

edited

Loading