Add category fingerprint caching for efficient similarity scoring
Goal
Improve performance and scalability of similarity scoring by introducing cached category fingerprints, instead of comparing chats against every individual chat in a category.
Problem
Naive approach:
chat → every chat in every category
This becomes expensive as the dataset grows and will slow down the UI.
Solution
Introduce a category fingerprint:
A single vector representation of all chats in a category
Then compute:
chat → category fingerprint
This reduces comparisons from N chats per category → 1 vector per category
How It Works
- Each category maintains a cached vector (fingerprint)
- The fingerprint is built from:
- Chat titles
- User prompts (default)
- When a category changes, its fingerprint is recomputed
When to Recompute
Rebuild the fingerprint when:
- A chat is added to a category
- A chat is removed from a category
- A chat inside the category is edited (optional, can defer)
Storage
- Store fingerprints locally (SQLite or in-memory cache with persistence)
- Store metadata:
- Last updated timestamp
- Number of chats included
- Vector size / method used (for future-proofing)
Implementation Notes
- Use the same vectorization method as similarity scoring (TF-IDF for MVP)
- Consider storing:
- Raw vector
- Or normalized vector (for faster cosine similarity)
Acceptance Criteria
- Each category has a cached fingerprint
- Fingerprints are only recomputed when necessary
- Similarity scoring uses fingerprints instead of per-chat comparisons
- No noticeable UI lag when rendering similarity scores
- Debug/logging option to inspect fingerprint rebuilds
Future Enhancements (NOT in this issue)
- Incremental updates instead of full recompute
- Support multiple fingerprint strategies (TF-IDF vs embeddings)
- Weighted fingerprints (recent chats matter more)
Why This Matters
- Keeps the app fast as data grows
- Enables real-time similarity scoring
- Makes the feature scalable without needing external APIs
Add category fingerprint caching for efficient similarity scoring
Goal
Improve performance and scalability of similarity scoring by introducing cached category fingerprints, instead of comparing chats against every individual chat in a category.
Problem
Naive approach:
This becomes expensive as the dataset grows and will slow down the UI.
Solution
Introduce a category fingerprint:
Then compute:
This reduces comparisons from N chats per category → 1 vector per category
How It Works
When to Recompute
Rebuild the fingerprint when:
Storage
Implementation Notes
Acceptance Criteria
Future Enhancements (NOT in this issue)
Why This Matters