sukanth · sukanth · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,5 @@ dist/
 .venv/
 .pytest_cache/
 .bookmarks-search/
+uv.lock
+.vscode/
diff --git a/README.md b/README.md
@@ -41,6 +41,7 @@ Ask in natural language — mindmark remembers what you saved.
 | `mindmark sync` | **Auto-detect** installed browsers and sync bookmarks directly — no export needed |
 | `mindmark find "query"` | Semantic search over titles, folders, domains, and URL slugs — returns top-K with similarity scores |
 | `mindmark open "query"` | Search and open the best match in your default browser |
+| `mindmark enrich` | Fetch page content, extract text, embed summaries, and improve search relevance with page context |
 | `mindmark stats` | Show index size, model info, top domains, and top folders |
 | `mindmark index <file>` | Import bookmarks from an exported HTML file (legacy workflow) |
 | `mindmark validate` | Check indexed bookmark URLs for stale links (HTTP 4xx/5xx or unreachable) and report them |
@@ -266,16 +267,19 @@ When you add new bookmarks in your browser, just run `mindmark sync` again — i
 
 > 💡 **Note:** If you change the embedding model with `--model`, all bookmarks will be re-embedded on the next sync. Browser names are case-insensitive (e.g., `--browser Chrome` and `--browser chrome` both work).
 
-### Filters
+### Filters and options
 
 Narrow down results without changing your query:
 
 ```bash
 mindmark find "useful tools" --domain github.com     # only github.com results
 mindmark find "useful tools" --folder work/kusto      # only bookmarks in matching folders
 mindmark find "useful tools" -k 20                    # return top 20 instead of 10
+mindmark find "useful tools" --excerpt               # include excerpts from enriched pages
 ```
 
+> 💡 **Note:** The `--excerpt` flag requires you to run `mindmark enrich` first to fetch and embed page content. See [Augmented Index](#-augmented-index-page-summaries) for details.
+
 ### Re-indexing
 
 For the `sync` workflow, just rerun `mindmark sync`. It's incremental — only changed bookmarks are re-embedded.
@@ -344,7 +348,90 @@ Browser data files                              "python async tutorial"
 
 ---
 
-## 🗂️ Storage Layout
+## 🎯 Augmented Index with Page Summaries
+
+By default, mindmark indexes only bookmark metadata: titles, folders, domains, and URL slugs. If you want **deeper page context** in search results, use the enrichment pipeline to fetch page content and embed summaries.
+
+> 💡 **Note:** In order to be 100% local and lightweight enrichment uses **extractive summarization** (first 500 chars of page text) — no LLM, no text generation. This means:
+> - Only the opening content is embedded (relevant if key info is early; may miss content further down)
+> - Page content must already be well-written for excerpts to be useful (relies on natural sentence structure)
+> - Privacy and speed are preserved (no cloud calls, runs entirely locally) 
+
+### Why enrich?
+
+Without enrichment, searching for **"authentication strategies"** on a bookmark titled **"AWS Services"** may miss it, even though the page discusses authentication. With enrichment, the page content is fetched and summarized, improving relevance.
+
+### Quick start
+
+1. **Enrich bookmarks** (fetch page content and embed summaries):
+
+```bash
+mindmark enrich --limit 100 --workers 4
+```
+
+Options:
+- `--limit N` — Process top N pending URLs (default: all)
+- `--workers N` — Parallel fetch workers (default: 8)
+- `--timeout S` — Per-request timeout in seconds (default: 10.0)
+- `--refresh-failed` — Retry previously failed enrichments
+
+2. **Search with page context**:
+
+```bash
+mindmark find "authentication strategies" --excerpt
+```
+
+With `--excerpt`, results display the most relevant excerpt from the enriched page:
+
+```
+ 1. AWS Services
+    aws.amazon.com
+    ⤵ To control user access to AWS resources, you must have an authentication strategy. AWS IAM provides fine-grained access control...
+
+ 2. Auth0 Documentation
+    auth0.com
+    ⤵ Authentication is the process of verifying the identity of a user or service. Authorization is the process of granting permissions...
+```
+
+The `⤵` symbol indicates content from the enriched page. Without enrichment, the symbol won't appear.
+
+### How it works
+
+1. **Fetch** — GET each bookmark URL with a user-agent, respecting HTTP 4xx/5xx and content-type guards.
+2. **Extract** — Strip boilerplate (nav, footer, scripts, styles) and extract plain text.
+3. **Summarize** — Use the first 500 characters of extracted text as the summary (extractive, no LLM).
+4. **Embed** — Embed the summary using the same ONNX model as bookmark metadata.
+5. **Blend** — At search time, combine base (bookmark metadata) and summary similarity scores:
+   - **Blended score = 0.65 × base_score + 0.35 × summary_score**
+   - Falls back to base-only if no summary exists.
+6. **Excerpt** — For readability, find and display the sentence from the summary most similar to the query.
+
+### Status and monitoring
+
+Check enrichment status:
+
+```bash
+python -c "
+from mindmark.index import Index
+idx = Index()
+print(idx.enrichment_stats())
+idx.close()
+"
+```
+
+Example output:
+```python
+{'pending': 1234, 'complete': 450, 'failed': 23}
+```
+
+### Notes
+
+- **100% local** — Page fetching happens on your machine; no cloud service is used.
+- **Smart caching** — Pages are re-fetched only if the page content changes (detected via content hash).
+- **Failure resilience** — HTTP errors, timeouts, and JavaScript-only pages are logged as failed; sync and search continue without interruption.
+- **Privacy** — No content leaves your machine; all processing is offline and local.
+
+---
 
 | What | macOS / Linux | Windows | Override |
 |---|---|---|---|

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "mindmark"
-version = "0.1.5"
+version = "0.1.6"
 description = "Local semantic search over your browser bookmarks — on-device embeddings, no cloud."
 readme = "README.md"
 requires-python = ">=3.9"

diff --git a/src/mindmark/cli.py b/src/mindmark/cli.py
@@ -153,6 +153,7 @@ def _clear_index_contents(db_path: Path) -> bool:
     try:
         con = sqlite3.connect(str(db_path), timeout=1.0)
         cur = con.cursor()
+        cur.execute("DELETE FROM bookmark_enrichment")
         cur.execute("DELETE FROM bookmark_sources")
         cur.execute("DELETE FROM bookmarks")
         cur.execute("DELETE FROM meta")
@@ -192,9 +193,11 @@ def _cmd_find(args):
     idx = Index(db_path=args.db)
     if not getattr(args, 'json', False):
         _auto_sync_hint(idx)
+    include_excerpt = getattr(args, 'excerpt', False)
     results = idx.search(
         query=args.query, k=args.top,
         domain=args.domain, folder=args.folder,
+        include_excerpt=include_excerpt,
     )
     if not results:
         print("no results (is the index empty? run: mindmark sync)")
@@ -219,6 +222,9 @@ def _cmd_find(args):
             path = f"{folder}/" if folder else ""
             print(f"{i:2d}. {r['title']}")
             print(f"    {path}{domain}")
+            if include_excerpt and r.get("relevant_excerpt"):
+                excerpt = r["relevant_excerpt"]
+                print(f"    ⤵ {excerpt}")
 
     return 0
 
@@ -243,6 +249,50 @@ def _cmd_stats(args):
         idx.close()
 
 
+def _cmd_enrich(args):
+    from .enricher import enrich_pending
+
+    idx = Index(db_path=args.db)
+    try:
+        pending = idx.pending_enrichment_urls(
+            limit=None if args.refresh_failed else args.limit
+        )
+        if args.refresh_failed:
+            reset = idx.reset_failed_enrichment()
+            if reset:
+                print(f"reset {reset} failed enrichment rows to pending")
+            # re-query after reset, respecting --limit
+            pending = idx.pending_enrichment_urls(limit=args.limit)
+
+        estats = idx.enrichment_stats()
+        total_pending = estats.get("pending", 0)
+
+        if not pending:
+            print("nothing to enrich — run 'mindmark sync' first, or use --refresh-failed")
+            return 0
+
+        to_process = len(pending)
+        print(
+            f"enriching {to_process} bookmarks "
+            f"(pending={total_pending} workers={args.workers} timeout={args.timeout}s)"
+        )
+
+        result = enrich_pending(
+            idx,
+            limit=args.limit,
+            workers=args.workers,
+            timeout=args.timeout,
+            refresh_failed=False,  # already handled above
+        )
+        print(f"done. {result}")
+        return 0
+    except KeyboardInterrupt:
+        print("\n\nCancelled by user.")
+        return 1
+    finally:
+        idx.close()
+
+
 def _cmd_sync(args):
     from .browsers import parse_browser_bookmarks, detect_browsers
 
@@ -292,6 +342,10 @@ def build_parser():
     pf.add_argument("--folder")
     pf.add_argument("--json", action="store_true")
     pf.add_argument("--open", type=int, metavar="N")
+    pf.add_argument(
+        "--excerpt", action="store_true",
+        help="include excerpt from enriched page content (requires mindmark enrich)",
+    )
     pf.set_defaults(func=_cmd_find)
 
     ps = sub.add_parser("stats", help="show index stats")
@@ -324,6 +378,28 @@ def build_parser():
     )
     pd.set_defaults(func=_cmd_drop_index)
 
+    pe = sub.add_parser(
+        "enrich",
+        help="fetch page content for bookmarks and build summary embeddings (local, no cloud)",
+    )
+    pe.add_argument(
+        "--limit", type=int, default=None,
+        help="max bookmarks to process per run (default: all pending)",
+    )
+    pe.add_argument(
+        "--workers", type=int, default=8,
+        help="parallel fetch workers (default: 8)",
+    )
+    pe.add_argument(
+        "--timeout", type=float, default=10.0,
+        help="per-request fetch timeout in seconds (default: 10.0)",
+    )
+    pe.add_argument(
+        "--refresh-failed", action="store_true",
+        help="retry previously failed enrichments",
+    )
+    pe.set_defaults(func=_cmd_enrich)
+
     return p
 
 
@@ -336,6 +412,12 @@ def main(argv=None):
         if args.workers <= 0:
             parser.error("--workers must be > 0")
         return args.func(args)
+    if args.cmd == "enrich":
+        if args.workers <= 0:
+            parser.error("--workers must be > 0")
+        if args.timeout <= 0:
+            parser.error("--timeout must be > 0")
+        return args.func(args)
     if args.cmd is None:
         parser.print_help()
         return 2