Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ dist/
.venv/
.pytest_cache/
.bookmarks-search/
uv.lock
.vscode/
91 changes: 89 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Ask in natural language — mindmark remembers what you saved.
| `mindmark sync` | **Auto-detect** installed browsers and sync bookmarks directly — no export needed |
| `mindmark find "query"` | Semantic search over titles, folders, domains, and URL slugs — returns top-K with similarity scores |
| `mindmark open "query"` | Search and open the best match in your default browser |
| `mindmark enrich` | Fetch page content, extract text, embed summaries, and improve search relevance with page context |
| `mindmark stats` | Show index size, model info, top domains, and top folders |
| `mindmark index <file>` | Import bookmarks from an exported HTML file (legacy workflow) |
| `mindmark validate` | Check indexed bookmark URLs for stale links (HTTP 4xx/5xx or unreachable) and report them |
Expand Down Expand Up @@ -266,16 +267,19 @@ When you add new bookmarks in your browser, just run `mindmark sync` again — i

> 💡 **Note:** If you change the embedding model with `--model`, all bookmarks will be re-embedded on the next sync. Browser names are case-insensitive (e.g., `--browser Chrome` and `--browser chrome` both work).

### Filters
### Filters and options

Narrow down results without changing your query:

```bash
mindmark find "useful tools" --domain github.com # only github.com results
mindmark find "useful tools" --folder work/kusto # only bookmarks in matching folders
mindmark find "useful tools" -k 20 # return top 20 instead of 10
mindmark find "useful tools" --excerpt # include excerpts from enriched pages
```

> 💡 **Note:** The `--excerpt` flag requires you to run `mindmark enrich` first to fetch and embed page content. See [Augmented Index](#-augmented-index-page-summaries) for details.

### Re-indexing

For the `sync` workflow, just rerun `mindmark sync`. It's incremental — only changed bookmarks are re-embedded.
Expand Down Expand Up @@ -344,7 +348,90 @@ Browser data files "python async tutorial"

---

## 🗂️ Storage Layout
## 🎯 Augmented Index with Page Summaries

By default, mindmark indexes only bookmark metadata: titles, folders, domains, and URL slugs. If you want **deeper page context** in search results, use the enrichment pipeline to fetch page content and embed summaries.

> 💡 **Note:** In order to be 100% local and lightweight enrichment uses **extractive summarization** (first 500 chars of page text) — no LLM, no text generation. This means:
> - Only the opening content is embedded (relevant if key info is early; may miss content further down)
> - Page content must already be well-written for excerpts to be useful (relies on natural sentence structure)
> - Privacy and speed are preserved (no cloud calls, runs entirely locally)

### Why enrich?

Without enrichment, searching for **"authentication strategies"** on a bookmark titled **"AWS Services"** may miss it, even though the page discusses authentication. With enrichment, the page content is fetched and summarized, improving relevance.

### Quick start

1. **Enrich bookmarks** (fetch page content and embed summaries):

```bash
mindmark enrich --limit 100 --workers 4
```

Options:
- `--limit N` — Process top N pending URLs (default: all)
- `--workers N` — Parallel fetch workers (default: 8)
- `--timeout S` — Per-request timeout in seconds (default: 10.0)
- `--refresh-failed` — Retry previously failed enrichments

2. **Search with page context**:

```bash
mindmark find "authentication strategies" --excerpt
```

With `--excerpt`, results display the most relevant excerpt from the enriched page:

```
1. AWS Services
aws.amazon.com
⤵ To control user access to AWS resources, you must have an authentication strategy. AWS IAM provides fine-grained access control...

2. Auth0 Documentation
auth0.com
⤵ Authentication is the process of verifying the identity of a user or service. Authorization is the process of granting permissions...
```

The `⤵` symbol indicates content from the enriched page. Without enrichment, the symbol won't appear.

### How it works

1. **Fetch** — GET each bookmark URL with a user-agent, respecting HTTP 4xx/5xx and content-type guards.
2. **Extract** — Strip boilerplate (nav, footer, scripts, styles) and extract plain text.
3. **Summarize** — Use the first 500 characters of extracted text as the summary (extractive, no LLM).
4. **Embed** — Embed the summary using the same ONNX model as bookmark metadata.
5. **Blend** — At search time, combine base (bookmark metadata) and summary similarity scores:
- **Blended score = 0.65 × base_score + 0.35 × summary_score**
- Falls back to base-only if no summary exists.
6. **Excerpt** — For readability, find and display the sentence from the summary most similar to the query.

### Status and monitoring

Check enrichment status:

```bash
python -c "
from mindmark.index import Index
idx = Index()
print(idx.enrichment_stats())
idx.close()
"
```

Example output:
```python
{'pending': 1234, 'complete': 450, 'failed': 23}
```

### Notes

- **100% local** — Page fetching happens on your machine; no cloud service is used.
- **Smart caching** — Pages are re-fetched only if the page content changes (detected via content hash).
- **Failure resilience** — HTTP errors, timeouts, and JavaScript-only pages are logged as failed; sync and search continue without interruption.
- **Privacy** — No content leaves your machine; all processing is offline and local.

---

| What | macOS / Linux | Windows | Override |
|---|---|---|---|
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "mindmark"
version = "0.1.5"
version = "0.1.6"
description = "Local semantic search over your browser bookmarks — on-device embeddings, no cloud."
readme = "README.md"
requires-python = ">=3.9"
Expand Down
82 changes: 82 additions & 0 deletions src/mindmark/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ def _clear_index_contents(db_path: Path) -> bool:
try:
con = sqlite3.connect(str(db_path), timeout=1.0)
cur = con.cursor()
cur.execute("DELETE FROM bookmark_enrichment")
cur.execute("DELETE FROM bookmark_sources")
cur.execute("DELETE FROM bookmarks")
cur.execute("DELETE FROM meta")
Expand Down Expand Up @@ -192,9 +193,11 @@ def _cmd_find(args):
idx = Index(db_path=args.db)
if not getattr(args, 'json', False):
_auto_sync_hint(idx)
include_excerpt = getattr(args, 'excerpt', False)
results = idx.search(
query=args.query, k=args.top,
domain=args.domain, folder=args.folder,
include_excerpt=include_excerpt,
)
if not results:
print("no results (is the index empty? run: mindmark sync)")
Expand All @@ -219,6 +222,9 @@ def _cmd_find(args):
path = f"{folder}/" if folder else ""
print(f"{i:2d}. {r['title']}")
print(f" {path}{domain}")
if include_excerpt and r.get("relevant_excerpt"):
excerpt = r["relevant_excerpt"]
print(f" ⤵ {excerpt}")

return 0

Expand All @@ -243,6 +249,50 @@ def _cmd_stats(args):
idx.close()


def _cmd_enrich(args):
from .enricher import enrich_pending

idx = Index(db_path=args.db)
try:
pending = idx.pending_enrichment_urls(
limit=None if args.refresh_failed else args.limit
)
if args.refresh_failed:
reset = idx.reset_failed_enrichment()
if reset:
print(f"reset {reset} failed enrichment rows to pending")
# re-query after reset, respecting --limit
pending = idx.pending_enrichment_urls(limit=args.limit)

estats = idx.enrichment_stats()
total_pending = estats.get("pending", 0)

if not pending:
print("nothing to enrich — run 'mindmark sync' first, or use --refresh-failed")
return 0

to_process = len(pending)
print(
f"enriching {to_process} bookmarks "
f"(pending={total_pending} workers={args.workers} timeout={args.timeout}s)"
)

result = enrich_pending(
idx,
limit=args.limit,
workers=args.workers,
timeout=args.timeout,
refresh_failed=False, # already handled above
)
print(f"done. {result}")
return 0
except KeyboardInterrupt:
print("\n\nCancelled by user.")
return 1
finally:
idx.close()


def _cmd_sync(args):
from .browsers import parse_browser_bookmarks, detect_browsers

Expand Down Expand Up @@ -292,6 +342,10 @@ def build_parser():
pf.add_argument("--folder")
pf.add_argument("--json", action="store_true")
pf.add_argument("--open", type=int, metavar="N")
pf.add_argument(
"--excerpt", action="store_true",
help="include excerpt from enriched page content (requires mindmark enrich)",
)
pf.set_defaults(func=_cmd_find)

ps = sub.add_parser("stats", help="show index stats")
Expand Down Expand Up @@ -324,6 +378,28 @@ def build_parser():
)
pd.set_defaults(func=_cmd_drop_index)

pe = sub.add_parser(
"enrich",
help="fetch page content for bookmarks and build summary embeddings (local, no cloud)",
)
pe.add_argument(
"--limit", type=int, default=None,
help="max bookmarks to process per run (default: all pending)",
)
pe.add_argument(
"--workers", type=int, default=8,
help="parallel fetch workers (default: 8)",
)
pe.add_argument(
"--timeout", type=float, default=10.0,
help="per-request fetch timeout in seconds (default: 10.0)",
)
pe.add_argument(
"--refresh-failed", action="store_true",
help="retry previously failed enrichments",
)
pe.set_defaults(func=_cmd_enrich)

return p


Expand All @@ -336,6 +412,12 @@ def main(argv=None):
if args.workers <= 0:
parser.error("--workers must be > 0")
return args.func(args)
if args.cmd == "enrich":
if args.workers <= 0:
parser.error("--workers must be > 0")
if args.timeout <= 0:
parser.error("--timeout must be > 0")
return args.func(args)
if args.cmd is None:
parser.print_help()
return 2
Expand Down
Loading
Loading