A local file search engine that indexes files on your machine and enables fast full-text and metadata search, with a CLI, an HTTP API, and a React-based web UI.
cd application
mvn packageThe built jar will be at application/target/application-1.0-SNAPSHOT.jar.
java -jar application/target/application-1.0-SNAPSHOT.jar index <directory>Options:
| Option | Default | Description |
|---|---|---|
--db <path> |
.searchengine/index.db |
Custom database path |
-i, --ignore <pattern> |
— | Glob pattern to ignore (repeatable) |
--max-file-size <MB> |
10 |
Skip files larger than this |
--preview-lines <n> |
3 |
Number of preview lines to store |
--batch-size <n> |
250 |
Number of files per DB batch write |
Examples:
# Basic index
java -jar ... index C:\Users\user\Documents
# With ignore rules
java -jar ... index C:\Users\user\Documents -i "*.log" -i "backup"
# Tune DB batch writes
java -jar ... index C:\Users\user\Documents --batch-size 500
# Custom database path
java -jar ... --db C:\myindex\index.db index C:\Users\user\Documentsjava -jar application/target/application-1.0-SNAPSHOT.jar search "<query>"Options:
| Option | Default | Description |
|---|---|---|
--db <path> |
.searchengine/index.db |
Custom database path |
--limit <n> |
50 |
Maximum number of results |
Query syntax:
| Query | Meaning |
|---|---|
getting started |
Full-text search |
README.md |
Search by filename |
content:hello |
Restrict full-text match to file contents |
path:src/main |
Filter by path substring (cross-platform) |
ext:java |
Filter by extension |
modified:2025-01-01 |
Files modified after date |
size:1048576 |
Files larger than size in bytes |
size:10kb, size:5mb, size:1gb |
Size filter with units (case-insensitive) |
sort:date, sort:alpha, sort:balanced, sort:behavior |
Choose ranking strategy |
config ext:json |
Combined full-text and metadata |
Qualifiers can appear in any order and combine with AND semantics. Duplicate qualifiers (e.g., two content: filters) compose with AND.
For CLI usage, sorting is query-based (sort:<mode>), not a separate --sort flag.
Examples:
java -jar ... search "getting started"
java -jar ... search "ext:java"
java -jar ... search "README.md"
java -jar ... search "size:10mb"
java -jar ... search "config ext:json" --limit 10
java -jar ... search "auth path:src/main sort:date"The project includes a React frontend that talks to an HTTP API server.
Start the API server:
java -jar application/target/application-1.0-SNAPSHOT.jar serverThe server listens on http://localhost:7070 by default. To use a different port, pass it as the next argument: ... server 8080.
Run the frontend:
cd frontend
npm install
npm run devThe dev server prints the URL it's listening on. The frontend sends requests to http://localhost:7070/api/*.
The UI exposes both indexing and search workflows: configure a root directory and ignore rules, run indexing with live progress, then search with sort-mode selection (default, balanced, date, alphabetical, personalized). Results show file metadata, content previews with query-term highlighting, and a "Mark as opened" action that feeds personalized ranking. When the personalized sort is active, results display ranking insights describing why each result scored where it did.
The crawler always ignores common system/build directories (for example node_modules, target, build, dist, .git, .idea, AppData, Program Files, Windows) and also ignores hidden files/directories and non-text files. You can add more rules with -i/--ignore.
To validate that only changed files are re-indexed, do the following steps:
- Run an initial index on a test directory.
- Modify one indexed text file, add one new file, and delete one existing indexed file.
- Run indexing again on the same directory.
- Check the report:
Skippedshould include unchanged files.Indexedshould reflect only changed/new files.Deletedshould include files removed from disk.
Run all automated tests:
cd application
mvn testCurrent suite covers:
- Query parsing: full-text, filename, metadata, mixed input, and
sizeunit parsing (bytes,kb,mb,gb) - Search behavior: recursive traversal, single-word and multi-word full-text search
- Metadata filters:
ext,modified,size(including unit forms),path,content - Runtime indexing options:
ignoreRules,maxFileSizeMb,previewLines,batchSize - Indexing lifecycle: background progress snapshot and final report
- Resilience: database failure propagation, unreadable files, and symlink-loop environments (platform-dependent skip)
- Incremental indexing: unchanged-file skip, modified-file update, and deleted-file cleanup
- Ranking strategies: resolver mapping, swappable strategy selection, behavior score formula (frequency, recency, position lift), ranking insight formatting (relative time, lift threshold)
- Search activity: history recording, suggestion prefix matching, recent-query ordering
Typical output should report all tests passing, with one optional skipped test on platforms that cannot create symlinks.
The default ranking favors content relevance and path features. The personalized ranking strategy (sort:behavior, or "Personalized" in the UI) reorders results based on the user's interaction history with similar queries. It uses three signals:
- Frequency — how often the file has been opened for the same normalized query
- Recency — how recently it was opened (exponential decay with a 7-day half-life)
- Position lift/boost — whether the user typically had to "dig" past higher-ranked results to reach this file (an opened file consistently found at position 8 ranks higher than one always found at position 1, holding other factors equal)
Files split into two buckets: those with any open history sort first by behavior score, those without sort after by full-text relevance. When the personalized sort is active, the UI shows insights under each result explaining the ranking ("you've opened this 5 times for similar searches", "last opened 2 hours ago", "you often find this past higher-ranked results").
The UI also surfaces query suggestions based on prefix matches against search history, and recent unique queries — both fed by the same activity tracking that drives personalized ranking.
The following sections document the architectural choices of the ranking system, including the trade-offs that were considered and deliberately accepted.
The sequence diagram below traces a personalized-search request from the UI to SQLite and back, showing where each layer narrows its dependencies and where the behavior-score UDF runs. The component diagram shows the post-decomposition/current shape of the persistence layer.
sequenceDiagram
autonumber
participant UI as Frontend (SearchPanel)
participant API as ApiServer
participant SS as SearchService
participant SE as SearchEngine
participant Acc as DatabaseAccessor
participant Sess as DatabaseSession<br/>(SqliteDatabaseSession)
participant FC as FileContext
participant Repo as SqliteFileRepository
participant DB as SQLite + FTS5<br/>+ behavior_score UDF
participant Obs as SearchActivityObserver
UI->>API: GET /api/search?q=auth+sort:behavior
API->>SS: search(dbPath, query, limit)
SS->>Acc: openFileSearch(dbPath)
Acc->>Sess: open(dbPath)
Sess-->>Acc: DatabaseSession
Note right of Acc: returns CloseableFileSearch<br/>(narrow view)
Acc-->>SS: CloseableFileSearch
SS->>SE: new SearchEngine(searchRepo, parser, limit)
SS->>SE: search(input)
SE->>SE: parse query, resolve strategy
SE->>FC: search(query, limit, BehaviorRankingStrategy, normalizedQuery)
FC->>Repo: search(...)
Repo->>DB: SELECT ... ORDER BY CASE ... behavior_score(...) DESC
Note right of DB: UDF computes behavior score<br/>per row using Java formula
DB-->>Repo: rows
Repo-->>FC: List<RankedSearchResult> (with insights)
FC-->>SE: results
SE-->>SS: results
SS->>Obs: onSearchExecuted(...)
Obs->>Acc: openSearchActivity(dbPath)
Acc-->>Obs: CloseableSearchActivity
Obs->>Sess: recordSearch(...)
Note right of Obs: Records query + duration<br/>for future personalization
SS-->>API: List<RankedSearchResult>
API-->>UI: JSON (results + insights)
UI->>UI: render results, show insights<br/>under top files
The narrowing happens at openFileSearch: services receive a CloseableFileSearch (a narrow view), not a session or a Database. The compiler enforces that only file-search methods can be called from SearchService at this point in the flow. The same pattern repeats for activity recording via CloseableSearchActivity.
graph TB
subgraph svc["Service layer (consumers)"]
SS[SearchService]
IS[IndexService]
HS[HistoryService]
Obs[SearchActivityObserver]
end
Acc[DatabaseAccessor]
subgraph ifaces["Closeable view interfaces<br/>(app.repository)"]
CFS[CloseableFileSearch]
CSA[CloseableSearchActivity]
CIR[CloseableIndexRuns]
CIS[CloseableIndexSession]
end
DS[DatabaseSession<br/>umbrella interface]
subgraph impl["SQLite implementation (app.db.sqlite)"]
SDS[SqliteDatabaseSession]
FC[FileContext]
IRC[IndexRunContext]
AC[ActivityContext]
SCP[SqliteConnectionProvider]
end
SS -->|via| CFS
SS -->|via| CSA
IS -->|via| CIS
HS -->|via| CIR
Obs -->|via| CSA
CFS -.implemented by.-> SDS
CSA -.implemented by.-> SDS
CIR -.implemented by.-> SDS
CIS -.implemented by.-> SDS
DS -.aggregates.-> CFS
DS -.aggregates.-> CSA
DS -.aggregates.-> CIR
DS -.aggregates.-> CIS
SDS -->|delegates to| FC
SDS -->|delegates to| IRC
SDS -->|delegates to| AC
FC -->|uses| SCP
IRC -->|uses| SCP
AC -->|uses| SCP
Acc -->|opens| DS
SS -.uses.-> Acc
IS -.uses.-> Acc
HS -.uses.-> Acc
Obs -.uses.-> Acc
The personalized ranking formula combines three signals into a single weighted score. Two implementation paths were considered:
Option A — formula inline in the strategy's ORDER BY clause. Fits the existing RankingStrategy contract (each strategy returns a SQL fragment). Simple to wire in, but the formula lives as a string concatenation in Java code and cannot be unit-tested without a live SQLite connection.
Option B — formula in a Java class, exposed to SQL through a SQLite user-defined function. Requires registering the UDF on every JDBC connection (introducing a connection-level coupling), but keeps the formula in a unit-testable Java class while preserving the RankingStrategy contract.
The current implementation features option B. The formula lives in BehaviorScoreFormula with BehaviorScoreFormulaTest covering each component (frequency cap, recency half-life, position lift threshold) as pure-function tests. The thin SQLite adapter (SqliteBehaviorScoreFunction) extends org.sqlite.Function and delegates to the formula. The strategy class (BehaviorRankingStrategy) returns an ORDER BY clause referencing the UDF by its registered name.
When the user picks personalized sort, results show insights describing why they ranked. The insight text is generated from the same raw signals that feed the formula (open count, last-open timestamp, average position) and lives in BehaviorRankingInsights. Backend production of insights is gated by RankingStrategy.producesInsights(), a default method that returns false and is overridden to true only on BehaviorRankingStrategy. Other strategies' result rows carry an empty insight list.
The persistence layer underwent two refactors during Iteration 2:
1. Narrow consumer dependencies. Services that previously held references to a god Database class were updated to depend on narrow repository interfaces (FileSearchRepository, SearchActivityRepository, IndexRunRepository, FileWriteRepository, FileMetadataRepository). Each interface has a Closeable* companion (e.g., CloseableFileSearch) extending AutoCloseable, so services can use try-with-resources while still depending on a narrow type. The DatabaseAccessor is the only place that constructs persistence handles; services never see the umbrella type.
2. Decompose the implementation. With consumers decoupled, the monolithic Database class was split into three per-domain context classes:
FileContext— file records (search, write, metadata)IndexRunContext— indexing run lifecycle and historyActivityContext— search execution and result-open activity
SqliteDatabaseSession composes the three contexts and implements the umbrella DatabaseSession interface (which aggregates the closeable views). SqliteDatabaseProvider returns DatabaseSession instances and now owns schema initialization (extracted from the previous Database constructor). The Database class no longer exists.
App.tsx is the top-level component owning section state and layout. Indexing concerns (config form, status badge, history table) live in IndexPanel; search concerns (search bar, sort selector, results, insights) live in SearchPanel. Leaf components (SuggestionBox, ResultCard, StatusBadge) are presentational. All HTTP calls are centralized in api/client.ts. Shared types live in types.ts. Pure utilities (formatFileSize, getFolderPath, highlightText) are extracted to the utils/* modules.
- The indexer targets text-like files and skips non-text/binary files.
- CLI sorting is expressed inside the query (
sort:...), not via a separate--sortoption. - Symlink-related behavior may vary by OS permissions (the test suite already marks this as optional on unsupported environments).
See ARCHITECTURE.md for the C4 model of the system design (delivered as part of an earlier iteration). The "Design notes" section above documents iteration-2 additions and refactors not covered by the original architecture document.