Skip to content

ralu2004/Local-File-Search-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

204 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local File Search System

A local file search engine that indexes files on your machine and enables fast full-text and metadata search, with a CLI, an HTTP API, and a React-based web UI.


Build

cd application
mvn package

The built jar will be at application/target/application-1.0-SNAPSHOT.jar.


Usage

Index a directory

java -jar application/target/application-1.0-SNAPSHOT.jar index <directory>

Options:

Option Default Description
--db <path> .searchengine/index.db Custom database path
-i, --ignore <pattern> Glob pattern to ignore (repeatable)
--max-file-size <MB> 10 Skip files larger than this
--preview-lines <n> 3 Number of preview lines to store
--batch-size <n> 250 Number of files per DB batch write

Examples:

# Basic index
java -jar ... index C:\Users\user\Documents

# With ignore rules
java -jar ... index C:\Users\user\Documents -i "*.log" -i "backup"

# Tune DB batch writes
java -jar ... index C:\Users\user\Documents --batch-size 500

# Custom database path
java -jar ... --db C:\myindex\index.db index C:\Users\user\Documents

Search

java -jar application/target/application-1.0-SNAPSHOT.jar search "<query>"

Options:

Option Default Description
--db <path> .searchengine/index.db Custom database path
--limit <n> 50 Maximum number of results

Query syntax:

Query Meaning
getting started Full-text search
README.md Search by filename
content:hello Restrict full-text match to file contents
path:src/main Filter by path substring (cross-platform)
ext:java Filter by extension
modified:2025-01-01 Files modified after date
size:1048576 Files larger than size in bytes
size:10kb, size:5mb, size:1gb Size filter with units (case-insensitive)
sort:date, sort:alpha, sort:balanced, sort:behavior Choose ranking strategy
config ext:json Combined full-text and metadata

Qualifiers can appear in any order and combine with AND semantics. Duplicate qualifiers (e.g., two content: filters) compose with AND. For CLI usage, sorting is query-based (sort:<mode>), not a separate --sort flag.

Examples:

java -jar ... search "getting started"
java -jar ... search "ext:java"
java -jar ... search "README.md"
java -jar ... search "size:10mb"
java -jar ... search "config ext:json" --limit 10
java -jar ... search "auth path:src/main sort:date"

Web UI

The project includes a React frontend that talks to an HTTP API server.

Start the API server:

java -jar application/target/application-1.0-SNAPSHOT.jar server

The server listens on http://localhost:7070 by default. To use a different port, pass it as the next argument: ... server 8080.

Run the frontend:

cd frontend
npm install
npm run dev

The dev server prints the URL it's listening on. The frontend sends requests to http://localhost:7070/api/*.

The UI exposes both indexing and search workflows: configure a root directory and ignore rules, run indexing with live progress, then search with sort-mode selection (default, balanced, date, alphabetical, personalized). Results show file metadata, content previews with query-term highlighting, and a "Mark as opened" action that feeds personalized ranking. When the personalized sort is active, results display ranking insights describing why each result scored where it did.


Default ignore rules

The crawler always ignores common system/build directories (for example node_modules, target, build, dist, .git, .idea, AppData, Program Files, Windows) and also ignores hidden files/directories and non-text files. You can add more rules with -i/--ignore.


Incremental indexing verification

To validate that only changed files are re-indexed, do the following steps:

  1. Run an initial index on a test directory.
  2. Modify one indexed text file, add one new file, and delete one existing indexed file.
  3. Run indexing again on the same directory.
  4. Check the report:
    • Skipped should include unchanged files.
    • Indexed should reflect only changed/new files.
    • Deleted should include files removed from disk.

Testing

Run all automated tests:

cd application
mvn test

Current suite covers:

  • Query parsing: full-text, filename, metadata, mixed input, and size unit parsing (bytes, kb, mb, gb)
  • Search behavior: recursive traversal, single-word and multi-word full-text search
  • Metadata filters: ext, modified, size (including unit forms), path, content
  • Runtime indexing options: ignoreRules, maxFileSizeMb, previewLines, batchSize
  • Indexing lifecycle: background progress snapshot and final report
  • Resilience: database failure propagation, unreadable files, and symlink-loop environments (platform-dependent skip)
  • Incremental indexing: unchanged-file skip, modified-file update, and deleted-file cleanup
  • Ranking strategies: resolver mapping, swappable strategy selection, behavior score formula (frequency, recency, position lift), ranking insight formatting (relative time, lift threshold)
  • Search activity: history recording, suggestion prefix matching, recent-query ordering

Typical output should report all tests passing, with one optional skipped test on platforms that cannot create symlinks.


Personalized ranking

The default ranking favors content relevance and path features. The personalized ranking strategy (sort:behavior, or "Personalized" in the UI) reorders results based on the user's interaction history with similar queries. It uses three signals:

  • Frequency — how often the file has been opened for the same normalized query
  • Recency — how recently it was opened (exponential decay with a 7-day half-life)
  • Position lift/boost — whether the user typically had to "dig" past higher-ranked results to reach this file (an opened file consistently found at position 8 ranks higher than one always found at position 1, holding other factors equal)

Files split into two buckets: those with any open history sort first by behavior score, those without sort after by full-text relevance. When the personalized sort is active, the UI shows insights under each result explaining the ranking ("you've opened this 5 times for similar searches", "last opened 2 hours ago", "you often find this past higher-ranked results").

The UI also surfaces query suggestions based on prefix matches against search history, and recent unique queries — both fed by the same activity tracking that drives personalized ranking.


Design notes

The following sections document the architectural choices of the ranking system, including the trade-offs that were considered and deliberately accepted.

Request flow

The sequence diagram below traces a personalized-search request from the UI to SQLite and back, showing where each layer narrows its dependencies and where the behavior-score UDF runs. The component diagram shows the post-decomposition/current shape of the persistence layer.

Personalized search - request sequence

sequenceDiagram
    autonumber
    participant UI as Frontend (SearchPanel)
    participant API as ApiServer
    participant SS as SearchService
    participant SE as SearchEngine
    participant Acc as DatabaseAccessor
    participant Sess as DatabaseSession<br/>(SqliteDatabaseSession)
    participant FC as FileContext
    participant Repo as SqliteFileRepository
    participant DB as SQLite + FTS5<br/>+ behavior_score UDF
    participant Obs as SearchActivityObserver

    UI->>API: GET /api/search?q=auth+sort:behavior
    API->>SS: search(dbPath, query, limit)

    SS->>Acc: openFileSearch(dbPath)
    Acc->>Sess: open(dbPath)
    Sess-->>Acc: DatabaseSession
    Note right of Acc: returns CloseableFileSearch<br/>(narrow view)
    Acc-->>SS: CloseableFileSearch

    SS->>SE: new SearchEngine(searchRepo, parser, limit)
    SS->>SE: search(input)
    SE->>SE: parse query, resolve strategy
    SE->>FC: search(query, limit, BehaviorRankingStrategy, normalizedQuery)

    FC->>Repo: search(...)
    Repo->>DB: SELECT ... ORDER BY CASE ... behavior_score(...) DESC
    Note right of DB: UDF computes behavior score<br/>per row using Java formula
    DB-->>Repo: rows
    Repo-->>FC: List<RankedSearchResult> (with insights)
    FC-->>SE: results
    SE-->>SS: results

    SS->>Obs: onSearchExecuted(...)
    Obs->>Acc: openSearchActivity(dbPath)
    Acc-->>Obs: CloseableSearchActivity
    Obs->>Sess: recordSearch(...)
    Note right of Obs: Records query + duration<br/>for future personalization

    SS-->>API: List<RankedSearchResult>
    API-->>UI: JSON (results + insights)
    UI->>UI: render results, show insights<br/>under top files
Loading

The narrowing happens at openFileSearch: services receive a CloseableFileSearch (a narrow view), not a session or a Database. The compiler enforces that only file-search methods can be called from SearchService at this point in the flow. The same pattern repeats for activity recording via CloseableSearchActivity.

Persistence layer - component structure

graph TB
    subgraph svc["Service layer (consumers)"]
        SS[SearchService]
        IS[IndexService]
        HS[HistoryService]
        Obs[SearchActivityObserver]
    end

    Acc[DatabaseAccessor]

    subgraph ifaces["Closeable view interfaces<br/>(app.repository)"]
        CFS[CloseableFileSearch]
        CSA[CloseableSearchActivity]
        CIR[CloseableIndexRuns]
        CIS[CloseableIndexSession]
    end

    DS[DatabaseSession<br/>umbrella interface]

    subgraph impl["SQLite implementation (app.db.sqlite)"]
        SDS[SqliteDatabaseSession]
        FC[FileContext]
        IRC[IndexRunContext]
        AC[ActivityContext]
        SCP[SqliteConnectionProvider]
    end

    SS -->|via| CFS
    SS -->|via| CSA
    IS -->|via| CIS
    HS -->|via| CIR
    Obs -->|via| CSA

    CFS -.implemented by.-> SDS
    CSA -.implemented by.-> SDS
    CIR -.implemented by.-> SDS
    CIS -.implemented by.-> SDS

    DS -.aggregates.-> CFS
    DS -.aggregates.-> CSA
    DS -.aggregates.-> CIR
    DS -.aggregates.-> CIS

    SDS -->|delegates to| FC
    SDS -->|delegates to| IRC
    SDS -->|delegates to| AC

    FC -->|uses| SCP
    IRC -->|uses| SCP
    AC -->|uses| SCP

    Acc -->|opens| DS
    SS -.uses.-> Acc
    IS -.uses.-> Acc
    HS -.uses.-> Acc
    Obs -.uses.-> Acc
Loading

Behavior score: SQLite UDF instead of inline SQL

The personalized ranking formula combines three signals into a single weighted score. Two implementation paths were considered:

Option A — formula inline in the strategy's ORDER BY clause. Fits the existing RankingStrategy contract (each strategy returns a SQL fragment). Simple to wire in, but the formula lives as a string concatenation in Java code and cannot be unit-tested without a live SQLite connection.

Option B — formula in a Java class, exposed to SQL through a SQLite user-defined function. Requires registering the UDF on every JDBC connection (introducing a connection-level coupling), but keeps the formula in a unit-testable Java class while preserving the RankingStrategy contract.

The current implementation features option B. The formula lives in BehaviorScoreFormula with BehaviorScoreFormulaTest covering each component (frequency cap, recency half-life, position lift threshold) as pure-function tests. The thin SQLite adapter (SqliteBehaviorScoreFunction) extends org.sqlite.Function and delegates to the formula. The strategy class (BehaviorRankingStrategy) returns an ORDER BY clause referencing the UDF by its registered name.

Explainability: per-strategy opt-in, not per-result tagging

When the user picks personalized sort, results show insights describing why they ranked. The insight text is generated from the same raw signals that feed the formula (open count, last-open timestamp, average position) and lives in BehaviorRankingInsights. Backend production of insights is gated by RankingStrategy.producesInsights(), a default method that returns false and is overridden to true only on BehaviorRankingStrategy. Other strategies' result rows carry an empty insight list.

Persistence layer: dependency inversion before decomposition

The persistence layer underwent two refactors during Iteration 2:

1. Narrow consumer dependencies. Services that previously held references to a god Database class were updated to depend on narrow repository interfaces (FileSearchRepository, SearchActivityRepository, IndexRunRepository, FileWriteRepository, FileMetadataRepository). Each interface has a Closeable* companion (e.g., CloseableFileSearch) extending AutoCloseable, so services can use try-with-resources while still depending on a narrow type. The DatabaseAccessor is the only place that constructs persistence handles; services never see the umbrella type.

2. Decompose the implementation. With consumers decoupled, the monolithic Database class was split into three per-domain context classes:

  • FileContext — file records (search, write, metadata)
  • IndexRunContext — indexing run lifecycle and history
  • ActivityContext — search execution and result-open activity

SqliteDatabaseSession composes the three contexts and implements the umbrella DatabaseSession interface (which aggregates the closeable views). SqliteDatabaseProvider returns DatabaseSession instances and now owns schema initialization (extracted from the previous Database constructor). The Database class no longer exists.

Frontend structure

App.tsx is the top-level component owning section state and layout. Indexing concerns (config form, status badge, history table) live in IndexPanel; search concerns (search bar, sort selector, results, insights) live in SearchPanel. Leaf components (SuggestionBox, ResultCard, StatusBadge) are presentational. All HTTP calls are centralized in api/client.ts. Shared types live in types.ts. Pure utilities (formatFileSize, getFolderPath, highlightText) are extracted to the utils/* modules.

Known limitations

  • The indexer targets text-like files and skips non-text/binary files.
  • CLI sorting is expressed inside the query (sort:...), not via a separate --sort option.
  • Symlink-related behavior may vary by OS permissions (the test suite already marks this as optional on unsupported environments).

Architecture

See ARCHITECTURE.md for the C4 model of the system design (delivered as part of an earlier iteration). The "Design notes" section above documents iteration-2 additions and refactors not covered by the original architecture document.

About

A local search engine that indexes documents, media, and binaries across your device. By leveraging filenames, content inspection, and metadata, it provides a "search-as-you-type" experience for retrieving local data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors