Skip to content

LTolo/LoomIndex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LoomIndex – High-Performance Concurrent Web Crawler

LoomIndex is a lightweight, high-performance, and concurrent web crawler developed in modern C++ (C++20). Engineered for speed and scalability, it serves as a robust foundation for high-throughput web scraping and data indexing projects.

✨ Key Features

  • Asynchronous I/O: Leverages libcurl (curl_multi) for scalable, non-blocking HTTP requests, capable of handling dozens of concurrent connections efficiently.
  • Custom Thread Pool: Native C++20 thread-pool implementation that safely dispatches parser and processor workloads.
  • Memory-efficient Bloom Filter: Integrates a built-in Bloom Filter for rapid URL deduplication, drastically reducing RAM constraints compared to traditional hash sets.
  • Docker Support: Fully containerized environment for instant, reproducible builds and zero-config execution.

🏗 Architecture

The system is built on a reliable multi-threaded Producer-Consumer model, ensuring a clear separation of concerns between network I/O and data processing:

graph TD;
    subgraph Core Engine
        CE[CrawlerEngine] -->|Spawns| TP[ThreadPool / Workers]
        CE -->|Pumps I/O| AF[AsyncFetcher]
    end

    subgraph Memory & Queue
        TP -->|Pops URLs| UF[URLFrontier]
        UF -->|Filters duplicates| BF[BloomFilter]
        AF -->|Callback on parse| TP
        TP -->|Pushes new links| UF
    end

    style CE fill:#f9f,stroke:#333,stroke-width:2px;
    style BF fill:#bbf,stroke:#333,stroke-width:2px;
Loading

🚀 How to Run

Using Docker (Highly Recommended)

The easiest way to build and run the crawler demo without configuring your local C++ environment is via Docker. Provide URLs as arguments, or let it default to https://example.com.

# Build the Docker image
docker build -t loomindex .

# Run the containerized demo application (fallback seed)
docker run --rm loomindex

# Run with custom URLs
docker run --rm loomindex https://github.com https://wikipedia.org

Building via CMake (Linux/macOS/WSL)

If you have a C++20 compiler and libcurl (libcurl4-openssl-dev) installed, you can build natively:

git clone https://github.com/yourusername/LoomIndex.git
cd LoomIndex

# Generate Makefiles and Build
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run Unit Tests
cd build
ctest --output-on-failure
cd ..

# Run the crawler
./build/LoomIndex https://example.com

📂 Project Structure

  • include/LoomIndex/: Public header files defining the core components (CrawlerEngine, ThreadPool, BloomFilter, etc.).
  • src/: Implementation files for the C++ components, including the main.cpp entrypoint.
  • tests/: GoogleTest framework unit tests validating concurrent behavior and data structures.
  • CMakeLists.txt: Top-level build configuration.
  • run_project.sh: Helper script to compile, unit-test, and run the binary sequentially.
  • Dockerfile: Container definition for encapsulated builds and execution.

About

A memory-efficient C++20 web crawler designed for extreme scalability. Implements probabilistic URL deduplication (Bloom Filter) to minimize RAM footprint and features a non-blocking asynchronous I/O loop. Engineered with a strict focus on RAII, thread safety, and graceful resource management.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors