A high-performance distributed system designed to ingest, parse, and analyze massive RSS feed datasets using Apache Spark. It implements a MapReduce architecture to perform Named Entity Recognition (NER) across a cluster of worker nodes, ensuring horizontal scalability.
The system utilizes Spark's RDD (Resilient Distributed Datasets) abstraction to create a fault-tolerant processing pipeline.
graph LR
A[Input: Feed URLs] -->|parallelize| B(Dist. Feed Fetcher)
B -->|map| C{RSS Parser}
C -->|flatMap| D[Articles RDD]
D -->|distribute| E(NER Workers)
E -->|mapToPair| F[Entity Tuples]
F -->|reduceByKey| G[Aggregated Counts]
G -->|collect| H[Final Output]
- Distributed Ingestion: Parallelized HTTP requests using
JavaRDD<String> urlsRDDto distribute network I/O load across workers. - Hierarchical Processing: Uses
flatMapto transform the Feed-level RDD into an Article-level RDD, allowing granular load balancing (one worker per article). - MapReduce Aggregation: Implemented
reduceByKey()for efficient entity counting, utilizing local combiners to minimize network shuffle. - Functional Paradigm: Leverages pure functions and immutability to ensure thread safety without manual lock management.
Comparison between the sequential implementation and this distributed Spark implementation.
| Metric | Sequential Engine | Distributed Spark Engine | Analysis |
|---|---|---|---|
| Execution Time | ~9.7s | ~16.9s | Spark has initialization overhead (driver/executors startup). |
| Scalability | O(N) - Linear/Limited | Horizontal Scaling | Spark excels as dataset size ( |
| Concurrency | Single Thread | Multi-Node Cluster | Automatic task distribution across available cores. |
| Fault Tolerance | None (Crash fails) | Resilient (RDD lineage) | Automatic partition reconstruction on failure. |
Engineering Insight: While the sequential version is faster for small datasets due to JVM warm-up and Spark context overhead, the Spark architecture is designed to handle TB-scale datasets where the sequential approach would hit memory (OOM) and CPU bottlenecks.
// 1. Ingestion: Partitioning URLs across the cluster
JavaRDD<String> urlsRDD = jsc.parallelize(urls, Math.min(urls.size(), 10));
// 2. Processing: Lazy evaluation pipeline
JavaRDD<NamedEntity> entitiesRDD = urlsRDD
.map(this::downloadAndParseFeed) // Distributes I/O
.filter(Objects::nonNull) // Filters invalid feeds
.flatMap(feed -> feed.getArticles()) // Flattens hierarchy
.flatMap(article -> article.computeNamedEntities(heuristic)); // Distributes CPU heavy task
// 3. Aggregation: MapReduce execution
List<Tuple2<String, Integer>> results = entitiesRDD
.mapToPair(entity -> new Tuple2<>(entity.getName(), 1))
.reduceByKey(Integer::sum) // Shuffling & Reducing
.collect(); // Action triggering the DAGThis project applies Functional Programming concepts to solve distributed system challenges:
- Immutability: RDDs are immutable, eliminating race conditions common in shared-state concurrency models (like Actors).
- Pure Functions: Transformation logic is side-effect free, making the system deterministic and easier to debug.
- Lazy Evaluation: Transformations are optimized into a DAG (Directed Acyclic Graph) and only executed when an Action (
collect) is called.
- Java 17 (Required
--add-opensfor Spark compatibility) - Apache Spark 3.5.0
- Maven
make compileRun the distributed version locally (simulating a cluster):
make runRun with specific heuristics (NER algorithms):
make run HEURISTIC=-qh # Quick Heuristic
make run HEURISTIC=-rh # Random Heuristic├── src/
│ ├── SparkFeedFetcher.java # Main Distributed Application (Driver)
│ ├── namedEntity/ # NER Logic (Serializable domain objects)
│ └── parser/ # RSS Parsing Logic
├── config/
│ └── subscriptions.json # Feed Sources
├── pom.xml # Maven Dependencies (Spark Core/SQL)
└── Makefile # Build automationOriginally developed as a Capstone Project for CS Paradigms at FaMAF - UNC.
- Team: Nehuen Guevara, Ignacio Hernandez, Andres Villagra, Clemente Ivetta.