This project implements a parallel news aggregator in Java. The application processes a large collection of news articles stored in JSON files, organizes them by categories and languages, removes duplicates, and generates multiple aggregated reports and statistics. The solution is designed to efficiently exploit multithreading using a fixed number of Java threads.
-
Uses a fixed pool of threads created at program start.
-
Work is distributed among threads to process input files concurrently.
-
Thread-safe data structures and synchronization mechanisms ensure correctness.
-
Extracts relevant fields from each article: uuid, title, author, url, text, published, language, and categories.
-
Efficiently handles large input datasets.
-
Articles are considered duplicates if they share the same uuid or title.
-
All duplicate articles are removed from further processing.
-
The number of removed duplicates is reported.
-
Articles are grouped according to a predefined list of valid categories.
-
One output file is generated per category, containing sorted article UUIDs.
-
Category names are normalized to generate valid file names.
-
Articles are grouped by language using a predefined list of valid languages.
-
One output file is generated per language, containing sorted article UUIDs.
-
Generates all_articles.txt containing all unique articles.
-
Articles are sorted by publication date (descending), with UUID as a tie-breaker.
-
Processes only English-language articles.
-
Removes linking words defined in an external file.
-
Counts how many distinct articles contain each keyword.
-
Outputs results sorted by frequency and lexicographically.
-> Generates a reports.txt file containing:
- Number of duplicates found
- Number of unique articles
- Most prolific author
- Most common language
- Most frequent category
- Most recent article
- Most frequent keyword in English articles