This project develops and evaluates a system to perform a quantitative analysis of the evolution of Wikipedia articles over time. The technique used is based on W-Shingling, where documents are represented as sets of word sequences (shingles) of a given size,
The project successfully shows the decay in content similarity over revision history, identifies the optimal
The results provide us with a structured process to measure how Wikipedia pages change over time. They also clearly demonstrate the practical compromise required when balancing computational speed against the accuracy of similarity scores, which is dictated by the signature size.
The Python implementation of this study, including the shingling, MinHash, and Jaccard similarity calculations, is available in the Wikipedia_Evolution_Study directory of this repository.