Skip to content

SrijaVuppula/Wikipedia_Evolution_Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Wikipedia Evolution Study

Abstract

This project develops and evaluates a system to perform a quantitative analysis of the evolution of Wikipedia articles over time. The technique used is based on W-Shingling, where documents are represented as sets of word sequences (shingles) of a given size, $w$. Each shingle is hashed using the hashing algorithm (MD5 in this project) to create a unique fingerprint. To approximate document similarity efficiently, a MinHash signature of size $\lambda$ is generated by selecting the smallest hash values. The Jaccard similarity value is then used to compare the present version of an article to its past versions. Experiments are conducted on a corpus of Wikipedia pages for various U.S. cities, testing shingle sizes of $w \in {25, 50}$ and signature sizes of $\lambda \in {8, 16, 32, 64}$.

The project successfully shows the decay in content similarity over revision history, identifies the optimal $\lambda$ value that best approximates the true similarity ($\lambda = \infty$), and analyzes the performance of the shingling process.

The results provide us with a structured process to measure how Wikipedia pages change over time. They also clearly demonstrate the practical compromise required when balancing computational speed against the accuracy of similarity scores, which is dictated by the signature size.


Code Availability

The Python implementation of this study, including the shingling, MinHash, and Jaccard similarity calculations, is available in the Wikipedia_Evolution_Study directory of this repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages