The Plagiarism Detector Using String Matching Algorithms is a Data Structures and Algorithms (DSA) project that detects similarities between text documents using classical pattern matching techniques.
The system compares an Original Document and a Submitted Document, identifies copied content, calculates the plagiarism percentage, and generates a detailed report.
The project demonstrates how string matching algorithms can be applied to solve real-world problems such as plagiarism detection, content verification, and text similarity analysis.
Plagiarism is a major concern in educational institutions, publishing platforms, online learning systems, and content management platforms.
Manually checking documents for copied content is time-consuming and inefficient.
This project automates plagiarism detection using efficient string matching algorithms.
- Upload Original and Submitted Documents
- Manual Text Input
- Automatic Loading of Sample Documents
- Text Preprocessing
- Sentence Tokenization
- Naive String Matching Algorithm
- Knuth-Morris-Pratt (KMP) Algorithm
- Rabin-Karp Algorithm
- Similarity Calculation
- Plagiarism Percentage Detection
- Matched Sentence Identification
- Downloadable Report Generation
- Interactive Streamlit Dashboard
Original Document
โ
โผ
Text Preprocessing
โ
โผ
Sentence Splitting
โ
โผ
String Matching Algorithms
(Naive / KMP / Rabin-Karp)
โ
โผ
Similarity Calculation
โ
โผ
Matched Content Detection
โ
โผ
Plagiarism Percentage
โ
โผ
Report Generation
Used to compare text patterns between documents.
Detects copied content efficiently.
Used in Rabin-Karp Algorithm.
Used in KMP Algorithm.
Used for storing prefix values and sentences.
Reading text documents.
Cleaning and tokenizing documents.
Compares the pattern against every possible position in the text.
O(n ร m)
O(1)
Uses an LPS (Longest Prefix Suffix) array to avoid unnecessary comparisons.
O(n + m)
O(m)
Uses hashing and rolling hash techniques for pattern matching.
O(n + m)
O(n ร m)
O(1)
- Python
- Streamlit
- Regular Expressions (re)
- File Handling
- Git
- GitHub
Plagiarism-Detector-Using-String-Matching/
โ
โโโ documents/
โ โโโ original.txt
โ โโโ submitted.txt
โ
โโโ reports/
โ
โโโ outputs/
โ
โโโ images/
โ
โโโ src/
|--app.py
|--main.py
โโโ requirements.txt
โโโ README.md
โโโ .gitignore
git clone https://github.com/your-username/Plagiarism-Detector-Using-String-Matching.gitcd Plagiarism-Detector-Using-String-Matchingpip install -r requirements.txtstreamlit run app.pyThe Streamlit dashboard will automatically open in your browser.
Artificial Intelligence is transforming many industries.
Machine learning allows systems to learn from data.
Feature engineering can improve prediction accuracy.
Cloud computing provides scalable computing resources.
Artificial Intelligence is transforming many industries.
Some random content.
Machine learning allows systems to learn from data.
Cloud computing provides scalable computing resources.
Algorithm Used: KMP
Matched Sentences:
1. Artificial Intelligence is transforming many industries.
2. Machine learning allows systems to learn from data.
3. Cloud computing provides scalable computing resources.
Plagiarism Percentage: 75%
- Academic Integrity Systems
- Assignment Evaluation Platforms
- Research Paper Similarity Analysis
- Content Verification Systems
- EdTech Platforms
- Publishing Platforms
- Documentation Review Tools
Through this project I learned:
- String Matching Algorithms
- KMP Algorithm
- Rabin-Karp Algorithm
- Hashing Techniques
- Pattern Searching
- Text Processing
- Streamlit Dashboard Development
- File Handling in Python
- Git and GitHub Project Management
- Real-world Applications of DSA
- PDF Document Support
- DOCX File Support
- Highlight Copied Text
- Semantic Similarity Detection
- NLP-Based Paraphrase Detection
- Multi-Document Comparison
- Database Integration
- Cloud Deployment
Arshdeep Kaur
B.Tech Student | DSA Enthusiast
Please consider giving the repository a star.