Skip to content

arshkaur2405/Plagiarism-Detector-Using-String-Matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“„ Plagiarism Detector Using String Matching Algorithms

๐Ÿš€ Project Overview

The Plagiarism Detector Using String Matching Algorithms is a Data Structures and Algorithms (DSA) project that detects similarities between text documents using classical pattern matching techniques.

The system compares an Original Document and a Submitted Document, identifies copied content, calculates the plagiarism percentage, and generates a detailed report.

The project demonstrates how string matching algorithms can be applied to solve real-world problems such as plagiarism detection, content verification, and text similarity analysis.


๐ŸŽฏ Problem Statement

Plagiarism is a major concern in educational institutions, publishing platforms, online learning systems, and content management platforms.

Manually checking documents for copied content is time-consuming and inefficient.

This project automates plagiarism detection using efficient string matching algorithms.


โœจ Features

  • Upload Original and Submitted Documents
  • Manual Text Input
  • Automatic Loading of Sample Documents
  • Text Preprocessing
  • Sentence Tokenization
  • Naive String Matching Algorithm
  • Knuth-Morris-Pratt (KMP) Algorithm
  • Rabin-Karp Algorithm
  • Similarity Calculation
  • Plagiarism Percentage Detection
  • Matched Sentence Identification
  • Downloadable Report Generation
  • Interactive Streamlit Dashboard

๐Ÿ—๏ธ Project Architecture

Original Document
        โ”‚
        โ–ผ
Text Preprocessing
        โ”‚
        โ–ผ
Sentence Splitting
        โ”‚
        โ–ผ
String Matching Algorithms
(Naive / KMP / Rabin-Karp)
        โ”‚
        โ–ผ
Similarity Calculation
        โ”‚
        โ–ผ
Matched Content Detection
        โ”‚
        โ–ผ
Plagiarism Percentage
        โ”‚
        โ–ผ
Report Generation

๐Ÿง  DSA Concepts Used

String Matching

Used to compare text patterns between documents.

Pattern Searching

Detects copied content efficiently.

Hashing

Used in Rabin-Karp Algorithm.

Prefix Function (LPS Array)

Used in KMP Algorithm.

Arrays

Used for storing prefix values and sentences.

File Handling

Reading text documents.

Text Processing

Cleaning and tokenizing documents.


โš™๏ธ Algorithms Implemented

1. Naive String Matching

Compares the pattern against every possible position in the text.

Time Complexity

O(n ร— m)

Space Complexity

O(1)

2. KMP Algorithm

Uses an LPS (Longest Prefix Suffix) array to avoid unnecessary comparisons.

Time Complexity

O(n + m)

Space Complexity

O(m)

3. Rabin-Karp Algorithm

Uses hashing and rolling hash techniques for pattern matching.

Average Time Complexity

O(n + m)

Worst Case

O(n ร— m)

Space Complexity

O(1)

๐Ÿ’ป Tech Stack

  • Python
  • Streamlit
  • Regular Expressions (re)
  • File Handling
  • Git
  • GitHub

๐Ÿ“‚ Folder Structure

Plagiarism-Detector-Using-String-Matching/
โ”‚
โ”œโ”€โ”€ documents/
โ”‚   โ”œโ”€โ”€ original.txt
โ”‚   โ””โ”€โ”€ submitted.txt
โ”‚
โ”œโ”€โ”€ reports/
โ”‚
โ”œโ”€โ”€ outputs/
โ”‚
โ”œโ”€โ”€ images/
โ”‚
โ”œโ”€โ”€ src/
     |--app.py
     |--main.py
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ .gitignore

๐Ÿ“ฅ Installation

Clone Repository

git clone https://github.com/your-username/Plagiarism-Detector-Using-String-Matching.git

Move into Project Folder

cd Plagiarism-Detector-Using-String-Matching

Install Dependencies

pip install -r requirements.txt

โ–ถ๏ธ Running the Project

streamlit run app.py

The Streamlit dashboard will automatically open in your browser.


๐Ÿ“ Sample Input

Original Document

Artificial Intelligence is transforming many industries.
Machine learning allows systems to learn from data.
Feature engineering can improve prediction accuracy.
Cloud computing provides scalable computing resources.

Submitted Document

Artificial Intelligence is transforming many industries.
Some random content.
Machine learning allows systems to learn from data.
Cloud computing provides scalable computing resources.

๐Ÿ“Š Sample Output

Algorithm Used: KMP

Matched Sentences:
1. Artificial Intelligence is transforming many industries.
2. Machine learning allows systems to learn from data.
3. Cloud computing provides scalable computing resources.

Plagiarism Percentage: 75%

๐Ÿ“ˆ Applications

  • Academic Integrity Systems
  • Assignment Evaluation Platforms
  • Research Paper Similarity Analysis
  • Content Verification Systems
  • EdTech Platforms
  • Publishing Platforms
  • Documentation Review Tools

๐ŸŽ“ Learning Outcomes

Through this project I learned:

  • String Matching Algorithms
  • KMP Algorithm
  • Rabin-Karp Algorithm
  • Hashing Techniques
  • Pattern Searching
  • Text Processing
  • Streamlit Dashboard Development
  • File Handling in Python
  • Git and GitHub Project Management
  • Real-world Applications of DSA

๐Ÿ”ฎ Future Enhancements

  • PDF Document Support
  • DOCX File Support
  • Highlight Copied Text
  • Semantic Similarity Detection
  • NLP-Based Paraphrase Detection
  • Multi-Document Comparison
  • Database Integration
  • Cloud Deployment

๐Ÿ‘จโ€๐Ÿ’ป Author

Arshdeep Kaur

B.Tech Student | DSA Enthusiast


โญ If you found this project useful

Please consider giving the repository a star.

About

A DSA-based plagiarism detection system using Naive String Matching, KMP, and Rabin-Karp algorithms with a Streamlit dashboard for document comparison and similarity analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages