"In a world where code is copied, CodeSniff sees what eyes can't."
CodeSniff is a token-based code similarity analyzer designed to detect plagiarism in Java source files. It uses the K-gram fingerprinting technique combined with the Winnowing algorithm β the same approach used by Stanford's MOSS system β to identify copied code even when variable names are changed, comments are removed, or statements are reordered.
Currently supporting Java with plans to expand to Python, C++, JavaScript and more in future releases through AI-powered semantic analysis.
β οΈ Current Version (v0.5): Supports Java source files only. Multi-language support planned for v2.5.
- π Token-based similarity detection using K-gram algorithm
- πͺ Winnowing algorithm for efficient fingerprint selection
- β Java source file upload and pairwise comparison
- π» Direct code paste for quick Java code analysis
- π Similarity percentage results table
- π₯ CSV report download
- βοΈ Configurable options β K-gram size, window size, ignore comments
- π Fully deployed on cloud infrastructure
| Technology | Purpose |
|---|---|
| HTML5 / CSS3 | UI structure and styling |
| Vanilla JavaScript | SPA routing and API calls |
| Vercel | Hosting and deployment |
| Technology | Purpose |
|---|---|
| Java 17 | Core language |
| Spring Boot 3.3.5 | REST API framework |
| Maven | Build and dependency management |
| Azure App Service F1 | Cloud hosting (24/7) |
| Technology | Purpose |
|---|---|
| PostgreSQL (Supabase) | Database |
| GitHub Actions | CI/CD pipeline |
| Nginx | Reverse proxy |
User Browser
β
βΌ
βββββββββββββββββββ
β Vercel β codesniff.tech
β (Frontend) β HTML + CSS + JS
ββββββββββ¬βββββββββ
β API calls
βΌ
βββββββββββββββββββββββββββ
β Azure App Service F1 β codesniff-backend.azurewebsites.net
β Spring Boot Backend β
β βββββββββββββββββββ β
β βAnalyzeControllerβ β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββΌβββββββββ β
β βSimilarityEngine β β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββΌβββββββββ β
β βTokenizer β β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββΌβββββββββ β
β βCodeNormalizer β β
β βββββββββββββββββββ β
βββββββββββββ¬ββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Supabase β PostgreSQL Database
β (Database) β
βββββββββββββββββββ
Input Code
β
βΌ
1. Code Normalization β Remove comments, whitespace, lowercase
β
βΌ
2. Tokenization β Convert code to token stream
β
βΌ
3. K-gram Generation β Create overlapping k-grams (default k=6)
β
βΌ
4. Hashing β Hash each k-gram
β
βΌ
5. Winnowing β Select minimum hashes per window
β
βΌ
6. Fingerprint Compare β Jaccard similarity between fingerprint sets
β
βΌ
Similarity Score (0% - 100%)
| Clone Type | Description | Detected |
|---|---|---|
| Type 1 | Exact copy | β |
| Type 2 | Renamed identifiers | β |
| Type 3 | Added/removed statements | β |
| Type 4 | Semantic similarity |
- Java 17+
- Maven 3.9+
- PostgreSQL (or Supabase account)
# Clone the repository
git clone https://github.com/Kunal-htr/codesniff.git
# Navigate to project
cd codesniff
# Install dependencies
mvn clean installCreate src/main/resources/application.properties:
server.port=9090
spring.datasource.url=jdbc:postgresql://your-db-host:5432/postgres
spring.datasource.username=your-username
spring.datasource.password=your-password
spring.datasource.driver-class-name=org.postgresql.Driver
spring.jpa.hibernate.ddl-auto=updatemvn spring-boot:runOpen http://localhost:9090 in your browser.
POST /api/analyze
Content-Type: application/jsonRequest Body:
{
"submissions": [
{ "name": "A.java", "content": "public class A { ... }" },
{ "name": "B.java", "content": "public class B { ... }" }
],
"options": {
"omitComments": true,
"k": 6,
"window": 4
}
}Response:
{
"summary": [
{
"a": "A.java",
"b": "B.java",
"score": 0.451
}
]
}GET /api/healthCodeSniff is alive!
codesniff/
βββ src/
β βββ main/
β βββ java/
β β βββ backend/
β β βββ App.java # Spring Boot entry point
β β βββ AnalyzeController.java # REST API endpoints
β β βββ SimilarityEngine.java # Core detection logic
β β βββ Tokenizer.java # Code tokenization
β β βββ CodeNormalizer.java # Code preprocessing
β β βββ CorsConfig.java # CORS configuration
β βββ resources/
β βββ static/
β βββ index.html # Frontend UI
β βββ app.js # Frontend logic
β βββ style.css # Styling
βββ frontend/ # Vercel deployment
βββ .github/workflows/ # CI/CD pipeline
βββ Dockerfile # Container config
βββ pom.xml # Maven config
git push to main
β
βΌ
GitHub Actions triggers
β
βΌ
Maven build + test
β
βΌ
Deploy to Azure App Service
β
βΌ
Live in ~50 seconds β
| Metric | Value |
|---|---|
| Average response time | ~200ms |
| Max file size | 1MB |
| Supported languages | All text-based |
| Concurrent comparisons | Multiple pairs |
| Module | Name | Status | Version |
|---|---|---|---|
| Module 1 | Similarity Engine | β Complete | v0.5 |
| Module 2 | UI & User Workflow | π Planned | v1.0 |
| Module 3 | Report Visualization | π Planned | v1.5 |
| Module 4 | Database & Storage | π Planned | v2.0 |
| Module 5 | Future AI Enhancements | π Planned | v2.5 |
This project is licensed under the MIT License.
- Spring Boot
- Supabase
- Vercel
- Azure
- Winnowing Algorithm β Schleimer, Wilkerson, Aiken (2003)
Made with β€οΈ by Kunal Patel
β Star this repo if you found it helpful!