Skip to content

Kunal-htr/codesniff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

version java spring boot license status

{CodeSniff}

AI-Based Code Plagiarism Detector

Detect code similarity instantly using K-gram tokenization and Winnowing algorithm

🌐 Live Demo Β· πŸ“ Report Bug Β· ✨ Request Feature


About

"In a world where code is copied, CodeSniff sees what eyes can't."

CodeSniff is a token-based code similarity analyzer designed to detect plagiarism in Java source files. It uses the K-gram fingerprinting technique combined with the Winnowing algorithm β€” the same approach used by Stanford's MOSS system β€” to identify copied code even when variable names are changed, comments are removed, or statements are reordered.

Currently supporting Java with plans to expand to Python, C++, JavaScript and more in future releases through AI-powered semantic analysis.


πŸš€ Features

⚠️ Current Version (v0.5): Supports Java source files only. Multi-language support planned for v2.5.

  • πŸ” Token-based similarity detection using K-gram algorithm
  • πŸͺŸ Winnowing algorithm for efficient fingerprint selection
  • β˜• Java source file upload and pairwise comparison
  • πŸ’» Direct code paste for quick Java code analysis
  • πŸ“Š Similarity percentage results table
  • πŸ“₯ CSV report download
  • βš™οΈ Configurable options β€” K-gram size, window size, ignore comments
  • 🌐 Fully deployed on cloud infrastructure

πŸ› οΈ Tech Stack

Frontend

Technology Purpose
HTML5 / CSS3 UI structure and styling
Vanilla JavaScript SPA routing and API calls
Vercel Hosting and deployment

Backend

Technology Purpose
Java 17 Core language
Spring Boot 3.3.5 REST API framework
Maven Build and dependency management
Azure App Service F1 Cloud hosting (24/7)

Database & Infrastructure

Technology Purpose
PostgreSQL (Supabase) Database
GitHub Actions CI/CD pipeline
Nginx Reverse proxy

πŸ—οΈ System Architecture

User Browser
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Vercel          β”‚  codesniff.tech
β”‚  (Frontend)      β”‚  HTML + CSS + JS
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ API calls
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Azure App Service F1    β”‚  codesniff-backend.azurewebsites.net
β”‚  Spring Boot Backend     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚AnalyzeControllerβ”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚           β”‚              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚SimilarityEngine  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚           β”‚              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚Tokenizer         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚           β”‚              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚CodeNormalizer    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Supabase        β”‚  PostgreSQL Database
β”‚  (Database)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”¬ Detection Pipeline

Input Code
    β”‚
    β–Ό
1. Code Normalization    β†’ Remove comments, whitespace, lowercase
    β”‚
    β–Ό
2. Tokenization         β†’ Convert code to token stream
    β”‚
    β–Ό
3. K-gram Generation    β†’ Create overlapping k-grams (default k=6)
    β”‚
    β–Ό
4. Hashing              β†’ Hash each k-gram
    β”‚
    β–Ό
5. Winnowing            β†’ Select minimum hashes per window
    β”‚
    β–Ό
6. Fingerprint Compare  β†’ Jaccard similarity between fingerprint sets
    β”‚
    β–Ό
Similarity Score (0% - 100%)

πŸ“¦ Clone Types Detected

Clone Type Description Detected
Type 1 Exact copy βœ…
Type 2 Renamed identifiers βœ…
Type 3 Added/removed statements βœ…
Type 4 Semantic similarity ⚠️ Partial

πŸš€ Getting Started

Prerequisites

  • Java 17+
  • Maven 3.9+
  • PostgreSQL (or Supabase account)

Installation

# Clone the repository
git clone https://github.com/Kunal-htr/codesniff.git

# Navigate to project
cd codesniff

# Install dependencies
mvn clean install

Configuration

Create src/main/resources/application.properties:

server.port=9090
spring.datasource.url=jdbc:postgresql://your-db-host:5432/postgres
spring.datasource.username=your-username
spring.datasource.password=your-password
spring.datasource.driver-class-name=org.postgresql.Driver
spring.jpa.hibernate.ddl-auto=update

Run Locally

mvn spring-boot:run

Open http://localhost:9090 in your browser.


πŸ”Œ API Reference

Analyze Code Similarity

POST /api/analyze
Content-Type: application/json

Request Body:

{
  "submissions": [
    { "name": "A.java", "content": "public class A { ... }" },
    { "name": "B.java", "content": "public class B { ... }" }
  ],
  "options": {
    "omitComments": true,
    "k": 6,
    "window": 4
  }
}

Response:

{
  "summary": [
    {
      "a": "A.java",
      "b": "B.java",
      "score": 0.451
    }
  ]
}

Health Check

GET /api/health
CodeSniff is alive!

πŸ“ Project Structure

codesniff/
β”œβ”€β”€ src/
β”‚   └── main/
β”‚       β”œβ”€β”€ java/
β”‚       β”‚   └── backend/
β”‚       β”‚       β”œβ”€β”€ App.java                 # Spring Boot entry point
β”‚       β”‚       β”œβ”€β”€ AnalyzeController.java   # REST API endpoints
β”‚       β”‚       β”œβ”€β”€ SimilarityEngine.java    # Core detection logic
β”‚       β”‚       β”œβ”€β”€ Tokenizer.java           # Code tokenization
β”‚       β”‚       β”œβ”€β”€ CodeNormalizer.java      # Code preprocessing
β”‚       β”‚       └── CorsConfig.java          # CORS configuration
β”‚       └── resources/
β”‚           └── static/
β”‚               β”œβ”€β”€ index.html               # Frontend UI
β”‚               β”œβ”€β”€ app.js                   # Frontend logic
β”‚               └── style.css                # Styling
β”œβ”€β”€ frontend/                                # Vercel deployment
β”œβ”€β”€ .github/workflows/                       # CI/CD pipeline
β”œβ”€β”€ Dockerfile                               # Container config
└── pom.xml                                  # Maven config

πŸ”„ CI/CD Pipeline

git push to main
      β”‚
      β–Ό
GitHub Actions triggers
      β”‚
      β–Ό
Maven build + test
      β”‚
      β–Ό
Deploy to Azure App Service
      β”‚
      β–Ό
Live in ~50 seconds βœ…

πŸ“Š Performance

Metric Value
Average response time ~200ms
Max file size 1MB
Supported languages All text-based
Concurrent comparisons Multiple pairs

πŸ“¦ Modules

Module Name Status Version
Module 1 Similarity Engine βœ… Complete v0.5
Module 2 UI & User Workflow πŸ”„ Planned v1.0
Module 3 Report Visualization πŸ”„ Planned v1.5
Module 4 Database & Storage πŸ”„ Planned v2.0
Module 5 Future AI Enhancements πŸ”„ Planned v2.5

πŸ“„ License

This project is licensed under the MIT License.


πŸ™ Acknowledgements


Made with ❀️ by Kunal Patel

⭐ Star this repo if you found it helpful!

About

πŸ” Code Similarity Analyzer using Token-Based Approach | Detects Java code plagiarism via K-gram & Winnowing algorithm | Spring Boot + Vercel + Azure

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors