Skip to content

BenjaminSRussell/Ai_code_detector

Repository files navigation

AI Code Detector

A probabilistic forensics tool for detecting AI-generated code in GitHub repositories. Analyzes stylometric, structural, and git history patterns to estimate the likelihood that code was generated by an AI assistant.

Features

Phase 1 - Heuristic Detection (Complete)

  • Multi-language support: Python, JavaScript, TypeScript, Go, Rust, C/C++, Java
  • Stylometric analysis: Comment patterns, naming conventions, formatting consistency
  • Structural analysis: Code complexity, error handling patterns, dead code detection
  • Git history analysis: Commit patterns, author diversity, file creation bursts
  • Detailed reports: JSON and Markdown outputs with file-level breakdowns
  • CLI interface

Phase 2 - ML Enhancement (Complete)

  • Code embeddings: MLX-based embeddings for semantic analysis (with hash fallback)
  • Learned classifier: Trainable ML classifier using embeddings + features
  • Ensemble detection: Combines heuristics with ML for robust predictions
  • Flexible backends: Supports MLX (Apple Silicon) and fallback modes

Phase 3 - Natural Language Explanations (Complete)

  • AI-powered explanations: Qwen-based natural language explanations
  • Template fallback: Works without LLM using intelligent templates
  • Per-file reasoning: Explains why specific files were flagged
  • Enhanced reports: JSON and Markdown with integrated explanations

Installation

git clone https://github.com/BenjaminSRussell/Ai_code_detector.git
cd Ai_code_detector
pip install -r requirements.txt
pip install -e .

Quick Start

Basic Detection (Phase 1)

python -m src.cli https://github.com/user/repo
python -m src.cli /path/to/local/repo

Enhanced Detection (Phase 2+3)

# Enhanced with ML + explanations
python -m src.cli_enhanced ./repo --mode enhanced

# With hash embeddings (fast, no dependencies)
python -m src.cli_enhanced ./repo --embedder hash

# With MLX embeddings (Apple Silicon)
python -m src.cli_enhanced ./repo --embedder mlx

# Disable ML or explanations
python -m src.cli_enhanced ./repo --no-ml
python -m src.cli_enhanced ./repo --no-explanations

How It Works

The detector analyzes code across three dimensions:

1. Stylometric Features

AI-generated code often exhibits:

  • Boilerplate comments: Generic phrases like "This function does...", "Returns:", "Args:"
  • Tutorial-style language: "First, we...", "Then, we...", "Let's..."
  • Generic naming: Overuse of names like result, data, temp, helper
  • Perfect formatting: Unnaturally consistent indentation and spacing
  • High duplication: Similar code patterns repeated across files

2. Structural Features

AI code patterns include:

  • Over-explained simple functions: Low complexity with verbose docstrings
  • Generic error handling: Catch-all except Exception blocks
  • Print-based debugging: print(f"An error occurred: {e}")
  • Dead code: Unused functions, unreferenced imports
  • Missing cleanup: Resource allocation without proper cleanup

3. Git History Features

AI-generated repositories often show:

  • Commit bursts: Large amounts of code in single commits
  • Low author diversity: Single author, all code at once
  • Generic commit messages: "initial commit", "update", "fix"
  • File creation bursts: Most files created simultaneously
  • Dense timeline: Complete repo created in days/hours

Configuration

Customize detection via YAML config:

ingestion:
  supported_extensions:
    - .py
    - .js
  excluded_dirs:
    - node_modules
    - dist
  max_file_size_mb: 1

scoring:
  weights:
    stylometry: 0.4
    structural: 0.4
    history: 0.2

Project Structure

ai-code-detector/
├── src/
│   ├── ingest/          # Repository loading and file filtering
│   ├── analysis/        # Feature extraction
│   ├── model/           # Scoring, embeddings, classifier, explainer
│   ├── report/          # Report generation
│   ├── detector.py      # Phase 1 detector
│   ├── detector_enhanced.py  # Phase 1+2+3 detector
│   ├── cli.py           # Basic CLI
│   └── cli_enhanced.py  # Enhanced CLI
├── configs/             # Configuration files
├── tests/               # Test suite
└── STATUS.txt           # Current project status

Development Status

See STATUS.txt for detailed information about what is implemented, tested, and ready to use.

Limitations

This tool provides probabilistic analysis, not definitive proof:

  • False positives: Well-structured human code may score high
  • False negatives: Edited AI code may score low
  • Easy to fool: Simple refactoring reduces detection
  • Language bias: Works best on Python; limited for other languages

Do NOT use as sole evidence for policy/disciplinary decisions.

Best used as one signal among many:

  • Code review
  • Author interviews
  • Development timeline analysis
  • Project context

License

MIT License

Contact

Issues: https://github.com/BenjaminSRussell/Ai_code_detector/issues

About

Finds out if someone is using ai generated code and provides some version of confidence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages