AI Code Detector

A probabilistic forensics tool for detecting AI-generated code in GitHub repositories. Analyzes stylometric, structural, and git history patterns to estimate the likelihood that code was generated by an AI assistant.

Features

Phase 1 - Heuristic Detection (Complete)

Multi-language support: Python, JavaScript, TypeScript, Go, Rust, C/C++, Java
Stylometric analysis: Comment patterns, naming conventions, formatting consistency
Structural analysis: Code complexity, error handling patterns, dead code detection
Git history analysis: Commit patterns, author diversity, file creation bursts
Detailed reports: JSON and Markdown outputs with file-level breakdowns
CLI interface

Phase 2 - ML Enhancement (Complete)

Code embeddings: MLX-based embeddings for semantic analysis (with hash fallback)
Learned classifier: Trainable ML classifier using embeddings + features
Ensemble detection: Combines heuristics with ML for robust predictions
Flexible backends: Supports MLX (Apple Silicon) and fallback modes

Phase 3 - Natural Language Explanations (Complete)

AI-powered explanations: Qwen-based natural language explanations
Template fallback: Works without LLM using intelligent templates
Per-file reasoning: Explains why specific files were flagged
Enhanced reports: JSON and Markdown with integrated explanations

Installation

git clone https://github.com/BenjaminSRussell/Ai_code_detector.git
cd Ai_code_detector
pip install -r requirements.txt
pip install -e .

Quick Start

Basic Detection (Phase 1)

python -m src.cli https://github.com/user/repo
python -m src.cli /path/to/local/repo

Enhanced Detection (Phase 2+3)

# Enhanced with ML + explanations
python -m src.cli_enhanced ./repo --mode enhanced

# With hash embeddings (fast, no dependencies)
python -m src.cli_enhanced ./repo --embedder hash

# With MLX embeddings (Apple Silicon)
python -m src.cli_enhanced ./repo --embedder mlx

# Disable ML or explanations
python -m src.cli_enhanced ./repo --no-ml
python -m src.cli_enhanced ./repo --no-explanations

How It Works

The detector analyzes code across three dimensions:

1. Stylometric Features

AI-generated code often exhibits:

Boilerplate comments: Generic phrases like "This function does...", "Returns:", "Args:"
Tutorial-style language: "First, we...", "Then, we...", "Let's..."
Generic naming: Overuse of names like result, data, temp, helper
Perfect formatting: Unnaturally consistent indentation and spacing
High duplication: Similar code patterns repeated across files

2. Structural Features

AI code patterns include:

Over-explained simple functions: Low complexity with verbose docstrings
Generic error handling: Catch-all except Exception blocks
Print-based debugging: print(f"An error occurred: {e}")
Dead code: Unused functions, unreferenced imports
Missing cleanup: Resource allocation without proper cleanup

3. Git History Features

AI-generated repositories often show:

Commit bursts: Large amounts of code in single commits
Low author diversity: Single author, all code at once
Generic commit messages: "initial commit", "update", "fix"
File creation bursts: Most files created simultaneously
Dense timeline: Complete repo created in days/hours

Configuration

Customize detection via YAML config:

ingestion:
  supported_extensions:
    - .py
    - .js
  excluded_dirs:
    - node_modules
    - dist
  max_file_size_mb: 1

scoring:
  weights:
    stylometry: 0.4
    structural: 0.4
    history: 0.2

Project Structure

ai-code-detector/
├── src/
│   ├── ingest/          # Repository loading and file filtering
│   ├── analysis/        # Feature extraction
│   ├── model/           # Scoring, embeddings, classifier, explainer
│   ├── report/          # Report generation
│   ├── detector.py      # Phase 1 detector
│   ├── detector_enhanced.py  # Phase 1+2+3 detector
│   ├── cli.py           # Basic CLI
│   └── cli_enhanced.py  # Enhanced CLI
├── configs/             # Configuration files
├── tests/               # Test suite
└── STATUS.txt           # Current project status

Development Status

See STATUS.txt for detailed information about what is implemented, tested, and ready to use.

Limitations

This tool provides probabilistic analysis, not definitive proof:

False positives: Well-structured human code may score high
False negatives: Edited AI code may score low
Easy to fool: Simple refactoring reduces detection
Language bias: Works best on Python; limited for other languages

Do NOT use as sole evidence for policy/disciplinary decisions.

Best used as one signal among many:

Code review
Author interviews
Development timeline analysis
Project context

License

MIT License

Contact

Issues: https://github.com/BenjaminSRussell/Ai_code_detector/issues

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
STATUS.txt		STATUS.txt
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
test_comprehensive.py		test_comprehensive.py
test_enhanced.py		test_enhanced.py
test_quick.py		test_quick.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Code Detector

Features

Phase 1 - Heuristic Detection (Complete)

Phase 2 - ML Enhancement (Complete)

Phase 3 - Natural Language Explanations (Complete)

Installation

Quick Start

Basic Detection (Phase 1)

Enhanced Detection (Phase 2+3)

How It Works

1. Stylometric Features

2. Structural Features

3. Git History Features

Configuration

Project Structure

Development Status

Limitations

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Code Detector

Features

Phase 1 - Heuristic Detection (Complete)

Phase 2 - ML Enhancement (Complete)

Phase 3 - Natural Language Explanations (Complete)

Installation

Quick Start

Basic Detection (Phase 1)

Enhanced Detection (Phase 2+3)

How It Works

1. Stylometric Features

2. Structural Features

3. Git History Features

Configuration

Project Structure

Development Status

Limitations

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages