A probabilistic forensics tool for detecting AI-generated code in GitHub repositories. Analyzes stylometric, structural, and git history patterns to estimate the likelihood that code was generated by an AI assistant.
- Multi-language support: Python, JavaScript, TypeScript, Go, Rust, C/C++, Java
- Stylometric analysis: Comment patterns, naming conventions, formatting consistency
- Structural analysis: Code complexity, error handling patterns, dead code detection
- Git history analysis: Commit patterns, author diversity, file creation bursts
- Detailed reports: JSON and Markdown outputs with file-level breakdowns
- CLI interface
- Code embeddings: MLX-based embeddings for semantic analysis (with hash fallback)
- Learned classifier: Trainable ML classifier using embeddings + features
- Ensemble detection: Combines heuristics with ML for robust predictions
- Flexible backends: Supports MLX (Apple Silicon) and fallback modes
- AI-powered explanations: Qwen-based natural language explanations
- Template fallback: Works without LLM using intelligent templates
- Per-file reasoning: Explains why specific files were flagged
- Enhanced reports: JSON and Markdown with integrated explanations
git clone https://github.com/BenjaminSRussell/Ai_code_detector.git
cd Ai_code_detector
pip install -r requirements.txt
pip install -e .python -m src.cli https://github.com/user/repo
python -m src.cli /path/to/local/repo# Enhanced with ML + explanations
python -m src.cli_enhanced ./repo --mode enhanced
# With hash embeddings (fast, no dependencies)
python -m src.cli_enhanced ./repo --embedder hash
# With MLX embeddings (Apple Silicon)
python -m src.cli_enhanced ./repo --embedder mlx
# Disable ML or explanations
python -m src.cli_enhanced ./repo --no-ml
python -m src.cli_enhanced ./repo --no-explanationsThe detector analyzes code across three dimensions:
AI-generated code often exhibits:
- Boilerplate comments: Generic phrases like "This function does...", "Returns:", "Args:"
- Tutorial-style language: "First, we...", "Then, we...", "Let's..."
- Generic naming: Overuse of names like
result,data,temp,helper - Perfect formatting: Unnaturally consistent indentation and spacing
- High duplication: Similar code patterns repeated across files
AI code patterns include:
- Over-explained simple functions: Low complexity with verbose docstrings
- Generic error handling: Catch-all
except Exceptionblocks - Print-based debugging:
print(f"An error occurred: {e}") - Dead code: Unused functions, unreferenced imports
- Missing cleanup: Resource allocation without proper cleanup
AI-generated repositories often show:
- Commit bursts: Large amounts of code in single commits
- Low author diversity: Single author, all code at once
- Generic commit messages: "initial commit", "update", "fix"
- File creation bursts: Most files created simultaneously
- Dense timeline: Complete repo created in days/hours
Customize detection via YAML config:
ingestion:
supported_extensions:
- .py
- .js
excluded_dirs:
- node_modules
- dist
max_file_size_mb: 1
scoring:
weights:
stylometry: 0.4
structural: 0.4
history: 0.2ai-code-detector/
├── src/
│ ├── ingest/ # Repository loading and file filtering
│ ├── analysis/ # Feature extraction
│ ├── model/ # Scoring, embeddings, classifier, explainer
│ ├── report/ # Report generation
│ ├── detector.py # Phase 1 detector
│ ├── detector_enhanced.py # Phase 1+2+3 detector
│ ├── cli.py # Basic CLI
│ └── cli_enhanced.py # Enhanced CLI
├── configs/ # Configuration files
├── tests/ # Test suite
└── STATUS.txt # Current project status
See STATUS.txt for detailed information about what is implemented, tested, and ready to use.
This tool provides probabilistic analysis, not definitive proof:
- False positives: Well-structured human code may score high
- False negatives: Edited AI code may score low
- Easy to fool: Simple refactoring reduces detection
- Language bias: Works best on Python; limited for other languages
Do NOT use as sole evidence for policy/disciplinary decisions.
Best used as one signal among many:
- Code review
- Author interviews
- Development timeline analysis
- Project context
MIT License
Issues: https://github.com/BenjaminSRussell/Ai_code_detector/issues