Skip to content

Latest commit

 

History

History
493 lines (378 loc) · 12.8 KB

File metadata and controls

493 lines (378 loc) · 12.8 KB

Scanner Usage Guide

The codebase scanner automatically detects your project languages and structure, making it easy to populate the Neo4J graph with your code.

Automatic Language Detection

🎯 New Feature: The scanner now automatically detects project languages using:

  • Build file analysis (package.json, pom.xml, build.gradle, setup.py, pyproject.toml, etc.)
  • Project metadata extraction (name, version, description, dependencies, frameworks)
  • Multi-language project support (e.g., TypeScript frontend + Java backend)
  • Sub-project detection for mono-repositories
  • Fallback to file extension detection when build files aren't available

Scanning a Project

Command Line Scanner

# Auto-detect languages and scan (recommended)
npm run scan /path/to/your/project

# Manual language specification (optional override)
npm run scan /path/to/your/project -- --languages typescript,javascript

# Include test files
npm run scan /path/to/your/project -- --include-tests

# Clear existing project data before scanning
npm run scan /path/to/your/project -- --clear-graph

# Clear ALL database data before scanning (all projects)
npm run scan /path/to/your/project -- --clear-all

# Exclude specific directories
npm run scan /path/to/your/project -- --exclude node_modules,dist,build

Remote Repository Scanning

🌐 New Feature: Scan repositories directly from GitHub, GitLab, and Bitbucket without local cloning:

# Public repositories
npm run scan https://github.com/owner/repo.git
npm run scan https://gitlab.com/owner/repo.git
npm run scan https://bitbucket.org/owner/repo.git

# Private repositories (requires authentication - see installation guide)
GITHUB_TOKEN=ghp_xxx npm run scan https://github.com/private/repo.git

# Specific branches or tags
npm run scan https://github.com/owner/repo.git -- --branch develop
npm run scan https://github.com/owner/repo.git -- --branch v2.0.0

# SSH URLs (requires SSH key setup)
npm run scan git@github.com:owner/repo.git

# With additional options
npm run scan https://github.com/owner/repo.git -- --languages typescript,java --analyze

Understanding Clear Options:

  • --clear-graph: Clears only the current project's data (useful for re-scanning a single project in a multi-project setup)
  • --clear-all: Clears the entire database (all projects) - use when starting fresh or resolving data conflicts

Auto-Detection Examples

# TypeScript project (detects from package.json)
npm run scan /path/to/react-app
# Output: ✅ TypeScript project detected, Framework: React

# Java project (detects from pom.xml or build.gradle)
npm run scan /path/to/spring-app  
# Output: ✅ Java project detected, Build system: Maven, Framework: Spring Boot

# Python project (detects from setup.py or pyproject.toml)
npm run scan /path/to/django-app
# Output: ✅ Python project detected, Build system: Poetry, Framework: Django

# Multi-language project
npm run scan /path/to/fullstack-app
# Output: ✅ Multi-language project detected: TypeScript (frontend), Java (backend)

# Remote repository examples
npm run scan https://github.com/microsoft/vscode.git
# Output: ✅ TypeScript project detected, Build system: npm, Framework: Electron

npm run scan https://github.com/spring-projects/spring-boot.git  
# Output: ✅ Java project detected, Build system: Gradle, Framework: Spring Boot

Via MCP Tools

Once connected to an AI tool, you can use:

Use the scan_dir tool to scan my current project at /path/to/project

Parameters for scan_dir:

  • directory_path: Path to your project root
  • languages: Array of languages (auto-detected if not specified)
  • exclude_paths: Paths to exclude (auto-suggested based on detected languages)
  • include_tests: Include test files (true/false, default based on project type)
  • clear_existing: Clear existing graph data (true/false)
  • max_depth: Maximum directory depth to scan (10)

Remote Repository Tools:

Use the scan_remote_repo tool to analyze https://github.com/owner/repo.git

Parameters for scan_remote_repo:

  • repository_url: Git repository URL (GitHub, GitLab, Bitbucket)
  • branch: Branch name to scan (default: main/master)
  • project_id: Custom project identifier (optional)
  • languages: Array of languages (auto-detected if not specified)
  • include_tests: Include test files (true/false)
  • clear_existing: Clear existing graph data (true/false)
  • use_cache: Use cached repository if available (true/false)
  • shallow_clone: Use shallow clone for faster scanning (true/false)

Auto-Detection Features:

  • Languages are automatically detected from build files and file extensions
  • Project metadata (name, version, description) extracted from build files
  • Framework detection (React, Spring Boot, Django, etc.) from dependencies
  • Language-specific exclude paths automatically suggested
  • Mono-repository and sub-project detection

Supported Languages

  • TypeScript (.ts, .tsx)
  • JavaScript (.js, .jsx)
  • Java (.java)
  • Python (.py)
  • C# (.cs) - Coming soon

What Gets Scanned

The scanner extracts:

Node Types:

  • Classes and interfaces
  • Methods and functions
  • Fields and properties
  • Packages and modules
  • Enums and exceptions

Relationships:

  • Inheritance (extends)
  • Interface implementation (implements)
  • Method calls (calls)
  • Field references (references)
  • Containment (contains)
  • Package membership (belongs_to)

Scanner Configuration

File Inclusion/Exclusion

Default Exclusions:

  • node_modules/
  • dist/, build/, out/
  • .git/
  • Test files (unless --include-tests is specified)
  • Hidden files and directories (starting with .)

Custom Exclusions:

# Exclude additional directories
npm run scan /path/to/project -- --exclude vendor,tmp,cache

Include Test Files:

# Include test files in analysis
npm run scan /path/to/project -- --include-tests

Language Selection

Automatic Detection (Recommended): The scanner now uses intelligent language detection:

  1. Build File Analysis - Analyzes package.json, pom.xml, build.gradle, setup.py, pyproject.toml, etc.
  2. Metadata Extraction - Extracts project name, version, dependencies, and frameworks
  3. Multi-Language Support - Handles projects with multiple languages (e.g., frontend + backend)
  4. Sub-Project Detection - Identifies mono-repositories with multiple sub-projects
  5. File Extension Fallback - Uses file extensions when build files aren't available

Manual Override (When Needed):

# Override auto-detection for specific languages only
npm run scan /path/to/project -- --languages java,python

# Force TypeScript only (useful for mixed projects)
npm run scan /path/to/project -- --languages typescript

Depth Control

# Limit scan depth to prevent deep recursion
npm run scan /path/to/project -- --max-depth 5

Scanner Output

Successful Scan with Auto-Detection

=== CodeRAG Project Scanner ===
Scanning: /path/to/my-react-app
Project ID: my-react-app

🔍 Auto-detecting project structure...
✅ TypeScript project detected
📋 Project: my-react-app v1.2.0
🏗️ Framework: React
📦 Build system: npm
💡 Recommendations: Include test files, exclude node_modules, dist

📁 Scanning files...
  TypeScript: 45 files
  JavaScript: 12 files
  
📦 Processing entities...
  Classes: 23
  Interfaces: 8
  Methods: 156
  Fields: 89
  
🔗 Building relationships...
  Inheritance: 12
  Implementations: 15
  Calls: 234
  References: 178
  
✅ Scan completed successfully!

Project Summary:
- Project Name: my-react-app
- Version: 1.2.0
- Framework: React
- Total entities: 276
- Total relationships: 439
- Languages: TypeScript, JavaScript
- Scan duration: 2.3s

Scan Errors

Common Issues:

  1. File Permission Errors:
⚠️  Warning: Cannot read file /path/to/file.ts (permission denied)
  1. Syntax Errors:
⚠️  Warning: Parse error in /path/to/file.ts:line 45
  1. Unsupported Files:
📁 Skipping unsupported file: /path/to/file.xml

Best Practices

1. Project Preparation

Before Scanning:

  • Ensure the project compiles successfully
  • Fix any major syntax errors
  • Consider excluding generated code directories
  • Remove or exclude large vendor directories

2. Initial Scan Strategy

New Project (First Time):

# Start with a clean database
npm run scan /path/to/project -- --clear-all

Re-scanning Project:

# Update existing project data
npm run scan /path/to/project -- --clear-graph

Adding to Multi-Project Database:

# Preserve other projects, add new one
npm run scan /path/to/new-project

3. Performance Optimization

Large Projects:

# Limit depth and exclude unnecessary directories
npm run scan /path/to/large-project -- \
  --max-depth 8 \
  --exclude node_modules,dist,build,vendor,tmp

Incremental Scanning:

  • For large codebases, consider scanning modules separately
  • Use project isolation to manage different components
  • Monitor scan duration and adjust exclusions as needed

4. Multi-Language Projects

Automatic Multi-Language Detection:

# Auto-detects all languages in polyglot projects
npm run scan /path/to/fullstack-project
# Output: ✅ Multi-language project: TypeScript (frontend), Java (backend), Python (scripts)

Manual Multi-Language Specification:

# Override auto-detection for specific languages
npm run scan /path/to/polyglot-project -- \
  --languages typescript,java,python \
  --include-tests

Mono-Repository Handling:

# Auto-detects sub-projects in mono-repositories
npm run scan /path/to/monorepo
# Output: 🏗️ Mono-repository detected with 3 sub-projects
# Suggestion: Consider scanning sub-projects separately for better organization

Troubleshooting

Scanner Won't Start

Check Prerequisites:

# Verify Node.js version
node --version  # Should be 18+

# Verify build
npm run build

# Check Neo4J connection
neo4j status  # or check Docker container

Scanner Fails to Parse Files

Common Solutions:

  1. Update Dependencies:
npm install
npm run build
  1. Check File Encoding:

    • Ensure files are UTF-8 encoded
    • Check for byte order marks (BOM)
  2. Syntax Validation:

    • Fix syntax errors in source files
    • Consider excluding problematic files

Performance Issues

Slow Scanning:

  1. Reduce Scope:
# Exclude large directories
npm run scan /path/to/project -- --exclude node_modules,dist,vendor
  1. Limit Depth:
# Reduce maximum directory depth
npm run scan /path/to/project -- --max-depth 6
  1. Language Filtering:
# Scan only specific languages
npm run scan /path/to/project -- --languages typescript

Database Issues

Connection Failures:

  • Verify Neo4J is running
  • Check .env configuration
  • Test connection manually

Memory Issues:

  • Increase Neo4J heap size
  • Clear database before large scans
  • Use --clear-all for fresh start

Remote Repository Issues

Authentication Failures:

# Error: 401 Unauthorized
# Solution: Check your authentication tokens in .env
GITHUB_TOKEN=ghp_xxx npm run scan https://github.com/private/repo.git

Network Issues:

# Error: Failed to clone repository
# Solutions:
# 1. Check internet connection
# 2. Verify repository URL
# 3. Try SSH URL if HTTPS fails
npm run scan git@github.com:owner/repo.git

Branch Not Found:

# Error: Branch 'feature-branch' not found
# Solution: Verify branch exists
npm run scan https://github.com/owner/repo.git -- --branch main

Large Repository Timeouts:

# Solution: Use shallow clone for faster processing
npm run scan https://github.com/large/repo.git -- --shallow-clone

Advanced Usage

Custom Scanner Configuration

For advanced users, you can modify scanner behavior by:

  1. Environment Variables:
# Set custom scan timeout
export SCANNER_TIMEOUT=300000  # 5 minutes

# Enable debug logging
export LOG_LEVEL=debug
  1. Programmatic Usage:
// Use scanner in your own scripts
const { Scanner } = require('./build/scanner');

const scanner = new Scanner({
  neo4jConfig: {
    uri: 'bolt://localhost:7687',
    user: 'neo4j',
    password: 'password'
  },
  scanConfig: {
    languages: ['typescript', 'java'],
    excludePaths: ['node_modules', 'dist'],
    includeTests: false,
    maxDepth: 10
  }
});

await scanner.scanDirectory('/path/to/project');

Batch Scanning

# Scan multiple projects in sequence
for project in /path/to/projects/*; do
  echo "Scanning $project"
  npm run scan "$project"
done

Next Steps