GitHub Repository Scraper - Architecture & Technical Documentation

Purpose and Objective
Work Completed So Far
Architecture and Design Patterns
Core Logic and Strategies
Tech Stack
Integration Details
Development Approach
General Observations

1. Purpose and Objective

The GitHub Repository Scraper is a full-stack application designed to analyze GitHub repositories by extracting commit history and generating a leaderboard of contributors ranked by their commit counts.

Primary Use Cases

Contributor Analytics: Identify top contributors in any GitHub repository
Repository Metrics: Gather statistics about repository activity and contribution patterns
Team Analysis: Understand contribution distribution across team members
Public/Private Repository Support: Handle both public and authenticated private repositories through GitHub Personal Access Tokens

Core Value Proposition

The application enables users to submit repository URLs through a user-friendly interface and asynchronously process them to generate comprehensive contributor leaderboards. The system is designed for scalability, handling large repositories efficiently through background job processing.

2. Work Completed So Far

Backend Implementation

✅ Core API Server

Fastify HTTP Server: Multi-endpoint REST API
- GET /health - Health check endpoint
- POST /leaderboard - Submit repository for processing
- GET /leaderboard - Retrieve contributor leaderboard
- GET /repositories - List all repositories with their states
Repository State Management: Dynamic handling of repository states (pending, in_progress, completed, failed)
URL Validation & Normalization: Conversion of SSH and HTTPS URLs to standard format

✅ Asynchronous Processing

Task Queue System: Bull-based job queue with Redis backing
Worker Process: Separate container for background repository processing
Non-blocking API: Immediate HTTP responses while processing happens asynchronously

✅ Git Operations

Bare Repository Cloning: Space-efficient cloning using --bare flag
Incremental Updates: git fetch for existing repositories
simple-git Integration: Node.js Git client for all operations

✅ Database & Persistence

PostgreSQL Database: Relational database for persistent storage
Prisma ORM: Type-safe database access with migrations
Data Models: Repository, Contributor, and RepositoryContributor (join table)
Automated Migrations: Prisma migrations run on container startup

✅ GitHub Integration

API Integration: User profile resolution via GitHub Search API
Token Authentication: Support for GitHub Personal Access Tokens
No-Reply Email Handling: Smart extraction of usernames from GitHub no-reply emails
Profile Enrichment: Automatic fetching of GitHub usernames and profile URLs

✅ Performance Optimizations

In-Memory Caching: Contributor caching during leaderboard generation
Database Caching: Persistent storage of resolved contributors
24-Hour Refresh: GitHub profile data cached with smart refresh logic

✅ Error Handling

Network Error Handling: Graceful handling of connection issues
Permission Error Handling: Clear error messages for access denied scenarios
Repository Not Found: Proper error responses for invalid repositories
State Management: Failed repositories tracked for potential retry

Frontend Implementation

✅ Modern UI Framework

Next.js 15: React meta-framework with SSR capabilities
React 19: Latest React features and improvements
TypeScript: Full type safety across the frontend

✅ User Interface Components

Repository Form: Submit repositories with optional private repo authentication
Repositories Table: Display all processed repositories with status badges
Leaderboard Display: Interactive contributor ranking visualization
Search Functionality: Filter repositories by URL
Status Badges: Visual indicators for repository processing states

✅ State Management

React Query: Server state management with automatic caching
Context API: Local UI state management (selected repo, search term)
Automatic Refetching: Smart polling based on repository states

✅ User Experience

Real-time Updates: Automatic polling every 2 seconds when jobs are active
Loading States: Skeleton loaders and loading indicators
Error Handling: User-friendly error messages and retry options
Responsive Design: Works seamlessly on desktop and mobile devices

✅ Styling & Components

Tailwind CSS: Utility-first CSS framework
Radix UI: Accessible, headless component library
Lucide Icons: Modern icon library
React Hook Form: Efficient form state management

Infrastructure

✅ Containerization

Docker: Containerized application components
Docker Compose: Multi-container orchestration
Volume Management: Persistent storage for repositories and database
Hot Reload: Development mode with automatic code reloading

✅ Service Architecture

Backend Container: Fastify API server
Frontend Container: Next.js application
Worker Container: Background job processor
PostgreSQL Container: Database server
Redis Container: Queue and cache server

3. Architecture and Design Patterns

System Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     Frontend (Next.js)                       │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ React Components                                     │   │
│  │ - RepositoryForm                                     │   │
│  │ - RepositoriesTable                                  │   │
│  │ - LeaderBoard                                        │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ State Management                                     │   │
│  │ - React Query (Server State)                        │   │
│  │ - Context API (Local State)                          │   │
│  └─────────────────────────────────────────────────────┘   │
└────────────────────┬────────────────────────────────────────┘
                     │ HTTP (Axios)
                     │ /api/* → Rewritten to backend
                     │
┌────────────────────▼────────────────────────────────────────┐
│                 Backend API (Fastify)                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ REST Endpoints                                       │   │
│  │ - POST /leaderboard (submit)                        │   │
│  │ - GET /leaderboard (retrieve)                       │   │
│  │ - GET /repositories (list)                          │   │
│  │ - GET /health (status)                              │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Services                                             │   │
│  │ - queueService.ts (Bull queue)                      │   │
│  │ - repoService.ts (Git operations)                   │   │
│  └─────────────────────────────────────────────────────┘   │
└────┬────────────────────┬──────────────────┬────────────────┘
     │                    │                  │
  Prisma         Redis Queue          Prisma
  (Query)        (Bull Job)           (Query)
     │                    │                  │
┌────▼────────────────────▼──────────────────▼────────────────┐
│              Worker Process                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Job Processing                                      │   │
│  │ 1. Update state: in_progress                       │   │
│  │ 2. syncRepository()                                │   │
│  │ 3. generateLeaderboard()                           │   │
│  │ 4. Update state: completed/failed                  │   │
│  └─────────────────────────────────────────────────────┘   │
└────┬────────────────────┬──────────────────┬────────────────┘
     │                    │                  │
┌────▼────────┐  ┌────────▼────────┐  ┌─────▼────────┐
│  PostgreSQL │  │     Redis       │  │ File System  │
│             │  │  (Job Queue)    │  │  (Bare Repos)│
│  - Repos    │  │  - Job State    │  │  /data/repos │
│  - Users    │  │  - Metadata     │  │              │
│  - Relations│  └─────────────────┘  └──────────────┘
└─────────────┘

Design Patterns Used

1. Producer-Consumer Pattern

Producer: Fastify API server accepts repository submission requests and enqueues jobs
Consumer: Worker process dequeues jobs and processes repositories asynchronously
Benefits:
- Decouples request handling from processing
- Enables horizontal scalability
- Non-blocking API responses

2. Repository Pattern

Database access abstracted through Prisma ORM
All data queries go through service layer (repoService.ts)
Benefits:
- Easy testing (can mock database layer)
- Database agnosticism
- Centralized data access logic

3. Service Layer Pattern

queueService.ts: Encapsulates all queue operations
repoService.ts: Handles repository syncing, leaderboard generation, and contributor management
repositoryService.ts (frontend): Encapsulates API communication
Benefits: Separation of concerns, reusable business logic

4. Singleton Pattern

Prisma client (prisma.ts) instantiated once and reused
Redis queue (repoQueue) created as single shared instance
Benefits:
- Prevents connection pool exhaustion
- Consistent state across application
- Resource efficiency

5. Context Pattern (Frontend)

RepositoryContext: Provides global access to repository state
Manages repositories, selection, and search state
Implements automatic refetching logic based on processing status
Benefits: Avoids prop drilling, centralized state management

6. Observer Pattern (Frontend)

React Query observes server state and automatically refetches
Context monitors repository states and adjusts polling behavior
Benefits: Reactive updates, automatic cache invalidation

7. State Machine Pattern

Repositories follow defined state transitions:

pending → in_progress → completed
                      ↘ failed

Backend endpoints respond differently based on current state
Benefits:
- Predictable state flow
- Prevents inconsistent states
- Easier debugging and testing

8. Caching Strategy (Multi-Layer)

In-Memory Cache: User cache during leaderboard generation (Map)
Database Cache: Contributor records cached in PostgreSQL
Redis Cache: Bull queue maintains job state and metadata
Benefits:
- Reduced database queries
- Faster processing
- Better performance for large repositories

4. Core Logic and Strategies

A. Repository Processing Workflow

User submits repo URL
         │
         ▼
┌─────────────────────────────────┐
│ Validate & Normalize URL        │
│ - Check GitHub URL format       │
│   (isValidGitHubUrl)            │
│ - Convert SSH to HTTPS format   │
│   (normalizeRepoUrl)            │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│ Check Database                  │
│ - Query by normalized URL       │
│ - If exists, return current     │
│   state                         │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│ Create Repository Record        │
│ - Insert into database          │
│ - Set state: pending            │
│ - Set lastAttempt: now          │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│ Enqueue Job                     │
│ - Add to Bull queue             │
│ - Include token if provided     │
│ - Job payload: {dbRepository,  │
│   token}                        │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│ Return HTTP Response            │
│ - 202 if pending/in_progress    │
│ - 200 if completed              │
│ - 500 if failed                 │
└─────────────────────────────────┘

B. Repository Synchronization (syncRepository)

Receive repository to sync
        │
        ▼
  Update state: in_progress
        │
        ▼
┌──────────────────────────────────┐
│ Determine Repository Path        │
│ /data/repos/{pathName}           │
└────────────┬─────────────────────┘
             │
             ▼
┌──────────────────────────────────┐
│ Check if repo exists locally     │
├──────────────────────────────────┤
│ If EXISTS:                       │
│ - git.cwd(repoPath)              │
│ - git.fetch()                    │
│ - Update existing repository     │
│                                  │
│ If NOT EXISTS:                   │
│ - Build authenticated URL        │
│   (if token provided)            │
│ - git.clone(url, repoPath,       │
│   ['--bare'])                    │
│ - Store in persistent volume      │
└──────────────────────────────────┘
        │
        ▼
   Return success/error

Key Strategy:

Uses bare repository cloning (--bare flag) to save disk space
No working directory needed (only commit history)
Incremental updates via git fetch for existing repos
Token embedded in URL for private repository access

C. Leaderboard Generation (generateLeaderboard)

Open cloned repository
        │
        ▼
Initialize caches
- usersCache: Map<string, Contributor>
- repositoryContributorCache: Map<id, {id, commitCount}>
        │
        ▼
Get full commit log
- git.log() → returns all commits
        │
        ▼
┌───────────────────────────────────┐
│ For each commit:                  │
│ 1. Extract author_email           │
│ 2. Skip if email is null          │
│ 3. Resolve contributor            │
│    (getDbUser)                    │
│ 4. Update commit count in cache   │
│    - Increment if exists          │
│    - Set to 1 if new              │
└───────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────┐
│ Bulk Upsert to Database           │
│ - Prisma transaction              │
│ - For each cached contributor:    │
│   - Upsert RepositoryContributor │
│   - Update or create record       │
│   - Set commitCount              │
└───────────────────────────────────┘
        │
        ▼
   Return sorted leaderboard
   (ordered by commitCount DESC)

Performance Optimizations:

In-memory caching prevents repeated database queries
Bulk transaction reduces database round-trips from O(n) to O(1)
Single pass through commit log

D. Contributor Resolution Strategy (getDbUser)

The system uses a sophisticated multi-step resolution strategy:

Receive author email
        │
        ▼
┌─────────────────────────────────┐
│ Check In-Memory Cache            │
│ - By email                       │
│ - By extracted username         │
│   (if no-reply email)           │
└────────┬────────────────────────┘
         │ Found → Return cached
         │
         ▼ Not found
┌─────────────────────────────────┐
│ Query Database                  │
│ - Find by email                 │
│ - Find by username              │
│   (if extracted from email)     │
└────────┬────────────────────────┘
         │ Found existing
         ▼
┌─────────────────────────────────┐
│ Check Refresh Requirement       │
│ - Is no-reply email?            │
│   → Skip GitHub API             │
│ - Updated < 24h ago?            │
│   → Return cached               │
│ - Updated > 24h ago?            │
│   → Fetch from GitHub API       │
└────────┬────────────────────────┘
         │ Not found in DB
         ▼
┌─────────────────────────────────┐
│ Query GitHub API                │
│ - Search: q={email}+in:email    │
│ - Extract: login, html_url     │
│ - Handle rate limits            │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│ Upsert to Database              │
│ - Create or update contributor  │
│ - Set username, email,          │
│   profileUrl                    │
└────────┬────────────────────────┘
         │
         ▼
    Cache locally
    Return user

Special Handling for GitHub No-Reply Emails:

Email format: {id}+{username}@users.noreply.github.com
Extracts username: email.split('@')[0].split('+')[1]
Constructs profile URL: https://github.com/{username}
Skips GitHub API call (saves rate limit)

Error Handling:

API rate limit errors logged but don't fail processing
Non-public users handled gracefully
Falls back to email-only storage if API fails

5. Tech Stack

Backend

Category	Technology	Version	Purpose
Runtime	Node.js	Latest LTS	Server-side JavaScript runtime
Language	TypeScript	5.6.3	Type-safe JavaScript
Web Framework	Fastify	5.1.0	Lightweight, high-performance HTTP server
ORM	Prisma	5.22.0	Type-safe database access with migrations
Database	PostgreSQL	15	Relational database for persistent storage
Queue	Bull	4.16.4	Job queue library with Redis backing
Cache/Message	Redis	6 (Alpine)	In-memory cache and message broker
Git Operations	simple-git	3.27.0	Node.js Git client for repository operations
HTTP Client	Axios	1.7.7	Promise-based HTTP client for GitHub API
Environment	dotenv	16.4.5	Environment variable management
Dev Tools	ESLint, Prettier	Latest	Code quality and formatting
Process Manager	nodemon	3.1.7	Development auto-reload

Frontend

Category	Technology	Version	Purpose
Framework	Next.js	15.0.3	React meta-framework with SSR
Runtime	React	19 (RC)	UI library with latest features
Language	TypeScript	5.x	Type-safe JavaScript
State Management	React Query	5.61.0	Server state management and caching
State	React Context API	Built-in	Local UI state management
Form Library	React Hook Form	7.53.2	Efficient form state management
UI Components	Radix UI	Latest	Headless, accessible component library
Styling	Tailwind CSS	3.4.1	Utility-first CSS framework
Icons	Lucide React	0.460.0	Modern icon library
HTTP Client	Axios	1.7.7	API communication
Validation	Zod	3.23.8	Schema validation (ready for use)
Date Utils	date-fns	3.6.0	Date manipulation utilities

Infrastructure

Component	Technology	Purpose
Containerization	Docker	Application containerization
Orchestration	Docker Compose	Multi-container orchestration
Volume Management	Docker Volumes	Persistent storage for repos and database
Networking	Docker Networks	Inter-container communication

Development Workflow

Local Development
    │
    ▼
Code Changes (TypeScript)
    │
    ├─→ Backend: Nodemon watches src/
    │   └─→ Auto-restarts on change
    │
    └─→ Frontend: Next.js HMR
        └─→ Hot Module Replacement
    │
    ▼
TypeScript Compilation
    │
    ▼
Running in Hot-Reload Mode
    │
    ▼
Live Testing in Browser / API

6. Integration Details

A. GitHub API Integration

Endpoint Used: GET https://api.github.com/search/users?q={email}+in:email

When It's Called:

During contributor resolution if email is not a no-reply address
When a cached user hasn't been refreshed in 24 hours
For enriching contributor profiles with GitHub usernames and profile URLs

Authentication:

Bearer token passed via Authorization header
Token from GITHUB_TOKEN environment variable
Optional: Token from request headers for private repos

Rate Limits:

Unauthenticated: 60 requests/hour
Authenticated: 5,000 requests/hour
Current implementation logs rate limit errors but doesn't retry

Error Handling:

Rate limit errors: Logged, processing continues
Non-public users: Handled gracefully, falls back to email-only
Network errors: Caught and logged, doesn't fail entire processing

Example Request:

const response = await axios.get(
  `https://api.github.com/search/users?q=${author_email}+in:email`,
  { headers: { Authorization: `Bearer ${process.env.GITHUB_TOKEN}` } }
);

B. Database Integration (PostgreSQL + Prisma)

Connection String: postgresql://user:password@db:5432/github_scraper

Data Models:

model Repository {
  id                      Int                      @id @default(autoincrement())
  url                     String                   @unique
  pathName                String
  state                   String                   @default("pending")
  lastAttempt             DateTime?
  lastProcessedAt         DateTime?
  createdAt               DateTime                 @default(now())
  updatedAt               DateTime                 @updatedAt
  contributors            RepositoryContributor[]
}

model Contributor {
  id                      Int                      @id @default(autoincrement())
  username                String?                  @unique
  email                   String?                  @unique
  profileUrl              String?
  createdAt               DateTime                 @default(now())
  updatedAt               DateTime                 @updatedAt
  repositories            RepositoryContributor[]
}

model RepositoryContributor {
  id                      Int                      @id @default(autoincrement())
  repository              Repository               @relation(fields: [repositoryId], references: [id])
  repositoryId            Int
  contributor             Contributor              @relation(fields: [contributorId], references: [id])
  contributorId           Int
  commitCount             Int                      @default(0)

  @@unique([repositoryId, contributorId])
}

Key Relationships:

Repository ↔ RepositoryContributor: One-to-Many
Contributor ↔ RepositoryContributor: One-to-Many
RepositoryContributor: Join table with composite unique constraint

Migrations:

Automated via prisma migrate deploy on container startup
Located in /backend/prisma/migrations/
Schema evolution tracked in schema.prisma
Migration history preserved for rollback capability

Query Patterns:

Find Repository: prisma.repository.findUnique({ where: { url } })
Create Repository: prisma.repository.create({ data: {...} })
Update State: prisma.repository.update({ where: { id }, data: { state } })
Bulk Upsert: prisma.$transaction([...upsert operations])

C. Redis Integration (Bull Queue)

Connection: redis://redis:6379 (Docker network)

Queue Name: repository-processing

Queue Configuration:

export const repoQueue = new Queue('repository-processing', {
  redis: {
    host: REDIS_HOST,
    port: REDIS_PORT,
  },
});

Job Payload:

{
  dbRepository: {
    id: number,
    url: string,
    pathName: string,
    state: string,
    lastAttempt: Date,
    lastProcessedAt: Date,
    createdAt: Date,
    updatedAt: Date
  },
  token: string | null // GitHub authentication token
}

Job Processing:

Consumer reads jobs from queue (repoQueue.process())
Executes synchronously in worker process
Updates repository state in database
Marks jobs as completed or failed
Errors are caught and logged

Queue Features:

Job persistence (survives Redis restart)
Job retry capability (infrastructure ready)
Job state tracking
Priority support (can be added)

D. File System Integration

Repository Storage: /data/repos/ (Docker volume: repo_volume)

Storage Strategy:

Bare repositories (no working directory)
Named after repository's path component
Example: https://github.com/aalexmrt/github-scraper → /data/repos/github-scraper
Persists across container restarts via Docker volume

Volume Configuration:

volumes:
  repo_volume:
    driver: local

Mount Points:

Backend container: /data/repos
Worker container: /data/repos
Shared between containers for consistency

E. API-to-API Communication

Frontend → Backend:

Next.js rewrite proxy: /api/* → http://backend:3000/*
Axios-based HTTP requests
Optional Bearer token in Authorization header
CORS handled by Next.js proxy

Backend → GitHub:

Axios GET requests
GitHub API token from environment or request
Error handling with try-catch blocks
Rate limit awareness

Inter-Container Communication:

Docker Compose network: github-scraper_default
Service names as hostnames: app, db, redis
Port mapping for external access

7. Development Approach

Key Decisions and Trade-offs

1. Bare Repository Cloning

Decision: Use git clone --bare instead of standard clone
Trade-off:
- ✅ Saves disk space (no working directory)
- ✅ Faster cloning
- ❌ Slightly more complex path operations
Rationale: Only need commit history, don't need working files; scales better with multiple repos

2. In-Memory Contributor Caching During Processing

Decision: Cache users in Map during single leaderboard generation
Trade-off:
- ✅ Fast processing (no repeated DB queries)
- ✅ Reduces database load
- ❌ Memory usage during large repos (acceptable)
Rationale: Processing happens once; reduces database load and improves speed significantly

3. Asynchronous Queue-Based Processing

Decision: Use Bull/Redis queue instead of immediate processing
Trade-off:
- ✅ Non-blocking API responses
- ✅ Horizontal scalability
- ✅ Better error handling
- ❌ Delayed processing (acceptable for async operations)
Rationale: Large repositories can take minutes to process; don't block HTTP request thread

4. URL Normalization Strategy

Decision: Normalize all URLs to lowercase HTTPS format
Trade-off:
- ✅ Prevents duplicate processing
- ✅ Single source of truth
- ❌ Requires URL transformation logic
Rationale: Prevents duplicate processing of same repo with different URL formats

5. 24-Hour GitHub Profile Refresh

Decision: Cache GitHub profile data locally and only refresh if stale
Trade-off:
- ✅ Reduced API calls
- ✅ Respects rate limits
- ❌ Potentially outdated profile information (acceptable)
Rationale: GitHub API has rate limits; contributors unlikely to change profiles frequently

6. React Query for Frontend State

Decision: Use React Query instead of Redux or other state managers
Trade-off:
- ✅ Smaller bundle size
- ✅ Automatic caching
- ✅ Built-in refetching
- ❌ Less flexibility for complex derived state (not needed)
Rationale: Primarily managing server state; React Query excels at this use case

7. Context API for Local State

Decision: Use Context API instead of Redux for UI-local state
Trade-off:
- ✅ Simpler setup
- ✅ No additional dependencies
- ❌ Can cause unnecessary re-renders (mitigated with proper usage)
Rationale: Limited local state (selected repo, search term); Context is sufficient

8. Automatic Polling Refresh

Decision: Poll every 2 seconds when jobs are queued/in-progress
Trade-off:
- ✅ Simpler implementation
- ✅ Works reliably
- ❌ Less efficient than WebSocket (acceptable for MVP)
Rationale: For MVP, polling is adequate; WebSocket noted as future enhancement

9. Repository State Machine

Decision: Implement strict state transitions
Trade-off:
- ✅ Prevents inconsistent states
- ✅ Easier debugging
- ❌ Requires careful state management
Rationale: Clear, predictable state flow easier to debug and test

10. Docker Multi-Container Setup

Decision: Separate backend, frontend, worker, database, and Redis containers
Trade-off:
- ✅ Independent scaling
- ✅ Easier deployment
- ✅ Microservices-friendly
- ❌ More complex orchestration
Rationale: Worker can be scaled independently; better resource utilization

8. General Observations

Notable Implementation Details

1. Graceful Error Handling

The backend implements thoughtful error handling with specific error messages:

if (error.message.includes('Could not resolve host')) {
  throw new Error(
    `Network error: Unable to resolve host for ${dbRepository.url}`
  );
} else if (error.message.includes('Repository not found')) {
  throw new Error(`Repository not found: ${dbRepository.url}`);
} else if (error.message.includes('Permission denied')) {
  throw new Error(
    `Permission denied: Ensure you have access to the repository ${dbRepository.url}`
  );
}

This helps frontend users understand what went wrong and how to fix it.

2. Efficient Transaction Handling

Leaderboard generation uses Prisma transactions to bulk-upsert contributor records:

await prisma.$transaction(
  Array.from(repositoryContributorCache.values()).map((contributor) =>
    prisma.repositoryContributor.upsert({...})
  )
);

This ensures atomicity and reduces database round-trips from O(n) to O(1).

3. Smart Contributor Identification

The system handles three types of email scenarios:

GitHub No-Reply: {id}+{username}@users.noreply.github.com - extracts username
Private Email: Regular email - queries GitHub API
Unknown: Stores raw email and marks as unresolved

4. Docker Volume Strategy

volumes:
  repo_volume:/data/repos      # Persistent storage for cloned repos
  pg_data:/var/lib/postgresql/data  # Database persistence

Ensures data survives container restarts, critical for production deployments.

5. Frontend Auto-Refresh Intelligence

const hasQueuedRepositories = repositories?.some(
  (repository: any) =>
    repository.state === 'pending' || repository.state === 'in_progress'
);

if (hasQueuedRepositories && !isRefetching) {
  setIsRefetching(true); // Start polling
} else if (!hasQueuedRepositories && isRefetching) {
  setIsRefetching(false); // Stop polling
}

Stops unnecessary polling once all jobs complete, saving bandwidth and server resources.

6. Status Code Semantics

202 Accepted: Repository is being processed (standard for async operations)
200 OK: Repository processing complete, leaderboard available
500 Internal Server Error: Processing failed
404 Not Found: Repository not found or not submitted yet
400 Bad Request: Invalid URL or missing parameters

7. Token Handling Security

Frontend doesn't store GitHub tokens:

// Only passed in request headers, not stored
isPrivate ? { headers: { Authorization: `Bearer ${apiToken}` } } : {};

User message confirms: "Your token will only be used for this request and won't be stored"

8. TypeScript Throughout

Both backend and frontend use TypeScript with strict configs:

Provides type safety across API boundaries
Better IDE support and refactoring capabilities
Catches errors at compile time
Self-documenting code

Strengths of the Architecture

Scalability: Worker process can be scaled independently via Docker Compose
Reliability: State machine pattern prevents inconsistent states
User Experience: Real-time polling provides immediate feedback on processing status
Data Integrity: Transactions ensure atomic operations
Clean Separation: Backend API is completely decoupled from frontend implementation
Error Recovery: Failed repositories tracked and can be retried (infrastructure ready)
Performance: Multiple caching layers optimize database queries
Maintainability: Clear separation of concerns, well-organized code structure

Potential Future Improvements

WebSocket Integration: Replace polling with real-time updates (noted in TODOs)
API Rate Limit Handling: Implement exponential backoff for GitHub API rate limits
Retry Mechanism: Automatic retries for failed repositories (framework ready)
Horizontal Scaling: Environment-based configuration for multi-worker deployments
Caching Strategy: Implement Redis caching for leaderboard results
Monitoring: Add structured logging and performance monitoring (e.g., Prometheus, Grafana)
Testing: Unit and integration tests for core logic
Pagination: For repositories and leaderboard results with large datasets
Search Filtering: Advanced search capabilities for repositories
Export Features: Export leaderboard data as CSV/JSON
Authentication: User authentication and authorization
Rate Limiting: API rate limiting to prevent abuse
Webhooks: GitHub webhook integration for automatic repository updates
Analytics: Repository analytics dashboard with charts and graphs

Code Quality Observations

Consistent Naming: Snake_case for database fields, camelCase for JavaScript
Clear Separation of Concerns: Services, utilities, and workers in separate modules
Minimal External Dependencies: Uses core libraries only (no bloat)
Development-Friendly: Hot-reload setup with nodemon and Next.js HMR
Environment Configuration: All secrets managed through .env files
Console Logging: Debug logs throughout for troubleshooting (could be enhanced with structured logging)
Error Messages: User-friendly error messages with actionable information
Type Safety: TypeScript interfaces and types used consistently

Summary

The GitHub Repository Scraper is a well-architected, production-oriented application that demonstrates:

Modern Architecture: Producer-consumer pattern with asynchronous job processing
Type Safety: TypeScript used comprehensively across the stack
Scalability: Containerized microservices design with independent scaling capabilities
User Experience: Real-time updates and intuitive UI with comprehensive error handling
Data Integrity: Transactions, state machines, and careful error recovery
Best Practices: REST API design, proper HTTP status codes, secure token handling
Performance: Multi-layer caching, efficient database queries, optimized Git operations

The codebase shows thoughtful design decisions balancing simplicity with scalability, making it suitable for both development and production deployments. The architecture supports future enhancements while maintaining clean separation of concerns and excellent developer experience.

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

GitHub Repository Scraper - Architecture & Technical Documentation

Table of Contents

1. Purpose and Objective

Primary Use Cases

Core Value Proposition

2. Work Completed So Far

Backend Implementation

✅ Core API Server

✅ Asynchronous Processing

✅ Git Operations

✅ Database & Persistence

✅ GitHub Integration

✅ Performance Optimizations

✅ Error Handling

Frontend Implementation

✅ Modern UI Framework

✅ User Interface Components

✅ State Management

✅ User Experience

✅ Styling & Components

Infrastructure

✅ Containerization

✅ Service Architecture

3. Architecture and Design Patterns

System Architecture Overview

Design Patterns Used

1. Producer-Consumer Pattern

2. Repository Pattern

3. Service Layer Pattern

4. Singleton Pattern

5. Context Pattern (Frontend)

6. Observer Pattern (Frontend)

7. State Machine Pattern

8. Caching Strategy (Multi-Layer)

4. Core Logic and Strategies

A. Repository Processing Workflow

B. Repository Synchronization (syncRepository)

C. Leaderboard Generation (generateLeaderboard)

D. Contributor Resolution Strategy (getDbUser)

5. Tech Stack

Backend

Frontend

Infrastructure

Development Workflow

6. Integration Details

A. GitHub API Integration

B. Database Integration (PostgreSQL + Prisma)

C. Redis Integration (Bull Queue)

D. File System Integration

E. API-to-API Communication

7. Development Approach

Key Decisions and Trade-offs

1. Bare Repository Cloning

2. In-Memory Contributor Caching During Processing

3. Asynchronous Queue-Based Processing

4. URL Normalization Strategy

5. 24-Hour GitHub Profile Refresh

6. React Query for Frontend State

7. Context API for Local State

8. Automatic Polling Refresh

9. Repository State Machine

10. Docker Multi-Container Setup

8. General Observations

Notable Implementation Details

1. Graceful Error Handling

2. Efficient Transaction Handling

3. Smart Contributor Identification

4. Docker Volume Strategy

5. Frontend Auto-Refresh Intelligence

6. Status Code Semantics

7. Token Handling Security

8. TypeScript Throughout

Strengths of the Architecture

Potential Future Improvements