GitHub Trends Analyzer

Analyze GitHub repository trends using the GitHub API. Explore language popularity, domain classifications, rising repositories, and growth patterns through comprehensive data visualization.

Project Overview

This project provides a complete data analysis pipeline for exploring GitHub repository trends. By leveraging the GitHub REST API, it collects, processes, and visualizes data from thousands of repositories to uncover insights about:

Language Popularity - Which programming languages dominate the open-source ecosystem
Domain Classification - How repos are categorized (Web Dev, ML, DevOps, etc.)
Rising Stars - Repositories gaining traction quickly
Language Trends - How language popularity shifts over time
Repository Metrics - Stars, forks, activity patterns, and growth rates
Language Combinations - Which languages are commonly used together

Sample Visualizations

The analysis generates publication-ready charts including:

Language popularity distribution by repo count and total stars
Heatmap of programming languages by domain
Top repositories with their primary languages
Rising repositories ranked by growth rate
Language trend lines over recent years
Language combination analysis
Repository metrics correlations
Domain distribution pie charts
License usage patterns

Features

Data Collection

Fetch top repositories using GitHub Search API
Configurable search queries (stars threshold, language filters, etc.)
Rate limit management and API throttling
Secure token handling with environment variables

Data Processing

Clean and transform raw API responses
Calculate derived metrics (stars per day, activity scores, rising scores)
Intelligent domain classification based on topics and descriptions
Language usage analysis and percentage calculations
Repository age and growth rate computations

Analysis & Insights

Most popular programming languages
Language specialization by domain (Web, ML, Mobile, DevOps, etc.)
Rising repositories identification
Language trend analysis over time
Common language combinations
Repository engagement metrics
License distribution analysis
Correlation matrix for key metrics

Visualization

10+ publication-ready charts using Seaborn and Matplotlib
Customizable color palettes and styles
High-resolution exports (300 DPI)
Interactive Jupyter notebook environment

Prerequisites

Python 3.8 or higher
Jupyter Notebook or JupyterLab
GitHub account (for API token)
Basic understanding of Python, Pandas, and data analysis

Installation

1. Clone the Repository

git clone https://github.com/joeltikoo/github-trends-analyzer.git
cd github-trends-analyzer

2. Create Virtual Environment (Recommended)

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

Required packages:

requests - API calls
pandas - Data manipulation
numpy - Numerical operations
matplotlib - Plotting
seaborn - Statistical visualizations
python-dotenv - Environment variable management
jupyter - Notebook environment

4. Set Up GitHub API Token

Create a .env file in the project root:

touch .env

Add your GitHub Personal Access Token:

GITHUB_TOKEN=your_github_token_here

To get a GitHub token:

Go to GitHub Settings → Developer settings → Personal access tokens
Click "Generate new token (classic)"
Give it a name (e.g., "GitHub Trends Analyzer")
Select scopes: public_repo (or just repo for private repos)
Click "Generate token"
Copy the token and paste it in your .env file

Important: Never commit your .env file to Git! It's already included in .gitignore.

Usage

Quick Start

Launch Jupyter Notebook:

jupyter notebook notebook.ipynb

Run the cells sequentially from top to bottom
Visualizations and CSV exports will be generated automatically

Customization Options

Adjust search parameters (Cell 4):

# Collect top repositories with custom filters
top_repos = search_repositories(
    query='stars:>5000 language:python',  # Custom query
    sort='stars',
    per_page=100,
    max_pages=10  # Adjust for more/fewer repos
)

Common search queries:

stars:>1000 - Repos with 1000+ stars
language:python - Python repos only
created:>2024-01-01 - Repos created after Jan 1, 2024
topics:machine-learning - ML-related repos
Combine filters: stars:>5000 language:javascript created:>2023-01-01

Modify domain classification (Cell 8):

# Add your own domain keywords
domains = {
    'Your Domain': ['keyword1', 'keyword2', 'keyword3'],
    # ... existing domains
}

Change visualization styles:

# Adjust color palettes
sns.barplot(..., palette='viridis')  # Try: viridis, plasma, rocket, mako

Output Files

The notebook generates several output files:

CSV Exports

github_repos_raw.csv - Complete raw dataset
language_summary.csv - Language statistics summary
domain_summary.csv - Domain classification summary
rising_repositories.csv - Top rising repos by growth rate

Visualizations (PNG, 300 DPI)

popular_languages.png - Language popularity charts
language_by_domain.png - Heatmap of languages vs domains
top_repositories.png - Top starred repositories
rising_repositories.png - Rising repos by growth rate
language_trends.png - Language trends over time
language_combinations.png - Common language pairs
repository_metrics.png - Multi-panel metrics analysis
domain_distribution.png - Domain pie chart
correlation_matrix.png - Metric correlations
license_distribution.png - License usage patterns

Analysis Examples

Example 1: Find Python's Most Popular Domain

python_repos = df_exploded[df_exploded['primary_language'] == 'Python']
print(python_repos['domains'].value_counts().head())

Example 2: Identify Fast-Growing Repos

# Repos with >100 stars/day
fast_growing = df_clean[df_clean['stars_per_day'] > 100]
print(fast_growing[['name', 'stars', 'stars_per_day', 'primary_language']])

Example 3: Compare Language Communities

# Average engagement by language
engagement = df_clean.groupby('primary_language').agg({
    'stars': 'mean',
    'forks': 'mean',
    'watchers': 'mean'
}).round(0)

Sample Insights

Based on analyzing 1000+ top GitHub repositories:

JavaScript and Python consistently rank as the most popular languages
Machine Learning and Web Development dominate repository domains
Repos combining Python + Jupyter are highly common in data science
TypeScript shows strong growth in recent years
MIT License is the most popular open-source license
Rising repositories average 50+ stars/day in their growth phase

Troubleshooting

API Rate Limit Exceeded

# Check your remaining rate limit
remaining, reset_time = check_rate_limit()
print(f"Remaining: {remaining}, Resets at: {datetime.fromtimestamp(reset_time)}")

Solutions:

Use a GitHub token (increases limit from 60 to 5000/hour)
Reduce max_pages parameter
Add time.sleep() between requests

Timezone Errors

Ensure all datetime operations use UTC:

df_clean['age_days'] = (pd.Timestamp.now(tz='UTC') - df_clean['created_at']).dt.days

Memory Issues with Large Datasets

# Reduce dataset size
top_repos = search_repositories(query='stars:>10000', max_pages=5)

Contributing

Contributions are welcome! Here are some ideas:

Add new analysis types (contributor networks, commit patterns)
Improve domain classification accuracy
Create interactive dashboards (Plotly, Streamlit)
Add time-series forecasting for language trends
Implement sentiment analysis on repo descriptions
Add support for organization-level analysis

Check out CONTRIBUTING.md for more info on contributing.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
csv-exports		csv-exports
visualizations		visualizations
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GitHub Trends Analyzer

Project Overview

Sample Visualizations

Features

Data Collection

Data Processing

Analysis & Insights

Visualization

Prerequisites

Installation

1. Clone the Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

4. Set Up GitHub API Token

Usage

Quick Start

Customization Options

Output Files

CSV Exports

Visualizations (PNG, 300 DPI)

Analysis Examples

Example 1: Find Python's Most Popular Domain

Example 2: Identify Fast-Growing Repos

Example 3: Compare Language Communities

Sample Insights

Troubleshooting

API Rate Limit Exceeded

Timezone Errors

Memory Issues with Large Datasets

Contributing

License

Additional Resources (that I used)

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages