Analyze GitHub repository trends using the GitHub API. Explore language popularity, domain classifications, rising repositories, and growth patterns through comprehensive data visualization.
This project provides a complete data analysis pipeline for exploring GitHub repository trends. By leveraging the GitHub REST API, it collects, processes, and visualizes data from thousands of repositories to uncover insights about:
- Language Popularity - Which programming languages dominate the open-source ecosystem
- Domain Classification - How repos are categorized (Web Dev, ML, DevOps, etc.)
- Rising Stars - Repositories gaining traction quickly
- Language Trends - How language popularity shifts over time
- Repository Metrics - Stars, forks, activity patterns, and growth rates
- Language Combinations - Which languages are commonly used together
The analysis generates publication-ready charts including:
- Language popularity distribution by repo count and total stars
- Heatmap of programming languages by domain
- Top repositories with their primary languages
- Rising repositories ranked by growth rate
- Language trend lines over recent years
- Language combination analysis
- Repository metrics correlations
- Domain distribution pie charts
- License usage patterns
- Fetch top repositories using GitHub Search API
- Configurable search queries (stars threshold, language filters, etc.)
- Rate limit management and API throttling
- Secure token handling with environment variables
- Clean and transform raw API responses
- Calculate derived metrics (stars per day, activity scores, rising scores)
- Intelligent domain classification based on topics and descriptions
- Language usage analysis and percentage calculations
- Repository age and growth rate computations
- Most popular programming languages
- Language specialization by domain (Web, ML, Mobile, DevOps, etc.)
- Rising repositories identification
- Language trend analysis over time
- Common language combinations
- Repository engagement metrics
- License distribution analysis
- Correlation matrix for key metrics
- 10+ publication-ready charts using Seaborn and Matplotlib
- Customizable color palettes and styles
- High-resolution exports (300 DPI)
- Interactive Jupyter notebook environment
- Python 3.8 or higher
- Jupyter Notebook or JupyterLab
- GitHub account (for API token)
- Basic understanding of Python, Pandas, and data analysis
git clone https://github.com/joeltikoo/github-trends-analyzer.git
cd github-trends-analyzer# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activatepip install -r requirements.txtRequired packages:
requests- API callspandas- Data manipulationnumpy- Numerical operationsmatplotlib- Plottingseaborn- Statistical visualizationspython-dotenv- Environment variable managementjupyter- Notebook environment
Create a .env file in the project root:
touch .envAdd your GitHub Personal Access Token:
GITHUB_TOKEN=your_github_token_here
To get a GitHub token:
- Go to GitHub Settings → Developer settings → Personal access tokens
- Click "Generate new token (classic)"
- Give it a name (e.g., "GitHub Trends Analyzer")
- Select scopes:
public_repo(or justrepofor private repos) - Click "Generate token"
- Copy the token and paste it in your
.envfile
Important: Never commit your
.envfile to Git! It's already included in.gitignore.
- Launch Jupyter Notebook:
jupyter notebook notebook.ipynb-
Run the cells sequentially from top to bottom
-
Visualizations and CSV exports will be generated automatically
Adjust search parameters (Cell 4):
# Collect top repositories with custom filters
top_repos = search_repositories(
query='stars:>5000 language:python', # Custom query
sort='stars',
per_page=100,
max_pages=10 # Adjust for more/fewer repos
)Common search queries:
stars:>1000- Repos with 1000+ starslanguage:python- Python repos onlycreated:>2024-01-01- Repos created after Jan 1, 2024topics:machine-learning- ML-related repos- Combine filters:
stars:>5000 language:javascript created:>2023-01-01
Modify domain classification (Cell 8):
# Add your own domain keywords
domains = {
'Your Domain': ['keyword1', 'keyword2', 'keyword3'],
# ... existing domains
}Change visualization styles:
# Adjust color palettes
sns.barplot(..., palette='viridis') # Try: viridis, plasma, rocket, makoThe notebook generates several output files:
github_repos_raw.csv- Complete raw datasetlanguage_summary.csv- Language statistics summarydomain_summary.csv- Domain classification summaryrising_repositories.csv- Top rising repos by growth rate
popular_languages.png- Language popularity chartslanguage_by_domain.png- Heatmap of languages vs domainstop_repositories.png- Top starred repositoriesrising_repositories.png- Rising repos by growth ratelanguage_trends.png- Language trends over timelanguage_combinations.png- Common language pairsrepository_metrics.png- Multi-panel metrics analysisdomain_distribution.png- Domain pie chartcorrelation_matrix.png- Metric correlationslicense_distribution.png- License usage patterns
python_repos = df_exploded[df_exploded['primary_language'] == 'Python']
print(python_repos['domains'].value_counts().head())# Repos with >100 stars/day
fast_growing = df_clean[df_clean['stars_per_day'] > 100]
print(fast_growing[['name', 'stars', 'stars_per_day', 'primary_language']])# Average engagement by language
engagement = df_clean.groupby('primary_language').agg({
'stars': 'mean',
'forks': 'mean',
'watchers': 'mean'
}).round(0)Based on analyzing 1000+ top GitHub repositories:
- JavaScript and Python consistently rank as the most popular languages
- Machine Learning and Web Development dominate repository domains
- Repos combining Python + Jupyter are highly common in data science
- TypeScript shows strong growth in recent years
- MIT License is the most popular open-source license
- Rising repositories average 50+ stars/day in their growth phase
# Check your remaining rate limit
remaining, reset_time = check_rate_limit()
print(f"Remaining: {remaining}, Resets at: {datetime.fromtimestamp(reset_time)}")Solutions:
- Use a GitHub token (increases limit from 60 to 5000/hour)
- Reduce
max_pagesparameter - Add
time.sleep()between requests
Ensure all datetime operations use UTC:
df_clean['age_days'] = (pd.Timestamp.now(tz='UTC') - df_clean['created_at']).dt.days# Reduce dataset size
top_repos = search_repositories(query='stars:>10000', max_pages=5)Contributions are welcome! Here are some ideas:
- Add new analysis types (contributor networks, commit patterns)
- Improve domain classification accuracy
- Create interactive dashboards (Plotly, Streamlit)
- Add time-series forecasting for language trends
- Implement sentiment analysis on repo descriptions
- Add support for organization-level analysis
Check out CONTRIBUTING.md for more info on contributing.
This project is licensed under the MIT License - see the LICENSE file for details.