GitHub Repository Analysis
This skill guides analysis of GitHub repositories to extract meaningful insights about development activity, contributor patterns, and release cycles.
Core Analysis Capabilities
Commit Frequency Analysis
- Extract commit history via GitHub API or git commands
- Calculate commit frequency by time period (daily, weekly, monthly)
- Identify patterns in development activity
- Detect periods of high/low activity
- Generate time-series visualizations of commit trends
Outstanding Contributors Analysis
- Identify top contributors by commit count
- Calculate contribution distribution (percentage per contributor)
- Analyze commit patterns by contributor
- Track first-time vs. recurring contributors
- Generate contributor leaderboards
Release Timeline Analysis
- Extract release/tag history
- Calculate time between releases
- Identify release patterns and cycles
- Track version numbering schemes
- Map releases to major commit periods
Implementation Approaches
Approach 1: GitHub API (Recommended for Public Repos)
Use GitHub's REST or GraphQL API for efficient data retrieval:
import requests
from datetime import datetime
def analyze_commits(owner, repo, token=None):
headers = {'Authorization': f'token {token}'} if token else {}
url = f'https://api.github.com/repos/{owner}/{repo}/commits'
all_commits = []
page = 1
while True:
response = requests.get(url, headers=headers, params={'page': page, 'per_page': 100})
commits = response.json()
if not commits:
break
all_commits.extend(commits)
page += 1
return all_commits
def analyze_contributors(commits):
contributor_stats = {}
for commit in commits:
author = commit['commit']['author']['name']
contributor_stats[author] = contributor_stats.get(author, 0) + 1
return sorted(contributor_stats.items(), key=lambda x: x[1], reverse=True)
def analyze_releases(owner, repo, token=None):
headers = {'Authorization': f'token {token}'} if token else {}
url = f'https://api.github.com/repos/{owner}/{repo}/releases'
response = requests.get(url, headers=headers)
return response.json()
Benefits:
- No repository cloning needed
- Efficient pagination
- Access to additional metadata
- Rate limiting: 60 requests/hour without auth, 5000 with token
Approach 2: Local Git Repository
Use git commands for detailed analysis when repository is already cloned:
# Get commit history with timestamps
git log --pretty=format:"%h|%an|%ae|%ad|%s" --date=iso > commits.txt
# Count commits by author
git shortlog -sn --all
# Get all tags/releases
git tag -l --sort=-version:refname
# Commit frequency by week
git log --pretty=format:"%ad" --date=short | awk '{print $1}' | uniq -c
# Commits per month
git log --pretty=format:"%ad" --date=format:"%Y-%m" | sort | uniq -c
Approach 3: Hybrid Approach
Combine both methods for comprehensive analysis:
- Use API for releases and high-level stats
- Clone repository for detailed commit analysis
- Use git commands for advanced filtering
Analysis Workflow
-
Repository Identification
- Parse GitHub URL or accept owner/repo parameters
- Validate repository exists and is accessible
-
Data Collection
- Fetch commit history (API or git log)
- Retrieve release/tag information
- Collect contributor metadata
-
Data Processing
- Parse timestamps and author information
- Group commits by time periods
- Calculate statistics and metrics
-
Insight Generation
- Identify top contributors with percentages
- Calculate commit frequency trends
- Map release timeline with intervals
- Detect anomalies or interesting patterns
-
Visualization & Reporting
- Create charts/graphs for trends
- Generate summary statistics
- Present findings in structured format
Key Metrics to Calculate
Commit Metrics
- Total commits
- Commits per day/week/month
- Average commits per active period
- Longest streak of daily commits
- Periods of inactivity
Contributor Metrics
- Total unique contributors
- Top N contributors (typically top 5-10)
- Contribution percentage per contributor
- One-time vs. recurring contributors
- New contributors over time
Release Metrics
- Total releases
- Time between releases (min, max, average)
- Release frequency trend
- Semantic versioning patterns
- Pre-release vs. stable releases
Output Formats
Summary Report
Repository: owner/repo
Analysis Period: YYYY-MM-DD to YYYY-MM-DD
Commit Activity:
- Total Commits: N
- Active Contributors: N
- Average Commits/Week: N
- Most Active Period: YYYY-MM
Top Contributors:
1. Name (N commits, X%)
2. Name (N commits, X%)
...
Recent Releases:
- v1.2.3 (YYYY-MM-DD) - N days since previous
- v1.2.2 (YYYY-MM-DD) - N days since previous
...
Detailed Analytics
- Time-series data in CSV/JSON format
- Visualization-ready datasets
- Contributor breakdown by time period
- Release calendar with annotations
Common Patterns & Tips
Handle rate limiting: Always check API rate limit headers and implement exponential backoff
Large repositories: For repos with 10k+ commits, consider:
- Analyzing recent history only (e.g., last 12 months)
- Sampling commits rather than processing all
- Using shallow clones for git-based analysis
Privacy considerations: GitHub API exposes public data only; private repos require authentication
Timezone handling: Normalize all timestamps to UTC for consistent analysis
Bot commits: Filter out automated commits (dependabot, renovate) for human contributor analysis
Email normalization: Same contributor may use different email addresses; consider consolidation
Error Handling
- Repository not found: Verify owner/repo spelling
- Rate limit exceeded: Implement retry logic or use authentication
- Empty history: Check if repository has been initialized
- API changes: GitHub API versioning may affect endpoints