Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Use the latest stable Python image
FROM python:3.11-slim
FROM python:3.14-slim

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
Expand Down Expand Up @@ -34,4 +34,4 @@ RUN chown -R app:app /app
USER app

# Set the default command
CMD ["python", "main.py"]
CMD ["python", "main.py"]
37 changes: 28 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ runs in a Docker container for easy deployment and isolation.

### Prerequisites

1. **GitHub Personal Access Token**: Create a [token](https://github.com/settings/tokens)
1. **GitHub App** *(recommended, required for authenticated runs)*: Create a GitHub App with read access to the target repositories, then note the numeric **App ID** and download a **private key** (PEM format). Without these the ETL runs unauthenticated (low rate-limit quota — suitable for testing only).
2. **Google Cloud Project**: Set up a GCP project with BigQuery enabled
3. **BigQuery Dataset**: Create a dataset in your GCP project
4. **Authentication**: Configure GCP credentials (see Authentication section below)
Expand All @@ -35,23 +35,40 @@ docker build -t github-etl .

### Running the Container

Create an env file (do **not** commit it):

```bash
# github-etl.env
GITHUB_REPOS=mozilla-firefox/firefox
GITHUB_APP_ID=your_github_app_id
GITHUB_PRIVATE_KEY=<paste PEM contents here, with real newline characters (do not use "\n" escape sequences)>
BIGQUERY_PROJECT=your-gcp-project
BIGQUERY_DATASET=your_dataset
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
```

Then run the container using `--env-file` to avoid exposing secrets in shell history
or via `/proc/<pid>/environ`:

```bash
docker run --rm \
-e GITHUB_REPOS="mozilla/firefox" \
-e GITHUB_TOKEN="your_github_token" \
-e BIGQUERY_PROJECT="your-gcp-project" \
-e BIGQUERY_DATASET="your_dataset" \
-e GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" \
--env-file github-etl.env \
-v /local/path/to/credentials.json:/path/to/credentials.json \
github-etl
```

> **Note**: Never pass the private key inline with `-e GITHUB_PRIVATE_KEY="$(cat ...)"` —
> that leaks the key into your shell history and makes it visible to other processes via
> `ps`/`/proc`. Use `--env-file`, Docker secrets, or a secret manager that injects
> `GITHUB_PRIVATE_KEY` as an environment variable instead.

### Environment Variables

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `GITHUB_REPOS` | Yes | - | Comma separated repositories in format "owner/repo" (e.g., "mozilla/firefox") |
| `GITHUB_TOKEN` | No | - | GitHub Personal Access Token (recommended to avoid rate limits) |
| `GITHUB_APP_ID` | No* | - | GitHub App numeric ID (found on the App's settings page). Required for authenticated access. |
| `GITHUB_PRIVATE_KEY` | No* | - | RSA private key in PEM format for the GitHub App. Required for authenticated access. |
| `BIGQUERY_PROJECT` | Yes | - | Google Cloud Project ID |
| `BIGQUERY_DATASET` | Yes | - | BigQuery dataset ID |
| `GOOGLE_APPLICATION_CREDENTIALS` | Yes* | - | Path to GCP service account JSON file (*or use Workload Identity) |
Expand All @@ -66,7 +83,7 @@ docker run --rm \

### Container Specifications

- **Base Image**: `python:3.11-slim` (latest stable Python)
- **Base Image**: `python:3.14-slim` (latest stable Python)
- **User**: `app` (uid: 1000, gid: 1000)
- **Working Directory**: `/app`
- **Ownership**: All files in `/app` are owned by the `app` user
Expand Down Expand Up @@ -128,7 +145,9 @@ Set up environment variables and run the script:

```bash
export GITHUB_REPOS="mozilla/firefox"
export GITHUB_TOKEN="your_github_token"
export GITHUB_APP_ID="your_github_app_id"
# Load the PEM from a file to avoid the key appearing in shell history
export GITHUB_PRIVATE_KEY="$(< your_private_key.pem)"
export BIGQUERY_PROJECT="your-gcp-project"
export BIGQUERY_DATASET="your_dataset"

Expand Down
99 changes: 79 additions & 20 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
import time
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from typing import Iterator, Optional
from typing import Callable, Iterator, Optional
from urllib.parse import parse_qs, urlparse

import jwt
import requests
from google.api_core import exceptions as api_exceptions
from google.api_core.client_options import ClientOptions
Expand All @@ -37,22 +38,46 @@ class AccessToken:
repo_installation_cache: dict[str, int] = {}


def generate_github_jwt(app_id: str, private_key_pem: str) -> str:
"""
Generate a short-lived GitHub App JWT signed with the app's private key.

GitHub App JWTs are valid for a maximum of 10 minutes. We use a 9-minute
expiry and backdate iat by 60 seconds to absorb clock skew between the
local machine and GitHub's servers.

Args:
app_id: GitHub App ID (numeric, found on the App's settings page)
private_key_pem: RSA private key in PEM format

Returns:
Signed JWT string
"""
now = int(time.time())
payload = {
"iat": now - 60, # backdate 60s to absorb clock skew
"exp": now + 540, # 9 minutes (GitHub maximum is 10)
"iss": app_id,
}
return jwt.encode(payload, private_key_pem, algorithm="RS256")


def get_installation_access_token(
jwt: str,
app_jwt: str,
repo: str,
github_api_url: str,
) -> str:
"""
Get a GitHub App installation access token, returning a cached one if still valid.

Uses the JWT to look up the installation for the given repo, then exchanges
it for an installation access token (valid for 1 hour). Tokens are cached
per installation ID so that repos sharing an installation reuse the same token,
while repos on different installations each get their own. The repo->installation
ID mapping is also cached since it never changes.
Uses the JWT (generated by ``generate_github_jwt()``) to look up the installation
for the given repo, then exchanges it for an installation access token (valid for
1 hour). Tokens are cached per installation ID so that repos sharing an installation
reuse the same token, while repos on different installations each get their own.
The repo->installation ID mapping is also cached since it never changes.

Args:
jwt: GitHub App JWT (stored in GITHUB_TOKEN env var)
app_jwt: Short-lived GitHub App JWT produced by ``generate_github_jwt()``
repo: Repository in "owner/repo" format, used to look up the installation
github_api_url: GitHub API base URL

Expand All @@ -63,7 +88,7 @@ def get_installation_access_token(
session = requests.Session()
session.headers.update(
{
"Authorization": f"Bearer {jwt}",
"Authorization": f"Bearer {app_jwt}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
Expand Down Expand Up @@ -154,6 +179,7 @@ def extract_pull_requests(
repo: str,
chunk_size: int = 100,
github_api_url: str = "https://api.github.com",
refresh_auth: Optional[Callable[[], None]] = None,
) -> Iterator[list[dict]]:
"""
Extract data from GitHub repositories in chunks.
Expand All @@ -165,6 +191,9 @@ def extract_pull_requests(
repo: GitHub repository name
chunk_size: Number of PRs to yield per chunk (default: 100)
github_api_url: GitHub API base URL
refresh_auth: Optional callable invoked before each page fetch to refresh
the session's Authorization header. Use this to prevent installation
tokens (1-hour TTL) from expiring mid-extraction on large repos.

Yields:
List of pull request dictionaries (up to chunk_size items)
Expand All @@ -183,6 +212,8 @@ def extract_pull_requests(
pages = 0

while True:
if refresh_auth:
refresh_auth()
resp = github_get(session, base_url, params=params)

batch = resp.json()
Expand Down Expand Up @@ -680,11 +711,12 @@ def main() -> int:
def _main() -> int:
logger.info("Starting GitHub ETL process with chunked processing")

github_jwt = os.environ.get("GITHUB_TOKEN") or None
if not github_jwt:
github_app_id = os.environ.get("GITHUB_APP_ID") or None
github_private_key = os.environ.get("GITHUB_PRIVATE_KEY") or None
if not github_app_id or not github_private_key:
logger.warning(
"GITHUB_TOKEN (expected to be a GitHub App JWT, not a personal access token) "
"is not set; proceeding without authentication (suitable for testing only)"
"GITHUB_APP_ID and GITHUB_PRIVATE_KEY are not set; "
"proceeding without authentication (suitable for testing only)"
)

# Read BigQuery configuration
Expand Down Expand Up @@ -748,16 +780,43 @@ def _main() -> int:
bigquery_client, bigquery_dataset, repo, snapshot_date
)

# Get (or refresh) the installation access token before processing each repo
if github_jwt:
access_token = get_installation_access_token(
github_jwt, repo, github_api_url
)
session.headers["Authorization"] = f"Bearer {access_token}"
# Build a per-repo token refresh callable. It is called by the generator
# before each page fetch, so every API request (PRs + commits + reviewers +
# comments) uses a valid token. The access_token_cache means this only hits
# the GitHub API when the cached token has <60 seconds remaining.
refresh_auth: Optional[Callable[[], None]] = None
if github_app_id and github_private_key:

def _make_refresh(
_repo: str = repo,
) -> Callable[[], None]:
def _refresh() -> None:
try:
app_jwt = generate_github_jwt(github_app_id, github_private_key)
access_token = get_installation_access_token(
app_jwt, _repo, github_api_url
)
except Exception as e:
raise RuntimeError(
f"Failed to obtain GitHub App access token for {_repo}: {e}. "
"Check that GITHUB_APP_ID is correct and GITHUB_PRIVATE_KEY "
"is a valid PEM-encoded RSA private key."
) from e
session.headers["Authorization"] = f"Bearer {access_token}"

return _refresh

refresh_auth = _make_refresh()
# Set the token immediately so the first generator page is authenticated.
refresh_auth()

for chunk_count, chunk in enumerate(
extract_pull_requests(
session, repo, chunk_size=100, github_api_url=github_api_url
session,
repo,
chunk_size=100,
github_api_url=github_api_url,
refresh_auth=refresh_auth,
),
start=1,
):
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ classifiers = [
dependencies = [
"requests>=2.25.0",
"google-cloud-bigquery==3.25.0",
"PyJWT[crypto]>=2.0.0",
]

[project.optional-dependencies]
Expand Down
Loading
Loading