Skip to content

Commit b95c3af

Browse files
committed
ci(pr-leakage): add reusable workflow that scans PRs for customer-data leaks
What: Add reusable workflow pr-leakage-check.yaml, stdlib Python scanner pr_leakage_scan.py, externalized banned-tokens YAML, customer-name denylist, skip-allowlist, and a self-test workflow with leaky and clean fixtures. How: Workflow is invoked via workflow_call from a per-repo caller stub; scanner pulls title, body, and commit messages via gh pr view and runs twelve always-on regexes plus five context-sensitive rules. Why: config-validation — Public connector repos must not name a specific customer or expose internal service topology in a permanent world-readable artifact. Refs: ConductorOne/baton-sdk#781, ConductorOne/baton-sdk#863, ConductorOne/baton-sdk#865
1 parent d2a4189 commit b95c3af

12 files changed

Lines changed: 756 additions & 0 deletions
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# pr-leakage banned tokens
2+
#
3+
# Parsed by .github/scripts/pr_leakage_scan.py. The parser supports a small
4+
# subset of YAML: top-level keys `always_on:` and `context_sensitive:`, each a
5+
# list of mappings whose keys are `id`, `pattern`, `description`, and
6+
# optionally `adjacent_any` (inline list), `window` (int), and
7+
# `require_quote_or_error` (bool). No anchors, no flow mappings, no block
8+
# scalars — keep entries to one regex per line.
9+
#
10+
# Rule IDs are stable. Removing a rule should be a deliberate PR with a
11+
# regression fixture explaining why.
12+
13+
always_on:
14+
- id: R1
15+
pattern: '\b[a-z0-9-]+\.conductor\.one\b'
16+
description: 'tenant subdomain on conductor.one'
17+
- id: R2
18+
pattern: '\b[a-z0-9-]+\.ductone\.com\b'
19+
description: 'tenant subdomain on ductone.com (btipling.d2.ductone.com exception applied)'
20+
- id: R3
21+
pattern: '\b(prod|staging|preprod)-(usw2|use1|euw1|euc1|usw1|use2)\b'
22+
description: 'internal region/env tag from profile labels'
23+
- id: R4
24+
pattern: '\bbe-(temporal-sync|temporal-worker|connector-runtime|api|sync-worker)\b'
25+
description: 'internal service name'
26+
- id: R5
27+
pattern: 'https?://app\.datadoghq\.com/'
28+
description: 'Datadog dashboard URL'
29+
- id: R6
30+
pattern: 'https?://[a-z0-9-]+\.datadoghq\.com/'
31+
description: 'Datadog tenant URL'
32+
- id: R7
33+
pattern: 'https?://linear\.app/'
34+
description: 'Linear ticket URL'
35+
- id: R8
36+
pattern: 'https?://[a-z0-9-]+\.slack\.com/'
37+
description: 'Slack URL'
38+
- id: R9
39+
pattern: 'https?://[a-z0-9-]+\.notion\.so/'
40+
description: 'Notion URL'
41+
- id: R10a
42+
pattern: 'https?://docs\.google\.com/'
43+
description: 'Google Docs URL'
44+
- id: R10b
45+
pattern: 'https?://drive\.google\.com/'
46+
description: 'Google Drive URL'
47+
- id: R11
48+
pattern: '\b[A-Za-z0-9._%+-]+@(conductorone\.com|ductone\.com)\b'
49+
description: 'internal email address'
50+
- id: R12a
51+
pattern: '<@U[A-Z0-9]{8,}>'
52+
description: 'Slack mention syntax'
53+
- id: R12b
54+
pattern: '<#C[A-Z0-9]{8,}>'
55+
description: 'Slack channel reference syntax'
56+
57+
context_sensitive:
58+
- id: C1
59+
pattern: '\b[0-9A-Za-z]{20,40}\b'
60+
description: 'C1 tenant / app / org ID in tenant-adjacent prose'
61+
adjacent_any: [tenant, Tenant, TENANT, tnt_, app_, org_]
62+
window: 40
63+
- id: C2
64+
pattern: '\b[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\b'
65+
description: 'UUID inside quoted string or near the word error (verbatim production error shape)'
66+
require_quote_or_error: true
67+
window: 40
68+
- id: C3
69+
pattern: '\b\d{2,4}K?\s+(users|groups|members|grants|entitlements|principals|resources)\b'
70+
description: 'customer-sized count measurement'
71+
- id: C4
72+
pattern: '\b\d{2,4}\s*(GB|TB)\s*(c1z|sync|tenant|database|sqlite)\b'
73+
description: 'customer-sized storage measurement'
74+
- id: C5
75+
pattern: '\b\d{2,3}\s*(minute|minutes|min)\b'
76+
description: 'customer wall-clock measurement near sync/tenant/temporal context'
77+
adjacent_any: [sync, expansion, tenant, temporal, c1z]
78+
window: 60
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# pr-leakage customer-name denylist.
2+
#
3+
# One customer name per line. Whole-word, case-insensitive, multi-word names
4+
# treated as `\s+` between tokens. Empty lines and lines beginning with `#`
5+
# ignored. Adding a name is a one-file PR to this repo; the next workflow run
6+
# in any consumer repo picks the change up because the caller stub pins @main.
7+
#
8+
# Seed entries are taken from the captured leak fixtures
9+
# (.github/pr-leakage-fixtures/leaky/). Add new entries only with a fixture
10+
# that proves the regression.
11+
12+
Eli Lilly
13+
Synthetic-Acme
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# pr-leakage skip-token allowlist.
2+
#
3+
# Authors listed here may bypass the scanner by including the literal token
4+
# [skip-leakage-check] anywhere in the PR body. The token without an
5+
# allowlisted actor is a hard fail — using the token without permission is
6+
# strictly worse than not using it.
7+
#
8+
# One GitHub login per line. Empty lines and lines beginning with `#` ignored.
9+
#
10+
# Adding a login is a PR to this repo; the PR description documents why and
11+
# names the Security reviewer who signed off.
12+
#
13+
# Initial allowlist is intentionally empty. The default path is to rewrite the
14+
# PR body so it does not leak; the escape hatch exists for incidents where
15+
# naming a customer in a public artifact has explicit Security sign-off.

0 commit comments

Comments
 (0)