Skip to content

Harden access-token log masking and migrate to lazy logging#53

Open
soustruh wants to merge 1 commit into
mainfrom
logging-masking-refactor
Open

Harden access-token log masking and migrate to lazy logging#53
soustruh wants to merge 1 commit into
mainfrom
logging-masking-refactor

Conversation

@soustruh
Copy link
Copy Markdown
Contributor

@soustruh soustruh commented May 21, 2026

Summary

  • Add a dict branch to AccessTokenFilter._mask so a dict passed as a %s arg (logger.debug("params: %s", params)) is masked at the data level — previously, dicts passed as args fell through to the catch-all unchanged. The dict-repr regex still scrubs the f-string case as a defence in depth.
  • Add a list branch (parity with the existing tuple branch).
  • Keep all three pre-existing scrubbers and extend them so each quote style is covered explicitly:
    • URL form (access_token=...) — unchanged.
    • Single-quoted Python dict repr ('access_token': '...') — pre-existing, kept.
    • Double-quoted JSON / Python dict repr when a value contained ' ("access_token": "...") — newly added, closes the residual leak path on raw FB error response bodies (JSON) and on dict reprs that flip quote style.
  • Drop the broken Exception branch from _mask (silently downgraded the exception type to str on failure of type(obj)(masked_str)).
  • Drop exc_info masking from filter()exc_info is a (type, value, traceback) tuple, not user-visible text. exc_text (the formatted string) is still masked.
  • Pre-compile regexes once at module load; factor out _TOKEN_KEY / _REDACTED constants.
  • Replace raw response.text dumps in page_loader.py with structured FB error field logging (code + message), falling back to body-length only on non-JSON bodies.
  • Migrate every logger.<level>(f"...") call in src/ to lazy %-style arg form.
  • Enable ruff G (flake8-logging-format) in pyproject.toml to prevent f-string logging from creeping back in.
  • Add tests/test_access_token_filter.py — first test coverage for the filter. Tests every branch including a note on the LogRecord single-Mapping-args unwrap quirk that affects how a dict arg is stored on the record.
  • Clean up duplicated bucket-id construction in component.py (extracted to a local variable).

No runtime behaviour changes. Masking is strictly additive (more shapes caught, none revealed). Logging output is equivalent under default formatters.

Test plan (verified locally)

  • uv run ruff check src/ tests/ — All checks passed
  • uv run pytest tests/test_access_token_filter.py -v — 13 passed
  • uv run pytest tests/ — 45 passed
  • grep -rnE 'logger\.(debug|info|warning|error|exception|critical)\(f["\x27]' src/ — no matches
  • grep -rn "response\.text" src/ — four usages, all benign (client.py:296 is for a conditional check on error text; page_loader.py:201 is in error pattern matching; page_loader.py:418 and :563 log only len(response.text) in the non-JSON fallback)

@soustruh soustruh force-pushed the logging-masking-refactor branch from 7e71a57 to 8165b4d Compare May 21, 2026 11:58
@matyas-jirat-keboola matyas-jirat-keboola marked this pull request as ready for review May 21, 2026 14:34
@matyas-jirat-keboola
Copy link
Copy Markdown
Contributor

@keboola-pr-reviewer review

2 similar comments
@matyas-jirat-keboola
Copy link
Copy Markdown
Contributor

@keboola-pr-reviewer review

@matyas-jirat-keboola
Copy link
Copy Markdown
Contributor

@keboola-pr-reviewer review

Rewrites AccessTokenFilter._mask to be data-shape-aware instead of
relying on Python repr coincidences:
- Add dict branch: masks 'access_token' key at the data level, so
  logger.debug("params: %s", params) is safe regardless of whether
  the caller uses f-strings or %s arg form, or whether any value in
  the dict contains a single quote.
- Add list branch (parity with existing tuple branch).
- Add JSON-form regex ("access_token":"...") to close the residual
  leak path in raw FB error response bodies.
- Drop the dict-repr regex (superseded by the dict branch) and the
  broken Exception branch (already handled via exc_text as a string).
- Drop meaningless exc_info masking (it's a (type, val, tb) tuple).
- Pre-compile regexes and factor out _TOKEN_KEY / _REDACTED constants.

Replace raw response.text dumps in page_loader with structured FB
error field logging (code + message), falling back to body-length only.

Migrate every f-string log call in src/ to lazy %s arg form and
enable ruff G (flake8-logging-format) to prevent regressions.

Add tests/test_access_token_filter.py — first coverage for the filter.
Includes a comment on the LogRecord single-Mapping-args unwrap quirk
that affects how dict args are stored on records.
@soustruh soustruh force-pushed the logging-masking-refactor branch from 8165b4d to e50168a Compare May 22, 2026 09:27
@matyas-jirat-keboola
Copy link
Copy Markdown
Contributor

Review — one blocking concern

Overall this is a solid hardening pass. The shift from coincidence-based regex on Python dict repr to a real dict branch is the right call, pre-compiling regexes is clean, and the lazy-logging migration + ruff G gate is a nice belt-and-suspenders setup. Tests look thorough for what they cover.

But I think dropping the Exception branch in _mask regresses on token safety, and I was able to reproduce a leak.

Reproduction

import sys, logging
sys.path.insert(0, 'src')
from client import AccessTokenFilter
import requests
from unittest.mock import MagicMock

r = MagicMock(); r.status_code = 400; r.reason = 'Bad Request'
r.url = 'https://graph.facebook.com/v23.0/me?access_token=EAAxyz123&fields=id'
exc = requests.HTTPError(f'{r.status_code} Client Error: {r.reason} for url: {r.url}', response=r)

filt = AccessTokenFilter()
rec = logging.LogRecord('t', logging.ERROR, '', 0, 'Failed: %s', (exc,), None)
filt.filter(rec)
print('LEAK?', 'EAAxyz123' in rec.getMessage())
print('msg:', rec.getMessage())

Output:

LEAK? True
msg: Failed: 400 Client Error: Bad Request for url: https://graph.facebook.com/v23.0/me?access_token=EAAxyz123&fields=id

Why it leaks

  1. keboola.http_client.HttpClient wraps requests, which on raise_for_status() raises requests.HTTPError whose __str__ is "<code> <reason> for url: <full url with query string>" — token in the clear.
  2. The PR has 8+ call sites that pass that exception as a %s arg, e.g.:
    • src/client.py:98, 208, 225, 349, 397
    • src/page_loader.py:338, 372, 384, 411, 428, 556, 573
  3. record.args = (exc,). The new _mask hits the tuple branch, recurses, lands on an Exception — none of the type branches (str / dict / tuple / list) match, so it returns unchanged.
  4. The filter exits. The handler then calls record.getMessage(), which does "…: %s" % (exc,) → invokes str(exc) after masking is done. URL with token reaches the handler's output stream.

So the old (admittedly ugly) Exception branch was actually doing load-bearing work, not just being broken.

Suggested fix

Add one branch to _mask:

if isinstance(obj, BaseException):
    return self._mask(str(obj))

Yes, this changes the type passed to %s, but the rendered output is identical (%s calls str() either way). And a corresponding test:

def test_masks_exception_in_args(filt):
    exc = Exception('400 for url: ...?access_token=EAAxyz')
    record = make_record('failed: %s', (exc,))
    filt.filter(record)
    assert 'EAAxyz' not in record.getMessage()

Smaller notes (non-blocking)

  • The old 'access_token': '[^']+' Python-repr regex is gone. With the dict branch + lazy logging + ruff G, this shouldn't recur in this codebase. But if any external string ever lands in a log message with the Python-repr form (e.g. a third-party exception embedding a dict repr), it wouldn't be masked. Adding the regex back as a third pattern in the string branch is essentially free.
  • src/page_loader.py:417, 561response.json() raises JSONDecodeError, which subclasses ValueError in Py3, so the except ValueError is fine. Just flagging in case you meant to be stricter.
  • _URL_TOKEN_PATTERN has no word boundary, so myaccess_token=… would partial-match. Not a real issue, just an observation.
  • Tests import _REDACTED / _TOKEN_KEY from client (underscored). Fine for internal testing; if you'd rather not couple, hardcode the literal in tests.

Happy to push the fix + test to the branch if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants