Harden access-token log masking and migrate to lazy logging by soustruh · Pull Request #53 · keboola/component-meta

soustruh · 2026-05-21T11:38:54Z

Summary

Add a dict branch to AccessTokenFilter._mask so a dict passed as a %s arg (logger.debug("params: %s", params)) is masked at the data level — previously, dicts passed as args fell through to the catch-all unchanged. The dict-repr regex still scrubs the f-string case as a defence in depth.
Add a list branch (parity with the existing tuple branch).
Keep all three pre-existing scrubbers and extend them so each quote style is covered explicitly:
- URL form (access_token=...) — unchanged.
- Single-quoted Python dict repr ('access_token': '...') — pre-existing, kept.
- Double-quoted JSON / Python dict repr when a value contained ' ("access_token": "...") — newly added, closes the residual leak path on raw FB error response bodies (JSON) and on dict reprs that flip quote style.
Drop the broken Exception branch from _mask (silently downgraded the exception type to str on failure of type(obj)(masked_str)).
Drop exc_info masking from filter() — exc_info is a (type, value, traceback) tuple, not user-visible text. exc_text (the formatted string) is still masked.
Pre-compile regexes once at module load; factor out _TOKEN_KEY / _REDACTED constants.
Replace raw response.text dumps in page_loader.py with structured FB error field logging (code + message), falling back to body-length only on non-JSON bodies.
Migrate every logger.<level>(f"...") call in src/ to lazy %-style arg form.
Enable ruff G (flake8-logging-format) in pyproject.toml to prevent f-string logging from creeping back in.
Add tests/test_access_token_filter.py — first test coverage for the filter. Tests every branch including a note on the LogRecord single-Mapping-args unwrap quirk that affects how a dict arg is stored on the record.
Clean up duplicated bucket-id construction in component.py (extracted to a local variable).

No runtime behaviour changes. Masking is strictly additive (more shapes caught, none revealed). Logging output is equivalent under default formatters.

Test plan (verified locally)

uv run ruff check src/ tests/ — All checks passed
uv run pytest tests/test_access_token_filter.py -v — 13 passed
uv run pytest tests/ — 45 passed
grep -rnE 'logger\.(debug|info|warning|error|exception|critical)\(f["\x27]' src/ — no matches
grep -rn "response\.text" src/ — four usages, all benign (client.py:296 is for a conditional check on error text; page_loader.py:201 is in error pattern matching; page_loader.py:418 and :563 log only len(response.text) in the non-JSON fallback)

matyas-jirat-keboola · 2026-05-21T14:35:10Z

@keboola-pr-reviewer review

matyas-jirat-keboola · 2026-05-21T15:14:59Z

@keboola-pr-reviewer review

matyas-jirat-keboola · 2026-05-21T15:17:48Z

@keboola-pr-reviewer review

Rewrites AccessTokenFilter._mask to be data-shape-aware instead of relying on Python repr coincidences: - Add dict branch: masks 'access_token' key at the data level, so logger.debug("params: %s", params) is safe regardless of whether the caller uses f-strings or %s arg form, or whether any value in the dict contains a single quote. - Add list branch (parity with existing tuple branch). - Add JSON-form regex ("access_token":"...") to close the residual leak path in raw FB error response bodies. - Drop the dict-repr regex (superseded by the dict branch) and the broken Exception branch (already handled via exc_text as a string). - Drop meaningless exc_info masking (it's a (type, val, tb) tuple). - Pre-compile regexes and factor out _TOKEN_KEY / _REDACTED constants. Replace raw response.text dumps in page_loader with structured FB error field logging (code + message), falling back to body-length only. Migrate every f-string log call in src/ to lazy %s arg form and enable ruff G (flake8-logging-format) to prevent regressions. Add tests/test_access_token_filter.py — first coverage for the filter. Includes a comment on the LogRecord single-Mapping-args unwrap quirk that affects how dict args are stored on records.

matyas-jirat-keboola · 2026-05-25T12:46:41Z

Review — one blocking concern

Overall this is a solid hardening pass. The shift from coincidence-based regex on Python dict repr to a real dict branch is the right call, pre-compiling regexes is clean, and the lazy-logging migration + ruff G gate is a nice belt-and-suspenders setup. Tests look thorough for what they cover.

But I think dropping the Exception branch in _mask regresses on token safety, and I was able to reproduce a leak.

Reproduction

import sys, logging
sys.path.insert(0, 'src')
from client import AccessTokenFilter
import requests
from unittest.mock import MagicMock

r = MagicMock(); r.status_code = 400; r.reason = 'Bad Request'
r.url = 'https://graph.facebook.com/v23.0/me?access_token=EAAxyz123&fields=id'
exc = requests.HTTPError(f'{r.status_code} Client Error: {r.reason} for url: {r.url}', response=r)

filt = AccessTokenFilter()
rec = logging.LogRecord('t', logging.ERROR, '', 0, 'Failed: %s', (exc,), None)
filt.filter(rec)
print('LEAK?', 'EAAxyz123' in rec.getMessage())
print('msg:', rec.getMessage())

Output:

LEAK? True
msg: Failed: 400 Client Error: Bad Request for url: https://graph.facebook.com/v23.0/me?access_token=EAAxyz123&fields=id

Why it leaks

keboola.http_client.HttpClient wraps requests, which on raise_for_status() raises requests.HTTPError whose __str__ is "<code> <reason> for url: <full url with query string>" — token in the clear.
The PR has 8+ call sites that pass that exception as a %s arg, e.g.:
- src/client.py:98, 208, 225, 349, 397
- src/page_loader.py:338, 372, 384, 411, 428, 556, 573
record.args = (exc,). The new _mask hits the tuple branch, recurses, lands on an Exception — none of the type branches (str / dict / tuple / list) match, so it returns unchanged.
The filter exits. The handler then calls record.getMessage(), which does "…: %s" % (exc,) → invokes str(exc) after masking is done. URL with token reaches the handler's output stream.

So the old (admittedly ugly) Exception branch was actually doing load-bearing work, not just being broken.

Suggested fix

Add one branch to _mask:

if isinstance(obj, BaseException):
    return self._mask(str(obj))

Yes, this changes the type passed to %s, but the rendered output is identical (%s calls str() either way). And a corresponding test:

def test_masks_exception_in_args(filt):
    exc = Exception('400 for url: ...?access_token=EAAxyz')
    record = make_record('failed: %s', (exc,))
    filt.filter(record)
    assert 'EAAxyz' not in record.getMessage()

Smaller notes (non-blocking)

The old 'access_token': '[^']+' Python-repr regex is gone. With the dict branch + lazy logging + ruff G, this shouldn't recur in this codebase. But if any external string ever lands in a log message with the Python-repr form (e.g. a third-party exception embedding a dict repr), it wouldn't be masked. Adding the regex back as a third pattern in the string branch is essentially free.
src/page_loader.py:417, 561 — response.json() raises JSONDecodeError, which subclasses ValueError in Py3, so the except ValueError is fine. Just flagging in case you meant to be stricter.
_URL_TOKEN_PATTERN has no word boundary, so myaccess_token=… would partial-match. Not a real issue, just an observation.
Tests import _REDACTED / _TOKEN_KEY from client (underscored). Fine for internal testing; if you'd rather not couple, hardcode the literal in tests.

Happy to push the fix + test to the branch if useful.

soustruh force-pushed the logging-masking-refactor branch from 7e71a57 to 8165b4d Compare May 21, 2026 11:58

matyas-jirat-keboola marked this pull request as ready for review May 21, 2026 14:34

soustruh force-pushed the logging-masking-refactor branch from 8165b4d to e50168a Compare May 22, 2026 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden access-token log masking and migrate to lazy logging#53

Harden access-token log masking and migrate to lazy logging#53
soustruh wants to merge 1 commit into
mainfrom
logging-masking-refactor

soustruh commented May 21, 2026 •

edited

Loading

Uh oh!

matyas-jirat-keboola commented May 21, 2026

Uh oh!

matyas-jirat-keboola commented May 21, 2026

Uh oh!

matyas-jirat-keboola commented May 21, 2026

Uh oh!

matyas-jirat-keboola commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

soustruh commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan (verified locally)

Uh oh!

matyas-jirat-keboola commented May 21, 2026

Uh oh!

matyas-jirat-keboola commented May 21, 2026

Uh oh!

matyas-jirat-keboola commented May 21, 2026

Uh oh!

matyas-jirat-keboola commented May 25, 2026

Review — one blocking concern

Reproduction

Why it leaks

Suggested fix

Smaller notes (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

soustruh commented May 21, 2026 •

edited

Loading