Skip to content

DENG-10958: add dataset.yaml file for firefox_accounts_derived#9340

Merged
gkabbz merged 4 commits into
mainfrom
DENG-10958-fx-accounts-dataset-yaml
May 13, 2026
Merged

DENG-10958: add dataset.yaml file for firefox_accounts_derived#9340
gkabbz merged 4 commits into
mainfrom
DENG-10958-fx-accounts-dataset-yaml

Conversation

@gkabbz
Copy link
Copy Markdown
Contributor

@gkabbz gkabbz commented May 8, 2026

Description

Creates a dataset.yaml file for firefox_accounts_derived establishing canonical field definitions for GCP Cloud Logging fields shared across the firefox_accounts_derived dataset.

The firefox_accounts_derived dataset is sourced from GCP Cloud Logging log sinks. Profiling showed 18 of 49 tables contain Cloud Logging structural fields (logName, severity, httpRequest, operation, sourceLocation, split, errorGroups, etc.) with no descriptions. This PR adds descriptions for those fields adapted from the GCP Cloud Logging LogEntry reference (https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry).

Changes

  • Adds bigquery_etl/schema/firefox_accounts_derived.yaml with 16 top-level fields and 26 nested sub-fields covering the full Cloud Logging LogEntry structure
  • All descriptions sourced verbatim from Google's official reference documentation

Follow on work will be to cascade these definitions to the individual table schemas.

Related Tickets & Documents

Reviewer, please follow this checklist

@gkabbz gkabbz requested a review from a team as a code owner May 8, 2026 18:19
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Claude finished @gkabbz's task in 4m 6s —— View job


Review submitted

  • Read the changed file
  • Read recommended_practices.md
  • Looked at existing ads_derived.yaml / global.yaml for conventions
  • Traced how dataset.yaml is consumed by _update_query_schema_with_base_schemas (bigquery_etl/cli/query.py:2282–2374)
  • Cross-checked fields/types against actual fxa schema (fxa_gcp_stderr_events_v1/schema.yaml)
  • Posted formal review with inline comments

Highlights

Most important finding: The current tooling only matches top-level field names and only sets field["description"] — it does not recurse into RECORD fields. So the 26 nested sub-field descriptions in this PR (e.g. httpRequest.requestMethod, operation.id, sourceLocation.file) will not be cascaded automatically by --use_dataset_schema. Worth aligning expectations on this before the follow-on work, since the follow-up either needs to extend the tooling or manually edit each of the 18 affected schema.yaml files.

Other comments: a small editorial note about "verbatim" in the PR body, a suggestion to restore the dropped second sentence on cacheValidatedWithOriginServer, and notes on the deferred labels.* / resource.labels.* coverage.

Otherwise the file is clean — types, naming, and organization match the existing fxa schemas, and there's no overlap with global.yaml.
· branch DENG-10958-fx-accounts-dataset-yaml

@gkabbz gkabbz changed the title add dataset.yaml file for firefox_accounts_derived DENG-10958: add dataset.yaml file for firefox_accounts_derived May 8, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

Nice, focused PR — the file is well-organized (top-level scalars, then RECORDs grouped by GCP type with pointers back to the LogEntry reference), naming and types match what's in the actual fxa schemas (spot-checked fxa_gcp_stderr_events_v1), and there's no overlap with bigquery_etl/schema/global.yaml, so this won't conflict with --use_global_schema.

Cross-cutting observations

1. Nested descriptions won't auto-cascade with the current tooling (most important). _update_query_schema_with_base_schemas and its helper get_field_from_schema (bigquery_etl/cli/query.py:2298–2374) only iterate top-level fields and only assign field["description"]. The 26 nested sub-field descriptions in this file (e.g. httpRequest.requestMethod, operation.id, sourceLocation.file) will be ignored by --use_dataset_schema. The follow-on cascade work needs to either extend the tooling to recurse into RECORDs, or manually populate nested descriptions in each of the 18 affected schema.yaml files. Inline comment with details on httpRequest. Calling this out so reviewers/follow-up authors don't expect the cascade to "just work" end-to-end.

2. "Verbatim" is slightly overstated. Most descriptions are minimally edited (mainly dropping "Required."/"Optional." prefixes, occasionally a trailing sentence). I'd recommend (a) softening "verbatim" to "adapted from" in the PR body, and (b) restoring the dropped second sentence on cacheValidatedWithOriginServer since it captures a non-obvious invariant ("only meaningful if cacheHit is True"). See inline.

3. Coverage gaps to track in the follow-up ticket. labels.* and resource.labels.* are intentionally deferred. Of the two, resource.labels.* is a stronger candidate for a future iteration since GCP populates it with stable platform keys (project_id, location, cluster_name, ...). Top-level labels is more deployment-specific and probably best kept as a parent-only entry, but a one-liner in its description making that explicit would help future readers. Inline notes on both.

Nit

  • Style consistency: ads_derived.yaml mixes BOOL and BOOLEAN; this PR uses BOOLEAN throughout, which matches the actual fxa table schemas exactly. Fine to leave as-is.

Overall LGTM modulo the nested-cascade caveat — that's the only one that affects how usable this file actually is, and it's worth resolving before/with the follow-on work rather than discovering it mid-cascade.

Comment on lines +58 to +62
- name: httpRequest
type: RECORD
description: Information about the HTTP request associated with this log entry, if
applicable.
fields:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that the existing --use_dataset_schema tooling will not actually cascade these nested descriptions. get_field_from_schema in bigquery_etl/cli/query.py:2314 and the surrounding loop in _update_query_schema_with_base_schemas only iterate top-level fields and only assign field["description"] — nested fields inside RECORDs are never visited. So when the cascade follow-up runs, you'll get descriptions for top-level RECORDs like httpRequest, operation, resource, sourceLocation, split, errorGroups, but the 26 sub-field descriptions defined here (e.g. httpRequest.requestMethod, operation.id, resource.type) won't be applied automatically.

Two options for the follow-up:

  1. Extend _update_query_schema_with_base_schemas to recurse into RECORD fields when both the base and query schema have matching nested structures.
  2. Manually populate the nested-field descriptions in each of the 18 affected schema.yaml files.

Option 1 is the bigger payoff since several other Cloud-Logging-sourced datasets (*_stderr_events_v1, *_stdout_events_v1, etc.) would benefit from the same recursion. Worth calling out in the follow-up ticket so expectations are aligned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sean-rose I'm assuming here that I'll be able to use !include-field even if the --use_dataset_schema doesn't cascade the nested descriptions?

Comment thread bigquery_etl/schema/firefox_accounts_derived.yaml Outdated
Comment thread bigquery_etl/schema/firefox_accounts_derived.yaml
Comment on lines +137 to +144
- name: resource
type: RECORD
description: The monitored resource that produced this log entry.
fields:
- name: type
type: STRING
description: The monitored resource type. For example, the type of a Compute Engine
VM instance is `gce_instance`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource.labels.* is deferred per the comment on line 135, but in the actual fxa schemas resource.labels carries fairly canonical GCP-side keys like project_id, location, cluster_name, namespace_name, etc. — these are stable across deployments (unlike top-level labels) and are reasonable candidates to define here in a follow-up. Worth an ordered TODO note in the related ticket so they aren't forgotten.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically skipping these because some tables have project_id, zone, instance_id (older gce_instance resource type) while others have cluster_name, namespace_name, pod_name (k8s_container post-GKE migration). Will address in a follow-up. Might just define in the table schema.

Comment thread bigquery_etl/schema/firefox_accounts_derived.yaml
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@lucia-vargas-a lucia-vargas-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will let Sean confirm about the nested fields.

@gkabbz gkabbz added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit 5f8edc2 May 13, 2026
60 checks passed
@gkabbz gkabbz deleted the DENG-10958-fx-accounts-dataset-yaml branch May 13, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants