DENG-10958: add dataset.yaml file for firefox_accounts_derived by gkabbz · Pull Request #9340 · mozilla/bigquery-etl

gkabbz · 2026-05-08T18:19:43Z

Description

Creates a dataset.yaml file for firefox_accounts_derived establishing canonical field definitions for GCP Cloud Logging fields shared across the firefox_accounts_derived dataset.

The firefox_accounts_derived dataset is sourced from GCP Cloud Logging log sinks. Profiling showed 18 of 49 tables contain Cloud Logging structural fields (logName, severity, httpRequest, operation, sourceLocation, split, errorGroups, etc.) with no descriptions. This PR adds descriptions for those fields adapted from the GCP Cloud Logging LogEntry reference (https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry).

Changes

Adds bigquery_etl/schema/firefox_accounts_derived.yaml with 16 top-level fields and 26 nested sub-fields covering the full Cloud Logging LogEntry structure
All descriptions sourced verbatim from Google's official reference documentation

Follow on work will be to cascade these definitions to the individual table schemas.

Related Tickets & Documents

DENG-10958

Reviewer, please follow this checklist

github-actions · 2026-05-08T18:19:58Z

Claude finished @gkabbz's task in 4m 6s —— View job

Review submitted

Read the changed file
Read recommended_practices.md
Looked at existing ads_derived.yaml / global.yaml for conventions
Traced how dataset.yaml is consumed by _update_query_schema_with_base_schemas (bigquery_etl/cli/query.py:2282–2374)
Cross-checked fields/types against actual fxa schema (fxa_gcp_stderr_events_v1/schema.yaml)
Posted formal review with inline comments

Highlights

Most important finding: The current tooling only matches top-level field names and only sets field["description"] — it does not recurse into RECORD fields. So the 26 nested sub-field descriptions in this PR (e.g. httpRequest.requestMethod, operation.id, sourceLocation.file) will not be cascaded automatically by --use_dataset_schema. Worth aligning expectations on this before the follow-on work, since the follow-up either needs to extend the tooling or manually edit each of the 18 affected schema.yaml files.

Other comments: a small editorial note about "verbatim" in the PR body, a suggestion to restore the dropped second sentence on cacheValidatedWithOriginServer, and notes on the deferred labels.* / resource.labels.* coverage.

Otherwise the file is clean — types, naming, and organization match the existing fxa schemas, and there's no overlap with global.yaml.
· branch DENG-10958-fx-accounts-dataset-yaml

github-actions

Review summary

Nice, focused PR — the file is well-organized (top-level scalars, then RECORDs grouped by GCP type with pointers back to the LogEntry reference), naming and types match what's in the actual fxa schemas (spot-checked fxa_gcp_stderr_events_v1), and there's no overlap with bigquery_etl/schema/global.yaml, so this won't conflict with --use_global_schema.

Cross-cutting observations

1. Nested descriptions won't auto-cascade with the current tooling (most important). _update_query_schema_with_base_schemas and its helper get_field_from_schema (bigquery_etl/cli/query.py:2298–2374) only iterate top-level fields and only assign field["description"]. The 26 nested sub-field descriptions in this file (e.g. httpRequest.requestMethod, operation.id, sourceLocation.file) will be ignored by --use_dataset_schema. The follow-on cascade work needs to either extend the tooling to recurse into RECORDs, or manually populate nested descriptions in each of the 18 affected schema.yaml files. Inline comment with details on httpRequest. Calling this out so reviewers/follow-up authors don't expect the cascade to "just work" end-to-end.

2. "Verbatim" is slightly overstated. Most descriptions are minimally edited (mainly dropping "Required."/"Optional." prefixes, occasionally a trailing sentence). I'd recommend (a) softening "verbatim" to "adapted from" in the PR body, and (b) restoring the dropped second sentence on cacheValidatedWithOriginServer since it captures a non-obvious invariant ("only meaningful if cacheHit is True"). See inline.

3. Coverage gaps to track in the follow-up ticket. labels.* and resource.labels.* are intentionally deferred. Of the two, resource.labels.* is a stronger candidate for a future iteration since GCP populates it with stable platform keys (project_id, location, cluster_name, ...). Top-level labels is more deployment-specific and probably best kept as a parent-only entry, but a one-liner in its description making that explicit would help future readers. Inline notes on both.

Nit

Style consistency: ads_derived.yaml mixes BOOL and BOOLEAN; this PR uses BOOLEAN throughout, which matches the actual fxa table schemas exactly. Fine to leave as-is.

Overall LGTM modulo the nested-cascade caveat — that's the only one that affects how usable this file actually is, and it's worth resolving before/with the follow-on work rather than discovering it mid-cascade.

github-actions · 2026-05-08T18:23:05Z

+- name: httpRequest
+  type: RECORD
+  description: Information about the HTTP request associated with this log entry, if
+    applicable.
+  fields:


Heads up that the existing --use_dataset_schema tooling will not actually cascade these nested descriptions. get_field_from_schema in bigquery_etl/cli/query.py:2314 and the surrounding loop in _update_query_schema_with_base_schemas only iterate top-level fields and only assign field["description"] — nested fields inside RECORDs are never visited. So when the cascade follow-up runs, you'll get descriptions for top-level RECORDs like httpRequest, operation, resource, sourceLocation, split, errorGroups, but the 26 sub-field descriptions defined here (e.g. httpRequest.requestMethod, operation.id, resource.type) won't be applied automatically.

Two options for the follow-up:

Extend _update_query_schema_with_base_schemas to recurse into RECORD fields when both the base and query schema have matching nested structures.

Manually populate the nested-field descriptions in each of the 18 affected schema.yaml files.

Option 1 is the bigger payoff since several other Cloud-Logging-sourced datasets (*_stderr_events_v1, *_stdout_events_v1, etc.) would benefit from the same recursion. Worth calling out in the follow-up ticket so expectations are aligned.

@sean-rose I'm assuming here that I'll be able to use !include-field even if the --use_dataset_schema doesn't cascade the nested descriptions?

github-actions · 2026-05-08T18:23:31Z

+- name: resource
+  type: RECORD
+  description: The monitored resource that produced this log entry.
+  fields:
+  - name: type
+    type: STRING
+    description: The monitored resource type. For example, the type of a Compute Engine
+      VM instance is `gce_instance`.


resource.labels.* is deferred per the comment on line 135, but in the actual fxa schemas resource.labels carries fairly canonical GCP-side keys like project_id, location, cluster_name, namespace_name, etc. — these are stable across deployments (unlike top-level labels) and are reasonable candidates to define here in a follow-up. Worth an ordered TODO note in the related ticket so they aren't forgotten.

Specifically skipping these because some tables have project_id, zone, instance_id (older gce_instance resource type) while others have cluster_name, namespace_name, pod_name (k8s_container post-GKE migration). Will address in a follow-up. Might just define in the table schema.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

lucia-vargas-a

LGTM. Will let Sean confirm about the nested fields.

gkabbz added 2 commits May 8, 2026 14:09

add dataset.yaml file for firefox_accounts_derived

ab55648

remove latency field as it doesn't appear in any fxa tables

8251e87

gkabbz requested a review from a team as a code owner May 8, 2026 18:19

gkabbz changed the title ~~add dataset.yaml file for firefox_accounts_derived~~ DENG-10958: add dataset.yaml file for firefox_accounts_derived May 8, 2026

Merge branch 'main' into DENG-10958-fx-accounts-dataset-yaml

9aa1927

gkabbz added the documentation label May 8, 2026

github-actions Bot reviewed May 8, 2026

View reviewed changes

Apply suggestion from @github-actions[bot]

317fd24

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

lucia-vargas-a approved these changes May 13, 2026

View reviewed changes

gkabbz added this pull request to the merge queue May 13, 2026

Merged via the queue into main with commit 5f8edc2 May 13, 2026
60 checks passed

gkabbz deleted the DENG-10958-fx-accounts-dataset-yaml branch May 13, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DENG-10958: add dataset.yaml file for firefox_accounts_derived#9340

DENG-10958: add dataset.yaml file for firefox_accounts_derived#9340
gkabbz merged 4 commits into
mainfrom
DENG-10958-fx-accounts-dataset-yaml

gkabbz commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot May 8, 2026

Uh oh!

gkabbz May 8, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot May 8, 2026

Uh oh!

gkabbz May 8, 2026

Uh oh!

Uh oh!

lucia-vargas-a left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gkabbz commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Tickets & Documents

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review submitted

Highlights

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Review summary

Cross-cutting observations

Nit

Uh oh!

github-actions Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gkabbz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gkabbz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lucia-vargas-a left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gkabbz commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading