DENG-10958: add dataset.yaml file for firefox_accounts_derived#9340
Conversation
|
Claude finished @gkabbz's task in 4m 6s —— View job Review submitted
HighlightsMost important finding: The current tooling only matches top-level field names and only sets Other comments: a small editorial note about "verbatim" in the PR body, a suggestion to restore the dropped second sentence on Otherwise the file is clean — types, naming, and organization match the existing fxa schemas, and there's no overlap with |
There was a problem hiding this comment.
Review summary
Nice, focused PR — the file is well-organized (top-level scalars, then RECORDs grouped by GCP type with pointers back to the LogEntry reference), naming and types match what's in the actual fxa schemas (spot-checked fxa_gcp_stderr_events_v1), and there's no overlap with bigquery_etl/schema/global.yaml, so this won't conflict with --use_global_schema.
Cross-cutting observations
1. Nested descriptions won't auto-cascade with the current tooling (most important). _update_query_schema_with_base_schemas and its helper get_field_from_schema (bigquery_etl/cli/query.py:2298–2374) only iterate top-level fields and only assign field["description"]. The 26 nested sub-field descriptions in this file (e.g. httpRequest.requestMethod, operation.id, sourceLocation.file) will be ignored by --use_dataset_schema. The follow-on cascade work needs to either extend the tooling to recurse into RECORDs, or manually populate nested descriptions in each of the 18 affected schema.yaml files. Inline comment with details on httpRequest. Calling this out so reviewers/follow-up authors don't expect the cascade to "just work" end-to-end.
2. "Verbatim" is slightly overstated. Most descriptions are minimally edited (mainly dropping "Required."/"Optional." prefixes, occasionally a trailing sentence). I'd recommend (a) softening "verbatim" to "adapted from" in the PR body, and (b) restoring the dropped second sentence on cacheValidatedWithOriginServer since it captures a non-obvious invariant ("only meaningful if cacheHit is True"). See inline.
3. Coverage gaps to track in the follow-up ticket. labels.* and resource.labels.* are intentionally deferred. Of the two, resource.labels.* is a stronger candidate for a future iteration since GCP populates it with stable platform keys (project_id, location, cluster_name, ...). Top-level labels is more deployment-specific and probably best kept as a parent-only entry, but a one-liner in its description making that explicit would help future readers. Inline notes on both.
Nit
- Style consistency:
ads_derived.yamlmixesBOOLandBOOLEAN; this PR usesBOOLEANthroughout, which matches the actual fxa table schemas exactly. Fine to leave as-is.
Overall LGTM modulo the nested-cascade caveat — that's the only one that affects how usable this file actually is, and it's worth resolving before/with the follow-on work rather than discovering it mid-cascade.
| - name: httpRequest | ||
| type: RECORD | ||
| description: Information about the HTTP request associated with this log entry, if | ||
| applicable. | ||
| fields: |
There was a problem hiding this comment.
Heads up that the existing --use_dataset_schema tooling will not actually cascade these nested descriptions. get_field_from_schema in bigquery_etl/cli/query.py:2314 and the surrounding loop in _update_query_schema_with_base_schemas only iterate top-level fields and only assign field["description"] — nested fields inside RECORDs are never visited. So when the cascade follow-up runs, you'll get descriptions for top-level RECORDs like httpRequest, operation, resource, sourceLocation, split, errorGroups, but the 26 sub-field descriptions defined here (e.g. httpRequest.requestMethod, operation.id, resource.type) won't be applied automatically.
Two options for the follow-up:
- Extend
_update_query_schema_with_base_schemasto recurse into RECORD fields when both the base and query schema have matching nested structures. - Manually populate the nested-field descriptions in each of the 18 affected
schema.yamlfiles.
Option 1 is the bigger payoff since several other Cloud-Logging-sourced datasets (*_stderr_events_v1, *_stdout_events_v1, etc.) would benefit from the same recursion. Worth calling out in the follow-up ticket so expectations are aligned.
There was a problem hiding this comment.
@sean-rose I'm assuming here that I'll be able to use !include-field even if the --use_dataset_schema doesn't cascade the nested descriptions?
| - name: resource | ||
| type: RECORD | ||
| description: The monitored resource that produced this log entry. | ||
| fields: | ||
| - name: type | ||
| type: STRING | ||
| description: The monitored resource type. For example, the type of a Compute Engine | ||
| VM instance is `gce_instance`. |
There was a problem hiding this comment.
resource.labels.* is deferred per the comment on line 135, but in the actual fxa schemas resource.labels carries fairly canonical GCP-side keys like project_id, location, cluster_name, namespace_name, etc. — these are stable across deployments (unlike top-level labels) and are reasonable candidates to define here in a follow-up. Worth an ordered TODO note in the related ticket so they aren't forgotten.
There was a problem hiding this comment.
Specifically skipping these because some tables have project_id, zone, instance_id (older gce_instance resource type) while others have cluster_name, namespace_name, pod_name (k8s_container post-GKE migration). Will address in a follow-up. Might just define in the table schema.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
lucia-vargas-a
left a comment
There was a problem hiding this comment.
LGTM. Will let Sean confirm about the nested fields.
Description
Creates a dataset.yaml file for
firefox_accounts_derivedestablishing canonical field definitions for GCP Cloud Logging fields shared across the firefox_accounts_derived dataset.The firefox_accounts_derived dataset is sourced from GCP Cloud Logging log sinks. Profiling showed 18 of 49 tables contain Cloud Logging structural fields (logName, severity, httpRequest, operation, sourceLocation, split, errorGroups, etc.) with no descriptions. This PR adds descriptions for those fields adapted from the GCP Cloud Logging LogEntry reference (https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry).
Changes
Follow on work will be to cascade these definitions to the individual table schemas.
Related Tickets & Documents
Reviewer, please follow this checklist