Skip to content

feat(schema): promote fields from missing metadata files#9234

Draft
gkatre wants to merge 2 commits into
mainfrom
missing-metadata-schema-curator/2026-04-21
Draft

feat(schema): promote fields from missing metadata files#9234
gkatre wants to merge 2 commits into
mainfrom
missing-metadata-schema-curator/2026-04-21

Conversation

@gkatre
Copy link
Copy Markdown
Contributor

@gkatre gkatre commented Apr 21, 2026

Summary

Consumed 8 _missing_metadata.yaml files produced by the schema-enricher skill and applied safe field-level curation across the base schema registry in bigquery_etl/schema/. Two new dataset-scoped base schema files are introduced (telemetry_derived.yaml, firefox_desktop_derived.yaml), and global.yaml and ads_derived.yaml are brought to a consistent quality bar (explicit mode on every field, non-empty descriptions, correct primitive types).

Input files consumed (8)

  • baseline_active_users_v1_missing_metadata.yaml
  • cfs_ga4_attr_v1_missing_metadata.yaml
  • clients_daily_joined_v1_missing_metadata.yaml
  • clients_daily_v6_missing_metadata.yaml
  • clients_first_seen_v1_missing_metadata.yaml
  • clients_last_seen_v1_missing_metadata.yaml
  • onboarding_hourly_v2_missing_metadata.yaml
  • onboarding_v2_missing_metadata.yaml

Changes

  • bigquery_etl/schema/global.yaml — expanded to 109 fields (from 50). Promoted ~60 new cross-dataset canonical fields supported by 2+ missing-metadata files, including CPU hardware (cpu_cores/count/family/l2_cache_kb/l3_cache_kb/model/speed_mhz/stepping/vendor), geo (city, geo_subdivision1, geo_subdivision2, isp_organization, geo_db_version), bits28 activity/URI patterns (days_seen_bits, days_active_bits, days_desktop_active_bits, days_visited_{1,5,10}uri_bits, days_had_8_active_ticks_bits, days_interacted_bits, days_opened_dev_tools_bits, days_created_profile_bits, etc.), days_since* derivatives, profile age/creation/first-seen/second-seen, Firefox Account/Sync/telemetry flags (fxa_configured, sync_configured, telemetry_enabled), Windows OS dimensions (windows_build_number, windows_ubr, is_wow64, memory_mb), attribution suite (attribution_dlsource, attribution_dltoken, attribution_ua, attribution_term), app_build, app_version_patch_revision, app_version_is_major_release, apple_model_id, env_build_arch, partner_id, distributor, timezone_offset, durations, is_new_profile, active_experiment_branch/id with deprecation notes. Fixed app_version and os_version INTEGER→STRING type mismatches (descriptions indicate string-like values like "1.0.3" and "100.9.11"). Added explicit mode: NULLABLE to every field lacking an explicit mode.
  • bigquery_etl/schema/ads_derived.yaml — added explicit mode to every field; wrote meaningful descriptions for payout (previously null) and price (previously just "Price."); added missing type: STRING / mode: REPEATED to the sites and zones fields; removed dau and profile_group_id (cross-file duplicates already covered in global.yaml, with the total_active alias moved onto global.yaml:dau). Fixed interaction_count key ordering (mode/name/type → name/type/mode).
  • bigquery_etl/schema/telemetry_derived.yamlnew file with 91 Firefox Desktop main-ping ETL fields: aborts_sum, crashes_detectedsum, crash_submit{attempt,success}sum, main/content/gpu/rdd/socket/utility/vr_crash_count, gfx_featuresstatus (advanced_layers, d2d, d3d11, gpu_process), sandbox_effective_content_process_level, env_build{id,version,platform_version,xpcom_abi}, environment_settings_intl* locales, default_search_engine_data_, devtools_toolbox_opened_count_sum, sync_count_{desktop,mobile}{sum,mean}, plugins_infobarsum, plugin_hangs_sum, plugins_notification_shown_sum, session_restored_mean, sessions_started_on_this_day, shutdown_kill_sum, subsession_hours_sum, total_hours_sum, trackers_blocked_sum, update_auto_download/channel/enabled, flash_version, search_cohort, previous_build_id, pings_aggregated_by_this_row, places_bookmarks/pages_count_mean, submission_timestamp_min, install_year, os_service_pack{major,minor}, web_notification_shown_sum, ad_clicks_count_all, search_with_ads_count_all, min/max_subsession_counter, n_logged_event, n_created_pictureinpicture, n_viewed_protection_report, first_paint_mean, push_api_notify_sum, ssl_handshake_result_{failure,success}_sum, e10s_enabled, blocklist_enabled, addon_compatibility_check_enabled, active_addons, active_addons_count_mean, active_hours_sum, distribution_version, distributor_channel, startup_profile_selection_first_ping_only.
  • bigquery_etl/schema/firefox_desktop_derived.yamlnew file with 67 Firefox Desktop-scoped derived fields: all ga4_* GA4 session/attribution/download dimensions (from cfs_ga4_attr_v1), stub_attr_logs_* stub attribution service fields, release_channel, event, message_id, addon_version from the Glean messaging_system ping, and windows_version.
  • bigquery_etl/schema/SCHEMA_AUDIT_RECOMMENDATIONS.mdnew document capturing items that require human judgment: recommended_target conflicts between missing-metadata files (11 cases), canonical/alias conflicts (flight_name, source_file), complex RECORD promotions awaiting sub-field confirmation (attribution, experiments, metadata, active_addons, days_seen_in_experiment, distribution, attribution_ext, distribution_ext), single-dataset candidates that were auto-promoted but warrant owner review, and the ~280 histogram/scalar/search fields omitted from the source missing-metadata files (recommend re-running schema-enricher without the "omitted for brevity" shortcut before a follow-up pass).

Counts

Change type Count
Fields promoted to global.yaml ~60
Fields promoted to telemetry_derived.yaml (new) 91
Fields promoted to firefox_desktop_derived.yaml (new) 67
Type corrections (INTEGER → STRING) 2 (app_version, os_version)
Descriptions filled (missing/minimal) ~100
Modes added (explicit NULLABLE/REPEATED) ~150
Cross-file duplicates removed 2 (dau, profile_group_id from ads_derived.yaml)
Self-review passes 3
Items deferred for human review 40+ (see recommendations doc)

Total: 341 fields across 4 base schema files (all YAML parse-validated, all have explicit type/mode/non-trivial description, zero cross-file canonical duplicates).

Test plan

  • Reviewer skims SCHEMA_AUDIT_RECOMMENDATIONS.md to decide on the 11 recommended_target conflicts and 4 canonical/alias conflicts
  • Spot-check a few tables that rely on --use-global-schema to confirm the INTEGER→STRING type change on app_version/os_version is safe
  • Confirm release_channel belongs in firefox_desktop_derived.yaml (where it is now) rather than global.yaml (where it was recommended for one table)
  • Confirm legacy_telemetry_client_id should stay in ads_derived.yaml or be promoted to global.yaml

🤖 Generated with Claude Code

…metadata-schema-curator

Consumed 8 _missing_metadata.yaml files covering baseline_active_users_v1,
cfs_ga4_attr_v1, clients_daily_joined_v1, clients_daily_v6, clients_first_seen_v1,
clients_last_seen_v1, onboarding_hourly_v2, onboarding_v2 and applied safe curation
to the base schema registry.

- global.yaml: promoted ~60 new cross-dataset canonical fields (CPU hardware,
  geo subdivisions, bits28 activity/URI fields, days_since_* derivatives,
  profile/account/sync/telemetry flags, attribution_ua/dltoken/dlsource,
  second_seen_date, Windows OS dimensions, etc.); fixed app_version and os_version
  INTEGER->STRING type mismatches; added explicit mode: NULLABLE to all fields
  lacking an explicit mode; added contextual descriptions to previously bare
  fields.
- ads_derived.yaml: added explicit mode to every field; wrote descriptions for
  payout and price (previously null / "Price."); added type/mode to sites/zones
  REPEATED STRING fields; removed dau and profile_group_id (duplicated in
  global.yaml); moved the total_active alias onto global.yaml:dau.
- telemetry_derived.yaml: new file with 91 Firefox Desktop main-ping ETL fields
  (crash/abort sums, gfx feature status, sandbox level, env_build_*, search
  engine metadata, intl settings, update_*, sync_count_*, plugin_* metrics).
- firefox_desktop_derived.yaml: new file with 67 GA4/stub attribution and Glean
  messaging-system fields (ga4_*, stub_attr_logs_*, release_channel, event,
  message_id, addon_version, windows_version).
- SCHEMA_AUDIT_RECOMMENDATIONS.md: 9 sections of findings that require human
  review, including recommended_target conflicts, canonical/alias conflicts
  (flight_name, source_file), complex RECORD types (attribution, experiments,
  metadata), and the ~280 histogram/scalar/search fields omitted from the
  source missing-metadata files.

3 self-review passes; 341 total fields across 4 base schema files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gkatre gkatre requested a review from a team April 21, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants