feat(schema): promote fields from missing metadata files#9234
Draft
gkatre wants to merge 2 commits into
Draft
Conversation
…metadata-schema-curator Consumed 8 _missing_metadata.yaml files covering baseline_active_users_v1, cfs_ga4_attr_v1, clients_daily_joined_v1, clients_daily_v6, clients_first_seen_v1, clients_last_seen_v1, onboarding_hourly_v2, onboarding_v2 and applied safe curation to the base schema registry. - global.yaml: promoted ~60 new cross-dataset canonical fields (CPU hardware, geo subdivisions, bits28 activity/URI fields, days_since_* derivatives, profile/account/sync/telemetry flags, attribution_ua/dltoken/dlsource, second_seen_date, Windows OS dimensions, etc.); fixed app_version and os_version INTEGER->STRING type mismatches; added explicit mode: NULLABLE to all fields lacking an explicit mode; added contextual descriptions to previously bare fields. - ads_derived.yaml: added explicit mode to every field; wrote descriptions for payout and price (previously null / "Price."); added type/mode to sites/zones REPEATED STRING fields; removed dau and profile_group_id (duplicated in global.yaml); moved the total_active alias onto global.yaml:dau. - telemetry_derived.yaml: new file with 91 Firefox Desktop main-ping ETL fields (crash/abort sums, gfx feature status, sandbox level, env_build_*, search engine metadata, intl settings, update_*, sync_count_*, plugin_* metrics). - firefox_desktop_derived.yaml: new file with 67 GA4/stub attribution and Glean messaging-system fields (ga4_*, stub_attr_logs_*, release_channel, event, message_id, addon_version, windows_version). - SCHEMA_AUDIT_RECOMMENDATIONS.md: 9 sections of findings that require human review, including recommended_target conflicts, canonical/alias conflicts (flight_name, source_file), complex RECORD types (attribution, experiments, metadata), and the ~280 histogram/scalar/search fields omitted from the source missing-metadata files. 3 self-review passes; 341 total fields across 4 base schema files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consumed 8
_missing_metadata.yamlfiles produced by theschema-enricherskill and applied safe field-level curation across the base schema registry inbigquery_etl/schema/. Two new dataset-scoped base schema files are introduced (telemetry_derived.yaml,firefox_desktop_derived.yaml), andglobal.yamlandads_derived.yamlare brought to a consistent quality bar (explicit mode on every field, non-empty descriptions, correct primitive types).Input files consumed (8)
baseline_active_users_v1_missing_metadata.yamlcfs_ga4_attr_v1_missing_metadata.yamlclients_daily_joined_v1_missing_metadata.yamlclients_daily_v6_missing_metadata.yamlclients_first_seen_v1_missing_metadata.yamlclients_last_seen_v1_missing_metadata.yamlonboarding_hourly_v2_missing_metadata.yamlonboarding_v2_missing_metadata.yamlChanges
bigquery_etl/schema/global.yaml— expanded to 109 fields (from 50). Promoted ~60 new cross-dataset canonical fields supported by 2+ missing-metadata files, including CPU hardware (cpu_cores/count/family/l2_cache_kb/l3_cache_kb/model/speed_mhz/stepping/vendor), geo (city, geo_subdivision1, geo_subdivision2, isp_organization, geo_db_version), bits28 activity/URI patterns (days_seen_bits, days_active_bits, days_desktop_active_bits, days_visited_{1,5,10}uri_bits, days_had_8_active_ticks_bits, days_interacted_bits, days_opened_dev_tools_bits, days_created_profile_bits, etc.), days_since* derivatives, profile age/creation/first-seen/second-seen, Firefox Account/Sync/telemetry flags (fxa_configured, sync_configured, telemetry_enabled), Windows OS dimensions (windows_build_number, windows_ubr, is_wow64, memory_mb), attribution suite (attribution_dlsource, attribution_dltoken, attribution_ua, attribution_term), app_build, app_version_patch_revision, app_version_is_major_release, apple_model_id, env_build_arch, partner_id, distributor, timezone_offset, durations, is_new_profile, active_experiment_branch/id with deprecation notes. Fixedapp_versionandos_versionINTEGER→STRING type mismatches (descriptions indicate string-like values like "1.0.3" and "100.9.11"). Added explicitmode: NULLABLEto every field lacking an explicit mode.bigquery_etl/schema/ads_derived.yaml— added explicitmodeto every field; wrote meaningful descriptions forpayout(previouslynull) andprice(previously just "Price."); added missingtype: STRING/mode: REPEATEDto thesitesandzonesfields; removeddauandprofile_group_id(cross-file duplicates already covered inglobal.yaml, with thetotal_activealias moved ontoglobal.yaml:dau). Fixedinteraction_countkey ordering (mode/name/type → name/type/mode).bigquery_etl/schema/telemetry_derived.yaml— new file with 91 Firefox Desktop main-ping ETL fields: aborts_sum, crashes_detectedsum, crash_submit{attempt,success}sum, main/content/gpu/rdd/socket/utility/vr_crash_count, gfx_featuresstatus (advanced_layers, d2d, d3d11, gpu_process), sandbox_effective_content_process_level, env_build{id,version,platform_version,xpcom_abi}, environment_settings_intl* locales, default_search_engine_data_, devtools_toolbox_opened_count_sum, sync_count_{desktop,mobile}{sum,mean}, plugins_infobarsum, plugin_hangs_sum, plugins_notification_shown_sum, session_restored_mean, sessions_started_on_this_day, shutdown_kill_sum, subsession_hours_sum, total_hours_sum, trackers_blocked_sum, update_auto_download/channel/enabled, flash_version, search_cohort, previous_build_id, pings_aggregated_by_this_row, places_bookmarks/pages_count_mean, submission_timestamp_min, install_year, os_service_pack{major,minor}, web_notification_shown_sum, ad_clicks_count_all, search_with_ads_count_all, min/max_subsession_counter, n_logged_event, n_created_pictureinpicture, n_viewed_protection_report, first_paint_mean, push_api_notify_sum, ssl_handshake_result_{failure,success}_sum, e10s_enabled, blocklist_enabled, addon_compatibility_check_enabled, active_addons, active_addons_count_mean, active_hours_sum, distribution_version, distributor_channel, startup_profile_selection_first_ping_only.bigquery_etl/schema/firefox_desktop_derived.yaml— new file with 67 Firefox Desktop-scoped derived fields: allga4_*GA4 session/attribution/download dimensions (fromcfs_ga4_attr_v1),stub_attr_logs_*stub attribution service fields,release_channel,event,message_id,addon_versionfrom the Glean messaging_system ping, andwindows_version.bigquery_etl/schema/SCHEMA_AUDIT_RECOMMENDATIONS.md— new document capturing items that require human judgment:recommended_targetconflicts between missing-metadata files (11 cases), canonical/alias conflicts (flight_name,source_file), complex RECORD promotions awaiting sub-field confirmation (attribution,experiments,metadata,active_addons,days_seen_in_experiment,distribution,attribution_ext,distribution_ext), single-dataset candidates that were auto-promoted but warrant owner review, and the ~280 histogram/scalar/search fields omitted from the source missing-metadata files (recommend re-runningschema-enricherwithout the "omitted for brevity" shortcut before a follow-up pass).Counts
global.yamltelemetry_derived.yaml(new)firefox_desktop_derived.yaml(new)app_version,os_version)NULLABLE/REPEATED)dau,profile_group_idfrom ads_derived.yaml)Total: 341 fields across 4 base schema files (all YAML parse-validated, all have explicit
type/mode/non-trivialdescription, zero cross-file canonical duplicates).Test plan
SCHEMA_AUDIT_RECOMMENDATIONS.mdto decide on the 11recommended_targetconflicts and 4 canonical/alias conflicts--use-global-schemato confirm the INTEGER→STRING type change onapp_version/os_versionis saferelease_channelbelongs infirefox_desktop_derived.yaml(where it is now) rather thanglobal.yaml(where it was recommended for one table)legacy_telemetry_client_idshould stay inads_derived.yamlor be promoted toglobal.yaml🤖 Generated with Claude Code