Skip to content

feat: Implement data validator module for NLP model#108

Open
nimamafi9271 wants to merge 1 commit into
InnovAIte-Deakin:contributionfrom
nimamafi9271:feature/misinformation-data-validator
Open

feat: Implement data validator module for NLP model#108
nimamafi9271 wants to merge 1 commit into
InnovAIte-Deakin:contributionfrom
nimamafi9271:feature/misinformation-data-validator

Conversation

@nimamafi9271
Copy link
Copy Markdown
Contributor

@nimamafi9271 nimamafi9271 commented May 2, 2026

This PR update revises the synthetic data validation script based on the finalised input data schema for the NLP misinformation model.

The script now validates the generated JSON dataset against the agreed schema before it is passed to downstream components. The update focuses on ensuring that required fields are present, values follow the expected format, and the dataset is suitable for later model training and evaluation.

Key changes and features:

  • Updated the expected schema based on the finalised input data schema
  • Validates required fields such as post_id, text, label, platform, narrative_theme, timestamp_simulated, language, generation_template, and split
  • Treats location_mentioned and source_credibility as optional fields
  • Checks that required fields are not missing or empty
  • Checks that post_id values are unique across the full dataset
  • Validates the post_id format based on the expected fp_XXXXXXXX structure
  • Validates controlled vocabulary fields, including:
    • label: misinformation, credible, unverified
    • platform: twitter, facebook, reddit, news_article, official_agency
    • narrative_theme: arson_blame, govt_inaction, evacuation_false, fire_extent_exaggeration, official_update, factual_report, unrelated
    • split: train, val, test
    • language: en
  • Checks that text length is within the required 20–300 character range
  • Validates timestamp_simulated as a valid ISO 8601 datetime string
  • Flags timestamps outside the 2019–2023 window as warnings
  • Validates source_credibility when present, ensuring it is numeric and within the 0.0–1.0 range
  • Checks narrative theme coverage in the training split and warns if any theme is below the expected minimum coverage
  • Checks overall train/validation/test split distribution against the expected 70/15/15 ratio
  • Checks label distribution within each split against the expected target distribution
  • Removes unexpected extra columns from the cleaned output
  • Produces a cleaned JSON output only when validation passes
  • Generates a validation report containing errors, warnings, dataset summary, and expected columns
  • Prints validation results to the console for quick debugging
  • Uses a rule-based structure so future validation rules can be added or modified more easily

@nimamafi9271 nimamafi9271 changed the title Implement data validator module for NLP model feat: Implement data validator module for NLP model May 11, 2026
@nimamafi9271 nimamafi9271 force-pushed the feature/misinformation-data-validator branch from 2d9ee79 to 99d4bf3 Compare May 11, 2026 16:27
@Oscarswild
Copy link
Copy Markdown
Collaborator

Looks good from my end. Do you have any output results by any chance that could be uploaded here?

@Oscarswild Oscarswild self-assigned this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants