feat: Implement data validator module for NLP model by nimamafi9271 · Pull Request #108 · InnovAIte-Deakin/InnovAIte_FireFusion_Project

nimamafi9271 · 2026-05-02T06:36:04Z

This PR update revises the synthetic data validation script based on the finalised input data schema for the NLP misinformation model.

The script now validates the generated JSON dataset against the agreed schema before it is passed to downstream components. The update focuses on ensuring that required fields are present, values follow the expected format, and the dataset is suitable for later model training and evaluation.

Key changes and features:

Updated the expected schema based on the finalised input data schema
Validates required fields such as post_id, text, label, platform, narrative_theme, timestamp_simulated, language, generation_template, and split
Treats location_mentioned and source_credibility as optional fields
Checks that required fields are not missing or empty
Checks that post_id values are unique across the full dataset
Validates the post_id format based on the expected fp_XXXXXXXX structure
Validates controlled vocabulary fields, including:
- label: misinformation, credible, unverified
- platform: twitter, facebook, reddit, news_article, official_agency
- narrative_theme: arson_blame, govt_inaction, evacuation_false, fire_extent_exaggeration, official_update, factual_report, unrelated
- split: train, val, test
- language: en
Checks that text length is within the required 20–300 character range
Validates timestamp_simulated as a valid ISO 8601 datetime string
Flags timestamps outside the 2019–2023 window as warnings
Validates source_credibility when present, ensuring it is numeric and within the 0.0–1.0 range
Checks narrative theme coverage in the training split and warns if any theme is below the expected minimum coverage
Checks overall train/validation/test split distribution against the expected 70/15/15 ratio
Checks label distribution within each split against the expected target distribution
Removes unexpected extra columns from the cleaned output
Produces a cleaned JSON output only when validation passes
Generates a validation report containing errors, warnings, dataset summary, and expected columns
Prints validation results to the console for quick debugging
Uses a rule-based structure so future validation rules can be added or modified more easily

Oscarswild · 2026-05-12T05:33:11Z

Looks good from my end. Do you have any output results by any chance that could be uploaded here?

nimamafi9271 changed the title ~~Implement data validator module for NLP model~~ feat: Implement data validator module for NLP model May 11, 2026

Update data validator script for misinformation model

99d4bf3

nimamafi9271 force-pushed the feature/misinformation-data-validator branch from 2d9ee79 to 99d4bf3 Compare May 11, 2026 16:27

Oscarswild self-assigned this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement data validator module for NLP model#108

feat: Implement data validator module for NLP model#108
nimamafi9271 wants to merge 1 commit into
InnovAIte-Deakin:contributionfrom
nimamafi9271:feature/misinformation-data-validator

nimamafi9271 commented May 2, 2026 •

edited

Loading

Uh oh!

Oscarswild commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nimamafi9271 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oscarswild commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nimamafi9271 commented May 2, 2026 •

edited

Loading