Skip to content

Data validation script#143

Open
yasandu0505 wants to merge 10 commits intoLDFLK:mainfrom
yasandu0505:workflow/validating-datasets
Open

Data validation script#143
yasandu0505 wants to merge 10 commits intoLDFLK:mainfrom
yasandu0505:workflow/validating-datasets

Conversation

@yasandu0505
Copy link
Member

@yasandu0505 yasandu0505 commented Mar 17, 2026

  • This is the initial data validation script with the pre-agreed constraints. (take a look and let's tune the core logic here, then we can switch to integrate this with CI pipeline).

update >>>>>>>> 2026-03-20

I have added the script (modularized) and also the ci-pipeline as well in here and this PR closes #89

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an initial data validation script designed to ensure the integrity and consistency of data files. By establishing a formal schema and implementing custom checks, it aims to catch common data quality issues early, laying the groundwork for future integration into a CI pipeline to automate data quality assurance.

Highlights

  • New Data Validation Script: Introduced an initial data validation script to enforce pre-agreed constraints on data files, aiming to improve data quality.
  • JSON Schema Definition: Defined a JSON schema (schema.json) to specify the expected structure for data, requiring 'columns' (array of strings) and 'rows' (array of arrays).
  • Custom Validation Logic: Implemented custom validation checks in validator.py for issues such as duplicate column names, row-column count mismatches, data type consistency within columns, and the presence of empty values.
  • Dependency Management: Added the jsonschema library as a dependency for programmatic schema validation.
  • Virtual Environment Exclusion: Updated the .gitignore file to exclude the virtual environment directory (venv/) from version control.
Changelog
  • .gitignore
    • Added venv/ to the list of ignored directories.
  • scripts/validator/requirements.txt
    • Added jsonschema>=4.25,<5 as a required dependency.
  • scripts/validator/schema.json
    • Created a new JSON schema file defining the structure for data files, requiring columns (array of strings) and rows (array of arrays).
  • scripts/validator/validator.py
    • Created a new Python script for data validation.
    • Implemented JSON schema validation using jsonschema.
    • Added custom validation checks for duplicate columns, row length consistency, data type consistency per column, and detection of empty values.
    • Included a temporary warning for float values, suggesting conversion to strings.
    • Provided a main function to process data.json files within a specified directory.
Activity
  • The pull request was opened by yasandu0505 with an initial description outlining the purpose of the data validation script and its intended future integration with a CI pipeline.
  • No further human activity (comments, reviews, or updates) has been recorded since the PR's creation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Mar 17, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://LDFLK.github.io/datasets/pr-preview/pr-143/

Built to branch gh-pages at 2026-03-23 10:38 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a data validation script, which is a great initiative. The script performs schema validation and several custom checks. My review focuses on improving the robustness, correctness, and performance of the validation logic in validator.py. I've identified a critical bug that can cause the script to crash, along with opportunities to make the code more efficient and resilient. Please see my detailed comments below.

@github-actions
Copy link

github-actions bot commented Mar 17, 2026

📦 Documentation Build

Status Artifact
✅ Build successful Download docs-build-pr-143

To preview locally:

  1. Click the artifact link above
  2. Scroll to "Artifacts" section and download docs-build-pr-143
  3. Extract the zip file
  4. Run npx serve . in the extracted folder
  5. Open http://localhost:3000

Built from commit 5ef1cd2

Copy link
Member

@ChanukaUOJ ChanukaUOJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial Comments!

@yasandu0505 yasandu0505 requested a review from sehansi-9 March 20, 2026 09:15
Copy link
Member

@ChanukaUOJ ChanukaUOJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

@ChanukaUOJ
Copy link
Member

And can you add some test cases to cover the helper functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Validation][Workflow] Automate data validations

3 participants