xSV validation / linting

Numerous import pipelines use xSV files with a documented file structure. It would be great to be able to validate the files before putting the data therein into the lakehouse.

Some of the xSV files to be imported into the lakehouse are MASSIVE, so having a solution that works well on multi-GB files would be best. Ideally the invalid lines would be reported on and/or removed and the rest of the file processed.

- [awesomeCSV](https://github.com/secretGeek/awesomeCSV) tool list


candidates for use:
- [qsv toolkit](https://github.com/dathere/qsv): includes a [validation module](https://github.com/dathere/qsv/blob/master/docs/help/validate.md) that strips out invalid lines
- [frictionless tabular data resource and validator](http://frictionlessdata.io/specs/tabular-data-resource/)
- [duckdb csv import](https://duckdb.org/docs/lts/data/csv/reading_faulty_csv_files) with invalid line detection
- [csvlinter go package](https://github.com/csvlinter/csvlinter)


TODO:

- [x] choose csv linting tool, add to container
- [ ] create configurable csv clean / lint workflow
- [ ] generate configs for existing csv imports


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xSV validation / linting #166

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

xSV validation / linting #166

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions