This project sets up the data modelling and day-to-day-operations of theLook e-commerce DWH leveraging:
-
dbt-core
-
BigQuery
-
Cloud Composer
-
Google Cloud Provider for Terraform
The DWH transformations of theLook e-commerce data were architected under the following principles and guidelines
- Analysts shouldn't have to do multiple joins to retrieve meaningful data
-
Tables are organized around major topics of interest, such as customers, products, orders
-
Each subject represents One-Big-Table with nested arrays and structs
- child objects should never be orphans
- child objects will always be queried within the context of the parent object
-
Data should reflect how current underlying platform functions
-
Data should reflect the topics of interest to business
-
Only process pieces of information that have changed
-
Avoid scanning too much data per run
-
Backfilling historical data should be possible via the scheduled run without the need for extra code adjustments
-
Changes in data should be easy to trace and audit
- Processing by topic instead of monolitic schedules of all topics together
The following linters are in place
-
SQL linting with custom configuration for
.sqlfluff -
YAML linting with custom configuration for
.yamllint -
Python linting with default configuration via
pylint -
Markdown linting with default configuration with
pymarkdownlint
To see if your SQL is compliant to the defined standard, you can run the following commands
# lint a specific file
sqlfluff lint path/to/file.sql
# lint a file directory
sqlfluff lint directory/of/sql/files
# let the linter fix your code
sqlfluff fix folder/model.sql- SQL linting (and fixing) is enforced via pre-commit hooks for
sqlfluff
# check which files will be linted by default
yamllint --list-files .
# lint a specific file
yamllint my_file.yml
# OR
yamllint .Linitng rules have been defined in .markdownlint.yaml and are enforced via pymarkdownlint pre-commit hooks
### [pre-commit hooks](https://github.com/pre-commit/pre-commit-hooks)
Pre-commit have been set up in this repo to check and fix for:
- missing lines at the end
- trailing whitespaces
- violations of sql standards
- errors in yaml syntax
### [dbt-checkpoint hooks](https://github.com/dbt-checkpoint/dbt-checkpoint)
dbt dbt-checkpoint hooks have been set up to check that:
- there are no compilation errors
- [no dbt script is directly referring to a table](https://github.com/dbt-checkpoint/dbt-checkpoint/blob/main/HOOKS.md#check-script-has-no-table-name)
- [script contains only existing sources or macros](https://github.com/dbt-checkpoint/dbt-checkpoint/blob/main/HOOKS.md#check-script-ref-and-source)
- [no semi-colons have been forgotten at the end of sql queries](https://github.com/dbt-checkpoint/dbt-checkpoint/blob/main/HOOKS.md#remove-script-semicolon)
- [check source has freshness](https://github.com/dbt-checkpoint/dbt-checkpoint/blob/main/HOOKS.md#check-source-has-freshness)
- [check source has tests](https://github.com/dbt-checkpoint/dbt-checkpoint/blob/main/HOOKS.md#check-source-has-tests)
- [check source has tests by group](https://github.com/dbt-checkpoint/dbt-checkpoint/blob/main/HOOKS.md#check-source-has-tests-by-group)
Hence, when working with the repo, make sure you've got the pre-commit installed so that they run upon your every commit
```bash
# install the githook scripts
pre-commit install
# run against all existing files
pre-commit run --all-files