Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 11 additions & 16 deletions .github/workflows/onpush.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,15 @@ jobs:
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
BRANCH: ${{ github.head_ref || github.ref_name }}
DEVELOPER: ${{ github.actor }}

steps:
- uses: actions/checkout@v1

- name: Dump GitHub context
run: echo '${{ toJson(github) }}'

- name: Set up Python
uses: actions/setup-python@v5
with:
Expand All @@ -47,14 +52,9 @@ jobs:

- name: Deploy on staging
run: |
BRANCH_NAME="${{ github.head_ref || github.ref_name }}"
PR_NUMBER="${{ github.event.pull_request.number }}"
DEVELOPER="${{ github.actor }}"

uv run python ./scripts/generate_template_workflow.py staging --serverless \
--branch "$BRANCH_NAME" \
--developer "$DEVELOPER" \
$(if [ -n "$PR_NUMBER" ]; then echo "--pr-number $PR_NUMBER"; fi)
uv run python ./scripts/generate_template_workflow.py staging \
--branch "$BRANCH" \
--developer "$DEVELOPER"

uv run databricks bundle deploy --target staging

Expand All @@ -64,13 +64,8 @@ jobs:

- name: Deploy on prod
run: |
BRANCH_NAME="${{ github.head_ref || github.ref_name }}"
PR_NUMBER="${{ github.event.pull_request.number }}"
DEVELOPER="${{ github.actor }}"

uv run python ./scripts/generate_template_workflow.py prod --serverless \
--branch "$BRANCH_NAME" \
--developer "$DEVELOPER" \
$(if [ -n "$PR_NUMBER" ]; then echo "--pr-number $PR_NUMBER"; fi)
uv run python ./scripts/generate_template_workflow.py prod \
--branch "$BRANCH" \
--developer "$DEVELOPER"

uv run databricks bundle deploy --target prod
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ pre-commit:
pre-commit autoupdate
pre-commit run --all-files

deploy-serverless:
uv run python ./scripts/generate_template_workflow.py $(env) --serverless
deploy:
uv run python ./scripts/generate_template_workflow.py $(env)
uv run databricks bundle deploy --target $(env)

run:
Expand Down
77 changes: 39 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,17 @@ Interested in bringing these principles in your own project? Let’s [connect o

This project template demonstrates how to:

- structure PySpark code inside classes/packages.
- run unit tests on transformations with [pytest package](https://pypi.org/project/pytest/) - set up VSCode to run unit tests on your local machine.
- structure integration tests to be executed on different environments / catalogs.
- structure PySpark code inside classes/packages, instead of notebooks.
- package and deploy code to different environments (dev, staging, prod).
- use a CI/CD pipeline with [Github Actions](https://docs.github.com/en/actions).
- run unit tests on transformations with [pytest package](https://pypi.org/project/pytest/). Set up VSCode to run unit tests on your local machine.
- run integration tests setting the input data and validating the output data.
- isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
- show developer name and branch as job tags to track issues.
- utilize [coverage package](https://pypi.org/project/coverage/) to generate test coverage reports.
- package and deploy code to different environments (dev, staging, prod) using a CI/CD pipeline with [Github Actions](https://docs.github.com/en/actions).
- isolate "dev" environments / catalogs to avoid concurrency issues between developers testing jobs.
- utilize [uv](https://docs.astral.sh/uv/) as a project/package manager.
- configure the workflow to run in different environments with different parameters with [jinja package](https://pypi.org/project/jinja2/).
- configure the workflow to run tasks selectively.
- configure job to run in different environments with different parameters with [jinja package](https://pypi.org/project/jinja2/).
- configure job to run tasks selectively.
- use [medallion architecture](https://www.databricks.com/glossary/medallion-architecture) pattern.
- lint and format code with [ruff](https://docs.astral.sh/ruff/) and [pre-commit](https://pre-commit.com/).
- use a Make file to automate repetitive tasks.
Expand All @@ -52,9 +54,9 @@ This project template demonstrates how to:
- utilize [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) to package/deploy/run a Python wheel package on Databricks.
- utilize [Databricks DQX](https://databrickslabs.github.io/dqx/) to define and enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data on quarantine tables.
- utilize [Databricks SDK for Python](https://docs.databricks.com/en/dev-tools/sdk-python.html) to manage workspaces and accounts and analyse costs. Refer to 'scripts' folder for some examples.
- utilize [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) and get data lineage for your tables and columns and a simplified permission model for your data.
- utilize [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) and get data lineage for your tables and columns.
- utilize [Databricks Lakeflow Jobs](https://docs.databricks.com/en/workflows/index.html) to execute a DAG and [task parameters](https://docs.databricks.com/en/workflows/jobs/parameter-value-references.html) to share context information between tasks (see [Task Parameters section](#task-parameters)). Yes, you don't need Airflow to manage your DAGs here!!!
- utilize serverless job clusters on [Databricks Free Edition](https://docs.databricks.com/aws/en/getting-started/free-edition ) to deploy your pipelines.
- utilize serverless job clusters on [Databricks Free Edition](https://docs.databricks.com/aws/en/getting-started/free-edition) to deploy your pipelines.

## 🧠 Resources

Expand All @@ -78,39 +80,39 @@ Other:
```
databricks-template/
├── .github/ # CI/CD automation
├── .github/ # CI/CD automation
│ └── workflows/
│ └── onpush.yml # GitHub Actions pipeline
│ └── onpush.yml # GitHub Actions pipeline
├── src/ # Main source code
│ └── template/ # Python package
│ ├── main.py # Entry point with CLI (argparse)
│ ├── config.py # Configuration management
│ ├── baseTask.py # Base class for all tasks
│ ├── commonSchemas.py # Shared PySpark schemas
│ └── job1/ # Job-specific tasks
├── src/ # Main source code
│ └── template/ # Python package
│ ├── main.py # Entry point with CLI (argparse)
│ ├── config.py # Configuration management
│ ├── baseTask.py # Base class for all tasks
│ ├── commonSchemas.py # Shared PySpark schemas
│ └── job1/ # Job-specific tasks
│ ├── extract_source1.py
│ ├── extract_source2.py
│ ├── generate_orders.py
│ ├── generate_orders_agg.py
│ ├── integration_setup.py
│ └── integration_validate.py
├── tests/ # Unit tests
├── tests/ # Unit tests
│ └── job1/
│ └── unit_test.py # Pytest unit tests
│ └── unit_test.py # Pytest unit tests
├── resources/ # Databricks workflow templates
├── resources/ # Databricks workflow templates
│ ├── wf_template_serverless.yml # Jinja2 template for serverless
│ ├── wf_template.yml # Jinja2 template for job clusters
│ └── workflow.yml # Generated workflow (auto-created)
├── scripts/ # Helper scripts
├── scripts/ # Helper scripts
│ ├── generate_template_workflow.py # Workflow generator (Jinja2)
│ ├── sdk_analyze_job_costs.py # Cost analysis script
│ └── sdk_workspace_and_account.py # Workspace and account management
print("SUMMARY")
├── docs/ # Documentation assets
│ ├── sdk_analyze_job_costs.py # Cost analysis script
│ └── sdk_workspace_and_account.py # Workspace and account management
├── docs/ # Documentation assets
│ ├── dag.png
│ ├── task_output.png
│ ├── data_lineage.png
Expand All @@ -127,6 +129,14 @@ databricks-template/
└── README.md # This file
```

## CI/CD pipeline

<br>

<img src="docs/ci_cd.png">

<br>

## Jobs

<br>
Expand Down Expand Up @@ -161,31 +171,22 @@ databricks-template/
<br>


## CI/CD pipeline

<br>

<img src="docs/ci_cd.png">

<br>


## Instructions

1) Create a workspace. Use a [Databricks Free Edition](https://docs.databricks.com/aws/en/getting-started/free-edition) workspace.


2) Install and configure Databricks CLI on your local machine. Follow instructions [here](https://docs.databricks.com/en/dev-tools/cli/install.html). Check the current version on databricks.yaml.
2) Install and configure Databricks CLI on your local machine. Check the current version on databricks.yaml. Follow instructions [here](https://docs.databricks.com/en/dev-tools/cli/install.html).


3) Build Python env and execute unit tests on your local machine
3) Build Python env and execute unit tests on your local machine.

make sync & make test


4) Deploy and execute on the dev workspace.

make deploy-serverless env=dev
make deploy env=dev


5) configure CI/CD automation. Configure [Github Actions repository secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) (DATABRICKS_HOST and DATABRICKS_TOKEN).
Expand Down
18 changes: 5 additions & 13 deletions scripts/generate_template_workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,41 +34,33 @@ def main():
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python generate_template_workflow.py dev --serverless
python generate_template_workflow.py staging --serverless --branch main --developer john --pr-number 123
python generate_template_workflow.py dev
python generate_template_workflow.py staging --branch main --developer john
""",
)

parser.add_argument("environment", help="Target environment (dev, staging, prod)")
parser.add_argument("--serverless", action="store_true", help="Use serverless workflow template")
parser.add_argument("--branch", help="Git branch name (auto-detected if not provided)")
parser.add_argument("--developer", help="Developer/deployer name (auto-detected if not provided)")
parser.add_argument("--pr-number", help="Pull request number (optional)")

args = parser.parse_args()

# Get or auto-detect git metadata
# Auto-detect git metadata in local environments, use provided values in CI
branch = args.branch if args.branch else get_git_branch()
developer = args.developer if args.developer else get_git_user()
pr_number = args.pr_number if args.pr_number else ""

print(f"Environment: {args.environment}")
print(f"Serverless mode: {args.serverless}")
print(f"Git branch: {branch}")
print(f"Developer: {developer}")
print(f"PR number: {pr_number if pr_number else 'N/A'}")

# Load and render template
file_loader = FileSystemLoader(".")
env = Environment(loader=file_loader)

if args.serverless:
template = env.get_template("/resources/wf_template_serverless.yml")
else:
template = env.get_template("/resources/wf_template.yml")
template = env.get_template("/resources/wf_template_serverless.yml")

# Render the template with all variables
output = template.render(environment=args.environment, branch=branch, developer=developer, pr_number=pr_number)
output = template.render(environment=args.environment, branch=branch, developer=developer)

# Save the rendered YAML to a file
output_file = "./resources/workflow.yml"
Expand Down