Skip to content

[DEV-12156] - Add Park Loader#3

Open
zachflanders-frb wants to merge 9 commits intoqatfrom
ftr/dev-12156-load-park
Open

[DEV-12156] - Add Park Loader#3
zachflanders-frb wants to merge 9 commits intoqatfrom
ftr/dev-12156-load-park

Conversation

@zachflanders-frb
Copy link
Copy Markdown

Added a load_park.py script. I made changes to the Dockerfile in order to run spark commands locally, I am not sure what the implications are of uncommenting these lines or how others have their local set up.

Additional questions:

  • Do we want to continue to support the postgres loading pattern in this script?
  • Is DeltaModel the right type of model, or should I have used CSVModel? It seems like the structure is not really used anywhere.

@dpb-bah
Copy link
Copy Markdown
Contributor

dpb-bah commented Apr 14, 2026

I made changes to the Dockerfile in order to run spark commands locally, I am not sure what the implications are of uncommenting these lines or how others have their local set up.

I think this is fine if you want to do that but would ask to not commit it to the repo. I removed it cause it added a bit of extra time to reinstalling and running pyspark within the main container should setup the jars/installation at runtime. Also there are other spark containers using Dockerfile.spark that include those steps setting up in the background.

Do we want to continue to support the postgres loading pattern in this script?

Unclear what you mean here. The resulting data in the model should be the same (so the same logic of transformations, cleaning the data, etc.) but shouldn't be postgres-based no.

Is DeltaModel the right type of model, or should I have used CSVModel? It seems like the structure is not really used anywhere.

That's a good catch ahead of time. I can go over this in a call/chat but I updated the data model.xlsx, figma, and ticket such that the data models should be

  • PARKBronze(CSVModel): no loader required, just based on the raw park file placed in the bucket
  • PARKGold(CSVModel): updated in the script you're making here (I'm trying to set a precedent of loaders/[name]_[bronze/silver/gold].py so something like park_gold.py) that pulls from PARKBronze, cleans it up, and saves it to PARKGold(CSVModel).

Over time we're trying to figure out the requirements to separate CSVModel vs DeltaModel and so far I'm basing it off size (<500k) so this would apply, but that can change/evolve.

Comment on lines -12 to +13
"""Simply converts datetime's to datetime64[us] for dataframes"""
return np.datetime64(dt).astype("datetime64[us]")
"""Simply converts datetime's to datetime64[ns] for dataframes"""
return np.datetime64(dt).astype("datetime64[ns]")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that nanoseconds is the default when pandas parses datetime columns. The microseconds were resulting in a cryptic error of: ValueError: Shape of passed values is (1, 7), indices imply (2, 7) when performing a pandas.concat.


class DeltaModel(LakeHouseModel):
FORMAT = LakeHouseModelFormat.DELTA
STRUCTURE: StructType
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The STRUCTURE with type StructType is really only used by the DeltaModel, so I moved this property to this model instead of the LakeHouseModel. I added a DTYPES property to the CSVModel that performs a similar purpose, but uses more pandas like paradigm of column names and dtypes.

Comment on lines +369 to +375
params = {
"dtype": {k: v for k, v in self.DTYPES.items() if v != datetime},
"parse_dates": [k for k, v in self.DTYPES.items() if v == datetime],
"usecols": cols,
}
# Ensure that any passed in kwargs take precedence over the default params
params.update(kwargs)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these params to ensure that the pandas dataframes are read from csv with consistent data types. This uses the new DTYPES property of the CSVModel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants