Conversation
I think this is fine if you want to do that but would ask to not commit it to the repo. I removed it cause it added a bit of extra time to reinstalling and running pyspark within the main container should setup the jars/installation at runtime. Also there are other spark containers using Dockerfile.spark that include those steps setting up in the background.
Unclear what you mean here. The resulting data in the model should be the same (so the same logic of transformations, cleaning the data, etc.) but shouldn't be postgres-based no.
That's a good catch ahead of time. I can go over this in a call/chat but I updated the data model.xlsx, figma, and ticket such that the data models should be
Over time we're trying to figure out the requirements to separate CSVModel vs DeltaModel and so far I'm basing it off size (<500k) so this would apply, but that can change/evolve. |
| """Simply converts datetime's to datetime64[us] for dataframes""" | ||
| return np.datetime64(dt).astype("datetime64[us]") | ||
| """Simply converts datetime's to datetime64[ns] for dataframes""" | ||
| return np.datetime64(dt).astype("datetime64[ns]") |
There was a problem hiding this comment.
Seems that nanoseconds is the default when pandas parses datetime columns. The microseconds were resulting in a cryptic error of: ValueError: Shape of passed values is (1, 7), indices imply (2, 7) when performing a pandas.concat.
|
|
||
| class DeltaModel(LakeHouseModel): | ||
| FORMAT = LakeHouseModelFormat.DELTA | ||
| STRUCTURE: StructType |
There was a problem hiding this comment.
The STRUCTURE with type StructType is really only used by the DeltaModel, so I moved this property to this model instead of the LakeHouseModel. I added a DTYPES property to the CSVModel that performs a similar purpose, but uses more pandas like paradigm of column names and dtypes.
| params = { | ||
| "dtype": {k: v for k, v in self.DTYPES.items() if v != datetime}, | ||
| "parse_dates": [k for k, v in self.DTYPES.items() if v == datetime], | ||
| "usecols": cols, | ||
| } | ||
| # Ensure that any passed in kwargs take precedence over the default params | ||
| params.update(kwargs) |
There was a problem hiding this comment.
I added these params to ensure that the pandas dataframes are read from csv with consistent data types. This uses the new DTYPES property of the CSVModel.
Added a load_park.py script. I made changes to the Dockerfile in order to run spark commands locally, I am not sure what the implications are of uncommenting these lines or how others have their local set up.
Additional questions: