nn.Embedding to avoid OneHotEncoding all categorical columns#425
nn.Embedding to avoid OneHotEncoding all categorical columns#425ravinkohli wants to merge 12 commits intoreg_cocktails-common_paper_modificationsfrom
Conversation
| # allows us to pass embed_columns to the dataset properties. | ||
| # TODO: test the trade off | ||
| # Another solution is to combine `OneHotEncoding`, `Embedding` and `NoEncoding` in one custom transformer. | ||
| # this will also allow users to use this transformer outside the pipeline |
There was a problem hiding this comment.
| # this will also allow users to use this transformer outside the pipeline | |
| # this will also allow users to use this transformer outside the pipeline, see [this](https://github.com/manujosephv/pytorch_tabular/blob/main/pytorch_tabular/categorical_encoders.py#L132) |
autoPyTorch/api/base_task.py
Outdated
|
|
||
|
|
||
| def get_search_updates(categorical_indicator: List[bool]): | ||
| def get_search_updates(categorical_indicator: List[bool]) -> HyperparameterSearchSpaceUpdates: |
There was a problem hiding this comment.
The method argument is not used, I believe it could be removed.
autoPyTorch/api/base_task.py
Outdated
| self.input_validator: Optional[BaseInputValidator] = None | ||
|
|
||
| self.search_space_updates = search_space_updates if search_space_updates is not None else get_search_updates(categorical_indicator) | ||
| # if search_space_updates is not None else get_search_updates(categorical_indicator) |
There was a problem hiding this comment.
I think this could also be removed.
| self.logger.debug(f"run_summary_dict {json.dumps(run_summary_dict)}") | ||
| with open(os.path.join(self.backend.temporary_directory, 'run_summary.txt'), 'a') as file: | ||
| file.write(f"{json.dumps(run_summary_dict)}\n") | ||
| # self._write_run_summary(pipeline) |
There was a problem hiding this comment.
Based on the functionality that was encapsulated in the function, I think this should be called here, right?
| def _add_forbidden_conditions(self, cs: ConfigurationSpace) -> ConfigurationSpace: | ||
| """ | ||
| Add forbidden conditions to ensure valid configurations. | ||
| Currently, Learned Entity Embedding is only valid when encoder is one hot encoder |
There was a problem hiding this comment.
Based on the chances introduced in the PR I think the first condition mentioned in the docstring regarding Learned Entity Embedding should be removed.
| return self | ||
|
|
||
| def transform(self, X: Dict[str, Any]) -> Dict[str, Any]: | ||
| if self.num_categories_per_col is not None: |
There was a problem hiding this comment.
self.num_categories_per_col is initialized as an empty list, which means that it will not be None also for the encoded columns. Maybe this conditions should be changed to:
if self.num_categories_per_col:
...
There was a problem hiding this comment.
it will be none when there were no categorical column, see line 38
There was a problem hiding this comment.
Hm, but line 38 initializes self.num_categories_per_col to an empty list if there are categorical columns, and [] is not None returns True.
I'm mentioning this because I thought in line 53 we check if there are columns to be embedded, currently the if conditions evaluates to true both for embedded and encoded columns.
autoPyTorch/api/base_task.py
Outdated
| # has_cat_features = any(categorical_indicator) | ||
| # has_numerical_features = not all(categorical_indicator) |
There was a problem hiding this comment.
I think this should be removed.
| """ | ||
| Args: | ||
| config (Dict[str, Any]): The configuration sampled by the hyperparameter optimizer | ||
| num_input_features (np.ndarray): column wise information of number of output columns after transformation |
There was a problem hiding this comment.
I think num_input_features should be replaced with num_categories_per_col (np.ndarray): number of categories for categorical columns that will be embedded
| ("imputer", SimpleImputer(random_state=self.random_state)), | ||
| # ("variance_threshold", VarianceThreshold(random_state=self.random_state)), | ||
| # ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)), | ||
| ("column_splitter", ColumnSplitter(random_state=self.random_state)), |
There was a problem hiding this comment.
I think the docstring of the class should be updated to also include column_splitter as a step.
| ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)), | ||
| # ("variance_threshold", VarianceThreshold(random_state=self.random_state)), | ||
| # ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)), | ||
| ("column_splitter", ColumnSplitter(random_state=self.random_state)), |
There was a problem hiding this comment.
Same as tabular_classification.py, it would be nice to add this step in the docstring as well.
…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>
* reduce number of hyperparameters for pytorch embedding * remove todos for the preprocessing PR, and apply suggestion from code review * remove unwanted exclude in test
|
This branch will be merged to reg_cocktails. Therefore, this PR has been shifted to #451 |
This PR replaces the
LearnedEntityEmbeddingwith pytorch'snn.Embeddingwhich implicitly one hot encodes categorical columns. This leads to a reduction in memory usage compared to the old version.Types of changes
Motivation and Context
One hot encoding can lead to explosion in memory when the categories per column is high. Using
nn.Embeddingfor such categorical columns will significantly reduce memory usage. Moreover, it is a more robust and simpler implementation of the embedding module. To do this, I have introduced a new pipeline step calledColumnSplitter(I am up for better name suggestions) which hasmin_values_for_embeddingas a hyperparameter.It also makes minor changes which optimise some parts of the library. These include
EarlyPreprocessingnode making it more efficient.self.categoriesfrom tabular feature validator which according to [memo] High memory consumption and the places of doubts #180 takes a lot of memory. We dont really need to store all the categories anyways we only neednum_categories_per_col.How has this been tested?
I have successfully run
example_tabular_classificationonAustraliandatasets where the default configuration allows us to verify the features introduced in this PR.