Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces the CTEImputer (Compress Then Explain imputer) as a new imputation strategy for the shapiq package, following the methodology from Baniecki et al. (2025). The imputer uses distribution compression (Compress++ with Kernel Thinning) to subsample background data before imputing missing features, providing accurate and stable explanations while being computationally efficient.
Key Changes:
- Adds new
CTEImputerclass implementing the compress-then-explain methodology with background data compression - Makes
CTEImputerthe new default imputer inTabularExplainer, replacingMarginalImputer - Adds
goodpointslibrary as a new dependency for distribution compression functionality
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/shapiq/imputer/cte_imputer.py | New implementation of CTEImputer class with distribution compression for efficient feature imputation |
| src/shapiq/imputer/init.py | Exports CTEImputer to make it available in the imputer module |
| src/shapiq/explainer/tabular.py | Updates TabularExplainer to use "cte" as the default imputer instead of "marginal" |
| pyproject.toml | Adds goodpoints dependency and shapiq keyword to project metadata |
| CHANGELOG.md | Documents the addition of CTEImputer feature for the development version |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| id_compressed = compress.compresspp_kt(data, kernel_type=b"gaussian", k_params=np.array([sigma**2]), g=4, seed=self.random_state) | ||
| self._replacement_data = data[id_compressed] | ||
| self.calc_empty_prediction() # reset the empty prediction to the new background data | ||
| return self |
There was a problem hiding this comment.
There's a trailing space at the end of this line that should be removed to maintain code cleanliness.
| return self | |
| return self |
| """ | ||
| d = data.shape[1] | ||
| sigma = np.sqrt(2 * d) | ||
| id_compressed = compress.compresspp_kt(data, kernel_type=b"gaussian", k_params=np.array([sigma**2]), g=4, seed=self.random_state) |
There was a problem hiding this comment.
The seed parameter passed to compresspp_kt may not properly handle the case when self.random_state is None. The function expects an integer seed, but random_state can be None according to the __init__ signature. This could lead to unexpected behavior or errors when random_state=None.
| id_compressed = compress.compresspp_kt(data, kernel_type=b"gaussian", k_params=np.array([sigma**2]), g=4, seed=self.random_state) | |
| if self.random_state is None: | |
| id_compressed = compress.compresspp_kt( | |
| data, | |
| kernel_type=b"gaussian", | |
| k_params=np.array([sigma**2]), | |
| g=4, | |
| ) | |
| else: | |
| id_compressed = compress.compresspp_kt( | |
| data, | |
| kernel_type=b"gaussian", | |
| k_params=np.array([sigma**2]), | |
| g=4, | |
| seed=self.random_state, | |
| ) |
| ``["cte", "marginal", "baseline", "conditional"]``. Defaults to ``"cte"``, which | ||
| initializes the default | ||
| :class:`~shapiq.games.imputer.marginal_imputer.MarginalImputer` with its default | ||
| :class:`~shapiq.games.imputer.marginal_imputer.CTEImputer` with its default |
There was a problem hiding this comment.
The documentation incorrectly references the wrong class name. It should say "CTEImputer" instead of "marginal_imputer.CTEImputer". The CTEImputer is defined in the cte_imputer module, not the marginal_imputer module.
| :class:`~shapiq.games.imputer.marginal_imputer.CTEImputer` with its default | |
| :class:`~shapiq.games.imputer.cte_imputer.CTEImputer` with its default |
| imputed_data = np.tile(self.x, (n_coalitions, 1)) | ||
| for i in range(sample_size): |
There was a problem hiding this comment.
The imputed_data array is initialized outside the loop but modified inside the loop, which means that after the first iteration, it will contain modifications from the previous iteration. This should be moved inside the loop to ensure each iteration starts with a fresh copy of self.x tiled for all coalitions, otherwise the results will be incorrect for iterations after the first one.
| imputed_data = np.tile(self.x, (n_coalitions, 1)) | |
| for i in range(sample_size): | |
| for i in range(sample_size): | |
| imputed_data = np.tile(self.x, (n_coalitions, 1)) |
| @@ -0,0 +1,164 @@ | |||
| """Implementation of the marginal imputer.""" | |||
There was a problem hiding this comment.
The module docstring incorrectly states "Implementation of the marginal imputer" when it should say "Implementation of the CTE imputer" or "Implementation of the compress then explain imputer".
| normalize: bool = True, | ||
| random_state: int | None = None, | ||
| ) -> None: | ||
| """Initializes the marginal imputer. |
There was a problem hiding this comment.
The docstring incorrectly states "Initializes the marginal imputer" when it should say "Initializes the CTE imputer" or "Initializes the compress then explain imputer".
| Examples: | ||
| >>> model = lambda x: np.sum(x, axis=1) | ||
| >>> data = np.random.rand(10, 3) | ||
| >>> imputer = MarginalImputer(model=model, data=data, x=data[0]) |
There was a problem hiding this comment.
The example in the docstring incorrectly uses "MarginalImputer" instead of "CTEImputer". This should be updated to match the class being documented.
Introducing CTEImputer
closes #225
Adds the
CTEImputerfollowing the compress then explain (CTE) methodology. It replaces missing features of the explanation point by values sampled from the background data, which is first subsampled using a distribution compression algorithm, specifically Compress++ with Kernel Thinning. CTE has shown to provide accurate and stable estimates of explanations while being computationally efficient. It is a new default imputer inTabularExplainer, removing the necessity to setsample_size.TODO
C++packages