-
Notifications
You must be signed in to change notification settings - Fork 10
improved autofix strategy #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aditya1503
wants to merge
34
commits into
main
Choose a base branch
from
improve_autofix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
4615637
make pull request
aditya1503 2a7cf91
cleaned skeleton code
aditya1503 e7a3d07
cleanup
aditya1503 72fc919
add type hinting
aditya1503 d67bbc3
address PR comments
aditya1503 fc4bf7c
Update cleanlab_studio/internal/util.py
aditya1503 9f00909
linting + doc change
aditya1503 d2a3432
set ambiguous to 0
aditya1503 6bcec4c
things to port to backend
aditya1503 cc52ce2
Updated code for different strategies
sanjanag 62efa2d
Fixed apply method
sanjanag e5c4872
Added test for computing rows for exclusion
sanjanag 02294c8
Improved formatting
sanjanag 1d644a0
Added tests for updating label issue rows based on threshold
sanjanag 3ff2507
Fixed mypy issue
sanjanag 7235b40
Added test for checking right rows are dropped for non near duplicate…
sanjanag 1b99d60
Added test for checking right rows are dropped for near duplicate issues
sanjanag 330aa44
Added get defaults method
sanjanag a19c88c
Return cleanset with original indices
sanjanag 69ccda6
Merge branch 'main' into improve_autofix
sanjanag 19143a3
Removed unimplemented test
sanjanag e5b97f5
removed unncessary merge change
sanjanag 20a532c
Fixed tests
sanjanag 3bbfc1c
Fixed mypy error
sanjanag b892e87
Added newline
sanjanag b54a0a7
Fixed formatting
sanjanag f870e04
added tests for dropped indices
sanjanag eb106d1
Added docs for user facing method
sanjanag a7acfa6
Black formatting
sanjanag 1f0344d
Merge remote-tracking branch 'origin/main' into improve_autofix
aditya1503 692efe4
merge main
aditya1503 afbe4a9
add github change request
aditya1503 7b96faa
Update cleanlab_studio/studio/studio.py
aditya1503 b31674c
linting
aditya1503 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -1,8 +1,8 @@ | ||||
| """ | ||||
| Python API for Cleanlab Studio. | ||||
| """ | ||||
| from typing import Any, List, Literal, Optional, Union | ||||
| import warnings | ||||
| from typing import Any, List, Literal, Optional, Union, Dict | ||||
|
|
||||
| import numpy as np | ||||
| import numpy.typing as npt | ||||
|
|
@@ -15,14 +15,17 @@ | |||
| from cleanlab_studio.internal.api import api | ||||
| from cleanlab_studio.internal.util import ( | ||||
| init_dataset_source, | ||||
| check_none, | ||||
| apply_corrections_snowpark_df, | ||||
| apply_corrections_spark_df, | ||||
| apply_corrections_pd_df, | ||||
| apply_autofixed_cleanset_to_new_dataframe, | ||||
| _get_autofix_defaults_for_strategy, | ||||
| _get_param_values, | ||||
| ) | ||||
| from cleanlab_studio.internal.settings import CleanlabSettings | ||||
| from cleanlab_studio.internal.types import FieldSchemaDict | ||||
|
|
||||
|
|
||||
| _snowflake_exists = api.snowflake_exists | ||||
| if _snowflake_exists: | ||||
| import snowflake.snowpark as snowpark | ||||
|
|
@@ -150,10 +153,10 @@ def apply_corrections(self, cleanset_id: str, dataset: Any, keep_excluded: bool | |||
| cl_cols = self.download_cleanlab_columns( | ||||
| cleanset_id, to_spark=True, include_project_details=True | ||||
| ) | ||||
| corrected_ds: pyspark.sql.DataFrame = apply_corrections_spark_df( | ||||
| pyspark_corrected_ds: pyspark.sql.DataFrame = apply_corrections_spark_df( | ||||
| dataset, cl_cols, id_col, label_col, keep_excluded | ||||
| ) | ||||
| return corrected_ds | ||||
| return pyspark_corrected_ds | ||||
|
|
||||
| elif isinstance(dataset, pd.DataFrame): | ||||
| cl_cols = self.download_cleanlab_columns(cleanset_id, include_project_details=True) | ||||
|
|
@@ -358,3 +361,54 @@ def poll_cleanset_status(self, cleanset_id: str, timeout: Optional[int] = None) | |||
|
|
||||
| except (TimeoutError, CleansetError): | ||||
| return False | ||||
|
|
||||
| def autofix_dataset( | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. allow string options to passed straight through into
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be added now, clarified in the docs:
|
||||
| self, | ||||
| original_df: pd.DataFrame, | ||||
| cleanset_id: str, | ||||
| params: Optional[Dict[str, Union[int, float]]] = None, | ||||
| strategy="optimized_training_data", | ||||
| ) -> pd.DataFrame: | ||||
| """ | ||||
| Improves a dataset by applying automatically-suggested corrections based on issues detected by Cleanlab. | ||||
| Args: | ||||
| cleanset_id (str): ID of the cleanset from the Project for this Dataset. | ||||
| original_df (pd.DataFrame): The original dataset (must be a DataFrame, so only text and tabular datasets are currently supported). | ||||
| params (dict, optional): Optional parameters to control how many data points from each type of detected data issue are auto-corrected or filtered (prioritizing the more severe instances of each issue). If not provided, default `params` values will be used. | ||||
| The `params` dictionary includes the following options: | ||||
|
|
||||
| * drop_ambiguous (float): Fraction of the data points detected as ambiguous to exclude from the dataset. | ||||
| * drop_label_issue (float): Fraction of the data points with label issues to exclude from the dataset. | ||||
| * drop_near_duplicate (float): Fraction of the data points detected as near duplicates to exclude from the dataset. | ||||
| * drop_outlier (float): Fraction of the data points detected as outliers to exclude from the dataset. | ||||
| * relabel_confidence_threshold (float): Confidence threshold for the suggested label, data points with label issues that also exceed this threshold are re-labeled as the suggested label. | ||||
|
|
||||
| strategy (str): What strategy to use for auto-fixing the dataset out of the following possibilities: | ||||
| ['optimized_training_data', 'drop_all_issues', 'suggested_actions']. | ||||
| Each of these possibilities corresponds to a default setting of the `params` dictionary, designed to be used in different scenarios. | ||||
| If specified, the `params` argument will override this argument. Specify 'optimized_training_data' when your goal is to auto-fix training data to achieve the best ML performance on randomly split test data. | ||||
| Specify 'drop_all_issues' to instead exclude all datapoints detected to have issues from the dataset. | ||||
| Specify 'suggested_actions' to instead apply the suggested action to each data point that is displayed in the Cleanlab Studio Web Application (e.g. relabeling for label issues, dropping for outliers, etc). | ||||
|
|
||||
| Returns: | ||||
| pd.DataFrame: A new dataframe after applying auto-fixes to the cleanset. | ||||
|
|
||||
| """ | ||||
| cleanset_df = self.download_cleanlab_columns(cleanset_id) | ||||
| if params is not None and strategy is not None: | ||||
| raise ValueError("Please provide only of params or strategy for autofix") | ||||
| param_values = _get_param_values(cleanset_df, params, strategy) | ||||
| return apply_autofixed_cleanset_to_new_dataframe(original_df, cleanset_df, param_values) | ||||
|
|
||||
| def get_autofix_defaults(self, strategy="optimized_training_data") -> Dict[str, float]: | ||||
| """ | ||||
| This method returns the default params auto-fixed dataset. | ||||
| Args: | ||||
| strategy (str): Auto-fixing strategy | ||||
| Possible strategies: optimized_training_data, drop_all_issues, suggested_actions | ||||
|
|
||||
| Returns: | ||||
| dict[str, float]: parameter dictionary containing confidence threshold for auto-relabelling, and | ||||
| fraction of rows to drop for each issue type. | ||||
| """ | ||||
| return _get_autofix_defaults_for_strategy(strategy) | ||||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In autofix, we can simply multiply the fraction of issues that are the cleanset defaults by the number of datapoints to get this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right when we spoke originally, we wanted this call to be similar to the Studio web interface call, hence I rewrote it this way, it was floating percentage before.
the function
_get_autofix_defaultsdoes the multiplication by number of datapoints