Sort categorical columns by category order in DataTables#501
Sort categorical columns by category order in DataTables#501
Conversation
Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
| """ | ||
| Return the set of column indices that have categorical dtypes | ||
| """ | ||
| if df_module_name in ["pandas", "numpy"]: |
There was a problem hiding this comment.
if the module name is numpy we will not get categorical types I assume? At least not pd.CategoricalDtype? We should just return an empty set in that case and not import pandas.
There was a problem hiding this comment.
Fixed in a072ccc — get_categorical_columns now returns set() early for "numpy" without importing pandas.
| ) | ||
| categorical_columns = get_categorical_columns(df_module_name, df) | ||
| if categorical_columns: | ||
| dt_args["columnDefs"] = ( |
There was a problem hiding this comment.
This is the same logic as for floats. Maybe we could mutualize the code?
There was a problem hiding this comment.
Mutualized in a072ccc — both float and categorical columns now share a single _RENDER_SORT_FUNCTION and are collected into a single extra_column_defs list before being prepended to columnDefs.
| # high=2, low=0, medium=1 according to category order | ||
| assert ( | ||
| dt_args["data_json"] | ||
| == '[[["high", 2], 1], [["low", 0], 2], [["medium", 1], 3], [["low", 0], 4]]' |
There was a problem hiding this comment.
Very nice that we have the rank repeated, i.e. the two 'low' have rank 0, well done.
| return [escape_non_finite_float(v) for v in x] | ||
|
|
||
| # Categorical and Enum types - return [display_value, category_code] for sorting | ||
| if isinstance(dtype, (nw.Categorical, nw.Enum)): |
There was a problem hiding this comment.
I would prefer this code path to be optional, as we do for floats. Can you add a new option like add_rank_to_categories which defaults to True?
There was a problem hiding this comment.
Added add_rank_to_categories (default True) in a072ccc. It follows the same pattern as format_floats_in_python: it's a global option in options.py, typed in ITableOptions, popped from kwargs in get_itable_arguments, and when set to False the category columns are passed as plain strings (no rank encoding, alphabetical DataTables sorting).
…tualize columnDefs, changelog Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
Done — all review comments addressed in a072ccc. Changelog entry added under "2.7.1-dev". |
| for i in categorical_columns | ||
| ], | ||
| "render": _RENDER_SORT_FUNCTION, | ||
| } |
There was a problem hiding this comment.
It seems simpler to just replace float_columns_to_be_formatted_in_python with the union of the later with categorical_columns in the current code?
There was a problem hiding this comment.
Done in b5ee3e7 — float and categorical column sets are now unioned into a single columns_with_rank set, producing ONE columnDefs entry with the shared render function.
| format_floats_in_python: Union[bool, Literal["auto"]] = "auto" | ||
|
|
||
| """ | ||
| Should categorical values be sorted according to the category order? |
There was a problem hiding this comment.
sorted by datatables rather than sorted?
There was a problem hiding this comment.
Fixed in b5ee3e7 — docstring now reads "Set this to False to sort categories alphabetically by DataTables."
|
|
||
| def get_categorical_columns( | ||
| df_module_name: DataFrameModuleName, | ||
| df: DataFrameOrSeries, |
There was a problem hiding this comment.
Can you make this more similar, or even better, mutualize the code with get_float_columns_to_be_formatted_in_python, which ensures that when float_columns_to_be_formatted_in_python="auto", the default, columns which are specified in the columnsDef don't see their data formatted?
There was a problem hiding this comment.
To be more explicit: the default value for add_rank_to_categories should be "auto", and default to True if the column in not in the columnsDef.
There was a problem hiding this comment.
Implemented in b5ee3e7. add_rank_to_categories defaults to "auto". The function get_categorical_columns_to_be_represented_through_their_rank mirrors get_float_columns_to_be_formatted_in_python exactly: when "auto", it removes columns that have a render function defined in columnDefs.
| if add_rank_to_categories: | ||
| codes = x.cat.codes.tolist() | ||
| n_categories = len(x.cat.categories) | ||
| sort_keys = [n_categories if c == -1 else c for c in codes] |
There was a problem hiding this comment.
I see that, at least for Polars, we put missing values first for floats (rank=0). Can you do the same here?
There was a problem hiding this comment.
Please also add a test for a category with missing values, for both Polars and Pandas
There was a problem hiding this comment.
Done in b5ee3e7. Null now gets rank 0 (sorts first) for all backends. Added test_pandas_categorical_with_missing_values and test_polars_categorical_with_missing_values explicitly asserting null rank = 0.
| if escape_html: | ||
| formatted = [escape_html_chars(i) for i in formatted] | ||
| if add_rank_to_categories: | ||
| category_to_rank = {cat: i for i, cat in enumerate(categories)} |
There was a problem hiding this comment.
Can't we access the underlying integer more directly?
There was a problem hiding this comment.
Simplified in b5ee3e7 — category_to_rank is now built directly with 1-indexed ranks: {cat: i + 1 for i, cat in enumerate(categories)}, so null (missing) gets rank 0 and the lookup is a single-step dict access with no separate null_sort_key variable.
| formatted = [escape_html_chars(i) for i in formatted] | ||
| if add_rank_to_categories: | ||
| codes = x.to_physical().to_list() | ||
| null_sort_key = len(x.cat.get_categories()) |
There was a problem hiding this comment.
Could we use rank rather than sort_key, etc, for the variables names ?
There was a problem hiding this comment.
Renamed in b5ee3e7 — all sort_keys variables are now ranks.
| escape_html: bool, | ||
| format_floats_in_python: bool, | ||
| warn_on_polars_get_fmt_not_found: bool, | ||
| add_rank_to_categories: bool = True, |
There was a problem hiding this comment.
Done in b5ee3e7 — add_rank_to_categories in _format_polars_series now has no default value.
|
|
||
| def _format_narwhals_series( | ||
| x, escape_html: bool, format_floats_in_python: bool | ||
| x, escape_html: bool, format_floats_in_python: bool, add_rank_to_categories: bool = True |
There was a problem hiding this comment.
Done in b5ee3e7 — add_rank_to_categories in _format_narwhals_series now has no default value.
| column_count: Optional[int] = None, | ||
| escape_html: bool = True, | ||
| float_columns_to_be_formatted_in_python: Optional[set[int]] = None, | ||
| categorical_columns: Optional[set[int]] = None, |
There was a problem hiding this comment.
Could this argument be, instead, something like categorical_columns_to_be_represented_through_their_rank? Also I am thinking that we could simply pass the rank to datatables, and leave the function that converts to the display/search representation to do the mapping? The rank 0 should be mapped to Javascript null, and the other ranks to the corresponding category, that is, the JS function should use a simple array (= the ordered categories) to do the mapping.
There was a problem hiding this comment.
Renamed in b5ee3e7 — parameter is now categorical_columns_to_be_represented_through_their_rank. The [display, rank] format is kept (consistent with floats) so the existing shared render function works for both.
…params Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
| assert isinstance(targets, list), targets | ||
| for target in targets: | ||
| remove_if_present(target) | ||
| return categorical_columns |
There was a problem hiding this comment.
That duplicates the current code, can use a dedicated function instead?
There was a problem hiding this comment.
Done in 0062c0f — extracted _remove_columns_with_render_in_columndefs shared helper used by both get_float_columns_to_be_formatted_in_python and get_categorical_columns_to_be_represented_through_their_rank.
|
|
||
| def _format_pandas_series( | ||
| x, escape_html: bool, format_floats_in_python: bool | ||
| x, escape_html: bool, format_floats_in_python: bool, add_rank_to_categories: bool = True |
There was a problem hiding this comment.
No default value in the function signature, thanks
There was a problem hiding this comment.
Done in 0062c0f — add_rank_to_categories parameter in _format_pandas_series now has no default value.
| except TypeError: | ||
| formatted = fmt.format_array(x._values, None, justify="all") | ||
| if escape_html: | ||
| formatted = [escape_html_chars(i) for i in formatted] |
There was a problem hiding this comment.
The two paragraphs above duplicate the existing code, can we reuse the current logic and just bifurcate at the end if the type is a category?
There was a problem hiding this comment.
Done in 0062c0f — the categorical check is now at the END of _format_pandas_series, after the shared fmt.format_array + escape_html code. When add_rank_to_categories=True, we return early with just x.cat.codes (no fmt.format_array). When False, the function falls through to the general path.
| df_module = get_dataframe_module_name(df) | ||
| if float_columns_to_be_formatted_in_python is None: | ||
| float_columns_to_be_formatted_in_python = set() | ||
| if categorical_columns_to_be_represented_through_their_rank is None: |
There was a problem hiding this comment.
categorical_columns_to_be_represented_through_their_rank conveys the meaning that we would pass just the rank to datatables.
Can you actually do that? And map the rank to the display value in the JavaScript rendering function?
There was a problem hiding this comment.
Implemented in 0062c0f. Categorical columns now send just integer ranks (0 = null, 1..N = categories). get_categorical_columns_to_be_represented_through_their_rank returns dict[int, list] (col_idx → ordered categories). Each categorical column gets its own columnDef entry with the categories array embedded in the JS render function: var categories = ["a", "b", "c"]; return type === 'sort' ? data : (data === 0 ? null : categories[data - 1]);
…+ JS mapping Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
Categorical columns (pandas
CategoricalDtype, polarsCategorical/Enum) were sorted alphabetically as strings by DataTables, ignoring the defined category order.Approach
Categorical columns are encoded as integer ranks and each column gets a per-column DataTables
columnDefsrender function that uses the integer for sorting and maps it back to the category name for display via an embedded categories array. Null/missing values get rank 0 (sort first), and categories are 1-indexed.Set
add_rank_to_categories=Falseto revert to alphabetical sorting for categorical columns. The default is"auto", which adds the rank only when the column has no render function already defined incolumnDefs.Changes
datatables_format.py_format_pandas_series: categorical check moved to the end of the function (nofmt.format_arrayduplication); whenadd_rank_to_categories=True, returns just integer ranks (0 for null/NaN, 1..N for categories); whenFalse, falls through to the generic string formatting path_format_polars_series:pl.Categorical/pl.Enumreturn integer ranks usingto_physical(); null gets rank 0 (sorts first)_format_narwhals_series:nw.Categorical/nw.Enumbuild a 1-indexed rank map fromcat.get_categories()and return integer ranks; null gets rank 0datatables_rows:categorical_columnsparameter renamed tocategorical_columns_to_be_represented_through_their_rankjavascript.py_remove_columns_with_render_in_columndefsshared helper (eliminates duplication between float and categorical column filtering)get_float_columns_to_be_formatted_in_python: refactored to use the shared helperget_categorical_columns_to_be_represented_through_their_rank()function returningdict[int, list](column index → ordered categories); mirrorsget_float_columns_to_be_formatted_in_pythonwith"auto"mode supportget_itable_arguments(): float columns keep a single sharedcolumnDefsentry; each categorical column gets its owncolumnDefsentry with the ordered categories array embedded in the JS render function:var categories = [...]; return type === 'sort' ? data : (data === 0 ? null : categories[data - 1]);add_rank_to_categoriesoption (default"auto") controls whether categorical columns get rank-based sortingoptions.py/typing.pyadd_rank_to_categories: Union[bool, Literal["auto"]] = "auto"global optionTests
test_polars_df_with_categorical_and_enumsto reflect the new integer-rank wire formattest_format_pandas_seriesandtest_format_polars_seriesto passadd_rank_to_categoriesexplicitlytest_pandas_categorical_with_missing_valuesandtest_polars_categorical_with_missing_valuesasserting null sorts first (rank 0)test_extension_arguments.pyupdated to permitcolumnDefs/keys_to_be_evaluatedin outputChangelog
2.7.1-devOriginal prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.