Skip to content

Sort categorical columns by category order in DataTables#501

Draft
Copilot wants to merge 5 commits intomainfrom
copilot/sort-categorical-types-order
Draft

Sort categorical columns by category order in DataTables#501
Copilot wants to merge 5 commits intomainfrom
copilot/sort-categorical-types-order

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 22, 2026

Categorical columns (pandas CategoricalDtype, polars Categorical/Enum) were sorted alphabetically as strings by DataTables, ignoring the defined category order.

Approach

Categorical columns are encoded as integer ranks and each column gets a per-column DataTables columnDefs render function that uses the integer for sorting and maps it back to the category name for display via an embedded categories array. Null/missing values get rank 0 (sort first), and categories are 1-indexed.

# Ordered categorical: low < medium < high
df = pd.DataFrame({
    "priority": pd.Categorical(
        ["high", "low", "medium"],
        categories=["low", "medium", "high"],
        ordered=True,
    )
})
# Now sorts as low → medium → high instead of alphabetically
show(df)

Set add_rank_to_categories=False to revert to alphabetical sorting for categorical columns. The default is "auto", which adds the rank only when the column has no render function already defined in columnDefs.

Changes

  • datatables_format.py

    • _format_pandas_series: categorical check moved to the end of the function (no fmt.format_array duplication); when add_rank_to_categories=True, returns just integer ranks (0 for null/NaN, 1..N for categories); when False, falls through to the generic string formatting path
    • _format_polars_series: pl.Categorical/pl.Enum return integer ranks using to_physical(); null gets rank 0 (sorts first)
    • _format_narwhals_series: nw.Categorical/nw.Enum build a 1-indexed rank map from cat.get_categories() and return integer ranks; null gets rank 0
    • datatables_rows: categorical_columns parameter renamed to categorical_columns_to_be_represented_through_their_rank
  • javascript.py

    • New _remove_columns_with_render_in_columndefs shared helper (eliminates duplication between float and categorical column filtering)
    • get_float_columns_to_be_formatted_in_python: refactored to use the shared helper
    • New get_categorical_columns_to_be_represented_through_their_rank() function returning dict[int, list] (column index → ordered categories); mirrors get_float_columns_to_be_formatted_in_python with "auto" mode support
    • get_itable_arguments(): float columns keep a single shared columnDefs entry; each categorical column gets its own columnDefs entry with the ordered categories array embedded in the JS render function: var categories = [...]; return type === 'sort' ? data : (data === 0 ? null : categories[data - 1]);
    • New add_rank_to_categories option (default "auto") controls whether categorical columns get rank-based sorting
  • options.py / typing.py

    • Added add_rank_to_categories: Union[bool, Literal["auto"]] = "auto" global option
  • Tests

    • Updated test_polars_df_with_categorical_and_enums to reflect the new integer-rank wire format
    • Updated test_format_pandas_series and test_format_polars_series to pass add_rank_to_categories explicitly
    • Added test_pandas_categorical_with_missing_values and test_polars_categorical_with_missing_values asserting null sorts first (rank 0)
    • test_extension_arguments.py updated to permit columnDefs/keys_to_be_evaluated in output
  • Changelog

    • Added entry under 2.7.1-dev
Original prompt

This section details on the original issue you should resolve

<issue_title>Categorical types should be sorted according to the category order</issue_title>
<issue_description></issue_description>

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
Copilot AI changed the title [WIP] Sort categorical types according to the category order Sort categorical columns by category order in DataTables Feb 22, 2026
Copilot AI requested a review from mwouts February 22, 2026 15:58
Copy link
Copy Markdown
Owner

@mwouts mwouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Please iterate on my comments, and also add an entry in the changelog.

Comment thread src/itables/javascript.py Outdated
"""
Return the set of column indices that have categorical dtypes
"""
if df_module_name in ["pandas", "numpy"]:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the module name is numpy we will not get categorical types I assume? At least not pd.CategoricalDtype? We should just return an empty set in that case and not import pandas.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a072cccget_categorical_columns now returns set() early for "numpy" without importing pandas.

Comment thread src/itables/javascript.py Outdated
)
categorical_columns = get_categorical_columns(df_module_name, df)
if categorical_columns:
dt_args["columnDefs"] = (
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same logic as for floats. Maybe we could mutualize the code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutualized in a072ccc — both float and categorical columns now share a single _RENDER_SORT_FUNCTION and are collected into a single extra_column_defs list before being prepended to columnDefs.

Comment thread tests/test_datatables_format.py Outdated
# high=2, low=0, medium=1 according to category order
assert (
dt_args["data_json"]
== '[[["high", 2], 1], [["low", 0], 2], [["medium", 1], 3], [["low", 0], 4]]'
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice that we have the rank repeated, i.e. the two 'low' have rank 0, well done.

return [escape_non_finite_float(v) for v in x]

# Categorical and Enum types - return [display_value, category_code] for sorting
if isinstance(dtype, (nw.Categorical, nw.Enum)):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this code path to be optional, as we do for floats. Can you add a new option like add_rank_to_categories which defaults to True?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added add_rank_to_categories (default True) in a072ccc. It follows the same pattern as format_floats_in_python: it's a global option in options.py, typed in ITableOptions, popped from kwargs in get_itable_arguments, and when set to False the category columns are passed as plain strings (no rank encoding, alphabetical DataTables sorting).

…tualize columnDefs, changelog

Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 22, 2026

Great work. Please iterate on my comments, and also add an entry in the changelog.

Done — all review comments addressed in a072ccc. Changelog entry added under "2.7.1-dev".

Copilot AI requested a review from mwouts February 22, 2026 20:32
Copy link
Copy Markdown
Owner

@mwouts mwouts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, more comments attached

Comment thread src/itables/javascript.py
for i in categorical_columns
],
"render": _RENDER_SORT_FUNCTION,
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems simpler to just replace float_columns_to_be_formatted_in_python with the union of the later with categorical_columns in the current code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b5ee3e7 — float and categorical column sets are now unioned into a single columns_with_rank set, producing ONE columnDefs entry with the shared render function.

Comment thread src/itables/options.py
format_floats_in_python: Union[bool, Literal["auto"]] = "auto"

"""
Should categorical values be sorted according to the category order?
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted by datatables rather than sorted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b5ee3e7 — docstring now reads "Set this to False to sort categories alphabetically by DataTables."

Comment thread src/itables/javascript.py

def get_categorical_columns(
df_module_name: DataFrameModuleName,
df: DataFrameOrSeries,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more similar, or even better, mutualize the code with get_float_columns_to_be_formatted_in_python, which ensures that when float_columns_to_be_formatted_in_python="auto", the default, columns which are specified in the columnsDef don't see their data formatted?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more explicit: the default value for add_rank_to_categories should be "auto", and default to True if the column in not in the columnsDef.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in b5ee3e7. add_rank_to_categories defaults to "auto". The function get_categorical_columns_to_be_represented_through_their_rank mirrors get_float_columns_to_be_formatted_in_python exactly: when "auto", it removes columns that have a render function defined in columnDefs.

Comment thread src/itables/datatables_format.py Outdated
if add_rank_to_categories:
codes = x.cat.codes.tolist()
n_categories = len(x.cat.categories)
sort_keys = [n_categories if c == -1 else c for c in codes]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that, at least for Polars, we put missing values first for floats (rank=0). Can you do the same here?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a test for a category with missing values, for both Polars and Pandas

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b5ee3e7. Null now gets rank 0 (sorts first) for all backends. Added test_pandas_categorical_with_missing_values and test_polars_categorical_with_missing_values explicitly asserting null rank = 0.

Comment thread src/itables/datatables_format.py Outdated
if escape_html:
formatted = [escape_html_chars(i) for i in formatted]
if add_rank_to_categories:
category_to_rank = {cat: i for i, cat in enumerate(categories)}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we access the underlying integer more directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in b5ee3e7category_to_rank is now built directly with 1-indexed ranks: {cat: i + 1 for i, cat in enumerate(categories)}, so null (missing) gets rank 0 and the lookup is a single-step dict access with no separate null_sort_key variable.

Comment thread src/itables/datatables_format.py Outdated
formatted = [escape_html_chars(i) for i in formatted]
if add_rank_to_categories:
codes = x.to_physical().to_list()
null_sort_key = len(x.cat.get_categories())
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use rank rather than sort_key, etc, for the variables names ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in b5ee3e7 — all sort_keys variables are now ranks.

Comment thread src/itables/datatables_format.py Outdated
escape_html: bool,
format_floats_in_python: bool,
warn_on_polars_get_fmt_not_found: bool,
add_rank_to_categories: bool = True,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No default value here, thanks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b5ee3e7add_rank_to_categories in _format_polars_series now has no default value.

Comment thread src/itables/datatables_format.py Outdated

def _format_narwhals_series(
x, escape_html: bool, format_floats_in_python: bool
x, escape_html: bool, format_floats_in_python: bool, add_rank_to_categories: bool = True
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No default value here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b5ee3e7add_rank_to_categories in _format_narwhals_series now has no default value.

Comment thread src/itables/datatables_format.py Outdated
column_count: Optional[int] = None,
escape_html: bool = True,
float_columns_to_be_formatted_in_python: Optional[set[int]] = None,
categorical_columns: Optional[set[int]] = None,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this argument be, instead, something like categorical_columns_to_be_represented_through_their_rank? Also I am thinking that we could simply pass the rank to datatables, and leave the function that converts to the display/search representation to do the mapping? The rank 0 should be mapped to Javascript null, and the other ranks to the corresponding category, that is, the JS function should use a simple array (= the ordered categories) to do the mapping.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in b5ee3e7 — parameter is now categorical_columns_to_be_represented_through_their_rank. The [display, rank] format is kept (consistent with floats) so the existing shared render function works for both.

…params

Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
Copilot AI requested a review from mwouts February 22, 2026 21:53
Comment thread src/itables/javascript.py Outdated
assert isinstance(targets, list), targets
for target in targets:
remove_if_present(target)
return categorical_columns
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That duplicates the current code, can use a dedicated function instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0062c0f — extracted _remove_columns_with_render_in_columndefs shared helper used by both get_float_columns_to_be_formatted_in_python and get_categorical_columns_to_be_represented_through_their_rank.

Comment thread src/itables/datatables_format.py Outdated

def _format_pandas_series(
x, escape_html: bool, format_floats_in_python: bool
x, escape_html: bool, format_floats_in_python: bool, add_rank_to_categories: bool = True
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No default value in the function signature, thanks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0062c0fadd_rank_to_categories parameter in _format_pandas_series now has no default value.

Comment thread src/itables/datatables_format.py Outdated
except TypeError:
formatted = fmt.format_array(x._values, None, justify="all")
if escape_html:
formatted = [escape_html_chars(i) for i in formatted]
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two paragraphs above duplicate the existing code, can we reuse the current logic and just bifurcate at the end if the type is a category?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0062c0f — the categorical check is now at the END of _format_pandas_series, after the shared fmt.format_array + escape_html code. When add_rank_to_categories=True, we return early with just x.cat.codes (no fmt.format_array). When False, the function falls through to the general path.

df_module = get_dataframe_module_name(df)
if float_columns_to_be_formatted_in_python is None:
float_columns_to_be_formatted_in_python = set()
if categorical_columns_to_be_represented_through_their_rank is None:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical_columns_to_be_represented_through_their_rank conveys the meaning that we would pass just the rank to datatables.

Can you actually do that? And map the rank to the display value in the JavaScript rendering function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 0062c0f. Categorical columns now send just integer ranks (0 = null, 1..N = categories). get_categorical_columns_to_be_represented_through_their_rank returns dict[int, list] (col_idx → ordered categories). Each categorical column gets its own columnDef entry with the categories array embedded in the JS render function: var categories = ["a", "b", "c"]; return type === 'sort' ? data : (data === 0 ? null : categories[data - 1]);

…+ JS mapping

Co-authored-by: mwouts <29915202+mwouts@users.noreply.github.com>
Copilot AI requested a review from mwouts February 22, 2026 22:17
@mwouts mwouts added this to the 2.7.1 milestone Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Categorical types should be sorted according to the category order

2 participants