Skip to content

Smartly Apply Constraints During Cartesian Product#773

Open
Scienfitz wants to merge 15 commits intomainfrom
feature/smart_cartesian_product_constraints
Open

Smartly Apply Constraints During Cartesian Product#773
Scienfitz wants to merge 15 commits intomainfrom
feature/smart_cartesian_product_constraints

Conversation

@Scienfitz
Copy link
Copy Markdown
Collaborator

@Scienfitz Scienfitz commented Mar 31, 2026

This PR implements a more optimized Cartesian product creation in the presence of constraints which can result in memory and time gains of many orders of magnitude (see mini benchmark below).

Rationale

  • Currently constraint filtering is done only after the entire search space has been created. This means the memory needed for the intermediate df is potentially huge even if the final df is tiny. In practice this had led to many problems when working with slot based mixtures, even if the optimized from_simplex constructor was used
  • Instead, any given cosntraint can be applied early during the parameter-by-parameter cross join operations. There are three tiers of applying this:
    1. As soon as possible filter: A constraint can be applied as soon as all of its affected parameters are in the current crossjoin-df. After this application the constraint is fully ensured and does not have to be applied again. If the order in which cross join goes over the parameters is optimized this would already lead to an improvement as subsequent operations "see" much smaller left-dataframes.
    2. Partial/early filter:
    • Some constraints can be applied even if not all affected parameters are present yet.
    • Example: no label duplicates - even if there are just 2 out of 7 parameters present, we can remove the rows that have duplicates in the 2 parameters.
    • This filter has to be repeated in every loop iteration until it ran with all affected parameters present. However, the cost of multiple filter applications are dwarfed by the savings from smaller cross join operations.
    • Whether a constraint supports early filtering might depend on its configuration, example: exclude constraint with combiner (early filter supported for OR, not for AND or XOR)
    1. Look ahead: Some constraints can look ahead based on the possible parameter values that might be incoming and recognize that constraints cannot be fulfilled even in future crossjoin iterations.
    • This is essentially what from_simplex implements for the very special case of 1 global sum constraint and 1 cardinality constraint. If we ever implement look-ahead filters for all constraints the from_simplex constructor might become obsolete
    • For this we would need access to the parameter values inside the constraint logic, which might be easier to implement once the constraints have been refactored Refactor General Constraint Interface #517
  • This PR implements smart filtering for tiers 1 and 2. I left IMPROVE notes to remember about tier 3. To achieve this
    • Constraint.get_invalid was extended to handle situations where not all parameters are in the df to be filtered. The constraint can the decide whether it can apply early filtering or returns the new UnsupportedEarlyFilteringError if it needs all parameters present
    • The crossjoin is done in a custom loop inside parameter_cartesian_prod_pandas_constrained which itself performs the process described above after deciding on a smart parameter order for the crossjoin

Good To Know

  • new Constraint method has_polars_implementation, discussion here
  • new Constraint property _filtering_parameters, discussion here
  • strange appearance of DiscreteNoLabelDuplicatesConstraint in DiscretePermutationInvarianceConstraint .get_invalid explained here

Mini Benchmark:

  • Scenario 1: 7 categoricals with 8 values each and a no label dupe constraint
  • Scenario 2: complex slot-based mixture with 6 slots, 3 subgroups, sum constraints and additional product parameters
  • tested on (old main vs this branch) x (polars on/off)
  • 30min as max runtime for a quick test
Scenario Polars main feature Speedup Memory reduction
1: from_product, 7×8 cat, NoLabelDuplicates (2M→40K rows) OFF 61.4s / 636 MB 6.6s / 48 MB 9.3x 13.2x
ON 1.8s / 73 MB 1.6s / 47 MB 1.1x 1.6x
2: from_simplex, 6-slot mixture + 3 extras (~12B→22K rows) OFF >30min 0.5s / 17 MB >3600x
ON >30min 0.5s / 17 MB >3600x

@Scienfitz Scienfitz self-assigned this Mar 31, 2026
@Scienfitz Scienfitz added the enhancement Expand / change existing functionality label Mar 31, 2026
@Scienfitz Scienfitz added this to the 0.15.0 milestone Mar 31, 2026
@Scienfitz Scienfitz force-pushed the feature/smart_cartesian_product_constraints branch 2 times, most recently from 6a33c52 to 503fef9 Compare April 1, 2026 23:48
Comment thread baybe/constraints/base.py Outdated
Comment thread baybe/constraints/base.py Outdated
Comment thread baybe/constraints/discrete.py
@Scienfitz Scienfitz marked this pull request as ready for review April 2, 2026 00:27
Copilot AI review requested due to automatic review settings April 2, 2026 00:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes discrete search space construction by applying discrete constraints incrementally during Cartesian product generation (including improved Polars/Pandas interop), aiming to reduce intermediate memory use and runtime for highly constrained spaces.

Changes:

  • Added baybe.searchspace.utils with shared Cartesian product helpers and a new incremental constrained-product builder.
  • Extended discrete constraint interfaces to support (or explicitly refuse) early filtering via UnsupportedEarlyFilteringError, plus a has_polars_implementation capability flag.
  • Updated discrete search space constructors and tests to use the new incremental filtering path (and added parity tests vs the naive approach).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
baybe/searchspace/utils.py New utilities: parameter ordering, pandas/polars cartesian product, and incremental constrained cartesian product builder.
baybe/searchspace/discrete.py Switches discrete space construction to incremental filtering; Polars path builds partial product and merges remainder via pandas. Adds new from_simplex validation.
baybe/constraints/base.py Adds _required_filtering_parameters and has_polars_implementation; updates docs for partial-dataframe filtering semantics.
baybe/constraints/discrete.py Updates discrete constraints to support early/partial filtering and to raise UnsupportedEarlyFilteringError when unsupported.
baybe/exceptions.py Adds UnsupportedEarlyFilteringError.
tests/constraints/test_constrained_cartesian_product.py New test ensuring naive vs incremental constrained product results match across several scenarios.
tests/constraints/test_constraints_polars.py Updates imports for moved cartesian product helpers.
tests/test_searchspace.py Updates imports for moved cartesian product helpers.
tests/hypothesis_strategies/alternative_creation/test_searchspace.py Adjusts simplex-related tests to reflect new from_simplex constraints.
CHANGELOG.md Documents incremental filtering and new constraint capability/exception additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +187 to +193
for param in ordered_params:
param_df = pd.DataFrame({param.name: param.active_values})
if df.empty:
df = param_df
else:
df = pd.merge(df, param_df, how="cross")

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter_cartesian_prod_pandas_constrained uses if df.empty: df = param_df to decide whether to start or to cross-merge. If initial_df is provided but has 0 rows (or 0 columns), df.empty is True and this path incorrectly replaces the initial dataframe instead of performing a cross join (which should stay empty). This can resurrect filtered-out rows or raise a KeyError later when reindexing to original_columns. Consider tracking initialization separately (e.g., start with df=None and check df is None, or check df.shape[1] == 0 and initial_df is None) and always pd.merge(..., how='cross') when initial_df is provided (even if empty).

Copilot uses AI. Check for mistakes.

# Initialize the dataframe
if initial_df is not None:
df = initial_df
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function assigns df = initial_df and then mutates df in-place via drop(..., inplace=True), which will also mutate the caller-provided initial_df. If callers reuse that dataframe (or if this helper becomes public), this can cause unexpected side effects. Safer option: copy initial_df at the start (df = initial_df.copy(deep=False) or deep copy if needed).

Suggested change
df = initial_df
# Work on a copy to avoid mutating the caller-provided initial_df
df = initial_df.copy(deep=False)

Copilot uses AI. Check for mistakes.
Comment on lines +388 to +393
# Validate minimum number of simplex parameters
if len(simplex_parameters) < 2:
raise ValueError(
f"'{cls.from_simplex.__name__}' requires at least 2 simplex "
f"parameters but got {len(simplex_parameters)}."
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_simplex now raises when len(simplex_parameters) < 2. This is a behavioral change versus allowing simplex_parameters=[] (product-only) or a single simplex parameter. If this is intentional, please document it as a breaking change (and consider updating validate_simplex_subspace_from_config to mirror this validation so config validation and construction behave consistently). If it’s not intentional, consider restoring support for the degenerate cases (e.g., delegating to from_product when simplex_parameters is empty).

Suggested change
# Validate minimum number of simplex parameters
if len(simplex_parameters) < 2:
raise ValueError(
f"'{cls.from_simplex.__name__}' requires at least 2 simplex "
f"parameters but got {len(simplex_parameters)}."
)
# Handle degenerate cases for backwards compatibility
# - No simplex parameters: fall back to a pure product space.
# - Single simplex parameter: treat it as a regular product parameter.
if len(simplex_parameters) == 0:
return cls.from_product(
product_parameters=product_parameters,
constraints=constraints,
)
if len(simplex_parameters) == 1:
merged_product_parameters = [*simplex_parameters, *product_parameters]
return cls.from_product(
product_parameters=merged_product_parameters,
constraints=constraints,
)

Copilot uses AI. Check for mistakes.
Comment thread baybe/constraints/discrete.py Outdated
Comment thread baybe/constraints/discrete.py
Comment thread CHANGELOG.md
Comment on lines +19 to +25
### Changed
- Discrete search space construction now applies constraints incrementally during
Cartesian product building, significantly reducing memory usage and construction
time for constrained spaces
- Polars path in discrete search space construction now builds the Cartesian product
only for parameters involved in Polars-capable constraints, merging the rest
incrementally via pandas
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog mentions incremental constraint application and the new UnsupportedEarlyFilteringError, but it doesn’t mention the new from_simplex restriction requiring at least 2 simplex parameters. If this restriction is intended/user-facing, it should be noted under “Changed” or “Breaking Changes” to avoid surprising downstream users.

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +236
def test_constrained_cartesian_product(scenario):
"""Verify incremental and naive product construction produce identical results."""
parameters, constraints = scenario()

# Naive approach: full product then filter
df_naive = parameter_cartesian_prod_pandas(parameters)
_apply_constraint_filter_pandas(df_naive, constraints)

# Incremental approach
df_incremental = parameter_cartesian_prod_pandas_constrained(
parameters, constraints
)

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a regression test for the initial_df-empty case in parameter_cartesian_prod_pandas_constrained (e.g., initial_df with correct columns but 0 rows, then merging additional parameters should keep 0 rows). This would have caught the current df.empty initialization bug and protects future refactors.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@AdrianSosic AdrianSosic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Scienfitz, I'll need some more time for the review but wanted to already share some comments so that you can start to think about it / we can discuss. More will follow 🙃

Comment thread baybe/searchspace/utils.py
Comment thread baybe/constraints/base.py Outdated
Comment thread baybe/constraints/base.py Outdated
Comment thread baybe/constraints/base.py Outdated
Comment thread baybe/constraints/discrete.py Outdated

@override
def get_invalid(self, data: pd.DataFrame) -> pd.Index:
if not set(self.parameters) <= set(data.columns):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Wasn't the _required_filtering_parameters property exactly intended for this purpose, i.e. don't have to hardcode the set(self.parameters) in each constraint but to dynamically read it from the property?
  2. I thought the reason to add the property in the first place was that we can automate this check, but now you've manually implemented it in each constraint!? I would have expected a template-design approach, where there the check is implemented in the base class and the subclasses only implement a _get_invalid or similar.

Copy link
Copy Markdown
Collaborator Author

@Scienfitz Scienfitz Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends

I actually only wanted to use the helper in the two cosntraints tha are more complex, ie where self.parameters is NOT the full set of affected parameters

But I can understand your request and its def stragne to sue it int he colum checks for some and some not. But then consider the many otehr applciations of self.parameters that I wold have to chagne for consistency as well. here an exmaple:
image

should I apply it here on the top and in evaluate_data as well? That would be way more invasive than I intended

A variant to clean this a bit could be: Do not make _required_parameters a base class property but only give it to the two problematic constraints. Would enforce. a bit better that I am only suing this as workaround int he two problematic constraints. Opinion? this would revert the utility the _required_parameters property has in things like the new utils and things outside of the constraint class itself

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdrianSosic with the changes to the control flow the template worfklow now made much more sense and I added it there

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the should I apply it everywhere: I think to be consistent, you need to apply it in all places where self.parameters carries the same semantic meaning. You certainly know: there are cases where a quantity just happens to carry the same value but is semantically different! In these cases, self.parameters needs to stay, but in all others, I suggest to change it to self._required_parameters. You would indeed have to check case by case 😬

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what do you want further changed?

Comment thread baybe/constraints/discrete.py Outdated
Comment thread baybe/constraints/discrete.py Outdated
Comment thread baybe/exceptions.py Outdated
Comment thread baybe/constraints/discrete.py Outdated
@Scienfitz Scienfitz force-pushed the feature/smart_cartesian_product_constraints branch 2 times, most recently from 78eb87b to 66c39c4 Compare April 9, 2026 18:16
Copy link
Copy Markdown
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review. Only had a look at the changes made to the constraints so far. Tried to comprehend the constraints, and the logic for them seems to check out. Will give a more in-depth review after some of the general issues pointed out by the others have been addressed.

Comment thread CHANGELOG.md
Comment on lines +23 to +25
- Polars path in discrete search space construction now builds the Cartesian product
only for parameters involved in Polars-capable constraints, merging the rest
incrementally via pandas
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Polars path in discrete search space construction now builds the Cartesian product
only for parameters involved in Polars-capable constraints, merging the rest
incrementally via pandas
- `Polars` path in discrete search space construction now builds the Cartesian product
only for parameters involved in `Polars`-capable constraints, merging the rest
incrementally via `pandas`

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be quite inconsistent with the existing CHANGELOG fyi

Comment thread baybe/exceptions.py Outdated
Comment thread baybe/constraints/discrete.py Outdated
cols = set(data.columns)
pairs = [(p, c) for p, c in zip(self.parameters, self.conditions) if p in cols]
if not pairs:
raise UnsupportedEarlyFilteringError(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring clearly states what this function should do: Get the indices of dataframe entries that are invalid under the constraint.. No indices are invalid, so it should return an empty index and not an error in my opinion.

@Scienfitz Scienfitz force-pushed the feature/smart_cartesian_product_constraints branch from 66c39c4 to 2c634b5 Compare April 10, 2026 20:29
if not simplex_parameters:
return cls(parameters=product_parameters, exp_rep=product_space)
# Validate minimum number of simplex parameters
if len(simplex_parameters) < 2:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why you changed this? Of course, we both agree that calling from_simplex with just one parameter is sort of meaningless but:

  • it's not "wrong", i.e. it still works. So should we really forbid it?
  • If we really do want to forbid (which I'm not sure, perhaps a warning is more appropriate), I think the message needs to change. You are just saying what is not allowed, but it's missing what you should do instead.

# Remove entries that violate parameter constraints:
_apply_constraint_filter_pandas(exp_rep, constraints)
# Merge product parameters and apply constraints incrementally
exp_rep = parameter_cartesian_prod_pandas_constrained(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not caused by your change, but general question: why are we using hard-coded pandas here? Could it be that we simple forgot to add the polars path?

polars_param_names: set[str] = set()
for c in polars_constraints:
polars_param_names.update(c._required_parameters)
polars_params = [p for p in parameters if p.name in polars_param_names]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a stupid question, but what happens downstream from here when the set of parameters is only a subset of the polars parameters? For example, I could have a polars constraint on ["A", "B", "C"] while my parameters contain only ["A", "B"]. Then polars_params will be "incorrectly" filtered down to just ["A", "B"]

)
initial_df = lazy_df.collect().to_pandas()
# Apply Polars constraints that failed back via pandas
_apply_constraint_filter_pandas(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, could it be that this is now also just dead code because we filter to polars-compatible constraints upfront? Or what am I not seeing here?

# (smallest expansion factor during cross-merging).
ordered: list[DiscreteParameter] = []
available: set[str] = set()
remaining = list(constrained)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remaining variable is a bit unlucky, I think:

  • it has initially the same content as unconstrained and is mutated, but unconstrained is never used again --> no need for second variable
  • I think all of them should be sets, not lists, right?

parameters: Sequence[DiscreteParameter],
constraints: Sequence[DiscreteConstraint],
) -> list[DiscreteParameter]:
"""Compute an optimal parameter ordering for incremental space construction.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: have you considered if a true optimal solution is feasible?
If not, I think the docstring needs to be adjusted (i.e. it's not optimal but a greedy approximation)

import polars as pl


def compute_parameter_order(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really minor so just resolve if disagree: perhaps the name of the function could be a bit less generic, i.e. compute says very little (it's like get). How about something with optimize, to make clear what is happening?

Comment on lines +174 to +178
# Initialize the dataframe
if initial_df is not None:
df = initial_df
else:
df = pd.DataFrame()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Suggested change
# Initialize the dataframe
if initial_df is not None:
df = initial_df
else:
df = pd.DataFrame()
# Initialize the dataframe
df = initial_df or pd.DataFrame()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Expand / change existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants