Add DBSCAN clustering step to Sophronia city by camacortespar · Pull Request #951 · next-exp/IC

camacortespar · 2026-02-24T13:32:25Z

Summary

This PR integrates a hit clustering step (using DBSCAN) into the Sophronia city workflow.

Key Changes

1. New Logic (`reco/hits_functions.py`)

Implemented cluster_tagger: A function that applies DBSCAN on an event-by-event basis.
It handles spatial anisotropy via scale_xy and scale_z parameters.
It preserves data structure and alignment perfectly.

2. Integration (`cities/sophronia.py`)

Added clustering_params to the city configuration.
Inserted the clustering step in the dataflow.
Backward Compatibility: If clustering_params is None (default), the city skips clustering and produces the exact same output structure as before (no cluster column).

3. Testing (`cities/sophronia_test.py` & `reco/hits_functions_test.py`)

Added unit tests for structure preservation, row alignment, and noise rejection in DBSCAN.
Added integration tests ensuring cluster column appears only when enabled.
Verified that enabling clustering does not corrupt other physical variables.

4. Reference Update

Updated database/test_data/228Th_10evt_hits.h5.
Re-generated this reference file using the current master logic to ensure test_sophronia_exact_result passes with the new code structure.

Configuration Example

To use this feature, add the following to the configuration file (these values are optimized for NEXT-100 detector geometry):

clustering_params = dict(
    eps = 3,
    min_samples = 5,
    scale_xy = 15.55,
    scale_z = 4.0
)

gonzaponte

Looks quite good. Here is the first round of comments.

gonzaponte · 2026-03-05T13:41:47Z


    return correct

+def hits_clusterizer(clustering_params: dict) -> Callable:


Instead of taking a dictionary take the four parameters that you need. The you can unpack the paramters from the dict in the calling function using the **kwargs syntax.

Also decorate this function with check_annotations

gonzaponte · 2026-03-05T13:46:54Z

+    scale_xy     = 14.55,
+    scale_z      = 3.7


should be 15.55 and 4, right?

gonzaponte · 2026-03-05T14:13:31Z

+        # Pre-allocate array for cluster labels
+        cluster_labels = np.full(len(df_hits), -9999, dtype=int)
+
+        # Get values once (faster than repeatedly accessing DataFrame columns)
+        coords = df_hits[['X', 'Y', 'Z']].to_numpy()
+        events = df_hits['event'].to_numpy()
+
+        # Use np.unique to get sorted event IDs
+        unique_events = np.unique(events)
+        for event_id in unique_events:
+
+            mask = (events == event_id)
+            X = coords[mask].copy()
+
+            # Scale
+            X[:, :2] /= scale_xy
+            X[:, 2]  /= scale_z
+
+            # DBSCAN clustering
+            labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X)
+            cluster_labels[mask] = labels
+
+        df_hits['cluster'] = cluster_labels


I think this should be rewritten using groupby("event").apply(tag_hits)
where tag_hits is a separate function that tags the hits in a single event.

Ok, now cluster_tagger() uses groupby("event").apply() with a worker function called tag_hits_in_event()

gonzaponte · 2026-03-05T14:15:11Z

+@given(df=gen_cluster_df)
+def test_dummy(df):
+    """
+    Hypothesis calls this function multiple times.
+    'df' will be a different pandas DataFrame in every call.
+    """
+    # Just for demonstration purposes, we print the shape of the generated DFs
+    print(f"Generated dataframe shape: {df.shape}")
+
+    # Check some stuff here
+    assert 'X' in df.columns
+    assert df['event'].dtype == int
+    assert not df.empty


Is this something you used for your own understanding?

Ups, yes, I forgot to delete it.
Done!

gonzaponte · 2026-03-05T14:17:44Z

+        - Preserves the original Index and order of rows.
+        - The 'cluster' column contains valid integers (no NaNs).
+    """
+    # Shuffle the input DataFrame to ensure cluster_tagger does not rely on any specific order


Split this test into three:

check the shape of the output (first two items in your list)

does not modify other stuff (third and fourth item)

validity of the cluster column

gonzaponte · 2026-03-05T14:46:28Z

+    # Shuffle the input DataFrame
+    df_input = df.sample(frac=1.0).copy()


Again, no randomness, if you want to test that this doesn't matter, you can parametrize this test with two row orders.

Ok, I shuffled rows in a specific and reproducible order.

gonzaponte · 2026-03-05T14:47:40Z

+    hits_group_0 = df_result[df_result['expected_label'] == 0]
+    hits_group_1 = df_result[df_result['expected_label'] == 1]
+    assert hits_group_0['cluster'].nunique() == 1, "Hits near (0,0,0) were assigned multiple cluster labels."
+    assert hits_group_1['cluster'].nunique() == 1, "Hits near (100,100,0) were assigned multiple cluster labels."
+    label_0 = hits_group_0['cluster'].iloc[0]
+    label_1 = hits_group_1['cluster'].iloc[0]
+    assert label_0 != label_1, "Both clusters were assigned the same label."
+    assert label_0 != -1 and label_1 != -1, "One of the clusters was labeled as noise (-1)."


Once the input order is known, this should become simpler.

gonzaponte · 2026-03-05T14:48:09Z

+    # Shuffle the input DataFrame
+    df_input = df.sample(frac=1.0).copy()


remove randomness

gonzaponte · 2026-03-05T15:00:30Z

+    # Shuffle the input DataFrame
+    df_input = df.sample(frac=1.0).copy()


remove randomness

gonzaponte · 2026-03-05T15:06:12Z

+                        scale_xy    = 14.55,
+                        scale_z     =   3.7)


gonzaponte · 2026-04-22T13:16:26Z

+    # If the event has no hits, there's nothing to do
+    if event_hits.empty:
+        return event_hits.assign(cluster=pd.Series(dtype=int))
+
+    # Extract coordinates and apply scaling
+    coords = event_hits[['X', 'Y', 'Z']].to_numpy()
+    coords[:, :2] /= scale_xy
+    coords[:, 2]  /= scale_z
+
+    # DBSCAN clustering
+    labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(coords)
+    # Add the cluster labels as a new column to the event's DataFrame.
+    event_hits['cluster'] = labels


The code is self-explanatory. The comments should not address what you are doing, but rather why you are doing something that might not be obvious. For example, here you may explain why you apply the scale factor.

gonzaponte · 2026-04-22T13:32:01Z

+    if event_hits.empty:
+        return event_hits.assign(cluster=pd.Series(dtype=int))


This is no longer necessary, you apply it in cluster_tagger

Yes, you're right.
Removed it!

gonzaponte · 2026-04-22T13:41:02Z

+    if df.empty:
+        return


I think this is not possible given the min_size=1 in the df generator. Check all tests for this.

gonzaponte · 2026-04-22T13:41:24Z

+    # Run the cluster tagger
+    params = dict(eps=10.0, min_samples=1, scale_xy=1.0, scale_z=1.0)   # Dummy values
+    df_result = cluster_tagger(df.copy(), **params)
+
+     # --- Assertations
+    assert len(df_result) == len(df), "Output DataFrame has different length than input."


Remove unnecessary comments. Rename params to dummy_params. Check other tests for this.

gonzaponte · 2026-04-22T14:33:30Z

+    """
+    Verifies that cluster_tagger:
+    - Does not modify any of the original columns.
+    - Preserves the original index and row order.


Is the second item true?

gonzaponte · 2026-04-22T16:24:09Z

+    cluster_labels = df_result['cluster'].unique()
+    assert cluster_labels.size == 2, "Expected exactly 2 unique cluster labels (one cluster + one noise)."
+    cluster_hits = df_result[df_result['cluster'] != -1]
+    assert cluster_hits.shape[0] == 3, "Expected exactly 3 hits to be clustered together."
+    noise_hit = df_result[df_result['cluster'] == -1]
+    assert noise_hit.shape[0] == 1, "Expected exactly 1 noise hit."
+    assert noise_hit['X'].iloc[0] == 100 and noise_hit['Y'].iloc[0] == 100, "The noise hit identified is NOT the distant one."


For better readability:

Use .column notation.

Use len instead of .shape[0].

Check other tests for the same patterns.

…ions.py

…esult test is updated to match the new implementation.

gonzaponte

besides these two cosmetic changes, I think this is good to go. I've asked @jwaiton to take a quick look to make sure I didn't miss anything important, but this is essentially approved.

gonzaponte · 2026-05-06T08:24:30Z

+from typing          import List
+from sklearn.cluster import DBSCAN
+
+from .. evm  import event_model as evm


I think you are not using this import, right?

No, I think it was there before. I'll remove it.

gonzaponte · 2026-05-06T08:24:58Z

+                                 scale_xy    = scale_xy,
+                                 scale_z     = scale_z)
+
+    return df_clustered.set_index(df_hits.index)


end this file with a newline

jwaiton · 2026-05-07T15:17:46Z

Very nice work! I only have one suggestion.

I'd suggest removing the eps parameter entirely from the process, and set it to 1/1.74 when you call DBSCAN.

The scaling applied with scale_xy & scale_z means that (if applied correctly) all neighbouring hits should be within ~1 unit distance from one another, right? If so, eps should always be 1 (or ~1.74 if you want to retain diagonal neighbours I believe).

This removes a parameter that is poorly explained even in DBSCANs own documentation (in my opinion), when it just means 'distance between two nodes'. If our distance will always be 1 or slightly less, there is no need for it 😸

camacortespar · 2026-05-08T09:40:22Z

But do we want that? As I remember, the optimal parameters include eps = 3.
Might setting eps = 1/1.74 be too restrictive?
I'm concerning about this parameters selection 'cause from my experience with low-background data, it changes significantly the performance of the fiducial volume cut, which implies changes in the fiducial rate.

Would it be better having flexibility for the eps value?

jwaiton · 2026-05-08T11:05:49Z

But do we want that? As I remember, the optimal parameters include eps = 3.

Not sure where the optimal parameter is extracted, but with parameters of eps = 3, scale_xy = 16, scale_z = 4, you can end up with a cluster of non-neighbouring hits surviving. A simple example is as shown:

# generate XYZ events of
# (0,0,0), (31.1, 31.1, 0), (62.2, 62.2, 0), (100, 100, 100)

data = {
                'event': [ 0,  0,  0,    0],     #
                'X'    : [0., 31.1, 62.2, 100.], #
                'Y'    : [0., 31.1, 62.2, 100.], #  these are not neighbouring hits
                'Z'    : [0., 0, 0, 100.]        #
        }
df = pd.DataFrame(data)

test_params = dict(eps=3, min_samples=2, scale_xy=16, scale_z=4)
df_result   = cluster_tagger(df.copy(), **test_params)

print(df_result)

which results in:

   event      X      Y      Z  cluster
0      0    0.0    0.0    0.0        0
1      0   31.1   31.1    0.0        0
2      0   62.2   62.2    0.0        0
3      0  100.0  100.0  100.0       -1

visually with a 15.55mm pitch looks like so:

These should not be labelled as a cluster.

I'm concerning about this parameters selection 'cause from my experience with low-background data, it changes significantly the performance of the fiducial volume cut, which implies changes in the fiducial rate.

Surely if you alter the eps to 1.74, this would improve the fiducial volume cut rate, as more hits (like the ones shown above) would be labelled as noise, right? Apologies if I'm misunderstanding.

The value 1.74 is just $\text{eps} = \sqrt{x^2 + y^2 + z^2} = \sqrt{3} \simeq 1.73$, so 1.74 ensures that all hits within 1 maximal unit distance (diagonally in 3d) will be considered neighbours, contributing to a cluster.

The next distance that we want to avoid would be two hits that are exactly 2 pitch lengths apart (not neighbours). So for example: xyz = [(0,0,0), (0, 31.1, 0)]. Dividing by xy_scale=16 gives us 1.9, so an eps any higher than 1.9 will start allowing clusters to be made with non-neighbouring hits.

Would it be better having flexibility for the eps value?

We have flexibility in the eps value without changing it from 1, as it is directly related to your scale_xy and scale_z. We currently are setting these values such that they (more or less) normalise our grid to 1 in X, Y & Z, but this doesnt have to be the case.

Apologies if I'm explaining this poorly, I'd be happy to chat in a zoom call to try and clarify this :)

camacortespar · 2026-05-08T13:52:34Z

Yes, I understand your point and if it is the case, I would fix eps = 1.74

But again, I'm still concerned with the clusterization process killing physical hits. For example, if a neighbour hit is not activated (for some threshold-related reason), but the next to it does it, the latter can be labelled as noise. I don't know how likely is this case.

Has the performance of this algorithm been studied using NEXT100 data? I thought it had, and the result had been the set of parameters eps = 3, min_samples = 5. Maybe I misunderstood. If we are sure that with one unit of distance is enough, we can fix eps value.

In any case, I'd like to discuss it next week in a meeting.

jwaiton · 2026-05-08T20:29:34Z

But again, I'm still concerned with the clusterization process killing physical hits. For example, if a neighbour hit is not activated (for some threshold-related reason), but the next to it does it, the latter can be labelled as noise. I don't know how likely is this case.

This is what my last comment is addressing. If this was the case, setting eps = 1.74 * 2 or scale_xy = 16*2, scale_z = 4*2 would do the same thing I believe, so there is no need for both. :)

I think the likelihood of the number of physical hits being on the order of 10 and being two pitch lengths apart for any hit is unlikely in NEXT-100, but I can check some x-rays/bremstrahlungs for this information (I believe @Ian0sborne would have this information at hand 😸 ).

Has the performance of this algorithm been studied using NEXT100 data? I thought it had, and the result had been the set of parameters eps = 3, min_samples = 5. Maybe I misunderstood. If we are sure that with one unit of distance is enough, we can fix eps value.

I studied/implemented a similar algorithm here, which I didn't provide defaults for, but used min_samples = 3 or 5, and relies on the equivalent value of $\text{eps} = \sqrt{3} \simeq 1.74$ as seen here. I believe Samuele developed this DBSCAN method initially, and the default values can be seen here, with eps = 2.3, xy_scale = 14.55, z_scale = 3.7.

For the scale in NEXT-100 (pitch = 15.55, maximum z bin = 4), this is a scaling of 1.06 in XY and 1.08 in Z, which would then be $\text{maximal unit distance} = \sqrt{1.06^2 + 1.06^2 + 1.08^2 } \simeq 1.85$. The next to next nearest neighbour (at its closest proximity) would be $\sqrt{0^2 + (1.06\cdot2)^2 + 0^2} = 2.12$, the nearest diagonal would be $\sqrt{(1.06)^2 + (1.06\cdot2)^2 + 0^2} = 2.37$. This means that with the default parameters previously given, the behaviour would be inconsistent (retain next-to-next nearest neighbours along the coordinates, but not the diagonals which I find odd).

Lets talk next week 👍

…ster. It is not a parameter function anymore.

camacortespar requested review from gonzaponte and jwaiton February 24, 2026 13:32

gonzaponte force-pushed the master branch from 0d77b8b to 693bc68 Compare February 24, 2026 13:34

camacortespar force-pushed the develop branch from d4f2bb3 to b2f0b28 Compare March 2, 2026 16:53

jmbenlloch force-pushed the develop branch from b2f0b28 to cf3f53e Compare March 3, 2026 14:35

gonzaponte requested changes Mar 5, 2026

View reviewed changes

gonzaponte requested changes Apr 22, 2026

View reviewed changes

camacortespar force-pushed the develop branch from f7e005b to 7995532 Compare April 23, 2026 09:56

camacortespar added 16 commits April 23, 2026 11:58

hits_clusterizer included in components.py

f648c0e

Hits clusterizer step included in sophronia.py

ed5248e

Update on 14/11

8ff46af

New implementation of hits_clusterizer, factory version

7b5f1c5

New version of cluster_hits in Sophronia. Main function in hits_funct…

c9c4344

…ions.py

Update on how to call cluster_hits in Sophronia

168a3a0

Pytests for cluster_tagger function

3650638

Pytest for hits clusterizer feature. Also, reference file for exact r…

0e64dad

…esult test is updated to match the new implementation.

New reference file including cluster label for hits

f22bb75

New hits reference file, git problem solved

e062786

scikit-learn added

4e43dc8

Update conda environment tag

c24a331

Update beersheba reference files

5c41eab

PR: first round of comments addressed

58bd2a3

Remove @settings

0e4703d

PR: second round of comments addressed

2eadf57

camacortespar force-pushed the develop branch from 7995532 to 2eadf57 Compare April 24, 2026 15:28

Add @settings

74dcedc

gonzaponte reviewed May 6, 2026

View reviewed changes

Removing unused import

8dda654

camacortespar added 5 commits May 12, 2026 15:14

DBSCAN eps value set to 1.8 to retain only neighbouring hits as a clu…

fe3d020

…ster. It is not a parameter function anymore.

Update reference files for tests

9af71ac

Removing deadline for tests that use hypothesis

6009d78

Updating esmeralda reference file for tests

995705d

Updating reference for esmeralda test

0772fbf


		return correct

		def hits_clusterizer(clustering_params: dict) -> Callable:

		# Shuffle the input DataFrame
		df_input = df.sample(frac=1.0).copy()

		if event_hits.empty:
		return event_hits.assign(cluster=pd.Series(dtype=int))

Conversation

camacortespar commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. New Logic (reco/hits_functions.py)

2. Integration (cities/sophronia.py)

3. Testing (cities/sophronia_test.py & reco/hits_functions_test.py)

4. Reference Update

Configuration Example

Uh oh!

gonzaponte left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzaponte left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwaiton commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camacortespar commented Feb 24, 2026 •

edited

Loading

1. New Logic (`reco/hits_functions.py`)

2. Integration (`cities/sophronia.py`)

3. Testing (`cities/sophronia_test.py` & `reco/hits_functions_test.py`)

jwaiton commented May 7, 2026 •

edited

Loading

jwaiton commented May 8, 2026 •

edited

Loading

jwaiton commented May 8, 2026 •

edited

Loading