Add DBSCAN clustering step to Sophronia city#951
Conversation
gonzaponte
left a comment
There was a problem hiding this comment.
Looks quite good. Here is the first round of comments.
|
|
||
| return correct | ||
|
|
||
| def hits_clusterizer(clustering_params: dict) -> Callable: |
There was a problem hiding this comment.
Instead of taking a dictionary take the four parameters that you need. The you can unpack the paramters from the dict in the calling function using the **kwargs syntax.
Also decorate this function with check_annotations
| scale_xy = 14.55, | ||
| scale_z = 3.7 |
There was a problem hiding this comment.
should be 15.55 and 4, right?
| # Pre-allocate array for cluster labels | ||
| cluster_labels = np.full(len(df_hits), -9999, dtype=int) | ||
|
|
||
| # Get values once (faster than repeatedly accessing DataFrame columns) | ||
| coords = df_hits[['X', 'Y', 'Z']].to_numpy() | ||
| events = df_hits['event'].to_numpy() | ||
|
|
||
| # Use np.unique to get sorted event IDs | ||
| unique_events = np.unique(events) | ||
| for event_id in unique_events: | ||
|
|
||
| mask = (events == event_id) | ||
| X = coords[mask].copy() | ||
|
|
||
| # Scale | ||
| X[:, :2] /= scale_xy | ||
| X[:, 2] /= scale_z | ||
|
|
||
| # DBSCAN clustering | ||
| labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X) | ||
| cluster_labels[mask] = labels | ||
|
|
||
| df_hits['cluster'] = cluster_labels |
There was a problem hiding this comment.
I think this should be rewritten using groupby("event").apply(tag_hits)
where tag_hits is a separate function that tags the hits in a single event.
There was a problem hiding this comment.
Ok, now cluster_tagger() uses groupby("event").apply() with a worker function called tag_hits_in_event()
| @given(df=gen_cluster_df) | ||
| def test_dummy(df): | ||
| """ | ||
| Hypothesis calls this function multiple times. | ||
| 'df' will be a different pandas DataFrame in every call. | ||
| """ | ||
| # Just for demonstration purposes, we print the shape of the generated DFs | ||
| print(f"Generated dataframe shape: {df.shape}") | ||
|
|
||
| # Check some stuff here | ||
| assert 'X' in df.columns | ||
| assert df['event'].dtype == int | ||
| assert not df.empty |
There was a problem hiding this comment.
Is this something you used for your own understanding?
There was a problem hiding this comment.
Ups, yes, I forgot to delete it.
Done!
| - Preserves the original Index and order of rows. | ||
| - The 'cluster' column contains valid integers (no NaNs). | ||
| """ | ||
| # Shuffle the input DataFrame to ensure cluster_tagger does not rely on any specific order |
There was a problem hiding this comment.
Split this test into three:
- check the shape of the output (first two items in your list)
- does not modify other stuff (third and fourth item)
- validity of the cluster column
| # Shuffle the input DataFrame | ||
| df_input = df.sample(frac=1.0).copy() |
There was a problem hiding this comment.
Again, no randomness, if you want to test that this doesn't matter, you can parametrize this test with two row orders.
There was a problem hiding this comment.
Ok, I shuffled rows in a specific and reproducible order.
| hits_group_0 = df_result[df_result['expected_label'] == 0] | ||
| hits_group_1 = df_result[df_result['expected_label'] == 1] | ||
| assert hits_group_0['cluster'].nunique() == 1, "Hits near (0,0,0) were assigned multiple cluster labels." | ||
| assert hits_group_1['cluster'].nunique() == 1, "Hits near (100,100,0) were assigned multiple cluster labels." | ||
| label_0 = hits_group_0['cluster'].iloc[0] | ||
| label_1 = hits_group_1['cluster'].iloc[0] | ||
| assert label_0 != label_1, "Both clusters were assigned the same label." | ||
| assert label_0 != -1 and label_1 != -1, "One of the clusters was labeled as noise (-1)." |
There was a problem hiding this comment.
Once the input order is known, this should become simpler.
| # Shuffle the input DataFrame | ||
| df_input = df.sample(frac=1.0).copy() |
| # Shuffle the input DataFrame | ||
| df_input = df.sample(frac=1.0).copy() |
| scale_xy = 14.55, | ||
| scale_z = 3.7) |
| # If the event has no hits, there's nothing to do | ||
| if event_hits.empty: | ||
| return event_hits.assign(cluster=pd.Series(dtype=int)) | ||
|
|
||
| # Extract coordinates and apply scaling | ||
| coords = event_hits[['X', 'Y', 'Z']].to_numpy() | ||
| coords[:, :2] /= scale_xy | ||
| coords[:, 2] /= scale_z | ||
|
|
||
| # DBSCAN clustering | ||
| labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(coords) | ||
| # Add the cluster labels as a new column to the event's DataFrame. | ||
| event_hits['cluster'] = labels |
There was a problem hiding this comment.
The code is self-explanatory. The comments should not address what you are doing, but rather why you are doing something that might not be obvious. For example, here you may explain why you apply the scale factor.
| if event_hits.empty: | ||
| return event_hits.assign(cluster=pd.Series(dtype=int)) |
There was a problem hiding this comment.
This is no longer necessary, you apply it in cluster_tagger
There was a problem hiding this comment.
Yes, you're right.
Removed it!
| if df.empty: | ||
| return |
There was a problem hiding this comment.
I think this is not possible given the min_size=1 in the df generator. Check all tests for this.
| # Run the cluster tagger | ||
| params = dict(eps=10.0, min_samples=1, scale_xy=1.0, scale_z=1.0) # Dummy values | ||
| df_result = cluster_tagger(df.copy(), **params) | ||
|
|
||
| # --- Assertations | ||
| assert len(df_result) == len(df), "Output DataFrame has different length than input." |
There was a problem hiding this comment.
Remove unnecessary comments. Rename params to dummy_params. Check other tests for this.
| """ | ||
| Verifies that cluster_tagger: | ||
| - Does not modify any of the original columns. | ||
| - Preserves the original index and row order. |
There was a problem hiding this comment.
Is the second item true?
| cluster_labels = df_result['cluster'].unique() | ||
| assert cluster_labels.size == 2, "Expected exactly 2 unique cluster labels (one cluster + one noise)." | ||
| cluster_hits = df_result[df_result['cluster'] != -1] | ||
| assert cluster_hits.shape[0] == 3, "Expected exactly 3 hits to be clustered together." | ||
| noise_hit = df_result[df_result['cluster'] == -1] | ||
| assert noise_hit.shape[0] == 1, "Expected exactly 1 noise hit." | ||
| assert noise_hit['X'].iloc[0] == 100 and noise_hit['Y'].iloc[0] == 100, "The noise hit identified is NOT the distant one." |
There was a problem hiding this comment.
For better readability:
- Use
.columnnotation. - Use
leninstead of.shape[0].
Check other tests for the same patterns.
…esult test is updated to match the new implementation.
gonzaponte
left a comment
There was a problem hiding this comment.
besides these two cosmetic changes, I think this is good to go. I've asked @jwaiton to take a quick look to make sure I didn't miss anything important, but this is essentially approved.
| from typing import List | ||
| from sklearn.cluster import DBSCAN | ||
|
|
||
| from .. evm import event_model as evm |
There was a problem hiding this comment.
I think you are not using this import, right?
There was a problem hiding this comment.
No, I think it was there before. I'll remove it.
| scale_xy = scale_xy, | ||
| scale_z = scale_z) | ||
|
|
||
| return df_clustered.set_index(df_hits.index) No newline at end of file |
There was a problem hiding this comment.
end this file with a newline
|
Very nice work! I only have one suggestion. I'd suggest removing the The scaling applied with This removes a parameter that is poorly explained even in DBSCANs own documentation (in my opinion), when it just means 'distance between two nodes'. If our distance will always be 1 or slightly less, there is no need for it 😸 |
|
But do we want that? As I remember, the optimal parameters include eps = 3. Would it be better having flexibility for the eps value? |
|
Yes, I understand your point and if it is the case, I would fix eps = 1.74 But again, I'm still concerned with the clusterization process killing physical hits. For example, if a neighbour hit is not activated (for some threshold-related reason), but the next to it does it, the latter can be labelled as noise. I don't know how likely is this case. Has the performance of this algorithm been studied using NEXT100 data? I thought it had, and the result had been the set of parameters eps = 3, min_samples = 5. Maybe I misunderstood. If we are sure that with one unit of distance is enough, we can fix eps value. In any case, I'd like to discuss it next week in a meeting. |
This is what my last comment is addressing. If this was the case, setting I think the likelihood of the number of physical hits being on the order of 10 and being two pitch lengths apart for any hit is unlikely in NEXT-100, but I can check some x-rays/bremstrahlungs for this information (I believe @Ian0sborne would have this information at hand 😸 ).
I studied/implemented a similar algorithm here, which I didn't provide defaults for, but used For the scale in NEXT-100 (pitch = 15.55, maximum z bin = 4), this is a scaling of 1.06 in XY and 1.08 in Z, which would then be Lets talk next week 👍 |

Summary
This PR integrates a hit clustering step (using DBSCAN) into the
Sophroniacity workflow.Key Changes
1. New Logic (
reco/hits_functions.py)cluster_tagger: A function that applies DBSCAN on an event-by-event basis.scale_xyandscale_zparameters.2. Integration (
cities/sophronia.py)clustering_paramsto the city configuration.clustering_paramsisNone(default), the city skips clustering and produces the exact same output structure as before (noclustercolumn).3. Testing (
cities/sophronia_test.py&reco/hits_functions_test.py)clustercolumn appears only when enabled.4. Reference Update
database/test_data/228Th_10evt_hits.h5.masterlogic to ensuretest_sophronia_exact_resultpasses with the new code structure.Configuration Example
To use this feature, add the following to the configuration file (these values are optimized for NEXT-100 detector geometry):