Refactor read_ldata() and update docs #172
Open
jonasschlegel wants to merge 19 commits into
Open
Conversation
Removes the ChannelOrDetectorIdLike helpers (_get_channelid, _get_detectorid, _is_valid_id_or_tier) and the LDMUtils channel conversion imports. All read_ldata methods now take DetectorIdLike directly; the partition-method's channel-consistency check was made obsolete by this and is removed. The jlevt+det post-filter switches from trig_e_ch to trig_e_det.
Introduces _evt_tiers and _perdet_tiers as the central dispatch constants and rewrites _lh5_data_open to use them (per-det tiers require a det, evt tiers don't). The shared PropSel/filterby/identity logic is extracted into _apply_read, and per-det reads now go through _resolve_perdet_path to support both the current 'tier/<det>' and the legacy '<det>/tier' layout. Adds a 'subgroup' kwarg for descents like jlhit/<det>/dataQC and an ignore_missing warning for missing detectors. Detector discovery in read_ldata((tier, fk)) now lists 'tier/<det>' primarily and falls back to top-level keys; evt tiers route to the no-det read directly. _convert_rsel2dsel disambiguates DataTier the same way it already did for DataCategory. The jlevt+det post-filter still uses the old shape (trig_e_det); the proper per-detector evt-tier reads come in the next commit.
read_ldata((:jlevt, fk, det)) now returns a flat-prefixed table with the events restricted to those where the detector is present. Per-det columns (e_cusp_ctc_cal, …) are unwrapped at the detector's per-event slot in the 'detector' VoV, per-trigger columns (trig_e_cusp_ctc_cal, …) at its slot in 'trig_e_det'. The two index schemes are distinguished per column via the materialized inner-length signature (_index_col), so the reader picks the right indexing automatically and works for both GED-like (mixed) and SPM/PMT-like (single-scheme) systems. The flat-prefix name-map is built dynamically by walking the HDF5 group (_build_evt_namemap / _walk_evt_h5!), so new sub-groups added on the file side need no change here. PropSel + filterby reads only the columns they actually reference; the eager fallback materializes the whole table and then applies f/filterby/n_evts via _apply_read. jlpmt is added to _evt_tiers, with its own (tier=jlpmt, sys=pmts) entry in _evt_persubdet_path so that the table itself is the per-pmt root. The no-det evt-tier read still returns the native nested LH5 structure (e.g. evt.aux.pulser.aux_trig); only the with-det path produces the flat-prefixed table.
A new filtertier kwarg lets a per-det raw/jldsp read use an event-tier filter: read filtertier with the given filterby for the same detector to obtain a Vector of per-det row indices (geds_dataidx / spms_dataidx / dataidx, depending on system), then load the target tier and slice. The dataidx columns are 1-based for both raw and jldsp, so they index directly without an offset. Default filtertier=nothing keeps the existing behavior. The kwarg is ignored when filtertier == tier; cross-tier requires a DetectorId and an event tier as filtertier (geds, spms, pmts have known index columns). Multi-filekey works automatically — the per-fk dispatch forwards filterby and filtertier through kwargs.
raw and jldsp share an identical row layout per (det, fk) — same row count and 1:1 trigger alignment. The cross-tier filter now also handles this case: filterby is evaluated on the filtertier's data, the resulting Bool mask is applied to the target tier directly. This works for both phy (where jlevt also exists) and cal (where it does not), so 'show me cal raw waveforms with dsp e_cusp > 1000' is now a one-liner — previously you would have had to read jldsp, compute a mask, read raw, and slice manually.
Drop the ChannelId-era examples (channels are no longer used by read_ldata) and document what the refactor enables: per-detector slicing on event tiers (jlevt/jlskm/jlpmt) with flat-prefixed columns, auto-detected detector system, sub-group descent via the subgroup kwarg, and the cross-tier filter (filtertier kwarg) covering both event tier → raw/jldsp and raw ↔ jldsp directions.
Add: per-tier with/without-detector behavior table, no-detector event-tier read example (nested LH5 + dot syntax), sub-group descent examples for jlhit/jlpls, multi-detector discovery (NamedTuple return), explicit @pf flat-vs-dot syntax explanation, listed system index columns for cross-tier filter, distributed-read example, and notes on n_evts and ignore_missing semantics.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #172 +/- ##
==========================================
- Coverage 47.85% 46.28% -1.57%
==========================================
Files 27 27
Lines 2466 2547 +81
==========================================
- Hits 1180 1179 -1
- Misses 1286 1368 +82 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
In some datasets (observed in p16-r000-phy) the *_dataidx column on the event tier is a run-wide cumulative offset, not a per-(fk, det) 1-based row index — using it as the slice index produces out-of-bounds errors. The per-detector event number *_detevtno is the right column: it is consistently 1-based per (fk, det) and matches the raw / jldsp row layout for the same detector. Updated the index dictionary, docs, and test references accordingly.
…bgroup whose inner-length signature matches — e.g. ged_spm per-fiber VoVs share the spms detector axis and pick up the SPM's slot. Columns that match neither are returned as-is.
Contributor
|
Hi, a review has been requested, but this PR is still a draft and nothing has changed since two weeks. @jonasschlegel is your work finished and is this PR ready to reviewed? If yes please convert this into a proper PR. |
Two additions to read_ldata: 1. _apply_read: when both PropSel and a PropertyFunction filterby are given, load only the union of source columns referenced by projection and filter, then filter+project. Avoids loading all columns when only a subset is used. 2. _read_evt_no_det_table: for event-tier reads without a detector, allow PropSel on flat-prefixed leaf columns (e.g. :geds_multiplicity, :aux_pulser_aux_trig) via the existing _build_evt_namemap. Falls back to the legacy nested-NamedTuple read when no PropSel is given or when a requested column is not a flat leaf (back-compat).
Three call-sites (_apply_read, _read_evt_table, _read_evt_no_det_table) shared the same load-needed/filter/n_evts/reproject pattern; extracted to _propsel_filter_apply, parameterised by a per-column load callback. Also extracted _sample_idx so the multi-method _load_all_keys reads cleanly. Trimmed multi-line comment blocks throughout the extension to 0-1 lines where the code is self-documenting. Kept the *_detevtno vs *_dataidx note (load-bearing) and the trig_e_det vs detector mask hint. Fixed the stale "uses *_dataidx" comment in the cross-tier filter dispatch. Removed an unactioned TODO in the parallel partition reader. No behaviour change for back-compat paths (no-PropSel evt-tier reads still return the native nested LH5).
Reorganise the LegendHDF5IO extension section of extensions.md: - Tier overview table (file layout × with/without-det × subgroup × per-det evt slicing) - Call forms section enumerating all supported rsel shapes - Per-detector evt-tier system support table (jlevt/jlskm/jlpmt × GED/SPM/PMT) - filtertier compatibility matrix (target × source) - No-detector evt-tier reads now document both flat-leaf PropSel and nested LH5 modes Also wrap all inline @pf kwarg usages in parens (@pf(...)) so the examples are copy-paste-safe in the REPL — naked @pf in kwarg position gets cut off at the next comma and PropertyFunctions then fails with "invalid assignment location".
In the no-detector (tier, fk) read for non-evt tiers, the fallback path
for the legacy "<det>/<tier>" layout iterated over every top-level key
in the file and called haskey(h, "<k>/<tier>") on each. If a top-level
key happened to be a Dataset (e.g. a runinfo / metadata marker), HDF5
descended into it and threw
MethodError: no method matching haskey(::HDF5.Dataset, ::String, ...)
since haskey is only defined for File/Group. Filter the top-level keys
to Groups before calling haskey, so non-Group entries are silently
skipped — they can't be detector folders anyway.
The (tier, fk) no-detector dispatch lists detectors by enumerating
top-level groups (current "tier/<det>") or by walking top-level keys
that contain a tier sub-group (legacy "<det>/tier"). Older datasets
have group names that don't parse as a valid DetectorId (e.g. early
SiPM/manufacturer-style names like "BF862-06"); those used to abort
the whole read with
ArgumentError: String "BF862-06" does not look like a valid LEGEND
detector id
once the per-detector recursion reached DetectorId(rsel[3]).
Filter both layouts via _is_valid_detid so non-parseable entries are
silently skipped — only entries that can become a DetectorId end up
in the per-det fan-out.
The no-detector (tier, fk) dispatch enumerates top-level groups to
list per-detector subtrees. Older datasets contain entries whose
names don't parse as DetectorId (e.g. early SiPM/manufacturer names
like "BF862-06"); without filtering, the per-detector fan-out aborts
with
ArgumentError: String "BF862-06" does not look like a valid LEGEND
detector id
Use the existing LegendDataManagement._can_convert_to(DetectorId, ...)
predicate (regex-based, dispatched for AbstractString/Symbol/Integer)
instead of a local try/catch wrapper around the DetectorId constructor.
Contributor
|
Can you maybe add a few test, to see that we can guarantee we can open the data in any combination of |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor
read_ldatato dropChannelIdfrom its public API, addper-detector slicing on event tiers (
:jlevt,:jlskm,:jlpmt),and introduce a cross-tier filter via the
filtertierkwarg.LegendHDF5IOextension docs are updated to match.Changes
DetectorIdthroughout.read_ldataacceptsDetectorIdLikeonly.Channel helpers (
_get_channelid,_get_detectorid,_is_valid_id_or_tier) and the channel-consistency check in thepartition method are removed.
_evt_tiers = [:jlevt, :jlskm, :jlpmt]and_perdet_tiers = [:jlpeaks, :jlhit, :jlpls]drive dispatch. raw andjldsp use
_resolve_perdet_path, which triestier/<det>(current),<det>/tier(legacy), andtier(struct-view) in order.read_ldata(_, l200, (:jlevt, fk, det))returns a flat-prefixed Table(
geds_e_cusp_ctc_cal,geds_trig_e_cusp_ctc_cal,aux_pulser_aux_trig,ged_spm_is_valid_lar, …). The name-map isbuilt dynamically by walking the HDF5 group; per-trigger and
per-det-list VoVs are distinguished by their inner-length signature
and unwrapped at the detector's per-event slot.
filtertierkwarg lets a per-detectorraw/jldsp read be filtered by an event-tier predicate (slicing via
the per-det
*_dataidxcolumn, 1-based for raw and jldsp) or by asibling per-trigger-tier predicate (1:1 row alignment per detector).
Works for phy and cal.
subgroup=:dataQCfortier/<det>/group-style layouts (e.g. jlhit).filterbyreturnsmissingare dropped (same semantics as
false).What now works
Backward-compat note
read_ldata(l200, :jlevt, fk)(no detector) still returns the nativenested LH5 view (
evt.aux.pulser.aux_trig). Filters there use dot syntax(
@pf $geds.is_valid_qc == true); flat-prefix names exist only on theper-detector side.
Performance
PropSel +
filterby+filtertierfast-path vs. eager-load-then-filteron real L200 data (FK p18-r000, B00000C, min of 3 trials):
e_cusp > 0)>1500 keV)filtertier=:jlevt, phy)filtertier=:jldsp, cal)The big wins (C–E) come from never materializing the ~1.3 GB full
jlevt+detTable when only a handful of columns are needed. Per-detraw/dsp reads (A, B, F) are smaller files to begin with, so the
absolute numbers are modest but still favorable.
MCE