Refactor read_ldata() and update docs by jonasschlegel · Pull Request #172 · legend-exp/LegendDataManagement.jl

jonasschlegel · 2026-04-29T13:40:19Z

Refactor read_ldata to drop ChannelId from its public API, add
per-detector slicing on event tiers (:jlevt, :jlskm, :jlpmt),
and introduce a cross-tier filter via the filtertier kwarg.
LegendHDF5IO extension docs are updated to match.

Changes

DetectorId throughout. read_ldata accepts DetectorIdLike only.
Channel helpers (_get_channelid, _get_detectorid,
_is_valid_id_or_tier) and the channel-consistency check in the
partition method are removed.
Tier classification. _evt_tiers = [:jlevt, :jlskm, :jlpmt] and
_perdet_tiers = [:jlpeaks, :jlhit, :jlpls] drive dispatch. raw and
jldsp use _resolve_perdet_path, which tries tier/<det> (current),
<det>/tier (legacy), and tier (struct-view) in order.
Per-detector event-tier slicing. With a detector,
read_ldata(_, l200, (:jlevt, fk, det)) returns a flat-prefixed Table
(geds_e_cusp_ctc_cal, geds_trig_e_cusp_ctc_cal,
aux_pulser_aux_trig, ged_spm_is_valid_lar, …). The name-map is
built dynamically by walking the HDF5 group; per-trigger and
per-det-list VoVs are distinguished by their inner-length signature
and unwrapped at the detector's per-event slot.
Cross-tier filter. A new filtertier kwarg lets a per-detector
raw/jldsp read be filtered by an event-tier predicate (slicing via
the per-det *_dataidx column, 1-based for raw and jldsp) or by a
sibling per-trigger-tier predicate (1:1 row alignment per detector).
Works for phy and cal.
Sub-group descent. subgroup=:dataQC for
tier/<det>/group-style layouts (e.g. jlhit).
Missing-tolerant filter. Rows where filterby returns missing
are dropped (same semantics as false).

What now works

# Per-detector event-tier read with a trigger-energy cut
read_ldata(@pf((; $geds_e_cusp_ctc_cal, $geds_t50)), l200, (:jlevt, fk, det);
           filterby = @pf $geds_is_valid_qc && $geds_trig_e_cusp_ctc_cal > 1500u"keV")

# Cross-subgroup QC filter (geds + ged_spm + !aux_muonveto + !aux_pulser)
read_ldata((:geds_e_cusp_ctc_cal,), l200, (:jlevt, fk, det);
           filterby = @pf $geds_is_valid_qc && $ged_spm_is_valid_lar &&
                          !$aux_muonveto_aux_trig && !$aux_pulser_aux_trig)

# Raw waveforms of phy events that pass an event-level QC + energy cut
read_ldata(@pf((; $waveform_presummed)), l200, (:raw, fk_phy, det);
           filterby   = @pf $geds_is_valid_qc && $geds_trig_e_cusp_ctc_cal > 1500u"keV",
           filtertier = :jlevt)

# Cal raw waveforms with a dsp-energy cut (no jlevt for cal)
read_ldata((:waveform_presummed,), l200, (:raw, fk_cal, det);
           filterby   = @pf $e_cusp > 1000,
           filtertier = :jldsp)

Backward-compat note

read_ldata(l200, :jlevt, fk) (no detector) still returns the native
nested LH5 view (evt.aux.pulser.aux_trig). Filters there use dot syntax
(@pf $geds.is_valid_qc == true); flat-prefix names exist only on the
per-detector side.

Performance

PropSel + filterby + filtertier fast-path vs. eager-load-then-filter
on real L200 data (FK p18-r000, B00000C, min of 3 trials):

Case	speed-up	memory
A — raw column selection (waveforms)	1.3×	2.1×
B — jldsp column + filter (`e_cusp > 0`)	1.0×	1.0×
C — jlevt event-scalar filter (cross-subgroup QC)	40.0×	42.3×
D — jlevt trigger-energy filter (`>1500 keV`)	33.7×	31.7×
E — cross-tier (`filtertier=:jlevt`, phy)	32.4×	26.0×
F — cross-tier (`filtertier=:jldsp`, cal)	1.6×	2.0×

The big wins (C–E) come from never materializing the ~1.3 GB full
jlevt+det Table when only a handful of columns are needed. Per-det
raw/dsp reads (A, B, F) are smaller files to begin with, so the
absolute numbers are modest but still favorable.

MCE

ENV["LEGEND_DATA_CONFIG"] = "/path/to/your/config.json"
using LegendDataManagement, LegendHDF5IO, PropertyFunctions
using Unitful: @u_str

l200    = LegendData(:l200)
PER, RUN = DataPeriod(18), DataRun(0)
fks_phy = search_disk(FileKey, l200.tier[:raw, :phy, PER, RUN])
fks_cal = search_disk(FileKey, l200.tier[:raw, :cal, PER, RUN])
FK_PHY, FK_CAL = first(fks_phy), first(fks_cal)
chinfo  = channelinfo(l200, FK_PHY)
DET_GED = first(filter(c -> c.system == :geds && c.processable, chinfo)).detector
DET_SPM = first(filter(c -> c.system == :spms, chinfo)).detector

# --- Column selection: three equivalent forms ----------------------------
read_ldata(l200, :jldsp, FK_PHY, DET_GED)                                # full
read_ldata((:e_cusp, :timestamp), l200, (:jldsp, FK_PHY, DET_GED))       # tuple
read_ldata(@pf((; $e_cusp, $timestamp)), l200, (:jldsp, FK_PHY, DET_GED))# @pf

# --- Per-detector across tiers (system auto-detected) --------------------
read_ldata((:waveform_presummed,), l200, (:raw,   FK_PHY, DET_GED))
read_ldata((:waveform_bit_drop,),  l200, (:raw,   FK_PHY, DET_SPM))
read_ldata((:e_cusp, :t50),        l200, (:jldsp, FK_PHY, DET_GED))
read_ldata((:e_cusp,),             l200, (:jlhit, FK_CAL, DET_GED); subgroup=:dataQC)
read_ldata((:daqenergy,),          l200, (:jlpls, FK_CAL, "PULS01");  subgroup=:tags)

# --- Event tier with detector → flat-prefixed Table ----------------------
read_ldata((:geds_e_cusp_ctc_cal, :geds_trig_e_cusp_ctc_cal),
           l200, (:jlevt, FK_PHY, DET_GED))

# Trigger-energy cut + cross-subgroup QC, all on flat names
read_ldata(@pf((; $geds_e_cusp_ctc_cal, $geds_t50)),
           l200, (:jlevt, FK_PHY, DET_GED);
           filterby = @pf $geds_is_valid_qc && $ged_spm_is_valid_lar &&
                          !$aux_muonveto_aux_trig && $geds_trig_e_cusp_ctc_cal > 20u"keV")

# --- Event tier without detector → native nested LH5 (use dots) ----------
# Note: the flat-prefix names (geds_is_valid_qc, …) only exist with a det.
# Without a det, use dot-syntax — `@pf $geds_is_valid_qc` would error,
# `@pf $geds.is_valid_qc == true` works (generic getproperty path).
evt = read_ldata(l200, :jlevt, FK_PHY;
                 filterby = @pf $geds.is_valid_qc == true)
evt.geds.e_cusp_ctc_cal       # VoV — per event, list of energies
evt.aux.pulser.aux_trig       # Vector{Bool}

# Old call style still works:
read_ldata((:tags,), l200, DataTier(:jlpls), :phy, partinfo, det_puls)

# --- Cross-tier filter (filtertier kwarg) --------------------------------
# (1) raw waveforms of events passing a jlevt QC + energy cut (phy)
read_ldata(@pf((; $waveform_presummed, $timestamp)), l200, (:raw, FK_PHY, DET_GED);
           filterby   = @pf $geds_is_valid_qc && $geds_trig_e_cusp_ctc_cal > 1500u"keV",
           filtertier = :jlevt)

# (2) cal raw filtered via cal jldsp (no jlevt for cal)
read_ldata((:waveform_presummed,), l200, (:raw, FK_CAL, DET_GED);
           filterby   = (@pf $e_cusp > 1000),
           filtertier = :jldsp)

# (3) symmetric — jldsp filtered via raw
read_ldata((:e_cusp,), l200, (:jldsp, FK_PHY, DET_GED);
           filterby   = (@pf $daqenergy > 100),
           filtertier = :raw)

# --- Multi-filekey, period, partition ------------------------------------
read_ldata((:e_cusp,), l200, (:jldsp, fks_cal[1:3], DET_GED))
read_ldata((:e_cusp,), l200, (:jldsp, :phy, PER, RUN, DET_GED))
read_ldata((:e_cusp,), l200, (:jldsp, :phy, DataPartition(1), DET_GED))

# Distributed: parallel=true after addprocs + @everywhere using …
# read_ldata((:e_cusp,), l200, (:jldsp, fks_phy, DET_GED); parallel=true)

# --- Other kwargs --------------------------------------------------------
read_ldata((:e_cusp,), l200, (:jldsp, FK_PHY, DET_GED); n_evts=1000)
read_ldata((:e_cusp,), l200, (:jldsp, FK_PHY, "DOESNOTEXIST"); ignore_missing=true)

Removes the ChannelOrDetectorIdLike helpers (_get_channelid, _get_detectorid, _is_valid_id_or_tier) and the LDMUtils channel conversion imports. All read_ldata methods now take DetectorIdLike directly; the partition-method's channel-consistency check was made obsolete by this and is removed. The jlevt+det post-filter switches from trig_e_ch to trig_e_det.

Introduces _evt_tiers and _perdet_tiers as the central dispatch constants and rewrites _lh5_data_open to use them (per-det tiers require a det, evt tiers don't). The shared PropSel/filterby/identity logic is extracted into _apply_read, and per-det reads now go through _resolve_perdet_path to support both the current 'tier/<det>' and the legacy '<det>/tier' layout. Adds a 'subgroup' kwarg for descents like jlhit/<det>/dataQC and an ignore_missing warning for missing detectors. Detector discovery in read_ldata((tier, fk)) now lists 'tier/<det>' primarily and falls back to top-level keys; evt tiers route to the no-det read directly. _convert_rsel2dsel disambiguates DataTier the same way it already did for DataCategory. The jlevt+det post-filter still uses the old shape (trig_e_det); the proper per-detector evt-tier reads come in the next commit.

read_ldata((:jlevt, fk, det)) now returns a flat-prefixed table with the events restricted to those where the detector is present. Per-det columns (e_cusp_ctc_cal, …) are unwrapped at the detector's per-event slot in the 'detector' VoV, per-trigger columns (trig_e_cusp_ctc_cal, …) at its slot in 'trig_e_det'. The two index schemes are distinguished per column via the materialized inner-length signature (_index_col), so the reader picks the right indexing automatically and works for both GED-like (mixed) and SPM/PMT-like (single-scheme) systems. The flat-prefix name-map is built dynamically by walking the HDF5 group (_build_evt_namemap / _walk_evt_h5!), so new sub-groups added on the file side need no change here. PropSel + filterby reads only the columns they actually reference; the eager fallback materializes the whole table and then applies f/filterby/n_evts via _apply_read. jlpmt is added to _evt_tiers, with its own (tier=jlpmt, sys=pmts) entry in _evt_persubdet_path so that the table itself is the per-pmt root. The no-det evt-tier read still returns the native nested LH5 structure (e.g. evt.aux.pulser.aux_trig); only the with-det path produces the flat-prefixed table.

A new filtertier kwarg lets a per-det raw/jldsp read use an event-tier filter: read filtertier with the given filterby for the same detector to obtain a Vector of per-det row indices (geds_dataidx / spms_dataidx / dataidx, depending on system), then load the target tier and slice. The dataidx columns are 1-based for both raw and jldsp, so they index directly without an offset. Default filtertier=nothing keeps the existing behavior. The kwarg is ignored when filtertier == tier; cross-tier requires a DetectorId and an event tier as filtertier (geds, spms, pmts have known index columns). Multi-filekey works automatically — the per-fk dispatch forwards filterby and filtertier through kwargs.

raw and jldsp share an identical row layout per (det, fk) — same row count and 1:1 trigger alignment. The cross-tier filter now also handles this case: filterby is evaluated on the filtertier's data, the resulting Bool mask is applied to the target tier directly. This works for both phy (where jlevt also exists) and cal (where it does not), so 'show me cal raw waveforms with dsp e_cusp > 1000' is now a one-liner — previously you would have had to read jldsp, compute a mask, read raw, and slice manually.

Drop the ChannelId-era examples (channels are no longer used by read_ldata) and document what the refactor enables: per-detector slicing on event tiers (jlevt/jlskm/jlpmt) with flat-prefixed columns, auto-detected detector system, sub-group descent via the subgroup kwarg, and the cross-tier filter (filtertier kwarg) covering both event tier → raw/jldsp and raw ↔ jldsp directions.

@pf

Add: per-tier with/without-detector behavior table, no-detector event-tier read example (nested LH5 + dot syntax), sub-group descent examples for jlhit/jlpls, multi-detector discovery (NamedTuple return), explicit @pf flat-vs-dot syntax explanation, listed system index columns for cross-tier filter, distributed-read example, and notes on n_evts and ignore_missing semantics.

codecov · 2026-04-29T13:50:33Z

Codecov Report

❌ Patch coverage is 0% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.28%. Comparing base (d2134f7) to head (2e15d35).

Files with missing lines	Patch %	Lines
ext/LegendDataManagementLegendHDF5IOExt.jl	0.00%	178 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #172      +/-   ##
==========================================
- Coverage   47.85%   46.28%   -1.57%     
==========================================
  Files          27       27              
  Lines        2466     2547      +81     
==========================================
- Hits         1180     1179       -1     
- Misses       1286     1368      +82

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

In some datasets (observed in p16-r000-phy) the *_dataidx column on the event tier is a run-wide cumulative offset, not a per-(fk, det) 1-based row index — using it as the slice index produces out-of-bounds errors. The per-detector event number *_detevtno is the right column: it is consistently 1-based per (fk, det) and matches the raw / jldsp row layout for the same detector. Updated the index dictionary, docs, and test references accordingly.

…bgroup whose inner-length signature matches — e.g. ged_spm per-fiber VoVs share the spms detector axis and pick up the SPM's slot. Columns that match neither are returned as-is.

DaGeibl · 2026-05-15T14:47:34Z

Hi, a review has been requested, but this PR is still a draft and nothing has changed since two weeks. @jonasschlegel is your work finished and is this PR ready to reviewed? If yes please convert this into a proper PR.

Two additions to read_ldata: 1. _apply_read: when both PropSel and a PropertyFunction filterby are given, load only the union of source columns referenced by projection and filter, then filter+project. Avoids loading all columns when only a subset is used. 2. _read_evt_no_det_table: for event-tier reads without a detector, allow PropSel on flat-prefixed leaf columns (e.g. :geds_multiplicity, :aux_pulser_aux_trig) via the existing _build_evt_namemap. Falls back to the legacy nested-NamedTuple read when no PropSel is given or when a requested column is not a flat leaf (back-compat).

Three call-sites (_apply_read, _read_evt_table, _read_evt_no_det_table) shared the same load-needed/filter/n_evts/reproject pattern; extracted to _propsel_filter_apply, parameterised by a per-column load callback. Also extracted _sample_idx so the multi-method _load_all_keys reads cleanly. Trimmed multi-line comment blocks throughout the extension to 0-1 lines where the code is self-documenting. Kept the *_detevtno vs *_dataidx note (load-bearing) and the trig_e_det vs detector mask hint. Fixed the stale "uses *_dataidx" comment in the cross-tier filter dispatch. Removed an unactioned TODO in the parallel partition reader. No behaviour change for back-compat paths (no-PropSel evt-tier reads still return the native nested LH5).

@pf

Reorganise the LegendHDF5IO extension section of extensions.md: - Tier overview table (file layout × with/without-det × subgroup × per-det evt slicing) - Call forms section enumerating all supported rsel shapes - Per-detector evt-tier system support table (jlevt/jlskm/jlpmt × GED/SPM/PMT) - filtertier compatibility matrix (target × source) - No-detector evt-tier reads now document both flat-leaf PropSel and nested LH5 modes Also wrap all inline @pf kwarg usages in parens (@pf(...)) so the examples are copy-paste-safe in the REPL — naked @pf in kwarg position gets cut off at the next comma and PropertyFunctions then fails with "invalid assignment location".

In the no-detector (tier, fk) read for non-evt tiers, the fallback path for the legacy "<det>/<tier>" layout iterated over every top-level key in the file and called haskey(h, "<k>/<tier>") on each. If a top-level key happened to be a Dataset (e.g. a runinfo / metadata marker), HDF5 descended into it and threw MethodError: no method matching haskey(::HDF5.Dataset, ::String, ...) since haskey is only defined for File/Group. Filter the top-level keys to Groups before calling haskey, so non-Group entries are silently skipped — they can't be detector folders anyway.

The (tier, fk) no-detector dispatch lists detectors by enumerating top-level groups (current "tier/<det>") or by walking top-level keys that contain a tier sub-group (legacy "<det>/tier"). Older datasets have group names that don't parse as a valid DetectorId (e.g. early SiPM/manufacturer-style names like "BF862-06"); those used to abort the whole read with ArgumentError: String "BF862-06" does not look like a valid LEGEND detector id once the per-detector recursion reached DetectorId(rsel[3]). Filter both layouts via _is_valid_detid so non-parseable entries are silently skipped — only entries that can become a DetectorId end up in the per-det fan-out.

…riction

Reverts d729e8a and 43239b5. The local try/catch wrapper is replaced by LegendDataManagement._can_convert_to(DetectorId, ...) in the following commit.

The no-detector (tier, fk) dispatch enumerates top-level groups to list per-detector subtrees. Older datasets contain entries whose names don't parse as DetectorId (e.g. early SiPM/manufacturer names like "BF862-06"); without filtering, the per-detector fan-out aborts with ArgumentError: String "BF862-06" does not look like a valid LEGEND detector id Use the existing LegendDataManagement._can_convert_to(DetectorId, ...) predicate (regex-based, dispatched for AbstractString/Symbol/Integer) instead of a local try/catch wrapper around the DetectorId constructor.

DaGeibl · 2026-05-20T12:47:38Z

Can you maybe add a few test, to see that we can guarantee we can open the data in any combination of DataSelectors?
In the testdata there is a cal file with the new data format, so you can use that.

jonasschlegel added 7 commits April 29, 2026 13:16

jonasschlegel requested review from DaGeibl and theHenks April 29, 2026 13:40

jonasschlegel self-assigned this Apr 29, 2026

jonasschlegel added bug Something isn't working enhancement New feature or request labels Apr 29, 2026

jonasschlegel marked this pull request as draft April 29, 2026 13:54

jonasschlegel added 2 commits April 29, 2026 16:09

Apply the same per-det / per-trigger indexing to columns from any su…

5eb5845

…bgroup whose inner-length signature matches — e.g. ged_spm per-fiber VoVs share the spms detector axis and pick up the SPM's slot. Columns that match neither are returned as-is.

DaGeibl mentioned this pull request May 15, 2026

change key filter of raw files to only allow DetectorID's #170

Closed

jonasschlegel added 10 commits May 18, 2026 11:08

Accept Symbol det names in _is_valid_detid filter

43239b5

Clarify call-form examples: mix :raw / :jldsp, note per-det tier rest…

df67402

…riction

Revert _is_valid_detid try/catch helper

d8837dc

Reverts d729e8a and 43239b5. The local try/catch wrapper is replaced by LegendDataManagement._can_convert_to(DetectorId, ...) in the following commit.

Show ENV["LEGEND_DATA_CONFIG"] setup in read_ldata example

2e15d35

jonasschlegel marked this pull request as ready for review May 19, 2026 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor read_ldata() and update docs #172

Refactor read_ldata() and update docs #172
jonasschlegel wants to merge 19 commits into
mainfrom
read-ldata-restructured

jonasschlegel commented Apr 29, 2026

Uh oh!

codecov Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

DaGeibl commented May 15, 2026

Uh oh!

DaGeibl commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonasschlegel commented Apr 29, 2026

Changes

What now works

Backward-compat note

Performance

MCE

Uh oh!

codecov Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DaGeibl commented May 15, 2026

Uh oh!

DaGeibl commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 29, 2026 •

edited

Loading