Skip to content

Refactor read_ldata() and update docs #172

Open
jonasschlegel wants to merge 19 commits into
mainfrom
read-ldata-restructured
Open

Refactor read_ldata() and update docs #172
jonasschlegel wants to merge 19 commits into
mainfrom
read-ldata-restructured

Conversation

@jonasschlegel
Copy link
Copy Markdown
Collaborator

Refactor read_ldata to drop ChannelId from its public API, add
per-detector slicing on event tiers (:jlevt, :jlskm, :jlpmt),
and introduce a cross-tier filter via the filtertier kwarg.
LegendHDF5IO extension docs are updated to match.

Changes

  1. DetectorId throughout. read_ldata accepts DetectorIdLike only.
    Channel helpers (_get_channelid, _get_detectorid,
    _is_valid_id_or_tier) and the channel-consistency check in the
    partition method are removed.
  2. Tier classification. _evt_tiers = [:jlevt, :jlskm, :jlpmt] and
    _perdet_tiers = [:jlpeaks, :jlhit, :jlpls] drive dispatch. raw and
    jldsp use _resolve_perdet_path, which tries tier/<det> (current),
    <det>/tier (legacy), and tier (struct-view) in order.
  3. Per-detector event-tier slicing. With a detector,
    read_ldata(_, l200, (:jlevt, fk, det)) returns a flat-prefixed Table
    (geds_e_cusp_ctc_cal, geds_trig_e_cusp_ctc_cal,
    aux_pulser_aux_trig, ged_spm_is_valid_lar, …). The name-map is
    built dynamically by walking the HDF5 group; per-trigger and
    per-det-list VoVs are distinguished by their inner-length signature
    and unwrapped at the detector's per-event slot.
  4. Cross-tier filter. A new filtertier kwarg lets a per-detector
    raw/jldsp read be filtered by an event-tier predicate (slicing via
    the per-det *_dataidx column, 1-based for raw and jldsp) or by a
    sibling per-trigger-tier predicate (1:1 row alignment per detector).
    Works for phy and cal.
  5. Sub-group descent. subgroup=:dataQC for
    tier/<det>/group-style layouts (e.g. jlhit).
  6. Missing-tolerant filter. Rows where filterby returns missing
    are dropped (same semantics as false).

What now works

# Per-detector event-tier read with a trigger-energy cut
read_ldata(@pf((; $geds_e_cusp_ctc_cal, $geds_t50)), l200, (:jlevt, fk, det);
           filterby = @pf $geds_is_valid_qc && $geds_trig_e_cusp_ctc_cal > 1500u"keV")

# Cross-subgroup QC filter (geds + ged_spm + !aux_muonveto + !aux_pulser)
read_ldata((:geds_e_cusp_ctc_cal,), l200, (:jlevt, fk, det);
           filterby = @pf $geds_is_valid_qc && $ged_spm_is_valid_lar &&
                          !$aux_muonveto_aux_trig && !$aux_pulser_aux_trig)

# Raw waveforms of phy events that pass an event-level QC + energy cut
read_ldata(@pf((; $waveform_presummed)), l200, (:raw, fk_phy, det);
           filterby   = @pf $geds_is_valid_qc && $geds_trig_e_cusp_ctc_cal > 1500u"keV",
           filtertier = :jlevt)

# Cal raw waveforms with a dsp-energy cut (no jlevt for cal)
read_ldata((:waveform_presummed,), l200, (:raw, fk_cal, det);
           filterby   = @pf $e_cusp > 1000,
           filtertier = :jldsp)

Backward-compat note

read_ldata(l200, :jlevt, fk) (no detector) still returns the native
nested LH5 view (evt.aux.pulser.aux_trig). Filters there use dot syntax
(@pf $geds.is_valid_qc == true); flat-prefix names exist only on the
per-detector side.

Performance

PropSel + filterby + filtertier fast-path vs. eager-load-then-filter
on real L200 data (FK p18-r000, B00000C, min of 3 trials):

Case speed-up memory
A — raw column selection (waveforms) 1.3× 2.1×
B — jldsp column + filter (e_cusp > 0) 1.0× 1.0×
C — jlevt event-scalar filter (cross-subgroup QC) 40.0× 42.3×
D — jlevt trigger-energy filter (>1500 keV) 33.7× 31.7×
E — cross-tier (filtertier=:jlevt, phy) 32.4× 26.0×
F — cross-tier (filtertier=:jldsp, cal) 1.6× 2.0×

The big wins (C–E) come from never materializing the ~1.3 GB full
jlevt+det Table when only a handful of columns are needed. Per-det
raw/dsp reads (A, B, F) are smaller files to begin with, so the
absolute numbers are modest but still favorable.

MCE

ENV["LEGEND_DATA_CONFIG"] = "/path/to/your/config.json"
using LegendDataManagement, LegendHDF5IO, PropertyFunctions
using Unitful: @u_str

l200    = LegendData(:l200)
PER, RUN = DataPeriod(18), DataRun(0)
fks_phy = search_disk(FileKey, l200.tier[:raw, :phy, PER, RUN])
fks_cal = search_disk(FileKey, l200.tier[:raw, :cal, PER, RUN])
FK_PHY, FK_CAL = first(fks_phy), first(fks_cal)
chinfo  = channelinfo(l200, FK_PHY)
DET_GED = first(filter(c -> c.system == :geds && c.processable, chinfo)).detector
DET_SPM = first(filter(c -> c.system == :spms, chinfo)).detector

# --- Column selection: three equivalent forms ----------------------------
read_ldata(l200, :jldsp, FK_PHY, DET_GED)                                # full
read_ldata((:e_cusp, :timestamp), l200, (:jldsp, FK_PHY, DET_GED))       # tuple
read_ldata(@pf((; $e_cusp, $timestamp)), l200, (:jldsp, FK_PHY, DET_GED))# @pf

# --- Per-detector across tiers (system auto-detected) --------------------
read_ldata((:waveform_presummed,), l200, (:raw,   FK_PHY, DET_GED))
read_ldata((:waveform_bit_drop,),  l200, (:raw,   FK_PHY, DET_SPM))
read_ldata((:e_cusp, :t50),        l200, (:jldsp, FK_PHY, DET_GED))
read_ldata((:e_cusp,),             l200, (:jlhit, FK_CAL, DET_GED); subgroup=:dataQC)
read_ldata((:daqenergy,),          l200, (:jlpls, FK_CAL, "PULS01");  subgroup=:tags)

# --- Event tier with detector → flat-prefixed Table ----------------------
read_ldata((:geds_e_cusp_ctc_cal, :geds_trig_e_cusp_ctc_cal),
           l200, (:jlevt, FK_PHY, DET_GED))

# Trigger-energy cut + cross-subgroup QC, all on flat names
read_ldata(@pf((; $geds_e_cusp_ctc_cal, $geds_t50)),
           l200, (:jlevt, FK_PHY, DET_GED);
           filterby = @pf $geds_is_valid_qc && $ged_spm_is_valid_lar &&
                          !$aux_muonveto_aux_trig && $geds_trig_e_cusp_ctc_cal > 20u"keV")

# --- Event tier without detector → native nested LH5 (use dots) ----------
# Note: the flat-prefix names (geds_is_valid_qc, …) only exist with a det.
# Without a det, use dot-syntax — `@pf $geds_is_valid_qc` would error,
# `@pf $geds.is_valid_qc == true` works (generic getproperty path).
evt = read_ldata(l200, :jlevt, FK_PHY;
                 filterby = @pf $geds.is_valid_qc == true)
evt.geds.e_cusp_ctc_cal       # VoV — per event, list of energies
evt.aux.pulser.aux_trig       # Vector{Bool}

# Old call style still works:
read_ldata((:tags,), l200, DataTier(:jlpls), :phy, partinfo, det_puls)

# --- Cross-tier filter (filtertier kwarg) --------------------------------
# (1) raw waveforms of events passing a jlevt QC + energy cut (phy)
read_ldata(@pf((; $waveform_presummed, $timestamp)), l200, (:raw, FK_PHY, DET_GED);
           filterby   = @pf $geds_is_valid_qc && $geds_trig_e_cusp_ctc_cal > 1500u"keV",
           filtertier = :jlevt)

# (2) cal raw filtered via cal jldsp (no jlevt for cal)
read_ldata((:waveform_presummed,), l200, (:raw, FK_CAL, DET_GED);
           filterby   = (@pf $e_cusp > 1000),
           filtertier = :jldsp)

# (3) symmetric — jldsp filtered via raw
read_ldata((:e_cusp,), l200, (:jldsp, FK_PHY, DET_GED);
           filterby   = (@pf $daqenergy > 100),
           filtertier = :raw)

# --- Multi-filekey, period, partition ------------------------------------
read_ldata((:e_cusp,), l200, (:jldsp, fks_cal[1:3], DET_GED))
read_ldata((:e_cusp,), l200, (:jldsp, :phy, PER, RUN, DET_GED))
read_ldata((:e_cusp,), l200, (:jldsp, :phy, DataPartition(1), DET_GED))

# Distributed: parallel=true after addprocs + @everywhere using …
# read_ldata((:e_cusp,), l200, (:jldsp, fks_phy, DET_GED); parallel=true)

# --- Other kwargs --------------------------------------------------------
read_ldata((:e_cusp,), l200, (:jldsp, FK_PHY, DET_GED); n_evts=1000)
read_ldata((:e_cusp,), l200, (:jldsp, FK_PHY, "DOESNOTEXIST"); ignore_missing=true)

Removes the ChannelOrDetectorIdLike helpers (_get_channelid,
_get_detectorid, _is_valid_id_or_tier) and the LDMUtils channel
conversion imports. All read_ldata methods now take DetectorIdLike
directly; the partition-method's channel-consistency check was made
obsolete by this and is removed. The jlevt+det post-filter switches
from trig_e_ch to trig_e_det.
Introduces _evt_tiers and _perdet_tiers as the central dispatch
constants and rewrites _lh5_data_open to use them (per-det tiers
require a det, evt tiers don't). The shared PropSel/filterby/identity
logic is extracted into _apply_read, and per-det reads now go through
_resolve_perdet_path to support both the current 'tier/<det>' and the
legacy '<det>/tier' layout. Adds a 'subgroup' kwarg for descents like
jlhit/<det>/dataQC and an ignore_missing warning for missing detectors.

Detector discovery in read_ldata((tier, fk)) now lists 'tier/<det>'
primarily and falls back to top-level keys; evt tiers route to the
no-det read directly. _convert_rsel2dsel disambiguates DataTier the
same way it already did for DataCategory.

The jlevt+det post-filter still uses the old shape (trig_e_det); the
proper per-detector evt-tier reads come in the next commit.
read_ldata((:jlevt, fk, det)) now returns a flat-prefixed table with
the events restricted to those where the detector is present. Per-det
columns (e_cusp_ctc_cal, …) are unwrapped at the detector's per-event
slot in the 'detector' VoV, per-trigger columns (trig_e_cusp_ctc_cal,
…) at its slot in 'trig_e_det'. The two index schemes are distinguished
per column via the materialized inner-length signature (_index_col), so
the reader picks the right indexing automatically and works for both
GED-like (mixed) and SPM/PMT-like (single-scheme) systems.

The flat-prefix name-map is built dynamically by walking the HDF5 group
(_build_evt_namemap / _walk_evt_h5!), so new sub-groups added on the
file side need no change here. PropSel + filterby reads only the
columns they actually reference; the eager fallback materializes the
whole table and then applies f/filterby/n_evts via _apply_read.

jlpmt is added to _evt_tiers, with its own (tier=jlpmt, sys=pmts) entry
in _evt_persubdet_path so that the table itself is the per-pmt root.

The no-det evt-tier read still returns the native nested LH5 structure
(e.g. evt.aux.pulser.aux_trig); only the with-det path produces the
flat-prefixed table.
A new filtertier kwarg lets a per-det raw/jldsp read use an event-tier
filter: read filtertier with the given filterby for the same detector
to obtain a Vector of per-det row indices (geds_dataidx / spms_dataidx
/ dataidx, depending on system), then load the target tier and slice.
The dataidx columns are 1-based for both raw and jldsp, so they index
directly without an offset.

Default filtertier=nothing keeps the existing behavior. The kwarg is
ignored when filtertier == tier; cross-tier requires a DetectorId and
an event tier as filtertier (geds, spms, pmts have known index columns).

Multi-filekey works automatically — the per-fk dispatch forwards
filterby and filtertier through kwargs.
raw and jldsp share an identical row layout per (det, fk) — same row
count and 1:1 trigger alignment. The cross-tier filter now also
handles this case: filterby is evaluated on the filtertier's data, the
resulting Bool mask is applied to the target tier directly.

This works for both phy (where jlevt also exists) and cal (where it
does not), so 'show me cal raw waveforms with dsp e_cusp > 1000' is
now a one-liner — previously you would have had to read jldsp,
compute a mask, read raw, and slice manually.
Drop the ChannelId-era examples (channels are no longer used by
read_ldata) and document what the refactor enables: per-detector
slicing on event tiers (jlevt/jlskm/jlpmt) with flat-prefixed columns,
auto-detected detector system, sub-group descent via the subgroup
kwarg, and the cross-tier filter (filtertier kwarg) covering both
event tier → raw/jldsp and raw ↔ jldsp directions.
Add: per-tier with/without-detector behavior table, no-detector
event-tier read example (nested LH5 + dot syntax), sub-group descent
examples for jlhit/jlpls, multi-detector discovery (NamedTuple
return), explicit @pf flat-vs-dot syntax explanation, listed system
index columns for cross-tier filter, distributed-read example, and
notes on n_evts and ignore_missing semantics.
@jonasschlegel jonasschlegel self-assigned this Apr 29, 2026
@jonasschlegel jonasschlegel added bug Something isn't working enhancement New feature or request labels Apr 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 0% with 178 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.28%. Comparing base (d2134f7) to head (2e15d35).

Files with missing lines Patch % Lines
ext/LegendDataManagementLegendHDF5IOExt.jl 0.00% 178 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #172      +/-   ##
==========================================
- Coverage   47.85%   46.28%   -1.57%     
==========================================
  Files          27       27              
  Lines        2466     2547      +81     
==========================================
- Hits         1180     1179       -1     
- Misses       1286     1368      +82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jonasschlegel jonasschlegel marked this pull request as draft April 29, 2026 13:54
In some datasets (observed in p16-r000-phy) the *_dataidx column on
the event tier is a run-wide cumulative offset, not a per-(fk, det)
1-based row index — using it as the slice index produces out-of-bounds
errors. The per-detector event number *_detevtno is the right column:
it is consistently 1-based per (fk, det) and matches the raw / jldsp
row layout for the same detector. Updated the index dictionary, docs,
and test references accordingly.
…bgroup whose inner-length signature matches — e.g. ged_spm

per-fiber VoVs share the spms detector axis and pick up the SPM's slot. Columns that match neither are returned as-is.
@DaGeibl
Copy link
Copy Markdown
Contributor

DaGeibl commented May 15, 2026

Hi, a review has been requested, but this PR is still a draft and nothing has changed since two weeks. @jonasschlegel is your work finished and is this PR ready to reviewed? If yes please convert this into a proper PR.

Two additions to read_ldata:

1. _apply_read: when both PropSel and a PropertyFunction filterby are given,
   load only the union of source columns referenced by projection and filter,
   then filter+project. Avoids loading all columns when only a subset is used.

2. _read_evt_no_det_table: for event-tier reads without a detector, allow
   PropSel on flat-prefixed leaf columns (e.g. :geds_multiplicity,
   :aux_pulser_aux_trig) via the existing _build_evt_namemap. Falls back to
   the legacy nested-NamedTuple read when no PropSel is given or when a
   requested column is not a flat leaf (back-compat).
Three call-sites (_apply_read, _read_evt_table, _read_evt_no_det_table)
shared the same load-needed/filter/n_evts/reproject pattern; extracted to
_propsel_filter_apply, parameterised by a per-column load callback. Also
extracted _sample_idx so the multi-method _load_all_keys reads cleanly.

Trimmed multi-line comment blocks throughout the extension to 0-1 lines
where the code is self-documenting. Kept the *_detevtno vs *_dataidx
note (load-bearing) and the trig_e_det vs detector mask hint. Fixed the
stale "uses *_dataidx" comment in the cross-tier filter dispatch. Removed
an unactioned TODO in the parallel partition reader.

No behaviour change for back-compat paths (no-PropSel evt-tier reads still
return the native nested LH5).
Reorganise the LegendHDF5IO extension section of extensions.md:

- Tier overview table (file layout × with/without-det × subgroup × per-det evt slicing)
- Call forms section enumerating all supported rsel shapes
- Per-detector evt-tier system support table (jlevt/jlskm/jlpmt × GED/SPM/PMT)
- filtertier compatibility matrix (target × source)
- No-detector evt-tier reads now document both flat-leaf PropSel and nested LH5 modes

Also wrap all inline @pf kwarg usages in parens (@pf(...)) so the
examples are copy-paste-safe in the REPL — naked @pf in kwarg position
gets cut off at the next comma and PropertyFunctions then fails with
"invalid assignment location".
In the no-detector (tier, fk) read for non-evt tiers, the fallback path
for the legacy "<det>/<tier>" layout iterated over every top-level key
in the file and called haskey(h, "<k>/<tier>") on each. If a top-level
key happened to be a Dataset (e.g. a runinfo / metadata marker), HDF5
descended into it and threw

    MethodError: no method matching haskey(::HDF5.Dataset, ::String, ...)

since haskey is only defined for File/Group. Filter the top-level keys
to Groups before calling haskey, so non-Group entries are silently
skipped — they can't be detector folders anyway.
The (tier, fk) no-detector dispatch lists detectors by enumerating
top-level groups (current "tier/<det>") or by walking top-level keys
that contain a tier sub-group (legacy "<det>/tier"). Older datasets
have group names that don't parse as a valid DetectorId (e.g. early
SiPM/manufacturer-style names like "BF862-06"); those used to abort
the whole read with

    ArgumentError: String "BF862-06" does not look like a valid LEGEND
    detector id

once the per-detector recursion reached DetectorId(rsel[3]).

Filter both layouts via _is_valid_detid so non-parseable entries are
silently skipped — only entries that can become a DetectorId end up
in the per-det fan-out.
Reverts d729e8a and 43239b5. The local try/catch wrapper is replaced
by LegendDataManagement._can_convert_to(DetectorId, ...) in the
following commit.
The no-detector (tier, fk) dispatch enumerates top-level groups to
list per-detector subtrees. Older datasets contain entries whose
names don't parse as DetectorId (e.g. early SiPM/manufacturer names
like "BF862-06"); without filtering, the per-detector fan-out aborts
with

    ArgumentError: String "BF862-06" does not look like a valid LEGEND
    detector id

Use the existing LegendDataManagement._can_convert_to(DetectorId, ...)
predicate (regex-based, dispatched for AbstractString/Symbol/Integer)
instead of a local try/catch wrapper around the DetectorId constructor.
@jonasschlegel jonasschlegel marked this pull request as ready for review May 19, 2026 13:49
@DaGeibl
Copy link
Copy Markdown
Contributor

DaGeibl commented May 20, 2026

Can you maybe add a few test, to see that we can guarantee we can open the data in any combination of DataSelectors?
In the testdata there is a cal file with the new data format, so you can use that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants