Skip to content

Run the TrackQuality ANN directly from the onnx file with ONNX Runtime#4

Draft
AndrewEdmonds11 wants to merge 3 commits into
Mu2e:mainfrom
AndrewEdmonds11:onnxruntime
Draft

Run the TrackQuality ANN directly from the onnx file with ONNX Runtime#4
AndrewEdmonds11 wants to merge 3 commits into
Mu2e:mainfrom
AndrewEdmonds11:onnxruntime

Conversation

@AndrewEdmonds11
Copy link
Copy Markdown
Contributor

This PR contains a draft of the code needed to run the TrkQual ANN directly from the onnx file. This will remove TMVA::SOFIE from the workflow. This is a draft while I work out final details and remove the old code. However, it is validated:

Begin processing the 1st record. run: 1430 subRun: 0 event: 5 at 30-Mar-2026 10:45:40 CDT
[TrackQuality::produce::TrkQualAll] Inputs = 25, 1.0000, 1.2688, 0.2400, 0.6373, 0.1149 1.0800 --> output = 0.9459 (ORT: 0.9459)
[TrackQuality::produce::TrkQualAll] Inputs = 20, 0.8000, 1.4036, 0.4500, 0.0159, 0.1751 1.4000 --> output = 0.1627 (ORT: 0.1627)
[TrackQuality::produce::TrkQualAll] Inputs = 25, 1.0000, 1.2698, 0.4400, 0.1736, 0.2697 1.0800 --> output = 0.1440 (ORT: 0.1440)
[TrackQuality::produce::TrkQualAll] Inputs = 22, 0.8800, 1.3619, 0.4091, 0.0000, 0.1486 1.2273 --> output = 0.2698 (ORT: 0.2698)
Begin processing the 2nd record. run: 1430 subRun: 0 event: 8 at 30-Mar-2026 10:45:40 CDT
[TrackQuality::produce::TrkQualAll] Inputs = 42, 1.0000, 0.4454, 0.0952, 0.0000, 0.1808 1.0476 --> output = 0.8104 (ORT: 0.8104)
[TrackQuality::produce::TrkQualAll] Inputs = 32, 0.7619, 1.1694, 0.2500, 0.0000, 0.2672 1.3125 --> output = 0.1182 (ORT: 0.1182)
[TrackQuality::produce::TrkQualAll] Inputs = 42, 1.0000, 0.4456, 0.1429, 0.0000, 0.2196 1.1190 --> output = 0.5403 (ORT: 0.5403)
[TrackQuality::produce::TrkQualAll] Inputs = 34, 0.8095, 1.1090, 0.2353, 0.0000, 0.3372 1.2647 --> output = 0.0798 (ORT: 0.0798)
Begin processing the 3rd record. run: 1430 subRun: 0 event: 9 at 30-Mar-2026 10:45:40 CDT
[TrackQuality::produce::TrkQualAll] Inputs = 53, 0.7162, 0.8714, 0.3019, 0.0000, 0.0619 1.6226 --> output = 0.8246 (ORT: 0.8246)
[TrackQuality::produce::TrkQualAll] Inputs = 73, 0.9865, 0.4028, 0.0959, 0.0008, 0.0537 1.1233 --> output = 0.9948 (ORT: 0.9948)
[TrackQuality::produce::TrkQualAll] Inputs = 55, 0.7432, 0.8394, 0.3091, 0.0000, 0.0855 1.4727 --> output = 0.7908 (ORT: 0.7908)
[TrackQuality::produce::TrkQualAll] Inputs = 73, 0.9865, 0.4080, 0.0959, 0.2266, 0.0709 1.1233 --> output = 0.9954 (ORT: 0.9954)
[TrackQuality::produce::TrkQualAll] Inputs = 88, 0.9670, 0.4044, 0.0455, 0.0000, 0.0744 1.0227 --> output = 0.9959 (ORT: 0.9959)
[TrackQuality::produce::TrkQualAll] Inputs = 74, 0.8043, 0.7272, 0.2838, 0.0000, 0.0930 1.1757 --> output = 0.9649 (ORT: 0.9649)
[TrackQuality::produce::TrkQualAll] Inputs = 84, 0.9231, 0.7048, 0.1071, 0.0000, 0.1344 1.0952 --> output = 0.9896 (ORT: 0.9896)
[TrackQuality::produce::TrkQualAll] Inputs = 65, 0.7065, 0.8516, 0.2923, 0.0000, 0.1807 1.2923 --> output = 0.5042 (ORT: 0.5042)

where output is the original output and ORT is value with the new ONNXRuntime code

@oksuzian
Copy link
Copy Markdown
Contributor

Took a look through this draft. Validation numbers match SOFIE to 4 decimals — behavior looks correct. A few structural items for when you come back to clean it up:

Blockers

  • Hardcoded ONNX path.
    _session(_env, "ArtAnalysis/TrkDiag/data/TrkQual_ANN1_v2.onnx", _session_options),
    Only works when cwd is set correctly. The existing SOFIE path uses ConfigFileLookupPolicy(conf().datFilename()) — add a fhicl::Atom<std::string> onnxFilename and resolve it the same way. Since _session is initialized in the ctor init list, you'll need to resolve the path inline in the init-list expression (or via a small helper).

Bugs / risks

  • print_shape is unused, and v.size() - 1 underflows to SIZE_MAX if v is empty — either wire it up behind _debug or delete.
  • Member-init ordering is load-bearing and fragile. _env → _session_options → _session → _input_name → _type_info → _tensor_info → _input_shape → _memory_info → _output_name must appear in this order both in the class body and the ctor init list. Worth a one-line comment above the member declarations noting this.
  • (double) casts before assignment into std::vector<float> are meaningless (value gets narrowed to float anyway). Either make the vector double or drop the casts.
  • output_data[0] = 0 mutates the ORT tensor after Run. Works but surprising — float ort_score = entrance_found ? output_data[0] : 0.f; reads more naturally.

Nits

  • SConscript indentation is inconsistent: surrounding lines use 2 spaces, new 'onnxruntime' uses 4 (mainlib) and 8 (plugins). Match surrounding style.
  • _session_options and _allocator are only needed at construction. Could be locals. Minor.
  • Dynamic-dim rewrite (if (dim == -1) dim = 1;) fixes batch=1. Batching N tracks per event would be a cheap win later — worth a TODO.

zwl0331 added a commit to zwl0331/CalorimeterGNNClustering that referenced this pull request May 10, 2026
Interface contract for the Mu2e Offline art::EDProducer that will
consume the exported CaloClusterNet .onnx model, following the pattern
in Andy Edmonds's Mu2e/ArtAnalysis#4. Covers: model artifact metadata
(opset 17, regenerate/validate commands); exact input/output tensor
names, shapes, dtypes, dynamic axes; the six node and eight edge
z-score stats as literal mean/std values from the train split so the
C++ caller doesn't need to parse a PyTorch .pt blob for six floats;
upstream graph construction (one graph per disk, r_max=210mm,
dt_max=25ns, kNN fallback, degree cap); and the full CCN+BFS10
cluster-assembly recipe as pseudocode with every hyperparameter frozen
(tau_edge=0.20, bfs_expand_cut=10 MeV, min_hits=2, min_energy=10 MeV).

Also embeds the 15c parity proof (max logit diff 9.06e-06, zero
threshold flips on 166K val edges, 12.5x CPU speedup vs PyTorch) so
the argument for trusting the deployment is in one place, and a list
of open items Sophie and Andy need to decide at the integration
meeting: central onnxruntime muse install status, module boundary
between graph construction and inference, whether to reuse Offline's
Calorimeter::neighbors/nextNeighbors vs porting the cKDTree builder,
normalisation-stats sidecar format for C++, and a model versioning
policy to catch silent tensor-layout drift.

Completes 15d in docs/plan.md. Milestone J is now 4/5; the remaining
gate is externally blocked on the integration meeting itself.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants