Add LEARN data transfer pipeline tools by kkaan · Pull Request #1 · Image-X-Institute/SPARK_Data_Analysis_Code

kkaan · 2026-03-09T09:21:37Z

Summary

Adds the learn-crawler toolset for automating the LEARN data transfer pipeline (Elekta XVI CBCT patient data from GenesisCare to USYD RDS)
Includes DICOM anonymisation, folder sorting, PII verification, and a PyQt6 GUI wizard
CLI utility scripts for directory auditing, DICOM tag reading, and XVI export crawling
CI workflow with linting/tests and PyInstaller build for standalone .exe distribution

- Add extract_dicom_plans.py for treatment plan extraction - Add CLAUDE.md with repository documentation - Add comprehensive XVI analysis documentation - Add DICOM conformance statement - Add .gitignore with patient data protection - Configure Claude Code settings 🔒 Patient confidential data is excluded from repository 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…ed by claude unverified, extract_elekta_rps_matrices.py also untested

- Add site_name to DicomAnonymiser; PatientName now AnonID^SiteName - Add anonymise_all_dcm() for recursive DICOM discovery regardless of directory layout - Anonymise filenames: replace parenthesised patient name with anon_id - Add source_base param to anonymise_file for arbitrary source dirs - Add classify_dicom_files() to sort .dcm by Modality tag (CT/RTPLAN/RTSTRUCT/RTDOSE) - Expand copy_anonymised_plans to handle Structure Set and Dose categories - Update execute() with anon_struct_dir and anon_dose_dir params

Replace real patient identifiers (MRN, name) used as test data with generic placeholders across source, tests, and scripts. Delete untracked files containing raw DICOM tag dumps.

- Swap step order: Config → Preview → Folder Sort → Anonymise → PII → Report - Fix segfault: QtLogHandler now uses pyqtSignal bridge for thread-safe logging - Add QScrollArea to ConfigPage for scrollable form - Copy anonymised plans to LEARN structure after anonymisation step - Enrich all session types (including MotionView) for datetime extraction - Improve undated session matching: use treatment_id + directory proximity

This workflow runs Bandit, a security linter for Python, on code pushes and pull requests to the master branch, as well as on a scheduled basis.

- Fraction numbering now starts at FX1 instead of FX0 - Fix _on_continue skipping worker launch for next step - PII check auto-detects MRN from source path - Show amber SKIPPED banner when no PII strings configured

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Closes #8

Move extract_elekta_rps_matrices.py, read_dicom_tags.py, and elektafdt_crawler.py into scripts/. Move run_patient_example.py into examples/. Delete empty Support/ directory and stale root __pycache__. Fix sys.path import in cbct-shifts/compare_rps_mosaiq.py to reference the new scripts/ location. Update path references in config.py, utils.py, Docs, and the example docstring. Rewrite README.md with full repo layout, GUI usage, and Python API docs.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Cover anonymise_ini_file, anonymise_centroid_file, anonymise_trajectory_log, and the end-to-end anonymise_output_folder function including TPS import and progress callback behaviour.

Remove MIM export and anonymisation sections (images 07-10) and add new Anonymisation and Folder Sorting section pointing to the LEARN Pipeline GUI and GUI Walkthrough document.

# Conflicts: # README.md

Copilot

Pull request overview

Adds the learn-crawler / learn_upload toolchain to automate the LEARN data transfer pipeline (Elekta XVI exports → structured LEARN output), including anonymisation, folder sorting, PII verification, and supporting CLI utilities, with tests/docs and CI packaging workflows.

Changes:

Introduces learn_upload Python package modules for DICOM anonymisation, folder mapping/copying, and PII verification (+ shared parsing utilities).
Adds standalone CLI scripts for DICOM inspection and Elekta/XVI-specific crawling/extraction tasks.
Adds pytest coverage, SOP/format documentation, GUI walkthrough assets, and GitHub Actions workflows (CI, Bandit, Windows release build).

Reviewed changes

Copilot reviewed 32 out of 55 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
tests/test_verify_pii.py	Unit tests for PII verification behavior across filenames/DICOM/XML/text.
tests/test_utils.py	Unit tests for INI/XML parsing and ScanUID datetime extraction helpers.
tests/test_anonymise_dicom.py	Extensive tests for anonymisation logic across file types and pipeline steps.
tests/init.py	Marks `tests` as a package (empty).
scripts/read_dicom_tags.py	CLI + importable helper to dump DICOM tags to stdout/file.
scripts/extract_elekta_rps_matrices.py	CLI utility to extract Elekta RPS embedded ZIP/INI matrices and alignment info.
scripts/elektafdt_crawler.py	Utility to scan patient exports and extract plan names into CSV.
scripts/dir_tree.py	CLI utility to generate a truncated directory tree audit report.
requirements-dev.txt	Dev dependency list for tests/linting.
pyproject.toml	Package metadata, dependencies, optional extras, and pytest/ruff config.
learn_upload/verify_pii.py	Core PII scanning implementation for output verification.
learn_upload/utils.py	Shared parsers for INI/XML/ScanUID and RPS ZIP-INI extraction.
learn_upload/folder_sort.py	Session discovery, classification, fraction assignment, LEARN tree creation, and file copy logic.
learn_upload/config.py	Central constants: paths, DICOM tag policies, logging helpers, RPS tag constant.
learn_upload/anonymise_dicom.py	DICOM + ancillary file anonymisation utilities and output-folder anonymisation pass.
learn_upload/main.py	`python -m learn_upload` entrypoint selecting GUI implementation.
learn_upload/init.py	Package overview/rationale; version definition.
learn_upload.spec	PyInstaller spec for building the Windows GUI executable.
examples/run_patient_example.py	End-to-end example pipeline invocation.
cbct-shifts/report_patient_details.py	Generates markdown CBCT shift report from RPS data.
cbct-shifts/compare_rps_mosaiq.py	Compares Mosaiq shift logs to RPS clipbox/couch-shift data to validate mapping.
README.md	Adds full project README for learn-crawler usage, developer setup, and pipeline documentation.
Docs/images/gui-walkthrough/09.png	GUI walkthrough screenshot asset.
Docs/images/gui-walkthrough/08.png	GUI walkthrough screenshot asset.
Docs/images/gui-walkthrough/06.png	GUI walkthrough screenshot asset.
Docs/images/gui-walkthrough/02.png	GUI walkthrough screenshot asset.
Docs/images/GC_Elekta_Patient_Upload_Process/04_mosaiq_images_icon.png	SOP image asset for Mosaiq toolbar icon.
Docs/elekta_xvi_sro_experimental_validation.md	Experimental validation notes document.
Docs/elekta_rps_format_documentation.md	RPS private-tag/ZIP/INI format documentation + Mosaiq mapping summary.
Docs/LEARN_Upload_Automation_Plan.md	Automation plan and design notes for pipeline modules/workflow.
Docs/GUI_Walkthrough.md	Step-by-step GUI walkthrough document.
Docs/GC_Elekta_Patient_Upload_Process.md	Updated SOP referencing the GUI pipeline and transfer workflow.
Docs/Elekta_XVI_Reconstruction_Directory_Analysis.md	Reconstruction directory/file format analysis reference.
.gitignore	Adds strong ignores for patient data, imaging files, outputs, and common Python artifacts.
.github/workflows/release.yml	Windows build + PyInstaller release workflow for tagged versions.
.github/workflows/ci.yml	CI workflow running ruff + pytest on PRs and master.
.github/workflows/bandit.yml	Bandit security scan workflow (SARIF upload).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T09:37:16Z

README.md

 In the 'Analysis code' folder, the codes are used extract different data properties. The SPARK Triangulation code is used to perform kV/MV triangulation for post-treatment verification of KIM tracking accuracy.

-In Data anonymisation folder, the codes were written to anonymize the SPARK data follwoing TCIA recommendation. 
+In Data anonymisation folder, the codes were written to anonymize the SPARK data follwoing TCIA recommendation.


Typo in README: "follwoing" → "following".

Copilot · 2026-03-09T09:37:16Z

pyproject.toml

+[build-system]
+requires = ["setuptools>=64"]
+build-backend = "setuptools.backends._legacy:_Backend"


build-backend = "setuptools.backends._legacy:_Backend" opts into the legacy setuptools backend. Unless there's a specific need for legacy behavior, using the standard PEP 517 backend (setuptools.build_meta) is more common and tends to behave better with modern tooling. Consider switching to build-backend = "setuptools.build_meta" to avoid packaging surprises.

Copilot · 2026-03-09T09:37:16Z

learn_upload/utils.py

+    match = _SCAN_DATETIME_PATTERN.search(scan_uid)
+    if not match:
+        logger.warning("Could not parse datetime from ScanUID: %s", scan_uid)
+        return None


parse_scan_datetime logs a warning whenever the ScanUID pattern isn't found. In normal operation you may encounter empty/missing ScanUIDs (and tests cover returning None), so this can create a lot of warning noise. Consider lowering this to logger.debug(...) (or only warning when the input looks like a ScanUID but fails to parse).

Copilot · 2026-03-09T09:37:17Z

learn_upload/verify_pii.py

+    try:
+        ds = pydicom.dcmread(path, force=True)
+    except Exception:
+        logger.warning("Could not read DICOM file: %s", path)
+        return findings
+
+    for elem in ds.iterall():
+        if elem.VR not in _STRING_VRS:
+            continue
+        value_str = str(elem.value).lower()
+        for pii, original in zip(pii_lower, pii_originals):
+            if pii in value_str:
+                tag_name = elem.keyword or str(elem.tag)


_check_dicom uses pydicom.dcmread(..., force=True) and then iterates all elements, which can be significantly slower than necessary and may parse non-DICOM/binary files as datasets. For performance and reliability, consider stop_before_pixels=True and avoiding force=True (or at least only using force=True behind a flag) since you only need metadata text tags for PII scanning.

Copilot · 2026-03-09T09:37:17Z

learn_upload/anonymise_dicom.py

+        for tag in DICOM_TAGS_REPLACE:
+            if tag == (0x0010, 0x0010):  # PatientName
+                dcm[tag].value = PersonName(f"{self.anon_id}^{self.site_name}")
+            else:
+                dcm[tag].value = self.anon_id
+


In anonymise_file, the loop over DICOM_TAGS_REPLACE assumes every tag is present and does dcm[tag].value = .... If a file is missing one of these tags (notably StudyID is optional in many datasets), this will raise KeyError and abort anonymisation. Consider guarding with if tag in dcm (and deciding whether to skip or create the element) to make anonymisation robust across real-world exports.

Copilot · 2026-03-09T09:37:18Z

learn_upload/verify_pii.py

+# pydicom VR types that contain human-readable strings worth checking.
+_STRING_VRS = {
+    "LO", "SH", "PN", "LT", "ST", "UT", "DA", "DS", "IS", "CS",
+    "AE", "AS", "DT", "TM", "UC", "UI",
+}


_STRING_VRS includes the "UI" VR, which will scan DICOM UIDs for PII substrings. Since UIDs are preserved by design (and are long digit strings), this can create false-positive PII findings (e.g., random MRN-like substrings appearing inside UIDs). Consider removing "UI" from the scanned VR set (and, if needed, explicitly scan only a curated list of human-entered string tags).

Copilot · 2026-03-09T09:37:18Z

learn_upload/verify_pii.py

+    # --- Human-readable summary ---
+    print(f"\nPII Verification: scanned {files_scanned} files in {directory}")
+    if findings:
+        print(f"FAIL — {len(findings)} PII finding(s):")
+        for f in findings:
+            print(f"  {f['file']}  [{f['location']}]  matched '{f['matched']}'")
+    else:
+        print("PASS — no residual PII detected")
+


verify_no_pii always prints a PASS/FAIL summary to stdout. This function is also used from the GUI (and potentially other Python callers), so unconditional printing can produce noisy/duplicated output or interfere with UI logging. Consider switching these to logger.info(...) or adding a verbose/print_summary flag (defaulting to False for library use, True for CLI).

Copilot · 2026-03-09T09:37:19Z

learn_upload/folder_sort.py

+        if recon_dir.is_dir():
+            ini_files = sorted(recon_dir.glob("*.INI"))
+            if ini_files:
+                session.ini_path = ini_files[0]
+                ini_text = ini_files[0].read_text(encoding="utf-8", errors="ignore")
+                ini_data = parse_xvi_ini(ini_text)
+                scan_uid = ini_data.get("ScanUID")
+                if scan_uid:
+                    session.scan_datetime = parse_scan_datetime(scan_uid)
+


_enrich_cbct_session only looks for Reconstruction/*.INI to extract ScanUID, but XVI exports can also contain only .INI.XVI (or lowercase extensions). If the .INI file is absent, sessions will remain undated and fraction assignment will degrade. Consider expanding the glob to include *.INI.XVI (and possibly case-insensitive matching) and/or scanning all INI-like files in Reconstruction/ until a ScanUID is found.

Copilot · 2026-03-09T09:37:19Z

learn_upload/folder_sort.py

+        dict[str, list[CBCTSession]]
+            ``{"FX0": [...], "FX1": [...], ...}`` sorted chronologically.
+        """
+        # Sort all sessions by datetime
+        sorted_sessions = sorted(
+            sessions,
+            key=lambda s: s.scan_datetime or datetime.min,
+        )
+
+        # Group by date
+        date_groups: dict[str, list[CBCTSession]] = {}
+        for s in sorted_sessions:
+            if s.scan_datetime is None:
+                date_key = "unknown"
+            else:
+                date_key = s.scan_datetime.strftime("%Y-%m-%d")
+
+            if date_key not in date_groups:
+                date_groups[date_key] = []
+            date_groups[date_key].append(s)
+
+        # Assign fraction labels in chronological order
+        fraction_map: dict[str, list[CBCTSession]] = {}
+        for fx_idx, date_key in enumerate(sorted(date_groups.keys()), start=1):
+            fx_label = f"FX{fx_idx}"
+            fraction_map[fx_label] = date_groups[date_key]


The docstring for assign_fractions states it returns keys like {"FX0": ..., "FX1": ...}, but the implementation starts enumeration at 1 and produces FX1, FX2, ... (and tests assert this behavior). Please update the docstring to match the actual output to avoid confusing downstream users.

Copilot · 2026-03-09T09:37:19Z

scripts/read_dicom_tags.py

+def read_dicom_tags(filepath: str, output: str = None, show_private: bool = False) -> None:
+    """Print every DICOM tag in the file, optionally writing to an output file."""
+    try:
+        ds = dcmread(filepath)
+    except (InvalidDicomError, FileNotFoundError, PermissionError) as e:
+        print(f"Error reading {filepath}: {e}", file=sys.stderr)
+        sys.exit(1)


read_dicom_tags is documented as usable from Python, but on error it calls sys.exit(1) from inside the library function. This makes it hard to reuse programmatically (it terminates the whole process). Consider raising an exception (or returning a non-zero status) from read_dicom_tags, and only calling sys.exit in main().

Kankean Kandasamy and others added 30 commits November 4, 2025 12:10

documentation and file organisation, rps format documentation generat…

1d30491

…ed by claude unverified, extract_elekta_rps_matrices.py also untested

extract matrices succesfully teste, documentation of rps format added

0960122

deprecated

d040365

fixed errors

9f331f4

experimental report skeleton

bf87c19

readme added

cf5a8d3

Update session-specific reconstruction params description

95076ef

Remove Claude Code configuration files

c166806

Merge branch 'master' of https://github.com/kkaan/learn-crawler

19421ae

gitignore

9de922d

documentation

b2276fe

anonymisation and folder mapping code added

05c1eb1

rps files anon too

04cb111

scrub test patient PII from codebase

cb149ab

Replace real patient identifiers (MRN, name) used as test data with generic placeholders across source, tests, and scripts. Delete untracked files containing raw DICOM tag dumps.

post anonimisation check

3ffdc78

gui and report

bf7fed8

i suck at gui

410ff64

fix gui: dark titlebar, layout sizing, crash fix, config persistence

0b9d914

remove old webview gui (replaced by PyQt6)

aed9042

main

e56f290

Add Bandit security scan workflow

8358d5f

This workflow runs Bandit, a security linter for Python, on code pushes and pull requests to the master branch, as well as on a scheduled basis.

gui kinda working

13dd226

restructured for anonymiser to handle the all the anonimisation

1cf0ead

fix: FX numbering starts at 1, nav bug, PII skip display

4cfe883

- Fraction numbering now starts at FX1 instead of FX0 - Fix _on_continue skipping worker launch for next step - PII check auto-detects MRN from source path - Show amber SKIPPED banner when no PII strings configured

Update learn_upload/anonymise_dicom.py

3140dc5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update learn_upload/anonymise_dicom.py

58d9d37

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update learn_upload/gui_qt.py

0dc27a3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

kkaan and others added 19 commits March 2, 2026 15:20

Update learn_upload/folder_sort.py

e2266e5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update learn_upload/anonymise_dicom.py

babc9f7

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

save CBCT shift report to Patient Files folder

e9e6bd4

Closes #8

Update README.md

5108776

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Document expected input and output directory structures in README

42d3df3

Add tests for anonymise helpers and anonymise_output_folder

8211658

Cover anonymise_ini_file, anonymise_centroid_file, anonymise_trajectory_log, and the end-to-end anonymise_output_folder function including TPS import and progress callback behaviour.

Add GUI walkthrough document with screenshots

69faa55

Add input directory structures to GUI walkthrough and link from README

c737f4f

Replace MIM anonymisation workflow with GUI wizard in SOP

390951a

Remove MIM export and anonymisation sections (images 07-10) and add new Anonymisation and Folder Sorting section pointing to the LEARN Pipeline GUI and GUI Walkthrough document.

Remove unused MIM screenshot images (07-10)

eacf7f9

Add clone instructions to README Quick Start

34ff6be

formatting

bf7765c

Add CI workflow with linting and tests

ce4ed26

Add PyInstaller build and GitHub Release workflow

60edb9a

Fix faulthandler crash in windowed exe build

29ebd85

Restructure README with download section and local drive guidance

3635214

sanity check

8d44d58

Merge remote-tracking branch 'spark/main' into learn-crawler

cb4a184

# Conflicts: # README.md

kkaan requested a review from Copilot March 9, 2026 09:29

kkaan assigned kkaan and unassigned kkaan Mar 9, 2026

Copilot started reviewing on behalf of kkaan March 9, 2026 09:30 View session

kkaan assigned ChandrimaSengupta Mar 9, 2026

kkaan marked this pull request as draft March 9, 2026 09:31

Copilot AI reviewed Mar 9, 2026

View reviewed changes

kkaan marked this pull request as ready for review March 9, 2026 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LEARN data transfer pipeline tools#1

Add LEARN data transfer pipeline tools#1
kkaan wants to merge 49 commits intomainfrom
learn-crawler

kkaan commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kkaan commented Mar 9, 2026

Summary

Contents

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants