Conversation
- Add extract_dicom_plans.py for treatment plan extraction - Add CLAUDE.md with repository documentation - Add comprehensive XVI analysis documentation - Add DICOM conformance statement - Add .gitignore with patient data protection - Configure Claude Code settings 🔒 Patient confidential data is excluded from repository 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…ed by claude unverified, extract_elekta_rps_matrices.py also untested
- Add site_name to DicomAnonymiser; PatientName now AnonID^SiteName - Add anonymise_all_dcm() for recursive DICOM discovery regardless of directory layout - Anonymise filenames: replace parenthesised patient name with anon_id - Add source_base param to anonymise_file for arbitrary source dirs - Add classify_dicom_files() to sort .dcm by Modality tag (CT/RTPLAN/RTSTRUCT/RTDOSE) - Expand copy_anonymised_plans to handle Structure Set and Dose categories - Update execute() with anon_struct_dir and anon_dose_dir params
Replace real patient identifiers (MRN, name) used as test data with generic placeholders across source, tests, and scripts. Delete untracked files containing raw DICOM tag dumps.
- Swap step order: Config → Preview → Folder Sort → Anonymise → PII → Report - Fix segfault: QtLogHandler now uses pyqtSignal bridge for thread-safe logging - Add QScrollArea to ConfigPage for scrollable form - Copy anonymised plans to LEARN structure after anonymisation step - Enrich all session types (including MotionView) for datetime extraction - Improve undated session matching: use treatment_id + directory proximity
This workflow runs Bandit, a security linter for Python, on code pushes and pull requests to the master branch, as well as on a scheduled basis.
- Fraction numbering now starts at FX1 instead of FX0 - Fix _on_continue skipping worker launch for next step - PII check auto-detects MRN from source path - Show amber SKIPPED banner when no PII strings configured
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Move extract_elekta_rps_matrices.py, read_dicom_tags.py, and elektafdt_crawler.py into scripts/. Move run_patient_example.py into examples/. Delete empty Support/ directory and stale root __pycache__. Fix sys.path import in cbct-shifts/compare_rps_mosaiq.py to reference the new scripts/ location. Update path references in config.py, utils.py, Docs, and the example docstring. Rewrite README.md with full repo layout, GUI usage, and Python API docs.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Cover anonymise_ini_file, anonymise_centroid_file, anonymise_trajectory_log, and the end-to-end anonymise_output_folder function including TPS import and progress callback behaviour.
Remove MIM export and anonymisation sections (images 07-10) and add new Anonymisation and Folder Sorting section pointing to the LEARN Pipeline GUI and GUI Walkthrough document.
# Conflicts: # README.md
There was a problem hiding this comment.
Pull request overview
Adds the learn-crawler / learn_upload toolchain to automate the LEARN data transfer pipeline (Elekta XVI exports → structured LEARN output), including anonymisation, folder sorting, PII verification, and supporting CLI utilities, with tests/docs and CI packaging workflows.
Changes:
- Introduces
learn_uploadPython package modules for DICOM anonymisation, folder mapping/copying, and PII verification (+ shared parsing utilities). - Adds standalone CLI scripts for DICOM inspection and Elekta/XVI-specific crawling/extraction tasks.
- Adds pytest coverage, SOP/format documentation, GUI walkthrough assets, and GitHub Actions workflows (CI, Bandit, Windows release build).
Reviewed changes
Copilot reviewed 32 out of 55 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_verify_pii.py | Unit tests for PII verification behavior across filenames/DICOM/XML/text. |
| tests/test_utils.py | Unit tests for INI/XML parsing and ScanUID datetime extraction helpers. |
| tests/test_anonymise_dicom.py | Extensive tests for anonymisation logic across file types and pipeline steps. |
| tests/init.py | Marks tests as a package (empty). |
| scripts/read_dicom_tags.py | CLI + importable helper to dump DICOM tags to stdout/file. |
| scripts/extract_elekta_rps_matrices.py | CLI utility to extract Elekta RPS embedded ZIP/INI matrices and alignment info. |
| scripts/elektafdt_crawler.py | Utility to scan patient exports and extract plan names into CSV. |
| scripts/dir_tree.py | CLI utility to generate a truncated directory tree audit report. |
| requirements-dev.txt | Dev dependency list for tests/linting. |
| pyproject.toml | Package metadata, dependencies, optional extras, and pytest/ruff config. |
| learn_upload/verify_pii.py | Core PII scanning implementation for output verification. |
| learn_upload/utils.py | Shared parsers for INI/XML/ScanUID and RPS ZIP-INI extraction. |
| learn_upload/folder_sort.py | Session discovery, classification, fraction assignment, LEARN tree creation, and file copy logic. |
| learn_upload/config.py | Central constants: paths, DICOM tag policies, logging helpers, RPS tag constant. |
| learn_upload/anonymise_dicom.py | DICOM + ancillary file anonymisation utilities and output-folder anonymisation pass. |
| learn_upload/main.py | python -m learn_upload entrypoint selecting GUI implementation. |
| learn_upload/init.py | Package overview/rationale; version definition. |
| learn_upload.spec | PyInstaller spec for building the Windows GUI executable. |
| examples/run_patient_example.py | End-to-end example pipeline invocation. |
| cbct-shifts/report_patient_details.py | Generates markdown CBCT shift report from RPS data. |
| cbct-shifts/compare_rps_mosaiq.py | Compares Mosaiq shift logs to RPS clipbox/couch-shift data to validate mapping. |
| README.md | Adds full project README for learn-crawler usage, developer setup, and pipeline documentation. |
| Docs/images/gui-walkthrough/09.png | GUI walkthrough screenshot asset. |
| Docs/images/gui-walkthrough/08.png | GUI walkthrough screenshot asset. |
| Docs/images/gui-walkthrough/06.png | GUI walkthrough screenshot asset. |
| Docs/images/gui-walkthrough/02.png | GUI walkthrough screenshot asset. |
| Docs/images/GC_Elekta_Patient_Upload_Process/04_mosaiq_images_icon.png | SOP image asset for Mosaiq toolbar icon. |
| Docs/elekta_xvi_sro_experimental_validation.md | Experimental validation notes document. |
| Docs/elekta_rps_format_documentation.md | RPS private-tag/ZIP/INI format documentation + Mosaiq mapping summary. |
| Docs/LEARN_Upload_Automation_Plan.md | Automation plan and design notes for pipeline modules/workflow. |
| Docs/GUI_Walkthrough.md | Step-by-step GUI walkthrough document. |
| Docs/GC_Elekta_Patient_Upload_Process.md | Updated SOP referencing the GUI pipeline and transfer workflow. |
| Docs/Elekta_XVI_Reconstruction_Directory_Analysis.md | Reconstruction directory/file format analysis reference. |
| .gitignore | Adds strong ignores for patient data, imaging files, outputs, and common Python artifacts. |
| .github/workflows/release.yml | Windows build + PyInstaller release workflow for tagged versions. |
| .github/workflows/ci.yml | CI workflow running ruff + pytest on PRs and master. |
| .github/workflows/bandit.yml | Bandit security scan workflow (SARIF upload). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| In the 'Analysis code' folder, the codes are used extract different data properties. The SPARK Triangulation code is used to perform kV/MV triangulation for post-treatment verification of KIM tracking accuracy. | ||
|
|
||
| In Data anonymisation folder, the codes were written to anonymize the SPARK data follwoing TCIA recommendation. | ||
| In Data anonymisation folder, the codes were written to anonymize the SPARK data follwoing TCIA recommendation. |
There was a problem hiding this comment.
Typo in README: "follwoing" → "following".
| [build-system] | ||
| requires = ["setuptools>=64"] | ||
| build-backend = "setuptools.backends._legacy:_Backend" |
There was a problem hiding this comment.
build-backend = "setuptools.backends._legacy:_Backend" opts into the legacy setuptools backend. Unless there's a specific need for legacy behavior, using the standard PEP 517 backend (setuptools.build_meta) is more common and tends to behave better with modern tooling. Consider switching to build-backend = "setuptools.build_meta" to avoid packaging surprises.
| match = _SCAN_DATETIME_PATTERN.search(scan_uid) | ||
| if not match: | ||
| logger.warning("Could not parse datetime from ScanUID: %s", scan_uid) | ||
| return None |
There was a problem hiding this comment.
parse_scan_datetime logs a warning whenever the ScanUID pattern isn't found. In normal operation you may encounter empty/missing ScanUIDs (and tests cover returning None), so this can create a lot of warning noise. Consider lowering this to logger.debug(...) (or only warning when the input looks like a ScanUID but fails to parse).
| try: | ||
| ds = pydicom.dcmread(path, force=True) | ||
| except Exception: | ||
| logger.warning("Could not read DICOM file: %s", path) | ||
| return findings | ||
|
|
||
| for elem in ds.iterall(): | ||
| if elem.VR not in _STRING_VRS: | ||
| continue | ||
| value_str = str(elem.value).lower() | ||
| for pii, original in zip(pii_lower, pii_originals): | ||
| if pii in value_str: | ||
| tag_name = elem.keyword or str(elem.tag) |
There was a problem hiding this comment.
_check_dicom uses pydicom.dcmread(..., force=True) and then iterates all elements, which can be significantly slower than necessary and may parse non-DICOM/binary files as datasets. For performance and reliability, consider stop_before_pixels=True and avoiding force=True (or at least only using force=True behind a flag) since you only need metadata text tags for PII scanning.
| for tag in DICOM_TAGS_REPLACE: | ||
| if tag == (0x0010, 0x0010): # PatientName | ||
| dcm[tag].value = PersonName(f"{self.anon_id}^{self.site_name}") | ||
| else: | ||
| dcm[tag].value = self.anon_id | ||
|
|
There was a problem hiding this comment.
In anonymise_file, the loop over DICOM_TAGS_REPLACE assumes every tag is present and does dcm[tag].value = .... If a file is missing one of these tags (notably StudyID is optional in many datasets), this will raise KeyError and abort anonymisation. Consider guarding with if tag in dcm (and deciding whether to skip or create the element) to make anonymisation robust across real-world exports.
| # pydicom VR types that contain human-readable strings worth checking. | ||
| _STRING_VRS = { | ||
| "LO", "SH", "PN", "LT", "ST", "UT", "DA", "DS", "IS", "CS", | ||
| "AE", "AS", "DT", "TM", "UC", "UI", | ||
| } |
There was a problem hiding this comment.
_STRING_VRS includes the "UI" VR, which will scan DICOM UIDs for PII substrings. Since UIDs are preserved by design (and are long digit strings), this can create false-positive PII findings (e.g., random MRN-like substrings appearing inside UIDs). Consider removing "UI" from the scanned VR set (and, if needed, explicitly scan only a curated list of human-entered string tags).
| # --- Human-readable summary --- | ||
| print(f"\nPII Verification: scanned {files_scanned} files in {directory}") | ||
| if findings: | ||
| print(f"FAIL — {len(findings)} PII finding(s):") | ||
| for f in findings: | ||
| print(f" {f['file']} [{f['location']}] matched '{f['matched']}'") | ||
| else: | ||
| print("PASS — no residual PII detected") | ||
|
|
There was a problem hiding this comment.
verify_no_pii always prints a PASS/FAIL summary to stdout. This function is also used from the GUI (and potentially other Python callers), so unconditional printing can produce noisy/duplicated output or interfere with UI logging. Consider switching these to logger.info(...) or adding a verbose/print_summary flag (defaulting to False for library use, True for CLI).
| if recon_dir.is_dir(): | ||
| ini_files = sorted(recon_dir.glob("*.INI")) | ||
| if ini_files: | ||
| session.ini_path = ini_files[0] | ||
| ini_text = ini_files[0].read_text(encoding="utf-8", errors="ignore") | ||
| ini_data = parse_xvi_ini(ini_text) | ||
| scan_uid = ini_data.get("ScanUID") | ||
| if scan_uid: | ||
| session.scan_datetime = parse_scan_datetime(scan_uid) | ||
|
|
There was a problem hiding this comment.
_enrich_cbct_session only looks for Reconstruction/*.INI to extract ScanUID, but XVI exports can also contain only .INI.XVI (or lowercase extensions). If the .INI file is absent, sessions will remain undated and fraction assignment will degrade. Consider expanding the glob to include *.INI.XVI (and possibly case-insensitive matching) and/or scanning all INI-like files in Reconstruction/ until a ScanUID is found.
| dict[str, list[CBCTSession]] | ||
| ``{"FX0": [...], "FX1": [...], ...}`` sorted chronologically. | ||
| """ | ||
| # Sort all sessions by datetime | ||
| sorted_sessions = sorted( | ||
| sessions, | ||
| key=lambda s: s.scan_datetime or datetime.min, | ||
| ) | ||
|
|
||
| # Group by date | ||
| date_groups: dict[str, list[CBCTSession]] = {} | ||
| for s in sorted_sessions: | ||
| if s.scan_datetime is None: | ||
| date_key = "unknown" | ||
| else: | ||
| date_key = s.scan_datetime.strftime("%Y-%m-%d") | ||
|
|
||
| if date_key not in date_groups: | ||
| date_groups[date_key] = [] | ||
| date_groups[date_key].append(s) | ||
|
|
||
| # Assign fraction labels in chronological order | ||
| fraction_map: dict[str, list[CBCTSession]] = {} | ||
| for fx_idx, date_key in enumerate(sorted(date_groups.keys()), start=1): | ||
| fx_label = f"FX{fx_idx}" | ||
| fraction_map[fx_label] = date_groups[date_key] |
There was a problem hiding this comment.
The docstring for assign_fractions states it returns keys like {"FX0": ..., "FX1": ...}, but the implementation starts enumeration at 1 and produces FX1, FX2, ... (and tests assert this behavior). Please update the docstring to match the actual output to avoid confusing downstream users.
| def read_dicom_tags(filepath: str, output: str = None, show_private: bool = False) -> None: | ||
| """Print every DICOM tag in the file, optionally writing to an output file.""" | ||
| try: | ||
| ds = dcmread(filepath) | ||
| except (InvalidDicomError, FileNotFoundError, PermissionError) as e: | ||
| print(f"Error reading {filepath}: {e}", file=sys.stderr) | ||
| sys.exit(1) |
There was a problem hiding this comment.
read_dicom_tags is documented as usable from Python, but on error it calls sys.exit(1) from inside the library function. This makes it hard to reuse programmatically (it terminates the whole process). Consider raising an exception (or returning a non-zero status) from read_dicom_tags, and only calling sys.exit in main().
Summary
learn-crawlertoolset for automating the LEARN data transfer pipeline (Elekta XVI CBCT patient data from GenesisCare to USYD RDS).exedistributionContents
learn_upload/— Core Python package (anonymise, folder sort, PII check, GUI)scripts/— Standalone CLI tools (dir_tree, DICOM reader, XVI crawler)tests/— pytest suiteDocs/— SOP documentation and format specsexamples/— Pipeline usage examples