Add Sage upload scripts and documentation for maintainability by wilke0818 · Pull Request #239 · sensein/b2aiprep

wilke0818 · 2026-02-05T02:06:29Z

Adds Sage script for generating manifest file that Sage API uses to create folder structure that matches the local upload folder
Adds Sage script for running upload onto Sage through bulk manifest process
- Uses indices in order to upload the manifest in parallel (choosing the right amount per job is tricky due to Sage slow upload speeds but also limited ability to handle concurrent uploads)
Adds Sage script for running a verification of the current Sage folders compared to local copies
- It can tell whether folder structure is different and whether files are different through the use of an MD5 hash but not how files might differ from each other
README for instructions on uses of different files

gemini-code-assist · 2026-02-05T02:06:47Z

Summary of Changes

Hello @wilke0818, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a suite of Python scripts designed to enhance the maintainability and efficiency of data management with Sage (Synapse). The scripts automate the process of preparing data for upload, executing bulk uploads, and verifying the consistency of data between local storage and the Synapse platform. This significantly streamlines data synchronization and integrity checks for both pediatric and adult datasets.

Highlights

Manifest Generation Script: A new Python script (sage_generate_manifest.py) has been added to create Synapse manifest files from local BIDS folder structures, facilitating the organization of data for upload.
Bulk Upload Script: A new Python script (sage_upload_manifest.py) enables bulk data uploads to Synapse using generated manifest files. It supports partial uploads by specifying start and end indices, which can be used for parallel processing.
Data Verification Script: A new Python script (verify_sage_contents.py) has been introduced to compare local folder structures and files with their counterparts on Synapse. It can detect differences in folder structure and file content (via MD5 hashes) to ensure data integrity.
Git Ignore Updates: The .gitignore file has been updated to exclude generated log files, TSV manifests, and shell scripts from the new Sage upload utilities, keeping the repository clean.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.gitignore
- Added ignore rules for external_scripts/sage_upload_scripts/logs/, *.tsv, and *.sh to prevent tracking generated files.
external_scripts/sage_upload_scripts/sage_generate_manifest.py
- New file: Implements a script to generate a Synapse manifest TSV file from a local BIDS folder structure.
- Supports specifying a BIDS folder, output manifest file name, and whether it's an adult dataset.
- Authenticates with Synapse using an environment variable SAGE_PAT.
external_scripts/sage_upload_scripts/sage_upload_manifest.py
- New file: Implements a script to upload data to Synapse using a manifest file.
- Allows specifying start and end indices to upload portions of a manifest, enabling parallel uploads.
- Reads the manifest using pandas and utilizes synapseutils.syncToSynapse for the upload.
external_scripts/sage_upload_scripts/verify_sage_contents.py
- New file: Implements a script to verify the consistency between local folders/files and their Synapse counterparts.
- Includes functions (walk_synapse_folder, walk_local_folder) to recursively traverse and collect metadata from both Synapse and local directories.
- Compares folder structures and, optionally, file MD5 hashes to detect discrepancies.
- Supports a dry_run mode to report differences without making actual changes on Synapse.

Activity

No specific pull request activity (comments, reviews, etc.) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a set of Python scripts for interacting with the Sage Synapse platform, including generating manifests, uploading data, and verifying contents. The changes are a good step towards improving data management maintainability. My review focuses on improving the robustness, clarity, and usability of these new scripts. I've identified several areas for improvement, including fixing potential bugs related to project ID selection, clarifying command-line argument descriptions, and adopting standard logging practices over print statements.

gemini-code-assist · 2026-02-05T02:08:39Z

external_scripts/sage_upload_scripts/sage_generate_manifest.py

+    print(f"Received args: {args}")
+
+    sage_pat = os.getenv("SAGE_PAT")
+    sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT


The logic for selecting the project ID based on the --adult flag is commented out, causing sage_project_id to be hardcoded to OVERALL_PROJECT. This seems to defeat the purpose of the --adult argument and should be corrected.

Suggested change

sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT

sage_project_id = ADULT_PROJECT if args.adult else PEDS_PROJECT

gemini-code-assist · 2026-02-05T02:08:39Z

external_scripts/sage_upload_scripts/sage_upload_manifest.py

+    args = parser.parse_args()
+
+    sage_pat = os.getenv("SAGE_PAT")
+    sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT


This line appears to be a copy-paste error from another script. It references args.adult, which is not an argument in this script. This will cause an AttributeError if the commented-out code is ever used. The logic for determining sage_project_id should be corrected for this script's context.

gemini-code-assist · 2026-02-05T02:08:39Z

external_scripts/sage_upload_scripts/verify_sage_contents.py

+    args = parser.parse_args()
+
+    sage_pat = os.getenv("SAGE_PAT")
+    sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT


This line appears to be a copy-paste error from another script. It references args.adult, which is not an argument in this script. This will cause an AttributeError if the commented-out code is ever used. The logic for determining sage_project_id should be corrected for this script's context.

gemini-code-assist · 2026-02-05T02:08:39Z

external_scripts/sage_upload_scripts/sage_generate_manifest.py

+ADULT_PROJECT = "syn72370534"
+OVERALL_PROJECT = "syn72370534"


The constants ADULT_PROJECT and OVERALL_PROJECT have the same value. If this is not intentional, it should be corrected to avoid confusion and potential bugs.

gemini-code-assist · 2026-02-05T02:08:40Z

external_scripts/sage_upload_scripts/sage_generate_manifest.py

+    parser = argparse.ArgumentParser(description="A simple script for ...")
+    parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
+    parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)
+    parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')


The ArgumentParser description and argument help messages are placeholders. They should be updated to be more descriptive and informative for users of this script.

Suggested change

parser = argparse.ArgumentParser(description="A simple script for ...")

parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')

parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)

parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')

parser = argparse.ArgumentParser(description="Generates a manifest file for uploading BIDS data to Sage Synapse.")

parser.add_argument('--bids_folder', default='./', type=str, help='Path to the local BIDS folder to be synced.')

parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str, help='Path to save the generated manifest file.')

parser.add_argument('--adult', default=False, action='store_true', help='Specify if the dataset is for adults. Defaults to pediatric.')

gemini-code-assist · 2026-02-05T02:08:40Z

external_scripts/sage_upload_scripts/sage_upload_manifest.py

+def main():
+    load_dotenv()
+
+    parser = argparse.ArgumentParser(description="A simple script for ...")


The ArgumentParser description is a placeholder. It should be updated to provide a meaningful summary of what the script does.

Suggested change

parser = argparse.ArgumentParser(description="A simple script for ...")

parser = argparse.ArgumentParser(description="Uploads a manifest file to Sage Synapse in chunks.")

gemini-code-assist · 2026-02-05T02:08:40Z

external_scripts/sage_upload_scripts/verify_sage_contents.py

+    """
+    Recursively walk through a Synapse folder structure.
+    Returns dict mapping relative paths to Synapse entity info.
+    """


The docstring is inaccurate. It states that the function returns a dictionary, but it actually returns a tuple of a dictionary and a list: (dict, list). The docstring should be updated to reflect the correct return signature.

Suggested change

"""

Recursively walk through a Synapse folder structure.

Returns dict mapping relative paths to Synapse entity info.

"""

"""

Recursively walk through a Synapse folder structure.

Returns a tuple containing a dictionary of files and a list of folders.

"""

gemini-code-assist · 2026-02-05T02:08:40Z

external_scripts/sage_upload_scripts/verify_sage_contents.py

+            rel_path = os.path.join(path, child_name) if path else child_name
+
+            if child_type == 'org.sagebionetworks.repo.model.Folder':
+                print(f"Scanning folder: {rel_path}")


Consider using the logging module instead of print() for output. The logging module provides more flexibility, such as controlling log levels, formatting, and directing output to different destinations (e.g., files, stderr). This applies to other print statements in this file as well.

gemini-code-assist · 2026-02-05T02:08:40Z

external_scripts/sage_upload_scripts/verify_sage_contents.py

+
+    sage_folder_set = set([f['path'] for f in sage_folders])
+    local_folder_set = set([f['path'] for f in local_folders])
+    if (sage_folder_set - local_folder_set) != set():


The comparison (sage_folder_set - local_folder_set) != set() can be simplified. An empty set evaluates to False in a boolean context, so you can write this more concisely. This applies to similar checks in this file.

Suggested change

if (sage_folder_set - local_folder_set) != set():

if sage_folder_set - local_folder_set:

gemini-code-assist · 2026-02-05T02:08:40Z

external_scripts/sage_upload_scripts/verify_sage_contents.py

+    parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
+    parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
+    parser.add_argument('--get_md5', default=False, action='store_true')
+    parser.add_argument('--dry_run', default=True, action='store_false')


The --dry_run argument is configured with action='store_false', which is counter-intuitive. When a user provides the --dry_run flag, it sets args.dry_run to False, effectively disabling the dry run. A more conventional approach would be to have a flag that enables the non-default behavior, for example, an --execute flag with action='store_true'.

wilke0818 · 2026-02-05T02:09:20Z

I wonder if it is perhaps smarter to write a script that uploads folders one at a time. This could possibly help with certain Sage errors I ran into while uploading while still parallelizing to deal with the fact that I get < 1 Mb/s during uploads.

Initial sage upload scripts

c4c1d2f

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

wilke0818 added 2 commits February 4, 2026 22:27

Upaded sage upload files and added a readme

52fda0b

Slight readme update

5f7b266

	sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
	sage_project_id = ADULT_PROJECT if args.adult else PEDS_PROJECT

		ADULT_PROJECT = "syn72370534"
		OVERALL_PROJECT = "syn72370534"

	parser = argparse.ArgumentParser(description="A simple script for ...")
	parser = argparse.ArgumentParser(description="Uploads a manifest file to Sage Synapse in chunks.")

	if (sage_folder_set - local_folder_set) != set():
	if sage_folder_set - local_folder_set:

Conversation

wilke0818 commented Feb 5, 2026

Uh oh!

gemini-code-assist bot commented Feb 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

wilke0818 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant