Add Sage upload scripts and documentation for maintainability#239
Add Sage upload scripts and documentation for maintainability#239
Conversation
wilke0818
commented
Feb 5, 2026
- Adds Sage script for generating manifest file that Sage API uses to create folder structure that matches the local upload folder
- Adds Sage script for running upload onto Sage through bulk manifest process
- Uses indices in order to upload the manifest in parallel (choosing the right amount per job is tricky due to Sage slow upload speeds but also limited ability to handle concurrent uploads)
- Adds Sage script for running a verification of the current Sage folders compared to local copies
- It can tell whether folder structure is different and whether files are different through the use of an MD5 hash but not how files might differ from each other
- README for instructions on uses of different files
Summary of ChangesHello @wilke0818, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a suite of Python scripts designed to enhance the maintainability and efficiency of data management with Sage (Synapse). The scripts automate the process of preparing data for upload, executing bulk uploads, and verifying the consistency of data between local storage and the Synapse platform. This significantly streamlines data synchronization and integrity checks for both pediatric and adult datasets. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a set of Python scripts for interacting with the Sage Synapse platform, including generating manifests, uploading data, and verifying contents. The changes are a good step towards improving data management maintainability. My review focuses on improving the robustness, clarity, and usability of these new scripts. I've identified several areas for improvement, including fixing potential bugs related to project ID selection, clarifying command-line argument descriptions, and adopting standard logging practices over print statements.
| print(f"Received args: {args}") | ||
|
|
||
| sage_pat = os.getenv("SAGE_PAT") | ||
| sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT |
There was a problem hiding this comment.
The logic for selecting the project ID based on the --adult flag is commented out, causing sage_project_id to be hardcoded to OVERALL_PROJECT. This seems to defeat the purpose of the --adult argument and should be corrected.
| sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT | |
| sage_project_id = ADULT_PROJECT if args.adult else PEDS_PROJECT |
| args = parser.parse_args() | ||
|
|
||
| sage_pat = os.getenv("SAGE_PAT") | ||
| sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT |
There was a problem hiding this comment.
| args = parser.parse_args() | ||
|
|
||
| sage_pat = os.getenv("SAGE_PAT") | ||
| sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT |
There was a problem hiding this comment.
| ADULT_PROJECT = "syn72370534" | ||
| OVERALL_PROJECT = "syn72370534" |
| parser = argparse.ArgumentParser(description="A simple script for ...") | ||
| parser.add_argument('--bids_folder', default='./', type=str, help='Example help information') | ||
| parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str) | ||
| parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset') |
There was a problem hiding this comment.
The ArgumentParser description and argument help messages are placeholders. They should be updated to be more descriptive and informative for users of this script.
| parser = argparse.ArgumentParser(description="A simple script for ...") | |
| parser.add_argument('--bids_folder', default='./', type=str, help='Example help information') | |
| parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str) | |
| parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset') | |
| parser = argparse.ArgumentParser(description="Generates a manifest file for uploading BIDS data to Sage Synapse.") | |
| parser.add_argument('--bids_folder', default='./', type=str, help='Path to the local BIDS folder to be synced.') | |
| parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str, help='Path to save the generated manifest file.') | |
| parser.add_argument('--adult', default=False, action='store_true', help='Specify if the dataset is for adults. Defaults to pediatric.') |
| def main(): | ||
| load_dotenv() | ||
|
|
||
| parser = argparse.ArgumentParser(description="A simple script for ...") |
There was a problem hiding this comment.
The ArgumentParser description is a placeholder. It should be updated to provide a meaningful summary of what the script does.
| parser = argparse.ArgumentParser(description="A simple script for ...") | |
| parser = argparse.ArgumentParser(description="Uploads a manifest file to Sage Synapse in chunks.") |
| """ | ||
| Recursively walk through a Synapse folder structure. | ||
| Returns dict mapping relative paths to Synapse entity info. | ||
| """ |
There was a problem hiding this comment.
The docstring is inaccurate. It states that the function returns a dictionary, but it actually returns a tuple of a dictionary and a list: (dict, list). The docstring should be updated to reflect the correct return signature.
| """ | |
| Recursively walk through a Synapse folder structure. | |
| Returns dict mapping relative paths to Synapse entity info. | |
| """ | |
| """ | |
| Recursively walk through a Synapse folder structure. | |
| Returns a tuple containing a dictionary of files and a list of folders. | |
| """ |
| rel_path = os.path.join(path, child_name) if path else child_name | ||
|
|
||
| if child_type == 'org.sagebionetworks.repo.model.Folder': | ||
| print(f"Scanning folder: {rel_path}") |
There was a problem hiding this comment.
|
|
||
| sage_folder_set = set([f['path'] for f in sage_folders]) | ||
| local_folder_set = set([f['path'] for f in local_folders]) | ||
| if (sage_folder_set - local_folder_set) != set(): |
There was a problem hiding this comment.
The comparison (sage_folder_set - local_folder_set) != set() can be simplified. An empty set evaluates to False in a boolean context, so you can write this more concisely. This applies to similar checks in this file.
| if (sage_folder_set - local_folder_set) != set(): | |
| if sage_folder_set - local_folder_set: |
| parser.add_argument('--bids_folder', default='./', type=str, help='Example help information') | ||
| parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset') | ||
| parser.add_argument('--get_md5', default=False, action='store_true') | ||
| parser.add_argument('--dry_run', default=True, action='store_false') |
There was a problem hiding this comment.
The --dry_run argument is configured with action='store_false', which is counter-intuitive. When a user provides the --dry_run flag, it sets args.dry_run to False, effectively disabling the dry run. A more conventional approach would be to have a flag that enables the non-default behavior, for example, an --execute flag with action='store_true'.
|
I wonder if it is perhaps smarter to write a script that uploads folders one at a time. This could possibly help with certain Sage errors I ran into while uploading while still parallelizing to deal with the fact that I get < 1 Mb/s during uploads. |