Skip to content

Add Sage upload scripts and documentation for maintainability#239

Draft
wilke0818 wants to merge 3 commits intomainfrom
wilke18/sage_scripts
Draft

Add Sage upload scripts and documentation for maintainability#239
wilke0818 wants to merge 3 commits intomainfrom
wilke18/sage_scripts

Conversation

@wilke0818
Copy link
Copy Markdown
Contributor

  • Adds Sage script for generating manifest file that Sage API uses to create folder structure that matches the local upload folder
  • Adds Sage script for running upload onto Sage through bulk manifest process
    • Uses indices in order to upload the manifest in parallel (choosing the right amount per job is tricky due to Sage slow upload speeds but also limited ability to handle concurrent uploads)
  • Adds Sage script for running a verification of the current Sage folders compared to local copies
    • It can tell whether folder structure is different and whether files are different through the use of an MD5 hash but not how files might differ from each other
  • README for instructions on uses of different files

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @wilke0818, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a suite of Python scripts designed to enhance the maintainability and efficiency of data management with Sage (Synapse). The scripts automate the process of preparing data for upload, executing bulk uploads, and verifying the consistency of data between local storage and the Synapse platform. This significantly streamlines data synchronization and integrity checks for both pediatric and adult datasets.

Highlights

  • Manifest Generation Script: A new Python script (sage_generate_manifest.py) has been added to create Synapse manifest files from local BIDS folder structures, facilitating the organization of data for upload.
  • Bulk Upload Script: A new Python script (sage_upload_manifest.py) enables bulk data uploads to Synapse using generated manifest files. It supports partial uploads by specifying start and end indices, which can be used for parallel processing.
  • Data Verification Script: A new Python script (verify_sage_contents.py) has been introduced to compare local folder structures and files with their counterparts on Synapse. It can detect differences in folder structure and file content (via MD5 hashes) to ensure data integrity.
  • Git Ignore Updates: The .gitignore file has been updated to exclude generated log files, TSV manifests, and shell scripts from the new Sage upload utilities, keeping the repository clean.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Added ignore rules for external_scripts/sage_upload_scripts/logs/, *.tsv, and *.sh to prevent tracking generated files.
  • external_scripts/sage_upload_scripts/sage_generate_manifest.py
    • New file: Implements a script to generate a Synapse manifest TSV file from a local BIDS folder structure.
    • Supports specifying a BIDS folder, output manifest file name, and whether it's an adult dataset.
    • Authenticates with Synapse using an environment variable SAGE_PAT.
  • external_scripts/sage_upload_scripts/sage_upload_manifest.py
    • New file: Implements a script to upload data to Synapse using a manifest file.
    • Allows specifying start and end indices to upload portions of a manifest, enabling parallel uploads.
    • Reads the manifest using pandas and utilizes synapseutils.syncToSynapse for the upload.
  • external_scripts/sage_upload_scripts/verify_sage_contents.py
    • New file: Implements a script to verify the consistency between local folders/files and their Synapse counterparts.
    • Includes functions (walk_synapse_folder, walk_local_folder) to recursively traverse and collect metadata from both Synapse and local directories.
    • Compares folder structures and, optionally, file MD5 hashes to detect discrepancies.
    • Supports a dry_run mode to report differences without making actual changes on Synapse.
Activity
  • No specific pull request activity (comments, reviews, etc.) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a set of Python scripts for interacting with the Sage Synapse platform, including generating manifests, uploading data, and verifying contents. The changes are a good step towards improving data management maintainability. My review focuses on improving the robustness, clarity, and usability of these new scripts. I've identified several areas for improvement, including fixing potential bugs related to project ID selection, clarifying command-line argument descriptions, and adopting standard logging practices over print statements.

print(f"Received args: {args}")

sage_pat = os.getenv("SAGE_PAT")
sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for selecting the project ID based on the --adult flag is commented out, causing sage_project_id to be hardcoded to OVERALL_PROJECT. This seems to defeat the purpose of the --adult argument and should be corrected.

Suggested change
sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
sage_project_id = ADULT_PROJECT if args.adult else PEDS_PROJECT

args = parser.parse_args()

sage_pat = os.getenv("SAGE_PAT")
sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line appears to be a copy-paste error from another script. It references args.adult, which is not an argument in this script. This will cause an AttributeError if the commented-out code is ever used. The logic for determining sage_project_id should be corrected for this script's context.

args = parser.parse_args()

sage_pat = os.getenv("SAGE_PAT")
sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line appears to be a copy-paste error from another script. It references args.adult, which is not an argument in this script. This will cause an AttributeError if the commented-out code is ever used. The logic for determining sage_project_id should be corrected for this script's context.

Comment on lines +10 to +11
ADULT_PROJECT = "syn72370534"
OVERALL_PROJECT = "syn72370534"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constants ADULT_PROJECT and OVERALL_PROJECT have the same value. If this is not intentional, it should be corrected to avoid confusion and potential bugs.

Comment on lines +18 to +21
parser = argparse.ArgumentParser(description="A simple script for ...")
parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)
parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ArgumentParser description and argument help messages are placeholders. They should be updated to be more descriptive and informative for users of this script.

Suggested change
parser = argparse.ArgumentParser(description="A simple script for ...")
parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)
parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
parser = argparse.ArgumentParser(description="Generates a manifest file for uploading BIDS data to Sage Synapse.")
parser.add_argument('--bids_folder', default='./', type=str, help='Path to the local BIDS folder to be synced.')
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str, help='Path to save the generated manifest file.')
parser.add_argument('--adult', default=False, action='store_true', help='Specify if the dataset is for adults. Defaults to pediatric.')

def main():
load_dotenv()

parser = argparse.ArgumentParser(description="A simple script for ...")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ArgumentParser description is a placeholder. It should be updated to provide a meaningful summary of what the script does.

Suggested change
parser = argparse.ArgumentParser(description="A simple script for ...")
parser = argparse.ArgumentParser(description="Uploads a manifest file to Sage Synapse in chunks.")

Comment on lines +28 to +31
"""
Recursively walk through a Synapse folder structure.
Returns dict mapping relative paths to Synapse entity info.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring is inaccurate. It states that the function returns a dictionary, but it actually returns a tuple of a dictionary and a list: (dict, list). The docstring should be updated to reflect the correct return signature.

Suggested change
"""
Recursively walk through a Synapse folder structure.
Returns dict mapping relative paths to Synapse entity info.
"""
"""
Recursively walk through a Synapse folder structure.
Returns a tuple containing a dictionary of files and a list of folders.
"""

rel_path = os.path.join(path, child_name) if path else child_name

if child_type == 'org.sagebionetworks.repo.model.Folder':
print(f"Scanning folder: {rel_path}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using the logging module instead of print() for output. The logging module provides more flexibility, such as controlling log levels, formatting, and directing output to different destinations (e.g., files, stderr). This applies to other print statements in this file as well.


sage_folder_set = set([f['path'] for f in sage_folders])
local_folder_set = set([f['path'] for f in local_folders])
if (sage_folder_set - local_folder_set) != set():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comparison (sage_folder_set - local_folder_set) != set() can be simplified. An empty set evaluates to False in a boolean context, so you can write this more concisely. This applies to similar checks in this file.

Suggested change
if (sage_folder_set - local_folder_set) != set():
if sage_folder_set - local_folder_set:

parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
parser.add_argument('--get_md5', default=False, action='store_true')
parser.add_argument('--dry_run', default=True, action='store_false')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The --dry_run argument is configured with action='store_false', which is counter-intuitive. When a user provides the --dry_run flag, it sets args.dry_run to False, effectively disabling the dry run. A more conventional approach would be to have a flag that enables the non-default behavior, for example, an --execute flag with action='store_true'.

@wilke0818
Copy link
Copy Markdown
Contributor Author

I wonder if it is perhaps smarter to write a script that uploads folders one at a time. This could possibly help with certain Sage errors I ran into while uploading while still parallelizing to deal with the fact that I get < 1 Mb/s during uploads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant