-
Notifications
You must be signed in to change notification settings - Fork 8
Add Sage upload scripts and documentation for maintainability #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Sage Upload Process | ||
|
|
||
| This file will detail the processes that take place to upload data to Sage Synapse repository. | ||
|
|
||
| ## 0. Prerequisites | ||
|
|
||
| As a prerequisite we are assuming that the main Sage project has been generated, that both the pediatric and the adult data are using the same project, and those folders have been generated and the proper synapse IDs are in each of the scripts. | ||
|
|
||
| Additionally, to run any of these codes, one needs to setup a personal access token with Sage. To do so, go to Account Settings on Sage and scroll down to Personal Access Tokens (PATs) and generate one. This token ideally would have all 3 permissions (View, Download, Modify). The code currently assumes that this PAT is saved as an environment variable called `SAGE_PAT`. There are additionally options for setting up a Sage profile as to not have to pass the environment variable every time but this is currently not configured. | ||
|
|
||
| Finally, run `pip install synapseclient` for the required python API. | ||
|
|
||
| ## 1. Generate Sage manifest file | ||
|
|
||
| In order to do bulk uploads of data to the Sage Synapse repository, we use the functions `synapseutils.generate_sync_manifest` and `synapseutils.syncToSynapse`. We start by generating a Sage manifest using the command below: | ||
|
|
||
| ``` | ||
| python sage_generate_manifest.py \ | ||
| --bids_folder path/to/bids/folder \ | ||
| --manifest_file path/to/save/manifest.tsv \ | ||
| --adult | ||
| ``` | ||
|
|
||
| `--bids_folder` gives the path to a BIDS folder we want to upload. This should be the output of the deidentify command in `RELEASE.md` but without any features. | ||
|
|
||
| `--manifest_file` describes where to save the generated manifest file and what to name it. | ||
|
|
||
| `--adult` is a flag for whether this is the adult or the pediatric data which will determine the folder on Synapse to upload to. Leave off the flag for generating the pediatric data manifest. | ||
|
|
||
| ## 2. Upload the data using the generated manifest | ||
|
|
||
| Taking the generated manifest file from step 1, we can then upload it using the below command. Since the manifest is already specific to the adult vs. the pediatric data, the upload command is the same for both/there is no flag to specify which dataset. | ||
|
|
||
| ``` | ||
| python sage_upload_manifest.py \ | ||
| --manifest_file path/to/manifest.tsv \ | ||
| --start 0 \ | ||
| --end 1000 | ||
| ``` | ||
|
|
||
| `--manifest_file` should match the file generated from running the command in step one. | ||
|
|
||
| This script is setup for parallel uploads by providing a start and end of the data in the manifest to upload. If not specified this will upload all of the data at once which might take quite a long time. | ||
|
|
||
| Note: I have found this file will fail with a variety of errors, typically related to doing concurrent uploads. For that reason I typically run it multiple time because reportedly under the hood it will cache the MD5 of the files uploaded and shouldn't reupload if they match. | ||
|
|
||
| ## 3. Verify uploaded data | ||
|
|
||
| This command can run both as a check to verify the data uploaded properly as well as a verification between the data on Sage and a local folder, especially to make sure future changes to the code only affected specified files. This verification will check that the folder structures are equivalent and whether any extra are on Sage compared to locally or if any are missing on Sage. It also checks against files, mainly checking that filenames are the same and if they are that the contents (through and md5 hash) are the same. | ||
|
|
||
| ``` | ||
| python verify_sage_contents.py \ | ||
| --bids_folder path/to/bids/folder \ | ||
| --adult \ | ||
| --get_md5 \ | ||
| --execute | ||
| ``` | ||
|
|
||
| `--bids_folder` specifies the BIDS folder that we want to compare the current data on Sage against. This would typically be the folder that was used for the generation of the manifest file, but might also be a new BIDS folder to check that only certain data and structures have changed. | ||
|
|
||
| `--adult` is a flag to specify whether to check that the Sage folder to compare against is for the adult data. Leave off the flag for checking the pediatric data. | ||
|
|
||
| `--get_md5` is a flag to specify whether to generate MD5 hashes of the files in the BIDS folder. If specified, it will generate the hash for the local files and compare it against the corresponding Sage file's MD5 hash. | ||
|
|
||
| `--sync` is an in progress command for determining whether to syncronize Sage based on the local folder. This would not be a bulk syncronize like the upload previously but do so for individual files and folders. Currently it is not configured to do anything and defaults to just doing a dry run check of the difference between the data datasets. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| import os | ||
| import argparse | ||
| from dotenv import load_dotenv | ||
|
|
||
| import synapseclient | ||
| import synapseutils | ||
| from synapseclient.models import Project, Folder | ||
|
|
||
| OVERALL_PROJECT = "syn72370534" | ||
| PEDS_FOLDER = "syn72493849" | ||
| ADULT_FOLDER = "syn72493850" | ||
|
|
||
| def main(): | ||
| load_dotenv() | ||
|
|
||
| parser = argparse.ArgumentParser(description="A simple script for ...") | ||
| parser.add_argument('--bids_folder', default='./', type=str, help='Example help information') | ||
| parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str) | ||
| parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset') | ||
|
|
||
| args = parser.parse_args() | ||
| print(f"Received args: {args}") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| sage_pat = os.getenv("SAGE_PAT") | ||
| sage_project_id = OVERALL_PROJECT | ||
|
|
||
| syn = synapseclient.login(authToken=sage_pat) | ||
| project = Project(id=sage_project_id).get() | ||
| print(f"Looking at project: {project.name}, id: {project.id}") | ||
|
|
||
| parent_id = ADULT_FOLDER if args.adult else PEDS_FOLDER | ||
| print(f"Syncing data with parent folder: {parent_id}") | ||
| root_data_folder = Folder(id=parent_id).get() | ||
|
|
||
| # Quick assertion that we are in the correct project for this code" | ||
| assert root_data_folder.parent_id == project.id | ||
|
|
||
| synapseutils.generate_sync_manifest( | ||
| syn=syn, | ||
| directory_path=args.bids_folder, | ||
| parent_id=parent_id, | ||
| manifest_path=args.manifest_file, | ||
| ) | ||
|
|
||
|
|
||
| if __name__=='__main__': | ||
| main() | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,43 @@ | ||||||
| import os | ||||||
| import argparse | ||||||
| from dotenv import load_dotenv | ||||||
|
|
||||||
| import synapseclient | ||||||
| import synapseutils | ||||||
| from synapseclient.models import Project | ||||||
| import tempfile | ||||||
| import pandas as pd | ||||||
|
|
||||||
| PEDS_PROJECT = "syn72418607" | ||||||
| ADULT_PROJECT = "syn72370534" | ||||||
| OVERALL_PROJECT = "syn72370534" | ||||||
|
Comment on lines
+12
to
+13
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
| PEDS_FOLDER = "syn72493849" | ||||||
| ADULT_FOLDER = "syn72493850" | ||||||
|
|
||||||
| def main(): | ||||||
| load_dotenv() | ||||||
|
|
||||||
| parser = argparse.ArgumentParser(description="A simple script for ...") | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
|
||||||
| parser.add_argument('--start', default=None,type=int) | ||||||
| parser.add_argument('--end',default=None,type=int) | ||||||
| parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str) | ||||||
|
|
||||||
| args = parser.parse_args() | ||||||
|
|
||||||
| sage_pat = os.getenv("SAGE_PAT") | ||||||
| sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
|
|
||||||
| syn = synapseclient.login(authToken=sage_pat) | ||||||
| with tempfile.NamedTemporaryFile(mode='w+', suffix='.tsv', delete=True) as temp: | ||||||
| data = pd.read_csv(args.manifest_file, sep='\t') | ||||||
| start = args.start if args.start and args.start >= 0 else 0 | ||||||
| end = args.end if args.end else len(data) | ||||||
| df = data[args.start:args.end] | ||||||
| df.to_csv(temp.name, sep='\t', index=False) | ||||||
| synapseutils.syncToSynapse( | ||||||
| syn=syn, manifestFile=temp.name, sendMessages=False | ||||||
| ) | ||||||
|
|
||||||
|
|
||||||
| if __name__=='__main__': | ||||||
| main() | ||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,185 @@ | ||||||||||||||||||
| import os | ||||||||||||||||||
| import argparse | ||||||||||||||||||
| from dotenv import load_dotenv | ||||||||||||||||||
|
|
||||||||||||||||||
| import synapseclient | ||||||||||||||||||
| import synapseutils | ||||||||||||||||||
| from synapseclient.models import Project, Folder, File | ||||||||||||||||||
| import hashlib | ||||||||||||||||||
|
|
||||||||||||||||||
| def get_file_md5(filepath): | ||||||||||||||||||
| """Calculate MD5 hash of a local file.""" | ||||||||||||||||||
| hash_md5 = hashlib.md5() | ||||||||||||||||||
| with open(filepath, "rb") as f: | ||||||||||||||||||
| for chunk in iter(lambda: f.read(4096), b""): | ||||||||||||||||||
| hash_md5.update(chunk) | ||||||||||||||||||
| return hash_md5.hexdigest() | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| PEDS_PROJECT = "syn72418607" | ||||||||||||||||||
| ADULT_PROJECT = "syn72370534" | ||||||||||||||||||
| OVERALL_PROJECT = "syn72370534" | ||||||||||||||||||
| PEDS_FOLDER = "syn72493849" | ||||||||||||||||||
| ADULT_FOLDER = "syn72493850" | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def walk_synapse_folder(syn, folder_id, path=""): | ||||||||||||||||||
| """ | ||||||||||||||||||
| Recursively walk through a Synapse folder structure. | ||||||||||||||||||
| Returns dict mapping relative paths to Synapse entity info. | ||||||||||||||||||
| """ | ||||||||||||||||||
|
Comment on lines
+28
to
+31
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docstring is inaccurate. It states that the function returns a dictionary, but it actually returns a tuple of a dictionary and a list:
Suggested change
|
||||||||||||||||||
| synapse_files = {} | ||||||||||||||||||
| synapse_folders = [] | ||||||||||||||||||
|
|
||||||||||||||||||
| try: | ||||||||||||||||||
| # Get all children of this folder | ||||||||||||||||||
| children = syn.getChildren(folder_id) | ||||||||||||||||||
|
|
||||||||||||||||||
| for child in children: | ||||||||||||||||||
| child_name = child['name'] | ||||||||||||||||||
| child_id = child['id'] | ||||||||||||||||||
| child_type = child['type'] | ||||||||||||||||||
|
|
||||||||||||||||||
| # Build relative path | ||||||||||||||||||
| rel_path = os.path.join(path, child_name) if path else child_name | ||||||||||||||||||
|
|
||||||||||||||||||
| if child_type == 'org.sagebionetworks.repo.model.Folder': | ||||||||||||||||||
| print(f"Scanning folder: {rel_path}") | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||||||
| synapse_folders.append({'path':rel_path,'id':child_id}) | ||||||||||||||||||
| # Recursively process subfolder | ||||||||||||||||||
| syn_files, syn_folders = walk_synapse_folder(syn, child_id, rel_path) | ||||||||||||||||||
| synapse_files.update(syn_files) | ||||||||||||||||||
| synapse_folders += syn_folders | ||||||||||||||||||
|
|
||||||||||||||||||
| elif child_type == 'org.sagebionetworks.repo.model.FileEntity': | ||||||||||||||||||
| # Get file metadata without downloading | ||||||||||||||||||
| file_entity = syn.get(child_id, downloadFile=False) | ||||||||||||||||||
|
|
||||||||||||||||||
| synapse_files[rel_path] = { | ||||||||||||||||||
| 'id': child_id, | ||||||||||||||||||
| 'name': child_name, | ||||||||||||||||||
| 'md5': file_entity.get('md5', None), | ||||||||||||||||||
| 'size': file_entity.get('dataFileHandleId', {}).get('contentSize', 0) if isinstance(file_entity.get('dataFileHandleId'), dict) else 0, | ||||||||||||||||||
| 'modifiedOn': file_entity.get('modifiedOn', None), | ||||||||||||||||||
| 'path': rel_path | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| #print(f"Found file: {rel_path}") | ||||||||||||||||||
|
|
||||||||||||||||||
| except Exception as e: | ||||||||||||||||||
| print(f"Error processing folder {path}: {str(e)}") | ||||||||||||||||||
|
|
||||||||||||||||||
| return synapse_files, synapse_folders | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def walk_local_folder(path, get_md5=False): | ||||||||||||||||||
| local_files = {} | ||||||||||||||||||
| local_folders = [] | ||||||||||||||||||
|
|
||||||||||||||||||
| for entity in os.listdir(path): | ||||||||||||||||||
| rel_path = os.path.join(path,entity) | ||||||||||||||||||
| if os.path.isdir(rel_path): | ||||||||||||||||||
| local_folders.append({'path':rel_path}) | ||||||||||||||||||
| temp_files, temp_folders = walk_local_folder(rel_path, get_md5=get_md5) | ||||||||||||||||||
| local_files.update(temp_files) | ||||||||||||||||||
| local_folders += temp_folders | ||||||||||||||||||
| else: | ||||||||||||||||||
| local_files[rel_path] = { | ||||||||||||||||||
| 'md5': get_file_md5(rel_path) if get_md5 else None, | ||||||||||||||||||
| 'path': rel_path | ||||||||||||||||||
| } | ||||||||||||||||||
| return local_files, local_folders | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def run_folder_comparisons(sage_folders, local_folders, dry_run=True): | ||||||||||||||||||
| for folder in sage_folders: | ||||||||||||||||||
| sage_path = folder['path'] | ||||||||||||||||||
| if os.path.exists(sage_path) and not os.path.isdir(sage_path): | ||||||||||||||||||
| print(f"Sage ID {folder['id']} ({sage_path}) found locally but is not a directory") | ||||||||||||||||||
| if not dry_run: | ||||||||||||||||||
| # not even sure what or why this might happen so not sure what to do | ||||||||||||||||||
| pass | ||||||||||||||||||
|
|
||||||||||||||||||
| sage_folder_set = set([f['path'] for f in sage_folders]) | ||||||||||||||||||
| local_folder_set = set([f['path'] for f in local_folders]) | ||||||||||||||||||
| if (sage_folder_set - local_folder_set) != set(): | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The comparison
Suggested change
|
||||||||||||||||||
| print("The following folders were found on Sage and not locally") | ||||||||||||||||||
| for folder in (sage_folder_set - local_folder_set): | ||||||||||||||||||
| print(f"\t{folder}") | ||||||||||||||||||
| if not dry_run: | ||||||||||||||||||
| # Add code to remove folders from Sage | ||||||||||||||||||
| pass | ||||||||||||||||||
| if (local_folder_set - sage_folder_set) != set(): | ||||||||||||||||||
| print("The following folders were found locally and not on Sage") | ||||||||||||||||||
| for folder in (local_folder_set - sage_folder_set): | ||||||||||||||||||
| print(f"\t{folder}") | ||||||||||||||||||
| if not dry_run: | ||||||||||||||||||
| # Add code to add folder to Sage | ||||||||||||||||||
| pass | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def run_file_comparisons(sage_files, local_files, md5=False, dry_run=True): | ||||||||||||||||||
| for f in sage_files: | ||||||||||||||||||
| sage_path = f | ||||||||||||||||||
| if not os.path.exists(sage_path): | ||||||||||||||||||
| print(f"Sage ID {sage_files[f]['id']} ({sage_path}) not found locally") | ||||||||||||||||||
| elif not os.path.isfile(sage_path): | ||||||||||||||||||
| print(f"Sage ID {sage_files[f]['id']} ({sage_path}) found locally but is not a file") | ||||||||||||||||||
| elif sage_path in local_files and md5: | ||||||||||||||||||
| if sage_files[f]['md5'] != local_files[f]['md5']: | ||||||||||||||||||
| print(f"Sage ID {sage_files[f]['id']} ({sage_path}) md5 ({sage_files[f]['md5']}) does not match local md5 hash ({local_files[f]['md5']})") | ||||||||||||||||||
| if not dry_run: | ||||||||||||||||||
| # Add logic for updating the files on sage | ||||||||||||||||||
| pass | ||||||||||||||||||
|
|
||||||||||||||||||
| sage_file_set = set(sage_files.keys()) | ||||||||||||||||||
| local_file_set = set(local_files.keys()) | ||||||||||||||||||
| if (sage_file_set - local_file_set) != set(): | ||||||||||||||||||
| print("The following files were found on Sage and not locally") | ||||||||||||||||||
| for folder in (sage_file_set - local_file_set): | ||||||||||||||||||
| print(f"\t{folder}") | ||||||||||||||||||
| if not dry_run: | ||||||||||||||||||
| # Add logic for removing files from Sage | ||||||||||||||||||
| pass | ||||||||||||||||||
| if (local_file_set - sage_file_set) != set(): | ||||||||||||||||||
| print("The following files were found locally and not on Sage") | ||||||||||||||||||
| for folder in (local_file_set - sage_file_set): | ||||||||||||||||||
| print(f"\t{folder}") | ||||||||||||||||||
| if not dry_run: | ||||||||||||||||||
| # Add logic for adding file to Sage | ||||||||||||||||||
| pass | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def main(): | ||||||||||||||||||
| load_dotenv() | ||||||||||||||||||
|
|
||||||||||||||||||
| parser = argparse.ArgumentParser(description="A simple script for ...") | ||||||||||||||||||
| parser.add_argument('--bids_folder', default='./', type=str, help='Example help information') | ||||||||||||||||||
| parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset') | ||||||||||||||||||
| parser.add_argument('--get_md5', default=False, action='store_true') | ||||||||||||||||||
| parser.add_argument('--sync', default=False, action='store_true') | ||||||||||||||||||
|
|
||||||||||||||||||
| args = parser.parse_args() | ||||||||||||||||||
|
|
||||||||||||||||||
| sage_pat = os.getenv("SAGE_PAT") | ||||||||||||||||||
| sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||||||
|
|
||||||||||||||||||
| syn = synapseclient.login(authToken=sage_pat) | ||||||||||||||||||
| project = Project(id=sage_project_id).get() | ||||||||||||||||||
| print(f"I just got my project: {project.name}, id: {project.id}") | ||||||||||||||||||
|
|
||||||||||||||||||
| parent_id = ADULT_FOLDER if args.adult else PEDS_FOLDER | ||||||||||||||||||
|
|
||||||||||||||||||
| my_project = Folder(id=parent_id).get() | ||||||||||||||||||
| synapse_files, synapse_folders = walk_synapse_folder(syn,my_project,args.bids_folder) | ||||||||||||||||||
| local_files, local_folders = walk_local_folder(args.bids_folder, args.get_md5) | ||||||||||||||||||
|
|
||||||||||||||||||
| run_folder_comparisons(synapse_folders, local_folders, dry_run=~args.sync) | ||||||||||||||||||
| run_file_comparisons(synapse_files, local_files, md5=args.get_md5, dry_run=~args.sync) | ||||||||||||||||||
|
|
||||||||||||||||||
| #run_comparison(dir_mapping, args.bids_folder) | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| if __name__=='__main__': | ||||||||||||||||||
| main() | ||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
ArgumentParserdescription and argument help messages are placeholders. They should be updated to be more descriptive and informative for users of this script.