Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -148,3 +148,7 @@ guide/data/

*.DS_Store
cache-wf/

external_scripts/sage_upload_scripts/logs/
external_scripts/sage_upload_scripts/*.tsv
external_scripts/sage_upload_scripts/*.sh
65 changes: 65 additions & 0 deletions external_scripts/sage_upload_scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Sage Upload Process

This file will detail the processes that take place to upload data to Sage Synapse repository.

## 0. Prerequisites

As a prerequisite we are assuming that the main Sage project has been generated, that both the pediatric and the adult data are using the same project, and those folders have been generated and the proper synapse IDs are in each of the scripts.

Additionally, to run any of these codes, one needs to setup a personal access token with Sage. To do so, go to Account Settings on Sage and scroll down to Personal Access Tokens (PATs) and generate one. This token ideally would have all 3 permissions (View, Download, Modify). The code currently assumes that this PAT is saved as an environment variable called `SAGE_PAT`. There are additionally options for setting up a Sage profile as to not have to pass the environment variable every time but this is currently not configured.

Finally, run `pip install synapseclient` for the required python API.

## 1. Generate Sage manifest file

In order to do bulk uploads of data to the Sage Synapse repository, we use the functions `synapseutils.generate_sync_manifest` and `synapseutils.syncToSynapse`. We start by generating a Sage manifest using the command below:

```
python sage_generate_manifest.py \
--bids_folder path/to/bids/folder \
--manifest_file path/to/save/manifest.tsv \
--adult
```

`--bids_folder` gives the path to a BIDS folder we want to upload. This should be the output of the deidentify command in `RELEASE.md` but without any features.

`--manifest_file` describes where to save the generated manifest file and what to name it.

`--adult` is a flag for whether this is the adult or the pediatric data which will determine the folder on Synapse to upload to. Leave off the flag for generating the pediatric data manifest.

## 2. Upload the data using the generated manifest

Taking the generated manifest file from step 1, we can then upload it using the below command. Since the manifest is already specific to the adult vs. the pediatric data, the upload command is the same for both/there is no flag to specify which dataset.

```
python sage_upload_manifest.py \
--manifest_file path/to/manifest.tsv \
--start 0 \
--end 1000
```

`--manifest_file` should match the file generated from running the command in step one.

This script is setup for parallel uploads by providing a start and end of the data in the manifest to upload. If not specified this will upload all of the data at once which might take quite a long time.

Note: I have found this file will fail with a variety of errors, typically related to doing concurrent uploads. For that reason I typically run it multiple time because reportedly under the hood it will cache the MD5 of the files uploaded and shouldn't reupload if they match.

## 3. Verify uploaded data

This command can run both as a check to verify the data uploaded properly as well as a verification between the data on Sage and a local folder, especially to make sure future changes to the code only affected specified files. This verification will check that the folder structures are equivalent and whether any extra are on Sage compared to locally or if any are missing on Sage. It also checks against files, mainly checking that filenames are the same and if they are that the contents (through and md5 hash) are the same.

```
python verify_sage_contents.py \
--bids_folder path/to/bids/folder \
--adult \
--get_md5 \
--execute
```

`--bids_folder` specifies the BIDS folder that we want to compare the current data on Sage against. This would typically be the folder that was used for the generation of the manifest file, but might also be a new BIDS folder to check that only certain data and structures have changed.

`--adult` is a flag to specify whether to check that the Sage folder to compare against is for the adult data. Leave off the flag for checking the pediatric data.

`--get_md5` is a flag to specify whether to generate MD5 hashes of the files in the BIDS folder. If specified, it will generate the hash for the local files and compare it against the corresponding Sage file's MD5 hash.

`--sync` is an in progress command for determining whether to syncronize Sage based on the local folder. This would not be a bulk syncronize like the upload previously but do so for individual files and folders. Currently it is not configured to do anything and defaults to just doing a dry run check of the difference between the data datasets.
47 changes: 47 additions & 0 deletions external_scripts/sage_upload_scripts/sage_generate_manifest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import os
import argparse
from dotenv import load_dotenv

import synapseclient
import synapseutils
from synapseclient.models import Project, Folder

OVERALL_PROJECT = "syn72370534"
PEDS_FOLDER = "syn72493849"
ADULT_FOLDER = "syn72493850"

def main():
load_dotenv()

parser = argparse.ArgumentParser(description="A simple script for ...")
parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)
parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
Comment on lines +16 to +19
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ArgumentParser description and argument help messages are placeholders. They should be updated to be more descriptive and informative for users of this script.

Suggested change
parser = argparse.ArgumentParser(description="A simple script for ...")
parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)
parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
parser = argparse.ArgumentParser(description="Generates a manifest file for uploading BIDS data to Sage Synapse.")
parser.add_argument('--bids_folder', default='./', type=str, help='Path to the local BIDS folder to be synced.')
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str, help='Path to save the generated manifest file.')
parser.add_argument('--adult', default=False, action='store_true', help='Specify if the dataset is for adults. Defaults to pediatric.')


args = parser.parse_args()
print(f"Received args: {args}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using the logging module instead of print() for output. The logging module provides more flexibility, such as controlling log levels, formatting, and directing output to different destinations (e.g., files, stderr).


sage_pat = os.getenv("SAGE_PAT")
sage_project_id = OVERALL_PROJECT

syn = synapseclient.login(authToken=sage_pat)
project = Project(id=sage_project_id).get()
print(f"Looking at project: {project.name}, id: {project.id}")

parent_id = ADULT_FOLDER if args.adult else PEDS_FOLDER
print(f"Syncing data with parent folder: {parent_id}")
root_data_folder = Folder(id=parent_id).get()

# Quick assertion that we are in the correct project for this code"
assert root_data_folder.parent_id == project.id

synapseutils.generate_sync_manifest(
syn=syn,
directory_path=args.bids_folder,
parent_id=parent_id,
manifest_path=args.manifest_file,
)


if __name__=='__main__':
main()
43 changes: 43 additions & 0 deletions external_scripts/sage_upload_scripts/sage_upload_manifest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import os
import argparse
from dotenv import load_dotenv

import synapseclient
import synapseutils
from synapseclient.models import Project
import tempfile
import pandas as pd

PEDS_PROJECT = "syn72418607"
ADULT_PROJECT = "syn72370534"
OVERALL_PROJECT = "syn72370534"
Comment on lines +12 to +13
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The constants ADULT_PROJECT and OVERALL_PROJECT have the same value. If this is not intentional, it should be corrected to avoid confusion and potential bugs.

PEDS_FOLDER = "syn72493849"
ADULT_FOLDER = "syn72493850"

def main():
load_dotenv()

parser = argparse.ArgumentParser(description="A simple script for ...")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ArgumentParser description is a placeholder. It should be updated to provide a meaningful summary of what the script does.

Suggested change
parser = argparse.ArgumentParser(description="A simple script for ...")
parser = argparse.ArgumentParser(description="Uploads a manifest file to Sage Synapse in chunks.")

parser.add_argument('--start', default=None,type=int)
parser.add_argument('--end',default=None,type=int)
parser.add_argument('--manifest_file', default='temp_manifest_file.tsv',type=str)

args = parser.parse_args()

sage_pat = os.getenv("SAGE_PAT")
sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line appears to be a copy-paste error from another script. It references args.adult, which is not an argument in this script. This will cause an AttributeError if the commented-out code is ever used. The logic for determining sage_project_id should be corrected for this script's context.


syn = synapseclient.login(authToken=sage_pat)
with tempfile.NamedTemporaryFile(mode='w+', suffix='.tsv', delete=True) as temp:
data = pd.read_csv(args.manifest_file, sep='\t')
start = args.start if args.start and args.start >= 0 else 0
end = args.end if args.end else len(data)
df = data[args.start:args.end]
df.to_csv(temp.name, sep='\t', index=False)
synapseutils.syncToSynapse(
syn=syn, manifestFile=temp.name, sendMessages=False
)


if __name__=='__main__':
main()
185 changes: 185 additions & 0 deletions external_scripts/sage_upload_scripts/verify_sage_contents.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
import os
import argparse
from dotenv import load_dotenv

import synapseclient
import synapseutils
from synapseclient.models import Project, Folder, File
import hashlib

def get_file_md5(filepath):
"""Calculate MD5 hash of a local file."""
hash_md5 = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()



PEDS_PROJECT = "syn72418607"
ADULT_PROJECT = "syn72370534"
OVERALL_PROJECT = "syn72370534"
PEDS_FOLDER = "syn72493849"
ADULT_FOLDER = "syn72493850"


def walk_synapse_folder(syn, folder_id, path=""):
"""
Recursively walk through a Synapse folder structure.
Returns dict mapping relative paths to Synapse entity info.
"""
Comment on lines +28 to +31
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring is inaccurate. It states that the function returns a dictionary, but it actually returns a tuple of a dictionary and a list: (dict, list). The docstring should be updated to reflect the correct return signature.

Suggested change
"""
Recursively walk through a Synapse folder structure.
Returns dict mapping relative paths to Synapse entity info.
"""
"""
Recursively walk through a Synapse folder structure.
Returns a tuple containing a dictionary of files and a list of folders.
"""

synapse_files = {}
synapse_folders = []

try:
# Get all children of this folder
children = syn.getChildren(folder_id)

for child in children:
child_name = child['name']
child_id = child['id']
child_type = child['type']

# Build relative path
rel_path = os.path.join(path, child_name) if path else child_name

if child_type == 'org.sagebionetworks.repo.model.Folder':
print(f"Scanning folder: {rel_path}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using the logging module instead of print() for output. The logging module provides more flexibility, such as controlling log levels, formatting, and directing output to different destinations (e.g., files, stderr). This applies to other print statements in this file as well.

synapse_folders.append({'path':rel_path,'id':child_id})
# Recursively process subfolder
syn_files, syn_folders = walk_synapse_folder(syn, child_id, rel_path)
synapse_files.update(syn_files)
synapse_folders += syn_folders

elif child_type == 'org.sagebionetworks.repo.model.FileEntity':
# Get file metadata without downloading
file_entity = syn.get(child_id, downloadFile=False)

synapse_files[rel_path] = {
'id': child_id,
'name': child_name,
'md5': file_entity.get('md5', None),
'size': file_entity.get('dataFileHandleId', {}).get('contentSize', 0) if isinstance(file_entity.get('dataFileHandleId'), dict) else 0,
'modifiedOn': file_entity.get('modifiedOn', None),
'path': rel_path
}

#print(f"Found file: {rel_path}")

except Exception as e:
print(f"Error processing folder {path}: {str(e)}")

return synapse_files, synapse_folders


def walk_local_folder(path, get_md5=False):
local_files = {}
local_folders = []

for entity in os.listdir(path):
rel_path = os.path.join(path,entity)
if os.path.isdir(rel_path):
local_folders.append({'path':rel_path})
temp_files, temp_folders = walk_local_folder(rel_path, get_md5=get_md5)
local_files.update(temp_files)
local_folders += temp_folders
else:
local_files[rel_path] = {
'md5': get_file_md5(rel_path) if get_md5 else None,
'path': rel_path
}
return local_files, local_folders


def run_folder_comparisons(sage_folders, local_folders, dry_run=True):
for folder in sage_folders:
sage_path = folder['path']
if os.path.exists(sage_path) and not os.path.isdir(sage_path):
print(f"Sage ID {folder['id']} ({sage_path}) found locally but is not a directory")
if not dry_run:
# not even sure what or why this might happen so not sure what to do
pass

sage_folder_set = set([f['path'] for f in sage_folders])
local_folder_set = set([f['path'] for f in local_folders])
if (sage_folder_set - local_folder_set) != set():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comparison (sage_folder_set - local_folder_set) != set() can be simplified. An empty set evaluates to False in a boolean context, so you can write this more concisely. This applies to similar checks in this file.

Suggested change
if (sage_folder_set - local_folder_set) != set():
if sage_folder_set - local_folder_set:

print("The following folders were found on Sage and not locally")
for folder in (sage_folder_set - local_folder_set):
print(f"\t{folder}")
if not dry_run:
# Add code to remove folders from Sage
pass
if (local_folder_set - sage_folder_set) != set():
print("The following folders were found locally and not on Sage")
for folder in (local_folder_set - sage_folder_set):
print(f"\t{folder}")
if not dry_run:
# Add code to add folder to Sage
pass


def run_file_comparisons(sage_files, local_files, md5=False, dry_run=True):
for f in sage_files:
sage_path = f
if not os.path.exists(sage_path):
print(f"Sage ID {sage_files[f]['id']} ({sage_path}) not found locally")
elif not os.path.isfile(sage_path):
print(f"Sage ID {sage_files[f]['id']} ({sage_path}) found locally but is not a file")
elif sage_path in local_files and md5:
if sage_files[f]['md5'] != local_files[f]['md5']:
print(f"Sage ID {sage_files[f]['id']} ({sage_path}) md5 ({sage_files[f]['md5']}) does not match local md5 hash ({local_files[f]['md5']})")
if not dry_run:
# Add logic for updating the files on sage
pass

sage_file_set = set(sage_files.keys())
local_file_set = set(local_files.keys())
if (sage_file_set - local_file_set) != set():
print("The following files were found on Sage and not locally")
for folder in (sage_file_set - local_file_set):
print(f"\t{folder}")
if not dry_run:
# Add logic for removing files from Sage
pass
if (local_file_set - sage_file_set) != set():
print("The following files were found locally and not on Sage")
for folder in (local_file_set - sage_file_set):
print(f"\t{folder}")
if not dry_run:
# Add logic for adding file to Sage
pass


def main():
load_dotenv()

parser = argparse.ArgumentParser(description="A simple script for ...")
parser.add_argument('--bids_folder', default='./', type=str, help='Example help information')
parser.add_argument('--adult', default=False, action='store_true', help='is adult dataset')
parser.add_argument('--get_md5', default=False, action='store_true')
parser.add_argument('--sync', default=False, action='store_true')

args = parser.parse_args()

sage_pat = os.getenv("SAGE_PAT")
sage_project_id = OVERALL_PROJECT#ADULT_PROJECT if args.adult else PEDS_PROJECT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line appears to be a copy-paste error from another script. It references args.adult, which is not an argument in this script. This will cause an AttributeError if the commented-out code is ever used. The logic for determining sage_project_id should be corrected for this script's context.


syn = synapseclient.login(authToken=sage_pat)
project = Project(id=sage_project_id).get()
print(f"I just got my project: {project.name}, id: {project.id}")

parent_id = ADULT_FOLDER if args.adult else PEDS_FOLDER

my_project = Folder(id=parent_id).get()
synapse_files, synapse_folders = walk_synapse_folder(syn,my_project,args.bids_folder)
local_files, local_folders = walk_local_folder(args.bids_folder, args.get_md5)

run_folder_comparisons(synapse_folders, local_folders, dry_run=~args.sync)
run_file_comparisons(synapse_files, local_files, md5=args.get_md5, dry_run=~args.sync)

#run_comparison(dir_mapping, args.bids_folder)


if __name__=='__main__':
main()