Improve files managment on large scale analysis #13

edg1983 · 2025-03-13T10:48:03Z

edg1983
Mar 13, 2025
Maintainer

We experimented with issues related to a massive number of files generated when the pipeline is run on large-scale datasets.

This is likely due to a combination of excessive scattering across processes and data. Table temp files generated during writing.

How can we overcome this?

There are different possible strategies we can explore:

aggregate small tasks (like finemapping per locus) in chunks and run one job per chunk (implementation is ongoing)
Use the scratch directive in critical processes. This can help solve the issue of writing many files to central storage since they are written on the compute node temp space. However, it will not help if the process itself generates too many files while running.
use of the new job array implementation in Nextflow >24.04. This makes our pipeline incompatible with older Nextflow versions (not a big issue in my mind). We need to experiment if this improves the N files generated.
check data.table docs on if/how to limit tmp files generation

ariannalandini · 2025-03-13T11:17:12Z

ariannalandini
Mar 13, 2025
Maintainer

Another problem - big summary statistics to read. data.table::fread() creates huge temporary file (as big as the original one) while reading, that then gets deleted automatically (IF the read is successfull).
This is an issue when the process is killed by the scheduler, and temporary files getw accumulated in the cnode.

Implementations in place to mitigate this:

tmpdir=getwd()
Indexing with tabix (works for molecular phenotyes to read-in only the specific trait)
Munging and locus breaker collpased in a single step to avoid re-reading the same sumstat only for locus breaker

Questions:

Is there a way to delete temporary files when the process is killed? For OOM
Does fread() reads in chuncks? Would this be helpful?
Explore other frameworks - alternatives to data.table::fread() (e.g. https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/)?

0 replies

edg1983 · 2025-04-01T12:25:47Z

edg1983
Apr 1, 2025
Maintainer Author

The new version of the munging script now uses readr::read_tsv, which does not create temp files; hence, this is solved.
@ariannalandini, you are also working on aggregating fine mapping in chunks, right?

An open question remains about experimenting with the job array, but it's not a priority anymore.

0 replies

edg1983 · 2025-04-16T05:13:34Z

edg1983
Apr 16, 2025
Maintainer Author

We agree to replace all fread calls with readr::read_delim or readr::read_tsv. This will solve the issue of tmp files without significantly increasing read time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve files managment on large scale analysis #13

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improve files managment on large scale analysis #13

Uh oh!

edg1983 Mar 13, 2025 Maintainer

Replies: 3 comments

Uh oh!

Uh oh!

ariannalandini Mar 13, 2025 Maintainer

Uh oh!

edg1983 Apr 1, 2025 Maintainer Author

Uh oh!

edg1983 Apr 16, 2025 Maintainer Author

edg1983
Mar 13, 2025
Maintainer

ariannalandini
Mar 13, 2025
Maintainer

edg1983
Apr 1, 2025
Maintainer Author

edg1983
Apr 16, 2025
Maintainer Author