7.16.3 Performance Update (with logs!) #4363
Replies: 5 comments
-
|
Thanks @cmoore24-24!, we'll take a look! |
Beta Was this translation helpful? Give feedback.
-
|
On a first look for trijet_B_7163 there are a lot of workers being lost. The maximum number of concurrent workers connected is 118, but in total 1103 workers connected. The workflow runs for ~800 minutes, and most of the disconnections occur after 500minutes, which matches with your experience of the workflow slowing down at the end. The number of waiting tasks also shoots up after the 600 minute, which corresponds to a high number of recovery tasks. There are overall 470902 tasks completed but 449174 recovery tasks were submitted. There are only 24065 distinct temp files, but 477990 tasks to produce them were submitted. So we need to determine why the workers are being lost. As long as a worker is connected it is busy, so they are not idle disconnecting. |
Beta Was this translation helpful? Give feedback.
-
|
A little more data, of the 24065 distinct temp files, only 4104 did not need a recovery task. Some had 140 recovery tasks or more. |
Beta Was this translation helpful? Give feedback.
-
|
I think I've fixed (at least a large part of) the issue. Here's a set of logs that should look a lot healthier than previous. The problem seems to have been with |
Beta Was this translation helpful? Give feedback.
-
|
@cmoore24-24 thanks for the logs! Most tasks run for less than 3 minutes, but about 1% run for more than 30min. It is this tasks that are the bottleneck of the workflow, and you may benefit by reducing the chunksize. I think it is worth a try. The longest task runs for about 75 minutes, almost all the duration of the workflow, and it would have been very costly if the worker running it was lost. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I've been working with the most recent version of taskvine from cctools (7.16.3), and wanted to share a little about how it has been going (slightly) ahead of the computing meeting on Feb 26th.
I have logs from a few different at-scale runs that I've placed in a tar file in my Notre Dame google drive (link: https://drive.google.com/drive/folders/1_JHwKW8-2IhcuE-3UV4vynyslm4f5Tui?usp=sharing). There are four runs in this tar file: three were run with 7.16.3, and one was run with 7.15.10 (which was the version I had been using previously) as a kind of control.
I think the biggest takeaway is that, the performance seems good during the run, but the of the application on the final few tasks of the run seems more agressive than in any previous version I've used. Two things seemed to always happen:
I think of particular note is the log from the directory



trijet_B. This run went for nearly a full day before I decided to manually pull the plug on it. Here are a few screenshots showing how this run ended:As you can see, the run had reached the point where it only had 8 tasks left according to the progress bar, but was runnning 1200 tasks with nearly 10000 still waiting, with that number bouncing between 7000 and 12000 for a couple of hours prior to this point. This is on top of nearly ~470,000 tasks having been completed throughout the run. I've included the status page for this manager at the time I ended the run for added context.
Hopefully this is helpful info, let me know if there is anything else I can do or information I can provide. Thanks for all the help!
Beta Was this translation helpful? Give feedback.
All reactions