[ENH] OpenmMLBenchmarkRunner class#1676
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1676 +/- ##
==========================================
- Coverage 52.80% 52.15% -0.66%
==========================================
Files 37 38 +1
Lines 4363 4418 +55
==========================================
Hits 2304 2304
- Misses 2059 2114 +55 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jgyasu @fkiraly I need a review here.
|
|
I haven't reviewed the code yet but I will answer your question for now,
The goal is to make the benchmark runner robust to crashes or estimator failures. For example, if the 4th estimator fails during execution, the entire benchmark run currently breaks and previously computed results may not be safely stored (correct me if I am wrong?). Instead, the runner should persist results incrementally after each successful run locally. This would allow us to:
So, this is more about checkpointing and resumability than about manual pause/resume. Of course you can then think that the user can pause the experiment with an keyboard interrupt too, and in that case the checkpointing and resumability would work too. @fkiraly do you want to add anything? |
As far as I know, we deal with classical models currently, right? so, wouldn't checkpointing be much? Also, what do you exactly mean by storing the results? Is it saving the output that is printed on the terminal or something else?
So, storing is done by publishing, and you can publish each single run when its finished, you don't need to wait till the experiment is done. The problem with looping is that if some run failed, all the benchmark experiment is gonna crash, I solved this by using threading instead of looping. So, I am not sure what do you exactly mean by |
What checkpointing has to do with classical models? I did not get you
What if there is an error in publishing? What happens when the internet connection breaks?
Storing the result of the benchmark runs in local disk, maybe in JSON or CSV format (design question). And then the results should be loadable in-memory as dataframes. |
As far as I know, checkpointing is about saving the model's sate. This is necessary with deep learning as it usually takes time, but with classical models. A single run usually doesn't take much time, So, I think checkpointing might be much. Let me know what you think.
I think it is not gonna get published then? You still can save the results of the run locally even with the model pickled if you use the method I think what I can automate here is the |
This is a draft PR for creating a
OpenmMLBenchmarkRunnerclass that handles running benchmarks in parallel instead of using for loops