Skip to content

Comments

[ENH] OpenmMLBenchmarkRunner class#1676

Draft
EmanAbdelhaleem wants to merge 1 commit intoopenml:mainfrom
EmanAbdelhaleem:benchmark
Draft

[ENH] OpenmMLBenchmarkRunner class#1676
EmanAbdelhaleem wants to merge 1 commit intoopenml:mainfrom
EmanAbdelhaleem:benchmark

Conversation

@EmanAbdelhaleem
Copy link
Contributor

This is a draft PR for creating a OpenmMLBenchmarkRunner class that handles running benchmarks in parallel instead of using for loops

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.15%. Comparing base (ede1497) to head (c413501).

Files with missing lines Patch % Lines
openml/benchmarks/benchmark_runner.py 0.00% 55 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1676      +/-   ##
==========================================
- Coverage   52.80%   52.15%   -0.66%     
==========================================
  Files          37       38       +1     
  Lines        4363     4418      +55     
==========================================
  Hits         2304     2304              
- Misses       2059     2114      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@EmanAbdelhaleem
Copy link
Contributor Author

EmanAbdelhaleem commented Feb 23, 2026

@jgyasu @fkiraly I need a review here.
I also have 2 questions regarding the pause and resume methods:

  1. Should the pause and resume methods get called when the user presses some keyboard key like P for pause and R for resume? or do you have a different workflow in mind?
  2. If the user has paused while some task was in progress, should the task end first before pausing or should the task get interrupted and deprecated?

@jgyasu
Copy link
Contributor

jgyasu commented Feb 23, 2026

I haven't reviewed the code yet but I will answer your question for now,

Should the pause and resume methods get called when the user presses some keyboard key like P for pause and R for resume? or do you have a different workflow in mind?

The goal is to make the benchmark runner robust to crashes or estimator failures. For example, if the 4th estimator fails during execution, the entire benchmark run currently breaks and previously computed results may not be safely stored (correct me if I am wrong?). Instead, the runner should persist results incrementally after each successful run locally.

This would allow us to:

  • Continuously save results as they are produced.
  • Avoid losing progress if a failure occurs.
  • Restart the benchmark later and automatically skip already completed runs.
  • Execute only the remaining combinations.

So, this is more about checkpointing and resumability than about manual pause/resume. Of course you can then think that the user can pause the experiment with an keyboard interrupt too, and in that case the checkpointing and resumability would work too. @fkiraly do you want to add anything?

@EmanAbdelhaleem
Copy link
Contributor Author

this is more about checkpointing

As far as I know, we deal with classical models currently, right? so, wouldn't checkpointing be much?

Also, what do you exactly mean by storing the results? Is it saving the output that is printed on the terminal or something else?
As far as I know, the current behaviour of benchmarking is as follows:

  1. Select a benchmarking Suite
  2. Iterate over the tasks
  3. Run model on task
  4. Store the results by publishing the runs (if you want)

So, storing is done by publishing, and you can publish each single run when its finished, you don't need to wait till the experiment is done. The problem with looping is that if some run failed, all the benchmark experiment is gonna crash, I solved this by using threading instead of looping.

So, I am not sure what do you exactly mean by save results as they are produced. You can already do this by publishing the run, right?

@jgyasu
Copy link
Contributor

jgyasu commented Feb 23, 2026

As far as I know, we deal with classical models currently, right? so, wouldn't checkpointing be much?

What checkpointing has to do with classical models? I did not get you

So, I am not sure what do you exactly mean by save results as they are produced. You can already do this by publishing the run, right?

What if there is an error in publishing? What happens when the internet connection breaks?

Also, what do you exactly mean by storing the results? Is it saving the output that is printed on the terminal or something else?

Storing the result of the benchmark runs in local disk, maybe in JSON or CSV format (design question). And then the results should be loadable in-memory as dataframes.

@EmanAbdelhaleem
Copy link
Contributor Author

EmanAbdelhaleem commented Feb 23, 2026

What checkpointing has to do with classical models? I did not get you

As far as I know, checkpointing is about saving the model's sate. This is necessary with deep learning as it usually takes time, but with classical models. A single run usually doesn't take much time, So, I think checkpointing might be much. Let me know what you think.

What if there is an error in publishing? What happens when the internet connection breaks?

I think it is not gonna get published then? You still can save the results of the run locally even with the model pickled if you use the method run.to_filesystem(path). However, it's not saved in a dataframe, it's serialized to be uploaded later using run.from_filesystem(path) that can instantiate an OpenMLRun object based on the saved files.

I think what I can automate here is the path passed to these methods. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants