On a recent study (https://dl.acm.org/doi/abs/10.1145/3597312) I've noticed that the difference between the top-N (N = 15 or more) algorithms in most datasets are insignificant. They only differ on a small selection of the Friedman datasets. Maybe it is a good idea to separate the comparison of the algorithms in different groups:
Given this, my other proposal is to add the benchmarks of those two competitions into the benchmark and the one proposed by @MilesCranmer. For the 2023 competition I can also generate datasets with different levels of noise and other nasty features! Also, we can grab other benchmark functions from multimodal optimization to create more of those.
On a recent study (https://dl.acm.org/doi/abs/10.1145/3597312) I've noticed that the difference between the top-N (N = 15 or more) algorithms in most datasets are insignificant. They only differ on a small selection of the Friedman datasets. Maybe it is a good idea to separate the comparison of the algorithms in different groups:
Given this, my other proposal is to add the benchmarks of those two competitions into the benchmark and the one proposed by @MilesCranmer. For the 2023 competition I can also generate datasets with different levels of noise and other nasty features! Also, we can grab other benchmark functions from multimodal optimization to create more of those.