Datasets core by ThomSerg · Pull Request #900 · CPMpy/cpmpy

ThomSerg · 2026-04-01T14:00:58Z

A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch benchmark_datasets into master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".

OrestisLomis

I think it is really nice! Some questions about the class hierarchy remain for me, please have a look. Few more small comments as well

OrestisLomis · 2026-04-15T11:44:14Z

+        }
+
+
+class IndexedDataset(Dataset):


Why is this functionality not directly in Dataset? I do not really see what kind of data would not be implementable in an indexable way. Is it for instance generator datasets?

Any dataset in which the instances must be consumed in order (where random access is not possible). This can for example happen when instances are generated on the spot and this generation process is sequence dependent, e.g. it uses a initial random seed and then goes on from there. Or maybe a very large dataset that is for some reason in one very big file, where reading the entire file into memory is not efficient and it is better to just stream instance by instance. I do agree that for now we don't have a usecase for it, but I wanted to prevent that in the future we or any other contributor couldn't add one of these alternative dataset types due to us hard-coding indexability into the base class. With a very generic and limited base class, these future additions are still a possibility.

After internal discussion, I removed the intermediate IndexedDataset. All datasets now inherit from Dataset, which requires implementation for indexability and iterability. Datasets that are only one of the two are more exception than the norm, so one idea is to just keep the code simple. Though this will cause issues for those datasets (streaming datasets, generator-based datasets), since they will have to through an exception that random access is not allowed. And it's not that nice of a design that the user has to take into account that index-based access could throw an exception for some datasets. In my opinion the previous code for keeping indexability out of the base dataset class is not a lot, but I can also see the other point of view in terms of complexity and maintainability.

I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

OrestisLomis · 2026-04-15T12:13:39Z

+                print(f"Finished downloading {len(files)} instances")
+
+        files = self._list_instances()
+        if len(files) == 0:


Can we check that the data is not tampered with? Maybe for starters if __len__ is actually hardcoded to be a specific number you can change this line with len(files) != len(self). The tradeoff is that you have to add a lot of hardcoded values for each year/track/... combination for some benchmarks which might not be desirable. What if the benchmark changes size for some reason?

TL;DR I think this is fine, but perhaps think a little about more data correctness guarantees

(response to Orestis' response:)

its OK to work with what we have...

we check/reload the metadata for each instance on disk in any case if I understand the code lower well

(continued response to my response to Orestis' response)

(with what we have = if I download the data, and chose to delete some tracks to save space, I don't expect my code to complain but rather just to use what is on the disk)

OrestisLomis

I think it is alright as is now. See comment for more detailed discussion.

OrestisLomis · 2026-04-17T14:04:49Z

+        }
+
+
+class IndexedDataset(Dataset):


I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

tias

great!

also nicely extensive test-suite.

Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome

tias · 2026-04-21T07:36:41Z

+import cpmpy as cp
+
+
+def _format_bytes(bytes_num):


if you import typing I kind-of expect typed functions... e.g. this is float or int?

tias · 2026-04-21T07:39:04Z

+        bytes_num /= 1024.0
+
+
+class classproperty:


that seems like sugar coating... do we care?

things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...

(read other comments lower first)

tias · 2026-04-21T07:46:03Z

+                print(f"Finished downloading {len(files)} instances")
+
+        files = self._list_instances()
+        if len(files) == 0:


(response to Orestis' response:)

its OK to work with what we have...

we check/reload the metadata for each instance on disk in any case if I understand the code lower well

tias · 2026-04-21T07:46:40Z

+                print(f"Finished downloading {len(files)} instances")
+
+        files = self._list_instances()
+        if len(files) == 0:


(continued response to my response to Orestis' response)

(with what we have = if I download the data, and chose to delete some tracks to save space, I don't expect my code to complain but rather just to use what is on the disk)

tias · 2026-04-21T07:49:31Z

+            data = self.parse(file_path)
+
+        if self.transform:
+            # TODO maybe revisit this flow of execution once CPMpy model feature extraction has been added


is this an old todo, or is this an actual todo for a soon-but-later PR? I think we describe the flow well in the paper, and there is no actually intended todo left? (one can always refactor anything, but that should not be written in the code)

tias · 2026-04-21T07:52:06Z

+                        fp = futures[future]
+                        print(f"Error collecting metadata for {fp.name}: {e}")
+
+    def _collect_one_metadata(self, file_path):


euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)

tias · 2026-04-21T07:53:39Z

+        filepath.parent.mkdir(parents=True, exist_ok=True)
+
+        req = Request(url)
+        with urlopen(req) as response:


aren't you overdoing it? If I understand this correctly you implement your own pretty-print downloader, but I'm not sure I want to maintain a custom pretty printer in a combinatorial optimisation library...

tias · 2026-04-21T07:55:25Z

+    class FromFilesDataset(FileDataset):
+        # Plain class attributes so that dataset_metadata() (a classmethod
+        # that reads cls.name / cls.description / ...) works correctly.
+        name = ""


I think this is the better, cleaner way... just plain class attributes, should be like that in the superclass too?

if its required, you can check in the constructor that it should be non-empty...

tias · 2026-04-21T07:57:07Z

+            return {}
+
+        def download(self) -> None:
+            raise NotImplementedError("from_files() datasets are already local; downloading is not supported.")


I would say pass # already in local files? There is no error here...

ThomSerg added 30 commits September 11, 2025 17:49

WCNF parser

feead09

Small docstring change

5ade48e

OPB parser

7f52f5f

Move parser out of init and add cli

548de8e

Add MSE and OPB datasets

4505025

Rename datasets to dataset

2b26034

Dataset specific 'open'

e238c29

Dataset module init file

669875a

Add benchmark runners

c1bd2fe

Formatting

83454e0

XCSP3 as dataset and benchmark

7f2d363

Parsers with changeable 'open'

9173c9f

Type-hints and docstrings

52b95de

Add TODOs

bf5ecd2

Mising helper functions

5dc3886

Print stacktrace of process

7209c62

Fix arguments

f66c8c5

Fix overwritten open

6ab8b32

Read as string instead of StringIO

34c8a9e

Read as text instead of binary

fd55b3a

Sigterm callbacks

2be9fa6

Attempt at fixing some nested memory exceptions

2e64623

Overwritable exit status

5b92680

Validate dataset arguments

8fff254

Check non-empty dataset

2b4a8f0

Add feedback finished downloading

b68144d

Small fixes

b08df43

Fix intermediate solutions and time tracking

431b065

Increase intermediate solution time resolution

7d98c35

Missing default return argument

4664051

ThomSerg added 8 commits March 11, 2026 11:24

Make datasets consistent with paper: remove loader

974df6b

Add / update docs

f4dd1ab

Add missing wcnf parts

cf221b5

Slim down to core changes

c1e3b5f

Add optional parametrised tests

d4fcaea

limit scope even more

60a607b

Remove future mps file

fd88c5b

small cleanup

364412a

ThomSerg requested a review from OrestisLomis April 1, 2026 14:00

ThomSerg added 12 commits April 14, 2026 10:08

Merge branch 'master' into dataset_core

18cec3a

Limit pr scope to just datasets

2d31bfe

Update docstring

7449578

Remove unused imports

92423c9

Cleanup

7e0c3c7

Update from_files

7253a94

Remove metadata portability check

e7bf227

cleanup xcsp3 dataset

66bc881

remove future feature metadata

8cf4823

simplify conftest

dd20c80

Make mypy happy

0979e72

Update github action

4f66d84

OrestisLomis requested changes Apr 15, 2026

View reviewed changes

ThomSerg added 4 commits April 17, 2026 09:50

Docstring changes

6e0aaa1

More robust instance name extraction

62b985c

Remove IndexedDataset

d01a076

Update docstring

049549c

ThomSerg requested a review from OrestisLomis April 17, 2026 08:51

OrestisLomis approved these changes Apr 20, 2026

View reviewed changes

tias reviewed Apr 21, 2026

View reviewed changes

tweaks

668e7f5

Conversation

ThomSerg commented Apr 1, 2026

Uh oh!

OrestisLomis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tias Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tias Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OrestisLomis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tias left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tias Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tias Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tias Apr 21, 2026 •

edited

Loading

tias Apr 21, 2026 •

edited

Loading

tias Apr 21, 2026 •

edited

Loading

tias Apr 21, 2026 •

edited

Loading