Skip to content

Datasets core#900

Open
ThomSerg wants to merge 180 commits intomasterfrom
dataset_core
Open

Datasets core#900
ThomSerg wants to merge 180 commits intomasterfrom
dataset_core

Conversation

@ThomSerg
Copy link
Copy Markdown
Collaborator

@ThomSerg ThomSerg commented Apr 1, 2026

A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch benchmark_datasets into master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".

@ThomSerg ThomSerg requested a review from OrestisLomis April 1, 2026 14:00
Copy link
Copy Markdown
Contributor

@OrestisLomis OrestisLomis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is really nice! Some questions about the class hierarchy remain for me, please have a look. Few more small comments as well

Comment thread cpmpy/tools/datasets/core.py Outdated
}


class IndexedDataset(Dataset):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this functionality not directly in Dataset? I do not really see what kind of data would not be implementable in an indexable way. Is it for instance generator datasets?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any dataset in which the instances must be consumed in order (where random access is not possible). This can for example happen when instances are generated on the spot and this generation process is sequence dependent, e.g. it uses a initial random seed and then goes on from there. Or maybe a very large dataset that is for some reason in one very big file, where reading the entire file into memory is not efficient and it is better to just stream instance by instance. I do agree that for now we don't have a usecase for it, but I wanted to prevent that in the future we or any other contributor couldn't add one of these alternative dataset types due to us hard-coding indexability into the base class. With a very generic and limited base class, these future additions are still a possibility.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After internal discussion, I removed the intermediate IndexedDataset. All datasets now inherit from Dataset, which requires implementation for indexability and iterability. Datasets that are only one of the two are more exception than the norm, so one idea is to just keep the code simple. Though this will cause issues for those datasets (streaming datasets, generator-based datasets), since they will have to through an exception that random access is not allowed. And it's not that nice of a design that the user has to take into account that index-based access could throw an exception for some datasets. In my opinion the previous code for keeping indexability out of the base dataset class is not a lot, but I can also see the other point of view in terms of complexity and maintainability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
print(f"Finished downloading {len(files)} instances")

files = self._list_instances()
if len(files) == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that the data is not tampered with? Maybe for starters if __len__ is actually hardcoded to be a specific number you can change this line with len(files) != len(self). The tradeoff is that you have to add a lot of hardcoded values for each year/track/... combination for some benchmarks which might not be desirable. What if the benchmark changes size for some reason?

TL;DR I think this is fine, but perhaps think a little about more data correctness guarantees

Copy link
Copy Markdown
Collaborator

@tias tias Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(response to Orestis' response:)

its OK to work with what we have...

we check/reload the metadata for each instance on disk in any case if I understand the code lower well

Copy link
Copy Markdown
Collaborator

@tias tias Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(continued response to my response to Orestis' response)

(with what we have = if I download the data, and chose to delete some tracks to save space, I don't expect my code to complain but rather just to use what is on the disk)

Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/xcsp3.py
@ThomSerg ThomSerg requested a review from OrestisLomis April 17, 2026 08:51
Copy link
Copy Markdown
Contributor

@OrestisLomis OrestisLomis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is alright as is now. See comment for more detailed discussion.

Comment thread cpmpy/tools/datasets/core.py Outdated
}


class IndexedDataset(Dataset):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

Copy link
Copy Markdown
Collaborator

@tias tias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

also nicely extensive test-suite.

Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome

import cpmpy as cp


def _format_bytes(bytes_num):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you import typing I kind-of expect typed functions... e.g. this is float or int?

bytes_num /= 1024.0


class classproperty:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like sugar coating... do we care?

things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(read other comments lower first)

print(f"Finished downloading {len(files)} instances")

files = self._list_instances()
if len(files) == 0:
Copy link
Copy Markdown
Collaborator

@tias tias Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(response to Orestis' response:)

its OK to work with what we have...

we check/reload the metadata for each instance on disk in any case if I understand the code lower well

print(f"Finished downloading {len(files)} instances")

files = self._list_instances()
if len(files) == 0:
Copy link
Copy Markdown
Collaborator

@tias tias Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(continued response to my response to Orestis' response)

(with what we have = if I download the data, and chose to delete some tracks to save space, I don't expect my code to complain but rather just to use what is on the disk)

data = self.parse(file_path)

if self.transform:
# TODO maybe revisit this flow of execution once CPMpy model feature extraction has been added
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an old todo, or is this an actual todo for a soon-but-later PR? I think we describe the flow well in the paper, and there is no actually intended todo left? (one can always refactor anything, but that should not be written in the code)

fp = futures[future]
print(f"Error collecting metadata for {fp.name}: {e}")

def _collect_one_metadata(self, file_path):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)

filepath.parent.mkdir(parents=True, exist_ok=True)

req = Request(url)
with urlopen(req) as response:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't you overdoing it? If I understand this correctly you implement your own pretty-print downloader, but I'm not sure I want to maintain a custom pretty printer in a combinatorial optimisation library...

class FromFilesDataset(FileDataset):
# Plain class attributes so that dataset_metadata() (a classmethod
# that reads cls.name / cls.description / ...) works correctly.
name = ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the better, cleaner way... just plain class attributes, should be like that in the superclass too?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its required, you can check in the constructor that it should be non-empty...

return {}

def download(self) -> None:
raise NotImplementedError("from_files() datasets are already local; downloading is not supported.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say pass # already in local files? There is no error here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants