Conversation
OrestisLomis
left a comment
There was a problem hiding this comment.
I think it is really nice! Some questions about the class hierarchy remain for me, please have a look. Few more small comments as well
| } | ||
|
|
||
|
|
||
| class IndexedDataset(Dataset): |
There was a problem hiding this comment.
Why is this functionality not directly in Dataset? I do not really see what kind of data would not be implementable in an indexable way. Is it for instance generator datasets?
There was a problem hiding this comment.
Any dataset in which the instances must be consumed in order (where random access is not possible). This can for example happen when instances are generated on the spot and this generation process is sequence dependent, e.g. it uses a initial random seed and then goes on from there. Or maybe a very large dataset that is for some reason in one very big file, where reading the entire file into memory is not efficient and it is better to just stream instance by instance. I do agree that for now we don't have a usecase for it, but I wanted to prevent that in the future we or any other contributor couldn't add one of these alternative dataset types due to us hard-coding indexability into the base class. With a very generic and limited base class, these future additions are still a possibility.
There was a problem hiding this comment.
After internal discussion, I removed the intermediate IndexedDataset. All datasets now inherit from Dataset, which requires implementation for indexability and iterability. Datasets that are only one of the two are more exception than the norm, so one idea is to just keep the code simple. Though this will cause issues for those datasets (streaming datasets, generator-based datasets), since they will have to through an exception that random access is not allowed. And it's not that nice of a design that the user has to take into account that index-based access could throw an exception for some datasets. In my opinion the previous code for keeping indexability out of the base dataset class is not a lot, but I can also see the other point of view in terms of complexity and maintainability.
There was a problem hiding this comment.
I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.
| print(f"Finished downloading {len(files)} instances") | ||
|
|
||
| files = self._list_instances() | ||
| if len(files) == 0: |
There was a problem hiding this comment.
Can we check that the data is not tampered with? Maybe for starters if __len__ is actually hardcoded to be a specific number you can change this line with len(files) != len(self). The tradeoff is that you have to add a lot of hardcoded values for each year/track/... combination for some benchmarks which might not be desirable. What if the benchmark changes size for some reason?
TL;DR I think this is fine, but perhaps think a little about more data correctness guarantees
There was a problem hiding this comment.
(response to Orestis' response:)
its OK to work with what we have...
we check/reload the metadata for each instance on disk in any case if I understand the code lower well
There was a problem hiding this comment.
(continued response to my response to Orestis' response)
(with what we have = if I download the data, and chose to delete some tracks to save space, I don't expect my code to complain but rather just to use what is on the disk)
OrestisLomis
left a comment
There was a problem hiding this comment.
I think it is alright as is now. See comment for more detailed discussion.
| } | ||
|
|
||
|
|
||
| class IndexedDataset(Dataset): |
There was a problem hiding this comment.
I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.
tias
left a comment
There was a problem hiding this comment.
great!
also nicely extensive test-suite.
Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome
| import cpmpy as cp | ||
|
|
||
|
|
||
| def _format_bytes(bytes_num): |
There was a problem hiding this comment.
if you import typing I kind-of expect typed functions... e.g. this is float or int?
| bytes_num /= 1024.0 | ||
|
|
||
|
|
||
| class classproperty: |
There was a problem hiding this comment.
that seems like sugar coating... do we care?
things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...
There was a problem hiding this comment.
(read other comments lower first)
| print(f"Finished downloading {len(files)} instances") | ||
|
|
||
| files = self._list_instances() | ||
| if len(files) == 0: |
There was a problem hiding this comment.
(response to Orestis' response:)
its OK to work with what we have...
we check/reload the metadata for each instance on disk in any case if I understand the code lower well
| print(f"Finished downloading {len(files)} instances") | ||
|
|
||
| files = self._list_instances() | ||
| if len(files) == 0: |
There was a problem hiding this comment.
(continued response to my response to Orestis' response)
(with what we have = if I download the data, and chose to delete some tracks to save space, I don't expect my code to complain but rather just to use what is on the disk)
| data = self.parse(file_path) | ||
|
|
||
| if self.transform: | ||
| # TODO maybe revisit this flow of execution once CPMpy model feature extraction has been added |
There was a problem hiding this comment.
is this an old todo, or is this an actual todo for a soon-but-later PR? I think we describe the flow well in the paper, and there is no actually intended todo left? (one can always refactor anything, but that should not be written in the code)
| fp = futures[future] | ||
| print(f"Error collecting metadata for {fp.name}: {e}") | ||
|
|
||
| def _collect_one_metadata(self, file_path): |
There was a problem hiding this comment.
euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)
| filepath.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| req = Request(url) | ||
| with urlopen(req) as response: |
There was a problem hiding this comment.
aren't you overdoing it? If I understand this correctly you implement your own pretty-print downloader, but I'm not sure I want to maintain a custom pretty printer in a combinatorial optimisation library...
| class FromFilesDataset(FileDataset): | ||
| # Plain class attributes so that dataset_metadata() (a classmethod | ||
| # that reads cls.name / cls.description / ...) works correctly. | ||
| name = "" |
There was a problem hiding this comment.
I think this is the better, cleaner way... just plain class attributes, should be like that in the superclass too?
There was a problem hiding this comment.
if its required, you can check in the constructor that it should be non-empty...
| return {} | ||
|
|
||
| def download(self) -> None: | ||
| raise NotImplementedError("from_files() datasets are already local; downloading is not supported.") |
There was a problem hiding this comment.
I would say pass # already in local files? There is no error here...
A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch
benchmark_datasetsinto master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".