Skip to content

phohenecker/country-data-gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Country Data Generator

This repository contains the implementation of a tool for generating toy datasets that pose reasoning tasks about countries. It has been created as part of the work for the following paper, and was used to generate a dataset for one of the experiments reported in the same:

Patrick Hohenecker and Thomas Lukasiewicz.
Ontology Reasoning with Deep Neural Networks.
Preprint at https://arxiv.org/abs/1808.07980 (2018).

You are very welcome to use the generator for creating data for your own research. However, in this case, please make sure to cite the paper above. For any questions about the paper or the code provided here, feel free to contact us via e-mail.

Important: The datasets generated by the provided tool are based on the countries knowledge base, which was introduced by Bouchard et al. (2015). This has been published under the Open Database License, which you should familiarize yourself with before using the data generator.

Notice: Any data created by the country data generator are written to the disk in the rel-data format, which is specified in detail here.

Inference Task

The countries knowledge base contains information about countries, regions, and subregions. To that end, it specifies the neighbors of countries on the one hand (which we formalize by means of the relation neighborOf) and the subregion and region, respectively, that a country is located in on the other hand (expressed using the predicate locatedIn). Nickel et al. (2016) introduced an according learning task, where some of the locatedIn relations are missing, and thus have to be predicted based on what is known about the respective neighborhoods. In order to create datasets, the country data generator closely follows the instructions in this paper. There are a few aspects to notice, however:

  • We endowed the data in the countries knowledge base with an ontology that formalizes those inferences that can be drawn with certainty. It is available as an answer set program here.
  • Not all of the missing locatedIn relations can be inferred with certainty, though. This is reflected by the fact that some of them are written to the according relations.data.inf files while others are stored in relations.data.pred.
  • In addition to what is specified in the countries knowledge base, we introduced three unary predicates that describe the different types of individuals that appear in the data: country, region, subregion. All of these have to be inferred as part of the learning task, and are never provided as facts. Again, not all of them can be inferred with certainty in every case.
  • The individuals that represent regions and subregions, respectively, are always considered as part of the dev/test set.

Just like the original paper, the generator considers three different versions of the problem: S1 (easy), S2, and S3 (hard). However, for additional details about the generation process, please refer to Nickel et al. (2016) as well as the methods section of our own paper.

Usage

Running the data generator is as easy as cloning this repository and launching the shell script run-data-gen.sh. Notice, however, that the application depends on numerous Python packages that need to be installed in order to run the same. For a complete list of dependencies, confer setup.py. While you could just install all of the required packages on your machine, a better solution is to create a virtual Conda environment. To that end, the file environment.yaml provides a specification of such an environment that is appropriate for running the data generator.

The following enumeration provides a step-by-step guide for running the country data generator in a Conda environment:

  1. Download this repository:

    $ git clone https://github.com/phohenecker/country-data-gen.git
    $ cd country-data-gen
    
  2. Create and activate an appropriate Conda environment:

    $ conda env create -f environment.yaml  # create the environment
    $ source activate country-data-gen      # activate it
    

    Notice that Conda environments can be reused, i.e., the one for the data generator has to be created only once.

  3. Run the Python application:

    (country-data-gen)$ ./run-data-gen.sh [ARGS]
    

For a detailed description of how to invoke the data generator and all options that may be provided to the same, refer to the application's help text. This is printed, if the application is launched with flag --help and -h, respectively:

(country-data-gen)$ ./run-data-gen.sh --help

Important: There are numerous options available, which allow for adjusting the generation process. However, while all of these have default values, there is one positional arg that needs to be provided. This is described in the next section.

The DLV System

The country data generator makes use of the DLV system in order to perform symbolic reasoning over countries by means of the ontology mentioned above. Therefore, you have to download the DLV executable for your platform from the official website, and provide the path to the same as (the only) positional arg:

(country-data-gen)$ ./run-data-gen.sh [OPTIONS] /path/to/dlv

Notice that DLV is free for academic and non-commercial educational use. However, details can be found here.

References

Guillaume Bouchard, Sameer Singh, and Theo Trouillon. On approximate reasoning capabilities of low-rank vector spaces. In Proceedings of the 2015 AAAI Spring Symposium on Knowledge Representation and Reasoning (KRR): Integrating Symbolic and Neural Approaches (2015).

Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge graphs. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (2016).

About

A dataset generator for the countries data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors