Skip to content
This repository was archived by the owner on Jul 3, 2023. It is now read-only.
This repository was archived by the owner on Jul 3, 2023. It is now read-only.

Link Invent Dataset inconsistent with the code base and prior model. #39

@vincrichard

Description

@vincrichard

Hello and thank you for the opensource repository.

I was going through LinkInvent and wanted to train to try to train the model in a TL fashion with the dataset provided in ReinventCommunity/notebooks/data/linkinvent_prior_training_data and the prior model. However, I think there was an error in the process of dataset creation. This was mainly for testing the code and I am aware there is no particular use in doing this TL.

The code expects the data to have warheads/inputs as first columns and linkers/targets as the second column. This can be seen in the code as well as in the ReinventCommunity/notebooks/models/linkinvent.prior vocabulary which has * and | as input tokens and [*] as target token.

The dataset provided however follows the following setup:
Linkers/target ---- warheads/inputs ----- Full smiles
[*]C#CC(O)CCCCCCC[*] ---- *C#CCO|*CCC#CCCCCCCC(C)C ---- CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCO

They should be modified to:

Warheads/inputs ----- linker/target ---- Full smiles
*C#CCO|*CCC#CCCCCCCC(C)C ---- *C#CCO|*CCC#CCCCCCCC(C)C ----CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCO

I tried it on my hand and after doing so it worked fine.
This might not be a big issue since in the case of LinkInvent, TL is less important. And in the case of a new model the vocabulary will be recreated. I still wanted to share this feedback since the dataset does not match the code logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions