Hello and thank you for the opensource repository.
I was going through LinkInvent and wanted to train to try to train the model in a TL fashion with the dataset provided in ReinventCommunity/notebooks/data/linkinvent_prior_training_data and the prior model. However, I think there was an error in the process of dataset creation. This was mainly for testing the code and I am aware there is no particular use in doing this TL.
The code expects the data to have warheads/inputs as first columns and linkers/targets as the second column. This can be seen in the code as well as in the ReinventCommunity/notebooks/models/linkinvent.prior vocabulary which has * and | as input tokens and [*] as target token.
The dataset provided however follows the following setup:
Linkers/target ---- warheads/inputs ----- Full smiles
[*]C#CC(O)CCCCCCC[*] ---- *C#CCO|*CCC#CCCCCCCC(C)C ---- CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCO
They should be modified to:
Warheads/inputs ----- linker/target ---- Full smiles
*C#CCO|*CCC#CCCCCCCC(C)C ---- *C#CCO|*CCC#CCCCCCCC(C)C ----CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCO
I tried it on my hand and after doing so it worked fine.
This might not be a big issue since in the case of LinkInvent, TL is less important. And in the case of a new model the vocabulary will be recreated. I still wanted to share this feedback since the dataset does not match the code logic.
Hello and thank you for the opensource repository.
I was going through LinkInvent and wanted to train to try to train the model in a TL fashion with the dataset provided in
ReinventCommunity/notebooks/data/linkinvent_prior_training_dataand the prior model. However, I think there was an error in the process of dataset creation. This was mainly for testing the code and I am aware there is no particular use in doing this TL.The code expects the data to have warheads/inputs as first columns and linkers/targets as the second column. This can be seen in the code as well as in the
ReinventCommunity/notebooks/models/linkinvent.priorvocabulary which has*and|as input tokens and[*]as target token.The dataset provided however follows the following setup:
Linkers/target ---- warheads/inputs ----- Full smiles
[*]C#CC(O)CCCCCCC[*]----*C#CCO|*CCC#CCCCCCCC(C)C----CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCOThey should be modified to:
Warheads/inputs ----- linker/target ---- Full smiles
*C#CCO|*CCC#CCCCCCCC(C)C----*C#CCO|*CCC#CCCCCCCC(C)C----CC(C)CCCCCCC#CCCCCCCCCCC(O)C#CC#CCOI tried it on my hand and after doing so it worked fine.
This might not be a big issue since in the case of LinkInvent, TL is less important. And in the case of a new model the vocabulary will be recreated. I still wanted to share this feedback since the dataset does not match the code logic.