-
-
Notifications
You must be signed in to change notification settings - Fork 2
Description
The NMT engine does not have good support for scripts that NLLB was not originally trained on. All characters would be added as individual tokens to the vocabulary resulting in poor translation quality. silnlp provides better support for unknown scripts. Here is an example of the silnlp config to enable this support:
tokenizer:
src_vocab_size: 0
trained_tokens: true
trg_vocab_size: 500
update_src: true
update_trg: trueWhen silnlp is configured in this way, it will train a new subword tokenizer on the target data using the specified trg_vocab_size. The new tokens are added to the NLLB vocabulary. This results in better translation quality for unknown scripts, then the default settings. Similar functionality should be added to the NMT engine. The engine could automatically detect when to enable this feature or it could be enabled using build options.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status