Skip to content

Add support for unknown scripts in NLLB #807

@ddaspit

Description

@ddaspit

The NMT engine does not have good support for scripts that NLLB was not originally trained on. All characters would be added as individual tokens to the vocabulary resulting in poor translation quality. silnlp provides better support for unknown scripts. Here is an example of the silnlp config to enable this support:

  tokenizer:
    src_vocab_size: 0
    trained_tokens: true
    trg_vocab_size: 500
    update_src: true
    update_trg: true

When silnlp is configured in this way, it will train a new subword tokenizer on the target data using the specified trg_vocab_size. The new tokens are added to the NLLB vocabulary. This results in better translation quality for unknown scripts, then the default settings. Similar functionality should be added to the NMT engine. The engine could automatically detect when to enable this feature or it could be enabled using build options.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions