Add support for unknown scripts in NLLB

The NMT engine does not have good support for scripts that NLLB was not originally trained on. All characters would be added as individual tokens to the vocabulary resulting in poor translation quality. silnlp provides better support for unknown scripts. Here is an example of the silnlp config to enable this support:

```yaml
  tokenizer:
    src_vocab_size: 0
    trained_tokens: true
    trg_vocab_size: 500
    update_src: true
    update_trg: true
```

When silnlp is configured in this way, it will train a new subword tokenizer on the target data using the specified `trg_vocab_size`. The new tokens are added to the NLLB vocabulary. This results in better translation quality for unknown scripts, then the default settings. Similar functionality should be added to the NMT engine. The engine could automatically detect when to enable this feature or it could be enabled using build options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for unknown scripts in NLLB #807

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add support for unknown scripts in NLLB #807

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions