Add download-tokenizer(s) targets for model tokenizer downloads#399
Add download-tokenizer(s) targets for model tokenizer downloads#399maczg wants to merge 2 commits intoargmaxinc:mainfrom
Conversation
Adds a make target to download tokenizer.json files from HuggingFace for each openai_whisper model found in the Models directory.
ZachNagengast
left a comment
There was a problem hiding this comment.
Thanks for the contribution, definitely a good script to add but it would be great if you could adjust to match the other scripts that use git to download the files rather than curl for consistency and clarity.
…pport - Add download-tokenizer target for single model downloads - Add download-tokenizers target for batch downloads - Support both openai_whisper-* and distil-whisper_* models - Auto-detect model type and resolve correct HuggingFace repo
c4e952a to
55808f5
Compare
|
The tokenizer files come from the original Would it make sense to add |
|
I see what you mean, was mainly commenting on the curl usage. I think the model parsing makes sense since it's a bit difficult to really correlate a tokenizer with our current model names without that. The reason in my mind to avoid curl in this case is that the hardcoded URLs may become out of date at some point, whereas git or huggingface-cli will be more future-proof. I'd suggest two options:
Open to other thoughts as well if you have any. |
This PR adds two new Makefile targets to download tokenizers from the respective OpenAI Whisper repositories.
This could be useful when using a local model folder with WhisperKit, the tokenizer is not automatically downloaded even if the model files are present. These targets allow fetching the
tokenizer.jsonseparately.