Skip to content

Add download-tokenizer(s) targets for model tokenizer downloads#399

Open
maczg wants to merge 2 commits intoargmaxinc:mainfrom
maczg:feature/download-tokenizers
Open

Add download-tokenizer(s) targets for model tokenizer downloads#399
maczg wants to merge 2 commits intoargmaxinc:mainfrom
maczg:feature/download-tokenizers

Conversation

@maczg
Copy link
Copy Markdown

@maczg maczg commented Jan 15, 2026

This PR adds two new Makefile targets to download tokenizers from the respective OpenAI Whisper repositories.
This could be useful when using a local model folder with WhisperKit, the tokenizer is not automatically downloaded even if the model files are present. These targets allow fetching the tokenizer.json separately.

Adds a make target to download tokenizer.json files from HuggingFace
for each openai_whisper model found in the Models directory.
Copy link
Copy Markdown
Contributor

@ZachNagengast ZachNagengast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, definitely a good script to add but it would be great if you could adjust to match the other scripts that use git to download the files rather than curl for consistency and clarity.

…pport

- Add download-tokenizer target for single model downloads
- Add download-tokenizers target for batch downloads
- Support both openai_whisper-* and distil-whisper_* models
- Auto-detect model type and resolve correct HuggingFace repo
@maczg maczg force-pushed the feature/download-tokenizers branch from c4e952a to 55808f5 Compare January 20, 2026 17:54
@maczg
Copy link
Copy Markdown
Author

maczg commented Jan 20, 2026

The tokenizer files come from the original openai/whisper-* and distil-whisper/* repos, not whisperkit-coreml.
Using git would require cloning multiple external repos and temp dir that will be removed after copying the requested file.

Would it make sense to add tokenizer.json and tokenizer_config.json directly to each model folder in whisperkit-coreml HuggingFace repo? That way, git lfs pull would automatically include them.

@ZachNagengast
Copy link
Copy Markdown
Contributor

I see what you mean, was mainly commenting on the curl usage. I think the model parsing makes sense since it's a bit difficult to really correlate a tokenizer with our current model names without that. The reason in my mind to avoid curl in this case is that the hardcoded URLs may become out of date at some point, whereas git or huggingface-cli will be more future-proof.

I'd suggest two options:

  1. Since WhisperKit is already checking for tokenizer in its associated hf repo first (the openai/whisper-* repo), you could use huggingface-cli and filter for just the needed .json files. I should work via the cli as normal because tokenizer first checks $(HOME)/Documents/huggingface/models/ before downloading if tokenizer path is not provided as an argument. Example:
download-tokenizer:
	@if [ -z "$(MODEL)" ]; then \
		echo "Error: MODEL is not set."; \
		exit 1; \
	fi
	@if echo "$(MODEL)" | grep -q "^distil-"; then \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="distil-whisper/$$base_model"; \
	else \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="openai/whisper-$$base_model"; \
	fi; \
	echo "Downloading tokenizer from $$repo..."; \
	huggingface-cli download "$$repo" tokenizer.json tokenizer_config.json \
		--local-dir "$(HOME)/Documents/huggingface/models/$$repo" --local-dir-use-symlinks False
  1. Add an option to bundle the tokenizer with the model, keeping the same destination logic and just replacing curl with huggingface-cli. Example:
# Download tokenizer (optionally bundle to model folder)
# Usage: 
#   make download-tokenizer MODEL=base                    # Just download
#   make download-tokenizer MODEL=base BUNDLE=true        # Download and bundle
download-tokenizer:
	@if [ -z "$(MODEL)" ]; then \
		echo "Error: MODEL is not set. Usage: make download-tokenizer MODEL=base [BUNDLE=true]"; \
		exit 1; \
	fi
	@if echo "$(MODEL)" | grep -q "^distil-"; then \
		dest="$(MODEL_REPO_DIR)/distil-whisper_$(MODEL)"; \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="distil-whisper/$$base_model"; \
	else \
		dest="$(MODEL_REPO_DIR)/openai_whisper-$(MODEL)"; \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="openai/whisper-$$base_model"; \
	fi; \
	echo "Downloading tokenizer from $$repo..."; \
	huggingface-cli download "$$repo" tokenizer.json tokenizer_config.json \
		--local-dir "$(HOME)/Documents/huggingface/models/$$repo" --local-dir-use-symlinks False; \
	if [ "$(BUNDLE)" = "true" ]; then \
		src="$(HOME)/Documents/huggingface/models/$$repo"; \
		mkdir -p "$$dest"; \
		echo "Bundling tokenizer from $$src to $$dest..."; \
		cp "$$src/tokenizer.json" "$$src/tokenizer_config.json" "$$dest/"; \
		echo "Tokenizer bundled to $$dest"; \
	fi

bundle-tokenizer:
	@$(MAKE) download-tokenizer MODEL=$(MODEL) BUNDLE=true

Open to other thoughts as well if you have any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants