Add download-tokenizer(s) targets for model tokenizer downloads by maczg · Pull Request #399 · argmaxinc/WhisperKit

maczg · 2026-01-15T15:32:02Z

This PR adds two new Makefile targets to download tokenizers from the respective OpenAI Whisper repositories.
This could be useful when using a local model folder with WhisperKit, the tokenizer is not automatically downloaded even if the model files are present. These targets allow fetching the tokenizer.json separately.

Adds a make target to download tokenizer.json files from HuggingFace for each openai_whisper model found in the Models directory.

ZachNagengast

Thanks for the contribution, definitely a good script to add but it would be great if you could adjust to match the other scripts that use git to download the files rather than curl for consistency and clarity.

…pport - Add download-tokenizer target for single model downloads - Add download-tokenizers target for batch downloads - Support both openai_whisper-* and distil-whisper_* models - Auto-detect model type and resolve correct HuggingFace repo

maczg · 2026-01-20T18:03:15Z

The tokenizer files come from the original openai/whisper-* and distil-whisper/* repos, not whisperkit-coreml.
Using git would require cloning multiple external repos and temp dir that will be removed after copying the requested file.

Would it make sense to add tokenizer.json and tokenizer_config.json directly to each model folder in whisperkit-coreml HuggingFace repo? That way, git lfs pull would automatically include them.

ZachNagengast · 2026-01-28T00:54:54Z

I see what you mean, was mainly commenting on the curl usage. I think the model parsing makes sense since it's a bit difficult to really correlate a tokenizer with our current model names without that. The reason in my mind to avoid curl in this case is that the hardcoded URLs may become out of date at some point, whereas git or huggingface-cli will be more future-proof.

I'd suggest two options:

Since WhisperKit is already checking for tokenizer in its associated hf repo first (the openai/whisper-* repo), you could use huggingface-cli and filter for just the needed .json files. I should work via the cli as normal because tokenizer first checks $(HOME)/Documents/huggingface/models/ before downloading if tokenizer path is not provided as an argument. Example:

download-tokenizer:
	@if [ -z "$(MODEL)" ]; then \
		echo "Error: MODEL is not set."; \
		exit 1; \
	fi
	@if echo "$(MODEL)" | grep -q "^distil-"; then \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="distil-whisper/$$base_model"; \
	else \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="openai/whisper-$$base_model"; \
	fi; \
	echo "Downloading tokenizer from $$repo..."; \
	huggingface-cli download "$$repo" tokenizer.json tokenizer_config.json \
		--local-dir "$(HOME)/Documents/huggingface/models/$$repo" --local-dir-use-symlinks False

Add an option to bundle the tokenizer with the model, keeping the same destination logic and just replacing curl with huggingface-cli. Example:

# Download tokenizer (optionally bundle to model folder)
# Usage: 
#   make download-tokenizer MODEL=base                    # Just download
#   make download-tokenizer MODEL=base BUNDLE=true        # Download and bundle
download-tokenizer:
	@if [ -z "$(MODEL)" ]; then \
		echo "Error: MODEL is not set. Usage: make download-tokenizer MODEL=base [BUNDLE=true]"; \
		exit 1; \
	fi
	@if echo "$(MODEL)" | grep -q "^distil-"; then \
		dest="$(MODEL_REPO_DIR)/distil-whisper_$(MODEL)"; \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="distil-whisper/$$base_model"; \
	else \
		dest="$(MODEL_REPO_DIR)/openai_whisper-$(MODEL)"; \
		base_model=$$(echo "$(MODEL)" | sed 's/_[0-9]*MB$$//' | sed 's/_turbo$$//' | sed 's/-v[0-9]\{8\}$$//'); \
		repo="openai/whisper-$$base_model"; \
	fi; \
	echo "Downloading tokenizer from $$repo..."; \
	huggingface-cli download "$$repo" tokenizer.json tokenizer_config.json \
		--local-dir "$(HOME)/Documents/huggingface/models/$$repo" --local-dir-use-symlinks False; \
	if [ "$(BUNDLE)" = "true" ]; then \
		src="$(HOME)/Documents/huggingface/models/$$repo"; \
		mkdir -p "$$dest"; \
		echo "Bundling tokenizer from $$src to $$dest..."; \
		cp "$$src/tokenizer.json" "$$src/tokenizer_config.json" "$$dest/"; \
		echo "Tokenizer bundled to $$dest"; \
	fi

bundle-tokenizer:
	@$(MAKE) download-tokenizer MODEL=$(MODEL) BUNDLE=true

Open to other thoughts as well if you have any.

Add download-tokenizers target to Makefile

b0269ee

Adds a make target to download tokenizer.json files from HuggingFace for each openai_whisper model found in the Models directory.

ZachNagengast requested changes Jan 19, 2026

View reviewed changes

maczg force-pushed the feature/download-tokenizers branch from c4e952a to 55808f5 Compare January 20, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add download-tokenizer(s) targets for model tokenizer downloads#399

Add download-tokenizer(s) targets for model tokenizer downloads#399
maczg wants to merge 2 commits intoargmaxinc:mainfrom
maczg:feature/download-tokenizers

maczg commented Jan 15, 2026

Uh oh!

ZachNagengast left a comment

Uh oh!

maczg commented Jan 20, 2026

Uh oh!

ZachNagengast commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maczg commented Jan 15, 2026

Uh oh!

ZachNagengast left a comment

Choose a reason for hiding this comment

Uh oh!

maczg commented Jan 20, 2026

Uh oh!

ZachNagengast commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants