Add Molmo2 by SangbumChoi · Pull Request #43451 · huggingface/transformers

SangbumChoi · 2026-01-23T14:47:55Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Adds AllenAI Molmo2 multimodal VLM to transformers, supporting: - Molmo2ForConditionalGeneration (image+video+text → text) - Molmo2TextModel / Molmo2TextForCausalLM (text-only) - Molmo2ImageProcessor and Molmo2VideoProcessor - Molmo2Processor Key implementation details: - Uses is_first_iteration (v5 API) for prepare_inputs_for_generation - Custom Molmo2Embedding with embedding + new_embedding parameters - Vision backbone with pooling adapter and multi-layer ViT features - Dynamic full cache support for generation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@strict

…odel_prefix - Replace einops.rearrange with native numpy reshape+transpose+reshape - Add @strict decorator to all 4 config classes (Molmo2VitConfig, Molmo2AdapterConfig, Molmo2TextConfig, Molmo2Config) to satisfy TRF010 - Set Molmo2Model.base_model_prefix = "model" (was empty, violating TRF002) - Fix image_mean/image_std mutable shared list (copy constants on init) - Fix test_image_processing: use image_processing_class instead of image_processor_list; skip CHW torch and 4-channel unsupported tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Re-sort _toctree.yml to place Molmo2 after mllama alphabetically - Add None guard in test_video_processor_from_dict_with_kwargs to skip when fast_video_processing_class is not defined Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Molmo2TextModel is an internal sub-component used by Molmo2Model and Molmo2ForConditionalGeneration and is tested implicitly through those. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

requests is not part of the standard library and caused ImportError in minimal environments (e.g. HuggingFace Jobs). Use urllib.request instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Molmo2's processor has several behaviors that are incompatible with the default ProcessorTesterMixin assumptions: - Chat template enforces strict user/assistant alternation (no system role) - Processor inserts BOS token, shifting sequence length by 1 - Image processor patchifies output, so rescale_factor passthrough fails - Video processor requires FPS metadata not provided by base tests - Hub processor_config.json contains auto_map not preserved in save/load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add @auto_docstring(checkpoint="allenai/Molmo2-8B") decorator to Molmo2TextConfig and Molmo2Config with custom_args for documenting non-standard parameters. This fixes check_config_docstrings CI check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… date Add parameter docstrings to Molmo2TextConfig and Molmo2Config __init__ methods so @strict-wrapped classes pass config docstring CI checks. Update model doc date to 2026-03-28. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move top-level `import torch` and `import torchvision.transforms` behind `is_torch_available()` / `is_torchvision_available()` guards in both image and video processors to prevent ModuleNotFoundError when torchvision is not installed. Also skip test_kwargs_overrides_default_image_processor_kwargs since Molmo2's patchifying image processor doesn't support rescale_factor passthrough. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert all absolute imports (from transformers.xxx) to relative imports (from ...xxx) in image_processing, video_processing, and processing modules to match the convention used by all other in-library models. Remove register_for_auto_class() calls which are only needed for custom hub models and were causing dynamic_module_utils to incorrectly scan local files for relative imports during save_pretrained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n_available The processor's top-level imports from image_processing_molmo2 and video_processing_molmo2 pull in PILImageResampling which requires PIL. Guard these imports with is_vision_available() so `from transformers import *` works when only torch is installed (no PIL/torchvision). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…L imports Move Molmo2ImagesKwargs and Molmo2VideosKwargs definitions directly into processing_molmo2.py instead of importing them from image/video processor modules which require PIL. Also remove Molmo2ImageProcessor/VideoProcessor type hints from __init__ to avoid NameError when vision is unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SangbumChoi · 2026-03-29T01:49:56Z

@molbap Hi I am still working on it since I have to make some example visualizer for this and (most of the code is generated by Claude code). However, you can start review this with brief level of code review! cc. @merveenoyan

Add integration tests for Molmo2-8B covering: - Image generation with exact expected text verification - Video QA (penguin identification) - Video pointing (coordinate output) - Multi-image comparison All expected values derived from actual model inference on A10G. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-29T13:27:18Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, molmo2

merveenoyan requested a review from molbap February 27, 2026 05:55

SangbumChoi force-pushed the molmo2 branch from b15107c to 3fee343 Compare March 26, 2026 23:28

SangbumChoi and others added 14 commits March 27, 2026 08:56

Merge branch 'main' into molmo2

bc04776

fix(molmo2): add Molmo2TextModel to IGNORE_NON_TESTED

e38b0a3

Molmo2TextModel is an internal sub-component used by Molmo2Model and Molmo2ForConditionalGeneration and is tested implicitly through those. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(molmo2): replace requests with stdlib urllib in video processor

cc06cbe

requests is not part of the standard library and caused ImportError in minimal environments (e.g. HuggingFace Jobs). Use urllib.request instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'main' into molmo2

f000173

Merge branch 'main' into molmo2

de8c268

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Molmo2#43451

Add Molmo2#43451
SangbumChoi wants to merge 16 commits intohuggingface:mainfrom
SangbumChoi:molmo2

SangbumChoi commented Jan 23, 2026

Uh oh!

SangbumChoi commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SangbumChoi commented Jan 23, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

SangbumChoi commented Mar 29, 2026

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant