About using ToMe in ImageEncoderViT within segment anything

Hello Author,

Thank you for your contributions. I am currently looking to optimize the [ImageEncoderViT](https://github.com/facebookresearch/segment-anything/blob/main/segment_anything/modeling/image_encoder.py) method from “Segment Anything” using your token merging method, but I have encountered two issues:

1.	I noticed that the Block in the ImageEncoderViT uses windowed attention, and the shape of the tokens is (B, W, H, C), such as (1, 64, 64, 1280) for vit_h. This dimensionality cannot be processed by bipartite soft matching. I am considering whether merging W and H directly for computation would work. Do you have a better suggestion?
2.	The implementation of ToMe can reduce the number of tokens by about 98%, which changes the final feature shape. In the Image Encoder ViT, there are two Conv2d operations at the end, and after the token shape is changed, it cannot undergo convolution operations. I am wondering if adding a shape-expanding operation at this point would be feasible?

Thank you for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About using ToMe in ImageEncoderViT within segment anything #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About using ToMe in ImageEncoderViT within segment anything #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions