Skip to content
This repository was archived by the owner on Jan 1, 2025. It is now read-only.
This repository was archived by the owner on Jan 1, 2025. It is now read-only.

About using ToMe in ImageEncoderViT within segment anything #42

@yangzijia

Description

@yangzijia

Hello Author,

Thank you for your contributions. I am currently looking to optimize the ImageEncoderViT method from “Segment Anything” using your token merging method, but I have encountered two issues:

  1. I noticed that the Block in the ImageEncoderViT uses windowed attention, and the shape of the tokens is (B, W, H, C), such as (1, 64, 64, 1280) for vit_h. This dimensionality cannot be processed by bipartite soft matching. I am considering whether merging W and H directly for computation would work. Do you have a better suggestion?
  2. The implementation of ToMe can reduce the number of tokens by about 98%, which changes the final feature shape. In the Image Encoder ViT, there are two Conv2d operations at the end, and after the token shape is changed, it cannot undergo convolution operations. I am wondering if adding a shape-expanding operation at this point would be feasible?

Thank you for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions