[FEATURE] Support resizing rectangular pos_embed

**Is your feature request related to a problem? Please describe.**

Currently when changing ViT img size from a rectangular size, `resample_abs_pos_embed()` does not work correctly since it does not know the original rectangular size and assume a square.

https://github.com/huggingface/pytorch-image-models/blob/5dce71010174ad6599653da4e8ba37fd5f9fa572/timm/models/vision_transformer.py#L1096-L1103

https://github.com/huggingface/pytorch-image-models/blob/5dce71010174ad6599653da4e8ba37fd5f9fa572/timm/layers/pos_embed.py#L32-L34

**Describe the solution you'd like**

It should work out of the box.

**Describe alternatives you've considered**

Manually resize it.

**Additional context**

Apparently dynamic img size also will not work when original img size is rectangle.

https://github.com/huggingface/pytorch-image-models/blob/5dce71010174ad6599653da4e8ba37fd5f9fa572/timm/models/vision_transformer.py#L603-L609

This is a rare problem since most image ViT use square inputs. The particular model I'm using is my previously ported AudioMAE (https://huggingface.co/gaunernst/vit_base_patch16_1024_128.audiomae_as2m), which uses rectangular input (mel-spectrogram).

I understand it is not so straight-forward to support this, since once the model is created (with updated image size), the original image size is lost. Some hacks can probably bypass this, but not so nice
1. Propagate the original image size to the `_load_weights()` function
2. Create a model with the original image size, load weights as usual. Add a new method like `.set_img_size()` which will update the internal `img_size` attribute and resamle pos embed.

Perhaps an easier solution is to fix dynamic img size to pass the original img size (which I tested locally and works)

```python
        if self.dynamic_img_size:
            B, H, W, C = x.shape
            pos_embed = resample_abs_pos_embed(
                self.pos_embed,
                (H, W),
                self.patch_embed.grid_size,
                num_prefix_tokens=0 if self.no_embed_class else self.num_prefix_tokens,
            )
```

	v = resample_abs_pos_embed(
	v,
	new_size=model.patch_embed.grid_size,
	num_prefix_tokens=num_prefix_tokens,
	interpolation=interpolation,
	antialias=antialias,
	verbose=True,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support resizing rectangular pos_embed #2190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if old_size is None:
	hw = int(math.sqrt(num_pos_tokens - num_prefix_tokens))
	old_size = hw, hw

	if self.dynamic_img_size:
	B, H, W, C = x.shape
	pos_embed = resample_abs_pos_embed(
	self.pos_embed,
	(H, W),
	num_prefix_tokens=0 if self.no_embed_class else self.num_prefix_tokens,
	)

[FEATURE] Support resizing rectangular pos_embed #2190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions