Description
When using segmentation_models_pytorch.DPT with the encoder tu-vit_base_patch16_224.augreg_in21k and default parameters, I get a runtime error related to tensor shape mismatch.
Code to reproduce
import segmentation_models_pytorch as smp
import torch
model = smp.DPT(
encoder_name='tu-vit_base_patch16_224.augreg_in21k',
encoder_depth=4,
encoder_weights='imagenet',
encoder_output_indices=None,
decoder_readout='cat',
decoder_intermediate_channels=(256, 512, 1024, 1024),
decoder_fusion_channels=256,
in_channels=3,
classes=1,
activation=None,
aux_params=None
)
x = torch.rand(8, 3, 224, 224)
y = model(x) # RuntimeError occurs here
Error traceback
RuntimeError: The expanded size of the tensor (196) must match the existing size (8) at non-singleton dimension 1. Target sizes: [8, 196, 768]. Tensor sizes: [8, 768]
Environment
segmentation-models-pytorch: latest version (0.4.1.dev0)
timm: 1.0.15
pytorch: 2.4.0
python: 3.10.14
OS: Windows 10
I also tried setting encoder_weights=None and explicitly specifying encoder_output_indices=(3, 6, 9, 11), but the same error occurs. It seems the encoder is returning [B, C] (e.g., [8, 768]) instead of the expected [B, N, C] shape, causing reshape operations in DPT to fail. Please let me know if I'm missing something in the usage of ViT encoders with DPT.