Closed
Description
In https://pytorch.org/tutorials/intermediate/named_tensor_tutorial.html I think there is a bug:
dot_prod = q.div_(scale).matmul(k.align_to(..., 'D_head', 'T_key'))
[...]
attn_weights = self.attn_dropout(F.softmax(dot_prod / scale,
dim='T_key'))
the scaling is done twice, and I think it should be done only once.
Thanks.