Description
Hi,
I am opening this issue because I noticed a weird behavior of the the spatial transformer networks implementation (
)I summarized my findings here. In short, what is happening is that
when the input is normalised and then fed to the STN, the F.grid_sample
call adds a zero-padding, however, the normalisation changes the background value from 0
to -mean/std
.
(
This causes the STN to collapse very early and to actually never learn the correct transformation. You can actually see that in the example code already (https://pytorch.org/tutorials/intermediate/spatial_transformer_tutorial.html), because the learnt transformation is zooming OUT instead of zooming IN on the digits. For the original 28 x 28 images, this is not such a big problem, However, when you continue to cluttered MNIST as in the original publication, the difference is huge. Once again, please have a look here.
I think the tutorial for the STN should be updated and also include the cluttered MNIST example because that is what drives the point home. I would volunteer to do so, if I get the permission to go ahead.
Unfortunately, most other implementations I was able to find on the web also have this bug.
cc @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen