Note on the embedding of the input coordinates. Tried using image coordinates natively (in the 720x576 pixel space) and noticed the initialization of the input embedding layer (setting the standard deviation to 1000 for the random normal initializer) having a big effect on training. Decided to normalize the coordinate in the unit square and the effect vanished (standard value of 1 is ok now).
Note on the initialization and normalization (with the exponential function) of the standard deviation parameters. Found a big factor that slows down learning: the exponentiation of the sx, sy parameters. This is done to ensure the values are always positive, but it creates a problem for the extractors: in short, to make sx, sy really small (as when the training nails the correct predictions), we need really big negative values so that the exponentiation results in small values. That's hard to do quickly for the training process, that is limited in how much change it can do in every update step. A quick workaround is to use sx = exp(30 * param), or any other suitable multiplicator that shrinks the exponential and allows for quicker updates. Also, check that the initializers have suitable values at the start. With the fixes, it went from needing 10000 iterations down to 3000, with good results within a few hundred iterations already.
In Fig. \ref{241438}\ref{646912} and \ref{633461}, a few visualizations of the results after the fixes.