Note on the optimizers. Adam does better than RMSProp, but the latter creates an interesting cell state sequence, much cleaner than the former. I could not figure out if scheduling an explicit learning rate decay helps Adam or not (I kept it for some minimal help and stability at lower rates). Adam also has a very spiky behavior in the loss plot, possibly due to vanishing gradient variances.
Note on regularization. Apparently, regularization in LSTMs and RNNs is a big no-no because it kills the memorization process. Another thing that was wrong in the original code.
Note on spikiness in the loss plot. Was looking at the gradients and the their clipping: there are some huge values in the gradients, why? Reduced some of the spikiness at the start of the training by correctly downsizing the initializer for the output weights. See Fig. \ref{610161}. There is still quite a lot of spikes, and the global norm of the gradients increases with lowering the learning rate. It seems that a spike in the global norm corresponds to an attempt at correcting a bad direction taken the previous step. It looks like this: at a good situation with a small loss, the algorithm takes a chance with an update that results in a bad situation with a worse loss, then the correction requires a big update to reverse the mistake. Why are gradients so big? aren't they supposed to get closer to zero because we are reaching a minimum? Looking at the norms for the separate gradients (there are 6: input_w, input_b, output_w, output_b, lstm_weights, lstm_bias), I noticed the biggest ones are always output_w and output_b, the parameters extraction layers, followed by the lstm_weights. The default clipping by the global norm is going to screw up some updates, so I decided to clip every gradient separately: output_w at 5, lstm_weights at 20, lstm_bias at 5 and the rest at 1. Performances got better by a couple of loss points