Loss (top) and global norm (bottom) during training when the gradients are clipped individually. The final loss reliably hits the range -15 to -16 with one best experiment reaching -16.6. The learning rate is set initially to 0.0008 and lowered to 20% every 1000 epochs, so the final rate is 0.0000064. The remaining spikes in the loss are probably due to the Adam optimizer overshooting the best update with too much momentum.