Since watching the separate gradient norms, I believe separately clipping them may be more performing than using a global norm clip. The following figures show the results. Loss reached a new best at -16.8 with the following clip values: 1, 1, 10, 5, 50, 5.