Bottom plot shows individual gradient norms before clipping. The biggest values are in the output layer bias and weights. It might be trying very hard to nail those normal distributions but finding it hard (or having mathematical problems that keep exploding the gradients).