Attempt 1 - Parallelization in one dimension, non-blocking communication
In the previous example, the retrieval of ghostlines from the other processors (the northern and southern neighbor) was with a blocking communication, before any computations. With this problem, it is entirely possible to do computations while waiting for the communication to finish, since only the first and last rows (and columns in the 2D case) need information from neighbouring processes. The shallow water equation can be solved for all the "inner" grid cells independently of other processes. Thus, overlapping computations with communication can be done by
  1. starting a non-blocking communication of the ghostlines with north and south neighbours (using Irecv and Isend) for and e  
  2. Doing calculations for the inner grid, in the 1D case, the inner rows (first and last local columns can still be calculated) 
  3. Wait for the communication to finish by MPI_Waitall
  4. Calculate the first and last rows of the local domain
The fact that the ghostlines are exchanged inside a function means the MPI-Request object needs to be passed between functions, e.g. created first inside the integrate function, passed to exchange_horizontal_code_lines_mpi, and and passed back to the computational function to be used with MPI_Waitall. Unfortunately, I couldn't work out how to do this in C++ since this is a new language to me , even after several hours of trying. In the end, I had to move the ghost line communication inside the integrate function, which makes the code a lot uglier but works. 
With latency hiding, the runtime went from 4.3 to 3.4 seconds on DAG using 4 processors, --size  2000 and --iter 400.