Hi all

I've got a coupled model, FOCI-OpenIFS, which consists of OpenIFS 40R1 + NEMO 3.6 + OASIS3-MCT3 + river routing (similar to EC-Earth). 
I'm running it on HLRN-IV in Goettingen, Germany, which has Intel Skylake chips and Intel compilers. 

I can compile and run the model with Intel Fortran 18.0.5 and Intel MPI 2018.4 without any problems. One year for a T159/ORCA05 configuration takes less than 45 minutes using 279 CPUs for OpenIFS and 480 CPUs for NEMO. 

However, when upgrading to Intel Fortran 19.0.3 and Intel MPI 2019.3 the model got around 100 times slower! 

Digging into the details with various performance tools, the support team found that NEMO runs fine with the new compiler versions but sending the ocean state to OpenIFS (SST, sea ice etc) takes up to 20 minutes. Basically, OpenIFS gets stuck at an OASIS_GET command, while NEMO has done the OASIS_PUT several minutes before. 
The river routing scheme, which is a single MPI task, also freezes and waits for runoff from OpenIFS. 

I can run with Intel Fortran 19.0.3 + Intel MPI 2018.4 without problems, so the issue is Intel MPI 2019. I don't know if the same problem is specific to my hardware. 

The support team suspects that Intel MPI 2019, for some reason, can not handle the many asynchronous MPI messages that get sent each coupling step.  

OpenIFS standalone, i.e. no coupling, runs maybe 20% slower with Intel MPI 2019 compared to 2018, which is bad but not as bad as the coupled model. 

I'm wondering if anyone else has tried to run OpenIFS with Intel compilers version 2019 and noticed anything like this. 
In particular, this would cause problems for coupled models like EC-Earth. 

Cheers
Joakim


9 Comments

  1. Hi Joakim,

    Interesting behaviour. I have not done much testing with intel 19 but would not have the OASIS coupler in place anyway to test your setup myself.   I think we have intel 18 & 19. I'll try some speed tests to see what I get. 20% slower is not particularly good.

    If it's an issue with the number of async MPI messages, maybe Intel MPI 2019 has reduced some internal parameters and you need to set some Intel_MPI specific environment variables that were not needed in 2018 version?

    Are you restricted to using Intel's MPI?  Is there an installation of OpenMPI or MPICH you could try instead?

    Let us know what you find.

    Cheers,   Glenn

  2. Hi Glenn 

    Thanks, it would be great if you could test it out. I don't think too many people use Intel 19 yet since it's quite new. We had to switch for the runs where we need XIOS detached, otherwise we would just stick with Intel 18. 

    The only way I can run OpenIFS+NEMO+OASIS with Intel 19 at a reasonable speed is to compile with Intel Fortran 19 and Intel MPI 2019 but at run time make sure I use "mpirun" from MVAPICH2. This confirms that Intel MPI 2019 is the issue. OASIS refuses to compile with OpenMPI, so I can't use that one for now. 

    What bugs me is that our model ECHAM6+NEMO+OASIS is slightly faster with Intel 19, even though ECHAM6 runs using 600 CPUs compared to OpenIFS 280 CPUs. ECHAM is sending more MPI messages, and still does not freeze. So it can't be the sheer number of MPI messages that is the problem. I was thinking it could be something specific to my OASIS interfaces in OpenIFS. But I'm not sure what it could be. 

    If you find a slowdown with Intel 19, or a speed up, or no change at all, please let me know. I'd be really interested to know. 

    Hopefully, I can persuade the support team to get some wizard from Intel to look it over... 

    Cheers

    Joakim 

  3. We don't have Intel v19 on our systems at the moment so I can't try it.

    As you can compile & link with intel v19 and intel_mpi v19 but need to run with mpirun from another mpi installation, that strongly suggests to me that there's some part of the intel mpi 19 environment misconfigured somewhere?  Is the ECHAM setup doing anything different in terms of the environment?  Different modules loaded for example?

    Cheers,  Glenn

  4. Hi Glenn

    I believe I've found a solution to my slowdown problem and the cause seems to be me not using the proper settings for our hardware and software. 
    The trick seems to have been switching launcher from "mpirun" (part of Intel MPI) to "srun" (wrapper from SBATCH) and setting some new environment variables. 
    This changed a few things, e.g. how tasks are distributed across nodes, how MPI fabric is chosen, etc.

    Now my coupled model even speeds up a little bit compared to Intel Fortran & MPI 2018. I'm not sure what solved it in the end, but at least I've got something that works. 
    OpenIFS standalone also seems to work fine. 
    I'd be glad to share the full set of environment variables, compiler flags etc if anyone else experiences similar problems in the future. 

    Cheers
    Joakim 

    1. Hi Joakim,

      I'd be interested to see the compiler flags and env variables you used. They might be useful for others. Could you post them?

      Thanks,   Glenn

  5. Hi Glenn

    No problem. So in summary my changes were: 

    1. Use "srun" rather than "mpirun". Full command "

      srun --mpi=pmi2 -l --kill-on-bad-exit=1 --cpu_bind=cores --distribution=cyclic:cyclic --multi-prog ./hostfile_srun" 
      where hostfile_srun is 

      0-479 ./nemo.exe
      480-480 ./rnfmap.exe
      481-759 ./master.exe -e KCM2


    2. Set the following environment variables: 

      export I_MPI_PMI_LIBRARY=libpmi.so
      export I_MPI_FABRICS=shm:ofi
      export I_MPI_OFI_PROVIDER=psm2
      export I_MPI_FALLBACK=disable
      export I_MPI_SLURM_EXT=0
      export I_MPI_LARGE_SCALE_THRESHOLD=8192
      export I_MPI_DYNAMIC_CONNECTION=1
      export I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0
      export I_MPI_HARD_FINALIZE=1
      export KMP_STACKSIZE=64m
      export KMP_AFFINITY=verbose,granularity=fine,compact,1,0
      export OMP_NUM_THREADS=1
      export FORT_BUFFERED=no
      export SLURM_CPU_FREQ_REQ=High


      There's something in the release notes for Intel MPI 2019 about "PSM2 support" which apparently was not implemented in Intel MPI 2018. So maybe that's part of the explanation. I think most of these flags are very specific to our hardware so another user at e.g. ECMWF or SMHI could probably not just pick them up and expect it to work. 


    I did not change the compiler flags, but here they are anyways: 


    OIFS_OPT="-g -traceback -O3 -xCORE-AVX512 -qopt-zmm-usage=high -Wl,-z,noexecstack"
    export OIFS_FFLAGS="-qopenmp -m64 -align array32byte -fp-model precise -convert big_endian ${OIFS_OPT}"
    export OIFS_CFLAGS="-fp-model precise -qopenmp ${OIFS_OPT}"


    I'm not sure how necessary the last three OIFS_OPT flags are, but they were recommended so I've got them there. 

    Again, I'm not using any multi-threading, so it always 1 thread per task. 

    Hope this helps!

    Cheers
    Joakim 

    1. Thanks. What was the reason for not using OpenMP? 

      (if not using openmp the -qopenmp flag on OIFS_CFLAGS can go).

      1. I compiled with OpenMP simply because I copied everything from the intel-opt.cfg file originally. But I could never get multi-threading to work. Now that I'm launching with srun rather than mpirun, I think multi-threading does work, but I haven't tried it. 

        Is there a good reason to not compile with OpenMP unless I'm using it? 

        Cheers
        Joakim 



        1. No, not really. Just housekeeping.