Jose Maria Rodriguez-Gonzalez reports that when compiling the model with the --build-type=DEBUG option, there is one test that fails (ifs_t21_test_compo_fc). This is seen both on the ECMWF Atos hpc2020 and on other platforms.

3 Comments

  1. In case it helps, the test seems to fail due to a floating point exception. This is the error trace that I get:

    [EC_DRHOOK:hostname:myproc:omltid:pid:unixtid] [YYYYMMDD:HHMMSS:walltime] [function@file:lineno]
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.609] [signal_drhook@drhook.c:1641] Received signal#8 (SIGFPE) :: 413MB (
    heap), 486MB (maxrss), 35MB (maxstack), 1601MB (vmpeak), 0 (paging), nsigs = 1
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.609] [signal_drhook@drhook.c:1649] Hardlimit for core file is now 18446744073709551615 (0xffffffffffffffff)
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.609] [signal_drhook@drhook.c:1656] Also activating Harakiri-alarm (SIGALRM=14) to expire after 500s elapsed to prevent hangs, nsigs = 1
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.612] [signal_drhook@drhook.c:1745] Signal#8 was caused by floating-point invalid operation [memaddr=0x154bb48ccb86] [excepts=0x0 [0]] : 0x154bb48ccb86 at /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libarpifs.DP.so(chem_main_), nsigs = 1
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.612] [signal_drhook@drhook.c:1778] : [00]: <unknown_function> /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libfiat.so 0x154bac218000 0x154bac235e4f # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.612] [signal_drhook@drhook.c:1778] : [01]: <unknown_function> /lib64/libpthread.so.0 0x154ba3f5f000 0x154ba3f71ce0 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [02]: chem_main_ /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libarpifs.DP.so 0x154bb431e000 0x154bb48ccb86 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [03]: chem_main_layer_ /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libarpifs.DP.so 0x154bb431e000 0x154bb5a4189d # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [04]: callpar_ /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libarpifs.DP.so 0x154bb431e000 0x154bb5a29036 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [05]: ec_phys_ /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libarpifs.DP.so 0x154bb431e000 0x154bb5b9f394 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [06]: ec_phys_drv_ /etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/../lib/libarpifs.DP.so 0x154bb431e000 0x154bb5bc8af7 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [07]: __kmp_invoke_microtask /usr/local/apps/intel/2021.4.0/compiler/latest/linux/compiler/lib/intel64/libiomp5.so 0x154ba393b000 0x154ba3a79053 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [08]: <unknown_function> /usr/local/apps/intel/2021.4.0/compiler/latest/linux/compiler/lib/intel64/libiomp5.so 0x154ba393b000 0x154ba39f5353 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [09]: <unknown_function> /usr/local/apps/intel/2021.4.0/compiler/latest/linux/compiler/lib/intel64/libiomp5.so 0x154ba393b000 0x154ba39f4362 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [10]: <unknown_function> /usr/local/apps/intel/2021.4.0/compiler/latest/linux/compiler/lib/intel64/libiomp5.so 0x154ba393b000 0x154ba3a79cdc # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [11]: <unknown_function> /lib64/libpthread.so.0 0x154ba3f5f000 0x154ba3f671cf # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1778] : [12]: clone /lib64/libc.so.6 0x154ba3374000 0x154ba33addd3 # addr2line
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [signal_drhook@drhook.c:1826] Starting DrHook backtrace for signal#8, nsigs = 1
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [DrHookCallTree] DR_HOOK call tree : 413 MB (maxheap), 487 MB (maxrss), 35 MB (maxstack), 1601 MB (vmpeak)
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] DR_HOOK call tree : 413 MB (maxheap), 487 MB (maxrss), 35 MB (maxstack), 1601 MB (vmpeak)
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :  MASTER ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :   CNT0 ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :    CNT1 ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :     CNT2 ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :      CNT3 ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :       CNT4 ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :        STEPO ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :         SCAN2M ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :          GP_MODEL_HEAP2 ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :           GP_MODEL ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:1:1247798:1247798] [20240130:174949:14.613] [DrHookCallTree] :            EC_PHYS_DRV ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [DrHookCallTree] :             >OMP-PHYSICS CLDPP T/S    (1002) ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [DrHookCallTree] :              EC_PHYS ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [DrHookCallTree] :               CALLPAR ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [DrHookCallTree] :                CHEM_MAIN_LAYER ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [DrHookCallTree] :                 CHEM_MAIN ,#1,st=1,wall=0.000s/0.000s
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] Backtrace(s) for program '/etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/ifsMASTER.DP' : sigcontextptr=0x154b7ef09c38
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] Backtrace (size = 14) with addr2line-cmd
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] /usr/bin/addr2line -fs -e '/etc/ecmwf/nfs/dh2_perm_b/spk/openifs/openifs-48r1/build/bin/ifsMASTER.DP' 0x154bac260c21 0x154bac2607dd 0x154ba3f71ce0 0x154bb48ccb86 0x154bb5a4189d 0x154bb5a29036 0x154bb5b9f394 0x154bb5bc8af7 0x154ba3a79053 0x154ba39f5353 0x154ba39f4362 0x154ba3a79cdc 0x154ba3f671cf 0x154ba33addd3
    OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [00]: libfiat.so(LinuxTraceBack+0x4a5) [0x154bac260c21] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [01]: libfiat.so(LinuxTraceBack+0x61) [0x154bac2607dd] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [02]: libpthread.so.0(+0x12ce0) [0x154ba3f71ce0] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [03]: libarpifs.DP.so(chem_main_+0xfdaa) [0x154bb48ccb86] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [04]: libarpifs.DP.so(chem_main_layer_+0x25a1) [0x154bb5a4189d] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [05]: libarpifs.DP.so(callpar_+0x3c792) [0x154bb5a29036] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [06]: libarpifs.DP.so(ec_phys_+0xf99c) [0x154bb5b9f394] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [07]: libarpifs.DP.so(ec_phys_drv_+0x1d14b) [0x154bb5bc8af7] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [08]: libiomp5.so(__kmp_invoke_microtask+0x93) [0x154ba3a79053] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [09]: libiomp5.so(+0xba353) [0x154ba39f5353] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [10]: libiomp5.so(+0xb9362) [0x154ba39f4362] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [11]: libiomp5.so(+0x13ecdc) [0x154ba3a79cdc] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [12]: libpthread.so.0(+0x81cf) [0x154ba3f671cf] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] [13]: libc.so.6(clone+0x43) [0x154ba33addd3] : ??() at ??:0
    [EC_DRHOOK:ac6-114:2:2:1247798:1247998] [20240130:174949:14.613] [LinuxTraceBack] End of backtrace(s)
    srun: error: ac6-114: task 1: Segmentation fault (core dumped)

  2. Hi Jose,

    thanks for this and thanks for your efforts. 

    I was going to message this morning to say that I think I have fixed this error, in that, all tests in  openifs-test passed for me with Debug on using gnu 11.2 and intel. I am just testing this further and we will apply the change and produce a new tar ball. This may take a few days, so I will report back once the testing is completed. 

    Hope this is OK

    Cheers

    Adrian