WRF 不明原因中斷

助教您好,我編譯 WRF 成功後(build_wrf.log 顯示成功),執行 RUN.slurm 修改成的 RUN.pbs,WRF 因不明原因中斷。
我曾嘗試更換 hdf5 的版本,包括台灣杉內建 1.10.1, netcdf 的 hdf5 等,皆出現同樣錯誤。
以下是編譯指令:

#!/bin/bash

#PBS -l nodes=1:ppn=40
#PBS -P ACD112218
#PBS -q ctest
#PBS -N compileWrf

export MODULEPATH=/home/93wilsonlu/wrf-hw/modules:/cm/local/modulefiles:/cm/shared/modulefiles:/cm/shared/applications

module purge

module load gcc/7.5.0
module load mpich-3.1.4-t
module load pnetcdf-1.12.0-t
module load netcdf/gnu/c-4.6.2/fortran-4.4.5
module load hdf5/1.10.1

export APPROOT=/home/93wilsonlu/wrf-hw
export NETCDF=/pkg/netcdf/gnu/c-4.6.2/fortran-4.4.5
export WRFIO_NCD_LARGE_FILE_SUPPORT=1
export NETCDF_classic=1
export PNETCDF=$APPROOT/opt/pnetcdf-1.12.0

# export PATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/bin:$PATH
# export LIBRARY_PATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/lib
# export LD_LIBRARY_PATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/lib
# export CPATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/include
# export INCLUDE=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/include

cd $APPROOT/WRF-ISC21
./compile -j 40 em_real >& build_wrf.log

執行指令:

#!/bin/bash
#PBS -P ACD112218
#PBS -N HPC-WinteCamp-Wrf
#PBS -l nodes=4:ppn=2
#PBS -l walltime=2:00:00
#PBS -q ct160
#PBS -o job-out.log
#PBS -e job-err.log
MODULEPATH=/home/93wilsonlu/wrf-hw/modules:/cm/local/modulefiles:/cm/shared/modulefiles:/cm/shared/applications
PATH=/home/93wilsonlu/wrf-hw/WRF-ISC21/main:$PATH

module load gcc/7.5.0
module load mpich-3.1.4-t
module load pnetcdf-1.12.0-t
module load netcdf/gnu/c-4.6.2/fortran-4.4.5
module load hdf5/1.10.1

export APPROOT=/home/93wilsonlu/wrf-hw
export NETCDF=/pkg/netcdf/gnu/c-4.6.2/fortran-4.4.5
export WRFIO_NCD_LARGE_FILE_SUPPORT=1
export NETCDF_classic=1
export PNETCDF=$APPROOT/opt/pnetcdf-1.12.0

# export PATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/bin:$PATH
# export LIBRARY_PATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/lib
# export LD_LIBRARY_PATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/lib
# export CPATH=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/include
# export INCLUDE=/pkg/netcdf/gnu/c-4.6.2/hdf5/1.12.0/include

export OMP_NUM_THREADS=1

cd WRF_practice_kit

ln -sf $PWD/namelist.input-VALIDATE namelist.input
/usr/bin/time -p mpirun -np 8 wrf.exe
rm -rf VALIDATE
mkdir VALIDATE
mv rsl.* namelist.input namelist.output VALIDATE
mv wrfo* VALIDATE


ln -sf $PWD/namelist.input-TIMING namelist.input
/usr/bin/time -p mpirun -np 8 wrf.exe
rm -rf TIMING
mkdir TIMING
mv rsl.* namelist.input namelist.output TIMING

以下是 job-out.log 和 job-err.log:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 93791 RUNNING AT cn0402
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 93983 RUNNING AT cn0402
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
 starting wrf task            0  of            8
 starting wrf task            2  of            8
 starting wrf task            5  of            8
 starting wrf task            7  of            8
 starting wrf task            3  of            8
 starting wrf task            4  of            8
 starting wrf task            6  of            8
 starting wrf task            1  of            8
real 92.87
user 693.57
sys 43.37
mv: cannot stat ‘wrfo*’: No such file or directory
 starting wrf task            0  of            8
 starting wrf task            1  of            8
 starting wrf task            4  of            8
 starting wrf task            5  of            8
 starting wrf task            2  of            8
 starting wrf task            3  of            8
 starting wrf task            7  of            8
 starting wrf task            6  of            8
real 90.78
user 681.69
sys 41.86

同學好,

請你查看/貼出完整的 job-out.log 我們才能提供你更多的協助。

設定一個 node:

設定多個 node:

同學好,

從錯誤訊息看來,蠻有可能是 MPICH 的問題。
你可以先嘗試跑 MPI Hello World 程式看看能不能正常執行,來縮小 debug 範圍。

MPI Hello World 完全沒問題,甚至 bitonic sort 都能運行。

#include <mpi.h>

int main(int argc, char* argv[]) {
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    printf("Hello World from proc %d\n", rank);

    MPI_Finalize();
    return 0;
}

那麼表示可能是 WRF 的 process 出錯

你可以使用 gdbserver 來 attach gdb 到每一個 rank 上做 debug

可以參考以下的 Wrapper script (untested)

#! /bin/bash

RANK=$PMI_RANK
PORT=$((30000 + $RANK))

echo GDB server for rank $RANK listening on port $PORT

gdbserver :$PORT $*

然後使用

mpirun ./wrapper.sh wrf.exe

來為每個 Rank 開啟一個 gdbserver, 然後用 gdb attach 上去,再 continue 每一個 rank
(剩下的我相信你會做 XD)