Page 1 of 1
VASP Job Hangs
Posted: Thu Jan 16, 2025 9:23 pm
by franklin_goldsmith1
A VASP job that runs with 32 cores on single node sometimes hangs. With the option "--get-stack-traces --report-state-on-timeout" for mpirun, the following error is reported:
Rank: 17 Node: node1648 PID: 3344693 State: WAITPID FIRED ExitCode 0
The attached zip file has all the files include the job script "submit.sh" for reproducing the issue.
Re: VASP Job Hangs
Posted: Fri Jan 17, 2025 11:01 am
by ahampel
Hi,
thank you for reaching out to us on the official VASP forum.
From what I can see in your output files one of the MPI ranks seems to have not returned. It seems that this did not happen during one calculation but right at the beginning? In your output log I can see that the last calculation finished:
Code: Select all
Total vdW correction in eV: 55.6945537
1 F= -.10914094E+03 E0= -.10907376E+03 d E =-.201543E+00
writing wavefunctions
and the following iteration in your slurm job did not produce any output yet correct? Or does this also happen during an electronic scf calculation?
Can you give me some details on how VASP is compiled? Compilers, libraries, makefile.include, etc please. Otherwise it will be hard for me to test this. Such crashes might be very compiler specific. Let's say no bug is known to me that sounds close to what you report. After I have this information I will try to reproduce the problem.
Best regards,
Alex H.
Re: VASP Job Hangs
Posted: Tue Jan 21, 2025 4:53 pm
by franklin_goldsmith1
Hi Alex,
Please see our user's repsonse below:
The hang is happening right at the end of an scf cycle. In this example, I am running the exact same single point energy calculation 1000 times in a row to try to get this error to show up. In this example, 7 of the 1000 single point energy calculations finished and exited properly. On the 8th single point energy calculation, the scf cycle finished completely, but the mpirun job never exits, it just hangs.
I don't have the installation instructions for the vasp module. I can see its dependent modules:
Code: Select all
depends_on("python/3.11.0s")
depends_on("zlib/1.2.13")
depends_on("intel-oneapi-mkl/2023.1.0")
depends_on("intel-oneapi-compilers/2023.1.0")
depends_on("cmake/3.26.3")
depends_on("cuda/12.2.0")
depends_on("cudnn/8.9.6.50-12")
depends_on("netlib-lapack/3.11.0")
depends_on("openmpi/5.0.0")
depends_on("libbeef/Nov2020")
depends_on("hdf5/1.14.1-2")
depends_on("openblas/0.3.23")
depends_on("netlib-scalapack-mpi/2.2.0")
depends_on("hdf5-mpi/1.14.3")
Below shows the directories/files of the installed module:
Code: Select all
$ ls
source source_original vdw_files
$ ls source
arch bin build makefile makefile.include old_bin old_build potpaw_LDA potpaw_LDA.64 potpaw_PBE potpaw_PBE.64 README.md src testsuite testsuite_spack tools vasp.6.4.2.tar vdw_kernel.bindat vdw_kernel.bindat.big_endian/code]
Here is the makefile:
[code]$ cat source/makefile
#optional: use a custom build directory
ifdef PREFIX
VASP_BUILD_DIR=$(PREFIX)
else
VASP_BUILD_DIR=build
endif
VERSIONS = std gam ncl
.PHONY: all veryclean test test_all versions $(VERSIONS)
all: std gam ncl
versions: $(VERSIONS)
$(VERSIONS):
if [ ! -d $(VASP_BUILD_DIR)/$@ ] ; then mkdir -p $(VASP_BUILD_DIR)/$@ ; fi
cp src/makefile src/.objects src/makedeps.awk makefile.include $(VASP_BUILD_DIR)/$@
$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ check
ifdef DEPS
$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ dependencies -j1
else
$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ cleandependencies -j1
endif
ifdef MODS
$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ modfiles -j1
endif
$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ all
veryclean:
rm -rf $(VASP_BUILD_DIR)/std
rm -rf $(VASP_BUILD_DIR)/gam
rm -rf $(VASP_BUILD_DIR)/ncl
test:
$(MAKE) -C testsuite test
test_all:
$(MAKE) -C testsuite test_all
Re: VASP Job Hangs
Posted: Tue Jan 21, 2025 8:23 pm
by ahampel
Perfect thanks that is already helpful. Can you also please post the content of makefile.include
in the directory. That file contains all the necessary library and compiler dependencies of VASP. Thank you!
From the top of my head I have no idea why at the end of the job VASP does not return properly. I am well aware that in the users example he just made a bash loop and called VASP 1000 times. But it is not clear to me what exactly at that moment is still pending. The OUTCAR
file is the most complete output file and it finishes normally. If I have the makefile.include I will try to compile a matching version and run the test calculation of the user.
Best regards,
Alex
Re: VASP Job Hangs
Posted: Wed Jan 22, 2025 1:55 pm
by franklin_goldsmith1
$ cat makefile.include
Code: Select all
# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
-DMPI -DMPI_BLOCK=8000 -Duse_collective \
-DscaLAPACK \
-DCACHE_SIZE=4000 \
-Davoidalloc \
-Dvasp6 \
-Duse_bse_te \
-Dtbdyn \
-Dfock_dblbuf \
-DLAPACK \
-DscaLAPACK
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = mpif90
FCL = mpif90
FREE = -free -names lowercase
FFLAGS = -assume byterecl -w -fallow-argument-mismatch -ffree-line-length-512
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB = linpack_double.o
# For the parser library
CXX_PARS = icpc
LLIBS = -lstdc++
##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -xHOST
FFLAGS += $(VASP_TARGET_CPU)
# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL += -qmkl
MKLROOT ?=
LLIBS += -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_openmpi_lp64 -lpthread -lm -ldl
INCS =-I$(MKLROOT)/include/fftw
# Use a separate scaLAPACK installation (optional but recommended in combination with OpenMPI)
# Comment out the two lines below if you want to use scaLAPACK from MKL instead
SCALAPACK_ROOT ?=
LLIBS += -L${SCALAPACK_ROOT}/lib -lscalapack
# For libbeef
CPP_OPTIONS += -Dlibbeef
LIBBEEF_ROOT ?=
LLIBS += -L$(LIBBEEF_ROOT)/lib -lbeef
# HDF5-support (optional but strongly recommended)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT ?=
LLIBS += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS += -I$(HDF5_ROOT)/include
# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS += -L$(WANNIER90_ROOT)/lib -lwannier
# For the fftlib library (hardly any benefit in combination with MKL's FFTs)
#CPP_OPTION += -Dsysv
#FCL = mpif90 fftlib.o -qmkl
#CXX_FFTLIB = icpc -qopenmp -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS += fftlib
Re: VASP Job Hangs
Posted: Thu Jan 23, 2025 9:09 am
by ahampel
Hi,
I now compiled VASP 6.4.3 as closely possible to what you have:
Code: Select all
1) git/2.40.0 7) openmpi/4.1.2
2) git-lfs/3.3.0 8) hdf5/1.13.0
3) slurm/23.02.3 9) wannier90/3.1.0
4) makedepf90/2.8.9 10) libxc/5.2.2
5) intel/2022.0.1 11) vasp-intel-dev/2022.0.1_mkl-2022.0.1_ompi-4.1.2
6) intel-oneapi-mkl/2022.0.1 12) scalapack/2.1.0
I do not have a OpenMPI 5.0.0 installation at hand to test nor openmpi compiled with oneapi 2023. Putting this aside I run the job now 500 times in a loop also with 32 mpi ranks without any issues. So I unfortunately can't really reproduce this on my system (rockylinux 9.4). However, can you maybe try the following things to check whether one of these is the problem:
user site:
building VASP:
If all these do not help we should have a closer look, but this is all that comes to my mind that could cause trouble here.