Page 1 of 1

VASP Job Hangs

Posted: Thu Jan 16, 2025 9:23 pm
by franklin_goldsmith1

A VASP job that runs with 32 cores on single node sometimes hangs. With the option "--get-stack-traces --report-state-on-timeout" for mpirun, the following error is reported:
Rank: 17 Node: node1648 PID: 3344693 State: WAITPID FIRED ExitCode 0

The attached zip file has all the files include the job script "submit.sh" for reproducing the issue.


Re: VASP Job Hangs

Posted: Fri Jan 17, 2025 11:01 am
by ahampel

Hi,

thank you for reaching out to us on the official VASP forum.

From what I can see in your output files one of the MPI ranks seems to have not returned. It seems that this did not happen during one calculation but right at the beginning? In your output log I can see that the last calculation finished:

Code: Select all

Total vdW correction in eV:     55.6945537                       
   1 F= -.10914094E+03 E0= -.10907376E+03  d E =-.201543E+00     
 writing wavefunctions                                           

and the following iteration in your slurm job did not produce any output yet correct? Or does this also happen during an electronic scf calculation?

Can you give me some details on how VASP is compiled? Compilers, libraries, makefile.include, etc please. Otherwise it will be hard for me to test this. Such crashes might be very compiler specific. Let's say no bug is known to me that sounds close to what you report. After I have this information I will try to reproduce the problem.

Best regards,
Alex H.


Re: VASP Job Hangs

Posted: Tue Jan 21, 2025 4:53 pm
by franklin_goldsmith1

Hi Alex,
Please see our user's repsonse below:

The hang is happening right at the end of an scf cycle. In this example, I am running the exact same single point energy calculation 1000 times in a row to try to get this error to show up. In this example, 7 of the 1000 single point energy calculations finished and exited properly. On the 8th single point energy calculation, the scf cycle finished completely, but the mpirun job never exits, it just hangs.

I don't have the installation instructions for the vasp module. I can see its dependent modules:

Code: Select all

depends_on("python/3.11.0s")
depends_on("zlib/1.2.13")
depends_on("intel-oneapi-mkl/2023.1.0")
depends_on("intel-oneapi-compilers/2023.1.0")
depends_on("cmake/3.26.3")
depends_on("cuda/12.2.0")
depends_on("cudnn/8.9.6.50-12")
depends_on("netlib-lapack/3.11.0")
depends_on("openmpi/5.0.0")
depends_on("libbeef/Nov2020")
depends_on("hdf5/1.14.1-2")
depends_on("openblas/0.3.23")
depends_on("netlib-scalapack-mpi/2.2.0")
depends_on("hdf5-mpi/1.14.3")

Below shows the directories/files of the installed module:

Code: Select all

$ ls
source  source_original  vdw_files
$ ls source
arch  bin  build  makefile  makefile.include  old_bin  old_build  potpaw_LDA  potpaw_LDA.64  potpaw_PBE  potpaw_PBE.64  README.md  src  testsuite  testsuite_spack  tools  vasp.6.4.2.tar  vdw_kernel.bindat  vdw_kernel.bindat.big_endian/code]

Here is the makefile:
[code]$ cat source/makefile
#optional: use a custom build directory 
ifdef PREFIX
    VASP_BUILD_DIR=$(PREFIX)
else
    VASP_BUILD_DIR=build
endif

VERSIONS = std gam ncl
.PHONY: all veryclean test test_all versions $(VERSIONS)
all: std gam ncl
versions: $(VERSIONS)
$(VERSIONS):
	if [ ! -d $(VASP_BUILD_DIR)/$@ ] ; then mkdir -p $(VASP_BUILD_DIR)/$@  ; fi
	cp src/makefile src/.objects src/makedeps.awk makefile.include $(VASP_BUILD_DIR)/$@ 

	$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ check

ifdef DEPS
	$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ dependencies -j1
else
	$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ cleandependencies -j1
endif


ifdef MODS
	$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ modfiles -j1
endif
	$(MAKE) -C $(VASP_BUILD_DIR)/$@ VERSION=$@ all

veryclean: 
	rm -rf $(VASP_BUILD_DIR)/std
	rm -rf $(VASP_BUILD_DIR)/gam
	rm -rf $(VASP_BUILD_DIR)/ncl

test:
	$(MAKE) -C testsuite test

test_all:
	$(MAKE) -C testsuite test_all

Re: VASP Job Hangs

Posted: Tue Jan 21, 2025 8:23 pm
by ahampel

Perfect thanks that is already helpful. Can you also please post the content of makefile.include in the directory. That file contains all the necessary library and compiler dependencies of VASP. Thank you!

From the top of my head I have no idea why at the end of the job VASP does not return properly. I am well aware that in the users example he just made a bash loop and called VASP 1000 times. But it is not clear to me what exactly at that moment is still pending. The OUTCAR file is the most complete output file and it finishes normally. If I have the makefile.include I will try to compile a matching version and run the test calculation of the user.

Best regards,
Alex


Re: VASP Job Hangs

Posted: Wed Jan 22, 2025 1:55 pm
by franklin_goldsmith1

$ cat makefile.include

Code: Select all

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dfock_dblbuf \
              -DLAPACK \
              -DscaLAPACK

CPP         = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC          = mpif90 
FCL         = mpif90

FREE        = -free -names lowercase

FFLAGS      = -assume byterecl -w -fallow-argument-mismatch -ffree-line-length-512

OFLAG       = -O2
OFLAG_IN    = $(OFLAG)
DEBUG       = -O0

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = icc
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = icpc
LLIBS       = -lstdc++

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##

# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -xHOST
FFLAGS     += $(VASP_TARGET_CPU)
 
# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL        += -qmkl
MKLROOT    ?=
LLIBS      += -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_openmpi_lp64 -lpthread -lm -ldl
INCS        =-I$(MKLROOT)/include/fftw

# Use a separate scaLAPACK installation (optional but recommended in combination with OpenMPI)
# Comment out the two lines below if you want to use scaLAPACK from MKL instead
SCALAPACK_ROOT ?= 
LLIBS      += -L${SCALAPACK_ROOT}/lib -lscalapack

# For libbeef
CPP_OPTIONS += -Dlibbeef
LIBBEEF_ROOT ?= 
LLIBS += -L$(LIBBEEF_ROOT)/lib -lbeef

# HDF5-support (optional but strongly recommended)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT  ?= 
LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS       += -I$(HDF5_ROOT)/include

# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS    += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS          += -L$(WANNIER90_ROOT)/lib -lwannier

# For the fftlib library (hardly any benefit in combination with MKL's FFTs)
#CPP_OPTION += -Dsysv
#FCL         = mpif90 fftlib.o -qmkl
#CXX_FFTLIB  = icpc -qopenmp -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS       += fftlib

Re: VASP Job Hangs

Posted: Thu Jan 23, 2025 9:09 am
by ahampel

Hi,

I now compiled VASP 6.4.3 as closely possible to what you have:

Code: Select all

  1) git/2.40.0                  7) openmpi/4.1.2
  2) git-lfs/3.3.0               8) hdf5/1.13.0
  3) slurm/23.02.3               9) wannier90/3.1.0
  4) makedepf90/2.8.9           10) libxc/5.2.2
  5) intel/2022.0.1             11) vasp-intel-dev/2022.0.1_mkl-2022.0.1_ompi-4.1.2
  6) intel-oneapi-mkl/2022.0.1  12) scalapack/2.1.0

I do not have a OpenMPI 5.0.0 installation at hand to test nor openmpi compiled with oneapi 2023. Putting this aside I run the job now 500 times in a loop also with 32 mpi ranks without any issues. So I unfortunately can't really reproduce this on my system (rockylinux 9.4). However, can you maybe try the following things to check whether one of these is the problem:

user site:

  • set NCORE to some reasonable value, e.g. 8

  • try slightly different number mpi ranks

  • use vasp_gam instead of vasp_std for Gamma point only calculations

building VASP:

  • try to compile VASP 6.4.3 (you are using 6.4.2). I do not see any obvious fix related to your problem in 6.4.3 but it is probably safest to use this latest patch release which should be available to you

  • lower VASP_TARGET_CPU ?= -xHOST in case the node that runs the job is slightly different than the compiling machine? Maybe set just VASP_TARGET_CPU ?= -march=broadwell for safety. This disables any AVX512 instructions.

  • remove scalapack. Just for testing!

  • If possible use an older OpenMPI 4.x.x version, I am always a bit cautious with .0.0 releases of anything.

If all these do not help we should have a closer look, but this is all that comes to my mind that could cause trouble here.