vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells

Message

khoang · #1 Post by **khoang** » Mon Mar 20, 2023 8:32 pm

Hi All,

I have been having this for a long time, and am wondering if anyone else encounters the same issue.

1. The issue occurs with the vasp_gam version of VASP 5.x compiled with Intel Parallel Studio (e.g., 2020) and VASP 6.x (including 6.4.0) compiled with Intel oneAPI (including the latest version, 2023.0.0). I have tried many different versions/combinations.

2. Supercell calculations (e.g., 96-atom GaN supercell; either PBE or HSE) using vasp_gam got hung at the first electronic step:

running on 16 total cores
distrk: each k-point on 16 cores, 1 groups
distr: one band on 2 cores, 8 groups
using from now: INCAR
vasp.5.4.4.18Apr17-6-g9f103f2a35 (build Jun 18 2021 15:11:44) gamma-only

POSCAR found type information on POSCAR Ga N
POSCAR found : 2 types and 96 ions
scaLAPACK will be used
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ...
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)

[... and stays forever here...]

running 8 mpi-ranks, with 4 threads/rank, on 1 nodes
distrk: each k-point on 8 cores, 1 groups
distr: one band on 1 cores, 8 groups
vasp.6.4.0 14Feb23 (build Mar 20 2023 14:38:27) gamma-only

POSCAR found type information on POSCAR GaN
POSCAR found : 2 types and 96 ions
Reading from existing POTCAR
scaLAPACK will be used
Reading from existing POTCAR
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)

[... and stays forever here...]

3. The same vasp_gam executables work fine for small cell calculations (e.g., 2-atom SIC cells). So, the above issue appears to occur only for medium/large supercells.

4. The issue with vasp_gam does NOT occur when VASP is compiled with NVIDIA HPC SDK (although this has other performance issues); e.g.,

running 8 mpi-ranks, with 4 threads/rank, on 1 nodes
distrk: each k-point on 8 cores, 1 groups
distr: one band on 1 cores, 8 groups
vasp.6.4.0 14Feb23 (build Mar 18 2023 13:29:23) gamma-only
POSCAR found type information on POSCAR GaN
POSCAR found : 2 types and 96 ions
Reading from existing POTCAR
scaLAPACK will be used selectively (only on CPU)
Reading from existing POTCAR
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.250782203599E+04 0.25078E+04 -0.17229E+05 1192 0.101E+03
DAV: 2 -0.369381138270E+03 -0.28772E+04 -0.26693E+04 1480 0.255E+02
DAV: 3 -0.642432618445E+03 -0.27305E+03 -0.26881E+03 1504 0.799E+01
DAV: 4 -0.648335273151E+03 -0.59027E+01 -0.58373E+01 1504 0.123E+01
DAV: 5 -0.648495137422E+03 -0.15986E+00 -0.15845E+00 1528 0.181E+00 0.762E+01
DAV: 6 -0.558711986211E+03 0.89783E+02 -0.35601E+02 1440 0.275E+01 0.324E+01
DAV: 7 -0.573486239834E+03 -0.14774E+02 -0.29068E+01 1384 0.784E+00 0.187E+01
DAV: 8 -0.578620056558E+03 -0.51338E+01 -0.32284E+00 1416 0.291E+00 0.748E+00
DAV: 9 -0.579416938305E+03 -0.79688E+00 -0.11702E+00 1448 0.176E+00 0.202E+00
DAV: 10 -0.579488224920E+03 -0.71287E-01 -0.20028E-01 1472 0.604E-01 0.780E-01
DAV: 11 -0.579502899887E+03 -0.14675E-01 -0.16029E-02 1520 0.194E-01 0.381E-01
DAV: 12 -0.579506778534E+03 -0.38786E-02 -0.39974E-03 1360 0.120E-01 0.174E-01
DAV: 13 -0.579507633129E+03 -0.85459E-03 -0.71188E-04 1368 0.499E-02 0.371E-02
DAV: 14 -0.579507678529E+03 -0.45400E-04 -0.52162E-05 976 0.142E-02
1 F= -.57950768E+03 E0= -.57950768E+03 d E =-.973631E-13 mag= -0.0084

*
I appreciate any comments and/or suggestions to fix this issue.

Thank you.

#2 Post by **henrique_miranda** » Wed Mar 22, 2023 5:12 pm

This is strange indeed but hard to say what is going wrong from the information you posted.
Could you share the makefile.include you used?
On what machine are you running the code? How many sockets and cores per node?

It looks to me like vasp.5.4.4 is compiled without openMP

Code: Select all

running on 16 total cores
distrk: each k-point on 16 cores, 1 groups
distr: one band on 2 cores, 8 groups
using from now: INCAR
vasp.5.4.4.18Apr17-6-g9f103f2a35 (build Jun 18 2021 15:11:44) gamma-only

so you are using 16 cpus

while vasp.6.4.0 is compiled with openMP

Code: Select all

running 8 mpi-ranks, with 4 threads/rank, on 1 nodes
distrk: each k-point on 8 cores, 1 groups
distr: one band on 1 cores, 8 groups
vasp.6.4.0 14Feb23 (build Mar 20 2023 14:38:27) gamma-only

so you are using 32 cpus

Are you running these calculations on the same machine?
Can it be that you have more threads than CPUs available?
You can try setting OMP_NUM_THREADS=1 and see if the calculation runs through.
Does this issue also occur with vasp_std?

For additional information I would like to point you to our list of validated toolchains:
https://www.vasp.at/wiki/index.php/Toolchains

khoang · #3 Post by **khoang** » Mon Apr 10, 2023 9:44 pm

Hi Henrique,

Thank you very much for your attention. I should have emphasized that the issue occurs on AMD CPUs only (and not on Intel CPUs). The code is compiled on a login node [AMD EPYC 7532 32-Core Processor (1 socket, 32 cores/socket)] and runs on a compute node which is either AMD EPYC 7662 64-Core Processor [1 socket, 64 cores/socket] or AMD EPYC 7662 64-Core Processor [2 sockets, 64 cores/socket]. The same issue is encountered when running the code on the node where it is compiled (i.e., the login node).

For your reference, attached is the makefile.include [vasp.6.4.0 with Intel oneAPI 2023.0.0 (specifically, compiler/2023.0.0 mpi/2021.8.0 mkl/2023.0.0)].

+ "It looks to me like vasp.5.4.4 is compiled without openMP... so you are using 16 cpus" --That is correct. 16 cores for 16 MPI processes.
+ "while vasp.6.4.0 is compiled with openMP... so you are using 32 cpus" -- That is correct. 32 cores for 8 MPI processes x 4 OpenMP threads.
+ "Are you running these calculations on the same machine?" -- Yes.
+ "Can it be that you have more threads than CPUs available?" --No, as seen above. For such a simple calculation (96 atoms at the PBE level), the job may still run even with accidental oversubscription.
+ "You can try setting OMP_NUM_THREADS=1 and see if the calculation runs through." --It encounters the same issue.
+ "Does this issue also occur with vasp_std?" --No, only vasp_gam; vasp_std works fine.
+ "For additional information I would like to point you to our list of validated toolchains..."--I followed one of VASP's validated chains as close as possible [using compiler/2022.0.2 mpi/2021.5.1 mkl/2022.0.2] but got the same issue.

Please let me know if you need any other information.

#4 Post by **henrique_miranda** » Tue Nov 14, 2023 8:41 am

I am really sorry for my very late reply.
Somehow your answer slipped under my radar and now I encountered it again while reviewing old threads.

If this issue still persists, could you maybe try running the code with LSCALAPACK=.FALSE.?
I personally have not encountered this type of problem myself but the fact that you don't encounter this issue for smaller cells makes me think SCALAPACK might be the culprit.

khoang · #5 Post by **khoang** » Mon Jan 22, 2024 10:09 pm

Thanks, Henrique.

Yes, setting LSCALAPACK=.FALSE. works in some cases. In others, it triggers an issue with one of the LAPACK subroutines: "LAPACK: Routine ZPOTRF failed!" In this case, I have to abandon the Intel-compiled vasp_gam and go back to using vasp_std.

My Community

vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells

vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells

Re: vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells

Re: vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells

Re: vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells

Re: vasp_gam [VASP 5.x and 6.x] with Intel got hung for large supercells