VASP 6.2.0 GPU Hybrid Functional Tests Fail

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
jglazar
Newbie
Newbie
Posts: 5
Joined: Mon Apr 27, 2020 3:07 pm

VASP 6.2.0 GPU Hybrid Functional Tests Fail

#1 Post by jglazar » Fri Mar 05, 2021 7:04 pm

Hi VASP team,

I successfully installed VASP 6.2.0 using OpenMPI a few weeks ago. Now, I'd like to install the OpenACC GPU port for VASP. I was able to set up NVIDIA SDK 20.9 and use my cluster's copy of Intel 17.0.3 MPK to make the VASP binaries. I edited the "makefile.include.linux_nv_acc+omp+mkl" file listed on the VASP wiki to the particulars of my system. I also had to manually prepend the NVIDIA SDK .../20.9/compilers/lib and .../20.9/compilers/extras/qd/lib directories to the LD_LIBRARY_PATH environment variable in order to successfully make the binaries.

Running the testsuite shows that all tests succeed except the Hybrid functional tests. I found that the Hybrid functional tests generally fail the force calculations. The "TOTAL-FORCE" section at the end of the OUTCAR shows "NaN" for each entry. Interestingly, the Electron-Ion, Ewald-Force, and Non-Local-Force entries are all correct. Digging into the force.F source code shows that the HARFOR, PARFOR, FORHF, or TAUFOR variables could be causing the issue. The TAUFOR variable is used for MetaGGA functionals, so that's likely not it. There must be some function call to calculate one of those variables which is throwing an error.

Some of the tests also fail the frequency calculations. Those don't show "NaN" values, but are instead just incorrect. I'm not sure what's going wrong there.

I tried making the VASP binaries with the "makefile.include.linux_nv_acc" file given in the .tar.gz package, but that had the same errors. It also showed "DSYEV" errors. These did not appear in my "makefile.include.linux_nv_acc+omp+mkl" tests.

Perhaps there's something wrong with my LAPACK or scaLAPACK packages? Maybe something is going wrong with the NVIDIA linear algebra packages?

Please let me know if you have any ideas.

Best,
James
You do not have the required permissions to view the files attached to this post.

merzuk.kaltak
Administrator
Administrator
Posts: 285
Joined: Mon Sep 24, 2018 9:39 am

Re: VASP 6.2.0 GPU Hybrid Functional Tests Fail

#2 Post by merzuk.kaltak » Thu Mar 11, 2021 1:17 pm

Hello,

I have notices that you have MPI errors in the testsuite.log:

Code: Select all

--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[62262,1],0] (PID 3830)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
This seems troublesome.
To make sure everything is correct with your environment setup,
I suggest you compile vasp without acc-option (it seems you are not using GPUs for your run anyway)
Use one of the corresponding makefiles in arch, for instance: makefile.include.linux_nv_omp.

jglazar
Newbie
Newbie
Posts: 5
Joined: Mon Apr 27, 2020 3:07 pm

Re: VASP 6.2.0 GPU Hybrid Functional Tests Fail

#3 Post by jglazar » Wed Mar 17, 2021 2:27 am

Hi again,

I tried installing VASP 6.2.0 with the `makefile.include.linux_nv_omp` settings, and now it hangs when running tests. The `make test` output (attached) shows that VASP loads up but never enters the "main loop." Note that the `mpirun: Forwarding signal 18 to job` line at the end of my file just explains that I killed the job, which I let run for an hour.

I noticed that the OpenMPI fork() error doesn't show up in the `acc+omp+mkl` tests. Perhaps something else is going wrong? Also, why does VASP not show that I'm running on GPU? I submitted my job to a GPU node on my computing cluster and specified "--constraint=gpu" in the SLURM header.

Best,
James
You do not have the required permissions to view the files attached to this post.

Post Reply