Compile vasp.6.x with NVIDIA NCCL

Message

khoang · #1 Post by **khoang** » Fri Feb 21, 2025 5:30 am

Dear All,

I am trying to compile a vasp.6.x version that can run on multiple GPU nodes (not just multiple GPUs on a single node). I would greatly appreciate it if you could share some tips on how to compile vasp.6.x with NVIDIA NCCL (via NVIDIA HPC SDK).

Thank you.

#2 Post by **manuel_engel1** » Fri Feb 21, 2025 9:26 am

Hello,

Thanks for reaching out. Running VASP on multiple nodes with GPUs is possible out of the box when you compile with the NVIDIA HPC SDK. For that, you can follow our VASP 6 installation guide and select one of the available makefile.include for the NVIDIA compiler.

Let me know if that answers your question.

Kind regards

khoang · #3 Post by **khoang** » Fri Feb 21, 2025 9:32 pm

Hi Manuel,

Thank you very much. It's comforting to know that the compilation is quite straightforward.

That also encouraged me to give it another try for vasp.6.5.0 with Nvidia HPC SDK 25.1. The executables work fine on a single node (with one or more GPUs).

The situation with running on multiple GPU nodes is more complicated. Using the same calculation as test jobs, here is the outcome:

Some test jobs ran fine without any issue (again, they really ran on different nodes--not on single node).
Some test jobs failed with the error:

Code: Select all

 ../../bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
../../bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
 
  Process name: [[29894,1],2]
  Exit code:    127
--------------------------------------------------------------------------

NOTE: The jobs were on different combinations of nodes. The successful/failed ratio is about 50/50. libqdmod.so.0 is available in the environment as shown by ldd. Adding the path to libqdmod.so.0 explicitly to the job submission script didn't help.

A small number of test jobs failed with the error:

Code: Select all

Connection closed by xxx.xxx.xx.xx port 22
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   gpu0023
  target node:  gpu0034

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

I would appreciated any additional comments/suggestions.

#4 Post by **manuel_engel1** » Mon Feb 24, 2025 10:04 am

I'm sorry to hear that you're having problems executing VASP on multiple nodes.

From what I can tell, it seems like you are not able to start VASP on multiple nodes. Or does the calculation start and at some point you get these error messages?

I do not believe that you have an issue with compilation but that it is somehow an issue with the environment. Based on the error messages, it's likely that there are certain nodes that are not setup correctly or that are missing the required libraries. Here are a few things that would help identify the problem:

Did you explicitly check on these nodes where the jobs are failing that all the required libraries are available?
Do you always submit to the exact same nodes or are they different?
Can you submit VASP explicitly on these nodes to see if the issue is reproducible?
If you can identify the nodes where it doesn't work, can you submit a single-node run to see if that works?

khoang · #5 Post by **khoang** » Mon Mar 03, 2025 6:55 pm

Thanks, Manuel. I'll do more systematic tests and get back to you.

My Community

Compile vasp.6.x with NVIDIA NCCL

Compile vasp.6.x with NVIDIA NCCL

Re: Compile vasp.6.x with NVIDIA NCCL

Re: Compile vasp.6.x with NVIDIA NCCL

Re: Compile vasp.6.x with NVIDIA NCCL

Re: Compile vasp.6.x with NVIDIA NCCL