Vasp crashes on one specific GPU
I have compiled Vasp 6.4.3 with GPU support on Ubuntu server 24.04 without any errors. We have four identical Nvidia A100 GPUs on our system and use Slurm to manage resources. Whenever a job attempts to use GPU 2, it immediately crashes. It is not dependent on the job, so far all jobs work perfectly fine on GPUs 0,1, and 3. We only run jobs on single GPUs and the test suite did run successfully (using GPU 0). All GPUs work perfectly fine for other tasks, e.g., training neural networks for inter-atomic potentials. What could cause these bugs?
This is the stdout of a crashed job:
Code: Select all
running 1 mpi-ranks, with 2 threads/rank, on 1 nodes
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
OpenACC runtime initialized ... 1 GPUs detected
And this is the stderr message:
Code: Select all
-----------------------------------------------------------------------------
| _ ____ _ _ _____ _ |
| | | | _ \ | | | | / ____| | | |
| | | | |_) | | | | | | | __ | | |
| |_| | _ < | | | | | | |_ | |_| |
| _ | |_) | | |__| | | |__| | _ |
| (_) |____/ \____/ \_____| (_) |
| |
| internal error in: mpi.F at line: 903 |
| |
| M_init_nccl: Error in ncclCommInitRank |
| |
| If you are not a developer, you should not encounter this problem. |
| Please submit a bug report. |
| |
-----------------------------------------------------------------------------
Warning: ieee_inexact is signaling
1
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[55828,1],0]
Exit code: 1
--------------------------------------------------------------------------