memory leak: AIMD with openmpi 4.1.4 on GPUs

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
liu_jiyuan
Newbie
Newbie
Posts: 4
Joined: Tue Mar 29, 2022 6:44 am

memory leak: AIMD with openmpi 4.1.4 on GPUs

#1 Post by liu_jiyuan » Sat Nov 19, 2022 2:12 pm

Hi all,

I am Liu Jiyuan, who asked this memory leak problem in the Q&A during the VASP workshop.

This job was run by 2xA30 GPU associated with the 2 Xeon Gold 6326 sockets and 256 G memory. The used memory exceeded the total memory when the calculation reached 6700+ steps. The VASP was compiled by nvhpc 22.7 along with cuda 11.7 and VTST. The ompi414 was compiled by nvc+nvfortran with coda aware.

Thanks!
You do not have the required permissions to view the files attached to this post.

henrique_miranda
Global Moderator
Global Moderator
Posts: 505
Joined: Mon Nov 04, 2019 12:41 pm
Contact:

Re: memory leak: AIMD with openmpi 4.1.4 on GPUs

#2 Post by henrique_miranda » Mon Nov 21, 2022 5:06 pm

Hi Liu,

Could you try running the same calculation using OMP_NUM_THREADS=1 and check if the problem persists?
Recently we had a report about a similar issue in this thread:
https://www.vasp.at/forum/viewtopic.php?f=3&t=18493
We are still looking into it but knowing whether setting OMP_NUM_THREADS=1 alleviates the issue would be a great help for us to narrow down the scope of possible issues.

liu_jiyuan
Newbie
Newbie
Posts: 4
Joined: Tue Mar 29, 2022 6:44 am

Re: memory leak: AIMD with openmpi 4.1.4 on GPUs

#3 Post by liu_jiyuan » Thu Nov 24, 2022 1:02 am

Hi Henrique,

OMP_NUM_THREADS=1 works! The memory usage is greatly reduced.

For OMP_NUM_THREADS=16 ion step 0~6000 OUTCAR:
Total CPU time used (sec): 96743.969
User time (sec): 94038.925
System time (sec): 2705.043
Elapsed time (sec): 70288.818

Maximum memory used (kb): 131238960.
Average memory used (kb): N/A

Minor page faults: 102568499
Major page faults: 5194
Voluntary context switches: 59524318

For OMP_NUM_THREADS=1 ion step 6001~12000 OUTCAR (continue run):
Total CPU time used (sec): 83684.477
User time (sec): 83524.712
System time (sec): 159.766
Elapsed time (sec): 83831.101

Maximum memory used (kb): 16510832.
Average memory used (kb): N/A

Minor page faults: 18466879
Major page faults: 4622
Voluntary context switches: 846170

The real usage of memory is much higher that the recorded one, but the magnitude makes sense.

Thanks.

Post Reply