PLUGINS_STRUCTURE_errors

Message

thomas_pigeon · #1 Post by **thomas_pigeon** » Thu Feb 20, 2025 3:34 pm

I compiled VASP 6.5.0 with the python plugins option, with two different compiler (gcc and fpp) see the attached makefile.include.
I execute VASP on a node composed of two processors AMD EPYC™ Milan 7763 - 64 Core - 2.45GHz - 256MB Cache
The plugin is only used to change the atoms positions every steps through a python code which runs Langevin dynamics using an integrator from ASE adapted for the plugin.
Depending on the ML_MODE and ML_LMLFF tag in the INCAR, I obtain two types of errors for both compilations with gcc and fpp.

With ML_LMLFF=.FALSE., the dynamics (through the plugin) runs for 4500 steps (out of 10 000) and then obtain the following error:

Code: Select all

slurmstepd-topaze1701: error: Detected 1 oom_kill event in StepId=7485027.0. Some of the step tasks have been OOM Killed.
srun: error: topaze1701: task 64: Out Of Memory
slurmstepd-topaze1701: error:  mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:64]
slurmstepd-topaze1701: error: *** STEP 7485027.0 ON topaze1701 CANCELLED AT 2025-02-19T23:19:21 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
slurmstepd-topaze1701: error:  mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:0]
+ exit 0

With ML_LMLFF=.TRUE. and ML_MODE = train. I do not obtain any error and can run dynamics (through the plugins) for 10 000 steps (with high CTIFOR to not do DFT).
In that particular case, the ML_CTIFOR was set to a high value so that there is no DFT calls and only FF evaluations.

With ML_LMLFF=.TRUE. and ML_MODE = run, the vasp execution stops before calling the python interface but after writing the first energy and forces to the OUTCAR.
I obtain the following error (many times):

Code: Select all

[topaze1150:3973629:0:3973629] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3973629) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000005aa353 rc_add_()  ???:0
 2 0x00000000004cd81b plugins_mp_plugins_structure_()  ???:0
 3 0x0000000001eff5f1 MAIN__()  ???:0
 4 0x000000000041fba2 main()  ???:0
 5 0x000000000003ad85 __libc_start_main()  ???:0
 6 0x000000000041faae _start()  ???:0
=================================
[topaze1150:3973585:0:3973585] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:3973585) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000584695 map_forward_()  ???:0
 2 0x000000000058921d fftbrc_plan_mpi_()  ???:0
 3 0x000000000058d33b fft3d_mpi_()  ???:0
 4 0x0000000000590e98 fft3d_()  ???:0
 5 0x00000000004cd838 plugins_mp_plugins_structure_()  ???:0
 6 0x0000000001eff5f1 MAIN__()  ???:0
 7 0x000000000041fba2 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041faae _start()  ???:0
=================================
[topaze1150:3973645:0:3973645] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1268000007f)
==== backtrace (tid:3973645) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000584695 map_forward_()  ???:0
 2 0x000000000058921d fftbrc_plan_mpi_()  ???:0
 3 0x000000000058d33b fft3d_mpi_()  ???:0
 4 0x0000000000590e98 fft3d_()  ???:0
 5 0x00000000004cd838 plugins_mp_plugins_structure_()  ???:0
 6 0x0000000001eff5f1 MAIN__()  ???:0
 7 0x000000000041fba2 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041faae _start()  ???:0
=================================

#2 Post by **manuel_engel1** » Fri Feb 21, 2025 1:26 pm

Hello,

Thank you kindly for the report. After talking with our ML and plugin experts, I am able to come back with a partial answer.

In the case where ML_LMLFF=True and ML_MODE=run, there is indeed a problem as some of the DFT quantities are not allocated. When running with the VASP plugin, these non-allocated quantities are accessed, causing the segmentation fault you see. We are already working on a fix for this issue.

As to why the first case is running out of memory is still a bit unclear to me. It might be due to an unrelated bug, or it might be something more benign. This needs to be investigated still.

Kind regards

#3 Post by **manuel_engel1** » Fri Feb 21, 2025 2:16 pm

We have now started to investigate the issue with ML_LMLFF=False that you described first. We suspect that it could be caused by a memory leak. Could you please tell us exactly what compiler and library versions you used to build VASP?

In particular, we are interested in the exact version numbers of

the Fortran compiler
the MPI library
the HDF5 library (if used)
scaLAPACK/LAPACK

This information would be greatly appreciated.

thomas_pigeon · #4 Post by **thomas_pigeon** » Mon Feb 24, 2025 9:46 am

Thank you a lot for your answers. The errors I reported previously were obtained with two different build of vasp.
I put in attachment the makefile.include used and some more details concerning the libraries used.

A summary is of it is:

For the first vasp build I used:

Code: Select all

$ mpif90 --version
GNU Fortran (GCC) 11.2.0
$ ompi_info --version
Open MPI v4.1.4

The AOCL scaLAPACK version is 3.2.0
The AOCL LAPACK version is 3.2.0

For the second build I used:

Code: Select all

$ mpifort --version
ifort (IFORT) 19.1.0.166 20191121
ompi_info --version 
Open MPI v4.1.4

The ScaLAPACK and LAPACK libraries come from the mkl version is 20.0.0

#5 Post by **andreas.singraber** » Mon Feb 24, 2025 9:53 am

Hello!

Thanks a lot for this information, this helps a lot!! Just a quick follow-up question: Does the first issue (out of memory after 4500 steps) occur with both toolchains and is it reproducible? Thank you!

All the best,
Andreas Singraber

thomas_pigeon · #6 Post by **thomas_pigeon** » Mon Feb 24, 2025 10:01 am

Yes this first issue with out of memory after roughly 4500steps (sometimes it goes up to 4800 steps but the same OOM kill event occur) is reproducible and occurs with both builds, as for the other error with the ML_MODE=.True.

Thank you a lot,
Thomas Pigeon

thomas_pigeon · #7 Post by **thomas_pigeon** » Wed Feb 26, 2025 8:34 am

I spotted additionnal bugs when running langevin dynamics with the plugin with the same builds I used. As I still wonder whether it is a error in the way I built vasp or it come from somewhere else, I post my observations here.

Whatever the choice of ML_LMLFF, when using the plugin, the structures written to the XDATCAR are always the one given in the initial POSCAR. The CONTCAR is updated every ionic step but this same initial stucture is writen inside. On the other hand, the POSCAR is rewritten every steps and it contains the correct structure. I can still check that the dynamics is ran "correctly' because the vasp plugin writes the strucures into another trajectory file.

When running MD using the plugin and ML_LMLFF=.TRUE. and ML_MODE = train, starting from scratch, with ML_ICRITERIA=1 after 598 steps (DFT and MLFF) I obtain the following error in the std_out:

Code: Select all

[1740503092.549007] [topaze1613:369406:0]        mm_xpmem.c:161  UCX  ERROR   failed to attach xpmem apid 0x350005a2fe offset 0x1871f000 length 233472: Cannot allocate memory
[1740503092.549057] [topaze1613:369406:0]        ucp_rkey.c:418  UCX  ERROR   failed to unpack remote key from remote md[6]: Input/output error

and in the std_err:

Code: Select all

 
free(): double free detected in tcache 2

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Could not print backtrace: mmap, errno: 12

where this last line is repeated 222 times (the number of mpi tasks is 128)

#8 Post by **andreas.singraber** » Wed Feb 26, 2025 9:33 am

Dear Thomas,

thanks for your updated report.. I did also quite some testing in the meantime, see below. But first, regarding the two new issues:

Whatever the choice of ML_LMLFF, when using the plugin, the structures written to the XDATCAR are always the one given in the initial POSCAR. The CONTCAR is updated every ionic step but this same initial stucture is writen inside.

This seems to be a separate problem and I will forward this to my colleague for further investigation.

The UCX ERROR issue could be related to the memory leak, would it be possible for you to log into the node and monitor the memory (e.g. with top) while the job is running? It would help us to know whether the memory increases on all MPI processes or maybe only on a single one (presumably rank 0).

So far I tried to replicate your setup as close as possible, at least for the GNU toolchain I am pretty close:

GNU compiler 11.2.0, OpenMPI 4.1.4, AOCL 3.2, almost identical makefile.include
Single node with 2x AMD Epyc 7713 64-core, 256 GB memory
Compiled with OpenMP-support (not used though), Python plugins, fftlib, shared memory

The build with the Intel compiler I cannot exactly recreate because I do not have this version available. I tested an alternative with an Intel 2022 and OpenMPI 4.1.2. However, for both, the Intel and GNU toolchain I could not yet find a substantial memory leak. I got some memory increase over time but at the order of kilobytes per time step, so this cannot be the culprit for running out of memory on your machine with ~ 240 GB.

I will continue my search for the memory leak on our hardware... in the meantime may I ask you to perform also a few more tests, if feasible:

Try to run the LMLFF_F test completely without the plugins by replacing your custom integrator with a VASP-internal one, e.g. like this:
Code: Select all
```
ISIF          = 0
IBRION        = 0
MDALGO        = 1
ANDERSEN_PROB = 0.05
POTIM         = 1.0
RANDOM_SEED   =  248489752   0   0
TEBEG         = 50
ISYM          = -1
NSW           = 10000
```
Do you also observe a memory leak in this case (if possible observe with top)?
Since both toolchains use OpenMPI 4.1.4 there is a chance that the memory leak issue is related to the MPI setup. Do you maybe have a different toolchain available (e.g. with Intel MPI) which you could test?
When running the LMLFF_F test could you please add the flag
Code: Select all
```
--mca btl_base_verbose 5
```
to your mpirun command and post (or attach) the output here? Also, in the same environment can you run the command
Code: Select all
```
ompi_info
```
and send us the output? I have a suspicion that this may be related to UCX which you seem to use but I do not...

Thanks a lot!

All the best,
Andreas Singraber

thomas_pigeon · #9 Post by **thomas_pigeon** » Thu Feb 27, 2025 9:19 am

Dear Andreas,

Thank you a lot for your help.
I did the following tests with both builds:

Same LMLFF_F run with modified vasp plugin to save the top to a file. This output files are attached.

LMLFF_F run with vasp MD integrator (no plugin), I tried to connect many times on the node to print the top many times, see the files attached. The OMM error appears aswell in that case with both builds. I couldn't capture the exact moment when the calculation crashed and print top right before but I expect that this behavior is similar as the one observed with the plugin.

On the machine on which I run calculations the mpirun command is ran through another command
Code: Select all
```
ccc_mprun
```
I tried to add the flag you mentioned above but it did not write anything more (or less) than in the previous test so did not report these results.

The ompi_info output is also in the attached files.

I have access to intel mpi and inter compilers. I did a third build using mpiifort.

Code: Select all

$ mpiifort --version
ifort (IFORT) 19.1.0.166 20191121
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

With that last build, the MD from vasp can run for at least 8000 steps with the same system which proves that there is no more the same memory issue. On the other hand, I cannot use the MLFF in that build. The calculation stays frozen right after if writes the beginnings of ML_LOGFILE.
Again, thank you a lot !

Thomas Pigeon

#10 Post by **andreas.singraber** » Fri Feb 28, 2025 2:35 pm

Dear Thomas,

thank you for the log files, this helped me a lot! It is evident that there is some kind of memory leak involved on your system because one can clearly see the memory consumption building up with time. You "loose" around 50 MB per time step which is quite substantial and finally results in the OOM error. Also, now we know it has nothing to do with the plugins system.

Regarding the additional flags to the mpirun command: you can also try to export the corresponding environment variable before using the ccc_mprun command, this should also instruct the underlying OpenMPI to output the infrastructure it is using:

Code: Select all

export OMPI_MCA_btl_base_verbose=5

I did a couple more tests, this time also including an OpenMPI build with UCX. At first, upon execution it still did not use this layer by default but I could ultimately enforce UCX usage. However, also this run did not show a memory leak.

Moreover, I investigated the possibility of a memory leak with the tool heaptrack which I used successfully in the past to identify memory issues. Also this tool did not report any leakages directly from VASP, only some minor ones from attached libraries (mostly MPI, scaLAPACK, etc.). However, the rates were negligible in comparison to what we are looking for. I did this test for an Intel and GNU toolchain with the same outcome.

I was able to run some longer trajectories overnight and at least one of them showed that there was almost no more memory increase. In this case the memory ramped up from 6.2 GB to 7.6 GB (in total!) during 20000 seconds and then stayed almost constant for another 10000 seconds (then I stopped it) with a residual leak rate of 19 kB per time step. This could not have been possible if there was really a substantial memory leak within VASP because the MD and electronic steps just keep on going. So, in summary, all increase of memory I could observe seems to come from various linked libraries with reasons I can only speculate about (increasing buffers..).

Unfortunately, at this point I am running out of ideas for further testing on our hardware and software stack. From my perspective there does not seem to be any hint of a memory leak in VASP and I believe that there is an issue with the setup of the machine you are using. The common denominator between the two toolchains seems to be OpenMPI, so I would focus on this part. This would also explain why you did not observe a leak for the Intel 19 toolchain (with Intel MPI). I compared your ompi_info output to mine and although we both use version 4.1.4 there seems to be one major difference. It seems you are using a custom built OpenMPI specific for your cluster (Bull?):

Code: Select all

                Open MPI: 4.1.4
  Open MPI repo revision: 4.1.4-Bull.4.0-221004150830
   Open MPI release date: Oct 04, 2022

The release date of OpenMPI 4.1.4 was actually on May 26, 2022, so I guess this is in the end not the same source code. One also finds the bullnbc collective component which I think is custom made for your cluster. Maybe you can circumvent any custom communication code by setting

Code: Select all

export OMPI_MCA_btl=self,vader

before executing VASP and monitor again the memory consumption. Otherwise, I would really recommend to get in contact with your system administrators and ask them if they have observed similar issues before. Maybe it is possible to switch to some generic OpenMPI version for testing?

If I were to debug this further on your machine I would recommend to use heaptrack (or valgrind) to identify the source of the leak. If it is indeed OpenMPI, maybe one can pinpoint which component is causing the trouble and disable it. Unfortunately, installing heaptrack without admin rights may be cumbersome because you would also need to install its dependencies (parts of boost, libunwind). Maybe your sysadmins can do this for you. You would only need the "data collector" part, the GUI is not strictly required (you could send me one output file and I could analyze it with the GUI).

Regarding the frozen run with the Intel-only toolchain when using MLFFs: I also tried to reproduce this behavior but had no problems, neither with our oldest (Intel 2022.0.1) nor our newest (Intel oneAPI 2025.0.3) compiler. Sometimes these hangups are also related to MPI issues, in the past this could often be resolved by setting this environment variable before executing VASP:

Code: Select all

export I_MPI_FABRICS=shm

Finally, you mentioned incorrect file output in the POSCAR, XDATCAR and CONTCAR files: the changes in the POSCAR file come from your Python script. However, there is indeed bug in VASP 6.5.0 for the other two files. This problem has been fixed and should be gone in the upcoming 6.5.1 release.

All the best,
Andreas Singraber

thomas_pigeon · #11 Post by **thomas_pigeon** » Fri Feb 28, 2025 2:52 pm

Dear Andreas,

Thank you a lot for your help. I manage to identify the reason for MLFF bug with intel MPI, I realized it was a simple mistake in the makefile include, I forgot the -DML_AVAILABLE, sorry for this disturbance.
For now, I think I will stick to the intel-only toolchain and ask the system administrator to address the issue.

Again thank you a lot for all your help

All the best

Thomas Pigeon

#12 Post by **andreas.singraber** » Mon Mar 03, 2025 2:13 pm

Dear Thomas,

actually the flag -DML_AVAILABLE should not be necessary. The machine-learned force fields feature will always be compiled in as long as MPI is present. Anyway, I am glad you can move forward with the other toolchain! If you ever happen to find out the reason for the memory leak issue it would be great if you could briefly update us. Thanks a lot!

All the best,
Andreas Singraber

My Community

PLUGINS_STRUCTURE_errors

PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors

Re: PLUGINS_STRUCTURE_errors