BSE memory estimate

Message

xiaoming_wang · #1 Post by **xiaoming_wang** » Mon Oct 21, 2024 3:23 pm

Hello,

I'm performing BSE calculations to write out the eigenfunctions (BSEFATBAND). When I continuously increase the number of the NBSEEIG, the seg fault eventually occurred as expected. The question is how to estimate the memory required. I'm confused with the mem estimate from the output.

Code: Select all

available memory per node:   37.38 GB, setting MAXMEM to   38278

It seems this number is the memory per MPI rank instead of node, since I have 2T mem for one node and 48 MPI ranks for the calculation.
And my BSE size is

Code: Select all

BSE (scaLAPACK) single prec attempting allocation of   0.195 Gbyte  rank=  48384

So, the 48384x48384 BSE matrix with single prec complex numbers only cost about 17.5 GB mem. I should have enough mem to store the matrix on each MPI rank, but why I still encountered the seg fault problem?
Btw, this mem problem only occurred while writing out the BSE eigenfunctions, not in the oscillator strength calculation.

Best,
Xiaoming Wang

#2 Post by **alexey.tal** » Tue Oct 22, 2024 10:02 am

Hi Xiaoming Wang,

That's correct this memory estimate is given per MPI rank, so it makes sense for your 48 MPI ranks and the 2T of memory on the node.
I see in the code that we allocate a complex array of dimensions RANK*NBSEEIG on each rank in double precision even if the BSE matrix is single precision.
Perhaps the allocation of this matrix does exceed the available memory.
How many eigenvectors are you trying to write into BSEFATBAND?
Could you please provide the output/input files, so that I can get a better idea of what calculations you are doing?

xiaoming_wang · #3 Post by **xiaoming_wang** » Tue Oct 22, 2024 2:41 pm

Hi,

I was requesting all the eigenvectors. I attached my inputs and outputs where I used 2 nodes each with 2T mem and I used 12 MPI ranks on each node.

Best,
Xiaoming

xiaoming_wang · #4 Post by **xiaoming_wang** » Tue Oct 22, 2024 2:53 pm

Hi,

If it is indeed the mem problem due to the allocation of the giant EIGENVECT matrix on each MPI rank, is there a way to write out the AMAT on each rank? I know AMAT is distributed over the ranks and contains the same information as EIGENVECT once collected. Basically, I need each rank write out the local stored AMAT with an index pointing to which NBSEEIG is written out. By this way, I think the mem requirement may be released.

Best,
Xiaoming

xiaoming_wang · #5 Post by **xiaoming_wang** » Tue Oct 22, 2024 10:53 pm

Hi,
After checking the source code line by line, it seems that the allocation of the matrix EIGVECT(NBSEEIG*NCV) would not cause the seg fault problem. Instead, the error occurred at the line

Code: Select all

EIGVECT((LAMBDA-1)*NCV+NCV13)=AMAT(IROW+(JCOL-1)*DESC(LLD_))

which is located at line 6407 of the subroutine WRITE_BSE_SCALA in bse.F.
So, I guess there is some mismatch of the dimensions of the two matrices?

#6 Post by **alexey.tal** » Wed Oct 23, 2024 8:33 am

I would like to note that BSEFATBAND was not really intended for writing out the full BSE matrix. Usually, we would use it to make a fatband plot or analyze the first few excitons and you just need a few first states for that.
As you are trying to write out all eigenvectors this routine a number of issues pops up. The first thing is that we are collecting all NBSEEIG eigenvectors on every node, which is not an issue if NBSEEIG is small, but for writing out all eigenvectors it is not a good idea. The second problem is 4 byte integers that we use in this routine. The line you are correctly identified causes the problem due to this multiplication (LAMBDA-1) * NCV . We compute a product of two large integers. For example, in the last eigenvector it is 48384*48383 and it will exceed the 4 byte integer.

I made a patch for VASP 6.4.3 that should solve this problem by using 8 byte integers in this routine.
However, I think that this is really not the optimal way to write out all the eigenvectors. Could you describe what exactly you need all the eigenvectors for and how you intend to use them, so that we could find a better mechanism for writing out the full BSE matrix and/or all eigenvectors.

Code: Select all

diff --git a/src/bse.F b/src/bse.F
index 6420e3931..50cf2bfc7 100644
--- a/src/bse.F
+++ b/src/bse.F
@@ -1440,7 +1440,7 @@ ccintegration: DO IALPHA=1,NALPHA
     IF ( LscaLAPACKaware) THEN
 #ifdef scaLAPACK
        CALL WRITE_BSE_SCALA(WHF, ISP_LOW, ISP_HIGH, &
-            BD, BSE_INDEX, R, AMAT_SCALA, NCV, NBSEEIG, IO%IU6, BSE_DESC)
+            BD, BSE_INDEX, R, AMAT_SCALA, INT(NCV,8), INT(NBSEEIG,8), IO%IU6, BSE_DESC)
 #endif
     ELSE
        IF (IO%IU6>=0) THEN
@@ -5663,12 +5663,12 @@ cp: DO N=1,WHF%WDES%COMM_INTER%NCPU
     TYPE (wavespin) WHF
     TYPE (latt) LATT_CUR
     TYPE(bse_matrix_index) :: BSE_INDEX
-    INTEGER :: NCV
-    INTEGER :: NBSEEIG
+    INTEGER(8) :: NCV
+    INTEGER(8) :: NBSEEIG
     COMPLEX(q) :: EIGVECT(:)
     REAL(q) :: R(:)
  ! local
-    INTEGER K1,K2,K3,K4,NPOS1,NPOS2,MINL
+    INTEGER(8) K1,K2,K3,K4,NPOS1,NPOS2,MINL
     INTEGER I, J
     INTEGER     :: LAMBDA, ISP
     REAL(q) :: RIP(NCV),MINV 
@@ -5853,7 +5853,7 @@ cp: DO N=1,WHF%WDES%COMM_INTER%NCPU
 #else
     GDEF  :: AMAT(:)
 #endif
-    INTEGER :: NCV
+    INTEGER(8) :: NCV
     INTEGER :: ANTIRES
     INTEGER :: IU6
  ! local
@@ -5867,12 +5867,12 @@ cp: DO N=1,WHF%WDES%COMM_INTER%NCPU
     INTEGER, OPTIONAL :: DESC(DLEN_)
 !MB
     COMPLEX(q), ALLOCATABLE :: EIGVECT(:)
-    INTEGER :: NBSEEIG
+    INTEGER(8) :: NBSEEIG
 
 ! BLACS variables
     INTEGER, EXTERNAL ::     NUMROC
     INTEGER MYROW, MYCOL, NPROW, NPCOL, NP,NQ
-    INTEGER I1RES, J1RES, IROW, JCOL
+    INTEGER(8) I1RES, J1RES, IROW, JCOL
     INTEGER I1, I2, J1, J2
 
 !Large NBSEEIG value can overrun the memory of a local node.
@@ -5938,7 +5938,7 @@ cp: DO N=1,WHF%WDES%COMM_INTER%NCPU
 
 !Gather all parts of the eigenvectors stored on the different nodes
 !    CALLMPI( M_sum_z(WHF%WDES%COMM, EIGVECT, NBSEEIG*NCV))
-    CALLMPI( M_sum_z(WHF%WDES%COMM_INTER, EIGVECT, NBSEEIG*NCV))
+    CALLMPI( M_sum_z8(WHF%WDES%COMM_INTER, EIGVECT, NBSEEIG*NCV))
     IF (IU6>=0) CALL WRITE_BSE(WHF,NCV,BSE_INDEX,EIGVECT,R,NBSEEIG)
       

     DEALLOCATE(EIGVECT)

xiaoming_wang · #7 Post by **xiaoming_wang** » Wed Oct 23, 2024 12:59 pm

Hi,
Thanks for your help. I will try the patch.
I'm investigating the second harmonic generation and transient absorption both of which includes sum over states of the exciton-exciton transition dipoles. It needs much more exciton states, if not all, compared to simple optical absorption calculations. I need the exciton wave functions A_{cv} to multiply some DFT matrix elements between c/v.
Anyway, I think it is a good idea to seek an efficient way to write out all the eigenvectors. For me, I'm thinking about dumping the AMAT to binary files by all the nodes instead of collecting them. For example, at line 6407, we just write the AMAT to disk. But then I think we still need a script to collect the eigenvectors since AMAT is distributed over NCV instead of NBSEEIG.

#8 Post by **alexey.tal** » Wed Oct 23, 2024 1:36 pm

Could you show the equation or a reference paper of what you are implementing? It should be much easier to do directly in VASP, instead of writing out so much data and then reading it again.

For me, I'm thinking about dumping the AMAT to binary files by all the nodes instead of collecting them. For example, at line 6407, we just write the AMAT to disk. But then I think we still need a script to collect the eigenvectors since AMAT is distributed over NCV instead of NBSEEIG.

That should allow for parallel IO, but that would not solve the problem of allocating very large EIGVECT. Are you trying to speed up the IO? In fact, AMAT is block-cyclically distributed across all MPI ranks, so we compute the corresponding part of every eigenvector on every rank and then sum up all the elements over all ranks. After that we could just write out a subset of eigenvectors from every rank as we store the full set of eigenvectors on every rank.

xiaoming_wang · #9 Post by **xiaoming_wang** » Wed Oct 23, 2024 2:28 pm

You can check the R_{nm} in Eq. 1 of arXiv:2310.09674 and mu_{lambda_i, lambda} in Eq. 16 of PRB 107, 205203.
I'm not trying to speed up the IO. I just want to release the mem overhead. If increasing the number of nodes could reduce the mem usage on each node due to the distribution of AMAT, it is fine to me.
Btw, the patch works.

#10 Post by **alexey.tal** » Thu Oct 24, 2024 9:53 am

Thank you for the references.

I'm not trying to speed up the IO. I just want to release the mem overhead. If increasing the number of nodes could reduce the mem usage on each node due to the distribution of AMAT, it is fine to me.

In that case I think you can run the calculation of fewer ranks per node to fit the calculation in memory.

Btw, the patch works.

Glad to hear that. Of course it is a workaround, so we should implement a better way to write out the full BSE matrix as users find it useful.

My Community

BSE memory estimate

BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate

Re: BSE memory estimate