Page 1 of 1

ZPOTRF, Sub-Space-Matrix not Hermitian error occur for large systems using MPI, but not serial

Posted: Mon Dec 17, 2012 7:49 pm
by dhfphysics
Hello All,
It has long been known in our little group that we cannot run certain large VASP problems in MPI (openMPI) because of the errors mentioned in the title. I have recently decided to run tests on a different linux (Intel Xeon) cluster than we usually use to see if things are different, but the same behavior occurs. Here is a description of the problem with one of the INCARs, and a POSCAR. Atoms have reasonable starting bond lengths and I am using the lapack_double supplied with VASP, along with intel mkl BLAS. (Same problem occurs using intel mkl's LAPACK).

-VASP version 5.3.2 (several older versions have shown same type of problem)
-Intel ifort 10.1 compiler (newer compiler on our usual cluster which displays the same type of problem)
-openmpi-1.4.3
-intel mkl 10.0.010 for BLAS (BLAS = -lmkl_em64t -lguide -lpthread; we also have a much newer mkl version on our usual cluster)
-optimization flags: -xT -O2 -ip

The ZPOTRF error often occurs before the results of the first non-self-consistent DAV step has been written to OSZICAR. When the INCAR is made more simple and with lower ENCUT, AMIX, AMIN, thousands of Sub-Space-Matrix not Hermitian errors appear instead.

The following INCAR is perhaps the simplest of dozens that I have tried in an attempt to circumvent the problem. This produces many Sub-Space-Matrix errors. Variants of AMIX, AMIN, ISMEAR, PREC, LASPH, LDAU (ultimately we need this), ALGO, NPAR, KPAR, ENCUT, EDIFF have been tried as well as dynamic library compilation vs. "-static -static-intel". In general, simpler INCARs with slower mixing cause Sub-Space errors while more accurate or default mixing INCARS exit with ZPOTRF errors. (I have been surprised by a few exceptions to this rule, with ZPOTRF errors on gentle INCARS.)
All tests start from random wavefunctions and initially run non-self consistent davidson electronic minimization method. Thus setting ICHARG=12 does not change anything.
The serial version works, the MPI version always fails. A supercell with fewer bulk unit cells in each slab (96 atoms, I believe) has no problem running in MPI, while this cell has 144 atoms. The following was tests on a single computer with 2 quad core processors, asking mpirun to use 8 processors and setting OMP_NUM_THREADS=1.

POTCAR: PAW_PBEs, default versions for Si, Zn, S

KPOINTS: Gamma centered: 6x6x1

INCAR:
SYSTEM = Si-ZnS
LWAVE=.FALSE.
LCHARG=.FALSE.
LVHAR=.FALSE.

ALGO = Normal
NSIM = 1
NPAR = 1
LREAL=Auto
#LASPH=.TRUE.
NELMDL=5
AMIX=0.04
AMIN=0.01

ENCUT = 280
PREC = normal

ISMEAR = 0
SIGMA = 0.03

EDIFF = 1e-5
EDIFFG = 1e-4
NSW = 0
IBRION = 2
ISIF = 1


POSCAR:
Si-ZnS test
5.468586
-1.0000000000000000 1.0000000000000000 0.0000000000000000
-1.0000000000000000 0.0000000000000000 1.0000000000000000
5.9443739999999989 5.9443739999999989 5.9443739999999989
Si S Zn
73 35 36
direct
0.3332701161042725 0.3334487121844025 0.0210358466499900
0.3333840668170857 -0.1667096936641326 0.0210216381980764
-0.1666185198799777 0.3332100618893786 0.0210218006941743
-0.1666817656802643 -0.1667567467642730 0.0210212118632949
-0.0000258056240640 0.0001347657523997 0.0350501862075766
-0.0000383021252124 0.4999523286191414 0.0350293281284112
-0.4997623162555701 -0.0001165234802117 0.0350370839237600
0.4999488340185663 -0.4999935350031234 0.0350405407763392
-0.0000953161261905 0.0001764813133159 0.0771043042690625
0.0000662100776604 -0.4999251321172099 0.0771123916639672
-0.4998903733395471 0.0000328945359120 0.0771011117058084
0.4999233615794916 -0.4998685875211670 0.0770939201916907
0.1666652893977039 0.1667685091423266 0.0911176886953862
0.1666567813998168 -0.3332765110218207 0.0911398642276121
-0.3332656159246743 0.1666198006608161 0.0911179003399126
-0.3334375498116086 -0.3332946888651714 0.0911147049424962
0.1665574557609579 0.1668111286560735 0.1331759288807631
0.1665292535687981 -0.3332520274838268 0.1331960830620760
-0.3332954851018329 0.1667543408928532 0.1331831549220725
-0.3332027983341149 -0.3333388473591442 0.1331844165842326
0.3332193312641562 0.3332779176448344 0.1471929305105706
0.3335277560113273 -0.1667230075731006 0.1471939686422086
-0.1666850365698827 0.3331966643467916 0.1471905741570010
-0.1668475649938693 -0.1665326231515160 0.1472011487344980
0.3333438745583680 0.3333291655932813 0.1892505599998482
0.3332362718045342 -0.1667173620617851 0.1892531628679436
-0.1666742151194670 0.3334405961571077 0.1892417716193337
-0.1667696033616081 -0.1664634015415564 0.1892454029147715
-0.0001089232847485 0.0000734215850406 0.2032666328455413
-0.0000448767171000 -0.4999183649907142 0.2032581593669197
-0.4999230747060615 0.0000553582085602 0.2032844189849891
-0.4998618838456362 0.4998619462415974 0.2032733670028024
-0.0000585389747536 -0.0000643573481173 0.2453220661565005
-0.0000231395845980 -0.4999714037944525 0.2453240769991303
0.4999608357224623 -0.0000497446587071 0.2453323888747364
0.4999315455992220 0.4999971029466527 0.2453252958540946
0.1666523768841424 0.1668105404802223 0.2593511413702672
0.1667665026237506 -0.3333924892834208 0.2593354588389654
-0.3333058263943880 0.1666222593481801 0.2593693976870021
-0.3333352710866327 -0.3333653020509709 0.2593267051876815
0.1665794890699174 0.1666073266049741 0.3013973817285648
0.1665834775297847 -0.3333995495892781 0.3014003575645577
-0.3332774613423672 0.1667357424302440 0.3014178887246128
-0.3335057376473508 -0.3332846395737344 0.3014129307014829
0.3333880542504090 0.3333251466523761 0.3154049207558848
0.3334424638162009 -0.1666397402181601 0.3154254632341582
-0.1666598465872767 0.3334627659744798 0.3154306098244862
-0.1667426023197670 -0.1667231579003317 0.3154124729993562
0.3334610853149438 0.3332026320530412 0.3574729648213454
0.3331970253866918 -0.1665537869048597 0.3574897623610454
-0.1666337259490237 0.3332506677851330 0.3574669275081340
-0.1665513814977407 -0.1668636432340909 0.3574838250862336
-0.0000831605305952 0.0000203826945777 0.3714958664570466
0.0001142952116475 0.4999426799231557 0.3715030867127696
-0.4999772331268719 -0.0000765156282789 0.3715136153441687
0.4998488818037131 0.4999741738348280 0.3714956340929624
0.0000666925842563 -0.0000441592546949 0.4135523853538227
0.0001084549512166 -0.4999652050338592 0.4135582606454121
-0.4999372637806231 -0.0001098015479439 0.4135694531829381
-0.4998886248842307 0.4998957607170313 0.4135522753683675
0.1668348917403356 0.1665541567986697 0.4275648278093045
0.1665294468356562 -0.3333352200875113 0.4275794760998223
-0.3333018989891179 0.1667711633087894 0.4275797965278231
-0.3335041938679780 -0.3333109435164714 0.4275757712579036
0.1667055874227011 0.1667844475422116 0.4696274452245786
0.1666303195999216 -0.3333844315893739 0.4696466524027317
-0.3333245598600232 0.1667053558077265 0.4696427963461762
-0.3334082817221474 -0.3334084981634393 0.4696234398212198
0.3333992600200615 0.3333291963913205 0.4836586922146967
0.3333680377132716 -0.1666194926997007 0.4836671987861440
-0.1667614352982923 0.3332547261288292 0.4836502704527834
-0.1666358992122718 -0.1668168867401114 0.4836500157863499
-0.1668381663838797 -0.1665721338500614 -0.4745090751937238
0.3334532912664145 0.3332790797279082 -0.4745300068701603
0.3334055987506374 -0.1667407106353387 -0.4745341582648517
-0.1667502219484245 0.3333209844134142 -0.4745207495929998
-0.0000112987142732 -0.0000227739527203 -0.4194901699352881
-0.0000721792617862 -0.4998627707737962 -0.4194936039155318
-0.4998669936539681 -0.0001626191087352 -0.4194820098262425
-0.4999141158442429 0.4999193844672645 -0.4194711729992875
0.1667171940574214 0.1666120966390137 -0.3644642711067068
0.1667504589640494 -0.3332053847838745 -0.3644438001753422
-0.3332206299601710 0.1665700871566402 -0.3644476646886273
-0.3332763281951785 -0.3333323702608670 -0.3644287699071860
0.3332236133180079 0.3333143082985517 -0.3094256650915417
0.3335237307440807 -0.1667363284439205 -0.3094183944232640
-0.1668712782460607 0.3334452144943940 -0.3094061514223238
-0.1667421791196450 -0.1666649141707319 -0.3094107092229308
0.0000392737315814 -0.0000729000673114 -0.2543661751667987
0.0000877981431496 0.4999891561819209 -0.2543780514392033
0.4999311334606509 -0.0000614114556772 -0.2543930072746277
-0.4999221398914857 0.4999716940087414 -0.2543813002756039
0.1666345715966626 0.1668121623301703 -0.1993527476409120
0.1667211837664618 -0.3333439743995965 -0.1993369236102139
-0.3334284986483446 0.1666845692499294 -0.1993518138330508
-0.3333364065442332 -0.3332678759885740 -0.1993464409485898
0.3332007169904655 0.3334318354040041 -0.1443090490903434
0.3333706018066668 -0.1666569569890427 -0.1442948026992679
-0.1665352585122272 0.3331861808934581 -0.1442992774475660
-0.1667306519494440 -0.1666340481512335 -0.1443085133537441
-0.0000857649025797 0.0000034890829546 -0.0892731168159685
0.0000100426369947 -0.4999844870701367 -0.0892613734642091
-0.4999545002653688 -0.0000676829355163 -0.0892743718044648
-0.4999065467193972 0.4999787416073116 -0.0892656252184963
0.1666519364827267 0.1665998696335678 -0.0342502968860840
0.1666437101160805 -0.3333262332409225 -0.0342594382638962
-0.3332937860019511 0.1666675022349230 -0.0342277071350919
-0.3334054147761223 -0.3333625334215875 -0.0342286433070994
0.0000101828226405 0.0000790856298458 -0.4610767043287973
0.0000462303468558 -0.4999691659455030 -0.4610764967455639
-0.4999526783464271 -0.0000041730853240 -0.4610754131825739
0.4999440099983854 -0.4999804396968608 -0.4610818526085538
0.1668091007962695 0.1665332825967287 -0.4060533308546262
0.1666442899203099 -0.3333353281192306 -0.4060338950061567
-0.3333302069344450 0.1666847352603549 -0.4060760603763299
-0.3333283567088552 -0.3332272011705013 -0.4060448373663480
0.3332695241498024 0.3333387375149279 -0.3509996664451893
0.3332734726397492 -0.1667240169987021 -0.3510225205936363
-0.1665269479656771 0.3332648032195944 -0.3510181840579200
-0.1665275399897213 -0.1666222012565122 -0.3510101134726178
-0.0000828600109596 -0.0000800598807327 -0.2959837692387465
0.0000296313156918 -0.4999707681729446 -0.2959674549440757
-0.4999282506171521 0.0000635263703810 -0.2959797622767003
-0.4999861955373690 -0.4998541000444462 -0.2959729556425186
0.1666389916459226 0.1666377914938493 -0.2409632374322218
0.1668050475629942 -0.3334049840165683 -0.2409440389841894
-0.3334121188254038 0.1666483973638052 -0.2409346036755921
-0.3334725732040010 -0.3332268964403695 -0.2409430574617726
0.3332921030975213 0.3334841783026276 -0.1859135093329313
0.3333481694034060 -0.1668347289231257 -0.1859072244158739
-0.1666837213840007 0.3334216927226507 -0.1859259855520721
-0.1666960416652966 -0.1666521514043575 -0.1859323229377019
0.0000108468905068 0.0000992618913873 -0.1308646862689997
-0.0000594799798549 0.4999751879400238 -0.1308700202053212
-0.4999881038585041 0.0000039856030325 -0.1308742288970177
-0.4999238082943203 0.4999318515899318 -0.1308805054945497
0.1666827000619334 0.1666951570805720 -0.0758581153802631
0.1667917787174852 -0.3332532218653256 -0.0758318469102036
-0.3333210930869563 0.1667585365110482 -0.0758483483244968
-0.3332766463649418 -0.3332342650979203 -0.0758293677918450
0.3333041812007727 0.3332313672856156 -0.0207962739802632
0.3333380245791318 -0.1667332373286115 -0.0208022946356239
-0.1665698509872580 0.3332422633996034 -0.0208051328868731
-0.1664969613970304 -0.1667691587511596 -0.0208046665733718

I can provide more info on compiler, openmpi etc, if needed
Thanks for your help.
David

<span class='smallblacktext'>[ Edited ]</span>

ZPOTRF, Sub-Space-Matrix not Hermitian error occur for large systems using MPI, but not serial

Posted: Mon Dec 17, 2012 7:56 pm
by dhfphysics
Note: A single computer may not have enough memory (8GB) to handle this computation to termination, but a serial test shows that it can succesfully complete non-selfconsistent DAV cycles using about 6Gb. I also have tested similar INCARs running MPI on 2 and 4 machines (with NPAR set to 2, 4, respectively), and the same errors occur.

It would be very helpful if someone with a linux cluster could run this on their system. If it works, perhaps we could talk by email about compiler/MPI2 details.

Thanks,
fosterd@physics.oregonstate.edu
<span class='smallblacktext'>[ Edited Tue Dec 18 2012, 09:55PM ]</span>