fnal geant4 performance group issues and progress daniel elvira for m. fischler, j. kowalkowski, m....

FNAL Geant4 Performance Group

Issues and Progress

Daniel Elvirafor

M. Fischler, J. Kowalkowski, M. Paterno

G4 Performance Work @FNAL The LHC experiments are the main G4 customers within the HEP,

FNAL (and the FNAL/G4 team) is involved in CMS. Then… It makes sense for the FNAL/G4 performance group to use

(primarily) the CMS simulation application to profile Geant4 code For the work presented here: G4.8.1p01, QGSP/QGSP_Bertini

(CMSSW180), Z’(dijets) identify places for improvement, design and implement improvements, feed those improvements back to Geant4.

We use a tool of our own design, SimpleProfiler, to collect detailed call stack samples. We also use our own tools to manage, analyze, present the data. Work is in progress to make these tools available to the public.

CHIPS performance analysis CMS reported that large number of small memory

allocations were causing memory fragmentation and this troubled them.

Average t-tbar event in CMS simulation created and destroyed about 1.6 million G4QHadron objects. G4QHadron constructors took >1% of program time;

odd for a constructor. Derived class G4QNucleus constructors took ~2% of

program time.

CHIPS code modification & result Problem was one data member:

std::vector<G4double>* Tb; There was no need for this vector to be on the

heap; change the data member to: std::vector<G4double> Tb;

Result is 1.5% speed improvement in the whole CMS simulation.

Bertini Performance Analysis

2% (median) of program time inG4ElementaryParticleColliderconstructor.

These large spreads will be discussed later.

Partial profiling result for100 jobs of 100 events each.Overall percentage of time time taken in functions (no children).

Bertini code modification We reorganized G4ElementaryParticleCollider.

Removed 20 of 21 data members that did not need to be part of its state (they were equivalent for all instances of the class).

Much of the redundant code (across the 20 data members) was replaced by a few templates.

Result was a reduction in source code bulk of about 40 printed pages

Bertini performance improvement Performance increase

~ 4% Note that the increase

in speed is greater than the obvious expectations initial profiling

Reduction in object allocation / deletion benefits other classes

Note: we have a large enough data sample to accurately characterize the differences

Irregularity in random number use Recall that the Bertini performance analysis

showed an unusually large spread in function speed across runs of the same job. We took this to indicate that there might be a reproducibility problem. We decided to investigate this further.

It looks as though there are two distinct groups of measurements.

First observation of irreducibility 91 runs each represented by a line on

the plot Each line shows the time taken per

event Expect all lines to form one “cable” Observed that after event 38, the jobs

separate into two branches, each of which appear to process the events differently

What is the cause of this?

Highly reproducible for a long period of time. Unfortunately it disappeared.

Apparently, there were changes to the cluster.

Architecture dependence in output Excerpt from 4GB Geant4 log file (at line 1,093,802) showing

subtle difference in physics output on different platforms. Discovered while trying to understand the irreproducibility

problem.

AMD:507 -598 -482 -2.75e+03 0 0.0962 0.282 2.92e+03 TECBackDisk KaonPlusInelastic:----- List of 2ndaries -#SpawnInStep= 10: -598 -482 -2.75e+03 4.39e+03 pi+: -598 -482 -2.75e+03 362 pi-: -598 -482 -2.75e+03 93.9 pi+: -598 -482 -2.75e+03 1.43e+03 pi-: -598 -482 -2.75e+03 654 proton: -598 -482 -2.75e+03 3.78e+03 pi-: -598 -482 -2.75e+03 3.14e+03 kaon+: -598 -482 -2.75e+03 5.68 gamma: -598 -482 -2.75e+03 2 gamma: -598 -482 -2.75e+03 0.835 C11[0.0]

INTEL:507 -598 -482 -2.75e+03 0 0.0962 0.282 2.92e+03TECBackDisk KaonPlusInelastic:----- List of 2ndaries -#SpawnInStep= 9: -598 -482 -2.75e+03 4.39e+03 pi+: -598 -482 -2.75e+03 362 pi-: -598 -482 -2.75e+03 93.9 pi+: -598 -482 -2.75e+03 1.43e+03 pi-: -598 -482 -2.75e+03 654 proton: -598 -482 -2.75e+03 3.78e+03 pi-: -598 -482 -2.75e+03 3.14e+03 kaon+: -598 -482 -2.75e+03 7.68 gamma: -598 -482 -2.75e+03 0.854 C11[0.0]

Random number usage issue Starting looking at the random number generators

as cause of the physics differences. The random number streams are identical on

AMD and Intel architectures. Different amount of random numbers drawn from

the generator depending on architecture, observable on the first event

This result is completely reproducible

AMD = 59,230,872 random numbersIntel = 59,511,723 random numbers

(difference of 280,851)

We are still investigating the cause.

Plans for the future

What we had in mind in the short term (rest of 2008):

1-Perform a major design review for the CHIPS library.2-Investigate the Intel vs. AMD dependence of the simulation output.3- Return to profiling, resuming with the newest version of Geant4.4- Continue work with two interns from Northern Illinois University,

who are helping to improve out data collation, analysis, and display tools; the tools will be made public once they are sufficiently robust.

But… Fermilab management has temporarily pulled out J. Kowalkowski, M. Paterno from G4 efforts (for at least 2-3 months effective ~Sep 15th. )

M. Fischler + 1 FTE (computer scientist) will undertake (1) shortly, withminimum guidance from M. Paterno.

Plans for the future

In the longer term 2009:

M. Fischler, J. Kowalkowski, M. Paterno will resume (2), (3) The FNAL/G4 team has interest in efforts to support multi-core/multi-

threaded programming: make code thread safe, create code that scales well with multiple CPUs.

Man-power is an issue. During this workshop we should discuss with Gabriele/John a wish list for 2009. A well defined long term program would help allocate FNAL resources to G4.

fnal geant4 performance group issues and progress daniel elvira for m. fischler, j. kowalkowski, m....

Documents