machine learning models to generate top events

Machine learning models to generate top events

Anja Butter

ITP, Universitat Heidelberg

arXiv:1907.03764, 1912.08824, 1912.00477, 2006.06685, and 2008.06545

with Armand Rousselot, Marco Bellagente, Sascha Diefenbacher, Gregor Kasieczka, Benjamin Nachman, Tilman Plehn, and

Ramon Winterhalder

Anja Butter Top 2020 1 / 21

The HEP trinity

Theory

Fundamental Lagrangian

• Perturbative QFT

Standard Model vs. new physics

• Matrix elements, loop integrals

Experiment

Complex detector

• ATLAS, CMS, LHCb, ALICE, ...

Reconstruction of individual events

• Big data: jet images, tracks, ...

Precision simulations

First-principle Monte Carlo generators

• Simulation of parton/particle-level events

• Herwig, Pythia, Sherpa, Madgraph,...

Detector simulation

• Geant4, PGS, Delphes, ...

⇒ Unweighted event samples


Neural networks for precision simulations

Problems in MC simulations• Event generation:

• High-dimensional phase space• Low unweighting efficiency• Higher order: exponential growth in computing time

• Highly complex full detector simulation → very slow

• HL-LHC: factor 25 in data → need higher precision

• Limited resources: Precision vs. computing time

Advantages of neural networks

• Flexible parametrisation

• Interpolation properties

• Fast evaluation


Possibilities for ML in event generation

Monte Carlo integration

• Estimating matrix element

• Neural importance sampling

[1707.00028] Bendavid, Regression & GAN

[1810.11509] Klimek and Perelstein

[1912.11055] Bishara and Montull Regression

[2001.05478] Bothmann et al. NF

[2001.05486, 2001.10028] Gao et al. NF

[2002.07516] Badger and Bullock Regression

Event generation

• Generating 4-momenta

• Z > ll , pp > jj , pp > tt+decay

[1901.00875] Otten et al. VAE & GAN

[1901.05282] Hashemi et al. GAN

[1903.02433] Di Sipio et al. GAN

[1903.02556] Lin et al. GAN

[1907.03764, 1912.08824] Butter et al. GAN

[1912.02748] Martinez et al. GAN

[2001.11103] Alanazi et al. GAN

Detector simulation• Jet images

• Fast shower simulation incalorimeters

[1701.05927] de Oliveira et al. GAN

[1705.02355, 1712.10321] Paganini et al. GAN

[1802.03325, 1807.01954] Erdmann et al. GAN

[1805.00850] Musella et al. GAN

[ATL-SOFT-PUB-2018-001,ATL-SOFT-PROC-2019-007] ATLAS VAE & GAN

[1909.01359] Carazza and Dreyer GAN

[2005.05334] Buhmann et al. VAE

Unfolding

• Detector to parton/particle leveldistributions

[1806.00433] Datta et al. GAN

[1911.09107] Andreassen et al.

[1912.0047] Bellagente et al. GAN

[2006.06685] Bellagente et al. NF

NO claim to completeness! Review: Generative Networks for LHC events [2008.08558]


Boosting standard event generation...

1. Generate phase space points

2. Calculate event weight

wevent = f (x1,Q2)f (x2,Q2) × M(x1, x2, p1, . . . pn) × J(pi (r))−1

3. Unweighting via importance sampling→ optimal for w ≈ 1



Matrix element


PDF Phase space mapping



Matrix element


- NNPDF since 2002(!)- genetic algorithm- n3fit: determ. gradient descent- S. Carrazza, J. Cruz-Martinez[1907.05075]

Phase space mapping



Figure 5: Comparison of a single neural network (left) vs. our ensemble approach (right)in estimating the differential cross-section against y, where y is the minimum yij as orderedby pT . Data is normalised to the maximum Njet bin value. Uncertainty bands denote1 s.d. calculated over 20 trained models (red and green) and Monte Carlo error on theNjet result (blue). – 13 –

- regression network- learn amplitude or K factor- S. Badger, J. Bullock [2002.07516]- J. Bendavid [1707.00028]



Phase space mapping



Figure 5: Comparison of a single neural network (left) vs. our ensemble approach (right)in estimating the differential cross-section against y, where y is the minimum yij as orderedby pT . Data is normalised to the maximum Njet bin value. Uncertainty bands denote1 s.d. calculated over 20 trained models (red and green) and Monte Carlo error on theNjet result (blue). – 13 –

- regression network- learn amplitude or K factor- S. Badger, J. Bullock [2002.07516]- J. Bendavid [1707.00028]



10�2 10�1 100 101 102

w

10�6

10�5

10�4

10�3

10�2

10�1

100

101

nor

mal

ised

dis

trib

ution

Uniform

Vegas

NN

(a) 3-jet production

10�2 10�1 100 101 102

w

10�5

10�4

10�3

10�2

10�1

100

nor

mal

ised

dis

trib

ution

Uniform

Vegas

NN

1.0 1.5 2.0

0

1

(b) 4-jet production

Figure 4: Event weight distributions for sampling the total cross section for gg !n jets forp

s = 1 TeVwith N = 106 points, comparing VEGAS optimisation, NN-based optimisation and an unoptimised(“Uniform”) distribution. Note that we now use a logarithmic scale for the x axis. The inset plotin (b) shows the peak region in more detail and using a linear scale.

an upcoming study [36], where increasing the final-state multiplicity (and hence the number of channels) inV + jets production also leads to a rapid reduction in the gain factor.

However, the results for the top quarks and the 3-jet production are promising and indicate that con-ventional optimisers such as VEGAS can potentially be outperformed by NN-based approaches also for morecomplex problems in the future. To this end the computational challenges outlined above need to be ad-dressed. In future research we will therefore aim to extend the range in final-state multiplicity while keepingthe training costs at an acceptable level, and—if successful—to implement the new sampling techniqueswithin the SHERPA general-purpose event generator framework. A starting point should be the further studyand comparison of alternative ways to integrate our NN approach within multi-channel sampling, beginningwith our ansatz and the one proposed in [36], to find out if the scaling behaviour can be optimised. On thepurely NN side, the exploration of possible extensions or alternatives to piecewise-quadratic coupling layersis promising, such as [51]. Also adversarial training has the potential to reduce training times significantly.The limitation of the statistical accuracy by a large number of zero-weight events found in the jet-productionexamples furthermore suggests that it is worthwhile to investigate the construction of optimised importancesampling maps that better respect common phase space cuts, or alternatively to modify the optimisationprocedure to further reduce the generation of points outside the fiducial phase space volume.

Acknowledgements

We are grateful to Stefan Hoche for fruitful discussion. We would also like to thank Tilman Plehn, AnjaButter and Ramon Winterhalder for useful discussions.

This work has received funding from the European Unions Horizon 2020 research and innovation pro-gramme as part of the Marie Sk lodowska-Curie Innovative Training Network MCnetITN3 (grant agreementno. 722104). SS acknowledges support through the Fulbright–Cottrell Award and from BMBF (contract05H18MGCA1).

15

- learn efficient phase space mapping(→ w ≈ 1)- normalizing flow- Gao et al. [2001.10028]- Bothmann et al. [2001.05478]


... or train generative network directly on events

Early stage of research. Possible gains:

• Generate more statistics

• Use network to sample phase space (similar to NF)

• Ship model instead of samples (profit from fast evaluation of NN)

• Use conditional generative network to interpolate between measured samples

• Use invertible architectures for unfolding

• Replace fast detector simulations

...


A generative model

• Generative Adversarial Networks (GAN)

• Data: true events {xT } vs. generated events {xG}• Discriminator distinguishes {xT }, {xG} [D(xT ) → 1, D(xG ) → 0]

LD =⟨− log D(x)

⟩x∼PT

+⟨− log(1− D(x))

⟩x∼PG

D(x)→0.5−−−−−−→ −2 log 0.5

• Generator fools discriminator [D(xG ) → 1]

LG =⟨− log D(x)

⟩x∼PG

⇒ New statistically independent samples

Generator{r} {xG} {xT } MC Data

Discriminator

LG LD


Why GANs? Features, problems and solutions

+ Generate better samples than VAE

+ Large community working on GANs

- Unstable training

Solutions• Regularization of the discriminator, eg. gradient penalty [Ghosh, Butter et al., ...]

• Modified training objective:• Wasserstein GAN (incl. gradient penalty) [Lin et al., Erdmann et al., ...]

• Least square GAN (LSGAN) [Martinez et al., ...]

• MMD-GAN [Otten et al., ...]

• MSGAN [Datta et al., ...]

• Cycle GAN [Carazza et al., ...]

• Use of symmetries [Hashemi et al., ...]

• Whitening of data [Di Sipio et al., ...]

• Feature augmentation [Alanazi et al., ...]


What is the statistical value of GANned events?• Example: 1D camel function

• Compare Sample vs. GAN vs. 5 param.-fit (mean, width, relative height)

• Evaluation on quantiles

MSE =1

Nquant

Nquant∑j=1

(xj −

1

Nquant

)2

• Convergence to amplification factor 2.5 for 20 quantiles


What is the statistical value of GANned events?

• Example: Sphere in 5 dimensions

• Sum over 5-dimensional quantiles

• Amplification factor increases for sparse quantiles:• 3 for 35 quantiles• 15 for 65 quantiles


How to GAN LHC events

Idea: generate hard process

• Realistic LHC final state tt → 6 jets [1907.03764]

• 18 dim output [fix external mass, no mom. cons.]

• Flat observables precise

• Systematic undershoot in tails [10-20% deviation]

0.0

2.0

4.0

6.0

1 �d�

dp T

,t[G

eV�

1]

⇥10�3

True

GAN

pT,t [GeV]0.81.01.2

GA

NTru

e

0 50 100 150 200 250 300 350 400pT,t [GeV]

0.1

1.0

1p

Nta

il

20%


t

t

W

W




• 18 dim output



• Sharp phase-space structures, not using ΓW

MMD2(PT , PG ) =⟨k(x, x′)

⟩x,x′∼PT

+⟨k(y, y′)

⟩y,y′∼PG

− 2⟨k(x, y)

⟩x∼PT ,y∼PG

70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 90.0mW− [GeV]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 σdσ

dmW−

[GeV−

1]

×10−1

True

Breit-Wigner

Gauss

No MMD


t

t

W

W




• 18 dim output



• Sharp phase-space structures, not using ΓW [MMD-loss]

• 2D correlations

0 20 40 60 80 100 120 140pT,b [GeV]

0

50

100

150

200

250

p T,t

[GeV

]

0 20 40 60 80 100 120 140pT,b [GeV]

0

50

100

150

200

250

p T,t

[GeV

]

0 20 40 60 80 100 120 140pT,b [GeV]

0

50

100

150

200

250

p T,t

[GeV

]

0 20 40 60 80 100 120 140pT,b [GeV]

0.0

0.2

0.5

0.8

1.0

1.2

1.5

1.8

2.0

1 �d�

dp T

,b[G

eV�

1]

⇥10�2

slice at pT,t = 100 GeV True

GAN

0.0

0.5

1.0

1.5

2.0⇥103

0.0

0.5

1.0

1.5

2.0

⇥103

-1.0

-0.5

0.0

0.5

1.0

0 20 40 60 80 100 120 140pT,b [GeV]

0

50

100

150

200

250p T

,t[G

eV]

0 20 40 60 80 100 120 140pT,b [GeV]

0

50

100

150

200

250

p T,t

[GeV

]

0 20 40 60 80 100 120 140pT,b [GeV]

0

50

100

150

200

250

p T,t

[GeV

]

0 20 40 60 80 100 120 140pT,b [GeV]

0.0

0.2

0.5

0.8

1.0

1.2

1.5

1.8

2.0

1 �d�

dp T

,b[G

eV�

1]

⇥10�2

slice at pT,t = 100 GeV True

GAN

0.0

0.5

1.0

1.5

2.0⇥103

0.0

0.5

1.0

1.5

2.0

⇥103

-1.0

-0.5

0.0

0.5

1.0


t

t

W

W

Reaching precision

1. Representation pT , η, φ

2. Momentum conservation

3. Resolve log pT

4. Regularization: spectral norm

5. Batch information

→ 1% precision X

Next step: automization

W + 2 jets


How to GAN event subtraction

Idea: sample based subtraction of distributions [1912.08824]

1 Consistent multidimensional difference between two distributions

2 Beat bin-induced statistical uncertainty [interpolation of distributions]

∆B−S =√

n2BNB + n2

SNS > max(∆B ,∆S )

• Many applications:- Soft-collinear subtraction, multi-jet merging, on-shell subtraction- Background subtraction [4-body decays→preserves correlations]

G{r}

c ∈ CB−S ∪ CS

{xG , c}

c ∈ CS

DB

DS

{xB}

{xS}

Data B

Data S

LDB

LDS

LG


Example I: Z pole

• Training data:• pp → e+e−

• pp → γ → e+e−

• 1 M events per dataset, MadGraph5

• Generated events: Z-Pole + interference


Example II: Dipole subtraction

• Theory uncertainties → limiting factor for HL-LHC

• Higher order: Subtract diverging Catany SeymourDipole from real emission term

• 1 M events per dataset, SHERPA

Z

e+

e−

g

0 200 400 600 800 1000Eg [GeV]

10�2

10�1

100

101

d�

dE

g[p

b/G

eV]

GAN vs Truth

B

S

B � S

0 200 400 600 800 1000Eg [GeV]

10�2

10�1

100

1 �d�

dE

g[p

b/G

eV]

(B � S)GAN

(B � S)Truth ± 1�


How to GAN away detector effects

Idea: invert Markov process [1912.00477]

Detector simulation

• Typical Markov process

• Prior dependent inversion possible [Datta et al.]

• Aim: unfolding multidimensional phase space

Reconstruct parton level pp → ZW → (ll)(jj)

• GAN: no connection between input and discr.→ use fully conditional GAN (FCGAN)

G{r} {xG} D

Condition

{xd}

{xp}

detector

parton

LD

LGMMD


W

Z

j

j

ℓ+

ℓ−




• Use fully conditional GAN (FCGAN)

• Inversion works X

Eq.(7) : pT ,j1 = 30 ... 100 GeV (∼ 88%)

Eq.(8) : pT ,j1 = 30 ... 60 GeV and pT ,j2 = 30 ... 50 GeV (∼ 38%)

0 25 50 75 100 125 150 175 200pT,j1

[GeV]

0.0

1.0

2.0

3.0

4.0

1 �d�

dp T

,j1

⇥10�2

Truth

FCGAN

Eq.(7)

Eq.(8)


W

Z

j

j

ℓ+

ℓ−




• Use fully conditional GAN (FCGAN)

• Inversion works X• BSM injection X

- train: SM events- test: 10% events with W ′ in s-channel

600 800 1000 1200 1400 1600 1800 2000m``jj [GeV]

0.0

1.0

2.0

3.0

4.0

5.0

6.0

1 σdσ

dm``jj

[GeV−

1]

×10−3

Truth (W’)

FCGAN

Truth (SM)


W

Z

j

j

ℓ+

ℓ−

Curing shortcomings with invertible structure

• cGAN calibration curves: mean correct, distribution too narrow

• INN: Normalizing flow with fast evaluation in both directions

(xprp

)Pythia,Delphes:g→

←−−−−−−−−−−−−−−−−−−−−→← unfolding:g

(xdrd

)

10 15 20 25 30 35 40 45 50pT,q1

[GeV]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

even

tsn

orm

aliz

ed

eINN

FCGAN

single detector event3200 unfoldings

Parton

Tru

th

0.0 0.2 0.4 0.6 0.8 1.0quantile pT,q1

0.0

0.2

0.4

0.6

0.8

1.0

frac

tion

ofev

ents

eINN FCGAN


Conditional invertible neural networks

Condition INN on detector data [2006.06685]

xpg(xp ,f (xd ))→

←−−−−−−−−−−−−−−−−−−−−→← unfolding: g(r,f (xd ))

r

Minimizing L = −⟨

log p(θ|xp , xd )⟩xp∼Pp ,xd∼Pd

=

⟨||g(xp , f (xd )))||22

2− log

∣∣∣∣∣ ∂g(xp , f (xd ))

∂xp

∣∣∣∣∣⟩

xp∼Pp ,xd∼Pd

− log p(θ)

→ calibrated parton level distributions

10 15 20 25 30 35 40 45 50pT,q1

[GeV]

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

even

tsn

orm

aliz

ed

cINN eINN

FCGAN

single detector event3200 unfoldings

Parton

Tru

th

0.0 0.2 0.4 0.6 0.8 1.0quantile pT,q1

0.0

0.2

0.4

0.6

0.8

1.0

frac

tion

ofev

ents

cINN

eINN FCGAN


Conditional invertible neural networks

Condition INN on detector data [2006.06685]

xpg(xp ,f (xd ))→

←−−−−−−−−−−−−−−−−−−−−→← unfolding: g(r,f (xd ))

r

• Use detector information xd of arbitrary dimension

• Unfold inclusive channels

• Cross check 2/3/4 jet exclusive channels

SciPost Physics Submission

0 25 50 75 100 125 150 175 200pT,q1 [GeV]

0.0

0.5

1.0

1.5

2.0

2.5

1 �d�

dp T

,q1

[GeV

�1]

�10�2

3 jet excl.

Parton Truth

Parton cINN

Detector Truth

0 25 50 75 100 125 150 175 200pT,q1 [GeV]

0.0

0.5

1.0

1.5

2.0

1 �d�

dp T

,q1

[GeV

�1]

�10�2

4 jet excl.

Parton Truth

Parton cINN

Detector Truth

0 20 40 60 80 100 120pT,q2 [GeV]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 �d�

dp T

,q2

[GeV

�1]

�10�2

3 jet excl.

Parton Truth

Parton cINN

Detector Truth

0 20 40 60 80 100 120pT,q2 [GeV]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 �d�

dp T

,q2

[GeV

�1]

�10�2

4 jet excl.

Parton Truth

Parton cINN

Detector Truth

70 75 80 85 90 95MW,reco [GeV]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 �d�

dM

W,r

eco

[GeV

�1]

�10�1

3 jet excl.

Parton Truth

Parton cINN

Detector Truth

70 75 80 85 90 95MW,reco [GeV]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 �d�

dM

W,r

eco

[GeV

�1]

�10�1

4 jet excl.

Parton Truth

Parton cINN

Detector Truth

Figure 8: cINNed pT,q and mW,reco distributions. Training and testing events include exactlythree (left) and four (right) jets from the data set including ISR.

described by perturbative QCD includes a free number of additional jets,

pp! ZW± + jets! (`�`+) (jj) + jets , (20)

For the additional jets we need to include for instance a pT cut to regularize the soft andcollinear divergences at fixed-order perturbation theory. The proper way of generating eventsis therefore to allow for any number of additional jets and then cut on the number of hard jets.Since ISR can lead to jets with larger pT than the W -decay jets, an assignment of the hardestjets to hard partons does not work. We simply sort jets and partons by their respective pT

and let the network work out their relations. We limit the number of jets to four becauselarger jet number appear very rarely and would not give us enough training events.

17

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

1 σdσ

dpT,q

2[G

eV−

1]

×10−2

2 jet incl.

Parton Truth

Parton cINN

Detector Truth

0 20 40 60 80 100 120pT,q2

[GeV]

0.81.01.2

cIN

NT

ruth

10 20 30 40 50 60 70 80pT,q2

[GeV]

10−4

10−3

10−2

1 σdσ

dpT,q

2[G

eV−

1]

cINN vs Truth

2 jet

3 jet

4 jet


Summary

• We can boost standard event generation using ML

• GANs can learn underlying distributions from event samples

• Possibilities to stabilize GAN training: gradient penalty, WGAN-GP, LSGAN,...

• MMD improves performance for special features

• Successful sample based subtraction implemented

• Applications: background subtraction, soft-collinear subtraction, . . .

• Unfold high-dimensional detector level distributions with cGANs and INN

• Stable under insertion of new data, proper calibration achieved by cINN


Important next steps

1. Quantify uncertainties (eg. Bayesian networks)• including correlations

2. High precision

3. Automization• move away from hand engineered networks


machine learning models to generate top events

Documents