hydrodynamical simulations - arxiv2 harshil m. kamdar, matthew j. turk, robert j. brunner cesses at...

15
MNRAS 000, 115 (2015) Preprint 5 April 2020 Compiled using MNRAS L A T E X style file v3.0 Machine Learning and Cosmological Simulations II: Hydrodynamical Simulations Harshil M. Kamdar 1,2? , Matthew J. Turk 2,4 and Robert J. Brunner 1,2,3,4,5 1 Department of Physics, University of Illinois, Urbana, IL 61801 USA 2 Department of Astronomy, University of Illinois, Urbana, IL 61801 USA 3 Department of Statistics, University of Illinois, Champaign, IL 61820 USA 4 National Center for Supercomputing Applications, Urbana, IL 61801 USA 5 Beckman Institute For Advanced Science and Technology, University of Illinois, Urbana, IL, 61801 USA 21 October 2015 ABSTRACT We extend a machine learning (ML) framework presented previously to model galaxy formation and evolution in a hierarchical universe using N-body + hydrodynamical simulations. In this work, we show that ML is a promising technique to study galaxy formation in the backdrop of a hydrodynamical simulation. We use the Illustris Sim- ulation to train and test various sophisticated machine learning algorithms. By using only essential dark matter halo physical properties and no merger history, our model predicts the gas mass, stellar mass, black hole mass, star formation rate, g - r color, and stellar metallicity fairly robustly. Our results provide a unique and powerful phe- nomenological framework to explore the galaxy-halo connection that is built upon a solid hydrodynamical simulation. The promising reproduction of the listed galaxy properties demonstrably place ML as a promising and a significantly more computa- tionally efficient tool to study small-scale structure formation. We find that ML mim- ics a full-blown hydrodynamical simulation surprisingly well in a computation time of mere minutes. The population of galaxies simulated by ML, while not numerically identical to Illustris, is statistically and physically robust and follows the same funda- mental observational constraints. Machine learning offers an intriguing and promising technique to create quick mock galaxy catalogs in the future. Key words: galaxies: halo – galaxies: formation – galaxies: evolution – cosmology: theory – large-scale structure of Universe 1 INTRODUCTION In a ΛCDM universe, gas cools hierarchically in the centers of haloes through mergers. The evolution of collisionless dark matter particles at large scales has been studied extensively at unprecedentedly high resolutions, given the meteoric rise in computational power and the relative simplicity of these simulations (Springel 2005; Springel et al. 2005; Klypin et al. 2011; Angulo et al. 2012; Skillman et al. 2014). The forma- tion of cosmic structure on the scale of galaxies, however, has been incredibly difficult to model (Baugh 2006; Somerville & Dav´ e 2014); the difficulty arises primarily because baryonic physics at this scale is governed by a wide range of dissipa- tive and/or nonlinear processes, some of which are poorly understood (Kang et al. 2005; Baugh 2006; Somerville & Dav´ e 2014). Dark matter plays an essential role in galaxy formation; ? E-mail: [email protected] broadly speaking, dark matter haloes are ‘cradles’ of galaxy formation. It is well-established that gas cools hierarchically in the centers of dark matter haloes through mergers; the evolution of galaxies, however, is dictated by a wide variety of baryonic processes that are discussed later in this paper. While baryonic physics plays a crucial role in the outcome of gaseous interactions, the story always starts with gravita- tional collapse. The connection between these two regimes (i.e. the galaxy-halo connection) is an important problem in modern cosmology. However, no simple mapping has been found between the internal dark matter halo properties and the final galaxy properties because of the sheer complexity of the baryonic interactions (Contreras et al. 2015). There are two prevalent techniques used to understand galaxy formation and evolution alongside N-body dark mat- ter simulations: semi-analytical models (hereafter, SAM) and simulations that include both hydrodynamics and grav- ity. The former is a post de facto technique that combines dark matter only simulations with approximate physical pro- c 2015 The Authors arXiv:1510.07659v1 [astro-ph.GA] 26 Oct 2015

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

MNRAS 000, 1–15 (2015) Preprint 5 April 2020 Compiled using MNRAS LATEX style file v3.0

Machine Learning and Cosmological Simulations II:Hydrodynamical Simulations

Harshil M. Kamdar1,2?, Matthew J. Turk2,4 and Robert J. Brunner1,2,3,4,51Department of Physics, University of Illinois, Urbana, IL 61801 USA2Department of Astronomy, University of Illinois, Urbana, IL 61801 USA3Department of Statistics, University of Illinois, Champaign, IL 61820 USA4National Center for Supercomputing Applications, Urbana, IL 61801 USA5Beckman Institute For Advanced Science and Technology, University of Illinois, Urbana, IL, 61801 USA

21 October 2015

ABSTRACTWe extend a machine learning (ML) framework presented previously to model galaxyformation and evolution in a hierarchical universe using N-body + hydrodynamicalsimulations. In this work, we show that ML is a promising technique to study galaxyformation in the backdrop of a hydrodynamical simulation. We use the Illustris Sim-ulation to train and test various sophisticated machine learning algorithms. By usingonly essential dark matter halo physical properties and no merger history, our modelpredicts the gas mass, stellar mass, black hole mass, star formation rate, g − r color,and stellar metallicity fairly robustly. Our results provide a unique and powerful phe-nomenological framework to explore the galaxy-halo connection that is built upona solid hydrodynamical simulation. The promising reproduction of the listed galaxyproperties demonstrably place ML as a promising and a significantly more computa-tionally efficient tool to study small-scale structure formation. We find that ML mim-ics a full-blown hydrodynamical simulation surprisingly well in a computation timeof mere minutes. The population of galaxies simulated by ML, while not numericallyidentical to Illustris, is statistically and physically robust and follows the same funda-mental observational constraints. Machine learning offers an intriguing and promisingtechnique to create quick mock galaxy catalogs in the future.

Key words: galaxies: halo – galaxies: formation – galaxies: evolution – cosmology:theory – large-scale structure of Universe

1 INTRODUCTION

In a ΛCDM universe, gas cools hierarchically in the centersof haloes through mergers. The evolution of collisionless darkmatter particles at large scales has been studied extensivelyat unprecedentedly high resolutions, given the meteoric risein computational power and the relative simplicity of thesesimulations (Springel 2005; Springel et al. 2005; Klypin et al.2011; Angulo et al. 2012; Skillman et al. 2014). The forma-tion of cosmic structure on the scale of galaxies, however, hasbeen incredibly difficult to model (Baugh 2006; Somerville &Dave 2014); the difficulty arises primarily because baryonicphysics at this scale is governed by a wide range of dissipa-tive and/or nonlinear processes, some of which are poorlyunderstood (Kang et al. 2005; Baugh 2006; Somerville &Dave 2014).

Dark matter plays an essential role in galaxy formation;

? E-mail: [email protected]

broadly speaking, dark matter haloes are ‘cradles’ of galaxyformation. It is well-established that gas cools hierarchicallyin the centers of dark matter haloes through mergers; theevolution of galaxies, however, is dictated by a wide varietyof baryonic processes that are discussed later in this paper.While baryonic physics plays a crucial role in the outcomeof gaseous interactions, the story always starts with gravita-tional collapse. The connection between these two regimes(i.e. the galaxy-halo connection) is an important problem inmodern cosmology. However, no simple mapping has beenfound between the internal dark matter halo properties andthe final galaxy properties because of the sheer complexityof the baryonic interactions (Contreras et al. 2015).

There are two prevalent techniques used to understandgalaxy formation and evolution alongside N-body dark mat-ter simulations: semi-analytical models (hereafter, SAM)and simulations that include both hydrodynamics and grav-ity. The former is a post de facto technique that combinesdark matter only simulations with approximate physical pro-

c© 2015 The Authors

arX

iv:1

510.

0765

9v1

[as

tro-

ph.G

A]

26

Oct

201

5

Page 2: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs and a comparison of differ-ent SAMs, the reader is referred to Baugh (2006); Somerville& Dave (2014) and Knebe et al. (2015). N-body + hydrody-namical simulations (hereafter, NBHS) evolve baryonic com-ponents using fluid dynamics alongside regular dark matterevolution. The biggest advantage of NBHS over SAMs is theself-consistent way in which gaseous interactions are treatedby the hydrodynamical codes. For a comparison of differenthydrodynamical codes, the reader is referred to Kim et al.(2014).

In recent years, the number of hydrodynamical sim-ulations that somewhat reproduce observed global galaxyproperties has been on the rise (Crain et al. 2009; Schayeet al. 2010; McCarthy et al. 2012; Puchwein et al. 2013;Kannan et al. 2013; Khandai et al. 2015; Schaye et al. 2015;Vogelsberger et al. 2014a). The rise has been due to therapid increase in computational power. Moreover, the sub-grid models used in hydrodynamical simulations have beensignificantly improved for star formation (Springel & Hern-quist 2003; Hopkins et al. 2011), black hole formation, andaccretion (Sijacki et al. 2007; Dubois et al. 2012). Lastly,the numerical techniques used in hydrodynamical simula-tions have gotten vastly more robust (Springel 2010).

However, it must be noted that the computational costsassociated with both NBHS and SAM’s are high; Illustristook a total of 19 million CPU hours to run1 and the largestEAGLE simulation took 4.5 million CPU hours (Schayeet al. 2015). Most SAMs, by construction, are meant tobe significantly faster than NBHS; however, they still re-quire an appreciable amount of computational power. Forexample, consider the open source GALACTICUS SAM putforth in Benson (2012); a halo of mass 1012M is evolvedin around 2 seconds and a halo of mass 1015M is evolvedin around 1.25 hours. A very rough order of magnitude esti-mate for about 500,000 dark matter haloes, with an averageevolution time of approximately 2 minutes (corresponding toabout 1013M), implies the time taken for GALACTICUSto build merger trees to z = 0 is O(15, 000) CPU hours. Theinherent complexity of physical processes and the computa-tional costs associated with a fully self-consistent treatmentmotivate a lot of the assumptions that SAMs make and thesubgrid models employed in NBHS.

In Kamdar et al. (2015) (hereafter referred to as K15),we explored the application of supervised machine learning(ML) techniques to the problem of galaxy formation andevolution in the backdrop of SAMs. Machine learning is asubfield of computer science that provides a platform tolearn complex, non-trivial relationships in large data sets.ML has previously been applied to Astronomy with consid-erable success (Ball et al. 2006, 2007; Fiorentin et al. 2007;Banerji et al. 2010; Ball & Brunner 2010; Gerdes et al. 2010;Kind & Brunner 2013; Xu et al. 2013; Ivezic et al. 2014;Ness et al. 2015; Kim 2015; Dieleman et al. 2015). As shownin K15, ML enabled the inference of some complex physi-cal phenomena and provided a unique and powerful frame-work to explore the connection between the dark matterregime (large scale) and the baryonic regime (smaller, galaxyscales).

1 http://www.illustris-project.org/about/

In this previous work, the Millennium simulation(Springel 2005) along with the Guo et al. (2011) SAM wasused to train a few ML algorithms (Breiman 2001; Geurtset al. 2006) to predict the total stellar mass, stellar mass inthe bulge, the hot gas mass, cold gas mass, and the black holemass. The results obtained were promising for the hot gasmass, stellar mass in the bulge, and the total stellar masswith regression scores (R2) of 0.99, 0.77, and 0.78 respec-tively; the distributions for each of these masses, the BHmass-bulge mass relation, and the stellar mass-halo massrelation were also reproduced well. The cold gas mass pre-diction using solely DM inputs was less robust (R2 = 0.39)with severe underpredictions; it was shown that the coldgas mass prediction was not robust because of ML’s inabil-ity to pick up on the time evolution of the mass cooling ODEprescribed in Guo et al. (2011) by itself. The analysis wasrepeated with the inclusion of the cooling radius and thehot gas mass (two important ingredients in cooling ODE)over the last two snapshots and significantly better resultswere obtained (R2 = 0.82). These promising results raisea very interesting question: can ML techniques reproduce anumerically and physically reasonable population of galaxiesif trained on an N-body + hydrodynamical simulation? Fur-thermore, can we apply these trained ML algorithms to anN-body only simulation and reproduce a statistically rea-sonable population of galaxies that capture the essence ofhydrodynamical simulations (i.e. essentially mimic a hydro-dynamical simulation in a dark matter only simulation)?

Furthermore, in K15, we also discussed Neistein &Weinmann (2010), where key processes were parametrized asa function of halo mass and redshift. In Neistein et al. (2012),the physics from an NBHS is extracted using the same tech-nique and are used within a SAM to explore whether a simi-lar population of galaxies can be created. The promising re-sults presented there combined with results from K15 placefurther confidence in the approach presented in this paper.

In this paper, we explore the feasibility of using ML topopulate an hydrodynamical simulation. For our study, weuse the Illustris simulation presented in Vogelsberger et al.(2014a,b), and Genel et al. (2014), one of the highest reso-lution and most ambitious N-body + hydrodynamical simu-lation attempted to date. The Illustris simulation has beenable to reproduce a wide variety of observed galaxy prop-erties. We extract the internal dark matter halo propertiesfrom the public database (Nelson et al. 2015) and use thecorresponding galaxy masses and star formation rate of eachfor our training and testing. To test the validity of our model,we make predictions at multiple epochs: z = 0, 1, 2, 4. Akey difference between this work and K15 is the absence ofmerger tree history in the present work. We chose to excludethe merger history because we empirically found that theirinclusion did not make a substantial difference to our resultsand excluding the merger history lets us train our algorithmsvery quickly (down from hours to minutes).

This work enables the exploration of some key questionsin galaxy formation physics. How much information can beextracted from the dark matter substructures about the evo-lution of galaxies inside? Can similar populations of galax-ies be reproduced in the case of NBHS where the physicsemployed is vastly more complicated? Can the approximaterules of galaxy formation be modeled by a machine to build a

MNRAS 000, 1–15 (2015)

Page 3: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 3

phenomenological model that is orders of magnitude quickerthan traditional galaxy formation models?

It must be emphasized here that no baryonic processesare included in our inputs and no explicit baryonic recipesare included in our model. Our model is phenomenological;it does not seek to replace existing galaxy formation models.Instead, it serves as a powerful tool to explore the connectionbetween dark matter haloes and their baryonic counterparts.

The paper is organized as follows. In Section 2, we dis-cuss general details about the Illustris simulation, the dataextraction, the basics of machine learning, and the primaryalgorithms we used. In Section 3, we present the results weobtained when the ML algorithms were applied to the Il-lustris data. In Section 4, we evaluate the effectiveness ofour model, point out specific deficiencies in the results weobtained, and compare the ML simulated galaxies and theIllustris galaxies. In Section 5, we conclude the paper withan extensive summary of our findings and a discussion offuture work.

2 DATA & BACKGROUND

In this section, we briefly discuss the data extraction fromthe set of Illustris simulations and the ML algorithms thatwere used in our analyses. We discuss general details andbriefly outline how key physical processes are handled inIllustris. Next, we briefly review the basics of ML and discussthe techniques that were employed in this work. Finally, wediscuss our reasons for choosing Illustris.

2.1 Illustris Simulation

What follows is only a brief overview of the Illustris simula-tions. For a thorough description of the physical models em-ployed and the simulation code, the reader is referred to thefollowing papers: Springel (2010); Vogelsberger et al. (2013);Torrey et al. (2014); Vogelsberger et al. (2014b,a); Genelet al. (2014); Sijacki et al. (2014); Nelson et al. (2015). Thesuite of Illustris simulations uses the state-of-the-art hydro-dynamical code AREPO (Springel 2010) to evolve resolutionelements in a box of size (106.5 Mpc)3 to z = 0. The cosmol-ogy employed in the simulations is consistent with WMAP9:Ωm = 0.2726, Ωb = 0.0456, ΩΛ = 0.7274, ns = 0.963 andσ8 = 0.809, and the Hubble constant is H0 = 70.4 km s−1

Mpc−1. Three different N-body + Hydrodynamical simula-tions were ran: Illustris-1 with 2×18203 particles, Illustris-2with 2×9103 particles and Illustris-3 with 2×4503 particles.A set of N-body dark matter only simulations with the samenumber of particles were also ran with the same cosmology.

Hydrodynamical equations in the Illustris simulationare solved using AREPO, a novel quasi-Lagrangian mov-ing mesh code that uses Voronoi cells. The mesh is used tosolve the Eulerian equations using finite volume techniques,while being completely Galilean invariant. The gravitationalforces are calculated using the standard TreePM method(Xu 1994), where short-range forces are calculated using ahierarchical octree algorithm and long-range forces are cal-culated using a particle mesh method.

Substructure in the Illustris simulation is identifiedthrough two different algorithms: Friends of Friends (FoF)(Davis et al. 1985) and SUBFIND (Springel et al. 2001;

Dolag et al. 2009). FoF was applied to the snapshots witha linking length of 0.2 times the mean dark matter parti-cle separation to find dark matter haloes, with at least 32dark matter particles. The corresponding stellar, gas, andblack hole elements are attached to the dark matter haloesas described in Dolag et al. (2009). To find gravitationallybound structures in the simulation, a modified version ofSUBFIND was used. At z = 0, 7,713,601 FoF groups werefound with more than 32 particles and 4,366,546 subhaloeswere identified (Vogelsberger et al. 2014a).

Illustris includes the treatment of several key physicalprocesses that shape (quite literally, sometimes) galaxy for-mation and evolution. The simulation includes treatmentsof gas cooling with self-shielding corrections, star forma-tion, black hole seeding, black hole accretion, AGN feedback(thermal quasar-mode, thermal-mechanical radio mode, ra-diative feedback), supernovae feedback, stellar evolutionwith associated chemical enrichment, stellar mass loss, andstar formation feedback. The fifteen or so free parametersthat are related to the modeling of the subresolution pro-cesses are finely tuned to the history of the star formationrate history and the stellar mass function at z = 0.

Due to the limited resolution of large NBHS, a subgridtreatment of star formation is required. Like many previ-ous simulations (Springel 2005; Few et al. 2012), the starformation recipe used in Illustris is an effective equation ofstate, where stars form above a certain gas density of (ρSFR)with a star formation time scale (tSFR). The values usedin Illustris are: ρSFR = 0.13 cm−3 and tSFR = 2.2 Gyr.The stochastic prescription for star formation follows theKennicutt-Schmidt law and adopts a Chabrier initial massfunction.

The cooling rate of gas in Illustris is calculated as afunction of gas density, temperature, metallicity, the radia-tion fields of AGN, and the ionising background radiationfrom galaxies and quasars. The primordial cooling is calcu-lated and combined with the cooling due to metals, by usingCLOUDY (Ferland et al. 1998). When a stellar particle isformed in the Illustris simulation, it inherits the metallicityof the local gas. Star particles slowly return mass to the in-terstellar medium to account for mass loss from aging stellarpopulations.

Every FoF group above M > 7.1×107M is seeded witha supermassive black hole with mass 1.4× 105M. Illustrisincludes treatments for both quasar mode and radio modefeedback (Sijacki et al. 2014), depending on what the accre-tion rate is at a particular time. The mass accretion is de-fined by a Bondi-Hoyle-Lyttleton based Eddington-limitedrate given by:

MBH = min

[4παG2M2

BHρ

(c2s + v2BH)

32

, Medd

](1)

A novel prescription for radiative AGN feedback is imple-mented by assuming an average AGN SED and a bolomet-ric, luminosity-dependent scaling; this model is further de-scribed in Vogelsberger et al. (2013); Sijacki et al. (2014).

For this work, we extracted the SUBFIND cata-logs (Nelson et al. 2015) at four epochs: z = 0, 1, 2, 4from http://illustris.org. The following dark matter(sub)halo properties were used as the inputs to our ML algo-rithms: MDM , Sx, Sy, Sz, Vdispersion, Vcircular, and NDM .Here Sx,y,z refer to the different components of the spin.

MNRAS 000, 1–15 (2015)

Page 4: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

4 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

We use these to predict the following attributes that accu-mulate through cosmic time due to a wide range of pro-cesses: Mgas, M?, MBH , SFR, metallicity, and M?,half . Tomake our training and testing set, we pose two constrains:Mdm > 109Mh

−1 and Mstar ≥ 0. No minimum on thenumber of stellar particles was imposed to end up with amore dynamic range of masses; the larger range enablesus to apply results from a lower resolution simulation toa higher resolution simulation. The number of (sub)haloesat z = 0, 1, 2, 4 are: 249370, 342622, 433406, and 403268 re-spectively.

We chose to use the Illustris simulation for primarilytwo reasons. First, the novel numerical code AREPO hasbeen shown to outperform Gadget-2’s SPH (Springel 2010),and the extensive treatment of the wide range of physicalprocesses that are included in the simulation make Illus-tris one of the most robust hydrodynamical simulations cur-rently available. Second, the data is publicly available (Nel-son et al. 2015). In the growing climate of scientific repro-ducibility, this approach is becoming increasingly important;and we follow this trend and release all our data and codeat: http://github.com/ProfessorBrunner/ml-sims.

2.2 Machine Learning

In this section, we briefly discuss the basics of machine learn-ing and we also provide the pseudocode for and briefly dis-cuss the ML algorithm that performed the best in our anal-yses.

2.2.1 General Overview

Machine learning is a popular field in computer science, witha wide variety of applications in a number of other areas.The basic idea of ML algorithms is to ‘learn’ approximaterelationships between the input data and the output datawithout any explicit analytical prescription being used. Su-pervised learning techniques are provided some training data(X, y) and they try to learn the mapping G(X→ y) in orderto apply this mapping to the test data.

Machine learning has been applied to several subfieldsin Astronomy with a lot of success; see, for example, Ball &Brunner (2010); Ivezic et al. (2014). A majority of the appli-cations of ML in astronomy have either been in classificationproblems such as star-galaxy classification (Ball et al. 2006;Kim 2015), galaxy morphology classification (Banerji et al.2010; Dieleman et al. 2015) or have been regression applica-tions like: photometric redshift estimation (Ball et al. 2007;Gerdes et al. 2010; Kind & Brunner 2013), estimation of stel-lar atmospheric parameters (Fiorentin et al. 2007), and de-termining stellar labels from spectroscopic data (Ness et al.2015).

We use various statistics later in the paper to quantifythe effectiveness of ML in predicting the galaxy properties.First, we use the standard mean squared error (MSE) metric,which is defined as:

MSE =1

Ntest

i=Ntest−1∑i=1

(Xi

test −Xipredicted

)2(2)

Here, Xitest is the ith value of the actual test set and

Xipredicted is the ith value of the predicted set. Furthermore,

to quantify the effectiveness of the ML algorithms, we usethe base MSE (MSEb) defined as:

MSEb =1

Ntest

i=Ntest−1∑i=1

(Xi

test −Xmean,train

)2(3)

Here, Xmean,train is the mean of the training data set.MSEb is an extremely naive prediction of the error sinceeach test point is simply predicted as the mean of the train-ing dataset. The factor MSEb

MSEwill quantitatively show how

good our model is at minimizing error in the prediction ofdifferent galaxy attributes.

We will also be using the following two metrics to checkfor the robustness of the prediction: the Pearson correlationand the coefficient of determination (‘regression score’). ThePearson correlation is written as:

ρ =cov(XpredictedXtest)

σXpredictedσXtest

(4)

and R2 as:

R2 = 1−∑

i(Xi

test −Xipredicted)2∑

i(Xi

test −Xmean,train)2(5)

2.2.2 Extremely Randomized Trees

Like K15, the two primary algorithms initially used in thiswork were extremely randomized trees (Geurts et al. 2006)(ERT) and random forests (Breiman 2001) (RF). Both tech-niques are ensemble techniques that build on a weak learner(decision trees, in this case). We found that ERT slightlyoutperformed random forests and, therefore, choose to fo-cus on ERT.

The essence of ERT is to build a large ensemble of re-gression trees where both the attribute and split-point choiceare randomized while splitting a tree node. We provide pseu-docode for the full algorithm in Table 1, which closely followsthe algorithm outlined in Geurts et al. (2006). In the algo-rithm, the Score is the reduction in the variance. For thetwo subtrees Sl and Sr corresponding to the split s?, theScore(s?, S), abbreviated to Sc(s?, S), is given by:

Sc(s?, S) =var(y, S)− |S

l||S| var(y, S

l)− |Sr||S| var(y, S

r)

var(y, S)(6)

The estimates produced by the M trees in the ERT en-semble are finally combined by averaging y over all trees inthe ensemble. The use of the original training data set inplace of a bootstrap sample (as is done for random forests)is done to minimize bias in the prediction. Furthermore, theuse of both randomization and averaging is aimed at reduc-ing the variance of the prediction (Geurts et al. 2006). Theadded randomness corresponds to uncorrelated errors andmitigates bias in the predictions.

We used the implementation of ERT and RF found inthe Python package, scikit-learn (Pedregosa et al. 2011). Theparameters we used and the runtime of the techniques for theproblem are discussed in the results section. ERT tends to befaster than RF because of the randomization in finding thesplit, which reduces the training time. The reduced trainingtime lets us build a bigger ensemble of trees and explainswhy our ERT results are, generally, marginally better thanRF’s.

MNRAS 000, 1–15 (2015)

Page 5: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 5

Table 1. An outline of the extremely randomized trees regression algorithm

Extremely Randomized Trees

Inputs: A training set S corresponding to (X, y) input-output vectors, where

X=(X1, X2, ..., XN ) and y=(y1, y2, ..., yl), M (number of trees in the ensemble),K (number of random splits screened at each node) and nmin,samples (number of samples

required to split a node)

Outputs: An ensemble of M trees: T = (t1, t2, ..., tM )

Step 1 : Randomly select K inputs (X1, X2, ..., XK) where 1 ≤ K ≤ N).

Step 2 : For each selected input variable Xi in i = (1, 2, ...,K):

• Compute the minimal and maximal value of X in the set: Xmini

and Xmaxi

• Randomly select a cut-point Xc in the interval [Xmini , Xmax

i ]

• Return the split in the interval Xi ≤ Xc

Step 3 : Select the best split s? such that Score(s?, S) = maxi=1,2,...,K Score(si, S)

Step 4 : Using s?, split S into Sl(Xi) and Sr(Xi)

Step 5 : For Sl(Xi) and Sr(Xi), check the following conditions:

• |Sl(Xi)| or |Sr(Xi)| is lower than nmin,samples

• All input attributes (X1, X2, ..., XN ) are constant in |Sl(Xi)| or |Sr(Xi)|• The output vector (y1, y2, ..., yl) is constant in |Sl(Xi)| or |Sr(Xi)|

Step 6 : If any of the conditions in step 5 are satisfied, stop. We’re at a leaf node.

If none of the conditions are satisfied, repeat steps 1 through 5.

3 RESULTS

In this section, we present our results of applying ML tech-niques to the Illustris simulation. The following attributesare predicted at multiple epochs: gas mass, stellar mass,black hole mass, stellar mass inside the half-mass radius,star formation rate, metallicity, and g − r color. We per-form our analysis for z = 0, 1, 2, 4 and present our resultsin this paper for z = 0 and z = 2. The rest can be foundin the linked Github repository. We also exclude the stellarhalf mass results as these are incredibly similar to the stellarmass results.

The structure of this section is as follows. For each phys-ical attribute that is predicted, we provide a table summariz-ing some basic statistical quantities that are indicative of thegoodness of fit at z = 0, 2. MSEb and the MSE are listedfor each technique. The factor reduction of the MSE is alsolisted to test the relative performance of the algorithms toquantify how much they are learning. Finally, the Pearsoncorrelation between the predicted and the true data set andthe coefficient of determination (R2) are also listed. We pro-vide a hexbin plot (on a log scale) of the predicted quantityversus the quantity in Illustris. For the hexbin plot, a gridsizeof 30 was used and the colormap was logarithmically scaled.A violinplot is also shown to compare the distributions ofa particular physical attribute in Illustris galaxies and theML galaxies. We also provide, when appropriate, a plot thatshows the physical robustness of our results, supplementingthe hexbin plot and the violinplot.

3.1 Gas Mass

In Table 2 and Figures 1 and 2, the results for the gas massprediction are shown. There are several physical processesthat play a role in how gas mass is accumulated through

Table 2. Gas mass statistics

Redshift MSEb MSE (MSEbMSE

) ρ R2

z = 0 20.294 6.641 3.055 0.849 0.673

z = 2 0.761 0.063 12.002 0.959 0.917

cosmic time. Changes in the gas mass are driven by twokey processes: gas cooling and turning into stars and starsreturning gas mass through gas recycling. As we can see inthe two plots and the table provided, ML is able to modelthese two processes reasonably well.

In the hexbin plot, we notice that there’s a notice-able overprediction at lower masses. This discrepancy is ex-plained by the fact that we’re using halo masses in our in-puts; the amount of gas mass that is present at z = 0 basedsolely on DM inputs is off (by an order of magnitude in cer-tain lower mass haloes) because the feedback processes arenot explicitly included in our analyses. For z = 2, the pre-diction is significantly better. We would like to emphasizehere that, as we can see in the hexbin plot and the violin-plot, our results are statistically robust; there is noticeablescatter but the distribution of the two quantities is similar.In Nelson et al. (2013), the gas accretion rate was shown as afunction of halo mass. We note that while there is no trivialrelationship, ML is able to model the amount of gas accreted(Nelson et al. 2013) onto a dark matter halo reasonably wellat both epochs.

3.2 Stellar Mass

The results for the stellar mass prediction are shown in Ta-ble 3 and Figures 3, 4, 18, and 19. The stellar mass pre-diction, like the gas mass, is being predicted very well. The

MNRAS 000, 1–15 (2015)

Page 6: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

6 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

108 109 1010 1011 1012

MIllustris,gas(M¯)

108

109

1010

1011

1012

MPredicted,gas(M

¯)

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

3.6

log 1

0N

MIllustris,gas(M¯) MPredicted,gas(M¯)

108

109

1010

1011

1012

M(1

010M

¯)

Figure 1. Left : A hexbin plot of MIllustris,gas and Mpredicted,gas at z = 0. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of MIllustris,gas and Mpredicted,gas. The median and the interquantile range are also

shown.

108 109 1010 1011 1012

MIllustris,gas(M¯)

108

109

1010

1011

1012

MPredicted,gas(M

¯)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

log 1

0N

MIllustris,gas(M¯) MPredicted,gas(M¯)

108

109

1010

1011

1012

M(1

010M

¯)

Figure 2. Left : A hexbin plot of MIllustris,gas and Mpredicted,gas at z = 2. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of MIllustris,gas and Mpredicted,gas. The median and the interquantile range are also

shown.

buildup of stellar mass in Illustris occurs when a gas cellexceeds some critical density ρsfr; when this condition ismet, star particles are produced with a timescale tsfr usingfree parameters that are fine-tuned to follow the Kennicutt-Schmidt relation. Our results for the stellar mass using solelyhalo properties are promising and imply that ML is able tomodel the recipes for stellar mass reasonably well.

However, there is still some scatter in the predicted re-sults. The feedback processes again play an important rolehere; these processes cannot be modeled solely using ML asthey have no dark matter dependence. In the case of the stel-lar mass, the AGN feedback and the supernovae feedbackquench star formation (especially for higher mass haloes)and a purely dark matter based phenomenological is un-able to model these phenomena. The high R2 values andthe hexbin plot place confidence in our predictions. More-

over, the distribution of the stellar mass is reproduced al-most perfectly as shown in Figure 4. The results shown arestatistically robust and constitute a set of galaxies with stel-lar masses that is similar to that found in Illustris.

Furthermore, in Figures 18 and 19, we show the stellarmass-halo mass relation at z = 0 and z = 2. As shownin both plots, a physically robust population of galaxies isreproduced. The SMHM is reproduced almost perfectly atlower masses. There are some deviations at higher masses,but we posit that this is because the number of dark matterhaloes with such a high mass is low, thereby leading to asmaller sample size and some deviations from expectations.For instance, at z = 0 there are only 60 dark matter haloeswith Mhalo,DM > 1013M. The SMHM being reproducedis incredibly promising because it shows that ML is ableto approximate the mapping between dark matter haloes

MNRAS 000, 1–15 (2015)

Page 7: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 7

Table 3. Stellar Mass statistics

Redshift MSEb MSE (MSEbMSE

) ρ R2

z = 0 1.506 0.126 11.928 0.957 0.916

z = 2 0.081 0.011 7.486 0.936 0.866

Table 4. Black Hole Mass statistics

Redshift MSEb MSE (MSEbMSE

) ρ R2

z = 0 2.68× 10−5 7× 10−6 3.624 0.852 0.724

z = 2 2.63× 10−6 1× 10−6 2.471 0.813 0.595

and stellar mass buildup very well. Another point to notehere, that we will come back to later in the paper, is thatthere’s no direct relationship being input or assumed by ML;instead, a relationship is approximated from the results ofan NBHS to predict how dark matter haloes are populatedwith galaxies. An important feature of Figures 18 and 19 isthe noticeably less scatter in the ML galaxies compared tothe Illustris galaxies. This point will be revisited in Section4.

3.3 Black Hole Mass

The results for the black hole mass, which is also reproducedvery well, are shown in Table 4 and Figures 5, 6, 12, and13. In Section 2, we discussed how the black hole mass isaccreted in Illustris. As with other quantities, the hexbinplot shows some scatter for lower masses but as the massincreases, the predictions and simulated values match upvery well. Our results for the black hole mass using solelyhalo properties are promising and imply that ML is able tomodel the recipes for black hole mass.

In Figures 12, and 13, the black hole mass-bulge massrelation is plotted. The prediction and the simulated rela-tionship match up almost perfectly, implying that the MLprescribed black hole mass and stellar mass are not just nu-merically but also physically robust. The reproduction of theBH-bulge relationship further places confidence in the util-ity of ML to produce a global population of galaxies thatis both statistically and physically robust. Like the stellarmass-halo mass relation, there is more scatter in the Illus-tris galaxies than in the ML galaxies. The reason for thescatter is discussed in detail in Section 4.

3.4 Star Formation Rate

In Table 5 and Figures 7, 8, 14, 15, 16, and 17, the resultsfor the performance of ML at modeling the star formationrate using DM halo properties are presented. The statisticsreported in Table 5 and the hexbin plot show that the pre-dictions of the SFR are good but there is significantly morescatter compared to the other attributes. There is a notice-able overprediction, as evidenced by the hexbin plot andthe violinplot. The distributions of the ML galaxies closelyresembles that of the Illustris galaxies. At z = 2, the pre-dictions have significantly less scatter and the distributionsmatch better.

In Figures 14 and 15, the SFR is plotted as a function of

Table 5. SFR statistics

Redshift MSEb MSE (MSEbMSE

) ρ R2

z = 0 0.377 0.140 2.702 0.794 0.630

z = 2 10.565 2.754 3.836 0.865 0.739

Table 6. Metallicity statistics

Redshift MSEb MSE (MSEbMSE

) ρ R2

z = 0 1.515× 10−5 9.22× 10−7 16.438 0.969 0.939

z = 2 2.12× 10−6 2.10× 10−7 10.106 0.949 0.901

stellar mass; the results for Illustris and the predicted resultsmatch up quite well at both epochs. The robust results im-ply that the set of galaxies, while it may not be numericallyidentical to Illustris, is statistically and physically robust.As with the earlier results, the standard deviation for theIllustris results is higher than the SFR predicted by ML. InFigures 16 and 17, the specific SFR is plotted as a function ofstellar mass. The results align very well at both epochs anddemonstrate that the ML simulated galaxies have a physi-cally robust mapping from dark matter halo properties tothe SFR.

3.5 Stellar Metallicity

The stellar metallicity of an entire galaxy ( MzMtot

) is also pre-dicted at z = 0 and z = 2. The results for the stellar metal-licity are shown in Table 6 and Figures 9 and 10. Both plotsindicate that the stellar metallicity is reconstructed very wellusing ML. There is some noticeable scatter at lower metal-licities, similar to the other physical attributes that are dis-cussed in this work. However, the distribution is reproducedvery well implying that the predictions are statistically ro-bust.

3.6 Color

g− r color is also predicted at z = 0. The results are shownin Figure 11. In Figure 11, a hexbin and a violinplot plot areshown and in Figure 18, g−r is plotted as a function of stellarmass. The hexbin plot shows a lot of scatter in the predictedvalues and a noticeable overprediction for lower magnitudes.However, the violinplot shows a peak at about the samepoint and the distributions look similar. This discrepancymay be because a few bins around ≈ 0.45 have greater than104 galaxies in them.

Figure 18 shows the g − r color as a function of stellarmass. The Illustris and the predicted curves match up verywell, placing confidence in our results. Following the recur-ring trend, the standard deviation in the predicted values islower than that found in Illustris, i.e., there is more scatterwith Illustris galaxies.

4 DISCUSSION

The results presented above show that machine learningtechniques are able to reproduce a strikingly similar pop-

MNRAS 000, 1–15 (2015)

Page 8: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

8 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

106 107 108 109 1010 1011 1012

MIllustris, (M¯)

106

107

108

109

1010

1011

1012

MPredicted,

(M¯)

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

log 1

0N

MIllustris, (M¯) MPredicted, (M¯)

106

107

108

109

1010

1011

1012

M(1

010M

¯)

Figure 3. Left : A hexbin plot of MIllustris,? and Mpredicted,? at z = 0. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of MIllustris,? and Mpredicted,?. The median and the interquantile range are also shown.

106 107 108 109 1010 1011 1012

MIllustris, (M¯)

106

107

108

109

1010

1011

1012

MPredicted,

(M¯)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

log 1

0N

MIllustris, (M¯) MPredicted, (M¯)

106

107

108

109

1010

1011

1012

M(1

010M

¯)

Figure 4. Left : A hexbin plot of MIllustris,? and Mpredicted,? at z = 2. The black dashed line corresponds to a perfect prediction.

Right : A violinplot showing the distributions of MIllustris,? and Mpredicted,?. The median and the interquantile range are also shown.

ulation of galaxies to a full-blown hydrodynamical simu-lation by using only important physical properties of thedark matter halo in which the galaxy resides. The follow-ing physical attributes were predicted for each galaxy: gasmass, stellar mass, black hole mass, SFR, stellar metallicity,and g − r color. There are two central differences betweenthis work and the results presented in K15. First, we areusing an N-body + hydrodynamical simulation where thebaryonic physics employed is vastly more complicated andthe treatment of physical processes is self-consistent. Sec-ond, the merger history was not included in our analysesreducing the computation time to a matter of minutes.

In the results shown previously, a wide variety ofobserved physical relationships are reproduced: the BH-bulge mass relation, the stellar mass-halo mass relation,SFR-stellar mass relation, SSFR-stellar mass relation, andmagnitude-stellar mass relation. The fact that these impor-tant observational constraints are consistent with the Illus-

tris galaxies is very promising. The reproduction of theserelations is important because along with numerical and sta-tistical robustness, the results show that the population ofgalaxies that is produced using ML is also physically robust.

It is important to note here that our model is a purelyphenomenological one. Unlike SAMs, ML does not presup-pose any relationship between the dark matter haloes andthe galaxies residing in the haloes. ML, therefore, does notoffer a replacement for SAMs or NBHS; instead, it can beused as a tool to explore the halo-galaxy connection andcould be used as an analysis tool to explore how differentsimulation physics influences structure formation in the uni-verse.

An interesting note here is the similarity between ourmodel and the methodology of SHAMs (Kravtsov et al. 2004;Conroy & Wechsler 2009). Both methods in question usephysical halo properties to make statements about the prop-erties of the galaxies that the haloes hold. A subtle, but

MNRAS 000, 1–15 (2015)

Page 9: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 9

106 107 108 109 1010

MIllustris,BH(M¯)

106

107

108

109

1010

MPredicted,BH(M

¯)

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

3.6

log 1

0N

MIllustris,BH(M¯) MPredicted,BH(M¯)

106

107

108

109

1010

M(1

010M

¯)

Figure 5. Left : A hexbin plot of MIllustris,BH and Mpredicted,BH at z = 0. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of MIllustris,BH and Mpredicted,BH . The median and the interquantile range are also

shown.

106 107 108 109 1010

MIllustris,BH(M¯)

106

107

108

109

1010

MPredicted,BH(M

¯)

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

log 1

0N

MIllustris,BH(M¯) MPredicted,BH(M¯)

106

107

108

109

1010

M(1

010M

¯)

Figure 6. Left : A hexbin plot of MIllustris,BH and Mpredicted,BH at z = 2. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of MIllustris,BH and Mpredicted,BH . The median and the interquantile range are also

shown.

important, distinction between the two methods must bemade. SHAMs a priori assume that a relationship betweenthe dark matter halo and the galaxy residing in the haloexists but ML does not make this assumption. Indeed, thereproduction of the SMHM places confidence in the key as-sumption that most SHAMs make: observable properties ofgalaxies are monotonically related to the dynamical proper-ties of dark matter substructures.

The results obtained give a unique and powerful lookinto the halo-galaxy connection that is somewhat similarto that which we presented in K15. We are able to quanti-tatively and qualitatively show that there is a surprisinglystrong environmental halo dependence for galaxy formationand evolution. The key difference between this work andK15 is that the results obtained herein show that the re-sults obtained in K15 are valid in the context of N-body

+ hydrodynamical simulations. Furthermore, the computa-tional costs associated with ML in the context of NBHS isminiscule. The full pipeline (preprocessing, running ML al-gorithms and plotting) took a total of 4 minutes for z = 0and 6 minutes for z = 2, making ML 2-4 orders of magni-tude faster than SAMs and 3-6 orders of magnitude fasterthan NBHS.

A recurring discrepancy that we find between our re-sults and the results presented in Illustris is the smallerscatter in our results (i.e. a lower standard deviation in MLgalaxies compared to the Illustris galaxies). An explanationfor this absence is the inability of ML to pick up on ex-treme cases. Indeed, the inability of our selected approachto pick up on these extreme cases implies that our selectedapproach is not able to fully model the physics that plays arole in galaxy formation; this discrepancy is expected since

MNRAS 000, 1–15 (2015)

Page 10: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

10 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

10-3 10-2 10-1 100 101

SFRIllustris(M¯yr−1 )

10-3

10-2

10-1

100

101

SFRPredicted(M

¯yr−

1)

0.0

0.3

0.6

0.9

1.2

1.5

1.8

2.1

2.4

2.7

log 1

0N

SFRIllustris(M¯yr−1 ) SFRPredicted(M¯yr

−1 )

10-3

10-2

10-1

100

101

SFR

(M¯yr−

1)

Figure 7. Left : A hexbin plot of SFRIllustris and SFRpredicted at z = 0. The black dashed line corresponds to a perfect prediction.

Right : A violinplot showing the distributions of SFRIllustris and SFRpredicted. The median and the interquantile range are also shown.

10-3 10-2 10-1 100 101

SFRIllustris(M¯yr−1 )

10-3

10-2

10-1

100

101

SFRPredicted(M

¯yr−

1)

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

log 1

0N

SFRIllustris(M¯yr−1 ) SFRPredicted(M¯yr

−1 )

10-3

10-2

10-1

100

101

SFR

(M¯yr−

1)

Figure 8. Left : A hexbin plot of SFRIllustris and SFRpredicted at z = 2. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of SFRIllustris and SFRpredicted. The median and the interquantile range are also shown.

the only inputs from which the ML algorithms are tryingto learn a relationship from are the dark matter halo in-puts. The absence of physical processes and, consequently,the phenomenological nature of the model help explain whyML is unable to model some of the subtler physics that playsan important role in galaxy formation and evolution.

However, the goal of this work is not to exactly modeleach physical process with high accuracy and produce a nu-merically identical set of galaxies; instead, the goal is to ex-plore the halo-galaxy connection in the backdrop of NBHSusing a new technique and to evaluate how much informationcan be extracted from the dark matter properties about theeventual baryonic evolution of galaxies. The results shownin this work demonstrably show that a numerically, statis-tically, and physically robust population of galaxies is pro-duced by ML when the algorithms are trained and tested ona robust N-body + hydrodynamical simulation. The relativesimplicity, the computational efficiency, and the physically

consistent population of galaxies that is produced cementsML as an invaluable analysis tool in future galaxy forma-tion studies. Our current approach does not replace any cur-rent galaxy formation models; instead, the applicability ofML lies in supplementing, validating, and extending currentmodels. In a forthcoming work, ML is used to train algo-rithms on an NBHS and applied to completely independentN-body only simulations to populate the N-body only sim-ulations with galaxies by, essentially, mimicking an NBHS.However, these mock galaxy catalogs are created in the orderof minutes.

Another important point is the role of the subgrid mod-els in N-body + hydrodynamical simulations. The subgridmodels employed in Illustris are fairly similar to the pre-scriptions used in SAMs. However, there are more free pa-rameters in SAMs that can be fine-tuned better but lead todegeneracies. Even though the method used to evolve physi-cal quantities in hydrodynamical simulations is significantly

MNRAS 000, 1–15 (2015)

Page 11: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 11

10-4 10-3 10-2 10-1

MetallicityIllustris

10-4

10-3

10-2

10-1

Metallicity P

redicted

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

log 1

0N

MetallicityIllustris MetallicityPredicted

10-4

10-3

10-2

10-1

Metallicity

Figure 9. Left : A hexbin plot of MetallicityIllustris and Metallicitypredicted at z = 0. The black dashed line corresponds to aperfect prediction. Right : A violinplot showing the distributions of MetallicityIllustris and Metallicitypredicted. The median and the

interquantile range are also shown.

10-4 10-3 10-2 10-1

MetallicityIllustris

10-4

10-3

10-2

10-1

Metallicity P

redicted

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

log 1

0N

MetallicityIllustris MetallicityPredicted

10-4

10-3

10-2

10-1

Metallicity

Figure 10. Left : A hexbin plot of MetallicityIllustris and Metallicitypredicted at z = 2. The black dashed line corresponds to aperfect prediction. Right : A violinplot showing the distributions of MetallicityIllustris and Metallicitypredicted. The median and the

interquantile range are also shown.

more sophisticated, the same underlying physics helps ex-plain why ML is successful at reproducing a physically ro-bust population of galaxies. The success of ML in modelinggalaxy formation is a statement on the ability of ML to inferthese subgrid models and global observational constraints inthe midst of hydrodynamical evolution.

Overall, the results presented in this paper and our pre-vious work on SAMs quantitatively show that an appreciableamount of information about galactic formation and evolu-tion can be extracted from dark matter substructures. Byreproducing some fundamental observational constraints, weshow that ML is able to mimic a full-blown hydrodynamicalsimulation reasonably well, but 6 orders of magnitude faster.

5 CONCLUSIONS

In this work, a variety of ML techniques were used to re-construct a set of galaxies in an NBHS using solely the DMhalo physical properties. Using the Illustris simulation totrain and test the ML algorithms, the gas mass, stellar mass,BH mass, SFR, and g − r are predicted. ML provides anincredibly unique and powerful framework for this particu-lar problem for three reasons: simplicity of ML algorithms,computational efficiency of ML algorithms, and their abil-ity to model incredibly complex relationships. The resultsshown in this work demonstrably show that a numerically,statistically and physically robust population of galaxies isproduced by ML when the algorithms are trained and testedon a robust N-body + hydrodynamical simulation.

Our primary conclusions are as follows:

MNRAS 000, 1–15 (2015)

Page 12: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

12 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

0.0 0.2 0.4 0.6 0.8 1.0g−rIllustris

0.0

0.2

0.4

0.6

0.8

1.0

g−r Predicted

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

log 1

0Ng−rIllustris g−rPredicted

10-3

10-2

10-1

100

g−r

Figure 11. Left : A hexbin plot of g − rIllustris and g − rpredicted at z = 0. The black dashed line corresponds to a perfect prediction.Right : A violinplot showing the distributions of g− rIllustris and g− rpredicted. The median and the interquantile range are also shown.

108 109 1010 1011 1012

M ,HM(M¯)

105

106

107

108

109

1010

MBH(M

¯)

IllustrisPredicted

Figure 12. The BH-bulge mass relation at z = 0 for the sim-ulated ML galaxies and Illustris galaxies. Both quantities are

binned using the stellar half mass. The two different shadings

(blue for Illustris and green for ML) represent the standard devi-ation at each binned point.

(1) Exploring the extent of the influence of dark matterhaloes and its environment on galaxy formation and evolu-tion is a non-trivial problem with poorly defined inputs andmappings. The problem is even murkier in the backdrop ofhydrodynamical simulations, where the evolution is vastlymore sophisticated and self-consistent with fewer fine-tunedparameters. ML offers a powerful framework to explore thisproblem.

(2) Using the Illustris simulation, a few important phys-ical properties of dark matter haloes were used to predict

108 109 1010 1011 1012

M ,HM(M¯)

105

106

107

108

109

1010

MBH(M

¯)

IllustrisPredicted

Figure 13. The BH-bulge mass relation at z = 2 for the sim-ulated ML galaxies and Illustris galaxies. Both quantities are

binned using the stellar half mass. The two different shadings

(blue for Illustris and green for ML) represent the standard devi-ation at each binned point.

the gas mass, stellar mass, black hole mass, SFR, stellarmetallicity, and g − r color. No baryonic processes were ex-plicitly included in our analysis. We used two sophisticatedML algorithms (ERT, RF) in our analyses.

(3) A remarkably similar population of galaxies is recon-structed when the ML algorithms are trained and tested onthe Illustris simulation. The individual physical attributesare generally predicted quite well with few discrepancies.The ML simulated galaxies match up with the Illustrisgalaxies by following a variety of global constraints: the

MNRAS 000, 1–15 (2015)

Page 13: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 13

108 109 1010 1011 1012

M (M¯)

10-3

10-2

10-1

100

101

SFR

(M¯yr−

1)

IllustrisPredicted

Figure 14. The SFR as a function of stellar mass at z = 0 for thesimulated ML galaxies and Illustris galaxies. Both quantities are

binned using the stellar mass. The two different shadings (blue

for Illustris and green for ML) represent the standard deviationat each binned point.

108 109 1010 1011 1012

M (M¯)

10-1

100

101

102

103

SFR

(M¯yr−

1)

IllustrisPredicted

Figure 15. The SFR as a function of stellar mass at z = 2 for thesimulated ML galaxies and Illustris galaxies. Both quantities are

binned using the stellar mass. The two different shadings (blue

for Illustris and green for ML) represent the standard deviationat each binned point.

108 109 1010 1011 1012

M (M¯)

10-13

10-12

10-11

10-10

10-9

sSFRyr−

1)

IllustrisPredicted

Figure 16. The SSFR as a function of stellar mass at z = 0 forthe simulated ML galaxies and Illustris galaxies. Both quantities

are binned using the stellar mass. The two different shadings (blue

for Illustris and green for ML) represent the standard deviationat each binned point.

108 109 1010 1011 1012

M (M¯)

10-10

10-9

10-8

sSFRyr−

1)

IllustrisPredicted

Figure 17. The SSFR as a function of stellar mass at z = 0 forthe simulated ML galaxies and Illustris galaxies. Both quantities

are binned using the stellar mass. The two different shadings (blue

for Illustris and green for ML) represent the standard deviationat each binned point.

MNRAS 000, 1–15 (2015)

Page 14: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

14 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner

1010 1011 1012 1013 1014

Mh(M¯)

10-3

10-2

10-1

100

M/M

h(M

¯)

IllustrisPredicted

Figure 18. The stellar mass-halo mass relation at z = 0 for thesimulated ML galaxies and Illustris galaxies. Both quantities are

binned using the halo mass. The two different shadings (blue for

Illustris and green for ML) represent the standard deviation ateach binned point.

1010 1011 1012 1013 1014

Mh(M¯)

10-3

10-2

10-1

M/M

h(M

¯)

IllustrisPredicted

Figure 19. The stellar mass-halo mass relation at z = 2 for thesimulated ML galaxies and Illustris galaxies. Both quantities are

binned using the halo mass. The two different shadings (blue for

Illustris and green for ML) represent the standard deviation ateach binned point.

109 1010 1011 1012

M (M¯)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

g−r

IllustrisPredicted

Figure 20. g − r at z = 0 as a function of stellar mass for thesimulated ML galaxies and Illustris galaxies. Both quantities are

binned using the stellar mass. The two different shadings (blue

for Illustris and green for ML) represent the standard deviationat each binned point.

BH-bulge mass relation, the stellar mass-halo mass relation,SFR-stellar mass relation, SSFR-stellar mass relation, andmagnitude-stellar mass relation.

(4) A recurring, and important, discrepancy was the no-ticeably smaller scatter (lower standard deviation) for agiven attribute in the ML galaxies compared to the Illus-tris galaxies. We hypothesize this is due to the inability ofour ML approach to pick up on extreme cases, indicatingthat the physical processes are not fully captured. This islikely because some feedback processes have little to no de-pendence on the dark matter halo properties and, therefore,cannot be modeled well by our current approach.

(5) However, the goal of this work was not to construct apopulation of galaxies that is numerically identical. Instead,the goal of this work goal was to evaluate how much infor-mation can be extracted from the dark matter only prop-erties about the eventual baryonic evolution of galaxies. Ifthe ML simulated galaxies are physically reasonable, whichthey are, then the approximate mapping found by ML be-tween the dark matter halo properties and the galaxy prop-erties solidifies ML’s role in future galaxy formation studies.The conclusion is clear: ML is able to mimic how galaxiesare evolved in an NBHS approximately well. Furthermore,ML is able to recreate the population of galaxies in about4 minutes in contrast to millions of hours, prompting itseffectiveness in future galaxy formation studies.

The success of ML algorithms at modeling galaxy for-mation reasonably well in an NBHS opens up a wide varietyof avenues for future work. Most notably, in a forthcomingwork, we adopt this methodology to train an ML algorithmon Illustris (N-body + hydro) and apply it to three separateN-body only simulations (Illustris-dark, Bolshoi, and Dark

MNRAS 000, 1–15 (2015)

Page 15: Hydrodynamical Simulations - arXiv2 Harshil M. Kamdar, Matthew J. Turk, Robert J. Brunner cesses at the scale of a galaxy. For a general, exhaustive re-view of the motivation of SAMs

Machine Learning and Cosmological Simulations 15

Sky) to mimic an NBHS to populate an N-body only simu-lation with galaxies in the order of minutes. The approachlaid out in this work could help with the rapid creation ofmock galaxy catalogs for upcoming surveys.

6 ACKNOWLEDGMENTS

HMK and RJB acknowledge support from the National Sci-ence Foundation Grant No. AST-1313415. HMK has beensupported in part by funding from the LAS Honors Councilat the University of Illinois and by the the Office of StudentFinancial Aid at the University of Illinois. HMK also thankssupport from the Shodor Foundation and Blue Waters. RJBhas been supported in part by the Center for Advanced Stud-ies at the University of Illinois. MJT is supported by the Gor-don and Betty Moore Foundation’s Data-Driven DiscoveryInitiative through Grant GBMF4561.

The Illustris-1 simulation was run on the CURIE su-percomputer at CEA/France as part of PRACE projectRA0844, and the SuperMUC computer at the Leibniz Com-puting Centre, Germany, as part of GCS-project pr85je.The further simulations were run on the Harvard Odysseyand CfA/ITC clusters, the Ranger and Stampede supercom-puters at the Texas Advanced Computing Center throughXSEDE, and the Kraken supercomputer at Oak Rridge Na-tional Laboratory through XSEDE.

REFERENCES

Angulo R., Springel V., White S., Jenkins A., Baugh C., FrenkC., 2012, MNRAS, 426, 2046

Ball N. M., Brunner R. J., 2010, International Journal of Modern

Physics D, 19, 1049

Ball N. M., Brunner R. J., Myers A. D., Tcheng D., 2006, ApJ,

650, 497

Ball N. M., Brunner R. J., Myers A. D., Strand N. E., AlbertsS. L., Tcheng D., Llora X., 2007, ApJ, 663, 774

Banerji M., et al., 2010, MNRAS, 406, 342

Baugh C. M., 2006, Reports on Progress in Physics, 69, 3101

Benson A. J., 2012, New Astronomy, 17, 175

Breiman L., 2001, Machine learning, 45, 5

Conroy C., Wechsler R. H., 2009, ApJ, 696, 620

Contreras S., Baugh C., Norberg P., Padilla N., 2015, arXivpreprint arXiv:1502.06614

Crain R. A., et al., 2009, MNRAS, 399, 1773

Davis M., Efstathiou G., Frenk C. S., White S. D., 1985, ApJ,292, 371

Dieleman S., Willett K. W., Dambre J., 2015, MNRAS, 450, 1441

Dolag K., Borgani S., Murante G., Springel V., 2009, MNRAS,399, 497

Dubois Y., Devriendt J., Slyz A., Teyssier R., 2012, MNRAS, 420,2662

Ferland G., Korista K., Verner D., Ferguson J., Kingdon J.,Verner E., 1998, Publications of the Astronomical Society of

the Pacific, 110, 761

Few C. G., Courty S., Gibson B. K., Kawata D., Calura F.,

Teyssier R., 2012, MNRAS: Letters, 424, L11

Fiorentin P. R., Bailer-Jones C., Lee Y., Beers T., Sivarani T.,Wilhelm R., Prieto C. A., Norris J., 2007, A&A, 467, 1373

Genel S., et al., 2014, MNRAS, 445, 175

Gerdes D. W., Sypniewski A. J., McKay T. A., Hao J., WeisM. R., Wechsler R. H., Busha M. T., 2010, ApJ, 715, 823

Geurts P., Ernst D., Wehenkel L., 2006, Machine learning, 63, 3

Guo Q., et al., 2011, MNRAS, 413, 101

Hopkins P. F., Quataert E., Murray N., 2011, MNRAS, 417, 950

Ivezic Z., Connolly A. J., VanderPlas J. T., Gray A., 2014, Statis-tics, Data Mining, and Machine Learning in Astronomy: A

Practical Python Guide for the Analysis of Survey Data:

A Practical Python Guide for the Analysis of Survey Data.Princeton University Press

Kamdar H., Turk M., Brunner R., 2015, MNRAS

Kang X., Jing Y., Mo H., Borner G., 2005, ApJ, 631, 21Kannan R., Stinson G. S., Maccio A. V., Brook C., Weinmann

S. M., Wadsley J., Couchman H. M., 2013, MNRAS, p. stt2144

Khandai N., Di Matteo T., Croft R., Wilkins S., Feng Y., TuckerE., DeGraf C., Liu M.-S., 2015, MNRAS, 450, 1349

Kim Edward B. R. C.-K. M., 2015, MNRAS, 453, 507

Kim J.-h., et al., 2014, The Astrophysical Journal SupplementSeries, 210, 14

Kind M. C., Brunner R. J., 2013, MNRAS, 432, 1483Klypin A. A., Trujillo-Gomez S., Primack J., 2011, ApJ, 740, 102

Knebe A., et al., 2015, arXiv preprint arXiv:1505.04607

Kravtsov A. V., Berlind A. A., Wechsler R. H., Klypin A. A., Got-tlober S., Allgood B., Primack J. R., 2004, The Astrophysical

Journal, 609, 35

McCarthy I. G., Font A. S., Crain R. A., Deason A. J., SchayeJ., Theuns T., 2012, MNRAS, 420, 2245

Neistein E., Weinmann S. M., 2010, MNRAS, 405, 2717

Neistein E., Khochfar S., Dalla Vecchia C., Schaye J., 2012,Monthly Notices of the Royal Astronomical Society, 421, 3579

Nelson D., Vogelsberger M., Genel S., Sijacki D., Keres D.,

Springel V., Hernquist L., 2013, Monthly Notices of the RoyalAstronomical Society, p. sts595

Nelson D., et al., 2015, arXiv preprint arXiv:1504.00362

Ness M., Hogg D. W., Rix H.-W., Ho A., Zasowski G., 2015, arXivpreprint arXiv:1501.07604

Pedregosa F., et al., 2011, The Journal of Machine Learning Re-search, 12, 2825

Puchwein E., Baldi M., Springel V., 2013, MNRAS, 436, 348

Schaye J., et al., 2010, MNRAS, 402, 1536Schaye J., et al., 2015, MNRAS, 446, 521

Sijacki D., Springel V., Di Matteo T., Hernquist L., 2007, MN-

RAS, 380, 877Sijacki D., Vogelsberger M., Genel S., Springel V., Torrey P.,

Snyder G., Nelson D., Hernquist L., 2014, arXiv preprint

arXiv:1408.6842Skillman S. W., Warren M. S., Turk M. J., Wechsler R. H., Holz

D. E., Sutter P., 2014, arXiv preprint arXiv:1407.2600

Somerville R. S., Dave R., 2014, arXiv preprint arXiv:1412.2712Springel V., 2005, MNRAS, 364, 1105

Springel V., 2010, MNRAS, 401, 791Springel V., Hernquist L., 2003, MNRAS, 339, 289Springel V., White S. D., Tormen G., Kauffmann G., 2001, MN-

RAS, 328, 726Springel V., et al., 2005, Nature, 435, 629

Torrey P., Vogelsberger M., Genel S., Sijacki D., Springel V.,

Hernquist L., 2014, MNRAS, p. stt2295Vogelsberger M., Genel S., Sijacki D., Torrey P., Springel V.,

Hernquist L., 2013, MNRAS, p. stt1789

Vogelsberger M., et al., 2014a, MNRAS, 444, 1518Vogelsberger M., et al., 2014b, Nature, 509, 177

Xu G., 1994, arXiv preprint astro-ph/9409021

Xu X., Ho S., Trac H., Schneider J., Poczos B., Ntampaka M.,2013, ApJ, 772, 147

This paper has been typeset from a TEX/LATEX file prepared by

the author.

MNRAS 000, 1–15 (2015)