hadoop & hep

Hadoop and HEP

Simon

Wednesday, 12 August 2009

About us

• CMS will take 1-10PB of data a year

• we’ll generate approx. the same in simulation data

• It could run for 20-30 years

• Have ~80 large computing centres around the world (>0.5PB, 100’s job slots each)

• ~3000 members of the collaboration


Why so much data?

• We have a very big digital camera

• Each event is ~1MB for normal running

• size increases for HI and upgrade studies

• Need many millions of events to get statistically significant results out for rare processes

• In my thesis I started with ~5M events to see an eventual “signal” of ~300


What’s an event?

• We have protons colliding, which contain quarks

• Quarks interact to produce excited states of matter

• These excited states decay and we record the decay products

• We then work back from the products to “see” the original event

• Many events happen at once

• Think of working out how a carburettor works by crashing 6 cars together on a motorway


An event


Duplication of data

• We keep events in multiple “tiers” of data

• Each tier contains a subset of the information of the parent tier

• We do this to let people work on huge amounts of data quickly

• In reality this style of working hasn’t really kicked off yet, but it’s early days

• Data is housed at >1 site


Duplication of work• One person’s signal is another’s background

• Common framework (CMSSW) for analysis but very little ability to share large amounts of work

• People coalesce into working groups, but these are generally small

• While everyone is trying to do the same thing they’re all trying to do it in different ways

• I suspect this is different from, say, Yahoo or last.fm


How we work

• Large, ~dedicated compute farms

• PBS/Torque/Maui/SGE accessed via grid interface

• ACL’s to prevent misuse of resources

• Not worried about people reading our data, but worried they might delete it accidentally

• Prevent DDoS


Where we use Hadoop• We currently use Hadoop’s HDFS at some of our T2 sites,

mainly in the US

• Led by Nebraska, been very successful to date

• I suspect more people will switch as centres expand

• Administration tools as well as performance particularly appreciated

• Alternatives are academic/research projects and tend to have a different focus (pub for details/rants)

• Maintenance & stability of code a big issue

• Storage in WN’s is also interesting


What would we have to do to run analysis with Hadoop?

• Split events sensibly over the cluster

• By event? by file? don’t care?

• Data files are ~2G - need to reliably reconstruct these files for export if we split them up

• Have CMSSW run in Hadoop

• Many, many pitfalls there, may not even be possible...


Metadata

• Lots of metadata associated with the data itself

• Moving that to HBase or similar and mining with Hadoop would be interesting

• Currently this is stored in big Oracle databases

• Also, log mining - probably harder to get people interested in this


Issues• Some analyses don’t map onto MapReduce

• Data is complex and in a weird file format

• CMSSW has a large memory foot print

• Not efficient to run only a few events as start up/tear down is expensive

• Sociologically it would be difficult to persuade people to move to MapReduce algorithms

• Until people see benefits - demonstrating those benefits is hard, physicists don’t think in cost terms


hadoop & hep

Technology

wednesday

people

events

data

run

event

worried