hadoop @ sara & big grid

Large-scale data processing
at SARA and BiG Grid
with Apache Hadoop

Evert LammertsApril 10, 2012, SZTAKI

First off...

About me

Consultant for SARA's eScience & Cloud ServicesTechnical lead for LifeWatch NetherlandsLead Hadoop infrastructure

About you

Who uses large-scale computing as a supporting tool?For who is large-scale computing core-business?

In this talk

Large-scale data processing?

Large-scale @ SARA & BiG Grid

An introduction to Hadoop & MapReduce

Hadoop @ SARA & BiG Grid

Large-scale @ SARA & BiG GridAn introduction to Hadoop & MapReduceHadoop @ SARA & BiG Grid

Three observations

I: Data is easier to collect

(Jimmy Lin, University of Maryland / Twitter, 2011)

More business is done on-lineMobile devices are more sophisticatedGovernments collect more dataSensing devices are becoming a commodityTechnology advanced: DNA sequencers!Enormous funding for research infrastructuresAnd so on...

Lesson: everybody collects data

Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 20112016

Three observations

II: Data is easier to store

Storage price decreases

http://www.mkomo.com/cost-per-gigabyte

Storage capacity increases

http://en.wikipedia.org/wiki/File:Hard_drive_capacity_over_time.svg

Three observations

III: Quantity beats quality

(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)

s/knowledge/data/g

Jimmy Lin, University of Maryland / Twitter, 2011

How are these observations addressed?

We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together?

Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)

Some background (bear with me...) 1/3

Amdahl's Law


(The Datacenter as a Computer, 2009, Luiz Andr Barroso and Urs Hlzle)


(The Datacenter as a Computer, 2009, Luiz Andr Barroso and Urs Hlzle)

Nodes (x2000): 8GB DRAM

4 x 1TB disks

Rack: 40 nodes

1Gbps switch

Datacenter: 8Gbps rack-to-cluster
switch connection

(NYT, 14/06/2006)


SARA
the national center for scientific computing

Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization

Large-scale data != new

Compute @ SARA

How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)

2.7 TB raw text, single file

Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]

Executed on the Grid

http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment

Case Study: Virtual Knowledge Studio


MethodTake an article, including history, as input

Extract categories and links for each revision

Output all links for each category, per revision

Aggregate all links for each category, per revision

Generate graph linking all categories on links, per revision

1.1) Copy file from local Machine to Grid storage

2.1) Stream file from Grid Storage to single machine2.2) Cut into pieces of 10 GB2.3) Stream back to Grid Storage

3.1) Process all files in parallel: N machines
run the Java application,
fetch a 10GB file as
input, processing it, and putting the result back



A bit of history

Nutch*

2002

2004

MR/GFS**

2006

2004

Hadoop

* http://nutch.apache.org/** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

http://wiki.apache.org/hadoop/PoweredBy

2010 - 2012: A Hype in Production

What's different about Hadoop?

No more do-it-yourself parallelism it's hard!But rather linearly scalable data parallelism

Separating the what from the how

2009, Luiz Andr Barroso and Urs Hlzle)

Core principals

Scale out, not up

Move processing to the data

Process data sequentially, avoid random reads

Seamless scalability


A typical data-parallel problem in abstraction

Iterate over a large number of records

Extract something of interest

Create an ordering in intermediate results

Aggregate intermediate results

Generate output

MapReduce: functional abstraction of step 2 & step 4


MapReduce

Programmer specifies two functionsmap(k, v) *

reduce(k', v') *

All values associated with a single key are sent to the same reducer

The framework handles the rest

The rest?

Scheduling, data distribution, ordering, synchronization, error handling...

1) Load file into
HDFS

2) Submit code to MR


Automatic distribution of data,
Parallelism based on data,
Automatic ordering of intermediate results

This is how it would be done with Hadoop

The ecosystem

The Forrester WaveTM: Enterprise Hadoop Solutions, Q1 2012


Timeline

2009:Piloting Hadoop on Cloud2010:Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me!2011:Funding granted for production service2012:Production cluster available (~March) 72 machines * 8 cores / 8 TB storage / 64GB RAM Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

Architecture

Components

Hadoop, Hive, Pig, Hbase, HCatalog - others?

What are scientists doing?

Information Retrieval

Natural Language Processing

Machine Learning

Econometry

Bioinformatics

Computational Ecology / Ecoinformatics

Machine learning: Infrawatch, Hollandse Brug

Structural health monitoring

145sensors

100Hz

60seconds

60minutes

24hours

365days

x

x

x

x

x

= large data

(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

And others: NLP & IR

e.g. ClueWeb: a ~13.4 TB webcrawl

e.g. Twitter gardenhose data

e.g. Wikipedia dumps

e.g. del.ico.us & flickr tags

Finding named entities: [person company place] names

Creating inverted indexes

Piloting real-time search

Personalization

Semantic web

Interest from industry

We're opening up shop.

Experiences: Data Science

DevOps

Programming algorithms

Domain knowledge

Experience: How we embrace Hadoop

Parallelism has never been easy so we teach!December 2010: hackathon (~50 participants - full)

April 2011: Workshop for Bioinformaticians

November 2011: 2 day PhD course (~60 participants full)

June 2012: 1 day PhD course

The datascientist is still in school... so we fill the gap!Devops maintain the system, fix bugs, develop new functionality

Technical consultants learn how to efficiently implement algorithms

Users bring domain knowledge

Methods are developing fast... so we build the community!Netherlands Hadoop User Group

http://www.nlhug.org/

Final thoughts

Hadoop is the first to provide commodity computingHadoop is not the only

Hadoop is probably not the best

Hadoop has momentum

What degree of diversification of infrastructure should we embrace?MapReduce fits surprisingly well as a programming model for data-parallelism

Where is the data scientist?Teach. A lot. And work together.

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level