scaling out with hadoop and hbase

36
Scaling Out Hadoop and NoSQL Age Mooij

Upload: age-mooij

Post on 01-Dec-2014

7.640 views

Category:

Technology


2 download

DESCRIPTION

A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands

TRANSCRIPT

Page 1: Scaling Out With Hadoop And HBase

Scaling Out Hadoop and NoSQL

Age Mooij

Page 2: Scaling Out With Hadoop And HBase

Big DataAn Introduction to Dealing with

Page 3: Scaling Out With Hadoop And HBase

About me...

@agemooij

Page 4: Scaling Out With Hadoop And HBase

...and meBig Data

Page 5: Scaling Out With Hadoop And HBase

IP Address Registration for Europe, Middle East, Russia

Ipv4: 232 (4.3×109) addressesIpv6: 2128 (3.4×1038) addresses

My Current Project...

Page 6: Scaling Out With Hadoop And HBase

Challenge

10 years of historical registration/routing data in flat files200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)80 million per day / 1,000 per second

Make it searchable...

Page 7: Scaling Out With Hadoop And HBase

...and youBig Data

Page 8: Scaling Out With Hadoop And HBase

Twitter

Google AmazonYahoo

LinkedIn

FacebookeBay

Marktplaats

HyvesDigg

Flickr YouTube

WikipediaMySpace300M users

45M users

6.5M users, 5.5M ads

50M users

264M users

32M users

Page 9: Scaling Out With Hadoop And HBase

Scalability:

Handling more load / requestsHandling more data

Handling more types of data

...without anything breaking or falling over

...and without going bankrupt

Page 10: Scaling Out With Hadoop And HBase

UPOutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

OutOut

VS

Page 11: Scaling Out With Hadoop And HBase

Scaling Out, Part 1

Processing Data

a.k.a. Data Crunching

Page 12: Scaling Out With Hadoop And HBase

Map/Reduce

Break the data into chunks

Process the chunks in parallel

Parallel Batch Processing of Data

Distribute the chunks

Merge the results

Page 13: Scaling Out With Hadoop And HBase

Reliable, Scalable, Distributed Computing

(written in Java)

Page 14: Scaling Out With Hadoop And HBase

Distributed File System (DFS)

Automatic file replication

Automatic checksumming / error correction

Foundation for all Hadoop projects

Based on Google’s File System (GFS)

Page 15: Scaling Out With Hadoop And HBase

Map / Reduce

Simple Java APIPowerful supporting frameworkPowerful toolsGood support for non-java languages

Page 16: Scaling Out With Hadoop And HBase
Page 17: Scaling Out With Hadoop And HBase

24 hours, about $240

4TB of raw image TIFF data (stored in S3)

100 Amazon EC2 instances

Hadoop Map/Reduce

11 million finished PDFs

Page 18: Scaling Out With Hadoop And HBase

Scaling Out, Part 1I

Storing & Retrieving DataReads and Writes

Page 19: Scaling Out With Hadoop And HBase

Relational Databases are hard to scale out

Page 20: Scaling Out With Hadoop And HBase

Replication

Master-Slave

Master-Master Limited scaling of writes

Single point of failureSingle point of bottleneck

Ways to Scale out an RDBMS (1)

Good for scaling reads

Complicated

Page 21: Scaling Out With Hadoop And HBase

Partitioning

Vertical : by function / table

Not truly Relational anymore (application joins)

Horizontal : by key / id (Sharding)

Limited Scalability (relocating, resharding)

Ways to Scale out an RDBMS (2)

Page 22: Scaling Out With Hadoop And HBase

Why are RDBMSsso hard toscale out

Page 23: Scaling Out With Hadoop And HBase

ConsistencyAvailabilityPartition Tolerance ...pick any two

Brewer’s CAP Theorem

Page 24: Scaling Out With Hadoop And HBase

ACID vs BASE

AtomicConsistentIsolatedDurable

BasicAvailabilitySoft StateEventual Consistency

Relational Non-Relational

Page 25: Scaling Out With Hadoop And HBase

NoSQL NO-SQL

Better Different

Non-Relational Databases

Page 26: Scaling Out With Hadoop And HBase

Types of NOSQL

(Distributed) Key-Value

Document Oriented

Column Oriented

Graph Oriented

RedisVoldemortScalaris (D)

CouchDBMongoDBRiak (D)

Cassandra (D)HBase (D)

Neo4J

(D) = Distributed (automatic out scaling)

Page 27: Scaling Out With Hadoop And HBase

RIPE NCC

Experiences so far...

Page 28: Scaling Out With Hadoop And HBase

Those Big Numbers Again...

10 years of historical data in flat files200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)80 million per day / 1,000 per second

Make it searchable...

Page 29: Scaling Out With Hadoop And HBase

~ 200 000 000 000 records

~ 15 000 000 000 records

Map / Reduce

Page 30: Scaling Out With Hadoop And HBase

TimestampRecord

Our Data is 3D

IP Address Record Timestamp

Row Column Name (!) Values (!)

Best fit & performance:Column Oriented

1 10..* 0..*

Page 31: Scaling Out With Hadoop And HBase

CassandraDigg

Facebook

0.4.1

Tunable: Availability vs Consistency

Twitter

Very active community

No documentation

Page 32: Scaling Out With Hadoop And HBase

Tumblr

Yahoo

StumbleUpon

Meetup

Streamy

Adobe

0.20.1Good Documentation

Very active communityBuilt on top of Hadoop DFS

Page 33: Scaling Out With Hadoop And HBase

Initial Results:Tested on an EC2 cluster of 8 XLarge instances

3.8 B (23 GB) 33 M (1 GB)5 hours

33 M (1 GB)

75 minutes “Needle in a haystack” full on-disk table scan:44000 inserts/second 0.5 M records/second

15 GBRecord duplication: 6x

Page 34: Scaling Out With Hadoop And HBase

In order to choose the right scaling tools, you need to:

Know what you want to query and howUnderstand your data

Page 35: Scaling Out With Hadoop And HBase

Big Data...Be Prepared !

Page 36: Scaling Out With Hadoop And HBase

Try some Scala in the basement !

val shameless = <SelfPromotion>

</SelfPromotion>