o’reilly – hadoop : the definitive guide ch.1 meet hadoop

20
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th , 2010 Taewhi Lee

Upload: cruz

Post on 23-Feb-2016

253 views

Category:

Documents


16 download

DESCRIPTION

O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop. May 28 th , 2010 Taewhi Lee. Outline . Data ! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing The Apache Hadoop Project. ‘Digital Universe’ Nears a Zettabyte. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

O’Reilly – Hadoop: The Definitive GuideCh.1 Meet Hadoop

May 28th, 2010Taewhi Lee

Page 2: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

2

Outline Data! Data Storage and Analysis Comparison with Other Systems

– RDBMS– Grid Computing– Volunteer Computing

The Apache Hadoop Project

Page 3: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

3

‘Digital Universe’ Nears a Zettabyte

Digital Universe: the total amount of data stored in the world’s computers Zettabyte: 1021 bytes >> Exabyte >> Petabyte >> Terabyte

Page 4: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

4

Flood of Data

NYSE generates 1TB new trade data / day

Page 5: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

5

Flood of Data

Facebook hosts 10 billion photos (1 petabyte)

Page 6: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

6

Flood of Data

Internet Archive stores 2 petabytes of data

Page 7: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

7

Individuals’ Data are Growing Apace

It becomes easier to take more and more photos

Page 8: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

8

Individuals’ Data are Growing Apace

LifeLog, my life in a terabyte

SQL

Capture and encoding

Microsoft Research’s MyLifeBits Project

Page 9: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

9

Amount of Public Data Increases

Available Public Data Sets on AWS– Annotated Human Genome– Public database of chemical structures– Various census data and labor statistics

Page 10: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

10

Large Data!

How to store & analyze large data?

“More data usually beats better algorithms”

Page 11: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

11

Outline Data! Data Storage and Analysis Comparison with Other Systems

– RDBMS– Grid Computing– Volunteer Computing

The Apache Hadoop Project

Page 12: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

12

Current HDD

How long it takes to read all the data off the disk?

capacity 1TBtransfer

rate 100MB/s

How about using multiple disks?

Page 13: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

13

Problems with Multiple Disks Hardware Failure

Doing tasks need to combine the dis-tributed data

What Hadoop Provides– Reliable shared storage (HDFS)– Reliable analysis system (MapReduce)

Page 14: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

14

Outline Data! Data Storage and Analysis Comparison with Other Systems

– RDBMS– Grid Computing– Volunteer Computing

The Apache Hadoop Project

Page 15: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

15

RDBMS

* Low latency for point queries or updates** Update times of a relatively small amount

of data

***

Page 16: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

16

Grid Computing

Shared storage (SAN) Works well for predominantly CPU-intensive jobs Becomes a problem when nodes need to access

large data

Page 17: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

17

Volunteer Computing Volunteers donate CPU time from their idle

computers Work units are sent to computers around the

world

Suitable for very CPU-intensive work with small data sets

Risky due to running work on untrusted ma-chines

Page 18: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

18

Outline Data! Data Storage and Analysis Comparison with Other Systems

– RDBMS– Grid Computing– Volunteer Computing

The Apache Hadoop Project

Page 19: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

19

Brief History of Hadoop Created by Doug Cutting Originated in Apache Nutch (2002)

– Open source web search engine, a part of the Lucene project

NDFS (Nutch Distributed File System, 2004) MapReduce (2005)

Doug Cutting joins Yahoo! (Jan 2006) Official start of Apache Hadoop project (Feb 2006) Adoption of Hadoop on Yahoo! Grid team (Feb

2006)

Page 20: O’Reilly –  Hadoop : The Definitive Guide Ch.1 Meet  Hadoop

20

The Apache Hadoop Project

Pig Chukwa Hive HBase

MapReduce HDFSZoo

Keeper

Core Avro