introduction to apache hadoop zibo wang. introduction what is apache hadoop? apache hadoop is a...

9
Introduction to Apache Hadoop Zibo Wang

Upload: alicia-cook

Post on 12-Jan-2016

227 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Introduction to Apache Hadoop

Zibo Wang

Page 2: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Introduction What is Apache Hadoop?

Apache Hadoop is a software framework which provides open source libraries for data-intensive computing using simple single map-reduce interface and its own distributed file system called HDFS.

Started by Doug Cutting and Mike Cazfarella. Written in JAVA

Page 3: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Introduction The use of Hadoop

Compute Storage Database

The advantages of Hadoop Scalable Algorithms Log Management Extract-Transform-Load (ETL) Platform

Page 4: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Map-Reduce Introduced by Google

A simple and powerful interface that enables automatic parallelization and distribution of large-scale computation.

Two major functions Map Reduce

Nodes and trackers

Page 5: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Map-Reduce

Page 6: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Hadoop Distributed File System (HDFS) It has large block size (default 64mb) for

storage to compensate for seek time to network bandwidth. So very large files for storage are ideal.

Streaming data access. Write once and read many times architecture. Since files are large time to read is significant parameter than seek to first record.

Commodity hardware. It is designed to run on commodity hardware which may fail. HDFS is capable of handling it.

Page 7: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

HDFS Architecture Filesystem Metadata Framework of write Framework of read

Page 8: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Prominent Users of Hadoop Yahoo!

More than 10,000 core Linux cluster Open scource

Facebook 30 PB data

Amazon Amazon Elastic Compute Cloud Amazon Simple Storage Service

Page 9: Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries

Thank you!