introduction to apache hadoop zibo wang. introduction what is apache hadoop? apache hadoop is a...
TRANSCRIPT
Introduction to Apache Hadoop
Zibo Wang
Introduction What is Apache Hadoop?
Apache Hadoop is a software framework which provides open source libraries for data-intensive computing using simple single map-reduce interface and its own distributed file system called HDFS.
Started by Doug Cutting and Mike Cazfarella. Written in JAVA
Introduction The use of Hadoop
Compute Storage Database
The advantages of Hadoop Scalable Algorithms Log Management Extract-Transform-Load (ETL) Platform
Map-Reduce Introduced by Google
A simple and powerful interface that enables automatic parallelization and distribution of large-scale computation.
Two major functions Map Reduce
Nodes and trackers
Map-Reduce
Hadoop Distributed File System (HDFS) It has large block size (default 64mb) for
storage to compensate for seek time to network bandwidth. So very large files for storage are ideal.
Streaming data access. Write once and read many times architecture. Since files are large time to read is significant parameter than seek to first record.
Commodity hardware. It is designed to run on commodity hardware which may fail. HDFS is capable of handling it.
HDFS Architecture Filesystem Metadata Framework of write Framework of read
Prominent Users of Hadoop Yahoo!
More than 10,000 core Linux cluster Open scource
Facebook 30 PB data
Amazon Amazon Elastic Compute Cloud Amazon Simple Storage Service
Thank you!