hadoop basics -venkat cherukupalli. what is hadoop? open source distributed processing large data...

10
Hadoop Basics -Venkat Cherukupalli

Upload: stuart-davis

Post on 12-Jan-2016

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Hadoop Basics -Venkat Cherukupalli

Page 2: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

What is Hadoop?Open Source

Distributed processing

Large data sets across clusters

Commodity, shared-nothing servers

Local computation and storage

Page 3: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Key ServicesHadoop Distributed File System (HDFS)

Reliable data storage

MapReducehigh-performance parallel data processing

Page 4: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

HDFSSplits user data across servers in a cluster

Replication - multiple node failures will not cause data loss

Reliable, scalable and low-cost storage

RAID – Massive scale

Namenode and Datanode

Page 5: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

HDFS

Page 6: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

MapReduceParallel distributed processing system

No special programming techniques

Existing algorithms work without change

Page 7: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

MapReduce Framework

• Processes large jobs in parallel across many nodes and combines results.

• Eliminates the bottlenecks imposed by monolithic storage systems.

• Results are collated and digested into a single output after each piece has been

analyzed.

Page 8: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Self-healingShifting work to the remaining nodes.

Creates additional copy of the data from the replicas

Self-healing for both storage and computation

No sysadmin intervention

Page 9: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

What is SQOOPImports individual tables or entire databases to files in HDFS

Generates Java classes to allow you to interact with your imported data

Provides the ability to import from SQL databases straight into your Hive data warehouse

Page 10: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Other ConceptsHBase -is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java

Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis

Pig -Platform for creating MapReduce programs used with Hadoop.

ZooKeeper -Reliable distributed coordination