hadoop basics -venkat cherukupalli. what is hadoop? open source distributed processing large data...

Hadoop Basics -Venkat Cherukupalli

Upload: stuart-davis

Post on 12-Jan-2016

220 views

Category:

Documents

5 download

Report

Download

Embed Size (px):

TRANSCRIPT

Hadoop Basics -Venkat Cherukupalli

What is Hadoop?Open Source

Distributed processing

Large data sets across clusters

Commodity, shared-nothing servers

Local computation and storage

Page 3: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Key ServicesHadoop Distributed File System (HDFS)

Reliable data storage

MapReducehigh-performance parallel data processing

Page 4: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

HDFSSplits user data across servers in a cluster

Replication - multiple node failures will not cause data loss

Reliable, scalable and low-cost storage

RAID – Massive scale

Namenode and Datanode

Page 5: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

HDFS

Page 6: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

MapReduceParallel distributed processing system

No special programming techniques

Existing algorithms work without change

Page 7: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

MapReduce Framework

• Processes large jobs in parallel across many nodes and combines results.

• Eliminates the bottlenecks imposed by monolithic storage systems.

• Results are collated and digested into a single output after each piece has been

analyzed.

Page 8: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Self-healingShifting work to the remaining nodes.

Creates additional copy of the data from the replicas

Self-healing for both storage and computation

No sysadmin intervention

Page 9: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

What is SQOOPImports individual tables or entire databases to files in HDFS

Generates Java classes to allow you to interact with your imported data

Provides the ability to import from SQL databases straight into your Hive data warehouse

http://hadoop.apache.org/hive

Page 10: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers

Other ConceptsHBase -is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java

Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis

Pig -Platform for creating MapReduce programs used with Hadoop.

ZooKeeper -Reliable distributed coordination

http://en.wikipedia.org/wiki/Data_warehouse

http://en.wikipedia.org/wiki/Hadoop

http://en.wikipedia.org/wiki/MapReduce

http://en.wikipedia.org/wiki/Hadoop

Performance Issues on Hadoop Clusters

Dynamic Hadoop Clusters - ApacheConarchive.apachecon.com/.../dynamic_hadoop_clusters.pdf · •Dynamic Hadoop clusters are a good way to explore Hadoop •Come and play with the SmartFrog

Secure your Hadoop clusters with BlueTalon SecureAccess for WebHDFS

Scalable On-Demand Hadoop Clusters with Docker and Mesos

Operationalizing YARN based Hadoop Clusters in the Cloud

Multi-tenant Apache Hadoop Clusters - BI Consultingbiconsulting.hu/letoltes/2016budapestdata/daniel_schoberle _hadoop... · Multi-tenant Apache Hadoop Clusters Dániel Schöberle

Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Improving Performance of Hadoop Clusters

SAS1844 - Securing Hadoop Clusters while Still Retaining Your Sanity

MapReduce Online - USENIX · 2.2 Hadoop Architecture Hadoop is composed of Hadoop MapReduce, an imple-mentation of MapReduce designed for large clusters, and the Hadoop Distributed

Adaptive Preshuffling in Hadoop Clusters

Building and Administering Hadoop Clusters - UMIACSjbg/teaching/INFM_718_2011/lecture_10.pdf · Building and Administering Hadoop Clusters 21 April 2011 ... Have run these tips by

Availability Of JobTracker Machine In Hadoop/MapReduce Zookeeper Coordinated Clusters

Workload Analysis Security Aspects and Optimization of Workload in Hadoop Clusters

Managing Enterprise Hadoop Clusters with Apache Ambari

Understanding Hadoop Clusters and the Network-Bradhedlund Com

Protecting your Critical Hadoop Clusters Against Disasters

One click Hadoop clusters - anywhere...One click Hadoop clusters - anywhere October, 2015 Janos Matyas, Senior Director of Engineering Page 2 ... • Full Hadoop stack provisioning

Adaptive Preshuffling in Hadoop Clusters - Auburn …xqin/pubs/xie_ijgdc13.pdf · Adaptive Preshuffling in Hadoop Clusters ... execution model of Hadoop can be divided into two separate

Theius: A Streaming Visualization Suite for Hadoop Clusters

Distributed Deep Learning on Hadoop Clusters

BIG DATA DIPLOMA - epsiloneg.com · I- Introduction to Big Data, Developing with Spark and Hadoop • Introduction to Hadoop and MapReduce SQL JOINS o Hadoop Ecosystems o Hadoop Clusters

Meeting Performance Goals in multi-tenant Hadoop Clusters

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

Understanding Hadoop Clusters and the Network · Understanding Hadoop Clusters and the Network Part 1. Introduction and Overview BRAD HEDLUND .com Brad Hedlund

Hadoop* Clusters Built on 10 Gigabit Ethernet - Intel · This paper will help you get ... for processing big data. Dramatic ... Hadoop* Clusters Built on 10 Gigabit Ethernet. Gigabit

ITDB Analytics and Hadoop Servicecanali.web.cern.ch › docs › ITDB_Analytics_and Hadoop_Service.pdf · Hadoop clusters at CERN IT • 3 production clusters (+ 1 for QA) as of December

WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Moving towards enterprise ready Hadoop clusters on the cloud

Understanding Hadoop Clusters and the Networkduda.imag.fr/3at/Understanding-HDFS.pdf · Understanding Hadoop Clusters and the Network Sep 10, 2011 • Brad Hedlund This article is

Secure Hadoop clusters on Windows platform

High Throughput and Low Latency on Hadoop Clusters using

Plug-and-play Virtual Appliance Clusters Running Hadoop

Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters

Understanding Hadoop Clusters and the Network-Slides and Text Bradhedlund Com