back to school - st. louis hadoop meetup september 2016

21
© 2016 MapR Technologies 1 © 2016 MapR Technologies 1 © 2016 MapR Technologies Back to School – Hadoop Ecosystem Overview Matt Miller, Solutions Engineer September, 2016

Upload: adam-doyle

Post on 22-Feb-2017

137 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 1© 2016 MapR Technologies 1© 2016 MapR Technologies

Back to School – Hadoop Ecosystem Overview

Matt Miller, Solutions EngineerSeptember, 2016

Page 2: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 2© 2016 MapR Technologies 2

Roadmap• What is Apache?• Hadoop Timeline and Level Set• Hadoop Suite of tools

1. Hive2. Sqoop3. Pig4. Oozie5. Hbase6. Flume7. Kafka8. Drill9. Yarn10. Zookeeper

• Use Cases• Q&A

Page 3: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 3© 2016 MapR Technologies 3

What is Apache?• Non-profit organization

• Governs the development of open source “Projects”

• “Top Level” projects are the most prominent

• Features “committers” from all over the world

Page 4: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 4© 2016 MapR Technologies 4

Hadoop Timeline

2003GFS White Paper Published

2004Map Reduce White Paper Published

2006Hadoop is born

HDFS + MapReduce

2009Hadoop distributions start

popping up

2016Organized Chaos – New projectsreleased every few months andonly the winners gain traction

2007 - PresentHadoop continuously evolves.

New tools are released to improveusability and make it easier to adopt.

2000 2020

Page 5: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 5© 2016 MapR Technologies 5

What is Hadoop?

Distributed File System + Processing Engine

HDFS Map Reduce

Page 6: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 6© 2016 MapR Technologies 6

What is MapReduce?• Three phase program built for distributed processing

– Map– Shuffle/Sort– Reduce

• Processing overhead associated with MR jobs(~30 seconds)

• Heavy disk usage

Page 7: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 7© 2016 MapR Technologies 7

1.) Hive• First SQL on Hadoop – HiveQL is the language

• Hadoop data warehousing tool

• Converts HiveQL into a Map Reduce job

• Bash, Java, and Python scripts can execute Hive commands

• Not ANSI compliant but VERY similar

Use Hive for long running jobs -- not ad-hoc queries

Page 8: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 8© 2016 MapR Technologies 8

2.) Sqoop• RDBMS connector for Hadoop

• Execute Sqoop scripts via the command line

• Sqoop can move Schemas, Tables, or Select statement results

• Helps improve ETL or enable data warehouse offload

Use Sqoop anytime data needs to move to/from an RDBMS

Page 9: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 9© 2016 MapR Technologies 9

3.) Pig• High level coding language for processing data

• Language used to express data flows is called Pig Latin

• Pig turns data flows into a series of MR jobs

• Can run in a single JVM or on a Hadoop Cluster

• User Defined Functions(UDFs) make Pig code easy to repurpose

Use pig to speed up development process

Page 10: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 10© 2016 MapR Technologies 10

4.) Oozie• Workflow Orchestration

• Schedule tasks to be completed based on time or completion of a previous task

• Used for Automation

• Develop these workflows either in a GUI or in XML– Hint: the GUI is much much MUCH simpler

Use Oozie when you need workflows

Page 11: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 11© 2016 MapR Technologies 11

5.) Hbase• Database built on HDFS

• Meant for big and fast data

• Hbase is a NoSQL database– Multiple types of NoSQL databases:

• Wide-column stores, Document DB, Graph DB, Key-Value stores• Hbase is a wide-column store

Use Hbase when “real-time read/write access to very large datasets” is required

Page 12: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 12© 2016 MapR Technologies 12

6.) Flume• Meant for ingesting streams of data

• Runs on the same cluster and stores data in HDFS– Also flexible enough to stream into Hbase or SolR

• Flume PUSHES data to its destination

• Flume does NOT store data within itself

Use Flume when basic streaming is required

Page 13: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 13© 2016 MapR Technologies 13

7.) Kafka• …Also meant for ingesting streams of data

• Runs on its own cluster

• Kafka does not PUSH data to other places– Other places pull from Kafka

• Kafka streams in the data, then PUBLISHES the data on its cluster and multiple users can SUBSCRIBE to that data and get their copy.

Use Kafka for advanced streaming

Page 14: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 14© 2016 MapR Technologies 14

8.) Drill• Flexible SQL tool

• Works with a lot of data types and storage platforms

• Does not require transformations to the data

• For ad-hoc analytics and performant queries on LARGE data sets

• Scales to thousands of nodes

Use Drill for data exploration and performant SQL

Page 15: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 15© 2016 MapR Technologies 15

9.) Yarn• Yet Another Resource Negotiator• Helps you allocate resources (and enforce usage quotas) to multiple

groups/users

10.) Zookeeper• Coordinates the distribution of jobs• Handles partial failures• Provides synchronization of jobs

Use Yarn for Multitenancy

ALWAYS use Zookeeper with Hadoop

Page 16: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 16© 2016 MapR Technologies 16

Use Case 1: Expensive RDBMS• Organization has 5 TB of sales

data in RDBMS ($$$)

• Currently 50 reports being generated regularly

• Largest report takes ~24 hours to generate

• Team only knows SQL

HDFS

Sqoop

Hive

Hive/Drill

Page 17: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 17© 2016 MapR Technologies 17

Use Case 2: Customer 360 Data Lake/Hub• 50 TB of customer data

• Data consists of everything from ERP data to JSON data from a rest API

• Four different business units need access to the data and they each have performance requirements

• Basic users need ad-hoc query capabilities

• Weekly jobs need to be kicked off during off hours

HDFS

Drill

YARN

Drill

Oozie

Page 18: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 18© 2016 MapR Technologies 18

Use Case 3: Online Video Game Support• Stats need to be updated milliseconds after

the game finishes

• Player needs to be able to randomly look up other player stats in less than a second

• System can never go down or lose information

• Management wants to save this data so analytics can be run on these datasets.

Kafka/Flume & Hbase

Hbase

Kafka & Hbase

HDFS

Page 19: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 19© 2016 MapR Technologies 19

Advice for those getting started…• Don’t try to hire a big data team, build from within

– MOTIVATED Linux and SQL people are enough to get started

• Target legacy RDMBS and move ~80% to Hadoop– Quick win– Instant validation and justification if you can cut costs and improve speed

at the same time

• Have fun

Page 20: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 20© 2016 MapR Technologies 20

Additional Resources• Full List of Hadoop Ecosystem

• Books:– The Definitive Guide to Hadoop– Hadoop Application Architectures

• Free Training:– Coursera and Edx

• My favorite is a Python specialization series– learn.mapr.com

• Free courses from 100 level to 400 level

Page 21: Back to School - St. Louis Hadoop Meetup September 2016

© 2016 MapR Technologies 21© 2016 MapR Technologies 21

Q & A@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies