hadoop at linkedin

21
July 27, 2015 Keith Dsouza

Upload: keith-dsouza

Post on 11-Feb-2017

348 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Hadoop at LinkedIn

July 27, 2015Keith Dsouza

Page 2: Hadoop at LinkedIn

2

● Open source framework developed and

maintained by Apache foundation

● Consists of a Distributed file system (HDFS)

used for storing large blocks of data

● MapReduce Framework - Model for large scale

data processing

● YARN - Resource management platform for

managing computing resources in clusters

What is Hadoop?

Page 3: Hadoop at LinkedIn

3

Problem: We have large number of attendees at this meetup

and want to count the number of Java Engineers, C++

Engineers and C Engineers present here

MapReduce In Practice

Page 4: Hadoop at LinkedIn

4

MapReduce In Practice

Page 5: Hadoop at LinkedIn

5

MapReduce In Practice

Page 6: Hadoop at LinkedIn

6

Summing it up

Page 7: Hadoop at LinkedIn

7

Page 8: Hadoop at LinkedIn

8

● One of the largest and most active clusters in

the world

● Hadoop 2.0

● 2000+ Nodes

● 50MM blocks of data

● Thousands of workflows

Scale

Page 9: Hadoop at LinkedIn

9

Project Takeout - Member Data Export

Page 10: Hadoop at LinkedIn

10

Project Takeout - Member Data Export

● 350MM LinkedIn Members

● Reads hundreds of Terabytes of data

● Built using Scalding

● Low-maintenance

Page 11: Hadoop at LinkedIn

11

Supported Development Platforms

Page 12: Hadoop at LinkedIn

12

ETL (Extract, Transform and Load)

Page 13: Hadoop at LinkedIn

13

● Project management for Hadoop jobs

● Hadoop job dependencies / job workflows

● Execution Tracker

● Job History

● Continuous job scheduler

Azkaban - Hadoop Project/Workflow Manager

Page 14: Hadoop at LinkedIn

14

Azkaban - Hadoop Project/Workflow Manager

Page 15: Hadoop at LinkedIn

15

Azkaban Execution/Logs

Page 16: Hadoop at LinkedIn

16

Hadoop DSL - Workflow

Page 17: Hadoop at LinkedIn

17

Hadoop DSL - Job File

Page 18: Hadoop at LinkedIn

18

Dr. Elephant - Analyze Jobs

Page 19: Hadoop at LinkedIn

19

Dr. Elephant - Analyze Problems

Page 20: Hadoop at LinkedIn

20

Production Workflows

Page 21: Hadoop at LinkedIn

21

● Contribute to Apache Hadoop

● Apache DataFu - http://datafu.incubator.apache.org/

● Azkaban - http://azkaban.github.io/

● Gobblin - https://github.com/linkedin/gobblin

● Pinot - https://github.com/linkedin/pinot

● Samza - http://samza.apache.org/

● Data@LinkedIn - http://data.linkedin.com

● Keep in touch - http://engineering.linkedin.com/

LinkedIn open source contributions