hadoop at linkedin

Post on 11-Feb-2017

349 Views

Category:

Engineering

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

July 27, 2015Keith Dsouza

2

● Open source framework developed and

maintained by Apache foundation

● Consists of a Distributed file system (HDFS)

used for storing large blocks of data

● MapReduce Framework - Model for large scale

data processing

● YARN - Resource management platform for

managing computing resources in clusters

What is Hadoop?

3

Problem: We have large number of attendees at this meetup

and want to count the number of Java Engineers, C++

Engineers and C Engineers present here

MapReduce In Practice

4

MapReduce In Practice

5

MapReduce In Practice

6

Summing it up

7

8

● One of the largest and most active clusters in

the world

● Hadoop 2.0

● 2000+ Nodes

● 50MM blocks of data

● Thousands of workflows

Scale

9

Project Takeout - Member Data Export

10

Project Takeout - Member Data Export

● 350MM LinkedIn Members

● Reads hundreds of Terabytes of data

● Built using Scalding

● Low-maintenance

11

Supported Development Platforms

12

ETL (Extract, Transform and Load)

13

● Project management for Hadoop jobs

● Hadoop job dependencies / job workflows

● Execution Tracker

● Job History

● Continuous job scheduler

Azkaban - Hadoop Project/Workflow Manager

14

Azkaban - Hadoop Project/Workflow Manager

15

Azkaban Execution/Logs

16

Hadoop DSL - Workflow

17

Hadoop DSL - Job File

18

Dr. Elephant - Analyze Jobs

19

Dr. Elephant - Analyze Problems

20

Production Workflows

21

● Contribute to Apache Hadoop

● Apache DataFu - http://datafu.incubator.apache.org/

● Azkaban - http://azkaban.github.io/

● Gobblin - https://github.com/linkedin/gobblin

● Pinot - https://github.com/linkedin/pinot

● Samza - http://samza.apache.org/

● Data@LinkedIn - http://data.linkedin.com

● Keep in touch - http://engineering.linkedin.com/

LinkedIn open source contributions

top related