hive at linkedin

24

Upload: mislam77

Post on 19-Aug-2015

732 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Hive at LinkedIn
Page 2: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING©2013 LinkedIn Corporation. All Rights Reserved.

Hive at LinkedInMohammad Islam, Mark Wagner, Karthik Ramasamy

Page 3: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 3

Agenda

LinkedIn Data and its Ecosystem Performance Improvements – Avro User experiences

Page 4: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 4

LinkedIn Data Sources

Event Data– Page Views– Clicks– Search queries

Database Data– Profile (Users & Companies)– Connections

External Data– Salesforce, DoubleClick

Page 5: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 5

Data Ecosystem at LinkedInMember

Facing

Systems

Page 6: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 6

Data Ecosystem at LinkedInMember

Facing

Systems

Data

Data

Data

Page 7: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 7

Data Ecosystem at LinkedInMember

Facing

Systems

DataData

Page 8: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 8

Data Ecosystem at LinkedInMember

Facing

Systems

Data

Data

Page 9: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 9

Data Ecosystem at LinkedInMember

Facing

Systems

Data

Page 10: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 10

Data in Hadoop

Almost all LinkedIn data is stored in Hadoop Tools used

– Hive/HCatalog– Pig– Java MapReduce– Azkaban

Page 11: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 11

Hive Usage

Use-cases– Ad-hoc query– Reporting– Building Platforms

Segmentation Engine Experimentations Engine

Users– Data Scientist– Business Analytics– Security team– Product team

Page 12: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 12

Hive Challenges

Performance– Faster query execution

Performance– Faster query execution

Efficient MR* execution plan– Effective resource usage– Ensure cluster stability

Page 13: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 13

LinkedIn Hive Initiatives

Make HCatalog work and deploy [OnGoing] Hive Performance Improvement (Avro data reading) [On

Going] Stabilize Hive Server 2 at LI [About to Start] Expand the scope of HCatalog metadata [Planning]

Page 14: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 14

HCatalog Initiatives

Expand scope of meta-data– Who creates this data?– What are the inputs?

Helpful to create data lineage

– Who is the maintainer of data?

Page 15: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERINGCourtesy: iclipart.com

Page 16: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 16

What is the Problem?

Reading Avro record takes long time.– 52 micro-second/record

Found the hotspot using VisualVm

Page 17: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 17

Improvement #1

Reduce the number of Schema.equals() calls Schema equality checks required primarily for evolved

schema. Solution includes caching to avoid unnecessary

expensive calls Results

– Trunk read overhead : 52 μs/record– After this patch read overhead : 32 μs/record

Page 18: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 18

Improvement #2

Reduce extra data transformations Solution is to provide custom object inspectors Results

– Current read overhead : 52 μs/record– After this patch read overhead : 30 μs/record

Page 19: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 19

Final Results

Trunk Improvement #1 Improvement #2 Combined0

10

20

30

40

50

6055

3230

11

Page 20: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERINGCourtesy: iclipart.com

Page 21: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 21

56%Never Used Hive

44%Use Hive

27%Primarily use Hive

Out of all our Hadoop users:

Hive User Base at LinkedIn

of Hive jobs were from ad-hoc queries32%

Page 22: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 22

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Who uses Hive and who doesn’t

Data Scientists

Engineers

Product Managers

Customer Support Specialists

Analysts

Hive adoption among Hadoop users by job title

Page 23: Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. ENGINEERING 23

Top concerns about Hive

Not friendly for long/complex workflows

Performance, especially for ad-hoc queries

Steep learning curve for tuning

Data/UDFs unavailability

Page 24: Hive at LinkedIn