data infrastructure and hadoop at linkedin

27
Big data and Hadoop September 2012 Hari Shankar Menon Software engineer LinkedIn 1

Upload: hari-shankar

Post on 02-Nov-2014

946 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data infrastructure and Hadoop at LinkedIn

Big data and Hadoop

September 2012

Hari Shankar Menon

Software engineer

LinkedIn

1

Page 2: Data infrastructure and Hadoop at LinkedIn

2

LinkedIn Engineering Data warehouse team

Previously, Software engineer @Clickable– Worked on building the reporting and analytics platform on

Hadoop and HBase.

Hadoop and Open-source enthusiast

About me

Page 3: Data infrastructure and Hadoop at LinkedIn

3

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

Agenda

Page 4: Data infrastructure and Hadoop at LinkedIn

Our mission

Connect the world’s professionals to make them more productive and successful

4

Page 5: Data infrastructure and Hadoop at LinkedIn

5

*as of Nov 4, 2011**as of June 30, 2011

2004 2005 2006 2007 2008 2009 2010

2 48

17

32

55

90

LinkedIn Members (Millions)

175M+

85%Fortune 100 Companies use LinkedIn to hire

Company Pages

>2M

**

New Members joining

~2/sec

Professional searches in 2011

~4.2B

LinkedIn by numbers

Page 6: Data infrastructure and Hadoop at LinkedIn

6

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

Page 7: Data infrastructure and Hadoop at LinkedIn

* Chart from Philip Russom- Research Director: TDWI

What is big data?

Page 8: Data infrastructure and Hadoop at LinkedIn

8

Infrastructure technologies

Databus

Primary data store (Front-end)Distributed key-value store

Document-oriented store

Distributed PubSub messaging

Search technologies

Database change replication SenseiDB

Zoie Bobo

Page 9: Data infrastructure and Hadoop at LinkedIn

9

http://data.linkedin.com/opensource

Open source

Page 10: Data infrastructure and Hadoop at LinkedIn

10

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

Page 11: Data infrastructure and Hadoop at LinkedIn

11

What is Hadoop Evolution of Hadoop Impact

Page 12: Data infrastructure and Hadoop at LinkedIn

12

Recommendation systems– Generating recommendations– Modeling– A/B Testing– Grandfathering

Data warehouse/ETL– Raw data storage– Aggregations– Heavy lifting

Data sciences– Strategic analyses– Experimentation sandbox

@

Page 13: Data infrastructure and Hadoop at LinkedIn

13

Pandora Search for People

Events YouMay BeInterested In

Groups browse maps

The Recommendations opportunity

• Relevance/Latency

• Offline computation

• Caching

Page 14: Data infrastructure and Hadoop at LinkedIn

14

Improving recommendations

• Mathematical modeling

• A/B Testing

• Grandfathering

Page 15: Data infrastructure and Hadoop at LinkedIn

15

Hadoop in the Data warehouse

• Source of truth• Lower retention• Ad-hoc analysis

• Longer retention• Complex

transformations• Algorithmic

computations

Page 16: Data infrastructure and Hadoop at LinkedIn

16

Hadoop in Data Sciences

• Deep dives

• Sandbox

• Hackday projects

Page 17: Data infrastructure and Hadoop at LinkedIn

17

Data Insights - 1

Job migration after financial collapse

Page 18: Data infrastructure and Hadoop at LinkedIn

18

Data Insights - 2

Page 19: Data infrastructure and Hadoop at LinkedIn

19

Data Insights - 3

Page 20: Data infrastructure and Hadoop at LinkedIn

20

About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges

Page 21: Data infrastructure and Hadoop at LinkedIn

21

1. User adoption of new technologies

2. Real-time processing

3. Graph/Network algorithms

4. Making data accessible

Challenges

Page 22: Data infrastructure and Hadoop at LinkedIn

22

User adoption

Page 23: Data infrastructure and Hadoop at LinkedIn

23

• Challenges• Random reads/writes• Warm-up time

• Solutions• Parts of the problem that can be moved offline?• HBase, Voldemort

Real-time processing

Page 24: Data infrastructure and Hadoop at LinkedIn

24

• Graph problems• Traditional joins

Map-reduce-incompatible problems

Page 25: Data infrastructure and Hadoop at LinkedIn

25

• Hadoop Tons of data

Making data accessible

Page 26: Data infrastructure and Hadoop at LinkedIn

26

Finally!

No Silver bullet

Hadoop Offline processing

Scalability by design

Page 27: Data infrastructure and Hadoop at LinkedIn

27

www.linkedin.com/in/harisreekumar

www.linkedin.com/company/linkedin/careers