dataengconf: building satori, a hadoop toll for data extraction at linkedin

39
Building Satori: Web Data Extraction On Hadoop Nikolai Avteniev Sr. Staff Software Engineer LinkedIn

Upload: hakka-labs

Post on 09-Jan-2017

298 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Building Satori: Web Data Extraction On Hadoop

Nikolai AvtenievSr. Staff Software EngineerLinkedIn

Page 2: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Building Opportunity from the Empire State Building

2

LinkedIn NYC

Page 3: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

3

The Team

Nikita LytkinStaff Software Engineer

Pi-Chuan ChangSr. Software Engineer

David AstleSr. Software Engineer

Nikolai AvtenievSr. Staff Software Engineer

Eran LeshemSr. Staff Software Engineer

Page 4: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

THE ECONOMIC GRAPH

Page 5: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Connecting talent with opportunity at massive scale

Members Companies Jobs Skills Schools Updates

Page 6: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

6

 What we thought we neededThe BIG Idea

Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.

Page 7: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

7

 Questions we wanted to answerFocused our Vision

Who would use this tool?

Do we need to crawl the entire web?

Do we need to process the pages near line?

Where would we store this data?

How would we correct mistakes in the flow?

Page 8: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Identity Team

Page 9: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Virtually All Member Value Relies On Identity Data

Susan KaplanSr. Marketing Manager at Weblo

SEARCHResearch & Contact

AD TARGETINGMarket Products

& Services

PMYKBuild Your Network

RECRUITERRecruit & Hire

FEEDGet Daily News

NETWORKKeep in Touch

RECOMMENDATIONSGet a Job/Gig

WVMPEstablish Yourself

as Expert

Page 10: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Identity Use CaseA smarter way to build your profile

• Suggest 1-click profile updates to members

• Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…

Page 11: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Kafka/Samza Team

Page 12: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

• Avg. HTML Document is 6K 37% < 10K

• Samza can handle 1.2M messages per node [2]

• There is a limit of how much data is retained between 7 and 30 days.

• Most of the data is filtered out• Need to bootstrap Samza

stores

12

Not a perfect fit

1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc

2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node

Page 13: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

13

Help 400M members fully realize their professional identity on LinkedIn.

Find sources of professional content on the public internet.

Fetch the content, extract structured data and match it to member profiles

The Project: Satori

Page 14: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Web Data Extraction HOW TO:

Page 15: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

• Enterprise VS Social Web use cases

• Web Sources • Wrappers

15

Web Data Extraction System

3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.

Page 16: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

16

What is a Wrapper?

Candy Wrapper Web Wrapper

Page 17: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Induce wrappers based on data [4]Build wrappers that are robust. [5]Cluster similar pages by URL [6]The web is huge and there are interesting things in the long tale[7]

17

Industrial Web Data Extraction

4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230.

5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011.

7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.

Page 18: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Picking a Crawler

Page 19: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

HERITRIX powers archive.org

NUTCH powers common crawl

BUbinNG part of LAW

Scrapy used with in LinkedIn

19

The Contestants

8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 20109. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 200410.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 201411. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K

Sitaker, A Rifkin - 2004 - CN-TR-04-04, November

Page 20: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

20

And the winner is …

Page 21: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Satori

Page 22: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

• Built on Nutch 1.9• Runs on Hadoop 2.3• Scheduled to run every 5

hours• Respects robots.txt • Default crawl delay of 5

seconds

22

Crawl Flow

Page 23: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

• Output into target schema• Apply XPATH wrappers• Wrappers are hierarchical

mapping of Schema field to XPath expression

• Indexed by data domain and data source

23

Extract Flow

Page 24: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Crawl rate is bound by the number of sites and the site

crawl delay

Page 25: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Common Crawl Great Sourcehttps://commoncrawl.org/

Gobblin Great Ingestion Frameworkhttps://github.com/linkedin/gobblinn

25

Bootstrap From Bulk Sources

Page 26: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

XPath extractors can be challenging on sites with rich

data

Page 27: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

It is easy to exceed the Hadoop quota

Page 28: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Match[in]

Page 29: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Matching authors and publications to members to power profile edit experiences

Page 30: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

30

Overview

Page 31: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Match using global identifiers, email or full name.

The data might not be clean after extraction

Start with a small set of data and get it to the users quickly

31

Start Simple

Page 32: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Narrow the candidates with LSH[1]

Use the simple model to generate the ground truth

Train using a simple algorithm and a few hundred features

32

Keep It Simple

1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Page 33: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Publications Companies

5.3

2.3

3.9

0.6

Extractor ObjectsTotal Processed

33

Current Status

Publication Company

562

5.62.5

1.2 0.1

Crawler ObjectsUnfetched FetchedGone

Page 34: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Target a data source which has data that will be easy to fetch,

extract and match.

Page 35: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Add tracking to the entire flow

Page 36: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Do it all offline if you can

Page 37: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Get the product to the customers early to validate the process and value proposition

Page 38: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

Most important of all write it all down and share it with everyone

Page 39: DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn

©2014 LinkedIn Corporation. All Rights Reserved.