building satori: web data extraction on hadoop

Building Satori: Web Data Extraction On Hadoop

Nikolai AvtenievSr. Staff Software EngineerLinkedIn

Building Opportunity from the Empire State Building

2

LinkedIn NYC

3

The Team

Nikita LytkinStaff Software Engineer

Pi-Chuan ChangSr. Software Engineer

David AstleSr. Software Engineer

Nikolai AvtenievSr. Staff Software Engineer

Eran LeshemSr. Staff Software Engineer

4

What we thought we neededThe BIG Idea

Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.

5

Questions we wanted to answerFocused our Vision

Who would use this tool?

Do we need to crawl the entire web?

Do we need to process the pages near line?

Where would we store this data?

How would we correct mistakes in the flow?

Identity Team

Virtually All Member Value Relies On Identity Data

Susan KaplanSr. Marketing Manager at Weblo

SEARCHResearch & Contact

AD TARGETINGMarket Products

& Services

PMYKBuild Your Network

RECRUITERRecruit & Hire

FEEDGet Daily News

NETWORKKeep in Touch

RECOMMENDATIONSGet a Job/Gig

WVMPEstablish Yourself

as Expert

Identity Use CaseA smarter way to build your profile

• Suggest 1-click profile updates to members

• Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…

Kafka/Samza Team

• Avg. HTML Document is 6K 37% < 10K

• Samza can handle 1.2M messages per node [2]

• There is a limit of how much data is retained between 7 and 30 days.

• Most of the data is filtered out• Need to bootstrap Samza

stores

10

Not a perfect fit

1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc

2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node

http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015%23bytesHtmlDoc

11

Help 400M members fully realize their professional identity on LinkedIn.

Find sources of professional content on the public internet.

Fetch the content, extract structured data and match it to member profiles

The Project: Satori

Web Data Extraction HOW TO:

• Enterprise VS Social Web use cases

• Web Sources • Wrappers

13

Web Data Extraction System

3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.

14

What is a Wrapper?

Candy Wrapper Web Wrapper

Induce wrappers based on data [4]Build wrappers that are robust. [5]Cluster similar pages by URL [6]The web is huge and there are interesting things in the long tail[7]

15

Industrial Web Data Extraction

4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230.

5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011.

7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.

Picking a Crawler

HERITRIX powers archive.org

NUTCH powers common crawl

BUbinNG part of LAW

Scrapy used with in LinkedIn

17

The Contestants

8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 20109. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 200410.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 201411. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K

Sitaker, A Rifkin - 2004 - CN-TR-04-04, November

https://scholar.google.com/citations?user=twgYqxAAAAAJ&hl=en&oi=sra

https://scholar.google.com/citations?user=WL8DH4YAAAAJ&hl=en&oi=sra

18

And the winner is …

Satori

• Built on Nutch 1.9• Runs on Hadoop 2.3• Scheduled to run every 5

hours• Respects robots.txt • Default crawl delay of 5

seconds

20

Crawl Flow

• Output into target schema• Apply XPATH wrappers• Wrappers are hierarchical

mapping of Schema field to XPath expression

• Indexed by data domain and data source

21

Extract Flow

Crawl rate is bound by the number of sites and the site

crawl delay

Common Crawl Great Sourcehttps://commoncrawl.org/

Gobblin Great Ingestion Frameworkhttps://github.com/linkedin/gobblinn

23

Bootstrap From Bulk Sources

XPath extractors can be challenging on sites with rich

data model

It is easy to exceed the Hadoop quota

Match[in]

Matching authors and publications to members to power profile edit experiences

28

Overview

Match using global identifiers, email or full name.

The data might not be clean after extraction

Start with a small set of data and get it to the users quickly

29

Start Simple

Narrow the candidates with LSH[1]

Use the simple model to generate the ground truth

Train using a simple algorithm and a few hundred features

30

Keep It Simple

1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Publications Companies

5.3

2.3

3.9

0.6

Extractor ObjectsTotal Processed

31

Current Status

Publication Company

562

5.62.5

1.2 0.1

Crawler ObjectsUnfetched FetchedGone

Target a data source which has data that will be easy to fetch,

extract and match.

Add tracking to the entire flow

Do it all offline if you can

Get the product to the customers early to validate the process and value proposition

Most important of all write it all down and share it with everyone