dataengconf: building satori, a hadoop toll for data extraction at linkedin

Building Satori: Web Data Extraction On Hadoop

Nikolai AvtenievSr. Staff Software EngineerLinkedIn

Building Opportunity from the Empire State Building

LinkedIn NYC

The Team

Nikita LytkinStaff Software Engineer

Pi-Chuan ChangSr. Software Engineer

David AstleSr. Software Engineer

Nikolai AvtenievSr. Staff Software Engineer

Eran LeshemSr. Staff Software Engineer

THE ECONOMIC GRAPH

Connecting talent with opportunity at massive scale

Members Companies Jobs Skills Schools Updates

What we thought we neededThe BIG Idea

Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.

Questions we wanted to answerFocused our Vision

Who would use this tool?

Do we need to crawl the entire web?

Do we need to process the pages near line?

Where would we store this data?

How would we correct mistakes in the flow?

Identity Team

Virtually All Member Value Relies On Identity Data

Susan KaplanSr. Marketing Manager at Weblo

SEARCHResearch & Contact

AD TARGETINGMarket Products

& Services

PMYKBuild Your Network

RECRUITERRecruit & Hire

FEEDGet Daily News

NETWORKKeep in Touch

RECOMMENDATIONSGet a Job/Gig

WVMPEstablish Yourself

as Expert

Identity Use CaseA smarter way to build your profile

• Suggest 1-click profile updates to members

• Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…

Kafka/Samza Team

• Avg. HTML Document is 6K 37% < 10K

• Samza can handle 1.2M messages per node [2]

• There is a limit of how much data is retained between 7 and 30 days.

• Most of the data is filtered out• Need to bootstrap Samza

stores

Not a perfect fit

1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc

2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node

Help 400M members fully realize their professional identity on LinkedIn.

Find sources of professional content on the public internet.

Fetch the content, extract structured data and match it to member profiles

The Project: Satori

Web Data Extraction HOW TO:

• Enterprise VS Social Web use cases

• Web Sources • Wrappers

Web Data Extraction System

3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.

What is a Wrapper?

Candy Wrapper Web Wrapper

Induce wrappers based on data [4]Build wrappers that are robust. [5]Cluster similar pages by URL [6]The web is huge and there are interesting things in the long tale[7]

Industrial Web Data Extraction

4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230.

5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011.

7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.

Picking a Crawler

HERITRIX powers archive.org

NUTCH powers common crawl

BUbinNG part of LAW

Scrapy used with in LinkedIn

The Contestants

8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 20109. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 200410.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 201411. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K

Sitaker, A Rifkin - 2004 - CN-TR-04-04, November

And the winner is …

Satori

• Built on Nutch 1.9• Runs on Hadoop 2.3• Scheduled to run every 5

hours• Respects robots.txt • Default crawl delay of 5

seconds

Crawl Flow

• Output into target schema• Apply XPATH wrappers• Wrappers are hierarchical

mapping of Schema field to XPath expression

• Indexed by data domain and data source

Extract Flow

Crawl rate is bound by the number of sites and the site

crawl delay

Common Crawl Great Sourcehttps://commoncrawl.org/

Gobblin Great Ingestion Frameworkhttps://github.com/linkedin/gobblinn

Bootstrap From Bulk Sources

XPath extractors can be challenging on sites with rich

It is easy to exceed the Hadoop quota

Match[in]

Matching authors and publications to members to power profile edit experiences

Overview

Match using global identifiers, email or full name.

The data might not be clean after extraction

Start with a small set of data and get it to the users quickly

Start Simple

Narrow the candidates with LSH[1]

Use the simple model to generate the ground truth

Train using a simple algorithm and a few hundred features

Keep It Simple

1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Publications Companies

Extractor ObjectsTotal Processed

Current Status

Publication Company

5.62.5

1.2 0.1

Crawler ObjectsUnfetched FetchedGone

Target a data source which has data that will be easy to fetch,

extract and match.

Add tracking to the entire flow

Do it all offline if you can

Get the product to the customers early to validate the process and value proposition

Most important of all write it all down and share it with everyone

dataengconf: building satori, a hadoop toll for data extraction at linkedin

Technology

building satori: web data extraction on hadoop

2013 satori seatposts

dataengconf: apache spark in financial modeling at blackrock

dataengconf: uri laserson (data scientist, cloudera) scaling...

satori wp1 slides

dataengconf: concrete named entity disambiguation

dataengconf sf16 - byomq: why we [re]built ironmq

2013 satori toolless_stems

satori catalogo summer 2012

satori vibrant & eloquent satori 2015-2016...

dataengconf sf16 - collecting and moving data at scale

users manual satori eng

2013 satori handlebars

dataengconf sf16 - methods for content relevance at linkedin

dataengconf sf16 - deriving meaning from wearable sensor...

satori wisdom

satori summer 2012

cloud native data pipelines (dataengconf sf 2017)

2013 satori catalog

2012 satori adjust stems