dataengconf: building satori, a hadoop toll for data extraction at linkedin
Post on 09-Jan-2017
298 Views
Preview:
TRANSCRIPT
Building Satori: Web Data Extraction On Hadoop
Nikolai AvtenievSr. Staff Software EngineerLinkedIn
Building Opportunity from the Empire State Building
2
LinkedIn NYC
3
The Team
Nikita LytkinStaff Software Engineer
Pi-Chuan ChangSr. Software Engineer
David AstleSr. Software Engineer
Nikolai AvtenievSr. Staff Software Engineer
Eran LeshemSr. Staff Software Engineer
THE ECONOMIC GRAPH
Connecting talent with opportunity at massive scale
Members Companies Jobs Skills Schools Updates
6
What we thought we neededThe BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
7
Questions we wanted to answerFocused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?
Identity Team
Virtually All Member Value Relies On Identity Data
Susan KaplanSr. Marketing Manager at Weblo
SEARCHResearch & Contact
AD TARGETINGMarket Products
& Services
PMYKBuild Your Network
RECRUITERRecruit & Hire
FEEDGet Daily News
NETWORKKeep in Touch
RECOMMENDATIONSGet a Job/Gig
WVMPEstablish Yourself
as Expert
Identity Use CaseA smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…
Kafka/Samza Team
• Avg. HTML Document is 6K 37% < 10K
• Samza can handle 1.2M messages per node [2]
• There is a limit of how much data is retained between 7 and 30 days.
• Most of the data is filtered out• Need to bootstrap Samza
stores
12
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node
13
Help 400M members fully realize their professional identity on LinkedIn.
Find sources of professional content on the public internet.
Fetch the content, extract structured data and match it to member profiles
The Project: Satori
Web Data Extraction HOW TO:
• Enterprise VS Social Web use cases
• Web Sources • Wrappers
15
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.
16
What is a Wrapper?
Candy Wrapper Web Wrapper
Induce wrappers based on data [4]Build wrappers that are robust. [5]Cluster similar pages by URL [6]The web is huge and there are interesting things in the long tale[7]
17
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.
Picking a Crawler
HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
19
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 20109. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 200410.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 201411. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
20
And the winner is …
Satori
• Built on Nutch 1.9• Runs on Hadoop 2.3• Scheduled to run every 5
hours• Respects robots.txt • Default crawl delay of 5
seconds
22
Crawl Flow
• Output into target schema• Apply XPATH wrappers• Wrappers are hierarchical
mapping of Schema field to XPath expression
• Indexed by data domain and data source
23
Extract Flow
Crawl rate is bound by the number of sites and the site
crawl delay
Common Crawl Great Sourcehttps://commoncrawl.org/
Gobblin Great Ingestion Frameworkhttps://github.com/linkedin/gobblinn
25
Bootstrap From Bulk Sources
XPath extractors can be challenging on sites with rich
data
It is easy to exceed the Hadoop quota
Match[in]
Matching authors and publications to members to power profile edit experiences
30
Overview
Match using global identifiers, email or full name.
The data might not be clean after extraction
Start with a small set of data and get it to the users quickly
31
Start Simple
Narrow the candidates with LSH[1]
Use the simple model to generate the ground truth
Train using a simple algorithm and a few hundred features
32
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Publications Companies
5.3
2.3
3.9
0.6
Extractor ObjectsTotal Processed
33
Current Status
Publication Company
562
5.62.5
1.2 0.1
Crawler ObjectsUnfetched FetchedGone
Target a data source which has data that will be easy to fetch,
extract and match.
Add tracking to the entire flow
Do it all offline if you can
Get the product to the customers early to validate the process and value proposition
Most important of all write it all down and share it with everyone
©2014 LinkedIn Corporation. All Rights Reserved.
top related