building satori: web data extraction on hadoop
TRANSCRIPT
3
The Team
Nikita LytkinStaff Software Engineer
Pi-Chuan ChangSr. Software Engineer
David AstleSr. Software Engineer
Nikolai AvtenievSr. Staff Software Engineer
Eran LeshemSr. Staff Software Engineer
4
What we thought we neededThe BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy. "The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
5
Questions we wanted to answerFocused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?
Virtually All Member Value Relies On Identity Data
Susan KaplanSr. Marketing Manager at Weblo
SEARCHResearch & Contact
AD TARGETINGMarket Products
& Services
PMYKBuild Your Network
RECRUITERRecruit & Hire
FEEDGet Daily News
NETWORKKeep in Touch
RECOMMENDATIONSGet a Job/Gig
WVMPEstablish Yourself
as Expert
Identity Use CaseA smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps & get credit for certificates, patents, publications…
• Avg. HTML Document is 6K 37% < 10K
• Samza can handle 1.2M messages per node [2]
• There is a limit of how much data is retained between 7 and 30 days.
• Most of the data is filtered out• Need to bootstrap Samza
stores
10
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node” https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node
11
Help 400M members fully realize their professional identity on LinkedIn.
Find sources of professional content on the public internet.
Fetch the content, extract structured data and match it to member profiles
The Project: Satori
• Enterprise VS Social Web use cases
• Web Sources • Wrappers
13
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70 (2014): 301-323.
Induce wrappers based on data [4]Build wrappers that are robust. [5]Cluster similar pages by URL [6]The web is huge and there are interesting things in the long tail[7]
15
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites." Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment 5.7 (2012): 680-691.
HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
17
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 20109. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 200410.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 201411. Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
• Built on Nutch 1.9• Runs on Hadoop 2.3• Scheduled to run every 5
hours• Respects robots.txt • Default crawl delay of 5
seconds
20
Crawl Flow
• Output into target schema• Apply XPATH wrappers• Wrappers are hierarchical
mapping of Schema field to XPath expression
• Indexed by data domain and data source
21
Extract Flow
Common Crawl Great Sourcehttps://commoncrawl.org/
Gobblin Great Ingestion Frameworkhttps://github.com/linkedin/gobblinn
23
Bootstrap From Bulk Sources
Match using global identifiers, email or full name.
The data might not be clean after extraction
Start with a small set of data and get it to the users quickly
29
Start Simple
Narrow the candidates with LSH[1]
Use the simple model to generate the ground truth
Train using a simple algorithm and a few hundred features
30
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Publications Companies
5.3
2.3
3.9
0.6
Extractor ObjectsTotal Processed
31
Current Status
Publication Company
562
5.62.5
1.2 0.1
Crawler ObjectsUnfetched FetchedGone