clustering output of apache nutch using apache spark
TRANSCRIPT
Clustering the output of Apache Nutch using Apache Spark
Thamme Gowda N. Dr. Chris Mattmann
May 12, 2016. Vancouver, Canada
1
About● ThammeGowda Narayanaswamy - TG in short - @thammegowda
○ Contributor to Apache Tika and Apache Nutch○ Now - a grad student @ University of Southern California○ Past - Technical Co-Founder @ Datoin - http://datoin.com
● Dr. Chris Mattmann @chrismattmann○ Adj. Prof. and the director of IRDS group
@ University of Southern California, Los Angeles○ Director @ Apache Software Foundation○ Chief Architect, NASA JPL
2
Overview
● Problem Statement● Clustering - a solution● Structure and Style Similarity● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and
GraphX● A demo
3
Audience
● Who crawls the web● Who extracts data from web● Who filters webpages● likes to know -
○ web page structure and style similarity○ shared near neighbor clustering
4
Problem Statement
● Scraping data from online marketplaces● Start with homepage → categories
→listing pages → Actual stuff (Detail page)●
5
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
6
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
7
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
8
Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
USELESS
USELESS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS
USEFUL FOR ANALYSIS
USEFUL FOR ANALYSIS
USEFUL FOR ANALYSIS
9
Question : How do we solve this?
Answer : Cluster the web pages
10
Why Cluster?
● Separate the interesting web pages?○ Drop uninteresting/noisy web pages○ Categorical treatment of clusters
● Extract Structured data using XPath○ Automated extraction using alignment
11
Goal
● Group web pages that are similar● Similar in terms of
○ CSS Styles○ DOM Structure
● Toolkit for experimentation with various thresholds○ % of similarity in style and/or structure○ Nice visualizations
12
How do we cluster?
● Based on similarity between pages● Semantic similarity
○ meaning of the web pages● Syntactic similarity
○ Web page structure, css styles● This session has focus on syntactic aspect
13
Structural similarity
● Web pages are built with HTML● HTML Doc → DOM tree● a labeled ordered tree● Structural similarity using tree
edit distance(TED)
HTML
HEAD BODY
TITLE DIV P
14
(Minimum) Tree Edit Distance● Edit distance measure similar to strings, but on
hierarchical data instead of sequences ● Number of editing operations required to transform one
tree into another.● Three basic editing operations: INSERT, REMOVE and
REPLACE.● An useful measure to quantify how similar (or dissimilar)
two trees are.
15
Example: Tree Edit Distance*
● Edit operations● Normalized
distance
* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.
16
Style Similarity
● Have you noticed ? ○ Similar web pages have similar css styles
● XPath : ”//*[@class]/@class”● Simple measure -
○ Jaccard Similarity on CSS class names○
17
Web pages consists of : ● HTML ✓● CSS ✓● JavaScript ×
18
Aggregating the Style and Structure
● StructuralSimilarity : Normalized Tree Edit Distance
● StyleSimilarity : Jaccard Distance
● Combine on a linear scale
○ Aggregated = k . Structural + (1-k) Style
19
Implementation
20
Implementation
● Read Nutch’s Segements○ sparkContext.sequneceFile(...)
● Filter web pages○ Robust content type detection -- Tika
● Structural Similarity○ HTML to DOM Tree -- NeckoHtml○ Tree Edit Distance -- Zhang Shasha’s algorithm
21
Implementation …● Style Similarity
○ Query CSS class names using Xpath● Similarity Matrix
○ sparkContext.cartesian() to get nxn cells○ Spark’s Distributed (Coordinate) Matrix
● Persist the matrix for later experimentation with multiple thresholds
22
Clustering● Shared Near Neighbor Clustering
○ Jarvis et al , 1973● With improvements
○ Graph based Implementation ■ Spark GraphX for the win!
* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.
23
What’s good about this algorithm?● What’s the difficulty with the most popular k-means?
○ Prior knowledge of clusters?○ Mean/Average of documents in a cluster?
■ Average of DOM Trees?■ Average of CSS styles?
○ Circular/Spherical/Globular shapes?● Shared Near Neighbor Cluster
○ Similarity matrix - pluggable similarity measures - generic○ Thresholds - numbers , percent of match
24
Shared Near Neighbor Algorithm
“If two data points share a threshold number of neighbors, then they must belong to the same cluster”
25
Clustering Implementation
● Similarity Matrix to Graph○ Clusters as nodes, similarity measure as edges
● Check for Similar neighbors○○ Filter on threshold and Merge
■ Immutable! - new graph for next iteration○ Repeat
26
Shared Near Neighbor Clustering on Apache Spark GraphX
27
Challenges● Tree Edit Distance is very expensive
28
What’s ahead on the road?● Integrate to Apache Nutch● Auto Extraction
○ Unsupervised learning on structure of pages and scrape the actual data of the web page
● Faster Tree Edit Distance○ May be with approximation techniques
29
Summary● Example Scenario ● Similarity measures● Clustering as a solution● Demo
31
Acknowledgements
● Dr. Chris Mattmann ○ My mentor○ Professor, Director at IRDS @ USC - http://irds.usc.edu○ Director, Apache Software Foundation
● DARPA Memex project
32
Thank You! ● Source Code
● Tutorial
● Follow up○ Thamme Gowda - @thammegowda○ Chris Mattmann - @chrismattmann
33