apache con slides nutch clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/apache con...

33
Clustering the output of Apache Nutch using Apache Spark Thamme Gowda N. Dr. Chris Mattmann May 12, 2016. Vancouver, Canada 1

Upload: dangkhanh

Post on 23-Apr-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Clustering the output of Apache Nutch using Apache Spark

Thamme Gowda N. Dr. Chris Mattmann

May 12, 2016. Vancouver, Canada

1

Page 2: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

About● ThammeGowda Narayanaswamy - TG in short - @thammegowda

○ Contributor to Apache Tika and Apache Nutch○ Now - a grad student @ University of Southern California○ Past - Technical Co-Founder @ Datoin - http://datoin.com

● Dr. Chris Mattmann @chrismattmann○ Adj. Prof. and the director of IRDS group

@ University of Southern California, Los Angeles○ Director @ Apache Software Foundation○ Chief Architect, NASA JPL

2

Page 3: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Overview

● Problem Statement● Clustering - a solution● Structure and Style Similarity● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and

GraphX● A demo

3

Page 4: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Audience

● Who crawls the web● Who extracts data from web● Who filters webpages● likes to know -

○ web page structure and style similarity○ shared near neighbor clustering

4

Page 5: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Problem Statement

● Scraping data from online marketplaces● Start with homepage → categories

→listing pages → Actual stuff (Detail page)●

5

Page 6: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

6

Page 7: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

USELESS

USELESS

7

Page 8: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

USELESS

USELESS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

8

Page 9: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

USELESS

USELESS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

USEFUL FOR ANALYSIS

USEFUL FOR ANALYSIS

USEFUL FOR ANALYSIS

9

Page 10: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Question : How do we solve this?

Answer : Cluster the web pages

10

Page 11: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Why Cluster?

● Separate the interesting web pages?○ Drop uninteresting/noisy web pages○ Categorical treatment of clusters

● Extract Structured data using XPath○ Automated extraction using alignment

11

Page 12: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Goal

● Group web pages that are similar● Similar in terms of

○ CSS Styles○ DOM Structure

● Toolkit for experimentation with various thresholds○ % of similarity in style and/or structure○ Nice visualizations

12

Page 13: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

How do we cluster?

● Based on similarity between pages● Semantic similarity

○ meaning of the web pages● Syntactic similarity

○ Web page structure, css styles● This session has focus on syntactic aspect

13

Page 14: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Structural similarity

● Web pages are built with HTML● HTML Doc → DOM tree● a labeled ordered tree● Structural similarity using tree

edit distance(TED)

HTML

HEAD BODY

TITLE DIV P

14

Page 15: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

(Minimum) Tree Edit Distance● Edit distance measure similar to strings, but on

hierarchical data instead of sequences ● Number of editing operations required to transform one

tree into another.● Three basic editing operations: INSERT, REMOVE and

REPLACE.● An useful measure to quantify how similar (or dissimilar)

two trees are.

15

Page 16: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Example: Tree Edit Distance*

● Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

16

Page 17: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Style Similarity

● Have you noticed ? ○ Similar web pages have similar css styles

● XPath : ”//*[@class]/@class”● Simple measure -

○ Jaccard Similarity on CSS class names○

17

Page 18: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Web pages consists of : ● HTML ✓● CSS ✓● JavaScript ×

18

Page 19: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Aggregating the Style and Structure

● StructuralSimilarity : Normalized Tree Edit Distance

● StyleSimilarity : Jaccard Distance

● Combine on a linear scale

○ Aggregated = k . Structural + (1-k) Style

19

Page 20: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Implementation

20

Page 21: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Implementation

● Read Nutch’s Segements○ sparkContext.sequneceFile(...)

● Filter web pages○ Robust content type detection -- Tika

● Structural Similarity○ HTML to DOM Tree -- NeckoHtml○ Tree Edit Distance -- Zhang Shasha’s algorithm

21

Page 22: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Implementation …● Style Similarity

○ Query CSS class names using Xpath● Similarity Matrix

○ sparkContext.cartesian() to get nxn cells○ Spark’s Distributed (Coordinate) Matrix

● Persist the matrix for later experimentation with multiple thresholds

22

Page 23: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Clustering● Shared Near Neighbor Clustering

○ Jarvis et al , 1973● With improvements

○ Graph based Implementation ■ Spark GraphX for the win!

* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.

23

Page 24: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

What’s good about this algorithm?● What’s the difficulty with the most popular k-means?

○ Prior knowledge of clusters?○ Mean/Average of documents in a cluster?

■ Average of DOM Trees?■ Average of CSS styles?

○ Circular/Spherical/Globular shapes?● Shared Near Neighbor Cluster

○ Similarity matrix - pluggable similarity measures - generic○ Thresholds - numbers , percent of match

24

Page 25: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Shared Near Neighbor Algorithm

“If two data points share a threshold number of neighbors, then they must belong to the same cluster”

25

Page 26: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Clustering Implementation

● Similarity Matrix to Graph○ Clusters as nodes, similarity measure as edges

● Check for Similar neighbors○○ Filter on threshold and Merge

■ Immutable! - new graph for next iteration○ Repeat

26

Page 27: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Shared Near Neighbor Clustering on Apache Spark GraphX

27

Page 28: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Challenges● Tree Edit Distance is very expensive

28

Page 29: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

What’s ahead on the road?● Integrate to Apache Nutch● Auto Extraction

○ Unsupervised learning on structure of pages and scrape the actual data of the web page

● Faster Tree Edit Distance○ May be with approximation techniques

29

Page 31: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Summary● Example Scenario ● Similarity measures● Clustering as a solution● Demo

31

Page 32: Apache Con Slides Nutch Clustering - schd.wsschd.ws/hosted_files/apachecon2016/9b/Apache Con Slides-Nutch... · Clustering the output of Apache Nutch using Apache Spark Thamme Gowda

Acknowledgements

● Dr. Chris Mattmann ○ My mentor○ Professor, Director at IRDS @ USC - http://irds.usc.edu○ Director, Apache Software Foundation

● DARPA Memex project

32