clustering output of apache nutch using apache spark

33
Clustering the output of Apache Nutch using Apache Spark Thamme Gowda N. Dr. Chris Mattmann May 12, 2016. Vancouver, Canada 1

Upload: thamme-gowda-narayanaswamy

Post on 15-Apr-2017

721 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Clustering output of Apache Nutch using Apache Spark

Clustering the output of Apache Nutch using Apache Spark

Thamme Gowda N. Dr. Chris Mattmann

May 12, 2016. Vancouver, Canada

1

Page 2: Clustering output of Apache Nutch using Apache Spark

About● ThammeGowda Narayanaswamy - TG in short - @thammegowda

○ Contributor to Apache Tika and Apache Nutch○ Now - a grad student @ University of Southern California○ Past - Technical Co-Founder @ Datoin - http://datoin.com

● Dr. Chris Mattmann @chrismattmann○ Adj. Prof. and the director of IRDS group

@ University of Southern California, Los Angeles○ Director @ Apache Software Foundation○ Chief Architect, NASA JPL

2

Page 3: Clustering output of Apache Nutch using Apache Spark

Overview

● Problem Statement● Clustering - a solution● Structure and Style Similarity● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and

GraphX● A demo

3

Page 4: Clustering output of Apache Nutch using Apache Spark

Audience

● Who crawls the web● Who extracts data from web● Who filters webpages● likes to know -

○ web page structure and style similarity○ shared near neighbor clustering

4

Page 5: Clustering output of Apache Nutch using Apache Spark

Problem Statement

● Scraping data from online marketplaces● Start with homepage → categories

→listing pages → Actual stuff (Detail page)●

5

Page 6: Clustering output of Apache Nutch using Apache Spark

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

6

Page 7: Clustering output of Apache Nutch using Apache Spark

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

USELESS

USELESS

7

Page 8: Clustering output of Apache Nutch using Apache Spark

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

USELESS

USELESS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

8

Page 9: Clustering output of Apache Nutch using Apache Spark

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

USELESS

USELESS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS

USEFUL FOR ANALYSIS

USEFUL FOR ANALYSIS

USEFUL FOR ANALYSIS

9

Page 10: Clustering output of Apache Nutch using Apache Spark

Question : How do we solve this?

Answer : Cluster the web pages

10

Page 11: Clustering output of Apache Nutch using Apache Spark

Why Cluster?

● Separate the interesting web pages?○ Drop uninteresting/noisy web pages○ Categorical treatment of clusters

● Extract Structured data using XPath○ Automated extraction using alignment

11

Page 12: Clustering output of Apache Nutch using Apache Spark

Goal

● Group web pages that are similar● Similar in terms of

○ CSS Styles○ DOM Structure

● Toolkit for experimentation with various thresholds○ % of similarity in style and/or structure○ Nice visualizations

12

Page 13: Clustering output of Apache Nutch using Apache Spark

How do we cluster?

● Based on similarity between pages● Semantic similarity

○ meaning of the web pages● Syntactic similarity

○ Web page structure, css styles● This session has focus on syntactic aspect

13

Page 14: Clustering output of Apache Nutch using Apache Spark

Structural similarity

● Web pages are built with HTML● HTML Doc → DOM tree● a labeled ordered tree● Structural similarity using tree

edit distance(TED)

HTML

HEAD BODY

TITLE DIV P

14

Page 15: Clustering output of Apache Nutch using Apache Spark

(Minimum) Tree Edit Distance● Edit distance measure similar to strings, but on

hierarchical data instead of sequences ● Number of editing operations required to transform one

tree into another.● Three basic editing operations: INSERT, REMOVE and

REPLACE.● An useful measure to quantify how similar (or dissimilar)

two trees are.

15

Page 16: Clustering output of Apache Nutch using Apache Spark

Example: Tree Edit Distance*

● Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

16

Page 17: Clustering output of Apache Nutch using Apache Spark

Style Similarity

● Have you noticed ? ○ Similar web pages have similar css styles

● XPath : ”//*[@class]/@class”● Simple measure -

○ Jaccard Similarity on CSS class names○

17

Page 18: Clustering output of Apache Nutch using Apache Spark

Web pages consists of : ● HTML ✓● CSS ✓● JavaScript ×

18

Page 19: Clustering output of Apache Nutch using Apache Spark

Aggregating the Style and Structure

● StructuralSimilarity : Normalized Tree Edit Distance

● StyleSimilarity : Jaccard Distance

● Combine on a linear scale

○ Aggregated = k . Structural + (1-k) Style

19

Page 20: Clustering output of Apache Nutch using Apache Spark

Implementation

20

Page 21: Clustering output of Apache Nutch using Apache Spark

Implementation

● Read Nutch’s Segements○ sparkContext.sequneceFile(...)

● Filter web pages○ Robust content type detection -- Tika

● Structural Similarity○ HTML to DOM Tree -- NeckoHtml○ Tree Edit Distance -- Zhang Shasha’s algorithm

21

Page 22: Clustering output of Apache Nutch using Apache Spark

Implementation …● Style Similarity

○ Query CSS class names using Xpath● Similarity Matrix

○ sparkContext.cartesian() to get nxn cells○ Spark’s Distributed (Coordinate) Matrix

● Persist the matrix for later experimentation with multiple thresholds

22

Page 23: Clustering output of Apache Nutch using Apache Spark

Clustering● Shared Near Neighbor Clustering

○ Jarvis et al , 1973● With improvements

○ Graph based Implementation ■ Spark GraphX for the win!

* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.

23

Page 24: Clustering output of Apache Nutch using Apache Spark

What’s good about this algorithm?● What’s the difficulty with the most popular k-means?

○ Prior knowledge of clusters?○ Mean/Average of documents in a cluster?

■ Average of DOM Trees?■ Average of CSS styles?

○ Circular/Spherical/Globular shapes?● Shared Near Neighbor Cluster

○ Similarity matrix - pluggable similarity measures - generic○ Thresholds - numbers , percent of match

24

Page 25: Clustering output of Apache Nutch using Apache Spark

Shared Near Neighbor Algorithm

“If two data points share a threshold number of neighbors, then they must belong to the same cluster”

25

Page 26: Clustering output of Apache Nutch using Apache Spark

Clustering Implementation

● Similarity Matrix to Graph○ Clusters as nodes, similarity measure as edges

● Check for Similar neighbors○○ Filter on threshold and Merge

■ Immutable! - new graph for next iteration○ Repeat

26

Page 27: Clustering output of Apache Nutch using Apache Spark

Shared Near Neighbor Clustering on Apache Spark GraphX

27

Page 28: Clustering output of Apache Nutch using Apache Spark

Challenges● Tree Edit Distance is very expensive

28

Page 29: Clustering output of Apache Nutch using Apache Spark

What’s ahead on the road?● Integrate to Apache Nutch● Auto Extraction

○ Unsupervised learning on structure of pages and scrape the actual data of the web page

● Faster Tree Edit Distance○ May be with approximation techniques

29

Page 31: Clustering output of Apache Nutch using Apache Spark

Summary● Example Scenario ● Similarity measures● Clustering as a solution● Demo

31

Page 32: Clustering output of Apache Nutch using Apache Spark

Acknowledgements

● Dr. Chris Mattmann ○ My mentor○ Professor, Director at IRDS @ USC - http://irds.usc.edu○ Director, Apache Software Foundation

● DARPA Memex project

32