clustering output of apache nutch using apache spark

Clustering the output of Apache Nutch using Apache Spark

Thamme Gowda N. Dr. Chris Mattmann

May 12, 2016. Vancouver, Canada

1

About● ThammeGowda Narayanaswamy - TG in short - @thammegowda

○ Contributor to Apache Tika and Apache Nutch○ Now - a grad student @ University of Southern California○ Past - Technical Co-Founder @ Datoin - http://datoin.com

● Dr. Chris Mattmann @chrismattmann○ Adj. Prof. and the director of IRDS group

@ University of Southern California, Los Angeles○ Director @ Apache Software Foundation○ Chief Architect, NASA JPL

2

https://twitter.com/thammegowda

http://datoin.com

https://twitter.com/chrismattmann

Overview

● Problem Statement● Clustering - a solution● Structure and Style Similarity● Shared Near Neighbor Clustering ● Scaling it up using Spark’s Distributed Matrices and

GraphX● A demo

3

Audience

● Who crawls the web● Who extracts data from web● Who filters webpages● likes to know -

○ web page structure and style similarity○ shared near neighbor clustering

4

Problem Statement

● Scraping data from online marketplaces● Start with homepage → categories

→listing pages → Actual stuff (Detail page)●

5

Sample set of web pages credits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

6

http://www.armslist.com

http://trec-dd.org/dataset.html

http://memex.jpl.nasa.gov


USELESS

USELESS

7





USELESS

USELESS

REQUIRED FOR CRAWLER, BUTNOT IMPORTANTFOR ANALYSIS



8





USELESS

USELESS




USEFUL FOR ANALYSIS

USEFUL FOR ANALYSIS

USEFUL FOR ANALYSIS

9




Question : How do we solve this?

Answer : Cluster the web pages

10

Why Cluster?

● Separate the interesting web pages?○ Drop uninteresting/noisy web pages○ Categorical treatment of clusters

● Extract Structured data using XPath○ Automated extraction using alignment

11

Goal

● Group web pages that are similar● Similar in terms of

○ CSS Styles○ DOM Structure

● Toolkit for experimentation with various thresholds○ % of similarity in style and/or structure○ Nice visualizations

12

How do we cluster?

● Based on similarity between pages● Semantic similarity

○ meaning of the web pages● Syntactic similarity

○ Web page structure, css styles● This session has focus on syntactic aspect

13

Structural similarity

● Web pages are built with HTML● HTML Doc → DOM tree● a labeled ordered tree● Structural similarity using tree

edit distance(TED)

HTML

HEAD BODY

TITLE DIV P

14

(Minimum) Tree Edit Distance● Edit distance measure similar to strings, but on

hierarchical data instead of sequences ● Number of editing operations required to transform one

tree into another.● Three basic editing operations: INSERT, REMOVE and

REPLACE.● An useful measure to quantify how similar (or dissimilar)

two trees are.

15

Example: Tree Edit Distance*

● Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

16

Style Similarity

● Have you noticed ? ○ Similar web pages have similar css styles

● XPath : ”//*[@class]/@class”● Simple measure -

○ Jaccard Similarity on CSS class names○

17

Web pages consists of : ● HTML ✓● CSS ✓● JavaScript ×

18

Aggregating the Style and Structure

● StructuralSimilarity : Normalized Tree Edit Distance

● StyleSimilarity : Jaccard Distance

● Combine on a linear scale

○ Aggregated = k . Structural + (1-k) Style

19

Implementation

20

Implementation

● Read Nutch’s Segements○ sparkContext.sequneceFile(...)

● Filter web pages○ Robust content type detection -- Tika

● Structural Similarity○ HTML to DOM Tree -- NeckoHtml○ Tree Edit Distance -- Zhang Shasha’s algorithm

21

Implementation …● Style Similarity

○ Query CSS class names using Xpath● Similarity Matrix

○ sparkContext.cartesian() to get nxn cells○ Spark’s Distributed (Coordinate) Matrix

● Persist the matrix for later experimentation with multiple thresholds

22

Clustering● Shared Near Neighbor Clustering

○ Jarvis et al , 1973● With improvements

○ Graph based Implementation ■ Spark GraphX for the win!

* Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11), 1025-1034.

23

What’s good about this algorithm?● What’s the difficulty with the most popular k-means?

○ Prior knowledge of clusters?○ Mean/Average of documents in a cluster?

■ Average of DOM Trees?■ Average of CSS styles?

○ Circular/Spherical/Globular shapes?● Shared Near Neighbor Cluster

○ Similarity matrix - pluggable similarity measures - generic○ Thresholds - numbers , percent of match

24

Shared Near Neighbor Algorithm

“If two data points share a threshold number of neighbors, then they must belong to the same cluster”

25

Clustering Implementation

● Similarity Matrix to Graph○ Clusters as nodes, similarity measure as edges

● Check for Similar neighbors○○ Filter on threshold and Merge

■ Immutable! - new graph for next iteration○ Repeat

26

Shared Near Neighbor Clustering on Apache Spark GraphX

27

Challenges● Tree Edit Distance is very expensive

28

What’s ahead on the road?● Integrate to Apache Nutch● Auto Extraction

○ Unsupervised learning on structure of pages and scrape the actual data of the web page

● Faster Tree Edit Distance○ May be with approximation techniques

29

Demo

30

https://git.io/vwS69


Summary● Example Scenario ● Similarity measures● Clustering as a solution● Demo

31

Acknowledgements

● Dr. Chris Mattmann ○ My mentor○ Professor, Director at IRDS @ USC - http://irds.usc.edu○ Director, Apache Software Foundation

● DARPA Memex project

32

http://irds.usc.edu

Thank You! ● Source Code

● Tutorial

● Follow up○ Thamme Gowda - @thammegowda○ Chris Mattmann - @chrismattmann

33

https://github.com/USCDataScience/autoextractor





https://twitter.com/thammegowda

https://twitter.com/chrismattmann

clustering output of apache nutch using apache spark

Technology