silos - distributed web archiving & analysis using map reduce

22
Anushree Venkatesh Sagar Mehta Sushma Rao

Upload: luyu

Post on 05-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

SILOs - Distributed Web Archiving & Analysis using Map Reduce. Anushree Venkatesh Sagar Mehta Sushma Rao. AGENDA. Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Anushree VenkateshSagar MehtaSushma Rao

Page 2: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs

SILOs ArchitectureModules

Experiments

Page 3: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Life span of a web page – 44 to 75 days Limitations of centralized/distributed

crawlingExploring map reduce

Analysis of web [ subset ]Web graphSearch response quality

Tweaked page rank Inverted Index

Page 4: SILOs - Distributed Web Archiving & Analysis using Map Reduce
Page 5: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Divide and conquer Functional programming counterparts -

> distributed data processing Plumbing behind the scenes -> Focus on

the problem Map – Division of key space Reduce – Combine results Pipelining functionality

Page 6: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Open source implementation of Map reduce in Java

HDFS – Hadoop specific file system Takes care of

fault tolerancedependencies between nodes

Setup through VM instance - Problems

Page 7: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Currently Single Node cluster

HDFS Setup

Incorporation of Berkeley DB

Page 8: SILOs - Distributed Web Archiving & Analysis using Map Reduce
Page 9: SILOs - Distributed Web Archiving & Analysis using Map Reduce
Page 10: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Seed List

Seed List

Compression

Compression

MParse

for URL

MParse

for URL

RURL, 1

(Remove Duplicate

s)

RURL, 1

(Remove Duplicate

s)

URL Extractor

MParse

for key word

MParse

for key word

RKeyWord

, URL

RKeyWord

, URL

Key Word Extractor

Page Content

Table

InvertedIndexTable

MURL, value

MURL, value

RURL, page

content

RURL, page

content

Distributed Crawler

MParent,

URL

MParent,

URL

RURL,

Parent

RURL,

Parent

Back Links Mapper

Back LinksTable

AdjacencyList

Table

DiffDiff

URL Table

Graph BuilderGraph Builder

<URL, parent URL>

Page 11: SILOs - Distributed Web Archiving & Analysis using Map Reduce
Page 12: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Map

Input <url, 1>

if(!duplicate(URL)) {

Insert into url_table

Page_content = http_get(url);

<hash(url), url, hash(page_content),time_stamp >

Output Intermediate pair < url, page_content>

}

Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) {

Page_content = http_get(url);

Update url table(hash(url),current_time);

Output Intermediate pair < url, page_content>

 }

Else {

Update url table(hash(url),current_time);

 }

Reduce

Input < url, page_content >

If(! Exits hash(URL) in page content table) {

Insert into page_content_table

<hash(page_content), compress(page_content) >

}

Else if(hash(page_content_table(hash(url)) != hash(current_page_content) {

Insert into page_content_table

<hash(page_content), compress( diff_with_latest(page_content) )>

}

}

Page 13: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Currently outside of Map-Reduce

Manual transfer of files to HDFS

Currently Depth First Search, will be modified for Breadth First Search

Page 14: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Map Input < url, page_content>

List<keywords> = parse(page_content);For each keyword, emit

Output Intermediate pair < keyword, url>

ReduceCombine all <keyword, url> pairs with the same

keyword to emit<keyword, List<urls> >

Insert into inverted index table<keyword, List<urls> >

Page 15: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Top Words Along with their Frequency

CMU

Carnegie 2456Mellon 2107University 1157Alumni 786Center 466News 395Library 393PA 373Research 357Pittsburgh, 352Information 313School 309

Cornell

Cornell 742University 378College 158Admissions 128Research 99Student 94School 89Information 77York 74Alumni 71Academics 62Ithaca 59

Gatech

Tech 2704Georgia 1882Alumni 1115Services 885Association 646Career 493Baseball 416Engineering 408Tennis 222Information 219students 198Institute 173Atlanta 164

Page 16: SILOs - Distributed Web Archiving & Analysis using Map Reduce
Page 17: SILOs - Distributed Web Archiving & Analysis using Map Reduce
Page 18: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Top 6 URL domains that get traversed

CMU

alumni.cmu.edu 92hr.web.cmu.edu 13www.alumniconnections.com 16www.carnegiemellontoday.com 10www.cmu.edu 170www.library.cmu.edu 69

Cornell

www.cornell.edu 43www.cuinfo.cornell.edu 2www.gradschool.cornell.edu 2www.news.cornell.edu 7www.sce.cornell.edu 8www.vet.cornell.edu 1

Gatech

centennial.gtalumni.org 4cyberbuzz.gatech.edu 7georgiatech.searchease.com 9gtalumni.org 236ramblinwreck.cstv.com 56www.gatech.edu

14

Page 19: SILOs - Distributed Web Archiving & Analysis using Map Reduce

Avg URL Depth

CMU

cmu.edu 2.73alumni.cmu.edu 2.18www.library.cmu.edu 2.23www.alumniconnections.com 4.81

Cornell

cornell.edu 1.34www.gradschool.cornell.edu 1www.news.cornell.edu 2.57www.sce.cornell.edu 1

Gatech

gatech.edu 1gtalumni.org 3ramblinwreck.cstv.com 2.57cyberbuzz.gatech.edu 2

Page 20: SILOs - Distributed Web Archiving & Analysis using Map Reduce

21

Questions, Comments, Criticisms

Page 21: SILOs - Distributed Web Archiving & Analysis using Map Reduce

HTML Parser Hadoop Framework (Apache) Peer Crawl

Page 22: SILOs - Distributed Web Archiving & Analysis using Map Reduce