silos - distributed web archiving & analysis using map reduce

Post on 05-Jan-2016

25 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

SILOs - Distributed Web Archiving & Analysis using Map Reduce. Anushree Venkatesh Sagar Mehta Sushma Rao. AGENDA. Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Anushree VenkateshSagar MehtaSushma Rao

Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs

SILOs ArchitectureModules

Experiments

Life span of a web page – 44 to 75 days Limitations of centralized/distributed

crawlingExploring map reduce

Analysis of web [ subset ]Web graphSearch response quality

Tweaked page rank Inverted Index

Divide and conquer Functional programming counterparts -

> distributed data processing Plumbing behind the scenes -> Focus on

the problem Map – Division of key space Reduce – Combine results Pipelining functionality

Open source implementation of Map reduce in Java

HDFS – Hadoop specific file system Takes care of

fault tolerancedependencies between nodes

Setup through VM instance - Problems

Currently Single Node cluster

HDFS Setup

Incorporation of Berkeley DB

Seed List

Seed List

Compression

Compression

MParse

for URL

MParse

for URL

RURL, 1

(Remove Duplicate

s)

RURL, 1

(Remove Duplicate

s)

URL Extractor

MParse

for key word

MParse

for key word

RKeyWord

, URL

RKeyWord

, URL

Key Word Extractor

Page Content

Table

InvertedIndexTable

MURL, value

MURL, value

RURL, page

content

RURL, page

content

Distributed Crawler

MParent,

URL

MParent,

URL

RURL,

Parent

RURL,

Parent

Back Links Mapper

Back LinksTable

AdjacencyList

Table

DiffDiff

URL Table

Graph BuilderGraph Builder

<URL, parent URL>

Map

Input <url, 1>

if(!duplicate(URL)) {

Insert into url_table

Page_content = http_get(url);

<hash(url), url, hash(page_content),time_stamp >

Output Intermediate pair < url, page_content>

}

Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) {

Page_content = http_get(url);

Update url table(hash(url),current_time);

Output Intermediate pair < url, page_content>

 }

Else {

Update url table(hash(url),current_time);

 }

Reduce

Input < url, page_content >

If(! Exits hash(URL) in page content table) {

Insert into page_content_table

<hash(page_content), compress(page_content) >

}

Else if(hash(page_content_table(hash(url)) != hash(current_page_content) {

Insert into page_content_table

<hash(page_content), compress( diff_with_latest(page_content) )>

}

}

Currently outside of Map-Reduce

Manual transfer of files to HDFS

Currently Depth First Search, will be modified for Breadth First Search

Map Input < url, page_content>

List<keywords> = parse(page_content);For each keyword, emit

Output Intermediate pair < keyword, url>

ReduceCombine all <keyword, url> pairs with the same

keyword to emit<keyword, List<urls> >

Insert into inverted index table<keyword, List<urls> >

Top Words Along with their Frequency

CMU

Carnegie 2456Mellon 2107University 1157Alumni 786Center 466News 395Library 393PA 373Research 357Pittsburgh, 352Information 313School 309

Cornell

Cornell 742University 378College 158Admissions 128Research 99Student 94School 89Information 77York 74Alumni 71Academics 62Ithaca 59

Gatech

Tech 2704Georgia 1882Alumni 1115Services 885Association 646Career 493Baseball 416Engineering 408Tennis 222Information 219students 198Institute 173Atlanta 164

Top 6 URL domains that get traversed

CMU

alumni.cmu.edu 92hr.web.cmu.edu 13www.alumniconnections.com 16www.carnegiemellontoday.com 10www.cmu.edu 170www.library.cmu.edu 69

Cornell

www.cornell.edu 43www.cuinfo.cornell.edu 2www.gradschool.cornell.edu 2www.news.cornell.edu 7www.sce.cornell.edu 8www.vet.cornell.edu 1

Gatech

centennial.gtalumni.org 4cyberbuzz.gatech.edu 7georgiatech.searchease.com 9gtalumni.org 236ramblinwreck.cstv.com 56www.gatech.edu

14

Avg URL Depth

CMU

cmu.edu 2.73alumni.cmu.edu 2.18www.library.cmu.edu 2.23www.alumniconnections.com 4.81

Cornell

cornell.edu 1.34www.gradschool.cornell.edu 1www.news.cornell.edu 2.57www.sce.cornell.edu 1

Gatech

gatech.edu 1gtalumni.org 3ramblinwreck.cstv.com 2.57cyberbuzz.gatech.edu 2

21

Questions, Comments, Criticisms

HTML Parser Hadoop Framework (Apache) Peer Crawl

top related