anushree venkatesh sagar mehta sushma rao. motivation what is map-reduce? why map-reduce? the...

22
Anushree Venkatesh Sagar Mehta Sushma Rao

Upload: toby-warren

Post on 06-Jan-2018

227 views

Category:

Documents


4 download

DESCRIPTION

 Life span of a web page – 44 to 75 days  Limitations of centralized/distributed crawling  Exploring map reduce  Analysis of web [ subset ]  Web graph Web graph  Search response quality Tweaked page rank Inverted Index

TRANSCRIPT

Page 1: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Anushree VenkateshSagar MehtaSushma Rao

Page 2: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs

SILOs ArchitectureModules

Experiments

Page 3: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Life span of a web page – 44 to 75 days Limitations of centralized/distributed

crawlingExploring map reduce

Analysis of web [ subset ]Web graphSearch response quality

Tweaked page rank Inverted Index

Page 4: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture
Page 5: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Divide and conquer Functional programming counterparts -

> distributed data processing Plumbing behind the scenes -> Focus on

the problem Map – Division of key space Reduce – Combine results Pipelining functionality

Page 6: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Open source implementation of Map reduce in Java

HDFS – Hadoop specific file system Takes care of

fault tolerancedependencies between nodes

Setup through VM instance - Problems

Page 7: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Currently Single Node cluster

HDFS Setup

Incorporation of Berkeley DB

Page 8: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture
Page 9: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture
Page 10: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Seed List

Compression

MParse

for URL

RURL, 1

(Remove Duplicate

s)

URL Extractor

MParse

for key word

RKeyWord

, URL

Key Word Extractor

Page Content

Table

InvertedIndexTable

MURL, value

RURL, page

content

Distributed Crawler

MParent,

URL

RURL,

Parent

Back Links Mapper

Back LinksTable

AdjacencyList

Table

Diff

URL Table

Graph Builder

<URL, parent URL>

Page 11: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture
Page 12: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Map Input <url, 1>

if(!duplicate(URL)) {Insert into url_tablePage_content = http_get(url);<hash(url), url, hash(page_content),time_stamp >Output Intermediate pair < url, page_content>}Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) {Page_content = http_get(url);Update url table(hash(url),current_time);Output Intermediate pair < url, page_content>

 }Else {

Update url table(hash(url),current_time); }Reduce

Input < url, page_content > If(! Exits hash(URL) in page content table) {Insert into page_content_table<hash(page_content), compress(page_content) >}Else if(hash(page_content_table(hash(url)) != hash(current_page_content) {

Insert into page_content_table<hash(page_content), compress( diff_with_latest(page_content) )>

}}

Page 13: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Currently outside of Map-Reduce

Manual transfer of files to HDFS

Currently Depth First Search, will be modified for Breadth First Search

Page 14: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Map Input < url, page_content>

List<keywords> = parse(page_content);For each keyword, emit

Output Intermediate pair < keyword, url>Reduce

Combine all <keyword, url> pairs with the same keyword to emit

<keyword, List<urls> >Insert into inverted index table

<keyword, List<urls> >

Page 15: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Top Words Along with their FrequencyCMU

Carnegie 2456Mellon 2107University 1157Alumni 786Center 466News 395Library 393PA 373Research 357Pittsburgh, 352Information 313School 309

Cornell

Cornell 742University 378College 158Admissions 128Research 99Student 94School 89Information 77York 74Alumni 71Academics 62Ithaca 59

Gatech

Tech 2704Georgia 1882Alumni 1115Services 885Association 646Career 493Baseball 416Engineering 408Tennis 222Information 219students 198Institute 173Atlanta 164

Page 16: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture
Page 17: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture
Page 18: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Top 6 URL domains that get traversedCMU

alumni.cmu.edu 92hr.web.cmu.edu 13www.alumniconnections.com 16www.carnegiemellontoday.com 10www.cmu.edu 170www.library.cmu.edu 69

Cornell

www.cornell.edu 43www.cuinfo.cornell.edu 2www.gradschool.cornell.edu 2www.news.cornell.edu 7www.sce.cornell.edu 8www.vet.cornell.edu 1

Gatech

centennial.gtalumni.org 4cyberbuzz.gatech.edu 7georgiatech.searchease.com 9gtalumni.org 236ramblinwreck.cstv.com 56www.gatech.edu

14

Page 19: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

Avg URL DepthCMU

cmu.edu 2.73alumni.cmu.edu 2.18www.library.cmu.edu 2.23www.alumniconnections.com 4.81

Cornell

cornell.edu 1.34www.gradschool.cornell.edu 1www.news.cornell.edu 2.57www.sce.cornell.edu 1

Gatech

gatech.edu 1gtalumni.org 3ramblinwreck.cstv.com 2.57cyberbuzz.gatech.edu 2

Page 20: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

21

Questions, Comments, Criticisms

Page 21: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture

HTML Parser Hadoop Framework (Apache) Peer Crawl

Page 22: Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture