introduction to nutch

19
Introduction to Nutch Zhao Dongsheng 2008.9.29

Upload: makani

Post on 16-Mar-2016

54 views

Category:

Documents


2 download

DESCRIPTION

Zhao Dongsheng 2008.9.29. Introduction to Nutch. Summary. What's Nutch Nutch's architecture How to use Nutch About the first homework. What's Nutch. Written in java Open-source project An Application that can build SE Behind a lot of web sites. What's Nutch. Lucene and Nutch - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Nutch

Introduction to Nutch

Zhao Dongsheng2008.9.29

Page 2: Introduction to Nutch

Summary

What's Nutch Nutch's architecture How to use Nutch About the first homework

Page 3: Introduction to Nutch

What's Nutch

Written in java Open-source project An Application that can build SE Behind a lot of web sites

Page 4: Introduction to Nutch

What's Nutch Lucene and Nutch Nutch grow out of Lucene Both open-source project Both written in java But Lucene is a Java library for

text indexing and search Nutch is an Application Nutch uses lucene for indexing

Page 5: Introduction to Nutch

Nutch's architecture

Page 6: Introduction to Nutch

Nutch's core components Fecher

Requests web pages Parses and extracts links

Web DB Page DB

Used for fetch sheduling Link DB

Store link gragh Store anchor text with each link Link-analysis and Anchor text indexing

Page 7: Introduction to Nutch

Nutch's core components (cont.)

Indexer Creates inverted index Uses Lucene

Searcher Finds relelant docs quickly Ranks the docs Summarizing

Page 8: Introduction to Nutch

Functions Nutch supports Politeness when crawling Duplicates removing PageRank analysis Distributed searching Summarizing ......

Page 9: Introduction to Nutch

Nutch's Technical Goals Fetch several billion pages per month Maintain an index of these pages Search that index up to 1000 times per

second Provide very high quality search results Operate at minimal cost

Page 10: Introduction to Nutch

Source code & API Source Dirs

analysis crawl html plugin scoring segment tools fetcher indexer net parse protocol searcher ...

crawl/Crawl.java fetcher/Fetcher.java

Page 11: Introduction to Nutch

How to use Nutch Download & unpack

Nutch required JVM Set environment variables

Configure Specify root URLs Specify URLs filters Optionally specify

Number of threads Levels to crawl Fetch delay

Page 12: Introduction to Nutch

How to use Nutch (cont.) Root URLs Example

http://www.pku.edu.cn URL Filter Example

crawl-urlfilter.txt -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|

wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

-[?*!@=] +^http://([a-z0-9]*\.)*pku.edu.cn/

Page 13: Introduction to Nutch

How to use Nutch (cont.) Run Nutch

Just a command line bin/nutch crawl myurl.txt -dir mycrawl -

depth 4 >& crawl.log Use Tomcat to experience!

Page 14: Introduction to Nutch

Home page

Page 15: Introduction to Nutch

Search result

Page 16: Introduction to Nutch

Score Explanation

Page 17: Introduction to Nutch

Anchor texts with a link

Page 18: Introduction to Nutch

About the first Homework About web crawling Familiar with Nutch & java Fetch blog/bbs etc ? Need your advice!

Page 19: Introduction to Nutch

Q & Athanks!