introduction to nutch

20
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2011

Upload: aira

Post on 07-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Introduction to Nutch. CSCI 572: Information Retrieval and Search Engines Summer 2011. Outline. What is Nutch? Motivation Architecture What currently exists? How I got involved Deploying Nutch on NASA’s Planetary Data System (PDS) Free text “Google-like” search of the PDS catalog - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Nutch

Introduction to Nutch

CSCI 572: Information Retrieval and Search Engines

Summer 2011

Page 2: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-2

Outline

• What is Nutch?– Motivation

– Architecture

– What currently exists?

– How I got involved

• Deploying Nutch on NASA’s Planetary Data System (PDS)– Free text “Google-like” search of the PDS catalog

– Architecture/Implementation

Page 3: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-3

What is Nutch?

• The brainchild of Doug Cutting– Research/programmer guru who has worked at several

high profile research labs (Yahoo, Bell Labs)

• Nutch builds upon Cutting’s lower level text indexing library and API called Lucene

• Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene

Page 4: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-4

Motivation

• Observation: Web Search is a commodity– Why can’t it be provided freely?

• Allows tweaking of typically “hidden” ranking algorithms

• Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

Page 5: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-5

Motivation

• Value-added capabilities– Improving fetching speed

– Parsing and handling of the hundreds of different content types available on the internet

– Handling different protocols for obtaining content

– Better ranking algorithms (OPIC, PageRank)

• More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

Page 6: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-6

Nutch’s Architecture

• Nutch Core facilities– Parsing

– Indexing

– Crawling

– Content Management

– Querying

– Plugin Framework

• Nutch’s extension points– Scoring, Parsing, Indexing, Querying, URLFiltering

Page 7: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-7

Nutch’s ArchitectureMaps to

Search engine architecture

proposed by Brin & Page

Page 8: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-8

What Currently Exists?• Version 0.6.x

– First easily deployable version

• Version 0.7.x– Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter

extension point, first Apache release after Incubation, mime type system

• Version 0.8.x– Completely new underlying architecture based on Hadoop– Parse plugins framework, multi-valued metadata container– Parser Factory enhancement

• Version 0.9.x– Major bug fixes– Hadoop, and Lucene library upgrades

• Version 1.0– Flexible filter framework– Flexible scoring– Initial integration with Tika– Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)

• Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt

Page 9: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-9

What Doesn’t?• Plenty!• Bug fixes (> 200 issues in JIRA right now with no

resolution)

• Nutch 2.0 architecture– http://search-lucene.com/m/gbrBF1RMWk9

– Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

Page 10: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-10

How I got involved• In this very class!

– Okay well it used to be called Cs599, but you get the picture

• Started out by contributing RSS parsing plugin– My final project in 599

• Moved on from there to– NUTCH-88, redesign of the parsing framework– NUTCH-139, Metadata container support– NUTCH-210, Web Context application file– And various other bug fixes, and contributions here and there– Mailing list support– Wiki support

• Became committer in October 2006• Helped spin Nutch into Apache TLP, March 2010, Nutch

PMC member

Page 11: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-11

Real world application of Nutch• I work at NASA’s Jet Propulsion

Laboratory• NASA’s Planetary Data System

– NASA’s archive for all planetary science data collected by missions over the past 30 years

– Collected 20 TB over the past 30 years• Increasing to over 200 TB in the next 3

years!

– Built up a catalog of all data collected

• Where does Nutch fit in?

Page 12: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-12

Where does Nutch fit into the PDS?

• PDS Management Council decide they want “Google-like” search of the PDS catalog

• Our plan: use Nutch to implement capability for PDS

Page 13: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-13

PDS Google-like Search ArchitectureSearch Engine Architecture (e.g. Nutch, Google)

PDS Catalog

PDS-D

Existing PDS

Query

Indexer Index

Lucene

Crawler

PDSExtract

Parser

PDSParser

pds.war

Tomcat

WebServer

CatalogMetadata

Page 14: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-14

Approach• Export PDS catalog datasets in RDF format (flat files)

• Use nutch to crawl RDF files– protocol-file plugin in Nutch

• Wrote our own parse-pds plugin– Parse the RDF files, and then extract the metadata

• Wrote our own index-pds plugin– Index the fields that we want from the parsed metadata

• Wrote our own query-pds plugin– Search the index on the fields that we want

Page 15: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-15

Search Interface

Page 16: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-16

Results

Page 17: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-17

Lessons Learned• Nutch currently isn’t exactly simple to deploy, or

configure– There is much discussion on mailing lists that refer to

“magic configuration” properties that aren’t intuitive

• Nutch documentation is currently…lacking• If you know how to use Nutch then it is extremely

easy to use, and a time-saver• Active participation in mailing lists, wiki,

necessary to use Nutch

Page 18: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-18

Good News

• Nutch is here to stay– Only open source, implementation for commodity web

search

– If you want to start your own Google++, Nutch is a great place to start

• Participation is welcome– Look what happened to me (student-> commiter)

– Plenty of areas to improve (including documentation)

Page 19: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-19

Your Class Project• It’s probably a good idea to at least take a look at Nutch,

whether you use it or not• You can see how a real implementation of theory described

in class operates– Implemented in pure Java (1.5)

• Add/extend capabilities within Nutch– Help finish plugging Nutch into HBase– Configure Nutch using Spring– Fully integrate Nutch and Solr– Fix *important* bugs– Add more scoring algorithm implementations

Page 20: Introduction to Nutch

May-26-11 CS572-Summer2011 CAM-20

Wrapup

• Thanks for your attention!• Nutch home page:

– http://nutch.apache.org

• Mailing lists– [email protected] (developer’s list)

[email protected] (user’s list)