introduction to nutch csci 572: information retrieval and search engines summer 2010
TRANSCRIPT
Introduction to Nutch
CSCI 572: Information Retrieval and Search Engines
Summer 2010
May-20-10 CS572-Summer2010 CAM-2
Outline
• What is Nutch?– Motivation
– Architecture
– What currently exists?
– How I got involved
• Deploying Nutch on NASA’s Planetary Data System (PDS)– Free text “Google-like” search of the PDS catalog
– Architecture/Implementation
May-20-10 CS572-Summer2010 CAM-3
What is Nutch?
• The brainchild of Doug Cutting– Research/programmer guru who has worked at several
high profile research labs (Yahoo, Bell Labs)
• Nutch builds upon Cutting’s lower level text indexing library and API called Lucene
• Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene
May-20-10 CS572-Summer2010 CAM-4
Motivation
• Observation: Web Search is a commodity– Why can’t it be provided freely?
• Allows tweaking of typically “hidden” ranking algorithms
• Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities
May-20-10 CS572-Summer2010 CAM-5
Motivation
• Value-added capabilities– Improving fetching speed
– Parsing and handling of the hundreds of different content types available on the internet
– Handling different protocols for obtaining content
– Better ranking algorithms (OPIC, PageRank)
• More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework
May-20-10 CS572-Summer2010 CAM-6
Nutch’s Architecture
• Nutch Core facilities– Parsing
– Indexing
– Crawling
– Content Management
– Querying
– Plugin Framework
• Nutch’s extension points– Scoring, Parsing, Indexing, Querying, URLFiltering
May-20-10 CS572-Summer2010 CAM-7
Nutch’s ArchitectureMaps to
Search engine architecture
proposed by Brin & Page
May-20-10 CS572-Summer2010 CAM-8
What Currently Exists?• Version 0.6.x
– First easily deployable version
• Version 0.7.x– Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter
extension point, first Apache release after Incubation, mime type system
• Version 0.8.x– Completely new underlying architecture based on Hadoop– Parse plugins framework, multi-valued metadata container– Parser Factory enhancement
• Version 0.9.x– Major bug fixes– Hadoop, and Lucene library upgrades
• Version 1.0– Flexible filter framework– Flexible scoring– Initial integration with Tika– Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)
• Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt
May-20-10 CS572-Summer2010 CAM-9
What Doesn’t?• Plenty!• Bug fixes (> 200 issues in JIRA right now with no
resolution)
• Nutch 2.0 architecture– http://search-lucene.com/m/gbrBF1RMWk9
– Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM
May-20-10 CS572-Summer2010 CAM-10
How I got involved• In this very class!
– Okay well it used to be called Cs599, but you get the picture
• Started out by contributing RSS parsing plugin– My final project in 599
• Moved on from there to– NUTCH-88, redesign of the parsing framework– NUTCH-139, Metadata container support– NUTCH-210, Web Context application file– And various other bug fixes, and contributions here and there– Mailing list support– Wiki support
• Became committer in October 2006• Helped spin Nutch into Apache TLP, March 2010, Nutch
PMC member
May-20-10 CS572-Summer2010 CAM-11
Real world application of Nutch• I work at NASA’s Jet Propulsion
Laboratory• NASA’s Planetary Data System
– NASA’s archive for all planetary science data collected by missions over the past 30 years
– Collected 20 TB over the past 30 years• Increasing to over 200 TB in the next 3
years!
– Built up a catalog of all data collected
• Where does Nutch fit in?
May-20-10 CS572-Summer2010 CAM-12
Where does Nutch fit into the PDS?
• PDS Management Council decide they want “Google-like” search of the PDS catalog
• Our plan: use Nutch to implement capability for PDS
May-20-10 CS572-Summer2010 CAM-13
PDS Google-like Search ArchitectureSearch Engine Architecture (e.g. Nutch, Google)
PDS Catalog
PDS-D
Existing PDS
Query
Indexer Index
Lucene
Crawler
PDSExtract
Parser
PDSParser
pds.war
Tomcat
WebServer
CatalogMetadata
May-20-10 CS572-Summer2010 CAM-14
Approach• Export PDS catalog datasets in RDF format (flat files)
• Use nutch to crawl RDF files– protocol-file plugin in Nutch
• Wrote our own parse-pds plugin– Parse the RDF files, and then extract the metadata
• Wrote our own index-pds plugin– Index the fields that we want from the parsed metadata
• Wrote our own query-pds plugin– Search the index on the fields that we want
May-20-10 CS572-Summer2010 CAM-15
Search Interface
May-20-10 CS572-Summer2010 CAM-16
Results
May-20-10 CS572-Summer2010 CAM-17
Lessons Learned• Nutch currently isn’t exactly simple to deploy, or
configure– There is much discussion on mailing lists that refer to
“magic configuration” properties that aren’t intuitive
• Nutch documentation is currently…lacking• If you know how to use Nutch then it is extremely
easy to use, and a time-saver• Active participation in mailing lists, wiki,
necessary to use Nutch
May-20-10 CS572-Summer2010 CAM-18
Good News
• Nutch is here to stay– Only open source, implementation for commodity web
search
– If you want to start your own Google++, Nutch is a great place to start
• Participation is welcome– Look what happened to me (student-> commiter)
– Plenty of areas to improve (including documentation)
May-20-10 CS572-Summer2010 CAM-19
Your Class Project• It’s probably a good idea to at least take a look at Nutch,
whether you use it or not• You can see how a real implementation of theory described
in class operates– Implemented in pure Java (1.5)
• Add/extend capabilities within Nutch– Help finish plugging Nutch into HBase– Configure Nutch using Spring– Fully integrate Nutch and Solr– Fix *important* bugs– Add more scoring algorithm implementations
May-20-10 CS572-Summer2010 CAM-20
Wrapup
• Thanks for your attention!• Nutch home page:
– http://nutch.apache.org
• Mailing lists– [email protected] (developer’s list)
– [email protected] (user’s list)