harnessing the power of nutch with scala

Crawling the web, Nutch with Scala

Vikas Hazrati @

CTO at Knoldus Software

Co-Founder at MyCellWasStolen.com

Community Editor at InfoQ.com

Dabbling with Scala – last 40 months

Enterprise grade implementations on Scala – 18 months

Web search software

lucene

crawler link-graph parsing

nutch – but we have google!

transparent

understanding

extensible

nutch – basic architecture

crawler searcher

nutch - architecture

web databaseCrawl dbfetchlists

segments

crawler

Recursive

nutch – crawl cyclegenerate – fetch – update cycle

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

repeat untildepth reached

Update segments

Index fetched pages

deduplication

Merge indexes forsearching

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

nutch - plugins

Create crawldb

Generate fetchlist

Fetch content

Update crawldb

parser

HTMLParserFilter

URL Filter

scoring filter

generate – fetch – update cycle

nutch – extension points

plugin.xml

build.xml

ivy.xml

// tells Nutch about the plugin

// build the plugin

// plugin dependencies

src // plugin source

nutch - example

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {

LOG.debug("Parsing URL: " + content.getUrl());

} Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult;

scalaI have Java !

concurrency verbose

popular

OO library

Strongly typed

scalaJava:class Person { private String firstName; private String lastName; private int age;

public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; }

public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}

Scala:class Person(var firstName: String, var lastName: String, var age: Int)

Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i

Java – everything is an object unless it is primitive

Scala – everything is an object. period.

Java – has operators (+, -, < ..) and methods

Scala – operators are methods

Java – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing

evolution

scala and concurrency

Fine grained coarse grained

Actors

actors

problem context

Aggregator

solution

Aggregator

Supplier 1

Supplier 2

Supplier 3

Create crawldb

Generate fetchlist

Fetch content

Update crawldb

Supplier URLs

plugins written in Scala

Crawl the supplier

Is URL interestingParse

Pass extraction to actor

seeddatabase

plugin - scalaclass DetailParserFilter extends HtmlParseFilter {

def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = {

if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult }

private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result }

private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml }

result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected

All with Nutch and 823 lines of Scala code

in action ….

resources

http://blog.knoldus.com

http://wiki.apache.org/nutch/NutchTutorial

http://www.scala-lang.org/

vikas@knoldus.com

harnessing the power of nutch with scala

lastname lastname

int age

age age

string lastname

firstname firstname

string firstname

string

parseresult

Technology

crawling ida mele. nutch apache nutch is an open source java...

nutch: a flexible and scalable open-source web search engine

nutch, open-source web...

nutch, and search engine history

scala days 2018 you are a -...

scala: economic empowerment of low … en/scala-call... ·...

focused crawling with -...

clustering output of apache nutch using apache spark

all about nutch

optimizing apache nutch for domain specific...

nutch in nutshell

nutch homepage search engine

building nutch: open source search

hadoop at yahoo! - umd department of computer …evergreen...

nutch: open-source web search software

scala by example - the scala programming language

web crawling with apache nutch

nutch + hadoop scaled, for crawling protected web sites...

nutch and lucene framework - cse, iit...

friends of solr - nutch & hdfs