harnessing the power of nutch with scala

Post on 11-Nov-2014

3.470 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction to N

TRANSCRIPT

Crawling the web, Nutch with Scala

Vikas Hazrati @

2

about

CTO at Knoldus Software

Co-Founder at MyCellWasStolen.com

Community Editor at InfoQ.com

Dabbling with Scala – last 40 months

Enterprise grade implementations on Scala – 18 months

3

nutch

Web search software

lucene

solr

crawler link-graph parsing

4

nutch – but we have google!

transparent

understanding

extensible

5

nutch – basic architecture

crawler searcher

6

nutch - architecture

web databaseCrawl dbfetchlists

links

pages

segments

crawler

Recursive

7

nutch – crawl cyclegenerate – fetch – update cycle

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

repeat untildepth reached

Update segments

Index fetched pages

deduplication

Merge indexes forsearching

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

8

nutch - plugins

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

parser

HTMLParserFilter

URL Filter

scoring filter

generate – fetch – update cycle

9

nutch – extension points

plugin.xml

build.xml

ivy.xml

// tells Nutch about the plugin

// build the plugin

// plugin dependencies

src // plugin source

10

nutch - example

<plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin>

11

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {

LOG.debug("Parsing URL: " + content.getUrl());

} Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult;

}

12

scalaI have Java !

concurrency verbose

popular

OO library

Strongly typed

jvm

13

scalaJava:class Person { private String firstName; private String lastName; private int age;

public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; }

public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}

Scala:class Person(var firstName: String, var lastName: String, var age: Int)

Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i

14

scala

Java – everything is an object unless it is primitive

Scala – everything is an object. period.

Java – has operators (+, -, < ..) and methods

Scala – operators are methods

Java – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing

15

evolution

16

scala and concurrency

Fine grained coarse grained

Actors

17

actors

18

19

problem context

Aggregator

UGC

20

solution

Aggregator

Supplier 1

Supplier 2

Supplier 3

21

Create crawldb

Inject root URLs In crawldb

Generate fetchlist

Fetch content

Update crawldb

Supplier URLs

plugins written in Scala

22

logic

Crawl the supplier

Is URL interestingParse

Pass extraction to actor

seeddatabase

23

plugin - scalaclass DetailParserFilter extends HtmlParseFilter {

def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = {

if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult }

private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result }

private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml }

24

result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected

All with Nutch and 823 lines of Scala code

25

demo

in action ….

26

resources

http://blog.knoldus.com

http://wiki.apache.org/nutch/NutchTutorial

http://www.scala-lang.org/

vikas@knoldus.com

top related