web crawling and data gathering with apache nutch

18
Apache Nutch Web Crawling and Data Gathering Steve Watt - @wattsteve IBM Big Data Lead Data Day Austin

Upload: steve-watt

Post on 08-May-2015

24.586 views

Category:

Technology


2 download

DESCRIPTION

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

TRANSCRIPT

Page 1: Web Crawling and Data Gathering with Apache Nutch

Apache Nutch

Web Crawling and Data Gathering

Steve Watt - @wattsteveIBM Big Data LeadData Day Austin

Page 2: Web Crawling and Data Gathering with Apache Nutch

2

Topics

Introduction

The Big Data Analytics Ecosystem

Load Tooling

How is Crawl data being used?

Web Crawling - Considerations

Apache Nutch Overview

Apache Nutch Crawl Lifecycle, Setup and Demos

Page 3: Web Crawling and Data Gathering with Apache Nutch

3

The Offline (Analytics) Big Data Ecosystem

Load Tooling

Web Content Your Content

Hadoop

Data Catalogs Analytics Tooling Export Tooling

Find Analyze Visualize Consume

Page 4: Web Crawling and Data Gathering with Apache Nutch

4

Load Tooling - Data Gathering Patterns and Enablers

Web Content

– Downloading – Amazon Public DataSets / InfoChimps

– Stream Harvesting – Collecta / Roll-your-own (Twitter4J)

– API Harvesting – Roll your own (Facebook REST Query)

– Web Crawling – Nutch

Your Content

– Copy from FileSystem

– Load from Database - SQOOP

– Event Collection Frameworks - Scribe and Flume

Page 5: Web Crawling and Data Gathering with Apache Nutch

5

How is Crawl data being used?

Build your own search engine – Built in Lucene Indexes for querying

– Solr integration for Multi-faceted search

Analytics Selective filtering and extraction with data from a single

provider Joining datasets from multiple providers for further

analytics Event Portal Example Is Austin really a startup town?

Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”

Page 6: Web Crawling and Data Gathering with Apache Nutch

6

Web Crawling - considerations

Robots.txt

Facebook lawsuit against API Harvester

“No Crawling without written approval” in Mint.com Terms of Use

What if the web had as many crawlers as Apache Web Servers ?

Page 7: Web Crawling and Data Gathering with Apache Nutch

7

Apache Nutch – What is it ?

Apache Nutch Project – nutch.apache.org– Hadoop + Web Crawler + Lucene

Hadoop based web crawler ? How does that work ?

Page 8: Web Crawling and Data Gathering with Apache Nutch

8

Apache Nutch Overview

Seeds and Crawl Filters

Crawl Depths

Fetch Lists and Partitioning

Segments - Segment Reading using Hadoop

Indexing / Lucene

Web Application for Querying

Page 9: Web Crawling and Data Gathering with Apache Nutch

Apache Nutch - Web Application

Page 10: Web Crawling and Data Gathering with Apache Nutch

Crawl Lifecycle

Generate

Inject

LinkDB

Fetch

Index

CrawlDB Update

Dedup

Merge

Page 11: Web Crawling and Data Gathering with Apache Nutch

Single Process Web Crawling

Page 12: Web Crawling and Data Gathering with Apache Nutch

Single Process Web Crawling

- Create the seed file and copy it into a “urls” directory

- Export JAVA_HOME

- Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)

- Edit the conf/nutch-site.xml and specify an http.agent.name

- bin/nutch crawl urls -dir crawl -depth 2

D E M O

Page 13: Web Crawling and Data Gathering with Apache Nutch

Distributed Web Crawling

Page 14: Web Crawling and Data Gathering with Apache Nutch

Distributed Web Crawling

- The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.

- Why orchestrate your crawl?

- How?– Create the seed file and copy it into a “urls” directory. Then

copy the directory up to the HDFS

– Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)

– Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.

– Restart Hadoop so the new files are picked up in the classpath

Page 15: Web Crawling and Data Gathering with Apache Nutch

Distributed Web Crawling

- Code Review: org.apache.nutch.crawl.Crawl

- Orchestrated Crawl Example (Step 1 - Inject):

bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls

D E M O

Page 16: Web Crawling and Data Gathering with Apache Nutch

Segment Reading

Page 17: Web Crawling and Data Gathering with Apache Nutch

17

Segment Readers

The SegmentReader class is not all that useful. But here it is anyway:

– bin/nutch readseg -list crawl/segments/20110128170617

– bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir

What you really want to do is process each crawled page in M/R as an individual record– SequenceFileInputFormatters over Nutch HDFS Segments

FTW

– RecordReader returns Content Objects as Value

Code Walkthrough

D E M O

Page 18: Web Crawling and Data Gathering with Apache Nutch

Thanks

Questions ?

Steve Watt - [email protected]

Twitter: @wattsteveBlog: stevewatt.blogspot.com

austinhug.blogspot.com