crawl the entire web in 10 minutes...and just 100€

Crawl the entire web

in 10 minutes...

Copyright ©: 2015 OnPage.org GmbH

Using AWS-EMR, AWS-S3, PIG, CommonCrawl

...and just 100 €

Since 2011 in Munich

Work at OnPage.org

Interested in Webcrawling and BigData Frameworks

Build low cost scalable BigData solutions

About Me

Twitter: @danny_munich

Facebook: https://www.facebook.com/danny.linden2

E-mail: [email protected]

Do you want to build your own Search-

Engine?

- High Hardware / Cloud Costs

- Nutch needs ~ 1 Hour for 1 million URLs

- You want to crawl > 1 Billion URLs

Solution ?

Don‘t Crawl!

- Use Common-Crawl : https://commoncrawl.org

- Non-Profit-Organisation

- ~Monthly over 2 Billions Crawled URLs

- Over 1.000 TB total since 2009

- URL seeding list from Blekko: https://blekko.com

Don‘t Crawl! – Use Common Crawl!

- Scalably stored on Amazon AWS S3

- Hadoop compatible format powered by Archive.org (Wayback Machine)

- Partitionable with S3 Object Prefix possibility

- 100MB-1GB file Sizes (gzip) -> Hadoop size

Nice Data Format

Store the raw crawl data.

Format 1:

WARC

Store only the

Meta-Information

as JSON

Format 2:

WAT

Store only the

Plain Text Content

Format 3:

WET

Choose the right format

- WARC (Raw HTML): 1.000 MB

- WAT (Meta data as JSON) : 450 MB

- WET (Plain Text): 150 MB

Processing

- Pure Hadoop with MapReduce

- Input Classes: http://commoncrawl.org/the-data/get-started/

Processing

- High Level ETL-Layer like PIG: http://pig.apache.org

- Example Stuff :

- https://github.com/norvigaward/warcexamples

- https://github.com/mortardata/mortar-examples

- https://github.com/matpalm/common-crawl

PIG Example

REGISTER file:/home/hadoop/lib/pig/piggybank.jar

DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();

%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";

-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";

%default OUTPUT_PATH "s3://example-bucket/out";

pages = LOAD '$INPUT_PATH'

USING FileLoaderClass

AS (url, html);

meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;

filtered = FILTER meta_titles BY meta_title IS NOT NULL;

STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('\t');

Hadoop & PIG on AWS

- Support new Hadoop releases

- PIG Integration

- Replace HDFS with S3

- Easy UI to start quickly

- Pay per Hour to scale as much as posible

It‘s Demo Time!

Let's cross fingers now

That‘s it!

Customer:

Twitter: @danny_munich

Facebook: https://www.facebook.com/danny.linden2

E-mail: [email protected]

And: We are hiring!

https://de.onpage.org/about/jobs/

crawl the entire web in 10 minutes...and just 100€

Technology

path s3

common crawl

raw crawl data

json format

mb wat meta data

pigstoragethadoop pig

html meta

mb processingpure hadoop