semagrow demonstrator: “web crawler + agrotagger”

Crawling the Web

Fabrizio Celli

Rome, 25th September 2014

Outline

• Purpose of this Webinar• The Web Crawler• The AgroTagger• The AGRIS use case– What’s next?

Purpose of this Webinar

• SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission

• Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance

• http://www.semagrow.eu• One of SemaGrow demonstrators is the component

“Web Crawler + AgroTagger”, objective of this Webinar

The demonstrator

• It is based on two command line applications (no user interface):– Web Crawler– AgroTagger

• Goal: – discover resources on the Web– tag resources with AGROVOC URIs– filter only resources about agriculture and

interlink to AGRIS

What we expect from the Webinar

• Comments, suggestions, opinions• Other real case scenarios for the

demonstrator• You can send your feedback at agris@fao.org

THE WEB-CRAWLER

Apache Nutch

• http://nutch.apache.org/• Highly extensible and scalable open source

Web crawler• Configurable• Input: a list of pre-selected URLs• Output: a list of discovered URLs

How it works

• The user defines a list of Web sites (URLs)• Each URL is a ROOT• The user defines the “depth”: the number of

"hops" a discovered link is away from the ROOT– Links very "far away" from the ROOT are unlikely

to hold much information• Start to crawl the Web!

Example: depth = 3ROOT (URL)

URL_1_1 URL_1_2 URL_1_ndepth = 1

depth = 2

depth = 3

URL_2_2_1 URL_2_2_m

URL_3_2_1_1 URL_3_2_1_p…

The application

• https://github.com/agrisfao/agrotagger/tree/master/crawler/application

• Command line application• Provided with bash scripts to run in Linux environments• Example of usage:

– depth = 5– output directory = work/output– directory with source URLS = work/urls

crawler_exec.sh 5 work/output work/urls

The outputURL:: http:/URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.phpURL:: http://10-29-2013-tfic-luncheon.eventbrite.com/URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina-Hale-Inc-FactSheet.pdfURL:: http://2014.northernspark.org/URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the-city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors...

THE AGROTAGGER

AGROVOC

• FAO multilingual vocabulary• Over 32 000 concepts in up to 21 languages• Part of the LOD cloud• Extensively used by cataloguers for indexing

data in agricultural information systems• http://

202.45.139.84:10035/catalogs/fao/repositories/agrovoc

The AgroTagger

• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs

• Or better… to extract URIs• It is based on MAUI

• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits

• Maui automatically identifies main topics in text documents

• It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds)

• https://code.google.com/p/maui-indexer

How it works

• Input: – A text file with a list of URLs– The output file of an Apache Nutch crawler

• Output:– A set of triples<URL> dcterms:subject <AGROVOC_URI>

The algorithm

• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC– Create a set of triples

• Multi-threaded• Currently, MAUI is trained only for English– It can be trained in other languages that use Latin

characters– Other solutions are needed for Chinese, Arabic,

Russian, etc.

The application

• https://github.com/agrisfao/agrotagger• Command line application• Entirely based on JAVA• Provided with bash scripts • Example of usage:

– directory with source files = work/source– output directory = work/output– type of source files = nutchOutput – output format = rdfnt

taggerDir.sh /work/source /work/output nutchOutput rdfnt

The outputInput

AgroTagger

Output

THE AGRIS USE CASE

• http://agris.fao.org• A collection of more than 7.8 million

bibliographic references in agriculture• AGRIS records come with AGROVOC descriptors• An RDF-aware system– the AGRIS database is publicly exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)

SemaGrow demonstrator

• The core idea is to harvest the Web– Input: pre-selected sources of information about

agriculture• Crawl and assign AGROVOC URIs– Store triples in the “crawler” database

• Definition of combinations between the “crawler” database and the AGRIS database

• New widget in AGRIS mashup pages!

Related resources available on the Web

• http://...• https://...

Current status

• The Web Crawler gathers data from the Web• The AgroTagger computes triples to assign

Agrovoc URIs to discovered URLs• A “crawler” triplestore is ready for computations

What’s next

• Processing phase • Discover meaningful combinations between the

AGRIS core database and “crawler” database• A triplestore of combinations will be set up and

used by AGRIS to generate a widget in the mashup page

• Evaluation of the quality of the widget• What does “meaningful combinations” mean?

Naïve Algorithm

• Just for testing purposes• Meaningful combinations = at least N common

AGROVOC URIs

Example

• http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the

WebCrawler (depth = 5)• ~1 million triples generated by the AgroTagger

(“crawler” database)Number of AGRIS records N: common AGROVOC URIs

between AGRIS and the output of the Crawler

Number of associations

900 K 3 17 MLN

900 K 4 3,2 MLN

1 MLN 5 0.6 MLN

Your feedback

• Comments, suggestions, other real case scenarios

• Ideas about the meaning of “meaningful combinations”

• If you will test the application, any comments to improve it

• Can the demonstrator support to overcome data problems?

• You can send your feedback at agris@fao.org

谢谢

σας ευχαριστώ

Gracias

semagrow demonstrator: “web crawler + agrotagger”

crawler database

agris database

user defines

bash scripts

url

agris fao

application https

outlink

Technology

demonstrator training manual 2012

big data in agriculture, the semagrow and aginfra experience

d5.5b prototype integration with aginfra...the...

partners for the delhi contact demonstrator the stadium...

semagrow demonstrator: “web crawler + agrotagger”

tutor, demonstrator & coordinator development

ska new technology demonstrator

new technologies demonstrator programme

iata ndc demonstrator - a...

demonstrator racks

demonstrator & showcase event

pathfinder technology demonstrator

peppol demonstrator client

demonstrator sell sheet pcap

retrogression and re-ageing in-service demonstrator trial...

demonstrator of advanced controllers

aa-mid demonstrator

stgc trigger demonstrator

d2.11 final demonstrator

tutorial soa demonstrator