semagrow demonstrator: “web crawler + agrotagger”

Post on 12-Nov-2014

104 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The webinar will present the SemaGrow demonstrator “Web Crawler + AgroTagger”, in order to collect feedback, ideas and comments about the status of the development and how the demonstrator helps to overcome data problems. SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission, aiming at developing algorithms, infrastructures and methodologies to cope with large data volumes and real time performance. In this context, FAO is providing a component than can be used to crawl the Web, giving a meaning to discovered resources by using the AgroTagger, which can assign some AGROVOC URIs to resources gathered by a Web crawler. The demonstrator is publicly available at https://github.com/agrisfao/agrotagger.

TRANSCRIPT

Crawling the Web

Fabrizio Celli

Rome, 25th September 2014

2

Outline

• Purpose of this Webinar• The Web Crawler• The AgroTagger• The AGRIS use case– What’s next?

3

Purpose of this Webinar

• SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission

• Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance

• http://www.semagrow.eu• One of SemaGrow demonstrators is the component

“Web Crawler + AgroTagger”, objective of this Webinar

4

The demonstrator

• It is based on two command line applications (no user interface):– Web Crawler– AgroTagger

• Goal: – discover resources on the Web– tag resources with AGROVOC URIs– filter only resources about agriculture and

interlink to AGRIS

5

What we expect from the Webinar

• Comments, suggestions, opinions• Other real case scenarios for the

demonstrator• You can send your feedback at agris@fao.org

6

THE WEB-CRAWLER

7

Apache Nutch

• http://nutch.apache.org/• Highly extensible and scalable open source

Web crawler• Configurable• Input: a list of pre-selected URLs• Output: a list of discovered URLs

8

How it works

• The user defines a list of Web sites (URLs)• Each URL is a ROOT• The user defines the “depth”: the number of

"hops" a discovered link is away from the ROOT– Links very "far away" from the ROOT are unlikely

to hold much information• Start to crawl the Web!

9

Example: depth = 3ROOT (URL)

URL_1_1 URL_1_2 URL_1_ndepth = 1

depth = 2

depth = 3

URL_2_2_1 URL_2_2_m

URL_3_2_1_1 URL_3_2_1_p…

10

The application

• https://github.com/agrisfao/agrotagger/tree/master/crawler/application

• Command line application• Provided with bash scripts to run in Linux environments• Example of usage:

– depth = 5– output directory = work/output– directory with source URLS = work/urls

crawler_exec.sh 5 work/output work/urls

11

The outputURL:: http:/URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.phpURL:: http://10-29-2013-tfic-luncheon.eventbrite.com/URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina-Hale-Inc-FactSheet.pdfURL:: http://2014.northernspark.org/URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the-city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors...

12

THE AGROTAGGER

13

AGROVOC

• FAO multilingual vocabulary• Over 32 000 concepts in up to 21 languages• Part of the LOD cloud• Extensively used by cataloguers for indexing

data in agricultural information systems• http://

202.45.139.84:10035/catalogs/fao/repositories/agrovoc

14

The AgroTagger

• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs

• Or better… to extract URIs• It is based on MAUI

15

MAUI

• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits

• Maui automatically identifies main topics in text documents

• It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds)

• https://code.google.com/p/maui-indexer

16

How it works

• Input: – A text file with a list of URLs– The output file of an Apache Nutch crawler

• Output:– A set of triples<URL> dcterms:subject <AGROVOC_URI>

17

The algorithm

• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC– Create a set of triples

• Multi-threaded• Currently, MAUI is trained only for English– It can be trained in other languages that use Latin

characters– Other solutions are needed for Chinese, Arabic,

Russian, etc.

18

The application

• https://github.com/agrisfao/agrotagger• Command line application• Entirely based on JAVA• Provided with bash scripts • Example of usage:

– directory with source files = work/source– output directory = work/output– type of source files = nutchOutput – output format = rdfnt

taggerDir.sh /work/source /work/output nutchOutput rdfnt

19

The outputInput

AgroTagger

Output

20

THE AGRIS USE CASE

21

AGRIS

• http://agris.fao.org• A collection of more than 7.8 million

bibliographic references in agriculture• AGRIS records come with AGROVOC descriptors• An RDF-aware system– the AGRIS database is publicly exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)

22

23

SemaGrow demonstrator

• The core idea is to harvest the Web– Input: pre-selected sources of information about

agriculture• Crawl and assign AGROVOC URIs– Store triples in the “crawler” database

• Definition of combinations between the “crawler” database and the AGRIS database

• New widget in AGRIS mashup pages!

24

Related resources available on the Web

• http://...• https://...

25

Current status

• The Web Crawler gathers data from the Web• The AgroTagger computes triples to assign

Agrovoc URIs to discovered URLs• A “crawler” triplestore is ready for computations

26

What’s next

• Processing phase • Discover meaningful combinations between the

AGRIS core database and “crawler” database• A triplestore of combinations will be set up and

used by AGRIS to generate a widget in the mashup page

• Evaluation of the quality of the widget• What does “meaningful combinations” mean?

27

Naïve Algorithm

• Just for testing purposes• Meaningful combinations = at least N common

AGROVOC URIs

28

Example

• http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the

WebCrawler (depth = 5)• ~1 million triples generated by the AgroTagger

(“crawler” database)Number of AGRIS records N: common AGROVOC URIs

between AGRIS and the output of the Crawler

Number of associations

900 K 3 17 MLN

900 K 4 3,2 MLN

1 MLN 5 0.6 MLN

29

Your feedback

• Comments, suggestions, other real case scenarios

• Ideas about the meaning of “meaningful combinations”

• If you will test the application, any comments to improve it

• Can the demonstrator support to overcome data problems?

• You can send your feedback at agris@fao.org

3030

谢谢

σας ευχαριστώ

Gracias

top related