semagrow demonstrator: “web crawler + agrotagger”
Upload: aims-agricultural-information-management-standards-fao-of-the-un
Post on 12-Nov-2014
104 views
DESCRIPTION
The webinar will present the SemaGrow demonstrator “Web Crawler + AgroTagger”, in order to collect feedback, ideas and comments about the status of the development and how the demonstrator helps to overcome data problems. SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission, aiming at developing algorithms, infrastructures and methodologies to cope with large data volumes and real time performance. In this context, FAO is providing a component than can be used to crawl the Web, giving a meaning to discovered resources by using the AgroTagger, which can assign some AGROVOC URIs to resources gathered by a Web crawler. The demonstrator is publicly available at https://github.com/agrisfao/agrotagger.TRANSCRIPT
Crawling the Web
Fabrizio Celli
Rome, 25th September 2014
2
Outline
• Purpose of this Webinar• The Web Crawler• The AgroTagger• The AGRIS use case– What’s next?
3
Purpose of this Webinar
• SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission
• Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance
• http://www.semagrow.eu• One of SemaGrow demonstrators is the component
“Web Crawler + AgroTagger”, objective of this Webinar
4
The demonstrator
• It is based on two command line applications (no user interface):– Web Crawler– AgroTagger
• Goal: – discover resources on the Web– tag resources with AGROVOC URIs– filter only resources about agriculture and
interlink to AGRIS
5
What we expect from the Webinar
• Comments, suggestions, opinions• Other real case scenarios for the
demonstrator• You can send your feedback at [email protected]
6
THE WEB-CRAWLER
7
Apache Nutch
• http://nutch.apache.org/• Highly extensible and scalable open source
Web crawler• Configurable• Input: a list of pre-selected URLs• Output: a list of discovered URLs
8
How it works
• The user defines a list of Web sites (URLs)• Each URL is a ROOT• The user defines the “depth”: the number of
"hops" a discovered link is away from the ROOT– Links very "far away" from the ROOT are unlikely
to hold much information• Start to crawl the Web!
9
Example: depth = 3ROOT (URL)
URL_1_1 URL_1_2 URL_1_ndepth = 1
depth = 2
depth = 3
URL_2_2_1 URL_2_2_m
…
…
URL_3_2_1_1 URL_3_2_1_p…
10
The application
• https://github.com/agrisfao/agrotagger/tree/master/crawler/application
• Command line application• Provided with bash scripts to run in Linux environments• Example of usage:
– depth = 5– output directory = work/output– directory with source URLS = work/urls
crawler_exec.sh 5 work/output work/urls
11
The outputURL:: http:/URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.phpURL:: http://10-29-2013-tfic-luncheon.eventbrite.com/URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina-Hale-Inc-FactSheet.pdfURL:: http://2014.northernspark.org/URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the-city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors...
12
THE AGROTAGGER
13
AGROVOC
• FAO multilingual vocabulary• Over 32 000 concepts in up to 21 languages• Part of the LOD cloud• Extensively used by cataloguers for indexing
data in agricultural information systems• http://
202.45.139.84:10035/catalogs/fao/repositories/agrovoc
14
The AgroTagger
• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs
• Or better… to extract URIs• It is based on MAUI
15
MAUI
• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits
• Maui automatically identifies main topics in text documents
• It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds)
• https://code.google.com/p/maui-indexer
16
How it works
• Input: – A text file with a list of URLs– The output file of an Apache Nutch crawler
• Output:– A set of triples<URL> dcterms:subject <AGROVOC_URI>
17
The algorithm
• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC– Create a set of triples
• Multi-threaded• Currently, MAUI is trained only for English– It can be trained in other languages that use Latin
characters– Other solutions are needed for Chinese, Arabic,
Russian, etc.
18
The application
• https://github.com/agrisfao/agrotagger• Command line application• Entirely based on JAVA• Provided with bash scripts • Example of usage:
– directory with source files = work/source– output directory = work/output– type of source files = nutchOutput – output format = rdfnt
taggerDir.sh /work/source /work/output nutchOutput rdfnt
20
THE AGRIS USE CASE
21
AGRIS
• http://agris.fao.org• A collection of more than 7.8 million
bibliographic references in agriculture• AGRIS records come with AGROVOC descriptors• An RDF-aware system– the AGRIS database is publicly exposed as RDF– AGROVOC is the backbone to interlink to external
sources of information (statistics, distribution maps, country profiles, germplasm data…)
22
23
SemaGrow demonstrator
• The core idea is to harvest the Web– Input: pre-selected sources of information about
agriculture• Crawl and assign AGROVOC URIs– Store triples in the “crawler” database
• Definition of combinations between the “crawler” database and the AGRIS database
• New widget in AGRIS mashup pages!
24
Related resources available on the Web
• http://...• https://...
25
Current status
• The Web Crawler gathers data from the Web• The AgroTagger computes triples to assign
Agrovoc URIs to discovered URLs• A “crawler” triplestore is ready for computations
26
What’s next
• Processing phase • Discover meaningful combinations between the
AGRIS core database and “crawler” database• A triplestore of combinations will be set up and
used by AGRIS to generate a widget in the mashup page
• Evaluation of the quality of the widget• What does “meaningful combinations” mean?
27
Naïve Algorithm
• Just for testing purposes• Meaningful combinations = at least N common
AGROVOC URIs
28
Example
• http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the
WebCrawler (depth = 5)• ~1 million triples generated by the AgroTagger
(“crawler” database)Number of AGRIS records N: common AGROVOC URIs
between AGRIS and the output of the Crawler
Number of associations
900 K 3 17 MLN
900 K 4 3,2 MLN
1 MLN 5 0.6 MLN
29
Your feedback
• Comments, suggestions, other real case scenarios
• Ideas about the meaning of “meaningful combinations”
• If you will test the application, any comments to improve it
• Can the demonstrator support to overcome data problems?
• You can send your feedback at [email protected]
3030
谢谢
σας ευχαριστώ
Gracias