![Page 1: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/1.jpg)
Crawling the Web
Fabrizio Celli
Rome, 25th September 2014
![Page 2: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/2.jpg)
2
Outline
• Purpose of this Webinar• The Web Crawler• The AgroTagger• The AGRIS use case– What’s next?
![Page 3: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/3.jpg)
3
Purpose of this Webinar
• SemaGrow is a project funded by the Seventh Framework Programme (FP7) of the European Commission
• Algorithms, infrastructures and methodologies to cope with large data volumes and real time performance
• http://www.semagrow.eu• One of SemaGrow demonstrators is the component
“Web Crawler + AgroTagger”, objective of this Webinar
![Page 4: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/4.jpg)
4
The demonstrator
• It is based on two command line applications (no user interface):– Web Crawler– AgroTagger
• Goal: – discover resources on the Web– tag resources with AGROVOC URIs– filter only resources about agriculture and
interlink to AGRIS
![Page 5: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/5.jpg)
5
What we expect from the Webinar
• Comments, suggestions, opinions• Other real case scenarios for the
demonstrator• You can send your feedback at [email protected]
![Page 6: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/6.jpg)
6
THE WEB-CRAWLER
![Page 7: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/7.jpg)
7
Apache Nutch
• http://nutch.apache.org/• Highly extensible and scalable open source
Web crawler• Configurable• Input: a list of pre-selected URLs• Output: a list of discovered URLs
![Page 8: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/8.jpg)
8
How it works
• The user defines a list of Web sites (URLs)• Each URL is a ROOT• The user defines the “depth”: the number of
"hops" a discovered link is away from the ROOT– Links very "far away" from the ROOT are unlikely
to hold much information• Start to crawl the Web!
![Page 9: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/9.jpg)
9
Example: depth = 3ROOT (URL)
URL_1_1 URL_1_2 URL_1_ndepth = 1
depth = 2
depth = 3
URL_2_2_1 URL_2_2_m
…
…
URL_3_2_1_1 URL_3_2_1_p…
![Page 10: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/10.jpg)
10
The application
• https://github.com/agrisfao/agrotagger/tree/master/crawler/application
• Command line application• Provided with bash scripts to run in Linux environments• Example of usage:
– depth = 5– output directory = work/output– directory with source URLS = work/urls
crawler_exec.sh 5 work/output work/urls
![Page 11: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/11.jpg)
11
The outputURL:: http:/URL:: http://%20www.umabroad.umn.edu/students/healthsafety/emergency.phpURL:: http://10-29-2013-tfic-luncheon.eventbrite.com/URL:: http://1z8jbr3nz90837simd2d2fwoktj.wpengine.netdna-cdn.com/wp-content/uploads/2014/05/Nina-Hale-Inc-FactSheet.pdfURL:: http://2014.northernspark.org/URL:: http://2014.northernspark.org/project/chimera outlink: toUrl: http://media2.northernspark.org/wp-includes/wlwmanifest.xml anchor: outlink: toUrl: http://2014.northernspark.org/partners/arts-culture-and-the-creative-economy-program-of-the-city-of-minneapolis anchor: outlink: toUrl: http://2014.northernspark.org/project/bell-museum-staff anchor: URL:: http://aaea.execinc.com/edibo/JobMarketCandidates outlink: toUrl: http://www.aaea.org/ anchor: AAEA outlink: toUrl: http://aaea.execinc.com/edibo/LoginHelp anchor: Create an Account / Need Help Logging In outlink: toUrl: http://www.aaea.org/about-aaea/aaea-sections anchor: AAEA Sections outlink: toUrl: http://www.aaea.org/about-aaea/aaea-committees anchor: AAEA Committees outlink: toUrl: http://www.aaea.org/about-aaea/awards-and-honors anchor: Awards and Honors...
![Page 12: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/12.jpg)
12
THE AGROTAGGER
![Page 13: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/13.jpg)
13
AGROVOC
• FAO multilingual vocabulary• Over 32 000 concepts in up to 21 languages• Part of the LOD cloud• Extensively used by cataloguers for indexing
data in agricultural information systems• http://
202.45.139.84:10035/catalogs/fao/repositories/agrovoc
![Page 14: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/14.jpg)
14
The AgroTagger
• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to extract keywords from some URLs
• Or better… to extract URIs• It is based on MAUI
![Page 15: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/15.jpg)
15
MAUI
• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits
• Maui automatically identifies main topics in text documents
• It uses different kinds of algorithms (Kea and Weka, named after New Zealand native birds)
• https://code.google.com/p/maui-indexer
![Page 16: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/16.jpg)
16
How it works
• Input: – A text file with a list of URLs– The output file of an Apache Nutch crawler
• Output:– A set of triples<URL> dcterms:subject <AGROVOC_URI>
![Page 17: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/17.jpg)
17
The algorithm
• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC– Create a set of triples
• Multi-threaded• Currently, MAUI is trained only for English– It can be trained in other languages that use Latin
characters– Other solutions are needed for Chinese, Arabic,
Russian, etc.
![Page 18: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/18.jpg)
18
The application
• https://github.com/agrisfao/agrotagger• Command line application• Entirely based on JAVA• Provided with bash scripts • Example of usage:
– directory with source files = work/source– output directory = work/output– type of source files = nutchOutput – output format = rdfnt
taggerDir.sh /work/source /work/output nutchOutput rdfnt
![Page 20: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/20.jpg)
20
THE AGRIS USE CASE
![Page 21: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/21.jpg)
21
AGRIS
• http://agris.fao.org• A collection of more than 7.8 million
bibliographic references in agriculture• AGRIS records come with AGROVOC descriptors• An RDF-aware system– the AGRIS database is publicly exposed as RDF– AGROVOC is the backbone to interlink to external
sources of information (statistics, distribution maps, country profiles, germplasm data…)
![Page 22: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/22.jpg)
22
![Page 23: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/23.jpg)
23
SemaGrow demonstrator
• The core idea is to harvest the Web– Input: pre-selected sources of information about
agriculture• Crawl and assign AGROVOC URIs– Store triples in the “crawler” database
• Definition of combinations between the “crawler” database and the AGRIS database
• New widget in AGRIS mashup pages!
![Page 24: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/24.jpg)
24
Related resources available on the Web
• http://...• https://...
![Page 25: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/25.jpg)
25
Current status
• The Web Crawler gathers data from the Web• The AgroTagger computes triples to assign
Agrovoc URIs to discovered URLs• A “crawler” triplestore is ready for computations
![Page 26: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/26.jpg)
26
What’s next
• Processing phase • Discover meaningful combinations between the
AGRIS core database and “crawler” database• A triplestore of combinations will be set up and
used by AGRIS to generate a widget in the mashup page
• Evaluation of the quality of the widget• What does “meaningful combinations” mean?
![Page 27: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/27.jpg)
27
Naïve Algorithm
• Just for testing purposes• Meaningful combinations = at least N common
AGROVOC URIs
![Page 28: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/28.jpg)
28
Example
• http://ageconsearch.umn.edu/ • 101,000 distinct Web resources discovered by the
WebCrawler (depth = 5)• ~1 million triples generated by the AgroTagger
(“crawler” database)Number of AGRIS records N: common AGROVOC URIs
between AGRIS and the output of the Crawler
Number of associations
900 K 3 17 MLN
900 K 4 3,2 MLN
1 MLN 5 0.6 MLN
![Page 29: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/29.jpg)
29
Your feedback
• Comments, suggestions, other real case scenarios
• Ideas about the meaning of “meaningful combinations”
• If you will test the application, any comments to improve it
• Can the demonstrator support to overcome data problems?
• You can send your feedback at [email protected]
![Page 30: SemaGrow demonstrator: “Web Crawler + AgroTagger”](https://reader030.vdocuments.us/reader030/viewer/2022032618/55b77fd5bb61eb30728b4581/html5/thumbnails/30.jpg)
3030
谢谢
σας ευχαριστώ
Gracias