Download - 2015 - Web Scraping - Applications and Tools
ePSIplatform Topic Report No. 2015 / 10 , December 2015 1
WEBSCRAPING,APPLICATIONSANDTOOLS
EuropeanPublicSectorInformationPlatform
TopicReportNo.2015/10
WebScraping:ApplicationsandTools
Author:OsmarCastrillo-Fernández
Published:December2015
ePSIplatform Topic Report No. 2015 / 10 , December 2015 2
WEBSCRAPING,APPLICATIONSANDTOOLS
TableofContents
TableofContents......................................................................................................................2
Keywords...................................................................................................................................3
Abstract/ExecutiveSummary...................................................................................................3
1 Introduction............................................................................................................................4
2 Whatiswebscraping?............................................................................................................6
ThenecessitytoscrapewebsitesandPDFdocuments........................................................6
TheAPImana........................................................................................................................7
3 Webscrapingtools.................................................................................................................8
PartialtoolsforPDFextraction.............................................................................................8
PartialtoolstoextractfromHTML........................................................................................9
Completetools....................................................................................................................12
Confrontingthethreetools.................................................................................................25
Othertools..........................................................................................................................26
4 Decisionmap........................................................................................................................27
5 Scrapingandlegislation........................................................................................................28
6 Conclusionsandrecommendations......................................................................................29
AbouttheAuthor....................................................................................................................30
Copyrightinformation.............................................................................................................31
ePSIplatform Topic Report No. 2015 / 10 , December 2015 3
WEBSCRAPING,APPLICATIONSANDTOOLS
Keywords
web scraping, data extracting, web content extracting, datamining, data harvester, crawler,
opengovernmentdata
Abstract/ExecutiveSummary
Internetisthevastestinformationanddatasourceeverbuiltbymankind.However,itisahuge
collectionofheterogeneousandpoorlystructureddata,difficulttocollectinamanualwayand
complicatedtouseinautomatedprocesses.
Overthelastyears,techniquesandtoolshavesurged,allowingdatacollectionandconversion
tostructureddatatobemanagedbyB2CandB2Bsystems.Thisarticleoffersanintroduction
to web scraping techniques and some of the most popular and novel techniques for data
extractionandreuseincomplexprocesses.
Thepossibilitiestotakebenefitofsuchdataaremany,includingareaslikeOpenGovernment
Data, Big Data, Business Intelligence, aggregators and comparators, development of new
applicationsandmashups,amongothers.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 4
WEBSCRAPING,APPLICATIONSANDTOOLS
1 IntroductionEvery single day, several petabytes of information are published via the Internet in various
formats,suchasHTML,PDF,CSVorXML.Curiously,HTMLisnotnecessarilythemostcommon
format to publish content. For instance, 70% of the content indexed by Google is extracted
from PDF documents. This is an additional obstacle for the different roles involved in data
extraction from multiple sources. Journalists, researchers, data analysts, sales agents or
software developers are some examples of professionals typically using the copy-and-paste
techniquetogetinformationinaspecificformatandexportittoaspreadsheet,anexecutive
reportortosomedataexchangeformatsuchasXML,JSONoranyoftheseveralvocabularies
availablebasedonthem.
Focusing on data acquisition from HTML (i.e, a web document), this article explains the
mechanismsandtoolsthatmayhelpustominimizethetediousdutiesofdataextractionfrom
theInternet.
With regards to the commitment of transparency and data openness that public
administrations have assumed over the last decade, scraping and crawling are techniques
that may be useful. Whereas current information systems in use by public administrations
ePSIplatform Topic Report No. 2015 / 10 , December 2015 5
WEBSCRAPING,APPLICATIONSANDTOOLS
considerthesetechniques,arelevantamountofwebsites,contentmanagementsystemsand
ECMs(EnterpriseContentManagement)donot.Exchangingthesesystemsfornewonesimply
considerable economic efforts. Scraping tools provide an alternative to exchanging which
minimizes such efforts. Open Government Data and Transparency Policies should take
advantageofthisopportunity.
extraction transformation reuse
ePSIplatform Topic Report No. 2015 / 10 , December 2015 6
WEBSCRAPING,APPLICATIONSANDTOOLS
2 Whatiswebscraping?One of the many definitions of this concept, and the favourite one for the author of this
document,is:
Awebscrapingtoolisatechnologysolutiontoextractdatafromwebsites,inaquick,efficientand
automatedmanner,offeringdatainamorestructuredandeasiertouseformat,eitherforB2Borfor
B2Cprocesses.
Scrapingprocessesmaybewrittenindifferentprogramminglanguages.Themostpopularare
Java,Python,RubyorNode.Asitisobvious,expertprogrammersarerequiredtodevelopand
evolve them, and even to use them.Nonetheless, some software companies have designed
differenttoolsthatenableotherpeopletousescrapingtechniquesbymeansofattractiveand
powerfuluserinterfaces.
WebscrapingtoolsarealsoreferredasWebDataExtractors,DataHarvesters,CrawlingTools
orWebContentMiningTools.
ThenecessitytoscrapewebsitesandPDFdocuments
Asalreadystated,approximately70%oftheinformationgeneratedintheInternetisavailable
in PDF documents, anunstructuredandhard tohandle format.However,awebpagehas a
structuredformat(HTMLcode),althoughinanon-reusableway.
PDFscrapingisnottheobjectoftheanalysisofthisarticle,althoughitistruethatsometools
exist to extract information, mainly related to data tables. This enormous amount of
informationpublishedbutcaptiveofthiskindofformatisusuallycalled“thetyrannyofPDF”.
SometoolsthatarepresentedinlatersectionsofthisdocumentcanreadPDFdocumentsand
returninformationinastructuredformat,althoughinabasicandrudimentaryway.
Following with the main scope of this document (HTML documents), its structured nature
multipliesthepossibilitiesopenbyscrapingtechniques.Webscrapingtechniquesandscraping
tools rely in the structure and properties of theHTML language. This facilitates the taskof
scrapingfortoolsandrobots,stoppinghumansfromtheboringrepetitiveanderrorproneduty
ofmanualdataretrieval.Finally,thesetoolsofferdatainfriendlyformatsfor laterprocessing
andintegration:JSON,XML,CSV,XLSoRSS.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 7
WEBSCRAPING,APPLICATIONSANDTOOLS
TheAPImana
Data collection in a handcrafted way is truly inefficient: search, copy and paste data in a
spreadsheet to later process them. This is a tedious, annoying and tiresome process.
Therefore, itmakesmuchmore sense to automate this process.Scraping allows this kindof
automation, as themajority of the available tools provide an API to facilitate access to the
contentgenerated.AnAPI(ApplicationProgrammingInterface)isamechanismtoconnecttwo
applications,allowingthemtosharedata.ScrapingtoolsfacilitateaURL,anInternetaddressas
thoseyoumaynoticeintheaddressbarofawebbrowser,givingaccesstothedatascraped.In
somecases,APIsarenotonly limited toaURL.Theycanalsobeprogrammed inanyway to
modify the final result of the scraping process. This feature is absolutely useful for B2B
integrationprocesses,enablingsubsequentapplicationsandservicesbasedonsuchdata.
Therefore, it isrecommendedthatagoodscrapingtoolprovidesasimpleandprogrammable
API.TheconceptofAPI isveryrelevant inthis topic.Occasionally, this term isalsoknownas
EndPoint.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 8
WEBSCRAPING,APPLICATIONSANDTOOLS
3 WebscrapingtoolsOnce that the basic concepts related to scraping have been commented, this documents
focusesontoolscurrentlyavailable inthemarket.Basic informationabouttheir functionality
andhowtheyworkisprovided,avoidingdeeptechnicaldetailsasmuchaspossible.
Firstly,wesuggestaninitialtoolsegmentation:
1. Partialtools.Thesearetypicallypluginstothird-partysoftware.Theyusuallyfocuson
aspecificscrapingtechnique(forinstance,HTMLtables).TheydonotprovideanAPI
forB2Bintegration.
2. Complete tools.This segment includes the latest tools in themarketandsomeolder
tools, created as a general scraping service. They offer features such as powerful
graphicaluser interface,visual scrapingutility,SaaSand/ordesktop licensingmodels,
querycachingandrecording,APIsorreportingandauditdashboards.
PartialtoolsforPDFextraction
This segment includes these toolsoriented to readPDFdocumentsandextractallorpartof
theinformationcontained.
WithinthistypemayincludethoseorientedexclusivelytoopenPDFsandextractallorpartof
itsinformation.Thefigurebelowprovidesabriefdescriptionofsomeofthesetools.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 9
WEBSCRAPING,APPLICATIONSANDTOOLS
PDFtoExcelOnLine
www.pdftoexcelonline.comZamzar
www.zamzar.comCometDocs
www.cometdocs.comPDFTables
pdftables.comTabula
tabula.technology
Works online. Allows
several input and output
formats, including Word,
Excel, PowerPoint, PDF).
Theresultingfileissentvia
email.Freetouse.
Works online.
Large amount of
input and output
formats. The
resultingfileissent
via email. Free to
use.
Works online.
Converts PDF to
Word, Excel and
PowerPoint. Cloud-
based document
storage. Application
versions for Desktop,
Smartphone and
Tablet. Free account
with limited features.
API for document
translation.
Works online. Only
converts existing
tables in a PDF
document to Excel.
API available. Free
account (very
limited) plus paid
accounts with
various licensing
terms.
Desktop tool.
Focused on
extraction of
tables in PDF
documents.
Source code
available.
Oriented to
software
developers.
On-line X X X X
Off-line X
SaaSFreeService X X X X
SaaSPaidService X X
Document
formats
word,excel,powerpoint,
pdf many
ConvertsPDFto
Word,Exceland
powerpoint
Convertsexisting
tablesintoPDF
documentstoExcel
format
Onlyextracts
datafromtables
intoPDF
documentsAPI X X
Orientedto
developers X
Sourcecode
available X
PartialtoolstoextractfromHTML
Regarding techniques and tools for web content (HTML documents) partial scraping, some
examplesoftoolsarecommentedbelow.
GoogleSpreadsheetsandtheIMPORTHTMLformula
This is a simple solution, but sufficiently effective to extract data from an HTML table to a
GoogleSpreadsheetsdocument.Theactualformatoftheformulais:
=IMPORTHTML(“URL”;”table”;N)
Testing the IMPORTHTML formula is fairly simple, as long as value N is known. This value
represents theorderof the table in the listof tablesavailable in theHTMLcodeavailable in
ePSIplatform Topic Report No. 2015 / 10 , December 2015 10
WEBSCRAPING,APPLICATIONSANDTOOLS
addressURL.Anexampleofusebasedintheformatis:
=IMPORTHTML("http://www.euskalmet.euskadi.net/s07-
5853x/es/meteorologia/app/predmun_o.apl?muni=";"table";1)
TableCapture,aGoogleChromeextension
TableCapture isanextensiontothepopularGoogleChromewebbrowser.Itenablesusersto
copythecontentoftablesincludedinawebpageinaneasymanner.Theextensionisavailable
fromtheaforementionedbrowser,typingthefollowingURLinitsaddressbar:
https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop
Once installed, if Chrome detects tables in the web page currently rendered, a red icon is
showntotherightoftheaddressbar.Clickingonitshowsalistingofthetableddetectedand
twocontrols.Thefirstonecopiescontenttotheclipboard,andthesecondoneopenaGoogle
Docsdocumentto laterpastethecontentofthetable.Anexample isshowninthefollowing
figure.
ExampleofuseofTableCapture
ePSIplatform Topic Report No. 2015 / 10 , December 2015 11
WEBSCRAPING,APPLICATIONSANDTOOLS
TabletoClipboard,Firefoxadd-on
Asinthepreviouscase,theFirefoxwebbrowseralsosupportsadd-ons(analogoustoChrome’s
extensions) to extract data fromHTML tables. An example is Table2Clipboard,which can be
downloaded and installed from https://addons.mozilla.org/es/firefox/addon/dafizilla-
table2clipboard/?src=userprofile
Inthiscase,acontextmenushowinguponaright-clickonatableallowscopyingituporjust
theclicked rowor column.This isamechanismquiteusefuland less intrusive thatoffersan
interestingfunctionalityinmanycases.
ExampleofuseofTabletoClipboard
ePSIplatform Topic Report No. 2015 / 10 , December 2015 12
WEBSCRAPING,APPLICATIONSANDTOOLS
Completetools
Overthelastyears,asetofcompanies,mayofthemstart-ups,haverealizedthatthemarket
wasdemandingtoolstoextract,store,processandrenderdataviaAPIs.Thesoftwareindustry
moves fast and in many directions and a good web scraper can help in application
development,PublicAdministration transparency,BigDataprocesses,dataanalysis,online
marketing, digital reputation, information reuse or content comparers and aggregators,
amongothertypicalscenarios.
But,whatshouldatoolofthiskindincludeatleasttobeconsideredasaseriousalternative?In
theopinionoftheauthor:
• Apowerfulandfriendlygraphicaluserinterface.
• Aneasy-to-useAPItolinkandintegratedata.
• Avisualaccesstowebsitestoperformdatapicking.
• Datacachingandstorage.
• Alogicalorganizationandmanagementofthequeriesusedtoextractdata.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 13
WEBSCRAPING,APPLICATIONSANDTOOLS
Import.io
CompanylocatedinLondon,specializedindataminingandInternetdatatransformationtousersto
exploitinformation.
Websiteàimport.ioandenterprise.import.io
MottoàInstantlyturnwebpagesintodata
Indubitably, this isoneof the reference tools in themarket. Itmaybeused in fourdifferent
ways.Thefirstone(namedMagic,thatcanbeclassifiedasbasic)istheaccesstoimport.ioin
ordertotypetheaddressofthewebsiteonwhichwewanttoperformscraping.Theresultis
shown in an attractive visual tabular format. The main con is that the process is not
configurableatall.
The second way of use is named Extractor. It is the most common usage of import.io
technology:download,installandexecuteinyourowncomputer.Thistoolisacustomizedweb
browseravailable forWindows,OSXandLinux.Thisway requires someprevious skillsusing
softwaretoolsandsometimetolearnhowtousethetool.However,“picking”isofferedina
quite reasonable manner –although open to improvement, at the same time. “Picking” is
performedbyclickingonthepartsofthescrapedwebsitethatwewanttoextract,inasimple
andvisualway.Thisisafeaturethatanywebscrapingtoolsmustincludethesedays.
Oncethatquerieshavebeencreated,outputformatsareonlytwo:JSONandCSV.Queriesmay
also be combined in order to page results (“Bulk Extract”) or aggregate them (“URLs from
anotherAPI”). It isalsorelevanttonotethatuserswillhaveaRESTfulAPIEndPointtoaccess
the data source –which is a mandatory feature in any relevant complete scraping tool
nowadays.
The import.io application requires a simple user registration process.With a username and
password, the application can be used for free, with a set of basic features that may be
sufficientforsmalldeveloperteamswithoutcomplexscrapingrequirements.Forcompaniesor
professionalsdemandingmore flexibilityandbackendresources,contactwith import.iosales
teamisrequired.
Thethirdandfourthwaysofuseprovidemorevaluetothetool.TheyaretheCrawlerandthe
Connector,respectively.TheCrawlertriestofollowallthelinksinthedocumentindicatedvia
ePSIplatform Topic Report No. 2015 / 10 , December 2015 14
WEBSCRAPING,APPLICATIONSANDTOOLS
its URL and it allows information extraction based on the picking process carried out in the
initialdocument.Inthetestscarriedouttowritethisarticle,wehavenotmanagedtofinishall
this process, as it seems to keep working all the time without producing any results. The
Connector permits to record a browsing script to reach the web document from which to
extractinformation.Thisapproachisveryinteresting,forinstance,ifthedatatobescrapedare
theresultofasearch.
Insummary, import.io isatoolwith interestingfunctionalityfree touse,withahigh levelof
maturity,anattractiveandmoderngraphicaluser interface,supportingcloudstorageofthe
queries,whichdemandslocalinstallationtotakefulladvantageofitsfeatures.
Strengths Weaknesses
VisualInterface Desktopinstallation
BlogandDocumentation Limitedamountofoutputdocumentformats
Allows pagination, aggregation, crawling and script
recording
Learningcurve
StrenghtsvsWeaknessesforimport.io
Viewofthebasicuseofimport.io(availableathttps://import.io)
ePSIplatform Topic Report No. 2015 / 10 , December 2015 15
WEBSCRAPING,APPLICATIONSANDTOOLS
import.iodesktopapplication
Screentoselectdataextractionmodeinimport.io
ePSIplatform Topic Report No. 2015 / 10 , December 2015 16
WEBSCRAPING,APPLICATIONSANDTOOLS
Kimonolabs
KimonoLabsisacompanylocatedinMountainView,California.TheyofferaSaaStoolaftertheuseof
aChromeextension.Itallowsextractingdatafromwebdocumentsinanintuitiveandeasyway.They
provideresultsinJSON,CSVandRSSformats.
Websiteàwww.kimonolobas.com
MottoàTurnwebsitesintostructuredAPIsfromyourbrowserinseconds
Kimono Labs is another key player in the field ofweb scraping. It uses a strategy similar to
import.io, using a web browser. In their case, they offer a Chrome extension, rather than
embeddingawebbrowserinanativedesktopapplication.Therefore,thefirststeptousethis
tool is to installGoogleChromeandthenthisextension.Registration isoptional. It isaquick
andsimpleprocessthatisrecommended.
After installation and registration, we can use Chrome to reach a web document with
interesting information. In thismoment,we click the icon installedby theKimonoextension
andthepickingprocessstarts.Thisprocessprovideshelptotheuseratfirstexecutionwitha
visualandattractive format. Itsgraphicaluser interface is reallypolishedandthetool results
veryfriendly.
InitialhelpviewofferedbyKimono
ePSIplatform Topic Report No. 2015 / 10 , December 2015 17
WEBSCRAPING,APPLICATIONSANDTOOLS
TostartworkingwithKimono,theusermustcreatea“ball”intheupperendofthescreen.By
default,aball isalreadycreated.Thevariousballscreatedareshownwithadifferentcolour.
Afterwards,sectionsofthedocumentmaybeselectedtoextractdataan,subsequently,they
are highlightedwith the colour of the associated ball. At the same time, the ball shows the
numberofelements thatmatch theselection.Ballsmaybeused to selectdifferent zonesof
the same document, although their purpose is to refer zones with certain semantic
consistency.
For instance, in thewebsiteofanewspaper,wemightcreateaballnamed“Title”and then
selectaheadline inawebpage.Then, thetoolhighlightsoneor twoadditionalheadlinesas
interesting.Afterselectinganewheadline, the tool startshighlightinganother20 interesting
elements in the web page. We may notice that there are some headlines which are not
highlightedbythetool.Weselectoneofthemandnowover60headlinesarehighlighted.This
process may be repeated until the tool highlights all the headlines after selecting a small
amount of them. This process is known as “selector overload” and is available in several
scrapingtools.
Oncethatall theheadlineshavebeenselected,wecantrydoingthesamewiththeopening
paragraph of each news item: create a ball, click on the text area of an opening paragraph,
thenanotherone,andgoonwiththeprocessuntilhavingallthedesiredinformationreadyfor
extraction.
Although the idea is really good, our tests have found that the process is somehow
bothersome.Sometimes,boxhighlightinginwebpagesdoesnotworkwellwith,forinstance,
problemsintextswhicharelinksatthesametime.
Oncethatthepickingprocess is finishedbyclickingonthe“Done”button,wecannameour
newAPIandparameterize its temporalprogramming.Withthis, thesystemmayexecutethe
APIevery15minutes,hour,day,etc. and store the results in the cloud. Whenever theuser
calls the API by accessing the associated URL, Kimono does not access the target site but
returnsthemostrecentdatastored in itscache.Thiscachingmechanism ishighlyusefulbut
notexclusiveofKimono.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 18
WEBSCRAPING,APPLICATIONSANDTOOLS
Exampleofhighlightinginthepickingprocessofkimono
The query management console, available atwww.kimonolabs.com, provides access to the
newly createdAPI and various controls andpanels to readdata and configurehow they are
obtained.Thisincludesaninterestingfeature,whichareemailalertstobereceivedwhendata
changeinthetargetsite.
There is anadditional interestingoptionnamed“MobileApp” that integrates the contentof
thecreatedAPIinaviewresemblingamobileapplication,allowingsomestylingconfiguration.
However, the view generated by this option is a web document accessible by the URL
announced, aimed to be rendered in a mobile browser. Unluckily, the name of the option
misleads users and does not generate a mobile application to be published in any mobile
applicationstore.Still,itmaybeausefuloptionforrapidprototyping.
The console menu also offers the “Combine APIs” option. Initially, it may look like an
aggregator, assembling the data obtained from several heterogeneous APIs in a single API.
Nevertheless,helpinformationinthisoptionindicatesthattheaggregatedAPIsmusthavethe
sameexactnameofdata collections. The conclusion is that thisoption is useful topaginate
information,butnottoaggregate.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 19
WEBSCRAPING,APPLICATIONSANDTOOLS
Managementconsoleofkimono
In summary, kimono is a free tool,with a high level ofmaturity, a very good graphical user
interface,providingcloudstorageforqueries,requiringChromebrowserandtheirextension–
bothoftheminstalledlocally.
Strengths Weaknesses
Visualinterface Chromebrowserdependency
Documentation Doesnowallowaggregation
Picker Weakmobileappoption
StrenghtsvsWeaknessesforkimono
ePSIplatform Topic Report No. 2015 / 10 , December 2015 20
WEBSCRAPING,APPLICATIONSANDTOOLS
myTrama
myTrama isawebcrawlingtooldevelopedbyVitesiaMobileSolutions,acompany located inGijón,
Spain.myTramaallowsanyusertoextractdatalocatedindifferentInternetsitesandobtainthemin
anorderedandstructuredwayinvariousformats(JSON,XML,CSVandPDF).
Websitesàwww.mytrama.comandwww.vitesia.com
MottoàDataisthenewoil
myTrama is a newweb crawling tool positioned as a clear competitor to those previously
commented.ItisapurelySaaSservice,thusavoidingtheneedforuserstoinstallanysoftware
nortodependonaspecificwebbrowser.myTramaworksonChrome,Firefox,InternetExplorer
andSafari.Itisavailableathttps://www.mytrama.com.
A general analysis of this tool suggests thatmyTrama takes the best ideas of import.io and
kimono. It presents information in a graphicaluser interface,perhapsnot so goodbutmore
compactandwiththelookandfeelofaprojectmanagementtool.Someofthefeatureswhich
seemmoreinterestinginthistoolarecommentedbelow:
• Main view is organized in a way similar to an email client, with 3 vertical zones: 1)
folders,2)queries,and3)querydetail.Itisefficientandfriendly.
• Besides JSON, XML and CSV, the classical structured formats for B2B integrations, it
adds PDF for quick reporting and sends results in an easily viewable and printable
format.
• ItincludesaquerylanguagenamedTrama-WQL(quitesimilartoSQL),whichissimple
to use while powerful. It is useful when visual picking is not sufficient, providing a
programmaticmannertodefinethepickingprocess.Documentationofthislanguageis
availableinthetoolasamenuoption.
• The “Audit” menu option gives access to a compact control panel with information
abouttherequestscurrentlybeingmadetoeachoftheAPIs(EndPoints).
• Thepickeriscompletelyintegrated.Itisnotnecessaryanytypeofadditionalsoftware.
Itissimilartotheapproachusedinkimono,althoughituses“boxes”insteadof“balls”.
Asubtledifferentiationisthatamagicwandreplacesthedefaultmousepointerwhen
pickingisavailable.Inaddition,thepickingprocessmaybestoppedbyright-clickingon
theareabeingpicked.
• myTrama permits grouping boxes within boxes, although only one level of grouping
ePSIplatform Topic Report No. 2015 / 10 , December 2015 21
WEBSCRAPING,APPLICATIONSANDTOOLS
andonlyagroupwithqueryareallowed.Thisisaveryusefulfeatureinordertohave
results properly grouped.Hopefully, thedevelopment teamwill improve this feature
soontoprovideuserswithmoreflexibility.
• Queryconfigurationallowsupdatefrequencywithagranularityofminutes,from0to
9999999.Zeromeansrealtime(thisis,accessingthetargetsiteuponeachrequestto
the EndPoint). For any other value, information is obtained from the cache –as in
kimono.
• APIs may be programmed using parameters sent via GET and POST requests.
Unfortunately,thedevteamhasnotpublishedsufficientdocumentationrelatedtothis
feature.Forexample, it ispossibletousetheURLofanAPIandoverwritetheFROM
parameter(theURLreferencingthetargetdocument)inrealtime.Itisalsopossibleto
passparametersviaGETandPOSTinthesameAPI.Additionally,thereisaservicethat
allowstheexecutionofaTrama-WQLsentencewithoutanyquerycreatedinthetool.
As these are not very well documented features, the best choice is to contact the
peopleatVitesia.
• Paging queries and aggregation of heterogeneous queries are supported in a fairly
simpleandcomfortableway.
• Forthosepreferringthebrowserextensionwayofscraping,aChromeextensionisalso
available.Thismechanismallowsuserstobrowsesitesandstartthescrapingprocess
byclickingonthebuttoninstalledbytheextension.Thispluginisnotyetpublishedbut
canberequestedtoVitesia.
• PDF isnotonlya formatavailableasoutput,butalsoas input.Therefore,aURLmay
reference only HTML documents but also PDF. For instance, users will be able to
extractinformationfromPDFsandgenerateJSONdocumentsthatfeedadatabasefor
later information analysis. The business hypothesis to support this is based on the
evidenceinitiallycommentedattheintroductionofthisarticlethatstatedthat70%of
thecontentpublishedintheInternetiscontainedinPDFdocuments.Vitesiaconsider
thatthismaybeadifferentiatingfeaturebetweenmyTramaandtheircompetitors.
• APIs preserve session. This allows chaining calls to queries in myTrama and fulfil
businessprocesses,suchassearches,step-basedforms(wizards)oraccessinformation
availablebehindaloginmechanism.
• Itisavailableintwolanguages:EnglishandSpanish.
• Access to this platform is basedon invitation.Users remain active for 30days. Later
contactwiththedevteamisrequiredinordertomovetoastableuser.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 22
WEBSCRAPING,APPLICATIONSANDTOOLS
Amongallthetoolsanalysed,myTramaseemstobethemostcompleteandcompact,although
its user interface is one step behind kimono and import.io. For users with software
development skills,myTramaseems tobe thebest choice–although requiringdirect contact
withVitesia.
InitialscreenofmyTrama
QuerymanagementconsoleinmyTrama
ePSIplatform Topic Report No. 2015 / 10 , December 2015 23
WEBSCRAPING,APPLICATIONSANDTOOLS
PickingprocessinmyTrama
DashboardscreeninmyTrama
In summary,myTrama is a tool solely offered as a SaaS service, very complete to carry our
scraping processes,with cloud storage and thatmay be operatedwith anyweb browser. Its
major weakness is the lack of documentation of many differentiating issues relevant to
developersinterestedintakingadvantageofscrapingprocesses.
Strengths Weaknesses
TheTrama-WQLlanguage Limiteddocumentation
Dashboards Freelicenseonlyfor30days
Picker Moreorientedtodevelopers
SessionpreservationbetweenAPIrequests
StrengthsvsweaknessesformyTrama
ePSIplatform Topic Report No. 2015 / 10 , December 2015 24
WEBSCRAPING,APPLICATIONSANDTOOLS
ePSIplatform Topic Report No. 2015 / 10 , December 2015 25
WEBSCRAPING,APPLICATIONSANDTOOLS
Confrontingthethreetools
distrib
ution
SaaSmodel
X
(requires
installationof
chromeextension)
X
Desktopinstaller
X
(Windows,OSX
andLinux)
Chromeextension
X
(required)
X
(optional)
Freelicense X X X
Cross-browsercompatibility Ownbrowser OnlyChrome X
operation
Valuationofuserinterface Good Verygood Good
APIscreationsimplicity X X X
Visualpicking X X X
Cachingandstorage X X X
Queryorganization X X X
Ownquerylanguage TramaWQL
Statisticsandauditdashboards X
PDFextraction X
Outputformats JSON,CSV JSON,CSV,RSSJSON,CSV,XML,
APIscreationsimplicity X X X
SessionpreservationbetweenAPIinvocations X
features
Automaticcrawling ?
Maturitylevel High High Medium/High
Complexofuse Medium/High Medium Medium
Cloudstorage X X X
Querypagination X X X
Queryaggregation X X
Realtimedata ? ? X
othe
rs Orientationtosoftwaredevelopment Medium/Low Medium/Low Medium
GoogleDocsintegration X X
Levelofdocumentation High High Low
ePSIplatform Topic Report No. 2015 / 10 , December 2015 26
WEBSCRAPING,APPLICATIONSANDTOOLS
Othertools
Thisarticleanalysesthreetoolsinthecategoryofcompletetoolsbutmanyotherexist.Forthe
sakeofbrevity, and for those readers interested in this kindof tools, someother interesting
tools,utilitiesorframeworkspermittingwebscrapingarelistedbelow.
Mozenda QuBole ScraperWiki Scrapy ApacheNutch
Scrapinghub ParseHub UbotStudio5 Scraper (Chrome
Plugin)
OutwitHub
Fminer.com 80legs ContentGrabber CloudScrape Webhose.io
UIPath Winautomation VisualWebRipper AddTolt AgentCommunity
AllinOneStats Automation
Anywhere
Clarabridge
Enterprise
DarcyRipper DataIntegration
DataCrops Dataddo Diffbot EasyWebExtract Espion
Feedity Ficstar Web
Grabber
ForNova Big Data
Platform
HeliumScraper KapowKatalyst
PDFCollector PDF Plain Text
Extractor
RedCritter Scrape.it SolidConverter
Spinn3r SyncFirstStandard TextfromPDF Trapeze UnitMiner
Web Content
Extractor
Web Data
Extraction
WebDataMiner Web Robots
Scraping
WebHarvy
Ifyouarenot convinced touseanyof the three recommended tools,MozendaorParseHub
maybeinterestingalternatives.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 27
WEBSCRAPING,APPLICATIONSANDTOOLS
4 DecisionmapThe following diagram may be helpful in the process of deciding which tool meets which
scrapingrequirements.Obviously,thediagramcouldbemorecomplex,asmorequestionsmay
beaskedinadecisionprocess.However,theriseofcomplexityinthefiguresuggestskeepingit
simple,butsufficientlyillustrative.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 28
WEBSCRAPING,APPLICATIONSANDTOOLS
5 ScrapingandlegislationThequestiontobemadeissimple:iswebscrapinglegal?
Theanswer isnotassimple.Firstly,weneedtoknowthespecific legislationofeachcountry
concerned.TheUnitedStatesofAmericaaremorepermissivethanEurope.InSpain,wherethe
authorlivesandworks,lawsarefairlymorerestrictive.However,therearevariousstatements
(includingonefromtheSupremeCourtofSpain),wherewebscrapingisconsideredlegalunder
specificconditions.
Forexample,usersmustbecarefulwithlegalconditionsincertainwebsites.Iftheircontentis
proprietaryandorientedtocommercialpurposes,scrapingmightbeanillegalactivity.Author
rightsareanother legal characteristicof special interestbefore scraping the informationofa
website.Forinstance,newsfromdigitalnewspaperscannotbeextractedtobepublishedina
blogorapplicationwithoutstatingthesource.
Asexpected,allthisgeneratesdoubts,soitisagoodpracticetoconsultlawyersspecializedin
digitalcontent, intellectualproperty,unfaircompetitionand licensing. Inanycase, ifdataare
clearlyopen,youwillnotneedtoworry.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 29
WEBSCRAPING,APPLICATIONSANDTOOLS
6 ConclusionsandrecommendationsWebscrapingisafamiliartermthathasgainedimportancebecauseoftheneedto“free”data
storedinPDFdocumentsorwebpages.Manyprofessionalsandresearchersneedthedatain
order to process it, analyse it and extract meaningful results. On the other hand, people
dealingwithB2Buse casesneed toaccessdata frommultiple sources to integrate it innew
applicationsthatprovideaddedvalueandinnovation.
Themarketdemandsholisticwebscrapingsolutions, thatencompasscloudstorageandease
to build interoperable APIs. In this article we have analysed and compared 3 web scraping
solutions: import.io, Kimono and myTrama. These solutions differ in their implementation
details but share more common ground than believed: a visual picker, JSON results, data
caching,backgroundrobotstogatherdata,programmaticAPIs,etc.
Anexperiencedteam,usingthesesolutions,candevelopservicessuchascontentaggregators,
ranking tools,mobile apps, data monitoring systems, reputation management applications,
businessintelligencesolutions,BigData,etc.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 30
WEBSCRAPING,APPLICATIONSANDTOOLS
AbouttheAuthor
OsmarCastrilloFernándezhasmorethan15yearsofexperienceintheITindustry.Heholdsa
Bachelor’sdegreeinComputerSciencefromtheUniversityofOviedo.Hismainexpertisefields
arewebdevelopment,service-orientedarchitecturesandpublicadministration.
InApril2004hestartedworkingforCTICFoundation intheR&Dteamwhatwas inchargeof
developinganewJ2EEframeworkforthePrincipalityofAsturias(FWPA).TheFWPAwasakey
elementinthesuccessofthemodelofE-GovernmentinthePrincipalityofAsturiasandhelped
to simplify and homogenise the development of new government-related applications and
services.
HewasateacherinthefirstandsecondeditionsoftheJ2EE-FWPAcourseforITprofessionals
inAsturias.HealsotaughtseveralothercoursesrelatedtoUML2.0andn-tierarchitectures.
InOctober2012hefoundedandbecomeCTOofVitesiaMobileSolutions,aCompanydevoted
to extracting and analysing published data on the Internet. At the start of 2013 he started
leadingtheTRAMAproject,atechnologytofacilitatewebscraping,thatlaterbecamethetool
myTrama.
ePSIplatform Topic Report No. 2015 / 10 , December 2015 31
WEBSCRAPING,APPLICATIONSANDTOOLS
Copyrightinformation
©2015European PSI Platform–This document and allmaterial therein has been compiled
with great care. However, the author, editor and/or publisher and/or any party within the
European PSI Platform or its predecessor projects the ePSIplus Network project or ePSINet
consortiumcannotbeheldliableinanywayfortheconsequencesofusingthecontentofthis
documentand/oranymaterial referenced therein.This reporthasbeenpublishedunder the
auspicesoftheEuropeanPublicSectorinformationPlatform.
The reportmay be reproduced providing acknowledgement ismade to the European Public
SectorInformation(PSI)Platform.