web crawling and processing with limited resources for … · 2018-06-12 · web crawling and...

17
Journal of Software 300 Volume 13, Number 5, May 2018 Web Crawling and Processing with Limited Resources for Business Intelligence and Analytics Applications Loredana M. Genovese, Filippo Geraci * Institute for Informatics and Telematics, CNR, Via G. Moruzzi, 1 Pisa, Italy. * Corresponding author. Email: [email protected] Manuscript submitted January 10, 2018; accepted March 8, 2018. doi: 10.17706/jsw.13.5.300-316 Abstract: Business intelligence (BI) is the activity of extracting strategic information from big data. The benefits of this activity for enterprises span from the reduction of the operative costs due to a more sensible internal organization to a more productive and aware decision process. To be effective, BI relies heavily on the availability of a huge amount of (possibly high-quality) data. The steady decrease of costs for acquiring, storing and analyzing large knowledge bases has motivated big companies to invest in BI technologies. Until now, instead, SMEs (Small and Medium-sized Companies) are excluded from the benefits of BI because of their limited budget and resources. In this paper we show that a satisfactory BI activity is possible even in presence of a small budget. Our ultimate goal is not necessarily that of proposing novel solutions but providing the practitioners with a sort of hitchhiker’s guide to a cost-effective web-based BI. In particular, we discuss how the Web can be used as a cheap yet reliable source of information where crawling, data cleaning and classification can be achieved using a limited amount of CPU, storage space and bandwidth. Key words: Big data analytics, business intelligence, spam detection, web classification, web crawling. 1. Introduction Intelligence is the activity of learning information from data. The benefits of its application to business activities have started to be discussed from the mid of the nineteen century when Hans Peter Luhn coined the term business intelligence (BI) in a paper of him [1]. In that paper Luhn, described some guidelines and functions of a business intelligence system under development at the IBM labs. In the course of the years, BI has evolved extending outside the scope of the database community and increasing its complexity. Structured information has been replaced by unstructured one; data volumes got bigger posing new challenges both for gathering, storing and analyzing them; analytics (namely: the possibility of creating profiles from data) has emerged as a new opportunity to help companies to drive their business efforts on the most promising directions. Forecasts on future trends [2] predict an ever increasing role of BI and big data analytics in the success of companies. However, the cost of implementing BI has excluded small and medium enterprises from the benefits of big data analysis. In fact, collecting and storing data is still considered expensive and computationally demanding. Although practical guides as that in [3] provide wise suggestions to build scalable infrastructures that can optimize long-term costs, the initial effort required to setup a minimalistic BI system still remains unaffordable. In this paper we deal with the problem of building a good quality business intelligence infrastructure keeping an eye on the company budget. Achieving this goal requires a careful choice of the data sources, extraction procedures, and pre-processing. As other studies suggest, given the universal accessibility of

Upload: others

Post on 13-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

300 Volume 13, Number 5, May 2018

WebCrawlingandProcessingwithLimitedResourcesforBusinessIntelligenceandAnalyticsApplications

LoredanaM.Genovese,FilippoGeraci*InstituteforInformaticsandTelematics,CNR,ViaG.Moruzzi,1Pisa,Italy.*Correspondingauthor.Email:[email protected],2018;acceptedMarch8,2018.doi:10.17706/jsw.13.5.300-316

Abstract:Business intelligence (BI) is the activity of extracting strategic information frombig data. Thebenefitsofthisactivityforenterprisesspanfromthereductionoftheoperativecostsduetoamoresensibleinternalorganizationtoamoreproductiveandawaredecisionprocess.Tobeeffective,BIreliesheavilyontheavailabilityofahugeamountof(possiblyhigh-quality)data.Thesteadydecreaseofcostsforacquiring,storingandanalyzinglargeknowledgebaseshasmotivatedbigcompaniestoinvestinBItechnologies.Untilnow, instead,SMEs(SmallandMedium-sizedCompanies)areexcludedfromthebenefitsofBIbecauseoftheirlimitedbudgetandresources.InthispaperweshowthatasatisfactoryBIactivityispossibleeveninpresence of a small budget. Our ultimate goal is not necessarily that of proposing novel solutions butprovidingthepractitionerswithasortofhitchhiker’sguidetoacost-effectiveweb-basedBI.Inparticular,wediscusshow theWebcanbeusedasa cheapyetreliable sourceof informationwhere crawling,datacleaningandclassificationcanbeachievedusingalimitedamountofCPU,storagespaceandbandwidth. Keywords:Bigdataanalytics,businessintelligence,spamdetection,webclassification,webcrawling.

1. IntroductionIntelligence is theactivityof learning informationfromdata.Thebenefitsof itsapplicationtobusiness

activitieshavestartedtobediscussedfromthemidofthenineteencenturywhenHansPeterLuhncoinedthetermbusinessintelligence(BI)inapaperofhim[1].InthatpaperLuhn,describedsomeguidelinesandfunctionsofabusinessintelligencesystemunderdevelopmentattheIBMlabs. In the courseof theyears,BIhasevolvedextendingoutside the scopeof thedatabase communityand

increasingitscomplexity.Structuredinformationhasbeenreplacedbyunstructuredone;datavolumesgotbigger posing new challenges both for gathering, storing and analyzing them; analytics (namely: thepossibility of creatingprofiles fromdata) has emergedas anew opportunity to help companies to drivetheir business efforts on the most promising directions. Forecasts on future trends [2] predict an everincreasingroleofBIandbigdataanalyticsinthesuccessofcompanies.However,thecostofimplementingBIhasexcludedsmallandmediumenterprisesfromthebenefitsofbigdataanalysis.Infact,collectingandstoringdataisstillconsideredexpensiveandcomputationallydemanding.Althoughpracticalguidesasthatin [3] provide wise suggestions to build scalable infrastructures that can optimize long-term costs, theinitialeffortrequiredtosetupaminimalisticBIsystemstillremainsunaffordable.In thispaperwe dealwith the problemof buildinga good quality business intelligence infrastructure

keepinganeyeon the companybudget.Achieving this goal requiresa careful choiceof thedata sources,extraction procedures, and pre-processing. As other studies suggest, given the universal accessibility of

Page 2: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

301 Volume 13, Number 5, May 2018

contents as well as its intrinsic multidisciplinarity, the Web can be considered as the most convenientsource of information. However, using the Web poses several issues in terms of data coherence,representativeness, and quality. Moreover, technical challenges have to be considered. In fact, althoughdownloadingapageseemstobeforfree,crawlinglargeportionsofthewebmayquicklybecomeexpensiveand computationally demanding. Seed selection is a key factor to ensure the data to be up to date. Forexample, new initiatives can be promoted by registering ad-hocwebsites thatmay not be reachable yetfromotherwebsitesbutarespreadthroughotherchannelssuchassocialmedia.Downloadorderingcaninfluencerepresentativenesswhiledatafilteringcanbiascoherence.InSection2wediscusscrawlingstrategies,tailoredontheneedsofsmallcompanies,thatcandrastically

simplify the design of crawlers and data storagewithout sacrificing quality. In particular, analyzing thebottlenecks of standard crawling,we discovered thatmost of the CPUeffort is due to the need to checkwhetheraURLhasalreadybeendiscovered/downloadedornot.However,amorein-depthanalysisshowsthat,leveragingonanad-hocdownloadingordering,thesecheckscanbereducedinthespecificcontextoflimitedresources.This,inturn,hastheeffectof:simplifyingthestoragearchitecture,andprovidingadatastreamaccessforthesubsequentanalyses.Insection3wetrytoanswertothequestionaboutwhetherwebdataissuitableforBIornot.Statistical

analysesintheliteratureshowthat,althoughmorphologicallydifferent,webcorporahavethepotentialtobe good knowledge bases once noise is filtered.We pinpoint spam andweb templates as the twomainsourcesofnoise.Besidesdata coherence,spamhasaprofound impacton the consumptionofbandwidthandstorageresources.Inordertolimitresourcewaste,weshowhowtoexploitDNS(domainnameserver)queriestopredictwhetheranewlydiscovereddomainislikelytobespamand,incase,removeitfromthelistofwebsitespendingtobedownloaded. Insection4wediscusstheproblemofper-topicclassificationcomparingpost-processingwith focused

crawling.Inabsenceofresourcelimitstheformersolutionwouldbepreferablebecauseitpreventsausefuldocument linked by an off topic website to be lost. However, in a real scenario the higher cost ofpost-processing compared to its benefitsmakes focused crawlingmore feasible. In either case, however,dealingwithclassificationrequirestoownalargesetofexamplesthat,atleastforspecializedcompanies,canbehardtofindinpubliclyavailableontologiesorannotatedresources.Thus,weshowhowsuchasetofexamplescanbebuildusingsemi-supervisedtechniquesthatbalanceaccuracyandhumaneffort. Summarizing: in this paper we investigate the technological and methodological challenges that IT

practitionershavetodealwithbuildingaBIarchitecturewithalimitedbudget.WealsoproposesolutionsthatcanhelptoattainasatisfactorytradeoffbetweencostsandBIquality.

2. CrawlingCrawling is theactivityof creatinga local copyofa relevantportionof theWorldWideWeb.This task

usually represents the first challenge that IT professionals have to deal with performing BI and Webanalytics. In its simplest form, a crawler is aprogram that receivesas inputa listof validURLs(calledseeds) and starts downloading and archiving the corresponding websites. Intra-domain hyperlinks areimmediately followed in breath-first order,while links to external resources are appended to the list ofseeds. Thecrawlingprocessterminatesoncethelistofseedsbecomesempty.Despitesimpleinprinciple,the variety of purposes forwhich data is collectedaswell as technical issues,made crawling a rich andcomplexresearchfield.Asameasureoftheinterestforthistopic,inarecentsurvey[4]theauthorsclaimtheycensused1488articlesaboutcrawling.Restrictingtothereal implementationstheauthors listed62works. In the following sectionwedissect themain crawlingdesign strategiesandpropose cost-effectiveoptionstailoredontheBIprocess.

Page 3: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

302 Volume 13, Number 5, May 2018

2.1. SeedSelectionTheseed list is the initialsetofURLs(moreoftenwebsites’homepages)taken in input fromacrawler

starting to build theweb image. Despite not in-depth discussed in the scientific literature about generalpurposecrawling,howtobuildacomprehensiveseedlistisoneofthemostdebatedtopicsinspecializedforums. In[5]theauthorsprovideasuccessfulmacroscopicdescriptionofthewebthatcanbeusefultospeculate

ontheeffectsofseedselection.Theystartrecallingthatthewebcanbemodeledasadirectgraphwherepagesarenodesandhyperlinksareedges.According to theexperiments in [5],webpages canbedividedintofivecategoriesthatformthefamousbow-tieshapedgraph.ThecoreofthewebconsistsintheSSC:astronglyconnectedsubgraph(i.e.agraphwheregivenanytworandomnodesthereexistsapathconnectingthem).TheSSCislinkedfromtheclassofINpagesandlinkstheOUTpages.ThetwootherrelevantclassesaretheTENDRILS(namely:pagesnotconnectedwithSSC)andthedisconnectedpages(i.e.pageswithbothin-degreeandout-degreeequalsto0).CrawlingtheSSCandOUTcomponentsisrelativelyeasy.Infact,givenasingleURLbelongingtoSSCitis

theoreticallypossibletofindapaththatconnectstoalltheotherelementsofSSCand,inturntoOUT.Theelementsofthesecategoriesaretypicallypopularwebsites,thusfindingpre-compiledlistsofthemonthewebisasimpletask.Pages in IN and in TENDRILS, which often are newwebsites not yet discovered and linked, are very

difficultoreven impossibletobediscovered. In fact,thesepageshaveeither in-degreeequalsto0orarelinkedfrompagesofthesamecategory.Accordingtotheestimations in[5] INandTENDRILSaccount forabouthalfoftheentireweb.Thus, inordertocrawlasmanypagesaspossibleof thesetwoclasses, theirURLsshouldbepresentintheinitialseedlist.Collaborativeeffortshaveproduced large listsofdomains freelyavailableontheweb.Despiteuseful to

increasethecoverageoftheINandTENDRILSclasses,theseresourcescannotbekeptupdatedbecauseofthe dynamicity of the web. Recently, harvesting URLs posted on Twitter has emerged as an effectivealternativesourceoffreshandupdatedseeds.AfirstapplicationofTwitterURLstoenhancewebsearchingisshownin[6], theabilityofthissocialmediato immediatelydetectnewtrends isproven in[7],whileamachinelearningapproachtoidentifymaliciousURLsonTwitterpostsisdescribedin[8].

2.2. URLCrawlingOrder In the context of limited resources, the URL download ordermatters [9]. Pioneering papers like [10]

indicatethebreath-firstasthestrategythatprovidesbetterempiricalguaranteestoquicklydownloadhighquality(intermsofpagerank)pages.Theintuitionatthebasisofthisideaisthatpagequalityissomehowrelatedtothedepthfromthedocumentroot[11]. In[11]theauthorsprovideanicecomparisonofseveralorderingstrategies.Accordingto[11],themain

qualifying feature to classifyorderingalgorithms is theamountof information theyneed,going from:nopre-existinginformationtotherequestofhistoricalinformationabouttheweb-graphstructure.In the context of BI for small andmedium enterprises, however, supposing the existence of historical

information(i.e.apreviouscrawling)maybeinappropriate.Infact,smallcompaniesarenotsupposedtobeabletorefreshtheircrawlingtoooften.Thus,evenifexisting,historicalinformationcouldbeoldenoughtobecome misleading. Hence, ordering strategies that do not require pre-existing information should bepreferred. In[12]theauthorsproposeahybridapproachinwhichwebsitesarekeptsortedusingapriorityqueue

andpriorities aregivenby thenumberofURLsperhost alreadydiscovered,butwhichdownload is stillpending. Once a website is selected, a pre-defined number of pages is downloaded. This strategy hask

Page 4: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

303 Volume 13, Number 5, May 2018

severaladvantagesintermsofnetworkusage.Infact:thenumberofDNSrequestscanbereducedthankstocaching,while the use of thekeep-alive directive [13] reduces network latency. Experiments in [12]provethatthisstrategystilldownloadshighqualitypagesfirst. Buildingupon[12]weproposeanorderingstrategythatcandrasticallysimplifythecrawlerarchitecture

inthepracticalcaseoflimitedresources.Weexploittheimportantobservationfrom[14]wheretheauthorsobserve that chains of dynamic pages can lead to an infinite website even though the real informationstoredinthewebserverisfinite.Accordingtothisobservation,settingalimittothemaximumnumberofpagesperhostappearssensible. Wethuskeepthehostssortedasin[12],butonceawebsiteisselected,thedownloadingprocessisdone

allatonce.Somenecessarycaveatsarediscussedinthesubsequentsections.

2.2.1. PolitenessCrawlingetiquettewouldrequirenottoaffectwebserverperformancesoverloadingitwithdownloading

requests. According to theexperiments in [12], a time interval of 15 seconds is appropriate for real lifeapplications so as to balance politeness constraints with the timeout of the keep-alive directive.Moreover,usingatime-seriesanalysisofwebserveraccesslogsasin[15],weobservedthatalimiteddegreeof tolerance to this limit could be acceptable in the initial phase of download. This comes from theobservation of the browser behavior. In fact, since to be rendered, a page may require severalsupplementaryfiles(i.e.CSSdirectives,images,script,etc.),inordertopromptlydisplayit,browsershavetodownloadall these files inparallel.Asaresult,mimickingthebrowserdownloadpatterncanspeedupcrawling(atleastofsmallwebsites)withoutinfluencingthewebserverperformance. Despite the above considerations, our URL ordering policy still requires a large per-host degree of

parallelismtomaximizenetworkthroughput.Inordertoavoidbreakingtheetiquetterulesbecauseoftheuseofvirtualhosts,however,somecarefulisrequiredintheselectionofthehoststhataredownloadedinparallel. In fact,evenlowratesofrequestscanbecomeproblematic iftheyhavetobeservedbythesamephysicalmachine.We thusmodifiedtheorderingpolicydescribed in [12] selecting thehostwithhighestpriorityamongthosethatnotresolvetoanIPalreadyindownload.

2.2.2. URLsmanagementAccordingtothebreath-firststrategy,onceaURLisfound,acrawlerhastocheckwhetheritisalreadyin

the list of those known and, if not, append it to the list. Since this check is URL-oriented, it has to berepeated foreachoutgoing link,both internal andexternal, inawebpage, resulting ina computationallydemandingoperation.Inparticular,distributedcrawlingarchitecturessplittheURLlistamongnodes,thuscheckingforaURLandupdatingthelistmightalsoimplyextranetworktraffic. Our crawling strategy introduces a drastic simplification in the URL management. Since a host is

downloadedallatonce,wedonotneedtokeeptrackofeachindividualURLinthecrawling,butitsufficesper-host information. In fact, a newly discovered external URL, will be retrieved again once thecorrespondingwebsite isdownloadedunlessthecrawlerreachedthemaximumnumberofpages forthathost or the corresponding page was not connected to the home. However, except for the case of adisconnected page belonging to a small enoughwebsite (that we expect to be ofmarginal utility for BIapplications),keepingthenewlydiscoveredURLdoesnotchangeitscrawlingfate,thusitcanbediscarded.WearrangeURLsintwodatastructures:theintra-linkstableandthehostqueue.Anintra-linkstable(see

Fig. 1) is locallymaintained by each individual crawling agent and destroyed once the download of thecorresponding website is completed. This data structure is used to maintain the set of internal pages(sortedbydiscoverytime).Weimplementedintra-linkstablesasfasthashsortedsets.

Page 5: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

304 Volume 13, Number 5, May 2018

Fig.1.Intra-linktableexample.

ThehostqueueisapriorityqueuestoringthehostURLandIPaddress(thatislookeduponlywhenthe

hostisselectedfordownloading),thestatus(complete,indownload,pending),andthepriorityexpressedasthenumberofpendingdownloads.InabsenceofthecompletelistofalreadydiscoveredURLs,however,duplicates(namelyURLsdiscoveredmorethanonce)cannotbedetectedunlessacceptingadegreeoferror,thusthenumberofpendingdownloadscanonlybeestimated.Weproposetwosimpleestimationmethods.The easier approach leverages on the idea that the importance of a website depends on the number ofin-links,thusthenumberofpendingdownloadscanbereplacedwiththehostin-degree.Asanalternative,asmall Bloom filter can be associated to each host. Bloom filters are fast and compact data structures tosketchobjectsetsthatenableprobabilisticmembershipquery.Inourcase,onceanexternallinkisextracted,theBloomfilterassociatedtothecorrespondinghostisqueriedformembership.IftheURLisnotfound,itis added to the bloom filter and the counter of pending downloads is incremented. Despite beingcomputationallymoredemanding thanhost in-degree,bloom filtersare stillmuch lighter thanURL lists.Moreover,accessingthehostqueuehappensonlyforexternallinksthataccountforanaverageof11.62%oftheoutgoinglinksinourexperiments.

2.3. CrawlingDepthAs mentioned in section 2.2, in a real scenario storage and bandwidth are limited and expensive as

opposedtothepossibilityofdynamicallycreatenewwebpageswhichisunlimitedandforfree.Theauthorsof [14]concludethatcrawlingupto five levels fromthehomepage isenoughto includemostof thewebthatisactuallyvisited.Hypothesesonthepower-lowdistributionofthenumberofpagesperwebsite[16],however, suggest that most websites are smaller. Based only upon the common sense, here we try tospeculateonhowthethresholdonthecrawlingdepthcanbesmallforBIapplicationswithsevereresourcelimits. Settingthedepthtotheminimumpossible,acrawlerwouldonlydownloadthewebsitedocumentroot.

Thispage,however,isoftenusedasasortofgatewaytocollectinformationontheuserconfigurationandredirectsthebrowsertotheappropriateversionofthehomepage.Inthiscase,acrawlershouldbeawareofredirectionsandfollowthemuntilreachingtheproperhomepage. Asexpected,homepagesareveryinformativecontainingmostofthebaseinformationaboutthewebsite.

Inparticular,leveragingonlyonhomepagesitispossibletoderivesomestructuralinformationabouttheunderneath technologies [17], accomplish large scale usability tests [18], [19] and perform webclassification[20].Extendingthecrawlingtotwo levelshasa large impact intermsofrequiredstorageandbandwidth. In

fact,homepagesoftentend toact asasortofhub, thus theexpectedvolumeofpagesdirectly reachablefromthehomeisusuallyhigh.Basedonasampleof60amongthemostpopularwebsites,weestimatedthesecondlevelofthewebhierarchytohaveonaverage137pages(Noticethatthiscanbeanunderestimationsincenewswebsitesandgeneral-purposeportalstendtohavemorelinksthantheotherwebsitesandcaneasilyexceed500internal links).However, thehighercost intermsofresources isbalancedbythemuch

Already downloaded

discovered

URL

Url4Not added because

already known

Url2 Url3 Url4 Url5Url1

Next

Newly

Page 6: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

305 Volume 13, Number 5, May 2018

richerinformationcontent.Forexample,ecommercecompaniesmayhavespecifiedonlycommoditymacrocategoriesinthehomepagesleavingthesubdivisionbymicrocategoryorbrandinthesecondlevel.Otherimportant details like contact information and partnerships are often reported in the second level ofhierarchy.Acrawlingdepthofthreepagesisoftensynonymousofdownloadingtheentirewebsite.Infact,inorder

to facilitateaccessibilityaswellas indexing from the searchengines, severalwebsitesusesitemaps.Thistechniqueconsistsincreatingapagedirectlydescendentfromthehomepageinwhichalltheinternallinksofthewebsiteareenumerated.Asaresult,sitemapsboundthewebsitehierarchytothreelevels.

2.4. DataStorageDatastorageisprobablythemostcomplexkeyfactorinacrawlingarchitecture.Nonetheless,evenhigh

impact publicationsgive fewdetails about the storagemodel. For example, in [21] the authors focus onload-balancing, tolerance to failures, and other network aspects; while in [22] the authors discuss thedownloadorderingof thepages, theupdatesandthe content filtering. Inboth cases,however, storage isonlymentioned.A simple classificationof thehugeamountofdataproducedduring crawling is, instead,providedin[23].Crawlingdatacanbeclassifiedaccordingtotwomaincharacteristics:fixed/variablesizeandoutputonly/recurrentusage.Fixedsizedata iseasier to storeandaccess: conventionaldatabases canhandlebillionsof recordsand

serve hundreds of concurrent requests. Variable size data, instead, requires some further considerationaboutitsusagefrequency.Infact,forfrequentlyaccesseddataafastaccessshouldbepreferred,whileforoutput only dataa compact representationwould be better. Frequently, however, fixed and variable sizedataco-existinthesamerecord.Thisisthecase,forexample,oftherecordstoringtheinformationaboutasinglewebpage. In fact,according tothe levelofdetail that the crawlerprovides, this record can containseveralfixedsizeinformation,asforexampleatimestamp,andsomevariablesizeinformationlikethepageURL.However,alsointhiscaseconventionaldatabasesprovidewell-qualifiedsolutions. The most verbose variable size crawling data is the HTML content of webpages. This information,

however, isusedonlyoncetoextractoutgoing links fromthepageandthen it isstoredforoutput.Smallcrawlerssuchaswgetstoreeachpage(orwebsite)inadedicatedfileleavingthefilesystemtheburdenofmanagingthem.Pageupdatesaresupportedbydeletingthecorrespondingfileandreplacingitwithanewone.Obviously,thissolutioncannotscaleevenforsmallcrawlsoffewmillionpages(orwebsites). Anothersimplestoringstrategyisthatofdefiningacertainnumberofmemoryclasses(definingbuckets

offixedsize)andmemorizingeachpageinthebucketthatbetterfitsthepagesize.Eachmemoryclasscanbestoredintoaseparatefile.Thissolutionintroducesacertainwasteofspacethatcanbeminimizedwithacareful choice of the sizes of thememory classes. A simplemethod to setmemory class sizes is that ofdownloadingasmallsampleofpagesandestimatesuitablevaluesfromthem.Updatesaresupportedalsointhiscase.However,thepossibilityforthenewcopyofawebpagetochangesize(thusrequiringabucketofadifferentmemoryclass)canintroduceaformofsegmentationinwhichcertainbucketsbecomeempty.Inordertoavoidthisphenomenon,eachmemoryclasscanbeendowedwithapriorityqueue(i.e.usingamin-heap)totakenoteoftheemptyslots. Asan intermediatestrategybetweenthetwopreviouslydescribedstoragesolutions,whenupdatesare

notrequired,HTMLpagescanbestoredinlogfiles.Onceanewpageisdownloaded,itisstoredjustafterthepreviousoneinthelogfile.Keepingarecordcontainingthelengthandareferencetotheoffsetinthelog file, accessing a webpage is very fast since it only requires a disk seek and a single read operation.Storagespaceisminimizedbecauseonlytwointegersperpagearerequired.Althoughsupportingupdatesis possible, thismodel is recommended only for static crawling. In fact, updating a pagewould requireappending the new copy to the log file and updating the corresponding record. However, the previous

Page 7: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

306 Volume 13, Number 5, May 2018

versionofthepagewouldnotbecancelled,thusresultinginawasteofspace.

Fig.2.Runningtime(inseconds)fortheoperationsof:Download,parsingandretrievingfromdiskof60

amongthemostpopularwebsites.We investigated a further class of data, not discussed in the literature, thatwe call auxiliary data.We

define as auxiliary some piece of information that crawlers had to compute for their activity but theywithdraw immediately after usage. Sometimes, however, the same information is necessary for thesubsequent analysis of the crawled document corpus, thus, storing it during crawling would speed upsubsequentanalysesreducingtheCPUrequirementsatthecostofahigherstorageconsumption.Themostimportant informationbelongingtothiscategory isparsing.Asdiscussed later insection2.4.1,parsing isnecessaryinalldataanalysesandcandominatethecomputationaltime.Savingitatthecostofamoderateincreaseofstorageusagecouldbeusefulinmostofthecases.

2.4.1. ParsingAsmentionedbefore,crawlersperformparsingtoextractoutgoinglinksfromwebpages.However,since

thisisconsideredamereinternaloperation,general-purposecrawlersdonotreturnparsinginoutput,thusforcingsubsequentanalysissoftwaretoredoit. Weperformedasimpleexperimenttoshowhowbigtheimpactofparsingisbothforcrawlinganddata

analysis.Wetestedthehomepageof60popularwebsitesandcomputethespeedfordownloading,parsingandretrievingthenfromdisk.Werepeatedthisoperationtwicesoastomeasuredownloadingspeedbeforeandafternetworkcaching.Asexpected,except for fewexceptions(seeFig.2)parsing is ingeneralmuchfasterthandownloadingandtakesrespectively24.47%oftheoverallprocessingtimeforthefirstdownloadand32.64%forthesecond.Comparingparsingwithdiskaccess(asitwouldbeduringtheanalysis),instead,thesituationdrasticallychanges.Infact,inthiscaseparsingtakesonaverage99%oftheoverallwall-clockprocessingtime.Avoidingrepeatingparsingduringanalyseswouldthusenableaconsistentspeedup.WiththepurposeofBIapplicationsinmind,wedevelopedanultra-fastparserthatstorestheoffsetsof

tagsinavectorofquadruples(seeFig.3).SincethetreestructureofHTMLisrarelynecessaryforanalysis

we do not keep track of it. However, we notice that building it would cost where is the

numberoftagsinthepage(inourexperiment ).Eachtagisdescribedasitsstartingpositionandlength(namely:thedistance inbytesbetweentheopenedandtheclosedanglebrackets). Inaddition, theoffset of the tag nameand its length is stored. In order tominimize the number of disk read operationsneededto loadthewholeparsingdatastructure,wemaintain inthe firstwordof thevectorthesizeofaquadrupleandthenumberoftags.Asaresult,loadingthedatastructurerequiresonlytwodiskaccesses.

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

1' Download 2' Download Parsing Disk read

O(n log2 n) n

n ≅ 2988

Page 8: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

307 Volume 13, Number 5, May 2018

Fig.3.Graphicaldescriptionoftheparsingdatastructure.

Arelevantproblemofthisapproach(andparsingingeneral)isthatwecannotdoanyassumptiononthe

dimensionsofpagesandtags.Ingeneral,exceptforthelengthofthetagname,whichislimitedto10bytesinHTML5,weshoulduseupto4bytestostoreeachelementofthequadruple.Usingrelativedistancesdoesnotmitigatetheproblem.Thiswouldmeanthatstoringtheinformationofeachtagwouldrequire13bytes.GiventhatourexperimentsindicatethemedianofthelengthofHTMLtagstobe8bytesand56.27%ofthetagstobesmallerthan13bytes, thiswouldbeanunacceptableblowupofrequiredspace. Todealwiththis problem, we use an adaptive strategy in which, after processing a page, the parser computes theminimumnumberofbitsnecessaryforeachfieldofthequadruple(exceptforthetaglengthwhichiskeptconstantto4bits).Thesevaluesarestoredintheheaderwordoftheparsingstructure.Inparticular,weuse

5 bits for the page offset (i.e. limiting themaximum page size to bytes), 4 bits for the tag

length(i.e.limitingthetaglengthto bytes),4bitsforthedistancefromthebeginningoftag.Inordertofittheheaderina32bitword,wereserveforthecounterofthenumberoftagstheremaining19bits.This,however,isnotalimitationinpracticesinceitwouldmeanthatourdatastructurecandealwithpagestoupto524288tags.Weexperimentallyevaluatedtheworst-casenumberofbitsrequiredtostoreeachfield.Weobservedthat

20bits areenough for themaximumpage length,11 for the tag lengthand7 for the tagnamedistance.Overallaquadruplewouldrequireatotalof42bitsintheworstcase.

3. FilteringWhatmakesskepticalBIpractitionersinusingtheWebasasourceofinformationfortheiranalysesisthe

untrustworthiness of sources and possibly the low text quality due to the presence of every sort ofartificiallyaddedcontentinthedata[24](akaboilerplate).Artificialtextisoftenusedtotrytocircumventthe rankingalgorithmsofsearchengines soas toobtainaprivilegedposition in the listof results.Othersourcesofunrelatedcontentare:interfaces(menus,navigationbars,etc.)advertisingbannersandparkeddomains.InordertodiscernwhetheritispossibletousewebdataforBIornot,wethushavetoanswertosome

questions: 1) Areweb-basedcorporaqualitativelycomparablewithtraditionalones? 2) Canweb-basedcorporainfluencetheBIactivity?

In[25],theauthortriestoanswertothefirstquestioncarryingoutacomparisonof“traditional”corporaandweb-basedones,focusingonCzechandSlovak.Experimentsshowthatnoneofthecomparedcorpusismorphosyntactically superior.Web-basedcorporaare simplydifferent.A further insight comes from [26]

Tag name starting position

Memory class size (5,4,4) = 13bits

Number of tags (19bits)

Starting position

Tag length

Tag name length (4 bits)

225

=32≅ 4G

216 = 65536

Page 9: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

308 Volume 13, Number 5, May 2018

where the author shows the effect of the size of corpus. The most interesting result is the empiricaldemonstration that when the dataset is large enough, the language of the web perfectly reflects thestandardwritten language. Lowering the size of the dataset, the equality persists for themost frequentnounswhiledifferencesbegintoappearintheleastfrequentnouns. A partial answer to the second question is provided in [27] where the authors prove that a

computer-mediated text can deceive the reader. However, this category of text has specific linguisticfeaturesthatcanbeeasilydetectedandremoved[28].We observe thatmost of the uninteresting text in web-based corpora comes from twomain sources:

templates (which include advertisement banners, navigational elements, etc.) and spam websites (akaparkeddomains).In[29]weproposedageneral-purposealgorithmfortemplateremoval.Abriefdiscussionof the benefits for BI of our approach as well as a sketch of the algorithm is presented in section 3.1.Differently from approaches like that in [30] our approach is not technology-specific and, thus, it is notsupposed to become deprecated with the evolution of web technologies. Moreover, integrating ourapproach with our crawling ordering described in section 2.2, it is possible to reduce the storagerequirementsofcrawlingstillenablingtorestoretheoriginalwebpages.In[31]weproposedaclustering-basedapproachtodetect themostcommoncategoryofspam:thatof

DNS-based parked domains. Our experiments showed that these websites resolve to dedicated nameservers.Asaresult,filteringatDNSlevelwouldintroduceamarginallossofusefulinformationwhilebeingmuchmoreefficientthanaper-websiteclassification.However,narrowingtoonlythiscategoryofspamisnotenoughforBIapplications.Insection3.2.2weextendtheapproachin[31]toallthecategoriesofspamwebsites.

3.1. TemplateRemoval Since the advent of contentmanagement systems (CMS) in the late nineties, web sites started using

templates.Estimationsin[32]rate48%ofthetopwebsitesasusingCMSs,whilethenumberofsitesthatemploy templates isprobablymuch higher.Web templates consist in stretches of invariableHTML codeinterlaced with the main text of the page. The goal of this technology is that of providing a uniformlook-and-feeltoallthepagesofasiteeasingthewebsitemanagerfromtheheavytaskofmanuallycuratingit.Thus,templatesdonot introduceuseful informationfromtheBIpointofview,buthaveamereroleofimprovingtheuserexperience.However,theimpactoftemplatesintheamountofresourcesusedtostorewebsites isnotnegligible.Accordingto[29]thewastedspaceduetotemplates ishighlyvariableandcanspanfrom22%(estimatedonen.wikipedia.org)to97%(estimatedonzdnet.com). A further undesired effect of web templates is that they have been reported to negatively affect the

accuracyofdataminingtoolsbecauseof thenoisethey introduce inthetext [33]. In fact, largetemplatescan introduce a not marginal number of extra words in the webpage. These words, in turn, can eithermystifyduplicatedetectionalgorithmsorartificiallybiasthesimilarityofdocuments. Consequently, stripping templates from webpages is expected to produce a drastic saving of storage

consumptionaswellasanincreaseofthedataqualityforthesubsequentBIprocess.

3.1.1. AnalgorithmfortemplateremovalInthissectionwesketchoutthemainideasbehindourtemplateextractionalgorithmleavingdetailsto

the original work in [29]. Our algorithm assumes that the template of a website ismade up of a set ofoverrepresentedstretchesofHTMLtokens(namely:atagorapieceoftextbetweentwotags)andconsistsoftwomainroutines:thealignerandabottom-uptemplatedistillationalgorithm.Thealignertakesininputtwo tokenized webpages and uses the result of the Needleman andWunsch global sequence alignmentalgorithm [34] to build a consensus page containing the union of the tags of the two pages. Each tag is

Page 10: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

309 Volume 13, Number 5, May 2018

endowedwithanumericalvaluerepresentingthefrequencyofthetaginthealignment.Startingfromaset

of pages,thetemplatedistillationalgorithm(seefigure4)usesabottomupapproachthat

recursivelyhalvesthenumberofelements in aligningpairsofpages.Once containsonlyonepage,thetagswithestimatedfrequencyhigherthan0.5(i.e.thetagsthatarepresentinatleasthalfoftheoriginalpages)arereturnedasthewebsitetemplate.Fortheeaseofcomputation, isconstrainedtobeapower

of2.

Fig.4.Examplerunofthetemplatedistillationalgorithm.

3.2. SpamWebsitesRemovalAsmentionedinSection3,spammingisoneofthetwomajorsourcesofnoiseinaweb-basedcorpusof

documents for BI applications. Although spam websites no longer look like long flat lists of ads, everyInternetusergatheredexperienceonhowtorecognizethemataglance.Infact,mostspamwebsitestendtolooksimilartoeachother.Exploitingthisfact,in[31]wedevelopedaclustering-basedapproachtoidentifythem. Goingfurther,weobservedthatitispossibletoleverageonthemechanismusedtocreatespamwebsites

onalargescaletopredictwhetherawebsitewouldcontainspamornotwithoutdownloadingit.Infact,bigprovidersofparkingfacilitiesonlyrequireanad-hocsettingofthewebsitenameserver(NS)tosupplytheservice.As shown insection3.2.3, classifyingdomainsaccordingwith theirNSs leadstoanaccurateandcomputationallycheapmethodtoremovespamsavingbandwidthandstorage.

3.2.1. AclusteringalgorithmforspamdetectionIn this section we sketch our approach leaving details to the original work in [31]. According to the

intuitionthatspamwebsitestendtolooksimilar(notequal)toeachother,givenamodelpageandnotionofdistanceamongpages,allspamwebsitesshouldformasinglehomogeneouscluster.However, the reality is a bit more complicated. In fact, parking service companies provide several

templatesthatthecustomercanuseandpersonalize.Consequently,spamwebsitesdonotpartitionintoasingle cluster, but form a large number (as large as the number of templates and customizations) ofhomogeneousandwell-separatedclusters.Distillingandkeepingupdatedacomprehensivelistoftemplatesis unfeasible, thus amodel-based approach is not pursuable. Yet, to save resources a website should beclassifiedbeforedownloadingit,thustheclassificationmechanismshouldbebasedonfeaturesthatcanbe

P = ( p1,..., pt ) tP P

t t

t

Template

Align ()

Align () Align () Align ()

Filter ()

...

1 2

...t−1

Page 11: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

310 Volume 13, Number 5, May 2018

knowninadvance. Basedupontheaboveconsiderations,ouralgorithmtakesininputasetofhomepagesresolvedfromthe

samenameserverandclassifyitasaregularISPorasaDPP(DomainParkingProvider).ThelistofDPPscansubsequentlybeusedduringcrawlingtofilteroutdomainsresolvingtotheseNSssincetheyarelikelytobespam. Asaclusteringalgorithmweusedourownmodifiedversionofthefurther-point-first(FPF)algorithmfor

the k-center problem (namely: find a subset of input elements to use as cluster centers so that tominimizethemaximumclusterradius)originallydescribedin[36].Ourimplementationreturnsthesameresult of the original version, however, it drastically reduces the number of distance computationsleveragingonthetriangularinequalityinmetricspaces.Thus,touseouralgorithmitisnecessarythatthedistancefunctiondefinesametricspace.

Let be an input set of pages and let be the (initially empty) subset of cluster

centers.Thefirstcenterischosenatrandomamongtheelementsin .Then,asanewcenteritisselected

the element such that the distance is maximized. The procedure stops when the

desirednumberofcentersisselected,thentheremainingelementsof areassignedtotheclosestcenter. In its simplest form, the distance function should measure the proportion of tags in common

betweentwopagesgiventheiroveralllength.Keepingtrackoftherelativeorderingofthetagsinthepagesrequires using the quadratic algorithm described in [37]. However, can be estimated with a

computationallylessexpensivefunction that(ignoringtherelativeorderingofthetags)computesthe

Jaccardscoreofthefrequencyvector(avectorthatforeachtagcountsthenumberofoccurrences)ofthepages.Sinceitholdsthat ,ifthevalueof ishighenoughtodecidethatacertainpage

istoofarfromacenter,thecomputationof becomesunnecessary.3.2.2. Detectingnon-NSbasedspamwebsitesThe method described in [31] focuses only on spam websites parked through DNS-level redirection.

Narrowingonlytocrawlingpurposes,thischoiceisagreeablebecauseclassificationcanbedonebeforethedownload, thus preventing wasting space and bandwidth. For BI applications, however, all types ofredirectionsmustbetakenintoaccount.BesidesDNS-based,therearetwootherpossibleredirectiontypes(see[35]fordetails):atHTTPlevelandatHTMLlevel.HTTPredirectionsaredirectlymanagedbywebserversthatreturna3XXclassstatuscodeandatarget

URLwhileHTMLredirectionsaresmallpiecesofcodeaimedatcausingthebrowsernottogenerateoutput,but immediately load another web page following the appropriate URL (i.e. using the tagmeta settingrefresh at zero time, calling a JavaScript function updating either the href or the location variable, orintroducing a full-window iframe tag). Independently from the type, both redirection types cause thebrowsertolandtoatargetURLbelongingtoaDPPwebsite[38].Spam websites using non-DNS redirections can still be detected and filtered out through some

considerationsonthedistributionof in-linksperhost.Accordingtothe famousmodelofwebsites in[39]there exist two important categories ofwebsites: hubs (websiteswith several out-links)andauthorities(websiteswithseveral in-links).Let be thesetofpagesbelonginganauthoritywebsite .Wecan

supposethat isaspamproviderifitsatisfiesthefollowingconditions:

1) Mostofthein-linksof areduetoredirections

2) Thepagesin clusterevenly.

Condition1preventsfromconsideringspamthosepagesthatlinktopopularservices,whilecondition2

k

P C ={c1,...ck )⊂ PP

p ∈ P minjd( p,c j )

Pd()

d()f ()

f (x, y) ≤ d(x, y) f ()d()

C(γ ) γ

γ

γ

C(γ )

Page 12: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

311 Volume 13, Number 5, May 2018

isthesamehypothesisatthebaseofourapproach.

Fig.5.NumberofhostsperNameServer(NS).Thebluebarindicatesregularwebsites,theredbarindicates

parkeddomains.TheredboxesonthexaxerepresentDPPNSs,theblackboxamisclassifiedNS.

3.2.3. EvaluationWeperformed an experiment to quantify: the fraction of relevant pages that would be lost using our

approach and the volume of irrelevant pages that would be wrongly left in the corpus.We analyzed acrawling of about 2million validweb-pages hostedbymore than 100 thousand differentprimary nameservers(NS)anddistributedamongthefourTLDs“.it”,“.com”,“.org”and“.net”.WethanfocusedonthemostrepresentedNS,whichare, indeed, thosethatmore likelycouldhostspamwebsites,removingthosewithless than 1000 pages. We obtained a corpus of 541,138 valid home pages distributed among 38 nameservers.Subsequently,weusedthecomputer-aidedapproachdescribedin[31]toclassifythehomepagesintwocategories: irrelevant/relevant. As figure5confirms,mostNSs(weobfuscatedthe fullnamesforprivacy reasons) focus only on one business, either regular hosting or domain parking, with only fournotable exceptions: NS10, NS11, NS19 and NS36. NS10 and NS19 are two medium-size private NSsbelongingtotwoItalianTVproductioncompaniesthatregisteredalargenumber(respectively87.15%and89.19%) of dictionary-based domainnames for own future use.NS11 andNS36, instead, are specializedcompaniesfordomainparkingthatalsohostsafraction(6.23%and11.43%)ofownregularpages.NoticethatNS11istheonlyNSmisclassifiedusingourmethod.

Table1.Per-WebsiteConfusionMatrix

HumanClassification

TotalRelevant Irrelevant

Algorithm Relevant 430,676 25,464 456,140Irrelevant 253 84,745 84,998

Total 430,929 110,209 541,138

In Table 1we show in detail the benefits of using ourmethod to filter out spamwebsites.Out of the25,464irrelevantpagesmisclassified,7408comesfromNS10and4897comesfromNS19.Althoughthesepagesare indeednot interesting for thehumanjudge, they cannotproperlybe consideredas spamsincetheyare not simple lists of ads but still contain human-generated text. Only 7118 pages can directly beascribedtosoftwaremisclassification. Thefractionofrelevantcontentthatwouldbelostisnegligibleandconsistsonlyin253pages.Thisresult

is particularly important for the class BI applications in which the discovery of not yet emerging newphenomenaiscrucial.

0

1

10

100

1.000

10.000

100.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

ISP DPP

Page 13: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

312 Volume 13, Number 5, May 2018

As a further important observation, sinceaNS-based classification doesnot require downloadingandstoringwebpages, in the caseofourexperiment theelementsof theNSs classifiedasDPPwouldnotbedownloaded,thussaving15.7%ofcrawlingresources.Summarizing: experiments show thatusing theNSto classifywhether the correspondingwebsitesare

regular or spam provides a reasonable accuracy (as high as 95.25%) as well as it is computationallyconvenient.

4. ClassificationLettingageneral-purposecrawlerloose,itwillstartdownloadingthewebwithoutdiscriminatingamong

topics until stopped or until its storage/bandwidth resources are completely exhausted. The resultingcorpuswillconsistinahugemassofdocumentsspreadingonalmostallthehumanknowledge.However,sincetypicalBIapplicationsneedtofocusonlyondocumentsabouttheinvestigatedtopic,amechanismtofilteroutunnecessarywebsitesisnecessary. Filtering can beappliedeitherat crawling time (focused crawling) or ex post. In the first case, oncea

website is discovered, the crawler classifies it and decides whether to download it or not. The mostimportantadvantageofthisstrategyisthatofsavingresources.Infact,ifawebsiteisclassifiedasofftopic,it isneither crawlednor stored.Ontheotherhand,discardingawebsite could cause links to interestingpagestobeirreparablylost.Aposteriorifilteringismuchmoreresourcedemandingthanfocusedcrawlingand should bepreferred only in those situations inwhich the application requires to crawla sampleascomplete as possible. Independently from the approach, however, a mechanism to classify websites isnecessary.

4.1. CreationoftheTrainingSetSupervisedclassification is thetaskofassigningadocumenttoaclassbasedonpre-existingexamples.

This tool has successfully been used in focused crawling aswell as in generalwebsite classification. Thehunger of classifiers for training examples, however, poses severe limitations to the practical usage ofsupervised classification. In fact, buildinga large-enoughhighquality training set can requirea longandtediousmanualclassificationbyadomainexpert.Inordertoreducehumaninvolvementinthelabelingofexamples,someauthorsproposedapproachesthatleverageonexistingknowledgebases. In[40]and[41]the authors use ontologies to drive crawling. In particular, in [40] the authors show how to exploit thenewlydiscovereddocumentstoextendtheontologyobtainingan increasinglymoreaccuratetrainingset;whilein[41]theauthorsdescribeaframeworktocomputetherelevanceofadocumentwithanontologyentity. In[42]theauthorsexperimentallyprovethatwebsiteclassificationcanbedonewithoutscarifyingaccuracybutonlyusingpositiveexamples.Allthoseworks,however,haveincommontheassumptionthatsuchformofdomainknowledgeexistsandisavailable.Theaboveassumptionisacceptabledealingwithgeneraltopicsthataremorelikelytobecoveredbysuch

a knowledge base. Small andmedium enterprises, instead,may be interested in uncommon specializedtopics not enough represented in the existing freely available datasets. Using clustering in place of asupervisedapproachwouldmeantradingtheclassificationaccuracywiththesimplicityandubiquityofthemethodology.In [20] we faced the problem of using a small description of a class (as small as a weighted list of

keywords)toautomaticallyderivefromtheWebalistofpagesthataremorelikelytobegoodexamplestotraina classifier.Asmost semi-supervisedmethods,ourapproachattempts tobalance theuseofhumanresources with the decrease of classification accuracy. In fact, to the easy task of compiling a list ofkeywordscorrespondsthe introductionofsomemisclassifiedexamplesand, inturn,aslightreductionofaccuracy.

Page 14: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

313 Volume 13, Number 5, May 2018

Fig.6.Graphicalrepresentationofthelabelingpipeline.

4.1.1. AsemisupervisedpipelinefortheextractionoftrainingexamplesInthissectionwesketchoutourpipelineleavingdetailstotheoriginalworkin[20].Boththeplethoraof

informalguidesonhowtochooseadomainnameandthescientificliteratureonthepsychologicaleffectsofbrandnamesagreethatasuccessfulbrandshouldconsistinthejuxtapositionoffewwords.Inparticular,in[43]theauthorsperformanexperimentinwhichtheydiscusstheeffectsofusingmeaningfulbrandnames.Results showhownames conveying the benefits of theproduct (i.e. smartphone) achieve a commercialadvantageonotherbrands.Goingastepfurther,in[44]theauthorsanalyzethelinguisticcharacteristicsofbrandnames concluding that semantic appositeness is one of the threemain features that contribute tomakethebrandfamiliarand,inturn,successful.Althoughpopularbrandsoftheneteconomycoinedinthemid-ninetiesrepresenteminentexceptionstotheabovestudies,itisrealistictohypothesizetheabundanceofdomainnamesconsistinginmeaningfulshortsentencessemanticallyrelatedtothewebsitecontent.

Givenaset of classess,eachofwhichdescribedasadocument,andaURLtokenizedasa

listofkeywords,ourapproachmapstheproblemofassigningawebsitetoaclassastheproblemofrankingclassesaccordingtotheirrelevancetotheURLtext.Figure6graphicallydepicts thesequenceofstepsofourpipeline. Weusedadictionary-basedgreedyalgorithmtodistilla listofwords fromaURL.Thedomainnameis

scannedfromlefttorightuntiladictionarywordisfound,thenthewordisextractedandtheprocedureisrecoursivelyinvokedontheramainingpartoftheURL.Inordertoavoidmisleadingortrivialtokenizations(i.e.“anacid”insteadof“anacid”),ouralgorithmsearcheswordsinlengthdecreasingorder.Iftheproceduredoesnotsucceedtouseall the charactersof thedomain itbacktracks totestdifferentalternatives. Ifnocompletetokenizationisfound,theURLisdiscardedandthecorrespondingwebsiteisnolongerconsideredasapotentialcandidatetobeaddedtothetrainingset.URLlabeling isdoneusinga learningtorankapproach[45].This frameworkprovidesabackgroundto

solve theproblemofsortinga setofdocumentsaccording toapoolofmeasuresof relevancetoaquery.Eachrelevancemeasureinducesarankingthatisaveragedwiththeothersinorderproduceafinalranking.Inourcase,weusedtheclassdescriptionsinplaceofthesetofdocumentsandthetextextractedfromtheURLas the query. Relecance is computed through: cosine similarity, generalized Jaccard coefficient [46],dicecoefficient,overlapcoefficient,WeightedCoordinationMatchSimilarity[47],BM25[48],andLMIRDIR[49].Aslabelwechosetheclasswithhighestranking,whileasadegreeofmembershipweusedtheaverageoftherelevancemeasuresnormalizedtotherange[0,1]. Inordertominimizethenumberofmislabelledexamples,andinturnbiasthesubsequentlearning,the

labellingpipeline implementstwofilters:after tokenizationandafterranking. Thefirst filterconsists inmeasuringthedegreeofpage/URLcoherence:ouralgorithmcomputesthecosinesimilaritybetweenthetextderivedfromtheURLandthepagecontent, then itverifiesthat thismeasureexceedsauser-defined

Discard

n

C1P

age

Page/URL

coherence

Ran

kin

g

Quality

filter

description

Class

description

Class

...

Relevane

RelevanceURL Tokenizer

Class

Discard

< t

hre

shold

Classified

example URLs

>=

thresh

old<

thresh

old

C

C ={c1,...cn} n

Page 15: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

314 Volume 13, Number 5, May 2018

threshold.Thesecondfilterensuresthattheclassassignmentisneitherweaknorambiguous.Inparticular,weakness ismeasured as the inverse degree ofmembershipwith the labelling class, while ambiguity isevaluatedasthenumberofclassessuchthatthedegreeofmembershipexceedsapre-definedthreshold.Experimentsin[20]over20,016webpagesbelongingto9non-overlappingcategoriesshowthat32.36%

oftheURLssucceedtobetokenizedand26.33%ofthepagesarelabeledwithanaccuracyof85.8%.

5. ConclusionIn thispaperweaddressed theproblemof enablingBI andbigdataanalyticsalso for those smalland

mediumenterprisesthat,becauseof their limitedbudgetandITresources,untilnowhavebeenexcludedfromthebenefitsoftheseactivities.Inparticular,weobservedthat,despitemorphologicallydifferentfromstandard text corpora,using theWebasa sourceof informationdoesnotnecessarily leadto lowqualitydata. In fact,applyingconvenienttemplateremoval filtersaswellasspamfilters, it ispossibletoremoveundesired text/pages without altering the useful informative content of the Web. We also showed howcareful design choices can contribute to drastically reduce the cost of crawling, and in turn, allow todownloadalarge-enoughportionoftheWebsoastomakeBIstatisticallysignificant.Finally,wedealtwiththeproblemofper-topicwebsiteclassificationshowingthatagoodtradeoffbetweenthehumaneffortofcreatingamanualdescriptionof theclassesandtheclassificationaccuracy ispossible.Summarizing:ourexperienceshowsthataccurateresource-drivendesignchoicescanleadtoasatisfactoryBIactivityalsoforsmallandmediumbusinesseswithoutsacrificingquality.

AcknowledgmentThisworkwassupportedby:theRegioneToscanaofItalyunderthegrantPORCRO2007/2013AsseIV

CapitaleUmanoandtheItalianRegistryoftheccTLD.it"Registro.it.

References[1] Luhn,H.P.(1958).Abusinessintelligencesystem.IBMJ.Res.Dev,314-319. [2] Hsinchun,C.,Chiang,R.H.L.,&Storey,V.C.(2012).Businessintelligenceandanalytics:Frombigdatato

bigimpact.MISquarterly36.[3] Hu,H.,Wen,Y,Chua,T.,&Li,X. (2014).Towardscalablesystemsforbigdataanalytics:Atechnology

tutorial.IEEEAccess.Vol.2,652-687. [4] Kumar,M., Bhatia, R., &RattanD. (2017). A survey ofWeb crawlers for information retrieval.Wiley

InterdisciplinaryReviews:DataMiningandKnowledgeDiscovery. [5] Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., &Wiener, J.

(2000).Graphstructureintheweb.ComputerNetworks,33(1),309-320. [6] Rowlands, T., Hawking, D., & Sankaranarayana. R. (2010). New-web search with microblog

annotations.Proceedingsofthe19thInternationalConferenceonWorldWideWeb. [7] Aiello,L.M.,Petkos,G.,Martin,C.,Corney,D.,Papadopoulos,S.,Skraba,R.,Oker,A.,Kompatsiaris,I.,&

Jaimes,A.(2013).SensingtrendingtopicsinTwitter.IEEETrans.onMultimedia. [8] Wang,D.,Navathe,S.B.,Liu,L.,Irani,D.,Tamersoy,A.,&Pu,C.(2013).Clicktrafficanalysisofshorturl

spamontwitter.Proceedingsofthe9thInt.Conf.onCollaborativeComputing:Networking,ApplicationsandWorksharing(Collaboratecom).

[9] Cho, J., Garcia-Molina H., & Page, L. (1998). Efficient crawling through URL ordering. ComputerNetworksandISDNSystems.

[10] Najork, M., &Wiener, J. L. Breadth-first crawling yields high-quality pages. Proceedings of the 10thinternationalconferenceonWorldWideWeb(WWW'01).

Page 16: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

315 Volume 13, Number 5, May 2018

[11] Baeza-Yates, R., & Castillo, C., Marin, M., & Rodriguez, A. Crawling a country: Better strategies thanbreadth-firstforwebpageordering.ProceedingsoftheSpecialinteresttracksandpostersofthe14thInt.Conf.onWorldWideWeb(WWW'05).

[12] Castillo,C.,Marin,M.,Rodrıguez,A.,&Baeza-Yates,R.(2004).SchedulingalgorithmsforWebcrawling.[13] Fielding,R.,Gettys,J.,Mogul,J.,Frystyk,H.,Masinter,L.,Leach,P.,&Berners-Lee,T.(1999).RFC2616-

HTTP/1.1,thehypertexttransferprotocol.http://w3.org/Protocols/rfc2616/rfc2616.html[14] Baeza-Yates,R.,&Castillo,C.(2004).Crawlingtheinfiniteweb:Fivelevelsareenough.Algorithmsand

ModelsfortheWeb-Graph. [15] Iyengar,A.K.,Squillante,M.S.,&Zhang,L.(1999).Analysisandcharacterizationoflarge¾ ScaleWeb

serveraccesspatternsandperformance. [16] AdamicL.A.,&Huberman,B.A.(2001).TheWeb'shiddenorder.Commun.[17] Gomes, D., Nogueira, A., Miranda, J., & Costa, M. (2009). Introducing the Portuguese web archive

initiative.In8thInternationalWebArchivingWorkshop. [18] William, A., & Tullis, T. (2013).Measuring the userexperience: collecting, analyzing, and presenting

usabilitymetrics.Newnes.[19] Lopes, R., Gomes, D., & Carriço, L. (2010).Web not for all: A large scale study of web accessibility.

ProceedingsoftheInt.CrossDisciplinaryConferenceonWebAccessibility. [20] Geraci,F.,&Papini,T.(2017).Approximatingmulti-classtextclassificationviaautomaticgenerationof

trainingexamples.Proceedingsof the18th InternationalConferenceonComputationalLinguisticsandIntelligentTextProcessing.

[21] Boldi, P., Codenotti, B., Santini, M., & Vigna, P. (2004). Ubicrawler: A scalable fully distributed webcrawler.Software:PracticeandExperience.

[22] Olston,C.,&Najork,M.(2010).Webcrawling.FoundationsandTrends®inInformationRetrieval4.3. [23] Felicioli, C.., Geraci, F.., & Pellegrini, M. (2011). Medium sized crawlingmade fast and easy through

Lumbricuswebis.Int.Conf.onMachineLearningandCybernetics. [24] Gyongyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. 1st Int. Workshop on Adversarial

InformationRetrievalontheWebAIRWeb. [25] Benko,V.(2017).Arewebcorporainferior?ThecaseofCzechandSlovak.ProceedingsoftheWorkshop

onChallengesintheManagementofLargeCorporaandBigDataandNaturalLanguageProcessing. [26] Khokhlova, M. (2016). Large corpora and frequency nouns. Proceedings of the Int. Conf. on

ComputationalLinguisticsandIntellectualTechnologies:“Dialogue2016.[27] Zhou,L.,&Burgoon,J.K.,Nunamaker,J.F.,&Twitchell,D.(2004).Automatinglinguistics-basedcuesfor

detectingdeception intext-basedasynchronous computer-mediated communications.GroupdecisionandNegotiation.

[28] Piskorski, J., Sydow,M., &Weiss, D. (2008). Exploring linguistic features for web spam detection: Apreliminarystudy.Proceedingsof the4thInternationalWorkshoponAdversarial InformationRetrievalontheWeb(AIRWeb'08).

[29] Geraci, F., & Maggini, M. (2011). A fast method for web template extraction via a multi-sequencealignment approach. International Joint Conference on Knowledge Discovery, Knowledge Engineering,andKnowledgeManagement.

[30] Schafer, R. (2017). Accurate and efficient general-purpose boilerplate detection for crawled webcorpora.LanguageResourcesandEvaluation,51(3),873-889.

[31] Geraci,F. (2015). Identificationofwebspamthroughclusteringofwebsite structures.Proceedingsofthe24thInternationalConferenceonWorldWideWeb.

[32] W3Techs, Usage of content management systems for websites.

Page 17: Web Crawling and Processing with Limited Resources for … · 2018-06-12 · Web Crawling and Processing with Limited Resources for ... , made crawling a rich and complex research

Journal of Software

316 Volume 13, Number 5, May 2018

https://w3techs.com/technologies/overview/content_management/all/[33] Martin,L.,&Gottron,T.(2012).ReadabilityandtheWeb.FutureInternet4.1. [34] Needleman,S.B.,&Wunsch,C.D.(1970).Ageneralmethodapplicabletothesearchforsimilaritiesin

theaminoacidsequenceoftwoproteins.JournalofMolecularBiology. [35] Almishari, M., & Yang, X. (2010). Ads-portal domains: Identification andmeasurements.ACM Trans.

Web,4(2). [36] Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. In Theoretical

ComputerScience. [37] Myers,E.W.(1986).AnO(ND)differencealgorithmanditsvariations.Algorithmica1.1(1986).[38] Li, Z.,Alrwais, S., Xie,Y.,Yu,F.,&Wang,X. (2013).Finding the linchpinsof thedarkweb: a studyon

topologically dedicated hosts on malicious web infrastructures. IEEE Symposium on Security andPrivacy.

[39] Kleinberg,J.M.(1999).Authoritativesourcesinahyperlinkedenvironment.JournalofACM. [40] Luong,H.P.,Gauch,S.,&Wang,Q.(2009).Ontology-basedfocusedcrawling.ProceedingsoftheInt.Conf.

onInformation,Process,andKnowledgeManagement. [41] Ehrig,M.,&Maedche,A.(2003).Ontology-focusedcrawlingofWebdocuments.Proceedingsofthe2003

ACMsymposiumonAppliedcomputing(SAC'03). [42] Yu,H., Han, J., & Chang, K. C. (2004). Pebl:Webpage classificationwithout negative examples. IEEE

TransactionsonKnowledgeandDataEngineering,16(1).[43] Keller, K. L., Heckler, S. E., & Houston, M. J. (1998). The effects of brand name suggestiveness on

advertisingrecall.TheJournalofMarketing. [44] Lowrey, T. M., Shrum, L. J., & Dubitsky, T. M. (2003). The relation between brand-name linguistic

characteristicsandbrand-namememory.JournalofAdvertising. [45] Hang,L.(2011).Ashortintroductiontolearningtorank.IEICETRANS.onInformationandSystems. [46] Charikar,M.S. (2002).Similarityestimationtechniques fromroundingalgorithms.Proceedingsofthe

Thiry-FourthAnnualACMSymposiumonTheoryofComputing. [47] Wilkinson,R.,Zobel,J.,&Sacks-Davis,R.(1995).Similaritymeasuresforshortqueries. [48] Robertson,S.E.(1997).Overviewoftheokapiprojects.JournalofDocumentation. [49] Bennett,G.,Scholer,F.,&Uitdenbogerd,A.(2008).Acomparativestudyofprobabilisticand language

modelsforinformationretrieval.Proceedingsofthe19thCconf.onAustralasianDatabase.

LoredanaM.Genovesewasborn inCrotone, Italy, in June1971.Shereceivedhermasterdegree inComputerScience from theUniversityofPisa in2002witha thesisonparallelreplicateddatabases.Shewasalsoawardeda``laureaspecialistica''degreeinInformaticsin2004.Afterseveralyearsspentonaprivate internetcompany,shejoinedtheInstitute forInformatics and Telematics of the Italian National Research Council in 2013 with thepositionofresearchassistant.

Filippo Geraci was born in Sicily, Italy, in September 1977. He received his degree incomputersciencefromtheUniversityofPisaandhisPh.D.ininformationengineeringfromthe University of Siena. He is currently researcher at the Institute for Informatics andTelematicsoftheItalianNationalResearchCouncilinPisa.HeheldthechairofthecourseofInformationsystemsforbusinessmanagementattheUniversityofSiena.Heistutorofthe

courseofAlgorithmDesignat theUniversityofPisa.His research fieldsare:algorithms for theweb,andbioinformatics.