blogforever crawler: techniques and algorithms to harvest...

BlogForever Crawler: Techniques and Algorithmsto Harvest Modern Weblogs

Olivier BlanvillainÉcole Polytechnique Fédérale

de Lausanne (EPFL)1015 Lausanne, Switzerland

[email protected]

Nikos KasioumisEuropean Organization forNuclear Research (CERN)

1211 Geneva 23, [email protected]

Vangelis BanosDepartment of Informatics

Aristotle University ofThessaloniki, Greece

[email protected]

ABSTRACTBlogs are a dynamic communication medium which has beenwidely established on the web. The BlogForever project hasdeveloped an innovative system to harvest, preserve, manageand reuse blog content. This paper presents a key compo-nent of the BlogForever platform, the web crawler. Moreprecisely, our work concentrates on techniques to automat-ically extract content such as articles, authors, dates andcomments from blog posts. To achieve this goal, we intro-duce a simple and robust algorithm to generate extractionrules based on string matching using the blog’s web feed inconjunction with blog hypertext. This approach leads to ascalable blog data extraction process. Furthermore, we showhow we integrate a web browser into the web harvesting pro-cess in order to support the data extraction from blogs withJavaScript generated content.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Information filtering, Query formu-lation, Selection process;D.2.8 [Software Engineering]: Metrics—Complexity mea-sures, Performance measures

General TermsDesign, Algorithms, Performance, Experimentation

KeywordsBlog crawler, web data extraction, wrapper generation

1. INTRODUCTIONBlogs disappear every day [13]. Losing data is obviouslyundesirable, but even more so when this data has historic,political or scientific value. In contrast to books, newspapersor centralized web platforms, there is no standard methodor authority to ensure blog archiving and long-term digitalpreservation. Yet, blogs are an important part of today’sweb: WordPress reports more than 1 million new posts and1.5 million new comments each day [30]. Blogs also showedto be an important resource during the 2011 Egyptian rev-olution by playing an instrumental role in the organizationand implementation of protests [6]. The need to preservethis volatile communication medium is nowadays very clear.

Among the challenges in developing a blog archiving soft-ware application is the design of a web crawler capable ofefficiently traversing blogs to harvest their content. Thesheer size of the blogosphere combined with an unpredictablepublishing rate of new information call for a highly scalablesystem, while the lack of programmatic access to the com-plete blog content makes the use of automatic extractiontechniques necessary. The variety of available blog publish-ing platforms offers a limited common set of properties thata crawler can exploit, further narrowed by the ever-changingstructure of blog contents. Finally, an increasing number ofblogs heavily rely on dynamically created content to presentinformation, using the latest web technologies, hence invali-dating traditional web crawling techniques.

A key characteristic of blogs which differentiates them fromregular websites is their association with web feeds [18].Their primary use is to provide a uniform subscription mech-anism, thereby allowing users to keep track of the latest up-dates without the need to actually visit blogs. Concretely,a web feed is an XML file containing links to the latest blogposts along with their articles (abstract or full text) and as-sociated metadata [23]. While web feeds essentially solvethe question of update monitoring, their limited size makesit necessary to download blog pages in order to harvest pre-vious content.

This paper presents the open-source BlogForever Crawler, akey component of the BlogForever platform [14] responsiblefor traversing blogs, extracting their content and monitor-ing their updates. Our main objective in this work is todesign a crawler capable of extracting blog articles, authors,publication dates and comments. Our contributions can besummarized as follows:

• We present a new algorithm to build extraction rulesfrom web feeds. We then derive an optimized refor-mulation tied to a particular string similarity metricand show that this reformulated algorithm has a lin-ear time complexity.

• We show how to use this algorithm for blog articleextraction and how it can be adapted to authors, pub-lication dates and comments.

• We present the overall crawler architecture and thespecific components we implemented to efficiently tra-verse blogs. We explain how our design allows for bothmodularity and scalability.

• We show how we make use of a complete web browserto render JavaScript powered web pages before pro-cessing them. This step allows our crawler to effec-tively harvest blogs built with modern technologies,such as the increasingly popular third-party comment-ing systems.

• We evaluate the content extraction and execution timeof our algorithm against three state-of-the-art web ar-ticle extraction algorithms.

Although our crawler implementation is integrated with theBlogForever platform, the presented techniques and algo-rithms can be used in other applications related to WrapperGeneration and Web Data Extraction.

2. ALGORITHMSThis section explains in detail the algorithms we developedto extract blog post articles as well as its variations for ex-tracting authors, dates and comments. Our approach usesblog specific characteristics to build extraction rules whichare applicable throughout a blog. Our focus is on minimis-ing the algorithmic complexity while keeping our approachsimple and generic.

2.1 MotivationExtracting metadata and content from HTML documentsis a challenging task. Standards and format recommenda-tions have been around for quite some time, strictly speci-fying how HTML documents should be organised [26]. Forinstance the <h1></h1> tags have to contain the highest-level heading of the page and must not appear more thanonce per page [27]. More recently, specifications such asmicrodata [28] define ways to embed semantic informationand metadata inside HTML documents, but these still suf-fer from very low usage: estimated to be used in less than0.5% of websites [22]. In fact, the majority of websites relyon the generic <span></span> and <div></div> containerelements with custom id or class attributes to organise thestructure of pages, and more than 95% of pages do not passHTML validation [29]. Under such circumstances, relyingon HTML structure to extract content from web pages isnot viable and other techniques need to be employed.

Having blogs as our target websites, we made the follow-ing observations which play a central role in the extractionprocess1:

1Our experiments on a large dataset of blogs showed thatfailing tests were either due to a violation of one of theseobservations, or to an insufficient amount of text in posts.

(a) Blogs provide web feeds: structured and standardizedviews of the latest posts of a blog,

(b) Posts of the same blog share a similar HTML structure.

Web feeds usually contain about 20 blog posts [20], oftenless than the total number of posts in blogs. Consequently,in order to effectively archive the entire content of a blog, itis necessary to download and process pages beyond the onesreferenced in the web feed.

2.2 Content extraction overviewTo extract content from blog posts, we proceed by buildingextraction rules from the data given in the blog’s web feed.The idea is to use a set of training data, pairs of HTMLpages and target content, to build an extraction rule capableof locating the target content on each HTML page.

Observation (a) allows the crawler to obtain input for theextraction rule generation algorithm: each web feed entrycontains a link to the corresponding web page as well as theblog post article (either abstract or full text), its title, au-thors and publication date. We call these fields targets asthey constitute the data our crawler aims to extract. Ob-servation (b) guarantees the existence of an appropriate ex-traction rule, as well as its applicability to all posts of theblog.

Algorithm 1 shows the generic procedure we use to buildextraction rules. The idea is quite simple: for each (page,target) input, compute, out of all possible extraction rules,the best one with respect to a certain ScoreFunction. Therule which is most frequently the best rule is then returned.

Algorithm 1: Best Extraction Rule

input : Set pageZipTarget of (page and target) pairs

output: Best extraction rule

bestRules ←− new list

foreach (page, target) in pageZipTarget do

score ←− new map

foreach rule in AllRules(page) do

extracted ←− Apply(rule, page)

score of rule ←− ScoreFunction(extracted, target)

bestRules ←− bestRules + rule with highest score

return rule with highest occurrence in bestRules

One might notice that each best rule computation is inde-pendent and operates on a different input pair. This impliesthat Algorithm 1 is embarrassingly parallel : iterations of theouter loop can trivially be executed on multiple threads.

Functions in Algorithm 1 are voluntarily abstract at thispoint and will be explained in detail in the remaining ofthis section. Subsection 2.3 defines AllRules, Apply andthe ScoreFunction we use for article extraction. In sub-section 2.4 we analyse the time complexity of Algorithm 1and give a linear time reformulation using dynamic program-ming. Finally, subsection 2.5 shows how the ScoreFunctioncan be adapted to extract authors, dates and comments.

2.3 Extraction rules and string similarityIn our implementation, rules are queries in the XML PathLanguage (XPath). Consequently, standard libraries canbe used to parse HTML pages and apply extraction rules,providing the Apply function used in Algorithm 1. We ex-perimented with 3 types of XPath queries: selection overthe HTML id attribute, selection over the HTML class at-tribute and selection using the relative path in the HTMLtree. id attributes are expected to be unique, and class

attributes have showed in our experiments to have betterconsistency than relative paths over pages of a blog. Forthese reasons we opt to always favour class over path, andid over class, such that the AllRules function returns asingle rule per node.

Function AllRules(page)

rules ←− new set

foreach node in page do

if node as id attribute then

rules ←− rules + {"//*[@id=‘node.id’]"}else if node as class attribute then

rules ←− rules + {"//*[@class=‘node.class’]"}else rules ←− rules + {RelativePathTo(node)}

return rules

Unsurprisingly, the choice of ScoreFunction greatly influ-ences the running time and precision of the extraction pro-cess. When targeting articles, extraction rule scores arecomputed with a string similarity function comparing theextracted strings with the target strings. We chose theSørensen–Dice coefficient similarity [4], which is, to the bestof our knowledge, the only string similarity algorithm fulfill-ing the following criteria:

1. Has low sensitivity to word ordering,

2. Has low sensitivity to length variations,

3. Runs in linear time.

Properties 1 and 2 are essential when dealing with caseswhere the blog’s web feed only contains an abstract or asubset of the entire post article. Table 1 gives examples to il-lustrate how these two properties hold for the Sørensen–Dicecoefficient similarity but do not for edit distance based sim-ilarities such as the Levenshtein [17] similarity.

string1 string2 Dice Leven.

"Scheme Scala" "Scala Scheme" 90% 50%

"Rachid" "Richard" 18% 61%

"Rachid" "Amy, Rachid and

all their friends"

29% 31%

Table 1: Examples of string similarities

The Sørensen–Dice coefficient similarity algorithm operatesby first building sets of pairs of adjacent characters, alsoknows as bigrams, and then applying the quotient of simi-larity formula:

Function Similarity(string1, string2)

bigrams1 ←− Bigrams(string1)

bigrams2 ←− Bigrams(string2)

return 2 |bigrams1 ∩ bigrams2| / (|bigrams1|+|bigrams2|)

Function Bigrams(string)

return set of pairs of adjacent characters in string

2.4 Time complexity and linear reformulationWith the functions AllRules, Apply and Similarity (asScoreFunction) being defined, the definition of Algorithm 1for article extraction is now complete. We can thereforeproceed with a time complexity analysis.

First, let’s assume that we have at our disposal a linear timeHTML parser that constructs an appropriate data structure,indexing HTML nodes on their id and class attributes, ef-fectively making Apply ∈ O(1). As stated before, the outerloop splits the input into independent computations andeach call to AllRules returns (in linear time) at most asmany rules as the number of nodes in its page argument.Therefore, the body of the inner loop will be executed O(n)times. Because each extraction rule can return any subtreeof the queried page, each call to Similarity takes O(n),leading to an overall quadratic running time.

We now present Algorithm 2, a linear time reformulation ofAlgorithm 1 for article extraction using dynamic program-ming.

Algorithm 2: Linear Time Best Content Extraction Rule

input : Set pageZipTarget of (Html and Text) pairs

output: Best extraction rule

bestRules ←− new list

foreach (page, target) in pageZipTarget do

score ←− new map

bigrams ←− new map

bigrams of target ←− Bigrams(target)

foreach node in page with post-order traversal do

bigrams of node ←−Bigrams(node.text) ∪ bigrams of all node.childs

score of node ←−2 |(bigrams of node) ∩ (bigrams of target)||bigrams of node|+ |bigrams of target|

bestRules ←− bestRules + Rule(node with best score)

return rule with highest occurrence in bestRules

While very intuitive, the original idea of first generating ex-traction rules and then picking these best rules prevents usfrom effectively reusing previously computed bigrams (set ofpairs of adjacent characters). For instance, when evaluatingthe extraction rule for the HTML root node, Algorithm 1will obtain the complete string of the page and pass it tothe Similarity function. At this point, the information onwhere the string could be split into substrings with alreadycomputed bigrams is not accessible, and the bigrams of the

page have to be computed by linearly traversing the entirestring. To overcome this limitation and implement memo-ization over the bigrams computations, Algorithm 2 uses apost-order traversal of the HTML tree and computes nodebigrams from their children bigrams. This way, we avoid se-rializing HTML subtrees for each bigrams computation andhave the guarantee that each character of the HTML pagewill be read at most once during the bigrams computation.

With bigrams computed in this dynamic programming man-ner, the overall time to compute all Bigrams(node.text) islinear. To conclude the proof that Algorithm 2 runs in lineartime we show that all other computations of the inner loopcan be done in constant amortized time. As the number ofedges in a tree is one less than the number of nodes, theamortized number of bigrams unions per inner loop itera-tion tends to one. Each quotient of similarity computationrequires one bigrams intersection and three bigrams lengthcomputations. Over a finite alphabet (we used printableASCII), bigrams sizes have bounded size and each of theseoperations takes constant time.

2.5 Variations for authors, dates, commentsUsing string similarity as the only score measurement leadsto poor performance on author and date extraction, andis not suitable for comment extraction. This subsectionpresents variations of the ScoreFunction which addressesissues of these other types of content.

The case of authors is problematic because authors’ namesoften appear in multiple places of a page, which results inseveral rules with maximum Similarity score. The heuris-tic we use to get around this issue consists of adding anew component in the ScoreFunction for author extractionrules: the tree distance between the evaluated node and thepost content node. This new component takes advantageof the positioning of a post’s authors node which often is adirect child or shares its parent with the post content node.

Dates are affected by the same duplication issue, as wellas the issue of inconsistencies of format between web feedsand web pages. Our solution for date extraction extendsthe ScoreFunction for authors by comparing the extractedstring to multiple targets, each being a different string rep-resentation of the original date obtained from the web feed.For instance, if the feed indicates that a post was pub-lished at "Thu, 01 Jan 1970 00:00:00", our algorithm willsearch for a rule that returns one of "Thursday January 1,

1970", "1970-01-01", "43 years ago" and so on. So farwe do not support dates in multiple languages, but addingnew target formats based on languages detection would bea simple extension of our date extraction algorithm.

Comments are usually available in separate web feeds, oneper blog post. Similarly to blog feeds, comment feeds havea limited number of entries, and when the number of com-ments on a blog post exceeds this limit, comments have tobe extracted from web pages. To do so, we use the followingScoreFunction:

• Rules returning fewer HTML nodes than the number ofcomments on the feed are filtered out with a zero score,

• The scores of the remaining rules are computed withthe value of the maximum weighted matching in thecomplete bipartite graph G = (U, V,E), where U isthe set of HTML nodes returned by the rule, V is theset of target comment fields from the web feed (suchas comment authors) and E(u, v) has weight equal toSimilarity(u, v).

Our crawler executes this algorithm on each post with anoverflow on its comment feed, thus supporting blogs withmultiple commenting engines. The comment content is ex-tracted first, allowing us to narrow down the initial filteringby fixing a target number of comments.

Regarding time complexity, computing the tree distance ofeach node of a graph to a single reference node can be donein linear time, and multiplying the number of targets bya constant factor does not affect the asymptotic computa-tional complexity. Computing scores of comment extractionrules requires a more expensive algorithm. However, this iscompensated by the fact that the proportion of candidatesleft, after filtering out rules not returning enough results, isvery low in practice. Analogous reformulations to the onedone with Algorithm 2 can be straightforwardly applied oneach ScoreFunction in order to minimize the time spent inSimilarity.

3. ARCHITECTUREThis section provides an overview of the crawler system ar-chitecture and the different techniques we used. The overallsoftware architecture is presented and discussed, introducingthe Scrapy framework and the enrichments we implementedfor our specific usage. Then, we show how we integrated aheadless web browser into the harvesting process to supportblogs that use JavaScript to display page content. Finally,we talk about the design choices we made in view of a largescale deployment.

3.1 OverviewOur crawler is built on top of Scrapy2, an open-source frame-work for web crawling. Scrapy provides an elegant and mod-ular architecture illustrated in Figure 1. Several componentscan be plugged into the Scrapy core infrastructure: Spiders,Item Pipeline, Downloader Middlewares and Spider Middle-wares; each allowing us to implement a different type offunctionality.

Our use case has two types of spiders: NewCrawl and Up-dateCrawl, which implement the logic to respectively crawla new blog and get updates from a previously crawled blog.After being downloaded and identified as blog posts (detailsin subsection 3.2), pages are packed into Items and sentthrough the following pipeline of operations:

1. Render JavaScript

2. Extract content

3. Extract comments

4. Download multimedia files

5. Propagate resulting records to the back-end

2http://scrapy.org/

Figure 1: Overview of the crawler architecture.(Credit: Pablo Hoffman, Daniel Grana, Scrapy)

This pipeline design provides great modularity. For exam-ple, disabling JavaScript rendering or plugging in an alter-native back-end can be done by editing a single line of code.

3.2 Enriching ScrapyIn order to identify web pages as blog posts, our implemen-tation enriches Scrapy with two components to narrow theextraction process down to the subsets of pages which areblog posts: blog post identification and download priorityheuristic.

Given a URL entry point to a website, the default Scrapybehaviour traverses all the pages of the same domain in alast-in-first-out manner. The blog post identification func-tion is able to identify whether an URL points to a blog postor not. Internally, for each blog, this function uses a regularexpression constructed from the blog post URLs found in theweb feed. This simple approach requires that blogs use thesame URL pattern for all their posts (or false negatives willoccur) which has to be distinct for pages that are not posts(or false positives will occur). In practice, this assumptionholds for all blog platforms we encountered and seems to bea common practice among web developers.

In order to efficiently deal with blogs that have a large num-ber of pages which are not posts, the blog post identificationmechanism is not sufficient. Indeed, after all pages identi-fied as blog posts are processed, the crawler needs to down-load all other pages to search for additional blog posts. Toreplace the naive random walk, depth first search or breadthfirst search web site traversals, we use a priority queue wherepriorities for new URLs are determined by a machine learn-ing system. This mechanism has shown to be mandatory forblogs hosted on a single domain alongside large number ofother types of web pages, such as those in forums or wikis.

The idea is to give high priority to URLs which are believedto point to pages with links to blog posts. These predic-tions are done using an active Distance-Weighted k-Nearest-Neighbour classifier [5]. Let L(u) be the number of links toblog posts contained in a page with URL u. Whenever apage is downloaded, its URL u and L(u) are given to themachine learning system as training data. When the crawler

encounters a new URL v, it will ask the machine learningsystem for an estimation of L(v), and use this value as thedownload priority of v. L(v) is estimated by calculating aweighted average of the values of the k URLs most similarto v.

3.3 JavaScript renderingJavaScript is a widely used language for client-side script-ing. While some applications simply use it for aesthetics,an increasing number of websites use JavaScript to down-load and display content. In such cases, traditional HTMLbased crawlers do not see web pages as they are presentedto a human visitor by a web browser, and might thereforebe obsolete for data extraction.

In our experiments whilst crawling the blogosphere, we en-countered several blogs where crawled data was incompletebecause of the lack of JavaScript interpretation. The mostfrequent cases were blogs using the Disqus3 and LiveFyre 4

comment hosting services. For webmasters, these tools arevery handy because the entire commenting infrastructureis externalized and their setup essentially comes down toincluding a JavaScript snippet in each target page. Bothof these services heavily rely on JavaScript to downloadand display the comments, even providing functionalitiessuch as real-time updates for edits and newly written com-ments. Less commonly, some blogs are fully rendered usingJavaScript. When loading such websites, the web browserwill not receive the page content as an HTML document,but will instead have to execute JavaScript code to down-load and display the page content. The Blogger platformprovides the Dynamic Views as a default template, whichuses this mechanism [10].

To support blogs with JavaScript-generated content, we em-bed a full web browser into the crawler. After consideringmultiple options, we opted for PhantomJS5, a headless webbrowser with great performance and scripting capabilities.The JavaScript rendering is the very first step of web pageprocessing. Therefore, extracting blog post articles, com-ments or multimedia files works equally well on blogs withJavaScript-generated content and on traditional HTML-onlyblogs.

When the number of comments on a page exceeds a cer-tain threshold, both Disqus and LiveFyre will only load themost recent ones and the stream of comments will end witha Show More Comments button. As part of the page loadingprocess, we instruct PhantomJS to repeatedly click on thesebuttons until all comments are loaded. Paths to Disqus andLiveFyre Show More buttons are manually obtained. Theyconstitute the only non-generic elements of our extractionstack which require human intervention to maintain and ex-tend to other commenting platforms.

3.4 ScalabilityWhen aiming to work with a large amount of input, it iscrucial to build every layer of a system with scalability inmind [25]. The BlogForever Crawler, and in particular the

3http://disqus.com/websites4http://web.livefyre.com5http://phantomjs.org

two core procedures NewCrawl and UpdateCrawl, are de-signed to be usable as part of an event-driven, scalable andfault-resilient distributed system.

Heading in this direction, we made the key design choice tohave both NewCrawl and UpdateCrawl as stateless compo-nents. From a high-level point of view, these two compo-nents are purely functional :

NewCrawl : URL→ P(RECORD)

UpdateCrawl : URL×DATE→ P(RECORD)

where URL, DATE and RECORD are respectively the setof all URLs, dates and records, and P designates the powerset operator. By delegating all shared mutable state to theback-end system, web crawler instances can be added, re-moved and used interchangeably.

4. EVALUATIONOur evaluation is articulated in two parts. First, we com-pare the article extraction procedure presented in section 2with three open-source projects capable of extracting arti-cles and titles from web pages. The comparison will showthat our blog-targeted solution has better performance bothin terms of success rate and running time. Second, a dis-cussion is held regarding the different solutions available toarchive data beyond what is available in the HTML sourcecode. Extraction of authors, dates and comments is not partof this evaluation because of the lack of publicly availablecompeting projects and reference data sets.

In our experiments we used Debian GNU/Linux 7.2, Python2.7 and an Intel Core i7-3770 3.4 GHz processor. Timingmeasurements were made on a single dedicated core withgarbage collection disabled. The Git repository for this pa-per6 contains the necessary scripts and instructions to repro-duce all the evaluation experiments presented in this section.The crawler source code is available under the MIT licensefrom the project’s websites7.

4.1 Extraction success ratesTo evaluate article and title extraction from blog posts wecompare our approach to three open source projects: Read-ability8, Boilerpipe [15] and Goose9, which are implementedin JavaScript, Java and Scala respectively. These projectsare more generic than our blog-specific approach in the sensethat they are able to identify and extract data directly fromHTML source code, and do not make use of web feeds orstructural similarities between pages of the same blog (ob-servations (a) and (b)). Table 2 shows the extraction successrates for article and title on a test sample of 2300 blog postsfrom 230 blogs obtained from the Spinn3r dataset [3].

On our test dataset, Algorithm 1 outperformed the com-petition by 4.9% on article extraction and 10.1% on titleextraction. It is important to stress that Readability, Boil-erpipe and Goose rely on generic techniques such as worddensity, paragraph clustering and heuristics on HTML tag-ging conventions, which are designed to work for any type

6https://github.com/OlivierBlanvillain/bfc-paper7https://github.com/BlogForever/crawler8https://github.com/gfxmonk/python-readability9https://github.com/GravityLabs/goose

Target Our approach Readability Boilerpipe Goose

Article 93.0% 88.1% 79.3% 79.2%

Title 95.0% 74.0% N/A 84.9%

Table 2: Extraction success rates

of web page. On the contrary, our algorithm is only suit-able for pages with associated web feeds, as these providethe reference data used to build extraction rules. Therefore,results shown in Table 2 should not be interpreted as a gen-eral quality evaluation of the different projects, but simplyas evidence that our approach is more suitable when workingwith blogs.

4.2 Article extraction running timesIn addition to the quality of the extracted data we also eval-uated the running time of the extraction procedure. Themain point of interest is the ability of the extraction proce-dure to scale as the number of posts in the processed blogincreases. This corresponds to the evaluation of a NewCrawltask, which is in charge of harvesting all published contenton a blog.

Figure 2 shows the cumulated time spent for each articleextraction procedure (this excludes common tasks such asdownloading pages and storing results) as a function of thenumber of blog posts processed. We used the Quantum Di-aries10 blog for this experiment.

Data presented in this graph was obtained by taking thearithmetic mean over 10 measurements. These results arebelieved to be significant given that standard deviations areof the order of 2 milliseconds.

0

5

10

15

20

25

30

35

40

0 10 20 30 40 50 60

Cumulatedrunningtime(sec.)

Processed blog posts

Figure 2: Running time of articles extraction.

Our approachReadabilityBoilerpipe

Goose

As illustrated in Figure 2, our approach spends the major-ity of its total running time between the initialisation andthe processing of the first blog post. This initial increaseof about 0.4 seconds corresponds to cost of executing Algo-rithm 2 to compute extraction rule for articles. As alreadymentioned, this consists of computing the best extractionrule of each page referenced by the web feed and picking the

10http://www.quantumdiaries.org

most appropriate one. Once we have this extraction rule,processing subsequent blog posts only requires parsing andapplying the rule, which takes about 3 milliseconds and arebarely visible on the scale of Figure 2. The other evalu-ated solutions do not function this way: each blog post isprocessed as new and independent input, leading to approx-imately linear running times.

The vertical dashed line at 15 processed blog posts repre-sents a suitable point of comparison of processing time perblog post. Indeed, as the web feed of our test blog contains15 blog posts, the extraction rule computation performed byour approach includes the cost of entirely processing these15 entries. That being said, comparing raw performance ofalgorithms implemented in different programming languagesis not very informative given the high variation of runningtimes observed across programming languages [12].

5. RELATED WORKOur crawler combines ideas from previous work on generalweb crawlers and wrapper generation algorithms. The wordwrapper is commonly used to designate procedures to ex-tract structured data from unstructured documents. Wedid not use this word in the present paper in favour of theterm extraction rule, which better reflects our implementa-tion and is decoupled from the XPath engine that concretelyperforms the extraction.

Web crawling has been a well-studied topic over the pastdecade. One direction which we believe to be of crucial im-portance is the one of large scale distributed crawlers. Mer-cator [11], UbiCrawler [2] and the crawler discussed in [24]are examples of a successful distributed crawler and the pa-pers describing them provide useful information regardingthe challenges encountered when working on a distributedarchitecture. One of the core issues when scaling out seemsto be in sharing the list of URLs that have already beenvisited and those that need to be visited next. While [11]and [24] rely on a central node to hold this information,[2] uses a fully distributed architecture where URLs are di-vided among nodes using consistent hashing. Both of theseapproaches require the crawlers to implement complex mech-anisms to achieve fault tolerance. The BlogForever Crawlercircumvents this problem by delegating all shared mutablestate to the back-end system. In addition, since we processweb pages on the fly and directly emit the extracted contentto the back-end, there is no need for persistent storage onthe crawler side. This removes one layer of complexity whencompared to general crawlers which need to use a distributedfile system ([24] uses NFS, [1] uses HDFS) or implement anaggregation mechanism in order to further exploit the col-lected data. Our design is similar to the distributed activeobject pattern presented in [16], which is further simplifiedby the fact that the state of the crawler instances is not keptbetween crawls.

A common approach in web content extraction is to manu-ally build wrappers for the targeted websites. This approachhas been proposed in the crawler discussed in [7] which auto-matically assigns web sites to predefined categories and getsthe appropriate wrapper from a static knowledge base. Thelimiting factor in this type of approach is the substantialamount of manual work needed to write and maintain the

wrappers, which is not compatible with the increasing sizeand diversity of the web. Several projects try to simplify thisprocess and provide various degrees of automation. This isthe case of the Stalker algorithm [19] which generates wrap-pers based on user-labelled training examples. Some com-mercial solutions such as the Lixto project [9] simplify thetask of building wrappers by offering a complete integrateddevelopment environment where the training data set is ob-tained via a graphical user interface.

Automated solutions use other techniques to identify andextract information directly from the structure and contentof the web page. The Boilerpipe project [15] (mentionedin our evaluation) uses text density analysis to extract themain article of a web page. The approach presented in [21] isbased on a tree structure analysis of pages with similar tem-plates, such as news web sites or blogs. Automatic solutionshave also been designed specifically for blogs. Similarly toour approach, Oita and Senellart [20] describe a procedureto automatically build wrappers by matching web feed ar-ticles with HTML pages. This work was further extendedby Gkotsis, Stepanyan, Cristea and Joy [8] with a focus onextracting content anterior to the one indexed in web feeds.[8] also reports to have successfully extracted blog post ti-tles, publication dates and authors, but their approach isless generic than the one for the extraction of articles. Fi-nally, neither [20] nor [8] provide complexity analysis whichwe believe to be essential before putting an algorithm inproduction.

6. CONCLUSION AND FUTURE WORKIn this paper, we presented the internals of the BlogFor-ever Crawler. Its central article extraction procedure basedon extraction rules generation was introduced along withtheoretical and empirical evidence validating the approach.A simple adaptation of this procedure that allows to ex-tract different types of content, including authors, dates andcomments was then presented. In order to support rapidlyevolving web technologies such as JavaScript-generated con-tent, the crawler uses a web browser to render pages beforeprocessing them. We also discussed the overall software ar-chitecture, highlighting the design choices made to achieveboth modularity and scalability. Finally, we evaluated ourcontent extraction algorithm against three state-of-the-artweb article extraction algorithms.

Future work could investigate hybrid extraction algorithmsto try and achieve near 100% success rates. Indeed, we haveobserved11 that the primary causes of failure of our approachwere the insufficient quality of web feeds or the high struc-tural variations of blog pages. This suggests that combiningour approach with others techniques such as word densityor spacial reasoning could lead to better performance giventhat these techniques are insensible to the above issues.

Another possible research direction would be the deploy-ment of the BlogForever Crawler on a large scale distributedsystem. This is particularly relevant in the domain of webcrawling given that intensive network operations can be aserious bottleneck. Crawlers greatly benefit from the use of

11An in-depth analysis of causes of failure was not includedin this paper given the high amount of manual work requiredto identify causes of failure on problematic pages.

multiple Internet access points which makes them naturalcandidates for distributed computing. We intend to explorethese opportunities in our future work.

7. ACKNOWLEDGMENTSAcknowledgments to our colleagues and friends from CERN,J. Cowton, M. Hobbs and A. Oviedo, for their careful read-ing and helpful comments that improved the quality of thispaper. We are also very grateful to G. Gkotsis from theUniversity of Warwick for generously sharing his researchmaterial, time, and ideas with us.

8. REFERENCES[1] P. Berger, P. Hennig, J. Bross, and C. Meinel.

Mapping the Blogosphere–Towards a universal andscalable Blog-Crawler. In Third InternationalConference on Social Computing, pages 672–677, 2011.

[2] P. Boldi, B. Codenotti, M. Santini, and S. Vigna.UbiCrawler: a scalable fully distributed web crawler.2003.

[3] K. Burton, N. Kasch, and I. Soboroff. The ICWSM2011 spinn3r dataset. In Fifth Annual Conference onWeblogs and Social Media, 2011.

[4] L. R. Dice. Measures of the amount of ecologicassociation between species. Ecology, 26(3):297, July1945.

[5] S. A. Dudani. The Distance-Weightedk-Nearest-Neighbor rule. IEEE Transactions onSystems, Man and Cybernetics, SMC-6(4):325–327,1976.

[6] N. Eltantawy and J. B. Wiest. Social media in theegyptian revolution: Reconsidering resourcemobilization theory (1-3), 2012.

[7] M. Faheem. Intelligent crawling of web applicationsfor web archiving. In Proceedings of the 21stinternational conference companion on World WideWeb, pages 127–132, 2012.

[8] G. Gkotsis, K. Stepanyan, A. I. Cristea, and M. Joy.Self-supervised automated wrapper generation forweblog data extraction. In Proceedings of the 29thBritish National Conference on Big Data, BNCOD’13,pages 292–302, Berlin, Heidelberg, 2013.

[9] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog,and S. Flesca. The lixto data extraction project: Backand forth between theory and practice. In Proceedingsof the Twenty-third Symposium on Principles ofDatabase Systems, pages 1–12, 2004.

[10] A. Harasymiv. Blogger dynamic views.http://buzz.blogger.com/2011/09/dynamic-views-

seven-new-ways-to-share.htm. Last visited 20 Jan2014.

[11] A. Heydon and M. Najork. Mercator: A scalable,extensible web crawler. Word Wide Web, 1999.

[12] R. Hundt. Loop recognition in C++/Java/Go/Scala.In Proceedings of Scala Days, 2011.

[13] K. Johnson. Are blogs here to stay?: An examination

of the longevity and currency of a static list of libraryand information science weblogs. Serials review,34(3):199–204, 2008.

[14] N. Kasioumis, V. Banos, and H. Kalb. Towardsbuilding a blog preservation platform. World WideWeb, pages 1–27, 2013.

[15] C. Kohlschutter, P. Fankhauser, and W. Nejdl.Boilerplate detection using shallow text features. InProceedings of the Third ACM InternationalConference on Web Search and Data Mining, WSDM’10, page 441–450, New York, NY, USA, 2010.

[16] R. G. Lavender and D. C. Schmidt. Active object – anobject behavioral pattern for concurrentprogramming. 1996.

[17] V. Levenshtein. Binary codes capable of correctingdeletions, insertions and reversals. Soviet PhysicsDoklady., 10(8):707–710, Feb. 1966.

[18] C. Lindahl and E. Blount. Weblogs: simplifying webpublishing. Computer, 36(11):114–116, 2003.

[19] I. Muslea, S. Minton, and C. A. Knoblock.Hierarchical wrapper induction for semistructuredinformation sources. Journal of Autonomous Agentsand Multi-Agent Systems, 4:93–114, 2001.

[20] M. Oita and P. Senellart. Archiving data objects usingweb feeds. Sept. 2010.

[21] D. C. Reis, P. B. Golgher, A. S. Silva, and A. F.Laender. Automatic web news extraction using treeedit distance. In Proceedings of the 13th InternationalConference on World Wide Web, WWW ’04, pages502–511, New York, NY, USA, 2004.

[22] A. Rogers and G. Brewer. Microdata usage statistics.http://trends.builtwith.com/docinfo/Microdata.Last visited 20 Jan 2014.

[23] RSS Advisory Board. Rss 2.0 specification. 2007.

[24] V. Shkapenyuk and T. Suel. Design andimplementation of a high-performance distributed webcrawler. In 18th International Conference on DataEngineering, 2002. Proceedings, pages 357–368, 2002.

[25] Various authors. The reactive manifesto.http://reactivemanifesto.org. Last visited 20 Jan2014.

[26] Various authors, W3C. W3C standards.http://w3.org/standards. Last visited 20 Jan 2014.

[27] Various authors, WC3. Use h1 for top level heading.http://www-mit.w3.org/QA/Tips/Use_h1_for_Title.Last visited 20 Jan 2014.

[28] WHATWG. Microdata - HTML5 draft standard.http://whatwg.org/specs/web-apps/current-

work/multipage/microdata.html. Last visited 20 Jan2014.

[29] B. Wilson. Metadata analysis and mining application.http://dev.opera.com/articles/view/mama. Lastvisited 20 Jan 2014.

[30] WordPress. Posting activity.http://wordpress.com/stats/posting. Last visited20 Jan 2014.

blogforever crawler: techniques and algorithms to harvest...

Documents