boilerplate detection using shallow text features

Boilerplate Detection using Shallow Text Features Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl L3S Research Center / Leibniz Universität Hannover Appelstr. 9a, 30167 Hannover Germany {kohlschuetter , fankhauser, nejdl}@L3S.de ABSTRACT In addition to the actual content Web pages consist of navi- gational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected proper ly . In this paper, we analyz e a small set of shallo w text features for classifying the individual text elements in a Web page. We compare the approach to complex, state- of-th e-art techniques and show that competit ive accuracy can be ach iev ed, at almost no cos t. Moreov er, we der ive a simple and plausible stochastic model for describing the boil erp late creat ion proces s. Wit h the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forw ard heuristics, achieving a remark able accuracy. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Re- trieval General Terms Algorithms, Experimentation, Theory Keywords Boilerplate Removal, Template Detection, Full-text Extrac- tion, Web Document Modeling, Text Cleaning 1. INTRODUCTION When examining a Web page, humans can easily distin- guish the main conten t from naviga tional text, adve rtise - ment s, related article s and other text portions. A number of approaches have been introduced to automatize this dis- tinction, using a combination of heuristic segmentation and featur es. How ever , we are not aware of a systematic anal- ysis of which features are the most salient for boilerplate con ten t. In this paper, we anal yse the most popula r fea - tures used for boilerplate detection on two corpora. We show that a combination of just two features - number of words Permission to make digital or hard copies of all or part of this work for perso nal or classr oom use is granted without fee pro vided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM’10, February 4–6, 2010, New York City, New York, USA. Copyright 2010 ACM 978-1-60558-889- 6/10/02 ...$10.00. and link density - leads to a simple classification model that achi eves competiti ve accuracy . The features have a stron g correspondence to stochastic text models introduced in the field of Quantitativ e Ling uistic s. Moreo ver, we show that removing boilerplate content based on these features significantly improves precision on the BLOGS06 benchmark, at almost no cost. The paper is structured as follows. After shortly reviewing related work in Section 2 we discuss potential features for de- tecting boilerplate content in Section 3. Section 4 describes our content classification experiments, which we performed in two flavors: the two-class problem for boilerplate/content and a four-cl ass probl em specific for the news domai n. In Section 5 we give a statistical linguistic interpretation of our observations. In Section 6 we apply the established model to the problem of Information Retrieval and show that precision can significantly be improved. Section 7 concludes with a discussion of further work. 2. RELA TED WORK Approaches to boilerplate detection typically exploit DOM- level features of segments by means of handcrafted rules or trained classifiers, or they identify common, i.e., frequently used segments or pattern s/shi ngles on a website [3,8, 9, 14, 24]. Using a combination of approaches, Gibson et al. quantify the amount of template content in the Web (40%-50%) [14 ]. Cha kra barti et al. det ermine the “te mpl ateness ” of DOM nodes by a classifier based upon regularized isotonic regression [6] using various DOM-level based features, in- cluding shallow text features as well as site-level hyperlink information. Yi et al. simp lify the DOM structure by deriv- ing a so-called Site Style Tree which is then used for classi- ficatio n [26]. Baluja [2] employs decision tree learning and entr opy reduction for templ ate detecti on at DOM level. Template detection is strongly related to the more generic problem of web page segmentation, which has been addressed at DOM-level [7], by exploiting term entropies [17] or by using Vision- based featur es [5]. Kohlsch¨ utt er et al. pre sent a statistical model for the distribution of segment-level text densities, and use the text density ratios of subsequent blocks to identify page-level segments [18,19]. The CleanEval com- petition [4] aims at establishing a representative corpus with a gold standard in order to provide a transparent and com- parable platf orm for boilerplate removal experiments. The evaluated algorithms mainly apply machine learning techniques for the class ificati on [4]. F or instance, NCleaner [10] utilizes a trained n-gram based language model, and Vic- tor [23] employs a multi-feature sequence-labeling approach

boilerplate detection using shallow text features

Documents