boiler pipe

Chris Boston

Contents � Html Structure � Goal of html content extraction � Html Stripping �  Java Options

�  BoilerPipe �  Apache Tika

� Python Options �  BoilerPipe Web API � Html2Text �  Beautiful Soup

Html Structure <HTML> <HEAD> <TITLE> Title of the page. </TITLE>

</HEAD> <BODY> Page content. </BODY> </HTML>

Example Page http://www.minecraft.net/about.jsp � Navigation links at the top � Main text in the body �  “Buy Now” on the side �  Copyright on the bottom

Goal of Html extrac7on � Given HTML, identify the relevant text.

�  Strip page navigation links. �  Strip site-‐specific text (copyright, etc.). � Many tools can find data based on where it occurs in the structure of the html.

�  In our case, we are trying to strip out words that are obviously related to site functions.

HTML Stripping � Regular Expression replacement:

�  <[^>]+>

�  In Java: �  String noHTMLString = htmlString.replaceAll("\\<.*? >","");

�  In Python: �  re.compile(r'<.*?>') .sub('', toStrip) #Must import re!

Example Stripped Page

BoilerPipe � Tool that intelligently removes html tags (and even irrelevant text). � Much smarter than a regular expression �  Provides several extraction methods. �  Returns text in a variety of formats.

How Boilerpipe Extracts Content � Retrieves Html given URL (optional) � Parses Html to find text content �  Separates text into text blocks � Uses variety of classifiers to determine which blocks are important

BoilerPipe Extractors ARTICLE_EXTRACTOR: Specializes on finding articles. DEFAULT_EXTRACTOR: Picks up more than just articles. Filters navigation links.

CANOLA_EXTRACTOR: Extractor based on krdwrd.

KEEP_EVERYTHING_EXTRACTOR: Gets everything. Could use this for extracting the title.

BoilerPipe Tests Try BoilerPipe. No setup required! http://boilerpipe-‐web.appspot.com/

Ge?ng Started with BoilerPipe 1.  Download BoilerPipe:

http://code.google.com/p/boilerpipe/downloads/detail?name=boilerpipe-‐1.1.0-‐bin.tar.gz

2.  Extract and add all JARs to path/workspace. 3.  Adapt code from next slide.

Example Java Code public static String extractFromUrl(String targetUrl) throws Exception { ExtractorBase extractor = CommonExtractors.ARTICLE_EXTRACTOR; return extractor.getText(new URL(targetUrl));

}

Apache Tika � Apache Tika: Java library that can parse many formats, including html. Lets you have a lot of control. �  http://tika.apache.org/

Apache Tika Features � Unlike BoilerPipe, Apache Tika can generate an xml parse tree from documents of almost any format.

� Allows traversal of the parse tree as parse events, meaning the entire document need not be in memory at one time to parse it.

� Parses and preserves metadata.

Op7ons for Python 1.  Use the BoilerPipe Web API

2.  Make a simple helper BoilerPipe JAR, then do the heavy lifting in python.

3.  Html2Text

4.  Beautiful Soup

1. Web API �  Generate a URL that requests a text file from the test site

�  Unfortunately, your system will fail if the site is unavailable. �  You can use different arguments to get different formats. �  Experiment with using the web site to find out what kinds of options are available.

�  Url has three parts:

1.  http://boilerpipe-‐web.appspot.com/extract?url= 2.  http://www.myurl.net/ 3.  &extractor=ArticleExtractor&output=text

�  Choose your Extractor type and return type here

1. Web API: Example Python Code def extract(url): fullUrl = "http://boilerpipe-‐web.appspot.com/extract?url="

fullUrl += url fullUrl += "&extractor=ArticleExtractor&output=text" html = urllib.urlopen(fullUrl) return html2text.html2text(html.read(), fullUrl)

2. Call Executable JAR from Python import os if __name__ == "__main__": startingDir = os.getcwd() # remember the current directory

jarDir = “Path/To/Jar“ os.chdir(jarDir) # change to our test directory os.system("java -‐jar myJar.jar myParameters") os.chdir(startingDir) # change back to where we started

3. Html2Text � Get it at: http://www.aaronsw.com/2002/html2text/ � Demo also on web site. Example code

import html2text import urllib test = urllib.urlopen(url) result = html2text.html2text(test.read(), url)

4. Beau7ful Soup � Generates a parse tree of a webpage

� Have to find relevant content on your own � Handles pages made with bad markup

Addi7onal Resources

� CLEANEVAL: A contest for html extractors. �  http://cleaneval.sigwac.org.uk/

Ques7ons? � Html Structure � Goal of html content extraction � Html Stripping �  Java Options

�  BoilerPipe �  Apache Tika

� Python Options �  BoilerPipe Web API � Html2Text �  Beautiful Soup

boiler pipe

Documents