boiler pipe
TRANSCRIPT
Chris Boston
Contents � Html Structure � Goal of html content extraction � Html Stripping � Java Options
� BoilerPipe � Apache Tika
� Python Options � BoilerPipe Web API � Html2Text � Beautiful Soup
Html Structure <HTML> <HEAD> <TITLE> Title of the page. </TITLE>
</HEAD> <BODY> Page content. </BODY> </HTML>
Example Page http://www.minecraft.net/about.jsp � Navigation links at the top � Main text in the body � “Buy Now” on the side � Copyright on the bottom
Goal of Html extrac7on � Given HTML, identify the relevant text.
� Strip page navigation links. � Strip site-‐specific text (copyright, etc.). � Many tools can find data based on where it occurs in the structure of the html.
� In our case, we are trying to strip out words that are obviously related to site functions.
HTML Stripping � Regular Expression replacement:
� <[^>]+>
� In Java: � String noHTMLString = htmlString.replaceAll("\\<.*? >","");
� In Python: � re.compile(r'<.*?>') .sub('', toStrip) #Must import re!
Example Stripped Page
BoilerPipe � Tool that intelligently removes html tags (and even irrelevant text). � Much smarter than a regular expression � Provides several extraction methods. � Returns text in a variety of formats.
How Boilerpipe Extracts Content � Retrieves Html given URL (optional) � Parses Html to find text content � Separates text into text blocks � Uses variety of classifiers to determine which blocks are important
BoilerPipe Extractors ARTICLE_EXTRACTOR: Specializes on finding articles. DEFAULT_EXTRACTOR: Picks up more than just articles. Filters navigation links.
CANOLA_EXTRACTOR: Extractor based on krdwrd.
KEEP_EVERYTHING_EXTRACTOR: Gets everything. Could use this for extracting the title.
BoilerPipe Tests Try BoilerPipe. No setup required! http://boilerpipe-‐web.appspot.com/
Ge?ng Started with BoilerPipe 1. Download BoilerPipe:
http://code.google.com/p/boilerpipe/downloads/detail?name=boilerpipe-‐1.1.0-‐bin.tar.gz
2. Extract and add all JARs to path/workspace. 3. Adapt code from next slide.
Example Java Code public static String extractFromUrl(String targetUrl) throws Exception { ExtractorBase extractor = CommonExtractors.ARTICLE_EXTRACTOR; return extractor.getText(new URL(targetUrl));
}
Apache Tika � Apache Tika: Java library that can parse many formats, including html. Lets you have a lot of control. � http://tika.apache.org/
Apache Tika Features � Unlike BoilerPipe, Apache Tika can generate an xml parse tree from documents of almost any format.
� Allows traversal of the parse tree as parse events, meaning the entire document need not be in memory at one time to parse it.
� Parses and preserves metadata.
Op7ons for Python 1. Use the BoilerPipe Web API
2. Make a simple helper BoilerPipe JAR, then do the heavy lifting in python.
3. Html2Text
4. Beautiful Soup
1. Web API � Generate a URL that requests a text file from the test site
� Unfortunately, your system will fail if the site is unavailable. � You can use different arguments to get different formats. � Experiment with using the web site to find out what kinds of options are available.
� Url has three parts:
1. http://boilerpipe-‐web.appspot.com/extract?url= 2. http://www.myurl.net/ 3. &extractor=ArticleExtractor&output=text
� Choose your Extractor type and return type here
1. Web API: Example Python Code def extract(url): fullUrl = "http://boilerpipe-‐web.appspot.com/extract?url="
fullUrl += url fullUrl += "&extractor=ArticleExtractor&output=text" html = urllib.urlopen(fullUrl) return html2text.html2text(html.read(), fullUrl)
2. Call Executable JAR from Python import os if __name__ == "__main__": startingDir = os.getcwd() # remember the current directory
jarDir = “Path/To/Jar“ os.chdir(jarDir) # change to our test directory os.system("java -‐jar myJar.jar myParameters") os.chdir(startingDir) # change back to where we started
3. Html2Text � Get it at: http://www.aaronsw.com/2002/html2text/ � Demo also on web site. Example code
import html2text import urllib test = urllib.urlopen(url) result = html2text.html2text(test.read(), url)
4. Beau7ful Soup � Generates a parse tree of a webpage
� Have to find relevant content on your own � Handles pages made with bad markup
Addi7onal Resources
� CLEANEVAL: A contest for html extractors. � http://cleaneval.sigwac.org.uk/
Ques7ons? � Html Structure � Goal of html content extraction � Html Stripping � Java Options
� BoilerPipe � Apache Tika
� Python Options � BoilerPipe Web API � Html2Text � Beautiful Soup