boiler pipe

23
Chris Boston

Upload: lokhantorie

Post on 11-Nov-2014

42 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: Boiler Pipe

Chris  Boston  

Page 2: Boiler Pipe

Contents  � Html  Structure  � Goal  of  html  content  extraction  � Html  Stripping  �  Java  Options  

�  BoilerPipe  �  Apache  Tika  

� Python  Options  �  BoilerPipe  Web  API  � Html2Text  �  Beautiful  Soup  

Page 3: Boiler Pipe

Html  Structure  <HTML>                  <HEAD>                                  <TITLE>        Title  of  the  page.              </TITLE>  

               </HEAD>                  <BODY>                                  Page  content.                  </BODY>  </HTML>    

Page 4: Boiler Pipe

Example  Page  http://www.minecraft.net/about.jsp  � Navigation  links  at  the  top  � Main  text  in  the  body  �  “Buy  Now”  on  the  side  �  Copyright  on  the  bottom  

Page 5: Boiler Pipe

Goal  of  Html  extrac7on  � Given  HTML,  identify  the  relevant  text.  

�  Strip  page  navigation  links.  �  Strip  site-­‐specific  text  (copyright,  etc.).  � Many  tools  can  find  data  based  on  where  it  occurs  in  the  structure  of  the  html.  

�  In  our  case,  we  are  trying  to  strip  out  words  that  are  obviously  related  to  site  functions.  

 

Page 6: Boiler Pipe

HTML  Stripping  � Regular  Expression  replacement:  

�  <[^>]+>  

�  In  Java:  �  String  noHTMLString  =  htmlString.replaceAll("\\<.*?  >","");  

�  In  Python:  �  re.compile(r'<.*?>')  .sub('',  toStrip)  #Must  import  re!  

Page 7: Boiler Pipe

Example  Stripped  Page  

Page 8: Boiler Pipe

BoilerPipe  � Tool  that  intelligently  removes  html  tags  (and  even  irrelevant  text).  � Much  smarter  than  a  regular  expression  �  Provides  several  extraction  methods.  �  Returns  text  in  a  variety  of  formats.  

Page 9: Boiler Pipe

How  Boilerpipe  Extracts  Content  � Retrieves  Html  given  URL  (optional)  � Parses  Html  to  find  text  content  �  Separates  text  into  text  blocks  � Uses  variety  of  classifiers  to  determine  which  blocks  are  important  

Page 10: Boiler Pipe

BoilerPipe  Extractors  ARTICLE_EXTRACTOR:    Specializes  on  finding  articles.    DEFAULT_EXTRACTOR:    Picks  up  more  than  just  articles.    Filters  navigation  links.  

 CANOLA_EXTRACTOR:    Extractor  based  on  krdwrd.  

KEEP_EVERYTHING_EXTRACTOR:    Gets  everything.    Could  use  this  for  extracting  the  title.  

Page 11: Boiler Pipe

BoilerPipe  Tests  Try  BoilerPipe.    No  setup  required!    http://boilerpipe-­‐web.appspot.com/    

Page 12: Boiler Pipe

Ge?ng  Started  with  BoilerPipe  1.  Download  BoilerPipe:  

http://code.google.com/p/boilerpipe/downloads/detail?name=boilerpipe-­‐1.1.0-­‐bin.tar.gz  

2.  Extract  and  add  all  JARs  to  path/workspace.  3.  Adapt  code  from  next  slide.  

Page 13: Boiler Pipe

Example  Java  Code    public  static  String  extractFromUrl(String  targetUrl)  throws  Exception  {    ExtractorBase  extractor  =  CommonExtractors.ARTICLE_EXTRACTOR;    return  extractor.getText(new  URL(targetUrl));  

}  

Page 14: Boiler Pipe

Apache  Tika  � Apache  Tika:  Java  library  that  can  parse  many  formats,  including  html.    Lets  you  have  a  lot  of  control.  �  http://tika.apache.org/  

Page 15: Boiler Pipe

Apache  Tika  Features  � Unlike  BoilerPipe,  Apache  Tika  can  generate  an  xml  parse  tree  from  documents  of  almost  any  format.  

 � Allows  traversal  of  the  parse  tree  as  parse  events,  meaning  the  entire  document  need  not  be  in  memory  at  one  time  to  parse  it.  

 � Parses  and  preserves  metadata.  

Page 16: Boiler Pipe

Op7ons  for  Python  1.  Use  the  BoilerPipe  Web  API  

2.  Make  a  simple  helper  BoilerPipe  JAR,  then  do  the  heavy  lifting  in  python.  

3.  Html2Text  

4.  Beautiful  Soup  

Page 17: Boiler Pipe

1.  Web  API  �  Generate  a  URL  that  requests  a  text  file  from  the  test  site  

�  Unfortunately,  your  system  will  fail  if  the  site  is  unavailable.  �  You  can  use  different  arguments  to  get  different  formats.  �  Experiment  with  using  the  web  site  to  find  out  what  kinds  of  options  are  available.  

 �  Url  has  three  parts:  

1.  http://boilerpipe-­‐web.appspot.com/extract?url=  2.  http://www.myurl.net/  3.  &extractor=ArticleExtractor&output=text  

�  Choose  your  Extractor  type  and  return  type  here  

 

Page 18: Boiler Pipe

1.  Web  API:  Example  Python  Code  def  extract(url):          fullUrl  =  "http://boilerpipe-­‐web.appspot.com/extract?url="  

       fullUrl  +=  url          fullUrl  +=  "&extractor=ArticleExtractor&output=text"          html  =  urllib.urlopen(fullUrl)          return  html2text.html2text(html.read(),  fullUrl)  

Page 19: Boiler Pipe

2.  Call  Executable  JAR  from  Python  import  os    if  __name__  ==  "__main__":        startingDir  =  os.getcwd()  #  remember  the  current  directory  

     jarDir  =  “Path/To/Jar“          os.chdir(jarDir)  #  change  to  our  test  directory        os.system("java  -­‐jar  myJar.jar  myParameters")        os.chdir(startingDir)  #  change  back  to  where  we  started    

Page 20: Boiler Pipe

3.  Html2Text  � Get  it  at:  http://www.aaronsw.com/2002/html2text/    � Demo  also  on  web  site.     Example  code  

import  html2text  import  urllib  test  =  urllib.urlopen(url)  result  =  html2text.html2text(test.read(),  url)  

Page 21: Boiler Pipe

4.  Beau7ful  Soup  � Generates  a  parse  tree  of  a  webpage  

� Have  to  find  relevant  content  on  your  own  � Handles  pages  made  with  bad  markup  

Page 22: Boiler Pipe

Addi7onal  Resources  

� CLEANEVAL:  A  contest  for  html  extractors.  �  http://cleaneval.sigwac.org.uk/  

Page 23: Boiler Pipe

Ques7ons?  � Html  Structure  � Goal  of  html  content  extraction  � Html  Stripping  �  Java  Options  

�  BoilerPipe  �  Apache  Tika  

� Python  Options  �  BoilerPipe  Web  API  � Html2Text  �  Beautiful  Soup