large scale text analysis with apache · pdf filelarge scale text analysis with apache spark...

20
LARGE SCALE TEXT ANALYSIS WITH APACHE SPARK Ron Daniel Jr. Director, Elsevier Labs 2015-03-31

Upload: hanguyet

Post on 09-Mar-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

LARGE SCALE TEXT ANALYSIS WITH APACHE SPARK Ron Daniel Jr.

Director, Elsevier Labs

2015-03-31

Page 2: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Large Scale Text Analysis with Apache Spark

Abstract

Elsevier Labs has developed an internal text analysis system, which runs a variety of standard Natural Language Processing steps over our archive of XML documents. We are now extending that basic system by using Spark and other parts of the Berkeley Data Analytics Stack for additional analyses, including ones at interactive speeds. We expect this to let us prototype new features more easily, and/or look at making current features more efficient. This talk will cover the architecture of the system and our experiences with Spark and other BDAS components

Page 3: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

LARGE SCALE TEXT ANALYSIS USING APACHE SPARK, DATABRCKS, AND THE BDAS STACK

Agenda •  A Brief Introduction to Spark, BDAS, and Databricks

•  Demo: Word Count in Theory and in Practice

•  External Libraries

•  Demo: Exploring Content with a Concordancer

•  Large-Scale Issues and Architectural Changes

•  Demo: Converting XML to Plain Text for NLP Processing

•  Demo: Abbreviation Handling

•  Conclusion

Page 4: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Spark •  Modern successor to Hadoop

•  Key Idea: Resilient Distributed Datasets (RDDs) •  List of objects, partitioned and distributed to multiple processors •  When possible, RDDs remain memory-resident. Will spill to disk if needed.

•  Easier to program than Hadoop's Map-Reduce •  Can use Scala anonymous functions or Python lambdas to provide functions inline that

will be executed over all objects in an RDD: •  someRDD  =  anRDD.map(lambda  obj:  func(obj)).map(lambda  obj:  func2(obj)  

•  Many additional operations (sort, filter, reduce, union, distinct, join, pipe, count, …) •  Libraries for SQL, Machine Learning, and Graph Processing. •  Support for Scala, Python, Java, SQL, and R (soon)

Master

Worker

Worker

RDDs & Partitions

Worker

Page 5: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

For More on Spark: •  Some very good books being published

•  https://spark.apache.org/ •  Provided documentation is good

•  https://databricks.com/spark/developer-resources •  Databricks has a lot of additional Spark documentation

Page 6: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Berkeley Data Analytics Stack (BDAS) •  BDAS is:

•  Spark Core + •  Major libraries built on Spark + •  Special I/O and Cluster Management

•  The libraries read and write RDDs which typically remain in memory between calls

•  Don't have to save intermediate results to disk, transform formats, fire up a new environment to use separate tool

•  Much faster end-to-end! •  BDAS runs on top of well-established standards

and/or cluster management systems •  (e.g. HDFS, S3, HBase, YARN, Mesos)

•  Spark Documentation also covers BDAS libraries

Page 7: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Databricks Workspace •  Notebook interface

•  Like Mathematica or IPython Notebooks

•  Interactive development •  Attack small parts of problems and leave an audit trail •  Caching of results

•  Documentation and reports

Page 8: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Word Count - the "Hello World" of Big Data

LARGE SCALE TEXT ANALYSIS WITH APACHE SPARK

#  Make  RDD  of  lines  from  text  file  rdd  =  sc.textFile("s3://.../example.txt")    #  Make  'tall'  RDD  with  words  of  each  line  wordsRDD  =  rdd.flatMap(lambda  s:  s.split("  "))    #  Put  a  count  of  1  with  each  word,  then  sum  the  counts  for  each  word  result  =  wordsRDD.map(lambda  w:  (w,  1)).reduceByKey(lambda  x,  y:  x  +  y)  

Page 9: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

"Demo": Word Count

LARGE SCALE TEXT ANALYSIS WITH APACHE SPARK

rdd  =  sc.textFile("s3://...")  words  =  rdd.flatMap(lambda  s:  s.split("  "))  result  =  words.map(lambda  w:  (w,  1))                              .reduceByKey(lambda  x,  y:  x  +  y)                              result.take(20)  result.takeSample(False,  20,  1)      #  More  useful  when  sorted  srtd_results  =  result.map(lambda  x:  (x[1],  x[0]))                                            .sortByKey(False)  srtd_results.take(20)  

Note:  Inden*on  is  for  legibility.  Won't  run  as-­‐is.  

Page 10: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Real World Results •  Data: Customer email feedback

•  Top 20 words are stop words or punctuation

•  Tokenization issues •  Trailing punctuation: Youngs, Utilization:

•  Stemming/lemmatization and case folding would be good

•  utilisation, Utilization:

•  Random sampling over-emphasizes the tail

[(867015,  u'the'),    (618065,  u'to'),    (524285,  u'you'),    (417616,  u'and'),    (368275,  u'of'),    (364962,  u'in'),    (333093,  u'your'),    (329289,  u'for'),    (275007,  u'this'),    (245704,  u'is'),    (228733,  u'I'),    (212472,  u'please'),    (212406,  u''),    (197095,  u'Customer'),    (192925,  u'a'),    (184271,  u'be'),    (184011,  u'or'),    (183961,  u'have'),    (183944,  u'-­‐'),    (175925,  u'not')]  

 (251,  u'utilisation'),    (119,  u'Michal'),    (87,  u'Utilization:'),    (22,  u'table,'),    (6,  u'Youngs,'),    (2,  u'Minus'),    (1,  u'neuroregeneration?'),    (1,  u'beach-­‐ridge'),    (1,  u'www.clch.nhs.uk'),    (1,  u'61.237.229.255'),    (1,  u'40203'),    (1,  u'Duisberg-­‐Essen'),    (1,  u'journal/article.')  

Page 11: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

A More Realistic Word Count •  Do wordcount with:

•  Smarter tokenizer •  Junk token elimination •  Lowercasing •  Stopword removal •  Porter stemmer? •  Minimum count threshold

•  Conclusions: •  A custom stopword list is important •  Wordcount not the right tool for

determining problems

 (13256,  u'problem'),    (13099,  u'error'),    (13082,  u'free'),    (13059,  u'user'),    (13044,  u'sincerely'),    (13022,  u'shall'),    (13001,  u'automated'),    (12916,  u'profile'),    (12887,  u'selection'),    (12845,  u'receiving'),    (12827,  u'list'),    (12701,  u'acknowledgement'),    (12653,  u'road'),    (12642,  u'comment')  

Page 12: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Better Tokenization (and more) with Libraries import  nltk  from  nltk.tokenize  import  TreebankWordTokenizer  from  nltk.stem.porter  import  *    stemmer  =  PorterStemmer()  words  =  sentsRDD.flatMap(lambda  s:                      TreebankWordTokenizer().tokenize(s.lower()))  stems  =  words.map(lambda  w:  stemmer.stem(w))  stopped  =  stems.filter(lambda  w:  w  not  in  stopwords)  

Page 13: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Useful Libraries •  NLTK, Numpy, scikit.learn, and other Python libraries

•  NLTK easy to install in Databricks, just give name of PyPl package •  Numpy & scikit.learn already installed.

•  Stanford CoreNLP and other Java libraries in .jar files •  Provides more advanced capabilities than NLTK (dependencies, constituents, co-ref) •  More difficult to install and use

•  Must cut unneeded model files from .jar •  Must work around non-serializability of StanfordCoreNLP class.

•  LXML (Python) or Spark-XML-Utils+Saxon(Scala calling Java) •  Can read XML content as strings in RDDs •  Very efficient for I/O •  XPath and other tools to extract data from the XML for new RDDs •  *Plug* Spark-XML-Utils developed by coworker, available from GitHub

Page 14: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Better Exploration with a Concordancer concordancer(KVSchemaRDD,  r"trouble  with  ([^\.\,;:\-­‐]+)",  20,  70,  25)  

Redacted

Page 15: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Concordancer Source Code import  re    def  makeTableRows(pii,  str,  pattern,  pre,  post):      iterator  =  pattern.finditer(str)      for  match  in  iterator:          (lo,hi)  =  (match.span())          start  =  max(0,  lo-­‐pre)          end  =  min(len(str),  hi+post)          tmpStr1  =  "<tr><td>"+pii+"</td><td  style='white-­‐space:  nowrap'>"  \              +  str[start:lo].replace("&",  "&amp;").replace("<",  "&lt;")  \              +  "<span  style='color:red'>"  +  str[lo:hi].replace("&",  "&amp;").replace("<",  "&lt;")  \              +  "</span>"  +  str[hi:end].replace("&",  "&amp;").replace("<",  "&lt;")  +  "</td></tr>"          yield  tmpStr1            def  concordancer(rdd,  regex,  pre,  post,  nrows):      pattern  =  re.compile(regex)      rdd1  =  rdd.flatMap(lambda  row:  makeTableRows(row.pii,  row.text,  pattern,  pre,  post))      lis1  =  rdd1.take(nrows)      tmpStr  =  "\n".join(lis1)      displayHTML("<table  border='1'  style='font-­‐family:  monospace'>"  +  tmpStr  +  "</table>")  

Page 16: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Large-Scale Issues •  Avoid shuffles!

•  Hash partition the wordcount example for faster reduceByKey

•  I/O: Many small files gives very poor I/O utilization •  45 min for 13M text files, 6 min for Hadoop sequence file

•  Weekly 'cron' job to update the Hadoop Sequence Files •  Immutability of RDDs makes HSF maintenance harder •  Splittable compression not so important if HSFs heavily partitioned. •  Easy to work with subsets - numeric suffixes handled with wildcards

•  Unreliable writes to S3 •  Save a copy to HDFS first, then write to S3. Don't have to re-compute from scratch if write

fails

Page 17: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

EC2

Content Analysis Toolbench ver. 2 & 3

S3

Initial NLP Processing

Gaz

Parse

Annot Search

Indexing

Labs

Production ScienceDirect Production Workflow

SD XML

Annot Files

Indices

Architecture of CAT 2 Architecture of CAT 3

S3

EC2

Spark

Spark Synch

INP++

DBricks

Index

Labs Production

ScienceDirect Production Workflow

SD XML

Annots & Updates

(.par)

XML Snapshot & Updates

(.hsf)

Annot Search Indices (.par)

Page 18: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

<?xml version="1.0" encoding="UTF-8"?><xocs:doc xsi:schemaLocation="http://www.elsevier.com/xml/xocs/dtd http://schema.elsevier.com/dtds/document/fulltext/xcr/xocs-article.xsd" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.elsevier.com/xml/ja/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd"><xocs:meta><xocs:content-family>serial</xocs:content-family><xocs:content-type>JL</xocs:content-type><xocs:cid>271205</xocs:cid><xocs:srctitle>Vaccine</xocs:srctitle><xocs:normalized-srctitle>VACCINE</xocs:normalized-srctitle><xocs:orig-load-date yyyymmdd="20080809">2008-08-09</xocs:orig-load-date><xocs:ew-transaction-id>2011-07-23T03:13:57</xocs:ew-transaction-id><xocs:eid>1-s2.0-S0264410X08009791</xocs:eid><xocs:pii-formatted>S0264-410X(08)00979-1</xocs:pii-formatted><xocs:pii-unformatted>S0264410X08009791</xocs:pii-unformatted><xocs:doi>10.1016/j.vaccine.2008.07.053</xocs:doi><xocs:item-stage>S300</xocs:item-stage><xocs:item-version-number>S300.2</xocs:item-version-number><xocs:item-weight>FULL-TEXT</xocs:item-weight><xocs:hub-eid>1-s2.0-S0264410X08X00477</xocs:hub-eid><xocs:timestamp yyyymmdd="20111006">2011-10-06T16:33:23.221307-04:00</xocs:timestamp><xocs:dco>0</xocs:dco><xocs:tomb>0</xocs:tomb><xocs:date-search-begin>20081009</xocs:date-search-begin><xocs:year-nav>2008</xocs:year-nav><xocs:indexeddate epoch="1218240000">2008-08-09T00:00:00Z</xocs:indexeddate><xocs:articleinfo>articleinfo crossmark dco dateupdated tomb dateloaded datesearch indexeddate issuelist volumelist yearnav articletitlenorm authfirstinitialnorm authfirstsurnamenorm cid cids contenttype copyright dateloadedtxt docsubtype doctype doi eid ewtransactionid hubeid issfirst issn issnnorm itemstage itemtransactionid itemweight openaccess openarchive pg pgfirst pglast pii piinorm pubdatestart pubdatetxt pubyr sectiontitle sortorder srctitle srctitlenorm srctype subheadings volfirst volissue webpdf webpdfpagecount figure table body affil articletitle auth authfirstini authfull authkeywords authlast primabst ref</xocs:articleinfo><xocs:issns><xocs:issn-primary-formatted>0264-410X</xocs:issn-primary-formatted><xocs:issn-primary-unformatted>0264410X</xocs:issn-primary-unformatted></xocs:issns><xocs:crossmark is-crossmark="0"/><xocs:vol-first>26</xocs:vol-first>…

"Demo": XML Conversion

Effect of concurrent intratracheal lipopolysaccharide and human serum albumin challenge on primary and secondary antibody responses in poultry Abstract Activation of the innate immune system by pathogen-associated molecular patterns (PAMPs) may direct specific immune responses and as a consequence probably significantly affect vaccination. Previously, we described modulation of specific antibody responses to systemically administered model antigens by intravenously (i.v.) as well as intratracheally (i.t.) administered PAMP such as the endotoxin lipopolysaccharide (LPS). In this study effects of various doses of i.t.-administered LPS on primary and secondary specific total and individual isotype (IgM, IgG and IgA)-specific antibody responses of chickens simultaneously i.t. challenged with various doses of human serum albumin (HuSA) were determined. i.t.-administered LPS enhanced primary and secondary HuSA-specific total and isotype-specific antibody titers depending on the dose of LPS and the dose of simultaneously administered HuSA. i.t.-administered HuSA enhanced primary and secondary total antibody responses to ‘environmental’ LPS as shown in birds receiving the zero i.t. LPS treatment, which also depended on dose of HuSA. HuSA administration decreased antibody responses to high doses of LPS. Body weight gain as a measurement of an acute phase cachectin response to LPS was affected by a HuSA by LPS interaction, indicating that simultaneously administered higher doses of HuSA decreased LPS-induced cachectin responses of the birds. Our results suggest a complex interaction of innate and specific immune system activating airborne antigens, which may have significant consequences for vaccination and husbandry management procedures.

Page 19: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

"Demo": Abbreviation Handling •  Common patterns:

•  Three Letter Acronyms (TLA). •  TLA (Three Letter Acronym).

•  Schwartz-Hearst Algorithm:

Three Letter Acronym (TLA) 0

Look for A Look for L Look for T

Page 20: LARGE SCALE TEXT ANALYSIS WITH APACHE · PDF fileLarge Scale Text Analysis with Apache Spark Abstract ... • Can use Scala anonymous functions or Python lambdas to provide functions

Conclusion •  Combination of Spark, Notebook UI, and library import provides a very

productive environment for text analysis.

•  Data quality depends on detecting errors. Interactive environment, and the ability to use good supporting tools like NLTK's tokenizer instead of toy solutions like split(' ') for tokenization, leads to improved data quality.

•  Very easy to compute a variety of analyses and keep them for future use.

•  Combining the analyses to achieve a goal is the real trick

•  Combination is slightly simplified by best practice of using SchemaRDDs and common keys