text analysis: methods for searching, organizing, labeling and summarizing document collections...
TRANSCRIPT
Text Analysis:Methods for Searching, Organizing,
Labeling and Summarizing Document Collections
Danny DunlavyComputer Science and Informatics Department (1415)
Sandia National Laboratories
July 16, 2008CSRI Student Seminar Series
SAND2008-4999P
Outline
• Introduction
• Motivational Problems
• Data
• Analysis Pipeline
• Transformation, Analysis, and Post-processing
• Hybrid Systems
• Examples
• Conclusions
Introduction
• Knowledge discovery– Goal of text analysis– Data → information → knowledge
• Challenges– Too much information to process manually– Data ambiguity
• Word sense, multilingual, errors, weak signals
– Heterogeneous data sources– Interpretability
• Goals of this talk– Exposure to research in text analysis at Sandia– Focus on methods based on mathematical principles
Example 1: Information Retrieval
Problem: ambiguous queries lead to information overload and topic confusion
Solutions: optimization, linear algebra, machine learning, and probabilistic modeling
Basketball player
Mathematician
Rank: 5, 50, 109, …
Jazz Musician?
Rank: > 200
Example 2: Spam DetectionX-TMWD-Spam-Summary: TS=20080714175419; ID=1; SEV=2.3.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230332E34383742393243372E303043312C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAABAMQFAAAAAAAAAAAAAIAIgDkAAAM=X-TMWD-IP-Reputation: SIP=128.8.128.57; IPRID=303030312E30413039303330322E34383742393243362E30303330; CTCLS=T2; CAT=UnknownDate: Mon, 14 Jul 2008 13:54:13 -0400From: Dianne O'Leary <[email protected]>To: [email protected]: TS=20080714175417; SEV=2.2.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.02.0125; ENG=IBF; RPDID=7374723D303030312E30413031303230332E34383742393243392E303045422C73733D312C6667733D30; CAT=NONE; CON=NONEX-MMS-Spam-Filter-ID: B2008071415_5.02.0125_4.0-9X-PMX-Version: 5.4.2.344556, Antispam-Engine: 2.6.0.325393, Antispam-Data: 2008.7.14.174143X-PerlMx-Spam: Gauge=IIIIIII, Probability=7%, Report='BODY_SIZE_1000_LESS 0, BODY_SIZE_300_399 0, BODY_SIZE_5000_LESS 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __SUBJ_MISSING 0'
Bayesian Statistics
• SpamAssassin• Cloudmark Authority• MailSweeper Business Suite
Neural Networks• SurfControl E-mail Filter• AntiSpam for SMTP
Signature (Hash) Analysis • Cloudmark SpamNet• IM Message Inspector
[S. Ali and Y. Xiang (2007), “Spam Classification Using Adaptive Boosting Algorithm," Proc. ICIS 2007.]
Lexical Analysis • Brightmail Anti-Spam• Tumbleweed Email Firewall
Heuristic Patterns• McAfee SpamKiller• Brightmail Anti-Spam
Solutions: optimization, linear algebra, machine learning, probabilistic modeling
Problem: term meaning/usage ambiguity and deceit creates confusion between spam and good e-mail
Example 3: Topic Detection and Association
http://cloud.clusty.com
http://www.kartoo.com
Problem: determine topics in text collections and identify the most important, novel, or significant relationships
Clustering and visualization are key analysis methods
Solutions: optimization, linear algebra, machine learning, and probabilistic modeling, visualization
Text Data
• Text collection(s)– Corpus (corpora)
• Structured– Database fielded data
• Semi-structured– XML, HTML
• Unstructured– Formal
• Newspaper articles, scientific articles, business reports, …
– Informal• E-mail, chat, code comments, …
• Other characteristics– Incomplete, noisy (errors, ambiguity), multilingual
Metadataprocessing tool
parameters useddate processed
Datanamed entitiesrelationships
factsevents
Semi-Structured Datae-mail, web pages, blogs, etc.
Unstructured Datareports, newswire, etc.
Metadataraw source index
date collectedsource reliability
etc.Data
E-mail Headersto
fromdate
subjectetc.
Metadataraw source index
data collectedsource reliability
etc.
DataE-mail Headersmessage bodyattachments
Unstructured Text Processing
Datatext
Analysis
Text Analysis Pipeline
Ingestion
Pre-processing
Transformation
Analysis
Post-processing
Archiving
File readers (ASCII, UTF-8, XML, PDF, ...)
Tokenization, stemming, part-of-speech taggingnamed entity extraction, sentence boundaries
Data model, dimensionality reduction, feature weighting, feature extraction/selection
Information retrieval, clustering, summarization,classification, pattern recognition, statistics
Visualization, filtering, summary statistics
Database, file, web site
Vector Space Model
• Vector Space Model for Text– Terms (features): – Documents (objects):– Term Document Matrix:– : measure of importance of term in document
• Term Examples
– Sentence: “Danny re-sent $1.”– Words: danny, sent, re [# chars?], $ [sym?], 1 [#?], re-sent [-?]– n-grams (3): dan, ann, nny, ny_, _re, re-, e-s, sen, ent, nt_, …– Named entities (people, orgs, money, etc.): danny, $1
• Document Examples– Documents, paragraphs, sentences, fixed-size chunks
[G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Comm. ACM, 18(11), 613–620.]
Feature Extraction: Dimension Reduction
• Goal: find new, smaller set of features (dimensions) that best captures variability, correlations, or structure in the data
• Methods– Principal component analysis (PCA)
• Eigenvalue decomposition of covariance matrix of
• Pre-processing: mean of each feature is 0
– Singular value decomposition of– Local Linear Embedding (LLE)
• Express points as combinations of neighbors and embed points into lower dimensional space (preserving neighbors)
– Multidimensional scaling• Preserve pairwise distances in lower dimensional space
– ISOMAP (nonlinear)• Extends MDS to use geodesic distances on a weighted graph
Analysis Tasks in This Talk
• Information retrieval– Goal: find documents most related to a query– Challenges: pseudonyms, synonyms, stemming, errors– Methods: LSA (later), boolean search, probabilistic retrieval
• Clustering– Goal: find a set of partitions that best separates groups of like objects– Challenges: distance metrics, number of clusters, uniqueness– Methods: k-means (later), agglomerative, graph-based
• Summarization– Goal: find a compact representation of text with same meaning– Challenges: single- vs. multi-document summaries, subjectivity– Methods: HMM+QR (later), probabilistic
• Classification– Goal: predict labels/categories of data instances (documents)– Challenges: data overfitting, – Methods: HEMLOCK (S. Gilpin, later), decision trees, naïve bayes, SVM
Other Analysis Tasks
• Machine translation• Speech recognition• Cross language information retrieval• Word sense disambiguation
– Determining sense of ambiguous words from context
• Lexical acquisition– Filling in gaps in dictionaries build from text corpora
• Concept drift detection– Change in general topics in streaming data
• Association analysis– Discovering novel relationships hidden in text
Hybrid Systems
• Rules + statistics/probabilities– Entity extraction (persons, organizations, locations)
• Rules: list of common names, capitalization
• Probabilities: chance name occurs given sequence of words
• Any combination of data analytic tools
Data modeler
Feature extractor
Clustering tool
Parser
Oftendeveloped
independently
Hybrid System Development
• Data model– Cross-system, cross-platform accessibility– Accommodation of multiple data structures
• System– Modularized framework (plug-and-play capabilities)– Compatible interfaces– Multiple user interfaces
• TITAN: customizable front-ends to analysis pipelines
• YALE: required parameters vs. complete set of parameters
• Performance, Verification & Validation– Tests for independent systems and overall system– Compatible test data and benchmarks– Analysis of parameter dependencies across individual systems
Motivation
• Query– methods plasma physics
• Retrieval– General: Google, 7.8106 of >2.51010 documents
– Targeted: arXiv, 9,000 of >403,000 documents
• Problems– Too much information
– Redundant information
– Results: link, title, abstract, snippet (?), etc.
– Ordering of results (meaning of “best” match?)
Problems to Solve
• QCS (Query, Cluster, Summarize)– Unstructured text parsing (common representation)
– Data fusion (cleaning, assimilating, normalizing)
– Natural language processing (sentences, POS)
– Document retrieval (ranking)
– High-dimensional clustering (data organization)
– Automatic text summarization (data reduction)
– Data representation/visualization (multiple perspectives)
QueryLatent Semantic Analysis (LSA)
• SVD:
• Truncated SVD:
• Query scores (query as new “doc”):
• LSA Ranking:
term
s
documentsd1 d2 dn
t2
t1
tm
…d3 d4
.
.
. Truncated SVD
term
s
concepts documents
con
cep
ts
singular values
[Deerwester, S. C., et al. (1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41 (6), 391–407.]
d1 : Hurricane. A hurricane is a catastrophe.
d2 : An example of a catastrophe is a hurricane.
d3 : An earthquake is bad.
d4 : Earthquake. An earthquake is a catastrophe.
d1 : Hurricane. A hurricane is a catastrophe.
d2 : An example of a catastrophe is a hurricane.
d3 : An earthquake is bad.
d4 : Earthquake. An earthquake is a catastrophe.
1011catastrophe
2100earthquake
0012hurricane
d4d3d2d1
0catastrophe
0earthquake
1hurricane
qA
.30.15.60.59catastrophe
.92.96.02-.03earthquake
.11-.11.78.78hurricane
d4d3d2d1
A2
00.71.89qTA .11–.78.78qTA2
Removestopwords
normalization only rank-2 approximation
captures link to doc 4
LSA Example 1
.450.71.45catastrophe
.89100earthquake
00.71.89hurricane
d4d3d2d1
A
LSA Example 2
∆policy∆planning∆politics∆tomlinson∆1986oSport in Society: policy, Politics and Culture, ed A. Tomlinson (1990) oPolicy and Politics in Sport, PE and Leisure eds S. Fleming, M. Talbot and A. Tomlinson (1995) oPolicy and Planning (II), ed J. Wilkinson (1986) oPolicy and Planning (I), ed J. Wilkinson (1986) oLeisure: Politics, Planning and People, ed A. Tomlinson (1985)
∆parker∆lifestyles∆1989∆partoWork, Leisure and Lifestyles (Part 2), ed S. R. Parker (1989) oWork, Leisure and Lifestyles (Part 1), ed S. R. Parker (1989)
[Leisure Studies of America Data: 97 documents, 335 terms]
ClusterGeneralized Spherical K-Means (gmeans)
• The Players– Documents:
– Partition/Disjoint Sets:
– Concept vectors (centroids):
• The Game– Maximize
• The Rules– Adaptive, but bounded k
– Similarity Estimation
– First variation (stochastic perturbation)[Dhillon, I. S., et al. (2002). Iterative clustering of high dimensional text data augmented by local search. Proc. IEEE ICDM.]
SummarizeHidden Markov Model + Pivoted QR
• Single Document Summarization– Mark summary sentences in training documents– Build probabilistic model
• Markov chain observations– log(#subject terms + 1)
• terms showing up in titles, topics, subject descriptions, etc.
– log(#topic terms + 1)
• terms above a threshold using a mutual information statistic
• Hidden Markov Model (HMM)– Hidden states: {summary, non-summary}
– Score sentences in each document• Probabilities of sentence being a summary sentence
n 1 n 2 n
[Conroy, J. M., et al. (2001). Text summarization via hidden markov models and pivoted QR matrix decomposition.]
SummarizeHidden Markov Model + Pivoted QR
• Multi-document Summarization
– Goal: generate w-word summaries
– Use HMM scores to select candidate sentences (~2w)
– Terms as sentence features• Terms:
• Sentences:
• Scaling: = HMM score of sentence
• Pivoted QR
– Choose column with maximum norm ( )
– Subtract components along from remaining columns
– Stop: chosen sentences (columns) ~w words• Removes semantic redundancy
QCS: Evaluation
• Document Understanding Conference (DUC)– Automatics evaluation of summarizers (ROUGE)
• Measures how well you agree with human summaries
– Human (), QCS (), S only () summaries
– QCS finds subtopics and outliers
ROUGE-2 score vs. Summarizers (Humans, QCS, S)
Cluster 1 Cluster 2
QCS: Evaluation
• Document Understanding Conference (DUC)– Scoring as a function of QCS cluster size (k)
– QCS (), S only (---) summaries
– Best results for different clusters use different k
ROUGE-2 scores vs. number of clusters
Cluster 1 Cluster 2
Benefits of QCS
• Dynamic data organization and compression– Subset of documents relevant to a query
– Topic clusters, single summary per cluster
• Multiple perspectives (analyses)– Relevance ranking, topic clusters, summaries
• Efficient use of computation– Parsing, term counts, natural language processing, etc.
ParaText™: Scalable Text Analysis
• ParaText™ Lite– Serial client-server text analysis– Parser, vector space model, SVD,
data similarities/graph creation – Built on vtkTextEngine (Titan)– Works with ~10K–100K documents
• ParaText™– End-to-end scalable text analysis– Challenge 1: Parsing [parallel string hashing, hierarchical agglomeration]– Challenge 2: Text modeling [initial Trilinos/Titan integration complete]– Challenge 3: Load balancing [initial: documents; goal: Isorropia/Zoltan]
• Impact– Available in ThreatView 1.2.0+ directly or through ParaText™ server– Plans to interface to LSAView, OverView (1424), Sisyphus (1422)
ParaText™ Server (PTS)
Artifact DB
PTS PTS PTS
Reader
P0 P1 Pk
Parser
Matrix
SVD
Reader
Parser
Matrix
SVD
Reader
Parser
Matrix
SVD
Parallel Pipeline
Matrices DB
HPC Resource (cluster, multicore server, etc.)
1 or 2DB Servers
XMLHTTP
Master ParaText™
Server
ParaText™ Client
LSAView: Algorithm Analysis/ Development
• LSAView– Analysis and exploration of
impact of informatics algorithms on end-user visual analysis of data
– Aids in discovery process of optimal algorithm parameters for given data and tasks
• Features – Side-by-side comparison of visualizations for two sets of parameters– Small multiple view for analyzing 2+ parameter sets simultaneously– Linked document, graph, matrix, and tree data views– Interactive, zoomable, hierarchical matrix and matrix-difference views– Statistical inference tests used to highlight novel parameter impact
• Impact– Used in developing and understanding ParaText™ and LSALIB algorithms
20 40 60 800
0.5
1
1.5
2
20 40 60 800
0.5
1
1.5
2
20 40 60 800
0.5
1
1.5
2
20 40 60 800
0.5
1
1.5
2
k
LSAView Impact
• Document similarities:
• Inner product view:
• Scaled inner product view:
What is the best scaling for document similarity graph generation?
original scaling no scaling inverse sqrt inverse
[Leisure Studies of America Data: 97 documents, 335 terms]
E-Mail Classification
• LSN Assistant / Sandia Categorization Framework – Yucca Mountain: categorize e-mail (Relevant, Federal Record, Privileged)– Machine learning library and GUI for document categorization & review– For review of existing categorizations, recommendations for new documents– Balanced learning
• Skewed class distributions
• Importance– Solved important, real problem
• ~400K e-mails incorrectly categorized
– Foundation for LSN Online Assistant• Real-time system for recommendations
• Impact– Dong Kim, lead of DOE/OCRWM LSN certification is “very impressed with the
LSN Assistant Tool and the approach to doing the review.”– Factor of 3 speedup over manual categorization review only
Conclusions
• Text analysis relies heavily upon mathematics– Linear algebra, optimization, machine learning,
probability theory, statistics, graph theory
• Hybrid system development is a challenge– More than just gluing pieces together
• Large-scale analysis is important– Storing and processing large amounts of data
– Scaling algorithms up
– Developing new algorithms for large data
• Useful across many application domains
Collaborations
• QCS
– Dianne O’Leary (Maryland), John Conroy & Judith Schlesinger (IDA/CCS)
• LSALIB
– Tammy Kolda (8962)
• ParaText™
– Tim Shead & Pat Crossno (1424)
• LSAView
– Pat Crossno (1424)
• Sandia Categorization Framework
– Justin Basilico (6341) and Steve Verzi (6343)
• HEMLOCK
– Sean Gilpin (1415)
Thank You
Text Analysis:Methods for Searching, Organizing,
Labeling and Summarizing Document Collections
Danny Dunlavy
http://www.cs.sandia.gov/~dmdunla