©2002 paula matuszek iminer introduction. ©2002 paula matuszek iminer from ibm l text mining tool...
TRANSCRIPT
©2002 Paula Matuszek
iMiner from IBM Text Mining tool with multiple components Text Analysis tools includ
– Language Identification Tool
– Feature Extraction Tool
– Summarizer Tool
– Topic Categorization Tool
– Clustering Tools – http://www-4.ibm.com/software/data/iminer/fortext/index.html– http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htm
©2002 Paula Matuszek
iMiner for Text 2 Basic technology includes:
– authority file with terms
– heuristics for extracting additional terms
– heuristics for extracting other features
– Dictionaries with parts of speech
– Partial parsing for part-of-speech tagging
– Significance measure for terms: Information Quotient (IQ).
Knowledge base cannot be directly expanded by end user
Strong machine-learning component
©2002 Paula Matuszek
Language Identification Can analyze
– an entire document
– a text string input from the command line Currently handles about a dozen language Can be trained; ML tool with input in
language to be learned Determines approximate proportion in
bilingual documents
©2002 Paula Matuszek
Language Identification Basically treated as a categorization
problem, where each language is a category
Training documents are processed to extract terms.
Importance of terms for categorization is determined statistically
Dictionaries of weighted terms are used to determine language of new documents
©2002 Paula Matuszek
Feature Extraction Locate and categorize relevant features in
text Some features are themselves of interest Also starting point for other tools like
classifiers, categorizers. Features may or may not be “meaningful” to a
person Goal is to find aspects of a document which
somehow characterize it
©2002 Paula Matuszek
Name Extraction Extracting Proper Names
– People, places, organizations
– Valuable clues to subject of text
Dictionaries of canonical forms Additional names extracted from documents
– Parsing finds tokens
– Additional parsing groups tokens into noun phrases
– Rules identify tokens which are names
– Variant groups are assigned a canonical name which is the most explicit variant found in document
©2002 Paula Matuszek
Examples for Name Extraction
“This subject is taught by Paula Matuszek.”– Recognize Paula as a first name of a person
– Recognize Matuszek as a capitalized word following a first name.
– Therefore “Paula Matuszek” is probably the name of a person.
“This subject is taught by Villanova University.”– Recognize Villanova as a probable name based on capitalization.
– Reognize University as a term which normally names an institution..
– Therefore “Villanova University” is probably the name of an institution.
“This subject is taught by Howard University”– BOTH of these sets of rules could apply. So rules need to be prioritized to
determine more likely parse.
©2002 Paula Matuszek
Other Rule Examples Dr., Mr,. Ms. are titles, and titles followed
by capitalized words frequently indicate names. If followed by only one word, it’s the last name
Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN.
Nouns can be names. Verbs can’t.
©2002 Paula Matuszek
Abbreviation/Acronym Extraction
Fruitful source of variants for names and terms
Existing dictionary of common terms Name followed by “(“ [A-Z]+ “)” probably
gives an abbreviation. Conventions regarding word internal
case and prefixes. “MSDOS” matches “MicroSoft DOS”, “GB” matches gigabyte.
©2002 Paula Matuszek
Number Extraction Useful primarily to improve performance
of other extractors. Variant expressions of numbers
– One thousand three hundred and twenty seven– thirteen twenty seven– 1327
Other numeric expressions– twenty-seven percent– 27%
Base forms are easy; most of effort is variants and determining canonical form based on rules
©2002 Paula Matuszek
Date Extraction
Absolute and relative dates Produces canonical form.
– March 27, 1997 1997/03/27– tomorrow ref+0000/00/01– a year ago ref-0001/00/00
Similar techniques and issues as for numbers
©2002 Paula Matuszek
Money Extraction
Recognizes currencies and produces canonical representation
Uses number extractor Examples
– “twenty-seven dollars” “27.000 dollars USA”– “DM 27” “27.000 marks Germany”
©2002 Paula Matuszek
Term Extraction
Identify other important terms found in text Other major lexical clue for subject, especially if
repeated. May use output from other extractors in rules Recognizes common lexical variants and
reduces to canonical form -- stemming Machine learning is much more important here
©2002 Paula Matuszek
Term Extraction
Dictionary with parts of speech info for English Pattern matching to find noun phrase structure
typical of technical terms. Feature repositories:
– Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics
– Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics
Authority and residue dictionaries trained
©2002 Paula Matuszek
Information Quotient Each feature (word, phrase, name) extracted is
assigned an information quotient Represents the significance of the feature in the
document TF-IDF: Term frequency-Inverse Document
Frequency Position information Stop words
©2002 Paula Matuszek
Feature Extraction Demo
Tool may be used for highlighting, etc, on documents to be displayed
Features extracted also form basis for other tools
Note that this is not full information extraction, although it is a starting point
http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html
©2002 Paula Matuszek
Other Features
Feature Extractor also identifies other features used by other text analysis tools:– sentence boundaries– paragraph boundaries– document tags– document structure– collection statistics
©2002 Paula Matuszek
Summarizer Tools Collection of sentences extracted from
document Characteristic of document content Works best for well-structured documents Can specify length Must apply feature extraction first
©2002 Paula Matuszek
Summarizer Feature extractor run first Words are ranked Sentences are ranked Highest ranked sentences are chosen Configurable: for length of sentence, for
word salience Works best when document is part of a
collection
©2002 Paula Matuszek
Word Ranking
Words scored IF– Appears in structures such as titles and captions
– Occurs more often in document than in collection (word salience)
– Occurs more than once in a document
Score is– salience if > threshold: tf*idf (by default)
– weighting factor if occurs in title, heading caption
©2002 Paula Matuszek
Sentence Ranking
Scored according to relevance in document and position in document.
Sum of– Scores of individual words – Proximity of sentence to beginning of its paragraph– “Bonus” for final sentence in long paragraph and
final paragraph in long documents– Proximity of paragraph to beginning of document
All configurable
©2002 Paula Matuszek
Summarization Examples
Examples from IBM documentation
http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html
©2002 Paula Matuszek
Some Common Statistical Measures(a brief digression)
TF x IDF Pairwise and multiple-word phrase counts Some other common statistical measures:
– information gain: how many bits of information do we gain by knowing that a term is present in a document
– mutual information: how likely a term is to occur in a document
– term strength: likelihood that a document will occur in both of two closely-related documents
©2002 Paula Matuszek
Topic Categorization Tool Assign documents to predetermined
categories Must first be trained
– Training tool creates category scheme
– Dictionary that stores significant vocabulary statistics
Output is list of possible categories and probabilities for each document
Can filter initial schema for faster processing
©2002 Paula Matuszek
Features Used for Categorizing
Linguistic Features– Uses the features extracted by Feature Extraction tool
N-Grams– letter groupings and short words.
– Can be used for non-English, because it doesn’t depend on heuristics
– Used by Language categorizer
©2002 Paula Matuszek
Document Categorizing Individual document is analyzed for
features Features are compared to those
determined for categories:– terms present/absent– IQ of terms– frequencies– document structure
©2002 Paula Matuszek
Document Categorization Important issue is determining which features!
High dimensionality is expensive. Ideally you want a small set of features which is
– present in all documents of one category– absent in all other documents
In actuality, not that clean. So:– use features with relatively high separation– eliminate feature which correlates very highly with
another feature (to reduce dimension space)
©2002 Paula Matuszek
Categorization Demo Typically categorization is a component in a
system which then “does something” with the categorized documents
Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving.
http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.
©2002 Paula Matuszek
Clustering Tools Organize documents without pre-existing
categories Hierarchical clustering
– creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up
Binary Relational clustering– Creates a flat set of clusters with each
document assigned to its best fit and relations between clusters captured
©2002 Paula Matuszek
Hierarchical Clustering
Input is a set of documents Output is a dendogram
– Root
– Intermediate level
– leaves– link to actual documents
Slicing is used to create manageable HTML tree
©2002 Paula Matuszek
Steps in Hierarchical Clustering
Select Linguistic Preprocessing technique: determines “similarity”
Cluster documents: create dendogram based on similairy
Define shape of tree with slicing technique and produce HTML output
©2002 Paula Matuszek
Linguistic Preprocessing Determining similarity between documents and
clusters: how do we define “similar”?– Lexical affinity. Does not require any
preprocessing– Linguistic Features. Requires that feature
extractor be run first. iMiner is either/or; you cannot combine the
two methods of determining similiarity
©2002 Paula Matuszek
Clustering: Lexical Affinities
Lexical affinities: groups of words which appear frequently close together– created “on the fly” during a clustering task– word pairs– stemming and other morphological analysis– stop words
Results in documents with textual similiarity being clustered together
©2002 Paula Matuszek
Clustering: Linguistic Features
Linguistic features: Use features extracted by the feature extraction tool– Names of organizations– Domain Technical Terms– Names of Individuals
Can allow focusing on specific areas of interest
Best if you have some idea what you are interested in
©2002 Paula Matuszek
Hierarchical Clustering Steps
Put each document in a cluster, characterized by its lexical or linguistic features
Merge the two most similar clusters Continue till all clusters are merged
©2002 Paula Matuszek
Hierarchical Clustering: Slicing
The Dendogram is too big to be useful Slicing reduces the size of the tree by
merging clusters if they are “similar enough”.– top threshold: collapse any tree which
exceeds it– bottom threshold: group under root any
cluster which is lower– Remaining clusters make a new tree– # of steps sets depth of tree
©2002 Paula Matuszek
Typical Slicing Parameters Bottom
– start around 5% or 10% similar– 90% would mean only virtually identical
documents get grouped Top
– good default is 90%– if want really identical, set to 100%
Depth: – Typically 2 to 10– Two would give you duplicates and rest
©2002 Paula Matuszek
Binary Relational Clustering
Binary Relational clustering– Creates a flat set of clusters
– Each document assigned to its best fit
– Relations between clusters captured Similarity based on features extracted by
Feature Extraction tool
©2002 Paula Matuszek
Relational Clustering: Document Similarity
Based on comparison of descriptors– Frequent descriptors across collection given
more weight: priority to wide topics– Rare descriptors given more weight: large
number of very focused clusters– Both, with rare descriptors given slightly
higher weight: relatively focused topics but fewer clusters
Descriptors are binary: present or absent
©2002 Paula Matuszek
Relational Clustering
Descriptors are features extracted by feature extraction tool.
Similarity threshold: at 100% only identical documents are clustered
Max # of clusters: overrides similiarity threshold to get number of clusters specified
©2002 Paula Matuszek
Binary Relational Clustering Outputs
Outputs are – clusters: topics found, importance of topics,
degree of similiarity in cluster
– links: sets of common descriptors between clusters
©2002 Paula Matuszek
Clustering Demo
Patents from “class 395”: information processing system organization
10% for top, 1% for bottom, total of 5 slices lexical affinity http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html
©2002 Paula Matuszek
Summary iMiner has a rich set of text mining tools Product is well-developed, stable No explicit user-modifiable knowledge
base -- uses automated techniques and built-in KB to extract relevant information
Can be deployed to new domains without a lot of additional work
BUT not as effective in many domains as a tool with a good KB
No real information extraction capability
©2002 Paula Matuszek
Information Extraction Overview
Given a body of text: extract from it some well-defined set of information
MUC conferences Typically draws heavily on NLP Three main components:
– Domain knowledge base– Extraction Engine– Knowledge model
©2002 Paula Matuszek
Information Extraction Domain Knowledge Base
Terms: enumerated list of strings which are all members of some class. – “January”, “February”– “Smith”, “Wong”, “Martinez”, “Matuszek”– “”lysine”, “alanine”, “cysteine”
Classes: general categories of terms– Monthnames, Last Names, Amino acids– Capitalized nouns, – Verb Phrases
©2002 Paula Matuszek
Domain Knowledge Base
Rules: LHS, RHS, salience Left Hand Side (LHS): a pattern to be
matched, written as relationships among terms and classes
Right Hand Side (RHS): an action to be taken when the pattern is found
Salience: priority of this rule (weight, strength, confidence)
©2002 Paula Matuszek
Some Rule Examples: <Monthname> <Year> => <Date> <Date> <Name> => print “Birthdate”, <Name>,
<Date> <Name> <Address> => create address database
record <daynumber> “/” <monthnumber> “/” <year> =>
create date database record (50) <monthnumber> “/” <daynumber> “/” <year> =>
create date database record (60) <capitalized noun> <single letter> “.” <capitalized
noun> => <Name> <noun phrase> <to be verb> <noun phrase> => create
“relationship” database record
©2002 Paula Matuszek
Generic KB Generic KB: KB likely to be useful in
many domains– names– dates– places– organizations
Almost all systems have one Limited by cost of development: it takes
about 200 rules to define dates reasonably well, for instance.
©2002 Paula Matuszek
Domain-specific KB We mostly can’t afford to build a KB for
the entire world. However, most applications are fairly
domain-specific. Therefore we build domain-specific KBs
which identify the kind of information we are interested in.– Protein-protein interactions– airline flights– terrorist activities
©2002 Paula Matuszek
Domain-specific KBs
Typically start with the generic KBs Add terminology Figure out what kinds of information you
want to extract Add rules to identify it Test against documents which have
been human-scored to determine precision and recall for individual items.
©2002 Paula Matuszek
Knowledge Model We aren’t looking for documents, we are
looking for information. What information? Typically we have a knowledge model or
schema which identifies the information components we want and their relationship
Typically looks very much like a DB schema or object definition
©2002 Paula Matuszek
Knowledge Model Examples
Personal records– Name
– First name– Middle Initial– Last Name
– Birthdate– Month– Day– Year
– Address
©2002 Paula Matuszek
Knowledge Model Examples
Protein Inhibitors– Protein name (class?)– Compound name (class?)– Pointer to source– Cache of text– Offset into text
©2002 Paula Matuszek
Knowledge Model Examples
Airline Flight Record– Airline
– Flight Number Origin Destination Date
» Status» departure time» arrival time
©2002 Paula Matuszek
Summary Text mining below the document level NOT typically interactive, because it’s
slow (1 to 100 meg of text/hr) Typically builds up a DB of information
which can then be queries Uses a combination of term- and rule-
driven analysis and NLP parsing. AeroText: very good system developed
by LMCO; we will get a complete demo on March 26.