©2002 paula matuszek iminer introduction. ©2002 paula matuszek iminer from ibm l text mining tool...

©2002 Paula Matuszek

iMiner Introduction


iMiner from IBM Text Mining tool with multiple components Text Analysis tools includ

– Language Identification Tool

– Feature Extraction Tool

– Summarizer Tool

– Topic Categorization Tool

– Clustering Tools – http://www-4.ibm.com/software/data/iminer/fortext/index.html– http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htm


iMiner for Text 2 Basic technology includes:

– authority file with terms

– heuristics for extracting additional terms

– heuristics for extracting other features

– Dictionaries with parts of speech

– Partial parsing for part-of-speech tagging

– Significance measure for terms: Information Quotient (IQ).

Knowledge base cannot be directly expanded by end user

Strong machine-learning component


Language Identification Can analyze

– an entire document

– a text string input from the command line Currently handles about a dozen language Can be trained; ML tool with input in

language to be learned Determines approximate proportion in

bilingual documents


Language Identification Basically treated as a categorization

problem, where each language is a category

Training documents are processed to extract terms.

Importance of terms for categorization is determined statistically

Dictionaries of weighted terms are used to determine language of new documents


Feature Extraction Locate and categorize relevant features in

text Some features are themselves of interest Also starting point for other tools like

classifiers, categorizers. Features may or may not be “meaningful” to a

person Goal is to find aspects of a document which

somehow characterize it


Name Extraction Extracting Proper Names

– People, places, organizations

– Valuable clues to subject of text

Dictionaries of canonical forms Additional names extracted from documents

– Parsing finds tokens

– Additional parsing groups tokens into noun phrases

– Rules identify tokens which are names

– Variant groups are assigned a canonical name which is the most explicit variant found in document


Examples for Name Extraction

“This subject is taught by Paula Matuszek.”– Recognize Paula as a first name of a person

– Recognize Matuszek as a capitalized word following a first name.

– Therefore “Paula Matuszek” is probably the name of a person.

“This subject is taught by Villanova University.”– Recognize Villanova as a probable name based on capitalization.

– Reognize University as a term which normally names an institution..

– Therefore “Villanova University” is probably the name of an institution.

“This subject is taught by Howard University”– BOTH of these sets of rules could apply. So rules need to be prioritized to

determine more likely parse.


Other Rule Examples Dr., Mr,. Ms. are titles, and titles followed

by capitalized words frequently indicate names. If followed by only one word, it’s the last name

Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN.

Nouns can be names. Verbs can’t.


Abbreviation/Acronym Extraction

Fruitful source of variants for names and terms

Existing dictionary of common terms Name followed by “(“ [A-Z]+ “)” probably

gives an abbreviation. Conventions regarding word internal

case and prefixes. “MSDOS” matches “MicroSoft DOS”, “GB” matches gigabyte.


Number Extraction Useful primarily to improve performance

of other extractors. Variant expressions of numbers

– One thousand three hundred and twenty seven– thirteen twenty seven– 1327

Other numeric expressions– twenty-seven percent– 27%

Base forms are easy; most of effort is variants and determining canonical form based on rules


Date Extraction

Absolute and relative dates Produces canonical form.

– March 27, 1997 1997/03/27– tomorrow ref+0000/00/01– a year ago ref-0001/00/00

Similar techniques and issues as for numbers


Money Extraction

Recognizes currencies and produces canonical representation

Uses number extractor Examples

– “twenty-seven dollars” “27.000 dollars USA”– “DM 27” “27.000 marks Germany”


Term Extraction

Identify other important terms found in text Other major lexical clue for subject, especially if

repeated. May use output from other extractors in rules Recognizes common lexical variants and

reduces to canonical form -- stemming Machine learning is much more important here


Term Extraction

Dictionary with parts of speech info for English Pattern matching to find noun phrase structure

typical of technical terms. Feature repositories:

– Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics

– Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics

Authority and residue dictionaries trained


Information Quotient Each feature (word, phrase, name) extracted is

assigned an information quotient Represents the significance of the feature in the

document TF-IDF: Term frequency-Inverse Document

Frequency Position information Stop words


Feature Extraction Demo

Tool may be used for highlighting, etc, on documents to be displayed

Features extracted also form basis for other tools

Note that this is not full information extraction, although it is a starting point

http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html


Other Features

Feature Extractor also identifies other features used by other text analysis tools:– sentence boundaries– paragraph boundaries– document tags– document structure– collection statistics


Summarizer Tools Collection of sentences extracted from

document Characteristic of document content Works best for well-structured documents Can specify length Must apply feature extraction first


Summarizer Feature extractor run first Words are ranked Sentences are ranked Highest ranked sentences are chosen Configurable: for length of sentence, for

word salience Works best when document is part of a

collection


Word Ranking

Words scored IF– Appears in structures such as titles and captions

– Occurs more often in document than in collection (word salience)

– Occurs more than once in a document

Score is– salience if > threshold: tf*idf (by default)

– weighting factor if occurs in title, heading caption


Sentence Ranking

Scored according to relevance in document and position in document.

Sum of– Scores of individual words – Proximity of sentence to beginning of its paragraph– “Bonus” for final sentence in long paragraph and

final paragraph in long documents– Proximity of paragraph to beginning of document

All configurable


Summarization Examples

Examples from IBM documentation

http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html


Some Common Statistical Measures(a brief digression)

TF x IDF Pairwise and multiple-word phrase counts Some other common statistical measures:

– information gain: how many bits of information do we gain by knowing that a term is present in a document

– mutual information: how likely a term is to occur in a document

– term strength: likelihood that a document will occur in both of two closely-related documents


Topic Categorization Tool Assign documents to predetermined

categories Must first be trained

– Training tool creates category scheme

– Dictionary that stores significant vocabulary statistics

Output is list of possible categories and probabilities for each document

Can filter initial schema for faster processing


Features Used for Categorizing

Linguistic Features– Uses the features extracted by Feature Extraction tool

N-Grams– letter groupings and short words.

– Can be used for non-English, because it doesn’t depend on heuristics

– Used by Language categorizer


Document Categorizing Individual document is analyzed for

features Features are compared to those

determined for categories:– terms present/absent– IQ of terms– frequencies– document structure


Document Categorization Important issue is determining which features!

High dimensionality is expensive. Ideally you want a small set of features which is

– present in all documents of one category– absent in all other documents

In actuality, not that clean. So:– use features with relatively high separation– eliminate feature which correlates very highly with

another feature (to reduce dimension space)


Categorization Demo Typically categorization is a component in a

system which then “does something” with the categorized documents

Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving.

http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.


Clustering Tools Organize documents without pre-existing

categories Hierarchical clustering

– creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up

Binary Relational clustering– Creates a flat set of clusters with each

document assigned to its best fit and relations between clusters captured


Hierarchical Clustering

Input is a set of documents Output is a dendogram

– Root

– Intermediate level

– leaves– link to actual documents

Slicing is used to create manageable HTML tree


Steps in Hierarchical Clustering

Select Linguistic Preprocessing technique: determines “similarity”

Cluster documents: create dendogram based on similairy

Define shape of tree with slicing technique and produce HTML output


Linguistic Preprocessing Determining similarity between documents and

clusters: how do we define “similar”?– Lexical affinity. Does not require any

preprocessing– Linguistic Features. Requires that feature

extractor be run first. iMiner is either/or; you cannot combine the

two methods of determining similiarity


Clustering: Lexical Affinities

Lexical affinities: groups of words which appear frequently close together– created “on the fly” during a clustering task– word pairs– stemming and other morphological analysis– stop words

Results in documents with textual similiarity being clustered together


Clustering: Linguistic Features

Linguistic features: Use features extracted by the feature extraction tool– Names of organizations– Domain Technical Terms– Names of Individuals

Can allow focusing on specific areas of interest

Best if you have some idea what you are interested in


Hierarchical Clustering Steps

Put each document in a cluster, characterized by its lexical or linguistic features

Merge the two most similar clusters Continue till all clusters are merged


Hierarchical Clustering: Slicing

The Dendogram is too big to be useful Slicing reduces the size of the tree by

merging clusters if they are “similar enough”.– top threshold: collapse any tree which

exceeds it– bottom threshold: group under root any

cluster which is lower– Remaining clusters make a new tree– # of steps sets depth of tree


Typical Slicing Parameters Bottom

– start around 5% or 10% similar– 90% would mean only virtually identical

documents get grouped Top

– good default is 90%– if want really identical, set to 100%

Depth: – Typically 2 to 10– Two would give you duplicates and rest


Binary Relational Clustering

Binary Relational clustering– Creates a flat set of clusters

– Each document assigned to its best fit

– Relations between clusters captured Similarity based on features extracted by

Feature Extraction tool


Relational Clustering: Document Similarity

Based on comparison of descriptors– Frequent descriptors across collection given

more weight: priority to wide topics– Rare descriptors given more weight: large

number of very focused clusters– Both, with rare descriptors given slightly

higher weight: relatively focused topics but fewer clusters

Descriptors are binary: present or absent


Relational Clustering

Descriptors are features extracted by feature extraction tool.

Similarity threshold: at 100% only identical documents are clustered

Max # of clusters: overrides similiarity threshold to get number of clusters specified


Binary Relational Clustering Outputs

Outputs are – clusters: topics found, importance of topics,

degree of similiarity in cluster

– links: sets of common descriptors between clusters


Clustering Demo

Patents from “class 395”: information processing system organization

10% for top, 1% for bottom, total of 5 slices lexical affinity http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html


Summary iMiner has a rich set of text mining tools Product is well-developed, stable No explicit user-modifiable knowledge

base -- uses automated techniques and built-in KB to extract relevant information

Can be deployed to new domains without a lot of additional work

BUT not as effective in many domains as a tool with a good KB

No real information extraction capability


Information Extraction Overview

Given a body of text: extract from it some well-defined set of information

MUC conferences Typically draws heavily on NLP Three main components:

– Domain knowledge base– Extraction Engine– Knowledge model


Information Extraction Domain Knowledge Base

Terms: enumerated list of strings which are all members of some class. – “January”, “February”– “Smith”, “Wong”, “Martinez”, “Matuszek”– “”lysine”, “alanine”, “cysteine”

Classes: general categories of terms– Monthnames, Last Names, Amino acids– Capitalized nouns, – Verb Phrases


Domain Knowledge Base

Rules: LHS, RHS, salience Left Hand Side (LHS): a pattern to be

matched, written as relationships among terms and classes

Right Hand Side (RHS): an action to be taken when the pattern is found

Salience: priority of this rule (weight, strength, confidence)


Some Rule Examples: <Monthname> <Year> => <Date> <Date> <Name> => print “Birthdate”, <Name>,

<Date> <Name> <Address> => create address database

record <daynumber> “/” <monthnumber> “/” <year> =>

create date database record (50) <monthnumber> “/” <daynumber> “/” <year> =>

create date database record (60) <capitalized noun> <single letter> “.” <capitalized

noun> => <Name> <noun phrase> <to be verb> <noun phrase> => create

“relationship” database record


Generic KB Generic KB: KB likely to be useful in

many domains– names– dates– places– organizations

Almost all systems have one Limited by cost of development: it takes

about 200 rules to define dates reasonably well, for instance.


Domain-specific KB We mostly can’t afford to build a KB for

the entire world. However, most applications are fairly

domain-specific. Therefore we build domain-specific KBs

which identify the kind of information we are interested in.– Protein-protein interactions– airline flights– terrorist activities


Domain-specific KBs

Typically start with the generic KBs Add terminology Figure out what kinds of information you

want to extract Add rules to identify it Test against documents which have

been human-scored to determine precision and recall for individual items.


Knowledge Model We aren’t looking for documents, we are

looking for information. What information? Typically we have a knowledge model or

schema which identifies the information components we want and their relationship

Typically looks very much like a DB schema or object definition


Knowledge Model Examples

Personal records– Name

– First name– Middle Initial– Last Name

– Birthdate– Month– Day– Year

– Address



Protein Inhibitors– Protein name (class?)– Compound name (class?)– Pointer to source– Cache of text– Offset into text



Airline Flight Record– Airline

– Flight Number Origin Destination Date

» Status» departure time» arrival time


Summary Text mining below the document level NOT typically interactive, because it’s

slow (1 to 100 meg of text/hr) Typically builds up a DB of information

which can then be queries Uses a combination of term- and rule-

driven analysis and NLP parsing. AeroText: very good system developed

by LMCO; we will get a complete demo on March 26.

©2002 paula matuszek iminer introduction. ©2002 paula matuszek iminer from ibm l text mining tool...

Documents