bas 250 lecture 7
TRANSCRIPT
BAS 250Lesson 7: Text Mining
This Week’s Learning Objectives
Explain what a text mining is, how it is used, and the
benefits of using it
Recognize the various text formats that can be used when
performing text mining
Understand common text-parsing operators such as
tokenization, stop word filtering, n-gram construction,
stemming, etc.
Text mining is a powerful way of analyzing data
that is in an unstructured format; such as in
paragraphs of text
The results of text analysis can reveal the
frequency and commonality of strong words or
grams across groups of documents
Overview
There are several important transformations
that must be made to unstructured data in
order to prepare it for text mining procedures
Text Transformations
When mining text, the words in the text must be
grouped together and counted
Without some numeric structure, the computer
cannot assess the meaning of the words
o Tokenize operators turn words in an input document
into attributes that can be mined
Text Transformations
There are necessary conjunctions and articles that make the text
readable in English, along with some abbreviations or even
typographical errors
Errors and abbreviations should be removed as they do not
speak much about the meaning of the text data
o Stopwords operators - remove stop words
o Filter Tokens operators - set minimum character length for words to be
included in analysis
Text Transformations
In some instances, letters that are uppercase will
not match with the same letters in lowercase,
this is known as Case Sensitivity
o For example, ‘Data’ might be interpreted differently
than ‘data’
o Can be addressed using a Transform Cases operator
Text Case Sensitivity
Stemming is the act of finding terms that share a
common root and combining them to mean
essentially the same thing
o For example, for the words ‘Company’, ‘Companies’,
‘Company’s’ - reduce all instances of these word variations
to a common form such as ‘Compan’ and represent all
instances as a single attribute
Text Stemming
An n-gram is a phrase or combination of words that may take on a meaning that is different from, or greater than the meaning of each word individually
o The ’n’ is the maximum number of terms you want to group together
o For example, ‘angry’
vs an n-gram of two ‘violent_angry’
vs an n-gram of three ‘violent_angry_behavior’
o Can be an excellent way to bring more granular analysis to text mining activity
Text n-gram
Replace Tokens is similar to replacing missing
or inconsistent values but useful for combining
several terms into a single token
o For example, ‘product’, ‘item’, and ‘device’ Change ‘product’ and ‘item’ to ‘device’
Missing or Replacing Text
An example of text mining results converted to a horizontal bar chart for viewing frequencies using RapidMiner
Visualizing Text
Summary - Learning Objectives
Explain what a text mining is, how it is used, and the
benefits of using it
Recognize the various text formats that can be used when
performing text mining
Understand common text-parsing operators such as
tokenization, stop word filtering, n-gram construction,
stemming, etc.
“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or
ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information