bas 250 lecture 7

13
BAS 250 Lesson 7: Text Mining

Upload: wake-tech-bas

Post on 14-Apr-2017

28 views

Category:

Education


5 download

TRANSCRIPT

Page 1: BAS 250 Lecture 7

BAS 250Lesson 7: Text Mining

Page 2: BAS 250 Lecture 7

This Week’s Learning Objectives

Explain what a text mining is, how it is used, and the

benefits of using it

Recognize the various text formats that can be used when

performing text mining

Understand common text-parsing operators such as

tokenization, stop word filtering, n-gram construction,

stemming, etc.

Page 3: BAS 250 Lecture 7

Text mining is a powerful way of analyzing data

that is in an unstructured format; such as in

paragraphs of text

The results of text analysis can reveal the

frequency and commonality of strong words or

grams across groups of documents

Overview

Page 4: BAS 250 Lecture 7

There are several important transformations

that must be made to unstructured data in

order to prepare it for text mining procedures

Text Transformations

Page 5: BAS 250 Lecture 7

When mining text, the words in the text must be

grouped together and counted

Without some numeric structure, the computer

cannot assess the meaning of the words

o Tokenize operators turn words in an input document

into attributes that can be mined

Text Transformations

Page 6: BAS 250 Lecture 7

There are necessary conjunctions and articles that make the text

readable in English, along with some abbreviations or even

typographical errors

Errors and abbreviations should be removed as they do not

speak much about the meaning of the text data

o Stopwords operators - remove stop words

o Filter Tokens operators - set minimum character length for words to be

included in analysis

Text Transformations

Page 7: BAS 250 Lecture 7

In some instances, letters that are uppercase will

not match with the same letters in lowercase,

this is known as Case Sensitivity

o For example, ‘Data’ might be interpreted differently

than ‘data’

o Can be addressed using a Transform Cases operator

Text Case Sensitivity

Page 8: BAS 250 Lecture 7

Stemming is the act of finding terms that share a

common root and combining them to mean

essentially the same thing

o For example, for the words ‘Company’, ‘Companies’,

‘Company’s’ - reduce all instances of these word variations

to a common form such as ‘Compan’ and represent all

instances as a single attribute

Text Stemming

Page 9: BAS 250 Lecture 7

An n-gram is a phrase or combination of words that may take on a meaning that is different from, or greater than the meaning of each word individually

o The ’n’ is the maximum number of terms you want to group together

o For example, ‘angry’

vs an n-gram of two ‘violent_angry’

vs an n-gram of three ‘violent_angry_behavior’

o Can be an excellent way to bring more granular analysis to text mining activity

Text n-gram

Page 10: BAS 250 Lecture 7

Replace Tokens is similar to replacing missing

or inconsistent values but useful for combining

several terms into a single token

o For example, ‘product’, ‘item’, and ‘device’ Change ‘product’ and ‘item’ to ‘device’

Missing or Replacing Text

Page 11: BAS 250 Lecture 7

An example of text mining results converted to a horizontal bar chart for viewing frequencies using RapidMiner

Visualizing Text

Page 12: BAS 250 Lecture 7

Summary - Learning Objectives

Explain what a text mining is, how it is used, and the

benefits of using it

Recognize the various text formats that can be used when

performing text mining

Understand common text-parsing operators such as

tokenization, stop word filtering, n-gram construction,

stemming, etc.

Page 13: BAS 250 Lecture 7

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s

Employment and Training Administration. The solution was created by the grantee and does not

necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor

makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such

information, including any information on linked sites and including, but not limited to, accuracy of the

information or its completeness, timeliness, usefulness, adequacy, continued availability, or

ownership.”

Except where otherwise stated, this work by Wake Technical Community College Building Capacity in

Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative

Commons Attribution 4.0 International License. To view a copy of this license, visit

http://creativecommons.org/licenses/by/4.0/

Copyright Information