text mining day 2: kaggle competition · text mining day 2: kaggle competition bda17, may 9, 2017...

15
TEXT MINING DAY 2: KAGGLE COMPETITION BDA17, May 9, 2017 Revised 9 PM R. Bohn + Sai Kolasani + entire class 1

Upload: ngodang

Post on 08-Apr-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

TEXT MINING DAY 2: KAGGLE COMPETITION

BDA17, May 9, 2017 Revised 9 PM

R. Bohn + Sai Kolasani + entire class

1

AGENDA

➤ Understand the Kaggle challenge and basic approach to solving it.

➤ Review bag-of-words methods: create data matrix.

2

THE DATA: AMAZON REVIEWS, 1 SENTENCE EACH

➤ 500 Train

➤ 200 Validate

➤ 300 Test

➤ 60% of these show up on PUBLIC leaderboard

➤ 120 observations are never in public board

➤ (Cheating is theoretically possible since only 300 in test set. Classify them by hand; create an algorithm to reproduce.)

3

RAW DATA = 1 SENTENCE OF TEXT FROM AMAZON REVIEWS

➤ 88 Terrible.. My car will not accept this cassette. 0

➤ 89 Buttons are too small. 0

➤ 90 This is a great phone!. 1

➤ 92 Worst ever. 0

➤ 93 I really recommend this faceplates since it looks very nice, elegant and cool. 1

➤ 94 It does everything the description said it would. 1

➤ 95 This is a VERY average phone with bad battery life that operates on a weak network. 0

➤ 96 Yes it's shiny on front side - and I love it! 1

➤ 101 There's a horrible tick sound in the background on all my calls that I have never experienced before. 0

4

Always examine rawest possible data. 10-100 observations

CREATING ADDITIONAL VARIABLES = GENERAL STRATEGY

➤ Text mining problems usually go beyond straight text variables.

➤ Many possibilities such as punctuation, capital words, bigrams, sentence length etc etc.

➤ Will these be useful? No way to predict, so just try them.

➤ In this exercise, small sample size (500) means probably not going to be very useful.

➤ In broader situations, look for outside variables such as source of the text, time of day, etc.

5

BASIC MODEL FOR TEXT MINING BAG OF WORDS (REVIEW)

➤ Filter the sentences as desired (e.g. stop words)

➤ Tokenize words. Count occurrences of each token

➤ Result = matrix with 500 rows, many columns (# of unique tokens)

➤ Add other variables if desired, e.g. sentence length

➤ Create a matrix of numbers, sentences x variables

➤ Use a model to classify this data as 0 or 1

➤ Validate the model.

➤ Repeat until satisfied

6

7

PUBLIC LEADERBOARD SOMEWHAT DIFFERENT

8

WHAT’S ROLE OF TEST DATA?

➤ Why separate test and validation data?

➤ “Then we realized that we can’t apply our model if the training data and other two data frames are separated, thus we

combine all three data frames together.”We change the sentences into corpus and do the standard

processing: tokenization, stopwords and stemming.”

➤ This is DANGEROUS! ➤It risks leakage and invalid test results.

➤ Sometimes called “time travel.” The analysis is done using information from the future (test data).

9

9/20/2016 Student/Class Info

https://act.ucsd.edu/FacultyClasslist/classlist­photos­view.htm?sectionId=876374 1/2

Class List PhotosAs of : 09/20/2016, 01:19:00 : Enrollment/Limit: (55/60) 

Section ID : 876374 Subject : IRGN Course: 438 Sec: A00 Term: FA16 Student photos are provided to authorized UCSD faculty and staff for education­related purpose only.

876374 Current Section

UC San Diego 9500 Gilman Dr. La Jolla, CA 92093 (858) 534­2230

Copyright ©2015 Regents of the University of California. All rights reserved.

INSTRUCTION TOOLS

 

Atsuta, Takeru

 

Carrillo, Miguel Angel

 

Chung, Yuyun

 

Dea, Kyle Christopher

 

Ellsworth, Ashlee Marie

 

Ferrera, Richard Allen

 

Ghosh, Subhranshu

 

Hawelti, Senay A

 

Hong, Minjeong

 

Hopkins, Jana

 

Hu, Keke

 

Huang, Liheng

 

Jin, Huiling

 

Kang, Seulki

 

Kato, Nobuto

 

Kim, Gue Hee

 

Kim, Hyerim

 

Kim, Young Min

 

Kwon, Youngsun

 

Lam, Kathryn Thuy Duyen

 

Lee, Hyun Ah

 

Levy, Conor David

 

Li, Yicong

 

Li, Ziqing

 

Liu, Juelin

 

Liu, Yue

 

Meng, Chengyuan

 

Murillo­Mena, Jaime

 

Murphey, Thomas Lu

 

Navis, Kyle David

2 Next Last

Submission 1 and Submission 3 use the existing stopwords list in tm package. Submission 2 and Submission 4 use a

customized stopwords list, which excludes:

"aren't", "wasn't", "weren't","hasn't" ""didn't", "won't", "wouldn't","shouldn't", "can't". "cannot", "couldn't", "mustn't", "not" and "no" from the existing list.

We believe these words more likely to convey negative opinions and are

important in predicting the tone of the reviews.

CUSTOM VS GENERIC OUTSIDE KNOWLEDGE?

Submission No. Stopwords Stemming Validation accuracy Test score

1Default English No 104/140 =74% 0.72222

2Customized No 93/140 = 66% 0.74444

3Default English Yes 105/140 =75% 0.716

4Customized Yes 95/140 = 68% 0.777

WHAT STOPWORDS TO USE?

Yuwen Xu + Ziqing Li ret <- tm_map(ret, removeWords, stopwords("english"))

ret <- tm_map(ret, removeWords, list.df)

10

To: Prof. Roger Bohn and TA Sai Kolasani

From: Yuwen Xu [email protected]; Ziqing Li [email protected] (Team name: Yuwen &

Ziqing)

Subject: Week 6 Assignment / Kaggle competition for text mining

Date: May 09, 2017

We have submitted four times on Kaggle Competition. Submission 3 and 4 are completed

after May 8 11:59pm.

All submissions use similar R code, shown in Appendix. All four submissions use tm

package to process texts to lowercase, strip whitespace, remove punctuation, remove numbers

and remove stopwords. From processed text in the training dataset, we picked out top 25 words

positive words and top 25 negative words. We use a for loop to calculate the rate of positive

words and the rate of negative words for each entry. If the rate of positive words is higher, the

entry is classified positive. If the rate of negative words is higher, the entry is classified negative.

The rate of positive words is the total frequency of top positive words in the entry divided by the

sum of the frequency of top positive words and the frequency of top negative words. Similarly,

the rate of negative words is the total frequency of top negative words in the entry sentence

divided by the sum of the frequency of top positive words and the frequency of top negative

words.

Submission 1 and Submission 2 do not use ‘stemDocument’. But Submission 3 and

Submission 4 include ‘stemDocument’ as part of text processing. Submission 1 and Submission

3 use the existing stopwords list in tm package. Submission 2 and Submission 4 use a

customized stopwords list, which excludes "aren't", "wasn't", "weren't","hasn't" "haven't",

"hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't". "cannot",

"couldn't", "mustn't", "not" and "no" from the existing list. We believe these words more likely

to convey negative opinions and are important in predicting the tone of the reviews.

Submission No. Stopwords Stemming Validation accuracy Test score

1 Default English No 104/140 =74% 0.72222

2 Customized No 93/140 = 66% 0.74444

3 Default English Yes 105/140 =75% 0.71667

4 Customized Yes 95/140 = 68% 0.77778

The table above shows that there is little improvement from stemming the text after text

processing. The accuracy on validation dataset is lower with customized stopwords. However,

test score (based on 60% of the test data) are higher with customized stopwords.

This was original version of previous page

11

WORD2VEC = NEW METHOD

➤ Heavily used at Google e.g. for translation

➤ Each word is translated to a vector of ~ 100 dimensions

➤ Trained on giant corpuses

➤ Use shallow neural network

➤ Words with similar meanings are “close” in the vector space

➤ Very new. Implications unclear.

➤ Many other methods also better than bag of words.

12

class <- vector(mode = "numeric", length = 300)

for (i in 1:300) { row <- test.df[i,] str <- row$sentencetemp <- Corpus(VectorSource(row)) temp <- tokenizer(temp) temp <- rowSums(as.matrix(TermDocumentMatrix(temp))) rateP <- 1 rateN <- 1

for (word in names(temp)) { p <- 0.01n <- 0.01if (! is.na(pos[word])) { p <- p + pos[word] } if (! is.na(neg[word])) { n <- n + neg[word] }rateP <- rateP * (p/(p + n) ^ temp[word]) rateN <- rateN * (n/(p + n) ^ temp[word]) if (rateP > rateN) { class[i] <- 1 } else { class[i] <- 0 } } }

Don’t Use Loops in R!!

13

AVOIDING LOOPS: THE APPLY FAMILY

apply acts on the rows or columns of a matrix.

apply(X, Dimension, Function, …)where X is a matrix, Dimension indicates whether to consider the rows (1), the columns (2), or both (c(1, 2)), Function is a function to apply, and ... are possible optional arguments for FUN.

For sums and means of matrix dimensions, we have shortcuts:

rowSums == apply(x, 1, sum)rowMeans == apply(x, 1, mean) colSums == apply(x, 2, sum)

colMeans == apply(x, 2, mean)

Many variants of apply to use in different situations.

apply

> x <- matrix(rnorm(200), 20, 10)

> apply(x, 2, mean)

[1] 0.04868268 0.35743615 -0.09104379

[4] -0.05381370 -0.16552070 -0.18192493

[7] 0.10285727 0.36519270 0.14898850

[10] 0.26767260

> apply(x, 1, sum)

[1] -1.94843314 2.60601195 1.51772391

[4] -2.80386816 3.73728682 -1.69371360

[7] 0.02359932 3.91874808 -2.39902859

[10] 0.48685925 -1.77576824 -3.34016277

[13] 4.04101009 0.46515429 1.83687755

[16] 4.36744690 2.21993789 2.60983764

[19] -1.48607630 3.58709251

The R Language

14

HOW TO IMPROVE MODEL FIT - 4 BASIC STRATEGIES

1. Get more observations (rows in the dataset)

2. Create more/different features from existing data.

•E.g. interaction terms; change from continuous to discrete; log transform.

3. Bring in outside knowledge e.g. stopword lists; weather data

4. Change the algorithm. Use validation data to decide on “best”

•Switch to entirely new algorithm e.g. LASSO

•Tune the coefficients. Every algorithm needs tuning.

• E.g. reduce overfitting; Use cutoff >> .5

5. Redefine the problem; check your metric of model performance15