text mining day 2: kaggle competition · text mining day 2: kaggle competition bda17, may 9, 2017...
TRANSCRIPT
TEXT MINING DAY 2: KAGGLE COMPETITION
BDA17, May 9, 2017 Revised 9 PM
R. Bohn + Sai Kolasani + entire class
1
AGENDA
➤ Understand the Kaggle challenge and basic approach to solving it.
➤ Review bag-of-words methods: create data matrix.
2
THE DATA: AMAZON REVIEWS, 1 SENTENCE EACH
➤ 500 Train
➤ 200 Validate
➤ 300 Test
➤ 60% of these show up on PUBLIC leaderboard
➤ 120 observations are never in public board
➤ (Cheating is theoretically possible since only 300 in test set. Classify them by hand; create an algorithm to reproduce.)
3
RAW DATA = 1 SENTENCE OF TEXT FROM AMAZON REVIEWS
➤ 88 Terrible.. My car will not accept this cassette. 0
➤ 89 Buttons are too small. 0
➤ 90 This is a great phone!. 1
➤ 92 Worst ever. 0
➤ 93 I really recommend this faceplates since it looks very nice, elegant and cool. 1
➤ 94 It does everything the description said it would. 1
➤ 95 This is a VERY average phone with bad battery life that operates on a weak network. 0
➤ 96 Yes it's shiny on front side - and I love it! 1
➤ 101 There's a horrible tick sound in the background on all my calls that I have never experienced before. 0
4
Always examine rawest possible data. 10-100 observations
CREATING ADDITIONAL VARIABLES = GENERAL STRATEGY
➤ Text mining problems usually go beyond straight text variables.
➤ Many possibilities such as punctuation, capital words, bigrams, sentence length etc etc.
➤ Will these be useful? No way to predict, so just try them.
➤ In this exercise, small sample size (500) means probably not going to be very useful.
➤ In broader situations, look for outside variables such as source of the text, time of day, etc.
5
BASIC MODEL FOR TEXT MINING BAG OF WORDS (REVIEW)
➤ Filter the sentences as desired (e.g. stop words)
➤ Tokenize words. Count occurrences of each token
➤ Result = matrix with 500 rows, many columns (# of unique tokens)
➤ Add other variables if desired, e.g. sentence length
➤ Create a matrix of numbers, sentences x variables
➤ Use a model to classify this data as 0 or 1
➤ Validate the model.
➤ Repeat until satisfied
6
WHAT’S ROLE OF TEST DATA?
➤ Why separate test and validation data?
➤ “Then we realized that we can’t apply our model if the training data and other two data frames are separated, thus we
combine all three data frames together.”We change the sentences into corpus and do the standard
processing: tokenization, stopwords and stemming.”
➤ This is DANGEROUS! ➤It risks leakage and invalid test results.
➤ Sometimes called “time travel.” The analysis is done using information from the future (test data).
9
9/20/2016 Student/Class Info
https://act.ucsd.edu/FacultyClasslist/classlistphotosview.htm?sectionId=876374 1/2
Class List PhotosAs of : 09/20/2016, 01:19:00 : Enrollment/Limit: (55/60)
Section ID : 876374 Subject : IRGN Course: 438 Sec: A00 Term: FA16 Student photos are provided to authorized UCSD faculty and staff for educationrelated purpose only.
876374 Current Section
UC San Diego 9500 Gilman Dr. La Jolla, CA 92093 (858) 5342230
Copyright ©2015 Regents of the University of California. All rights reserved.
INSTRUCTION TOOLS
Atsuta, Takeru
Carrillo, Miguel Angel
Chung, Yuyun
Dea, Kyle Christopher
Ellsworth, Ashlee Marie
Ferrera, Richard Allen
Ghosh, Subhranshu
Hawelti, Senay A
Hong, Minjeong
Hopkins, Jana
Hu, Keke
Huang, Liheng
Jin, Huiling
Kang, Seulki
Kato, Nobuto
Kim, Gue Hee
Kim, Hyerim
Kim, Young Min
Kwon, Youngsun
Lam, Kathryn Thuy Duyen
Lee, Hyun Ah
Levy, Conor David
Li, Yicong
Li, Ziqing
Liu, Juelin
Liu, Yue
Meng, Chengyuan
MurilloMena, Jaime
Murphey, Thomas Lu
Navis, Kyle David
2 Next Last
Submission 1 and Submission 3 use the existing stopwords list in tm package. Submission 2 and Submission 4 use a
customized stopwords list, which excludes:
"aren't", "wasn't", "weren't","hasn't" ""didn't", "won't", "wouldn't","shouldn't", "can't". "cannot", "couldn't", "mustn't", "not" and "no" from the existing list.
We believe these words more likely to convey negative opinions and are
important in predicting the tone of the reviews.
CUSTOM VS GENERIC OUTSIDE KNOWLEDGE?
Submission No. Stopwords Stemming Validation accuracy Test score
1Default English No 104/140 =74% 0.72222
2Customized No 93/140 = 66% 0.74444
3Default English Yes 105/140 =75% 0.716
4Customized Yes 95/140 = 68% 0.777
WHAT STOPWORDS TO USE?
Yuwen Xu + Ziqing Li ret <- tm_map(ret, removeWords, stopwords("english"))
ret <- tm_map(ret, removeWords, list.df)
10
To: Prof. Roger Bohn and TA Sai Kolasani
From: Yuwen Xu [email protected]; Ziqing Li [email protected] (Team name: Yuwen &
Ziqing)
Subject: Week 6 Assignment / Kaggle competition for text mining
Date: May 09, 2017
We have submitted four times on Kaggle Competition. Submission 3 and 4 are completed
after May 8 11:59pm.
All submissions use similar R code, shown in Appendix. All four submissions use tm
package to process texts to lowercase, strip whitespace, remove punctuation, remove numbers
and remove stopwords. From processed text in the training dataset, we picked out top 25 words
positive words and top 25 negative words. We use a for loop to calculate the rate of positive
words and the rate of negative words for each entry. If the rate of positive words is higher, the
entry is classified positive. If the rate of negative words is higher, the entry is classified negative.
The rate of positive words is the total frequency of top positive words in the entry divided by the
sum of the frequency of top positive words and the frequency of top negative words. Similarly,
the rate of negative words is the total frequency of top negative words in the entry sentence
divided by the sum of the frequency of top positive words and the frequency of top negative
words.
Submission 1 and Submission 2 do not use ‘stemDocument’. But Submission 3 and
Submission 4 include ‘stemDocument’ as part of text processing. Submission 1 and Submission
3 use the existing stopwords list in tm package. Submission 2 and Submission 4 use a
customized stopwords list, which excludes "aren't", "wasn't", "weren't","hasn't" "haven't",
"hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't". "cannot",
"couldn't", "mustn't", "not" and "no" from the existing list. We believe these words more likely
to convey negative opinions and are important in predicting the tone of the reviews.
Submission No. Stopwords Stemming Validation accuracy Test score
1 Default English No 104/140 =74% 0.72222
2 Customized No 93/140 = 66% 0.74444
3 Default English Yes 105/140 =75% 0.71667
4 Customized Yes 95/140 = 68% 0.77778
The table above shows that there is little improvement from stemming the text after text
processing. The accuracy on validation dataset is lower with customized stopwords. However,
test score (based on 60% of the test data) are higher with customized stopwords.
This was original version of previous page
11
WORD2VEC = NEW METHOD
➤ Heavily used at Google e.g. for translation
➤ Each word is translated to a vector of ~ 100 dimensions
➤ Trained on giant corpuses
➤ Use shallow neural network
➤ Words with similar meanings are “close” in the vector space
➤ Very new. Implications unclear.
➤ Many other methods also better than bag of words.
12
class <- vector(mode = "numeric", length = 300)
for (i in 1:300) { row <- test.df[i,] str <- row$sentencetemp <- Corpus(VectorSource(row)) temp <- tokenizer(temp) temp <- rowSums(as.matrix(TermDocumentMatrix(temp))) rateP <- 1 rateN <- 1
for (word in names(temp)) { p <- 0.01n <- 0.01if (! is.na(pos[word])) { p <- p + pos[word] } if (! is.na(neg[word])) { n <- n + neg[word] }rateP <- rateP * (p/(p + n) ^ temp[word]) rateN <- rateN * (n/(p + n) ^ temp[word]) if (rateP > rateN) { class[i] <- 1 } else { class[i] <- 0 } } }
Don’t Use Loops in R!!
13
AVOIDING LOOPS: THE APPLY FAMILY
apply acts on the rows or columns of a matrix.
apply(X, Dimension, Function, …)where X is a matrix, Dimension indicates whether to consider the rows (1), the columns (2), or both (c(1, 2)), Function is a function to apply, and ... are possible optional arguments for FUN.
For sums and means of matrix dimensions, we have shortcuts:
rowSums == apply(x, 1, sum)rowMeans == apply(x, 1, mean) colSums == apply(x, 2, sum)
colMeans == apply(x, 2, mean)
Many variants of apply to use in different situations.
apply
> x <- matrix(rnorm(200), 20, 10)
> apply(x, 2, mean)
[1] 0.04868268 0.35743615 -0.09104379
[4] -0.05381370 -0.16552070 -0.18192493
[7] 0.10285727 0.36519270 0.14898850
[10] 0.26767260
> apply(x, 1, sum)
[1] -1.94843314 2.60601195 1.51772391
[4] -2.80386816 3.73728682 -1.69371360
[7] 0.02359932 3.91874808 -2.39902859
[10] 0.48685925 -1.77576824 -3.34016277
[13] 4.04101009 0.46515429 1.83687755
[16] 4.36744690 2.21993789 2.60983764
[19] -1.48607630 3.58709251
The R Language
14
HOW TO IMPROVE MODEL FIT - 4 BASIC STRATEGIES
1. Get more observations (rows in the dataset)
2. Create more/different features from existing data.
•E.g. interaction terms; change from continuous to discrete; log transform.
3. Bring in outside knowledge e.g. stopword lists; weather data
4. Change the algorithm. Use validation data to decide on “best”
•Switch to entirely new algorithm e.g. LASSO
•Tune the coefficients. Every algorithm needs tuning.
• E.g. reduce overfitting; Use cutoff >> .5
5. Redefine the problem; check your metric of model performance15