automatic search event-summary
TRANSCRIPT
Automatic Search Event by
Automatic Keyword Extraction
Xiwei Yan08-10-2016
Overview
Ads landing pages Html source code Text
Keyword & Key PhrasesSimilar WebpagesAudience
Motivation
• Automate the search events (free BA from manually generating the keywords)
• Identify users for campaigns that don’t have pixels
A First Glimpse at Result

Approach
• Preprocessing
• Keyword Extraction models
– TF-IDF
– TextRank
– Word2Vec + TextRank
– TextRank + Word2Vec
Approach 1 - TFIDF• Preprocessing
• Lower case, lemmatize, stop words, punctuation, tokenization, tag and filter by part-of-speech tags
• Keyword Extraction models
– TF-IDF • TF-IDF(w, d, n, N) = TF(w, d) * IDF(n, N)
– TF(w, d) = # times word w occurred
in doc d
– IDF(n, N) = # docs the word w appears
Word Term freq in doc1
Appear in # docs
Tfidf
car 27 3 0
auto 3 2 1.216
Insurance 0 2 0
Best 14 2 5.676
Approach 2 - TextRank
• PreprocessingLower case, lemmatize, stop words, punctuation, tokenization, tag and filter by part-of-speech tags
• Identify Structurally important Keyword
• Iteratively Calculate:
d is the damping factor that usually set to 0.85
Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
1
1
1
1
1
1
1
1
1
1
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.32
0.32
2.65
0.49
2.65
2.19
0.36
0.32
0.32
0.36
first iteration
𝑆 (𝑉 𝑖 )= (1−𝑑 )+𝑑∗ ∑𝑗 ∈𝑛𝑔𝑏𝑟 (𝑉 𝑖)
1|𝑑𝑒𝑔𝑟𝑒𝑒 (𝑉 𝑗 )|
𝑆 (𝑉 𝑗 )
𝑆𝑐𝑜𝑟𝑒 (𝑔𝑒𝑖𝑐𝑜 )=0.15+0.85∗( 11∗1+ 11∗1+ 12∗1+ 15 ∗1+ 14 ∗1)=2.65service call auto insurance policy
𝑆𝑐𝑜𝑟𝑒 (𝑝𝑜𝑙𝑖𝑐𝑦 )=0.15+0.85∗( 11∗1+ 11∗1+ 15 ∗1+ 15 ∗1)=2.19find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
𝑆𝑐𝑜𝑟𝑒 (𝑠𝑒𝑟𝑣𝑖𝑐𝑒 )=0.15+0.85∗( 15∗1)=0.32geico
iterations
d is the damping factor that usually set to 0.85
Approach 2 - TextRank
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.12
1.77
0.52
0.51
0.51
0.52
geico
auto
insurance
policy
privacy
find
car
coverage
call
sevice
0.51
0.51
2.12
0.87
2.65
1.75
0.52
0.51
0.51
0.52
Converge
𝑆 (𝑉 𝑖 )= (1−𝑑 )+𝑑∗ ∑𝑗 ∈𝑛𝑔𝑏𝑟 (𝑉 𝑖)
1|𝑑𝑒𝑔𝑟𝑒𝑒 (𝑉 𝑗 )|
𝑆 (𝑉 𝑗 )
service call auto insurance policy
𝑆𝑐𝑜𝑟𝑒 (𝑝𝑜𝑙𝑖𝑐𝑦 )=0.15+0.85∗( 11∗0.52+11 ∗0.52+ 15 ∗2.12+ 15∗2.12)=1.75find privacy insurance geico
5
5
4
2
1
1
1
1
1
1
𝑆𝑐𝑜𝑟𝑒 (𝑠𝑒𝑟𝑣𝑖𝑐𝑒 )=0.15+0.85∗( 15∗2.12)=0.51geico
10iterations
Converge Really Quick! (<= 20 iterations)
d is the damping factor that usually set to 0.85
𝑆𝑐𝑜𝑟𝑒 (𝑔𝑒𝑖𝑐𝑜 )=0.15+0.85∗( 11∗0.51+ 11∗0.51+12∗0.87+ 15∗2.12+ 14 ∗1.77)=2.12
Approach 3 – Word2vec + ?
• Preprocessing• No preprocessing (ideally)
• Keyword Extraction models– Word2Vec + Clustering
Projection matrix
0100...00
.
.
.001000
.
.
.000010
000001...
.9
.8
.1
.
.
.
.
.1
5V*1W
(t)
W(1)
W(t-1)
W(2)
.…..
D*V
D*1
Continuous Bag-of-Words Model +
Negative SamplingThe
cat
on
that
Projection Matrix W
sitscoversampleinputpredictlearnbelievetypefivedesignhuman
Cost Function:
Backpropagation:
Gradient Descent:
softmax
0.3660.20.1030.1000.0090.0110.0450.0500.0700.0100.009Projection
Matrix W’
Approach 3 – Word2vec + Clustering
• k-means• DBSCAN
Approach 3 – Word2vec + TextRank
W(1)
N*D
W(2)
W(3)
W(4)
W(n-2)
W(n-1)W(n)
……………………………………
johndeerecompactutilitytractortaylormessickInc................companyprofileagriculturalequipment
tractortillage
mowerexcavator
sprayer
shredderagriculture
harvest
mowerexcavatorshredder tillageharvestsprayer
Document Text
Trained Word2vec Model
TextRank
• Identify semantically important Keyword
Approach 4 –TextRank + Word2vecWord TextRank
Scoretractor 0.015847john 0.013281sale 0.012494standard 0.012474equipment 0.010799power 0.009747messick 0.008162new 0.008151work 0.007907series 0.007707mower 0.006099utility 0.006035compact 0.005751
TextRank Result
mower 0.8502excavator 0.7708shredder 0.7451tillage 0.7341harvest 0.7154sprayer 0.7101
Word2vec Similarity
Word New Scoretractor 0.015847
mower 0.015847*0.8502= 0.013433
john 0.013281
sale 0.012494
standard 0.012474
excavator 0.015847*0.7708= 0.012215
shredder 0.015847*0.7451= 0.011808
tillage 0.015847*0.7341= 0.011633
harvest 0.015847*0.7154= 0.011337
sprayer 0.015847*0.7101= 0.011253
equipment 0.010799
power 0.009747
messick 0.008162
new 0.008151
Google’s Pre-trained Word2vecCampaign % Words in Pre-trained
Model Vocab.% Keywords in Pre-trained Model Vocab.
Geico 0.929985 0.88888
Taylor Messick (Agricultural Equipment)
0.929784 0.41176
Trane (AC) 0.922018 0.71428
Model Testing
1. Generate keyword from the 4 models2. Feed into Lucene and find urls3. Track the audience who visited these urls4. Compare the audience we find to the audience the pixels find
Results (Dell) - KeywordTFIDF TextRank Word2vec_Textrank TextRank_Word2vec
office dell outlet dell
dellcom support collaboration acquire
view service acquire laptop
electronics product work desktop
customer price purchase software
dell use spare rebate
representative software poster welding
dellcomreturnspolicy customer transformation windows
dells system apg corporations
information practices new dell please dell software
prosupport dell dell inc poster laptop desktop
products view dell outlet apg transformation dell new
services support dell dell today purchase acquire dell tablet
dell sales dell team spare transformation dell inc
Results (Toyota) - KeywordTFIDF TextRank Word2vec_Textrank TextRank_Word2vec
highlander toyota generate toyota
kbbcom information acquire preowned
edmundscom site misuse certified
certify vehicle tale highlander
information use govern rav
certification program tradein yaris
site email fourwheel avalon
program service generate tale corolla
assistance sale rubbed bologna sequoia
violated please toyota site identify tundra
hybrid highlander toyota vehicle wheel camry
car certification toyota dealer rubbed tale venza
personal information new toyota help toyota vehicle
cruiser preowned toyota certified new avalon preowned
Results - UrlsDell Toyota
http://thetechjournal.com/electronics/laptop/dell-inspiron-15r-laptop.xhtml
http://www.adverts.ie/laptop-parts-and-accessories/dell-laptop-charger-19-5v-4-62a-90w/10838435
http://www.dellservicecentreinchennai.in/tablet-repair-center-medavakkam.html
http://www.dell.com/us/business/p/poweredge-c6320p/pd?oc=&model_id=poweredge-c6320p&l=en&s=bsd
http://forum.notebookreview.com/threads/dell-2012-outlet-coupons.636641/page-21
http://www.macdonaldtoyota.ca/
http://www.stcharlestoyota.net
http://www.baldwintoyotaofpoplarbluff.com/
http://www.lafontainetoyota.com/
http://www.cedarrapidstoyota.com/
http://www.craigtoyota.com/
http://www.planettoyotaonline.com/
http://www.gatewaytoyotapierre.com/
Result - # of Converters
Result - % of Converters
CampaignId TFIDF TextRank TextRank_Word2vec
Word2vec_TextRank
13405 25 (0.2%) 99 (0.8%) 44 (0.4%) 1 (0.008%)
13553 229 (3.2%) 269 (3.7%) 252 (3.5%) 8 (0.1%)
14099 6 (0.03%) 57 (0.3%) 16 (0.08%) 2 (0.01%)
14545 247 (3%) 250 (3%) 482 (5.7%) 7 (0.08%)
15077 0 (0%) 4 (0.02%) 15 (0.08%) 6 (0.03%)
Conclusion
• TextRank and TextRank_Word2vec consistently perform better than TFIDF
• TextRank don’t require extra space for model saving
• All 3 models need O(n) computational time
Appendix
0100...00
.
.
.001000
.
.
.000010
.
.
.000010
000001...
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
5D*1
.
.
.
.
.
.
.
.
.
.tanhHidden Layer
0.003...........0.0000.0090.0110.0450.0000.0000.3660.010....................0.0100.0000.000
Apple...........Computerpointtrafficinboxpolicyprintcouchchoice....................chooselatermedia
Output layer
softmax
Most Computation
Neural Net Language
Model
Maximize
Time Complexity
The
cat
sits
on
that
Projection Matrix
0100...00
.
.
.001000
.
.
.000010
.
.
.000010
000001...
.1
.3
.7
.4
.9
.
.
.2
.01
.9
.2
.
.
.
.4
.5
.9
.8
.1
.
.
.
.
.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5V*1
W(t)
W(1)
W(t-1)
W(2)
.…..
D*V
5D*1
.
.
.
.
.
.
.
.
.
.tanhHidden Layer
Hierarchical Probabilistic Neural Net
Language ModelThe
cat
sits
on
that
Projection Matrix
TV
Computer
couch
table
make
choose
write
0100...00
.
.
.001000
.
.
.000010
000001...
.9
.8
.1
.
.
.
.
.1
5V*1W
(t)
W(1)
W(t-1)
W(2)
.…..
D*V
D*1
Continuous Bag-of-Words
ModelThe
cat
on
that
Projection Matrix
TV
Computer
couch
table
make
choose
sits
crawl