discovering context
Post on 09-Jul-2015
109 Views
Preview:
DESCRIPTION
TRANSCRIPT
Discovering Context: Classifying tweets through a semantic
transform based on Wikipedia
Yegin Genc, Yasuaki Sakamoto, and Jeffrey V. Nickerson
"So I'm told by a reputable person they have killed Osama Bin Laden. …"
“I hate how my phone has this stupid … spell check …”
Twitter to function as a large sensor system, and can increase our awareness of our surroundings
Discovering Context: Classifying tweets through a semantic transform based on
Wikipedia
Why classify?
"So I'm told by a reputable person they have killed Osama Bin Laden. …"
“I hate how my phone has this stupid … spell check …”
Terrorism (?) Irritating technology
NOT IMPORTANT
important
important
important
important
important
How to classify?
message
message
transform
transform
distance(T(m1), T(m2))
d(message1, message2) α d(T(message1),T( message2))
Tweet 1
Tweet 2
Tweet 3
.
.
.
Tweet n
Wiki Page 1(WP1)
Wiki Page 2(WP2)
Wiki Page 3(WP3)
Wiki Page n(WPn)
.
.
.
WP1 WP2
WP3
WPn
d13
d3n
d12
d1n
d2n
d32
STEP 1:FINDING WIKI PAGES
STEP 2:CALCULATING DISTANCE
A Two-Step Approach
Candidate Pages
(word13)
Candidate Pages
(word12)Tweet 1 Word-Set (WS) =
Word11
Word12
Word13
.
.
.
Word1n
Candidate Pages
(word11)
.
.
.
Candidate Pages
(word1n)
Wiki Page 1
max overlap btw. WS and CP content
.
.
.
Step – 1: Finding Wiki Pages
Tweet:RT ashajayy Rest in peace JD Salinger Catcher in the Rye is one of my absolute
favourite books Sad day
Candidate Pages Hits
//en.wikipedia.org/wiki/J.D._Salinger 290
//en.wikipedia.org/wiki/J._D._Salinger 289
//en.wikipedia.org/wiki/books 145
//en.wikipedia.org/wiki/Doris_Day 138
//en.wikipedia.org/wiki/peace 131
Words:Rest, peace, JD, Salinger, Catcher, Rye, absolute, favourite, books, Sad, day
Wiki Page 1
Wiki Page 2
WP1
L1
WP1
L2
WP1
L2
WP1
L3
WP1
L3
WP1
L3
WP2
L1
WP2
L2
WP2
L3
WP1 L3WP2 L2
1
2
3
d12= 3
Step – 2: Calculating the Distance
DiscriminantAnalysis
T1
T3
T2
AccuracyRate
Method
Tweets-T 1 (Topic 1)-T 2 (Topic 1)-T 3 (Topic 2)
.
.
.
T1 T2 T3
T1 0 d12 d12
T2 d21 0 d23
T3 d31 d32 0
Distance Matrix
MDS
X Y
T1 t1x t1y
T2 t2x t2y
T3 t3x t3y
SED
LSA
Wikipedia
DSED
DLSA
DWIKI
Acc. SED
Acc. LSA
Acc. WIKI
Other Techniques
String Edit Distance (SED)
Minimum number of edits needed to transform one string into the other
Kitten → sitten (subst. of 's' for 'k')
SED = 1
Latent Semantic Analysis (LSA)
Natural language processing technique for classification based on term occurrences in documents
DataWithout Noise With Noise
Category Count
X J.D. Salinger 15
iPad 15
Haiti 15
TOTAL 45
Category Count
X J.D. Salinger 15
iPad 15
Haiti 15
Random 55
TOTAL 100
RT @ashajayy Rest in peace, JD Salinger. Catcher in the Rye is one of my absolute favourite books. Sad day.
@JMNelis I fear I may have killed him because I talked about how I hate "Catcher." (1/2)
'Catcher In The Rye' Author J.D. Salinger Dies At 91 - The author of The Catcher in the Rye died of natural causes,... http//ow.ly/16rETF
What Yall think about me buying a whole bunch of sour patch kids and giving them to haiti i bet they would be HAPPY!
Please ReTweet (http//caltweet.com/4gx ) - Lets ALL really AID Haiti
RT @UNC_Health_Care Video Want to help the #Haitian patients at #UNC Hospitals? Here's how. http//bit…
@Alitas_Way naw im kiddin but ma'am it really looks great on u
Please come to our Legal Studies Open House on Tuesday February 2nd from 6-730pm.Please call for exact location and to RSVP …
Most impressive stat for Warner is he holds the top 3 most passing yards in a superbowl. Three games three most passing yards in 40
iPad..not so appealing to me (Yet!) It's basically the MacBook&iPhone combined.I have both so don't think i'll be getting the iPad soon.
Have u seen it?Apple iPad Tablet Steve Jobs Unveils Visionary Computer http//bit.ly/9IslTP
The new Apple formula Hype
Technique J. D. Salinger iPad Haiti
String Edit Distance .67 .13 .60 Latent Semantic Analysis .67 .73 .80 Wikipedia .93 .87 .80
Tweets without noise:
-0.3 -0.2 -0.1 0.0 0.1 0.2
-0.3
-0.2
-0.1
0.0
0.1
0.2
SED
Coordinate 1
Co
ord
ina
te 2
-0.6 -0.2 0.2 0.6
-0.6
-0.2
0.2
0.6
LSA
Coordinate 1
Co
ord
ina
te 2
-2 0 2 4 6 8
-4-2
02
46
Wiki
Coordinate 1
Co
ord
ina
te 2
X J.D. SalingeriPadHaiti
Tweets with noise:
Technique J. D. Salinger iPad Haiti
Latent Semantic Analysis .60 .60 .20 Wikipedia .93 .87 .73
-0.3 -0.2 -0.1 0.0 0.1 0.2
-0.3
-0.2
-0.1
0.0
0.1
0.2
SED
Coordinate 1
Co
ord
ina
te 2
-0.6 -0.2 0.2 0.6
-0.6
-0.2
0.2
0.6
LSA
Coordinate 1
Co
ord
ina
te 2
-2 0 2 4 6 8
-4-2
02
46
Wiki
Coordinate 1C
oo
rdin
ate
2
-0.3 -0.2 -0.1 0.0 0.1 0.2
-0.3
-0.2
-0.1
0.0
0.1
0.2
SED
Coordinate 1
Co
ord
ina
te 2
-0.6 -0.2 0.2 0.6
-0.6
-0.2
0.2
0.6
LSA
Coordinate 1
Co
ord
ina
te 2
-2 0 2 4 6 8
-4-2
02
46
Wiki
Coordinate 1
Co
ord
ina
te 2
X J.D. SalingeriPadHaitiRandom
Conclusion
Wikipedia Space shows promising results in defining similarity of short text
– Socially constructed
– Large space
– Immune to noise
Future Work
• Adaptive classification
– What we consider as noise may contain useful information depending on the context
• Improved mapping and distance calculations
• Utilizing other social aspects of Wikipedia
Thank you!
Q&A
top related