discovering context

Post on 09-Jul-2015

109 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at HCII2011

TRANSCRIPT

Discovering Context: Classifying tweets through a semantic

transform based on Wikipedia

Yegin Genc, Yasuaki Sakamoto, and Jeffrey V. Nickerson

"So I'm told by a reputable person they have killed Osama Bin Laden. …"

“I hate how my phone has this stupid … spell check …”

Twitter to function as a large sensor system, and can increase our awareness of our surroundings

Discovering Context: Classifying tweets through a semantic transform based on

Wikipedia

Why classify?

"So I'm told by a reputable person they have killed Osama Bin Laden. …"

“I hate how my phone has this stupid … spell check …”

Terrorism (?) Irritating technology

NOT IMPORTANT

important

important

important

important

important

How to classify?

message

message

transform

transform

distance(T(m1), T(m2))

d(message1, message2) α d(T(message1),T( message2))

Tweet 1

Tweet 2

Tweet 3

.

.

.

Tweet n

Wiki Page 1(WP1)

Wiki Page 2(WP2)

Wiki Page 3(WP3)

Wiki Page n(WPn)

.

.

.

WP1 WP2

WP3

WPn

d13

d3n

d12

d1n

d2n

d32

STEP 1:FINDING WIKI PAGES

STEP 2:CALCULATING DISTANCE

A Two-Step Approach

Candidate Pages

(word13)

Candidate Pages

(word12)Tweet 1 Word-Set (WS) =

Word11

Word12

Word13

.

.

.

Word1n

Candidate Pages

(word11)

.

.

.

Candidate Pages

(word1n)

Wiki Page 1

max overlap btw. WS and CP content

.

.

.

Step – 1: Finding Wiki Pages

Tweet:RT ashajayy Rest in peace JD Salinger Catcher in the Rye is one of my absolute

favourite books Sad day

Candidate Pages Hits

//en.wikipedia.org/wiki/J.D._Salinger 290

//en.wikipedia.org/wiki/J._D._Salinger 289

//en.wikipedia.org/wiki/books 145

//en.wikipedia.org/wiki/Doris_Day 138

//en.wikipedia.org/wiki/peace 131

Words:Rest, peace, JD, Salinger, Catcher, Rye, absolute, favourite, books, Sad, day

Wiki Page 1

Wiki Page 2

WP1

L1

WP1

L2

WP1

L2

WP1

L3

WP1

L3

WP1

L3

WP2

L1

WP2

L2

WP2

L3

WP1 L3WP2 L2

1

2

3

d12= 3

Step – 2: Calculating the Distance

DiscriminantAnalysis

T1

T3

T2

AccuracyRate

Method

Tweets-T 1 (Topic 1)-T 2 (Topic 1)-T 3 (Topic 2)

.

.

.

T1 T2 T3

T1 0 d12 d12

T2 d21 0 d23

T3 d31 d32 0

Distance Matrix

MDS

X Y

T1 t1x t1y

T2 t2x t2y

T3 t3x t3y

SED

LSA

Wikipedia

DSED

DLSA

DWIKI

Acc. SED

Acc. LSA

Acc. WIKI

Other Techniques

String Edit Distance (SED)

Minimum number of edits needed to transform one string into the other

Kitten → sitten (subst. of 's' for 'k')

SED = 1

Latent Semantic Analysis (LSA)

Natural language processing technique for classification based on term occurrences in documents

DataWithout Noise With Noise

Category Count

X J.D. Salinger 15

iPad 15

Haiti 15

TOTAL 45

Category Count

X J.D. Salinger 15

iPad 15

Haiti 15

Random 55

TOTAL 100

RT @ashajayy Rest in peace, JD Salinger. Catcher in the Rye is one of my absolute favourite books. Sad day.

@JMNelis I fear I may have killed him because I talked about how I hate "Catcher." (1/2)

'Catcher In The Rye' Author J.D. Salinger Dies At 91 - The author of The Catcher in the Rye died of natural causes,... http//ow.ly/16rETF

What Yall think about me buying a whole bunch of sour patch kids and giving them to haiti i bet they would be HAPPY!

Please ReTweet (http//caltweet.com/4gx ) - Lets ALL really AID Haiti

RT @UNC_Health_Care Video Want to help the #Haitian patients at #UNC Hospitals? Here's how. http//bit…

@Alitas_Way naw im kiddin but ma'am it really looks great on u

Please come to our Legal Studies Open House on Tuesday February 2nd from 6-730pm.Please call for exact location and to RSVP …

Most impressive stat for Warner is he holds the top 3 most passing yards in a superbowl. Three games three most passing yards in 40

iPad..not so appealing to me (Yet!) It's basically the MacBook&iPhone combined.I have both so don't think i'll be getting the iPad soon.

Have u seen it?Apple iPad Tablet Steve Jobs Unveils Visionary Computer http//bit.ly/9IslTP

The new Apple formula Hype

Technique J. D. Salinger iPad Haiti

String Edit Distance .67 .13 .60 Latent Semantic Analysis .67 .73 .80 Wikipedia .93 .87 .80

Tweets without noise:

-0.3 -0.2 -0.1 0.0 0.1 0.2

-0.3

-0.2

-0.1

0.0

0.1

0.2

SED

Coordinate 1

Co

ord

ina

te 2

-0.6 -0.2 0.2 0.6

-0.6

-0.2

0.2

0.6

LSA

Coordinate 1

Co

ord

ina

te 2

-2 0 2 4 6 8

-4-2

02

46

Wiki

Coordinate 1

Co

ord

ina

te 2

X J.D. SalingeriPadHaiti

Tweets with noise:

Technique J. D. Salinger iPad Haiti

Latent Semantic Analysis .60 .60 .20 Wikipedia .93 .87 .73

-0.3 -0.2 -0.1 0.0 0.1 0.2

-0.3

-0.2

-0.1

0.0

0.1

0.2

SED

Coordinate 1

Co

ord

ina

te 2

-0.6 -0.2 0.2 0.6

-0.6

-0.2

0.2

0.6

LSA

Coordinate 1

Co

ord

ina

te 2

-2 0 2 4 6 8

-4-2

02

46

Wiki

Coordinate 1C

oo

rdin

ate

2

-0.3 -0.2 -0.1 0.0 0.1 0.2

-0.3

-0.2

-0.1

0.0

0.1

0.2

SED

Coordinate 1

Co

ord

ina

te 2

-0.6 -0.2 0.2 0.6

-0.6

-0.2

0.2

0.6

LSA

Coordinate 1

Co

ord

ina

te 2

-2 0 2 4 6 8

-4-2

02

46

Wiki

Coordinate 1

Co

ord

ina

te 2

X J.D. SalingeriPadHaitiRandom

Conclusion

Wikipedia Space shows promising results in defining similarity of short text

– Socially constructed

– Large space

– Immune to noise

Future Work

• Adaptive classification

– What we consider as noise may contain useful information depending on the context

• Improved mapping and distance calculations

• Utilizing other social aspects of Wikipedia

Thank you!

Q&A

top related