proposal defense

33
DISSERTATION PROPOSAL DEFENSE XIAOJU ZHENG JUNE 9, 2010 Life Cycle of #hashtags: “words” in the #Twittertopia 1

Upload: xiaojuzheng

Post on 31-Oct-2014

11 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Proposal defense

1

DISSERTATION PROPOSAL DEFENSEXIAOJU ZHENGJUNE 9 , 2010

Life Cycle of #hashtags: “words” in the #Twittertopia

Page 2: Proposal defense

2

Road Map of the Presentation

Background of research questionsResearch QuestionsOverview of the dataFollow-up ExperimentsDiffusion modelsChallenges

Page 3: Proposal defense

3

The Laws of Imitation

“why, given one hundred different innovations conceived of at the same time – innovations in the form of words, in mythological ideas, in industrial processes, etc. – ten will spread abroad, while ninety will be forgotten.”

--- Gabriel Tarde (1903) “The Laws of Imitation”

Merit of its own? Or something else?

Page 4: Proposal defense

5

The Laws of Imitation

Merit is not the only catalyst of the spread of an idea.

In situations where “the poorest innovations, from the point of view of logic, are selected because of their place, or even date of birth.”, Tarde attributes these irrational occurrences to “extra-logical influences”

Page 5: Proposal defense

6

Social Network Analysis

Current research in social network analysis asserts that these “extra-logical” influences can be explained by examining the dynamics of the network through which influence is transmitted between individuals.

In other words, if we view individuals as nodes in a social network, where a directed edge indicates that one node influences another, then some graph configurations make it more likely that an innovation will be widely adopted than others.

Page 6: Proposal defense

7

Basic Research Questions

How is a word created?What makes a newly-created word better

than others?How is a newly-created word picked up by

users at large?How does a word gain popularity among the

population?In a word, the life cycle of a word.

Page 7: Proposal defense

8

Research Question --- Data

Word creation in EnglishIn spoken English, it can take decades – even centuries – for new words to emerge, become part of common parlance, and then fade into disuse.

Word creation on Twitter,a word in the form of #hashtags can live the entire lifecycle in very short period of time, e.g. a couple of days

A news story breaks, and competing hashtags vie for dominance. Then a few influential people adopt the same one. Suddenly the conversation coalesces around it, the term trends, the spammers start using it, and then the conversation peters out as we move on to the next topic. (only one possibility)

Is that the pattern? And how closely does it map onto the ways that words and phrases emerge in spoken language?

#hashtag – word on twitter

Page 8: Proposal defense

9

Twitter

Twitter.com:Twitter is a social networking and micro-blogging service that enables its users to send and read messages known as tweets. Tweets are text-based posts of up to 140 characters displayed on the author's profile page and delivered to the author's subscribers who are known as followers.

Page 9: Proposal defense

10

Twitter: some conventions

@mentions - following word is the name of a twitter user and as such this tweet refers to that user, e.g. ”@dave thanks for the help” or ”Talking with @paul about twitter”. (can be used to spot smaller network)

Retweets -”RT” means ”I am retweeting (copying) something from elsewhere”, e.g. ”RT@john I just saw Madonna” means that I am retweeting theoriginal message from John (can be used to spot smaller network)

#hashtags –give contextual relevance to a tweet or identified as a keyword, e.g. ”Like this demo #acita09” or ”Why does #ms-word keep crashing”

Page 10: Proposal defense

13

Page 11: Proposal defense

14

Research Question

Word creation and its propsperitywhat count as criteria for a newly-coined “word” to be accepted as a good #hashtag and how a good #hashtag gain popularity among groups of people.

Logical: linguistic groundings of a good #hashtagLinguistic analysis of the #hashtags and behavioral studies

Extra-logical: social groundings of a popular #hashtage.g. network structure and dynamics

Page 12: Proposal defense

15

#Hashtag

Page 13: Proposal defense

16

I. Linguistic Analysis of #hashtagsmore at https://docs.google.com/Doc?docid=0AWbvIzcQLhQXZGdoY256cDJfMTExZjZjbjNxbTM&hl=en

Linking words into a sentence: e.g. whatsyourbackground, tweetwhatyoueat LetsMakeATrendingTopic, goodluckjustin

Part of word + existing word: e.g. animtip, appstore Compounding: noun + noun e.g. sundayhug, pubquiz,

waikikilunch Compounding: verb + noun e.g. hashtagme, pickon, killcapscop Compounding: adv + verb e.g. currentlycrushing Compounding: adj + noun e.g. digitalbritan, GoodTimes,

morningsickness Splinter: e.g. socialmem (SocialCamp Memphis), Acronym & Initials: e.g. smlb (St Michael Le Belfrey church),

emr (electronic medical records), #eu (European Union), #cah (Crimes against humanity)

Neologism (splinter involved): e.g. twacker(twitter users who lose user account), tweetie (Twitter client for Mac and iPhone), twitvorce (to divorce yourself from a Twitter member by unfollowing them), twittertopia, twendsetter

MISC: omgfact, #tcot (top conversation on twitter)

Page 14: Proposal defense

17

Preliminary Analysis

Public timeline: 20 tweets per minute20 days of non-stop crawlingTotal tweets = 567,091Total words = 8,495,323Average words per tweet = 14.98NPS Chat Corpus: 45010 tokens/6,066 typesWebtext corpus in NLTK: 396,736

tokens/21,537 types

Page 15: Proposal defense

18

Top 10 Frequent words

Twitter NPS Chat Web text Oxford English Corpus

1 I Lol I the

2 the to the is

3 to i to to

4 a the A and

5 and you you of

6 Is I in a

7 In a and in

8 It hi on that

9 you me of have

10 of is is i

Page 16: Proposal defense

19

Top 10 Frequent words

Twitter NPS Chat Webtext Oxford English Corpus

1 I Lol I the

2 the to the is

3 to i to to

4 a the A and

5 and you you of

6 Is I in a

7 In a and in

8 It hi on that

9 you me of have

10 of is is i

Page 17: Proposal defense

20

Top 10 Frequent words

Twitter NPS Chat Webtext Oxford English Corpus

1 I Lol I the

2 the to the is

3 to i to to

4 a the A and

5 and you you of

6 Is I in a

7 In a and in

8 It hi on that

9 you me of have

10 of is is i

Page 18: Proposal defense

21

Twitter presents a different genre of texts

Self expression: "I" is the top-ranking word that tweets begin with.

Stats update: "Watching", "trying", "listening", "reading" and "eating" are all in the Top 100 first words, revealing just how often people use Twitter to report on whatever they are experiencing at the time.

News broadcast: The abbreviation "RT" (retweet) is extremely common

Page 19: Proposal defense

22

Twitter presents a different genre of text

popular web addresses (e.g. URL shortening service) among the top 500: "tinyurl.com", "twitpic.com", "ff.im", "twurl.nl". These all appear because they offer services useful to twitterers.

Tech vocabulary: among top 500: "Google”, “Faceobok”:, “internet”, “website”, “blog”, “Mac”, and “app”.

popular web addresses (e.g. URL shortening service) among the top 500: "tinyurl.com", "twitpic.com", "ff.im", "twurl.nl". These all appear because they offer services useful to twitterers.

Tech vocabulary: among top 500: "Google”, “Faceobok”:, “internet”, “website”, “blog”, “Mac”, and “app”.

Page 20: Proposal defense

23

Research Question

linguistic groundings of a good #hashtagLinguistic analysis of the #hashtags and behavioral studies

social groundings of a popular #hashtage.g. network structure and dynamics

Would linguistically equally good #hashtags have different degrees of popularity? Is it because of the different network structure?

Behavioral studies to get quantitative measurement about linguistic goodness of #hashtags.

Page 21: Proposal defense

24

Linguistic Grounding

Question 1: Does the tag length distribution of adopted #hashtag demonstrate a different distribution from words? Does it conform to a power law distribution or a lognormal distribution? Do #hashtags of different length receive different goodness judgement (e.g. are extremely short tags better than extremely short words?)

Page 22: Proposal defense

25

Linguistic Grounding

Question 2: What are the linguistics processes of creating a #hashtag? What count as a good #hashtag (morphologically, phonotactically, and semantically)?

A more qualitative analysis of the #hashtags needs to be done to design a metrics of analysis: e.g. compounding, splinter (of what kind)

Page 23: Proposal defense

26

Linguistic Grounding – behavioral experiements

Word vs. Nonword: Subjects will be presented with #hashtags collected from twitter.com, and asked to label them as either word or nonword.

Come up with specific criterion for word vs. nonword

Morpheme identification: based on the results obtained from the Word vs. Nonword experiment, #hashtags will be presented for subjects to divide them into morphemes and identify meaningful subparts.

Page 24: Proposal defense

27

Linguistic Grounding – behavioral experiements

Semantic transparency:word association game: for hashtags like “twitvorce”, subjects will be asked to provide free word associations. For instance, subjects are likely to provide “twitter” and “divorce” for the “twitvorce”.

Page 25: Proposal defense

28

Goodness rating: general: for both #hashtags, that are “nonwords”, subjects will provide subjective goodness ratings, e.g. on a scale from 1 to 7. phonotactic: subjects rate the pronouncability, e.g. for acronyms and initials.For instance, some acronyms are just strings of consonants without vowels, some are strings of vowels, and others are mixture of consonants and vowels. Would more pronouncable #hashtags be perceived as better #hashtags?

Page 26: Proposal defense

29

Linguistic Grounding – statistical parser

Phonotactic likelihood: Develop a statistical parser (e.g. finite state

machine) for #hashtags and words, and compare the phonotactic probability. Also compare the statistical parser with e.g. Vitevich (2004) model.

Page 27: Proposal defense

30

Social Grounding

Based on the realistic data from twitter, diffusion models can be tested.

Diffusion models:Linear Threshold ModelCascade Model

Page 28: Proposal defense

31

The Threshold model

Threshold Model.It says that people adopt a new behavior because a sufficiently large proportion of their friends have adopted that behavior. E.g. Early adopters have a very low threshold, say 5% or 10%, while late adopters would have a much higher threshold. Every person, however, has their own individual threshold.

The key variable here is the initial distribution of thresholds across a social network, which describes in totality the final extent of the behavior.

But this model says nothing about how people initially adopt behavior. That is, it says nothing about innovators or the things that are being invented, only about the spread of innovation through a social network.

Page 29: Proposal defense

32

The Threshold model

In the threshold model every person u has a threshold :

and each of their neighbors v is weighted according to: W u,v.If

then the person u adopts the behavior. The set of thresholds, weights, and initial

adopters determines the extent of the behavior in the social network.

Page 30: Proposal defense

33

The Cascade Model

Cascade Modelevery person has a chance of adopting a new behavior whenever one of their neighbors adopts it.

The probability that a person adopts the new behavior is the conversion rate for the notification.

This probability is both a function of the sender and the recipient, so more influential people are more likely to convince others to adopt a behavior.

Page 31: Proposal defense

34

The Cascade Model

In the cascade model, for every person u and neighbor v there is a random variable X u,v

which describes the likelihood of u adopting the behavior if v has adopted it.

Page 32: Proposal defense

35

Diffusion Model

Threshold model: neighborhood densityadopt if enough friends do so.

Cascade Model: function of the sender and receiverpeople have a chance of doing something if one of their friends is doing it.

Page 33: Proposal defense

36

Several Challenges at this step

Design a metrics for #hashtag classification ( e.g. p. 16): position of #hashtag, functions, word structure.

Different #hashtag may have different adoption patterns and diffusion patterns.

Quantitative measurement of “success” of a #hashtag: by frequency of mentioning, logevity (within a short or long time frame)

Design a way to find competing, equally good #hashtags

Representative sample