tcs: efficient topic discovery over crowd-oriented service

57
SIGKDD 2014 TCS: Efficient Topic Discovery over Crowd - oriented Service Data Yongxin Tong, Caleb Chen Cao, Lei Chen Department of Computer Science and Engineering The Hong Kong University of Science and Technology

Upload: others

Post on 20-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TCS: Efficient Topic Discovery over Crowd-oriented Service

SIGKDD 2014

TCS: Efficient Topic Discovery over Crowd-oriented Service Data

Yongxin Tong, Caleb Chen Cao, Lei Chen

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

Page 2: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

2

Page 3: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowdsourcing in Social Media

3

Page 4: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowdsourcing Process

MTurk workers(Photo By Andrian Chen)

AMT

Requesters

The web connects the tasks from requestors and responsesfrom workers.

4

Page 5: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Services

The information services provided by crowdsourcing usuallyinclude massive task-response pairs.

5

Page 6: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

A snippet of crowd-oriented service from Stack Overflow

6

Page 7: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

A snippet of crowd-oriented service from Stack Overflow– Task ID

7

Page 8: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

A snippet of crowd-oriented service from Stack Overflow– Task ID, Task Details,

8

Page 9: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

A snippet of crowd-oriented service from Stack Overflow– Tasks: Task ID, Task Details, Timestamp, etc.

9

Page 10: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data A snippet of crowd-oriented service from Stack Overflow

– Tasks: Task ID, Task Details, Timestamp, etc.

– Responses: Responses ID

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

R3 T1 Storing images in your database will… 2014-02-01 16:38:17

10

Page 11: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data A snippet of crowd-oriented service from Stack Overflow

– Tasks: Task ID, Task Details, Timestamp, etc.

– Responses: Responses ID, Task ID

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

R3 T1 Storing images in your database will… 2014-02-01 16:38:17

11

Page 12: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data A snippet of crowd-oriented service from Stack Overflow

– Tasks: Task ID, Task Details, Timestamp, etc.

– Responses: Responses ID, Task ID, Response Details,

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

R3 T1 Storing images in your database will… 2014-02-01 16:38:17

12

Page 13: TCS: Efficient Topic Discovery over Crowd-oriented Service

Crowd-Oriented Service Data A snippet of crowd-oriented service from Stack Overflow

– Tasks: Task ID, Task Details, Timestamp, etc.

– Responses: Responses ID, Task ID, Response Details, Timestamp, etc.

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

R3 T1 Storing images in your database will… 2014-02-01 16:38:17

13

Page 14: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-I

Task-Response Pairs– Task-Response Correlation

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

R3 T1 Storing images in your database will… 2014-02-01 16:38:17

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

14

Page 15: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-II

Big volume– Each task may have large amount of responses

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

… … … …

R100 T1 Storing images in your database will… 2014-02-04 11:36:02

… … … …

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

… … …

15

Page 16: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-III

Dynamic Evolution with TimeTimeline

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

1st Bucket

16

Page 17: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-III

TimelineTask ID Tasks Timestamp

T1 Android application database to save images ... 2014-01-31 17:30:33

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

Dynamic Evolution with Time– Accumulates as Multiple Consecutive Buckets

1st Bucket

17

Page 18: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-III

TimelineTask ID Tasks Timestamp

T1 Android application database to save images ... 2014-01-31 17:30:33

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

Dynamic Evolution with Time

1st Bucket

2nd Bucket

18

Page 19: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-III

TimelineTask ID Tasks Timestamp

T1 Android application database to save images ... 2014-01-31 17:30:33

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 17:30:33

R2 T1 Save the image to your sdcard. ... 2014-02-01 10:31:25

R3 T1 Storing images in your database will… 2014-02-01 21:40:01

Dynamic Evolution with Time

1st Bucket2nd Bucket

19

Page 20: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-III

TimelineTask ID Tasks Timestamp

T1 Android application database to save images ... 2014-01-31 17:30:33

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 17:30:33

R2 T1 Save the image to your sdcard. ... 2014-02-01 10:31:25

R3 T1 Storing images in your database will… 2014-02-01 21:40:01

Task ID Tasks TimestampT2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

Dynamic Evolution with Time

1st Bucket2nd Bucket3rd Bucket

20

Page 21: TCS: Efficient Topic Discovery over Crowd-oriented Service

Characteristic of Crowd-Oriented Service Data-III

TimelineTask ID Tasks Timestamp

T1 Android application database to save images ... 2014-01-31 17:30:33

Response ID Task ID Responses TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 17:30:33

R2 T1 Save the image to your sdcard. ... 2014-02-01 10:31:25

R3 T1 Storing images in your database will… 2014-02-01 21:40:01

Task ID Tasks TimestampT2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

… ……. ….

Dynamic Evolution with Time– Accumulates as Multiple Consecutive Buckets

21

Page 22: TCS: Efficient Topic Discovery over Crowd-oriented Service

Applications of Topics in Crowd-Oriented Service Data

Accuracy of Crowdsourcing Systems– Quality Control

Users Personalization– Task Requestors – Crowd Workers

Task Recommendation– Task Assignment in Crowdsourcing Platforms

22

Page 23: TCS: Efficient Topic Discovery over Crowd-oriented Service

Challenges How to model crowd-oriented service data containing

– Task-Response Correlation

High training efficiency is important

Topics over crowd-oriented service data are evolving

Responses ID Task ID Tasks TimestampR1 T1 Android SQLite database with multiple... 2014-01-31 17:30:33

R2 T1 Save the image to your sdcard. ... 2014-02-01 10:31:25

R3 T1 Storing images in your database will… 2014-02-01 21:40:01

23

Page 24: TCS: Efficient Topic Discovery over Crowd-oriented Service

Our Contributions We design the New probabilistic topic model to crowd-

oriented service data.– Topic Crowd Service Model

We propose an efficient solution to discover latent topicsfrom crowd-oriented service data.

– Pairwise Sketch (pSketch)– Bucket Parameter Estimation (BPE) Algorithm

We verify the effectiveness and efficiency of the proposedmethods in extensive experimental results.

24

Page 25: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

25

Page 26: TCS: Efficient Topic Discovery over Crowd-oriented Service

Problem Definitions

Task-Response Pairs– Given a crowd-oriented task Ti, a set of corresponding

responses {Ri,1,…, Ri,m}, the arbitrary pair (Ti, Ri,j) wherej∈[1, m] is called a task-response pair.

Crowd-oriented Service Data– CS=\{T1, R1,1),...,(Tn, Rn,1),…,(Tn, Rn,m)} be a set of task-

response pairs, where each task and response is a document.Each document d is represented by a subset of the collection ofwords. Given an arbitrary task-response pair, a word-pairincludes two words, one word is from the document of the task,the other word is from the document of the response.

26

Page 27: TCS: Efficient Topic Discovery over Crowd-oriented Service

Problem Definitions

(T1, R1,1) is a task-response pair. In this example, CS={(T1, R1,1), (T1, R1,2), (T1, R1,3), (T3, R3,1)}. (iPhone, iOS7) is a word-pair in (T3, R3,1).

Task ID Tasks TimestampT1 Android application database to save images ... 2014-01-31 17:30:33

T2 How to use simpleadapter with activity Android... 2014-02-02 10:31:25

T3 How to find mobile phones on hotspot networks in iPhone? 2014-02-03 21:40:01

Response ID Task ID Responses TimestampR1,1 T1 Android SQLite database with multiple... 2014-01-31 18:23:01

R1,2 T1 Save the image to your sdcard. ... 2014-02-01 15:01:53

R1,3 T1 Storing images in your database will… 2014-02-01 16:38:17

R3,1 T3 iOS 7 system of Apple devices provide... 2014-02-03 22:14:27

27

Page 28: TCS: Efficient Topic Discovery over Crowd-oriented Service

Problem Definitions

Topic– A semantically coherent topic ϕ is a multinomial distribution

of words {p(w|ϕ)}w∈W with the constraint .

Topic Discovery in Crowd-oriented Service Data– Given the input of a crowd-oriented service data CS, we are

required to infer the latent topics ϕ over in CS.

( | ) 1w W

p w φ∈

=∑

28

Page 29: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

29

Page 30: TCS: Efficient Topic Discovery over Crowd-oriented Service

Generative Process of TCS Model Each task or response is

viewed as a document,respectively.

TCS shares ingredientswith Latent DirichletAllocation (LDA):– Each topic has a

distribution over words;– Each document has a

distribution over topics;– If a document d is a task, sample a response from the set of task-response pairs;– Otherwise, d is a response and select its corresponding task;– Combine the task and response as a new document and generate the new

distribution over topics;– Each sentence is the basic unit for topic assignment.

30

Page 31: TCS: Efficient Topic Discovery over Crowd-oriented Service

Challenges of TCS Model It is infeasible to count and store frequencies of all

word pairs due to the excessively high cost.– Our Solution: Only storing significant (frequent) word-pairs

and removing extremely infrequent word-pairs.

How to training the TCS model efficiently when thecorrelation of task-response pair is considered?– Our Solution: Speeding up the training and belief updating

process according to significant word-pairs.

31

Page 32: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

32

Page 33: TCS: Efficient Topic Discovery over Crowd-oriented Service

Key ideas of Pairwise Sketch Main ideas

– A sketch-based method– Approximate the frequency of word-pairs in tasks and

responses with bounded error within a probability. Only frequent word-pairs are significant for topic

modeling– Extremely infrequent word-pairs in tasks and

responses are removed. Effective Space Complexity

– O( ) – Refer to our paper for the proof if you are interested.

1 | |Tδεδ+

33

Page 34: TCS: Efficient Topic Discovery over Crowd-oriented Service

Pairwise Sketch

Task word frequency

Response word frequency

Sketch for counting frequency of words in tasks

Sketch for counting frequency of words in responses34

Page 35: TCS: Efficient Topic Discovery over Crowd-oriented Service

Pairwise Sketch

+1+1

+1+1

+1Word in

Task iPhone

Word in Response iOS

Hash Function

35

Page 36: TCS: Efficient Topic Discovery over Crowd-oriented Service

Pairwise Sketch

+1+1

+1+1

+1+1

+1+1

+1+1

Hash Function

Hash Function

Word in Task iPhone

Word in Response iOS

36

Page 37: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

37

Page 38: TCS: Efficient Topic Discovery over Crowd-oriented Service

Belief Residual of Sentences The belief that a sentence s of a specific document d

is generated by topic k is denoted by μkd,s,

– μkd,s is calculated as follows,

The belief residual rkds between two successive

iterations t and (t-1) is calculated as follows,

The estimation of thetopic distributioninthe document level.

The estimation of theword distribution inthe topic level.

38

Page 39: TCS: Efficient Topic Discovery over Crowd-oriented Service

Belief Residual of Documents The residual of a specific document d at topic k as

follows:

The residual of the document d is calculated asfollows:

Updating beliefs with large residuals acceleratesconvergence.

39

Page 40: TCS: Efficient Topic Discovery over Crowd-oriented Service

Belief Update Algorithm After each iteration

– Sort rd in a descending order for all documents;– Select several documents with the largest residuals;– For each selected document

Sort rkd in descending order;

Select several topic with the largest residual;– Update the corresponding μk

ds;– Normalize the corresponding μk

ds;

40

Page 41: TCS: Efficient Topic Discovery over Crowd-oriented Service

Belief Update Algorithm

Topic 1 Topic2 Topic 3 Topic 4

Selected documents Selected topics

Document ignored incurrent iteration

Topic 1 Topic2 Topic 3 Topic 4

Topic 1 Topic2 Topic 3 Topic 4

A Running Example

41

Page 42: TCS: Efficient Topic Discovery over Crowd-oriented Service

Parameter Estimation in A Bucket Topic distribution in the document d,

Word distribution in the topic k,

The prior of thetopic distributionof the document d.

The prior of theword distribution ofthe topic k.

42

Page 43: TCS: Efficient Topic Discovery over Crowd-oriented Service

Belief Calculation over Multiple Buckets Statistic of the previous consecutive (m-1) buckets

– Word information accumulated from previous bucket

For the current bucket, n.,-s,w[m]μ k.,-s,w[m]+Ωk

w[m-1] work as the delegates of n.,-s,w[m]μ k

.,-s,w[m]

Ωkw[m-1]=n.,.,w[m-1]μ k

.,.,w[m-1]

Bucket (m-1)

Ωkw[m-1]

Bucket (m)

n.,-s,w[m]μ k.,-s,w[m]+Ωk

w[m-1]

The number that the word w in allsentences of all documents inprevious (m-1) buckets .

The belief that the word w in allsentences of all documents is assignedto the topic k in previous (m-1) buckets.

43

Page 44: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

44

Page 45: TCS: Efficient Topic Discovery over Crowd-oriented Service

Efficiency

The efficiency of TCS(BPE) isbetter than the other topicmodels:– TCS is a light-weight topic

model.– TCS reduces the amount of data

that need to be processed.– BPE reduces the document

scope and topic scope that needto be scanned in each iteration.

46

Page 46: TCS: Efficient Topic Discovery over Crowd-oriented Service

Memory Consumption The memory consumption

of each topic modelincreases with the size of abucket and the number oftopics, respectively.

pSketch consumes leastmemory– It reduces the amount of

data that need to beprocessed

47

Page 47: TCS: Efficient Topic Discovery over Crowd-oriented Service

Effectiveness TCS demonstrates good

performance in terms ofperplexity.

Perplexity1 describes the held-out perplexity on the learnedmodel .

Perplexity2 is used to evaluatethe effectiveness of predictionof the model.

48

Page 48: TCS: Efficient Topic Discovery over Crowd-oriented Service

BPE v.s. GS and VB BPE is always the fastest one among the three methods.

– GS and VB uses the variational Bayes and collapsed Gibbs sampling parameterestimation methods.

BPE shows the best effectiveness when the data size is small.– The performance gain of BPE decreases when the data size increases.– The effectiveness of BPE is not worse than that of GS and VB .

49

Page 49: TCS: Efficient Topic Discovery over Crowd-oriented Service

Topic Evolution (Sport Topic)

Bucket 1 Bucket 2 Bucket 3 Bucket 4 Bucket 5

Running Running Running Running Running

Basketball Basketball Swimming Swimming Swimming

Injury Gym Basketball Gym Gym

Shoe Swimming Gym Football Football

Gym Injury Football Basketball Basketball

Football Football Injury Shoe Shoe

Swimming Shoe Shoe Injury Injury

We compare the topics that are discovered fromconsecutive different buckets about the topic sport.

50

Page 50: TCS: Efficient Topic Discovery over Crowd-oriented Service

Outline Motivations Problem Definitions Topic Crowd Service Model Pairwise Sketch Parameter Estimation Experimental Study Conclusion

51

Page 51: TCS: Efficient Topic Discovery over Crowd-oriented Service

Conclusion Propose a new problem of discovering latent topics over

massive crowd-oriented service data .

Design the New probabilistic topic model to crowd-orientedservice data.

– Topic Crowd Service Model

Propose an efficient solution to discover latent topics fromcrowd-oriented service data.

– Pairwise Sketch (pSketch)– Bucket Parameter Estimation (BPE) Algorithm

Show the effectiveness and efficiency of the proposedmethods in extensive experimental results.

52

Page 52: TCS: Efficient Topic Discovery over Crowd-oriented Service

53

Page 53: TCS: Efficient Topic Discovery over Crowd-oriented Service

Active Learning vs. Crowdsourcing There are main three differences

Range of Applications– Active Learning: improving the accuracy of learning algorithms only– Crowdsourcing: enhancing the quality of any questions

Quality of Feedback– Active Learning: assuming that workers is correct– Crowdsourcing: considering confidences of different crowd workers

Task Assignment Strategy– Active Learning: no consideration of task assignment– Crowdsourcing: optimal task assignment result

54

Page 54: TCS: Efficient Topic Discovery over Crowd-oriented Service

Time Complexity of Algorithm 3 The time complexity of Algorithm 3 is O( )

– |D| is the number of all documents including tasks andresponses.

– |I| is the number of iterations, which is set as a input parameter.– |ASW| is the average number of words among significant

word-pairs in each document.

| | | | | |D I ASW× ×

55

Page 55: TCS: Efficient Topic Discovery over Crowd-oriented Service

Effects of ρD and ρK

ρD is the elected proportions of documents for messagepassing in each iteration.

ρK is the elected proportions of topics for messagepassing in each iteration.

Lower ρD and ρK make the faster BPE Algorithm sincethe number of updated topics and documents decreases.

Default value of ρD and ρK are set as 0.5 since theperplexity is worse if ρD and ρK are lower than 0.5.

56

Page 56: TCS: Efficient Topic Discovery over Crowd-oriented Service

Effects of λT and λR

λT is the significant (frequent) threshold for wordsamong tasks in pSketch.

λR is the significant (frequent) threshold for wordsamong responses in pSketch.

Higher λT and λR spend the fewer memory cost ofpSketch and lead to the faster training process of BPEalgorithm.

Default value of λT and λR are set as 0.001 (1‰).

57

Page 57: TCS: Efficient Topic Discovery over Crowd-oriented Service

BPE vs. GS vs. VB Three different parameter estimation approach have the

same inputs from pSketch, so we only compare theireffectiveness in parameter estimation process.

GS is stable but slow. VB is the faster one, but its accuracy is lower. BPE only updates elected proportions of documents and

topics in each iteration, so is the fastest. Especially, theeffectiveness of BPE is not lower than that of GS andVB.

58