crowd mining - (joint work with y. amsterdamer, y ... · 5 december 2012, the university of hong...

5 December 2012, The University of Hong Kong

Crowd Mining(joint work with Y. Amsterdamer,Y. Grossman, and T. Milo)

PIERRE SENELLART

2 / 26 Télécom PT & Tel Aviv U. Pierre Senellart

Association rule mining

One of the best most studied aspect of data mining [Agrawalet al., 1993]

Discovering rules in a database of transactions D

Transaction: set of items

Rule: X ! Y with X , Y sets of itemsOnly interested in rules with support and confidence greater thangiven thresholds �s, �c

supp(X ! Y ) =#ft 2 D j X [Y � tg

#Dconf(X ! Y ) =

#ft 2 D j X [Y � tg#ft 2 D j X � tg

Typical application: market basket Diaper! Beer


Crowd-sourced data

Many applications where raw, extensional, exhaustive data is notavailable

But intensionally hidden in people’s collective minds

) Resort to asking humans (the crowd) for bits of the data theyknow (shopping history, life habits, etc.)

Humans are bad at remembering the full history; also bad atdiscovering correlations

The crowd is a costly resource [Parameswaran and Polyzotis, 2011]


Mining association rules from the crowd

Goal of this workDetermining association rules on crowd-sourced data, by:

asking questions to humans that are easy to answer;

determining which is the best question to ask at any given point;

deducing from all answers a (probabilistic) set of valid associationrules;

optimizing this computation as much as possible.


Outline

Introduction

Concepts

Crowd Mining Algorithm

The CrowdMiner System

Experiments

Conclusions


User support and confidence

A set of users U

Each user u 2 U has a (hidden) transaction database Du

Each rule X ! Y is associated with its user support anduser confidence:

suppu(X ! Y ) =#ft 2 Du j X [Y � tg

#Du

confu(X ! Y ) =#ft 2 Du j X [Y � tg#ft 2 Du j X � tg


Significant rules

Significant rules are those whose overall support and confidenceare above specified threshold �s, �c

Overall support and confidence defined as the mean user supportand confidence:

supp(r) = avgu2U

suppu(r) conf(r) = avgu2U

confu(r)

Goal: finding significant rules while asking the smallest number ofquestions to the crowd as possible


Questions to the crowd

Two kind of questions:

Closed questions X ! Y ? Ask a user for her (approximate) supportand confidence for this rule;

Open questions ?! ? Ask a user for one arbitrary rule and its(approximate) support and confidence.

Users will not be precise, but that’s fine.

Example (Morning! Jogging)“How often do you go jogging in the morning?”“I go jogging three times per week in the morning.”

confu(Morning! Jogging) = 37 suppu(Morning! Jogging) = 3

21

(if there is one transaction for each morning, afternoon, evening)


Outline

Introduction

Concepts



Experiments

Conclusions


Algorithm components

Choosethenextquestion

Choosethenextclosedquestion

Openor closedquestion?

Choosecandidaterules

Ranktherulesbygrade

Estimatenext error

Estimatecurrent error

estimatemeandistribution

estimatesampledistribution

estimaterulesignificance

One generalframework forcrowdmining

One particularchoice ofimplementation ofall black boxes

We do not claimany optimality

But we validate byexperiments


Estimating distributions

Attention: support and confidence are correlated, we need toconsider bivariate distributions!

Central limit theorem: the sample distribution of (confidence,support) pairs for a rule is normally distributed

Hypothesis: The distribution of (confidence, support) values forrules among the whole set of users is normally distributed

The sample mean � and covariancematrix � are unbiased estimators ofthat of the original distribution

Co

nfi

den

ce

1.0

0.8

0.6

0.4

0.2

0.0 1.0 0.8 0.6 0.4 0.2 0.0

Support


Estimating rule significance

A rule is significant if:

1Z

�s

1Z

�c

N�;

1K �(c; s) dc ds > 0:5

�, � are sample mean and covariance matrixK is the number of samplesN is the bivariate normal distribution

Efficient algorithms [Genz, 2004] for numerical integration ofbivariate normal distributions.

The current error probability on rule significance is simply thedistance of this integral to 0 or 1


Estimating next error

The current distribution N�;� for a rule can be used as anestimator of what the next answer would be

We sample according to N�;�, recom-pute rule significance and error prob-abilities, and deduce from that thenext error probability in this partic-ular case

Co

nfi

den

ce

1.0

0.8

0.6

0.4

0.2

0.0 1.0 0.8 0.6 0.4 0.2 0.0

Support

Co

nfi

den

ce

1.0

0.8

0.6

0.4

0.2

1.0 0.8 0.6 0.4 0.2

Support

0.0 0.0

By averaging over all samples, we obtain an estimate of the nexterror probability

The difference between next error and current error (expectederror reduction) is an estimate of how much we gain by asking aquestion on this rule!


Putting everything together

Choosethenextquestion

Choosethenextclosedquestion

Openor closedquestion?

Choosecandidaterules

Ranktherulesbygrade

Estimatenext error

Estimatecurrent error

estimatemeandistribution

estimatesampledistribution

estimaterulesignificance

Candidate rules are rules oflength 1, rules for which wehave samples, and rules forwhich subrules are significant(analogous to Apriori [Agrawalet al., 1994])

The grade of a rule is theexpected error reduction whenknown, an estimate based onsubrules otherwise

We decide between closed oropen by flipping a coin(exploitation vs exploration)


Outline

Introduction

Concepts



Experiments

Conclusions


Architecture

Initial Data

Query Selector Best Rules ExtractorData Aggregator

answer question

Question Display Portal Interface

Rule DatabaseRule Database

resultsuser queryask question

rule+query [rule, conf, supp]

Portal User Interface


Outcome

Rank Change Rule Support Conf. Error Prob.

1 +1 Morning → Jogging 0.087 0.61 0.521e-11

2 -1 Jogging → Energy Drink, Granola 0.085 0.5 0.66e-8

3 Morning → Coffee 0.067 0.52 0.54e-7… … … … … …

1752 -8 Upset Stomach → Chamomile 0.032 0.05 0.03

1753 Vegetarian, Yoga → Raw Foods 0.009 0.047 0.012… … … … … …


Outline

Introduction

Concepts



Experiments

Conclusions


Datasets

We experimented on several datasets:

Real-world Retail dataset [Brijs et al., 1999] from a shoppingbasket application; since the data is anonymized, users areassigned transactions in a random fashion

Edits on categories in Simple English Wikipedia: transactions arearticles, items are high-level categories (Wordnet-level classes ofYAGO [Suchanek et al., 2007]) assigned to articles, users areeditors of these articles

Synthetic dataset (not discussed here)


Experimental setting

Baselines:Random: At each step, we choose a random rule to ask a user

aboutGreedy: Ask about the known rule with the fewest samples

(starting with smaller rules)

Settings:Zero-knowledge: we start with no information about the worldKnown items: the set of items is known, no information about

rulesRule refinement: already know some rules (not discussed here)

We evaluate in terms of precision, recall, F-measure of predictedsignificant rules, as well as absolute number of errors (notdiscussed here)


F-measure, zero-knowledge

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 500 1000 1500 2000

F-m

eas

ure

Number of Samples

CrowdMiner

Random

Greedy

Retail dataset

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 500 1000 1500 2000

F-m

eas

ure

Number of Samples

CrowdMinerRandomGreedy

Wikipedia dataset


Precision and recall, zero-knowledge

0

0.2

0.4

0.6

0.8

1

1.2

0 500 1000 1500 2000

Precision

NumberofSamples


Retail dataset

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000

Recall

NumberofSamples


Retail dataset

Better precision: we make sure to reduce the global expectednumber of errors; Greedy loses precisions as new rules are explored

Much better recall: due to adding potentially large rules ascandidates once candidate subrules are found (Greedy will onlyadd such rules much later)


F-measure, known items

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000

F-m

eas

ure

Number of Samples


Retail dataset

Good initial precision of the greedy algorithm: the best thing todo is to start by asking about rules of small size anyway

CrowdMiner overtakes Greedy: larger rules are soon madecandidate and their significance assessed


Outline

Introduction

Concepts



Experiments

Conclusions


In brief

How to design an interactive poll? Many situations when onewants to find correlations in non-extensionally accessible data

“Crowd-sourced” Apriori (but with subtleties)

Good behavior in practice

Many other design choices for replacing black boxes, especially inthe presence of priors

Connections with active learning [Lindenbaum et al., 2004]


Perspectives

What are the best next k questions to ask?Allows parallelization. Also possible to do thatby sampling, and not significantly more costly!

Take into account correlations between rules torefine estimates

Which user to ask which question?

References I

R. Agrawal, T. Imieliński, and A. Swami. Mining association rulesbetween sets of items in large databases. SIGMOD Record, 22(2),1993.

R. Agrawal, R. Srikant, et al. Fast algorithms for mining associationrules. In VLDB, 1994.

Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. Usingassociation rules for product assortment decisions: A case study. InKnowledge Discovery and Data Mining, 1999.

Alan Genz. Numerical computation of rectangular bivariate andtrivariate normal and t probabilities. Statistics and Computing, 14,2004.

References II

M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective samplingfor nearest neighbor classifiers. Machine Learning, 54(2), 2004.

Aditya G. Parameswaran and Neoklis Polyzotis. Answering queriesusing humans, algorithms and databases. In CIDR, 2011.

Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: Acore of semantic knowledge. Unifying WordNet and Wikipedia. InWWW, 2007.

crowd mining - (joint work with y. amsterdamer, y ... · 5 december 2012, the university of hong...

Documents