crowd mining - (joint work with y. amsterdamer, y ... · 5 december 2012, the university of hong...
TRANSCRIPT
5 December 2012, The University of Hong Kong
Crowd Mining(joint work with Y. Amsterdamer,Y. Grossman, and T. Milo)
PIERRE SENELLART
2 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Association rule mining
One of the best most studied aspect of data mining [Agrawalet al., 1993]
Discovering rules in a database of transactions D
Transaction: set of items
Rule: X ! Y with X , Y sets of itemsOnly interested in rules with support and confidence greater thangiven thresholds �s, �c
supp(X ! Y ) =#ft 2 D j X [Y � tg
#Dconf(X ! Y ) =
#ft 2 D j X [Y � tg#ft 2 D j X � tg
Typical application: market basket Diaper! Beer
3 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Crowd-sourced data
Many applications where raw, extensional, exhaustive data is notavailable
But intensionally hidden in people’s collective minds
) Resort to asking humans (the crowd) for bits of the data theyknow (shopping history, life habits, etc.)
Humans are bad at remembering the full history; also bad atdiscovering correlations
The crowd is a costly resource [Parameswaran and Polyzotis, 2011]
4 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Mining association rules from the crowd
Goal of this workDetermining association rules on crowd-sourced data, by:
asking questions to humans that are easy to answer;
determining which is the best question to ask at any given point;
deducing from all answers a (probabilistic) set of valid associationrules;
optimizing this computation as much as possible.
5 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Outline
Introduction
Concepts
Crowd Mining Algorithm
The CrowdMiner System
Experiments
Conclusions
6 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
User support and confidence
A set of users U
Each user u 2 U has a (hidden) transaction database Du
Each rule X ! Y is associated with its user support anduser confidence:
suppu(X ! Y ) =#ft 2 Du j X [Y � tg
#Du
confu(X ! Y ) =#ft 2 Du j X [Y � tg#ft 2 Du j X � tg
7 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Significant rules
Significant rules are those whose overall support and confidenceare above specified threshold �s, �c
Overall support and confidence defined as the mean user supportand confidence:
supp(r) = avgu2U
suppu(r) conf(r) = avgu2U
confu(r)
Goal: finding significant rules while asking the smallest number ofquestions to the crowd as possible
8 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Questions to the crowd
Two kind of questions:
Closed questions X ! Y ? Ask a user for her (approximate) supportand confidence for this rule;
Open questions ?! ? Ask a user for one arbitrary rule and its(approximate) support and confidence.
Users will not be precise, but that’s fine.
Example (Morning! Jogging)“How often do you go jogging in the morning?”“I go jogging three times per week in the morning.”
confu(Morning! Jogging) = 37 suppu(Morning! Jogging) = 3
21
(if there is one transaction for each morning, afternoon, evening)
8 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Questions to the crowd
Two kind of questions:
Closed questions X ! Y ? Ask a user for her (approximate) supportand confidence for this rule;
Open questions ?! ? Ask a user for one arbitrary rule and its(approximate) support and confidence.
Users will not be precise, but that’s fine.
Example (Morning! Jogging)“How often do you go jogging in the morning?”“I go jogging three times per week in the morning.”
confu(Morning! Jogging) = 37 suppu(Morning! Jogging) = 3
21
(if there is one transaction for each morning, afternoon, evening)
9 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Outline
Introduction
Concepts
Crowd Mining Algorithm
The CrowdMiner System
Experiments
Conclusions
10 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Algorithm components
Choosethenextquestion
Choosethenextclosedquestion
Openor closedquestion?
Choosecandidaterules
Ranktherulesbygrade
Estimatenext error
Estimatecurrent error
estimatemeandistribution
estimatesampledistribution
estimaterulesignificance
One generalframework forcrowdmining
One particularchoice ofimplementation ofall black boxes
We do not claimany optimality
But we validate byexperiments
11 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating distributions
Attention: support and confidence are correlated, we need toconsider bivariate distributions!
Central limit theorem: the sample distribution of (confidence,support) pairs for a rule is normally distributed
Hypothesis: The distribution of (confidence, support) values forrules among the whole set of users is normally distributed
The sample mean � and covariancematrix � are unbiased estimators ofthat of the original distribution
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
11 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating distributions
Attention: support and confidence are correlated, we need toconsider bivariate distributions!
Central limit theorem: the sample distribution of (confidence,support) pairs for a rule is normally distributed
Hypothesis: The distribution of (confidence, support) values forrules among the whole set of users is normally distributed
The sample mean � and covariancematrix � are unbiased estimators ofthat of the original distribution
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
11 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating distributions
Attention: support and confidence are correlated, we need toconsider bivariate distributions!
Central limit theorem: the sample distribution of (confidence,support) pairs for a rule is normally distributed
Hypothesis: The distribution of (confidence, support) values forrules among the whole set of users is normally distributed
The sample mean � and covariancematrix � are unbiased estimators ofthat of the original distribution
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
11 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating distributions
Attention: support and confidence are correlated, we need toconsider bivariate distributions!
Central limit theorem: the sample distribution of (confidence,support) pairs for a rule is normally distributed
Hypothesis: The distribution of (confidence, support) values forrules among the whole set of users is normally distributed
The sample mean � and covariancematrix � are unbiased estimators ofthat of the original distribution
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
12 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating rule significance
A rule is significant if:
1Z
�s
1Z
�c
N�;
1K �(c; s) dc ds > 0:5
�, � are sample mean and covariance matrixK is the number of samplesN is the bivariate normal distribution
Efficient algorithms [Genz, 2004] for numerical integration ofbivariate normal distributions.
The current error probability on rule significance is simply thedistance of this integral to 0 or 1
13 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating next error
The current distribution N�;� for a rule can be used as anestimator of what the next answer would be
We sample according to N�;�, recom-pute rule significance and error prob-abilities, and deduce from that thenext error probability in this partic-ular case
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
1.0 0.8 0.6 0.4 0.2
Support
0.0 0.0
By averaging over all samples, we obtain an estimate of the nexterror probability
The difference between next error and current error (expectederror reduction) is an estimate of how much we gain by asking aquestion on this rule!
13 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating next error
The current distribution N�;� for a rule can be used as anestimator of what the next answer would be
We sample according to N�;�, recom-pute rule significance and error prob-abilities, and deduce from that thenext error probability in this partic-ular case
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
1.0 0.8 0.6 0.4 0.2
Support
0.0 0.0
By averaging over all samples, we obtain an estimate of the nexterror probability
The difference between next error and current error (expectederror reduction) is an estimate of how much we gain by asking aquestion on this rule!
13 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating next error
The current distribution N�;� for a rule can be used as anestimator of what the next answer would be
We sample according to N�;�, recom-pute rule significance and error prob-abilities, and deduce from that thenext error probability in this partic-ular case
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
1.0 0.8 0.6 0.4 0.2
Support
0.0 0.0
By averaging over all samples, we obtain an estimate of the nexterror probability
The difference between next error and current error (expectederror reduction) is an estimate of how much we gain by asking aquestion on this rule!
13 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Estimating next error
The current distribution N�;� for a rule can be used as anestimator of what the next answer would be
We sample according to N�;�, recom-pute rule significance and error prob-abilities, and deduce from that thenext error probability in this partic-ular case
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
0.0 1.0 0.8 0.6 0.4 0.2 0.0
Support
Co
nfi
den
ce
1.0
0.8
0.6
0.4
0.2
1.0 0.8 0.6 0.4 0.2
Support
0.0 0.0
By averaging over all samples, we obtain an estimate of the nexterror probability
The difference between next error and current error (expectederror reduction) is an estimate of how much we gain by asking aquestion on this rule!
14 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Putting everything together
Choosethenextquestion
Choosethenextclosedquestion
Openor closedquestion?
Choosecandidaterules
Ranktherulesbygrade
Estimatenext error
Estimatecurrent error
estimatemeandistribution
estimatesampledistribution
estimaterulesignificance
Candidate rules are rules oflength 1, rules for which wehave samples, and rules forwhich subrules are significant(analogous to Apriori [Agrawalet al., 1994])
The grade of a rule is theexpected error reduction whenknown, an estimate based onsubrules otherwise
We decide between closed oropen by flipping a coin(exploitation vs exploration)
15 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Outline
Introduction
Concepts
Crowd Mining Algorithm
The CrowdMiner System
Experiments
Conclusions
16 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Architecture
Initial Data
Query Selector Best Rules ExtractorData Aggregator
answer question
Question Display Portal Interface
Rule DatabaseRule Database
resultsuser queryask question
rule+query [rule, conf, supp]
Portal User Interface
17 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Outcome
Rank Change Rule Support Conf. Error Prob.
1 +1 Morning → Jogging 0.087 0.61 0.521e-11
2 -1 Jogging → Energy Drink, Granola 0.085 0.5 0.66e-8
3 Morning → Coffee 0.067 0.52 0.54e-7… … … … … …
1752 -8 Upset Stomach → Chamomile 0.032 0.05 0.03
1753 Vegetarian, Yoga → Raw Foods 0.009 0.047 0.012… … … … … …
18 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Outline
Introduction
Concepts
Crowd Mining Algorithm
The CrowdMiner System
Experiments
Conclusions
19 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Datasets
We experimented on several datasets:
Real-world Retail dataset [Brijs et al., 1999] from a shoppingbasket application; since the data is anonymized, users areassigned transactions in a random fashion
Edits on categories in Simple English Wikipedia: transactions arearticles, items are high-level categories (Wordnet-level classes ofYAGO [Suchanek et al., 2007]) assigned to articles, users areeditors of these articles
Synthetic dataset (not discussed here)
20 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Experimental setting
Baselines:Random: At each step, we choose a random rule to ask a user
aboutGreedy: Ask about the known rule with the fewest samples
(starting with smaller rules)
Settings:Zero-knowledge: we start with no information about the worldKnown items: the set of items is known, no information about
rulesRule refinement: already know some rules (not discussed here)
We evaluate in terms of precision, recall, F-measure of predictedsignificant rules, as well as absolute number of errors (notdiscussed here)
21 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
F-measure, zero-knowledge
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 500 1000 1500 2000
F-m
eas
ure
Number of Samples
CrowdMiner
Random
Greedy
Retail dataset
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 500 1000 1500 2000
F-m
eas
ure
Number of Samples
CrowdMinerRandomGreedy
Wikipedia dataset
22 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Precision and recall, zero-knowledge
0
0.2
0.4
0.6
0.8
1
1.2
0 500 1000 1500 2000
Precision
NumberofSamples
CrowdMinerRandomGreedy
Retail dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000
Recall
NumberofSamples
CrowdMinerRandomGreedy
Retail dataset
Better precision: we make sure to reduce the global expectednumber of errors; Greedy loses precisions as new rules are explored
Much better recall: due to adding potentially large rules ascandidates once candidate subrules are found (Greedy will onlyadd such rules much later)
23 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
F-measure, known items
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000
F-m
eas
ure
Number of Samples
CrowdMinerRandomGreedy
Retail dataset
Good initial precision of the greedy algorithm: the best thing todo is to start by asking about rules of small size anyway
CrowdMiner overtakes Greedy: larger rules are soon madecandidate and their significance assessed
24 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Outline
Introduction
Concepts
Crowd Mining Algorithm
The CrowdMiner System
Experiments
Conclusions
25 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
In brief
How to design an interactive poll? Many situations when onewants to find correlations in non-extensionally accessible data
“Crowd-sourced” Apriori (but with subtleties)
Good behavior in practice
Many other design choices for replacing black boxes, especially inthe presence of priors
Connections with active learning [Lindenbaum et al., 2004]
26 / 26 Télécom PT & Tel Aviv U. Pierre Senellart
Perspectives
What are the best next k questions to ask?Allows parallelization. Also possible to do thatby sampling, and not significantly more costly!
Take into account correlations between rules torefine estimates
Which user to ask which question?
References I
R. Agrawal, T. Imieliński, and A. Swami. Mining association rulesbetween sets of items in large databases. SIGMOD Record, 22(2),1993.
R. Agrawal, R. Srikant, et al. Fast algorithms for mining associationrules. In VLDB, 1994.
Tom Brijs, Gilbert Swinnen, Koen Vanhoof, and Geert Wets. Usingassociation rules for product assortment decisions: A case study. InKnowledge Discovery and Data Mining, 1999.
Alan Genz. Numerical computation of rectangular bivariate andtrivariate normal and t probabilities. Statistics and Computing, 14,2004.
References II
M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective samplingfor nearest neighbor classifiers. Machine Learning, 54(2), 2004.
Aditya G. Parameswaran and Neoklis Polyzotis. Answering queriesusing humans, algorithms and databases. In CIDR, 2011.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: Acore of semantic knowledge. Unifying WordNet and Wikipedia. InWWW, 2007.