focus the mining beacon: lessons and challenges from the world of e-commerce
Post on 03-Jan-2016
15 Views
Preview:
DESCRIPTION
TRANSCRIPT
Ronny Kohavi, Product Unit Manager, Microsoft
Joint work with Llew Mason, Rajesh Parekh, Zijian ZhengMachine Learning, vol 57, 2004
Focus the Mining Beacon: Lessons and Challenges from the World of E-Commerce
ECML and PKDD Oct 3rd, 2005
Talk (and ML paper) available at http://www.kohavi.com
Ronny Kohavi, Microsoft
2
Solar Eclipse Today
I’d like to thank the organizers for arranging theconference on Oct 3rd in Porto! The Sun was obscured 89.7% here A few pictures I took from my hotel room
with the Sky & Space glasses provided
Ronny Kohavi, Microsoft
3
Overview
Background/experience E-commerce: great domain for data mining Business lessons and Simpson’s paradox Technical lessons Challenges Q&A
Ronny Kohavi, Microsoft
4
Background (I)
1993-1995: Led development of MLC++, the Machine Learning Library in C++ (Stanford University) Implemented or interfaced many ML algorithms.
Source code is public domain, used for algorithm comparisons 1995-1998: Developed and managed MineSet
MineSet ™ was a “horizontal” data mining and visualization product at Silicon Graphics, Inc (SGI). Utilized MLC++. Now owned by Purple Insight
Key insight: customers want simple stuff: Naïve Bayes + Viz ICML 1998 keynote: claimed that to be successful, data mining
needs to be part of a complete solution in a vertical market I followed this vision to Blue Martini Software
A consultant is someone who• borrows your razor,• charges you by the hour,• learns to shave
on your face
Ronny Kohavi, Microsoft
5
Background (II)
1998-2003: Director of Data Mining, then VP of Business Intelligence at Blue Martini Software Developed end-to-end e-commerce platform with integrated business
intelligence from collection, extract-transform-load (ETL) to data warehouse, reporting, mining, visualizations
Analyzed data from over 20 clients Key insight: collection, ETL worked great. Found many insights.
However, customers mostly just ran the reports/analyses we provided 2003-2005: Director, Data Mining and Personalization,
Amazon Key insights: (i) simple things work, and (ii) human insight is key
Recently moved to Microsoft Building platform utilizing machine learning and user feedback to
improve interactions Shameless plug: we are hiring
Ronny Kohavi, Microsoft
6
Ingredients for Successful Data Mining
Large amount of data (many records) Rich data with many attributes (wide records) Clean data / reliable collection (avoid GIGO) Actionable domain (have real-world impact, experiment) Measurable return-on-investment (did the recipe help)
E-commerce has all theright ingredients
If you are choosing to work a domain, make sure it has theseingredients
Ronny Kohavi, Microsoft
7
Business-level Lessons (I)
Auto-creation of the data warehouse worked very well At Blue Martini we owned the operational side as well as
the analysis, we had a ‘DSSGen’ process that auto-generated a star-schema data warehouse
This worked very well. For example, if a new customer attribute was added at the operational side, it automatically became available in the data warehouse
Clients are reluctant to list specific questions Conduct an interim meeting with basic findings.
Clients often came up with a long list of questionsfaced with basic statistics about their data
Ronny Kohavi, Microsoft
8
Business-level Lessons (II)
Collect business-level data from operational side Many things not observable in weblogs (search
information, shopping cart events, registration forms, time to return results). Log at app-server
External events: marketing promotions, advertisements, site changes
Choose to collect as much data as you realistically can because you do not know what might be relevant for a future question.Discoveries that contradict our prior thinking are usually the most interesting
Ronny Kohavi, Microsoft
9
We tend to interpret the picture to the left as a serious problem
How Priors Fail us
Ronny Kohavi, Microsoft
10
We are not Used to Seeing Pacifiers with Teeth
Ronny Kohavi, Microsoft
11
Collection example – Form Errors
Here is a good example of data collection that we introduced without knowing apriori whether it will help: form errors
If a web form was filled and a field did not pass validation, we logged the field and value filled
This was the Bluefly home page when they went live
Looking at form errors, we saw thousands of errors every day on this page
Any guesses?
Ronny Kohavi, Microsoft
12
Business-level Lessons (III)
Crawl, Walk, Run Do basic reporting first, generate univariate statistics, then
use OLAP for hypothesis testing, and only then start asking characterization questions and use data mining algorithms
Agree on terminology What is the difference between a visit and a session? How do you define a customer
(e.g., did every customer purchase)? How is “top seller” defined when showing best sellers?
Why are lists from Amazon (left) and Barnes Noble (right) so different?The answer: no agreed-upon definition of sales rank.
Ronny Kohavi, Microsoft
13
Twyman’s Law
Any statistic that appears interestingis almost certainly a mistake
Validate “amazing” discoveries in different ways.They are usually the result of a business process 5% of customers were born on the same day
o 11/11/11 is the easiest way to satisfy the mandatory birth date field
For US Web sites, there will be a small sales spike later this month on Oct 30, 2005
o Hint: Between 1-2AM, sales will approximately double relative to the prior week
o Due to daylight saving ending, after 1:59AM DST comes 1:00AM no DST, so there are two actual hours from 1AM to 2AM
Ronny Kohavi, Microsoft
14
Twyman’s Law (II)
KDD CUP 2000 Customers who were willing to receive e-mail
correlated with heavy spenders (target variable)o Default for registration question was changed from “yes” to “no” on 2/28
o When it was realized that nobody is opting-in, the default was changed
o This coincided with a $10 discount off every purchase
o Lots of participants found thisspurious correlation, but itwas terrible for predictionson the test set
Sites go through phases(launches) and multiplethings change together
0%
20%
40%
60%
80%
100%
2/1 2/82/1
52/2
22/2
93/7
3/14
3/21
3/28
Date
Pe
rce
nta
ge
of
Cu
sto
me
rs
Heavy Spenders Accepts Email
Ronny Kohavi, Microsoft
15
Simpson’s Paradox
Every talk (hopefully) has a few key points to take away Simpson’s paradox is a one key takeaway from this talk Lack of awareness of the phenomenon can lead to mistaken
conclusions Unlike esoteric brain teasers, it happens in real life
Flow for next few slides Examples that most of you might think are “impossible” Explanation of why they are possible and do happen Implications/warning
Ronny Kohavi, Microsoft
16
Example 1: Paper reviews
Ann and Bob are papers reviewers for conferences They participate in two review cycles:
C1 and C2 (e.g., two conferences) Both reviewed the same number of papers in total
Ann accepted 55%, Bob accepted 35% (stricter) Who is the stricter reviewer?
Adopted from wikipedia/simpson’s paradox
It appears to be Bob, but it’s possible to show that there are cases were Ann is stricter in both cycles. Specifically For C1, Ann is stricter
o Ann accepted 60% of papers (stricter), Bob accepted 90% of papers For C2, Ann is stricter
o Ann accepted 10% of papers (stricter), Bob accepted 30% of papers
Ronny Kohavi, Microsoft
17
Examples 2: Drug Treatment
Real-life example for kidney stone treatments Overall success rates:
Treatment A succeeded 78%, Treatment B succeeded 83% (better)
Further analysis splits the population by stone size For small stones
Treatment A succeeded 93% (better), Treatment B succeeded 83% For large stones
Treatment A succeeded 73% (better), Treatment B succeeded 69% Hence treatment A is better in both cases, yet was worse in total
A similar real-life example happened when the two populations segments were cities
Adopted from wikipedia/simpson’s paradox
Ronny Kohavi, Microsoft
18
Example 3: Sex Bias?
Adopted from real data for UC Berkeley admissions Women claim sexual discrimination
Only 34% of women were accepted, while 44% of men were accepted
Segmenting by departments to isolate the bias, they find that all departments accept a higher percentage of women applicants than men applicants.(If anything, there is a slight bias in favor of women!)
There is no conflict in the above bullets. It’s possible and it happened
Bickel, P. J., Hammel, E. A., and O'Connell, J. W. (1975). Sex bias in graduateadmissions: Data from Berkeley. Science, 187, 1975, 398-404.
Ronny Kohavi, Microsoft
19
Example 4: Purchase Channels
Real example from a Blue Martini Customer We plotted the average customer spending for customers
purchasing on the web or “on the web and offline (POS)” (multi-channel), but segmented bynumber of purchases per customer
In all segments, multi-channelcustomers spent less
However, like shop.org predicted,ignoring the segments, multi-channelcustomers spent more on average
Multichannel customers spend 72% more per year than single channel customers
-- State of Retailing Online, shop.org
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 >5
Number of purchases
Cu
sto
mer
Avera
ge S
pen
din
g
Multi-channel Web-channel only
Ronny Kohavi, Microsoft
20
Last Example: Batting Average
Baseball example (For those not familiar with baseball, batting average is percent of hits.) One player can hit for a higher batting average than another player
during the first half of the year Do so again during the second half But to have a lower batting average for the entire year
First Half Second Half Total seasonA 4/ 10 = 0.400 25/100 = 0.250 29/110 = 0.264B 35/100 = 0.350 2/ 10 = 0.200 37/110 = 0.336
Example
Key to the “paradox” is that the segmenting variable (e.g., half year) interacts with “success” and with the counts.E.g., “A” was sick and rarely played in the 1st half, then “B” was sick in the 2nd half, but the 1st half was “easier” overall.
Ronny Kohavi, Microsoft
21
Not Really a Paradox, Yet Non-Intuitive
If a/b < A/B and c/d < C/D, it’s possible that (a+c)/(b+d) > (A+C)/(B+D)
We are essentially dealing with weighted averages when we combine segments
Here is a simple example with two treatments Each cell has Success / Total = Percent Success % T1 is superior in both segment C1 and segment C2, yet loses overall C1 is “harder” (lower success for both treatments) T1 gets tested more in C1
T1 T2C1 2/8 = 25% 1/5 = 20%C2 4/5 = 80% 6/8 = 75%Both 6/13 = 46% 7/13= 54%
Ronny Kohavi, Microsoft
22
The Other Examples
Paper reviews: Ann was tougher in general, but she reviewed most of her papers in the “write-only” conference where acceptance is always higher
Kidney Stones: treatments did not work well against large stones, but treatment A was heavily tested on those
Sex Bias: Departments differed in their acceptance rates and women applied more to departments were such rates were lower
Web vs. Multi-channel: customers that visited often spent more on average and multi-channel customers visited more
Ronny Kohavi, Microsoft
23
Key Takeaway
Why is this so important? In knowledge discovery, we state probabilities
(correlations) and associate them with causality Reviewer Bob is stricter Treatment T1 works better Berkeley discriminates against women
We must be careful to check for confounding variables
Confounding variables may not be ones we are collecting (e.g., latent/hidden)
Ronny Kohavi, Microsoft
24
Controlled Experiments (I)
Controlled experiments (A/B test, or control/treatment) are the gold standard
Make sure to randomize properly You cannot run option A on day 1 and option B on day 2, you
have to run them in parallel When running in parallel, you cannot randomize based on IP
(e.g., load-balancer randomization) because all of AOL traffic comes from a few proxy servers
Every customer must have an equal chance of falling into control or treatment and must stick to that group
Ronny Kohavi, Microsoft
25
Controlled Experiments (II)
Issues with controlled experiments Duration: we measure only short term impact.
Hard to assess long term effects Primacy effect: changing navigation in a website may
degrade customer experience, even if the new navigation is better
Multiple experiments: on a large site, you may have multiple experiments running in parallel. Scheduling and QA are complex
Consistency/contamination: on the web, assignment is usually cookie-based, but people may use multiple computers
Statistical tests: distributions are far from normal.E.g., 97% of sessions do not purchase, so there’s a large mass on the zero spending
Ronny Kohavi, Microsoft
26
Technical Lessons – Cleansing (I)
Auditing data Make sure time-series data exists for the whole period.
It is very easy to conclude that this week was bad relative to last week because some data is missing (e.g., collection bug)
Synchronize clocks from all data collection points.In one example, some servers were set to GMT and others to EST, leading to strange anomalies.Even being a few minutes off can cause add-to-carts to appear “prior” to the search
Ronny Kohavi, Microsoft
27
Technical Lessons – Cleansing (II)
Auditing data (continued) Remove test data.
QA organizations constantly test the system. Make sure the data can be identified and removed from analysis
Remove robots/bots5-40% of site e-commerce site traffic is generated by crawlers from search engines andstudents learning Perl.These significantly skew results unless removed
Ronny Kohavi, Microsoft
28
Data Processing
Utilize hierarchies Generalizations are hard to find when there are many attribute
values (e.g., every product has a Stock Keeping Unit number) Collapse such attribute values based on hierarchies
Remember date/time attributes Date/time attributes are often ignored, but contain information Convert them into cyclical attributes, such as hour of day or
morning/afternoon/evening, day of week, etc. Compute deltas between such attributes (e.g., ship date minus
order date)
Ronny Kohavi, Microsoft
29
Analysis / Model Building
Mining at the right granularity level To answer questions about customers, we must aggregate
clickstreams, purchases, and other information to the customer level
Defining the right transformation and creating summary attributes is the key to success
Phrase the problem to avoid leaks A leak is an attribute that “gives away” the label.
E.g., heavy spenders pay more sales tax (VAT) Phrasing the problem to avoid leaks is a key insight.
Instead of asking who is a heavy spender, ask which customers migrate from spending a small amount in period 1 to a large amount in period 2
Ronny Kohavi, Microsoft
30
Data Visualizations
Picking the right visualization is key to seeing patterns On the left is traffic by day – note the weekends (but hard to see patterns) On the right is a heatmap, showing traffic colored from green to yellow to red
utilizing the cyclical nature of the week (going up in columns)It’s easy to see the weekend, Labor day on Sept 3, and the effect of Sept 11
weekends
Ronny Kohavi, Microsoft
31
Model Visualizations
When we build models for prediction, it is sometimes important to understand them
For MineSet™, we built visualizations for all models
Here is one: Naïve-Bayes / Evidence model (movie)
Ronny Kohavi, Microsoft
32
UI Tweaks – Feedback in Help Small UI changes can make a big difference Example from Microsoft Help When reading help (from product or web), you have an option to
give feedback
Ronny Kohavi, Microsoft
33
Two Variants of Feedback
A B
Feedback A puts everything together, whereas feedback B is two-stage: question follows rating.
Feedback A just has 5 stars, whereas B annotates the stars with “Not helpful” to “Very helpful” and makes them lighter
Feedback B gets more than double the response rate!
Which one has a higher response rate?
Ronny Kohavi, Microsoft
34
Another Feedback Variant
Call this variant C.Which one has a higher response rate, B or C?
C
Feedback C outperforms B by a factor of 3.5 !!
Ronny Kohavi, Microsoft
35
A Real Technical Lesson:Computing Confidence Intervals
In many situations we need to compute confidence intervals, which are simply estimated as: acc_h +- z*stdDev
where acc_h is the estimated mean accuracy, stdDev is the estimated standard deviation, and z is usually 1.96 for a 95% confidence interval)
This fails miserably for small amounts of data For Example: If you see three coin tosses that are head, the confidence interval for
the probability of head would be [1,1] Use a more accurate formula that does not require using stdDev
(but still assumes Normality):
It’s not used often because it’s more complex, but that’s what computers are for See Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation
and Model Selection” in IJCAI-95
Ronny Kohavi, Microsoft
36
Challenges (I)
Finding a way to map business questions to data transformations Don Chamberlin wrote on the design of SQL “What we
thought we were doing was making it possible for non-programmers to interact with databases." The SQL99 standard is now about 1,000 pages
Many operations that are needed for mining are not easy to write in SQL
Explaining models to users What are ways to make models more comprehensible How can association rules be visualized/summarized?
Ronny Kohavi, Microsoft
37
Challenges (II)
Dealing with “slowly changing dimensions” Customer attributes change (people get married, their children
grow and we need to change recommendations) Product attributes change, or are packaged differently.
New editions of books come out Supporting hierarchical attributes Deploying models
Models are built based on constructed attributes in the data warehouse. Translating them back to attributes available at the operational side is an open problem
For web sites, detecting robots/spiders Detection is based on heuristics (useragent, IP, javascript)
Ronny Kohavi, Microsoft
38
Challenges (III)
Analyzing and measuring long-term impact of changes Control/Treatment experiments give us short-term value.
How do we address long-term impact of changes? For non-commerce sites, how do we measure user
satisfaction? Example: users hit F1 for help in Microsoft Office and execute a series of queries, browsing through documents.How do we measure satisfaction other than through surveys?
Ronny Kohavi, Microsoft
39
Summary
Pick a domain that has the right ingredientsThe Web and E-commerce are excellent
Think about the problem end-to-end fromcollection, transformations, reporting, visualizations, modeling, taking action
The lessons and challenges are from e-commerce, but likely to be applicable in other domains
Beware of hidden variables when concluding causality. Think about Simpson’s paradox.Conduct control/treatment experiments with proper randomization
Ronny Kohavi, Microsoft
40
Fun Lessons
For ebay: do not bid on everyword in Google’s adwords
One accurate measurement is worth a thousand expert opinions -- Admiral Grace Hopper
Advertising may be described as the science of arresting the human intelligence long enough to get money from it
Not everything that can be counted counts And not everything that counts can be counted -- Albert Einstein
Entropy requires no maintenance In God we trust. All others must have data
Copy of talk and full paper, visit http://kohavi.com
top related