mirrors and crystal balls a personal perspective on data mining
DESCRIPTION
Mirrors and Crystal Balls A Personal Perspective on Data Mining. Raghu Ramakrishnan. Outline. This award recognizes the work of many people, and I represent the many A warp-speed tour of some earlier work What’s a data mining talk without predictions? - PowerPoint PPT PresentationTRANSCRIPT
1ACM SIGKDD Innovation Award
Raghu Ramakrishnan
Mirrors and Crystal BallsA Personal Perspective on Data Mining
2ACM SIGKDD Innovation Award
Outline
• This award recognizes the work of many people, and I represent the many– A warp-speed tour of some earlier work
• What’s a data mining talk without predictions?– Some exciting directions for data mining that
we’re working on at Yahoo!
3ACM SIGKDD Innovation Award
A Look in the Mirror …
(and the faces I found there:unfortunately, couldn’t find photos for some people)
(and apologies in advance for not discussing the related work that provided context and, often, tools and motivation)
4ACM SIGKDD Innovation Award
1987 2007
5ACM SIGKDD Innovation Award
Sequences, Streams• SEQ
– Sequence Data Processing. P. Seshadri, M. Livny and R. Ramakrishnan. SIGMOD 1994
– SEQ: A Model for Sequence Databases. P. Seshadri, M. Livny, and R. Ramakrishnan, ICDE 1995
– The Design and Implementation of a Sequence Database System. P. Seshadri, M. Livny and R. Ramakrishnan. VLDB 1996
• SRQL– SRQL: Sorted Relational Query
Language. R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SSDBM 1998
6ACM SIGKDD Innovation Award
Scalable Clustering
• Birch– BIRCH: A Clustering Algorithm for Large
Multidimensional Datasets. T. Zhang, R. Ramakrishnan and M. Livny. SIGMOD 96
– Fast Density Estimation Using CF-Kernels. T. Zhang, R. Ramakrishnan, and M. Livny. KDD 1999
– Clustering Large Databases in Arbitrary Metric Spaces. V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French. ICDE 1999
• Clustering Categorical Data– CACTUS: A Scalable Clustering Algorithm
for Categorical Data. V. Ganti, J. Gehrke, and R. Ramakrishnan. KDD 1999
7ACM SIGKDD Innovation Award
Scalable Decision Trees
• Rain Forest– RainForest: A Framework for
Fast Decision Tree Construction of Large Datasets. J. Gehrke, R. Ramakrishnan and V. Ganti. VLDB 1998
• Boat– BOAT: Optimistic Decision Tree
Construction. J. Gehrke, V. Ganti, R. Ramakrishnan, and W-Y. Loh. SIGMOD 1999
8ACM SIGKDD Innovation Award
Streaming and Evolving Data, Incremental Mining
• FOCUS– FOCUS: A Framework for
Measuring Changes in Data Characteristics. V. Ganti, J. Gehrke, R. Ramakrishnan, and W-Y. Loh. PODS 1999
• DEMON– DEMON: Mining and
Monitoring Evolving Data. V. Ganti, J. Gehrke, and R. Ramakrishnan. ICDE 1999
9ACM SIGKDD Innovation Award
Mass Collaboration
• The QUIQ Engine: A Hybrid IR-DB System. N. Kabra, R. Ramakrishnan, and V. Ercegovac. ICDE 2003
• Mass Collaboration: A Case Study. R. Ramakrishnan, A. Baptist, V. Ercegovac, M. Hanselman, N. Kabra, A. Marathe, U. Shaft. IDEAS 2004
KNOWLEDGEBASE
QUESTION
Answer added to power self service
SELF SERVICE
ANSWER
KNOWLEDGEBASE
QUESTION
SELF SERVICE
--Partner Experts-Customer Champions -Employees
Customer
Support Agent
Answer added to power self service
10ACM SIGKDD Innovation Award
OLAP, Hierarchies, and Exploratory Mining
• Prediction Cubes. B-C. Chen, L. Chen, Y. Lin, R. Ramakrishnan. VLDB 2005
• Bellwether Analysis: Predicting Global Aggregates from Local Regions. B-C. Chen, R. Ramakrishnan, J.W. Shavlik, P. Tamma. VLDB 2006
11ACM SIGKDD Innovation Award
Hierarchies Redux
• OLAP Over Uncertain and Imprecise Data. D. Burdick, P. Deshpande, T.S. Jayram, R. Ramakrishnan, S. Vaithyanathan. VLDB 2005
• Efficient Allocation Algorithms for OLAP Over Imprecise Data. D. Burdick, P.M. Deshpande, T. S. Jayram, R. Ramakrishnan, S. Vaithyanathan.
• Learning from Aggregate Views. B-C. Chen, L. Chen, D. Musicant, and R. Ramakrishnan. ICDE 2006
• Mondrian: Multidimensional K-Anonymity. K. LeFevre, D.J. DeWitt, R. Ramakrishnan. ICDE 2006
• Workload-Aware Anonymization. K. LeFevre, D.J. DeWitt, R. Ramakrishnan. KDD 2006
• Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge. B-C. Chen, R. Ramakrishnan, K. LeFevre. VLDB 2007
• Composite Subset Measures. L. Chen, R. Ramakrishnan, P. Barford, B-C. Chen, V. Yegneswaran. VLDB 2006
12ACM SIGKDD Innovation Award
Many Other Connections …
• Scalable Inference– Optimizing MPF Queries:
Decision Support and Probabilistic Inference. H. Corrada Bravo, R. Ramakrishnan. SIGMOD 2007
• Relational Learning– View Learning for Statistical
Relational Learning, with an Application to Mammography. J. Davis, E.S. Burnside, I. Dutra, David Page, R. Ramakrishnan, V. Santos Costa, J.W. Shavlik.
13ACM SIGKDD Innovation Award
Community Information Management
• Efficient Information Extraction over Evolving Text Data. F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE 2008
• Toward Best-Effort Information Extraction. W. Shen, P. DeRose, R. McCann, A. Doan, R. Ramakrishnan. SIGMOD 2008
• Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan. VLDB 2007
• Source-aware Entity Matching: A Compositional Approach. W. Shen, P. DeRose, L. Vu, A. Doan, R. Ramakrishnan. ICDE 2007
14ACM SIGKDD Innovation Award
… Through the Looking Glass
Prediction is very hard, especially about the future. Yogi Berra
15ACM SIGKDD Innovation Award
Information Extraction
… and the challenge of managing it
16ACM SIGKDD Innovation Award
17ACM SIGKDD Innovation Award
DBLife
Integrated information about a (focused) real-world community
Collaboratively built and maintained by the community
CIMple software package
18ACM SIGKDD Innovation Award
babycenter
epicurious
Search Results of the Future
yelp.com
answers.com
webmd
Gawker
New York Times
(Slide courtesy Andrew Tomkins)
19ACM SIGKDD Innovation Award
Opening Up Yahoo! SearchPhase 1 Phase 2
Giving site owners and developers control over the appearance of Yahoo!
Search results.
BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!
Search infrastructure and technology to developers and companies to help them
build their own search experiences.
(Slide courtesy Prabhakar Raghavan)
20ACM SIGKDD Innovation Award
Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
21ACM SIGKDD Innovation Award
Economics of IE
• Data $, Supervision $– The cost of supervision, especially large,
high-quality training sets, is high– By comparison, the cost of data is low
• Therefore– Rapid training set construction/active learning
techniques– Tolerance for low- (or low-quality) supervision– Take feedback and iterate rapidly
22ACM SIGKDD Innovation Award
Example: Accepted Papers
• Every conference comes with a slightly different format for accepted papers– We want to extract accepted papers directly
(before they make their way into DBLP etc.)
• Assume– Lots of background knowledge (e.g., DBLP
from last year)– No supervision on the target page
• What can you do?
23ACM SIGKDD Innovation Award
24ACM SIGKDD Innovation Award
Down the Page a Bit
25ACM SIGKDD Innovation Award
Record Identification
• To get started, we need to identify records
– Hey, we could write an XPath, no?
– So, what if no supervision is allowed?
• Given a crude classifier for paper records,
can we recursively split up this page?
26ACM SIGKDD Innovation Award
First Level Splits
27ACM SIGKDD Innovation Award
After More Splits …
28ACM SIGKDD Innovation Award
Now Get the Records
• Goal: To extract fields of individual records
• We need training examples, right?– But these papers are new
• The best we can do without supervision is noisy labels.– From having seen other such pages
29ACM SIGKDD Innovation Award
Partial, Noisy Labels
30ACM SIGKDD Innovation Award
Extracted Records
31ACM SIGKDD Innovation Award
Refining Results via Feedback
• Now let’s shift slightly to consider extraction of publications from academic home pages – Must identify publication sections of faculty home
pages, and extract paper citations from them
• Underlying data model for extracted data is – A flexible graph-based model (similar to RDF or ER
conceptual model)– “Confidence” scores per-attribute or relationship
32ACM SIGKDD Innovation Award
Extracted Publication Titles
33ACM SIGKDD Innovation Award
A Dubious Extracted Publication…
PSOX provides declarative lineage tracking over operator executions
34ACM SIGKDD Innovation Award
Where’s the Problem?
Use lineage to find source of problem..
35ACM SIGKDD Innovation Award
Source Page
Hmm, not a publication page ..
(but may have looked like one to a
classifier)
36ACM SIGKDD Innovation Award
Feedback
User corrects classification of that section..
37ACM SIGKDD Innovation Award
Faculty or Student?
•NLP•Build a Classifier•Or…
38ACM SIGKDD Innovation Award
…Another Clue…
39ACM SIGKDD Innovation Award
…Stepping Back…
Student
Student
Student-List
AdvisorOf
Prof
Prof-List
Prof
Prof
• Leads to large-scale, partially-labeled relational learning
• Involving different types of entities and links
40ACM SIGKDD Innovation Award
Maximizing the Value of What You Select to Show Users
p1 p2 p3
41ACM SIGKDD Innovation Award
Content Optimization
• PROBLEM: Match-making between content, user, context– Content:
• Programmed (e.g., editors); Acquired (e.g., RSS feeds, UGC)– User
• Individual (e.g., b-cookie), or user segment– Context
• E.g., Y! or non-Y! property; device; time period
• APPROACH: Scalable algorithms that select content to show, using editorially determined content mixes, and respecting editorially set constraints and policies.
42ACM SIGKDD Innovation Award
Team from Y! Research
Raghu Ramakrishnan
Deepak Agarwal
Pradheep Elango
Seung-Taek ParkWei Chu
BeeChungChen
43ACM SIGKDD Innovation Award
Team from Y! Engineering
Scott Roy
Nitin Motgi
Joe Zachariah
Kenneth FoxTodd Beaupre
44ACM SIGKDD Innovation Award
Yahoo! Home Page Featured Box
• It is the top-center part of the Y! Front Page
• It has four tabs: Featured, Entertainment, Sports, and Video
45ACM SIGKDD Innovation Award
Traditional Role of Editors
• Strict quality control– Preserve “Yahoo! Voice”
• E.g., typical mix of content– Community standards– Quality guidelines
• E.g., Topical articles shown for limited time
• Program articles periodically– New ones pushed, old ones taken out
• Few tens of unique articles per day– 16 articles at any given time; editors keep up with
novel articles and remove fading ones– Choose which articles appear in which tabs
46ACM SIGKDD Innovation Award
Content Optimization Approach
• Editors continue to determine content sources, program some content, determine policies to ensure quality, and specify business constraints– But we use a statistically based machine
learning algorithm to determine what articles to show where when a user visits the FP
47ACM SIGKDD Innovation Award
Modeling Approach
• Pure feature based (did not work well):– Article: URL, keywords, categories– Build offline models to predict CTR when article
shown to users– Models considered
• Logistic Regression with feature selection• Decision Trees, Feature segments through clustering
• Track CTR per article in user segments through online models– This worked well; the approach we took
eventually
48ACM SIGKDD Innovation Award
Challenges
• Non-stationary CTR
• To ensure webpage stability, we show the same article until we find a better one– CTR decays over time; sharply at F1– Time-of-day; day-of-week effect in CTR
49ACM SIGKDD Innovation Award
Modeling Approach
• Track item scores through dynamic linear models (fast Kalman filter algorithms)
• We model decay explicitly in our models
• We have a global time-of-day curve explicitly in our online models
50ACM SIGKDD Innovation Award
Explore/Exploit
• What is the best strategy for new articles?– If we show it and it’s bad: lose clicks– If we delay and it’s good: lose clicks
• Solution: Show it while we don’t have much data if it looks promising– Classical multi-armed bandit type problem– Our setup is different than the ones studied in
the literature; new ML problem
51ACM SIGKDD Innovation Award
Novel Aspects
• Classical: Arms assumed fixed over time– We gain and lose arms over time
• Some theoretical work by Whittle in 80’s; operations research
• Classical: Serving rule updated after each pull– We compute optimal design in batch mode
• Classical: Generally. CTR assumed stationary– We have highly dynamic, non-stationary CTRs
52ACM SIGKDD Innovation Award
Some Other Complications
• We run multiple experiments (possibly correlated) simultaneously; effective sample size calculation a challenge
• Serving Bias: Incorrect to learn from data for serving scheme A and apply to serving scheme B– Need unbiased quality score– Bias sources: positional effects, time effect, set of
articles shown together
• Incorporating feature-based techniques– Regression style , E.g., logistic regression – Tree-based (hierarchical bandit)
53ACM SIGKDD Innovation Award
System Challenges
• Highly dynamic system characteristics:– Short article lifetimes, pool constantly
changing, user population is dynamic, CTRs non-stationary
– Quick adaptation is key to success
• Scalability:– 1000’s of page views/sec; data collection,
model training, article scoring done under tight latency constraints
54ACM SIGKDD Innovation Award
Results
• We built an experimental infrastructure to test new content serving schemes– Ran side-by-side experiments on live traffic– Experiments performed for several months;
we consistently outperformed the old system– Results showed we get more clicks by
engaging more users– Editorial overrides
• Did not reduce lift numbers substantially
55ACM SIGKDD Innovation Award
Comparing buckets
56ACM SIGKDD Innovation Award
Experiments
• Daily CTR Lift relative to editorial serving
57ACM SIGKDD Innovation Award
Lift is Due to Increased Reach
• Lift in fraction of clicking users
58ACM SIGKDD Innovation Award
Related Work
• Amazon, Netflix, Y! Music, etc.:– Collaborative filtering with large content pool– Achieve lift by eliminating bad articles– We have a small number of high quality
articles
• Search, Advertising– Matching problem with large content pool– Match through feature based models
59ACM SIGKDD Innovation Award
Summary of Approach
• Offline models to initialize online models
• Online models to track performance
• Explore/exploit to converge fast
• Study user visit patterns and behavior; program content accordingly
60ACM SIGKDD Innovation Award
Summary
• There are some exciting “grand challenge” problems that will require us to bring to bear ideas from data management, statistics, learning, and optimization– i.e., data mining problems!
• Our field is too young to think about growing old, but the best is yet to be …