semantics on the web: how do we get there?
DESCRIPTION
Semantics on the Web: How Do We Get There?. Raghu Ramakrishnan Chief Scientist for Audience & Cloud Computing Yahoo!. Prediction is very hard, especially about the future. Yogi Berra. Trends in Search. I want to book a vacation in Tuscany. Start. Finish. Search Intent. Premise: - PowerPoint PPT PresentationTRANSCRIPT
1NASA CIDU 2008 Keynote
Raghu RamakrishnanChief Scientist for Audience & Cloud Computing
Yahoo!
Semantics on the Web:How Do We Get There?
3NASA CIDU 2008 Keynote
Prediction is very hard, especially about the future. Yogi Berra
4NASA CIDU 2008 Keynote
Trends in Search
5NASA CIDU 2008 Keynote
Search Intent
• Premise:– People don’t want to search– People want to get tasks done
I want to book a vacation in Tuscany.Start Finish
Broder 2002, A Taxonomy of web search
6NASA CIDU 2008 Keynote
Y! Shortcuts
7NASA CIDU 2008 Keynote
Google Base
8NASA CIDU 2008 Keynote
Opening Up Yahoo! SearchPhase 1 Phase 2
Giving site owners and developers control over the appearance of Yahoo!
Search results.
BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!
Search infrastructure and technology to developers and companies to help them
build their own search experiences.
(Slide courtesy Prabhakar Raghavan)
9NASA CIDU 2008 Keynote
babycenter
epicurious
Search Results of the Future
yelp.com
answers.com
webmd
Gawker
New York Times
(Slide courtesy Andrew Tomkins)
10NASA CIDU 2008 Keynote
Custom Search Experiences
Social Search
Vertical Search
Visual Search
(Slide courtesy Prabhakar Raghavan)
11NASA CIDU 2008 Keynote
How Does SearchMonkey Work?
Acme.com’sdatabase
Index
RDF/Microformat Markup
Site owners/publishers share structured data with Yahoo!. 1
Consumers customize their search experience with Enhanced Results3
Site owners & third-party developers build SearchMonkey apps.2
DataRSS feed
Web Services
Page Extraction
Acme.com’s Web Pages
(Slide courtesy Amit Kumar)
13NASA CIDU 2008 Keynote
BOSS Offerings
API
A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences.
BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search.
• University of Illinois Urbana Champaign• Carnegie Mellon University
• Stanford University
• Purdue University
• MIT
• Indian Institute of
Technology Bombay
• University of
Massachusetts
CUSTOM
Working with 3rd parties to build a more relevant, brand/site specific web search experience.
This option is jointly built by Yahoo! and select partners.
ACADEMIC
Working with the following universities to allow for wide-scale research in the search field:
(Slide courtesy Prabhakar Raghavan)
14NASA CIDU 2008 Keynote
Partner Examples
15NASA CIDU 2008 Keynote
Aside: Cloud Computing
On demand access to enormous datasets and computing power Rapid application development No maintenance worries for application developer
On demand access to enormous datasets and computing power Rapid application development No maintenance worries for application developer
APIs
16NASA CIDU 2008 Keynote
Cloud Services
Compute-Centric Data-Centric
Analytic /Batch ProcessingE.g., Hadoop, SSDS
HTC job schedulingE.g., Condor
Virtual m/cE.g., EC2, PoolBoy, Tashi
ServingStoresE.g., Mobstor,MySQL, S3, Sherpa
Messaging
•Ordered?•Guaranteed delivery?•Scalability
(# msgs, producers, subscribers)
Archival /WarehouseStorage
17NASA CIDU 2008 Keynote
Google Co-Op
This query matches a pattern
provided by Contributor…
…so SERP displays (query-specific) links
programmed by Contributor.
Subscribed Link
edit | remove
Query-based direct-display, programmed by Contributor
Users “opts-in” by “subscribing” to
them
18NASA CIDU 2008 Keynote
Structured Search Content:Where Do We Get It?
• Semantic Web?
• Three ways to create semantically rich summaries that address the user’s information needs:– Editorial, Extraction, UGC
• Unleash community computing—PeopleWeb!
Challenge: Design social interactions that lead to creation and maintenance of high-quality structured content
19NASA CIDU 2008 Keynote
Evolution of Online Communities
20NASA CIDU 2008 Keynote
21NASA CIDU 2008 Keynote
Metadata
• Estimated growth of metadata– Anchortext: 100Mb/day– Tags: 40Mb/day– Pageviews: 100-200Gb/day– Reviews: Around 10Mb/day – Ratings: <small>
Drove most advances in search from 1996-present
Increasingly rich and available, but not yet useful in search
This is in spite of the fact that interactions on the web arecurrently limited by the fact that each site is essentially a silo
22NASA CIDU 2008 Keynote
PeopleWeb: Site-Centric People-Centric
• Common web-wide id for objects (incl. users)– Even common attributes? (e.g., pixels for camera objects)
• As users move across sites, their personas and social networks will be carried along
• Increased semantics on the web through community activity (another path to the goals of the Semantic Web)
Global Object
Model
Portable Social Environment
Community
Search
(Towards a PeopleWeb, Ramakrishnan & Tomkins, IEEE Computer, August 2007)
23NASA CIDU 2008 Keynote
Social Search
24NASA CIDU 2008 Keynote
Social Search• Explicitly open up search
– Enable communities, sites and consumers to explicitly re-define search results (e.g., SearchMonkey, Boss)
• Right unit for a “search result”? Can we “stitch together” more informative abstracts from multiple sources?
• Creation of specialized ranking engines for different tasks, or different user communities
• Implicitly leverage socially engaged users and their interactions– Learning from shared community interactions, and leveraging
community interactions to create and refine content • Expanding search results to include sources of
information – E.g., Experts, sub-communities of shared interest, particular
search engines (in a world with many, this is valuable!)
Reputation, Quality, Trust, Privacy
25NASA CIDU 2008 Keynote
Web Search Results for “Lisa”
Latest news results for “Lisa”. Mostly about people because Lisa is a popular name
Web search results are very diversified, covering pages about organizations, projects, people, events, etc.
41 results from My Web!
26NASA CIDU 2008 Keynote
Save / Tag Pages You Like
You can save / tag pages you like into My Web from toolbar / bookmarklet / save buttons
You can pick tags from the suggested tags based on collaborative tagging technology
Type-ahead based on the tags you have used
Enter your note for personal recall and sharing purpose
You can specify a sharing mode
You can save a cache copy of the page content
(Slide courtesy: Raymie Stata)
27NASA CIDU 2008 Keynote
My Web 2.0 Search Results for “Lisa”
Excellent set of search results from my community because a couple of people in my community are interested in Usenix Lisa-related topics
28NASA CIDU 2008 Keynote
29NASA CIDU 2008 Keynote
Tech Support at COMPAQ
“In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.”
“Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.”
– Steve Young, VP of Customer Care, Compaq
30NASA CIDU 2008 Keynote
KNOWLEDGEBASE
QUESTION
Answer added to power self service
SELF SERVICE
ANSWER
KNOWLEDGEBASE
QUESTION
SELF SERVICE
--Partner Experts-Customer Champions -Employees
Customer
How It Works
Support Agent
Answer added to power self service
31NASA CIDU 2008 Keynote
65% (3,247)
77% (3,862)
86% (4,328)
6,845
74% answered
Answersprovidedin 12h
Answersprovidedin 24h
40% (2,057)
Answersprovided
in 3h
Answersprovidedin 48h
Questions
• No effort to answer each question
• No added experts
• No monetary incentives for enthusiasts
Timely Answers
77% of answers provided within 24h
32NASA CIDU 2008 Keynote
Power of Knowledge Creation
~80%
Support Incidents Agent Cases
5-10 %
Self-Service *)
CustomerMass Collaboration *)
KnowledgeCreation
SHIELD 1
SHIELD 2
*) Averages from QUIQ implementations
33NASA CIDU 2008 Keynote
Mass Contribution
Users who on average provide only 2 answers provide 50% of all answers
7 % (120) 93 % (1,503)
50 % (3,329)
100 %(6,718)
Answers
ContributingUsers
Top users
Contributed by mass of users
34NASA CIDU 2008 Keynote
Interesting Problems
• Question categorization
• Detecting undesirable questions & answers
• Identifying “trolls”
• Ranking results in Answers search
• Finding related questions
• Estimating question & answer quality
(Byron Dom: SIGIR 2006 talk)
35NASA CIDU 2008 Keynote
Better Search via Information Extraction
• Extract, then exploit, structured data from raw text:For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
PEOPLE
Select Name From PEOPLE Where Organization = ‘Microsoft’
Bill Gates
Bill Veghte(from Cohen’s IE tutorial, 2003)
36NASA CIDU 2008 Keynote
DBLife: Community Information Mgmt
Integrated information about a (focused) real-world community
Collaboratively built and maintained by the community Semantic web,
“bottom-up” CIMple software
package
37NASA CIDU 2008 Keynote
DBLife• Integrate data of the DB research community• 1164 data sources
Crawled daily, 11000+ pages = 160+ MB / day
38NASA CIDU 2008 Keynote
Entity Extraction and Resolution
Raghu Ramakrishnan
co-authors = A. Doan, Divesh Srivastava, ...
39NASA CIDU 2008 Keynote
Resulting ER Graph
“Proactive Re-optimization
Jennifer Widom
Shivnath Babu
SIGMOD 2005
David DeWitt
Pedro Bizarrocoauthor
coauthor
coauthor
advise advise
write
write
write
PC-Chair
PC-member
40NASA CIDU 2008 Keynote
Mass Collaboration
• We want to leverage user feedback to improve the quality of extraction over time.– Maintaining an extracted “view” on a collection of
documents over time is very costly; getting feedback from users can help
– In fact, distributing the maintenance task across a large group of users may be the best approach
41NASA CIDU 2008 Keynote
Mass Collaboration Example
Not David!
Picture is removed if enough users vote “no”.
42NASA CIDU 2008 Keynote
Mass Collaboration Meets Spam
Jeffrey F. Naughton swears that this is David J. DeWitt
43NASA CIDU 2008 Keynote
DBLife
• Faculty: AnHai Doan & Raghu Ramakrishnan• Students: P. DeRose, W. Shen, F. Chen, R. McCann, Y.
Lee, M. Sayyadian• Prototype system up and running since early 2005• Plan to release a public version of the system in Spring
2007• 1164 sources, crawled daily, 11000+ pages / day • 160+ MB, 121400+ people mentions, 5600+ persons• See DE overview article, CIDR 2007 demo
44NASA CIDU 2008 Keynote
Information Extraction
… and the challenge of managing it
45NASA CIDU 2008 Keynote
46NASA CIDU 2008 Keynote
Economics of IE
• Data $, Supervision $– The cost of supervision, especially large,
high-quality training sets, is high– By comparison, the cost of data is low
• Therefore– Rapid training set construction/active learning
techniques– Tolerance for low- (or low-quality) supervision– Take feedback and iterate rapidly
47NASA CIDU 2008 Keynote
The Purple SOX Project
Operator Library
Extraction Management System(e.g Vertex, Societek)
Shopping,Travel,Autos
Academic Portals
(DBLife/MeYahoo)
EnthusiastPlatform
…and many others
Application Layer
(SOcial eXtraction)
48NASA CIDU 2008 Keynote
Example: Accepted Papers
• Every conference comes with a slightly different format for accepted papers– We want to extract accepted papers directly
(before they make their way into DBLP etc.)
• Assume– Lots of background knowledge (e.g., DBLP
from last year)– No supervision on the target page
• What can you do?
49NASA CIDU 2008 Keynote
50NASA CIDU 2008 Keynote
Down the Page a Bit
51NASA CIDU 2008 Keynote
Record Identification
• To get started, we need to identify records
– Hey, we could write an XPath, no?
– So, what if no supervision is allowed?
• Given a crude classifier for paper records,
can we recursively split up this page?
52NASA CIDU 2008 Keynote
First Level Splits
53NASA CIDU 2008 Keynote
After More Splits …
54NASA CIDU 2008 Keynote
Now Get the Records
• Goal: To extract fields of individual records
• We need training examples, right?– But these papers are new
• The best we can do without supervision is noisy labels.– From having seen other such pages
55NASA CIDU 2008 Keynote
Partial, Noisy Labels
56NASA CIDU 2008 Keynote
Extracted Records
57NASA CIDU 2008 Keynote
Refining Results via Feedback
• Now let’s shift slightly to consider extraction of publications from academic home pages – Must identify publication sections of faculty home
pages, and extract paper citations from them
• Underlying data model for extracted data is – A flexible graph-based model (similar to RDF or ER
conceptual model)– “Confidence” scores per-attribute or relationship
58NASA CIDU 2008 Keynote
Extracted Publication Titles
59NASA CIDU 2008 Keynote
A Dubious Extracted Publication…
PSOX provides declarative lineage tracking over operator executions
60NASA CIDU 2008 Keynote
Where’s the Problem?
Use lineage to find source of problem..
61NASA CIDU 2008 Keynote
Source Page
Hmm, not a publication page ..
(but may have looked like one to a
classifier)
62NASA CIDU 2008 Keynote
Feedback
User corrects classification of that section..
63NASA CIDU 2008 Keynote
Faculty or Student?
•NLP•Build a Classifier•Or…
64NASA CIDU 2008 Keynote
…Another Clue…
65NASA CIDU 2008 Keynote
…Stepping Back…
Student
Student
Student-List
AdvisorOf
Prof
Prof-List
Prof
Prof
• Leads to large-scale, partially-labeled relational learning
• Involving different types of entities and links
66NASA CIDU 2008 Keynote
Maximizing the Value of What You Select to Show Users
p1 p2 p3
67NASA CIDU 2008 Keynote
Content Optimization
• PROBLEM: Match-making between content, user, context– Content:
• Programmed (e.g., editors); Acquired (e.g., RSS feeds, UGC)– User
• Individual (e.g., b-cookie), or user segment– Context
• E.g., Y! or non-Y! property; device; time period
• APPROACH: Scalable algorithms that select content to show, using editorially determined content mixes, and respecting editorially set constraints and policies.
68NASA CIDU 2008 Keynote
Team
• Deepak Agarwal• Todd Beaupre• Bee-Chung Chen• Wei Chu• Pradheep Elango• Ken Fox• Nitin Motgi• Seung-Taek Park• Raghu Ramakrishnan• Scott Roy• Joe Zachariah
69NASA CIDU 2008 Keynote
Yahoo! Home Page Featured Box
• It is the top-center part of the Y! Front Page
• It has four tabs: Featured, Entertainment, Sports, and Video
70NASA CIDU 2008 Keynote
Traditional Role of Editors
• Strict quality control– Preserve “Yahoo! Voice”
• E.g., typical mix of content
– Community standards– Quality guidelines
• E.g., Topical articles shown for limited time
• Program articles periodically– New ones pushed, old ones taken out
• Few tens of unique articles per day
– 16 articles at any given time; editors keep up with novel articles and remove fading ones
– Choose which articles appear in which tabs
71NASA CIDU 2008 Keynote
Content Optimization Approach
• Editors continue to determine content sources, program some content, determine policies to ensure quality, and specify business constraints– But we use a statistically based machine
learning algorithm to determine what articles to show where when a user visits the FP
72NASA CIDU 2008 Keynote
Modeling Approach
• Pure feature based (did not work well):– Article: URL, keywords, categories– Build offline models to predict CTR – Models considered
• Logistic Regression with feature selection, decision trees, etc.
• Track CTR per article in user segments through online models– Offline models used for initializing CTR – This worked well; the approach we took
eventually
73NASA CIDU 2008 Keynote
Explore/Exploit
• What is the best strategy for new articles?– If we show it and it’s bad: lose clicks– If we delay and it’s good: lose clicks
• Solution: Show it while we don’t have much data if it looks promising– Classical multi-armed bandit type problem– Our setup is different than the ones studied in
the literature; new ML problem
74NASA CIDU 2008 Keynote
Novel Aspects• Classical: Arms assumed fixed over time
– We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research
• Classical: Serving rule updated after each pull– We compute optimal design in batch mode
• Classical: Generally. CTR assumed stationary– We have highly dynamic, non-stationary CTRs
75NASA CIDU 2008 Keynote
Some Other Complications
• We run multiple experiments (possibly correlated) simultaneously; effective sample size calculation a challenge
• Serving Bias: Incorrect to learn from data for serving scheme A and apply to serving scheme B– Need unbiased quality score– Bias sources: positional effects, time effect, set of
articles shown together
• Incorporating feature-based techniques– Regression style , E.g., logistic regression – Tree-based (hierarchical bandit)
76NASA CIDU 2008 Keynote
System Challenges
• Highly dynamic system characteristics:– Short article lifetimes, pool constantly
changing, user population is dynamic, CTRs non-stationary
– Quick adaptation is key to success
• Scalability:– 1000’s of page views/sec; data collection,
model training, article scoring done under tight latency constraints
77NASA CIDU 2008 Keynote
Results
• We built an experimental infrastructure to test new content serving schemes– Ran side-by-side experiments on live traffic– Experiments performed for several months;
we consistently outperformed the old system– Results showed we get more clicks by
engaging more users– Editorial overrides
• Did not reduce lift numbers substantially
78NASA CIDU 2008 Keynote
Comparing buckets
79NASA CIDU 2008 Keynote
Lift is Due to Increased Reach
• Lift in fraction of clicking users
80NASA CIDU 2008 Keynote
Related Work
• Amazon, Netflix, Y! Music, etc.:– Collaborative filtering with large content pool– Achieve lift by eliminating bad articles– We have a small number of high quality
articles
• Search, Advertising– Matching problem with large content pool– Match through feature based models
81NASA CIDU 2008 Keynote
Summary of Content Optimization
• Offline models to initialize online models
• Online models to track performance
• Explore/exploit to converge fast
• Study user visit patterns and behavior; program content accordingly
82NASA CIDU 2008 Keynote
Summary
• Online communities represent a tremendous resource for organizing information online– Open APIs and cloud services = mass engagement– Extraction + mass collaboration = semantics
• Web is becoming – More people-centric, less site-centric– Highly intertwined, distributed, dynamic, personalized– Models of ownership, trust, incentives?– Next generation of search algorithms and infrastructure?
• Exciting “grand challenge” problems that require us to combine ideas from data management, statistics, learning, and optimization– i.e., data mining problems!
83NASA CIDU 2008 Keynote
Further Reading
• Content, Metadata, and Behavioral Information: Directions for Yahoo! Research, The Yahoo! Research Team, IEEE Data Engineering Bulletin, Dec 2006 (Special Issue on Web-Scale Data, Systems, and Semantics)
• Systems, Communities, Community Systems on the Web, Community Systems Group at Yahoo! Research, SIGMOD Record, Sept 2007
• Towards a PeopleWeb, R. Ramakrishnan and A. Tomkins, IEEE Computer, August 2007 (Special Issue on Web Search)
84NASA CIDU 2008 Keynote
DBLife Papers
• Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE-08.
• Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07.
• Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, W. Shen, A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07.
• Source-aware Entity Matching: A Compositional Approach, W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan: ICDE 2007.
• OLAP over Imprecise Data with Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan, S. Vaithyanathan. VLDB-07.
• Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006.
• Managing Information Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial.