semantics on the web: how do we get there?

82
1 NASA CIDU 2008 Keynote Raghu Ramakrishnan Chief Scientist for Audience & Cloud Computing Yahoo! Semantics on the Web: How Do We Get There?

Upload: declan

Post on 20-Jan-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Semantics on the Web: How Do We Get There?. Raghu Ramakrishnan Chief Scientist for Audience & Cloud Computing Yahoo!. Prediction is very hard, especially about the future. Yogi Berra. Trends in Search. I want to book a vacation in Tuscany. Start. Finish. Search Intent. Premise: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semantics on the Web: How Do We Get There?

1NASA CIDU 2008 Keynote

Raghu RamakrishnanChief Scientist for Audience & Cloud Computing

Yahoo!

Semantics on the Web:How Do We Get There?

Page 2: Semantics on the Web: How Do We Get There?

3NASA CIDU 2008 Keynote

Prediction is very hard, especially about the future. Yogi Berra

Page 3: Semantics on the Web: How Do We Get There?

4NASA CIDU 2008 Keynote

Trends in Search

Page 4: Semantics on the Web: How Do We Get There?

5NASA CIDU 2008 Keynote

Search Intent

• Premise:– People don’t want to search– People want to get tasks done

I want to book a vacation in Tuscany.Start Finish

Broder 2002, A Taxonomy of web search

Page 5: Semantics on the Web: How Do We Get There?

6NASA CIDU 2008 Keynote

Y! Shortcuts

Page 6: Semantics on the Web: How Do We Get There?

7NASA CIDU 2008 Keynote

Google Base

Page 7: Semantics on the Web: How Do We Get There?

8NASA CIDU 2008 Keynote

Opening Up Yahoo! SearchPhase 1 Phase 2

Giving site owners and developers control over the appearance of Yahoo!

Search results.

BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo!

Search infrastructure and technology to developers and companies to help them

build their own search experiences.

(Slide courtesy Prabhakar Raghavan)

Page 8: Semantics on the Web: How Do We Get There?

9NASA CIDU 2008 Keynote

babycenter

epicurious

Search Results of the Future

yelp.com

answers.com

LinkedIn

webmd

Gawker

New York Times

(Slide courtesy Andrew Tomkins)

Page 9: Semantics on the Web: How Do We Get There?

10NASA CIDU 2008 Keynote

Custom Search Experiences

Social Search

Vertical Search

Visual Search

(Slide courtesy Prabhakar Raghavan)

Page 10: Semantics on the Web: How Do We Get There?

11NASA CIDU 2008 Keynote

How Does SearchMonkey Work?

Acme.com’sdatabase

Index

RDF/Microformat Markup

Site owners/publishers share structured data with Yahoo!. 1

Consumers customize their search experience with Enhanced Results3

Site owners & third-party developers build SearchMonkey apps.2

DataRSS feed

Web Services

Page Extraction

Acme.com’s Web Pages

(Slide courtesy Amit Kumar)

Page 11: Semantics on the Web: How Do We Get There?

13NASA CIDU 2008 Keynote

BOSS Offerings

API

A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences.

BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search.

• University of Illinois Urbana Champaign• Carnegie Mellon University

• Stanford University

• Purdue University

• MIT

• Indian Institute of

Technology Bombay

• University of

Massachusetts

CUSTOM

Working with 3rd parties to build a more relevant, brand/site specific web search experience.

This option is jointly built by Yahoo! and select partners.

ACADEMIC

Working with the following universities to allow for wide-scale research in the search field:

(Slide courtesy Prabhakar Raghavan)

Page 12: Semantics on the Web: How Do We Get There?

14NASA CIDU 2008 Keynote

Partner Examples

Page 13: Semantics on the Web: How Do We Get There?

15NASA CIDU 2008 Keynote

Aside: Cloud Computing

On demand access to enormous datasets and computing power Rapid application development No maintenance worries for application developer

On demand access to enormous datasets and computing power Rapid application development No maintenance worries for application developer

APIs

Page 14: Semantics on the Web: How Do We Get There?

16NASA CIDU 2008 Keynote

Cloud Services

Compute-Centric Data-Centric

Analytic /Batch ProcessingE.g., Hadoop, SSDS

HTC job schedulingE.g., Condor

Virtual m/cE.g., EC2, PoolBoy, Tashi

ServingStoresE.g., Mobstor,MySQL, S3, Sherpa

Messaging

•Ordered?•Guaranteed delivery?•Scalability

(# msgs, producers, subscribers)

Archival /WarehouseStorage

Page 15: Semantics on the Web: How Do We Get There?

17NASA CIDU 2008 Keynote

Google Co-Op

This query matches a pattern

provided by Contributor…

…so SERP displays (query-specific) links

programmed by Contributor.

Subscribed Link

edit | remove

Query-based direct-display, programmed by Contributor

Users “opts-in” by “subscribing” to

them

Page 16: Semantics on the Web: How Do We Get There?

18NASA CIDU 2008 Keynote

Structured Search Content:Where Do We Get It?

• Semantic Web?

• Three ways to create semantically rich summaries that address the user’s information needs:– Editorial, Extraction, UGC

• Unleash community computing—PeopleWeb!

Challenge: Design social interactions that lead to creation and maintenance of high-quality structured content

Page 17: Semantics on the Web: How Do We Get There?

19NASA CIDU 2008 Keynote

Evolution of Online Communities

Page 18: Semantics on the Web: How Do We Get There?

20NASA CIDU 2008 Keynote

Page 19: Semantics on the Web: How Do We Get There?

21NASA CIDU 2008 Keynote

Metadata

• Estimated growth of metadata– Anchortext: 100Mb/day– Tags: 40Mb/day– Pageviews: 100-200Gb/day– Reviews: Around 10Mb/day – Ratings: <small>

Drove most advances in search from 1996-present

Increasingly rich and available, but not yet useful in search

This is in spite of the fact that interactions on the web arecurrently limited by the fact that each site is essentially a silo

Page 20: Semantics on the Web: How Do We Get There?

22NASA CIDU 2008 Keynote

PeopleWeb: Site-Centric People-Centric

• Common web-wide id for objects (incl. users)– Even common attributes? (e.g., pixels for camera objects)

• As users move across sites, their personas and social networks will be carried along

• Increased semantics on the web through community activity (another path to the goals of the Semantic Web)

Global Object

Model

Portable Social Environment

Community

Search

(Towards a PeopleWeb, Ramakrishnan & Tomkins, IEEE Computer, August 2007)

Page 21: Semantics on the Web: How Do We Get There?

23NASA CIDU 2008 Keynote

Social Search

Page 22: Semantics on the Web: How Do We Get There?

24NASA CIDU 2008 Keynote

Social Search• Explicitly open up search

– Enable communities, sites and consumers to explicitly re-define search results (e.g., SearchMonkey, Boss)

• Right unit for a “search result”? Can we “stitch together” more informative abstracts from multiple sources?

• Creation of specialized ranking engines for different tasks, or different user communities

• Implicitly leverage socially engaged users and their interactions– Learning from shared community interactions, and leveraging

community interactions to create and refine content • Expanding search results to include sources of

information – E.g., Experts, sub-communities of shared interest, particular

search engines (in a world with many, this is valuable!)

Reputation, Quality, Trust, Privacy

Page 23: Semantics on the Web: How Do We Get There?

25NASA CIDU 2008 Keynote

Web Search Results for “Lisa”

Latest news results for “Lisa”. Mostly about people because Lisa is a popular name

Web search results are very diversified, covering pages about organizations, projects, people, events, etc.

41 results from My Web!

Page 24: Semantics on the Web: How Do We Get There?

26NASA CIDU 2008 Keynote

Save / Tag Pages You Like

You can save / tag pages you like into My Web from toolbar / bookmarklet / save buttons

You can pick tags from the suggested tags based on collaborative tagging technology

Type-ahead based on the tags you have used

Enter your note for personal recall and sharing purpose

You can specify a sharing mode

You can save a cache copy of the page content

(Slide courtesy: Raymie Stata)

Page 25: Semantics on the Web: How Do We Get There?

27NASA CIDU 2008 Keynote

My Web 2.0 Search Results for “Lisa”

Excellent set of search results from my community because a couple of people in my community are interested in Usenix Lisa-related topics

Page 26: Semantics on the Web: How Do We Get There?

28NASA CIDU 2008 Keynote

Page 27: Semantics on the Web: How Do We Get There?

29NASA CIDU 2008 Keynote

Tech Support at COMPAQ

“In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.”

“Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.”

– Steve Young, VP of Customer Care, Compaq

Page 28: Semantics on the Web: How Do We Get There?

30NASA CIDU 2008 Keynote

KNOWLEDGEBASE

QUESTION

Answer added to power self service

SELF SERVICE

ANSWER

KNOWLEDGEBASE

QUESTION

SELF SERVICE

--Partner Experts-Customer Champions -Employees

Customer

How It Works

Support Agent

Answer added to power self service

Page 29: Semantics on the Web: How Do We Get There?

31NASA CIDU 2008 Keynote

65% (3,247)

77% (3,862)

86% (4,328)

6,845

74% answered

Answersprovidedin 12h

Answersprovidedin 24h

40% (2,057)

Answersprovided

in 3h

Answersprovidedin 48h

Questions

• No effort to answer each question

• No added experts

• No monetary incentives for enthusiasts

Timely Answers

77% of answers provided within 24h

Page 30: Semantics on the Web: How Do We Get There?

32NASA CIDU 2008 Keynote

Power of Knowledge Creation

~80%

Support Incidents Agent Cases

5-10 %

Self-Service *)

CustomerMass Collaboration *)

KnowledgeCreation

SHIELD 1

SHIELD 2

*) Averages from QUIQ implementations

Page 31: Semantics on the Web: How Do We Get There?

33NASA CIDU 2008 Keynote

Mass Contribution

Users who on average provide only 2 answers provide 50% of all answers

7 % (120) 93 % (1,503)

50 % (3,329)

100 %(6,718)

Answers

ContributingUsers

Top users

Contributed by mass of users

Page 32: Semantics on the Web: How Do We Get There?

34NASA CIDU 2008 Keynote

Interesting Problems

• Question categorization

• Detecting undesirable questions & answers

• Identifying “trolls”

• Ranking results in Answers search

• Finding related questions

• Estimating question & answer quality

(Byron Dom: SIGIR 2006 talk)

Page 33: Semantics on the Web: How Do We Get There?

35NASA CIDU 2008 Keynote

Better Search via Information Extraction

• Extract, then exploit, structured data from raw text:For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..

PEOPLE

Select Name From PEOPLE Where Organization = ‘Microsoft’

Bill Gates

Bill Veghte(from Cohen’s IE tutorial, 2003)

Page 34: Semantics on the Web: How Do We Get There?

36NASA CIDU 2008 Keynote

DBLife: Community Information Mgmt

Integrated information about a (focused) real-world community

Collaboratively built and maintained by the community Semantic web,

“bottom-up” CIMple software

package

Page 35: Semantics on the Web: How Do We Get There?

37NASA CIDU 2008 Keynote

DBLife• Integrate data of the DB research community• 1164 data sources

Crawled daily, 11000+ pages = 160+ MB / day

Page 36: Semantics on the Web: How Do We Get There?

38NASA CIDU 2008 Keynote

Entity Extraction and Resolution

Raghu Ramakrishnan

co-authors = A. Doan, Divesh Srivastava, ...

Page 37: Semantics on the Web: How Do We Get There?

39NASA CIDU 2008 Keynote

Resulting ER Graph

“Proactive Re-optimization

Jennifer Widom

Shivnath Babu

SIGMOD 2005

David DeWitt

Pedro Bizarrocoauthor

coauthor

coauthor

advise advise

write

write

write

PC-Chair

PC-member

Page 38: Semantics on the Web: How Do We Get There?

40NASA CIDU 2008 Keynote

Mass Collaboration

• We want to leverage user feedback to improve the quality of extraction over time.– Maintaining an extracted “view” on a collection of

documents over time is very costly; getting feedback from users can help

– In fact, distributing the maintenance task across a large group of users may be the best approach

Page 39: Semantics on the Web: How Do We Get There?

41NASA CIDU 2008 Keynote

Mass Collaboration Example

Not David!

Picture is removed if enough users vote “no”.

Page 40: Semantics on the Web: How Do We Get There?

42NASA CIDU 2008 Keynote

Mass Collaboration Meets Spam

Jeffrey F. Naughton swears that this is David J. DeWitt

Page 41: Semantics on the Web: How Do We Get There?

43NASA CIDU 2008 Keynote

DBLife

• Faculty: AnHai Doan & Raghu Ramakrishnan• Students: P. DeRose, W. Shen, F. Chen, R. McCann, Y.

Lee, M. Sayyadian• Prototype system up and running since early 2005• Plan to release a public version of the system in Spring

2007• 1164 sources, crawled daily, 11000+ pages / day • 160+ MB, 121400+ people mentions, 5600+ persons• See DE overview article, CIDR 2007 demo

Page 42: Semantics on the Web: How Do We Get There?

44NASA CIDU 2008 Keynote

Information Extraction

… and the challenge of managing it

Page 43: Semantics on the Web: How Do We Get There?

45NASA CIDU 2008 Keynote

Page 44: Semantics on the Web: How Do We Get There?

46NASA CIDU 2008 Keynote

Economics of IE

• Data $, Supervision $– The cost of supervision, especially large,

high-quality training sets, is high– By comparison, the cost of data is low

• Therefore– Rapid training set construction/active learning

techniques– Tolerance for low- (or low-quality) supervision– Take feedback and iterate rapidly

Page 45: Semantics on the Web: How Do We Get There?

47NASA CIDU 2008 Keynote

The Purple SOX Project

Operator Library

Extraction Management System(e.g Vertex, Societek)

Shopping,Travel,Autos

Academic Portals

(DBLife/MeYahoo)

EnthusiastPlatform

…and many others

Application Layer

(SOcial eXtraction)

Page 46: Semantics on the Web: How Do We Get There?

48NASA CIDU 2008 Keynote

Example: Accepted Papers

• Every conference comes with a slightly different format for accepted papers– We want to extract accepted papers directly

(before they make their way into DBLP etc.)

• Assume– Lots of background knowledge (e.g., DBLP

from last year)– No supervision on the target page

• What can you do?

Page 47: Semantics on the Web: How Do We Get There?

49NASA CIDU 2008 Keynote

Page 48: Semantics on the Web: How Do We Get There?

50NASA CIDU 2008 Keynote

Down the Page a Bit

Page 49: Semantics on the Web: How Do We Get There?

51NASA CIDU 2008 Keynote

Record Identification

• To get started, we need to identify records

– Hey, we could write an XPath, no?

– So, what if no supervision is allowed?

• Given a crude classifier for paper records,

can we recursively split up this page?

Page 50: Semantics on the Web: How Do We Get There?

52NASA CIDU 2008 Keynote

First Level Splits

Page 51: Semantics on the Web: How Do We Get There?

53NASA CIDU 2008 Keynote

After More Splits …

Page 52: Semantics on the Web: How Do We Get There?

54NASA CIDU 2008 Keynote

Now Get the Records

• Goal: To extract fields of individual records

• We need training examples, right?– But these papers are new

• The best we can do without supervision is noisy labels.– From having seen other such pages

Page 53: Semantics on the Web: How Do We Get There?

55NASA CIDU 2008 Keynote

Partial, Noisy Labels

Page 54: Semantics on the Web: How Do We Get There?

56NASA CIDU 2008 Keynote

Extracted Records

Page 55: Semantics on the Web: How Do We Get There?

57NASA CIDU 2008 Keynote

Refining Results via Feedback

• Now let’s shift slightly to consider extraction of publications from academic home pages – Must identify publication sections of faculty home

pages, and extract paper citations from them

• Underlying data model for extracted data is – A flexible graph-based model (similar to RDF or ER

conceptual model)– “Confidence” scores per-attribute or relationship

Page 56: Semantics on the Web: How Do We Get There?

58NASA CIDU 2008 Keynote

Extracted Publication Titles

Page 57: Semantics on the Web: How Do We Get There?

59NASA CIDU 2008 Keynote

A Dubious Extracted Publication…

PSOX provides declarative lineage tracking over operator executions

Page 58: Semantics on the Web: How Do We Get There?

60NASA CIDU 2008 Keynote

Where’s the Problem?

Use lineage to find source of problem..

Page 59: Semantics on the Web: How Do We Get There?

61NASA CIDU 2008 Keynote

Source Page

Hmm, not a publication page ..

(but may have looked like one to a

classifier)

Page 60: Semantics on the Web: How Do We Get There?

62NASA CIDU 2008 Keynote

Feedback

User corrects classification of that section..

Page 61: Semantics on the Web: How Do We Get There?

63NASA CIDU 2008 Keynote

Faculty or Student?

•NLP•Build a Classifier•Or…

Page 62: Semantics on the Web: How Do We Get There?

64NASA CIDU 2008 Keynote

…Another Clue…

Page 63: Semantics on the Web: How Do We Get There?

65NASA CIDU 2008 Keynote

…Stepping Back…

Student

Student

Student-List

AdvisorOf

Prof

Prof-List

Prof

Prof

• Leads to large-scale, partially-labeled relational learning

• Involving different types of entities and links

Page 64: Semantics on the Web: How Do We Get There?

66NASA CIDU 2008 Keynote

Maximizing the Value of What You Select to Show Users

p1 p2 p3

Page 65: Semantics on the Web: How Do We Get There?

67NASA CIDU 2008 Keynote

Content Optimization

• PROBLEM: Match-making between content, user, context– Content:

• Programmed (e.g., editors); Acquired (e.g., RSS feeds, UGC)– User

• Individual (e.g., b-cookie), or user segment– Context

• E.g., Y! or non-Y! property; device; time period

• APPROACH: Scalable algorithms that select content to show, using editorially determined content mixes, and respecting editorially set constraints and policies.

Page 66: Semantics on the Web: How Do We Get There?

68NASA CIDU 2008 Keynote

Team

• Deepak Agarwal• Todd Beaupre• Bee-Chung Chen• Wei Chu• Pradheep Elango• Ken Fox• Nitin Motgi• Seung-Taek Park• Raghu Ramakrishnan• Scott Roy• Joe Zachariah

Page 67: Semantics on the Web: How Do We Get There?

69NASA CIDU 2008 Keynote

Yahoo! Home Page Featured Box

• It is the top-center part of the Y! Front Page

• It has four tabs: Featured, Entertainment, Sports, and Video

Page 68: Semantics on the Web: How Do We Get There?

70NASA CIDU 2008 Keynote

Traditional Role of Editors

• Strict quality control– Preserve “Yahoo! Voice”

• E.g., typical mix of content

– Community standards– Quality guidelines

• E.g., Topical articles shown for limited time

• Program articles periodically– New ones pushed, old ones taken out

• Few tens of unique articles per day

– 16 articles at any given time; editors keep up with novel articles and remove fading ones

– Choose which articles appear in which tabs

Page 69: Semantics on the Web: How Do We Get There?

71NASA CIDU 2008 Keynote

Content Optimization Approach

• Editors continue to determine content sources, program some content, determine policies to ensure quality, and specify business constraints– But we use a statistically based machine

learning algorithm to determine what articles to show where when a user visits the FP

Page 70: Semantics on the Web: How Do We Get There?

72NASA CIDU 2008 Keynote

Modeling Approach

• Pure feature based (did not work well):– Article: URL, keywords, categories– Build offline models to predict CTR – Models considered

• Logistic Regression with feature selection, decision trees, etc.

• Track CTR per article in user segments through online models– Offline models used for initializing CTR – This worked well; the approach we took

eventually

Page 71: Semantics on the Web: How Do We Get There?

73NASA CIDU 2008 Keynote

Explore/Exploit

• What is the best strategy for new articles?– If we show it and it’s bad: lose clicks– If we delay and it’s good: lose clicks

• Solution: Show it while we don’t have much data if it looks promising– Classical multi-armed bandit type problem– Our setup is different than the ones studied in

the literature; new ML problem

Page 72: Semantics on the Web: How Do We Get There?

74NASA CIDU 2008 Keynote

Novel Aspects• Classical: Arms assumed fixed over time

– We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research

• Classical: Serving rule updated after each pull– We compute optimal design in batch mode

• Classical: Generally. CTR assumed stationary– We have highly dynamic, non-stationary CTRs

Page 73: Semantics on the Web: How Do We Get There?

75NASA CIDU 2008 Keynote

Some Other Complications

• We run multiple experiments (possibly correlated) simultaneously; effective sample size calculation a challenge

• Serving Bias: Incorrect to learn from data for serving scheme A and apply to serving scheme B– Need unbiased quality score– Bias sources: positional effects, time effect, set of

articles shown together

• Incorporating feature-based techniques– Regression style , E.g., logistic regression – Tree-based (hierarchical bandit)

Page 74: Semantics on the Web: How Do We Get There?

76NASA CIDU 2008 Keynote

System Challenges

• Highly dynamic system characteristics:– Short article lifetimes, pool constantly

changing, user population is dynamic, CTRs non-stationary

– Quick adaptation is key to success

• Scalability:– 1000’s of page views/sec; data collection,

model training, article scoring done under tight latency constraints

Page 75: Semantics on the Web: How Do We Get There?

77NASA CIDU 2008 Keynote

Results

• We built an experimental infrastructure to test new content serving schemes– Ran side-by-side experiments on live traffic– Experiments performed for several months;

we consistently outperformed the old system– Results showed we get more clicks by

engaging more users– Editorial overrides

• Did not reduce lift numbers substantially

Page 76: Semantics on the Web: How Do We Get There?

78NASA CIDU 2008 Keynote

Comparing buckets

Page 77: Semantics on the Web: How Do We Get There?

79NASA CIDU 2008 Keynote

Lift is Due to Increased Reach

• Lift in fraction of clicking users

Page 78: Semantics on the Web: How Do We Get There?

80NASA CIDU 2008 Keynote

Related Work

• Amazon, Netflix, Y! Music, etc.:– Collaborative filtering with large content pool– Achieve lift by eliminating bad articles– We have a small number of high quality

articles

• Search, Advertising– Matching problem with large content pool– Match through feature based models

Page 79: Semantics on the Web: How Do We Get There?

81NASA CIDU 2008 Keynote

Summary of Content Optimization

• Offline models to initialize online models

• Online models to track performance

• Explore/exploit to converge fast

• Study user visit patterns and behavior; program content accordingly

Page 80: Semantics on the Web: How Do We Get There?

82NASA CIDU 2008 Keynote

Summary

• Online communities represent a tremendous resource for organizing information online– Open APIs and cloud services = mass engagement– Extraction + mass collaboration = semantics

• Web is becoming – More people-centric, less site-centric– Highly intertwined, distributed, dynamic, personalized– Models of ownership, trust, incentives?– Next generation of search algorithms and infrastructure?

• Exciting “grand challenge” problems that require us to combine ideas from data management, statistics, learning, and optimization– i.e., data mining problems!

Page 81: Semantics on the Web: How Do We Get There?

83NASA CIDU 2008 Keynote

Further Reading

• Content, Metadata, and Behavioral Information: Directions for Yahoo! Research, The Yahoo! Research Team, IEEE Data Engineering Bulletin, Dec 2006 (Special Issue on Web-Scale Data, Systems, and Semantics)

• Systems, Communities, Community Systems on the Web, Community Systems Group at Yahoo! Research, SIGMOD Record, Sept 2007

• Towards a PeopleWeb, R. Ramakrishnan and A. Tomkins, IEEE Computer, August 2007 (Special Issue on Web Search)

Page 82: Semantics on the Web: How Do We Get There?

84NASA CIDU 2008 Keynote

DBLife Papers

• Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE-08.

• Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07.

• Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, W. Shen, A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07.

• Source-aware Entity Matching: A Compositional Approach, W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan: ICDE 2007.

• OLAP over Imprecise Data with Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan, S. Vaithyanathan. VLDB-07.

• Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006.

• Managing Information Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial.