march 2000 gio xit 1 increasing the precision when obtaining information from the web gio wiederhold...

33
March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report: www-db.stanford.edu/pub/gio/1999/miti.htm E P F L seminar Supported by the AFOSR- New World Vistas Progr

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 1

Increasing the Precision when

Obtaining Information from the Web

Gio Wiederhold Stanford University

4 April 2000

related report: www-db.stanford.edu/pub/gio/1999/miti.htm

E P F L seminar

Supported by the AFOSR- New World Vistas Program

Page 2: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 2

Growth Factors

Consumer

Pull

Research &

Inno -vation

Toolbuilding

Product building &

marketingGeneralTechnology

Push Businessneeds

Governmentresponsibilities

InformationTechnology

Page 3: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 3

T r e n d s 1998 : 1999

• Users of the Internet 40% 52% of U.S. population

• Growth of Net Sites (now 2.2M public sites with 288M pages)• Expected growth in E-commerce by Internet users [BW, 6 Sep.1999]

segment 1998 1999– books 7.2% 16.0%– music & video 6.3% 16.4%– toys 3.1% 10.3%– travel 2.6% 4.0%– tickets 1.4% 4.2%– Overall 8.0% 33.0% = $9.5Billion

An unstainable trend cannot be sustained [Herbert Stein]

new services

98 99 00 01 02 03 04 0.3 1 3 9 27 81 **

90 80 70 60 50 40 30 20 10

0

Year / %

%

Centroid, in 1999 ~1% of total market

E-penetration Toys

Page 4: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 4

Expect continuing growth

• Hardware technology will continue to lead and encourage broader usage

• Communication technology will continue to lead and become more economical

• User interfaces will improve and not be a barrier to the acceptance of technology

• Government policies will not hinder open interaction - or not be able to

Page 5: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 5

The Problem of Information Growth:

"We are drowning in information but starved for knowledge. This level of information is clearly impossible to be handled by present means. Uncontrolled and unorganized information is no longer a resource in an information society, instead it becomes the enemy."

-- John Naisbitt, author of 1982 bestseller Megatrends

. . . and it’s not getting better

Dealing with this issue requires Precision:• Helpful for casual users --

reduce human filtering when browsing• Essential for business --

regular tasks require automation

needsKnow-ledge

Page 6: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 6

Knowledge from Experts

encoded for reuse

• The product: Information

ObservationsFilters

AnalysesAggregation of instancesIntegration of sources

Data

Data + Knowledge Information

Page 7: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 7

Precision to be improved in

• Relevance of Information for the Customer– modeling the customer

• Timeliness of Information– resolving temporal mismatch for past data

• Search for Information– precision versus recall

• Meaning of the Information our focus here– resolving semantic mismatch

Service model to achieve these objectives services add value by increasing precision

Page 8: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 8

Search techniques add value

Yahoo humans catalog and organize useful web sites.

Junglee integrates diverse sources using wrappers.

AltaVista automatically surfs and indexes the web.

Excite also tracks queries and classifies customers.

Firefly provides customer control over their profiles.

Cookies track users’ activities between sessions.

Alexa collects webpages and their usage.

Google ranks the reference importance of web pages.

. . .

Page 9: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 9

Problems for search engines and progress

• Unsuitable source representations• part classification: HTML --- XML• print formats: postscript, adobe PDF• non-text: images, sound, video• hidden in databases behind CGI scripts

• Inconsistent semantics • context distinct / scope / view

• Naïve modeling of customers• roles & growth

Search engines cannot solve all problems

Being improved.

Rate?

Page 10: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 10

Large quantities affect cost

The human genome: ~ 4 000 000 000 base pairs

Genes, and gene abnormalities

1 human

~10 000

proteins ?

diseases

Everybody’s genes

6 000 000 000

humans

Small organic molecules - affect proteins - suitable for drugs

~2 000 000molecules

Metabolic pathways

<1000systems

Nature Progress

Page 11: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 11

Need for precision

adapted from Warren Powell, Princeton Un.

data

err

ors

information quantity

human lim

it

acceptable limit

hum

an w

ith to

ols?

Information Wall

More precision is needed as data volume increases--- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toysnot real cars apparent drug-target with poor annotation ) False Negatives cause lost opportunities, suboptimal to some degree

False positives = poor precision typically cost more thanfalse negatives = poor recall

Testing false lead in pharmaceutics costs > $ 100 000 in stage 1.

Page 12: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 12

Heterogeneity among Domains

If interoperation involves distinct

domains mismatch ensues• Autonomy conflicts with consistency,

– Local Needs have Priority,– Outside uses are a Byproduct

Heterogeneity must be addressed• Platform and Operating Systems • Representation and Access Conventions • Naming and Ontology

Page 13: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 13

Semantic Mismatches

Information comes from many autonomous sources• Differing viewpoints (by source)

– differing terms for similar items { lorry, truck }

– same terms for dissimilar items trunk(luggage, car)

– differing coverage vehicles (DMV, AIA)

– differing granularity trucks (shipper, manuf.)

– different scope student museum fee, Stanford

• Hinders use of information from disjoint sources – missed linkages loss of information, opportunities– irrelevant linkages overload on user or application

program

• Poor precision when merged

ok for web browsing , poor for business

Page 14: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 14

Proposed Solutions

Specify and standardize terminology usage: ontology• Globally all interacting sources

– wonderful for users and their programs– long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ? – costly maintenance, since all sources evolve – who has the authority to dictate conformance

• Domain-specific XML DTD assumption– Small, focused, cooperating groups– high quality, some examples - genomics, arthritis, shakespeare plays

– allows sharable, formal tools – ongoing, local maintenance affecting users - annual updates

– poor interoperation, users still face inter-domain mismatches

• solves only part of the problem

Page 15: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 15

Domains and Consistency .

• a domain will contain many objects• the object configuration is consistent• within a domain all terms are consistent &• relationships among objects are consistent

• context is implicit

No committee is needed to forge compromises * within a domain

Compromises hide valuable details

Domain Ontology

Page 16: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 16

Objective of Scalable Knowledge Composition

Provide for Maintainable Application Ontologies

• devolve maintenance onto many domain-specific experts / authorities

• provide an algebra to compute composed ontologies that are limited to their articulation terms

• enable interpretation within the source contexts

SKC

Page 17: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 17

Sample Operation: INTERSECTION

Source Domain 1:Owned and maintained by Store

Result contains shared terms,useful for purchasing

Source Domain 2:Owned and maintainedby Factory

Articulation

Page 18: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 18

Tools to create articulations

Graph matcherforArticulation- creatingExpert

Vehicle ontology

Transport ontology

Suggestionsfor articulations

Page 19: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 19

continue from initial pointTool suggests terms for further articulation:

• by spelling similarity,• by graph position• by term match nexus

Expert response:1. Okay2. False3. Irrelevant to this articulation

All results are recorded

Okay ’s are converted into articulation rules

Page 20: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 20

Candidate Match Nexus

Term linkages automatically extracted from Webster’s* / Oxford dictionary +

* freely available

+ restricted

Based on processing headwords definitions using algebra primitives

Notice presence of 2 domains: chemistry, transport

Page 21: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 21

Using the Match Nexus

Experiment: On government structures of

NATO countries:SKEIN system resolved over 70% of unmatched terms

Page 22: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 22

Using the Match Nexus

Page 23: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 23

An Ontology Algebra

A knowledge-based algebra for ontologies

The Articulation Ontology (AO) consists of matching rules that link domain ontologies

Intersection create a subset ontology keep sharable entries

Union create a joint ontology merge entries

Difference create a distinct ontology remove shared entries

Page 24: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 24

INTERSECTION support

Store Ontology

Articulation ontology

Matching rules that use terms from the 2 source domains

Factory Ontology

Terms usefulfor purchasing

Page 25: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 25

Other Basic Operations

typically priorintersections

UNION: mergingentire ontologies

DIFFERENCE: materialfully under local control

Arti-culation ontology

Page 26: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 26

Features of an algebra

Operations can be composed

Operations can be rearranged

Alternate arrangements can be evaluated

Optimization is enabled

The record of past operations can be

kept and reused when sources change

Page 27: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 27

Knowledge CompositionArticulationknowledgefor U

U

U

(A B)U

(B C)U

(C E)

Knowledge resource

B

Knowledge resource

A

Knowledge resource

C Knowledge

resourceD

U

(C D)

U

(B C)

Articulationknowledge

Composed knowledge forapplications using A,B,C,E

Knowledge resource

E

U

(C E)

Legend:

U : unionU: intersection

Articulationknowledgefor (A B)

U

Page 28: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 28

SKC Primitive Operations

Unary• Summarize -- abstract • Glossarize - list terms

• Filter - reduce instances

• Extract - move into context

Binary • Match - data corrobaration

• Difference - distance measure

• Intersect - use of articulation

• Union - search broadening

Constructors• create object• create setConnectors• match object• match setEditors• insert value• edit value• move value• delete valueConverters• object - value• object indirection• reference indirection

Model and Instance

Page 29: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 29

Exploiting the result .

Processing & query evaluation is best performed withinSource Domains & by their engines

Result has linksto source Avoid n2 problem of

interpreter mapping

Page 30: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 30

Domain Specialization .• Knowledge Acquisition (20% effort) &• Knowledge Maintenance (80% effort *)

to be performed• Domain specialists• Professional organizations• Field teams

of modest size

Empowermentautomouslymaintainable

* based on experience with software

Page 31: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 31

Summary

To sustain the growth of web usage1. The value of the results has to keep increasing

precision, relevance not volume, nor recall2. Value is provided by experts,

encoded as models of diverse resources, customersProblems to be addressed mismatches quality maintenance

+ Tools for these tasks

} Clear, scalable models

Page 32: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 32

Acknowledgments

Supported by AF Office of Scientific Research– New World Vistas program

Participants• David Maluf, postdoc, PhD EE, McGill Univ., 1997.• Jan Jannink, PhD candidate, CS, grad. June 2000? • Shrish Agarwal, MS graduate, CS, 1999.• Prasenjit Mitra, PhD candidate, EE, grad. 2001?• Martin Kerstens, PhD, summer visitor from CWI.• Stefan Decker, postdoc, PhD Univ.Karlsruhe 1999.

Page 33: March 2000 Gio XIT 1 Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report:

March 2000 Gio XIT 33

• April-June 2000, at 14:15 - 15:15, room ?

Presentations in English -- but I'll try to manage discussions in French and/or German.• I plan to cover the material in an integrating fashion, drawing from concepts in

databases, artificial intelligence, software engineering, and business principles.

1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR, XML.

2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).

3. 4/5 Digital libraries, information resources. Value of services, copyright.

4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.

5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance.

6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]

7. 31/5 Application to Bioinformatics.

8. 15/6 Educational challenges. Expected changes in teaching and learning.

9. 22/6 Privacy protection and security. Security mediation.

10.29/6 Summary and projection for the future.• Feedback and comments are appreciated.

Seminar Course on Intelligent Information Systems