april 2002 gio artint 1 information interoperation versus integration gio wiederhold stanford...
Post on 22-Dec-2015
217 views
TRANSCRIPT
April 2002 Gio ArtInt 1
Information
Interoperation versus Integration
Gio Wiederhold
Stanford University
April 2002
www-db.stanford.edu/people/gio.html
Thanks to Jan Jannink, Prasenjit Mitra, Stefan Decker.
April 2002 Gio ArtInt 2
Layout
• Objectives and definitions 3 – 8– information, knowledge, actions
• Integration, articulation, interoperation 9 -- 13• Semantic problems 14 – 20• Effect of the Internet 21 – 26• Precision 27 -- 31• SKC project: 31 -- 52
– Ontology algebra & tools
• Summary 53 -- 59
April 2002 Gio ArtInt 3
Today's sources and needs
Many resources• computing
– databases– analytical
software– simulations
• people– organizations– domains
semantics
Many objectives
• operations– reports– routine actions– optimization
• decision-making– crisis response– allocations– conflict resolution
Require information
April 2002 Gio ArtInt 4
Objective of Information Systems
• Initiate Optimal Actions, i.e., that are – highest in delivering benefits for the costs and risks
Effective -- achieve the goal
Satisfactory -- move towards the goal
Feasible -- valid and possible
Optimality requires perfect -- complete and error-free --
data and knowledge. Not always attainable
The best is often the enemy of the good
April 2002 Gio ArtInt 5
Sources of Information
• Recall of what you have forgotten
• Information from external sources
• news• papers• books
• Induction from Data – Experience
• Processing of Data – Statistics • Combining Information that was previously disjoint
Processing by intermediaries
April 2002 Gio ArtInt 6
Data + Knowledge Information
ActionAction
Knowledge LoopKnowledge Loop
= rules = rules processprocess
ExperienceExperience
Data LoopData Loop
= facts= factsStorageStorage
Education
Education SelectionSelection
IntegrationIntegration
AbstractionAbstraction
Decision-makingDecision-making
State changesState changes
RecordingRecording
ProjectionProjection
Info
rmat
ion
Info
rmat
ionAssim
ilation
Assimilatio
n
April 2002 Gio ArtInt 7
4 Information Technology Tasks
Actionable Information is created at the confluence
of data -- the state &
knowledge -- the ability toselect, integrate, abstract
and project the stateinto the future
Selection: SQL Selection: SQL
one verb languageone verb languageIntegration: Middleware Integration: Middleware
Mediation . Mediation .
Abstraction: VisualizationAbstraction: VisualizationSummarization Summarization
Projection: Simulation Projection: Simulation Spreadsheets Spreadsheets
Mediation * Mediation *
* Focus of this talk
April 2002 Gio ArtInt 8
Encoded Knowledge
Lexicons: collect terms used in information systems
Taxonomies: categorize, abstract, classify terms
Schemas of databases: attributes, ranges, constraints
Data dictionaries: systems with multiple files, owners
Object libraries: grouped attributes, inherit., methods
Symbol tables: terms bound to implemented programs
Domain object models: (XML DTD): interchange terms
Ontologies: networks of terms and relationships
. . . Metadata formalizes Knowledge
April 2002 Gio ArtInt 9
Combining data the obvious way
Application
Data A
MetaData A
Data B
MetaData B
Data AB
MetaData AB }Integrate
1. Integration:
Create a large database,that incorporates all
material in the sources
2. Operation:
Access the integrated dataUpdate the integrated data
when sources change
April 2002 Gio ArtInt 10
or linkages to the data
Data A
MetaData A
Data B
MetaData B
Application
AB
}
Articulate
Rules for
AB
1. Articulation:
Create Links to needed Dataperhaps with rules
2. Interoperation:
Exploit the linkages and rules
Inter- operation
April 2002 Gio ArtInt 11
Benefits and problems
Integration– Long time to establish common metadata more+ Easy for applications, everything is there– Massive replication
– Difficult to maintain as sources are updated
Articulation+ Metadata defined by application need more
+ Rapid implementation and modification– Interoperation via indirect access
+ Mitigated by caching– Many articulations
– Up to one per application+ Affected by selected source changes
April 2002 Gio ArtInt 12
Alternatives for Applications
Data A
MetaData A
Data B
MetaData B
Application
AB
}
Articulate
Rules for
AB
Application
Data A
MetaData A
Data B
MetaData B
Data AB
MetaData AB }
Integrate
Operate&Maintain
Inter-Operate
April 2002 Gio ArtInt 13
Central Solutions do not Scale
What workswith 7 modules(Apps and DBs)and one person in charge
fails when we have 100
and need a committee
Any changes in sources affect the integrated module
April 2002 Gio ArtInt 14
Semantic Heterogeneity:
A Major Cause of Problems:
Integration / Interoperation joins domains
Domains have their own terminologies Need autonomy to deal with knowledge growth
The usage of terms in a domain is efficient Appropriate granularity
Mechanic working on a truck vs. logistics manager Shorthand notations
PS (power steering) vs. PS (partial shipment)
Functions differ in scope Payroll versus Personnel
getting paid vs. available (includes contract staff)
April 2002 Gio ArtInt 15
Why now semantic heterogeneity
Heterogeneity has been addressed
• Platform and Operating Systems
• Representation and Access Conventions • Naming and Ontology
Whenever interoperation involves distinct domains mismatch ensues
• Autonomy conflicts with consistency, – Local Needs have Priority,– Outside uses are a Byproduct
April 2002 Gio ArtInt 16
Domains and Consistency .
• a domain will contain many objects• the object configuration is consistent• within a domain all terms are consistent &• relationships among objects are consistent
• context is implicit
No committee is needed to forge compromises * within a domain
Compromises hide valuable details
Domain Ontology
April 2002 Gio ArtInt 17
SKC grounded definition .
• Ontology: a set of terms and their relationships• Term: a reference to real-world and abstract objects• Relationship: a named and typed set of links between
objects• Reference: a label that names objects• Abstract object: a concept which refers to other objects• Real-world object: an entity instance with a physical
manifestation
April 2002 Gio ArtInt 18
Examples of Semantic Mismatches
Information comes from many autonomous sources• Differing viewpoints ( by source )
– differing terms for similar items { lorry, truck }
– same terms for dissimilar items trunk( luggage, car)
– differing coverage vehicles ( DMV, AIA )
– differing granularity trucks ( shipper, manuf. )
– different scope student ( museum fee, Stanford )
• Hinders use of information from disjoint sources – missed linkages loss of information, opportunities– irrelevant linkages overload on user or application program
• Poor precision when merged
Still ok for web browsing , poor for business & science
April 2002 Gio ArtInt 19
Structural Semantic Heterogeneity
Same concept?If incorporated in differing structures?
April 2002 Gio ArtInt 20
Effect on Information Integration
• Big integrated systems ?Big integrated systems ?– cost, delaycost, delay– obsolescenceobsolescence– inflexibilityinflexibility– unmaintainableunmaintainable
• Composed systems ?Composed systems ?– no or limited controlno or limited control– operational inefficiencyoperational inefficiency– understanding heterogeneityunderstanding heterogeneity– distributed maintenance distributed maintenance
April 2002 Gio ArtInt 21
Integration up-to-dateness
Frequency of source change
frequency of visits Feb.2000
F(capability given 2.2M public sites with 288M pages )
never
1/second
1/minute
1/hour
1/day
1/week
1/month
1/year
as often as possible0 1 ?
100%
50%
0%
%tageup-to-date F(user need)
effort, methods
April 2002 Gio ArtInt 22
Web capabilities: more domains
• World-wide connectivity– internet– world-wide web
• More databases– public & corporate
• Faster communication– digital– packeting: TCP-IP, ATM
• Disintermediation– ubiquitous publishing
Information overload Data starvation
April 2002 Gio ArtInt 23
Global consistency: Hope, but . .
Common semantic assumptions in assembling and integrating distributed information resources
• The language used by the resources is the same
• Sub languages used by the resources are subsets of a globally consistent language
These assumptions are provably false
Working towards the goal of globally consistency is
1. naïve -- the goal cannot be achieved
2. inefficient -- languages are efficient in local contexts
3. unmaintainable – terminology evolves with progress
April 2002 Gio ArtInt 24
Knowledge from the Web
• Knowledge about the data sources– where is what– contents - semantics– scope of contents
• Knowledge to apply the data sources– mapping to application terms– mapping to application scope– integration from multiple sources
April 2002 Gio ArtInt 25
Availability of Knowledge
• XML based schemasDTDs, extended semantics
a very small fraction now
• Embedded or related knowledgetext, dictionaries, context ontologies, DB schemas
the majority of sources
must exploit both flexibly
in order to make progress with KB-processing
costly to develop
April 2002 Gio ArtInt 26
Search engines need knowledge
Yahoo humans catalog and organize useful web sites into hierarchies with crosslinks. About 300 topic experts in 2001. Scalable?demoz public service classification with 3000 editors. Quality?
AltaVista worms (surfs) and indexes all of the web. Frequency/volatility?
Excite also tracks queries and classifies customers. 1D, privacy?
Firefly provides customer control over their profiles. Ease/effective?
Junglee integrates diverse wrapped sources, compares Semantics?
Alexa collects webpages and their usage, keeps old pages Massive, suggests pages with similar access reference pattern. Value?
Google Pagerank gives the reference importance of web pages, using weighted votes, iterated closure. Find new stuff?
and others, used by most of the e-commerce sites .
April 2002 Gio ArtInt 27
Problems for integrating the web
• Unsuitable source representations• part classification: HTML --- XML . . . . . . . . . .• print formats: postscript, adobe PDF . . . . . . .
• non-text: images, video (more ), • sound [Audio Mining (Dragon)], . . . . . . . . .
• hidden in databases behind CGI scripts . . .
• Inconsistent semantics • context distinct / scope / view . . . . . . . . . . . . .
• Naïve modeling of customers/apps• personalization, versus context, roles & growth . . . . . . . . . . . . . . . . . . . . . . . . .
Search engines, trying to deal with all of the web, cannot solve the problems precisely
Being improved.
Rate?
Progress
April 2002 Gio ArtInt 28
Historical note
Ich weiss nicht was soll es bedeuten, ...
• -- an early complaint about semantics [Heinrich Heine: Die Lorelei]
April 2002 Gio ArtInt 29
Precision of Language
• Our languages must serve our needs– Efficiency of expression: minimal symbols
carpenter versus householder domain
– Commitment surgeon versus pathologist
– Locality provides context geographical locality: downtown irrelevant on the webconceptual locality: 5 pounds remains important
Business requires precision, including that terms used map consistently to instances
knowledge controls use of data
April 2002 Gio ArtInt 30
Need for precision
adapted from Warren Powell, Princeton Un.
More precision is needed as data volume increases--- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toys apparent drug-target with poor annotation )
False Negatives cause lost opportunities, suboptimal to some degree
False positives = poor precision typically cost more thanfalse negatives = poor recall
information quantity
human lim
it
acceptable limit
hum
an w
ith to
ols?
data
err
ors
Information Wall
April 2002 Gio ArtInt 31
Relationships among search parameters
volume retrie
ved
volume available
100%
50%
0%space of methods
% tage actually relevant
perfect recall
v.relevantp= v.retrieved
v.relevantr = v.availablerecall
perfect precision
precision
April 2002 Gio ArtInt 32
Large scale demands articulation
• Many source ontologies
• Diverse application domains– with their ontologies
Articulation = application-specific intersection of sources
– keeps and exploits connection to sources
– many articulations are possible
– articulation results are small ( not )
– reliable articulations require consistent sources
– modest amount of knowledge needed
SKC project:Scalable
KnowledgeComposition
April 2002 Gio ArtInt 33
Sample Intersections .
Shoe Store• Shoes { . . . }• Customers { . . . }• Employees { . . . }
size = size color =table(colcode)
style = style
Ana-tomy {. . . }
• Material inventory {...}• Employees { . . . }• Machinery { . . . }• Processes { . . . }• Shoes { . . . }
Shoe Factory
Hard-ware
Articulation ontology
matching rules :
foot = foot Employees Employees Nail (toe, foot) Nail (fastener). . . . . .
Department Store
April 2002 Gio ArtInt 34
System setting: mediation
data and simulation resources
value-added services
decision-making support applicationsApplication Layer
Mediation Layer
Foundation Layer
April 2002 Gio ArtInt 35
Sample Operation: INTERSECTION
Source Domain 1:Owned and maintained by Store
Result contains shared terms,useful for purchasing
Source Domain 2:Owned and maintainedby Factory
Articulation
April 2002 Gio ArtInt 36
Sample Processing in HPKB
• What is the most recent year an OPEC member nation was on the UN security council?– Related to DARPA HPKB
Challenge Problem– SKC resolves 3 Sources
CIA Factbook ‘96 (nation) OPEC (members, dates) UN (SC members, years)
– SKC obtains the Correct Answer
1996 (Indonesia)
– Other groups obtained more,
but factually wrong answers
– Problems resolved by SKC* Factbook has out of date
OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994)
* different country names Gambia => The Gambia
* historical country names Yugoslavia
UN lists future security council members
Gabon 1999 intent of original question
Temporal variants
April 2002 Gio ArtInt 37
Many ontologies Interoperation
Have devolved maintenance onto domain-specific experts / authorities
• Many articulations
But applications often span multiple domains
• Can't make domains bigger and keep them consistent
• Need an algebra to compute composed domains, – whose ontologies are limited to
the terms defined by their articulations
SKCSKCSKCSKC
April 2002 Gio ArtInt 38
An Ontology Algebra
A knowledge-based algebra for ontologies
The Articulation Ontology (AO) consists of matching rules that link domain ontologies
Intersection create a subset ontology keep sharable entries
Union create a joint ontology merge entries
Difference create a distinct ontology remove shared entries
April 2002 Gio ArtInt 39
Other Basic Operations
typically priorintersections
UNION: mergingentire ontologies
DIFFERENCE: materialfully under local control
Arti-culation ontology
April 2002 Gio ArtInt 40
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused when sources change
April 2002 Gio ArtInt 41
INTERSECTION support
Store Ontology
Articulation ontology
Matching rules that use terms from the 2 source domains
Factory Ontology
Terms usefulfor purchasing
April 2002 Gio ArtInt 42
Knowledge CompositionArticulationknowledgefor U
U
U
(A B)U
(B C)U
(C E)
Knowledge resource
B
Knowledge resource
A
Knowledge resource
C Knowledge
resourceD
U
(C D)
U
(B C)
Articulationknowledge
Composed knowledge forapplications using A,B,C,E
Knowledge resource
E
U
(C E)
Legend:
U : unionU: intersection
Articulationknowledgefor (A B)
U
April 2002 Gio ArtInt 43
Tools to create articulations
Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.)
Veh
icle
reg
istr
atio
n
on
tolo
gy
Veh
icle
sale
s o
nto
log
y
more
April 2002 Gio ArtInt 44
1. Initialize by word matching
Graph matcherforArticulation- creatingExpert
Vehicle ontology
Transport ontology
Suggestionsfor articulations
April 2002 Gio ArtInt 45
2. continue from initial point
Also suggest similar terms for further articulation:
• by spelling similarity,• by graph position• by term match repository
Expert response:1. Okay2. False3. Irrelevant to this articulation
All results are recorded
Okay’s are converted into articulation rules
April 2002 Gio ArtInt 46
3. A Term Net to Propose Match Candidates
Term linkages automatically extracted from 1912 Webster’s dictionary *
* free, other sources .have been processed.
Based on processing headwords definitions using algebra primitives
Notice presence of 2 domains: chemistry, transport
April 2002 Gio ArtInt 47
4. Using the term net
April 2002 Gio ArtInt 48
5. Navigating the term net
April 2002 Gio ArtInt 49
Primitive Operations
Unary• Summarize -- structure up• Glossarize - list terms• Filter - reduce instances• Extract - circumscription
Binary • Match - data corrobaration• Difference - distance
measure• Intersect - schem discovery• Blend - schema extension
Constructors• create object• create setConnectors• match object• match setEditors• insert value• edit value• move value• delete valueConverters• object - value• object indirection• reference indirection
Model and Instance
April 2002 Gio ArtInt 50
Matching rules (in process)
• A equals U• A equals {u, … }• A equals to Union (U, V) • A equals to Union (U, v ...) • A equals to Intersection (U, W) • A equals to Intersection (U, w ...) • A equals to Difference (U, X) • A equals to Difference (U, x ...)
must obey algebraic
properties
April 2002 Gio ArtInt 51
Future: exploiting the result
Processing & query evaluation is best performed within Source
Domains & by their engines
Result has linksto source
Avoid n2 problem of interpretermapping as stated by Swartout as an issue in HPKB year 1
April 2002 Gio ArtInt 52
Innovation in SKC
• No need to harmonize full ontologies• Focus on what is critical for interoperation• Rules specific for articulation• Potentially many sets of articulation rules
• Maintenance is distributed– to n sources– to m articulation agents
is m < n2 , depends on architecture density a research question
April 2002 Gio ArtInt 53
Application• Informal, pragmatic
• Client-control
• Use up to 72 mediators
Mediation• Formal, reliable service
• Domain-Expert control
• Use up to 72 sources
Integration at higher Levels
April 2002 Gio ArtInt 54
Domain Specialization .
• Knowledge Acquisition (20% effort) &• Knowledge Maintenance (80% effort *)
to be performed• Domain specialists• Professional organizations• Field teams of modest size
Empowermentautonomouslymaintainable
* based on experience with software
April 2002 Gio ArtInt 55
Summary
To sustain the growth of web usage1. The value of the results has to keep increasing
precision, relevance not volume2. Value is provided by mediating experts,
encoded as models of diverse resources, diverse customersProblems being addressed redundancy mismatches quality maintenance
Need to develop flexible tools for these tasks
} Clear models
April 2002 Gio ArtInt 56
Integration Science ?
IntegrationScience
IntegrationScience
ArtificialIntelligence
knowledge mgmtlinguistic models
uncertainty
ArtificialIntelligence
knowledge mgmtlinguistic models
uncertainty
Systems Engineering
analysisdocumentation
costing
Systems Engineering
analysisdocumentation
costing
Databasesaccessstoragealgebras
scalability
Databasesaccessstoragealgebras
scalability
April 2002 Gio ArtInt 57
E-commerce
• Gartner: 2000 prediction for 2004: 7.3 T$
• Revision:2001 prediction for 2004: 5.9 T$ drastic loss?
Growth and Perception
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Realist
ic growth
Perceived growth
Invisible growth
Extrapolated growth
Disap- pointment Combi-
natorial growth
Perceived initial growthPerception level
ExamplesArtificial IntelligenceDatabasesNeural networksE-commerce
50 companies, each after 20% of the market
Failures
April 2002 Gio ArtInt 58
E-Commerce Trends 1998/1999/ ...
• Users of the Internet 40% 52% of U.S. population
• Growth of Net Sites (now 2.2M public sites with 288M pages)• Expected growth in E-commerce by Internet users [BW, 6 Sep.1999]
segment 1998 1999– books 7.2% 16.0%– music & video 6.3% 16.4%– toys 3.1% 10.3%– travel 2.6% 4.0%– tickets 1.4% 4.2%– Overall 8.0% 33.0% = $9.5Billion
• Will change all retail businesses, when? An unstainable trend cannot be sustained [Herbert Stein]
new services
98 99 00 01 02 03 04 0.3 1 3 9 27 81 **
90 80 70 60 50 40 30 20 10
0
Year / %
%
Centroid, in 1999 ~1% of total market
E-penetration Toys
April 2002 Gio ArtInt 59
Stanford
Some Groups at Stanford University• KSL: Feigenbaum, Fikes et al.• Logic: Genesereth, Koller, McCarthy, Nielsen et al.• Industrial: Petrie et al.• Medical Informatics: Musen et al.• Infolab: Garcia-Molina, Manning, Ullman, Wiederhold
www-db.stanford.edu/people/gio.html
The configurations are not fixed,
projects specifically can include a variety of combinations
April 2002 Gio ArtInt 60
Final Panel
• Agreement with Colette Rolland and Arne Solvberg– consider the research question as a Hypothesis (Hx)
note that the job is to Convert a hypo (less than) -thesis to a thesis
– justify the Hx in terms of a real-world problem
• Computer Science is broadening into all areas of human endeavor– Most new reserachers will interact with other disciplines– Many new entrants will contribute competencies from other areas
• This means that we all will need to develop an understanding for the differences among these areas
April 2002 Gio ArtInt 61
Differences among Sciences
Scientific incompatibilities sciences have often been characterized as
• differences in language, perhaps, mor formally• differences in Ontology, and • differences in problem decomposition
– alternate high-level objectives– divide-and-conquer paradigms
One person's (or discipline's) optimal approach is viewed by others as a bureaucracy to be fought
There are a more fundamental differences ..
April 2002 Gio ArtInt 62
Paradigms of Science
Disciplines use distinct scientific models1. Mathematical Sciences:
• Formal proofs, to hold in all cases. • A single exception invalidates the proof• Complexity is dealt with explicitly (big Oh)
2. Engineering sciences• recognition of alternative approaches to a goal• A thesis shows the best (cost-benefit) case in some range
3. Social Sciences• surveys can measure truth of a Hx to some degree of belief• having some exceptions does not invalidate the Hx
4. Biological Sciences• cases provide distinct examples and • prediction assumes a linear world
One should know in which paradigm one's research is performed