Learning Learning Retrieval Knowledge Retrieval Knowledge
from Datafrom DataHelge Langseth
Norwegian University of Science and Technology, Dept. of Mathematical Sciences
Agnar AamodtNorwegian University of Science and Technology, Dept.Computer and Information Science
Ole Martin WinnemSINTEF Telecom and Informatics, Depth of Computer Science
NTNU
Work partly performed within NOEMIE, ESPRIT project no. 22312Participants: NTNU, SINTEF, Saga, JRC, Schlumberger, Matra, Acknosoft, Dauphine
NTNUSlide no.: 2
OutlineOutline• Background / NOEMIE-project• CREEK• A data mining method• Integrating semantic networks with automatically
generated networks structures: – Problems with the semantics– Benefits
• Initial empirical results
NTNUSlide no.: 3
Data and User viewsData and User views
The
Task
Reality
NTNUSlide no.: 4
Study of the task realityStudy of the task realityExperiencegathering Past cases
General domain knowledgeThe
Task
Reality
CBR
Data warehouse
Datacapturing DM
NTNUSlide no.: 5
An example caseAn example casecase-16
instance-of value case has-activity value tripping-in circulating has-depth-of-occurrence value 5318 has-task value solve-lc-problem has-observable-parameter value high-pump-pressure
high-mud-density-1.41-1.7kg/l high-viscosity-30-40cp normal-yield-point-10-30-lb/100ft2 large-final-pit-volume-loss->100m3 long-lc-repair-time->15h low-pump-rate low-running-in-speed-<2m/s complete-initial-loss decreasing-loss-when-pump-off very-depleted-reservoir->0.3kg/l tight-spot high-mud-solids-content->20% small-annular-hydraulic-diameter-2-4in small-leak-off/mw-margin-0.021-0.050kg/l very-long-stands-still-time->2h
has-well-section-position value in-reservoir-section has-failure value induced-fracture-lc has-repair-activity value pooh-to-casing-shoe
waited-<1h increased-pump-rate-stepwise lost-circulation-again pumped-numerous-lcm-pills no-return-obtained set-and-squeezed-balanced-cement-plug
NTNUSlide no.: 6
Initial designInitial design
Data Mining
DW
Case-based reasoning
Controller
• User experiences• Problem descriptions• Solutions
NTNUSlide no.: 7
TangledTangled CreekLCreekL NetworkNetworkthing
domain-objectcase
car
case#54van
electrical-faultbattery-fault
engine-test
engine
test-procedure
engine-fault
turning-of-ignition-key
test-step
battery-low
starter-motor
engine-turns
diagnostic-case
diagnosis
solved
diagnostic-hypothesis
wheel
vehicle
transportation
hsc
hp
hsc
hschsc
hsc
hsc
hi
hi
hp
hp
hphp
case-of
status-of
hd
has-status
possible-status-of
tested-by
has-function
tested-by
batteryinstance-of
has-fault
hsc
tested-by
hsc
test-for
test-for
has-fault
goal
find-faultfind-treatment
hschsc
hschsc
hsc
has-state
observed-finding
subclass-of
car-fault
fuel-system
fuel-system-fault
hsc
hp
has-fault
has-outputdescribed-in
part-of
hsc
electrical-system
broken-carburettor-membranehschsc
has-fault
has-engine-status
hi
hd
starter-motor-turns
N-DD-234567
has-electrical-status
finding
subclass-ofsubclass-of
subclass-of
hsc
hp
- has subclass- has-instance- has-part- has-descriptor
NTNUSlide no.: 8
Suitable DM methods must be:Suitable DM methods must be:• Able to generate structures from data, including a method for use (and
update) of the domain expert’s model• Able to learn new entities when exposed to new data• The expressiveness is important. Limited models (like decision trees)
are not suitable. • Our system performs explanation-driven CBR. Hence the models must
be open for inspection• As we work in open, weak theory domains, we cannot expect that a
deterministic structure will be able to capture the main effects• Should have semantic similarities with a semantic network structure• Bayesian networks is our initial method of choice although there are
significant differences which impose some limitations on the integration• Other methods (e.g. ILP) are candidates for future activities
NTNUSlide no.: 9
Bayesian networks (BN)Bayesian networks (BN)
Left: Alarm (A) is caused by earthquake (E) and burglary (B). Alarm is independent of radio (R)given E and B.
Right: The degree of belief in A (and not A) giventhe state of E and B. Eks.: Belief in A is 0.2 given E and not B (2nd row).
• A computer efficient representation of probability distributions by conditional independence among the attributes/states of a domain.
• Has a qualitative part (below left), representing statistical dependence/independence statements. Can often be interpreted as a causal model among states.
• Has a quantitative part (below right), representing conditionalprobability values for a specific state given one or more other states. Can be interpreted as a degree of belief in on state given other states.
• User experiences• Problem descriptions• Solutions
Controller
Data Mining Case-based reasoning
KI CBR (Creek)+
Causal DM (BNs)
General DM• Clustering• Time series • etc.
“Data driven” CBR
321
Information flow
1) Data preprocessing/cleaning2) Structure learning and parameter tuning in
the Bayesian Network3) Generation of similarity matrices etc.DW
NTNUSlide no.: 11
CBR and BN integration: General pictureCBR and BN integration: General picture
Case Base
User DBs General purpose DBs
Machine Generated
General Domain
Knowledge
Human Generated
Knowledge Intensive CBR
General Data Mining
Causal Data Mining
NTNUSlide no.: 13
The experiment of Heckerman et. al.The experiment of Heckerman et. al.
case# x 1 x 2 x 3 x 37
1
2
3
4
10,000
3
2
1
3
2
3
2
3
2
2
2
2
3
3
2
4
3
3
1
3
17
25
6 5 4
19
27
20
10 21
37
31
11 32
33
22
15
14
23
13
16
29
8 9
2812
34 35 36
24
30
72618
321
17
25 18 26
3
6 5 4
19
27
20
10 21
35 3736
31
11 32 34
12 24
33
22
15
14
23
13
16
29
30
7 8 9
28
21
17
25
6 5 4
19
27
20
10 21
37
31
11 32
33
22
15
14
23
1316
29
8 9
2812
34 35 36
24
30
72618
321
Deleted
NTNUSlide no.: 14
Generating Networks:Generating Networks:•Initialize Network
repeat
• Propose some Change to the structure
• Fit Parameters to the new structure
• Evaluate the new network according to some measure (like BIC, AIC, MDL)
• If the New network is Better than the previous, then Keep the Change
until Finished
NTNUSlide no.: 15
BNs are powered by Conditional IndependenciesBNs are powered by Conditional Independencies
Age
Exposure To Toxic
Gender
Smoking
Cancer
Serum Calcium
Lung Tumour
Cancer is independent of Age and Gendergiven Exposure To Toxic and Smoking
Bayesian Networks: semanticsBayesian Networks: semanticsS
L
C
E
DX
conditionalindependenciesin BN structure
+local
probabilitymodels
full jointdistributionover domain
=
),|()|(),|()|()()(),,,,,(
eldPlxPcsePslPcPsPdxelcsP =
• Compact & natural representation:– nodes have ≤ k parents ⇒ O(2kn) vs. O(2n)
parameters– parameters natural and easy to elicit.
Slide taken from Nir Friedman: “Learning the Structure of Probabilistic Models”
NTNUSlide no.: 17
Can we learn causation from data?Can we learn causation from data?
NTNUSlide no.: 18
Can we learn causation … (continued)Can we learn causation … (continued)
The newspaper’s theory: “The Bimbo Theory”:
Test result
Clothes IQ Clothes IQ
Sex
Test result
Sex
The “meaning” is different, but the two networks are equally plausible from the newspaper story
NTNUSlide no.: 19
Inferred CausationInferred Causation
NTNUSlide no.: 20
Integration of BN and Integration of BN and EDoMoEDoMo
fuel-system-fault observable-state
too-rich-gas-mixture-in-cylinder
carburettor
carburettor-valve-stuckcauses
no-chamber-ignition
engine-does-not-fire
water-in-gas-mixture
water-in-gas-tank
fuel-system
carburettor-fault
enigne-turns
carburettor-valve-fault observed-finding
causes
causes
causes
causes
hsc hschsc
hp
hi
hi
hi
causes
hsc has-fault
hsc
has-fault condensation-in-gas-tank
causes
NTNUSlide no.: 21
Integration LevelIntegration Level
Low Medium HighPurpose Domain level integration Inference level
integrationData-source Separate
data filesCommon data format,different use
Everythingrepresentedas frames
Typical BN-Inferencetask
RetrieveCases ExplainSimilarity(AttrA, AttrB)
No dedicatedBN inferenceunit
EDoMoVerification
Noverification
Verify substructures byexamining “hiddennodes” and KLdivergence
Verificationon arc level
IMPOSSIBLE?
NTNUSlide no.: 22
Effect of Evidence During BNEffect of Evidence During BN--retrieveretrieve
Observed
Domain model attributes Cases
NTNUSlide no.: 23
Case Indexing During BNCase Indexing During BN--retainretainIndex structure in BNIndex structure in Creek
Remindings (solid) and causal (dot-line)
Feature #2Feature #1 Feature#2Feature#1
Case#1:Only F#1observed
Case#1:F#2 is relevantthrough its influence on F#1
Case#2:F#1 is relevantthrough its influence on F#2
Case#2:Both F#1 and F#2 are relevant
NTNUSlide no.: 24
Validation of Validation of EDoMoEDoMoHidden Node No hidden nodes
KL-div < α
? OK
fuel-system-fault observable-state
too-rich-gas-mixture-in-cylinder
carburettor
carburettor-valve-stuckcauses
no-chamber-ignition
engine-does-not-fire
water-in-gas-mixture
water-in-gas-tank
fuel-system
carburettor-fault
enigne-turns
carburettor-valve-fault observed-finding
causes
causes
causes
causes
hsc hschsc
hp
hi
hi
hi
causes
hsc has-fault
hsc
has-fault condensation-in-gas-tank
causes
NTNUSlide no.: 25
AdvantagesAdvantages ofof BN+CBR combinationBN+CBR combinationThe BN model strengthens:
CBR Retrieval by • reducing the number of indexes needed to identify a case• due to the interdependency of indexes in the BN• matching of cases with syntaciallay different but semantically similar features
CBR Reuse by • suggesting solution adaptation based on a causal explanation from within the BN• explaining results to the user
CBR Retain by • checking for inter-consistency of case features (indexes) and identifying relevant
features when storing a new case• learning general domain knowledge by updating the BN
NTNUSlide no.: 26
Setup of Empirical StudySetup of Empirical Study• Generate a BN from the semantic network of the drilling-
fluid domain– Select entities manually– Use causal and taxonomic links as prior– Structural learning
• Enter parts of a known case (Case-16) as a new situation to both the CBR-system as to the BN.
• Evaluate differences in retrieved cases, and compare the quality of the retrieve regarding both ability to score similar cases high as well as punishing weaker correspondence
NTNUSlide no.: 27
Preliminary empirical resultsPreliminary empirical results• Generated BN with 146 links between 128 cases
from the semantic net of 1254 entities and 2434 relationships. Structural learning difficult because of to small overlap between data and user views
• Both methods were able to select Case-16 as best fit, discrepancies otherwise
• The BN separated well between good and not-so-good matches
0
10
20
30
40
No.
cas
es0-0.5 0.5-0.75 >0.75
Score
NTNUSlide no.: 28
Further research/Still to come:Further research/Still to come:• Perform a more elaborate empirical study• Examine other machine learning methods in addition to
BNs (ILP is a strong candidate)• Look into different ways of collaboration between the two
models (e.g. BN used only to activate)• Continue our effort to make BNs as well suited as
possible for the integration with BNs• Extend the methods to handle time sequences (e.g. to
handle a planning task)• Examine the use of “event-type” DBs (discrepancy DBs)
for automatic case generation through data mining
NTNUSlide no.: 29
Others doing the job for us:Others doing the job for us:• Daphne Koller’s group at Stanford
Extending the expressiveness of a BN • Elisabeth van de Stadt (TU Delft):
Spread activation algorithm for BNs• Judea Pearl’s group at UCLA:
Causation in probabilistic models• Friedman & Goldszmidt:
Learning BN structure from data• Many more …
NTNUSlide no.: 30
Finishing StatementFinishing Statement
His world is built up by rules. His world is built up by rules. Therefore he can never be as Therefore he can never be as
quick or as smart as we can be.quick or as smart as we can be.
Morpheus describes an opposing agent in the movie “The Matrix”