probabilistic rdf

26
Probabilistic RDF Octavian Udrea 1 V.S. Subrahmanian 1 Zoran Majkić 2 1 University of Maryland College Park 2 University “La Sapienza”, Rome, Italy

Upload: nicola

Post on 12-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic RDF. Octavian Udrea 1 V.S. Subrahmanian 1 Zoran Majkić 2 1 University of Maryland College Park 2 University “La Sapienza”, Rome, Italy. Motivation. Not all information on the Web is easily expressible in “classic” models (i.e., relational) RDF extraction from text - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic RDF

Probabilistic RDF

Octavian Udrea1 V.S. Subrahmanian1

Zoran Majkić2

1University of Maryland College Park2University “La Sapienza”, Rome, Italy

Page 2: Probabilistic RDF

Motivation

Not all information on the Web is easily expressible in “classic” models (i.e., relational)

RDF extraction from text STORY is the first, very successful prototype Need to extend RDF with temporal, uncertainty

components Goal: build a logical model of RDF with

uncertainty and provide query algorithms

Page 3: Probabilistic RDF

The Probabilistic RDF idea

An RDF theory is a set of triples (subject, property, value) (USA hasCapital Washington DC), (Washington DC hasPopulation 500,000)

Probabilistic RDF extends this model with uncertainty over the set of values.

(USA hasCapital {(Washington DC, 0.95), (State of Washington, 0.05)})

Page 4: Probabilistic RDF

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Extracted based on www.wrongdiagnosis .com

Page 5: Probabilistic RDF

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Page 6: Probabilistic RDF

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Page 7: Probabilistic RDF

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Page 8: Probabilistic RDF

Probabilistic RDF syntax

Schema uncertainty: (c subClassOf (C,δ)) ΣdЄC δ(d) <= 1

Class-instance uncertainty: (x rdf:type (C,δ)) ΣdЄC δ(d) <= 1

Instance-based uncertainty: (x p (Y, δ)) ΣyЄY δ(y) <= 1

Page 9: Probabilistic RDF

Probabilistic RDF syntax

Sanity requirements (c subClassOf (C1,δ1)), ((c subClassOf (C2,δ2)) =>

(C1 = C2 and δ1 = δ2) or C1 ∩ C2 = Ø Same applies for other types of uncertainty

Transitive properties Simple inferential capability Examples: associatedWith, controlledBy

P-path: A set of triples connected by transitive properties

Page 10: Probabilistic RDF

Example p-path

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Page 11: Probabilistic RDF

P-path semantics and t-norms We cannot generally assume independence

between triples on a transitive path Flu, AcuteBronchitis, Pneumonia

T-norms are used to express the user’s knowledge of the relationship between triples is associative, commutative 0 x = 0, 1 x = x x <= y, z <= w => x z <= y w

P-Path probability: t-norm applied to individual probabilities on the path

Page 12: Probabilistic RDF

Example p-path

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

(Flu, associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm

Page 13: Probabilistic RDF

pRDF semantics

A world W is a set of simple triples (with no probabilities)

An interpretation I associates a probability to each world

I satisfies a pRDF theory: For each (s, p, (V,δ)), δ(v) <= Σ I(W), where W

contains (s,p,v) Same applies to paths w.r.t. to a given t-norm

Page 14: Probabilistic RDF

pRDF semantics

A theory is consistent iff it has a satisfying interpretation Every pRDF theory is consistent

Entailment: T entails T’ iff every satisfying interpretation of T satisfies T’

Closure of a theory: The entire set of triples entailed by the theory Maximal w.r.t. the probability values

Page 15: Probabilistic RDF

pRDF fixpoint semantics

The closure operator Δ adds exactly one entailed triple at each step

(Flu associatedWith, (Acute Bronchitis, .7)) and

(Acute Bronchitis associatedWith (Pneumonia, .65)) yields:

(Flu associatedWith, (Pneumonia, 0.455))

w.r.t. the product t-norm Δ has a fixpoint which is the theory

closure.

Page 16: Probabilistic RDF

pRDF query processing

We will consider only simple queries: a triple with a variable term Example (? associatedWith Pneumonia 4) What is associated with Pneumonia with

probability above .4? Simple method:

Compute the closure Select any triple in the closure that matches the

query VERY expensive computationally

Page 17: Probabilistic RDF

pRDF query processing

Set of algorithms for answering simple queries and conjunctions: pRDF_Subject, pRDF_Property, …,

pRDF_conjunction Central idea:

Apply Δ in only those directions that yield tuples relevant to the query

Cut off path computations when the threshold can no longer be reached. min(current_probability, threshold)

Page 18: Probabilistic RDF

Experimental results

Implementation Java, 1700 LOC Disk-based storage for pRDF theories

Synthetically generated datasets According to varying underlying distributions

Datasets extracted from Web sources

Page 19: Probabilistic RDF

Experimental questions

Does the underlying distribution affect query running time?

From a practical point of view, which are the “fastest” types of queries?

How does running time vary with the number of atoms in a conjunction?

What other theory-dependent factors affect running time? Theory width Number of properties

Page 20: Probabilistic RDF

Query running time (Poisson)Atomic queries running time (Poisson)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

5000 10000 20000 40000 60000 80000 100000

Dataset size [no quadruples]

Tim

e [m

s]

pRDF_Subject

pRDF_Property

pRDF_Probability

Page 21: Probabilistic RDF

Query running time (zipf)Atomic queries running time (zipf)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

5000 10000 20000 40000 60000 80000 100000

Dataset size [no quadruples]

Tim

e [

ms]

pRDF_Subject

pRDF_Property

pRDF_Probability

Page 22: Probabilistic RDF

Conjunctive queries running time Conjunctive queries running time

0

5000

10000

15000

20000

25000

30000

5000 10000 20000 40000 60000 80000 100000

Dataset size [no quadruples]

Tim

e [

ms]

5 queries

10 queries

20 queries

30 queries

Page 23: Probabilistic RDF

Dependence on property width Atomic query running time dependence on width

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

5 10 15 20 25 30 35

Dataset width average

Tim

e [

ms]

pRDF_Subject

pRDF_Property

pRDF_Probability

Page 24: Probabilistic RDF

Number of propertiesAtomic query running time dependence on the number of properties

0

5000

10000

15000

20000

25000

30000

35000

20 30 40 50 60 70 80 100

Number of properties

Tim

e [

ms]

pRDF_Subject

pRDF_Property

pRDF_Probability

Page 25: Probabilistic RDF

Take away points

RDF syntax with uncertainty Model-theory and fixpoint semantics for

pRDF Efficient query algorithms for pRDF

Page 26: Probabilistic RDF

The end

http://om.umiacs.umd.edu/

Thank you!

Questions & comments