probabilistic rdf

Post on 12-Jan-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Probabilistic RDF. Octavian Udrea 1 V.S. Subrahmanian 1 Zoran Majkić 2 1 University of Maryland College Park 2 University “La Sapienza”, Rome, Italy. Motivation. Not all information on the Web is easily expressible in “classic” models (i.e., relational) RDF extraction from text - PowerPoint PPT Presentation

TRANSCRIPT

Probabilistic RDF

Octavian Udrea1 V.S. Subrahmanian1

Zoran Majkić2

1University of Maryland College Park2University “La Sapienza”, Rome, Italy

Motivation

Not all information on the Web is easily expressible in “classic” models (i.e., relational)

RDF extraction from text STORY is the first, very successful prototype Need to extend RDF with temporal, uncertainty

components Goal: build a logical model of RDF with

uncertainty and provide query algorithms

The Probabilistic RDF idea

An RDF theory is a set of triples (subject, property, value) (USA hasCapital Washington DC), (Washington DC hasPopulation 500,000)

Probabilistic RDF extends this model with uncertainty over the set of values.

(USA hasCapital {(Washington DC, 0.95), (State of Washington, 0.05)})

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Extracted based on www.wrongdiagnosis .com

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Probabilistic RDF example

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

Probabilistic RDF syntax

Schema uncertainty: (c subClassOf (C,δ)) ΣdЄC δ(d) <= 1

Class-instance uncertainty: (x rdf:type (C,δ)) ΣdЄC δ(d) <= 1

Instance-based uncertainty: (x p (Y, δ)) ΣyЄY δ(y) <= 1

Probabilistic RDF syntax

Sanity requirements (c subClassOf (C1,δ1)), ((c subClassOf (C2,δ2)) =>

(C1 = C2 and δ1 = δ2) or C1 ∩ C2 = Ø Same applies for other types of uncertainty

Transitive properties Simple inferential capability Examples: associatedWith, controlledBy

P-path: A set of triples connected by transitive properties

Example p-path

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

P-path semantics and t-norms We cannot generally assume independence

between triples on a transitive path Flu, AcuteBronchitis, Pneumonia

T-norms are used to express the user’s knowledge of the relationship between triples is associative, commutative 0 x = 0, 1 x = x x <= y, z <= w => x z <= y w

P-Path probability: t-norm applied to individual probabilities on the path

Example p-path

Condition

Infection

BacterialInfection

ViralInfection

Respiratory

Digestive

Sleeping

FoodPoisoninghasComplication

subClassOf

subClassOf

subClassOf

subClassOf

subClassOf

subClassOfsubClassOf, .85

subClassOf, .15

Botulism

E-ColiPoisoning

Flu

AcuteBronchitis

Pneumonia

Middle EarInfection

Emphysema

Cor pulmonale

rdf:type

hasComplication, .7

hasComplication, .15

associatedWith, 0.1

hasComplication,.1

hasComplication, .001

associatedWith, .65 hasComplication, .02

Symptom

Metabolic Mental

Fatigue

causeOf

subClassOf subClassOf

rdf:type,. 7 rdf:type, .2causeOf

causeOf

rdf:type

rdf:type

associatedWith

subPropertyOf

(Flu, associatedWith, (Pneumonia, 0.455)) w.r.t. the product t-norm

pRDF semantics

A world W is a set of simple triples (with no probabilities)

An interpretation I associates a probability to each world

I satisfies a pRDF theory: For each (s, p, (V,δ)), δ(v) <= Σ I(W), where W

contains (s,p,v) Same applies to paths w.r.t. to a given t-norm

pRDF semantics

A theory is consistent iff it has a satisfying interpretation Every pRDF theory is consistent

Entailment: T entails T’ iff every satisfying interpretation of T satisfies T’

Closure of a theory: The entire set of triples entailed by the theory Maximal w.r.t. the probability values

pRDF fixpoint semantics

The closure operator Δ adds exactly one entailed triple at each step

(Flu associatedWith, (Acute Bronchitis, .7)) and

(Acute Bronchitis associatedWith (Pneumonia, .65)) yields:

(Flu associatedWith, (Pneumonia, 0.455))

w.r.t. the product t-norm Δ has a fixpoint which is the theory

closure.

pRDF query processing

We will consider only simple queries: a triple with a variable term Example (? associatedWith Pneumonia 4) What is associated with Pneumonia with

probability above .4? Simple method:

Compute the closure Select any triple in the closure that matches the

query VERY expensive computationally

pRDF query processing

Set of algorithms for answering simple queries and conjunctions: pRDF_Subject, pRDF_Property, …,

pRDF_conjunction Central idea:

Apply Δ in only those directions that yield tuples relevant to the query

Cut off path computations when the threshold can no longer be reached. min(current_probability, threshold)

Experimental results

Implementation Java, 1700 LOC Disk-based storage for pRDF theories

Synthetically generated datasets According to varying underlying distributions

Datasets extracted from Web sources

Experimental questions

Does the underlying distribution affect query running time?

From a practical point of view, which are the “fastest” types of queries?

How does running time vary with the number of atoms in a conjunction?

What other theory-dependent factors affect running time? Theory width Number of properties

Query running time (Poisson)Atomic queries running time (Poisson)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

5000 10000 20000 40000 60000 80000 100000

Dataset size [no quadruples]

Tim

e [m

s]

pRDF_Subject

pRDF_Property

pRDF_Probability

Query running time (zipf)Atomic queries running time (zipf)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

5000 10000 20000 40000 60000 80000 100000

Dataset size [no quadruples]

Tim

e [

ms]

pRDF_Subject

pRDF_Property

pRDF_Probability

Conjunctive queries running time Conjunctive queries running time

0

5000

10000

15000

20000

25000

30000

5000 10000 20000 40000 60000 80000 100000

Dataset size [no quadruples]

Tim

e [

ms]

5 queries

10 queries

20 queries

30 queries

Dependence on property width Atomic query running time dependence on width

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

5 10 15 20 25 30 35

Dataset width average

Tim

e [

ms]

pRDF_Subject

pRDF_Property

pRDF_Probability

Number of propertiesAtomic query running time dependence on the number of properties

0

5000

10000

15000

20000

25000

30000

35000

20 30 40 50 60 70 80 100

Number of properties

Tim

e [

ms]

pRDF_Subject

pRDF_Property

pRDF_Probability

Take away points

RDF syntax with uncertainty Model-theory and fixpoint semantics for

pRDF Efficient query algorithms for pRDF

The end

http://om.umiacs.umd.edu/

Thank you!

Questions & comments

top related