recognizing ontology-applicable multiple-record web documents
DESCRIPTION
Recognizing Ontology-Applicable Multiple-Record Web Documents. David W. Embley Dennis Ng Li Xu. Brigham Young University. Problem: Recognizing Applicable Documents. Document 1: Car Ads. Document 2: Items for Sale or Rent. A Conceptual Modeling Solution. Car-Ads Ontology. Car [->object]; - PowerPoint PPT PresentationTRANSCRIPT
Recognizing Ontology-ApplicableMultiple-Record Web Documents
David W. Embley
Dennis Ng
Li Xu
Brigham Young University
Problem: Recognizing Applicable DocumentsDocument 1: Car Ads
Document 2: Items for Sale or Rent
A Conceptual Modeling Solution
Car-Ads Ontology
Car [->object];
Car [0:0.975:1] has Year [1:*];
Car [0:0.925:1] has Make [1:*];
Car [0:0.908:1] has Model [1:*];
Car [0:0.45:1] has Mileage [1:*];
Car [0:2.1:*] has Feature [1:*];
Car [0:0.8:1] has Price [1:*];
PhoneNr [1:*] is for Car [1:1.15:*];
Year matches [4]
constant {extract “\d{2}”;
context "([^\$\d]|^)[4-9]\d,[^\d]";
substitute "^" -> "19"; },
…
End;
Recognition Heuristics
• H1: Density
• H2: Expected Values
• H3: Grouping
Document 1: Car Ads
Document 2: Items for Sale or Rent
H1: Density
H1: Density
• Car Ads– Number of Matched Characters: 626– Total Number of Characters: 2048– Density: 0.306
• Items for Rent or Sale– Number of Matched Characters: 196– Total Number of Characters: 2671– Density: 0.073
Document 1: Car Ads
Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3
H2: Expected Values
Document 2: Items for Sale or Rent
Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4
H2: Expected Values
OV D1 D2
Year 0.98 16 6
Make 0.93 10 0
Model 0.91 12 0
Mileage 0.45 6 2
Price 0.80 11 8
Feature 2.10 29 0
PhoneNr 1.15 15 11
D1: 0.996
D2: 0.567
ov
D1
D2
H3: Grouping (of 1-Max Object Sets)
YearMakeModelPriceYearModelYearMakeModelMileage…
Document 1: Car Ads
{{{
YearMileage…MileageYearPricePrice…
Document 2: Items for Sale or Rent
{{
H3: GroupingCar Ads----------------YearYearMakeModel-------------- 3PriceYearModelYear---------------3MakeModelMileageYear---------------4ModelMileagePriceYear---------------4…Grouping: 0.865
Sale Items----------------YearYearYearMileage-------------- 2MileageYearPricePrice---------------3YearPricePriceYear---------------2PricePricePricePrice---------------1…Grouping: 0.500
Expected Number in Group = Ave = 4 (for our example)
Sum of Distinct 1-Max in each GroupNumber of Groups Expected Number in a Group
1-Max
3+3+4+4 44
= 0.875 2+3+2+1 44 = 0.500
Combining Heuristics
• Decision-Tree Learning Algorithm C4.5– (H1, H2, H3, Positive)
– (H1, H2, H3, Negative)
• Training Set– 20 positive examples– 30 negative examples (some purposely similar, e.g. classified ads)
• Test Set– 10 positive examples
– 20 negative examples
Car Ads: Rule & Results
• Precision: 100%• Recall: 91%• Accuracy 97%
– Harmonic Mean– 2/(1/Precision + 1/Recall)
False Negative
Obituaries
Obituaries: Rule & Results
• Precision: 91%• Recall: 100%• Accuracy: 97%
False Positive: Missing Person Report
Universal Rule
• Precision: 84%• Recall: 100%• Accuracy: 93%
Additional and Future Work
• Other Approaches– Naïve Bayes [McCallum96] (accuracy near 90%)– Logistic Regression [Wang01] (accuracy near 95%)– Multivariate Analysis with Continuous Random Vectors
[Tang01] (accuracy near 100%)
• More Extensive Testing– Similar documents (motorcycles, wedding announcements, …)– Accuracy drops to near 87%– Naïve Bayes drops to near 77%– Others … ?
• Other Types of Documents– XML Documents– Forms and the Hidden Web– Tables
Summary
• Objective: Automatically Recognize Document Applicability
• Approach:– Conceptual Modeling– Recognition Heuristics
• Density
• Expected Values
• Grouping
• Result: Accuracy Near 95%
www.deg.byu.edu