data-extraction ontology generation by example yuanqiu (joe) zhou data extraction group brigham...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Data-Extraction Ontology Generation by Example
Yuanqiu (Joe) ZhouData Extraction Group
Brigham Young UniversitySponsored by NSF
Motivation
Semi-structured Web data need to be extracted for further manipulations.
Contrast to other wrapper generation techniques, BYU ontology-based data-extraction technique is resilient.
By-Example approach makes it possible to help common users generate ontologies easily.
Web-based System GUI
Canon PowerShot S40
4.0 1600 x 12001024 x 768640 x 480
Architecture
Data Frame Library
User Defined Form
System GUI
Sample Pages
Ontology Generator
Extraction Engine Test PagesPopulated Database
Extraction Ontology
Extraction Ontology
Object and Relationship Sets and Constraints
Extraction Patterns
Keywords
Context Expressions
BaseA
B
C
D1 D2
E1 E2
Base [0:1] A [1:*]
Base [0:2] B [1:*]
Base [0:*] C [1:*]
Base [0:2] D1 [1:*] D2 [1:*]
Base [0:*] E1 [1:*] E2 [1:*]
Ontology GenerationObject and Relationship Sets and Constraints
Base
A
B
…
…
A
B1
B2
B1, B2 : B
G
H I
F
A [0:1] F [1:*]
B1 [0:1] G [1:*]
B2 [0:1] H [1:*] I [1:*]
Ontology GenerationObject and Relationship Sets and Constraints
Sample Web Page User Created Form
CCD Resolution Image Resolution
Optical Zoom
Digital Zoom
Digital Camera
Brand Model
Zoom
Zoom
PowerShot G2 Canon
4.0 2272 x 1074
3
2
Object and Relationship Sets and Constraints
DigitalCamera [-> object]DigitalCamera [0:1] Brand [1:*]DigitalCamera [0:1] Model [1:*]DigitalCamera [0:1] CCDResolution [1:*]DigitalCamera [0:1] ImageResolution [1:*]DigitalCamera [0:1] Zoom [1:*]
Zoom [0:1] DigitalZoom [1:*]Zoom [0:1] OpticalZoom [1:*]
Ontology GenerationExtraction Patterns
Data Frame Library Lexicons Synonym Dictionaries or thesauri Regular Expressions
Matching extraction patterns: Only one (bingo!) More than one (use extraction pattern filters) No matching extraction pattern (create one)
Features a high-quality 4.0 Megapixel Resolution CCD
The new Nikon Coolpix 995 boasts of a 3.34 Megapixel CCD
3 effective megapixel
Ontology GenerationKeywords
3.5x optical zoom (2.5x digital)
a superior 4x Optical Zoom Nikkor lens, plus 4x stepless digital zoom
optical 3X /digital 6X zoom
Ontology GenerationContext Expressions
DigitalCamera [-> object];DigitalCamera [0:1] Brand [1:*];DigitalCamera [0:1] ImageResolution [1:*];DigitalCamera [0:1] Zoom [1:*];DigitalCamera [0:1] CCDResolution [1:*];
Zoom[0:1] OpticalZoom[1:*];
Brand matches [10] constant{ extract "\bNikon\b";},
{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};
end;
CCD Resolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b“, "\bCCD\b", "\bCCD Resolution\b";
end;
OpticalZoom matches [10]constant{ extract "\b\d(\.\d)";
context "\b\d(\.\d)?(x)\b"; };keyword "\boptical\b";
end;
Extraction Ontology
Measurements How much of the ontology was generated with
respect to how much could have been generated?
How many components generated should not have been generated?
What comparisons can we make about the precision and recall ratios of extraction data between a system-generated ontology and an expert-generated ontology?
How many sample pages are necessary for acceptable system performance?
Contributions
Proposes a by-example approach to semi-automatically generate data-extraction ontologies
Constructs a Web-based tool to generate data-extraction ontologies