mary mcgee wood, shenghui wang dept of computer science, u. of manchester valentin tablan, diana...

39
Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Populating a Database from Parallel Texts using “Ontology-based” Information Extraction

Upload: mariam-purdom

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Mary McGee Wood, Shenghui WangDept of Computer Science, U. of Manchester

Valentin Tablan, Diana Maynard,

Hamish Cunningham Dept of Computer Science, U. of Sheffield

Susannah LydonEarth Science Education Unit, U. of Keele

Populating a Database from Parallel Texts using “Ontology-based” Information Extraction

Page 2: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

The hypothesis

Page 3: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Overview

Parallel texts

Legacy data in the natural sciences

“Ontology-based” Information Extraction

Page 4: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

NLDB’04 - a few running threadsMultiple / semi-overlapping text sources

Sophisticated vs shallow or statistical text processing

“Ontologies” are not the same as gazetteers or lexicons (or semantic nets!)

Autonomous agents vs HCC (Human-Computer Collaborative) approaches

Page 5: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

We are doing…

Highly homogeneous data sources

Shallow text processing

“Ontologies” only as a last resort

HCC approach

Page 6: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

We are not doing…

Heterogeneous data sources

Sophisticated language processing

Improvement of single-source IE or question-answering

Autonomous agents

Page 7: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Parallel textsText descriptions in the traditional descriptive sciences.

Descriptions of protein sequences and functions in molecular biology.

Press coverage of news stories.

Police witness-of-crime reports.

(Semi-) automatic marking of free text answers in examinations.

Page 8: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Legacy data in the natural sciences

Text descriptions in the traditional descriptive sciences:

Species descriptions in botany and zoology

Descriptions of diseases in medicine.

Page 9: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Five species of Ranunculus (buttercups)

Six botanists’ text descriptions (Floras)

Data sources

Page 10: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

R. acris L. - Meadow Buttercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14.

Typical data

Page 11: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Hand Parsing & Correlation

CTM FE FNA GLEASON GRAY STACE

Petals Petals petals Petals petals

number 5 5

length usually 10-15 mm

9-13 mm 8-14 mm. long

0.8-1.4 cm long

width 8-11 mm nearly as broad

shape broadly obovate with cuneate base

broadly obovate

rounded-obovate

colour bright glossy yellow, rarely paler or white

yellow

Hand-parsed species descriptions for Ranunculus bulbosus

Page 12: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Results of hand-analysis of Ranunculus descriptions from six sources

- Most data from one source only

- Individual texts contain on average 39% of the total information for each species

Page 13: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Department of BotanyNatural History Museum,

LondonRob HuxleyDavid Sutton

MultiFlora IAutomatic compilation of accurate taxonomic databases from multiple non-computerised sourcesDepartment of Computer ScienceUniversity of ManchesterMary McGee WoodDavid RydeheardSusannah Lydon

Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072

Page 14: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

GATE I

Page 15: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Tagger output

Page 16: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Parse trees

Page 17: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Names & verbs

‘Basal leaves more or less deeply divided…’1231 semantics 179 191 (qlf:[ne_tag(e13, offsets(179, 184)), name(e13, 'Basal'), realisation(e13, offsets(179, 184)), leave(e12), time(e12, present), aspect(e12, simple), voice(e12, active), realisation(e12, offsets(185, 191)), realisation(e12, offsets(185, 191)), lsubj(e12, e13)])

1247 semantics 200 226 (qlf:[divide(e14), adv(e14, less), adv(e14, deeply), time(e14, none), aspect(e14, simple), voice(e14, passive), into(e14, e15), count(e15, 3), realisation(e15, offsets(225, 226)), realisation(e14, offsets(200, 226)), realisation(e14, offsets(200, 226))])

Page 18: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Template output (1)Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND

Erect

Perennial

to 1m measure unknown

basal position pubescent

leaves Prefix

deeply

palmately

lobed

Page 19: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Template output (2)flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION

flowers 15-25mm measure width

across

sepals reflexed true

achenes short

hooked

smooth

glabrous

2-3.5mm measure unknown

Page 20: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

MultiFlora II:Combining Information Extraction and Knowledge Representation for Biodiversity Informatics

Department of Computer Science, University of ManchesterMary McGee WoodSusannah LydonAlan Rector

Department of Botany, Natural History Museum, LondonRob Huxley

Natural Language Processing Group, University of SheffieldHamish CunninghamValentin TablanDiana Maynard

Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049

Page 21: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

GATE II

Page 22: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

“Ontology-based” Information Extraction

“Ontology” – classes of heads, properties, and features

Gazetteers – instances of these classes

(Lexicons – not currently used)

Page 23: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Head categoriesSpecific plant parts:Flower: Flower, floret, FlLeaf: leaf, leaves, FrondsPetal: petal, honey-leaf, vexillum

Collective categories:PlantSeparatablePart:

appendage, glume, tuberPlantUnseparatablePart:

beak, lobe, segmentSpecificRegionOfWhole: apex, border, head

Page 24: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Ontology: Heads

ontology-heads.eps

Page 25: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Properties

2DShape: arching, linear, toothed

3DShape: branching, thickened, tube

Colour: glossy, golden, greenish

Count: numerous, several

Page 26: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Ontology: Properties

Page 27: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Features

Habit: bush, shrub, succulent MorphologicalProperty:

dense, contiguous, separate SurfaceProperty:

pilose, pitted, rugose

Page 28: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Ontology: Features

Page 29: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Perennial herb with overwintering lf-rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched.

More typical data

Page 30: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

System outputHead Class Head Property FeatClass Feature

Plant herb hasLifeform Lifeform Perennial

Leaf lf-rosettes hasLifeform Lifeform overwintering

PlantSepPart stock hasRelProperty RelProperty short

PlantSepPart stock hasOrientation Orientation oblique to erect

PlantSepPart stock hasLength Length up to 5 cm

PlantSepPart stock hasRelProperty RelProperty rhizome-like

Root roots hasColour Colour white

Root roots hasShape3D Shape3D rather fleshy

Root roots hasShape3D Shape3D little branched

Page 31: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

R. acris R. bulbosus R. hederaceus Avg

Single description, average

78 60 83 74

Single description, average, for whole template

78 60 83 74

Merged, for whole template

63 58 69 63

Precision

Page 32: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

R. acris R. bulbosus R. hederaceus Avg

Single description, average

70 55 74 66

Single description, average, for whole template

22 18 26 22

Merged, for whole template

69 61 82 71

Recall

Page 33: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

R. acris R. bulbosus R. hederaceus Avg

Single description, average

73.78 57.39 78.2469.77

Single description, average, for whole template

34.32 27.69 39.6033.92

Merged for whole template

65.86 59.46 74.9466.76

F-measure

Page 34: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Of all instances of missed information, percentage compensated for by merging

50 46 55 50

Of total number of slots in template, percentage where merging allowed compensation for missed information

25 29 18 24

Information merging

Page 35: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

These figures based on human judgement

Automated “merging reasoner” under active construction

Information merging

Page 36: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Future work – short termFine-tuning to improve precision

(Semi-) automatic template correlation heuristics

(Semi-) automatic data correlation heuristics

Extend coverage and evaluation

Page 37: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U

Future targets Techniques:

Merging reasoner

Temporal reasoner

Data types:

Large-scale legacy data in biodiversity studies

Free text annotations in Bioinformatics databases

Page 38: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U
Page 39: Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U