toward making online biological data machine understandable cui tao data extraction research group...

Toward Making Online Biological Data Machine Understandable Cui Tao

Data Extraction Research GroupDepartment of Computer Science, Brigham Young University, Provo, UT 84602

Introduction Source Location by Semantic Indexing

Contact Information

Data Extraction Research GroupDepartment of Computer Science Brigham Young UniversityProvo, UT 84602

Cui Tao, [email protected]

http://www.deg.byu.edu/

Conclusions

PROBLEMS:Huge evolving number of Bio-databases e.g. molecular biology database collection

2004: total 548, 162 more than 20032005: total 719, 171 more than 2004

Different access capabilities

Syntactic heterogeneity

Semantics heterogeneity

Updated at anytime by independent authorities

SOLUTION:

Source page understanding

Table Interpretation

Aligning with an ontology

Source location through semantic annotation

Metadata vs. instance data annotation

Use of annotation in query processing

Ontology evolution

Adjustments to ISA and Part-Of hierarchies

Addition of attributes

GOALS:To help biologists cross search various resources

Examples: Cross-linked information (Join queries) “Find genes which are longer than 5kbp, whose

products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG

Collecting information from similar data sources (Union queries)

“Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase

table

tr

tr

td

td

td

td

td

td

td

td

td

td

Status

Nucleotides (coding/transcript)

Protein

Swissprot

Amino Acids

F47G6.1 1, 2

confirmed by cDNA(s)

1773/7391 bp

WP:CE26812

DTN1_CAEEL

td 590 aa

td Gene Model

F18H3.5b 1, 2, 3

F18H3.5a 1, 2

table

tr

tr

tr

td

td

td

td

td

td

td

td

td

td

td

td

td

td

td

Gene Model

Status

Nucleotides (coding/transcript)

Protein

Amino Acids

confirmed by cDNA(s)

1029/3051 bp

WP:CE18608

342 aa

partially confirmed by cDNA(s)

1221/1704 bp

WP:CE28918

406 aa

SAMPLE ONTOLOGY OBJECT RECOGNITIONSAMPLE ONTOLOGY OBJECT RECOGNITION

Key Concepts: sample ontology object, expected values

Steps:

Map the values with the sample ontology object set

Map the labels with the ontology concepts

Understand all pages from the same web site

Ontology Evolution

Source Page Understanding

Key Concepts: sibling pages and sibling tables

Main Idea:

Compare two sibling tables:

variable fields ~ values & fixed fields ~ labels

Structure pattern for one pair of sibling tables General structure pattern for all sibling tables

SIBLING PAGE COMPARISONSIBLING PAGE COMPARISON

Steps:Transfer each HTML table to a DOM treeFind sibling tree pairsCompare and find matched nodes

Generate a structure pattern for all sibling tables

Source Organism

Accession Number

Protein Name

Length in Amino Acid

Molecular Weight in Da

ProtoNet

ProtoNet

ProtoNet

ProtoNet

ProtoNet

ProtoNet

Semantic Web

Semantic annotation

Query

META-DATA ANNOTATIONMETA-DATA ANNOTATION

DATA ANNOTATIONDATA ANNOTATION

Likely to have “imperfect” ontologies

Can enrich semi-automatically

Two possibilities:

Value enrichment

Object-set and relationship-set enrichment

VALUE ENRICHMENTVALUE ENRICHMENT

Source

Target

Source Organism

Accession Number

Protein Name


Molecular Weight in Da

RELATIONSHIP-SET ENRICHMENTRELATIONSHIP-SET ENRICHMENT

OBJECT-SET ENRICHMENTOBJECT-SET ENRICHMENT

Start End


Location

Gene

“37,?612,?680”;

“37,?610,?585”;

“3,?095”:

A sample ontology object (partial information)

Two sample pages (partial information)

Specie

Protein Name

Map to

Update values

Finished: sibling table comparison technique

Working on: sample ontology object recognition

ontology generation in the biological domain

Implementation Status:

Ontology: will not cover everything in the domain

Source page understanding: structured/semi-structured

Value enrichment: only value lexicons

Object set and relationship set enrichment: only ISA and Part-Of hierarchies and simple attribute additions

Delimitations:

Old ontology

Updated ontology

Possible new object sets that could be added to the ontology

Data Extraction Data Extraction Research GroupResearch Group

toward making online biological data machine understandable cui tao data extraction research group...

Documents

ontology source location

sibling pages

ontology concepts

sibling table c

value enrichment object

sibling tables main

sibling tree pairs

semantic annotation