toward making online biological data machine understandable cui tao data extraction research group...

1
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University, Provo, UT 84602 Introduction Source Location by Semantic Indexing Contact Information Data Extraction Research Group Department of Computer Science Brigham Young University Provo, UT 84602 Cui Tao, [email protected] http://www.deg.byu.edu/ Conclusions PROBLEMS: Huge evolving number of Bio-databases e.g. molecular biology database collection 2004: total 548, 162 more than 2003 2005: total 719, 171 more than 2004 Different access capabilities Syntactic heterogeneity Semantics heterogeneity Updated at anytime by independent authorities SOLUTION: Source page understanding Table Interpretation Aligning with an ontology Source location through semantic annotation Metadata vs. instance data annotation Use of annotation in query processing Ontology evolution Adjustments to ISA and Part-Of hierarchies Addition of attributes GOALS: To help biologists cross search various resources Examples: Cross-linked information (Join queries) “Find genes which are longer than 5kbp, whose products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG Collecting information from similar data sources (Union queries) “Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase table tr tr td td td td td td td td td td Status Nucleotides (coding/transcript) Protein Swissprot Amino Acids F47G6.1 1, 2 confirmed by cDNA(s) 1773/7391 bp WP:CE26812 DTN1_CAEEL td 590 aa td Gene Model F18H3.5b 1, 2, 3 F18H3.5a 1, 2 table tr tr tr td td td td td td td td td td td td td td td Gene Model Status Nucleotides (coding/transcript) Protein Amino Acids confirmed by cDNA(s) 1029/3051 bp WP:CE18608 342 aa partially confirmed by cDNA(s) 1221/1704 bp WP:CE28918 406 aa SAMPLE ONTOLOGY OBJECT RECOGNITION SAMPLE ONTOLOGY OBJECT RECOGNITION Key Concepts: sample ontology object, expected values Steps: Map the values with the sample ontology object set Map the labels with the ontology concepts Understand all pages from the same web site Ontology Evolution Source Page Understanding Key Concepts: sibling pages and sibling tables Main Idea: Compare two sibling tables: variable fields ~ values & fixed fields ~ labels Structure pattern for one pair of sibling tables General structure pattern for all sibling tables SIBLING PAGE COMPARISON SIBLING PAGE COMPARISON Steps: Transfer each HTML table to a DOM tree Find sibling tree pairs Compare and find matched nodes Generate a structure pattern for all sibling tables Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da ProtoNet ProtoNet ProtoNet ProtoNet ProtoNet ProtoNet Semantic Web Semantic annotation Query META-DATA ANNOTATION META-DATA ANNOTATION DATA ANNOTATION DATA ANNOTATION Likely to have “imperfect” ontologies Can enrich semi-automatically Two possibilities: Value enrichment Object-set and relationship-set enrichment VALUE ENRICHMENT VALUE ENRICHMENT Source Target Source Organism Accession Number Protein Name Length in Amino Acid Molecular Weight in Da RELATIONSHIP-SET ENRICHMENT RELATIONSHIP-SET ENRICHMENT OBJECT-SET ENRICHMENT OBJECT-SET ENRICHMENT Start End Length in Amino Acid Location Gene “37,?612,? 680”; “37,?610,? 585”; “3,?095”: A sample ontology object (partial information) Two sample pages (partial information) Specie Protein Name Map to Update values Finished: sibling table comparison technique Working on: sample ontology object recognition ontology generation in the biological domain Implementation Status: Ontology: will not cover everything in the domain Source page understanding: structured/semi- structured Value enrichment: only value lexicons Object set and relationship set enrichment: only ISA and Part-Of hierarchies and simple attribute additions Delimitations: Old ontology Updated ontology Possible new object sets that could be added to the ontology Data Extraction Data Extraction Research Group Research Group

Post on 21-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,

Toward Making Online Biological Data Machine Understandable Cui Tao

Data Extraction Research GroupDepartment of Computer Science, Brigham Young University, Provo, UT 84602

Introduction Source Location by Semantic Indexing

Contact Information

Data Extraction Research GroupDepartment of Computer Science Brigham Young UniversityProvo, UT 84602

Cui Tao, [email protected]

http://www.deg.byu.edu/

Conclusions

PROBLEMS:Huge evolving number of Bio-databases e.g. molecular biology database collection

2004: total 548, 162 more than 20032005: total 719, 171 more than 2004

Different access capabilities

Syntactic heterogeneity

Semantics heterogeneity

Updated at anytime by independent authorities

SOLUTION:

Source page understanding

Table Interpretation

Aligning with an ontology

Source location through semantic annotation

Metadata vs. instance data annotation

Use of annotation in query processing

Ontology evolution

Adjustments to ISA and Part-Of hierarchies

Addition of attributes

GOALS:To help biologists cross search various resources

Examples: Cross-linked information (Join queries) “Find genes which are longer than 5kbp, whose

products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG

Collecting information from similar data sources (Union queries)

“Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase

table

tr

tr

td

td

td

td

td

td

td

td

td

td

Status

Nucleotides (coding/transcript)

Protein

Swissprot

Amino Acids

F47G6.1 1, 2

confirmed by cDNA(s)

1773/7391 bp

WP:CE26812

DTN1_CAEEL

td 590 aa

td Gene Model

F18H3.5b 1, 2, 3

F18H3.5a 1, 2

table

tr

tr

tr

td

td

td

td

td

td

td

td

td

td

td

td

td

td

td

Gene Model

Status

Nucleotides (coding/transcript)

Protein

Amino Acids

confirmed by cDNA(s)

1029/3051 bp

WP:CE18608

342 aa

partially confirmed by cDNA(s)

1221/1704 bp

WP:CE28918

406 aa

SAMPLE ONTOLOGY OBJECT RECOGNITIONSAMPLE ONTOLOGY OBJECT RECOGNITION

Key Concepts: sample ontology object, expected values

Steps:

Map the values with the sample ontology object set

Map the labels with the ontology concepts

Understand all pages from the same web site

Ontology Evolution

Source Page Understanding

Key Concepts: sibling pages and sibling tables

Main Idea:

Compare two sibling tables:

variable fields ~ values & fixed fields ~ labels

Structure pattern for one pair of sibling tables General structure pattern for all sibling tables

SIBLING PAGE COMPARISONSIBLING PAGE COMPARISON

Steps:Transfer each HTML table to a DOM treeFind sibling tree pairsCompare and find matched nodes

Generate a structure pattern for all sibling tables

Source Organism

Accession Number

Protein Name

Length in Amino Acid

Molecular Weight in Da

ProtoNet

ProtoNet

ProtoNet

ProtoNet

ProtoNet

ProtoNet

Semantic Web

Semantic annotation

Query

META-DATA ANNOTATIONMETA-DATA ANNOTATION

DATA ANNOTATIONDATA ANNOTATION

Likely to have “imperfect” ontologies

Can enrich semi-automatically

Two possibilities:

Value enrichment

Object-set and relationship-set enrichment

VALUE ENRICHMENTVALUE ENRICHMENT

Source

Target

Source Organism

Accession Number

Protein Name

Length in Amino Acid

Molecular Weight in Da

RELATIONSHIP-SET ENRICHMENTRELATIONSHIP-SET ENRICHMENT

OBJECT-SET ENRICHMENTOBJECT-SET ENRICHMENT

Start End

Length in Amino Acid

Location

Gene

“37,?612,?680”;

“37,?610,?585”;

“3,?095”:

A sample ontology object (partial information)

Two sample pages (partial information)

Specie

Protein Name

Map to

Update values

Finished: sibling table comparison technique

Working on: sample ontology object recognition

ontology generation in the biological domain

Implementation Status:

Ontology: will not cover everything in the domain

Source page understanding: structured/semi-structured

Value enrichment: only value lexicons

Object set and relationship set enrichment: only ISA and Part-Of hierarchies and simple attribute additions

Delimitations:

Old ontology

Updated ontology

Possible new object sets that could be added to the ontology

Data Extraction Data Extraction Research GroupResearch Group