supporting on-the-fly data integration for bioinformatics
DESCRIPTION
Supporting on-the-fly data Integration for bioinformatics. Candidate: Xuan Zhang Advisor: Gagan Agrawal. Road Map. Mission Statement Motivation Implementation Comprehensive Examples Future work Conclusion. Mission Statement. Enhance information integration systems on Functionality - PowerPoint PPT PresentationTRANSCRIPT
Supporting on-the-fly data Integration for bioinformatics
Candidate: Xuan Zhang
Advisor: Gagan Agrawal
Road Map
• Mission Statement
• Motivation
• Implementation
• Comprehensive Examples
• Future work
• Conclusion
Mission Statement
• Enhance information integration systems on– Functionality
• On-the-fly data incorporation• Flat file data process
– Usability• Declarative interface• Low programming requirement
Motivation
• Integration is essential for biological research– Biological data include
• Sequences: DNA (GenBank), protein (Swiss-Prot)• Structure: RNA (RNAbase), protein (PDB)• Interaction: pathway (KEGG), regulation (GRBase)• Function: disease (OMIM)• 2ndary: protein family (Pfam)
– Biological data is inter-related.
Motivation
• Challenges of bioinformatics integration– Data volume: overwhelming
• DNA sequence: 100 gigabases (August, 2005)
– Data growth:
exponential
Figure provided by PDB
Motivation
• Challenges of bioinformatics integration (cont.)– Tools: Many and more– Service interfaces: Variety
• Web pages• Web service• Grid service
Motivation
• Challenges of bioinformatics integration (cont.)– Inter-operability: Low
• Heterogeneous data sources– Semi-structured by nature– Flat file, relational, object-oriented databases
• Independently developed tools• No data exchange standard
– Little Collaboration
Road Map
• Mission Statement
• Motivation
• Implementation
• Future
• Conclusion
– Approach Overview– Advantage– Components
Approach Summary
• Metadata– Declarative description of data– Data mining algorithms for semi-automatic
writing– Reusable by different requests on same data
• Code generation– Request analysis and execution separated– General modules with plug-in data module
System OverviewUnderstand Data Process Data
Data File User Request
Answ
er
Metadata Description
Layout Descriptor---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema Descriptor
CodeGeneration
RequestProcessor
Layout Miner
SchemaMiner
Information Integration System
Advantages
• Simple interface– At metadata level, declarative
• General data model– Semi-structured data– Flat file data
• Low human involvement– Semi-automatic data incorporation– Low maintenance cost
• OK Performance– Linear scale guaranteed
Road Map
• Mission Statement
• Motivation
• Implementation
• Future
• Conclusion
– Approach Overview– Advantage– Components
System Components
• Understand data– Layout mining– Schema mining
• Process data– Wrapper generation– Query Process– Query Process with indices
Layout Mining
• Goal 1: Separate delimiters from values– D-score: location &
frequency
• Goal 2: Organize delimiters and values– NFA
Data File
Token Parser
Tokens
Delimiter Mining
Candidate Delimiters
Layout Learning
Layout Descriptor
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments
Schema Mining Goals
• Ultimate goal: discover schema about an unknown flat file dataset
• Immediate goal: Assign attributes with meaningful labels
Our Approach
• Summarize values from bottom up• Use knowledge from
– Ontology– Heuristics
• A head-up: attribute label attribute name– What we can mine
• date
– What we cannot do• Creation date, last modification date, birthday, …
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments
Schema Mining System
• Major Components– Data Cleaning and
summarization– Score calculation
• Score function• Ontology• Heuristics
– Score Clustering
Raw attribute valuesRaw attribute values
Value cleaning and summarizationValue cleaning and summarization
Attribute summariesAttribute summaries
Score calculationScore calculation
ScoresScoresClusteringClustering
algorithmalgorithm
Cutoff valuesCutoff values
LabelingLabeling
Attribute LabelsAttribute Labels
• Goal: reduce amount of data
• Collect frequent tokens– Approximate frequent token mining algorithm
Data Summarization
• Goal: reduce amount of data
• Collect frequent tokens– Approximate frequent token mining algorithm
• Token categorization by profile– Token profile: a ordered list of N(numerical),
A(alphabetic) and special characters– Token categories:
• Word, number, else and other user defined categories
Score Function Template
• Desired property– Simple– Adjustable trade-
off between sensitivity and error tolerance
0.00.10.20.30.40.50.60.70.80.91.0
F_pt B_pt t
Temperature
Score Clustering
• Goal: Sort attributes into three groups, H (high), L (low) and M (middle), by scores
• Mathematically, find two scores, scorei and scorej, from {score1, score2, score3, …, scoreN}, to minimize the standard deviation
• N (number of attributes) is not large. Exact answer can be found.
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm
• Mining with ontology• Mining with heuristics
– Experiments
Use of Ontology
• An observation: a similarity between ontology and schema– Both satisfy “is-a” relation
• E.g “Diabetes is a disease.”• Ontology: “diabetes” is a child of “disease”• Schema: “diabetes” is a valid instance of attribute
“disease”
• Common ancestors in ontology ~ attribute label
Real-world Complications
• To find an arbitrary value in an ontology– Complete and comprehensive ontology?
• Selective sampling
– Error-free dataset?• Adjustable sensitivity & fault tolerance
• Performance
Ontology Database
• Goal: to approximate a complete comprehensive ontology database
• Approach– “Complete”: sample popular terms– “Comprehensive”: public ontology databases +
common facts
• Result– 6 major categories– 386 terms
Ontology Based Metrics (1)
1. Occurrence(term) =Frequent_Count[i],
if term=Frequent_Token[i]
mini:[0, t] Frequent_Count[i],
if term=Frequent_Token[0]|…|Frequent_Token[t]
0, else
2. Strength(term) = Occurrence(term) + Strength(child_term)
Ontology Based Metrics (2)
• Two factors– Relative strength compared with other concepts– Completeness of ontology as a whole
• Ontology score = product of two factors– Each modulated by the template score function
Mining With Heuristics (1)
• Use token profile– “number”: {N, N.N}– “date”: {N-A-N, N/N/N}
• Use frequent token counts– “identification”: Frequent_Counts[]=1
• Use other token information– “biological sequence”: length >45, or in 10’s
Mining With Heuristics (2)
• Use token sequence information– “people name”: length (2~3), separator (“,” or
“and”), profile (not number, date)
• Again, these counts are modulated by the template function to calculate scores
Schema Mining Road Map
• Schema Mining– Overview– Mining System– Core Mining Algorithm– Experiments
Schema Mining Experiment Design
• Datasets– GenBank, UniProt SWISSPROT and Pfam
• Cutoff values– Exact clustering
• Evaluation– Weighted Cohen’s Kappa
Compare group most, middle and little with true label Y(yes), P(partial) and N(no)
Result Summary: Kappa
Very goodVery good
GoodGood
ModerateModerate
1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,
7: name, 8: number, 9: organism, 10: publication method, 11: sequence
Cellular Component (O)
Date (H)
Organism Name (O)
Schema Mining Summary
• According to Kappa tests, results are good or very good
• Possible improvement– Clustering method with better intelligence– Better ontology database– More involved language analysis– Hybrid of bottom-up and top-down approaches
System Components
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query Process– Query Process with indices
Data Process Overview
• Automatic code generation approach• Input
– Metadata about datasets involved– Optional:
• Implicit data transformation task• Request by users• Indexing functions
• Output– Executable programs
• General modules• Task-specific data module
Metadata Description
• Two aspects of data in flat files– Logical view of the data– Physical data organization
• Two components of every data descriptor– Schema description– Layout description
• Design goals– Powerful– Easy for writing and interpretation
Metadata Challenges
• Examples of sequence formats– ALN/ClustalW format – AMPS Block file format – ClustalW – Codata – EMBL – GCG/MSF – GDE – Genebank – Fasta (Pearson) – NBRF/PIR – PDB format – Pfam/Stockholm format – Phylip – Raw – RSF – UniProtKB/Swiss-Prot
List and example provided by EMBL-EBI
>FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
{ name "Short name for sequence" longname "Long (more descriptive) name for sequence" sequence-ID "Unique ID number" creation-date "mm/dd/yy hh:mm:ss" direction [-1|1] strandedness [1|2] type [DNA|RNA||PROTEIN|TEXT|MASK] offset (-999999,999999) group-ID (0,999) creator "Author's name" descrip "Verbose description“ comments "Lines of comments that can be fairly arbitrary text about a sequence. Return characters are allowed, but no internal double quotes or brace characters. Remember to close with a double quote" sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" }
LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993 DEFINITION Mouse fosB mRNA. ACCESSION X14897 VERSION X14897.1 GI:50991 KEYWORDS fos cellular oncogene; fosB oncogene; oncogene. SOURCE Mus musculus. ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 4145) AUTHORS Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and
Bravo,R. TITLE The product of a novel growth factor activated gene, fos B,
interacts with JUN proteins enhancing their DNA binding activity JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE 89251612 PUBMED 2498083COMMENT clone=AC113-1; cell line=NIH3T3. FEATURES Location/Qualifiers source 1..4145
/organism="Mus musculus" /db_xref="taxon:10090“
CDS 1202..2218 /note="fosB protein (AA 1-338)" /codon_start=1 /protein_id="CAA33026.1" /db_xref="GI:50992" /db_xref="MGD:95575" /db_xref="SWISS-PROT:P13346" /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991 t 1 others ORIGIN 1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa 121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt 181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta 301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata 421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca 781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg 1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc 1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc 1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc 1441 c
• Major Challenges:
1. Various representation
2. Semi-structured data
Schema Descriptors
• Follow XML DTD standard for semi-structured data
• Simple attribute list for relational data
<?xml version='1.0' encoding='UTF-8'?><!ELEMENT FASTA (ID, DESCRIPTION, SEQ)><!ELEMENT ID (#PCDATA)><!ELEMENT DESCRIPTION (#PCDATA)><!ELEMENT SEQ (#PCDATA)>
[FASTA] //Schema NameID = string //Data type definitionsDESCRIPTION = stringSEQ = string
Layout Descriptors
• Overall structure (FASTA example)
DATASET “FASTAData” { //Dataset nameDATATYPE {FASTA} //Schema name
DATASPACE LINESIZE=80 {
// ---- File layout details goes here ----
}DATA {osu/fasta} //File location
}
File Layout
• Key observations on line-based biological data files– Strings of variable length– Delimiters widely used– Data fields may be divided into variables– Repetitive structures>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
Layout Descriptors
• File layout (FASTA example)
DATASPACE LINESIZE=80 { <
“>” ID “ ” DESCRIPTION < “\n” SEQ >
“\n” | EOF>
}
>seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n>seq3 …
System Component
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices
Wrapper Generation Road Map
• Motivation and overview
• System structure
• Wrapper generation
• Wrapper execution
• Experiments
Wrapper Generation Motivation
• Wrappers are essential for bioinformatics integration– Heterogeneous data sources– Function: transform data
• Current solutions– Manually written wrappers– Scripts
Wrapper GenerationAdvantages
• Wrapper generated automatically– Stand-alone programs for integration systems and
workflows– Little human interference. New resources can be
integrated on-the-fly– Direct transformation. No unnecessary intermediate form
needed– Only requires data description at metadata level, one
descriptor/data source
• Transfer data from flat files directly– No DB support required– No other domain or format heuristics
Wrapper GenerationSystem Overview
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
WRAPINFO
Wrapper generationsystem
wrapper
Mapping File
Mapping Parser
Schema Mapping
Mapping Generator
Schema Descriptors
Layout Parser
Layout Descriptor
Data EntryRepresentation
Application Analyzer
Layout Parse Tree
• FASTA exampleDATASPACE LINESIZE=80 {
<“>” ID “ ” DESCRIPTION
< “\n” SEQ >“\n” | EOF
>}
DATASPACE rootlinesize = 80
< >
< >
“>”-ID “ “-DESCRIPTION
“\n”-SEQ
“\n”-DUMMY | EOF
Leaf: delimiter-variable (DLM-VAR) pair
Internal node: environment
Schema Mapping
• Algorithm: strict name matchingfor field ft in target schema
for field fs in source schema
if ft=fs then add pair (fs, ft) to the mapping
• Output– A list of attribute pairs– A editable file for user to verify and modify
Wrapping Assumptions
• Convert semi-structured (and structured) data to structured data
• Both datasets are stored record-wise
• Order of records not disturbed after wrapping
Semi-structured Structured
Data can be transformed entry by entry
Application Analyzer
• Task: to generate clear directions for wrapper and organize them in WRAPINFOR
• Sub-tasks– What values to store– How to extract values– How to store values– How to write values
Important Concepts (1)
• “Useful”– An attribute is useful iff its values are in target
• “Reachable”– node b is reachable from node a, if there exists
a valid layout configuration such that a.DLM and b.DLM defines the boundaries of a.VAR.
i.e “… a.DLM a.VAR b.DLM …”
– A value instance is between• Its own delimiter• The first appearance of its reachable delimiters
Important Concepts (2)
• Attribute Cardinality– Regular attribute: fixed number of values per
entry• ID
– Semi-structured attribute: varied number of values per entry
• References
WRAPINFOR
• Contents: information to answer a particular wrapping task
• Forms: in XML– 5 look-up tables
• Delimiter, Usefulness, Cardinality, Label, Reachable
– 3 parameters• one_to_one_total, one_to_multiple_total, complete_in
• Function: plug into general modules to form a functional wrapper
Wrapper Generation Road Map
• Motivation and overview of our approach
• System structure
• Wrapper generation
• Wrapper execution
• Experiments
Wrapper Overview
Inputdataset
Datasetbuffer
DataReader
Value buffer
one_to_multiple_values
one_to_one_values
DataWriterOutputdataset
Synchronizer
load run
FARA
run
RA
halt
Wrapper Structure
• One data module: WRAPINFO
• Three general action module– Synchronizer: central controler– DataReader, DataWriter: interact with datasets
• One value buffer
• Suitable for data grid
• Transform data one entry at a time
Wrapper Execution
• DataReader– Extract attribute value
• Delimiter table + Reachable table
– Fill value buffer: Label look-up table
• DataWriter– Retrieve from value buffer: Label look-up table– Write target file
• Delimiter table + Reachable table + label table
• Synchronizer– Call DataReader on source: parameters– Call DataWriter on target: parameters
Wrapper Experiments (1)
TRANSFAC-to-Reference Problem
(in logarithm)
(in logari
thm
)
•Analysis time constant•Execution time linear
Wrapper Experiments (2)
SWISSPROT-to-FASTA Problem
•Performance comparable to handwritten codes
System Components
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices
Query Execution Road Map
• Motivation
• System Overview
• System Implementation– Languages– System
• Experiments
Limitation of Wrapper
• Data Wrapping =
Data formatting + Data projection
• Other query types– Selection– Cross Product– Join
New Functionalities• Value examination• Multiple datasets
Advantages
• Retrieve multiple pieces of information all at once
• Data easily available
• Declarative languages only
• High flexibility
• Low over-head
• Suitable for data grid
System Enhancedquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
TargetData file
Source/target names
Schema & Layout informationmappings
Query analysis
Query execution
Query ExecutionRoad Map
• Motivation• System Overview• System Implementation
– Languages• Metadata Description Language• Query Language
– System• Query Analysis• Query Execution
• Experiments
Query Language• Declarative, SQL-like• Projection, selection, cross product, join queries• Example AUTOWRAP POSTBLAST
FROM BLASTP, SWISSPROT
BY BLASTP.SP_ID = SWISSPROT.ID
WHERE
POSTBLAST.QUERY = BLASTP.QUERY
POSTBLAST.SP_AC = BLASTP.SP_AC
POSTBLAST.SP_ID = BLASTP.SP_ID
POSTBLAST.FULL_DESCR = SWISSPROT.DEPOSTBLAST.FULL_DESCR = SWISSPROT.DE
POSTBLAST.SEQUENCE = SWISSPORT.SQPOSTBLAST.SEQUENCE = SWISSPORT.SQ
POSTBLAST.SCORE = BLASTP.SCORE
POSTBLAST.E_VALUE = BLASTP.E_VALUE
Target dataset
Source datasets
Join criteria
Attribute pairs
Application AnalyzerEnhancement
• Constant values in query– Pseudo-label look-up table
• Other query information– Parameters: comparing field pairs
• Output: QUERYINFOR
Query Execution
• Query-Proc Structure
• DataReader and DataWriter– Similar to wrapper
• Value buffer– Store useful values from one data entry of every
source dataset
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
TargetData file
Enhanced Synchronizer
• Synchronizer– Set up pseudo-attributes: Pseudo label look-up
table– Call DataReader on source 1 and 2; Call
DataWriter on target: Parameters– Test join conditions: Parameters– Clean value buffer: Parameters
Post-BLAST Query
• Goal: Enhance BLAST output to FASTA format
• Query: Join query between BLAST output (source 1) and SWISSPROT (source 2)
• 2 modes– UNIQUE: halt once a
match found in source 2– ALL: search all source 2
entries
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Tim
e (
se
c)
3 5 12
Query Size (Sequence Number)
UNIQUE
ALL
Chip-Supplement Query• Goal: Look up microarray
genes information into tabular format
• Query: Join query between protein array and yeast genome database
• 2 queries– Chip-Supplement:
• array join genome
– Chip-Supplement-Sorted:• genome join array
0
10
20
30
40
50
60
70
80
90
Tim
e (
se
c)
Chip-Supplement Chip-Supplement-
Sorted
Query Type
UNIQUE
ALL
OMIM-Plus Query
• Add reverse links of proteins to disease database
• Join query between OMIM database and SWISSPROT database
• Results in OMIM form
• 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours
System Components
• Understand data– Layout mining– Schema mining
• Process data– Metadata description language– Wrapper generation– Query execution– Query execution with indices
Query with IndicesRoad Map
• Motivation and Overview
• System
• System Enhancement– Language– System Implementation
• Experiments
Query With IndicesMotivation
• Goal– Improve the performance of query-proc program
• Index
– Maintain the advantages• Flat file based• Low requirement on programming
Challenges & Approaches
• Various indexing algorithms for various biological data– User defined indexing functions– Standard function interfaces
• Flat file data– Values parsed implicitly and ready to be indexed– Byte offset as pointer
• Metadata about indices– Layout descriptor
System Revisitquery
Query parser
Metadatacollection
Datasetdescriptors
Descriptorparser
Application analyzer
QUERYINFOR
DataReader DataWriter
Synchronizer
Source data files
Targetdata file
Source/target names
Schema & Layout information mappings
Query analysis
Query execution
Index file Index functions
Language Enhancement
• Describe indices– Indexing is a property of dataset– Extend layout descriptors
– Maintain query format
DATASET “name”{…INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}}
AUTOWRAP GNAMESFROM CHIPDATA, YEASTGENOMEBY CHIPDATA.GENE = YEASTGENOME.IDWHERE …
New meaning of “=“:If index available, use index
retrieving functionElse, compare values directly
System Enhancement
• Metadata Descriptor Parser+ parse index information
• Application Analyzer+ index information: index look-up table
+ test condition: compare_field_indexing
Query-Proc Enhancement
• Synchronizer+ if index is applicable, check availability of index
data file• If no, call index generation function
+ Load indices
+ Call index retrieving function first for candidate entry list
Microarray Gene Information Look-up
• Goal: gather information about genes (120)
• Query: microarray output join genome database
• Index: gene names in genome
0.01 0.72
20.89
81.59
0
10
20
30
40
50
60
70
80
90
Per
form
ance
(se
c)
queryanalysis
indexgeneration
query withindices
query w/oindices
BLAST-ENHANCE Query
• Goal: Add extra information to BLAST output
• Query: BLAST output join Swiss-Prot database
• Index: protein ID in Swiss-Prot
0
200
400
600
800
1000
1200
Per
form
ance
(se
c)
indexgeneration
query w/indices
query w/oindices
3 5 12
OMIM-PLUS Query
• Goal: add Swiss-Prot link to OMIM
• Query: OMIM join Swiss-Prot
• Index: protein ID in Swiss-Prot
1
10
100
1000
10000
100000
1000000
10000000
Perf
orm
ance
(sec
)
indexgeneration
query w/indices
query w/oindices
Homology Search Query
• Goal: find similar sequences
• Query: query sequence list * sequence database
• Indexing algorithm– Sequence-based– Transformation of sub-string composition– Indexing n-D numerical values
Homology Search (1)
• Index (Singh’s algorithm)– Data: yeast
genome– wavelet
coefficients – minimum
bounding rectangles
0
50
100
150
200
250
300
350
Per
form
ance
(sec
)
1 2 3 4 5
Database size (9.8MB)
Index generation
10
20
40
Homology Search (2)
• Index (Ferhatosmanoglu’s algorithm)– Data: GenBank– Wavelet coefficients– Scalar quantization– R-tree 0
5
10
15
20
25
30
perf
orm
ance
(sec
)
1 2 3 4 5
Database size (250MB)
10
20
40
Road Map
• Mission Statement
• Motivation
• Implementation
• Comprehensive Example
• Future work
• Conclusion
Gene Name Nomenclature
• It is crucial to identify genes CORRECTLY and UNAMBIGUOUSLY– Genes with multiple names– Multiple gene share same names
• Historically, little central control on naming process“…As biologists strive to make sense of the growing wealth of genomic information, this messy nomenclature is becoming a bugbear…”
Helen Pearson, Nature, 2001
Gene Name in DBs
• Databases related to genes– Genome databases (main force in nomenclature)
• SGD (yeast)• HGNC (human)• TAIR (a plant)• dictyBase (an one-cell amoeba)
– Curated gene databases• Entrez Gene by NCBI
– Curated gene product databases• Swiss-Prot by SIB and EBI
Queries About Gene Name
• Gene identifiers usages in databases– How are gene symbols in DB A used in DB B?– How are gene alias in DB A used in DB B?
• Nomenclature across species– Q1-Q2: genome – Entrez Gene, Swiss-Prot– Q3-Q4: Entrez Gene – Swiss-Prot
• Nomenclature over time– Q5-Q7: Swiss-Prot – genome
Challenges
• Various data representation– Line-based texts– Tabular forms with or without title– Format evolves over time
• Data storage– Large volume– Each file queried limited times
Metadata descriptors
Format and schemalearning
Flat file processing
Integration System RevisitUnderstand Data Process Data
Data File User Request
Metadata Description
Layout Descriptor---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema DescriptorLayout Descriptor
---------------------------------------------------
Schema Descriptor
CodeGeneration
QueryProcessor
Layout Miner
SchemaMiner
Information Integration System
GenomeEntrez GeneSwiss-Prot
- Join queries
Nomenclature Results (1)
• Across Species
0
10
20
30
40
50
60
70
80
90
Pe
rce
nta
ge
(%
)
Entrez GeneID
Entrez GeneAlias
Swiss-ProtID
Swiss-ProtAlias
Q1-Q2
SGD
HGNC
TAIR
dictyBase
0
10
20
30
40
50
60
Per
cen
tag
e (%
)
Swiss-Prot ID Swiss-Prot Alias
Q3-Q4
SGD
HGNC
TAIR
dictyBase
Nomenclature Results (2)
• Over time
Q5: How many gene ID in Swiss-Prot are gene ID in genome?Q6: How many gene ID in Swiss-Prot are alias in genome?Q7: How many gene alias in Swiss-Prot are gene ID in genome?
Performance
• Linear w.r.t. source 1 size
Conclusion
• A frame work and a set of tools for on-the-fly flat file data integration– New data source understood semi-automatically
by data mining tools– New data processed automatically by generated
programs
• AdvantagesHigh level interface, flat file based, ok
performance, low maintenance cost