a study of computational methods for storing and sequencing genetic databases csc 545 – advanced...
Post on 22-Dec-2015
213 views
TRANSCRIPT
![Page 1: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/1.jpg)
A Study of Computational A Study of Computational Methods for Storing and Methods for Storing and
Sequencing Genetic Sequencing Genetic DatabasesDatabases
CSC 545 – Advanced Database CSC 545 – Advanced Database SystemsSystems
By: Nnamdi IhuegbuBy: Nnamdi Ihuegbu12/2/0312/2/03
![Page 2: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/2.jpg)
AbstractAbstract
Scope of Study (i.e. aspect of Genetic Scope of Study (i.e. aspect of Genetic Databases)Databases) Types of Genetic DatabasesTypes of Genetic Databases Storage/organization/access/Storage/organization/access/
manipulation techniquesmanipulation techniques Sequencing (querying) of data in Sequencing (querying) of data in
Genetic DatabasesGenetic Databases Logical Layout of Genetic DatabasesLogical Layout of Genetic Databases
![Page 3: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/3.jpg)
Brief IntroductionBrief Introduction Human Genome Project (and others) -> Vast Human Genome Project (and others) -> Vast
amount of biological dataamount of biological data Venture: Computer Science and Biology Venture: Computer Science and Biology
(BCB) -> Genetic Databases (BCB) -> Genetic Databases (map,genomic,proteomic)(map,genomic,proteomic)
Expected date of Completed map of human Expected date of Completed map of human genome: end of 2003genome: end of 2003
Next stage: Sequence comp. and Seq-Next stage: Sequence comp. and Seq-Protein function.Protein function.
Useful to Pharm. Companies (CADD – e.g. Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).SKB’s Relenza).
![Page 4: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/4.jpg)
Results - SequenceResults - Sequence
Current Sequence Generation Current Sequence Generation TechnologiesTechnologies Maxam-Gilbert (use chemicals to cleave Maxam-Gilbert (use chemicals to cleave
DNA at a specific base/length)DNA at a specific base/length) Sanger (use enzymatic procedures to Sanger (use enzymatic procedures to
produce DNA based on specific base—produce DNA based on specific base—i.e. length)i.e. length)
![Page 5: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/5.jpg)
Derivation of nucleotide Derivation of nucleotide sequence from human sequence from human
chromosomechromosome
![Page 6: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/6.jpg)
Results - SequenceResults - Sequence Types of Sequence Comparisons/alignmts.Types of Sequence Comparisons/alignmts.
Global (“How similar are these two sequences?”)Global (“How similar are these two sequences?”) To find best overall alignment b/w two sequencesTo find best overall alignment b/w two sequences 1970: Needleman and Wunch (global, dynamic)1970: Needleman and Wunch (global, dynamic) Shortcomings: in small similarities w/in 2 subseq.Shortcomings: in small similarities w/in 2 subseq.
Local (“What sequences in a database are most Local (“What sequences in a database are most similar to this sequence?”)similar to this sequence?”)
To find the best subseq. match b/w two sequencesTo find the best subseq. match b/w two sequences 1981: Smith and Waterman (local, dynamic)1981: Smith and Waterman (local, dynamic) Shortcomings: not computationally efficient, slowShortcomings: not computationally efficient, slow
![Page 7: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/7.jpg)
Results - SequenceResults - Sequence
Global alignment
Local alignment
?
?
Figure 3: Illustrating the differences between global and local sequence alignment
![Page 8: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/8.jpg)
Results - SequenceResults - Sequence
Heuristic Search (Quick, Approximate)Heuristic Search (Quick, Approximate) Quickly search for “words” that match Quickly search for “words” that match
sequence. Then recursively perform local sequence. Then recursively perform local search on each matched word until no other search on each matched word until no other matchesmatches
FASTA (1998), BLAST(1990)FASTA (1998), BLAST(1990) Shortcomings: approximate not exact, E-Shortcomings: approximate not exact, E-
Value (sig if <0.05)Value (sig if <0.05)
![Page 9: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/9.jpg)
Results – Sequence (CSC Results – Sequence (CSC Implementation)Implementation)
Sequence alignment can be Sequence alignment can be represented as matrices and graphs represented as matrices and graphs (using rules and costs)(using rules and costs)
When converted into a directed When converted into a directed acyclic graph, solution of the acyclic graph, solution of the sequence alignment is the longest-sequence alignment is the longest-path (max. path problem).path (max. path problem).
![Page 10: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/10.jpg)
Results Sequence (CSC Results Sequence (CSC Implementation)Implementation)
Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1
• Can be solved dynamically as a ‘running max score’ (RMS).
•For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score)
•Replace D(i,j) with max
•Needleman-Wunch Dynamic Program
![Page 11: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/11.jpg)
Results – Sequence (CSC Results – Sequence (CSC Implementation)Implementation)
Similar to Smith-WatermanSimilar to Smith-Waterman Differences: Differences:
restricts RMS-discontinues if <0 after restricts RMS-discontinues if <0 after several iterationsseveral iterations
For each iteration, saves max for each For each iteration, saves max for each cell separately rather than replace-cell separately rather than replace->Trace back through max. scores for >Trace back through max. scores for best local alignmentbest local alignment
BLAST Implementation (BLAST Implementation (http://www.ebi.ac.uk/blast2/#http://www.ebi.ac.uk/blast2/#) )
![Page 12: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/12.jpg)
Results - StorageResults - Storage EMBL Nucleotide Sequence Database (on EMBL Nucleotide Sequence Database (on
Oracle)Oracle) Scale: over 130 tables, 140 relationships Scale: over 130 tables, 140 relationships
(80 GB of data)(80 GB of data) Object Oriented Organization with Related 5 Object Oriented Organization with Related 5
packages.packages. Operations that return attribute type-Operations that return attribute type-
>supports on demand object creation>supports on demand object creation ‘‘live object cache’ – copying most accessed live object cache’ – copying most accessed
instance of DB into cache by Primary key instance of DB into cache by Primary key and performing queries on this cache.and performing queries on this cache.
![Page 13: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/13.jpg)
Results - StorageResults - Storage
5 EMBL Packages:5 EMBL Packages: Sequence Info – general information on Sequence Info – general information on
biological sequence.biological sequence. Feature Info – sequence Feature Info – sequence
annotation/commentannotation/comment Reference Info – bibliographic ref. on seq.Reference Info – bibliographic ref. on seq. Taxonomy Info – taxonomy of organism’s Taxonomy Info – taxonomy of organism’s
sequence (i.e. kingdom, phyla, family, sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.)genus, species, e.t.c.)
Location Info – location of sequence on Location Info – location of sequence on DNA/RNADNA/RNA
![Page 14: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/14.jpg)
Results – Storage (Gen. Results – Storage (Gen. Relation B/W 5 packages)Relation B/W 5 packages)
![Page 15: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/15.jpg)
Results – Storage (Sequence Results – Storage (Sequence Info)Info)
![Page 16: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/16.jpg)
Results – Storage (Feature Results – Storage (Feature Info)Info)
![Page 17: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/17.jpg)
Results – Storage (Reference Results – Storage (Reference Info)Info)
![Page 18: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/18.jpg)
Results – Storage (Taxonomy Results – Storage (Taxonomy Info)Info)
![Page 19: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/19.jpg)
Results – Storage (Location Results – Storage (Location Info)Info)
![Page 20: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/20.jpg)
ConclusionConclusion
Genetic Databases (3 main types) Genetic Databases (3 main types) are essential to store, manage, and are essential to store, manage, and query the massive bio-data from query the massive bio-data from studies like HGP.studies like HGP.
Object Oriented Design and data Object Oriented Design and data organizationorganization
Sequence Analysis: Global (N-W), Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)Local (S-W), Heuristic (FASTA, BLAST)
![Page 21: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/21.jpg)
Conclusion - Future Conclusion - Future EnhancementsEnhancements
Storage/Management: highly dependent Storage/Management: highly dependent on hardware industry progresson hardware industry progress
Sequence Analysis: Sequence Analysis: Use of parallel prog. for faster analysis of 2 Use of parallel prog. for faster analysis of 2
sequences (BLAZE-Stanford)sequences (BLAZE-Stanford) Faster means of comparing and aligning Faster means of comparing and aligning
multiple sequences simultaneously (e.g. multiple sequences simultaneously (e.g. comparing novel protein sequence to comparing novel protein sequence to family).family).
![Page 22: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/22.jpg)
Any Questions?Any Questions?
![Page 23: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03](https://reader035.vdocuments.us/reader035/viewer/2022081519/56649d775503460f94a5903c/html5/thumbnails/23.jpg)