managing gene annotation information the search is over … one problem solved … another begins
DESCRIPTION
Managing Gene Annotation Information the search is over … one problem solved … another begins. observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group. Interdisciplinary Center for Biotechnology Research. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/1.jpg)
Managing GeneAnnotation Information
the search is over… one problem solved
… another beginsobservations from a foot soldier in the bio-information (r)evolution
Bill Farmerie -- ICBR Genomics Group
![Page 2: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/2.jpg)
Interdisciplinary Center for Biotechnology Research Established at the University of Florida in 1987
by the Florida Legislature centralized organization of biomedical core facilities supporting biotechnology-based research
How did information management become my problem?
![Page 3: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/3.jpg)
1998 GSAC Miami Beach
![Page 4: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/4.jpg)
Why should I care about this problem? Because my paycheck depends on it. Avoid fatal failure in the funding loop.
PI has $ for large gene-
based project
Core Lab generates
data
Downstream data management &
analysis
PI writes papers,
gives talks
PI applies for new funding
Other PI’s think this
looks like a good idea
![Page 5: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/5.jpg)
From Sequence to Function The genomic sequence identifies the 'parts'
the next trick is understanding gene function Post genomic era = functional genomics Critical concept: genes of similar sequence
may have similar functions Inferring function for a new gene begins with
searching for it’s nearest neighbor (or homolog) of known function
![Page 6: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/6.jpg)
BLAST Most common starting point for gene identification Similarity search of sequence repository (GenBank) Output
Calculated scores (bit score and e-value) Text string (definition line), ID Reference Tag Sequence alignment
Advantages Fast algorithm, very good at finding close homologs
Disadvantages Not good at finding distant relatives
Cluster and Grid-enabled versions available
![Page 7: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/7.jpg)
HMMER HMMER developed by Sean Eddy Uses Hidden Markov Models Searches unknown protein query sequence against a
database of protein family models Statistical models constructed from alignment of conserved
protein regions (Pfam) Advantages
Superior to BLAST for discovering more distant homology relations
Disadvantages More computationally intensive than BLAST
GRID enabled
![Page 8: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/8.jpg)
OK! Great!
Sequencing done. Homology searches complete.
But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators?
![Page 9: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/9.jpg)
Search for summarizing information that restores sanityCTGGGTTCTGTTCGGGATCCCAGTCACAGGGACAATGGCGCATTCATATGTCACTTCCTTTACCTGCCTGGA
GAGGTGTGGCCACAGACTCTGGTGGCTGCGAACGGGGACTCTGACCCAGTCGACTTTATCGCCTTGACGAAG
AACCAGATTGACGTTGTCGGAGTCGGAACTCACCTGGTCACCTGTACGACTCAGCCGTCGCTGGGTTGCGTT
CTGACACGCGGCTCCTCGTGTGGAGCCGAAACCCCGACAAAAGCGAAGGAGAGAGTGAGTATGAGCAGGCGG
![Page 10: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/10.jpg)
BlastQuest
A small idea with a big mission
![Page 11: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/11.jpg)
BlastQuest Requirements Accessible to research groups at remote locations Privacy constrained sharing of results among the scientists Selective browsing of BLAST homology search results Selective data filtering on statistical criteria
e-value or bit score Selective data grouping on criteria such as GI number, or a defined
number of top-scoring results Ad hoc search capability on user determined criteria:
text terms boolean logic
From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution.
![Page 12: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/12.jpg)
Overview of BlastQuest Architecture
MySQL DBMS
Web BrowserClient Side GUI Tier 3
Tier 2
Tier 1
BLAST XML documentAssembly ACE fileXML Loader ACE Loader
SQL Constructor
Client Interface ModuleWebServer
JDBC
![Page 13: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/13.jpg)
Welcome to BlastQuest
![Page 14: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/14.jpg)
Choose among client projects
![Page 15: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/15.jpg)
Results Selection
![Page 16: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/16.jpg)
Grouped Results
![Page 17: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/17.jpg)
![Page 18: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/18.jpg)
![Page 19: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/19.jpg)
Ad Hoc Text Searching
![Page 20: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/20.jpg)
Internal BLAST Searches
![Page 21: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/21.jpg)
Viewing a Gene Ontology Tree
![Page 22: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/22.jpg)
Viewing a Gene Ontology Tree
![Page 23: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/23.jpg)
Viewing a Gene Ontology Tree
![Page 24: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/24.jpg)
KEGG Classification Kyoto Encyclopedia of Genes and Genomes “Wiring diagrams of life” KEGG Protein Networks
Metabolic pathways Regulatory pathways Molecular complexes Network-network relations Network-environment relations
![Page 25: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/25.jpg)
Unique to non-UnigeneCommon to both Unique to Unigene
![Page 26: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/26.jpg)
Bacterial Genome Annotation Workbench
Another simple idea driven by necessity
![Page 27: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/27.jpg)
Start
![Page 28: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/28.jpg)
Project Summary
![Page 29: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/29.jpg)
Contig Browser
![Page 30: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/30.jpg)
Contig summary
![Page 31: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/31.jpg)
Physical map linked to annotation
![Page 32: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/32.jpg)
Simple problems.Simple solutions.Why are these simple ideas important?
![Page 33: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/33.jpg)
Human Genome Project HGP drove innovation in biotechnology 2 major technological benefits
stimulated development of high throughput methods
reliance on computational tools for data mining and visualization of biological information
![Page 34: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/34.jpg)
The HGP and the cost of DNA sequencing
“finished” quality DNA sequence a DNA base call is considered finished if the probability of base
call error is less than 1 in 10,000 also known as phred > 40
contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 7-10X coverage
1985: $10 per finished base 2001: $1 per 10 finished bases
![Page 35: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/35.jpg)
Genbank August 22, 2005
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Public Collections of DNA and RNA Sequence Reach 100 Gigabases
![Page 36: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/36.jpg)
Trends in the cost efficiency of DNA sequencing§
§Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335
![Page 37: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/37.jpg)
454 Life Sciences Corporation
The first commercial, massively parallel, DNA sequencing technology
![Page 38: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/38.jpg)
454 Technology Cyclic-array sequencing on in vitro amplified DNA
molecules individual molecules must be amplified to give a
detectable sequencing signal Instead of biological cloning, we amplify individual
DNA fragments on solid state beads using PCR Instead of terminator-based sequencing,
pyrosequencing used to determine nucleotide order “sequencing by synthesis”
![Page 39: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/39.jpg)
454 Process Overview
![Page 40: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/40.jpg)
The bottom line … efficiency of DNA sequencing increased 100X cost per finished base declined 10- to 30-fold
… so what happens next? The “democratization” of large-scale genomic biology Many projects are now possible that were once fiscally
inviable We must deal with basic local data management and
information issues or lose this opportunity
![Page 41: Managing Gene Annotation Information the search is over … one problem solved … another begins](https://reader035.vdocuments.us/reader035/viewer/2022070503/5681568e550346895dc43a1c/html5/thumbnails/41.jpg)
If you thought bioinformatics was important before
By terminator-based sequencing we @ UF produce 60-70 Mbp per yearBy synthesis-based sequencing we produce 60-70 Mbp per day