1 database resources of the national center for biotechnology information baharak rastegari medg 505...
TRANSCRIPT
1
Database Resources of the National Center for Biotechnology
Information
Baharak RastegariMEDG 505 presentation
February 3, 2005 [email protected]
David L. Wheeler et al. Nucleic Acids Research, Vol. 33, Database issue
2
NCBI! What is it?
• Created in 1998• At the National Institutes of Health• To develop information systems for
molecular biology• Maintains: GenBank(R) nucleic acid sequence
database
• Provides: Data retrieval systems & computational resources
3
4
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
5
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
6
Entrez
• Text searching→ using Boolean queries→ of a diverse set of over 20 databases
• Simultaneous searches across all Entrez databases at speeds comparable to a single database search
7
8
Entrez
• Retrieved record can be displayed in a wide variety of formats→ GenBank Flatfile, FASTA, XML, …
• Graphical display is offered for some type of records
• Search history→ allows users to recall result of previous
searches and combine them using Boolean logic
9
Entrez
• PubMed→ includes 12.8 million references and abstracts
in MEDLINE(R) → with links to the full text of more than 4400
journals available on web
• PubMed Central→ digital archive of peer reviewed journals in life
sciences→ access to over 300 000 full text articles→ over 160 journals
• Books database→ Contains more than 35 online scientific
textbook
12
Taxonomy
• Indexed over 165 000 named organisms• Can be used to view taxonomic position or
retrieve data from a database for particular organism or group
• Searches can be made on whole, partial or phonetically spelled organism names
• Links to organisms commonly used in biological research are provided
• Display custom taxonomic trees, representing user-defined subsets of the full NCBI taxonomy
13
14
15
16
Entrez Gene
• Successor to LocusLink• Provides an interface to curated
sequences and descriptive information about genes
• With links to gene related resources→ NCBI’s Map Viewer, Evidence Viewer, Blast
Link, ..
17
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
18
BLAST Family
• BLAST→ Local alignment search tool→ performing sequence-similarity searches
against variety of sequence databases→ returning a set of gapped alignments btw the
query and database sequences
• BLAST2Sequences→ comparing two DNA or protein sequences → producing a dot-plot representation of the
alignments
19
20
BLAST Family
• MegaBLAST→ designed to search for nearly exact matches→ handles batch nucleotide queries→ operates up to 10 times faster than standard
nucleotide BLAST
• BLASTLink (BLink)→ displays pre-computed protein BLAST
alignments for each protein in the Entrez databases
→ can display subset of these alignments by taxonomic criteria, database of origin, …
21
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
22
UniGene
• System for automatically partitioning Gen-Bank sequences, including ESTs, into a non-redundant set of gene-oriented clusters
• Each cluster contains sequences that represent a unique gene, and is linked to related information
• Human UniGene→ over 4.5 million human ESTs→ reduced to 42-fold in number to approximately 107 000
sequence clusters
• Has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression
23
ProEST
• Analogous to BLASTLink• Presents pre-computed BLAST alignment
btw protein sequences from model organisms and six-frame translations of UniGene nucleotide sequences
• Reports are updated in tandem with UniGene protein similarities
24
Trace & Assembly Archives
• Trace Archive allows for flexible searching and download of sequencing traces
• Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank
25
HomoloGene
• System for automated detection of homologs among the annotated genes of several completely sequence eukaryotic
• New HomoloGene build is guided by the taxonomic tree, relies on:→ conserved gene order & measures of DNA
similarity among closely related species→ protein similarity for more distantly related
organisms
•
26
…HomoloGene
• ‘Ancestor’ field→ refers to the taxonomic group of the last common
ancestor of the species represented in HomoloGene entry→ using it is possible to limit a search to genes conserved in
one of 22 ancestral group
• ‘Pairwise Score’ display gives a table of pairwise statistics for members of a Homologene group that includes→ percent amino acid and nucleotide identities→ Jukes-Cantor genetic distance parameter→ the ratio of non-synonymous to synonymous amino acid
substitutions (Ka/Ks)
28
Reference Sequences
• RefSeq provides curated references for→ transcripts, proteins and genomic regions→ computationally derived nucleotide sequences
and proteins
• Containing 1.3 million sequences→ including more than 1 million protein sequences→ representing more than 2400 organisms
29
ORF Finder and Spidey
• ORF finder→ performs a six-frame translation of a nucleotide
sequence → returns the location of each ORF within a
specified size range
• Spidey→ alignment tool for eukaryotic genomic sequences→ takes into account predicted splice sites in
constructing its alignment, and can use one of four splice-site models
→ returns exon alignments, protein translations and a summary showing the alignment quality, …
30
Electronic PCR (e-PCR)
• Forward e-PCR→ searches for matches to STS primer pairs in the
UniSTS database of over 450 000 markers→ to increase sensitivity, allows the size of primer
segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted
• Reverse e-PCR→ used to estimate the genomic binding site,
amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases
31
32
33
dbSNP
• Database of single nucleotide polymorphisms• Repository for single base nucleotide
substitutions and short deletion and insertion polymorphisms
• Contains 9.8 million human SNPs as well as about 5 million from a variety of other organisms
34
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
35
Entrez Genomes
• Provides access to genomic data contributed by the scientific community for species whose sequencing and mapping is complete or in progress
• Includes:→ over 180 complete microbial genomes→ more than 1600 viral genomes→ over 550 reference sequences for eukaryotic organelles→…
• Complete genome can be accessed hierarchically starting from either→an alphabetical listing →phylogenetic tree for each of six principal taxonomic groups
36
COGs database
• Clusters of orthologous groups • Presents a compilation of orthologous groups
of proteins from 66 completely sequenced organisms
• Eukaryotic version, KOGs, is available for seven eukaryotes
37
MAP & Evidence Viewer
• MAP Viewer displays→ genome assemblies→ genetic and physical markers→ the result of annotation, and other analyses using
sets of aligned maps
• Evidence Viewer displays the alignments to a→ genomic contig of RefSeq transcripts→ GenBank mRNAs→ known or potential transcripts→ EST’s supporting a gene model
39
Cancer Chromosome
• Consists of → NCI/NCBI SKY, M-FISH and CGH databases→ NCI Mitelman database of chromosome
Aberrations in cancer→ NCI Recurrent Chromosome Aberrations in
Cancer dtabase
• Three search formats are available→ convential Entrez query→ Quick/Simple search: set of menus to select a
disease site or diagnosis→ Advanced search : combination of forms for more
complex queries
40
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
41
SAGEmap
• Provides two-way mapping btw→ regular (10 base) and LongSAGE (17 base)
SAGE tags→ UniGene clusters
• SAGEmap repository contains→ 381 SAGE experiments from 11 organisms
• Can also construct a user-configurable table of data comparing one group of SAGE libraries with another
• Is updated weekly
42
43
Gene Expression Ominbus
• Data repository and retrieval system for any high-throughput gene expression or molecular abundance data
• Contains→ microarray-based experiments measuring the abundance of
mRNA→ genomic DNA and protein molecules→ non-array-based technologies such as SAGE→ mass spectrometry peptide profiling
• Now contains → high-throughput gene expression data from about 30 000
hybridization experiment→ about 1000 array definitions→ half a billion individual spot measurement data derived from
over 100 organisms
44
OMIM
• Catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at the John Hopkins University
• Contains information on disease phenotypes and genes
• Contains→ about 16 000 entries
45
DB Resources Categories
• Databases retrieval tools• The BLAST family of sequence-similarity
search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of
gene expression and phenotypes• The molecular modeling database, the
conserved domain database search, CDART and Protein interactions
46
MMDB
• Built by processing entries from the Protein Data Bank
• Structures are linked to sequences in Entrez and to the Conserved Domain Database.
• Conserved Domain Search can be used to search a protein sequence for conserved domains in CDD
• Wherever possible, CDD hits are linked to structure which can be viewed with NCBI’s 3D molecular structure viwer, Cn3D
47
HIV-I/Human Protein Interaction DB
• Concise summary of documented interactions between HIV-1 proteins and→ host cell proteins→ other HIV-1 proteins → proteins from disease organisms associated with
HIV or AIDS
• Summaries, including protein RefSeq accession numbers, Entrez Gene ID number, … are presented
48
Summary / Conclusion
• NCBI provides many tools for data retrieval and analysis of data in GenBank and other biological data
• All of the tools and resources can be find easily on the website http://www.ncbi.nih.gov/ along with documentations and explanatory material
• NCBI Handbook and several tutorials are available
• One can search for tools and information in NCBI website by choosing NCBI Website as database
49
50
Thank you!
51
Outline
• Introduction• Related work• Components of a Pseudoknotted Sec. Str.• Parsing algorithm• Enumerating loops• Akutsu’s structure class• Conclusion & Future work