1 database resources of the national center for biotechnology information baharak rastegari medg 505...

1

Database Resources of the National Center for Biotechnology

Information

Baharak RastegariMEDG 505 presentation

February 3, 2005 [email protected]

David L. Wheeler et al. Nucleic Acids Research, Vol. 33, Database issue

mailto:[email protected]

2

NCBI! What is it?

• Created in 1998• At the National Institutes of Health• To develop information systems for

molecular biology• Maintains: GenBank(R) nucleic acid sequence

database

• Provides: Data retrieval systems & computational resources

4

DB Resources Categories

• Databases retrieval tools• The BLAST family of sequence-similarity

search programs• Resources for Gene-level sequences• Resources for Genome-scale analysis• Resources for the analysis of patterns of

gene expression and phenotypes• The molecular modeling database, the

conserved domain database search, CDART and Protein interactions

5






6

Entrez

• Text searching→ using Boolean queries→ of a diverse set of over 20 databases

• Simultaneous searches across all Entrez databases at speeds comparable to a single database search

8

Entrez

• Retrieved record can be displayed in a wide variety of formats→ GenBank Flatfile, FASTA, XML, …

• Graphical display is offered for some type of records

• Search history→ allows users to recall result of previous

searches and combine them using Boolean logic

9

Entrez

• PubMed→ includes 12.8 million references and abstracts

in MEDLINE(R) → with links to the full text of more than 4400

journals available on web

• PubMed Central→ digital archive of peer reviewed journals in life

sciences→ access to over 300 000 full text articles→ over 160 journals

• Books database→ Contains more than 35 online scientific

textbook

12

Taxonomy

• Indexed over 165 000 named organisms• Can be used to view taxonomic position or

retrieve data from a database for particular organism or group

• Searches can be made on whole, partial or phonetically spelled organism names

• Links to organisms commonly used in biological research are provided

• Display custom taxonomic trees, representing user-defined subsets of the full NCBI taxonomy

16

Entrez Gene

• Successor to LocusLink• Provides an interface to curated

sequences and descriptive information about genes

• With links to gene related resources→ NCBI’s Map Viewer, Evidence Viewer, Blast

Link, ..

17






18

BLAST Family

• BLAST→ Local alignment search tool→ performing sequence-similarity searches

against variety of sequence databases→ returning a set of gapped alignments btw the

query and database sequences

• BLAST2Sequences→ comparing two DNA or protein sequences → producing a dot-plot representation of the

alignments

20

BLAST Family

• MegaBLAST→ designed to search for nearly exact matches→ handles batch nucleotide queries→ operates up to 10 times faster than standard

nucleotide BLAST

• BLASTLink (BLink)→ displays pre-computed protein BLAST

alignments for each protein in the Entrez databases

→ can display subset of these alignments by taxonomic criteria, database of origin, …

21






22

UniGene

• System for automatically partitioning Gen-Bank sequences, including ESTs, into a non-redundant set of gene-oriented clusters

• Each cluster contains sequences that represent a unique gene, and is linked to related information

• Human UniGene→ over 4.5 million human ESTs→ reduced to 42-fold in number to approximately 107 000

sequence clusters

• Has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression

23

ProEST

• Analogous to BLASTLink• Presents pre-computed BLAST alignment

btw protein sequences from model organisms and six-frame translations of UniGene nucleotide sequences

• Reports are updated in tandem with UniGene protein similarities

24

Trace & Assembly Archives

• Trace Archive allows for flexible searching and download of sequencing traces

• Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank

25

HomoloGene

• System for automated detection of homologs among the annotated genes of several completely sequence eukaryotic

• New HomoloGene build is guided by the taxonomic tree, relies on:→ conserved gene order & measures of DNA

similarity among closely related species→ protein similarity for more distantly related

organisms

•

26

…HomoloGene

• ‘Ancestor’ field→ refers to the taxonomic group of the last common

ancestor of the species represented in HomoloGene entry→ using it is possible to limit a search to genes conserved in

one of 22 ancestral group

• ‘Pairwise Score’ display gives a table of pairwise statistics for members of a Homologene group that includes→ percent amino acid and nucleotide identities→ Jukes-Cantor genetic distance parameter→ the ratio of non-synonymous to synonymous amino acid

substitutions (Ka/Ks)

28

Reference Sequences

• RefSeq provides curated references for→ transcripts, proteins and genomic regions→ computationally derived nucleotide sequences

and proteins

• Containing 1.3 million sequences→ including more than 1 million protein sequences→ representing more than 2400 organisms

29

ORF Finder and Spidey

• ORF finder→ performs a six-frame translation of a nucleotide

sequence → returns the location of each ORF within a

specified size range

• Spidey→ alignment tool for eukaryotic genomic sequences→ takes into account predicted splice sites in

constructing its alignment, and can use one of four splice-site models

→ returns exon alignments, protein translations and a summary showing the alignment quality, …

30

Electronic PCR (e-PCR)

• Forward e-PCR→ searches for matches to STS primer pairs in the

UniSTS database of over 450 000 markers→ to increase sensitivity, allows the size of primer

segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted

• Reverse e-PCR→ used to estimate the genomic binding site,

amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases

33

dbSNP

• Database of single nucleotide polymorphisms• Repository for single base nucleotide

substitutions and short deletion and insertion polymorphisms

• Contains 9.8 million human SNPs as well as about 5 million from a variety of other organisms

34






35

Entrez Genomes

• Provides access to genomic data contributed by the scientific community for species whose sequencing and mapping is complete or in progress

• Includes:→ over 180 complete microbial genomes→ more than 1600 viral genomes→ over 550 reference sequences for eukaryotic organelles→…

• Complete genome can be accessed hierarchically starting from either→an alphabetical listing →phylogenetic tree for each of six principal taxonomic groups

36

COGs database

• Clusters of orthologous groups • Presents a compilation of orthologous groups

of proteins from 66 completely sequenced organisms

• Eukaryotic version, KOGs, is available for seven eukaryotes

37

MAP & Evidence Viewer

• MAP Viewer displays→ genome assemblies→ genetic and physical markers→ the result of annotation, and other analyses using

sets of aligned maps

• Evidence Viewer displays the alignments to a→ genomic contig of RefSeq transcripts→ GenBank mRNAs→ known or potential transcripts→ EST’s supporting a gene model

39

Cancer Chromosome

• Consists of → NCI/NCBI SKY, M-FISH and CGH databases→ NCI Mitelman database of chromosome

Aberrations in cancer→ NCI Recurrent Chromosome Aberrations in

Cancer dtabase

• Three search formats are available→ convential Entrez query→ Quick/Simple search: set of menus to select a

disease site or diagnosis→ Advanced search : combination of forms for more

complex queries

40






41

SAGEmap

• Provides two-way mapping btw→ regular (10 base) and LongSAGE (17 base)

SAGE tags→ UniGene clusters

• SAGEmap repository contains→ 381 SAGE experiments from 11 organisms

• Can also construct a user-configurable table of data comparing one group of SAGE libraries with another

• Is updated weekly

43

Gene Expression Ominbus

• Data repository and retrieval system for any high-throughput gene expression or molecular abundance data

• Contains→ microarray-based experiments measuring the abundance of

mRNA→ genomic DNA and protein molecules→ non-array-based technologies such as SAGE→ mass spectrometry peptide profiling

• Now contains → high-throughput gene expression data from about 30 000

hybridization experiment→ about 1000 array definitions→ half a billion individual spot measurement data derived from

over 100 organisms

44

OMIM

• Catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at the John Hopkins University

• Contains information on disease phenotypes and genes

• Contains→ about 16 000 entries

45






46

MMDB

• Built by processing entries from the Protein Data Bank

• Structures are linked to sequences in Entrez and to the Conserved Domain Database.

• Conserved Domain Search can be used to search a protein sequence for conserved domains in CDD

• Wherever possible, CDD hits are linked to structure which can be viewed with NCBI’s 3D molecular structure viwer, Cn3D

47

HIV-I/Human Protein Interaction DB

• Concise summary of documented interactions between HIV-1 proteins and→ host cell proteins→ other HIV-1 proteins → proteins from disease organisms associated with

HIV or AIDS

• Summaries, including protein RefSeq accession numbers, Entrez Gene ID number, … are presented

48

Summary / Conclusion

• NCBI provides many tools for data retrieval and analysis of data in GenBank and other biological data

• All of the tools and resources can be find easily on the website http://www.ncbi.nih.gov/ along with documentations and explanatory material

• NCBI Handbook and several tutorials are available

• One can search for tools and information in NCBI website by choosing NCBI Website as database

http://www.ncbi.nih.gov/



50

Thank you!

51

Outline

• Introduction• Related work• Components of a Pseudoknotted Sec. Str.• Parsing algorithm• Enumerating loops• Akutsu’s structure class• Conclusion & Future work

1 database resources of the national center for biotechnology information baharak rastegari medg 505...

Documents