ncbi molecular biology resources

NCBI Molecular Biology Resources

EBI

GenBankGenBank

DDBJDDBJ

EMBLEMBL

EMBLEMBLNIGNIG

CIB

NCBI

NIHNIH

SRS

getentry

Entrez

The International Sequence Database Collaboration

•Submissions•Updates •Submissions

•Updates

•Submissions•Updates

•Sequin•BankIt•WebIn•SAKURA

Primary vs. Derivative Databases

ACGTGC

CG

TG

AATTGACTAACGTGCA

CG

TG

C TTGACA

TATAGCCG

GenBank

SequencingCenters

GAGA

ATTC

C GAGA

ATTC

C UniGene

RefSeq:LocusLink andGenomes Pipelines

RefSeq:Annotation Pipeline

Labs

Curators

Algorithms

TATAGCCGAGCTCCGATACCGATGACAA

Updated ONLY by submitters

EST UniSTS

STS

GSS

HTG

Updated continuall

y by NCBI

PRI ROD PLN MAM BCT

INV VRT PHG VRL

Types of Databases Primary Databases

Original submissions by experimentalists Remember biologyʼs Central Dogma: DNA →RNA →protein. Primary refers to one dimensional ʻsymbolʼ information written in

sequential order necessary to specify a particular biological molecular entity, be it polypeptide or polynucleotide.

Content controlled by the submitter• Examples: GenBank, SNP, GEO, PubChem Substance

Derivative Databases Built from primary data Content controlled by third party (NCBI)

• Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound

What is Entrez? Entrez Global Query is an integrated search and retrieval

system for databases of National Center for Biotechnology Information (NCBI).

It provides access to all NCBI databases simultaneously with a single query string and user interface.

Support boolean operators and search term tags to limit parts of the search statement to particular fields.

This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database.

A text search / retrieval engine NOT A DATABASE. A tool for finding biologically linked data. A virtual workspace for manipulating large datasets.

Entrez Databases

Each record is assigned a UID unique integer identifier for internal tracking GI number for Nucleotide

Each record is given a Document Summary a summary of the record’s content (DocSum)

Each record is assigned links to biologically related UIDs

Each record is indexed by data fields [author], [title], [organism], and many others

The Entrez System

Nucleotide

Protein

Structure

PubMed

PopSet

Genome

OMIM

Taxonomy

Books

ProbeSet

3D Domains

UniSTS

SNP

CDD

Entrez

UniGeneJournals

PubMedCentral

The Entrez System: Text Searches

Entrez Taxonomy

The backbone of NCBI

[organism]

An Entrez Database - Nucleotide GenBank: Primary Data (97.9%)

original submissions by experimentalists submitters retain editorial control of

records archival in nature

RefSeq: Derivative Data (2.1%) curated by NCBI staff NCBI retains editorial control of records record content is updated continually

What is GenBank? NCBI’s Primary Sequence Database

Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data

Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data)

Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database

A Traditional GenBank RecordLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//

Header

Feature Table

Sequence

The Flatfile Format

An Example Record – M17755

Field Indexed Terms

[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol mrna

gbdiv prisrcdb genbank

Indexing for Nucleotide UID 4680720

M17755: Feature Table

CDS position in bp

TPO [gene name]

thyroiditis[text word]

thyroid peroxidase[protein name]

protein accession

Sequence: 99.99% Accurate

The sequence itselfis not indexed…

Use BLAST for that!

RefSeq database is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein.

Non-redundant nucleotide and protein sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes.

Updated to reflect current sequence data and biology

Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC).

Similar to a review article, a RefSeq integrates information across multiple sources at a given time hence provides a foundation for uniting sequence data with genetic and functional information.

They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records.

RefSeq: NCBI’s Derivative Sequence Database

The common Refseq accession prefixAccession prefix Molecular type

NC_ Complete genomic molecule (chromosome; microbial or organelle genome)

NT_ Genomic contig

NM_ Curated mRNA

XM_ mRNA (Computed)

NP_ Curated Protein

XP_ Protein (Computed)

NR_ Curated RNA

XR_ RNA(Computed)

Entrez Gene and RefSeq

• Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI

• Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)

• Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases

GenBank RefSeq Gene

Nucleotide

Entrez Gene: RefSeq Annotations

Beyond RefSeq

If your organism does not have RefSeqs…

UniGene : gene-based clusters of cDNAs and ESTs

WGS sequences in Entrez Nucleotide (wgs[prop])

Trace Archive

A gene-oriented view of sequence entries

•MegaBlast based automated sequence clustering

•Now informed by genome hits New!

•Nonredundant set of gene oriented clusters

•Each cluster a unique gene

•Information on tissue types and map locations

•Includes known genes and uncharacterized ESTs

•Useful for gene discovery and selection of

mapping reagents

What is UniGene?

Organisms in UniGene Top Ten1. Human2. Rice3. Mouse4. Cow5. Wheat6. Zebrafish7. Pig8. Chicken9. Frog (X. laevis)10. Frog (X. tropicalis)

Finding UniGene Clustersby link

by Entrez search

UniGene Cluster for TPO

Entrez Protein

GenPept (DDBJ, EMBL, GenBank) 4,444,405 RefSeq 1,753,167 PIR 222,395 Swiss Prot 189,005

PDB 68,621 PRF 12,079 Third Party Annotation 4,219 Total 6,693,891

Protein Sources and Links

PIR

RefSeq

SWISS-PROT

GenPept

NM_000537

M17755

no mRNA!

no mRNA!

PubMed is the NCBI gateway to MEDLINE.

MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published world-wide. It has 12 million records dating back to 1966.

In order to impose uniformity and consistency to the indexing of biomedical literature MeSH vocabulary is used for indexing journal articles for MEDLINE.

MeSH is the acronym for "Medical Subject Headings."

MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.

Taxonomy Browser is…

• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms

Structure site includes…

• Molecular Modelling Database (MMDB)• biopolymer structures obtained from the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)

OMIM is…

•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU

General Protein Databases

SWISS-PROT Manually curated high-quality annotations, less data

GenPept/TREMBL Translated coding sequences from GenBank/EMBL Few annotations, more up to date

PIR Phylogenetic-based annotations

All 3 now combining efforts to form UniProt (http://www.uniprot.org)

http://us.expasy.org/sprot/userman.html

Swissprot format

Non-redundant Databases

Sequence data only: cannot be browsed, can only be searched using a sequence

Combine sequences from more than one database Examples:

NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)

Pfam (http://www.sanger.ac.uk/Software/Pfam/)Collection of multiple sequence alignments and hidden Markov models covering many

common protein domains and families SMART (a Simple Modular Architecture Research Tool)

Identification and annotation of genetically mobile domains and the analysis of domain architectures

(http://smart.embl-heidelberg.de/help/smart_about.shtml CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)

Combines SMART and Pfam databasesEasier and quicker search

Protein domain databases

Sequence Motif Databases

Scan Prosite (http://www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS/) Store conserved motifs occurring in nucleic acid or

protein sequences Motifs can be stored as consensus sequences,

alignments, or using statistical representations such as residue frequency tables

ncbi molecular biology resources

Documents

particular database

entrez databaseseach

integrated search

search statement

ncbi databases

sequence database archival

entrez global query

nucleotideeach record