ncbi molecular biology resources
DESCRIPTION
NCBI Molecular Biology Resources. Sequin BankIt WebIn SAKURA. Entrez. Submissions Updates. Submissions Updates. Submissions Updates. SRS. getentry. The International Sequence Database Collaboration. NIH. NCBI. GenBank. EMBL. DDBJ. EBI. CIB. NIG. EMBL. - PowerPoint PPT PresentationTRANSCRIPT
NCBI Molecular Biology Resources
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBLNIGNIG
CIB
NCBI
NIHNIH
SRS
getentry
Entrez
The International Sequence Database Collaboration
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
•Sequin•BankIt•WebIn•SAKURA
Primary vs. Derivative Databases
ACGTGC
CG
TG
AATTGACTAACGTGCA
CG
TG
C TTGACA
TATAGCCG
GenBank
SequencingCenters
GAGA
ATTC
C GAGA
ATTC
C UniGene
RefSeq:LocusLink andGenomes Pipelines
RefSeq:Annotation Pipeline
Labs
Curators
Algorithms
TATAGCCGAGCTCCGATACCGATGACAA
Updated ONLY by submitters
EST UniSTS
STS
GSS
HTG
Updated continuall
y by NCBI
PRI ROD PLN MAM BCT
INV VRT PHG VRL
Types of Databases Primary Databases
Original submissions by experimentalists Remember biologyʼs Central Dogma: DNA →RNA →protein. Primary refers to one dimensional ʻsymbolʼ information written in
sequential order necessary to specify a particular biological molecular entity, be it polypeptide or polynucleotide.
Content controlled by the submitter• Examples: GenBank, SNP, GEO, PubChem Substance
Derivative Databases Built from primary data Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound
What is Entrez? Entrez Global Query is an integrated search and retrieval
system for databases of National Center for Biotechnology Information (NCBI).
It provides access to all NCBI databases simultaneously with a single query string and user interface.
Support boolean operators and search term tags to limit parts of the search statement to particular fields.
This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database.
A text search / retrieval engine NOT A DATABASE. A tool for finding biologically linked data. A virtual workspace for manipulating large datasets.
Entrez Databases
Each record is assigned a UID unique integer identifier for internal tracking GI number for Nucleotide
Each record is given a Document Summary a summary of the record’s content (DocSum)
Each record is assigned links to biologically related UIDs
Each record is indexed by data fields [author], [title], [organism], and many others
The Entrez System
Nucleotide
Protein
Structure
PubMed
PopSet
Genome
OMIM
Taxonomy
Books
ProbeSet
3D Domains
UniSTS
SNP
CDD
Entrez
UniGeneJournals
PubMedCentral
The Entrez System: Text Searches
Entrez Taxonomy
The backbone of NCBI
[organism]
An Entrez Database - Nucleotide GenBank: Primary Data (97.9%)
original submissions by experimentalists submitters retain editorial control of
records archival in nature
RefSeq: Derivative Data (2.1%) curated by NCBI staff NCBI retains editorial control of records record content is updated continually
What is GenBank? NCBI’s Primary Sequence Database
Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data
Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data)
Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database
A Traditional GenBank RecordLOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a//
Header
Feature Table
Sequence
The Flatfile Format
An Example Record – M17755
Field Indexed Terms
[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol mrna
gbdiv prisrcdb genbank
Indexing for Nucleotide UID 4680720
M17755: Feature Table
CDS position in bp
TPO [gene name]
thyroiditis[text word]
thyroid peroxidase[protein name]
protein accession
Sequence: 99.99% Accurate
The sequence itselfis not indexed…
Use BLAST for that!
RefSeq database is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein.
Non-redundant nucleotide and protein sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes.
Updated to reflect current sequence data and biology
Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC).
Similar to a review article, a RefSeq integrates information across multiple sources at a given time hence provides a foundation for uniting sequence data with genetic and functional information.
They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records.
RefSeq: NCBI’s Derivative Sequence Database
The common Refseq accession prefixAccession prefix Molecular type
NC_ Complete genomic molecule (chromosome; microbial or organelle genome)
NT_ Genomic contig
NM_ Curated mRNA
XM_ mRNA (Computed)
NP_ Curated Protein
XP_ Protein (Computed)
NR_ Curated RNA
XR_ RNA(Computed)
Entrez Gene and RefSeq
• Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI
• Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)
• Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases
GenBank RefSeq Gene
Nucleotide
Entrez Gene: RefSeq Annotations
Beyond RefSeq
If your organism does not have RefSeqs…
UniGene : gene-based clusters of cDNAs and ESTs
WGS sequences in Entrez Nucleotide (wgs[prop])
Trace Archive
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
What is UniGene?
Organisms in UniGene Top Ten1. Human2. Rice3. Mouse4. Cow5. Wheat6. Zebrafish7. Pig8. Chicken9. Frog (X. laevis)10. Frog (X. tropicalis)
Finding UniGene Clustersby link
by Entrez search
UniGene Cluster for TPO
Entrez Protein
GenPept (DDBJ, EMBL, GenBank) 4,444,405 RefSeq 1,753,167 PIR 222,395 Swiss Prot 189,005
PDB 68,621 PRF 12,079 Third Party Annotation 4,219 Total 6,693,891
Protein Sources and Links
PIR
RefSeq
SWISS-PROT
GenPept
NM_000537
M17755
no mRNA!
no mRNA!
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published world-wide. It has 12 million records dating back to 1966.
In order to impose uniformity and consistency to the indexing of biomedical literature MeSH vocabulary is used for indexing journal articles for MEDLINE.
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.
Taxonomy Browser is…
• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms
Structure site includes…
• Molecular Modelling Database (MMDB)• biopolymer structures obtained from the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)
OMIM is…
•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU
General Protein Databases
SWISS-PROT Manually curated high-quality annotations, less data
GenPept/TREMBL Translated coding sequences from GenBank/EMBL Few annotations, more up to date
PIR Phylogenetic-based annotations
All 3 now combining efforts to form UniProt (http://www.uniprot.org)
http://us.expasy.org/sprot/userman.html
Swissprot format
Non-redundant Databases
Sequence data only: cannot be browsed, can only be searched using a sequence
Combine sequences from more than one database Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)
Pfam (http://www.sanger.ac.uk/Software/Pfam/)Collection of multiple sequence alignments and hidden Markov models covering many
common protein domains and families SMART (a Simple Modular Architecture Research Tool)
Identification and annotation of genetically mobile domains and the analysis of domain architectures
(http://smart.embl-heidelberg.de/help/smart_about.shtml CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
Combines SMART and Pfam databasesEasier and quicker search
Protein domain databases
Sequence Motif Databases
Scan Prosite (http://www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS/) Store conserved motifs occurring in nucleic acid or
protein sequences Motifs can be stored as consensus sequences,
alignments, or using statistical representations such as residue frequency tables