ncbi field guide ncbi molecular biology resources march 2007 ncbi databases
Post on 18-Dec-2015
245 views
TRANSCRIPT
![Page 1: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/1.jpg)
NC
BI F
ield
G
uid
e
NCBI Molecular Biology Resources
March 2007
NCBI Databases
![Page 2: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/2.jpg)
NC
BI F
ield
G
uid
e The National Center for
Biotechnology Information
Created in 1988 as a part of theNational Library of Medicine at NIH
– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
Bethesda,MD
![Page 3: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/3.jpg)
NC
BI F
ield
G
uid
e
Web Access: www.ncbi.nlm.nih.gov
![Page 4: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/4.jpg)
NC
BI F
ield
G
uid
e
NCBI Databases and Services
• GenBank largest sequence database
• Free public access to biomedical literature– PubMed free Medline
– PubMed Central full text online access
• Entrez integrated molecular and literature databases
• BLAST highest volume sequence search service
• VAST structure similarity searches
• Software and Databases
![Page 5: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/5.jpg)
NC
BI F
ield
G
uid
e
Types of Databases
• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
![Page 6: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/6.jpg)
NC
BI F
ield
G
uid
e
Entrez Nucleotides
Primary • GenBank / EMBL / DDBJ 86,766,287
Derivative• RefSeq 1,715,255
• Third Party Annotation 5,312
• PDB 7,334 Total 88,494,392
![Page 7: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/7.jpg)
NC
BI F
ield
G
uid
e
What is GenBank? NCBI’s Primary Sequence
Database• Nucleotide only sequence database • Archival in nature
– Historical– Reflective of submitter point of view (subjective)– Redundant
• GenBank Data– Direct submissions (traditional records)– Batch submissions (EST, GSS, STS)– ftp accounts (genome data)
• Three collaborating databases– GenBank– DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL)
Database
![Page 8: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/8.jpg)
NC
BI F
ield
G
uid
e
EBI
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
International Sequence Database Collaboration
![Page 9: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/9.jpg)
NC
BI F
ield
G
uid
e
GenBank: NCBI’s Primary Sequence Database
ftp://ftp.ncbi.nih.gov/genbank/ftp://ftp.ncbi.nih.gov/genbank/
Release 158 February 2007
86,639,920 Records
157,335,689,977 Total Bases
263 Gigabytes (non-WGS) 1115 files (non-WGS)
• full release every two months• incremental updates daily• available only via ftp
• full release every two months• incremental updates daily• available only via ftp
![Page 10: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/10.jpg)
NC
BI F
ield
G
uid
e
Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 Aug-060
20
40
60
80
100
120
140
160
Bas
es
(bil
lio
ns)
The Growth of GenBank
Non-WGS: 71.3 billion basesNon-WGS: 71.3 billion bases
WGS: 86.0 billion bases WGS: 86.0 billion bases
Release 158Release 158
Doubling time 12-14 months
![Page 11: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/11.jpg)
NC
BI F
ield
G
uid
e Organization of GenBank:
Traditional Divisions
Records are divided into 18 Divisions.12 Traditional 6 Bulk
Traditional Divisions: Traditional Divisions: • Direct Submissions (Sequin and BankIt)
• Accurate• Well characterized
PRI Primate PLN Plant and FungalBCT Bacterial and Archeal INV InvertebrateROD RodentVRL ViralVRT Other VertebrateMAM Mammalian PHG PhageSYN Synthetic (cloning vectors)ENV Environmental Samples UNA Unannotated
Entrez query: gbdiv_xxx[Properties]
![Page 12: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/12.jpg)
NC
BI F
ield
G
uid
e Organization of GenBank:
Bulk Divisions
Records are divided into 18 Divisions.12 Traditional 6 Bulk
BULK Divisions: BULK Divisions: • Batch Submission (Email and FTP)
• Inaccurate• Poorly characterized
EST Expressed Sequence Tag GSS Genome Survey SequenceHTG High Throughput GenomicSTS Sequence Tagged SiteHTC High Throughput cDNAPAT Patent
Entrez query: gbdiv_xxx[Properties]
![Page 13: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/13.jpg)
NC
BI F
ield
G
uid
e
A TraditionalGenBank
Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//
Header
Feature Table
Sequence
The Flatfile Format
![Page 14: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/14.jpg)
NC
BI F
ield
G
uid
e
Traditional GenBank Record
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
well annotatedwell annotated
the sequence is the datathe sequence is the data
![Page 15: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/15.jpg)
NC
BI F
ield
G
uid
e
Bulk Divisions
• Expressed Sequence Tag– 1st pass single read cDNA
• Genome Survey Sequence– 1st pass single read gDNA
• High Throughput Genomic– incomplete sequences of genomic clones
• Sequence Tagged Site– PCR-based mapping reagents
•Batch Submission and htg (email and ftp)•Inaccurate•Poorly Characterized
![Page 16: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/16.jpg)
NC
BI F
ield
G
uid
e
GenBank Bulk Sequence: EST
poorly characterizedpoorly characterized
![Page 17: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/17.jpg)
NC
BI F
ield
G
uid
e
ESTs in Entrez
Total 41 million recordsHuman 7.9 millionMouse 4.7 millionCow 1.3 millionRice 1.2 millionZebrafish 1.2 millionMaize 1.2 millionXenopus tropicalis 1.0 millionRat 0.9 millionWheat 0.9 millionChicken 0.6 millionBarley 0.4 million
Total 41 million recordsHuman 7.9 millionMouse 4.7 millionCow 1.3 millionRice 1.2 millionZebrafish 1.2 millionMaize 1.2 millionXenopus tropicalis 1.0 millionRat 0.9 millionWheat 0.9 millionChicken 0.6 millionBarley 0.4 million
![Page 18: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/18.jpg)
NC
BI F
ield
G
uid
e
Derivative Databases
![Page 19: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/19.jpg)
NC
BI F
ield
G
uid
e Entrez Protein: Derivative
DatabaseData Source
GenPept
Sequences
6,937,176
RefSeq 3,359,561
Third Party Annotation 5,136
Swiss Prot 255,159
PIR 29,996
PRF 12,079
PDB 91,116
PAT Division 669,035
Total 10,690,223
BLAST nr total(no patents or env)
4,545,310
![Page 20: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/20.jpg)
NC
BI F
ield
G
uid
e
FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GenPept: GenBank CDS translations
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
![Page 21: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/21.jpg)
NC
BI F
ield
G
uid
e
Redundant Proteins
>gi|741682|prf||2007430A DNA mismatch repair protei...MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair...MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|4557757|ref|NP_000240.1| MutL protein homolog 1...MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|1079787|gb|AAA82079.1| DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
GenPept
NCBI RefSeq
Swiss-Prot
PRF
![Page 22: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/22.jpg)
NC
BI F
ield
G
uid
e
Protein Sequences from Structures
>gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With AdpnpSHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDELALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAAHPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQKERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACEDKLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ
>gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With AdpnpSHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDELALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAAHPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQKERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACEDKLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ
![Page 23: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/23.jpg)
NC
BI F
ield
G
uid
e
Primary vs. DerivativeSequence Databases
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATTAT
TC
CGAGA
ATTAT
TC
C
AT
GAGA
ATTC
C GAGA
ATTC
C
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCG
TATAGCCG TATAGCCGTATAGCCG TATAGCCGCAT
T
GAGA
ATTC
C GAGA
ATTC
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continuall
y by NCBI
Updated ONLY by submitters
![Page 24: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/24.jpg)
NC
BI F
ield
G
uid
e
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins– reviewed– human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more
• Model transcripts and proteins• Assembled Genomic Regions (contigs)
– human genome– mouse genome– rat genome
• Chromosome records– Human genome– microbial– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
srcdb_refseq[Properties]
– chicken– honeybee– sea urchin
![Page 25: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/25.jpg)
NC
BI F
ield
G
uid
e Selected RefSeq Accession
Numbers
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
![Page 26: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/26.jpg)
NC
BI F
ield
G
uid
e
GenBank to RefSeq
![Page 27: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/27.jpg)
NC
BI F
ield
G
uid
e
RefSeqs: Annotation Reagents
Genomic DNAGenomic DNA((NCNC, , NT, NWNT, NW))
Model mRNAModel mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
Curated ProteinCurated Protein (NP)(NP)
Scanning....
= ?
GenBankSequences
RefSeq
![Page 28: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/28.jpg)
NC
BI F
ield
G
uid
e
RefSeq Benefits
• non-redundancy • explicitly linked nucleotide and protein sequences• updates to reflect current sequence data and biology• data validation • format consistency• distinct accession series • stewardship by NCBI staff and collaborators
![Page 29: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/29.jpg)
NC
BI F
ield
G
uid
e
Mouse Assembly
RefSeq ContigRefSeq Contig
BACBAC
WGSWGS
OtherGenBankOtherGenBank
RefSeq TranscriptRefSeq Transcript
UniGene TranscriptUniGene Transcript
![Page 30: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/30.jpg)
NC
BI F
ield
G
uid
e
Expressed Sequences
UniGene
GEO
![Page 31: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/31.jpg)
NC
BI F
ield
G
uid
e
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
What is UniGene?
![Page 32: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/32.jpg)
NC
BI F
ield
G
uid
e
EST hits: Human mRNA
Albumin mRNAAlbumin mRNA
5’ EST hits5’ EST hits
3’ EST hits3’ EST hits
![Page 33: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/33.jpg)
NC
BI F
ield
G
uid
e
UniGeneChordates
Invertebrates
Plants
Fungi et al.
![Page 34: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/34.jpg)
NC
BI F
ield
G
uid
e
Xenopus laevis MLH1Cluster
Uncharacterized ESTsUncharacterized ESTs
![Page 35: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/35.jpg)
NC
BI F
ield
G
uid
e
Human ALB Cluster
![Page 36: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/36.jpg)
NC
BI F
ield
G
uid
e
Expression Data
![Page 37: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/37.jpg)
NC
BI F
ield
G
uid
e
Other NCBI Databases
•Structure: imported structures (PDB)Cn3D viewer, NCBI
curation
•CDD: conserved domain databaseProtein families (COGs
and KOGs)
Single domains (PFAM, SMART, CD)
•dbSNP: nucleotide polymorphism
•Gene: gene recordsUnifies LocusLink and
Microbial Genomes
![Page 38: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/38.jpg)
NC
BI F
ield
G
uid
e
NCBI Structures and Domains
![Page 39: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/39.jpg)
NC
BI F
ield
G
uid
e
MMDB: MMolecular MModeling Data Base
• Derived from experimentally determined PDB records• Value added to PDB records including:
– Addition of explicit chemical graph information– Validation (secondary structure elements)– Inclusion of Taxonomy, Citation – Conversion to ASN.1 data description language
• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
![Page 40: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/40.jpg)
NC
BI F
ield
G
uid
e
Cn3D 4.1: Bacillus thuringiensis Toxin
![Page 41: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/41.jpg)
NC
BI F
ield
G
uid
e
VAST: Structure NeighborsVector Alignment Search Tool
For each protein chain,
locate SSEs (secondarystructure elements),
and represent them asindividual vectors. 1
2
3
4
5 6
Human IL-4
IL-4 &Leptinalign the vectors
![Page 42: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/42.jpg)
NC
BI F
ield
G
uid
e
Protein Domains
• Structural Domain– Discrete independently folding unit of a protein
• Conserved Domain (sequence-based)– Protein region with recognizable position-specific
pattern of sequence conservation
• Sequence-based domains often roughly correspond to structural domains
• Domains often have distinct, identifiable functions
![Page 43: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/43.jpg)
NC
BI F
ield
G
uid
e
NCBI’s Conserved Domain Database
• PSI-BLAST –based score matrices
• Searchable with RPS-BLAST
• Sources – SMART– PFAM– COGs– NCBI curated domains
• structure informed alignments
![Page 44: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/44.jpg)
NC
BI F
ield
G
uid
e
Src Domains
Four 3d domainsThree conserved domainsFour 3d domainsThree conserved domains
![Page 45: NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases](https://reader036.vdocuments.us/reader036/viewer/2022081503/56649d245503460f949fada0/html5/thumbnails/45.jpg)
NC
BI F
ield
G
uid
e
Structure vs Conserved Domain
SH2
SH3
TyrKC
SH2
Conserved phosphotyrosine binding residuesConserved phosphotyrosine binding residues
Cn3DCn3D