ncbi fieldguide ncbi molecular biology resources a field guide part 2 august 2-3, 2005
Post on 21-Dec-2015
221 Views
Preview:
TRANSCRIPT
NC
BI
Fie
ldG
uid
e
NCBI Molecular Biology Resources
A Field Guidepart 2
August 2-3, 2005
NC
BI
Fie
ldG
uid
eWeb
Access
BLAST
VAST
Entrez
Text
Sequence
Structure
NC
BI
Fie
ldG
uid
eWhy do we need similarity searching?
To identify and annotate sequences with…• incomplete (or no) annotations (GenBank)• incorrect annotations
To assemble genomes To explore evolutionary relationships by…
• finding homologous molecules
• developing phylogenetic trees NOTE: Similar sequences may NOT have similar function!
Searching with Sequences
NC
BI
Fie
ldG
uid
eBasic Local Alignment Search
Tool
• Widely used similarity search tool• Heuristic approach based on Smith Waterman algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.
– DNA vs DNA
– DNA translation vs Protein
– Protein vs Protein
– Protein vs DNA translation
– DNA translation vs DNA translation
• www, standalone, and network clients
NC
BI
Fie
ldG
uid
e
Global vs Local AlignmentSeq 1
Seq 2
Seq 1
Seq 2
Global alignment
Local alignment
NC
BI
Fie
ldG
uid
eGlobal vs. Local Alignment
Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125
Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194
Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264
Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401
Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471
Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60
440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500
Align program (Lipman and Pearson)
BLASTp
NC
BI
Fie
ldG
uid
e
Nucleotide WordsGTACTGGACATGGACCCTACAGGAAQuery:
Word Size = 11GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
...........
Make a lookuptable of words
Minimum word size = 7blastn default = 11megablast default = 28
NC
BI
Fie
ldG
uid
e
Protein WordsGTQITVEDLFYNIATRRKALKNQuery:
Word Size = 3
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word Size can be 2 or 3 (default = 3)
NC
BI
Fie
ldG
uid
e
Initial Matches and Extensions
Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
<---- SEI YYN ---->
ATCGCCATGCTTAATTGGGCTT
<--- CATGCTTAATT ----->
neighborhood words
exact word match one match
two matches
Nucleotide BLAST requires one exact match
NC
BI
Fie
ldG
uid
e
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
NC
BI
Fie
ldG
uid
e
An Alignment BLAST Can Make
Solution: compare protein sequences; BLASTX
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
BLAST 2 Sequences (blastx) output:
NC
BI
Fie
ldG
uid
e
Scoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
NC
BI
Fie
ldG
uid
e
Scoring Systems - ProteinsPosition Independent Matrices
PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly conserved blocks• Each matrix derived separately from blocks with a defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)
PSI- and RPS-BLAST
NC
BI
Fie
ldG
uid
e
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
NC
BI
Fie
ldG
uid
e
Gapped Alignments
• Gapping provides more biologically realistic alignments• Statistical behavior is not completely understood for gapped alignments• Gapped BLAST parameters must be found by simulations for each matrix
• Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)
NC
BI
Fie
ldG
uid
e
Scores
V D S – C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 7
PAM30 +7 +2 0 -10 +10 +2 11
Simply add the scores for each pair of aligned residues
Different matrices produce different scores!
NC
BI
Fie
ldG
uid
e
Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by
chancesize of database
your score
expected number of
random hits
NC
BI
Fie
ldG
uid
e
Advanced BLAST Options: Nucleotide
Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]gbdiv est[Properties] AND rat[organism]
Other Advanced–e 10000 expect value-v 2000 descriptions-b 2000 alignments
NC
BI
Fie
ldG
uid
eAdvanced BLAST Options: Protein
Matrix Selection•PAM30 -- most stringent•BLOSUM45 -- least stringent
Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]green plants[Organism]srcdb refseq[Properties]Other Advanced–e 10000 expect value-v 2000 descriptions-b 2000 alignments
Limit by taxonMus musculus[Organism]Mammalia[Organism]Viridiplantae[Organism]
NC
BI
Fie
ldG
uid
e
sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%)
Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88
FilteredUnfiltered
Low Complexity Filtering
NC
BI
Fie
ldG
uid
e
>gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628
Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%)
Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIRSbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60
Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQSbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120
. . .
Low Complexity Filter
low complexity sequence
NC
BI
Fie
ldG
uid
e
Neighbors: Precomputed BLAST
Nucleotide
Protein
Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.
NC
BI
Fie
ldG
uid
eBlink – Protein BLAST Alignments
• Lists only 200 hits • List is nonredundant
NC
BI
Fie
ldG
uid
e
Blink – Best Hits
NC
BI
Fie
ldG
uid
eMegablast: NCBI’s Genome
Annotator
• Long alignments of similar DNA
sequences
• Greedy algorithm
• Concatenation of query sequences
• Faster than blastn; less sensitive
NC
BI
Fie
ldG
uid
eMegaBLAS
T
AI217550AI251192AI254381BE645079
C:\seq\hs.4.fsa
> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
NC
BI
Fie
ldG
uid
e
Discontiguous Megablast
• Uses discontiguous word matches
• Better for cross-species comparisons
NC
BI
Fie
ldG
uid
eTemplates for Discontiguous
MegaBLAST
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5
NC
BI
Fie
ldG
uid
e
Nucleotide vs. Protein BLAST
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc
Human: N R V T V V L G A Q W G D E G
+ + V + V L G Q W G D E G
A.th.: S Q V S G V L G C Q W G D E G
agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
Comparing ADSS from H. sapiens and A. thaliana
BLASTp finds three matching words
BLASTn finds no match, because there are no 7 bp words
Protein searches are generally more sensitive than nucleotide searches.
NC
BI
Fie
ldG
uid
e
Translated BLAST
Query DatabaseProgram
N Pucleotide rotein
N
N
N
N
P
P
blastx
tblastn
tblastx
PPPP P P
PPPP P P PPPP P P
PPPP P PParticularly useful for nucleotide sequences withoutprotein annotations, such as ESTs or genomic DNA
NC
BI
Fie
ldG
uid
e
Genomic BLAST
• These pages provide customized nucleotide and protein databases for each genome• If a Map Viewer is available, the BLAST hits can be viewed on the maps
NC
BI
Fie
ldG
uid
e
BLAST the Chicken Genome
Program
Accession for human TPO mRNA
NC
BI
Fie
ldG
uid
e
BLAST Hit on the Genome
NC
BI
Fie
ldG
uid
e
BLASTn Hit on the Map Viewer
NC
BI
Fie
ldG
uid
e
TBLASTN Results Using NP_000538
NC
BI
Fie
ldG
uid
eLinking Protein Sequence,
Structure, and Function
sequence function (pfam, smart)
Structure
PSI-BLASTRPS-BLAST
VAST
BLASTp sequence structure
structure structure
sequence structure + function (cd)
NC
BI
Fie
ldG
uid
e
Position Specific Substitution Rates
Active site serineWeakly conserved serine
NC
BI
Fie
ldG
uid
ePosition Specific Score Matrix
(PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine is scored differently in these two positions
Active site nucleophile
NC
BI
Fie
ldG
uid
e
PSI-BLAST
Create your own PSSM:Confirming relationships of purinenucleotide metabolism proteins
query BLOSUM62PSSM AlignmentAlignment
NC
BI
Fie
ldG
uid
e
PSI BLAST
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY
e value cutoff for PSSM
NC
BI
Fie
ldG
uid
e
PSI Results: Initial BLAST Run
NC
BI
Fie
ldG
uid
eFirst PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NC
BI
Fie
ldG
uid
eThird PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
NC
BI
Fie
ldG
uid
ePfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT
COG
SMART
CD
Entrez Domains (CDD)
HMM based models originally concentrating on eukaryotic signalingdomains, now expanding
BLAST based alignments derived from complete proteomes of prokaryotes
NCBI curated domains based on sequence and structural alignments
Pfam pfam01234
smart00123
cd01234
COG0123
NCBI
NCBI
Sanger
EMBL
Single Domains
Protein Families
NC
BI
Fie
ldG
uid
e
Protein Links: Domains
NC
BI
Fie
ldG
uid
e
Results of a CD-Search
CD
SMART
Pfam
Click on a colored bar to align your sequence to the CD
NC
BI
Fie
ldG
uid
e
CDD Record – heme peroxidases
aligned query
red = high conservation
blue = low conservation
NC
BI
Fie
ldG
uid
e
Curated CD Record
Curated CDs (cd12345) are based on sequence and structure alignments
Annotated features
Structural evidence
aligned query
NC
BI
Fie
ldG
uid
e
Blink: Sequence to Structure
related structures
NC
BI
Fie
ldG
uid
e
Related StructuresCn3D
NC
BI
Fie
ldG
uid
e
Entrez Structure
• Derived from experimentally determined PDB records
• Add value to PDB records by:– Adding explicit chemical bonding
information– Validating and indexing the sequences– Annotating 3D domains and secondary
structure– Adding links to CDD, Taxonomy, Pubmed – Converting PDB data to ASN.1
• Structure neighbors determined byVector Alignment Search Tool (VAST)
MMDB: MMolecular MModeling Data Base
Structure
NC
BI
Fie
ldG
uid
e
Structure Summary Page
Conserved Domains
VAST Neighbors for chain C (domain 0)
Cn3D
VAST Neighbors for domain 2
NC
BI
Fie
ldG
uid
eVAST: Structure
NeighborsVector Alignment Search Tool
For each 3D domain,
locate SSEs (secondarystructure elements),
and represent them asindividual vectors.
1
2
3
4
5 6
Human IL-4
VAST uses 3D Domains only!Whole polypeptides are assigned 3D domain 0 (zero).
NC
BI
Fie
ldG
uid
e
VAST Neighbors
1D2V
1D2V
1Q4G
3D domains!
NC
BI
Fie
ldG
uid
e
Viewing a VAST Alignment
RMSD in Angstroms
Sequence percent identityVAST P value
Cn3D
NC
BI
Fie
ldG
uid
e
Submitting a PDB File to VAST
• Redesigned interface!• This is the best way to convert PDB into MMDB format!
New!
NC
BI
Fie
ldG
uid
eEntrez PubChem
PC Substance
PC Compound
PC BioAssay
Primary database of chemical samples
Derived database of known chemicals fromPC Substance records
Primary database of bioactivity screens ofsamples in PC Substance
New!
NC
BI
Fie
ldG
uid
e
Links from Structure
N-acetylglucosamine
heme
mannose
fucose
NC
BI
Fie
ldG
uid
e
Search for thyroxine
ChemID 24KEGG 4DTP-NCI 3NIST 3 Biocyc 2BIND 2Chembank 2NIAID 1TOTAL 41
NC
BI
Fie
ldG
uid
e
Sequence Polymorphisms
SNP OMIM
• Primary database of submitted SNPs• Curated database of reference SNPs• Contains more than just SNPs:
• True SNPs• MNP (multiple nucleotide)• Insertions• Deletions• Microsatellites• Mixed• No variation (constant)
• Clinical literature database• Curated at Johns Hopkins Univ• Links human genes and genetic disorders to human disease• Lists allelic variants that have clinical consequences
Variations in SNP are not necessarily in OMIM, and vice versa!
General Polymorphisms Human Phenotypes
NC
BI
Fie
ldG
uid
e
Linking to SNP
Links to SNP are also available fromNucleotide and Protein
Entrez Gene - TPO
NC
BI
Fie
ldG
uid
e
Entrez SNP
primary data: ss#
SNP UID: rs#
NC
BI
Fie
ldG
uid
e
Find Non-synonymous SNPs
#7 AND coding nonsynon[Function Class]
Function Class
NC
BI
Fie
ldG
uid
e
Non-synonymous TPO SNPs
Link to Map Viewer
View all SNPs in locus
Link to related 3D structures
NC
BI
Fie
ldG
uid
e
GeneView in dbSNP
NC
BI
Fie
ldG
uid
e
Links to OMIM
Links to SNP are also available fromNucleotide and Protein
Entrez Gene - TPO
NC
BI
Fie
ldG
uid
e
OMIM Record
NC
BI
Fie
ldG
uid
e
Explore a Disease SNP
799
NC
BI
Fie
ldG
uid
e
Curated CD Record
E799
Cn3D
NC
BI
Fie
ldG
uid
e
For More Information…
NC
BI
Fie
ldG
uid
e
For More Information…
•General Help info@ncbi.nlm.nih.gov•BLAST blast-help@ncbi.nlm.nih.gov
E-mail addresses
The (free!) NCBI Newsletter
The NCBI Handbook
http://www.ncbi.nih.gov/Education/index.html
The NCBI Education Page
http://www.ncbi.nih.gov/About/newsletter.html
Follow the link from the NCBI Home Page
top related