ncbi fieldguide ncbi molecular biology resources a field guide part 2 september 30, 2004 icgeb
TRANSCRIPT
NC
BI
Fie
ldG
uid
e
NCBI Molecular Biology Resources
A Field Guidepart 2
September 30, 2004 ICGEB
NC
BI
Fie
ldG
uid
e
Genomes
Taxonomy
Links Between and Within Nodes
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structures
Word weight
VAST
BLASTBLAST
Phylogeny
ComputationalComputational
Computational
Computational
NC
BI
Fie
ldG
uid
e
BLAST
VAST
Pubmed
Text
Sequence
Structure
NC
BI
Fie
ldG
uid
ePubmed: Computation of Related
Articles
The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths.
The value of a term is dependent on global and local types of information:
1) the number of different documents in the database that contain the term;
2) the number of times the term occurs in a particular document;
NC
BI
Fie
ldG
uid
e
Global and local weights
• The global weight of a term is greater for the less frequent terms. The presence of a term that occurred in most of the documents would really tell one very little about a document. On the other hand, a term that occurred in only 100 documents of one million would be very helpful in limiting the set of documents of interest.
• The local weight of a term is the measure of its importance in a particular document. Generally, the more frequent a term is within a document, the more important it is in representing the content of that document. However, this relationship is saturating, i.e., as the frequency continues to go up, the importance of the word increases less rapidly and finally comes to a finite limit.
NC
BI
Fie
ldG
uid
e
How we define similar documents
• The similarity between two documents is computed by adding up the weights (local wt1 × local wt2 × global wt) of all of the terms the two documents have in common. This provides an indication of how related two documents are.
• Once the similarity score of a document in relation to each of the other documents in the database has been computed, that document's neighbors are identified as the most similar (highest scoring) documents found. These closely related documents are pre-computed for each document in PubMed.
NC
BI
Fie
ldG
uid
e
Related articles: difficult task
NC
BI
Fie
ldG
uid
e
E-utilities: Top Level of Entrez
NC
BI
Fie
ldG
uid
e
E-utilities course
NC
BI
Fie
ldG
uid
e
E-utilities
• A set of seven server-side programs.
• Support a uniform URL syntax.
• Translate a standard set of URL-encoded input parameters for the array of programs comprising the Entrez system.
NC
BI
Fie
ldG
uid
e
Entrez Functions and E-utilities
• Searches: esearch.fcgi
• DocSums: esummary.fcgi
• Links: elink.fcgi
• Uploads: epost.fcgi
• Downloads: efetch.fcgi
• Global Query: egquery.fcgi
• Information: einfo.fcgi
NC
BI
Fie
ldG
uid
eA Docsum via esummary.fcgi and via the Web
NC
BI
Fie
ldG
uid
eA Simple Eutilities Pipeline
NC
BI
Fie
ldG
uid
eSearch for upstream regions of
homologous genes
• #!/usr/local/bin/perl #where the Perl is located
• use LWP::Simple; # we use LWP:Simple to get the content of URLs
• $ebase="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"; # this is a base URL we will add details to
• while(<>){ # we are reading file of gene names; file name is read from the command line;• chomp;$gene=$_;• $term=$gene."[gene+name]+AND+human[orgn]"; # we are interested in human genes only
#1. Search in Homologene
• $url=$ebase."esearch.fcgi?db=homologene&term=$term"; #search Entrez Gene with gene name• $result=get($url); #with the help of LWP's "get" command we download the content of the corresponding URL
• while($result=~/<Id>(\d+)<\/Id>/sg) #parsing out the content, reading gi's from Id lines• {$id.="$1,";} #...and concatenating them in one string, with commas as delimiters• chop $id;
#2. Link Homologene -> Nucleotide
• $url=$ebase."elink.fcgi?db=nucleotide&id=$id&dbfrom=homologene";#link back to nucleotides to get list of homolog NM gi's• $result=get($url);• $id="";• while($result=~/<Link>[^<]+<Id>(\d+)<\/Id>/sg){$id.="$1,";} chop $id;
#3. Link Nucleotide -> Gene
NC
BI
Fie
ldG
uid
eLots of precomputed data and a little bit of
parsing
• $url=$ebase."elink.fcgi?db=gene&id=$id&dbfrom=nucleotide"; #link to Entrez Gene again to get the genomic coordinates• $result=get($url);$id="";• while($result=~/<Link>[^<]+<Id>(\d+)<\/Id>/sg){push @ids,$1;} chop $id;• print @ids;• foreach $id (@ids){ #foreach NM accession gi
#4. Fetch XML document with gene information from Gene
• $url=$ebase."efetch.fcgi?db=gene&id=$id&retmode=xml";• #fetch the gene report that gives the genomic sequence and coordinates• $result=get($url);• $result=~/<Gene-commentary_type value=.genomic.>.+?<Seq-id_gi>(\d+)/s;• $id=$1;• $result=~/<Seq-interval_from>(\d+)/;$from=$1;• $result=~/<Seq-interval_to>(\d+)/;$to=$1;• $result=~/<Na-strand value="(\w+)"/;$strand=$1;if($strand eq "minus"){$strand=2;}else{$strand=1;}
• if($strand==1){ $to=$from;$from-=1000; }else{ $from=$to;$to+=1000; }
#5. Fetch upstream sequence from Nucleotide
• $url=$ebase."efetch.fcgi?db=nucleotide&id=$id&retmode=text&rettype=fasta&seq_start=$from&seq_stop=$to&strand=$strand";• #fetch sequence• $result=get($url);$result=~s/>ref/>lcl|$gene|/;• print "$result";• }
• }
NC
BI
Fie
ldG
uid
e
A General Design Approach
• Know what you want before you begin– Do I need the full record? (EFetch)– Will a DocSum be sufficient? (ESummary)
• Know what Entrez database contains the data you want– If it’s not in Entrez, the eUtils can’t access it
• Try your pipeline in interactive web Entrez first– Some Entrez queries may surprise you– Some Entrez data may surprise you– Some Entrez links may surprise you
NC
BI
Fie
ldG
uid
e
Others use E-utilities too: PubCrawler
NC
BI
Fie
ldG
uid
eMedBlast: searching for articles related
to a sequence.
NC
BI
Fie
ldG
uid
e
Fairness issue. Gate is only so wide. Scripts use the resources of many to satisfy a few.
Why Regulate?
NC
BI
Fie
ldG
uid
e
Scripts are like “fat” bunnies!!!
NC
BI
Fie
ldG
uid
e
Web Servers and Browsers
• Your browser makes one connection.
• Each server has an finite number of slots.
• A slot is allotted to a connection 1st come 1st served.
• Connections are (typically) not persistent.
• Scripts use more slots, and approach “persistent” connection.
NC
BI
Fie
ldG
uid
e
Normal Use
NC
BI
Fie
ldG
uid
e
Scripting
NC
BI
Fie
ldG
uid
e
Detection
• Weblogs are monitored by a script. • Alarm e-mails are sent hourly and a daily
encapsulation once a day.• Analysis – copyright versus volume. Not
automatic!• Blocking occurs.
– Copyrighted material can be very light volume.– Blast is “sensitive” can also be light in volume.– Entrez and PubMed mostly a “fairness” issue.
NC
BI
Fie
ldG
uid
e
How you are blocked.
• The IP address is blacklisted from the main NCBI web servers.
• You get a very obvious error message.
• Remember Spock: “The needs of the many outweigh the needs of the few.”
NC
BI
Fie
ldG
uid
e
How to avoid blockage.
• Plan your project.• Can I use other methods?
– FTP– Batch Entrez
• Write good scripts.– Expect errors– Multiple UIDs
• Follow the E-utils recommendations. • Ask us for advice.
NC
BI
Fie
ldG
uid
e
Recommendations.
• Use ‘eutils.ncbi.nlm.nih.gov’.
• Use the &tool and &email fields.
• Do not submit more than once every 3 seconds.
• Limit to 9 PM – 5 AM EST (our time).
NC
BI
Fie
ldG
uid
e
BLAST
VAST
Pubmed
Text
Sequence
Structure
NC
BI
Fie
ldG
uid
e
BLAST®
Basic Local Alignment Search Tool
• Why align sequences ? - because it is the best way to infer structure-function relationships for the
unknown biomolecules • Global vs local alignments• BLAST basics• MegaBLAST• Discontiguous MegaBLAST
NC
BI
Fie
ldG
uid
e
Basic Local Alignment Search Tool
Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman
algorithm Searches for matching “words” and then extends the
hits Uses statistical theory to determine if a match might
have occurred by chance
NC
BI
Fie
ldG
uid
eGlobal
Alignment
Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125
Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194
Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VAWorm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264
Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401
Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471
Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE +Worm: 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... .worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60
440 450human REQLEHI--------KTHELHL . .:: . : ...worm QWKLEDLFNLDSSEYKEASINF 500
Align program (Lipman and Pearson)
NC
BI
Fie
ldG
uid
eHow BLAST
Works
Make a lookup table of all “words” in the query
Scan the database for matching words
Initiate extensions from these matches
NC
BI
Fie
ldG
uid
eWord
sGTQITVEDLFYNIATRRKALKNQuery:
Word Size = 3
Word size is adjustable 2 or 3 for protein ( 3 default) > 7 for blastn ( 11 default )
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ TQI QIT ITV TVE VED EDL DLF LFY …
Make a lookuptable of words
NC
BI
Fie
ldG
uid
eScan Database…Initiate
Extensions Protein BLAST requires two hits
GTQITVEDLFYNI
<------ TVE FFN ------>
two neighborhood words(threshold score)
Nucleotide BLAST requires exact matches
exact word match
ATCGCCATGCTTAATTGGGCTT<------ CATGCTTAATT ------>
NC
BI
Fie
ldG
uid
eAn Alignment That BLAST Can’t
Find…
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || || || ||| || | |||||| || | |||||| ||||| | |
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || || | | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
NC
BI
Fie
ldG
uid
e…but the corresponding amino acid
sequences are conserved much better
NC
BI
Fie
ldG
uid
e
Protein alignment looks good
NC
BI
Fie
ldG
uid
e…and they have the same domains,
too
NC
BI
Fie
ldG
uid
eLocal Alignment Statistics
High scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by
chancesize of database
your score
expected number of
random hits
NC
BI
Fie
ldG
uid
eScoring Systems -
Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
NC
BI
Fie
ldG
uid
eScoring Systems -
ProteinsPosition Independent Matrices
PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly
conserved blocks• Each matrix derived separately from blocks with a
defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST
NC
BI
Fie
ldG
uid
e
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutionsPositive for more likely substitutions
NC
BI
Fie
ldG
uid
eOptions for Advanced Blast:
Protein
Matrix Selection•PAM30 -- most stringent•BLOSUM45 -- least stringent
Example Entrez queriesproteins all[Filter] NOT mammalia[Organism]green plants[Organism]srcdb refseq[Properties]Other advanced-W 2 word size–e 10000 expect value-v 2000 descriptions-b 2000 alignments
Limit by taxonMus musculus[Organism]Mammalia[Organism]Viridiplantae[Organism]
NC
BI
Fie
ldG
uid
eOptions for Advanced Blasting:
Nucleotide
Example Entrez Queriesnucleotide all[Filter] NOT mammalia[Organism]green plants[Organism]biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced-W 7 word size–e 10000 expect value-v 2000 descriptions-b 2000 alignments
NC
BI
Fie
ldG
uid
eFind a homolog of human CSK in C. elegans
Query = c-src tyrosine kinase (CSK) NP_004374 (450 aa) [Homo sapiens]
Database = NCBI protein nr Entrez limit: Caenorhabditis elegans [ORGN]
Program = BLASTP
Homology Searches
Hits to the Conserved Domain Database:
Query=>gi|4758078|ref|NP_004374.1| c-src tyrosine kinase [Homo sapiens]MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSIDEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPVKWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL
NC
BI
Fie
ldG
uid
e
BLAST Graphical Overview
SH3 SH2 tyr kinase domain
NC
BI
Fie
ldG
uid
eBLAST
Alignments
gi|7160701|emb|CAB04427.2| C. elegans KIN-22 protein (corresponding sequence F49B2.5) [Caenorhabditis elegans]
gi|17508235|ref|NP_493502.1| Tyrosine kinase with SH2, SH3 and N myristoylation domains, Drosophila suppressor of pole hole homolog (57.5 kD) (kin-22) [Caenorhabditis elegans] Length = 507
Score = 290 bits (742), Expect = 1e-78 Identities = 170/440 (38%), Positives = 245/440 (55%), Gaps = 21/440 (4%)
NC
BI
Fie
ldG
uid
e3D
Domains
TyrKc
SH3SH2
NC
BI
Fie
ldG
uid
e
sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%)
Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88
FilteredUnfiltered
Low Complexity Filtering
NC
BI
Fie
ldG
uid
e
PSI-BLAST
Position-Specific Iterated BLAST
• Mining for protein domains• Confirming relationships among related proteins
NC
BI
Fie
ldG
uid
ePosition Specific Substitution
Rates
Active site serineWeakly conserved serine
NC
BI
Fie
ldG
uid
ePosition Specific Score Matrix
(PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Active site nucleophile
Serine scored differently in these two positions
NC
BI
Fie
ldG
uid
e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
PSI-BLAST
e value cutoff for PSSM
NC
BI
Fie
ldG
uid
e
RESULTS: Initial BLASTPSame results as protein-protein BLAST
NC
BI
Fie
ldG
uid
eResults of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NC
BI
Fie
ldG
uid
e
Third PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
Check to add to PSSM
NC
BI
Fie
ldG
uid
eMegaBLAS
T
AI217550AI251192AI254381BE645079
C:\seq\hs.4.fsa
> 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' endGAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGTGCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATACATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAAGTCGTATCGATGT> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTCCTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAAGCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
NC
BI
Fie
ldG
uid
eWhat is Discontiguous (Cross-species)
MegaBLAST?
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5
NC
BI
Fie
ldG
uid
e
Neighbors: Precomputed BLAST
Nucleotide
Protein
Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.
NC
BI
Fie
ldG
uid
eBlink – Protein BLAST Alignments
• Lists only 200 hits • List is nonredundant
NC
BI
Fie
ldG
uid
eBLAST Databases: Non-redundant
protein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF
– PDB (sequences from structures)
NC
BI
Fie
ldG
uid
eBLAST Databases: Nucleic
Acid• nr (nt)
– Traditional GenBank Divisions– NM_ and XM_ RefSeqs
• dbest – EST Division
• htgs – HTG division
• gss – GSS division
• chromosome – NC_ RefSeqs
• wgs– whole genome shotgun
NC
BI
Fie
ldG
uid
e
Genomic BLAST
• These pages provide customized nucleotide and protein databases for each genome• If a Map Viewer is available, the BLAST hits can be viewed on the maps
NC
BI
Fie
ldG
uid
eWhat if Your Favorite Gene is not found
in the latest genome build?
POSSIBLE VARIANTS:
• The gene does not exist;
• It exists, but there is a problem with assembly;
• It exists, but there is a problem with annotation
NC
BI
Fie
ldG
uid
eAn example: finding prestin in Human
genome
• We start with rat prestin, BLAST it against the Human genome and look for evidences that human prestin exists as well.
NC
BI
Fie
ldG
uid
e
Searching the Human Genome
>gi|12188917|emb|AJ303372.1|RNO303372 Rattus norvegicus
ATGGATCATGCTGAAGAAAATGAAATTCCTGCAGAGATCAGAAGTACCTCGTGGAA
GTCATCCGGTCCTCCAGGAGAGGCTGCACGTCAAGGACAAAGTCACAGACTCCATC
GCAGGCATTCACGTGCACTCCTAAAAAAGTAAGAAACATCATCTACATGTTCTTGC
TTGCCAGCATATAAATTCAAGGAGTATGTGCTGGGTGACTTGGTCTCGGGCATAAG
AGCTCCCCCAAGGCTTAGCCTTCGCGATGCTGGCAGCTGTGCCTCCGGTGTTCGGC
On for same species comparisons
NC
BI
Fie
ldG
uid
eBLAST Results
16 hits to one contig
Human Genome Database953 contigs2.9 billion letters
NC
BI
Fie
ldG
uid
eMap Viewer: Genomic Context of BLAST Hits
Genes
Genome Scan
Models
Human EST hits
Contig
GenBank
Mouse EST hits
NC
BI
Fie
ldG
uid
eHuman prestin: now appears in Build
34
NC
BI
Fie
ldG
uid
e
Now we can compare genes
NC
BI
Fie
ldG
uid
e
Three prestin genes: finally together!
NC
BI
Fie
ldG
uid
e
Same prestin, different assemblies
NC
BI
Fie
ldG
uid
eDoes homology mean the common
biological function?
• Not always; the existence of the common ancestor does not guarantee that some function won’t be lost or acquired after the divergence.
An example: zeta-crystallin is a component of a transparent lens matrix
of the vertebrate eye. Its homolog in E.coli is the metabolic
enzyme quinone oxidoreductase.
NC
BI
Fie
ldG
uid
e
BLAST
VAST
Entrez
Text
Sequence
Structure
NC
BI
Fie
ldG
uid
eStructure similarity: No More
BLASTing!
• Three-dimensional structures are most conserved during the evolution;
• One still can detect the existence of the common ancestor based on the structure similarity;
• Spatial similarity is not calculated the same way we do it for sequences
NC
BI
Fie
ldG
uid
eVAST: Structure
NeighborsVector Alignment Search Tool
For each protein chain,
locate SSEs (secondarystructure elements),
and represent them asindividual vectors.
1
2
3
4
5 6
Human IL-4
NC
BI
Fie
ldG
uid
e
VAST: Structure Neighbors
NC
BI
Fie
ldG
uid
eStructure Neighbors in Cn3D
SH3 SH2
C-Srckinase
Human vs.Chicken
NC
BI
Fie
ldG
uid
e3D Domain Neighbors
HumanC-SrcKinase(Tyr)
vs.
Chk1kinase(Ser/Thr)
NC
BI
Fie
ldG
uid
e
NCBI is changing
From sequence data storage facility to one-stop shop with integrated databases of various kind.
You can be part of the future – work with us! Your expertise and data are indispensable.
NC
BI
Fie
ldG
uid
e
GenBank
NC
BI
Fie
ldG
uid
e
Refseq
NC
BI
Fie
ldG
uid
e
Entrez Gene
NC
BI
Fie
ldG
uid
e
Homologene database
NC
BI
Fie
ldG
uid
e
New generation of databases: an example
NC
BI
Fie
ldG
uid
eProtein interaction database: a seed for
future precomputed resources
NC
BI
Fie
ldG
uid
e
New databases: GenSAT
NC
BI
Fie
ldG
uid
e
PubChem
NC
BI
Fie
ldG
uid
e
Headache? Take Aspirin
NC
BI
Fie
ldG
uid
e
Aspirin has 432 neighbors
NC
BI
Fie
ldG
uid
e
Link to 3D protein structures
NC
BI
Fie
ldG
uid
eFor More Information…
•General Help [email protected]•BLAST [email protected]
E-mail addresses
The (free!) NCBI Newsletter
The NCBI Handbook
http://www.ncbi.nih.gov/Education/index.html
The NCBI Education Page
http://www.ncbi.nih.gov/About/newsletter.html
Follow the link from the NCBI Home Page