fundamentals of sequence analysis
DESCRIPTION
Fundamentals of Sequence Analysis. Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 [email protected]. MLH1. MutL. Human. Fly. Worm. Yeast. Bacteria. Molecular Evolution. Common ancestry allows us to infer similar function. 3000 Myr. 1000 Myr. 540 Myr. Pancreatic carcinoma. - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/2.jpg)
NC
BI
3000 Myr3000 Myr
1000 Myr1000 Myr
540 Myr540 Myr
Common ancestry allows us to infer similar function
Alzheimer’sDisease
Ataxiatelangiectasia
Colon cancer
Pancreaticcarcinoma
Yeast BacteriaWormFlyHuman
Molecular Evolution
MLH1 MutL
![Page 3: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/3.jpg)
NC
BI
Why do we need similarity searching?
Identification and annotation•Incomplete or no annotations (GenBank)•Incorrectly annotated sequences
Evolutionary relationshipshomologous molecules may
have similar functions
![Page 4: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/4.jpg)
NC
BI
Why Search Databases?
• To find out if a new DNA sequence is already fully or partially present in the databanks.
• To find homologous proteins to a putative coding ORF that might share similar 3D structure.
• to identify homology (“relatedness”) between a query and entries in a database
![Page 5: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/5.jpg)
NC
BI
Some Terminology
![Page 6: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/6.jpg)
NC
BI
Searching Sequence Databases
• Two sequences are homologous when they share a common ancestry. This ancestry is reflected in strong sequence similarity.
• Computationally, threshold limits for sequence similarity can be defined by :– length of the stretch of similar sequence– percentage of identity between the
sequence– statistical measurements, like E-value, P-
value, Bit-score, etc.
![Page 7: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/7.jpg)
NC
BI
Similarity and Homology
• Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness.
• Homology is an evolutionary term used to describe relationship via descent from a common ancestor.
• Homologous things are often similar, but not always (whale flipper <-> human arm)
• Homology is NEVER expressed as a percentage
![Page 8: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/8.jpg)
NC
BI
Orthologs vs Paralogs
• Homologs can be separated into two classes: orthologs and paralogs.
• Orthologs are homologous genes that perform the same function in different species.
• Paralogs are homologous genes within a species that may perform different functions.
![Page 9: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/9.jpg)
NC
BI
Similarity and Homology• Sequence homology can be reliably inferred
from statistically significant similarity over a majority of the sequence length.
• Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor.
• Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ <-> tubulin)
• Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"
![Page 10: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/10.jpg)
NC
BI
Searching sequence databases
• When we search a sequence database, we are usually looking for related sequences.
• Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity.
• When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.
![Page 11: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/11.jpg)
NC
BI
Pairwise Sequence Alignments
• Purpose:• identification of sequences with significant similarity to
(a) sequence(s) in a sequence-repository• identification of all homologous sequences the repository• identification of domains with sequence similarity
• Terminology • Global alignment• Local alignment
![Page 12: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/12.jpg)
NC
BI
Terminology: Global Alignment
• Finds the optimal alignment over the entire length of the two compared sequences
• Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA
• Suitable for sequences of homologous molecules
![Page 13: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/13.jpg)
NC
BI
Terminology: Local Alignment
• short regions of similarity between a pair of sequences.
• compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length
• useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons
![Page 14: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/14.jpg)
NC
BI
Substitutions, Insertions, Deletions
• Mutation: one of– switch from one nucleotide to another– insertion– deletion
• Substitution: a switch in nucleotides which spreads throughout most of a species.
• Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other):
ccctaggtccca
cgggtatccaacggtatgcca
![Page 15: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/15.jpg)
NC
BI
Example
• For the previous example cggtatgcca cgggtatccaa , ccctaggtccca, the two
descendent sequences align as follows
c g g g t a - - t - c c a a c c c - t a g g t c c c - a
• “-” (indel) represents an insertion or deletion.
![Page 16: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/16.jpg)
NC
BI
Algorithms: definition
Webster’s definition: “a procedure for solving a
mathematical problem in a finite number of steps that frequently involves a repetition of an operation; or broadly: a step-by-step procedure for solving a problem or accomplishing some end”
![Page 17: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/17.jpg)
NC
BI
Algorithms
• Needleman-Wunsch– Exhaustive global alignment– most rigorous method when aligning conserved
sequences of similar length (no exon shuffling, insertion/deletion etc)
• Smith-Waterman– Exhaustive local alignment– alignment does not have to extend along the full
length of the sequences– In contrast to N-W alignments initiating at all
possible positions of the sequence-space will be considered
– Can be very slow
![Page 18: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/18.jpg)
NC
BI
Basic Local Alignment Search Tool
http://www.ncbi.nlm.nih.gov/BLAST/
![Page 19: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/19.jpg)
NC
BI
Basic Local Alignment Search Tool
• Widely used similarity search tool• Heuristic approach based on Smith Waterman
algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and
database.– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation
• www, standalone, and network clients
![Page 20: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/20.jpg)
NC
BI
BLAST Selection Matrix
![Page 21: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/21.jpg)
NC
BI
Choosing The Right BLAST Flavor for Proteins
What you Want to Do? The Right BLAST Flavor
Find out something about the function of the protein
Use blastp to compare your protein with other proteins contained in the databases.
Discover new genes encoding similar proteins
Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading framesClaverie & Notredame 2003
![Page 22: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/22.jpg)
NC
BI
Choosing the Right BLAST
Flavor for DNAQuestions Answer
Am I interested in non coding DNA?
Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)
Do I want to discover new proteins?
Yes, Use tblastx
Do I want to discover proteins encoded in my query DNA sequences?
Yes, Use blastx
Am I unsure of the quality of my DNA?
Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.
Claverie & Notredame 2003
![Page 23: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/23.jpg)
NC
BI
Choosing The Right BLAST Flavor
for DNA SequencesUsage Query Database Progra
m
Find very similar DNA sequence
DNA DNA blastn
Protein discovery and ESTs
Translated DNA
Translated DNA
tblastx
Analysis of query DNA sequence
Translated DNA
Protein blastx
Claverie & Notredame 2003
![Page 24: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/24.jpg)
NC
BI
WWW BLAST
![Page 25: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/25.jpg)
NC
BI
Web BLAST
![Page 26: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/26.jpg)
NC
BI
BLAST Databases: Non-redundant protein
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein
• PIR, Swiss-Prot, PRF
– PDB (sequences from structures)
![Page 27: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/27.jpg)
NC
BI
BLAST Databases: Nucleic Acid
• nr (nt)– Traditional GenBank
Divisions– NM_ and XM_ RefSeqs
• dbest – EST Division
• htgs – HTG division
• gss – GSS division
• chromosome – NC_ RefSeqs
![Page 28: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/28.jpg)
NC
BI
Protein BLAST Page
>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS
Protein database
![Page 29: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/29.jpg)
NC
BI
BLAST Formatting Page
![Page 30: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/30.jpg)
NC
BI
BLAST Output Overview
• Graphic Display: Shows you where your query is similar to other sequences.
• Hit List: Name of sequences similar to your query ranked by similarity
• Alignments: Every alignment between your query and the reported hits
• Parameters: List of the various parameters used for the search
![Page 31: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/31.jpg)
NC
BI
BLAST Output: Graphic
mouse over,click for active links
Sort by taxonomy
Red bar = most similar sequencePink = almost as similarGreen – even less similarBlue/Black – worse scores
![Page 32: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/32.jpg)
NC
BI
BLAST Output: Descriptions
Bacterial mismatch repair proteins
link to entrez
sorted by e values
4 X 10-56
Default e value cutoff 10
LocusLink
Bit scores < 50 unreliable
![Page 33: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/33.jpg)
NC
BI
A Little on Interpretation
• How similar must sequences be in order to be considered homologous?
• More than 25% of the amino acids present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor.
• Rem: only > 100 aa or nt in length
![Page 34: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/34.jpg)
NC
BI
A Little on Interpretation: E-value
• Determine how much you can trust your conclusion on homology.
• E-value = Expectation Values• Allow for comparing pairwise alignment with
different similarities and different length. Advantage over Percent Identity (not discussed).
• Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The loest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology
• If you want to be certain of homology your E-values must be below 10-4 or (0.0001).
![Page 35: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/35.jpg)
NC
BI
TaxBLAST: Taxonomy Reports
![Page 36: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/36.jpg)
NC
BI
BLAST Output: Pairwise Alignments
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)
Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
![Page 37: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/37.jpg)
NC
BI
BLAST Output: Alignments
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756
Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
![Page 38: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/38.jpg)
NC
BI
Results from nr
Sequences producing significant alignments: (bits) Value
gi|604369|gb|AAA85687.1| (U17857) hMLH1 gene product [Homo ... 233 3e-61 gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mut... 233 4e-61 gi|466462|gb|AAA17374.1| (U07418) human homolog of E. coli ... 233 4e-61 gi|13878583|sp|Q9JK91|MLH1_MOUSE DNA mismatch repair protei... 214 2e-55 gi|19387852|ref|NP_081086.1| (NM_026810) mutL homolog 1; DN... 213 2e-55 gi|13591989|ref|NP_112315.1| (NM_031053) mismatch repair pr... 212 5e-55 gi|12835158|dbj|BAB23172.1| (AK004105) DNA MISMATCH REPAIR ... 205 6e-53 gi|3192877|gb|AAC19117.1| (AF068257) mutL homolog [Drosophi... 128 1e-29 gi|17136968|ref|NP_477022.1| (NM_057674) Mlh1-P1 [Drosophil... 127 1e-29 gi|17861656|gb|AAL39305.1| (AY069160) GH18717p [Drosophila ... 125 8e-29 gi|20146218|dbj|BAB89000.1| (AP003238) putative MLH1 [Oryza... 87 2e-17 gi|11357265|pir||T51620 DNA mismatch repair protein MLH1 [i... 83 5e-16 gi|18413196|ref|NP_567345.1| (NM_116983) MLH1 protein [Arab... 83 5e-16 gi|6323819|ref|NP_013890.1| (NC_001145) Required for mismat... 72 1e-12 gi|460627|gb|AAA16835.1| (U07187) Mlh1p [Saccharomyces cere... 71 2e-12gi|19112991|ref|NP_596199.1| (NC_003423) putative DNA misma... 70 5e-12 gi|13517948|gb|AAK29067.1|AF346620_1 (AF346620) MLH1 [Trypa... 57 3e-08 gi|16272041|ref|NP_438240.1| (NC_000907) DNA mismatch repai... 54 3e-07 gi|19173567|ref|NP_597370.1| (NC_003232) DNA MISMATCH REPAI... 52 9e-07 gi|13543339|gb|AAH05833.1|AAH05833 (BC005833) Similar to mu... 50 5e-06 gi|15602769|ref|NP_245841.1| (NC_002663) MutL [Pasteurella ... 50 6e-06 gi|15642797|ref|NP_227838.1| (NC_000853) DNA mismatch repai... 48 2e-05
>gi|4557757|ref|NP_000240.1| (NM_000249) mutL homolog 1; mutL (E. coli) homolog 1; coli) homolog 1 (colon cancer, nonpolyposis type 2) [Homo sapiens] gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 (MutL protein homolog 1) gi|631299|pir||S43085 DNA mismatch repair protein MLH1 - human gi|463989|gb|AAC50285.1|(U07343) hMLH1 [Homo sapiens] gi|1079787|gb|AAA82079.1|(U40978) DNA mismatch repair protein homolog [Homo sapiens] gi|13905126|gb|AAH06850.1|AAH06850 (BC006850) mutL (E. coli) homolog 1 type 2) [Homo sapiens] gi|741682|prf||2007430A DNA mismatch repair protein [Homo sapiens] Length = 756
Score = 233 bits (593), Expect = 4e-61 Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
![Page 39: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/39.jpg)
NC
BI
tblastn Results Against ESTs
>gi|12794555|emb|AL531062.1|AL531062 AL531062 LTI_NFL001_NBC4 Homo sapiens cDNA clone CS0DM005YM23 5 prime. Length = 878
Score = 167 bits (422), Expect(3) = 1e-42 Identities = 81/82 (98%), Positives = 81/82 (98%) Frame = +2
Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 512 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 691
Query: 61 GSNSSRMYFTQTLLPGLAGPSG 82 GSNSSRMYFTQTLLPGLAGP GSbjct: 692 GSNSSRMYFTQTLLPGLAGPLG 757
Score = 24.3 bits (51), Expect(3) = 1e-42 Identities = 11/26 (42%), Positives = 11/26 (42%) Frame = +1
Query: 80 PSGEMVKXXXXXXXXXXXXXXDKVYA 105 PSG MVK DKVYASbjct: 748 PSG*MVKSTTSLTSSSTSGSSDKVYA 825
combined expect forhits to multiple frames
![Page 40: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/40.jpg)
NC
BI
BLAST Tips
• It is faster and more accurate to BLAST proteins (blastp) rather than nucleotides.
• If in doubt use blastp.• When possible restrict to the subset of
the database you are interested in.• Look around for the database you
need or create your own custom BLAST database. BUT HOW???
• When is the best time to use the BLAST server?
![Page 41: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/41.jpg)
NC
BI
Asking Biological Problems with BLASTWhat You
Want to DOGeneral (but More Complicated) Computational Method
Using BLAST
Finding genes in a genome
Run gene prediction software or an ORF Finder (for bacteria)
Cut your genome sequence in little (2-5kb) overlapping sequences. Use blastx to BLAST each piece of genome against NR (nonredundant protein db). Works better for sequences with no introns (bacteria).
Predicting protein function
Domain analysis or wet-lab experimentation
Use blastp to BLAST your protein sequence against SWISS-Prot (future = UniProt). If you get a good hit (more than 25% identify) over the complete length of the protein, then your protein has the same function as the SWISS-PROT protein
Predicting protein 3-D structure
Homology modeling, X-ray, NMR analysis of protein of interest
Use blastp to BLAST your protein against PDB (Protein structure DB), if you get hit >25% identity, then your protein and the good hit(s) have a similar 3-D structure
Finding protein family members
Clone new family members using PCR techniques
Use blastp (or better use PSI-BLAST) and run against NR (nonredundant protein family). After you have all members of family, you can make multiple sequence alignment phylogenetic tree
Claverie & Notredame 2003
![Page 42: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/42.jpg)
NC
BI
BLAST and PSI-BLAST Servers on the Internet
Country
Program
URL
USA BLAST/ PSI-BLAST
http://www.ncbi.nlm.nih.gov/BLAST
USA BLAST http://genome.wustl.edu/gsc/BLAST
EUROPE BLAST http://www.ch.embnet.org/software/bBLAST.html
Europe BLAST http://www.ebi.ac.uk/blast2/
Japan BLAST/ PSI-BLAST
http://www.ddbj.nig.ac.jp/E-mail/ homology.html
![Page 43: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/43.jpg)
NC
BI
Alternative Method for
Homology Searches• Smith-Waterman (ssearch): slower but
more accurate• FASTA: slower than BLAST, but more
accurate when making DNA comparison
• BLAT: for locating cDNA in a genome or finding close proteins in a genome
![Page 44: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/44.jpg)
NC
BI
Common Mistake
• Seq1 has domain A & B; Seq2 has domain A and Seq3 has domain B
• Use Seq 1 as query sequence• What happens? E-value of both of these hits may
be very high if domain A and B are long and well conserved.
• Seq1 is homologous to Seq2&3, but remember Seq1 is not homlogous over the entire length to Seq2&3
• Just don’t depend on the E-value• “BLAST hits are not transitive, unless the
alignments are overlapping”• Most proteins have more than one domain, so
becareful when looking a BLAST results, not all reported hits belong to the same big family.
Sequence 1: AAAAAABBBBBBSequence 2: AAAAAASequence 3: BBBBBB
![Page 45: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/45.jpg)
NC
BI
Common Questions
• When I do a blast job using WU-BLAST vs NCBI BLAST with the same query sequence, I get a different result? Both are based on the same algorithm, but a different implementation. So why the difference?
Usually this is due to the slight variation in the database version, but differences in BLAST program version also play a minor role in the difference. Usually the result, do not change in a dramatic manner, but they do change a bit.
![Page 46: Fundamentals of Sequence Analysis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56814625550346895db33203/html5/thumbnails/46.jpg)
NC
BI
Self Guided Exercises - BLAST
• If you need further help on Blast. • First READ then try the problem set.• Blast Course:http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-
1.htmlBlast
• Tutorial:http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/
information3.html • Blast Quick Start (click on P for the
problem set)http://www.ncbi.nlm.nih.gov/Class/minicourses/
blastex2.html