database search. overview : 1. fasta : is suitable for protein sequence searching 2. blast :...
TRANSCRIPT
database searchdatabase search
Overview :
1. FastA : is suitable for protein sequence searching
2. BLAST : is suitable for DNA, RNA, protein sequence searching
FastA
History : FastA was developed by Lipman and Pearson in 1985, which is the first database search software.
EBI provides fastA service, available at
http://www.ebi.ac.uk/Tools/fasta/
Idea: identify the short substring matching with the target sequence.
other software
commonly used
http://www.ebi.ac.uk/Tools/sss/
example: protein sequence :EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLP
parametersinput
sequence
select database
results
100% identity
17/28=60.7% (identity)28 aa overlap
BLAST
Basic Local Alignment Search Tool (BLAST) .
BLAST was developed by NCBI.
BLAST finds regions of similarity between biological sequences.
Basic BLASTProgram Sequence database Program description
Blastn Nucleotide NucleotideSearch a nucleotide database using a nucleotide
query Algorithms: blastn, megablast, discontiguous megablast
Blastp Protein ProteinSearch protein database using a protein query
Algorithms: blastp, psi-blast, phi-blast, delta-blast
Blastx Nucleotide proteinSearch protein database using a translated
nucleotide query
Tblastn Protein NucleotideSearch translated nucleotide database using a
protein query
Tblastx Nucleotide NucleotideSearch translated nucleotide database using a
translated nucleotide query
T:translation, n: nucleotide, p:protein ; x: cross
BLASTALLBLASTALL
Query Sequence
Amino acid Sequence DNA Sequence
TBLASTxBLASTxBLASTnTBLASTnBLASTp
NucleotideDatabase
ProteinDatabase
NucleotideDatabase
NucleotideDatabase
ProteinDatabase
Translated TranslatedTranslated
Blast source1. NCBI : http://blast.ncbi.nlm.nih.gov/Blast.cgi/ (online
version)
ftp://ftp.ncbi.nih.gov/blast/ (stand alone)
2.other websites : http://life.zsu.edu.cn/blast/
http://www.fruitfly.org/blast/
http://www.mcgb.uestc.edu.cn/blast/blast.html
…
BLAST
1. online : from website
2. stand alone : download the software
comparison between them web server advantages : 1. easy. 2. update. 4. database download is no need. disadvantages : 1. not suitable for large data. 2. cannot define your own database.
Web Blast provided by NCBIBlastn for nucleotide
Blastp for protein
http://blast.ncbi.nlm.nih.gov/Blast.cgi
An example :1. cctggcgataaccgtcttgtcggcggttgcgctgacgttgcgtcgtgatatcatcagggcAgaccggttacatccccctaa
2.gatcgaaaaacgcttgtgttaaaaatttgctaaattttgccaatttggtaaaacagttgcAtcacaacaggagatagcaat
the first sequence
The second sequence
sequence
range
softwaresimilarity from high to low
results shown in new window
results of pairwise alignment
No significant similarity found
information of the two sequences
parameters selected
Why we need the standalone version of BLAST ?1. specific database
2. privacy
3. batch processing
Blast (standalone version)
Blast (standalone version)
How to download BLAST ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release
blast-2.2.23-ia32-win32.exe
unzip, we can get three folders
bin: all the exe files
data : data for BLAST
doc : readme
We need to format the database for BLAST.
First, save your database as Fasta format;Second, use formatdb provided in BLAST package to
format the database.dos command : formatdb –i sequence.fa –p T/F –o T/F –n db_name
Blast (standalone version)
An example
1. There are 13 proteins in the file “Delta.txt” as the database.
2. 1 protein is selected as the query sequence, and stored in file “seq.txt” ;
1. format Delta.txt :
formatdb –i Delta.txt –p T
parameter :1. –i: database2. –p: T-protein , F-nucleotide
2. search Delta.txt by using BLAST :
Blastall –p blastp –d Delta.txt –i seq.txt –o out.txt
parameter :1. –p: program name : blastp , blastn , blastx , tblastn , tblastx2. –d: database name3. –i: query sequences4. –o: output file
3. To read other parameters just type blastall
4. Results : Score ESequences producing significant alignments: (bits) Value
P83301|CXO_CONVE 69 1e-017P69749|CXD6A_CONBU 20 0.009P69750|CXD6A_CONCN 18 0.036P24159|CXDB_CONTE P18511|CXDA_CONTE 18 0.042P60179|CXD66_CONAA 17 0.066P60513|CXD6A_CONER 17 0.11 P69751|CXD6E_CONCT P69748|CXD6A_CONAI 16 0.19 P69754|CXD6B_CONMA P69753|CXD6A_CONMA 14 0.56 P69752|CXD6B_CONER P58913|CXD6A_CONPU 14 0.62 P69756|CXD6D_CONMA P69755|CXD6C_CONMA 13 0.89 Q9XZK5|CXSO6_CONST P69757|CXD6A_CONSE 12 2.6
>P83301|CXO_CONVE Length = 33
Score = 69.3 bits (168), Expect = 1e-017, Method: Compositional matrix adjust. Identities = 33/33 (100%)
Query: 1 EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLP 33 EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLPSbjct: 1 EDCIAVGQLCVFWNIGRPCCSGLCVFACTVKLP 33
>P69749|CXD6A_CONBU Length = 27
Score = 20.0 bits (40), Expect = 0.009, Method: Compositional matrix adjust. Identities = 13/30 (43%), Gaps = 6/30 (20%)
Query: 1 EDCIAVGQLCVFWNIGRP CCSGLCVFAC 28 C A G C RP CCS C FACSbjct: 1 DECSAPGAFCLI RPGLCCSEFCFFAC 26
5. pairwise alignment :
bl2seq –p blastp –i seq.txt –j 1.txt –o out.txt
parameter :1.–p: program name : blastp , blastn……2. –i: first sequence3. –j: second sequence 4. –o: output filesTo read other parameter, just type bl2seq
6. database can be downloaded from :
ftp://ftp.ncbi.nih.gov/blast/db/
scoring matrices can be downloaded from :ftp://ftp.ncbi.nih.gov/blast/matrices/
PSI-blast
Position specific iterative BLAST (PSI-
BLAST) .
Altschul et al. (1997). Gapped Blast and PSI-Blast: a new
generation of protein database search programs. Nucleic
Acids Research, 25(17):3389-3402
target: only proteins
PSI-blast Position specific iterative BLAST (PSI-BLAST) refers to a
feature of BLAST 2.0 in which a profile is automatically
constructed from the first set of BLAST alignments. PSI-
BLAST is similar to NCBI BLAST2 except that it uses
position-specific scoring matrices derived during the
search, this tool is used to detect distant evolutionary
relationships.
online source : http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?
page=/NPSA/npsa_psiblast.html
http://blast.ncbi.nlm.nih.gov/Blast.cgi
http://www.ebi.ac.uk/Tools/blastpgp/