day8 - new york botanical garden(goad and kanehisa 1982) nucleic acids research 50bkpjc eases 150 to...
TRANSCRIPT
1 #!/usr/bin/env node2 let fs = require('fs');3 let readline = require('readline');4 let a = false;5 let i = '';6 let r = false;
7 function checkFile(file){8 try {9 return(fs.statSync(file).isFile());
10 } catch(error){11 return(false);12 }13 }
14 for(let k = process.argv.length-1; k >= 0; k--){15 if(process.argv[k] === '-a'){16 a = true;17 } else if(process.argv[k] === '-i'){18 if(checkFile(process.argv[k+1]) === true){19 i = process.argv[k+1];20 }21 } else if(process.argv[k] === '-r'){22 r = true;23 }24 }
25 if(i.length >= 1){2627 } else {28 let o = '\nA JavaScript for processing GenBank DNA FASTA files.\n\nUsage:\n'29 + '-a\treduces names to GenBank accessions only (default = ' + a + ')\n'30 + '-i\tspecifies a GenBank input fasta file (required)\n'31 + '-r\tproduces the reverse complement of each sequence (default = ' + r +
')\n\n';32 process.stderr.write(o, 'UTF8');33 }
26 function format(n, s){27 if((typeof(n) === 'string') && (typeof(s) === 'string')){28 if((n.length > 2) && (s.length > 1)){29 let labels = n.replace(/^>/, '').split(' ');30 let o = '>';31 if(a === true){32 o += labels[0];33 } else {34 o += labels[1] + '_' + labels[2] + '_' + labels[0];35 }36 let b = s.toUpperCase().replace(/[^ACGTNVDBHWMRKSY]/g, '').split('');37 if(r === true){38 b = b.join('').replace(/[ACGTVDBHWMRKY]/g, function(x){39 return({40 A: 'T',41 C: 'G',42 G: 'C',43 T: 'A',44 V: 'B',45 D: 'H',46 B: 'V',47 H: 'D',48 M: 'K',49 R: 'Y',50 K: 'M',51 Y: 'R'52 }[x]);53 }).split('').reverse();54 }55 let l = b.length;56 for(let k = 0; k < l; k++){57 if((k % 80) === 0){58 o += '\n' + b[k];59 } else {60 o += b[k];61 }62 }63 return(o + '\n');64 }65 }
66 return('');67 }
68 let line = readline.createInterface({69 input: fs.createReadStream(i),70 output: process.stdout,71 terminal: false72 });73 let n = '';74 let nRE = new RegExp(/^>/);75 let s = '';76 line.on('line', function(line){77 line = line.trim();78 if(nRE.test(line) === true){79 process.stdout.write(format(n, s), 'UTF8');80 s = '';81 n = line;82 } else {83 s += line;84 }85 return(false);86 });87 line.on('close', function(){88 process.stdout.write(format(n, s), 'UTF8');89 return(false);90 });
searching for sequences...
DNA/RNA/protein sequences are ‘special’ textcase is (usually) meaningless
can be used to denote sequence/alignment qualityorientation is not (usually) important
sequences are archived in arbitrary orientation[protein sequences have just one orientation]
commonly coded as letters, but could be numbers, etc.‘query’ sequence used to find ‘reference’ sequence(s)
queries are entire sequence or sequence fragment(s)
...searching for sequences...
grepone reference sequence per linequery twice (both orientations)will find exact matches only (if query is a substring)maybe not so useful
tre-agrepone reference sequence per linespecify maximum allowable number of mismatchesquery twice (both orientations)similar results as megaBLAST (if query is a ‘substring’)
...searching for sequences...
hidden Markov model (e.g. HMMR)uses a probabilistic model of a sequence alignmentor a model of one sequence using a substitution matrixfinds similar sequences
machine learning (neural networks)a complex model of particular sequence typesused to find features (e.g. splice sites)
...searching for sequences
a more robust sequence search
alignment of query (both orientations) to all references
‘global’ alignment (all positions aligned)
‘local’ alignment (core most similar positions aligned)
compute similarity
edit distance (a.k.a. p–distance, raw distance)
similarity matrices
return the most similar sequences/fragments
effective, but slow (SEQHP: Goad and Kanehisa 1982)
Nucleic Acids Research
BOUSCKPJ BASES VWA TO 150. r----- ----teert--- ---r--rt--r---T.teeT ere rrerrrr---- _ _ . ....... I.... - - 1 .... ......
I
8
Ap
Figure 2. A forward path matrix for the comparison of bases 1050 to 1150 and1680to 1780 within the 5495 base segment of the mouse immunoglobulin K lightchain gene complex (Ref. 9).
5495 base-pair region of the mouse K immunoglobulin light chain.9 Figure 2shows a 101 base by 101 base sector of a forward path matrix, Fig. 3 thecorresponding reverse path matrix, and Fig. 4 their logical product.
It is only the path matrix that need be stored as a full I x J matrix; a
few rows of the D-matrix suffice--the row being calculated at any point andthe previous one (or in the generalization to follow, a few previous ones) on
which the calculation depends. And since the product path matrix can beformed as the elements of the reverse path matrix are calculated, storage foronly one matrix is required. We pack 20 elements per 60-bit word in the CDC7600.
252
C~~~~~~~~~ N
i, tc \\ N
O\ \ ~ ~ ~ ~ N
(Goad and Kanehisa 1982)
(Goad and Kanehisa 1982)Nucleic Acids Research
50BKPJC EASES 150 TO 1150
Figure 3. The corresponding reverse path matrix.
In the calculations of Figs. 2-5 we use a more general set of weightsthan was introduced in Eq. (1) above. These allow contiguous bases to bedeleted with weights that need not be proportional to their number. That is,we generalize Eq. (1) to add
-n if n contiguous bases are deleted, n =1,2..N-. (1')Then in the induction defined by Eq. (2) must consider an expanded set ofpossibilities:
D(i,j) = max{max(D(i-n,j) - ~n)' max(D(i,j-n) - 13n)' D(i-n,j-n) + 6) . (2')n n
For two sequences of length I and J the algorithm requires a number of opera-tions proportional to IJ(1 + Np); there is an obvious advantage in limitingNf3. Except where noted the calculations reported here have used al = 2; N =
A~~ ~ ~ ~ ~ ~ ~ ~ '
253
(Goad and Kanehisa 1982)Nucleic Acids Research
cminiwi s I TO NmmoT CC! -C AC-T-TTTT!SCTATCCGCTTGAOTCAACCCACACCACT TCCTGTCA^C^CTTCCGACCCCCGACCAAGCT CCAAA! AAAACC TAA^CTACTC TTCTCAAC
I
i0
Figure 4. The logical product of the two path matrices.
3' l = 2 = p3 = 3. This sets the concordance threshold at one mismatch ordeletion per three matches and allows from one to three bases to be units ofdeletion.
Further generalizations have been implemented, but will not be discussedfurther here; 6 may be replaced by a 4 x 4 matrix that provides differentweights for the various kinds of mismatches; or it may be evaluated for threebases at a time and made a function of the two codons juxtaposed. In any case
a relationship between the p's and mismatch penalties should be maintainedsuch that two adjacent deletions, one from each sequence, cannot produce a
lower penalty than the mismatches that they could replace.Any pair of subsequences whose overall degree of concordance exceeds the
threshold is represented by a continuous path in the product path matrix,
254
NN~N
NN~~~~~~~~~~~
\\\\\\\\ \ \~
\\ N
N~~~~~~~~~~~~~~~~
A \\\\\ \\ ~~~~N\N N N N
\\\\ \~~~~~~~~~
FASTP and FASTA...
Lipman and Pearson (1985); Pearson and Lipman (1988)
the ‘ancestor’ of BLAST
search for likely ‘similar’ sequences
use a pre–computed table of oligo by sequence position
compute offsets to find clustered similarities (diagonals)
estimate sequence similarity (number of shared oligos)
pick best segments
join segments, recalculate similarity with substitutions
local alignment including best segments
oligo by sequence positionseq oligo0 oligo1 oligo2 oligo3 oligo4 oligo5 ...
Q 1,28 20,22 15 5 2,10 12 ...
0 5,32 24,26 19 9 6,14 16 ...
1 1 – 28,13 – – – ...
2 – 1 – – – 7,11 ...
3 4,31 23,25 18 – 5,13 15,32 ...
oligo offsetsseq oligo0 oligo1 oligo2 oligo3 oligo4 oligo5 ...
Q 0,0 0,0 0 0 0,0 0 ...
0 4,4 4,4 4 4 4,4 4 ...
1 0,– – 13,– – – – ...
2 – -19 – – – -5,– ...
3 3,3 3,3 3 – 3,3 3,– ...
Proc. Natl. Acad. Sci. USA 85 (1988) 2445
A 50 100
50
100
50
100
N
''\X\\\'
* \\' \\
\\ * '~\' \
C50 100
B
FIG. 1. Identification of sequence similarities by FASTA. Thefour steps used by the FASTA program to calculate the initial andoptimal similarity scores between two sequences are shown. (A)Identify regions of identity. (B) Scan the regions using a scoringmatrix and save the best initial regions. Initial regions with scores
less than the joining threshold (27) are dashed. The asterisk denotesthe highest scoring region reported by FASTP. (C) Optimally joininitial regions with scores greater than a threshold. The solid linesdenote regions that are joined to make up the optimized initial score.
(D) Recalculate an optimized alignment centered around the highestscoring initial region. The dotted lines denote the bounds of theoptimized alignment. The result of this alignment is reported as theoptimized score.
much closer to the optimized score for many sequences. Infact, unlike FASTP, the FASTA method may yield initialscores that are higher than the corresponding optimizedscores.
Local Similarity Analyses. Molecular biologists are ofteninterested in the detection of similar subsequences withinlonger sequences. In contrast to FASTP and FASTA, whichreport only the one highest scoring alignment between twosequences, local sequence comparison tools can identifymultiple alignments between smaller portions of two se-
quences. Local similarity searches can clearly show theresults of gene duplications (see Fig. 2) or repeated struc-tural features (see Fig. 3) and are frequently displayed usinga "graphic matrix" plot (7), which allows one to detectregions of local similarity by eye. Optimal algorithms forsensitive local sequence comparison (6, 8, 9) can havetremendous computational requirements in time and mem-
ory, which make them impractical on microcomputers and,when comparing longer sequences, on larger machines as
well.The program for detecting local similarities, LFASTA,
uses the same first two steps for finding initial regions thatFASTA uses. However, instead of saving 10 initial regions,LFASTA saves all diagonal regions with similarity scores
greater than a threshold. LFASTA and FASTA also differ inthe construction of optimized alignments. Instead of focus-ing on a single region, LFASTA computes a local alignmentfor each initial region. Thus LFASTA considers all of theinitial regions shown in Fig. 1B, instead of just the diagonalshown in Fig. 1D. Furthermore, LFASTA considers not
50 100 only the band around each initial region but also potentialsequence alignments for some distance before and after theinitial region. Starting at the end of the initial region, anoptimization (6) proceeds in the reverse direction until allpossible alignment scores have gone to zero. The location ofthe maximal local similarity score in the reverse direction isthen used to start a second optimization that proceeds in theforward direction. An optimal path starting from the forwardmaximum is then displayed (5). The local homologies can bedisplayed as sequence alignments (see Fig. 2B) or on atwo-dimensional graphic matrix style plot (see Figs. 2A and3).
Statistical Significance. The rapid sequence comparisonalgorithms we have developed also provide additional toolsfor evaluating the statistical significance of an alignment.There are approximately 5000 protein sequences, with 1.1million amino acid residues, in the NBRF protein sequence
library, and any computer program that searches the libraryby calculating a similarity score for each sequence in thelibrary will find a highest scoring sequence, regardless ofwhether the alignment between the query and library se-
quence is biologically meaningful or not. Accompanying theprevious version of FASTP was a program for the evaluationof statistical significance, RDF, which compares one se-
quence with randomly permuted versions of the potentiallyrelated sequence.
We have written a new version of RDF (RDF2) that hasseveral improvements. (i) RDF2 calculates three scores foreach shuffled sequence: one from the best single initial region(as found by FASTP), a second from the joined initial regions(used by FASTA), and a third from the optimized diagonal.(it) RDF2 can be used to evaluate amino acid or DNAsequences and allows the user to specify the scoring matrix tobe employed. Thus sequences found using the PAM250scoring matrix can be evaluated using the identity or geneticcode matrix. (iii) The user may specify either a global or localshuffle routine.
Locally biased amino acid or nucleotide composition isperhaps the most common reason for high similarity scores
of dubious biological significance (10). High scoring align-ments between query and library sequences may be due topatches of hydrophobic or charged amino acid residues or toA + T- or G + C-rich regions in DNA. A simple Monte Carloshuffle analysis that constructs random sequences by takingeach residue in one sequence and placing it randomly alongthe length of the new sequence will break up these patches ofbiased composition. As a result, the scores of the shuffledsequences may be much lower than those of the unshuffledsequence, and the sequences will appear to be related.Alternatively, shuffled sequences can be constructed bypermuting small blocks of 10 or 20 residues so that, while theorder of the sequence is destroyed, the local composition isnot. By shuffling the residues within short blocks along thesequence, patches of G + C- or A+ T-rich regions in DNA,for example, are undisturbed. Evaluating significance with a
local shuffle is more stringent than the global approach, andthere may be some circumstances in which both should beused in conjunction. Whereas two proteins that share a
common evolutionary ancestor may have clearly significantsimilarity scores using either shuffling strategy, proteinsrelated because of secondary structure or hydropathic pro-
file may have similarity scores whose significance decreasesdramatically when the results of global and local shufflingare compared.
Implementation. The FASTA/LFASTA package of se-
quence analysis tools is written in the C programming lan-guage and has been implemented under the Unix, VAX/VMS, and IBM PC DOS operating systems. Versions of theprogram that run on the IBM PC are limited to query se-
50
100.
\l
6F
1
I
I
Biochemistry: Pearson and Lipman
16
(Pearson and Lipman 1988)
aa substitution matrices...
derived from ‘curated’ multiple sequence alignments
highly sample dependent
assume alignment is correct, sequences are homologous
PAM (Dayhoff, Schwartz, and Orcutt 1978)
Point Accepted Mutation
PAMx: x = x mutations per 100 amino acids
derived from alignments of protein ‘families’
...aa substitution matrices
BLOSUM (Henikoff and Henikoff 1992)
BLOcks of Amino Acid SUbstitution Matrix
BLOSUMx: x = similarity used to merge sequences prior to making the matrix
higher numbers are for use with more similar sequences
derived from indel free alignment segments
original BLOSUM62 has an error, use corrected version
nt substitution matricesformulae can be used, but generally are not
JC69 (Jukes and Cantor 1969), K80 (K2P; Kimura 1980), HKY85 (Hasegawa, Kishino, and Yano 1985), T92 (Tamura 1992), GTR (Tavaré 1986)
nucleotide matrices not usually based on empirical dataequal weights commonly used+5/-4 or +1/-2 (match vs. mismatch)
ATGCCTGCACGC |||| |||| || ATGCATGCATGC 111121111211
= 10-4 = 6 (50%)
ATGCCTGCACGC |||| |||| || ATGCATGCATGC 555545555455
= 50-8 = 42 (70%)
...FASTP and FASTA...
assessment of statistical significance
using RDF or RDF2
permute (shuffle) reference sequences
the entire database or just similar sequences (faster)
number of positions and base composition constant
order jumbled
recalculate similarity
...FASTP and FASTA
p–values are not calculated (distribution not normal)
z–values used instead
z = (similarity - mean of permuted similarity) / standard deviation of permuted similarity
z > 3 = possibly significant (0.0013%)
z > 6 = probably significant
z > 10 = significant
(Lipman and Pearson 1985)
FASTA vs. BLAST
BLAST has larger default word size (== faster)
can be less ‘sensitive’ (ignores low similarly sequences)
BLAST focuses on most informative k–mers
improved treatment of indels
‘small’ indels and/or ‘short’ segments of mismatch between matching words are tolerated
different methods of calculating significance
BLAST provides a ‘significance’ of match assuming that search settings are correct
BLAST flavors…Altschul et al. (1990, 1997)
the ‘original’ source of algorithms1997 overcomes non–matching regions (e.g. indels)freely available from NCBI (binary and code)
Kent (2002; BLAT)faster in some circumstances (e.g. large batches)designed to search against whole genomesfreely available from the author (binary and code)
Gish (2006; WU-BLAST)same algorithm as NCBI BLAST, but fasterdifferent default settings => different resultsbinary (possibly available), but code is not distributed
…BLAST flavors…
Nguyen and Lavenier (2009; PLAST)parallel version3–6× faster than multithreaded BLASTfreely available from the authors (binary and code)
Panagiotis and Sahinidis (2011; GPU-BLAST)requires a CUDA graphics processorup to 10× fasterprotein sequences onlyfreely available from the authors (binary and code)
…BLAST flavors
Zhao and Chu (2014; G-BLASTN)requires a CUDA graphics processorup to 14× fasternucleotide sequences only
inefficient on short sequences (> 1,000,000 bp)database must fit into the GPU memoryfreely available from the authors (binary and code)