day8 - new york botanical garden(goad and kanehisa 1982) nucleic acids research 50bkpjc eases 150 to...

1 #!/usr/bin/env node2 let fs = require('fs');3 let readline = require('readline');4 let a = false;5 let i = '';6 let r = false;

7 function checkFile(file){8 try {9 return(fs.statSync(file).isFile());

10 } catch(error){11 return(false);12 }13 }

14 for(let k = process.argv.length-1; k >= 0; k--){15 if(process.argv[k] === '-a'){16 a = true;17 } else if(process.argv[k] === '-i'){18 if(checkFile(process.argv[k+1]) === true){19 i = process.argv[k+1];20 }21 } else if(process.argv[k] === '-r'){22 r = true;23 }24 }

25 if(i.length >= 1){2627 } else {28 let o = '\nA JavaScript for processing GenBank DNA FASTA files.\n\nUsage:\n'29 + '-a\treduces names to GenBank accessions only (default = ' + a + ')\n'30 + '-i\tspecifies a GenBank input fasta file (required)\n'31 + '-r\tproduces the reverse complement of each sequence (default = ' + r +

')\n\n';32 process.stderr.write(o, 'UTF8');33 }

26 function format(n, s){27 if((typeof(n) === 'string') && (typeof(s) === 'string')){28 if((n.length > 2) && (s.length > 1)){29 let labels = n.replace(/^>/, '').split(' ');30 let o = '>';31 if(a === true){32 o += labels[0];33 } else {34 o += labels[1] + '_' + labels[2] + '_' + labels[0];35 }36 let b = s.toUpperCase().replace(/[^ACGTNVDBHWMRKSY]/g, '').split('');37 if(r === true){38 b = b.join('').replace(/[ACGTVDBHWMRKY]/g, function(x){39 return({40 A: 'T',41 C: 'G',42 G: 'C',43 T: 'A',44 V: 'B',45 D: 'H',46 B: 'V',47 H: 'D',48 M: 'K',49 R: 'Y',50 K: 'M',51 Y: 'R'52 }[x]);53 }).split('').reverse();54 }55 let l = b.length;56 for(let k = 0; k < l; k++){57 if((k % 80) === 0){58 o += '\n' + b[k];59 } else {60 o += b[k];61 }62 }63 return(o + '\n');64 }65 }

66 return('');67 }

68 let line = readline.createInterface({69 input: fs.createReadStream(i),70 output: process.stdout,71 terminal: false72 });73 let n = '';74 let nRE = new RegExp(/^>/);75 let s = '';76 line.on('line', function(line){77 line = line.trim();78 if(nRE.test(line) === true){79 process.stdout.write(format(n, s), 'UTF8');80 s = '';81 n = line;82 } else {83 s += line;84 }85 return(false);86 });87 line.on('close', function(){88 process.stdout.write(format(n, s), 'UTF8');89 return(false);90 });

searching for sequences...

DNA/RNA/protein sequences are ‘special’ textcase is (usually) meaningless

can be used to denote sequence/alignment qualityorientation is not (usually) important

sequences are archived in arbitrary orientation[protein sequences have just one orientation]

commonly coded as letters, but could be numbers, etc.‘query’ sequence used to find ‘reference’ sequence(s)

queries are entire sequence or sequence fragment(s)

...searching for sequences...

grepone reference sequence per linequery twice (both orientations)will find exact matches only (if query is a substring)maybe not so useful

tre-agrepone reference sequence per linespecify maximum allowable number of mismatchesquery twice (both orientations)similar results as megaBLAST (if query is a ‘substring’)

...searching for sequences...

hidden Markov model (e.g. HMMR)uses a probabilistic model of a sequence alignmentor a model of one sequence using a substitution matrixfinds similar sequences

machine learning (neural networks)a complex model of particular sequence typesused to find features (e.g. splice sites)

...searching for sequences

a more robust sequence search

alignment of query (both orientations) to all references

‘global’ alignment (all positions aligned)

‘local’ alignment (core most similar positions aligned)

compute similarity

edit distance (a.k.a. p–distance, raw distance)

similarity matrices

return the most similar sequences/fragments

effective, but slow (SEQHP: Goad and Kanehisa 1982)

Nucleic Acids Research

BOUSCKPJ BASES VWA TO 150. r----- ----teert--- ---r--rt--r---T.teeT ere rrerrrr---- _ _ . ....... I.... - - 1 .... ......

I

8

Ap

Figure 2. A forward path matrix for the comparison of bases 1050 to 1150 and1680to 1780 within the 5495 base segment of the mouse immunoglobulin K lightchain gene complex (Ref. 9).

5495 base-pair region of the mouse K immunoglobulin light chain.9 Figure 2shows a 101 base by 101 base sector of a forward path matrix, Fig. 3 thecorresponding reverse path matrix, and Fig. 4 their logical product.

It is only the path matrix that need be stored as a full I x J matrix; a

few rows of the D-matrix suffice--the row being calculated at any point andthe previous one (or in the generalization to follow, a few previous ones) on

which the calculation depends. And since the product path matrix can beformed as the elements of the reverse path matrix are calculated, storage foronly one matrix is required. We pack 20 elements per 60-bit word in the CDC7600.

252

C~~~~~~~~~ N

i, tc \\ N

O\ \ ~ ~ ~ ~ N

(Goad and Kanehisa 1982)

(Goad and Kanehisa 1982)Nucleic Acids Research

50BKPJC EASES 150 TO 1150

Figure 3. The corresponding reverse path matrix.

In the calculations of Figs. 2-5 we use a more general set of weightsthan was introduced in Eq. (1) above. These allow contiguous bases to bedeleted with weights that need not be proportional to their number. That is,we generalize Eq. (1) to add

-n if n contiguous bases are deleted, n =1,2..N-. (1')Then in the induction defined by Eq. (2) must consider an expanded set ofpossibilities:

D(i,j) = max{max(D(i-n,j) - ~n)' max(D(i,j-n) - 13n)' D(i-n,j-n) + 6) . (2')n n

For two sequences of length I and J the algorithm requires a number of opera-tions proportional to IJ(1 + Np); there is an obvious advantage in limitingNf3. Except where noted the calculations reported here have used al = 2; N =

A~~ ~ ~ ~ ~ ~ ~ ~ '

253

(Goad and Kanehisa 1982)Nucleic Acids Research

cminiwi s I TO NmmoT CC! -C AC-T-TTTT!SCTATCCGCTTGAOTCAACCCACACCACT TCCTGTCA^C^CTTCCGACCCCCGACCAAGCT CCAAA! AAAACC TAA^CTACTC TTCTCAAC

I

i0

Figure 4. The logical product of the two path matrices.

3' l = 2 = p3 = 3. This sets the concordance threshold at one mismatch ordeletion per three matches and allows from one to three bases to be units ofdeletion.

Further generalizations have been implemented, but will not be discussedfurther here; 6 may be replaced by a 4 x 4 matrix that provides differentweights for the various kinds of mismatches; or it may be evaluated for threebases at a time and made a function of the two codons juxtaposed. In any case

a relationship between the p's and mismatch penalties should be maintainedsuch that two adjacent deletions, one from each sequence, cannot produce a

lower penalty than the mismatches that they could replace.Any pair of subsequences whose overall degree of concordance exceeds the

threshold is represented by a continuous path in the product path matrix,

254

NN~N

NN~~~~~~~~~~~

\\\\\\\\ \ \~

\\ N

N~~~~~~~~~~~~~~~~

A \\\\\ \\ ~~~~N\N N N N

\\\\ \~~~~~~~~~

FASTP and FASTA...

Lipman and Pearson (1985); Pearson and Lipman (1988)

the ‘ancestor’ of BLAST

search for likely ‘similar’ sequences

use a pre–computed table of oligo by sequence position

compute offsets to find clustered similarities (diagonals)

estimate sequence similarity (number of shared oligos)

pick best segments

join segments, recalculate similarity with substitutions

local alignment including best segments

oligo by sequence positionseq oligo0 oligo1 oligo2 oligo3 oligo4 oligo5 ...

Q 1,28 20,22 15 5 2,10 12 ...

0 5,32 24,26 19 9 6,14 16 ...

1 1 – 28,13 – – – ...

2 – 1 – – – 7,11 ...

3 4,31 23,25 18 – 5,13 15,32 ...

oligo offsetsseq oligo0 oligo1 oligo2 oligo3 oligo4 oligo5 ...

Q 0,0 0,0 0 0 0,0 0 ...

0 4,4 4,4 4 4 4,4 4 ...

1 0,– – 13,– – – – ...

2 – -19 – – – -5,– ...

3 3,3 3,3 3 – 3,3 3,– ...

Proc. Natl. Acad. Sci. USA 85 (1988) 2445

A 50 100

50

100

50

100

N

''\X\\\'

* \\' \\

\\ * '~\' \

C50 100

B

FIG. 1. Identification of sequence similarities by FASTA. Thefour steps used by the FASTA program to calculate the initial andoptimal similarity scores between two sequences are shown. (A)Identify regions of identity. (B) Scan the regions using a scoringmatrix and save the best initial regions. Initial regions with scores

less than the joining threshold (27) are dashed. The asterisk denotesthe highest scoring region reported by FASTP. (C) Optimally joininitial regions with scores greater than a threshold. The solid linesdenote regions that are joined to make up the optimized initial score.

(D) Recalculate an optimized alignment centered around the highestscoring initial region. The dotted lines denote the bounds of theoptimized alignment. The result of this alignment is reported as theoptimized score.

much closer to the optimized score for many sequences. Infact, unlike FASTP, the FASTA method may yield initialscores that are higher than the corresponding optimizedscores.

Local Similarity Analyses. Molecular biologists are ofteninterested in the detection of similar subsequences withinlonger sequences. In contrast to FASTP and FASTA, whichreport only the one highest scoring alignment between twosequences, local sequence comparison tools can identifymultiple alignments between smaller portions of two se-

quences. Local similarity searches can clearly show theresults of gene duplications (see Fig. 2) or repeated struc-tural features (see Fig. 3) and are frequently displayed usinga "graphic matrix" plot (7), which allows one to detectregions of local similarity by eye. Optimal algorithms forsensitive local sequence comparison (6, 8, 9) can havetremendous computational requirements in time and mem-

ory, which make them impractical on microcomputers and,when comparing longer sequences, on larger machines as

well.The program for detecting local similarities, LFASTA,

uses the same first two steps for finding initial regions thatFASTA uses. However, instead of saving 10 initial regions,LFASTA saves all diagonal regions with similarity scores

greater than a threshold. LFASTA and FASTA also differ inthe construction of optimized alignments. Instead of focus-ing on a single region, LFASTA computes a local alignmentfor each initial region. Thus LFASTA considers all of theinitial regions shown in Fig. 1B, instead of just the diagonalshown in Fig. 1D. Furthermore, LFASTA considers not

50 100 only the band around each initial region but also potentialsequence alignments for some distance before and after theinitial region. Starting at the end of the initial region, anoptimization (6) proceeds in the reverse direction until allpossible alignment scores have gone to zero. The location ofthe maximal local similarity score in the reverse direction isthen used to start a second optimization that proceeds in theforward direction. An optimal path starting from the forwardmaximum is then displayed (5). The local homologies can bedisplayed as sequence alignments (see Fig. 2B) or on atwo-dimensional graphic matrix style plot (see Figs. 2A and3).

Statistical Significance. The rapid sequence comparisonalgorithms we have developed also provide additional toolsfor evaluating the statistical significance of an alignment.There are approximately 5000 protein sequences, with 1.1million amino acid residues, in the NBRF protein sequence

library, and any computer program that searches the libraryby calculating a similarity score for each sequence in thelibrary will find a highest scoring sequence, regardless ofwhether the alignment between the query and library se-

quence is biologically meaningful or not. Accompanying theprevious version of FASTP was a program for the evaluationof statistical significance, RDF, which compares one se-

quence with randomly permuted versions of the potentiallyrelated sequence.

We have written a new version of RDF (RDF2) that hasseveral improvements. (i) RDF2 calculates three scores foreach shuffled sequence: one from the best single initial region(as found by FASTP), a second from the joined initial regions(used by FASTA), and a third from the optimized diagonal.(it) RDF2 can be used to evaluate amino acid or DNAsequences and allows the user to specify the scoring matrix tobe employed. Thus sequences found using the PAM250scoring matrix can be evaluated using the identity or geneticcode matrix. (iii) The user may specify either a global or localshuffle routine.

Locally biased amino acid or nucleotide composition isperhaps the most common reason for high similarity scores

of dubious biological significance (10). High scoring align-ments between query and library sequences may be due topatches of hydrophobic or charged amino acid residues or toA + T- or G + C-rich regions in DNA. A simple Monte Carloshuffle analysis that constructs random sequences by takingeach residue in one sequence and placing it randomly alongthe length of the new sequence will break up these patches ofbiased composition. As a result, the scores of the shuffledsequences may be much lower than those of the unshuffledsequence, and the sequences will appear to be related.Alternatively, shuffled sequences can be constructed bypermuting small blocks of 10 or 20 residues so that, while theorder of the sequence is destroyed, the local composition isnot. By shuffling the residues within short blocks along thesequence, patches of G + C- or A+ T-rich regions in DNA,for example, are undisturbed. Evaluating significance with a

local shuffle is more stringent than the global approach, andthere may be some circumstances in which both should beused in conjunction. Whereas two proteins that share a

common evolutionary ancestor may have clearly significantsimilarity scores using either shuffling strategy, proteinsrelated because of secondary structure or hydropathic pro-

file may have similarity scores whose significance decreasesdramatically when the results of global and local shufflingare compared.

Implementation. The FASTA/LFASTA package of se-

quence analysis tools is written in the C programming lan-guage and has been implemented under the Unix, VAX/VMS, and IBM PC DOS operating systems. Versions of theprogram that run on the IBM PC are limited to query se-

50

100.

\l

6F

1

I

I

Biochemistry: Pearson and Lipman

16

(Pearson and Lipman 1988)

aa substitution matrices...

derived from ‘curated’ multiple sequence alignments

highly sample dependent

assume alignment is correct, sequences are homologous

PAM (Dayhoff, Schwartz, and Orcutt 1978)

Point Accepted Mutation

PAMx: x = x mutations per 100 amino acids

derived from alignments of protein ‘families’

...aa substitution matrices

BLOSUM (Henikoff and Henikoff 1992)

BLOcks of Amino Acid SUbstitution Matrix

BLOSUMx: x = similarity used to merge sequences prior to making the matrix

higher numbers are for use with more similar sequences

derived from indel free alignment segments

original BLOSUM62 has an error, use corrected version

nt substitution matricesformulae can be used, but generally are not

JC69 (Jukes and Cantor 1969), K80 (K2P; Kimura 1980), HKY85 (Hasegawa, Kishino, and Yano 1985), T92 (Tamura 1992), GTR (Tavaré 1986)

nucleotide matrices not usually based on empirical dataequal weights commonly used+5/-4 or +1/-2 (match vs. mismatch)

ATGCCTGCACGC |||| |||| || ATGCATGCATGC 111121111211

= 10-4 = 6 (50%)

ATGCCTGCACGC |||| |||| || ATGCATGCATGC 555545555455

= 50-8 = 42 (70%)

...FASTP and FASTA...

assessment of statistical significance

using RDF or RDF2

permute (shuffle) reference sequences

the entire database or just similar sequences (faster)

number of positions and base composition constant

order jumbled

recalculate similarity

...FASTP and FASTA

p–values are not calculated (distribution not normal)

z–values used instead

z = (similarity - mean of permuted similarity) / standard deviation of permuted similarity

z > 3 = possibly significant (0.0013%)

z > 6 = probably significant

z > 10 = significant

(Lipman and Pearson 1985)

FASTA vs. BLAST

BLAST has larger default word size (== faster)

can be less ‘sensitive’ (ignores low similarly sequences)

BLAST focuses on most informative k–mers

improved treatment of indels

‘small’ indels and/or ‘short’ segments of mismatch between matching words are tolerated

different methods of calculating significance

BLAST provides a ‘significance’ of match assuming that search settings are correct

BLAST flavors…Altschul et al. (1990, 1997)

the ‘original’ source of algorithms1997 overcomes non–matching regions (e.g. indels)freely available from NCBI (binary and code)

Kent (2002; BLAT)faster in some circumstances (e.g. large batches)designed to search against whole genomesfreely available from the author (binary and code)

Gish (2006; WU-BLAST)same algorithm as NCBI BLAST, but fasterdifferent default settings => different resultsbinary (possibly available), but code is not distributed

…BLAST flavors…

Nguyen and Lavenier (2009; PLAST)parallel version3–6× faster than multithreaded BLASTfreely available from the authors (binary and code)

Panagiotis and Sahinidis (2011; GPU-BLAST)requires a CUDA graphics processorup to 10× fasterprotein sequences onlyfreely available from the authors (binary and code)

…BLAST flavors

Zhao and Chu (2014; G-BLASTN)requires a CUDA graphics processorup to 14× fasternucleotide sequences only

inefficient on short sequences (> 1,000,000 bp)database must fit into the GPU memoryfreely available from the authors (binary and code)

day8 - new york botanical garden(goad and kanehisa 1982) nucleic acids research 50bkpjc eases 150 to...

Documents