introduction to perl for bioinformatics

Introduction to Perlfor Bioinformatics

Thursday, April 7

Programming languages

• Self-contained language– Platform-independent

– Used to write O/S

– C (imperative, procedural)

– C++, Java (object-oriented)

– Lisp, Haskell, Prolog (functional)

• Scripting language– Closely tied to O/S

– Perl, Python, Ruby

• Domain-specific language– R (statistics)

– MatLab (numerics)

– SQL (databases)

• An O/S typically manages…– Devices (see above)

– Files & directories

– Users & permissions

– Processes & signals

Bioinformatics “pipelines” often involve chaining together multiple tools

Perl is the most-used bioinformatics language

Most popular bioinformatics programming languages

Bioinformatics career survey, 2008

Michael Barton

Perl overview

• Interpreted, not compiled– Fast edit-run-revise cycle

• Procedural & imperative– Sequence of instructions (“control flow”)– Variables, subroutines

• Syntax close to C (the de facto standard minimal language)

– Weakly typed (unlike C)– Redundant, not minimal (“there’s more than one way to do it”)– High-level data structures & algorithms

– Hashes, arrays

– Operating System support (files, processes, signals)– String manipulation

Perl basics

• Basic syntax of a Perl program:

#!/usr/local/bin/perl# Elementary Perl programprint "Hello World\n";

"\n" means new line

print statement tells Perl to print the following text to the screen

Single or double quotesenclose a "string literal"(double quotes are "interpolated")

All statements endwith a semicolon

Linesbeginningwith "#" arecomments,and are ignoredby Perl

Hello World

Variables• We can tell Perl to "remember" a particular

value, using the assignment operator “=“:

• The $x is referred to as a "scalar variable".Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"Scalar variable names are all prefixed with the dollar symbol.

$x = 3;print $x;

3

$x = "ACGCGT";print $x;

ACGCGT

Binding site for yeasttranscription factor MCB

Arithmetic operations

• Basic operators are + - / * %

• Can also use += -= /= *= ++ --

$x = 14;$y = 3;print "Sum: ", $x + $y, "\n";print "Product: ", $x * $y, "\n";print "Remainder: ", $x % $y, "\n";

Sum: 17Product: 42Remainder: 2

$x = 5;print "x started as $x\n";$x = $x * 2;print "Then x was $x\n";$x = $x + 1;print "Finally x was $x\n";

x started as 5Then x was 10Finally x was 11

Could write$x *= 2;

Could write$x += 1;or even++$x;

String operations

• Concatenation . and .=

• Can find the length of a string using the function length($x)

$a = "pan";$b = "cake";$a = $a . $b;print $a;

pancake

$a = "soap";$b = "dish";$a .= $b;print $a;

soapdish

$mcb = "ACGCGT";print "Length of $mcb is ", length($mcb);

Length of ACGCGT is 6

More string operations

$x = "A simple sentence";print $x, "\n";print uc($x), "\n";print lc($x), "\n";$y = reverse($x);print $y, "\n";$x =~ tr/i/a/;print $x, "\n";print length($x), "\n";

A simple sentenceA SIMPLE SENTENCEa simple sentenceecnetnes elpmis AA sample sentence17

Convert to upper case

Convert to lower case

Reverse the string

Transliterate "i"'s into "a"'s

Calculate the length of the string

Concatenating DNA fragments

$dna1 = "accacgt";$dna2 = "taggtct";print $dna1 . $dna2;

"Transcribing" DNA to RNA

accacguuaggucu

$dna = "accACgttAGGTct";$rna = lc($dna);$rna =~ tr/t/u/;print $rna;

Make it alllower case

DNA string is a mixtureof upper & lower case

Transliterate "t" to "u"

accacgttaggtct

Conditional blocks

• The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Perl, this looks like this:if (condition) { action } else { alternative }

$x = 149;$y = 100;if ($x > $y){ print "$x is greater than $y\n";}else{ print "$x is less than $y\n";}

149 is greater than 100

These braces { }tell Perl whichpiece of codeis contingent onthe condition.

Conditional operators

• Numeric: > >= < <= != ==

• String: eq ne gt lt ge le

$x = 5 * 4;$y = 17 + 3;if ($x == $y) { print "$x equals $y"; } 20 equals 20

"equals""does not equal"

"is alphabeticallygreater than" "is alphabetically

less than"

"is alphabeticallygreater-or-equal"

"is alphabeticallyless-or-equal"

Note that the testfor "$x equals $y" is$x==$y, not $x=$y

($x, $y) = ("Apple", "Banana");if ($y gt $x) { print "$y after $x "; } Banana after Apple

"does not equal"

Shorthand syntax forassigning more thanone variable at a time

Logical operators• Logical operators: && means "and", || means "or"

• An exclamation mark ! is used to negate what follows Thus !($x < $y) means the same as ($x >= $y)

• In computers, the value zero is often used to represent falsehood, while any non-zero value (e.g. 1) represents truth. Thus:

if (1) { print "1 is true\n"; }if (0) { print "0 is true\n"; }if (-99) { print "-99 is true\n"; }

1 is true-99 is true

$x = 222;if ($x % 2 == 0 and $x % 3 == 0){ print "$x is an even multiple of 3\n"; }

222 is an even multiple of 3

Loops

• Here's how to print out the numbers 1 to 10:

• This is a while loop.The code is executed while the condition is true.

$x = 1;while ($x <= 10) { print $x, " "; ++$x;}

1 2 3 4 5 6 7 8 9 10

The code insidethe braces isrepeatedlyexecuted as longas the condition$x<=10 remainstrue

Equivalent to$x = $x + 1;

A common kind of loop

• Let's dissect the code of the while loop again:

• This form of while loop is common enough to have its own shorthand: the for loop.

$x = 1;while ($x <= 10) { print $x, " "; ++$x;}

Initialization

Test for completion

Continuation

for ($x = 1; $x <= 10; ++$x) { print $x, " ";}

InitializationTest for completion

Continuation

defined and undef

• The function defined($x) is true if $x has been assigned a value:

• A variable that has not yet been assigned a value has the special value undef

• Often, if you try to do something "illegal" (like reading from a nonexistent file), you end up with undef as a result

if (defined($newvar)) { print "newvar is defined\n";} else { print "newvar is not defined\n";}

newvar is not defined

Reading a line of data• To read from a file, we first need to open

the file and give it a filehandle.

• Once the file is opened, we can read a single line from it into the scalar $x :This code snippet opens a file called"sequence.txt", and associates it witha filehandle called FILE

open FILE, "sequence.txt";

$x = <FILE>;This reads the next line from the file,including the newline at the end, "\n".if the end of the file is reached, $x isassigned the special value undef

Reading an entire file

• The following piece of code reads every line in a file and prints it out to the screen:

• A shorter version of this is as follows:

open FILE, "sequence.txt";while (defined ($x = <FILE>)) { print $x;}close FILE;

open FILE, "sequence.txt";while ($x = <FILE>) { print $x;}close FILE;

This reads a line of data into$x, then checks if $x is defined.If $x is undef, then the filemust have ended.

this is equivalent todefined($x=<FILE>)

The default variable, $_

• Many operations that take a scalar argument, such as length($x), are assumed to work on $_ if the $x is omitted:

• So we can also read a whole file like this:

$_ = "Hello";print; print length;

Hello5

open FILE, "sequence.txt";while (<FILE>) { print;}close FILE;

This line is equivalent towhile (defined($_=<FILE>)) {

Summary: scalars and loops

• Assignment operator

• Arithmetic operations

• String operations

• Conditional tests

• Logical operators

• Loops • defined and undef• Reading a file

$x = 5;

$y = $x * 3;

if ($y > 10) { print $s; }

$s = "Value of y is " . $y;

if ($y>10 && $s eq "") { exit; }

for ($x=1; $x<10; ++$x) { print $x, "\n"; }

Pattern-matching

• A more sophisticated kind of logical test is to ask whether a string contains a pattern

• e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT?

$name = "YBR007C";$dna="TAATAAAAAACGCGTTGTCG";if ($dna =~ /ACGCGT/){ print "$name has MCB!\n"; }

20 bases upstream ofthe yeast gene YBR007C

The pattern binding operator =~

The pattern for the MCB binding siteYBR007C has MCB!

FASTA format

• A format for storing multiple named sequences in a single file

• This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Name of sequence ispreceded by > symbol

NB sequences canspan multiple lines

Call this file fly3utr.txt

Printing all sequence names in a FASTA database

• The key to this program is this block:

open FILE, "fly3utr.txt";while ($x = <FILE>) { if ($x =~ />/) { print $x; }}close FILE;

>CG11604>CG11455>CG11488

if ($x =~ />/) { print $x; }

This pattern matches (and returns TRUE) if the defaultvariable $_ contains the FASTA sequence-name symbol >

This line prints $_ ifthe pattern matched

Pattern replacement

open FILE, "fly3utr.txt";while (<FILE>) { if (/>/) { s/>//; print; }}close FILE;

CG11604CG11455CG11488

New statementremoves the ">"

•The new statement s/>// is an example of a replacement.•General form: s/OLD/NEW/ replaces OLD with NEW•Thus s/>// replaces ">" with "" (the empty string)

$_ is thedefaultvariablefor theseoperations

Finding all sequence lengthsOpen file

Read line

End of file?

Line starts with “>” ?

Remove “\n” newline character at end of line

Sequence name

Sequence data

Add length of line to running totalRecord the name

Reset running total of current sequence length

First sequence?Print last sequence length

Stop

noyes

yes

yes

no

no

Start

Print last sequence length

Finding all sequence lengthsopen FILE, "fly3utr.txt";while (<FILE>) { chomp; if (/>/) { if (defined $len) {

print "$name $len\n"; } $name = $_; $len = 0; } else { $len += length; }}print "$name $len\n";close FILE;

>CG11604 58>CG11455 83>CG11488 68

The chomp statementtrims the newline character"\n" off the end of thedefault variable, $_.Try it without this andsee what happens – andif you can work out why

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Reverse complementing DNA

$dna = "accACgttAGgtct";$revcomp = lc($dna);$revcomp = reverse($revcomp);$revcomp =~ tr/acgt/tgca/;print $revcomp;

agacctaacgtggt

Start by making string lower caseagain. This is generally good practice

Reverse the string

Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'

• A common operation due to double-helix symmetry of DNA

Arrays• An array is a variable holding a list of items

• We can think of this as a list with 4 entries

@nucleotides = ('a', 'c', 'g', 't');print "Nucleotides: @nucleotides\n";

Nucleotides: a c g t

a c g telement 0

element 1 element 2 element 3

the array is theset of all four elements

Note that the elementindices start at zero.

Array literals

• There are several, equally valid ways to assign an entire array at once.

@a = (1,2,3,4,5);print "a = @a\n";@b = ('a','c','g','t');print "b = @b\n";@c = 1..5;print "c = @c\n";@d = qw(a c g t);print "d = @d\n";

a = 1 2 3 4 5b = a c g tc = 1 2 3 4 5d = a c g t

This is the most common: a comma-separated list, delimited by parentheses

Accessing arrays

• To access array elements, use square brackets; e.g. $x[0] means "element zero of array @x"

• Remember, element indices start at zero!• If you use an array @x in a scalar context, such

as @x+0, then Perl assumes that you wanted the length of the array.

@x = ('a', 'c', 'g', 't');print $x[0], "\n";$i = 2;print $x[$i], "\n";

ag

@x = ('a', 'c', 'g', 't');print @x + 0;

4

Array operations• You can sort and reverse arrays...

• You can read the entire contents of a file into an array (each line of the file becomes an element of the array)

@x = ('a', 't', 'g', 'c');@y = sort @x;@z = reverse @y;print "x = @x\n";print "y = @y\n";print "z = @z\n";

x = a t g cy = a c g tz = t g c a

open FILE, "sequence.txt";@x = <FILE>;

push, pop, shift, unshift

@x = (‘A’, ‘T’, ‘W’);print "I started with @x\n";$y = pop @x;push @x, ‘G’;print "Then I had @x\n";$z = shift @x;unshift @x, ‘C’;print "Now I have @x\n";print "I lost $y and $z\n";

I started with A T WThen I had A T GNow I have C T GI lost W and A

pop removes the lastelement of an array

push adds an elementto the end of an array

shift removes the firstelement of an array

unshift adds an elementto the start of an array

foreach

• Finding the total of a list of numbers:

• Equivalent to:

@val = (4, 19, 1, 100, 125, 10);$total = 0;foreach $x (@val) { $total += $x;}print $total; 259

@val = (4, 19, 1, 100, 125, 10);$total = 0;for ($i = 0; $i < @val; ++$i) { $total += $val[$i];}print $total; 259

foreach statementloops through eachentry in an array

The @ARGV array

• A special array is @ARGV• This contains the command-line

arguments when the program is invoked at the Unix prompt

• It's a way for the user to pass information into the program

Exploding a sequence into an array

• The programming language C treats all strings as arrays

$dna = "accggtgtgcg";print "String: $dna\n";@array = split( //, $dna);print "Array: @array\n";

String: accggtgtgcgArray: a c c g g t g t g c g

The split statement turnsa string into an array.Here, it splits after everycharacter, but we can alsosplit at specific points,like a restriction enzyme

Taking a slice of an array

• The syntax @x[i,j,k...] returns a (3-element) array containing elements i,j,k... of array @x

@nucleotides = ('a', 'c', 'g', 't');@purines = @nucleotides[0,2];@pyrimidines = @nucleotides[1,3];print "Nucleotides: @nucleotides\n";print "Purines: @purines\n";print "Pyrimidines: @pyrimidines\n";

Nucleotides: a c g tPurines: a gPyrimidines: c t

Finding elements in an array

• The grep command is used to select some elements from an array

• The statement grep(EXPR,LIST) returns all elements of LIST for which EXPR evaluates to true (when $_ is set to the appropriate element)

• e.g. select all numbers over 100:

@numbers = (101, 235, 10, 50, 100, 66, 1005);@numbersOver100 = grep ($_ > 100, @numbers);print "Numbers: @numbers\n";print "Numbers over 100: @numbersOver100\n";

Numbers: 101 235 10 50 100 66 1005Numbers over 100: 101 235 1005

Applying a function to an array

• The map command applies a function to every element in an array

• Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST

• Example: multiply every number by 3

@numbers = (101, 235, 10, 50, 100, 66, 1005);@numbersTimes3 = map ($_ * 3, @numbers);print "Numbers: @numbers\n";print "Numbers times 3: @numbersTimes3\n";

Numbers: 101 235 10 50 100 66 1005Numbers times 3: 303 705 30 150 300 198 3015

Review: pattern-matching

• The following code:

prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the default variable, $_

• Instead of using $_ we can "bind" the pattern to another variable (e.g. $dna) using this syntax:

• We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax:

• We can replace all occurrences by appending a 'g':

if (/ACGCGT/) { print "Found MCB binding site!\n"; }

if ($dna =~ /ACGCGT/) { print "Found MCB binding site!\n"; }

$dna =~ s/ACGCGT/_MCB_/;

$dna =~ s/ACGCGT/_MCB_/g;

introduction to perl for bioinformatics

Documents

print x

nprint length

nprint lc

nprint product

nprint remainder

acgcgtprint length

function length

nprint uc