introduction to perl for bioinformatics

40
Introduction to Perl for Bioinformatics Thursday, April 7

Upload: dagan

Post on 04-Jan-2016

127 views

Category:

Documents


20 download

DESCRIPTION

Introduction to Perl for Bioinformatics. Thursday, April 7. Programming languages. Self-contained language Platform-independent Used to write O/S C (imperative, procedural) C++, Java (object-oriented) Lisp, Haskell, Prolog (functional) Scripting language Closely tied to O/S - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Perl for Bioinformatics

Introduction to Perlfor Bioinformatics

Thursday, April 7

Page 2: Introduction to Perl for Bioinformatics

Programming languages

• Self-contained language– Platform-independent

– Used to write O/S

– C (imperative, procedural)

– C++, Java (object-oriented)

– Lisp, Haskell, Prolog (functional)

• Scripting language– Closely tied to O/S

– Perl, Python, Ruby

• Domain-specific language– R (statistics)

– MatLab (numerics)

– SQL (databases)

• An O/S typically manages…– Devices (see above)

– Files & directories

– Users & permissions

– Processes & signals

Page 3: Introduction to Perl for Bioinformatics

Bioinformatics “pipelines” often involve chaining together multiple tools

Page 4: Introduction to Perl for Bioinformatics

Perl is the most-used bioinformatics language

Most popular bioinformatics programming languages

Bioinformatics career survey, 2008

Michael Barton

Page 5: Introduction to Perl for Bioinformatics

Perl overview

• Interpreted, not compiled– Fast edit-run-revise cycle

• Procedural & imperative– Sequence of instructions (“control flow”)– Variables, subroutines

• Syntax close to C (the de facto standard minimal language)

– Weakly typed (unlike C)– Redundant, not minimal (“there’s more than one way to do it”)– High-level data structures & algorithms

– Hashes, arrays

– Operating System support (files, processes, signals)– String manipulation

Page 6: Introduction to Perl for Bioinformatics

Perl basics

• Basic syntax of a Perl program:

#!/usr/local/bin/perl# Elementary Perl programprint "Hello World\n";

"\n" means new line

print statement tells Perl to print the following text to the screen

Single or double quotesenclose a "string literal"(double quotes are "interpolated")

All statements endwith a semicolon

Linesbeginningwith "#" arecomments,and are ignoredby Perl

Hello World

Page 7: Introduction to Perl for Bioinformatics

Variables• We can tell Perl to "remember" a particular

value, using the assignment operator “=“:

• The $x is referred to as a "scalar variable".Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"Scalar variable names are all prefixed with the dollar symbol.

$x = 3;print $x;

3

$x = "ACGCGT";print $x;

ACGCGT

Binding site for yeasttranscription factor MCB

Page 8: Introduction to Perl for Bioinformatics

Arithmetic operations

• Basic operators are + - / * %

• Can also use += -= /= *= ++ --

$x = 14;$y = 3;print "Sum: ", $x + $y, "\n";print "Product: ", $x * $y, "\n";print "Remainder: ", $x % $y, "\n";

Sum: 17Product: 42Remainder: 2

$x = 5;print "x started as $x\n";$x = $x * 2;print "Then x was $x\n";$x = $x + 1;print "Finally x was $x\n";

x started as 5Then x was 10Finally x was 11

Could write$x *= 2;

Could write$x += 1;or even++$x;

Page 9: Introduction to Perl for Bioinformatics

String operations

• Concatenation . and .=

• Can find the length of a string using the function length($x)

$a = "pan";$b = "cake";$a = $a . $b;print $a;

pancake

$a = "soap";$b = "dish";$a .= $b;print $a;

soapdish

$mcb = "ACGCGT";print "Length of $mcb is ", length($mcb);

Length of ACGCGT is 6

Page 10: Introduction to Perl for Bioinformatics

More string operations

$x = "A simple sentence";print $x, "\n";print uc($x), "\n";print lc($x), "\n";$y = reverse($x);print $y, "\n";$x =~ tr/i/a/;print $x, "\n";print length($x), "\n";

A simple sentenceA SIMPLE SENTENCEa simple sentenceecnetnes elpmis AA sample sentence17

Convert to upper case

Convert to lower case

Reverse the string

Transliterate "i"'s into "a"'s

Calculate the length of the string

Page 11: Introduction to Perl for Bioinformatics

Concatenating DNA fragments

$dna1 = "accacgt";$dna2 = "taggtct";print $dna1 . $dna2;

"Transcribing" DNA to RNA

accacguuaggucu

$dna = "accACgttAGGTct";$rna = lc($dna);$rna =~ tr/t/u/;print $rna;

Make it alllower case

DNA string is a mixtureof upper & lower case

Transliterate "t" to "u"

accacgttaggtct

Page 12: Introduction to Perl for Bioinformatics

Conditional blocks

• The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Perl, this looks like this:if (condition) { action } else { alternative }

$x = 149;$y = 100;if ($x > $y){ print "$x is greater than $y\n";}else{ print "$x is less than $y\n";}

149 is greater than 100

These braces { }tell Perl whichpiece of codeis contingent onthe condition.

Page 13: Introduction to Perl for Bioinformatics

Conditional operators

• Numeric: > >= < <= != ==

• String: eq ne gt lt ge le

$x = 5 * 4;$y = 17 + 3;if ($x == $y) { print "$x equals $y"; } 20 equals 20

"equals""does not equal"

"is alphabeticallygreater than" "is alphabetically

less than"

"is alphabeticallygreater-or-equal"

"is alphabeticallyless-or-equal"

Note that the testfor "$x equals $y" is$x==$y, not $x=$y

($x, $y) = ("Apple", "Banana");if ($y gt $x) { print "$y after $x "; } Banana after Apple

"does not equal"

Shorthand syntax forassigning more thanone variable at a time

Page 14: Introduction to Perl for Bioinformatics

Logical operators• Logical operators: && means "and", || means "or"

• An exclamation mark ! is used to negate what follows Thus !($x < $y) means the same as ($x >= $y)

• In computers, the value zero is often used to represent falsehood, while any non-zero value (e.g. 1) represents truth. Thus:

if (1) { print "1 is true\n"; }if (0) { print "0 is true\n"; }if (-99) { print "-99 is true\n"; }

1 is true-99 is true

$x = 222;if ($x % 2 == 0 and $x % 3 == 0){ print "$x is an even multiple of 3\n"; }

222 is an even multiple of 3

Page 15: Introduction to Perl for Bioinformatics

Loops

• Here's how to print out the numbers 1 to 10:

• This is a while loop.The code is executed while the condition is true.

$x = 1;while ($x <= 10) { print $x, " "; ++$x;}

1 2 3 4 5 6 7 8 9 10

The code insidethe braces isrepeatedlyexecuted as longas the condition$x<=10 remainstrue

Equivalent to$x = $x + 1;

Page 16: Introduction to Perl for Bioinformatics

A common kind of loop

• Let's dissect the code of the while loop again:

• This form of while loop is common enough to have its own shorthand: the for loop.

$x = 1;while ($x <= 10) { print $x, " "; ++$x;}

Initialization

Test for completion

Continuation

for ($x = 1; $x <= 10; ++$x) { print $x, " ";}

InitializationTest for completion

Continuation

Page 17: Introduction to Perl for Bioinformatics

defined and undef

• The function defined($x) is true if $x has been assigned a value:

• A variable that has not yet been assigned a value has the special value undef

• Often, if you try to do something "illegal" (like reading from a nonexistent file), you end up with undef as a result

if (defined($newvar)) { print "newvar is defined\n";} else { print "newvar is not defined\n";}

newvar is not defined

Page 18: Introduction to Perl for Bioinformatics

Reading a line of data• To read from a file, we first need to open

the file and give it a filehandle.

• Once the file is opened, we can read a single line from it into the scalar $x :This code snippet opens a file called"sequence.txt", and associates it witha filehandle called FILE

open FILE, "sequence.txt";

$x = <FILE>;This reads the next line from the file,including the newline at the end, "\n".if the end of the file is reached, $x isassigned the special value undef

Page 19: Introduction to Perl for Bioinformatics

Reading an entire file

• The following piece of code reads every line in a file and prints it out to the screen:

• A shorter version of this is as follows:

open FILE, "sequence.txt";while (defined ($x = <FILE>)) { print $x;}close FILE;

open FILE, "sequence.txt";while ($x = <FILE>) { print $x;}close FILE;

This reads a line of data into$x, then checks if $x is defined.If $x is undef, then the filemust have ended.

this is equivalent todefined($x=<FILE>)

Page 20: Introduction to Perl for Bioinformatics

The default variable, $_

• Many operations that take a scalar argument, such as length($x), are assumed to work on $_ if the $x is omitted:

• So we can also read a whole file like this:

$_ = "Hello";print; print length;

Hello5

open FILE, "sequence.txt";while (<FILE>) { print;}close FILE;

This line is equivalent towhile (defined($_=<FILE>)) {

Page 21: Introduction to Perl for Bioinformatics

Summary: scalars and loops

• Assignment operator

• Arithmetic operations

• String operations

• Conditional tests

• Logical operators

• Loops • defined and undef• Reading a file

$x = 5;

$y = $x * 3;

if ($y > 10) { print $s; }

$s = "Value of y is " . $y;

if ($y>10 && $s eq "") { exit; }

for ($x=1; $x<10; ++$x) { print $x, "\n"; }

Page 22: Introduction to Perl for Bioinformatics

Pattern-matching

• A more sophisticated kind of logical test is to ask whether a string contains a pattern

• e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT?

$name = "YBR007C";$dna="TAATAAAAAACGCGTTGTCG";if ($dna =~ /ACGCGT/){ print "$name has MCB!\n"; }

20 bases upstream ofthe yeast gene YBR007C

The pattern binding operator =~

The pattern for the MCB binding siteYBR007C has MCB!

Page 23: Introduction to Perl for Bioinformatics

FASTA format

• A format for storing multiple named sequences in a single file

• This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Name of sequence ispreceded by > symbol

NB sequences canspan multiple lines

Call this file fly3utr.txt

Page 24: Introduction to Perl for Bioinformatics

Printing all sequence names in a FASTA database

• The key to this program is this block:

open FILE, "fly3utr.txt";while ($x = <FILE>) { if ($x =~ />/) { print $x; }}close FILE;

>CG11604>CG11455>CG11488

if ($x =~ />/) { print $x; }

This pattern matches (and returns TRUE) if the defaultvariable $_ contains the FASTA sequence-name symbol >

This line prints $_ ifthe pattern matched

Page 25: Introduction to Perl for Bioinformatics

Pattern replacement

open FILE, "fly3utr.txt";while (<FILE>) { if (/>/) { s/>//; print; }}close FILE;

CG11604CG11455CG11488

New statementremoves the ">"

•The new statement s/>// is an example of a replacement.•General form: s/OLD/NEW/ replaces OLD with NEW•Thus s/>// replaces ">" with "" (the empty string)

$_ is thedefaultvariablefor theseoperations

Page 26: Introduction to Perl for Bioinformatics

Finding all sequence lengthsOpen file

Read line

End of file?

Line starts with “>” ?

Remove “\n” newline character at end of line

Sequence name

Sequence data

Add length of line to running totalRecord the name

Reset running total of current sequence length

First sequence?Print last sequence length

Stop

noyes

yes

yes

no

no

Start

Print last sequence length

Page 27: Introduction to Perl for Bioinformatics

Finding all sequence lengthsopen FILE, "fly3utr.txt";while (<FILE>) { chomp; if (/>/) { if (defined $len) {

print "$name $len\n"; } $name = $_; $len = 0; } else { $len += length; }}print "$name $len\n";close FILE;

>CG11604 58>CG11455 83>CG11488 68

The chomp statementtrims the newline character"\n" off the end of thedefault variable, $_.Try it without this andsee what happens – andif you can work out why

>CG11604TAGTTATAGCGTGAGTTAGTTGTAAAGGAACGTGAAAGATAAATACATTTTCAATACC>CG11455TAGACGGAGACCCGTTTTTCTTGGTTAGTTTCACATTGTAAAACTGCAAATTGTGTAAAAATAAAATGAGAAACAATTCTGGT>CG11488TAGAAGTCAAAAAAGTCAAGTTTGTTATATAACAAGAAATCAAAAATTATATAATTGTTTTTCACTCT

Page 28: Introduction to Perl for Bioinformatics

Reverse complementing DNA

$dna = "accACgttAGgtct";$revcomp = lc($dna);$revcomp = reverse($revcomp);$revcomp =~ tr/acgt/tgca/;print $revcomp;

agacctaacgtggt

Start by making string lower caseagain. This is generally good practice

Reverse the string

Replace 'a' with 't', 'c' with 'g','g' with 'c' and 't' with 'a'

• A common operation due to double-helix symmetry of DNA

Page 29: Introduction to Perl for Bioinformatics

Arrays• An array is a variable holding a list of items

• We can think of this as a list with 4 entries

@nucleotides = ('a', 'c', 'g', 't');print "Nucleotides: @nucleotides\n";

Nucleotides: a c g t

a c g telement 0

element 1 element 2 element 3

the array is theset of all four elements

Note that the elementindices start at zero.

Page 30: Introduction to Perl for Bioinformatics

Array literals

• There are several, equally valid ways to assign an entire array at once.

@a = (1,2,3,4,5);print "a = @a\n";@b = ('a','c','g','t');print "b = @b\n";@c = 1..5;print "c = @c\n";@d = qw(a c g t);print "d = @d\n";

a = 1 2 3 4 5b = a c g tc = 1 2 3 4 5d = a c g t

This is the most common: a comma-separated list, delimited by parentheses

Page 31: Introduction to Perl for Bioinformatics

Accessing arrays

• To access array elements, use square brackets; e.g. $x[0] means "element zero of array @x"

• Remember, element indices start at zero!• If you use an array @x in a scalar context, such

as @x+0, then Perl assumes that you wanted the length of the array.

@x = ('a', 'c', 'g', 't');print $x[0], "\n";$i = 2;print $x[$i], "\n";

ag

@x = ('a', 'c', 'g', 't');print @x + 0;

4

Page 32: Introduction to Perl for Bioinformatics

Array operations• You can sort and reverse arrays...

• You can read the entire contents of a file into an array (each line of the file becomes an element of the array)

@x = ('a', 't', 'g', 'c');@y = sort @x;@z = reverse @y;print "x = @x\n";print "y = @y\n";print "z = @z\n";

x = a t g cy = a c g tz = t g c a

open FILE, "sequence.txt";@x = <FILE>;

Page 33: Introduction to Perl for Bioinformatics

push, pop, shift, unshift

@x = (‘A’, ‘T’, ‘W’);print "I started with @x\n";$y = pop @x;push @x, ‘G’;print "Then I had @x\n";$z = shift @x;unshift @x, ‘C’;print "Now I have @x\n";print "I lost $y and $z\n";

I started with A T WThen I had A T GNow I have C T GI lost W and A

pop removes the lastelement of an array

push adds an elementto the end of an array

shift removes the firstelement of an array

unshift adds an elementto the start of an array

Page 34: Introduction to Perl for Bioinformatics

foreach

• Finding the total of a list of numbers:

• Equivalent to:

@val = (4, 19, 1, 100, 125, 10);$total = 0;foreach $x (@val) { $total += $x;}print $total; 259

@val = (4, 19, 1, 100, 125, 10);$total = 0;for ($i = 0; $i < @val; ++$i) { $total += $val[$i];}print $total; 259

foreach statementloops through eachentry in an array

Page 35: Introduction to Perl for Bioinformatics

The @ARGV array

• A special array is @ARGV• This contains the command-line

arguments when the program is invoked at the Unix prompt

• It's a way for the user to pass information into the program

Page 36: Introduction to Perl for Bioinformatics

Exploding a sequence into an array

• The programming language C treats all strings as arrays

$dna = "accggtgtgcg";print "String: $dna\n";@array = split( //, $dna);print "Array: @array\n";

String: accggtgtgcgArray: a c c g g t g t g c g

The split statement turnsa string into an array.Here, it splits after everycharacter, but we can alsosplit at specific points,like a restriction enzyme

Page 37: Introduction to Perl for Bioinformatics

Taking a slice of an array

• The syntax @x[i,j,k...] returns a (3-element) array containing elements i,j,k... of array @x

@nucleotides = ('a', 'c', 'g', 't');@purines = @nucleotides[0,2];@pyrimidines = @nucleotides[1,3];print "Nucleotides: @nucleotides\n";print "Purines: @purines\n";print "Pyrimidines: @pyrimidines\n";

Nucleotides: a c g tPurines: a gPyrimidines: c t

Page 38: Introduction to Perl for Bioinformatics

Finding elements in an array

• The grep command is used to select some elements from an array

• The statement grep(EXPR,LIST) returns all elements of LIST for which EXPR evaluates to true (when $_ is set to the appropriate element)

• e.g. select all numbers over 100:

@numbers = (101, 235, 10, 50, 100, 66, 1005);@numbersOver100 = grep ($_ > 100, @numbers);print "Numbers: @numbers\n";print "Numbers over 100: @numbersOver100\n";

Numbers: 101 235 10 50 100 66 1005Numbers over 100: 101 235 1005

Page 39: Introduction to Perl for Bioinformatics

Applying a function to an array

• The map command applies a function to every element in an array

• Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST

• Example: multiply every number by 3

@numbers = (101, 235, 10, 50, 100, 66, 1005);@numbersTimes3 = map ($_ * 3, @numbers);print "Numbers: @numbers\n";print "Numbers times 3: @numbersTimes3\n";

Numbers: 101 235 10 50 100 66 1005Numbers times 3: 303 705 30 150 300 198 3015

Page 40: Introduction to Perl for Bioinformatics

Review: pattern-matching

• The following code:

prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the default variable, $_

• Instead of using $_ we can "bind" the pattern to another variable (e.g. $dna) using this syntax:

• We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax:

• We can replace all occurrences by appending a 'g':

if (/ACGCGT/) { print "Found MCB binding site!\n"; }

if ($dna =~ /ACGCGT/) { print "Found MCB binding site!\n"; }

$dna =~ s/ACGCGT/_MCB_/;

$dna =~ s/ACGCGT/_MCB_/g;