perl for bio in for ma tics

8/4/2019 Perl for Bio in for Ma Tics

1/158

Programming for Computational Biology

Ian HolmesDepartment of Bioengineering

University of California, Berkeley


2/158

Programming languages

Self-contained language

Platform-independent

Used to write O/S

C (imperative, procedural)

C++, Java (object-oriented)

Lisp, Haskell, Prolog (functional)

Scripting language

Closely tied to O/S

Perl, Python, Ruby

Domain-specific language

R (statistics)

MatLab (numerics)

SQL (databases)

An O/S typically manages

Devices (see above)

Files & directories

Users & permissions

Processes & signals


3/158

Bioinformatics pipelines often involvechaining together multiple tools


4/158

Perl is the most-used bioinformatics language

Most popular bioinformatics programming languages

Bioinformatics career survey, 2008

Michael Barton


5/158

Pros and Cons of Perl

Reasons for Perls popularity in bioinformatics (Lincoln Stein)

Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,

summarizing and otherwise mangling text

Perl is forgiving

Perl is component-oriented Perl is easy to write and fast to develop in

Perl is a good prototyping language

Perl is a good language for Web CGI scripting

Problems with Perl Hard to read (theres more than one way to do it, cryptic syntax)

Too forgiving (no strong typing, allows sloppy code)


6/158

Perl overview

Interpreted, not compiled Fast edit-run-revise cycle

Procedural & imperative

Sequence of instructions (control flow) Variables, subroutines

Syntax close to C (the de facto standard minimal language) Weakly typed (unlike C)

Redundant, not minimal (theres more than one way to do it )

Syntactic sugar

High-level data structures & algorithms

Hashes, arrays

Operating System support (files, processes, signals)

String manipulation


7/158

Goals of this course

Concepts of computer programming

Rudimentary Perl (widely-used language)"How Perl saved the Human Genome Project" (Lincoln Stein)

Introduction to Bioinformatics file formats

Practical data-handling algorithms

Exposure to Bioinformatics software


8/158

Structural elements Learning Perl, Schwartz et al

ISBN 0-596-10105-8 O'Reilly

"There's more than one way to do it

Q: But which is best? A: TESTS

Tests (above) supercede texts (below):

The main program The program outputFiles areshown in

yellow

FilenameStandard output streamTerminal input

Description of test conditions

Terminal session


9/158

General principles of programming

Make incremental changes

Test everything you do

the edit-run-revise cycle

Write so that others can read it

(when possible, write with others)

Think before you write Use a good text editor

Good debugging style


10/158

Perl for BioinformaticsSection 1: Scalars and Loops

Ian HolmesDepartment of Bioengineering

University of California, Berkeley


11/158

Perl basics

Basic syntax of a Perl program:

# Elementary Perl program

print "Hello World\n";

"\n" means new line

print statement tells Perl to print the following stuff to the screen

Single or double quotes

enclose a "string literal"(double quotes are "interpolated")

All statements endwith a semicolon

Lines

beginningwith "#" are

comments,and are ignoredby Perl

Hello World


12/158

Variables

We can tell Perl to "remember" a particularvalue, using the assignment operator =:

The $x is referred to as a "scalar variable".

Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"

Scalar variable names are all prefixed with the dollar symbol.

$x= 3;

print $x;

3

$x= "ACGCGT";

print $x;

ACGCGT

Binding site for yeasttranscription factor MCB


13/158

Arithmetic operations

Basic operators are + - / * %

Can also use += -= /= *=++ --

$x= 14;

$y = 3;

print "Sum: ", $x+$y, "\n";

print "Product: ", $x * $y, "\n";

print "Remainder: ", $x % $y, "\n";

Sum: 17

Product: 42Remainder: 2

$x= 5;

print "x started as $x\n";$x=$x * 2;

print "Then x was $x\n";

$x=$x+ 1;

print "Finally x was $x\n";

x started as 5

Then x was 10

Finally x was 11

Could write$x *= 2;

Could write$x+= 1;

or even++$x;


14/158

String operations

Concatenation..=

Can find the length of a string using thefunction length($x)

$a = "pan";

$b= "cake";

$a =$a .$b;

print $a;

pancake

$a = "soap";

$b= "dish";

$a .=$b;

print $a;

soapdish

$mcb= "ACGCGT";

print "Length of $mcb is ",

length($mcb);Length of ACGCGT is 6


15/158

More string operations

$x= "A simple sentence";

print $x, "\n";

print uc($x), "\n";

print lc($x), "\n";$y = reverse($x);

print $y, "\n";

$x=~ tr/i/a/;

print $x, "\n";

print length($x), "\n";

A simple sentenceA SIMPLE SENTENCE

a simple sentence

ecnetnes elpmis A

A sample sentence

17

Convert to upper case

Convert to lower case

Reverse the string

Transliterate "i"'s into "a"'s

Calculate the length of the string


16/158

Concatenating DNA fragments

$dna1 = "accacgt";

$dna2 = "taggtct";

print $dna1 .$dna2;

"Transcribing" DNA to RNA

accacguuaggucu

$dna = "accACgttAGGTct";

$rna

=lc(

$dna

);

$rna =~ tr/t/u/;

print $rna;

Make it alllower case

DNA string is a mixtureof upper & lower case

Transliterate "t" to "u"

accacgttaggtct


17/158

Comparison: variables in C are typed

C does not have a basic type for strings only individual characters.

Strings are built up from more basic elements as arrays of characters (well getto arrays later).

Much of this functionality is provided in C and C++ as part of the standard library.


18/158

Conditional blocks

The ability to execute an action contingent onsome condition is what distinguishes a computerfrom a calculator. In Perl, this looks like this:if (condition) { action } else { alternative }

$x= 149;

$y = 100;

if ($x > $y)

{

print "$x is greater than $y\n";

}else

{

print "$x is less than $y\n";

}

149 is greater than 100

These braces { }

tell Perl whichpiece of code

is contingent onthe condition.


19/158

Conditional operators

Numeric: > >= <


20/158

Logical operators Logical operators: && means "and", || means "or"

An exclamation mark ! is used to negate what followsThus !($x < $y) means the same as ($x >=$y)

In computers, the value zero is often used to

represent falsehood, while any non-zero value(e.g. 1) represents truth. Thus:

if (1) { print "1 is true\n"; }

if (0) { print "0 is true\n"; }

if (-99) { print "-99 is true\n"; }

1 is true

-99 is true

$x= 222;

if ($x % 2 == 0 and $x % 3 == 0)

{ print "$x is an even multiple of 3\n"; }

222 is an even multiple of 3


21/158

Loops

Here's how to print out the numbers 1 to 10:

This is a while loop.The code is executed while the condition is true.

$x= 1;

while ($x


22/158

A common kind of loop

Let's dissect the code of the while loop again:

This form of while loop is common enough tohave its own shorthand: the forloop.

$x= 1;

while ($x


23/158

Loops in C++ are similar to Perlcout is the standard output stream, part of the standard library.Used in C++ only (C has a complicated printf command)


24/158

defined and undef

The function defined($x) is true if$x hasbeen assigned a value:

A variable that has not yet been assigned a

value has the special valueundef

Often, if you try to do something "illegal" (likereading from a nonexistent file), you end up withundef as a result

if (defined($newvar)) {

print "newvar is defined\n";

} else {

print "newvar is not defined\n";

}

newvar is notdefined

C does not have defined or undef. At best, using an uninitialized value will

cause a compiler error; at worst, it will lead to undefined behavior (i.e. disaster)


25/158

Reading a line of data

To read from a file, we first need to openthe file and give it a filehandle.

Once the file is opened, we can read a

single line from it into the scalar$x :

This code snippet opens a file called"sequence.txt", and associates it witha filehandle called FILE

open FILE, "sequence.txt";

$x= ;This reads the next line from the file,including the newline at the end, "\n".

if the end of the file is reached, $x isassigned the special value undef


26/158

Reading an entire file

The following piece of code reads everyline in a file and prints it out to the screen:

A shorter version of this is as follows:


while (defined ($x= )) {print $x;

}

close FILE;


while ($x= ) {

print $x;

}

close FILE;

This reads a line of data into$x, then checks if$x is defined.If$x is undef, then the file

must have ended.

this is equivalent todefined($x=)


27/158

The default variable, $_

Many operations that take a scalar argument,such as length($x), are assumed to work on$_ if the $x is omitted:

So we can also read a whole file like this:

$_= "Hello";

print;

print length;

Hello5


while () {

print;

}

close FILE;

This line is equivalent towhile (defined($_=)) {


28/158

Files in C++ are streams


29/158

Debugging

Most programs don't work first time

Most apparently "working" programs

actually aren't Bugs are cryptic

Debugging is a scientific process

As you gain experience, you will begin to"insure" against bugs with yourprogramming technique


30/158

Mars Climate Orbiter

Mars Climate Orbiter was the thirdspacecraft to be launched under the MarsSurveyor program to map & explore Mars

Around 2am PDT on September 23, 1998,the spacecraft disappeared behind Marsfollowing a maneouvre that should have

put it into Mars orbit This failure, along with a subsequent(unexplained) craft loss, cost NASA$327.6 million


31/158

What was the problem?

Following a certain kind of engine burn, designed tostabilise the craft's angular momentum, the Orbiter sentdata to the ground station, so that its trajectory could berecalibrated (by a software module called SM_FORCE)

The Orbiter also internally recomputed its trajectoryfollowing a burn

The Orbiter's internal software module used metric units(Newton-seconds) while the ground station'sSM_FORCE module used Imperial (pound-seconds).The specification called for metric units

The maneouvre executed on September 23rd wastherefore computed using the wrong trajectory, takingthe Orbiter too low into Mars' atmosphere


32/158

Why was the bug not detected?

The spacecraft periodically transmitted itscomputed trajectory to the ground station. Aquick comparison between the two trajectorieswould have revealed the error. However,

Other bugs in the SM_FORCE module prevented itsuse until 4 months into the flight

The ground crew weren't aware that trajectory datafrom the spacecraft were available

Discrepancies were noticed, but were only reportedinformally by email, and not taken seriously enough

i.e. incomplete testing; ignoring unexpectedresults; institutional complacency.


33/158

Debugging is scientific

Finding bugs can be very frustrating

A job that you thought was nearly finished,

for which you have budgeted a certainamount of time, stretches out indefinitely

Often you may have no idea what's wrong

If you think of debugging as a scientificproblem and approach it systematically,much of the pain disappears


34/158

The Process of Debugging

Step 1: Identify the Problem

observe it (e.g. because a test fails) reproduce it (so you can make it happen 100% of the time)

isolate it (strip it down to its bare essentials)


35/158


Step 2: Gather Information record all symptoms (disparate symptoms may be

related; if not, you should tackle them systematically one by one)

follow the flow of control of the program (manyways of doing this: e.g. you can use a "debugger" to watch thevariables; the time-honored method, and definitely the best, is toinsert debugging print statements into your code)

note recent changes (usually the cause of bugs)

look for similar problems (can ask other developers)

check "machine environment" (e.g. if you move to adifferent computer, does it have less memory? less disk space?)


36/158


Step 3: Form a Hypothesis try to isolate the code that causes the problem

e.g. strip away all "working" code that is not essential to reproducing

the bug if you can't find the bug, use a systematic "deletion" strategy (c.f.genetics!) until you have narrowed down the problem

what should that code be doing?

this can be seen as a continuation of Step 1

("identify the problem") debugging is a cyclic, interactive process


37/158


Step 4: Test Your Hypothesis

do not skip this step!

often the hypothesis will come to you in aflash of inspiration, but you still need to test it

for simple bugs, testing just means fixing the

problem for more complex bugs, you'll need to proceed

to the next steps...


38/158


Step 5: Propose a Solution

keep it minimal: try not to redesign all thecode unless this is absolutely necessary

then again, do not flinch from redesign if thisis what is called for

Step 6: Test the Solution also make sure you didn't break existing code


39/158

Process of Debugging: Summary

Step 1: Identify the Problem

Step 2: Gather Information

Step 3: Form a Hypothesis Step 4: Test Your Hypothesis

Step 5: Propose a Solution

Step 6: Test the Solution


40/158

Proactive debugging

Place consistency checks in your code also called assertions

Put comments in your code

this saves time when debugging Comment known (and fixed) bugs

keep a record of what you've fixed

Put log messages into your code

you can make these optional (e.g. comment themout); having them there can save lots of time


41/158

Perl for Bioinformatics

Section 2: Sequences and Arrays


42/158

Summary: scalars and loops

Assignment operator

Arithmetic operations

String operations Conditional tests

Logical operators

Loops defined and undef

Reading a file

$x= 5;

$y =$x * 3;

if ($y > 10) { print $s; }

$s = "Value of y is " .$y;

if ($y>10 && $s eq "") { exit; }

for ($x=1; $x


43/158

Pattern-matching

A very sophisticated kind of logical test isto ask whether a string contains apattern

e.g. does a yeast promoter sequencecontain the MCB binding site, ACGCGT?

$name = "YBR007C";

$dna="TAATAAAAAACGCGTTGTCG";

if ($dna =~ /ACGCGT/)

{

print "$name has MCB!\n";

}

20 bases upstream ofthe yeast gene YBR007C

The pattern binding operator =~

Thepattern for the MCB binding siteYBR007C has MCB!


44/158

FASTA format

A format for storing multiple named sequencesin a single file

This file contains 3' UTRsforDrosophila genes CG11604,CG11455 and CG11488

>CG11604

TAGTTATAGCGTGAGTTAGT

TGTAAAGGAACGTGAAAGAT

AAATACATTTTCAATACC>CG11455

TAGACGGAGACCCGTTTTTC

TTGGTTAGTTTCACATTGTA

AAACTGCAAATTGTGTAAAA

ATAAAATGAGAAACAATTCT

GGT

>CG11488TAGAAGTCAAAAAAGTCAAG

TTTGTTATATAACAAGAAAT

CAAAAATTATATAATTGTTT

TTCACTCT

Name of sequence is

preceded by > symbol

NB sequences canspan multiple lines

Call this file fly3utr.txt


45/158

Printing all sequence names in aFASTA database

The key to this program is this block:

open FILE, "fly3utr.txt";

while ($x= ) {

if ($x=~ />/) {

print $x;

}

}close FILE;

>CG11604

>CG11455

>CG11488

if ($x=~ />/) {

print $x;

}

This pattern matches (and returns TRUE) if the defaultvariable $_ contains the FASTA sequence-name symbol >

This line prints $_ if

the pattern matched


46/158

Pattern replacement

open FILE, "fly3utr.txt";

while () {

if (/>/) {

s/>//;

print;

}

}

close FILE;

CG11604

CG11455

CG11488

New statementremoves the ">"

The new statement s/>// is an example of a replacement.

General form: s/OLD/NEW/ replaces OLD with NEWThus s/>// replaces ">" with "" (the empty string)

$_ is thedefaultvariablefor theseoperations


47/158

Finding all sequence lengthsOpen file

Read line

End of file?

Line starts with > ?

Remove \n newlinecharacter at end of line

Sequence name Sequence data

Add length of lineto running totalRecord the name

Reset running total ofcurrent sequence length

First sequence?Print lastsequencelength

Stop

noyes

yes

yes

no

no

Start

Print lastsequencelength


48/158

Finding all sequence lengthsopen FILE, "fly3utr.txt";while () {

chomp;

if (/>/) {

if (defined $len) {

print "$name $len\n";

}$name =$_;

$len = 0;

} else {

$len += length;

}

}


close FILE;

>CG11604 58

>CG11455 83

>CG11488 68

The chomp statementtrims the newline character"\n" off the end of thedefault variable, $_.

Try it without this andsee what happens andif you can work out why

>CG11604

TAGTTATAGCGTGAGTTAGT

TGTAAAGGAACGTGAAAGAT

AAATACATTTTCAATACC

>CG11455

TAGACGGAGACCCGTTTTTC

TTGGTTAGTTTCACATTGTA

AAACTGCAAATTGTGTAAAA

ATAAAATGAGAAACAATTCT

GGT

>CG11488

TAGAAGTCAAAAAAGTCAAG

TTTGTTATATAACAAGAAAT

CAAAAATTATATAATTGTTT

TTCACTCT


49/158

Reverse complementing DNA

$dna = "accACgttAGgtct";

$revcomp = lc($dna);

$revcomp = reverse($revcomp);

$revcomp =~ tr/acgt/tgca/;

print $revcomp;

agacctaacgtggt

Start by making string lower caseagain. This is generally good practise

Reverse the string

Replace 'a' with 't', 'c' with 'g',

'g' with 'c' and 't' with 'a'

A common operation due to double-helixsymmetry of DNA


50/158

Running external programs

$lines = `wc -l myfile.txt`;

Suppose you want to get the output of another program into a variable.

e.g. the following shell command prints the number of lines in the file myfile.txt

wc -l myfile.txt

but that only prints the result to standard output; it does not give you access to theoutput of the command from within the Perl program.

An (equivalent) way is to open a pipe from the command:

open FILEHANDLE, "wc -l myfile.txt |";

$lines = ;

system "wc -l myfile.txt";You can execute a command like this from Perl using system

One way to get the output is by enclosing the command in backticks:


51/158

Arrays

An arrayis a variable holding a list of items

We can think of this as a list with 4 entries

@nucleotides = ('a', 'c', 'g', 't');

print "Nucleotides: @nucleotides\n";

Nucleotides: a c g t

a c g telement 0

element 1 element 2element 3

the array is theset of all four elements

Note that the elementindices start at zero.


52/158

Array literals

There are several, equally valid ways toassign an entire array at once.

@a = (1,2,3,4,5);

print "a =@a\n";

@b= ('a','c','g','t');

print "b=@b\n";

@c = 1..5;

print "c =@c\n";

@d = qw(a c g t);

print "d =@d\n";

a = 1 2 3 4 5

b= a c g t

c = 1 2 3 4 5

d = a c g t

This is the most common: a comma-

separated list, delimited by parentheses


53/158

Accessing arrays

To access array elements, use square brackets;e.g. $x[0] means "element zero of array @x"

Remember, element indices start at zero!

If you use an array @x in a scalarcontext, suchas @x+0, then Perl assumes that you wanted the

length of the array.

@x= ('a', 'c', 'g', 't');

print $x[0], "\n";$i = 2;

print $x[$i], "\n";

a

g

@x= ('a', 'c', 'g', 't');

print @x+ 0;4


54/158

Array operations

You can sort and reverse arrays...

You can read the entire contents of a file

into an array (each line of the file becomesan element of the array)

@x= ('a', 't', 'g', 'c');

@y = sort @x;

@z= reverse @y;

print "x=@x\n";print "y =@y\n";

print "z=@z\n";

x= a t g c

y = a c g t

z= t g c a


@x= ;


55/158

push, pop, shift, unshift

@x= ("Fame", "Power", "Money");

print "I started with @x\n";

$y = pop @x;

push @x, "Success";

print "Then I had @x\n";

$z= shift @x;unshift @x, "Glamour";

print "Now I have @x\n";

print "I lost $y and $z\n";

I started with Fame Power Money

Then I had Fame Power Success

Now I have Glamour Power Success

I lost Money and Fame

pop removes the lastelement of an array

push adds an element

to the end of an arrayshift removes the firstelement of an array

unshift adds an elementto the start of an array


56/158

foreach

Finding the total of a list of numbers:

Equivalent to:

@val = (4, 19, 1, 100, 125, 10);

$total = 0;

foreach $x (@val) {

$total +=$x;}

print $total; 259

@val = (4, 19, 1, 100, 125, 10);

$total = 0;

for ($i = 0; $i < @val; ++$i) {

$total +=$val[$i];

}

print $total; 259

foreach statement

loops through eachentry in an array


57/158

Iterator comparison

foreach

for

iMac G5 1.8GHz 512MB, Mac OS X 10.4.2, perl v5.8.6 built for darwin-thread-multi-2level

[yoko:~] yam% time perl -e 'for ($n = 1; $n


58/158

The @ARGV array

A special array is @ARGV

This contains the command-line

arguments when the program is invoked atthe Unix prompt

It's a way for the user to pass informationinto the program


59/158

Exploding a sequence into an array

The programming language C treats allstrings as arrays

$dna = "accggtgtgcg";

print "String: $dna\n";

@array = split //, $dna;

print "Array: @array\n";

String: accggtgtgcg

Array: a c c g g t g t g c g

The split statement turnsa string into an array.Here, it splits after everycharacter, but we can alsosplit at specific points,like a restriction enzyme


60/158

Taking a slice of an array

The syntax @x[i,j,k...] returns a (3-element)array containing elements i,j,k... of array @x

@nucleotides = ('a', 'c', 'g', 't');

@purines =@nucleotides[0,2];@pyrimidines =@nucleotides[1,3];

print "Nucleotides: @nucleotides\n";

print "Purines: @purines\n";

print "Pyrimidines: @pyrimidines\n";

Nucleotides: a c g t

Purines: a g

Pyrimidines: c t


61/158

Finding elements in an array

The grep command is used to select some

elements from an array

The statement grep(EXPR,LIST) returns all

elements ofLIST for which EXPR evaluates totrue (when $_ is set to the appropriate element)

e.g. select all numbers over 100:

@numbers = (101, 235, 10, 50, 100, 66, 1005);

@numbersOver100 = grep ($_ > 100, @numbers);

print "Numbers: @numbers\n";

print "Numbers over 100: @numbersOver100\n";

Numbers: 101 235 10 50 100 66 1005

Numbers over 100: 101 235 1005


62/158

Applying a function to an array

The map command applies a function to

every element in an array

Similar syntax to list: map(EXPR,LIST)applies EXPR to every element in LIST

Example: multiply every number by 3

@numbers = (101, 235, 10, 50, 100, 66, 1005);

@numbersTimes3 = map ($_ * 3, @numbers);

print "Numbers: @numbers\n";

print "Numbers times 3: @numbersTimes3\n";

Numbers: 101 235 10 50 100 66 1005

Numbers times 3: 303 705 30 150 300 198 3015


63/158


Section 3: Patterns and Subroutines


64/158

Review: pattern-matching

The following code:

prints the string "Found MCB binding site!" if the pattern "ACGCGT"is present in the default variable, $_

Instead of using $_ we can "bind" the pattern to another variable(e.g. $dna) using this syntax:

We can replace the first occurrence of ACGCGT with the string_MCB_ using the following syntax:

We can replace alloccurrences by appending a 'g':

if (/ACGCGT/) {

print "Found MCB binding site!\n";

}

if ($dna =~ /ACGCGT/) {

print "Found MCB binding site!\n";

}

$dna =~ s/ACGCGT/_MCB_/;

$dna =~ s/ACGCGT/_MCB_/g;


65/158

Regular expressions

Perl provides a pattern-matching engine

Patterns are called regular expressions

They are extremely powerful probably Perl's strongest feature, compared to

other languages

Often called "regexps" for short


66/158

QuickTime and adecompressor

are needed to see this picture.

Motivation:N-glycosylation motif

Common post-translational modification in ER Membrane & secreted proteins

Purpose:folding, stability, cell-cell adhesion

Attachment ofa 14-sugar oligosaccharide

Occurs at asparagine residues with theconsensus sequence NX1X2,where X1can be anything

(but proline & aspartic acid inhibit)

X2is serine or threonine Can we detect potentialN-glycosylation

sites in a protein sequence?


67/158

Interlude: interactive testing

This script echoes input from the keyboard

Sometimes (e.g. in Windows IDEs) theoutput isnt printed until the script stops

This is because ofbuffering.

To stop buffering, set to "autoflush":

while () {

print;

}The special filehandle STDIN means"standard input", i.e. the keyboard

$| = 1;

while () {

print;

}

$| is the autoflush flag


68/158

Matching alternative characters

[ACGT] matches one A, C, G or T:

In general square brackets denote a set ofalternative possibilities

Use - to match a range of characters: [A-Z]

. matches anything

\s matches spaces or tabs \S is anything that's not a space or tab

[^X] matches anything but X

while () {

print "Matched: $_" if /[ACGT]/;

}

this is not printed

This is printed

Matched: This is printed

Italics denoteinput text


69/158

Matching alternative strings

/(this|that)/ matches "this" or "that"

...and is equivalent to /th(is|at)/

while () {print "Matched: $_" if /this|that|other/;

}

Won't match THIS

Will match this

Matched: Will match thisWon't match ThE oThER

Will match the other

Matched: Will match the other

Remember, regexpsare case-sensitive


70/158

Matching multiple characters

x* matches zero or more x's (greedily) x*? matches zero or more x's (sparingly)

x+ matches one or more x's (greedily)

x{n} matches n x's

x{m,n} matches from m to n x's

Word and string boundaries ^ matches the start of a string $ matches the end of a string

\b matches word boundaries


71/158

"Escaping" special characters

\ is used to "escape" characters that

otherwise have meaning in a regexp

so \[ matches the character "["

if not escaped, "[" signifies the start of a list ofalternative characters, as in [ACGT]


72/158

Retrieving what was matched

If parts of the pattern are enclosed byparentheses, then (following the match) thoseparts can be retrieved from the scalars $1, $2...

e.g. /the (\S+) sat on the (\S+) drinking (\S+)/

matches "the cat sat on the mat drinking milk"

with $1="cat", $2="mat", $3="milk"

$| = 1;

while () {

if (/(a|the) (\S+)/i) {

print "Noun: $2\n";

}

}

Pick up the cup

Noun: cup

Sit on a chair

Noun: chair

Put the milk in the tea

Noun: milk

Note: only the first "the"is picked up by this regexp


73/158

Variations and modifiers

//i ignores upper/lower case distinctions:

//g starts search where last match left off

pos($_)is index of first character after last match

s/OLD/NEW/ replaces first "OLD" with "NEW"

s/OLD/NEW/g is "global" (i.e. replaces everyoccurrence of "OLD" in the string)

pAttERn

Matched pAttERn

while () {

print "Matched: $_" if /pattern/i;

}


74/158

N-glycosylation site detector

$| = 1;

while () {

$_= uc $_;

while (/(N[^PD][ST])/g) {print "Potential N-glycosylation sequence ",

$1, " at residue ", pos() - 2, "\n";

}

}

Convert to upper case

Regexp uses

'g' modifier toget all matchesin sequence

pos() is index of first residue

after match, starting at zero;so, pos()-2 is index of first residue

of three-residue match, starting at one.

while (/(N[^P][ST])/g) { ... }

The main regular expression


75/158

PROSITE and Pfam

PROSITE a database of regular expressionsfor protein families, domains and motifs

Pfam a database ofHidden MarkovModels (HMMs) equivalent toprobabilistic regular expressions


76/158

Subroutines

Often, we can identify self-contained tasks thatoccur in so many different places we may wantto separate their description from the rest of our

program. Code for such a task is called a subroutine.

Examples of such tasks:

finding the length of a sequence

reverse complementing a sequence

finding the mean of a list of numbers

NB: Perl providesthe subroutinelength(

$x)to do

this already


77/158

Finding all sequence lengths (2)open FILE, "fly3utr.t

xt";

while () {

chomp;

if (/>/) {

print_name_and_len();

$name =$_;$len = 0;

} else {

$len += length;

}

}

print_name_and_len();

close FILE;

sub print_name_and_len {

if (defined ($name)) {


}

}

Subroutine definition;code in here is notexecuted unlesssubroutine is called

Subroutine calls

Reverse complement subroutine


78/158

Reverse complement subroutinesub revcomp {

my $rev;

$rev = reverse ($dna);$rev =~ tr/acgt/tgca/;

return $rev;

}

$rev = 12345;

$dna = "accggcatg";

$rev1 = revcomp();

print "Revcomp of $dna is $rev1\n";

$dna = "cggcgt";

$rev2 = revcomp();print "Revcomp of $dna is $rev2\n";

print "Value of rev is $rev\n";

Revcomp of accggcatg is catgccggt

Revcomp of cggcgt is acgccg

Value of rev is 12345

Value of$rev is

unchanged bycalls to revcomp

"my" announces that$rev is localto the

subroutine revcomp

"return" announcesthat the return valueof this subroutineis whatever's in $rev


79/158

Revcomp with argumentssubrevcomp {

my ($dna)=@_;

my $rev = reverse ($dna);

$rev =~ tr/acgt/tgca/;

return $rev;

}

$dna1 = "accggcatg";

$rev1 = revcomp ($dna1);

print "Revcomp of $dna1 is $rev1\n";

$dna2 = "cggcgt";

$rev2 = revcomp ($dna2);

print "Revcomp of $dna2 is $rev2\n";

Revcomp of accggcatg is catgccggt

Revcomp of cggcgt is acgccg

The array @_ holdsthe arguments tothe subroutine(in this case, thesequence to berevcomp'd)

Now we don'thave to re-usethe same variablefor the sequenceto be revcomp'd


80/158

Mean & standard deviation@xdata = (1, 5, 1, 12, 3, 4, 6);

($x_mean, $x_sd)= mean_sd (@xdata);

@ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7);

($y_mean, $y_sd)= mean_sd (@ydata);

sub mean_sd {

my @data =@_;my $n =@data + 0;

my $sum = 0;

my $sqSum = 0;

foreach $x (@data) {

$sum +=$x;

$sqSum +=$x * $x;}

my $mean =$sum / $n;

my $variance =$sqSum / $n - $mean * $mean;

my $sd = sqrt ($variance);

return ($mean, $sd);

}

Subroutinereturns atwo-elementlist: (mean,sd)

Subroutine

takes a listof$n numeric

arguments

Square root


81/158

Maximum element of an array

Subroutine to find the largest entry in an array

@num = (1, 5, 1, 12, 3, 4, 6);

$max= find_max (@num);

print "Numbers: @num\n";

print "Maximum: $max\n";

sub find_max {

my @data =@_;

my $max= pop @data;

foreach my $x (@data) {

if ($x > $max) {$max=$x;

}

}

return $max;

}

Numbers: 1 5 1 12 3 4 6

Maximum: 12


82/158

Including variables in patterns

Subroutine to find number of instances ofa given binding site in a sequence

$dna = "ACGCGTAAGTCGGCACGCGTACGCGT";

$mcb= "ACGCGT";

print "$dna has ",count_matches ($mcb, $dna),

" matches to $mcb\n";

sub count_matches {

my ($pattern, $text)=@_;

my $n = 0;while ($text =~ /$pattern/g) { ++$n }

return $n;

}

ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT


83/158


Section 4: Hashes


84/158

Data structures

Suppose we have a file containing a tableofDrosophila gene names and cellularcompartments, one pair on each line:

Cyp12a5 Mitochondrion

MRG15 Nucleus

Cop Golgi

bor CytoplasmBx42 Nucleus

Suppose this file is in "genecomp.txt"


85/158

Reading a table of data

We can split eachline into a 2-elementarray using thesplit command.

This breaks the lineat each space:

The opposite ofsplit is join, which makes a scalarfrom an array:

open FILE, "genecomp.txt";

while () {

($g, $c)= split;

push @gene, $g;

push @comp, $c;

}close FILE;

print "Genes: @gene\n";

print "Compartments: @comp\n";

Genes: Cyp12a5 MRG15 Cop bor Bx42

Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus

print join (" and ", @gene);

Cyp12a5 and MRG15 and Cop and bor and Bx42


86/158

Finding an entry in a table

The following code assumes that we'vealready read in the table from the file:

Example:$ARGV[0] = "Cop"

$geneToFind = shift @ARGV;

print "Searching for gene $geneToFind\n";

for ($i = 0; $i < @gene; ++$i) {

if ($gene[$i] eq $geneToFind) {print "Gene: $gene[$i]\n";

print "Compartment: $comp[$i]\n";

exit;

}

}

print "Couldn't find gene\n";

Searching for gene Cop

Gene: Cop

Compartment: Golgi


87/158

Binary search

The previous algorithm is inefficient. If there are Nentries in the list, then on average we have to searchthrough (N+1) entries to find the one we want.

For the full Drosophila genome, N=12,000. This ispainfully slow.

An alternative is the Binary Search algorithm:

Start with a sorted list.

Compare the middle element

with the one we want. Pick thehalf of the list that contains ourelement.

Iterate this procedure to"home in" on the right element.This takes around log

2

(N) steps.


88/158

Associative arrays (hashes)

Implementing algorithms like binary searchis a common task in languages like C.

Conveniently, Perl provides a type of array

called an associative array(also called ahash) that is pre-indexed for quick search.

An associative array is a set of keyvalue pairs(like our genecompartment table)

$comp{"Cop"} = "Golgi"; Curly braces {} are used toindex an associative array


89/158

Reading a table using hashes

open FILE, "genecomp.txt";

while () {

($g, $c)= split;

$comp{$g} =$c;

}

$geneToFind = shift@ARGV;print "Gene: $geneToFind\n";

print "Compartment: ", $comp{$geneToFind}, "\n";

Gene: CopCompartment: Golgi

...with $ARGV[0] = "Cop" as before:


90/158

Reading a FASTA file into a hashsub read_FASTA {

my ($filename)=@_;

my (%name2seq, $name, $seq);

open FILE, $filename;

while () {

chomp;

if (/>/) {

s/>//;

if (defined $name) {

$name2seq{$name} =$seq;

}

$name =$_;

$seq = "";

} else {

$seq .

=$_

;

}

}

$name2seq{$name} =$seq;

close FILE;

return %name2seq;

}


91/158

Formatted output of sequences

sub print_seq {my ($name, $seq)=@_;

print ">$name\n";

my $width = 50;

for (my $i = 0; $i < length($seq); $i +=$width) {

if ($i +$width > length($seq)) {

$width

=length(

$seq

)-$i;

}

print substr ($seq, $i, $width), "\n";

}

}

The term substr($x,$i,$len) returns the substring of$x starting at position $i with length $len.

For example, substr("Biology",3,3) is "log"

50-column output


92/158

keys and values

keys returns the list of keys in the hash e.g. names, in the %name2seq hash

values returns the list of values

e.g. sequences, in the %name2seq hash%name2seq = read_FASTA ("fly3utr.txt");

print "Sequence names: ",

join (" ", keys (%name2seq)), "\n";

my $len = 0;

foreach$seq (values %name2seq

){

$len += length ($seq);

}

print "Total length: $len\n";

Sequence names: CG11488 CG11604 CG11455

Total length: 210


93/158

Files of sequence names

Easy way to specify a subset of a givenFASTA database

Each line is the name of a sequence in a

given database e.g. CG1167

CG685

CG1041CG1043


94/158

Get named sequences

Given a FASTA database and a "file of sequencenames", print every named sequence:

($fasta, $fosn)=@ARGV;

%name2seq = read_FASTA ($fasta);

open FILE, $fosn;

while ($name = ) {chomp $name;

$seq =$name2seq{$name};

if (defined $seq) {

print_seq ($name, $seq);

} else {

warn "Can't find sequence: $name. ","Known sequences: ",

join (" ", keys %name2seq), "\n";

}

}

close FILE;


95/158

Intersection of two sets

Two files of sequence names:

What is the overlap?

Find intersection using hashes:

CG1167

CG685

CG1041

CG1043

CG215

CG1041

CG483

CG1167

CG1163

open FILE1, "fosn1.txt";

while () { $gotName{$_} = 1; }

close FILE1;

open FILE2, "fosn2.txt";

while () {print if $gotName{$_};

}

close FILE2;

fosn1.txt

fosn2.txt

CG1041

CG1167


96/158

Assigning hashes

A hash can be assigned directly,as a list of "key=>value" pairs:

%comp = ('Cyp12a5' => 'Mitochondrion',

'MRG15'=> 'Nucleus',

'Cop' => 'Golgi',

'bor' => 'Cytoplasm',

'Bx42' => 'Nucleus');

print "keys: ", join(";",keys(%comp)), "\n";

print "values: ", join(";",values(%comp)), "\n";

keys: bor;Cop;Bx42;Cyp12a5;MRG15

values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus

The genetic code as a hash


97/158

The genetic code as a hash%aa = ('ttt'=>'F', 'tct'=>'S', 'tat'=>'Y', 'tgt'=>'C',

'ttc'=>'F', 'tcc'=>'S', 'tac'=>'Y', 'tgc'=>'C','tta'=>'L', 'tca'=>'S', 'taa'=>'!', 'tga'=>'!',

'ttg'=>'L', 'tcg'=>'S', 'tag'=>'!', 'tgg'=>'W',

'ctt'=>'L', 'cct'=>'P', 'cat'=>'H', 'cgt'=>'R',

'ctc'=>'L', 'ccc'=>'P', 'cac'=>'H', 'cgc'=>'R',

'cta'=>'L', 'cca'=>'P', 'caa'=>'Q', 'cga'=>'R',

'ctg'=>'L', 'ccg'=>'P', 'cag'=>'Q', 'cgg'=>'R',

'att'=>'I', 'act'=>'T', 'aat'=>'N', 'agt'=>'S',

'atc'=>'I', 'acc'=>'T', 'aac'=>'N', 'agc'=>'S',

'ata'=>'I', 'aca'=>'T', 'aaa'=>'K', 'aga'=>'R',

'atg'=>'M', 'acg'=>'T', 'aag'=>'K', 'agg'=>'R',

'gtt'=>'V', 'gct'=>'A', 'gat'=>'D', 'ggt'=>'G',

'gtc'=>'V', 'gcc'=>'A', 'gac'=>'D', 'ggc'=>'G',

'gta'=>'V', 'gca'=>'A', 'gaa'=>'E', 'gga'=>'G',

'gtg'=>'V', 'gcg'=>'A', 'gag'=>'E', 'ggg'=>'G' );


98/158

Translating: DNA to protein$prot = translate ("gatgacgaaagttgt");

print $prot;

sub translate {

my ($dna)=@_;

$dna = lc ($dna);

my $len = length ($dna);

if ($len % 3 != 0) {

die "Length $len is not a multiple of 3";

}

my $protein = "";

for (my $i = 0; $i < $len; $i += 3) {

my $codon = substr ($dna, $i, 3);

if (!defined ($aa{$codon})) {

die "Codon $codon is illegal";

}

$protein .=$aa{$codon};

}

return $protein;

} DDESC


99/158

Counting residue frequencies

%count = count_residues ("gatgacgaaagttgt");

@residues = keys (%count);

foreach $residue (@residues) {

print "$residue: $count{$residue}\n";

}

sub count_residues {

my ($seq)=@_;

my %freq;

$seq = lc ($seq);

for (my $i = 0; $i < length($seq); ++$i) {

my $residue = substr ($seq, $i, 1);

++

$freq{

$residue};

}

return %freq;

}

g: 5

a: 5

c: 1

t: 4


100/158

Counting N-mer frequencies

%count = count_nmers ("gatgacgaaagttgt", 2);

@nmers = keys (%count);

foreach $nmer (@nmers) {

print "$nmer: $count{$nmer}\n";

}

sub count_nmers {

my ($seq, $n)=@_;

my %freq;

$seq = lc ($seq);

for (my $i = 0; $i


101/158

N-mer frequencies for a whole file

my %name2seq = read_FASTA ("fly3utr.txt");while (($name, $seq)= each %name2seq) {

%count = count_nmers ($seq, 2, %count);

}

@nmers = keys (%count);

foreach $nmer (@nmers) {

print "$nmer: $count{$nmer}\n";

}

sub count_nmers {

my ($seq, $n, %freq)=@_;

$seq = lc ($seq);

for (my $i = 0; $i


102/158

Files and filehandles

Opening a file:

Closing a file:

Reading a line:

Reading an array:

Printing a line:

Read-only:

Write-only: Test if file exists:

open XYZ, $filename;

close XYZ;

This XYZ is the filehandle

$data = ;

@data = ;

print XYZ $data;

open XYZ, "$filename";

if (-e $filename) {

print "$filename exists!\n";

}


103/158


Section 5: References

B hi d th S


104/158

Behind the Scenes

PC = memory + CPU(+ peripherals)

Memory is just a list of bytes(e.g. 227 bytes in a machine with

128Mb of RAM) To a first approximation, this is

just one huge array. The arrayindex is called the address

some of the array elementsare interpreted as instructioncodes by the CPU

CPU

39243

65

216

012

227 -1227 -2

addresses

45113

B ff fl tt k


105/158

Buffer overflow attack


106/158

H d i l t ti


107/158

Hexadecimal notation

Computers use binary notation, which is tricky to interconvertto/from decimal notation

however, binary notation is big & unwieldy

A compromise is to use hexadecimal

Hexadecimal is base 16 (decimal is base 10, binary is base 2)

The letters A-F are used to represent the extra digits for 10-15

Binary: Decimal: Hexadecimal:

101 5 5

1011 11 B

11100 28 1C

101000011 323 143

R f


108/158

References

Recall the subroutine find_max(@x) which returns the largestelement in the array @x

Count the number of times we create an array in this code.

All in all, we've created three copies of this array. Each copy usesup time and memory. This seems unnecessary... and it is.

Instead of passing the whole array into the subroutine, we couldsimply tell the subroutine where in memory the array begins.

The memory address of a particular variable is called a reference tothat variable. This is a useful abstraction.

Addresses are often displayed in hexadecimal.

@x= (1, 5, 1, 12, 3, 4, 6);

$max= find_max (@x);

sub find_max {

my @data =@_;

...

Array @x created here@x copied into @_ here

@_ copied into @data here

R f t


109/158

Reference syntax

To create a reference to a scalar, $x:

an array, @x:

a hash,%x:

To access a reference to

a scalar:

an array:

an array element: a hash:

a hash element:

Alternative syntax for arrays:

$scalar_ref = \$x;

$array_ref = \@x;

$hash_ref = \%x;

$x=$$scalar_ref;

@x=@$array_ref;

%x= %$hash_ref;

$x=$array_ref->[3];

$x=$hash_ref->{'key'};

$x=$$array_ref[3];

R f t l


110/158

References to scalars$x= 10;

$y = 20;print "Initially: x=$x, y=$y\n";

$xReference = \$x;

print "X-reference: $xReference\n";

print "Referenced variable: $$xReference\n";

$$xReference += 3;

print "Now: x=$x, y=$y\n";

$yReference = \$y;print "Y-reference: $yReference\n";

print "Referenced variable: $$yReference\n";

$$yReference *= 2;

print "Finally: x=$x, y=$y\n";

Initially: x=10, y=20X-reference: SCALAR(0x1832ac0)

Referenced variable: 10

Now: x=13, y=20

Y-reference: SCALAR(0x1832ae4)

Referenced variable: 20

Finally: x=13, y=40

This referencepoints to $x

This changesthe value of$x

This referencepoints to $y

This changesthe value of$y

This is the memorylocation used to store $x

This is the memorylocation used to store $y

R f t


111/158

References to arrays@x= ('a', 'c', 'g', 't');

@y = 1..10;print "x: @x\n";

print "y: @y\n";

$xReference = \@x;

print "X-reference: $xReference\n";

print "Referenced array: @$xReference\n";

$$xReference[3] =~ tr/t/u/;

print "New x: @x\n";$yReference = \@y;

print "Referenced array: @$yReference\n";

$yReference->[3] *= 2;

print "New y: @y\n";

x: a c g t

y: 1 2 3 4 5 6 7 8 9 10X-reference: ARRAY(0x1832b08)

Referenced array: a c g t

New x: a c g u

Referenced array: 1 2 3 4 5 6 7 8 9 10

New y: 1 2 3 8 5 6 7 8 9 10

This referencepoints to @x

This referencepoints to @y

This changes the4th element of@x

This changes the4th element of@y

(NB alternative notation)

Note that the type of referenceis now ARRAY, not SCALAR

R f t h h


112/158

References to hashes%comp = ('Cyp12a5' => 'Mitochondrion',

'MRG15' => 'Nucleus',

'Cop' => 'Golgi',

'bor' => 'Cytoplasm',

'Bx42' => 'Nucleus');

$ref = \%comp;

print "Values: ", join(" ",values(%comp)), "\n";

print "Ref:$ref\n";

print "Ref values: ", join(" ",values(%$ref)), "\n";

$$ref{'MRG15'} =~ s/N/n/;

print "New values: ", join(" ",values(%comp)), "\n";

Values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus

Ref: HASH(0x1832b08)Ref values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus

New values: Cytoplasm Golgi Nucleus Mitochondrion nucleus

The referencepoints to %comp

This changes$comp{'MRG15'}

Note lower-case 'n' after change

References to s bro tines


113/158

References to subroutines

We can also have references to subroutines

Syntax for assigning a subroutine reference:

Syntax for calling a subroutine reference:

Anonymous subroutines:

$subref = \&read_FASTA;

%name2seq = &$subref ("fly3utr.txt");

$subref = sub { print "Hello world\n"; };&$subref(); Hello world

References to code


114/158

References to codesub hello {

print "Hello @_!\n";

}

my $codeRef1 = \&hello;

&$codeRef1 ("Mr", "President");

print "Ref:$codeRef1\n";

my $codeRef2 = sub { print "Goodbye @_!" };

&$codeRef2 ("cruel", "world");

Hello Mr President!

Ref: CODE(0x180cc3c)Goodbye cruel world!

The referencepoints to thesubroutine hello

This is an anonymoussubroutine reference

An anonymous subroutine is one that is never named, but only referenced.

Well be seeing more about anonymous references on the following slides.

Reasons for references


115/158

Reasons for references

Increased efficiency/performance (pass areference instead of the whole thing)

Allowing a subroutine to modify the value

of a variable, and have this modification bepropagated back to the caller of thesubroutine

Allowing arrays/hashes to contain(references to) other arrays/hashes

Abstract representation of subroutines

Anonymous arrays and hashes


116/158

Anonymous arrays and hashes

Recall the syntax for assigning an entire array...

...and the syntax for assigning an entire hash...

We can also create an array and assign a reference to it,without explicitly naming the array variable:

This is called an anonymous array. We can also create anonymous hashes:

@nucleotide = ('a', 'c', 'g', 't');

%dna2rna = ('a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u');

$nucleotide_ref =['a', 'c', 'g', 't'];

Note square brackets

instead of parentheses

$dna2rna_ref = {'a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u'};

Note curly brackets

Arrays of arrays


117/158

Arrays of arrays

More precisely, arrays ofreferences-to-arrays.

Suppose we want to represent this matrix:

We could do it like this:

Or, more succinctly, like this:

$row1 =[0,0,0,2];

$row2 =[0,0,3,0];

$row3 =[0,3,0,1];

$row4 =[2,0,1,0];

@matrix= ($row1,$row2,$row3,$row4);

0 0 0 2

0 0 3 0

0 3 0 1

2 0 1 0

@matrix= ([0,0,0,2],

[0,0,3,0],

[0,3,0,1],

[2,0,1,0]);

@matrix is an array

of references to arrays

This matrix could be a table of RNA base-pairing scoresif the row and column indices are (A,C,G,U). The score of apair is the number of strong hydrogen bonds that it forms.Thus, A-U and U-A pairs score +2; C-G and G-C pairs score

+3; G-U and U-G pairs score +1; and all other pairs score 0.


118/158

Arrays in Cand C++

C has nothing like Perls hashes,although various libraries(e.g. GLIB) have equivalents.

C++s Standard Template Libraryoffers the map template, which

is similar to a hash.

The vector is a C++ template.

Templates (like C arrays) arestrongly typed, unlike Perls

weakly typed arrays & hashes.


119/158

Genome annotations

GFF annotation format


120/158

GFF annotation format

Nine-column tab-delimited format for simple annotations:

Many of these now obsolete, but name/start/end/strand (andsometimes type) are useful

Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file)

SEQ1 EMBL atg 103 105 . + 0 group1SEQ1 EMBL exon 103 172 . + 0 group1SEQ1 EMBL splice5 172 173 . + . group1SEQ1 netgene splice5 172 173 0.94 + . group1

SEQ1 genie sp5-20 163 182 2.3 + . group1SEQ1 genie sp5-10 168 177 2.1 + . group1SEQ2 grail ATG 17 19 2.1 - 0 group2

Sequencename

Program

Feature

typeStart

residue(starts at 1)

End

residue(starts at 1) Score

Strand

(+ or -)

Codingframe

("." if notapplicable)

Group

Reading a GFF file


121/158

Reading a GFF file

This subroutine reads a GFF file

Each line is made into an array via the split command

The subroutine returns an array of such arrays

sub read_GFF {my ($filename)=@_;

open GFF, "


122/158

Writing a GFF file

We should be able to write as well as read all datatypes

Each array is made into a line via the join command

Arguments: filename & reference to array of arrays

sub write_GFF {my ($filename, $gffRef)=@_;

open GFF, ">$filename" or die $!;

foreach my $gff (@$gffRef) {

print GFF join ("\t", @$gff), "\n";

}

close GFF or die $!;

}

open evaluates FALSE ifthe file failed to open, and$! contains the error message

close evaluates FALSE if

there was an error with the file

GFF intersect detection


123/158

GFF intersect detection

Let (name1,start1,end1) and (name2,start2,end2) be the co-ordinates of two segments

If they don't overlap, there are three possibilities: name1 and name2 are different;

name1= name

2but start

1> end

2;

name1 = name2 but start2 > end1;

Checking every possible pair takes time N2 to run, whereN is the number ofGFF lines (how can this be improved?)

Self intersection of a GFF file


124/158

Self-intersection of a GFF file

sub self_intersect_GFF {my @gff =@_;

my @intersect;

foreach $igff (@gff) {

foreach $jgff (@gff) {

if ($igff ne $jgff) {

if ($$igff[0] eq $$jgff[0]) {

if (!($$igff[3] > $$jgff[4]

|| $$jgff[3] > $$igff[4])) {

push @intersect, $igff;

last;

}

}

}

}

}

return @intersect;

}

Note: this code is slow.Vast improvements in

speed can be gained ifwe sortthe @gff array

before checking forintersection.

Fields 0, 3 and 4 of theGFF line are thesequence name, start

and end co-ordinates ofthe feature

Converting GFF to sequence


125/158

Converting GFF to sequence

Puts together several previously-described subroutines

Namely: read_FASTA read_GFF revcomp print_seq

($gffFile, $seqFile)=@ARGV;

@gff = read_GFF ($gffFile);

%seq = read_FASTA ($seqFile);foreach $gffLine (@gff) {

$seqName =$gffLine->[0];

$seqStart =$gffLine->[3];

$seqEnd =$gffLine->[4];

$seqStrand =$gffLine->[6];

$seqLen =$seqEnd + 1 - $seqStart;

$subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen);if ($seqStrand eq "-") { $subseq = revcomp ($subseq); }

print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq);

}

DNA Microarrays


126/158

y

Normalizing microarray data


127/158

Normalizing microarray data

Often microarray data are normalizedas aprecursor to further analysis (e.g. clustering)

This can eliminate systematic bias; e.g.

if every level for a particular gene is elevated, thismight signal a problem with the probe for that gene

if every level for a particular experiment is elevated,there might have been a problem with thatexperiment, or with the subsequent image analysis

Normalization is crude (it can eliminate realsignal as well as noise), but common

Rescaling an array


128/158

Rescaling an array

For each element of the array:add a, then multiply by b

@array = (1, 3, 5, 7, 9);

print "Array before rescaling: @array\n";

rescale_array (\@array, -1, 2);print "Array after rescaling: @array\n";

sub rescale_array {

my ($arrayRef, $a, $b)=@_;

foreach my $x (@$arrayRef) {

$x= ($x+$a) * $b;

}}

Array before rescaling: 1 3 5 7 9

Array after rescaling: 0 4 8 12 16

Array ispassedby reference

Microarray expression data


129/158

Microarray expression data

A simple format with tab-separated fields First line contains experiment names

Subsequent lines contain:

gene name expression levels for each experiment

* EmbryoStage1 EmbryoStage2 EmbryoStage3 ...

Cyp12a5 104.556 102.441 55.643 ...

MRG15 4590.15 6691.11 9472.22 ...

Cop 33.12 56.3 66.21 ...

bor 5512.36 3315.12 1044.13 ...

Bx42 1045.1 632.7 200.11 ...

... ... ... ...

Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn

Reading a file of expression data


130/158

Reading a file of expression datasub read_expr {

my ($filename)=@_;open EXPR, "


131/158

Normalizing by gene

A program to normalize expression datafrom a set of microarray experiments

Normalizes by gene

($experiment, $expr)= read_expr ("expr.txt");

while (($geneName, $lineRef)= each %$expr) {

normalize_array ($lineRef);}

sub normalize_array {

my ($data)=@_;

my ($mean, $sd)= mean_sd (@$data);@$data= map (($_ - $mean) / $sd, @$data);

}

NB $data

is a reference

to an array

Could also use the following:rescale_array($data,-$mean,1/$sd);

Normalizing by column


132/158

Normalizing by column

Remaps gene arrays to column arrays

($experiment, $expr)

= read_expr ("expr.txt");

my @genes = sort keys %$expr;for ($i = 0; $i < @$experiment; ++$i) {

my @col;

foreach $j (0..@genes-1) {

$col[$j] =$expr->{$genes[$j]}->[$i];

}

normalize_array(\@col);foreach $j (0..@genes-1) {

$expr->{$genes[$j]}->[$i] =$col[$j];

}

}

Puts columndata in @col

Puts @colback into %expr

Normalizes (note useof reference)


133/158


Section 6: Advanced topics

Sorting


134/158

Sorting

It is often useful to be able to sort an array e.g. smallest element first, largest last

Many sort algorithms exist Bubblesort (swaps)

Quicksort (pivots)

Binary tree sort (inserts)

Typically, in older languages, you have toimplement one of these yourself although qsort is provided in C

This is changing...

Sorting string data


135/158

Sorting string data

Perl provides the sort function to sort an

array of strings into alphabetic order:

@nucleotides = ('g', 'c', 't', 'a');

@sorted_nucleotides = sort @nucleotides;print "Nucleotides: @nucleotides\n";

print "Sorted: @sorted_nucleotides\n";

Nucleotides: g c t a

Sorted: a c g t

Sorting numeric data


136/158

Sorting numeric data

To sort numeric data, we have to provide a sort function

This is a subroutine that compares two items, $a and $b

It must return -1 if$a$b

Fortunately, Perl provides an operator that does just this.It is the spaceship operator$a $b

The syntax is as follows:

@x= (5, 1, 16, 2, -1, 10);

@y = sort by_number @x;

print "y: @y\n";

subby_number {

return $a $b;

} y: -1 1 2 5 10 16

The variables $a and $b getpassed "automagically" into this

subroutine. Yet another example ofarbitrary Perl weirdness...

Standard sort functions


137/158

Standard sort functions

$a $b is the "standard" numeric sort

The "standard" alphabetic sort is $a cmp $b

The alphabetic sort is the one used by default:

$x= "Pears";$y = "Apples";

$z= "Oranges";

print "$x cmp $y: ", $x cmp $y, "\n";

print "$x cmp $z: ", $x cmp $z, "\n";

print "$y cmp $z: ", $y cmp $z, "\n";

print "$x cmp $x: ", $x cmp $x, "\n";

Pears cmp Apples: 1

Pears cmp Oranges: 1

Apples cmp Oranges: -1

Pears cmp Pears: 0

Sorting a GFF file


138/158

Sorting a GFF file

We can "chain" multiple sort functions to sort bysequence name, then by startpoint, then by endpoint:

This works because (X or Y or Z) = X (if X!=0)or Y (if X==0 and Y != 0)or Z (if X==Y==0)

($infile, $outfile)=@ARGV;

@gff = read_GFF ($infile);

@gff = sort by_GFF_startpoint (@gff);write_GFF ($outfile, \@gff);

subby_GFF_startpoint {

return ($$a[0] cmp $$b[0]

or $$a[3] $$b[3]

or $$a[4] $$b[4]);

}

"chaining" multiplesort comparisons

this line doesthe actual sort

Fields 0, 3 and 4 of theGFF line are thesequence name, start

and end co-ordinates ofthe feature

Packages


139/158

Packages

Perl allows you to organise your subroutines inpackages each with its own namespace

Perl looks for the packages in a list of directoriesspecified by the array @INC

Many packages available athttp://www.cpan.org/

use PackageName;

PackageName::doSomething();

This line includes a file called"PackageName.pm" in your code

print "INC dirs: @INC\n";

INC dirs: Perl/lib Perl/site/lib.The "." means thedirectory that thescript is saved in

This invokes a subroutine called doSomething()in the package called "PackageName.pm"

Object-oriented programming


140/158

Object oriented programming

Data structures are often associated with code FASTA: read_FASTA print_seq revcomp ...

GFF: read_GFF write_GFF ...

Expression data: read_expr mean_sd...

Object-oriented programming makes thisassociation explicit.

A type of data structure, with an associated set of

subroutines, is called a class The subroutines themselves are called methods

A particular instance of the class is an object

OOP concepts


141/158

OOP concepts

Abstraction represent the essentials, hide the details

Encapsulation storing data and subroutines in a single unit

hiding private data (sometimes all data, via accessors)

Inheritance abstract base interfaces

multiple derived classes

Polymorphism different derived classes exhibit different behaviors in

response to the same requests

OOP: Analogy


142/158

OOP: Analogy


143/158

o Messages (the words in the speech balloons, and also perhaps the coffee itself)

o Overloading (Waiter's response to "A coffee", different response to "A black coffee")

o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)

o Encapsulation (Customer doesn't need to know about Kitchen)

o Inheritance (not exactlyused here, except implicitly: all types of coffee can be drunk orspilled, all humans can speak basic English and hold cups of coffee, etc.)

o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is

a Factory (and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.

OOP: Advantages


144/158

OOP: Advantages

Often more intuitive Data has behavior

Modularity Interfaces are well-defined

Implementation details are hidden

Maintainability Easier to debug, extend

Framework for code libraries Graphics & GUIs

BioPerl, BioJava

OOP: Jargon Member method


145/158

Member, method A variable/subroutine associated with a particular class

Overriding When a derived class implements a method differently from its

parent class

Constructor, destructor

Methods called when an object is created/destroyed Accessor

A method that provides [partial] access to hidden data

Factory

An [abstract] object that creates other objects Singleton

A class which is only everinstantiatedonce (i.e. theres only everone object of this class)

C.f. static member variables, which occur once per class

Objects in Perl


146/158

An object in Perl is usually a reference to a hash The method subroutines for an object are foundin a class-specific package Command bless $x, MyPackage associates

variable $x with package MyPackage

Syntax of method calls e.g. $x->save();

this is equivalent to PackageName::save($x);

Typical constructor: PackageName->new();

@EXPORT and @EXPORT_OK arrays used toexport method names to users namespace

Many useful Perl objects available at CPAN

AUTOLOAD


147/158

When an undefined method is called on anobject, the special method AUTOLOAD iscalled, if defined

Special variable $AUTOLOAD containsfunction name

Allows implementation of e.g. defaultaccessors for hash elements

GD.pm


148/158

p

A graphics package by Lincoln Steinuse GD;

# create a new image

$im = new GD::Image(100,100);

# allocate some colors

$white =$im->colorAllocate(255,255,255);

$black

=$im->colorAllocate(0,0,0

);

$red =$im->colorAllocate(255,0,0);

$blue =$im->colorAllocate(0,0,255);

# make the background transparent

$im->transparent($white);

# Put a black frame around the picture

$im->rectangle(0,0,99,99,$black);

# Draw a blue oval$im->arc(50,50,95,75,0,360,$blue);

# And fill it with red

$im->fill(50,50,$red);

# Convert the image to PNG and print it out

print $im->png;

CGI.pm


149/158

p

CGI (Common Gateway Interface) Page-based web programming paradigm

CGI.pm (also by Lincoln Stein)

Perl CGI interface runs on a webserver

allows you to write a program that runs behinda webpage

CGI (static, page-based) is gradually beingsupplemented by AJAX

BioPerl


150/158

A set of Open Source Bioinformaticspackages largely object-oriented

Can be downloaded from bio.perl.org Handles various different file formats

Parses BLAST and other programs

Basis for Ensembl the human genome annotation project www.ensembl.org

Example: GenBank


151/158

p

Example: Bio::DB::GenBank


152/158

p

Interface to the GenBank database

Saves having to rewrite same old parsers

use Bio::DB::GenBank;

$gb= new Bio::DB::GenBank;

$seq =$gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID

# or ...

$seq =$gb->get_Seq_by_acc('J00522'); # Accession Number

$seq =$gb->get_Seq_by_version('J00522.1'); # Accession.version

$seq =$gb->get_Seq_by_gi('405830'); # GI Number

Digest::MD5


153/158

g

MD5 is a one-way hash function

e.g. gravatar.com uses MD5 to map(authenticated) email addresses to avatar icons

Digest::MD5


154/158

g

MD5 is a one-way hash function

e.g. gravatar.com uses MD5 to map(authenticated) email addresses to avatar icons

use Digest::MD5 qw(md5 md5_hex md5_base64);

my $baseURL = "http://www.gravatar.com/avatar/;

while () {

chomp;

print $baseURL, md5_hex(lc($_)), "\n;

}

Other programming languages


155/158

p g g g g

Procedural languages Interpreted/scripting languages

"Shell languages (TCSH, BASH, CSH)

Python: cleaner, object-oriented

Ruby: even more object-oriented

Compiled languages C: very basic, portable and fast

C++: more elaborate, object-oriented C

Java: stripped-down portable C++; "safer & cleaner

Functional languages More mathematical, cleaner; but less pragmatic

Lisp, Scheme Lisp is the oldest. (Lots (of (parentheses)))

Prolog, ML, Haskell


156/158

Co-ordinate transformation


157/158

Motivation: map clones to chromosomesChromosome

Clones

17455 17855

403 803

Co-ordinate transformations (cont.)


158/158

What if a segment spans multiple clones?

perl for bio in for ma tics

Documents