perl for bioinformatics

Perl for Bioinformatics

Lecture 4

Variables - review

• A variable name starts with a $• It contains a number or a text string• Use my to define a variable• Use = to assign a value• Use \ to stop the variable being

interpolated• Take care with variable names and with

changing the contents of variables

Conditional Blocks, review• An if test can be used to control a

command in a conditional block, according to the outcome of a decision made by comparing variables.

• It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq.

• It’s usual to indent the block to make it easier to read the code

Arrays• An array can store multiple pieces of data. • They are essential for the most useful

functions of Perl. They can store data such as:

– the lines of a text file (e.g. primer sequences)– a list of numbers (e.g. BLAST e values)

• Arrays are designated with the symbol @

my @bases = (“A”, “C”, “G”, “T”);

Converting a variable to an array

split splits a variable into parts and puts them in an array.

my $dnastring = "ACGTGCTA";

my @dnaarray = split //, $dnastring;

@dnaarray is now (A, C, G, T, G, C, T, A)

@dnaarray = split /T/, $dnastring;

@dnaarray is now (ACG, GC, A)

• join combines the elements of an array into a single scalar variable (a string)

$dnastring = join('', @dnaarray);

Converting an array to a variable

which arrayspacer(empty here)

Loops• A loop repeats a bunch of functions until it is done.

The functions are placed in a BLOCK – some code delimited with curly brackets {}

• Loops are really useful with arrays.

• The “foreach” loop is probably the most useful of all:

foreach my $base (@dnaarray) {

print "$base “;

}

• String comparison (is the text the same?)

• eq (equal ) • ne (not equal )

There are others but beware of them!

Comparing strings

Getting part of a string

• substr takes characters out of a string

$letter = substr($dnastring, $position, 1)

which string where in the string

how many letters to take

Combining strings

• Strings can be concatenated (joined).

• Use the dot . operator$seq1= “ACTG”;

$seq2= “GGCTA”;

$seq3= $seq1 . $seq2;

print $seq3;ACTGGGCTA

Making Decisions - review

• The if operator is generally used together with numerical or string comparison operators, inside an (expression).

numerical: ==, !=, >, <, ≥, ≤strings: eq, ne

• You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time

More healthy exercise

• Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.

• Site for EcoR1: GAATTC• Bam H1: GGATCC• Hind III: AAGCTT

• Read in restriction site to variable

• Remove newline character

• Check if variable contains “GAATTC”

• Check if variable contains “GGATCC”

• ..etc.

pseudocode

• Read in sequence to variable

• Remove newline character

• Split sequence in variable to array using “GAATTC”.

• Count and report number of fragments.

• Measure length of fragments and report site positions, adding six for missing sites

What about longer sequences?

Arrays and loops - review

• An array starts with @. It contains multiple bits of data in a list-like format.

• @bases = (“A”, “C”, “G”, “T”);

• You can make decisions on each member of an array using a foreach loop which puts each part of the array through the test, one at a time

Test time, again

• Remember –

keep track of what’s in a variable

don’t over-write a variable with another value, unless you intend to

syntax and case are critical

lines end with a semicolon

brackets and quotes must match.

Opening and closing files

• So we can input large amounts of data, Perl has to read data out of files, and write results into output files

• This is done in two steps

• First, you must give the file a name within the script - this is known as a filehandle

• Use the open command:open MYFILE, ‘exampleprotein.txt’;

Reading a file• Once the file is open, you can read from it, line by

line, using the readline <> operator again

– (put the filehandle between the angle brackets)

• Perl reads files one line at a time, each time you input data from the file, the next line is read:

open FILE1,’exampleprotein.txt’;$line1 = <FILE1>;chomp $line1;$line2 = <FILE1>;

Using loops to read in a file

• The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out.

my $longsequence;open FILE, ‘exampleprotein.txt’;while (my $line = <FILE>){

chomp $line;$longsequence = $longsequence . $line;

}close FILE;

• This reads the whole file, and puts each line into the variable $longsequence one at a time.

• Think about what happens to the ID line….

Now More Fun Excercises

• Read a DNA sequence from a fasta format file

• Calculate the GC content.• What about the non-DNA characters in

the file?>header lines with the name of the sequencecarriage returns !! You know this one.blank spacesN’s or X’s or unexpected letters

• Open file

• Load ID line and sequence into different variables

• Split the sequence into an array of individual letters

• For each letter in the array: – if A, increment “A” counting number variable– If T, increment “T” counting variable– …etc

Pseudocode

Writing to a File• Writing to a file is similar to reading from it• Use the > operator to open a file for writing:

open OUTPUT, ‘>/home/class30/output.txt’;

• This creates a new file with that name, or overwrites an existing file

• Use >> to append text to an existing file• print to the file using the filehandle:

print OUTPUT $myoutputdata;

else

• Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.

Some more stuff you need to know

elsif• You can string a lot of “if”s together using elsif

if ($site eq “GAATTC” {print “EcoR1 site\n”;

} elsif ($site eq “CCATGG” {print “BamHI site\n”;

} elsif ($site eq “AAGCTT”) {print “HindIII site\n”;

} else { #only happens if none of the preceeding are truedie “I can’t find any of the sites I know\n”;

}

Subscripts

• Bioinformatics data often can be made into array format:– multi-line sequence files– Microarray or statistics data in “tab delimited”

format

• You can address part of the array as if it was a variable using a subscript

@numbers = (8, 8, 8, 23984092, 8);print “$numbers[3]\n”;

Please note – the first element is number zero! Second is 1!

Regular Expressions• Sounds odd, doesn’t it? It means a pattern that

the computer can match, in a standard format.

• Very useful in bioinformatics work

• DNA patterns• restriction sites• promoters/transcription factor binding sites• intron splice site

• Protein patterns• conserved domains (motifs)• active sites• structural motifs (membrane spanning, signal peptide, etc.)

The Binding and Match Operators: =~ / /

• The =~ operator binds functions together• The // operator matches things to patterns

It can be translated as “contains”

• The forward slashes contain the pattern to be matched, like this:

if ($dnaseq =~ /GAATTC/) {print “EcoRI site found\n”}

A regular expression..is a joy forever. And a pattern to match:

can be just a text string, such as: /GATC/

it can have alternative characters: /G[AT]TC/

or contain a wildcard that matches any character: /G.TC/

Or be something bizzare: /\/[^\/]*\/\.\./

Perl Regular Expressions

• It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t:

#man perlrequick

Alternative Characters• Square brackets within the match expression allow

for alternative characters:if ($dna =~ /CAG[AT]CAG/)

• This will match an DNA string that starts with CAG; has A or T in the 4th position, followed by another CAG.

• A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns:

if ($dna =~ /GAAT|ATTC/)

Special characters• Perl has a large set of special characters to use

in regular expressions:

– the dot (.) matches any character– \d matches any digit (a number from 0-9)– \w matches any “word” character

(a letter or a number, not punctuation or space)

– \s matches white space (any amount)– \t matches a tab (useful for tab delimited files)– ^ matches the beginning of a line– $ matches the end of a line

– Knowing this makes you lots of fun at parties.

“Special” characters

• What if you need to match text that contains a special character?

• Aren’t there dots at the end of sentences?

• Now you have to use a backslash (\) to “escape” the special meaning of that character:

if $onewordsentence =~ /\w+ \ ./

-This would match any text that has one or more text characters, followed by a dot.

Finished.

Bringing it together

• So now, when you think about it, you can:

Open a fileCheck whether each line of the file contains a particular

patternRecover part of that lineWrite it out to another file

So.. isn’t that what you wanted to know?But really, it’s very useful combined with the UNIX

command line.

A last exercise?...

• Now we’re getting up to speed with Perl, lets try something more fun:

• Open up a BLAST output file

• Spit out the name of the query sequence, the top hit, and how many hits there were.

Only the beginning

• Sadly, there is much, much more than this to the Power of Perl.

• You can make, create and download other people’s websites

• Make Linux and Windows graphical programs• Do almost anything on the internet• Interact with databases• And much much more

Why won’t I teach you more stuff?

• Whoa!• Programming takes time to learn properly• You’ve got the tools now to get started on a

programming project

• We will go through some more Perl functions in the later classes, especially modules such as Bioperl.

Practice makes perfect

• You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful.

#man perlintro

#man perlrequick

#perldoc bioperl

• Also, check out Radhika’s resource page

http://compbio.sph.harvard.edu/chb/training/training-resources

perl for bioinformatics

Documents