perl for bioinformatics
DESCRIPTION
Perl for Bioinformatics. Lecture 4. Variables - review. A variable name starts with a $ It contains a number or a text string Use my to define a variable Use = to assign a value Use \ to stop the variable being interpolated - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/1.jpg)
Perl for Bioinformatics
Lecture 4
![Page 2: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/2.jpg)
Variables - review
• A variable name starts with a $• It contains a number or a text string• Use my to define a variable• Use = to assign a value• Use \ to stop the variable being
interpolated• Take care with variable names and with
changing the contents of variables
![Page 3: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/3.jpg)
Conditional Blocks, review• An if test can be used to control a
command in a conditional block, according to the outcome of a decision made by comparing variables.
• It’s important to keep track of whether variables are strings or numbers. Numbers are compared with ==, strings with eq.
• It’s usual to indent the block to make it easier to read the code
![Page 4: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/4.jpg)
Arrays• An array can store multiple pieces of data. • They are essential for the most useful
functions of Perl. They can store data such as:
– the lines of a text file (e.g. primer sequences)– a list of numbers (e.g. BLAST e values)
• Arrays are designated with the symbol @
my @bases = (“A”, “C”, “G”, “T”);
![Page 5: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/5.jpg)
Converting a variable to an array
split splits a variable into parts and puts them in an array.
my $dnastring = "ACGTGCTA";
my @dnaarray = split //, $dnastring;
@dnaarray is now (A, C, G, T, G, C, T, A)
@dnaarray = split /T/, $dnastring;
@dnaarray is now (ACG, GC, A)
![Page 6: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/6.jpg)
• join combines the elements of an array into a single scalar variable (a string)
$dnastring = join('', @dnaarray);
Converting an array to a variable
which arrayspacer(empty here)
![Page 7: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/7.jpg)
Loops• A loop repeats a bunch of functions until it is done.
The functions are placed in a BLOCK – some code delimited with curly brackets {}
• Loops are really useful with arrays.
• The “foreach” loop is probably the most useful of all:
foreach my $base (@dnaarray) {
print "$base “;
}
![Page 8: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/8.jpg)
• String comparison (is the text the same?)
• eq (equal ) • ne (not equal )
There are others but beware of them!
Comparing strings
![Page 9: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/9.jpg)
Getting part of a string
• substr takes characters out of a string
$letter = substr($dnastring, $position, 1)
which string where in the string
how many letters to take
![Page 10: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/10.jpg)
Combining strings
• Strings can be concatenated (joined).
• Use the dot . operator$seq1= “ACTG”;
$seq2= “GGCTA”;
$seq3= $seq1 . $seq2;
print $seq3;ACTGGGCTA
![Page 11: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/11.jpg)
Making Decisions - review
• The if operator is generally used together with numerical or string comparison operators, inside an (expression).
numerical: ==, !=, >, <, ≥, ≤strings: eq, ne
• You can make decisions on each member of an array using a loop which puts each part of the array through the test, one at a time
![Page 12: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/12.jpg)
More healthy exercise
• Write a program that asks the user for a DNA restriction site, and then tells them whether that particular sequence matches the site for the restriction enzyme EcoRI, or Bam HI, or Hind III.
• Site for EcoR1: GAATTC• Bam H1: GGATCC• Hind III: AAGCTT
![Page 13: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/13.jpg)
• Read in restriction site to variable
• Remove newline character
• Check if variable contains “GAATTC”
• Check if variable contains “GGATCC”
• ..etc.
pseudocode
![Page 14: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/14.jpg)
• Read in sequence to variable
• Remove newline character
• Split sequence in variable to array using “GAATTC”.
• Count and report number of fragments.
• Measure length of fragments and report site positions, adding six for missing sites
What about longer sequences?
![Page 15: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/15.jpg)
Arrays and loops - review
• An array starts with @. It contains multiple bits of data in a list-like format.
• @bases = (“A”, “C”, “G”, “T”);
• You can make decisions on each member of an array using a foreach loop which puts each part of the array through the test, one at a time
![Page 16: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/16.jpg)
Test time, again
• Remember –
keep track of what’s in a variable
don’t over-write a variable with another value, unless you intend to
syntax and case are critical
lines end with a semicolon
brackets and quotes must match.
![Page 17: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/17.jpg)
Opening and closing files
• So we can input large amounts of data, Perl has to read data out of files, and write results into output files
• This is done in two steps
• First, you must give the file a name within the script - this is known as a filehandle
• Use the open command:open MYFILE, ‘exampleprotein.txt’;
![Page 18: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/18.jpg)
Reading a file• Once the file is open, you can read from it, line by
line, using the readline <> operator again
– (put the filehandle between the angle brackets)
• Perl reads files one line at a time, each time you input data from the file, the next line is read:
open FILE1,’exampleprotein.txt’;$line1 = <FILE1>;chomp $line1;$line2 = <FILE1>;
![Page 19: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/19.jpg)
Using loops to read in a file
• The while loop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out.
my $longsequence;open FILE, ‘exampleprotein.txt’;while (my $line = <FILE>){
chomp $line;$longsequence = $longsequence . $line;
}close FILE;
• This reads the whole file, and puts each line into the variable $longsequence one at a time.
• Think about what happens to the ID line….
![Page 20: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/20.jpg)
Now More Fun Excercises
• Read a DNA sequence from a fasta format file
• Calculate the GC content.• What about the non-DNA characters in
the file?>header lines with the name of the sequencecarriage returns !! You know this one.blank spacesN’s or X’s or unexpected letters
![Page 21: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/21.jpg)
• Open file
• Load ID line and sequence into different variables
• Split the sequence into an array of individual letters
• For each letter in the array: – if A, increment “A” counting number variable– If T, increment “T” counting variable– …etc
Pseudocode
![Page 22: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/22.jpg)
Writing to a File• Writing to a file is similar to reading from it• Use the > operator to open a file for writing:
open OUTPUT, ‘>/home/class30/output.txt’;
• This creates a new file with that name, or overwrites an existing file
• Use >> to append text to an existing file• print to the file using the filehandle:
print OUTPUT $myoutputdata;
![Page 23: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/23.jpg)
else
• Instead of just letting the script go on if it fails an if test, you can get it to execute a second block of code if the statement in brackets isn’t true.
Some more stuff you need to know
![Page 24: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/24.jpg)
elsif• You can string a lot of “if”s together using elsif
if ($site eq “GAATTC” {print “EcoR1 site\n”;
} elsif ($site eq “CCATGG” {print “BamHI site\n”;
} elsif ($site eq “AAGCTT”) {print “HindIII site\n”;
} else { #only happens if none of the preceeding are truedie “I can’t find any of the sites I know\n”;
}
![Page 25: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/25.jpg)
Subscripts
• Bioinformatics data often can be made into array format:– multi-line sequence files– Microarray or statistics data in “tab delimited”
format
• You can address part of the array as if it was a variable using a subscript
@numbers = (8, 8, 8, 23984092, 8);print “$numbers[3]\n”;
Please note – the first element is number zero! Second is 1!
![Page 26: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/26.jpg)
Regular Expressions• Sounds odd, doesn’t it? It means a pattern that
the computer can match, in a standard format.
• Very useful in bioinformatics work
• DNA patterns• restriction sites• promoters/transcription factor binding sites• intron splice site
• Protein patterns• conserved domains (motifs)• active sites• structural motifs (membrane spanning, signal peptide, etc.)
![Page 27: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/27.jpg)
The Binding and Match Operators: =~ / /
• The =~ operator binds functions together• The // operator matches things to patterns
It can be translated as “contains”
• The forward slashes contain the pattern to be matched, like this:
if ($dnaseq =~ /GAATTC/) {print “EcoRI site found\n”}
![Page 28: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/28.jpg)
A regular expression..is a joy forever. And a pattern to match:
can be just a text string, such as: /GATC/
it can have alternative characters: /G[AT]TC/
or contain a wildcard that matches any character: /G.TC/
Or be something bizzare: /\/[^\/]*\/\.\./
![Page 29: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/29.jpg)
Perl Regular Expressions
• It never ceases to amaze me what people can do with regular expressions, but you can match pretty much anything you can think of and a lot you can’t:
#man perlrequick
![Page 30: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/30.jpg)
Alternative Characters• Square brackets within the match expression allow
for alternative characters:if ($dna =~ /CAG[AT]CAG/)
• This will match an DNA string that starts with CAG; has A or T in the 4th position, followed by another CAG.
• A vertical line within the /expression/ means “or”; it allows you to look for either of two completely different patterns:
if ($dna =~ /GAAT|ATTC/)
![Page 31: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/31.jpg)
Special characters• Perl has a large set of special characters to use
in regular expressions:
– the dot (.) matches any character– \d matches any digit (a number from 0-9)– \w matches any “word” character
(a letter or a number, not punctuation or space)
– \s matches white space (any amount)– \t matches a tab (useful for tab delimited files)– ^ matches the beginning of a line– $ matches the end of a line
– Knowing this makes you lots of fun at parties.
![Page 32: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/32.jpg)
“Special” characters
• What if you need to match text that contains a special character?
• Aren’t there dots at the end of sentences?
• Now you have to use a backslash (\) to “escape” the special meaning of that character:
if $onewordsentence =~ /\w+ \ ./
-This would match any text that has one or more text characters, followed by a dot.
Finished.
![Page 33: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/33.jpg)
Bringing it together
• So now, when you think about it, you can:
Open a fileCheck whether each line of the file contains a particular
patternRecover part of that lineWrite it out to another file
So.. isn’t that what you wanted to know?But really, it’s very useful combined with the UNIX
command line.
![Page 34: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/34.jpg)
A last exercise?...
• Now we’re getting up to speed with Perl, lets try something more fun:
• Open up a BLAST output file
• Spit out the name of the query sequence, the top hit, and how many hits there were.
![Page 35: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/35.jpg)
Only the beginning
• Sadly, there is much, much more than this to the Power of Perl.
• You can make, create and download other people’s websites
• Make Linux and Windows graphical programs• Do almost anything on the internet• Interact with databases• And much much more
![Page 36: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/36.jpg)
Why won’t I teach you more stuff?
• Whoa!• Programming takes time to learn properly• You’ve got the tools now to get started on a
programming project
• We will go through some more Perl functions in the later classes, especially modules such as Bioperl.
![Page 37: Perl for Bioinformatics](https://reader035.vdocuments.us/reader035/viewer/2022062301/56815b11550346895dc8bd36/html5/thumbnails/37.jpg)
Practice makes perfect
• You can now practice your Perl skills and understand a lot of the books and help files, which are probably more useful.
#man perlintro
#man perlrequick
#perldoc bioperl
• Also, check out Radhika’s resource page
http://compbio.sph.harvard.edu/chb/training/training-resources