perl subroutines and references andrew emerson, high performance systems, cineca
TRANSCRIPT
PERL
subroutines and references
Andrew Emerson, High Performance Systems, CINECA
Consider the following code:
# Counts Gs in various bits of DNA
$dna=“CGGTAATTCCTGCA”;
$G_count=0;
for ($pos=0; $pos <length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$G_count if ($base eq ‘G’);
} # end for
. . .do something else
$new_dna = <DNA_FILE>;
$G_count=0;
for ($pos=0; $pos <length $new_dna; $pos++) {
$base=substr($new_dna,$pos,1);
++$G_count if ($base eq ‘G’);
} # end for
# Counts Gs in various bits of DNA
$dna=“CGGTAATTCCTGCA”;
$G_count=0;
for ($pos=0; $pos <length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$G_count if ($base eq ‘G’);
} # end for
. . .do something else
$new_dna = <DNA_FILE>;
$G_count=0;
for ($pos=0; $pos <length $new_dna; $pos++) {
$base=substr($new_dna,$pos,1);
++$G_count if ($base eq ‘G’);
} # end for
Inconvenient to repeat pieces of code many times if it does the same thing.
Better if we could write something like…
# Counts Gs in various bits of DNA # improved version (PSEUDO PERL)
# Main program
$dna=“CGGTAATTCCTGCA”;
count_g using $dna;
.
. do something else
$new_dna = <DNA_FILE>;
count_g using $new_dna;
.
.
# Counts Gs in various bits of DNA # improved version (PSEUDO PERL)
# Main program
$dna=“CGGTAATTCCTGCA”;
count_g using $dna;
.
. do something else
$new_dna = <DNA_FILE>;
count_g using $new_dna;
.
.
count_G subroutinefor ($pos=0;$pos<length $dna; $pos++) { $base=substr($dna,$pos,1); ++$G_count if ($base eq ‘G’);}
count_G subroutinefor ($pos=0;$pos<length $dna; $pos++) { $base=substr($dna,$pos,1); ++$G_count if ($base eq ‘G’);}
SubroutinesSubroutines The pieces of code used in this way are often
called subroutines and are common to all programming languages (but with different names)subroutines (Perl, FORTRAN) functions (C,C++*,FORTRAN, Java*)procedures (PASCAL)
Essential for procedural or structured programming.
The pieces of code used in this way are often called subroutines and are common to all programming languages (but with different names)subroutines (Perl, FORTRAN) functions (C,C++*,FORTRAN, Java*)procedures (PASCAL)
Essential for procedural or structured programming.
* object-oriented programming languages
Advantages of using subroutinesAdvantages of using subroutines
Saves typing → fewer lines of code →less likely to make a mistake
re-usable if subroutine needs to be modified, can be
changed in only one placeother programs can use the same subroutinecan be tested separately
makes the overall structure of the program clearer
Saves typing → fewer lines of code →less likely to make a mistake
re-usable if subroutine needs to be modified, can be
changed in only one placeother programs can use the same subroutinecan be tested separately
makes the overall structure of the program clearer
Program design using subroutinesProgram design using subroutinesConceptual flowConceptual flow
subroutines can use other subroutines to make more complex and flexible programs
Program design using subroutinesProgram design using subroutines-pseudo code-pseudo code
#
# Main program
# pseudo-code
..set variables
.
call sub1
.
call sub2
.
call sub3
.
exit program
sub 1
# code for sub 1
exit subroutine
sub 2
# code for sub 1
exit subroutine
sub 3
# code for sub 1
call sub 4
exit subroutine
sub 4
# code sub4
exit
Using subroutines in PerlUsing subroutines in Perl
# Program to count Gs in DNA sequences
# (valid perl)
# Main program
$dna=“GGCCTAACCTCCGGT”;
count_G;
print “no. of G in $dna=$number_of_g\n”;
# subroutines
sub count_G {
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
}
# Program to count Gs in DNA sequences
# (valid perl)
# Main program
$dna=“GGCCTAACCTCCGGT”;
count_G;
print “no. of G in $dna=$number_of_g\n”;
# subroutines
sub count_G {
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
}
Example 1.
Subroutines in PerlSubroutines in PerlDefined using the sub command:
sub name {
...
}
Called from the main program or another subroutine using its name:
name;
Sometimes you will see in old Perl programs (like mine)
&name;
But is optional in modern Perl.
Subroutines in PerlSubroutines in Perl
# main program..exit;
# subroutines defined here
sub sub1 {...}sub sub2 {...}sub sub3 {...}
# main program..exit;
# subroutines defined here
sub sub1 {...}sub sub2 {...}sub sub3 {...}
Subroutines can be placed anywhere in the program but best to group them at the end;
exit not strictly necessary, but makes it clear we want to leave the program here.
Return to example 1- Return to example 1- why is this bad?why is this bad?
# Program to count Gs in DNA sequences
# Main program
$dna=“GGCCTAACCTCCGGT”;
count_G;
print “no. of G in $dna=$number_of_g\n”;
exit;
# subroutines
sub count_G {
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
}
What does count_G need ?
Where did this come from?
Perl subroutines-passing Perl subroutines-passing parametersparameters
The input/outputs to a subroutine are specified using parameters (or arguments) given when the subroutine is called:
$no_of_G = count_g($dna);$no_of_G = count_g($dna);
It is now clear what the subroutine expects as input and what is returned as output.
Other examples:
$day_in_yr = calc_day($day,$month);
$num_sequences = read_sequence_file(@database);
$day_in_yr = calc_day($day,$month);
$num_sequences = read_sequence_file(@database);
Input parametersInput parameters
All the parameters in a Perl subroutine (including arrays) end up in a single array called @_
Therefore in the code:
#
$pos = find_motif($motif,$protein);
.
sub find_motif {
$a = $_[0];
$b = $_[1];
...
}
#
$pos = find_motif($motif,$protein);
.
sub find_motif {
$a = $_[0];
$b = $_[1];
...
}
$a takes the value of $motif
$b takes the value of $protein
Subroutine output (return values)Subroutine output (return values)A subroutine does not have to explicitly return something to the main program:
print_title;
sub print_title{
print “Sequence Manipulation program\n”;
print “-----------------------------\n”;
print “Written by: A.Nother \n”;
print “Version 1.1: \n”
}
print_title;
sub print_title{
print “Sequence Manipulation program\n”;
print “-----------------------------\n”;
print “Written by: A.Nother \n”;
print “Version 1.1: \n”
}
but often it does, even if only to signal the procedure went well or gave an error.
Subroutine return valuesSubroutine return values
By default the subroutine returns the last thing evaluated but you can use the return statement to make this explicit:
sub count_G {
$dna=@_[0];
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g;
}
sub count_G {
$dna=@_[0];
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g;
}
input
output
return also exits the subroutine
Return ValuesYou can return from more than 1 point in the sub – can be useful for signalling errors:
if ($dna eq “”) {
return “No DNA was given”; # exit with error message
} else {
..
} # end if
return $number_of_G;
} # end sub
if ($dna eq “”) {
return “No DNA was given”; # exit with error message
} else {
..
} # end if
return $number_of_G;
} # end sub
In which case the calling program or sub should check the return value before continuing. However, in general best to return from only 1 point, otherwise difficult to follow the logic.
Return valuesReturn values
Can also return multiple scalars, arrays, etc. but just as for the input everything ends up in a single array or list:
@DNA = read_file($filename);
.
.
($nG,$nC,$nT,$nA) = count_bases($dna);
sub count_bases {
...
return ($num_G,$num_C,$num_T,$num_A);
} # end sub
note ( and ) for the list
Counting bases – Attempt 2Counting bases – Attempt 2# Program to count Gs in DNA sequences
# using input/output parameters
# Main program
$dna=“GGCCTAACCTCCGGT”;
$num_g = count_G($dna);
print “no. of G in $dna=$num_g\n”;
exit;
# subroutines
sub count_G {
$dna=$_[0];
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g; # return value of sub
}
better but ...
All the variables inside are also visible outside the sub, not only params !
The need for variable scopingThe need for variable scopingA subroutine written this way is in danger of overwriting a variable used elsewhere in the program. Remember that a subroutine should work like a black box, apart from well-defined inputs/outputs it should not affect the rest of the program.
sub
input
output
Apart from input/output all vars needed by the sub should appear and disappear within the sub.
Allows us also to use the same names for vars outside and inside the sub without conflict.
Variable scoping in PerlVariable scoping in Perl By default, all variables defined outside a sub
are visible within it and vice-versa – all variables are global.
Therefore the sub can change variables used outside the sub.
Solution ?Solution ?
Restrict the scope of the variables by making them local to the subroutine → eliminate the risk of altering a variable present outside the sub. Also makes it clear what the subroutine needs to function.
..
Variable scoping in PerlVariable scoping in Perl
In Perl, variables are made local to a subroutine (or a block) using the my keyword. For example,
my variable1; # simple declaration
my $dna=“GGTTCACCACCTG”; # with initialization
my ($seq1,$seq2,$seq3); # more than 1
Attenzione
my $seq1, $seq2;
This means
my $seq1;
$seq2;
Which is valid Perl so the compiler won’t give an error.
Must use () if multiple vars per line
Subroutines with local variablesSubroutines with local variables# Program to count Gs in DNA sequences – final version
# Main program
$dna=“GGCCTAACCTCCGGT”;
$num_g = count_G($dna);
print “no. of G in $dna=$num_g\n”;
exit;
# subroutines
sub count_G {
my $dna=$_[0];
my ($pos,$base);
my $number_of_g=0;
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g; # return value
}
# Program to count Gs in DNA sequences – final version
# Main program
$dna=“GGCCTAACCTCCGGT”;
$num_g = count_G($dna);
print “no. of G in $dna=$num_g\n”;
exit;
# subroutines
sub count_G {
my $dna=$_[0];
my ($pos,$base);
my $number_of_g=0;
for ($pos=0;$pos<length $dna; $pos++) {
$base=substr($dna,$pos,1);
++$number_of_g if ($base eq ‘G’);
} # end for
return $number_of_g; # return value
}
Variables declared like this is called lexical scoping in the Perl man pages
Remember that my makes a copy of the variable.
Other examples of subroutine useOther examples of subroutine use
sub find_motif { my $motif=shift; # shifts the @_ array my $protein=shift; # (avoids $_[0], etc)...}
sub count_C_and_G { my @fasta_lines=@_;... return ($num_C,$num_G); # returns a list}
sub reverse_seq { use strict; # the strict command enforces use of my
$seq=$_[0]; # so this line will give an error (even
# if defined in main program)...}
sub find_motif { my $motif=shift; # shifts the @_ array my $protein=shift; # (avoids $_[0], etc)...}
sub count_C_and_G { my @fasta_lines=@_;... return ($num_C,$num_G); # returns a list}
sub reverse_seq { use strict; # the strict command enforces use of my
$seq=$_[0]; # so this line will give an error (even
# if defined in main program)...}
Question: What if we want to pass Question: What if we want to pass two arrays ?two arrays ?
$seq=compare_seqs(@seqs1,@seqs2);
Remember that everything arrives in the sub in a single array. Likewise for return values:
(@annotations,@dna) = parse_genbank(@dna);
...
sub parse_genbank {
...
return (@annotations,@dna);
}
(@annotations,@dna) = parse_genbank(@dna);
...
sub parse_genbank {
...
return (@annotations,@dna);
}
In the first example what will be in the special array @_ ?
Solution: Use referencesSolution: Use references
subroutines – call by valuesubroutines – call by value
$i=2;
$j =add_100($i);
print “i=$i\n”;
sub simple_sub {
my $i=$_[0];
$i=$i+100;
return $i;
}
$i=2;
$j =add_100($i);
print “i=$i\n”;
sub simple_sub {
my $i=$_[0];
$i=$i+100;
return $i;
}
This is called Call by Value because a copy is made of the parameter passed and whatever happens to this copy in the subroutine doesn’t affect the variable in the main program
Consider
$i unaffected by the subroutine
What are references?
References can be considered to be an identifier or some other description of the objects rather than the objects themselves.
To buyBananasBeerFruitPastaFrozen pizzamore beer
or
reference
copy
ReferencesReferencesIn computing, references are often addresses of objects (scalars,arrays,..) in memory:
@genbank
100 200 300
address of @genbank array
$dna
address of $dna scalar
ReferencesReferences
References (sometimes called pointers in other languages) can be more convenient because in Perl they are scalars → often much smaller than the object they refer to (e.g. an array or hash).
Array references can be passed around and copied very efficiently, often also using less memory.
Being scalars, they can be used to make complicated data structures such as arrays of arrays, arrays of hashes and so on..
References (sometimes called pointers in other languages) can be more convenient because in Perl they are scalars → often much smaller than the object they refer to (e.g. an array or hash).
Array references can be passed around and copied very efficiently, often also using less memory.
Being scalars, they can be used to make complicated data structures such as arrays of arrays, arrays of hashes and so on..
References in PerlReferences in Perl
Simplest way to create a reference in Perl is with \
$scalar_ref = \$sequence; # reference to a scalar
$dna_ref = \@DNA_list; # reference to an array
$hash_ref = \%genetic_code; # reference to a hash
$scalar_ref = \$sequence; # reference to a scalar
$dna_ref = \@DNA_list; # reference to an array
$hash_ref = \%genetic_code; # reference to a hash
To get back the original object the reference needs to be dereferenced;
$scalar = $$scalar_ref; # for scalars just add $
@new_dna = @$dna_ref; # for arrays just add @
%codon_lookup = %$hash_ref; # similary for hashes
$scalar = $$scalar_ref; # for scalars just add $
@new_dna = @$dna_ref; # for arrays just add @
%codon_lookup = %$hash_ref; # similary for hashes
Passing two arrays into a sub using Passing two arrays into a sub using referencesreferences
# compare two databases, each held as an array
#
$results = compare_dbase(\@dbase1,\@dbase2); # supply refs
...
sub compare_dbase {
my ($db1_ref,$db2_ref) = @_; # params are refs to arrays
@db1 = @$db1_ref; # dereference
@db2 = @$db2_ref; # dereference
... # now use @db1,@db2
return $results;
}
# compare two databases, each held as an array
#
$results = compare_dbase(\@dbase1,\@dbase2); # supply refs
...
sub compare_dbase {
my ($db1_ref,$db2_ref) = @_; # params are refs to arrays
@db1 = @$db1_ref; # dereference
@db2 = @$db2_ref; # dereference
... # now use @db1,@db2
return $results;
} Similarly we can return 2 or more arrays by the same method
References – final wordsReferences – final words
Caution: Calling by reference can change the original variables;
@dna1=(G,G,T,C,T,G);
@dna2=(A,A,A,A,A);
add_seqs(\@dna1,\@dna2);
print “dna1=@dna1 \n dna2=@dna2 \n”;
sub add_seqs {
my ($seq1,$seq2) =@_;
push(@$seq1,$@seq2);
}
@dna1=(G,G,T,C,T,G);
@dna2=(A,A,A,A,A);
add_seqs(\@dna1,\@dna2);
print “dna1=@dna1 \n dna2=@dna2 \n”;
sub add_seqs {
my ($seq1,$seq2) =@_;
push(@$seq1,$@seq2);
}If you don’t want this behaviour then create local copies of the arrays as in previous example.
OUTPUT
dna1=G G T C T G A A A A A
dna2=A A A A A
OUTPUT
dna1=G G T C T G A A A A A
dna2=A A A A A
subroutines-summarysubroutines-summary
subroutines defined with sub represent the main tool for structuring programs in Perl.
variables used only by the subroutine should be declared with my, to prevent conflict with external variables (lexical scoping)
parameters passed in to the sub end up in the single array @_; similarly for any return values
array references need to be used to pass two or more arrays in (call by reference) or out of a sub.
subroutines defined with sub represent the main tool for structuring programs in Perl.
variables used only by the subroutine should be declared with my, to prevent conflict with external variables (lexical scoping)
parameters passed in to the sub end up in the single array @_; similarly for any return values
array references need to be used to pass two or more arrays in (call by reference) or out of a sub.