an introduction to perl mbg8680 2006 gerard tromp
TRANSCRIPT
An Introduction to Perl
MBG8680 2006
Gerard Tromp
ReferencesBooks:
Wall L, Christiansen T, Orwant J. Programming Perl. Sebastopol, CA: O'Reilly, 2000:1-1070
Cozens S. Advanced Perl Programming Sebastopol, CA: O'Reilly, 2005:1-281
Christiansen T, Torkington N. Perl Cookbook. Sebastopol, CA: O'Reilly, 1998:1-757
Perl Manual pagesWeb: (not exhaustive – try google: learning perl)
http://www.oreilly.com/ http://www.perl.com/ (O’Reilly maintains) http://www.cpan.org (Comprehensive Perl Archive
Network) http://learn.perl.org/
What is Perl?Scripting language
Interpreted at run-time
Developed as improved awk/nawk Data/Text extraction tool on UNIX
• Aho, Weinberger and Kernigan (Bell Laboratories)
– A. Aho, B. Kernighan, and P. Weinberger. AWK -- A pattern scanning and processing language. Software Practice and Experience, 9(4):267--280, 1979
Extremely powerful pattern matching capabilities (regular expression engine)
What is Perl? (2)
Extensible Modules and Packages
• (CPAN: www.cpan.org)
General programming language Can be used for:
• system calls (date, time, sockets, network)• file IO
Complex programming tasks• Genome builds are performed with Perl
What is Perl? The official description.
Perl is a general-purpose programming language originally developed for text manipulation and now used for a wide range of tasks including system administration, web development, network programming, GUI development, and more.
The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). Its major features are that it's easy to use, supports both procedural and object-oriented (OO) programming, has powerful built-in support for text processing, and has one of the world's most impressive collections of third-party modules.
Some important concepts
Perl uses punctuation and some characters to distinguish specific meaning (as do most computer languages). Train your eyes to note the difference
between: •( ), [ ], { } – important delimiters•$, @, % – important data types
– variables
Basics – variables
Variable Syntax: Variables contain data Types
• scalar $ $foo simple value, e.g., string, number
• array @ @foo list of values• hash % %foo paired lists of keys – values• subroutine & &foo block (chunk) of code that can be
called• typeglob * *foo all things called foo
Basics – functions (procedures)Function syntax:
Perl does not distinguish between functions, procedures and subroutines (other languages do)
Function syntax is defined in the manual pages• man “function name”• see perdoc perlfunc
Some functions take no arguments, other variable/optional arguments, e.g.,• print FILEHANDLE LIST (LIST is list of variables)
• print LIST• print
Basics – operators
Operators “do things”: Mathematical
• addition + $foo + $bar• multiplication * $foo * $bar• division / $foo / $bar• subtraction - $foo - $bar• modulus % $foo % $bar• exponentiation ** $foo ** $ bar
Basics – operators (2)
assignment• simple = $a = 3; $a=“abc”• complex
– mathematical
*= multiply $a *= 3 ($a==9)
-= subtract $a -= 4 ($a==5)
+= subtract $a += 5 ($a==10) – string
.= concatenate $a .= “d” ($a==abcd)
x= repeat $a x= 3
||= conditional $a ||= “a”
Basics – operators (3)
Logical• and &&, and $a && $b• or ||, or $a || $b• not !, not ! $a• xor xor $a xor $b
Basics – operators (4)
Test
numeric string• equality == eq• inequality != ne • less than < lt • greater than > gt • less than or equal <= le • comparison <=> cmp
Basics – controlFlow control (execute till condition is met)
conditional• if if( CONDITION ){ }
if( CONDITION ){ }elsif( CONDITION ){ }else( CONDITION ){ }
• unless unless( CONDITION ){ }
• while while( CONDITION ){ }
• for for( $a=1; $a<10; $a++ ){ }
• foreach foreach( LIST ){ }
Basics – control (2)
Flow control (execute till condition is met) termination
• next next;next if ( CONDTION);
skips current loop• last last;
last if ( CONDTION);terminates loop
Text Manipulation in Perl.
Text manipulation was the primary reason for developing Perl originally
The text manipulation “engine” in Perl is an extended Unix Regular Expression (REGEX) History
• Derived from “regular sets” (mathematical language theory)
• Part of Unix editors ‘qed’ and ‘ed’ -> grep/egrep• Incorporated into sed, awk (nawk)• Extended in some current versions of Unix (Linux) to reflect
the Perl extensions• Incorporated into Java regular expressions
Regular Expressions
Way to specify a set of strings without enumerating each possibility Way to specify a pattern to match
Distinct syntax Delimiters /PATTERN/ traditional
?PATTERN? almost any other character
Metacharacters• special interpretation to specific characters/character
combinations
Regular Expression – Metacharacters (1)
[ ] list / character class
Match any character listed between brackets
[^ ] negated list Match any character except listed characters
$ terminal anchor Match end of string
^ proximal anchor Match beginning of string
. Single-character wildcard
Match any character (once)
* multi-characterWildcard
Match as many characters as possible (greedy)
| alternation / or Match pattern preceding or following
Regular Expression – Metacharacters (2)
Unix escape characters (metacharacters) \ – backslash
• “escapes” meaning of special (non-alphanumeric) character, e.g., $,%,^
• converts some alphabetical characters into special metacharacters– \n newline– \r carriage return– \t tab– \f form-feed– \a alarm (BEL)– \0 ASCII NULL– \e escape
Regular Expression – Metacharacters (3)Perl extensions
\s [ \t\n\r\f] whitespace
\S [^ \t\n\r\f] not whitespace
\w [a-zA-Z_0-9] word character
\W [^a-zA-Z_0-9] not word character
\d [0-9] digit
\D [^0-9] non-digit
\b true at word boundary
\B true not at word boundary
Regular Expression – Quantifiers
Quantifiers allow specification of how many times the previous character/pattern should be matched
Originally limited in Unix * match 0 or more times {Min, Max} match at least Min times and
no more than Max times {Min,} match at least Min times {,Max} match no more than Max times
{Count} match exactly count times
Regular Expression – Quantifiers (2)Perl extensions
+ match at least once
? match zero or 1 times
*? match minimum of 0 or more times
+? match at least once but minimum times
?? match minimum of 0 or 1 times
{}? minimal form of specific quantifiers
Regular Expression – CapturingCapturing allows (a portion of) the
pattern to be used elsewhereOriginally limited in Unix (awk/sed)
\(PATTERN\) escaped parentheses Captured pattern(s) stored in buffers: $1, $2 … $n For input line:
“This is a test” the pattern:
/\([Tt]his\).*\(t[es]*t\)/yields two buffers:
$1 == “This”; $2 == “test”
Regular Expression – Perl CapturingCapturing allows (a portion of) the
pattern to be used elsewhereIn Perl – do NOT escape parentheses
(PATTERN) parentheses Captured pattern(s) stored in variables: $1, $2 … $n For input line:
“This is a test” the pattern:
/([Tt]his).*(t[es]*t)/yields two variables:
$1 == “This”; $2 == “test”
Regular Expression – Perl Capturing and Clustering(?#…) comment – ignore
(?:…) cluster, but do not capture
(?=…) test to see if pattern matches ahead – look ahead
(?!…) look ahead to test if pattern does NOT match (negative look ahead)
(?<=…) look behind
(?<!…) Negative look behind
Perl quotes
Different quote characters have specific meaning and properties.
Interpolation is the expansion of variables occurs for some quote types but not others
Perl quotes (2)Conven-tional Generic Interpretation
Inter-polation
' ' q// Literal string No
" " qq// Literal string Yes
` ` qx Command execution Yes
( ) qw// Word List No
// m// Pattern match Yes
s/// s/// Pattern substitution Yes
y/// tr/// Character translation No
" " qr// Regular expression Yes
$x = “abc”; @x = ( abc, def, ghi, klm); %x = (1, abc, 2, def, 3, ghi, 4, klm);
what does the following produce? print $x, “\n”; print $x[3], “\n”; print $x{2}, “\n”;
Variable assignment
abc
klm
def
What happened and why?
A Simple Command-line Script Using an Array
Type the following on a line in the PuTTY window (shell window)
perl –e ‘@x=(2,5,7,9,11); print “@X\n";’
perl –e ‘@x=(2,5,7,9,11); print “$x[4]\n";’
perl –e ‘@x=(2,5,7,9,11); foreach $x (@x) {print “$x\n"; }’
NOTE: command-line scripts are tricky since the entire script must be enclosed in single quotes
A Simple Command-line Script Using a Hash
Type the following on a line
perl –e ‘%x=(2,5,7,9,11,15); print “%x\n";’
perl –e ‘%x=(2,5,7,9,11,15); print “$x{5}\n";’
perl –we ‘%x=(2,5,7,9,11,15); print “$x{5}\n";’
perl –e ‘%x=(2,5,7,9,11,15); print “$x{7}\n";’
perl –e ‘%x=(2,5,7,9,11,15); foreach $x (keys %x) {print “$x\t$x{$x}\n"; }’
A Simple (file) ProgramA program to extract specific URL data from html generated
by NCBI Map viewer “view as table”
#! /usr/bin/perl –w
while(<>){if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){
print "$1$2\t$2\t$3\n";&mysub($1,$2,$3);
}}
sub mysub{# … do something;
}
Dissection of a simple program
Examine program line by line
1 #! /usr/bin/perl –w23 while(<>){4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){5 print "$1$2\t$2\t$3\n";6 &mysub($1,$2,$3);7 }8 }910 sub mysub{11 # … do something;12 }
Dissection of a simple program (2) Invocation line
1 #! /usr/bin/perl –w
This line ‘starts’ the Perl program Syntax is derived from Unix shell script syntax
• #! (pound-bang) – tells Unix shell that the next arguments is the name or path and name
of a program (executable)
• /usr/bin/perl– tells Unix shell which executable (perl) to find in which path (directory
location)
• -w– “flag(s)” passed to executalbe (program)
– tell program to “do things” or adopt specific behavior
– here: turn on perl warnings
Dissection of a simple program (3)
Control loop and input operator3 while(<>){
# elided lines 4 – 7 8 }
while ( CONDTION ) BLOCK • execute loop until condition becomes false• here CONDITION is <> , an input operator
– reads from STDIN, a C filehandle accessible to every program– reads until the end-of-file, i.e., until no further data
•BLOCK is a block (chunk) of code
Dissection of a simple program (4)
IF LOOP – IF ( CONDITION) BLOCK 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){5 print "$1$2\t$2\t$3\n";6 &mysub($1,$2,$3);7 }
if ( /PATTERN/ ) BLOCK • if PATTERN matches execute the BLOCK
href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)<• what are literals?• what are character classes?• what does the pattern match?
Dissection of a simple program (5)
IF LOOP – IF ( CONDITION) BLOCK 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){5 print "$1$2\t$2\t$3\n";6 &mysub($1,$2,$3);7 }
print "$1$2\t$2\t$3\n";
• what does the line do?• what is $1, $2 and $3? where is it in the pattern?
&mysub($1,$2,$3); • what is &mysub?
• what are $1, $2 and $3 with respect to mysub?
Dissection of a simple program (6)
Subroutine10 sub mysub{11 # … do something;12 }
sub mysub BLOCK • subroutine declaration and code BLOCK• BLOCK consists of { code}• everything after # is a comment• this is a null subroutine – does nothing• perl does not require declaration of parameters• all perl parameters are made available to subroutine as
an array – @_
Getting Help
perldoc
perldoc perldoc perldoc perl perldoc perlintro perldoc perlfaq perldoc 'topic'
very important program
how to use perldoc
list of available topics*
useful material like this lecture
common questions answered
– information on specified topic from list above
* perldoc will extract documentation embedded in packages. The list returned
by ‘perldoc perl' is for the base perl installation
Getting Help (2)
Books – see referencesWeb – see referencesUnix ‘man’ command.
Although perldoc will return help/information for most perl-related items, there are still a few that only have ‘man pages’
Hands-on Problems
Write a Perl script that will do the following. 1: for chromosome 20, create a tab-delimited list
of:• gene names • gene ids (GeneID number)• chromosomal location (beginning, end) • orientation
2: extend the columns to include (where appropriate):
• HUGO HGNC ID• OMIM ID
Improvements on the script(s)
Wouldn’t it be great if you could skip the browsing part and go straight to the web page in Perl? look at LWP module
• http://search.cpan.org/dist/libwww-perl/lib/LWP.pm
What is a ‘module’?
Perl modules
A module is a collection of scripts (code) that have already been written for you Strictly speaking, a module is a collection
of one or more packages A package is small collection of code
package NAME;BLOCK1;
Perl modules (2)
Why packages? allows namespace to be uncluttered keeps related code in one place allows reusability of code
Modules? can think of as extended packages can be procedural (traditional) or object-
oriented
Perl modules (3)
modules must be installed from source (CPAN) module included in script by:
• use MODULE;– executes the module at compile time– complains immediately if not found
• require MODULE;– executes the module at run time– only complains later
Perl modules (4)
Module allows access to module specific functions (methods)
Some Modules have hundreds of functions
Functions are written as generically as possible to make them extensible
Perl modules (5)
DBI database interface abstract database interface that makes database
access as generic as possibleDBI::DBD
DBI database driver (specific to database or interface, e.g., Oracle, Sybase, MySQL, WINODBC32)
performs the database-specific calls and allows DBI to ‘hide’ them from the user
• interprets DBI generic calls to database in database-specific manner
Modules
Insufficient time to delve into these very important bioinformatic modules
DBI http://search.cpan.org/~timb/DBI-1.51/DBI.pm
BioPerl http://search.cpan.org/~birney/bioperl-1.4/Bio/Perl.pm http://www.bioperl.org/wiki/Main_Page http://www.bioperl.org/wiki/Bptutorial.pl http://doc.bioperl.org/releases/bioperl-1.4
Homework ProblemYou have performed a large-scale SNP
genotyping project. The data are provided to you in a tabular list in the
following format:• Some header lines
– includes blank lines
– column descriptions
• columns– Gene ID
– Polymorphism ID
– Fragment (no data [-])
– Subject ID
– Allele 1
– Allele 2
Homework Problem (2)You have to write a script to transform the
data into a wide table that has Individual ID as rows Polymorphisms as columns Genotype data as a string “Allele1/Allele2” Polymorphisms must be grouped by gene Genes must be in order (left to right)
Notes There will be about 5,300 individuals, 200 genes
and a total of about 1,300 polymorphisms the solution is to use hashes and nested hashes