an introduction to perl mbg8680 2006 gerard tromp

48
An Introduction to Perl MBG8680 2006 Gerard Tromp

Upload: godfrey-hall

Post on 17-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Introduction to Perl MBG8680 2006 Gerard Tromp

An Introduction to Perl

MBG8680 2006

Gerard Tromp

Page 2: An Introduction to Perl MBG8680 2006 Gerard Tromp

ReferencesBooks:

Wall L, Christiansen T, Orwant J. Programming Perl. Sebastopol, CA: O'Reilly, 2000:1-1070

Cozens S. Advanced Perl Programming Sebastopol, CA: O'Reilly, 2005:1-281

Christiansen T, Torkington N. Perl Cookbook. Sebastopol, CA: O'Reilly, 1998:1-757

Perl Manual pagesWeb: (not exhaustive – try google: learning perl)

http://www.oreilly.com/ http://www.perl.com/ (O’Reilly maintains) http://www.cpan.org (Comprehensive Perl Archive

Network) http://learn.perl.org/

Page 3: An Introduction to Perl MBG8680 2006 Gerard Tromp

What is Perl?Scripting language

Interpreted at run-time

Developed as improved awk/nawk Data/Text extraction tool on UNIX

• Aho, Weinberger and Kernigan (Bell Laboratories)

– A. Aho, B. Kernighan, and P. Weinberger. AWK -- A pattern scanning and processing language. Software Practice and Experience, 9(4):267--280, 1979

Extremely powerful pattern matching capabilities (regular expression engine)

Page 4: An Introduction to Perl MBG8680 2006 Gerard Tromp

What is Perl? (2)

Extensible Modules and Packages

• (CPAN: www.cpan.org)

General programming language Can be used for:

• system calls (date, time, sockets, network)• file IO

Complex programming tasks• Genome builds are performed with Perl

Page 5: An Introduction to Perl MBG8680 2006 Gerard Tromp

What is Perl? The official description.

Perl is a general-purpose programming language originally developed for text manipulation and now used for a wide range of tasks including system administration, web development, network programming, GUI development, and more.

The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). Its major features are that it's easy to use, supports both procedural and object-oriented (OO) programming, has powerful built-in support for text processing, and has one of the world's most impressive collections of third-party modules.

Page 6: An Introduction to Perl MBG8680 2006 Gerard Tromp

Some important concepts

Perl uses punctuation and some characters to distinguish specific meaning (as do most computer languages). Train your eyes to note the difference

between: •( ), [ ], { } – important delimiters•$, @, % – important data types

– variables

Page 7: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – variables

Variable Syntax: Variables contain data Types

• scalar $ $foo simple value, e.g., string, number

• array @ @foo list of values• hash % %foo paired lists of keys – values• subroutine & &foo block (chunk) of code that can be

called• typeglob * *foo all things called foo

Page 8: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – functions (procedures)Function syntax:

Perl does not distinguish between functions, procedures and subroutines (other languages do)

Function syntax is defined in the manual pages• man “function name”• see perdoc perlfunc

Some functions take no arguments, other variable/optional arguments, e.g.,• print FILEHANDLE LIST (LIST is list of variables)

• print LIST• print

Page 9: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – operators

Operators “do things”: Mathematical

• addition + $foo + $bar• multiplication * $foo * $bar• division / $foo / $bar• subtraction - $foo - $bar• modulus % $foo % $bar• exponentiation ** $foo ** $ bar

Page 10: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – operators (2)

assignment• simple = $a = 3; $a=“abc”• complex

– mathematical

*= multiply $a *= 3 ($a==9)

-= subtract $a -= 4 ($a==5)

+= subtract $a += 5 ($a==10) – string

.= concatenate $a .= “d” ($a==abcd)

x= repeat $a x= 3

||= conditional $a ||= “a”

Page 11: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – operators (3)

Logical• and &&, and $a && $b• or ||, or $a || $b• not !, not ! $a• xor xor $a xor $b

Page 12: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – operators (4)

Test

numeric string• equality == eq• inequality != ne • less than < lt • greater than > gt • less than or equal <= le • comparison <=> cmp

Page 13: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – controlFlow control (execute till condition is met)

conditional• if if( CONDITION ){ }

if( CONDITION ){ }elsif( CONDITION ){ }else( CONDITION ){ }

• unless unless( CONDITION ){ }

• while while( CONDITION ){ }

• for for( $a=1; $a<10; $a++ ){ }

• foreach foreach( LIST ){ }

Page 14: An Introduction to Perl MBG8680 2006 Gerard Tromp

Basics – control (2)

Flow control (execute till condition is met) termination

• next next;next if ( CONDTION);

skips current loop• last last;

last if ( CONDTION);terminates loop

Page 15: An Introduction to Perl MBG8680 2006 Gerard Tromp

Text Manipulation in Perl.

Text manipulation was the primary reason for developing Perl originally

The text manipulation “engine” in Perl is an extended Unix Regular Expression (REGEX) History

• Derived from “regular sets” (mathematical language theory)

• Part of Unix editors ‘qed’ and ‘ed’ -> grep/egrep• Incorporated into sed, awk (nawk)• Extended in some current versions of Unix (Linux) to reflect

the Perl extensions• Incorporated into Java regular expressions

Page 16: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expressions

Way to specify a set of strings without enumerating each possibility Way to specify a pattern to match

Distinct syntax Delimiters /PATTERN/ traditional

?PATTERN? almost any other character

Metacharacters• special interpretation to specific characters/character

combinations

Page 17: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Metacharacters (1)

[ ] list / character class

Match any character listed between brackets

[^ ] negated list Match any character except listed characters

$ terminal anchor Match end of string

^ proximal anchor Match beginning of string

. Single-character wildcard

Match any character (once)

* multi-characterWildcard

Match as many characters as possible (greedy)

| alternation / or Match pattern preceding or following

Page 18: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Metacharacters (2)

Unix escape characters (metacharacters) \ – backslash

• “escapes” meaning of special (non-alphanumeric) character, e.g., $,%,^

• converts some alphabetical characters into special metacharacters– \n newline– \r carriage return– \t tab– \f form-feed– \a alarm (BEL)– \0 ASCII NULL– \e escape

Page 19: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Metacharacters (3)Perl extensions

\s [ \t\n\r\f] whitespace

\S [^ \t\n\r\f] not whitespace

\w [a-zA-Z_0-9] word character

\W [^a-zA-Z_0-9] not word character

\d [0-9] digit

\D [^0-9] non-digit

\b true at word boundary

\B true not at word boundary

Page 20: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Quantifiers

Quantifiers allow specification of how many times the previous character/pattern should be matched

Originally limited in Unix * match 0 or more times {Min, Max} match at least Min times and

no more than Max times {Min,} match at least Min times {,Max} match no more than Max times

{Count} match exactly count times

Page 21: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Quantifiers (2)Perl extensions

+ match at least once

? match zero or 1 times

*? match minimum of 0 or more times

+? match at least once but minimum times

?? match minimum of 0 or 1 times

{}? minimal form of specific quantifiers

Page 22: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – CapturingCapturing allows (a portion of) the

pattern to be used elsewhereOriginally limited in Unix (awk/sed)

\(PATTERN\) escaped parentheses Captured pattern(s) stored in buffers: $1, $2 … $n For input line:

“This is a test” the pattern:

/\([Tt]his\).*\(t[es]*t\)/yields two buffers:

$1 == “This”; $2 == “test”

Page 23: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Perl CapturingCapturing allows (a portion of) the

pattern to be used elsewhereIn Perl – do NOT escape parentheses

(PATTERN) parentheses Captured pattern(s) stored in variables: $1, $2 … $n For input line:

“This is a test” the pattern:

/([Tt]his).*(t[es]*t)/yields two variables:

$1 == “This”; $2 == “test”

Page 24: An Introduction to Perl MBG8680 2006 Gerard Tromp

Regular Expression – Perl Capturing and Clustering(?#…) comment – ignore

(?:…) cluster, but do not capture

(?=…) test to see if pattern matches ahead – look ahead

(?!…) look ahead to test if pattern does NOT match (negative look ahead)

(?<=…) look behind

(?<!…) Negative look behind

Page 25: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl quotes

Different quote characters have specific meaning and properties.

Interpolation is the expansion of variables occurs for some quote types but not others

Page 26: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl quotes (2)Conven-tional Generic Interpretation

Inter-polation

' ' q// Literal string No

" " qq// Literal string Yes

` ` qx Command execution Yes

( ) qw// Word List No

// m// Pattern match Yes

s/// s/// Pattern substitution Yes

y/// tr/// Character translation No

" " qr// Regular expression Yes

Page 27: An Introduction to Perl MBG8680 2006 Gerard Tromp

$x = “abc”; @x = ( abc, def, ghi, klm); %x = (1, abc, 2, def, 3, ghi, 4, klm);

what does the following produce? print $x, “\n”; print $x[3], “\n”; print $x{2}, “\n”;

Variable assignment

abc

klm

def

What happened and why?

Page 28: An Introduction to Perl MBG8680 2006 Gerard Tromp

A Simple Command-line Script Using an Array

Type the following on a line in the PuTTY window (shell window)

perl –e ‘@x=(2,5,7,9,11); print “@X\n";’

perl –e ‘@x=(2,5,7,9,11); print “$x[4]\n";’

perl –e ‘@x=(2,5,7,9,11); foreach $x (@x) {print “$x\n"; }’

NOTE: command-line scripts are tricky since the entire script must be enclosed in single quotes

Page 29: An Introduction to Perl MBG8680 2006 Gerard Tromp

A Simple Command-line Script Using a Hash

Type the following on a line

perl –e ‘%x=(2,5,7,9,11,15); print “%x\n";’

perl –e ‘%x=(2,5,7,9,11,15); print “$x{5}\n";’

perl –we ‘%x=(2,5,7,9,11,15); print “$x{5}\n";’

perl –e ‘%x=(2,5,7,9,11,15); print “$x{7}\n";’

perl –e ‘%x=(2,5,7,9,11,15); foreach $x (keys %x) {print “$x\t$x{$x}\n"; }’

Page 30: An Introduction to Perl MBG8680 2006 Gerard Tromp

A Simple (file) ProgramA program to extract specific URL data from html generated

by NCBI Map viewer “view as table”

#! /usr/bin/perl –w

while(<>){if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){

print "$1$2\t$2\t$3\n";&mysub($1,$2,$3);

}}

sub mysub{# … do something;

}

Page 31: An Introduction to Perl MBG8680 2006 Gerard Tromp

Dissection of a simple program

Examine program line by line

1 #! /usr/bin/perl –w23 while(<>){4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){5 print "$1$2\t$2\t$3\n";6 &mysub($1,$2,$3);7 }8 }910 sub mysub{11 # … do something;12 }

Page 32: An Introduction to Perl MBG8680 2006 Gerard Tromp

Dissection of a simple program (2) Invocation line

1 #! /usr/bin/perl –w

This line ‘starts’ the Perl program Syntax is derived from Unix shell script syntax

• #! (pound-bang) – tells Unix shell that the next arguments is the name or path and name

of a program (executable)

• /usr/bin/perl– tells Unix shell which executable (perl) to find in which path (directory

location)

• -w– “flag(s)” passed to executalbe (program)

– tell program to “do things” or adopt specific behavior

– here: turn on perl warnings

Page 33: An Introduction to Perl MBG8680 2006 Gerard Tromp

Dissection of a simple program (3)

Control loop and input operator3 while(<>){

# elided lines 4 – 7 8 }

while ( CONDTION ) BLOCK • execute loop until condition becomes false• here CONDITION is <> , an input operator

– reads from STDIN, a C filehandle accessible to every program– reads until the end-of-file, i.e., until no further data

•BLOCK is a block (chunk) of code

Page 34: An Introduction to Perl MBG8680 2006 Gerard Tromp

Dissection of a simple program (4)

IF LOOP – IF ( CONDITION) BLOCK 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){5 print "$1$2\t$2\t$3\n";6 &mysub($1,$2,$3);7 }

if ( /PATTERN/ ) BLOCK • if PATTERN matches execute the BLOCK

href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)<• what are literals?• what are character classes?• what does the pattern match?

Page 35: An Introduction to Perl MBG8680 2006 Gerard Tromp

Dissection of a simple program (5)

IF LOOP – IF ( CONDITION) BLOCK 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){5 print "$1$2\t$2\t$3\n";6 &mysub($1,$2,$3);7 }

print "$1$2\t$2\t$3\n";

• what does the line do?• what is $1, $2 and $3? where is it in the pattern?

&mysub($1,$2,$3); • what is &mysub?

• what are $1, $2 and $3 with respect to mysub?

Page 36: An Introduction to Perl MBG8680 2006 Gerard Tromp

Dissection of a simple program (6)

Subroutine10 sub mysub{11 # … do something;12 }

sub mysub BLOCK • subroutine declaration and code BLOCK• BLOCK consists of { code}• everything after # is a comment• this is a null subroutine – does nothing• perl does not require declaration of parameters• all perl parameters are made available to subroutine as

an array – @_

Page 37: An Introduction to Perl MBG8680 2006 Gerard Tromp

Getting Help

perldoc

perldoc perldoc perldoc perl perldoc perlintro perldoc perlfaq perldoc 'topic'

very important program

how to use perldoc

list of available topics*

useful material like this lecture

common questions answered

– information on specified topic from list above

* perldoc will extract documentation embedded in packages. The list returned

by ‘perldoc perl' is for the base perl installation

Page 38: An Introduction to Perl MBG8680 2006 Gerard Tromp

Getting Help (2)

Books – see referencesWeb – see referencesUnix ‘man’ command.

Although perldoc will return help/information for most perl-related items, there are still a few that only have ‘man pages’

Page 39: An Introduction to Perl MBG8680 2006 Gerard Tromp

Hands-on Problems

Write a Perl script that will do the following. 1: for chromosome 20, create a tab-delimited list

of:• gene names • gene ids (GeneID number)• chromosomal location (beginning, end) • orientation

2: extend the columns to include (where appropriate):

• HUGO HGNC ID• OMIM ID

Page 40: An Introduction to Perl MBG8680 2006 Gerard Tromp

Improvements on the script(s)

Wouldn’t it be great if you could skip the browsing part and go straight to the web page in Perl? look at LWP module

• http://search.cpan.org/dist/libwww-perl/lib/LWP.pm

What is a ‘module’?

Page 41: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl modules

A module is a collection of scripts (code) that have already been written for you Strictly speaking, a module is a collection

of one or more packages A package is small collection of code

package NAME;BLOCK1;

Page 42: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl modules (2)

Why packages? allows namespace to be uncluttered keeps related code in one place allows reusability of code

Modules? can think of as extended packages can be procedural (traditional) or object-

oriented

Page 43: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl modules (3)

modules must be installed from source (CPAN) module included in script by:

• use MODULE;– executes the module at compile time– complains immediately if not found

• require MODULE;– executes the module at run time– only complains later

Page 44: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl modules (4)

Module allows access to module specific functions (methods)

Some Modules have hundreds of functions

Functions are written as generically as possible to make them extensible

Page 45: An Introduction to Perl MBG8680 2006 Gerard Tromp

Perl modules (5)

DBI database interface abstract database interface that makes database

access as generic as possibleDBI::DBD

DBI database driver (specific to database or interface, e.g., Oracle, Sybase, MySQL, WINODBC32)

performs the database-specific calls and allows DBI to ‘hide’ them from the user

• interprets DBI generic calls to database in database-specific manner

Page 46: An Introduction to Perl MBG8680 2006 Gerard Tromp

Modules

Insufficient time to delve into these very important bioinformatic modules

DBI http://search.cpan.org/~timb/DBI-1.51/DBI.pm

BioPerl http://search.cpan.org/~birney/bioperl-1.4/Bio/Perl.pm http://www.bioperl.org/wiki/Main_Page http://www.bioperl.org/wiki/Bptutorial.pl http://doc.bioperl.org/releases/bioperl-1.4

Page 47: An Introduction to Perl MBG8680 2006 Gerard Tromp

Homework ProblemYou have performed a large-scale SNP

genotyping project. The data are provided to you in a tabular list in the

following format:• Some header lines

– includes blank lines

– column descriptions

• columns– Gene ID

– Polymorphism ID

– Fragment (no data [-])

– Subject ID

– Allele 1

– Allele 2

Page 48: An Introduction to Perl MBG8680 2006 Gerard Tromp

Homework Problem (2)You have to write a script to transform the

data into a wide table that has Individual ID as rows Polymorphisms as columns Genotype data as a string “Allele1/Allele2” Polymorphisms must be grouped by gene Genes must be in order (left to right)

Notes There will be about 5,300 individuals, 200 genes

and a total of about 1,300 polymorphisms the solution is to use hashes and nested hashes