perl for bio in for ma tics

Upload: amitha-sampath

Post on 07-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Perl for Bio in for Ma Tics

    1/158

    Programming for Computational Biology

    Ian HolmesDepartment of Bioengineering

    University of California, Berkeley

  • 8/4/2019 Perl for Bio in for Ma Tics

    2/158

    Programming languages

    Self-contained language

    Platform-independent

    Used to write O/S

    C (imperative, procedural)

    C++, Java (object-oriented)

    Lisp, Haskell, Prolog (functional)

    Scripting language

    Closely tied to O/S

    Perl, Python, Ruby

    Domain-specific language

    R (statistics)

    MatLab (numerics)

    SQL (databases)

    An O/S typically manages

    Devices (see above)

    Files & directories

    Users & permissions

    Processes & signals

  • 8/4/2019 Perl for Bio in for Ma Tics

    3/158

    Bioinformatics pipelines often involvechaining together multiple tools

  • 8/4/2019 Perl for Bio in for Ma Tics

    4/158

    Perl is the most-used bioinformatics language

    Most popular bioinformatics programming languages

    Bioinformatics career survey, 2008

    Michael Barton

  • 8/4/2019 Perl for Bio in for Ma Tics

    5/158

    Pros and Cons of Perl

    Reasons for Perls popularity in bioinformatics (Lincoln Stein)

    Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,

    summarizing and otherwise mangling text

    Perl is forgiving

    Perl is component-oriented Perl is easy to write and fast to develop in

    Perl is a good prototyping language

    Perl is a good language for Web CGI scripting

    Problems with Perl Hard to read (theres more than one way to do it, cryptic syntax)

    Too forgiving (no strong typing, allows sloppy code)

  • 8/4/2019 Perl for Bio in for Ma Tics

    6/158

    Perl overview

    Interpreted, not compiled Fast edit-run-revise cycle

    Procedural & imperative

    Sequence of instructions (control flow) Variables, subroutines

    Syntax close to C (the de facto standard minimal language) Weakly typed (unlike C)

    Redundant, not minimal (theres more than one way to do it )

    Syntactic sugar

    High-level data structures & algorithms

    Hashes, arrays

    Operating System support (files, processes, signals)

    String manipulation

  • 8/4/2019 Perl for Bio in for Ma Tics

    7/158

    Goals of this course

    Concepts of computer programming

    Rudimentary Perl (widely-used language)"How Perl saved the Human Genome Project" (Lincoln Stein)

    Introduction to Bioinformatics file formats

    Practical data-handling algorithms

    Exposure to Bioinformatics software

  • 8/4/2019 Perl for Bio in for Ma Tics

    8/158

    Structural elements Learning Perl, Schwartz et al

    ISBN 0-596-10105-8 O'Reilly

    "There's more than one way to do it

    Q: But which is best? A: TESTS

    Tests (above) supercede texts (below):

    The main program The program outputFiles areshown in

    yellow

    FilenameStandard output streamTerminal input

    Description of test conditions

    Terminal session

  • 8/4/2019 Perl for Bio in for Ma Tics

    9/158

    General principles of programming

    Make incremental changes

    Test everything you do

    the edit-run-revise cycle

    Write so that others can read it

    (when possible, write with others)

    Think before you write Use a good text editor

    Good debugging style

  • 8/4/2019 Perl for Bio in for Ma Tics

    10/158

    Perl for BioinformaticsSection 1: Scalars and Loops

    Ian HolmesDepartment of Bioengineering

    University of California, Berkeley

  • 8/4/2019 Perl for Bio in for Ma Tics

    11/158

    Perl basics

    Basic syntax of a Perl program:

    # Elementary Perl program

    print "Hello World\n";

    "\n" means new line

    print statement tells Perl to print the following stuff to the screen

    Single or double quotes

    enclose a "string literal"(double quotes are "interpolated")

    All statements endwith a semicolon

    Lines

    beginningwith "#" are

    comments,and are ignoredby Perl

    Hello World

  • 8/4/2019 Perl for Bio in for Ma Tics

    12/158

    Variables

    We can tell Perl to "remember" a particularvalue, using the assignment operator =:

    The $x is referred to as a "scalar variable".

    Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"

    Scalar variable names are all prefixed with the dollar symbol.

    $x= 3;

    print $x;

    3

    $x= "ACGCGT";

    print $x;

    ACGCGT

    Binding site for yeasttranscription factor MCB

  • 8/4/2019 Perl for Bio in for Ma Tics

    13/158

    Arithmetic operations

    Basic operators are + - / * %

    Can also use += -= /= *=++ --

    $x= 14;

    $y = 3;

    print "Sum: ", $x+$y, "\n";

    print "Product: ", $x * $y, "\n";

    print "Remainder: ", $x % $y, "\n";

    Sum: 17

    Product: 42Remainder: 2

    $x= 5;

    print "x started as $x\n";$x=$x * 2;

    print "Then x was $x\n";

    $x=$x+ 1;

    print "Finally x was $x\n";

    x started as 5

    Then x was 10

    Finally x was 11

    Could write$x *= 2;

    Could write$x+= 1;

    or even++$x;

  • 8/4/2019 Perl for Bio in for Ma Tics

    14/158

    String operations

    Concatenation..=

    Can find the length of a string using thefunction length($x)

    $a = "pan";

    $b= "cake";

    $a =$a .$b;

    print $a;

    pancake

    $a = "soap";

    $b= "dish";

    $a .=$b;

    print $a;

    soapdish

    $mcb= "ACGCGT";

    print "Length of $mcb is ",

    length($mcb);Length of ACGCGT is 6

  • 8/4/2019 Perl for Bio in for Ma Tics

    15/158

    More string operations

    $x= "A simple sentence";

    print $x, "\n";

    print uc($x), "\n";

    print lc($x), "\n";$y = reverse($x);

    print $y, "\n";

    $x=~ tr/i/a/;

    print $x, "\n";

    print length($x), "\n";

    A simple sentenceA SIMPLE SENTENCE

    a simple sentence

    ecnetnes elpmis A

    A sample sentence

    17

    Convert to upper case

    Convert to lower case

    Reverse the string

    Transliterate "i"'s into "a"'s

    Calculate the length of the string

  • 8/4/2019 Perl for Bio in for Ma Tics

    16/158

    Concatenating DNA fragments

    $dna1 = "accacgt";

    $dna2 = "taggtct";

    print $dna1 .$dna2;

    "Transcribing" DNA to RNA

    accacguuaggucu

    $dna = "accACgttAGGTct";

    $rna

    =lc(

    $dna

    );

    $rna =~ tr/t/u/;

    print $rna;

    Make it alllower case

    DNA string is a mixtureof upper & lower case

    Transliterate "t" to "u"

    accacgttaggtct

  • 8/4/2019 Perl for Bio in for Ma Tics

    17/158

    Comparison: variables in C are typed

    C does not have a basic type for strings only individual characters.

    Strings are built up from more basic elements as arrays of characters (well getto arrays later).

    Much of this functionality is provided in C and C++ as part of the standard library.

  • 8/4/2019 Perl for Bio in for Ma Tics

    18/158

    Conditional blocks

    The ability to execute an action contingent onsome condition is what distinguishes a computerfrom a calculator. In Perl, this looks like this:if (condition) { action } else { alternative }

    $x= 149;

    $y = 100;

    if ($x > $y)

    {

    print "$x is greater than $y\n";

    }else

    {

    print "$x is less than $y\n";

    }

    149 is greater than 100

    These braces { }

    tell Perl whichpiece of code

    is contingent onthe condition.

  • 8/4/2019 Perl for Bio in for Ma Tics

    19/158

    Conditional operators

    Numeric: > >= <

  • 8/4/2019 Perl for Bio in for Ma Tics

    20/158

    Logical operators Logical operators: && means "and", || means "or"

    An exclamation mark ! is used to negate what followsThus !($x < $y) means the same as ($x >=$y)

    In computers, the value zero is often used to

    represent falsehood, while any non-zero value(e.g. 1) represents truth. Thus:

    if (1) { print "1 is true\n"; }

    if (0) { print "0 is true\n"; }

    if (-99) { print "-99 is true\n"; }

    1 is true

    -99 is true

    $x= 222;

    if ($x % 2 == 0 and $x % 3 == 0)

    { print "$x is an even multiple of 3\n"; }

    222 is an even multiple of 3

  • 8/4/2019 Perl for Bio in for Ma Tics

    21/158

    Loops

    Here's how to print out the numbers 1 to 10:

    This is a while loop.The code is executed while the condition is true.

    $x= 1;

    while ($x

  • 8/4/2019 Perl for Bio in for Ma Tics

    22/158

    A common kind of loop

    Let's dissect the code of the while loop again:

    This form of while loop is common enough tohave its own shorthand: the forloop.

    $x= 1;

    while ($x

  • 8/4/2019 Perl for Bio in for Ma Tics

    23/158

    Loops in C++ are similar to Perlcout is the standard output stream, part of the standard library.Used in C++ only (C has a complicated printf command)

  • 8/4/2019 Perl for Bio in for Ma Tics

    24/158

    defined and undef

    The function defined($x) is true if$x hasbeen assigned a value:

    A variable that has not yet been assigned a

    value has the special valueundef

    Often, if you try to do something "illegal" (likereading from a nonexistent file), you end up withundef as a result

    if (defined($newvar)) {

    print "newvar is defined\n";

    } else {

    print "newvar is not defined\n";

    }

    newvar is notdefined

    C does not have defined or undef. At best, using an uninitialized value will

    cause a compiler error; at worst, it will lead to undefined behavior (i.e. disaster)

  • 8/4/2019 Perl for Bio in for Ma Tics

    25/158

    Reading a line of data

    To read from a file, we first need to openthe file and give it a filehandle.

    Once the file is opened, we can read a

    single line from it into the scalar$x :

    This code snippet opens a file called"sequence.txt", and associates it witha filehandle called FILE

    open FILE, "sequence.txt";

    $x= ;This reads the next line from the file,including the newline at the end, "\n".

    if the end of the file is reached, $x isassigned the special value undef

  • 8/4/2019 Perl for Bio in for Ma Tics

    26/158

    Reading an entire file

    The following piece of code reads everyline in a file and prints it out to the screen:

    A shorter version of this is as follows:

    open FILE, "sequence.txt";

    while (defined ($x= )) {print $x;

    }

    close FILE;

    open FILE, "sequence.txt";

    while ($x= ) {

    print $x;

    }

    close FILE;

    This reads a line of data into$x, then checks if$x is defined.If$x is undef, then the file

    must have ended.

    this is equivalent todefined($x=)

  • 8/4/2019 Perl for Bio in for Ma Tics

    27/158

    The default variable, $_

    Many operations that take a scalar argument,such as length($x), are assumed to work on$_ if the $x is omitted:

    So we can also read a whole file like this:

    $_= "Hello";

    print;

    print length;

    Hello5

    open FILE, "sequence.txt";

    while () {

    print;

    }

    close FILE;

    This line is equivalent towhile (defined($_=)) {

  • 8/4/2019 Perl for Bio in for Ma Tics

    28/158

    Files in C++ are streams

  • 8/4/2019 Perl for Bio in for Ma Tics

    29/158

    Debugging

    Most programs don't work first time

    Most apparently "working" programs

    actually aren't Bugs are cryptic

    Debugging is a scientific process

    As you gain experience, you will begin to"insure" against bugs with yourprogramming technique

  • 8/4/2019 Perl for Bio in for Ma Tics

    30/158

    Mars Climate Orbiter

    Mars Climate Orbiter was the thirdspacecraft to be launched under the MarsSurveyor program to map & explore Mars

    Around 2am PDT on September 23, 1998,the spacecraft disappeared behind Marsfollowing a maneouvre that should have

    put it into Mars orbit This failure, along with a subsequent(unexplained) craft loss, cost NASA$327.6 million

  • 8/4/2019 Perl for Bio in for Ma Tics

    31/158

    What was the problem?

    Following a certain kind of engine burn, designed tostabilise the craft's angular momentum, the Orbiter sentdata to the ground station, so that its trajectory could berecalibrated (by a software module called SM_FORCE)

    The Orbiter also internally recomputed its trajectoryfollowing a burn

    The Orbiter's internal software module used metric units(Newton-seconds) while the ground station'sSM_FORCE module used Imperial (pound-seconds).The specification called for metric units

    The maneouvre executed on September 23rd wastherefore computed using the wrong trajectory, takingthe Orbiter too low into Mars' atmosphere

  • 8/4/2019 Perl for Bio in for Ma Tics

    32/158

    Why was the bug not detected?

    The spacecraft periodically transmitted itscomputed trajectory to the ground station. Aquick comparison between the two trajectorieswould have revealed the error. However,

    Other bugs in the SM_FORCE module prevented itsuse until 4 months into the flight

    The ground crew weren't aware that trajectory datafrom the spacecraft were available

    Discrepancies were noticed, but were only reportedinformally by email, and not taken seriously enough

    i.e. incomplete testing; ignoring unexpectedresults; institutional complacency.

  • 8/4/2019 Perl for Bio in for Ma Tics

    33/158

    Debugging is scientific

    Finding bugs can be very frustrating

    A job that you thought was nearly finished,

    for which you have budgeted a certainamount of time, stretches out indefinitely

    Often you may have no idea what's wrong

    If you think of debugging as a scientificproblem and approach it systematically,much of the pain disappears

  • 8/4/2019 Perl for Bio in for Ma Tics

    34/158

    The Process of Debugging

    Step 1: Identify the Problem

    observe it (e.g. because a test fails) reproduce it (so you can make it happen 100% of the time)

    isolate it (strip it down to its bare essentials)

  • 8/4/2019 Perl for Bio in for Ma Tics

    35/158

    The Process of Debugging

    Step 2: Gather Information record all symptoms (disparate symptoms may be

    related; if not, you should tackle them systematically one by one)

    follow the flow of control of the program (manyways of doing this: e.g. you can use a "debugger" to watch thevariables; the time-honored method, and definitely the best, is toinsert debugging print statements into your code)

    note recent changes (usually the cause of bugs)

    look for similar problems (can ask other developers)

    check "machine environment" (e.g. if you move to adifferent computer, does it have less memory? less disk space?)

  • 8/4/2019 Perl for Bio in for Ma Tics

    36/158

    The Process of Debugging

    Step 3: Form a Hypothesis try to isolate the code that causes the problem

    e.g. strip away all "working" code that is not essential to reproducing

    the bug if you can't find the bug, use a systematic "deletion" strategy (c.f.genetics!) until you have narrowed down the problem

    what should that code be doing?

    this can be seen as a continuation of Step 1

    ("identify the problem") debugging is a cyclic, interactive process

  • 8/4/2019 Perl for Bio in for Ma Tics

    37/158

    The Process of Debugging

    Step 4: Test Your Hypothesis

    do not skip this step!

    often the hypothesis will come to you in aflash of inspiration, but you still need to test it

    for simple bugs, testing just means fixing the

    problem for more complex bugs, you'll need to proceed

    to the next steps...

  • 8/4/2019 Perl for Bio in for Ma Tics

    38/158

    The Process of Debugging

    Step 5: Propose a Solution

    keep it minimal: try not to redesign all thecode unless this is absolutely necessary

    then again, do not flinch from redesign if thisis what is called for

    Step 6: Test the Solution also make sure you didn't break existing code

  • 8/4/2019 Perl for Bio in for Ma Tics

    39/158

    Process of Debugging: Summary

    Step 1: Identify the Problem

    Step 2: Gather Information

    Step 3: Form a Hypothesis Step 4: Test Your Hypothesis

    Step 5: Propose a Solution

    Step 6: Test the Solution

  • 8/4/2019 Perl for Bio in for Ma Tics

    40/158

    Proactive debugging

    Place consistency checks in your code also called assertions

    Put comments in your code

    this saves time when debugging Comment known (and fixed) bugs

    keep a record of what you've fixed

    Put log messages into your code

    you can make these optional (e.g. comment themout); having them there can save lots of time

  • 8/4/2019 Perl for Bio in for Ma Tics

    41/158

    Perl for Bioinformatics

    Section 2: Sequences and Arrays

  • 8/4/2019 Perl for Bio in for Ma Tics

    42/158

    Summary: scalars and loops

    Assignment operator

    Arithmetic operations

    String operations Conditional tests

    Logical operators

    Loops defined and undef

    Reading a file

    $x= 5;

    $y =$x * 3;

    if ($y > 10) { print $s; }

    $s = "Value of y is " .$y;

    if ($y>10 && $s eq "") { exit; }

    for ($x=1; $x

  • 8/4/2019 Perl for Bio in for Ma Tics

    43/158

    Pattern-matching

    A very sophisticated kind of logical test isto ask whether a string contains apattern

    e.g. does a yeast promoter sequencecontain the MCB binding site, ACGCGT?

    $name = "YBR007C";

    $dna="TAATAAAAAACGCGTTGTCG";

    if ($dna =~ /ACGCGT/)

    {

    print "$name has MCB!\n";

    }

    20 bases upstream ofthe yeast gene YBR007C

    The pattern binding operator =~

    Thepattern for the MCB binding siteYBR007C has MCB!

  • 8/4/2019 Perl for Bio in for Ma Tics

    44/158

    FASTA format

    A format for storing multiple named sequencesin a single file

    This file contains 3' UTRsforDrosophila genes CG11604,CG11455 and CG11488

    >CG11604

    TAGTTATAGCGTGAGTTAGT

    TGTAAAGGAACGTGAAAGAT

    AAATACATTTTCAATACC>CG11455

    TAGACGGAGACCCGTTTTTC

    TTGGTTAGTTTCACATTGTA

    AAACTGCAAATTGTGTAAAA

    ATAAAATGAGAAACAATTCT

    GGT

    >CG11488TAGAAGTCAAAAAAGTCAAG

    TTTGTTATATAACAAGAAAT

    CAAAAATTATATAATTGTTT

    TTCACTCT

    Name of sequence is

    preceded by > symbol

    NB sequences canspan multiple lines

    Call this file fly3utr.txt

  • 8/4/2019 Perl for Bio in for Ma Tics

    45/158

    Printing all sequence names in aFASTA database

    The key to this program is this block:

    open FILE, "fly3utr.txt";

    while ($x= ) {

    if ($x=~ />/) {

    print $x;

    }

    }close FILE;

    >CG11604

    >CG11455

    >CG11488

    if ($x=~ />/) {

    print $x;

    }

    This pattern matches (and returns TRUE) if the defaultvariable $_ contains the FASTA sequence-name symbol >

    This line prints $_ if

    the pattern matched

  • 8/4/2019 Perl for Bio in for Ma Tics

    46/158

    Pattern replacement

    open FILE, "fly3utr.txt";

    while () {

    if (/>/) {

    s/>//;

    print;

    }

    }

    close FILE;

    CG11604

    CG11455

    CG11488

    New statementremoves the ">"

    The new statement s/>// is an example of a replacement.

    General form: s/OLD/NEW/ replaces OLD with NEWThus s/>// replaces ">" with "" (the empty string)

    $_ is thedefaultvariablefor theseoperations

  • 8/4/2019 Perl for Bio in for Ma Tics

    47/158

    Finding all sequence lengthsOpen file

    Read line

    End of file?

    Line starts with > ?

    Remove \n newlinecharacter at end of line

    Sequence name Sequence data

    Add length of lineto running totalRecord the name

    Reset running total ofcurrent sequence length

    First sequence?Print lastsequencelength

    Stop

    noyes

    yes

    yes

    no

    no

    Start

    Print lastsequencelength

  • 8/4/2019 Perl for Bio in for Ma Tics

    48/158

    Finding all sequence lengthsopen FILE, "fly3utr.txt";while () {

    chomp;

    if (/>/) {

    if (defined $len) {

    print "$name $len\n";

    }$name =$_;

    $len = 0;

    } else {

    $len += length;

    }

    }

    print "$name $len\n";

    close FILE;

    >CG11604 58

    >CG11455 83

    >CG11488 68

    The chomp statementtrims the newline character"\n" off the end of thedefault variable, $_.

    Try it without this andsee what happens andif you can work out why

    >CG11604

    TAGTTATAGCGTGAGTTAGT

    TGTAAAGGAACGTGAAAGAT

    AAATACATTTTCAATACC

    >CG11455

    TAGACGGAGACCCGTTTTTC

    TTGGTTAGTTTCACATTGTA

    AAACTGCAAATTGTGTAAAA

    ATAAAATGAGAAACAATTCT

    GGT

    >CG11488

    TAGAAGTCAAAAAAGTCAAG

    TTTGTTATATAACAAGAAAT

    CAAAAATTATATAATTGTTT

    TTCACTCT

  • 8/4/2019 Perl for Bio in for Ma Tics

    49/158

    Reverse complementing DNA

    $dna = "accACgttAGgtct";

    $revcomp = lc($dna);

    $revcomp = reverse($revcomp);

    $revcomp =~ tr/acgt/tgca/;

    print $revcomp;

    agacctaacgtggt

    Start by making string lower caseagain. This is generally good practise

    Reverse the string

    Replace 'a' with 't', 'c' with 'g',

    'g' with 'c' and 't' with 'a'

    A common operation due to double-helixsymmetry of DNA

  • 8/4/2019 Perl for Bio in for Ma Tics

    50/158

    Running external programs

    $lines = `wc -l myfile.txt`;

    Suppose you want to get the output of another program into a variable.

    e.g. the following shell command prints the number of lines in the file myfile.txt

    wc -l myfile.txt

    but that only prints the result to standard output; it does not give you access to theoutput of the command from within the Perl program.

    An (equivalent) way is to open a pipe from the command:

    open FILEHANDLE, "wc -l myfile.txt |";

    $lines = ;

    system "wc -l myfile.txt";You can execute a command like this from Perl using system

    One way to get the output is by enclosing the command in backticks:

  • 8/4/2019 Perl for Bio in for Ma Tics

    51/158

    Arrays

    An arrayis a variable holding a list of items

    We can think of this as a list with 4 entries

    @nucleotides = ('a', 'c', 'g', 't');

    print "Nucleotides: @nucleotides\n";

    Nucleotides: a c g t

    a c g telement 0

    element 1 element 2element 3

    the array is theset of all four elements

    Note that the elementindices start at zero.

  • 8/4/2019 Perl for Bio in for Ma Tics

    52/158

    Array literals

    There are several, equally valid ways toassign an entire array at once.

    @a = (1,2,3,4,5);

    print "a =@a\n";

    @b= ('a','c','g','t');

    print "b=@b\n";

    @c = 1..5;

    print "c =@c\n";

    @d = qw(a c g t);

    print "d =@d\n";

    a = 1 2 3 4 5

    b= a c g t

    c = 1 2 3 4 5

    d = a c g t

    This is the most common: a comma-

    separated list, delimited by parentheses

  • 8/4/2019 Perl for Bio in for Ma Tics

    53/158

    Accessing arrays

    To access array elements, use square brackets;e.g. $x[0] means "element zero of array @x"

    Remember, element indices start at zero!

    If you use an array @x in a scalarcontext, suchas @x+0, then Perl assumes that you wanted the

    length of the array.

    @x= ('a', 'c', 'g', 't');

    print $x[0], "\n";$i = 2;

    print $x[$i], "\n";

    a

    g

    @x= ('a', 'c', 'g', 't');

    print @x+ 0;4

  • 8/4/2019 Perl for Bio in for Ma Tics

    54/158

    Array operations

    You can sort and reverse arrays...

    You can read the entire contents of a file

    into an array (each line of the file becomesan element of the array)

    @x= ('a', 't', 'g', 'c');

    @y = sort @x;

    @z= reverse @y;

    print "x=@x\n";print "y =@y\n";

    print "z=@z\n";

    x= a t g c

    y = a c g t

    z= t g c a

    open FILE, "sequence.txt";

    @x= ;

  • 8/4/2019 Perl for Bio in for Ma Tics

    55/158

    push, pop, shift, unshift

    @x= ("Fame", "Power", "Money");

    print "I started with @x\n";

    $y = pop @x;

    push @x, "Success";

    print "Then I had @x\n";

    $z= shift @x;unshift @x, "Glamour";

    print "Now I have @x\n";

    print "I lost $y and $z\n";

    I started with Fame Power Money

    Then I had Fame Power Success

    Now I have Glamour Power Success

    I lost Money and Fame

    pop removes the lastelement of an array

    push adds an element

    to the end of an arrayshift removes the firstelement of an array

    unshift adds an elementto the start of an array

  • 8/4/2019 Perl for Bio in for Ma Tics

    56/158

    foreach

    Finding the total of a list of numbers:

    Equivalent to:

    @val = (4, 19, 1, 100, 125, 10);

    $total = 0;

    foreach $x (@val) {

    $total +=$x;}

    print $total; 259

    @val = (4, 19, 1, 100, 125, 10);

    $total = 0;

    for ($i = 0; $i < @val; ++$i) {

    $total +=$val[$i];

    }

    print $total; 259

    foreach statement

    loops through eachentry in an array

  • 8/4/2019 Perl for Bio in for Ma Tics

    57/158

    Iterator comparison

    foreach

    for

    iMac G5 1.8GHz 512MB, Mac OS X 10.4.2, perl v5.8.6 built for darwin-thread-multi-2level

    [yoko:~] yam% time perl -e 'for ($n = 1; $n

  • 8/4/2019 Perl for Bio in for Ma Tics

    58/158

    The @ARGV array

    A special array is @ARGV

    This contains the command-line

    arguments when the program is invoked atthe Unix prompt

    It's a way for the user to pass informationinto the program

  • 8/4/2019 Perl for Bio in for Ma Tics

    59/158

    Exploding a sequence into an array

    The programming language C treats allstrings as arrays

    $dna = "accggtgtgcg";

    print "String: $dna\n";

    @array = split //, $dna;

    print "Array: @array\n";

    String: accggtgtgcg

    Array: a c c g g t g t g c g

    The split statement turnsa string into an array.Here, it splits after everycharacter, but we can alsosplit at specific points,like a restriction enzyme

  • 8/4/2019 Perl for Bio in for Ma Tics

    60/158

    Taking a slice of an array

    The syntax @x[i,j,k...] returns a (3-element)array containing elements i,j,k... of array @x

    @nucleotides = ('a', 'c', 'g', 't');

    @purines =@nucleotides[0,2];@pyrimidines =@nucleotides[1,3];

    print "Nucleotides: @nucleotides\n";

    print "Purines: @purines\n";

    print "Pyrimidines: @pyrimidines\n";

    Nucleotides: a c g t

    Purines: a g

    Pyrimidines: c t

  • 8/4/2019 Perl for Bio in for Ma Tics

    61/158

    Finding elements in an array

    The grep command is used to select some

    elements from an array

    The statement grep(EXPR,LIST) returns all

    elements ofLIST for which EXPR evaluates totrue (when $_ is set to the appropriate element)

    e.g. select all numbers over 100:

    @numbers = (101, 235, 10, 50, 100, 66, 1005);

    @numbersOver100 = grep ($_ > 100, @numbers);

    print "Numbers: @numbers\n";

    print "Numbers over 100: @numbersOver100\n";

    Numbers: 101 235 10 50 100 66 1005

    Numbers over 100: 101 235 1005

  • 8/4/2019 Perl for Bio in for Ma Tics

    62/158

    Applying a function to an array

    The map command applies a function to

    every element in an array

    Similar syntax to list: map(EXPR,LIST)applies EXPR to every element in LIST

    Example: multiply every number by 3

    @numbers = (101, 235, 10, 50, 100, 66, 1005);

    @numbersTimes3 = map ($_ * 3, @numbers);

    print "Numbers: @numbers\n";

    print "Numbers times 3: @numbersTimes3\n";

    Numbers: 101 235 10 50 100 66 1005

    Numbers times 3: 303 705 30 150 300 198 3015

  • 8/4/2019 Perl for Bio in for Ma Tics

    63/158

    Perl for Bioinformatics

    Section 3: Patterns and Subroutines

  • 8/4/2019 Perl for Bio in for Ma Tics

    64/158

    Review: pattern-matching

    The following code:

    prints the string "Found MCB binding site!" if the pattern "ACGCGT"is present in the default variable, $_

    Instead of using $_ we can "bind" the pattern to another variable(e.g. $dna) using this syntax:

    We can replace the first occurrence of ACGCGT with the string_MCB_ using the following syntax:

    We can replace alloccurrences by appending a 'g':

    if (/ACGCGT/) {

    print "Found MCB binding site!\n";

    }

    if ($dna =~ /ACGCGT/) {

    print "Found MCB binding site!\n";

    }

    $dna =~ s/ACGCGT/_MCB_/;

    $dna =~ s/ACGCGT/_MCB_/g;

  • 8/4/2019 Perl for Bio in for Ma Tics

    65/158

    Regular expressions

    Perl provides a pattern-matching engine

    Patterns are called regular expressions

    They are extremely powerful probably Perl's strongest feature, compared to

    other languages

    Often called "regexps" for short

  • 8/4/2019 Perl for Bio in for Ma Tics

    66/158

    QuickTime and adecompressor

    are needed to see this picture.

    Motivation:N-glycosylation motif

    Common post-translational modification in ER Membrane & secreted proteins

    Purpose:folding, stability, cell-cell adhesion

    Attachment ofa 14-sugar oligosaccharide

    Occurs at asparagine residues with theconsensus sequence NX1X2,where X1can be anything

    (but proline & aspartic acid inhibit)

    X2is serine or threonine Can we detect potentialN-glycosylation

    sites in a protein sequence?

  • 8/4/2019 Perl for Bio in for Ma Tics

    67/158

    Interlude: interactive testing

    This script echoes input from the keyboard

    Sometimes (e.g. in Windows IDEs) theoutput isnt printed until the script stops

    This is because ofbuffering.

    To stop buffering, set to "autoflush":

    while () {

    print;

    }The special filehandle STDIN means"standard input", i.e. the keyboard

    $| = 1;

    while () {

    print;

    }

    $| is the autoflush flag

  • 8/4/2019 Perl for Bio in for Ma Tics

    68/158

    Matching alternative characters

    [ACGT] matches one A, C, G or T:

    In general square brackets denote a set ofalternative possibilities

    Use - to match a range of characters: [A-Z]

    . matches anything

    \s matches spaces or tabs \S is anything that's not a space or tab

    [^X] matches anything but X

    while () {

    print "Matched: $_" if /[ACGT]/;

    }

    this is not printed

    This is printed

    Matched: This is printed

    Italics denoteinput text

  • 8/4/2019 Perl for Bio in for Ma Tics

    69/158

    Matching alternative strings

    /(this|that)/ matches "this" or "that"

    ...and is equivalent to /th(is|at)/

    while () {print "Matched: $_" if /this|that|other/;

    }

    Won't match THIS

    Will match this

    Matched: Will match thisWon't match ThE oThER

    Will match the other

    Matched: Will match the other

    Remember, regexpsare case-sensitive

  • 8/4/2019 Perl for Bio in for Ma Tics

    70/158

    Matching multiple characters

    x* matches zero or more x's (greedily) x*? matches zero or more x's (sparingly)

    x+ matches one or more x's (greedily)

    x{n} matches n x's

    x{m,n} matches from m to n x's

    Word and string boundaries ^ matches the start of a string $ matches the end of a string

    \b matches word boundaries

  • 8/4/2019 Perl for Bio in for Ma Tics

    71/158

    "Escaping" special characters

    \ is used to "escape" characters that

    otherwise have meaning in a regexp

    so \[ matches the character "["

    if not escaped, "[" signifies the start of a list ofalternative characters, as in [ACGT]

  • 8/4/2019 Perl for Bio in for Ma Tics

    72/158

    Retrieving what was matched

    If parts of the pattern are enclosed byparentheses, then (following the match) thoseparts can be retrieved from the scalars $1, $2...

    e.g. /the (\S+) sat on the (\S+) drinking (\S+)/

    matches "the cat sat on the mat drinking milk"

    with $1="cat", $2="mat", $3="milk"

    $| = 1;

    while () {

    if (/(a|the) (\S+)/i) {

    print "Noun: $2\n";

    }

    }

    Pick up the cup

    Noun: cup

    Sit on a chair

    Noun: chair

    Put the milk in the tea

    Noun: milk

    Note: only the first "the"is picked up by this regexp

  • 8/4/2019 Perl for Bio in for Ma Tics

    73/158

    Variations and modifiers

    //i ignores upper/lower case distinctions:

    //g starts search where last match left off

    pos($_)is index of first character after last match

    s/OLD/NEW/ replaces first "OLD" with "NEW"

    s/OLD/NEW/g is "global" (i.e. replaces everyoccurrence of "OLD" in the string)

    pAttERn

    Matched pAttERn

    while () {

    print "Matched: $_" if /pattern/i;

    }

  • 8/4/2019 Perl for Bio in for Ma Tics

    74/158

    N-glycosylation site detector

    $| = 1;

    while () {

    $_= uc $_;

    while (/(N[^PD][ST])/g) {print "Potential N-glycosylation sequence ",

    $1, " at residue ", pos() - 2, "\n";

    }

    }

    Convert to upper case

    Regexp uses

    'g' modifier toget all matchesin sequence

    pos() is index of first residue

    after match, starting at zero;so, pos()-2 is index of first residue

    of three-residue match, starting at one.

    while (/(N[^P][ST])/g) { ... }

    The main regular expression

  • 8/4/2019 Perl for Bio in for Ma Tics

    75/158

    PROSITE and Pfam

    PROSITE a database of regular expressionsfor protein families, domains and motifs

    Pfam a database ofHidden MarkovModels (HMMs) equivalent toprobabilistic regular expressions

  • 8/4/2019 Perl for Bio in for Ma Tics

    76/158

    Subroutines

    Often, we can identify self-contained tasks thatoccur in so many different places we may wantto separate their description from the rest of our

    program. Code for such a task is called a subroutine.

    Examples of such tasks:

    finding the length of a sequence

    reverse complementing a sequence

    finding the mean of a list of numbers

    NB: Perl providesthe subroutinelength(

    $x)to do

    this already

  • 8/4/2019 Perl for Bio in for Ma Tics

    77/158

    Finding all sequence lengths (2)open FILE, "fly3utr.t

    xt";

    while () {

    chomp;

    if (/>/) {

    print_name_and_len();

    $name =$_;$len = 0;

    } else {

    $len += length;

    }

    }

    print_name_and_len();

    close FILE;

    sub print_name_and_len {

    if (defined ($name)) {

    print "$name $len\n";

    }

    }

    Subroutine definition;code in here is notexecuted unlesssubroutine is called

    Subroutine calls

    Reverse complement subroutine

  • 8/4/2019 Perl for Bio in for Ma Tics

    78/158

    Reverse complement subroutinesub revcomp {

    my $rev;

    $rev = reverse ($dna);$rev =~ tr/acgt/tgca/;

    return $rev;

    }

    $rev = 12345;

    $dna = "accggcatg";

    $rev1 = revcomp();

    print "Revcomp of $dna is $rev1\n";

    $dna = "cggcgt";

    $rev2 = revcomp();print "Revcomp of $dna is $rev2\n";

    print "Value of rev is $rev\n";

    Revcomp of accggcatg is catgccggt

    Revcomp of cggcgt is acgccg

    Value of rev is 12345

    Value of$rev is

    unchanged bycalls to revcomp

    "my" announces that$rev is localto the

    subroutine revcomp

    "return" announcesthat the return valueof this subroutineis whatever's in $rev

  • 8/4/2019 Perl for Bio in for Ma Tics

    79/158

    Revcomp with argumentssubrevcomp {

    my ($dna)=@_;

    my $rev = reverse ($dna);

    $rev =~ tr/acgt/tgca/;

    return $rev;

    }

    $dna1 = "accggcatg";

    $rev1 = revcomp ($dna1);

    print "Revcomp of $dna1 is $rev1\n";

    $dna2 = "cggcgt";

    $rev2 = revcomp ($dna2);

    print "Revcomp of $dna2 is $rev2\n";

    Revcomp of accggcatg is catgccggt

    Revcomp of cggcgt is acgccg

    The array @_ holdsthe arguments tothe subroutine(in this case, thesequence to berevcomp'd)

    Now we don'thave to re-usethe same variablefor the sequenceto be revcomp'd

  • 8/4/2019 Perl for Bio in for Ma Tics

    80/158

    Mean & standard deviation@xdata = (1, 5, 1, 12, 3, 4, 6);

    ($x_mean, $x_sd)= mean_sd (@xdata);

    @ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7);

    ($y_mean, $y_sd)= mean_sd (@ydata);

    sub mean_sd {

    my @data =@_;my $n =@data + 0;

    my $sum = 0;

    my $sqSum = 0;

    foreach $x (@data) {

    $sum +=$x;

    $sqSum +=$x * $x;}

    my $mean =$sum / $n;

    my $variance =$sqSum / $n - $mean * $mean;

    my $sd = sqrt ($variance);

    return ($mean, $sd);

    }

    Subroutinereturns atwo-elementlist: (mean,sd)

    Subroutine

    takes a listof$n numeric

    arguments

    Square root

  • 8/4/2019 Perl for Bio in for Ma Tics

    81/158

    Maximum element of an array

    Subroutine to find the largest entry in an array

    @num = (1, 5, 1, 12, 3, 4, 6);

    $max= find_max (@num);

    print "Numbers: @num\n";

    print "Maximum: $max\n";

    sub find_max {

    my @data =@_;

    my $max= pop @data;

    foreach my $x (@data) {

    if ($x > $max) {$max=$x;

    }

    }

    return $max;

    }

    Numbers: 1 5 1 12 3 4 6

    Maximum: 12

  • 8/4/2019 Perl for Bio in for Ma Tics

    82/158

    Including variables in patterns

    Subroutine to find number of instances ofa given binding site in a sequence

    $dna = "ACGCGTAAGTCGGCACGCGTACGCGT";

    $mcb= "ACGCGT";

    print "$dna has ",count_matches ($mcb, $dna),

    " matches to $mcb\n";

    sub count_matches {

    my ($pattern, $text)=@_;

    my $n = 0;while ($text =~ /$pattern/g) { ++$n }

    return $n;

    }

    ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT

  • 8/4/2019 Perl for Bio in for Ma Tics

    83/158

    Perl for Bioinformatics

    Section 4: Hashes

  • 8/4/2019 Perl for Bio in for Ma Tics

    84/158

    Data structures

    Suppose we have a file containing a tableofDrosophila gene names and cellularcompartments, one pair on each line:

    Cyp12a5 Mitochondrion

    MRG15 Nucleus

    Cop Golgi

    bor CytoplasmBx42 Nucleus

    Suppose this file is in "genecomp.txt"

  • 8/4/2019 Perl for Bio in for Ma Tics

    85/158

    Reading a table of data

    We can split eachline into a 2-elementarray using thesplit command.

    This breaks the lineat each space:

    The opposite ofsplit is join, which makes a scalarfrom an array:

    open FILE, "genecomp.txt";

    while () {

    ($g, $c)= split;

    push @gene, $g;

    push @comp, $c;

    }close FILE;

    print "Genes: @gene\n";

    print "Compartments: @comp\n";

    Genes: Cyp12a5 MRG15 Cop bor Bx42

    Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus

    print join (" and ", @gene);

    Cyp12a5 and MRG15 and Cop and bor and Bx42

  • 8/4/2019 Perl for Bio in for Ma Tics

    86/158

    Finding an entry in a table

    The following code assumes that we'vealready read in the table from the file:

    Example:$ARGV[0] = "Cop"

    $geneToFind = shift @ARGV;

    print "Searching for gene $geneToFind\n";

    for ($i = 0; $i < @gene; ++$i) {

    if ($gene[$i] eq $geneToFind) {print "Gene: $gene[$i]\n";

    print "Compartment: $comp[$i]\n";

    exit;

    }

    }

    print "Couldn't find gene\n";

    Searching for gene Cop

    Gene: Cop

    Compartment: Golgi

  • 8/4/2019 Perl for Bio in for Ma Tics

    87/158

    Binary search

    The previous algorithm is inefficient. If there are Nentries in the list, then on average we have to searchthrough (N+1) entries to find the one we want.

    For the full Drosophila genome, N=12,000. This ispainfully slow.

    An alternative is the Binary Search algorithm:

    Start with a sorted list.

    Compare the middle element

    with the one we want. Pick thehalf of the list that contains ourelement.

    Iterate this procedure to"home in" on the right element.This takes around log

    2

    (N) steps.

  • 8/4/2019 Perl for Bio in for Ma Tics

    88/158

    Associative arrays (hashes)

    Implementing algorithms like binary searchis a common task in languages like C.

    Conveniently, Perl provides a type of array

    called an associative array(also called ahash) that is pre-indexed for quick search.

    An associative array is a set of keyvalue pairs(like our genecompartment table)

    $comp{"Cop"} = "Golgi"; Curly braces {} are used toindex an associative array

  • 8/4/2019 Perl for Bio in for Ma Tics

    89/158

    Reading a table using hashes

    open FILE, "genecomp.txt";

    while () {

    ($g, $c)= split;

    $comp{$g} =$c;

    }

    $geneToFind = shift@ARGV;print "Gene: $geneToFind\n";

    print "Compartment: ", $comp{$geneToFind}, "\n";

    Gene: CopCompartment: Golgi

    ...with $ARGV[0] = "Cop" as before:

  • 8/4/2019 Perl for Bio in for Ma Tics

    90/158

    Reading a FASTA file into a hashsub read_FASTA {

    my ($filename)=@_;

    my (%name2seq, $name, $seq);

    open FILE, $filename;

    while () {

    chomp;

    if (/>/) {

    s/>//;

    if (defined $name) {

    $name2seq{$name} =$seq;

    }

    $name =$_;

    $seq = "";

    } else {

    $seq .

    =$_

    ;

    }

    }

    $name2seq{$name} =$seq;

    close FILE;

    return %name2seq;

    }

  • 8/4/2019 Perl for Bio in for Ma Tics

    91/158

    Formatted output of sequences

    sub print_seq {my ($name, $seq)=@_;

    print ">$name\n";

    my $width = 50;

    for (my $i = 0; $i < length($seq); $i +=$width) {

    if ($i +$width > length($seq)) {

    $width

    =length(

    $seq

    )-$i;

    }

    print substr ($seq, $i, $width), "\n";

    }

    }

    The term substr($x,$i,$len) returns the substring of$x starting at position $i with length $len.

    For example, substr("Biology",3,3) is "log"

    50-column output

  • 8/4/2019 Perl for Bio in for Ma Tics

    92/158

    keys and values

    keys returns the list of keys in the hash e.g. names, in the %name2seq hash

    values returns the list of values

    e.g. sequences, in the %name2seq hash%name2seq = read_FASTA ("fly3utr.txt");

    print "Sequence names: ",

    join (" ", keys (%name2seq)), "\n";

    my $len = 0;

    foreach$seq (values %name2seq

    ){

    $len += length ($seq);

    }

    print "Total length: $len\n";

    Sequence names: CG11488 CG11604 CG11455

    Total length: 210

  • 8/4/2019 Perl for Bio in for Ma Tics

    93/158

    Files of sequence names

    Easy way to specify a subset of a givenFASTA database

    Each line is the name of a sequence in a

    given database e.g. CG1167

    CG685

    CG1041CG1043

  • 8/4/2019 Perl for Bio in for Ma Tics

    94/158

    Get named sequences

    Given a FASTA database and a "file of sequencenames", print every named sequence:

    ($fasta, $fosn)=@ARGV;

    %name2seq = read_FASTA ($fasta);

    open FILE, $fosn;

    while ($name = ) {chomp $name;

    $seq =$name2seq{$name};

    if (defined $seq) {

    print_seq ($name, $seq);

    } else {

    warn "Can't find sequence: $name. ","Known sequences: ",

    join (" ", keys %name2seq), "\n";

    }

    }

    close FILE;

  • 8/4/2019 Perl for Bio in for Ma Tics

    95/158

    Intersection of two sets

    Two files of sequence names:

    What is the overlap?

    Find intersection using hashes:

    CG1167

    CG685

    CG1041

    CG1043

    CG215

    CG1041

    CG483

    CG1167

    CG1163

    open FILE1, "fosn1.txt";

    while () { $gotName{$_} = 1; }

    close FILE1;

    open FILE2, "fosn2.txt";

    while () {print if $gotName{$_};

    }

    close FILE2;

    fosn1.txt

    fosn2.txt

    CG1041

    CG1167

  • 8/4/2019 Perl for Bio in for Ma Tics

    96/158

    Assigning hashes

    A hash can be assigned directly,as a list of "key=>value" pairs:

    %comp = ('Cyp12a5' => 'Mitochondrion',

    'MRG15'=> 'Nucleus',

    'Cop' => 'Golgi',

    'bor' => 'Cytoplasm',

    'Bx42' => 'Nucleus');

    print "keys: ", join(";",keys(%comp)), "\n";

    print "values: ", join(";",values(%comp)), "\n";

    keys: bor;Cop;Bx42;Cyp12a5;MRG15

    values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus

    The genetic code as a hash

  • 8/4/2019 Perl for Bio in for Ma Tics

    97/158

    The genetic code as a hash%aa = ('ttt'=>'F', 'tct'=>'S', 'tat'=>'Y', 'tgt'=>'C',

    'ttc'=>'F', 'tcc'=>'S', 'tac'=>'Y', 'tgc'=>'C','tta'=>'L', 'tca'=>'S', 'taa'=>'!', 'tga'=>'!',

    'ttg'=>'L', 'tcg'=>'S', 'tag'=>'!', 'tgg'=>'W',

    'ctt'=>'L', 'cct'=>'P', 'cat'=>'H', 'cgt'=>'R',

    'ctc'=>'L', 'ccc'=>'P', 'cac'=>'H', 'cgc'=>'R',

    'cta'=>'L', 'cca'=>'P', 'caa'=>'Q', 'cga'=>'R',

    'ctg'=>'L', 'ccg'=>'P', 'cag'=>'Q', 'cgg'=>'R',

    'att'=>'I', 'act'=>'T', 'aat'=>'N', 'agt'=>'S',

    'atc'=>'I', 'acc'=>'T', 'aac'=>'N', 'agc'=>'S',

    'ata'=>'I', 'aca'=>'T', 'aaa'=>'K', 'aga'=>'R',

    'atg'=>'M', 'acg'=>'T', 'aag'=>'K', 'agg'=>'R',

    'gtt'=>'V', 'gct'=>'A', 'gat'=>'D', 'ggt'=>'G',

    'gtc'=>'V', 'gcc'=>'A', 'gac'=>'D', 'ggc'=>'G',

    'gta'=>'V', 'gca'=>'A', 'gaa'=>'E', 'gga'=>'G',

    'gtg'=>'V', 'gcg'=>'A', 'gag'=>'E', 'ggg'=>'G' );

  • 8/4/2019 Perl for Bio in for Ma Tics

    98/158

    Translating: DNA to protein$prot = translate ("gatgacgaaagttgt");

    print $prot;

    sub translate {

    my ($dna)=@_;

    $dna = lc ($dna);

    my $len = length ($dna);

    if ($len % 3 != 0) {

    die "Length $len is not a multiple of 3";

    }

    my $protein = "";

    for (my $i = 0; $i < $len; $i += 3) {

    my $codon = substr ($dna, $i, 3);

    if (!defined ($aa{$codon})) {

    die "Codon $codon is illegal";

    }

    $protein .=$aa{$codon};

    }

    return $protein;

    } DDESC

  • 8/4/2019 Perl for Bio in for Ma Tics

    99/158

    Counting residue frequencies

    %count = count_residues ("gatgacgaaagttgt");

    @residues = keys (%count);

    foreach $residue (@residues) {

    print "$residue: $count{$residue}\n";

    }

    sub count_residues {

    my ($seq)=@_;

    my %freq;

    $seq = lc ($seq);

    for (my $i = 0; $i < length($seq); ++$i) {

    my $residue = substr ($seq, $i, 1);

    ++

    $freq{

    $residue};

    }

    return %freq;

    }

    g: 5

    a: 5

    c: 1

    t: 4

  • 8/4/2019 Perl for Bio in for Ma Tics

    100/158

    Counting N-mer frequencies

    %count = count_nmers ("gatgacgaaagttgt", 2);

    @nmers = keys (%count);

    foreach $nmer (@nmers) {

    print "$nmer: $count{$nmer}\n";

    }

    sub count_nmers {

    my ($seq, $n)=@_;

    my %freq;

    $seq = lc ($seq);

    for (my $i = 0; $i

  • 8/4/2019 Perl for Bio in for Ma Tics

    101/158

    N-mer frequencies for a whole file

    my %name2seq = read_FASTA ("fly3utr.txt");while (($name, $seq)= each %name2seq) {

    %count = count_nmers ($seq, 2, %count);

    }

    @nmers = keys (%count);

    foreach $nmer (@nmers) {

    print "$nmer: $count{$nmer}\n";

    }

    sub count_nmers {

    my ($seq, $n, %freq)=@_;

    $seq = lc ($seq);

    for (my $i = 0; $i

  • 8/4/2019 Perl for Bio in for Ma Tics

    102/158

    Files and filehandles

    Opening a file:

    Closing a file:

    Reading a line:

    Reading an array:

    Printing a line:

    Read-only:

    Write-only: Test if file exists:

    open XYZ, $filename;

    close XYZ;

    This XYZ is the filehandle

    $data = ;

    @data = ;

    print XYZ $data;

    open XYZ, "$filename";

    if (-e $filename) {

    print "$filename exists!\n";

    }

  • 8/4/2019 Perl for Bio in for Ma Tics

    103/158

    Perl for Bioinformatics

    Section 5: References

    B hi d th S

  • 8/4/2019 Perl for Bio in for Ma Tics

    104/158

    Behind the Scenes

    PC = memory + CPU(+ peripherals)

    Memory is just a list of bytes(e.g. 227 bytes in a machine with

    128Mb of RAM) To a first approximation, this is

    just one huge array. The arrayindex is called the address

    some of the array elementsare interpreted as instructioncodes by the CPU

    CPU

    39243

    65

    216

    012

    227 -1227 -2

    addresses

    45113

    B ff fl tt k

  • 8/4/2019 Perl for Bio in for Ma Tics

    105/158

    Buffer overflow attack

  • 8/4/2019 Perl for Bio in for Ma Tics

    106/158

    H d i l t ti

  • 8/4/2019 Perl for Bio in for Ma Tics

    107/158

    Hexadecimal notation

    Computers use binary notation, which is tricky to interconvertto/from decimal notation

    however, binary notation is big & unwieldy

    A compromise is to use hexadecimal

    Hexadecimal is base 16 (decimal is base 10, binary is base 2)

    The letters A-F are used to represent the extra digits for 10-15

    Binary: Decimal: Hexadecimal:

    101 5 5

    1011 11 B

    11100 28 1C

    101000011 323 143

    R f

  • 8/4/2019 Perl for Bio in for Ma Tics

    108/158

    References

    Recall the subroutine find_max(@x) which returns the largestelement in the array @x

    Count the number of times we create an array in this code.

    All in all, we've created three copies of this array. Each copy usesup time and memory. This seems unnecessary... and it is.

    Instead of passing the whole array into the subroutine, we couldsimply tell the subroutine where in memory the array begins.

    The memory address of a particular variable is called a reference tothat variable. This is a useful abstraction.

    Addresses are often displayed in hexadecimal.

    @x= (1, 5, 1, 12, 3, 4, 6);

    $max= find_max (@x);

    sub find_max {

    my @data =@_;

    ...

    Array @x created here@x copied into @_ here

    @_ copied into @data here

    R f t

  • 8/4/2019 Perl for Bio in for Ma Tics

    109/158

    Reference syntax

    To create a reference to a scalar, $x:

    an array, @x:

    a hash,%x:

    To access a reference to

    a scalar:

    an array:

    an array element: a hash:

    a hash element:

    Alternative syntax for arrays:

    $scalar_ref = \$x;

    $array_ref = \@x;

    $hash_ref = \%x;

    $x=$$scalar_ref;

    @x=@$array_ref;

    %x= %$hash_ref;

    $x=$array_ref->[3];

    $x=$hash_ref->{'key'};

    $x=$$array_ref[3];

    R f t l

  • 8/4/2019 Perl for Bio in for Ma Tics

    110/158

    References to scalars$x= 10;

    $y = 20;print "Initially: x=$x, y=$y\n";

    $xReference = \$x;

    print "X-reference: $xReference\n";

    print "Referenced variable: $$xReference\n";

    $$xReference += 3;

    print "Now: x=$x, y=$y\n";

    $yReference = \$y;print "Y-reference: $yReference\n";

    print "Referenced variable: $$yReference\n";

    $$yReference *= 2;

    print "Finally: x=$x, y=$y\n";

    Initially: x=10, y=20X-reference: SCALAR(0x1832ac0)

    Referenced variable: 10

    Now: x=13, y=20

    Y-reference: SCALAR(0x1832ae4)

    Referenced variable: 20

    Finally: x=13, y=40

    This referencepoints to $x

    This changesthe value of$x

    This referencepoints to $y

    This changesthe value of$y

    This is the memorylocation used to store $x

    This is the memorylocation used to store $y

    R f t

  • 8/4/2019 Perl for Bio in for Ma Tics

    111/158

    References to arrays@x= ('a', 'c', 'g', 't');

    @y = 1..10;print "x: @x\n";

    print "y: @y\n";

    $xReference = \@x;

    print "X-reference: $xReference\n";

    print "Referenced array: @$xReference\n";

    $$xReference[3] =~ tr/t/u/;

    print "New x: @x\n";$yReference = \@y;

    print "Referenced array: @$yReference\n";

    $yReference->[3] *= 2;

    print "New y: @y\n";

    x: a c g t

    y: 1 2 3 4 5 6 7 8 9 10X-reference: ARRAY(0x1832b08)

    Referenced array: a c g t

    New x: a c g u

    Referenced array: 1 2 3 4 5 6 7 8 9 10

    New y: 1 2 3 8 5 6 7 8 9 10

    This referencepoints to @x

    This referencepoints to @y

    This changes the4th element of@x

    This changes the4th element of@y

    (NB alternative notation)

    Note that the type of referenceis now ARRAY, not SCALAR

    R f t h h

  • 8/4/2019 Perl for Bio in for Ma Tics

    112/158

    References to hashes%comp = ('Cyp12a5' => 'Mitochondrion',

    'MRG15' => 'Nucleus',

    'Cop' => 'Golgi',

    'bor' => 'Cytoplasm',

    'Bx42' => 'Nucleus');

    $ref = \%comp;

    print "Values: ", join(" ",values(%comp)), "\n";

    print "Ref:$ref\n";

    print "Ref values: ", join(" ",values(%$ref)), "\n";

    $$ref{'MRG15'} =~ s/N/n/;

    print "New values: ", join(" ",values(%comp)), "\n";

    Values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus

    Ref: HASH(0x1832b08)Ref values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus

    New values: Cytoplasm Golgi Nucleus Mitochondrion nucleus

    The referencepoints to %comp

    This changes$comp{'MRG15'}

    Note lower-case 'n' after change

    References to s bro tines

  • 8/4/2019 Perl for Bio in for Ma Tics

    113/158

    References to subroutines

    We can also have references to subroutines

    Syntax for assigning a subroutine reference:

    Syntax for calling a subroutine reference:

    Anonymous subroutines:

    $subref = \&read_FASTA;

    %name2seq = &$subref ("fly3utr.txt");

    $subref = sub { print "Hello world\n"; };&$subref(); Hello world

    References to code

  • 8/4/2019 Perl for Bio in for Ma Tics

    114/158

    References to codesub hello {

    print "Hello @_!\n";

    }

    my $codeRef1 = \&hello;

    &$codeRef1 ("Mr", "President");

    print "Ref:$codeRef1\n";

    my $codeRef2 = sub { print "Goodbye @_!" };

    &$codeRef2 ("cruel", "world");

    Hello Mr President!

    Ref: CODE(0x180cc3c)Goodbye cruel world!

    The referencepoints to thesubroutine hello

    This is an anonymoussubroutine reference

    An anonymous subroutine is one that is never named, but only referenced.

    Well be seeing more about anonymous references on the following slides.

    Reasons for references

  • 8/4/2019 Perl for Bio in for Ma Tics

    115/158

    Reasons for references

    Increased efficiency/performance (pass areference instead of the whole thing)

    Allowing a subroutine to modify the value

    of a variable, and have this modification bepropagated back to the caller of thesubroutine

    Allowing arrays/hashes to contain(references to) other arrays/hashes

    Abstract representation of subroutines

    Anonymous arrays and hashes

  • 8/4/2019 Perl for Bio in for Ma Tics

    116/158

    Anonymous arrays and hashes

    Recall the syntax for assigning an entire array...

    ...and the syntax for assigning an entire hash...

    We can also create an array and assign a reference to it,without explicitly naming the array variable:

    This is called an anonymous array. We can also create anonymous hashes:

    @nucleotide = ('a', 'c', 'g', 't');

    %dna2rna = ('a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u');

    $nucleotide_ref =['a', 'c', 'g', 't'];

    Note square brackets

    instead of parentheses

    $dna2rna_ref = {'a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u'};

    Note curly brackets

    Arrays of arrays

  • 8/4/2019 Perl for Bio in for Ma Tics

    117/158

    Arrays of arrays

    More precisely, arrays ofreferences-to-arrays.

    Suppose we want to represent this matrix:

    We could do it like this:

    Or, more succinctly, like this:

    $row1 =[0,0,0,2];

    $row2 =[0,0,3,0];

    $row3 =[0,3,0,1];

    $row4 =[2,0,1,0];

    @matrix= ($row1,$row2,$row3,$row4);

    0 0 0 2

    0 0 3 0

    0 3 0 1

    2 0 1 0

    @matrix= ([0,0,0,2],

    [0,0,3,0],

    [0,3,0,1],

    [2,0,1,0]);

    @matrix is an array

    of references to arrays

    This matrix could be a table of RNA base-pairing scoresif the row and column indices are (A,C,G,U). The score of apair is the number of strong hydrogen bonds that it forms.Thus, A-U and U-A pairs score +2; C-G and G-C pairs score

    +3; G-U and U-G pairs score +1; and all other pairs score 0.

  • 8/4/2019 Perl for Bio in for Ma Tics

    118/158

    Arrays in Cand C++

    C has nothing like Perls hashes,although various libraries(e.g. GLIB) have equivalents.

    C++s Standard Template Libraryoffers the map template, which

    is similar to a hash.

    The vector is a C++ template.

    Templates (like C arrays) arestrongly typed, unlike Perls

    weakly typed arrays & hashes.

  • 8/4/2019 Perl for Bio in for Ma Tics

    119/158

    Genome annotations

    GFF annotation format

  • 8/4/2019 Perl for Bio in for Ma Tics

    120/158

    GFF annotation format

    Nine-column tab-delimited format for simple annotations:

    Many of these now obsolete, but name/start/end/strand (andsometimes type) are useful

    Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file)

    SEQ1 EMBL atg 103 105 . + 0 group1SEQ1 EMBL exon 103 172 . + 0 group1SEQ1 EMBL splice5 172 173 . + . group1SEQ1 netgene splice5 172 173 0.94 + . group1

    SEQ1 genie sp5-20 163 182 2.3 + . group1SEQ1 genie sp5-10 168 177 2.1 + . group1SEQ2 grail ATG 17 19 2.1 - 0 group2

    Sequencename

    Program

    Feature

    typeStart

    residue(starts at 1)

    End

    residue(starts at 1) Score

    Strand

    (+ or -)

    Codingframe

    ("." if notapplicable)

    Group

    Reading a GFF file

  • 8/4/2019 Perl for Bio in for Ma Tics

    121/158

    Reading a GFF file

    This subroutine reads a GFF file

    Each line is made into an array via the split command

    The subroutine returns an array of such arrays

    sub read_GFF {my ($filename)=@_;

    open GFF, "

  • 8/4/2019 Perl for Bio in for Ma Tics

    122/158

    Writing a GFF file

    We should be able to write as well as read all datatypes

    Each array is made into a line via the join command

    Arguments: filename & reference to array of arrays

    sub write_GFF {my ($filename, $gffRef)=@_;

    open GFF, ">$filename" or die $!;

    foreach my $gff (@$gffRef) {

    print GFF join ("\t", @$gff), "\n";

    }

    close GFF or die $!;

    }

    open evaluates FALSE ifthe file failed to open, and$! contains the error message

    close evaluates FALSE if

    there was an error with the file

    GFF intersect detection

  • 8/4/2019 Perl for Bio in for Ma Tics

    123/158

    GFF intersect detection

    Let (name1,start1,end1) and (name2,start2,end2) be the co-ordinates of two segments

    If they don't overlap, there are three possibilities: name1 and name2 are different;

    name1= name

    2but start

    1> end

    2;

    name1 = name2 but start2 > end1;

    Checking every possible pair takes time N2 to run, whereN is the number ofGFF lines (how can this be improved?)

    Self intersection of a GFF file

  • 8/4/2019 Perl for Bio in for Ma Tics

    124/158

    Self-intersection of a GFF file

    sub self_intersect_GFF {my @gff =@_;

    my @intersect;

    foreach $igff (@gff) {

    foreach $jgff (@gff) {

    if ($igff ne $jgff) {

    if ($$igff[0] eq $$jgff[0]) {

    if (!($$igff[3] > $$jgff[4]

    || $$jgff[3] > $$igff[4])) {

    push @intersect, $igff;

    last;

    }

    }

    }

    }

    }

    return @intersect;

    }

    Note: this code is slow.Vast improvements in

    speed can be gained ifwe sortthe @gff array

    before checking forintersection.

    Fields 0, 3 and 4 of theGFF line are thesequence name, start

    and end co-ordinates ofthe feature

    Converting GFF to sequence

  • 8/4/2019 Perl for Bio in for Ma Tics

    125/158

    Converting GFF to sequence

    Puts together several previously-described subroutines

    Namely: read_FASTA read_GFF revcomp print_seq

    ($gffFile, $seqFile)=@ARGV;

    @gff = read_GFF ($gffFile);

    %seq = read_FASTA ($seqFile);foreach $gffLine (@gff) {

    $seqName =$gffLine->[0];

    $seqStart =$gffLine->[3];

    $seqEnd =$gffLine->[4];

    $seqStrand =$gffLine->[6];

    $seqLen =$seqEnd + 1 - $seqStart;

    $subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen);if ($seqStrand eq "-") { $subseq = revcomp ($subseq); }

    print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq);

    }

    DNA Microarrays

  • 8/4/2019 Perl for Bio in for Ma Tics

    126/158

    y

    Normalizing microarray data

  • 8/4/2019 Perl for Bio in for Ma Tics

    127/158

    Normalizing microarray data

    Often microarray data are normalizedas aprecursor to further analysis (e.g. clustering)

    This can eliminate systematic bias; e.g.

    if every level for a particular gene is elevated, thismight signal a problem with the probe for that gene

    if every level for a particular experiment is elevated,there might have been a problem with thatexperiment, or with the subsequent image analysis

    Normalization is crude (it can eliminate realsignal as well as noise), but common

    Rescaling an array

  • 8/4/2019 Perl for Bio in for Ma Tics

    128/158

    Rescaling an array

    For each element of the array:add a, then multiply by b

    @array = (1, 3, 5, 7, 9);

    print "Array before rescaling: @array\n";

    rescale_array (\@array, -1, 2);print "Array after rescaling: @array\n";

    sub rescale_array {

    my ($arrayRef, $a, $b)=@_;

    foreach my $x (@$arrayRef) {

    $x= ($x+$a) * $b;

    }}

    Array before rescaling: 1 3 5 7 9

    Array after rescaling: 0 4 8 12 16

    Array ispassedby reference

    Microarray expression data

  • 8/4/2019 Perl for Bio in for Ma Tics

    129/158

    Microarray expression data

    A simple format with tab-separated fields First line contains experiment names

    Subsequent lines contain:

    gene name expression levels for each experiment

    * EmbryoStage1 EmbryoStage2 EmbryoStage3 ...

    Cyp12a5 104.556 102.441 55.643 ...

    MRG15 4590.15 6691.11 9472.22 ...

    Cop 33.12 56.3 66.21 ...

    bor 5512.36 3315.12 1044.13 ...

    Bx42 1045.1 632.7 200.11 ...

    ... ... ... ...

    Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn

    Reading a file of expression data

  • 8/4/2019 Perl for Bio in for Ma Tics

    130/158

    Reading a file of expression datasub read_expr {

    my ($filename)=@_;open EXPR, "

  • 8/4/2019 Perl for Bio in for Ma Tics

    131/158

    Normalizing by gene

    A program to normalize expression datafrom a set of microarray experiments

    Normalizes by gene

    ($experiment, $expr)= read_expr ("expr.txt");

    while (($geneName, $lineRef)= each %$expr) {

    normalize_array ($lineRef);}

    sub normalize_array {

    my ($data)=@_;

    my ($mean, $sd)= mean_sd (@$data);@$data= map (($_ - $mean) / $sd, @$data);

    }

    NB $data

    is a reference

    to an array

    Could also use the following:rescale_array($data,-$mean,1/$sd);

    Normalizing by column

  • 8/4/2019 Perl for Bio in for Ma Tics

    132/158

    Normalizing by column

    Remaps gene arrays to column arrays

    ($experiment, $expr)

    = read_expr ("expr.txt");

    my @genes = sort keys %$expr;for ($i = 0; $i < @$experiment; ++$i) {

    my @col;

    foreach $j (0..@genes-1) {

    $col[$j] =$expr->{$genes[$j]}->[$i];

    }

    normalize_array(\@col);foreach $j (0..@genes-1) {

    $expr->{$genes[$j]}->[$i] =$col[$j];

    }

    }

    Puts columndata in @col

    Puts @colback into %expr

    Normalizes (note useof reference)

  • 8/4/2019 Perl for Bio in for Ma Tics

    133/158

    Perl for Bioinformatics

    Section 6: Advanced topics

    Sorting

  • 8/4/2019 Perl for Bio in for Ma Tics

    134/158

    Sorting

    It is often useful to be able to sort an array e.g. smallest element first, largest last

    Many sort algorithms exist Bubblesort (swaps)

    Quicksort (pivots)

    Binary tree sort (inserts)

    Typically, in older languages, you have toimplement one of these yourself although qsort is provided in C

    This is changing...

    Sorting string data

  • 8/4/2019 Perl for Bio in for Ma Tics

    135/158

    Sorting string data

    Perl provides the sort function to sort an

    array of strings into alphabetic order:

    @nucleotides = ('g', 'c', 't', 'a');

    @sorted_nucleotides = sort @nucleotides;print "Nucleotides: @nucleotides\n";

    print "Sorted: @sorted_nucleotides\n";

    Nucleotides: g c t a

    Sorted: a c g t

    Sorting numeric data

  • 8/4/2019 Perl for Bio in for Ma Tics

    136/158

    Sorting numeric data

    To sort numeric data, we have to provide a sort function

    This is a subroutine that compares two items, $a and $b

    It must return -1 if$a$b

    Fortunately, Perl provides an operator that does just this.It is the spaceship operator$a $b

    The syntax is as follows:

    @x= (5, 1, 16, 2, -1, 10);

    @y = sort by_number @x;

    print "y: @y\n";

    subby_number {

    return $a $b;

    } y: -1 1 2 5 10 16

    The variables $a and $b getpassed "automagically" into this

    subroutine. Yet another example ofarbitrary Perl weirdness...

    Standard sort functions

  • 8/4/2019 Perl for Bio in for Ma Tics

    137/158

    Standard sort functions

    $a $b is the "standard" numeric sort

    The "standard" alphabetic sort is $a cmp $b

    The alphabetic sort is the one used by default:

    $x= "Pears";$y = "Apples";

    $z= "Oranges";

    print "$x cmp $y: ", $x cmp $y, "\n";

    print "$x cmp $z: ", $x cmp $z, "\n";

    print "$y cmp $z: ", $y cmp $z, "\n";

    print "$x cmp $x: ", $x cmp $x, "\n";

    Pears cmp Apples: 1

    Pears cmp Oranges: 1

    Apples cmp Oranges: -1

    Pears cmp Pears: 0

    Sorting a GFF file

  • 8/4/2019 Perl for Bio in for Ma Tics

    138/158

    Sorting a GFF file

    We can "chain" multiple sort functions to sort bysequence name, then by startpoint, then by endpoint:

    This works because (X or Y or Z) = X (if X!=0)or Y (if X==0 and Y != 0)or Z (if X==Y==0)

    ($infile, $outfile)=@ARGV;

    @gff = read_GFF ($infile);

    @gff = sort by_GFF_startpoint (@gff);write_GFF ($outfile, \@gff);

    subby_GFF_startpoint {

    return ($$a[0] cmp $$b[0]

    or $$a[3] $$b[3]

    or $$a[4] $$b[4]);

    }

    "chaining" multiplesort comparisons

    this line doesthe actual sort

    Fields 0, 3 and 4 of theGFF line are thesequence name, start

    and end co-ordinates ofthe feature

    Packages

  • 8/4/2019 Perl for Bio in for Ma Tics

    139/158

    Packages

    Perl allows you to organise your subroutines inpackages each with its own namespace

    Perl looks for the packages in a list of directoriesspecified by the array @INC

    Many packages available athttp://www.cpan.org/

    use PackageName;

    PackageName::doSomething();

    This line includes a file called"PackageName.pm" in your code

    print "INC dirs: @INC\n";

    INC dirs: Perl/lib Perl/site/lib.The "." means thedirectory that thescript is saved in

    This invokes a subroutine called doSomething()in the package called "PackageName.pm"

    Object-oriented programming

  • 8/4/2019 Perl for Bio in for Ma Tics

    140/158

    Object oriented programming

    Data structures are often associated with code FASTA: read_FASTA print_seq revcomp ...

    GFF: read_GFF write_GFF ...

    Expression data: read_expr mean_sd...

    Object-oriented programming makes thisassociation explicit.

    A type of data structure, with an associated set of

    subroutines, is called a class The subroutines themselves are called methods

    A particular instance of the class is an object

    OOP concepts

  • 8/4/2019 Perl for Bio in for Ma Tics

    141/158

    OOP concepts

    Abstraction represent the essentials, hide the details

    Encapsulation storing data and subroutines in a single unit

    hiding private data (sometimes all data, via accessors)

    Inheritance abstract base interfaces

    multiple derived classes

    Polymorphism different derived classes exhibit different behaviors in

    response to the same requests

    OOP: Analogy

  • 8/4/2019 Perl for Bio in for Ma Tics

    142/158

    OOP: Analogy

  • 8/4/2019 Perl for Bio in for Ma Tics

    143/158

    o Messages (the words in the speech balloons, and also perhaps the coffee itself)

    o Overloading (Waiter's response to "A coffee", different response to "A black coffee")

    o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)

    o Encapsulation (Customer doesn't need to know about Kitchen)

    o Inheritance (not exactlyused here, except implicitly: all types of coffee can be drunk orspilled, all humans can speak basic English and hold cups of coffee, etc.)

    o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is

    a Factory (and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.

    OOP: Advantages

  • 8/4/2019 Perl for Bio in for Ma Tics

    144/158

    OOP: Advantages

    Often more intuitive Data has behavior

    Modularity Interfaces are well-defined

    Implementation details are hidden

    Maintainability Easier to debug, extend

    Framework for code libraries Graphics & GUIs

    BioPerl, BioJava

    OOP: Jargon Member method

  • 8/4/2019 Perl for Bio in for Ma Tics

    145/158

    Member, method A variable/subroutine associated with a particular class

    Overriding When a derived class implements a method differently from its

    parent class

    Constructor, destructor

    Methods called when an object is created/destroyed Accessor

    A method that provides [partial] access to hidden data

    Factory

    An [abstract] object that creates other objects Singleton

    A class which is only everinstantiatedonce (i.e. theres only everone object of this class)

    C.f. static member variables, which occur once per class

    Objects in Perl

  • 8/4/2019 Perl for Bio in for Ma Tics

    146/158

    An object in Perl is usually a reference to a hash The method subroutines for an object are foundin a class-specific package Command bless $x, MyPackage associates

    variable $x with package MyPackage

    Syntax of method calls e.g. $x->save();

    this is equivalent to PackageName::save($x);

    Typical constructor: PackageName->new();

    @EXPORT and @EXPORT_OK arrays used toexport method names to users namespace

    Many useful Perl objects available at CPAN

    AUTOLOAD

  • 8/4/2019 Perl for Bio in for Ma Tics

    147/158

    When an undefined method is called on anobject, the special method AUTOLOAD iscalled, if defined

    Special variable $AUTOLOAD containsfunction name

    Allows implementation of e.g. defaultaccessors for hash elements

    GD.pm

  • 8/4/2019 Perl for Bio in for Ma Tics

    148/158

    p

    A graphics package by Lincoln Steinuse GD;

    # create a new image

    $im = new GD::Image(100,100);

    # allocate some colors

    $white =$im->colorAllocate(255,255,255);

    $black

    =$im->colorAllocate(0,0,0

    );

    $red =$im->colorAllocate(255,0,0);

    $blue =$im->colorAllocate(0,0,255);

    # make the background transparent

    $im->transparent($white);

    # Put a black frame around the picture

    $im->rectangle(0,0,99,99,$black);

    # Draw a blue oval$im->arc(50,50,95,75,0,360,$blue);

    # And fill it with red

    $im->fill(50,50,$red);

    # Convert the image to PNG and print it out

    print $im->png;

    CGI.pm

  • 8/4/2019 Perl for Bio in for Ma Tics

    149/158

    p

    CGI (Common Gateway Interface) Page-based web programming paradigm

    CGI.pm (also by Lincoln Stein)

    Perl CGI interface runs on a webserver

    allows you to write a program that runs behinda webpage

    CGI (static, page-based) is gradually beingsupplemented by AJAX

    BioPerl

  • 8/4/2019 Perl for Bio in for Ma Tics

    150/158

    A set of Open Source Bioinformaticspackages largely object-oriented

    Can be downloaded from bio.perl.org Handles various different file formats

    Parses BLAST and other programs

    Basis for Ensembl the human genome annotation project www.ensembl.org

    Example: GenBank

  • 8/4/2019 Perl for Bio in for Ma Tics

    151/158

    p

    Example: Bio::DB::GenBank

  • 8/4/2019 Perl for Bio in for Ma Tics

    152/158

    p

    Interface to the GenBank database

    Saves having to rewrite same old parsers

    use Bio::DB::GenBank;

    $gb= new Bio::DB::GenBank;

    $seq =$gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID

    # or ...

    $seq =$gb->get_Seq_by_acc('J00522'); # Accession Number

    $seq =$gb->get_Seq_by_version('J00522.1'); # Accession.version

    $seq =$gb->get_Seq_by_gi('405830'); # GI Number

    Digest::MD5

  • 8/4/2019 Perl for Bio in for Ma Tics

    153/158

    g

    MD5 is a one-way hash function

    e.g. gravatar.com uses MD5 to map(authenticated) email addresses to avatar icons

    Digest::MD5

  • 8/4/2019 Perl for Bio in for Ma Tics

    154/158

    g

    MD5 is a one-way hash function

    e.g. gravatar.com uses MD5 to map(authenticated) email addresses to avatar icons

    use Digest::MD5 qw(md5 md5_hex md5_base64);

    my $baseURL = "http://www.gravatar.com/avatar/;

    while () {

    chomp;

    print $baseURL, md5_hex(lc($_)), "\n;

    }

    Other programming languages

  • 8/4/2019 Perl for Bio in for Ma Tics

    155/158

    p g g g g

    Procedural languages Interpreted/scripting languages

    "Shell languages (TCSH, BASH, CSH)

    Python: cleaner, object-oriented

    Ruby: even more object-oriented

    Compiled languages C: very basic, portable and fast

    C++: more elaborate, object-oriented C

    Java: stripped-down portable C++; "safer & cleaner

    Functional languages More mathematical, cleaner; but less pragmatic

    Lisp, Scheme Lisp is the oldest. (Lots (of (parentheses)))

    Prolog, ML, Haskell

  • 8/4/2019 Perl for Bio in for Ma Tics

    156/158

    Co-ordinate transformation

  • 8/4/2019 Perl for Bio in for Ma Tics

    157/158

    Motivation: map clones to chromosomesChromosome

    Clones

    17455 17855

    403 803

    Co-ordinate transformations (cont.)

  • 8/4/2019 Perl for Bio in for Ma Tics

    158/158

    What if a segment spans multiple clones?