biology 595m – practical biocomputing

83
Biol 59500-033 - Practical Biocomputing 1 Biology 59500-033 Practical Biocomputing Michael Gribskov Hock 331 [email protected] x46933

Upload: others

Post on 02-Oct-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 1

Biology 59500-033 – Practical Biocomputing

Michael Gribskov

Hock 331

[email protected]

x46933

Page 2: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 2

Introduction

Goals

• Basic skills in acquiring, transforming, and handling data

• Understand what is hard and what is easy to do with computers

• Basic introduction to good programming practices

• Not…

○ A bioinformatics course per se

○ Designed for professional programmers

Page 3: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 3

Introduction

Course Wiki

• Because auditors to not have easy access to blackboard I use a Wiki

for course materials.

• Go to https://wiki.itap.purdue.edu/display/wl49402201720/spring-2017-biol-59500-033+Home

Page 4: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 4

Introduction

Survey - To help me target the course to your needsFill this in on the class roster on the wiki to confirm that you can

access and edit the wiki https://wiki.itap.purdue.edu/display/wl49402201720/Student+Roster

• Name

• Email

• What kind of computer do you use (Mac, Windows, UNIX, Linux etc)

• What is your programming background – None, or language and

level of expertise

• What is your major and/or area of interest, this could included a

short description of a problem that made you want to take this

course. If you are an auditor (not taking the course for credit) please

note it in this section.

Page 5: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 5

Introduction

Overall Schedule

• 6-8 weeks Perl Programming

○ Basics of Computers and Programming

○ Basics of Perl

○ Regular expressions and text processing

○ Writing user agent or robot scripts

• 3 weeks Databases

• 3 - 4 weeks Putting it all together / Advanced topics

Page 6: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 6

Introduction

Effort / Work Required

• Weekly Problem Sets

• Sporadic Quizzes

• One or Two Midterm Exams

• Final Project

○ A working website or script that combines data manipulation and

computing to do something novel

Page 7: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 7

Introduction

Texts

• Online texts

○ Safari – SAMS Teach Yourself Perl in 24 Hours, 3rd ed

− Available on Purdue Safari,

http://proquestcombo.safaribooksonline.com/

○ Paper texts (many available, not required)

Page 8: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 8

Introduction

Additional Online Texts

• I've posted some additional texts gathered from the internet on the

Wiki. If you averse to buying a book, have a look at these and see

what you think. These are listed on the Wiki.

Page 9: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 9

Introduction

Week 1 Goals

• Get Perl up and running somewhere you have ready access

○ Your personal PC

○ Lab computer

○ Computer lab

○ Genomics Computing Facility

• Write some simple scripts that actually do something

Page 10: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 10

Introduction

Reading for this week

• Perl in 24 Hours (P24h)

○ Wednesday 1/11

− Hour 1 - Installing perl etc.

− Hour 2 - Variables and operators

○ Friday 1/13

− Hour 3 - Program flow

• Programming Perl (PP)

○ Wednesday 1/11

− Ch 5 - Creating and running a perl program

− Ch 6 - Perl variables, pg 23-29

− Ch 7 – Operators, pg 43-50

○ Friday 1/13

− Ch 8 – Conditional constructs

Page 11: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 11

Introduction

Where to Run Perl

• Versions: Perl 5.8-5.18 are OK, earlier versions are not

• Options

○ Install on your own PC (instructions posted Wiki)

○ Install on your lab computer

○ Use Genomics Computing Facility Unix computers

− Get account/password from instructor

○ Use ITAP labs or RCAC clusters

Page 12: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 12

Introduction

History

• PERL was invented by Larry Wall in 1987 as Practical Extraction and

Report Language

• Gained Popularity with the advent of the world wide web

○ PERL v4 – 1993

○ v5 – 1995, current version 5.22

○ Perl 6 is essentially a different language

• Widely used for CGI scripts, and processing electronic documents

○ "the Swiss Army chainsaw of programming

languages"

○ The Duct tape of the Internet

○ Most common scripting language used in genomics/bioinformatics

Page 13: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 13

Introduction

Basics of Perl

• Perl is an interpreted language. A Perl script can run on any

computer with a Perl interpreter. Other interpreted languages

include python and java.

• Perl replaces many “shell” scripting tools used in UNIX such as sed

and awk

• Designed to be practical and useful rather than elegant, minimal, or

as an example of a theoretical philosophy

• Perl maxims

○ TMTOWTDI (Tim Toady) – There’s more than one way to do it

○ DWIM - Do what I mean or the "principle of least astonishment"

○ “What is the sound of Perl? Is it not the sound of a wall that people have

stopped banging their heads against?” – Larry Wall

Page 14: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 14

Introduction

Perl Pros and Cons

• Pros

○ Powerful

○ Expressive

○ Easy and flexible to use

○ String and list processing

○ Automatic memory management

• Cons

○ Perl can be ugly (punctuation resembles cartoon cursing)

○ Perl can be excessively complex and compact, leading to unreadable

code

○ Perl is not efficient at highly mathematical operations

○ Many computer scientists look down on Perl

Page 15: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 15

Am I Cut Out for Programming?

Programming Skills

• Logical thinking

• Detail oriented

• Able solve problems based on incorrect results

• Laboratory biologists are natural programmers!

Page 16: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 16

Biological Protocol

Reagents

• SOB ( Super Optimal Broth ) WB ( Washing Buffer )

○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)

○ 0.5% w/v Yeast extract 90% distilled water

○ 10mM NaCl chilled to 4°C

○ 2.5mM KCl

Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB

• Grow cells with vigorous aeration overnight at 37°C.

• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.

• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.

• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.

• Repeat 2X

○ Resuspend the cell pellet in 250 ml of WB.

○ Centrifuge the cell suspension at 5,000 RPM for 15 min .

○ Carefully pour off the supernatant as soon as the rotor stops.

○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.

• Resuspend the cell pellet in 1 ml WB.

• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.

Page 17: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 17

Biological Protocol

Reagents

SOB ( Super Optimal Broth ) WB ( Washing Buffer )

○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)

○ 0.5% w/v Yeast extract 90% distilled water

○ 10mM NaCl chilled to 4°C

○ 2.5mM KCl

Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB

• Grow cells with vigorous aeration overnight at 37°C.

• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.

• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.

• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.

• Repeat 2X

○ Resuspend the cell pellet in 250 ml of WB.

○ Centrifuge the cell suspension at 5,000 RPM for 15 min .

○ Carefully pour off the supernatant as soon as the rotor stops.

○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.

• Resuspend the cell pellet in 1 ml WB.

• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.

Definitions

Page 18: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 18

Biological Protocol

Reagents

• SOB ( Super Optimal Broth ) WB ( Washing Buffer )

○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)

○ 0.5% w/v Yeast extract 90% distilled water

○ 10mM NaCl chilled to 4°C

○ 2.5mM KCl

Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB

• Grow cells with vigorous aeration overnight at 37°C.

• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.

• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.

• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.

• Repeat 2X

○ Resuspend the cell pellet in 250 ml of WB.

○ Centrifuge the cell suspension at 5,000 RPM for 15 min .

○ Carefully pour off the supernatant as soon as the rotor stops.

○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.

• Resuspend the cell pellet in 1 ml WB.

• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.

Repeat for a certain time,

or number of times, or

until a condition is met

Page 19: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 19

Biological Protocol

Reagents

• SOB ( Super Optimal Broth ) WB ( Washing Buffer )

○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)

○ 0.5% w/v Yeast extract 90% distilled water

○ 10mM NaCl chilled to 4°C

○ 2.5mM KCl

Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB

• Grow cells with vigorous aeration overnight at 37°C.

• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.

• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.

• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.

• Repeat 2X

○ Resuspend the cell pellet in 250 ml of WB.

○ Centrifuge the cell suspension at 5,000 RPM for 15 min .

○ Carefully pour off the supernatant as soon as the rotor stops.

○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.

• Resuspend the cell pellet in 1 ml WB.

• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.

Conditional or

alternative steps

Final result (output)

Page 20: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 20

What is a program (script)

A detailed set of instructions for how to do something

• Definitions/symbols

○ pi = 3.1416

• Actions

○ Area = pi x radius2

• Loops

○ Repeat 2 times

○ Repeat until …

• Conditional

○ If (something) do (something)

Computers are like very hard working but very stupid lab helpers• Instructions must be exact – computers are quite happy to do the wrong

thing over and over

• All possible alternatives must be covered – when undefined situations occur computers either

○ Do the wrong thing

○ Stop and wait (forever)

○ Fail catastrophically (shoot themselves)

Page 21: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 21

A More Detailed Protocol

Preparation of E. coli cells for electroporation

1. Use a fresh colony of DH5α (or other appropriate host strain) to inoculate 5 ml of SOB (without magnesium) medium in a 50 ml sterile conical tube. Grow cells with vigorous aeration overnight at 37°C.

2. Dilute 2.5 ml of cells into 250 ml of SOB (without magnesium) in a 1 liter flask. Grow for 2 to 3 hours with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.

3. Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min in sterile centrifuge bottles. (Make sure you use autoclaved bottles!).

4. Wash the cell pellet in 250 ml of ice-cold WB as follows. First, add a small amount of WB to cell pellet; pipet up and down or gently vortex until cells are resuspended. Then fill centrifuge bottle with ice cold WB and gently mix. NOTE-the absolute volume of WB added at this point is not important.

5. Centrifuge the cell suspension at 5,000 RPM for 15 min and carefully pour off the supernatant as soon as the rotor stops. Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time.

6. Wash the cell pellet a second time by resuspending in 250 ml of sterile ice-cold WB using the same technique described above. Centrifuge the cell suspension at 5000 RPM for 15 min.

7. Gently pour off the supernatant leaving a small amount of WB in the bottom of the bottle. Resuspend the cell pellet in the WB - no additional WB needs to be added – and the final volume should be about 1 ml. Cells can be used immediately or can be frozen in 0.2 ml aliquots in freezer vials using a dry ice-ethanol bath. Store frozen cells at -70°C.

Page 22: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 22

Introduction

Virtues of a programmer …

• Laziness - Makes you write labor-saving programs that other people

will find useful, and document what you wrote so you don't have to

answer so many questions about it.

• Impatience - The anger you feel when the computer is being lazy.

This makes you write programs that don't just react to your needs,

but actually anticipate them. Or at least pretend to.

• Hubris - Excessive pride. Also the quality that makes you write (and

maintain) programs that other people won't want to say bad things

about.

… In constrast to Cowboy Programming• galloping off on one's own without a prior plan (the runaway one-

liner)

• unnecessarily dense, unreadable code (False Hubris)

• reinventing the wheel unnecessarily (False Impatience)

• brute-force programming (False Laziness)

Page 23: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 23

Computer Architecture

CPU

Input/Output

keyboard display diskUSB network

MemoryCache

Main

Fast Storage

Slow StorageEven Slower

Storage

Excruciatingly

Slow Storage

memory

Page 24: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 24

Operating System

Computer Architecture

CPU

Input/Output

keyboard display disk network

MemoryCache

Main

Perl Program

Stored on disk

Executed by perlinterpreter

memory

Page 25: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 25

Operating System

Computer Architecture

CPU

Input/Output

keyboard display diskUSB network

MemoryCache

Main

Perl Program

Stored on disk

Executed by perlinterpreter

Perl Interpreter

Reads perl script and carries out instructions

memory

Page 26: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 26

Introduction

Basics

• Terminology

• Getting information into and back from a program

• Arithmetic

• Doing things repeatedly (looping)

• Making decisions (true or false?)

Page 27: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 27

Introduction

Basics

• Variables – Storage – named so that values can be assigned and

accessed by symbols

• Operators – perform various operations on variables (e.g., + - * = )

• Expressions – A piece of Perl code that evaluates to a result (don’t

worry for now exactly what this means)

• Statements – variables + operators

• Functions – segments of programs that are reused. Predefined

functions look like parts of the language

• Programs / Scripts – series of statements

• Algorithms – Not the main focus of this course. Abstract

descriptions of how to accomplish a task – implemented as

programs

Page 28: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 28

Basics of Programming

Simple Perl Programs

• Comments begin with #

• Every non-comment line ends with a semicolon;

• Comments are very important because they provide a place to

explain what the program does

# this is a comment

# this is test program

Print "testing";

Page 29: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 29

Basics of Programming

Simple Perl Programs

• To get started, a few informal elements. We’ll come back to these

formally later

• Simple variable names begin with $ (technically called scalar

variables)

• A simple program

$one = 1;

$one = $one + 1;

# a simple adding program

$one = 1;

$two = 2;

$sum = $one + $two;

Page 30: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 30

Basics of Programming

Simple Perl Programs

• Input operator <>

○ Reads a line of input

• Print function

○ Nothing is printed until you print a carriage return

○ use \n to generate a carriage return

print "any text inside quotes\n";

$in = <>;

Page 31: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 31

Basics of Programming

Simple Perl Programs• While loop

○ while executes a block of code (inside the curly braces) until the

condition in the parentheses becomes false (1=true, 0=false).

while ( 1 ) {

# anything here repeats forever

}

# program echo:

# repeat terminal input back to the display

while ( 1 ) {

$in = <>;

print $in;

}

Page 32: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 32

Basics of Programming

Simple Perl Programs

• Where does <> read from?

• If there is nothing else on the command line, input is from the

keyboard% echo.pl

% perl echo.pl

• If a file is provided, input is from the file (you will see nothing unless

you have the script print)% echo.pl file.txt

% echo.pl <file.txt

% perl echo.pl file.txt

# program echo

# repeat terminal input back to the display

while ( $in = <> ) {

print $in;

}

Page 33: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 33

Basics of Programming

Simple Perl Programs• Read number from a file and sum

> sample.pl data1.in

> perl sample.pl data1.in

# program sample1.pl

# Calculate the sum of a list of number in a file

while ($in = <> ) {

print $in;

$sum = $sum + $in;

$n_values = $n_values + 1;

}

print "There are $n_values values in the file\n";

print "The sum is $sum.

3

4.5

-1

6

7.3

8.11

data1.in

(partial)

Page 34: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 34

Basics of Programming

Simple Perl Programs

• Conditional operators – test whether something is true or false

• Simple conditional tests

○ == tests whether two numbers are the same

○ eq tests whether two strings are the same

if ( some_condition ) {

# this executes if true

}

$one = 1;

$two = 2;

$two == $one + $one; # true

$two == $one; # false

"me" eq "you"; # false

Page 35: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 35

Basics of Programming

Simple Perl Programs

• Read number from a file and sum

Ignore negative values (maybe this is the way I mark missing values)

# program sample1.pl

# Calculate the sum of a list of number in a file,

# skip any -1 values

while ($in = <> ) {

if ( $in == -1 ) {

print $in;

$sum = $sum + $in;

$n_values = $n_values + 1;

}

}

print "There are $n_values values in the file\n";

print "The sum is $sum.

Page 36: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 36

Before next class

• Identify a location where you can run perl

○ Already installed on your personal or lab computer

○ ITAP or RCAC lab

○ Download and install on your own computer (instructions on wiki)

• Check the version and make sure it is ≥ 5.8

○ type perl -V to find out the version

• Verify that you can create a file containing a perl script and run it

(see next page for suggestions)

○ perl test.pl should work everywhere

○ test.pl should work if configured to recognize .pl suffix

○ Windows – use a command window

○ Mac – use a terminal window

• It is essential that you confirm that you are able write and run perl

scripts as soon as possible. If you have difficulty it is usually easy

to solve – don't just assume it will work, actually try it

Page 37: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 37

Before next class

• Try the following to make sure you understand.

Write a script that …

○ prints out a message, the traditional message is "hello world" but you

may prefer something else such as "Mr. Watson come here I want you."

○ prints the square of each number you enter on the keyboard

○ print the numbers in the Fibonacci series up to a value specified in a

variable

− a little harder – read the variable from the keyboard

○ calculates the sum of a series of numbers you enter from the keyboard

− hmm, how to make it stop and give the answer

• This is a learn by jumping in the deep end experience. All of these

examples are fairly easy, but may cause you some difficulty if this is

your first programming experience.

• Don't spend too long on these (remember, Perl is the sound of

people not beating their heads against the wall), but do make a list of

questions about what you don't understand. We will discuss these

in class.

Page 38: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 38

January 11

Today – Hour 2• Scalar variables

• Numeric and assignment operators

• Conditional operators

• Operator precedence

• Style

Friday• Logical expressions

• Logical Statements

• Looping

• Lists

• Reading for Friday

○ P24H

− Hour 3 - Controlling the Program's Flow

○ PP

− ch 8: 57-73 - Conditional Constructs

Page 39: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 39

Basic Perl

Scripts / Programs

• A perl script or program is a series of statements

• A statement is made up of variables and operators

○ Variables – a symbolic name that refers to a value stored in memory

perl scalars always begin with $ (sometimes called a sigil)

○ Operator – an action that is used to modify a variable

○ Statements are normally terminated by a ; (semicolon)

○ The term expression is sometimes used to refer to a fragment of code

that evaluates to a result

22

3.14

a

aeiou

this is a string

$x

$pi

$letter

$vowel

$label

Memory Variable

$x = $pi * $r**2; A simple statement with three variables and three operators

Page 40: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 40

Basic Perl

Expressions

• Expressions are fragments of code that evaluate to a result

• Expressions are often used with the assignment operator (=) to

assign values to variables, e.g.,

○ $x = $y + 1; # y + 1 is an expression

• Understanding the logical (true/false) value of expressions is critical

to making decisions and using loops with comparisons such as

○ < (less than)

○ > (greater than)

○ == && etc.

• Because Perl does not distinguish much between strings (text) and

numbers, context is important in determining the logical value

Page 41: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 41

Basic Perl

Scalar Variables – Numbers and Strings

• Scalar variables correspond to single numbers or strings of

characters (as opposed to arrays, matrices or other collections)

○ Numbers – unlike some languages, Perl does not distinguish between

integer, floating point, and scientific notation

○ Strings – a piece of text; one or more letters including spaces,

punctuation, digits, and nonprinting characters such as tabs, returns,

form feeds, etc.

$x = 12000;

$y = 12000.00;

$z = 1.2e4;

$name = "Gribskov";

$alphabet = "abcdefghijklmnopqrstuvwxyz";

$space = " ";

$nothing = "";

Page 42: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 42

Basic Perl

Scalar Variables - Strings

• Strings – a piece of text

• Double quotes or single quotes?

○ Double quotes are interpolating quotes, any Perl variables in the string

are replaced by their values. Perl variables are detected by their sigils

($, @, %)

○ Single quotes are non-interpolating quotes, apparent Perl variables are

retained exactly as written

$x = 12;

$name = "adam";

$name = "adam-$x"; # value of $name is adam-12

$name = 'adam-$x'; # value of $name is adam-$x

Page 43: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 43

Basic Perl

Scalar Variables – Numeric operators• For convenience we can refer to an expression with an operator as

having a left hand side (lhs) and a right hand side (rhs)

$x + 1 (with respect to +) $y = $x + 1 (with respect to =)

• Arithmetic operators

+ plus

- minus

/ divide

* multiply

** exponentiate (not ^)

% modulo (remainder) 7 % 2 is 1

• Assignment operators

= assignment, $x = 1

+= increment, $x += 12 same as $x = $x + 12

-= decrement, $x -= 12 same as $x = $x1 - 12

++ autoincrement, $x++ same as $x = $x + 1

-- autodecrement, $x-- same as $x = $x - 1

Page 44: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 44

Basic Perl

Assignment Operators

• Store a new value for a variable

○ a new value is placed in memory, destroying any previously stored

value

• Newbies sometimes find statements like the following confusing

$x = $x + 1

Assignment does the following

1. Evaluate the rhs 22 + 1 is 23

2. replace the current value of the lhs with the new value

22

3.14

a

aeiou

this is a string

$x

$pi

$letter

$vowel

$label

Memory Variable

23

3.14

a

aeiou

this is a string

$x

$pi

$letter

$vowel

$label

Memory Variable

Page 45: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 45

Basic Perl

Assignment Operators -+= and -=

• Shorthand for common assignments involving addition and

subtraction

○ $x = $x + 4; $x += 4;

$y = $y – 7; $y -= 7;

○ no space between + and =, or – and =

• Also *=, /=, **= etc by the same principle

Page 46: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 46

Basic Perl

Scalar Variables

• Autoincrement and autodecrement (++ and --)

○ $x++; and $x--; are shorthand for $x = $x + 1; and $x = $x – 1;

○ position of the operator before or after the variable determines whether

increment/decrement happens before or after the statement on the line

executes.

THESE CONSTRUCTIONS ARE ERROR PRONE – AVOID THEM

$x = 12;

print ++$x;

print "\n";

print $x++;

print "\n";

print --$x;

print "\n";

print $x--;

print "\n";

$y = $x++; # easy to miss the increment

$y = ++$x; # more cryptic

while ( $x++ < 10 )

while ( ++$x < 10 ) # are these the same?

Page 47: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 47

January 13

Today• Logical expressions

• Logical Statements

• Looping

• Lists

Next week • Lists (arrays)

• Hashes

• Files

• Text processing

• Subroutines

Reading for today• 24H Hour

○ Hour 3 - Controlling the Program's Flow

○ Hour 4 - Stacking Building Blocks: Lists and Arrays

• PP

○ ch 6: 25-29 - Arrays

○ ch 8: 49-59 -Conditional Constructs

Page 48: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 48

Homework 1

Why homework?

• Programming is like learning an instrument, you must practice or the concepts will just slip out of your memory

• Homework is posted on the Wiki each week

• There is no TA for this class so some standards are needed to make reviewing programs feasible

○ Homework will be graded primarily based on whether it produces the output it is supposed to. An example input and output will be provided for you to use in testing your code; a different input will usually be used for grading

○ Style and format count

○ I can only spend a limited amount of time figuring out why your program does not work, you must make sure it runs before submitting it.

○ Contact me if you can't make your script run after a reasonable amount of effort. Many times a script will not work because of some trivial typographical or syntactic problem. I can often find these quickly because I have made most of these mistakes myself in the past.

○ You can also try asking for help on the wiki. Likewise, feel free to answer questions on the wiki if you know the answer

Page 49: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 49

Homework 1

Homework?

• Email completed homework to [email protected] with the

subject "biol59500 homework X" where X is replaced with the actual

homework number, e.g., "biol59500 homework 1".

• Include your script file as an attachment, not in the body of the mail.

Name the attached file with your last name and the homework

number, e.g., "huang_hw1", or "smith_hw17".

• The attached file should be able to be run as a script – it should have

no extraneous non-code content.

• Try to not embed filenames or paths in your script that will only work

on your computer. If you use these, your script will fail when I test

it.

Page 50: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 50

Homework 1

fastq - sequencing read file• 4 lines per read

○ ID (begins with @)

○ Sequence

○ + separator

○ quality

• How many reads?

• What is the average length of a read?

Page 51: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 51

Basic Perl

Scalar Variables• String operators

. (period) concatenation

.= append

x repeat; $a x $b repeats $a, $b times

index($a,$b) integer offset of string $b in string $a

substr($s,$o,$l) get substring $l long beginning at $o in string $s

length($s) length of string $s

$month = "Jan ";

$day = "twentieth ";

$year = "2006";

$month_day = $month.$day; # "Jan twentieth "

$full_date = $month; # "Jan "

$full_date .= $day; # "Jan twentieth "

$full_date .= $year; # "Jan twentieth 2006"

$six_a = "$a" x 6; # $six_a is "aaaaaa"

$strlen = length($full_date); # strlen is 18

$date = substr($full_date,0,13); # date is "Jan twentieth"

Page 52: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 52

Basic Perl

Scalar Variables• Substring( expression, offset [,length] )

• Index( string, substring [, position] )

○ Return the offset of the first occurrence of substring in string after

position (or offset zero if not specified)

○ use index for getting the offset

• Length( string )

○ use for finding length of a string

$a = "you are the one";

$b = substr( $a, 4, 3 ); # "are"

$a = "you are the one";

$offset = index( $a, "are" );

$b = substr( $a, $offset, 3 ); # "are"

$a = "you are the one";

$length = length( $a ); # length is 15

Page 53: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 53

Basic Perl

Scalar Variables

• Special characters in strings

\n newline or line feed

\r return

\t tab

\f form feed

\b backspace

• What if you need to use " or ' or \ in a string

○ Must be marked with the \ symbol to tell the Perl interpreter they are not

quotes around strings

− use \" \' \\

○ Commonly called "escaping"

• Special characters you probably won't use

\c control (ctrl)

\u force next character uppercase

\l force next character lowercase

\U force following characters to uppercase

\L force following characters to lowercase

\E end \U or \L

print "\"Don\’t do it\"\n"; # prints "Don't do it"

print ""Don't do it"\n"; # error

Bareword found where operator expected at aa.pl line 1, near """Don't"

(Missing operator before Don't?)

String found where operator expected at aa.pl line 1, near "do it"\n""

(Do you need to predeclare do?)

syntax error at aa.pl line 1, near """Don't "

Execution of aa.pl aborted due to compilation errors.

Page 54: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 54

Basic Perl

Scalar Variables • Perl tries to do the right thing, even when you are syntactically

wrong (DWIM – do what I mean)

$a = "aaa";

$b = "bbb";

$ab = $a + $b; # operands are strings not numbers

print "ab = $ab\n"; # the numeric value of a string is zero

$one = "1";

$two = "2";

$three = $one + $two; # operands are strings that are integers

print "three = $three\n";

$one = "1"; # mixed characters and numbers

$two = 2;

$three = $one . $two; # concatenate strings

print "three = $three\n";

$three += 4;

print "three plus 4 = $three\n";

$one_a = $one + $a;

print "one_a = $one_a\n";

$a_three = $a . $three;

print "a_three = $a_three\n";

ab = 0

three = 3

three = 12

three plus 4 = 16

one_a = 1

a_three = aaa16

Page 55: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 55

Basic Perl

Conditional (Logical) Operators

• Test whether something is true or false

○ Simple conditional tests

− == tests whether two numbers are the same

− eq tests whether two strings are the same

if ( some_condition ) {

# this executes if true

}

$one = 1;

$two = 2;

$two == $one + $one; # $two is true

$two == $one; # $two is false

"me" eq "you"; # false

"me" == "you"; # true, 0 == 0

Page 56: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 56

Basic Perl

Operator Precedence • When a statement has more than one operator, there are rules that

define which operation is done first. This is called operator

precedence. Just as in basic algebra, the use of parentheses

overrules the default rules.

• Associativity Operators Precedence

non-associative ++ -- highest (applied first)

right **

left * / % x

left + - .

non-associative == eq

right = += -= *= etc lowest (last)

• Associativity

○ non-associative – applies only to the immediate operand

○ left – operations carried out left to right

○ right – operation carried out right to left

Page 57: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 57

Basic Perl

Operator Precedence

• Examples

○ x + y + z = (x + y) + z

○ x + y * z = x + (y * z)

○ x**2 + y**2 * 4 = x2 + 4y2

○ x+=2 > y + z = x+= (2 > (y + z))

• Take home lesson: operator precedence is hard to remember

If in doubt, DON'T RELY ON IT. USE PARENTHESES

Associativity Operators

non ++ --

right **

left * / % x

left + - .

non < > <= >= gt lt ge

non == != <=> eq ne cmp

right = += -= *= etc

Page 58: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 58

Style

What we want to avoid

• Programming style is supposed to

○ Save time and effort reusing your programs

○ Make your program easy to read

○ Make it easy to figure out what it does (validate)

○ Help prevent mistakes

○ Help find mistakes (debug)

• Style is partially convention and partially esthetics

$_='\=*Sxw!jds@j$.jl.dt#Rw%^dcn"K1x(=Bl1nwl!\*1enab^h"F=!J$h%fhcq',

tr&J-ZA-Ij-za-i&A-Za-z&&s&\(&logic&&&s&\*&un&g&s&=&al&g&s&\^&it&g&&

s&%&st&g&&s&\$&ber&g&s&\#&\n&&s&"& of&g,s&([A-Z])& $1&g&&s&\\u&U&&&

s&!&es, &g&s&\\a&A&&s&1&i&g&&print" $_\n";sub liminal{"use perl!";}

Page 59: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 59

Style

General Rules

• Use space to emphasize parts of code

○ Use white space (spaces, tabs, etc) and blank lines to emphasize

segments that go together

○ is easier read than

− Use space inside braces {} and parentheses ()

− Use space around operators: +, -, *, /, =, ==, eq, etc

• Break lines at a page width (usually about 80 characters)

○ Why? So you can print it out if you want to

while ( $count < 3 ) {

# do some stuff

}

while ($count<3){#do some stuff

}

Page 60: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 60

Style

General• Use indentation to indicate blocks of code

• Align parentheses so its easy to see where a block begins and ends

○ Preferred style (Kernigan and Richie “snuggled” style)

○ Other styles

while ( $forever ) { # Kernigan and Richie “snuggled” style

$text = <>; # read one line from terminal

print $text;

}

while ( $forever ) # GNU style

{

$text = <>; # read one line from terminal

print $text;

}

while ( $forever ) # BSD Style

{

$text = <>; # read one line from terminal

print $text;

}

Page 61: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 61

Style

Variable names

Use mnemonic variable names• Why? Makes it much easier to understand what the program does

• $height, $weight, $age, $sex instead of $x1, $x2, $x3, $x4

Develop some consistent style for capitalization etc.• Why? So you don’t get confused

• Perl is case sensitive, $quaint is not the same as $Quaint

Use underlines and capitalization to improve readability• Improves naturalness of code – improves your ability to

understand and remember what it does

• $number_of_lines is easier to read than $numberoflineseither is better than $nl

• Alternative style: numberOfLines (I prefer this for function names)

Do not make variable names counterintuitive or misleading• Pretty obvious, don’t say $done when you mean $not_done

while ( $not_done ) {

# do some stuff

}

Page 62: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 62

Style

Variable Names

• Variable names should be unique

○ Create a new variable for a new purpose: don’t just use one from an

earlier part of the program “because its there”

○ Reusing variable names is confusing because the same variable means

different things in different parts of the program

$c = 0;

$value = 1;

$threshold = 1000;

# find the largest power of 2 less than the threshold

while ( $value < $threshold) {

$c = $c + 1;

$value = $value * 2; # next power of 2

}

# display result and find out if we should continue

print “enter a new set of thresholds, ending with 0\n”;

$more_numbers = 1;

while ( $more_numbers ) {

$c = <>;

...

$more_numbers = $c eq “0”; # was the entered number zero?

}

Page 63: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 63

Style

Document your code

• To avoid writing the same scripts over and over, you have to be able

to (quickly) figure out what an existing script does

• ALWAYS include comments to explain what the script does.

• Date helps you identify scripts that you used for a specific purpose

at a certain time (important when you discover bugs)

• Name is helpful if you give your scripts to your co-workers

while (1){ $a = <>; print $a; }

# echo.pl

#

# Get input from the terminal and echo to the display

#

# 11 January 2006 Michael Gribskov

#

$forever = 1;

while ( $forever ) {

$text = <>; # read one line from terminal

print $text;

}

Page 64: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 64

Style

• Establish a style and stick to it

• Style is a matter of habit; good habits == better code

• More style as we go on

• Required for homework

○ Basic documentation at top of script

− Purpose of script and method (if not obvious)

− Author

− Date

○ Comments describing function of code segments

○ Indentation of code blocks

○ White space separating code segments

Page 65: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 65

Basic Perl

Logical Expressions

• Give a result of false or true

• Operators

○ different for numbers and strings

○ Numeric test String test Meaning

== eq equal to

!= ne not equal to

> gt greater than

>= ge greater than or equal to

< lt less than

<= le less than or equal to

<=> cmp not equal to, signed result*

*don't worry about cmp and <=> for now

Page 66: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 66

Basic Perl

Logical Expressions

• What is true?

○ True is anything that is not false

• Only four things are false

○ The number 0 (zero) is false.

○ The string "0" (again zero) is false.

○ The empty string ("" or '') is false.

○ The undefined value, undef , is false.

• Everything else is true.

Understanding what is true and what is false is

essential to understanding how decisions are made

Page 67: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 67

Basic Perl

Logical Expressions

• Greater than – Less than

○ Common mistake is to assume that the limit will be included. This is

only true for >= and <=.

• Trying to do the right thing

$count = 0;

while ( $count < 3 ) { # FALSE when $count == 3

print "$count\n";

$count++;

}

$five = "5";

$ten = "10";

$result = $five < $ten;

$result = $five lt $ten;

# true, 5 < 10

# false, 1 sorts before 5 in strings

(alphabetical)

Page 68: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 68

Basic Perl

Logical Operators

• To combine logical expressions you need

○ and && (AND)

○ or || (OR)

○ not ! (NOT)

○ the written forms (and,or,not) have low precedence

○ the symbolic forms ( &&, ||, ! ) have high precedence

AND ( and &&), true only when both operands are true

operand 1 operand 2 result

True True True

True False False

False True False

False False False

OR (or || ), true when either operand is true

operand 1 operand 2 result

True True True

True False True

False True True

False False False

Page 69: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 69

Basic Perl

Logical Operators

• AND and OR are evaluated in drop dead fashion. Only as many

operands as need be checked are checked

• Important later when the operands may not be simple variables

• Every expression, numeric, string, or logical variable, has a logical

value.

$x = 1;

$y = 0;

if ( $y || ($x=2) ){

print "x: $x y:$y\n";

}

$x = 1;

$y = 0;

if ( $y or ($x=3) ){

print "x: $x y:$y\n";

}

Page 70: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 70

Basic Perl

Operator Precedence

• Including logical operators

Associativity Operators

non-associative ++ --

right **

right !

left * / % x

left + - .

non-associative < > <= >= gt lt ge

non-associative == != <=> eq ne cmp

left &&

left ||

right = += -= *= etc

right not

left and

left or

Page 71: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 71

Basic Perl

Logical Statements

• Used for making decisions

○ If / elsif / else

○ Always apply to a block of code delimited by { }

• If ( logical expression ) {

Block of code # executes only if expression is true

}

• If ( logical expression ){

Block of code # if expression is true

} else {

Block of code # if expression is false

}

Page 72: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 72

Basic Perl

Logical Statements

• Making a series of comparisons with if / elsif / else

• If ( logical expression 1 ) {

Block of code # if expression 1 is true

} elsif ( logical expression 2 ) {

Block of code # if expression 2 is true

} else {

Block of code # if both expressions are

} # false

Page 73: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 73

Basic Perl

Logical Statements• Some language have a special statement for making multiway

decisions (called a case statement)

• if / elsif / else is the closest thing to a Perl case statement

while ( $action ne "done\n" ) {

$action = <>;

if ( $action eq "add\n" ) {

} elsif ( $action eq "subtract\n" ) {

} elsif ( $action eq "divide\n" ) {

} elsif ( $action eq "multiply\n" ) {

} else {

print "I don\'t understand command $action\n";

}

}

Page 74: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 74

Basic Perl

Logical Statements

• Unless - the opposite of if

○ if the logical expression is false, the block is executed

○ unless (logical expression) {

Block of code

}

while ( $action ne "done" ) {

$action = <>;

unless ( $action eq "quit" ) {

if ( $action eq "add" ) {

} elsif ( $action eq "subtract" ) {

} elsif ( $action eq "divide" ) {

} else {

print "I do not understand command $action\n";

}

}

}

Page 75: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 75

Basic Perl

Logical Statements• unless

○ unless can also take an else clause

○ unless is more confusing than if

Use else sparingly with unless

unless ( $name eq "Frank" ) {

print "Hi $name\n";

} else {

print "Oh, it\'s you again, Frank\n";

exit;

}

# compared to

if ( $name eq "Frank" ) {

print "Oh, it\'s you again, Frank\n";

exit;

} else {

print "Hi $name\n";

}

Page 76: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 76

Basic Perl

Logical Statements• Additional syntax for if and unless – one line tests

○ expression if ( logical condition );

○ expression unless ( logical condition );

• Can be more readable in some contexts

# these are the same

if ( x > 3 ) {

x = $y + 1;

}

$x = $y + 1 if ( x > 3 );

# these are the same

unless ( x > 3 ) {

x = $y + 1;

}

$x = $y + 1 unless ( x > 3 );

Page 77: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 77

Basic Perl

Looping

• Looping allows a program to do something over and over, one of the

main reasons for using a program. Looping in Perl uses

○ while

○ do / while and do / until

○ foreach

○ for

• while

○ while ( logical expression ) {

Block of code

}

○ while loops test the condition before every execution of the loop.

○ If you want it to be tested after the loop use do … while or do … until.

$value=5;

while ( $value>10 ){

print "$value\n";

$value=$value-1;

}

Page 78: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 78

Basic Perl

Looping• do … while

○ do {

Block of code

} while ( logical expression ); # continues if expression is TRUE

• do … until

○ do {

Block of code

} until ( logical expression); # continues if expression is FALSE

• Loop always executes at least once

• do loops test the condition before each subsequent execution of the

loop.

$value=5;

do{

print "$value\n";

$value = $value-1;

} while ( $value > 10 );

Page 79: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 79

Basic Perl

Looping• foreach, executes once for each item in a list

○ Most common looping statement in Perl

○ temporary variable assumes the value of the list item at each iteration

○ foreach $tmp_variable ( list of values ) {

Block of code

}

○ foreach ( list of values ) {

Block of code

}$total = 0;

foreach $number( 1,2,3 ){

$total += $number;

print “$number $total\n”;

}

$cycle = 0;

foreach( 1 .. 3 ){

$cycle++;

print “cycle $cycle\n";

}

Page 80: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 80

Basic Perl

Range Operator

• .. (two periods)

○ 1 .. 10

○ $i .. $j

○ $i .. $i + 10

• very handy for foreach loops

$total = 0;

foreach $number( 1 .. 3 ){

$total += $number;

print “$number $total\n”;

}

Page 81: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 81

Basic Perl

Looping• for

○ familiar to C/C++ programmers, used much less in Perl

○ defines a temporary variable for use in loop with a specified initial value,

ending value, and increment for each iteration

○ for ( initial_expression; test_expression; change_expression) {

Block of code

}

○ Gives detailed control over begin, end, and step of the loop# Print numbers 1 to 99 by 2

for ( $i=1; $i<=100; $i+=2 ) {

print "$i\n";

}

Page 82: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 82

Basic Perl

Looping• Break outs

○ Particularly useful with forever loops

○ next – stops the current iteration, proceeds with the next normal

iteration, i.e., skip things you don't want

○ last – stop the current iteration and immediately exit the loop

i.e., stop when you find what you want

$forever = 1;

# break out of loop using last

while ( $forever ) { # compare while ( 1 ) {

count++;

print "count is $count\n";

if ( $count > 3 ) {

last;

}

}

Page 83: Biology 595M – Practical Biocomputing

Biol 59500-033 - Practical Biocomputing 83

Basic Perl

Looping

• Breakouts are often most readable using the one line syntax for if

$forever = 1;

# skip processing using next

while ( $forever ) {

count++;

next if ( $count == 1 );

print "count is $count\n";

if ( $count > 3 ) {

last;

}

}