introduction to unix and perl todd scheetz sept. 6, 2001 computational methods in molecular biology

Post on 13-Dec-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to UNIX and Perl

Todd Scheetz

Sept. 6, 2001Computational Methods in Molecular Biology

Definitions

Operating System• provides a uniform interface between a computer’s hardware and user-level programs.• Manages the low-level functionality of the hardware automatically.

Programming Language• provides a formal structure/syntax for implementing algorithmic procedures.

What is UNIX?

Operating system developed at Bell Labs.• originally written in assembly code• the C programming language was designed to implement a more portable version of UNIX

Multi-userMulti-tasking

What is UNIX? (part 2)

Made available with source code at no cost• could fix bugs, add features or just test alternative methods• EXCELLENT for learning or teaching

Adopted by Berkeley to make BSD• virtual memory• paging• networking (TCP/IP)

What is UNIX? (part 3)

By programmers, for programmers• extensive facilities to allow people to work together and share information in controlled ways• time sharing system

Basic Guidelines• Principle of least surprise• every program should do one thing and do it well

UNIX hierarchy

Adapted from Tanenbaum, p. 273

Hardware (CPU, memory, disks, keyboard, etc.)

UNIX O/S(process mgmt, memory mgmt,file system, I/O, etc.)

Standard Libr.(open, close, fork,read, print, etc.)

Std. Utility Programs(shell, editor, compiler)

Users

User i/f

Library i/f

System call i/f

UNIX Basics

User Accounts - required to log-on to the computer with username and password.

Groups - entity made up of one or more users.

Sharing...

Bob

Stacie

Diane

MikeBill

group1 group2

UNIX Basics

File Sharing - Regulated by three sets of permissions.

Permissions: read, write, execute

Subjects: owner, group, all

R W XUser (u)Group (g)All (a)

-rwxr-xr-x foo.pl-r-xr-xr-x bar.pl-rw------- secret-rw-r--r-- public

UNIX Basics

Super-user accountcomplete access to all files

Required for system administration tasksadd accounts/groupschange permissions/owners of any filechange password of any accountshutdown a machine

UNIX BasicsUNIX Filesystem Hierarchy

/

bin etc usr vartmpdev lib

bin doc lib local

Two shortcuts. - the current directory.. - the directory one level “up”

/usr/usr/bin/usr/local/usr/local/bin

bin etc lib tmp

What is UNIX?

Processes

Each program executes as a process

A process provides encapsulation for the program

Under UNIX, multiple processes can be running at the same time!

How to control processes:^C -- break^Z -- stop& -- start in backgroundps -- show which processes are runningkill -- kill a process

What is UNIX?

grep - show every line from a file that matches a supplied patternEx. grep sub my_program.pl(would return every line in the file that contained the string ‘sub’)

ls - list filesEx. ls *.pl(would list all files in the current directory that end in ‘.pl’)

head - list the first lines in a fileEx. head -20 my_program.pl(would show the first 20 lines from my_program.pl)

sort - performs a lexical sorting of a fileEx. sort my_program.pl

What is UNIX?

UNIX also provides a method for concatenating multiple programs together

Pipes…

Ex.head -20 *.pl | grep File | sort

pipes

UNIX BasicsUNIX Command Summary

pwd - print working directorycd - change directoryls - list filesmv - move a file (relocate/rename)rm - remove a filecp - copy a file

mkdir - make a new directoryrmdir - remove a directorymore - display the contents of a file (one screen as a time)

chmod - change the permissions on a filechgrp - change the group associated with a file

UNIX Shell

Shells

a.k.a. command interpreterthe primary user interface to UNIXinterpret and execute commands

1. Interactive use2. Customization of UNIX session (environment)3. programmability

/bin/sh - Bourne shell/bin/csh - C shell/bin/bash - Bourne again shell/bin/tcsh - modified, updated C shell

UNIX Shell

bash

prompt -- by default shows who you are, what machine the shell is running on, and what directory you are in.

PATH -- environment variable that defines where the shell should look for the programs you are running.

/bin/usr/bin/usr/local/bin/usr/X11R6/bin/usr/sbin.

Installing Software

Pre-built vs. source

RPM vs. “raw” binaries

Processdownloadingextractingcompilinginstallationconfiguration

Mini-Tour of UNIX

Go through the most common commands.

Perl

Basics of a Perl program under UNIX

Perl is an interpreted language

The first line of a Perl program (in UNIX) is...#!/usr/bin/perl

The # character is the comment character.

All single-expression statements must end in a semi-colon.$area = $pi * $radius * $radius;while (CONDITION) {

# some stuff}

Programming Languages

Input/Output in Perl

Reading in from the keyboard...$line = <STDIN>;

Filehandles...

File: open(FH,”filename”);open(FH,”>filename”);...$line = <FH>;...close(FH);

DO HELLO WORLD WALK-THROUGH.

Programming Languages

Data Types

Integer - 0, 1, 2, …, 1000, 1001, …Floating Point - 0.0, 0.001, 0.0003, 3.14159265, …Character - a, b, c, d, …, 0, 1, 2, :, !, …

Different languages use different conventions. In Perl, a string is also a basic data type. A string is a sequence of 0 or more characters.

Programming Languages

Variables - Pieces of data stored within a program. (similar to variables in arithmetic)

scalar variables are distinguished by the ‘$’ at their front.

Any name beginning with a letter is allowed$a$a1$alphabet_soup_is_OK_to_me

Programming LanguagesArithmetic Operations

+ Addition- Subtraction* Multiplication/ Division

% Modulo++ Increment-- Decrement|| Logical OR

&& Logical AND! Logical Negation

Programming LanguagesArithmetic Operations

== Eq Equality!= neq Inequality> Greater than

>= … or equal to< Less than

<= … or equal to

Programming LanguagesStatements

A program can be broken down into basic structures called statements. Statements are terminated by a semi-colon.

print “Hello, world!\n”;

Assignment statements use a single ‘=‘ rather than the ‘==‘ of the equality operation.

$pi = 3.1415926;$area = $pi * $radius * $radius;$line = <STDIN>;

Programming Languages

Variable Types

Scalar - a single valueArray - a list of values (indexed by sequential number)Hash - a set of key,value pairs

Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)

0 11 22 33 5

First 1Second 2Third 3Fourth 5

......... ...

Programming Languages

Arrays are good when the data is dense, and the algorithm uses a linear access pattern.

Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)

1 1 1 0 1 0 1 0 0 0 1

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

1 2 3 5 7 11 13 17 19 23 29 31

Programming Languages

0 1 2 3 4 5 6 7 8 9 10 11

1 2 3 5 7 11 13 17 19 23 29 31

1 2 3 5 7 11 13 17 19 23 29 31

1 1 1 1 1 1 1 1 1 1 1 1

Hash - “associative array”• array indices can be any unique set of “keys”• excellent for accessing in random patterns (in sparse data)

(Ex. “is 19 a prime number?”)

Programming Languages

Scalar -- $foo, $a1, $a2000

Array -- @array, @iito access the element at index $i

$array[$i]

the last index of an array is $#array

the number of elements in an array is$num_elements = $#array + 1;

OR$num_elements = @array;

Programming Languages

Hash --%hash, %envto access the element with index of $i

$hash{$i}

to get a list of keys used in a hash@key_list = keys(%hash);

to determine how many keys are in a hash$num_elements = @key_list;

OR$num_elements = keys(%hash);

Programming Languages

Control of Program Execution

if -- executes a block of code, if the condition evaluates to TRUE

if($light eq “green”) {continue_driving();

}

if( ($light eq “green”) && ($no_traffic) ) {continue_driving();

}

Programming LanguagesIn many cases, a simple if statement is not sufficient, as multiple alternative outcomes need to be evaluated.

if($light eq “green”) {continue_driving();

} else {stop_car();

}

if($light eq “green”) {continue_driving();

} elsif($light eq “red”) {stop_car();

} else {go_fast_to_beat_the_yellow();

}

Programming Languages

Control of Program Execution

Sometimes you need to iterate through a statement multiple times...

Looping constructs:for (…) { … }foreach $var (@list) { … }while (COND) { … }

Programming Languages

Foreach Loop…

foreach $var (@list) {do_stuff($var);

}

foreach $name (@name_list) {print “Name = $name\n”;

}

foreach $name (@name_list) {if($hair_color{$name} eq “blond”) {

print “$name has blond hair.\n”;}

}

Programming Languages

for (INIT; COND; POST) {do_stuff();

}

for ($i=0; $i < 50;$i++) {print “i = $i\n”;

}

for ($i=0; $i < 50; $i++) {if($prime{$i} == 1) {

print “$i is prime!\n”;} else {

print “$i is not prime.\n”;}

}

Programming Languageswhile (COND) {

do_stuff();}

while($line = <FILE_HANDLE>) {print “$line”;

}

while($flag ==0) {if($prime{$position} == 1) {

$flag = 1;} else {

$position++;}

}

Intermission

Review of Perl Concepts

Data Typesscalararrayhash

Input/Outputopen(FILEHANDLE,”filename”);$line = <FILEHANDLE>;print “$line”;

Arithmetic Operations+, -, *, /, %&&, ||, !

Review of Perl Concepts

Control Structuresifif/elseif/elsif/else

foreach

for

while

Regular Expressions

General approach to the problem of pattern matching

RE’s are a compact method for representing a set of possible strings without explicitly specifying each alternative.

For this portion of the discussion, I will be using {} to represent the scope of a set.

{A}{A,AA}

{Ø} = empty set

Regular Expressions

In addition, the [] will be used to denote possible alternatives.

[AB] = {A,B}

With just these semantics available, we can begin building simple Regular Expressions.

[AB][AB] = {AA, AB, BA, BB}AA[AB]BB = {AAABB,AABBB}

Regular Expressions

Additional Regular Expression components* = 0 or more of the specified symbol+ = 1 or more of the specified symbol

A+ = {A, AA, AAA, … }A* = {Ø, A, AA, AAA, … }

AB* = {A, AB, ABB, ABBB, … }[AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }

Regular Expressions

What if we want a specific number of iterations?

A{2,4} = {AA, AAA, AAAA}[AB]{1,2} = {A, B, AA, AB, BA, BB}

What if we want any character except one?[^A] = {B}

What if we want to allow any symbol?

. = {A, B}

.* = {Ø, A, B, AA, AB, BA, BB, … }

Regular Expressions

All of these operations are available in Perl

Several “shortcuts”

\d = {0, 2, 3, 4, 5, 6, 7, 8, 9}\w+\s\w+ = {…, Hello World, … }

Name Definition CodeWhitespace [space, tab,

new-line]\s

Wordcharacter

[a-zA-Z_0-9] \w

Digit [0-9] \d

Pattern Matching

Perl supports built-in operations for pattern matching, substitution, and character replacement

Pattern Matching

if($line =~ m/Rn.\d+/) {...

}

In Perl, RE’s can be a part of the string rather than the whole string.

^ - beginning of string$ - end of string

Pattern Matching

Back references…

if($line =~ m/(Rn.\d+)/) {$UniGene_label = $1;

}

Regular Expressions

$file = “my_fasta_file”;open(IN, $file);$line_count = 0;while($line = <IN>) {

if($line =~ m/^\>/) {$line_count++;

}}print “There are $line_count FASTA sequences in $file.\n”;

Pattern Matching

UniGene data file

ID Bt.1TITLE Cow casein kinase II alpha …EXPRESS ;placentaPROTSIM ORG=Caenorhabditis elegans; …PROTSIM ORG=Mus musculus; PROTGI=…SCOUNT 2SEQUENCE ACC=M93665; NID=g162776; …SEQUENCE ACC=BF043619; NID=…//ID Bt.2TITLE Bos taurus cyclin-dependent …...

Pattern Matching

Let’s write a small Perl program to determine how many clusters there are in the Bos taurus UniGene file.

Pattern Matching

Now we’ll build a Perl program that can write an HTML file containing some basic links based on the Bos taurus UniGene clustering.

Important:

http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank

Substitution

Pattern matching is useful for counting or indexing items, but to modify the data, substitution is required.

Substitution searches a string for a PATTERN and, if found, replaces it with REPLACEMENT.

$line =~ s/PATTERN/REPLACEMENT/;

Returns a value equal to the number of times the pattern was found and replaced.

$result = $line =~ s/PATTERN/REPLACEMENT/;

Substitution

Substitution can take several different options.specified after the final slash

The most useful areg - global (can substitute at more than one location)i - case insensitive matching

$string = “One fish, Two fish, Red fish, Blue fish.”;$string =~ s/fish/dog/g;print “$string\n”;

One dog, Two dog, Red dog, Blue dog.

Substitution

Example: Removing leading and trailing white-space

$line =~ s/^\s*(.*?)\s*$/$1/;

a *? performs a minimal match…it will stop at the first point that the remainder of the expression can be matched.

$line =~ s/^\s*(.*)\s*$/$1/;this statement will not remove trailing white-space, instead the white space is retained by the .*

Character Replacement

A similar operation to substitution is character replacement.

$line =~ tr/a-z/A-Z/;

$count_CG = $line =~ tr/CG/CG/;

$line =~ tr/ACGT/TGCA/;

$line =~ s/A/T/g;$line =~ s/C/G/g;$line =~ s/G/C/g;$line =~ s/T/A/g;

Character Replacement

while($line = <IN>) {$count_CG = $line =~ tr/CG/CG/;$count_AT = $line =~ tr/AT/AT/;

}$total = $count_CG + $count_AT;$percent_CG = 100 * ($count_CG/$total);

print “The sequence was $percent_CG CG-rich.\n”;

Subroutines

One of the most important aspects of programming is dealing with complexity. A program that is written in one large section is generally more difficult to debug. Thus a major strategy in program development is modularization.

Break the program up into smaller portions that can each be developed and tested independently.

Makes the program more readable, and easier to maintain and modify.

Subroutines

EXAMPLE:Reading in sequences from UniGene.all.seq file

Multiple FASTA sequences in a single file, each annotated with the UniGene cluster they belong to.

GOAL: Make an output file consisting only of the longest sequence from each cluster.

Subroutines

ISSUES:1. Want to design and implement a usable program2. Use subroutines where useful to reduce complexity.3. Minimize the memory requirements.

(human UniGene seqs > 2 GB)

top related