introduction to unix and perl todd scheetz sept. 6, 2001 computational methods in molecular biology
TRANSCRIPT
![Page 1: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/1.jpg)
Introduction to UNIX and Perl
Todd Scheetz
Sept. 6, 2001Computational Methods in Molecular Biology
![Page 2: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/2.jpg)
Definitions
Operating System• provides a uniform interface between a computer’s hardware and user-level programs.• Manages the low-level functionality of the hardware automatically.
Programming Language• provides a formal structure/syntax for implementing algorithmic procedures.
![Page 3: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/3.jpg)
What is UNIX?
Operating system developed at Bell Labs.• originally written in assembly code• the C programming language was designed to implement a more portable version of UNIX
Multi-userMulti-tasking
![Page 4: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/4.jpg)
What is UNIX? (part 2)
Made available with source code at no cost• could fix bugs, add features or just test alternative methods• EXCELLENT for learning or teaching
Adopted by Berkeley to make BSD• virtual memory• paging• networking (TCP/IP)
![Page 5: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/5.jpg)
What is UNIX? (part 3)
By programmers, for programmers• extensive facilities to allow people to work together and share information in controlled ways• time sharing system
Basic Guidelines• Principle of least surprise• every program should do one thing and do it well
![Page 6: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/6.jpg)
UNIX hierarchy
Adapted from Tanenbaum, p. 273
Hardware (CPU, memory, disks, keyboard, etc.)
UNIX O/S(process mgmt, memory mgmt,file system, I/O, etc.)
Standard Libr.(open, close, fork,read, print, etc.)
Std. Utility Programs(shell, editor, compiler)
Users
User i/f
Library i/f
System call i/f
![Page 7: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/7.jpg)
UNIX Basics
User Accounts - required to log-on to the computer with username and password.
Groups - entity made up of one or more users.
Sharing...
Bob
Stacie
Diane
MikeBill
group1 group2
![Page 8: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/8.jpg)
UNIX Basics
File Sharing - Regulated by three sets of permissions.
Permissions: read, write, execute
Subjects: owner, group, all
R W XUser (u)Group (g)All (a)
-rwxr-xr-x foo.pl-r-xr-xr-x bar.pl-rw------- secret-rw-r--r-- public
![Page 9: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/9.jpg)
UNIX Basics
Super-user accountcomplete access to all files
Required for system administration tasksadd accounts/groupschange permissions/owners of any filechange password of any accountshutdown a machine
![Page 10: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/10.jpg)
UNIX BasicsUNIX Filesystem Hierarchy
/
bin etc usr vartmpdev lib
bin doc lib local
Two shortcuts. - the current directory.. - the directory one level “up”
/usr/usr/bin/usr/local/usr/local/bin
bin etc lib tmp
![Page 11: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/11.jpg)
What is UNIX?
Processes
Each program executes as a process
A process provides encapsulation for the program
Under UNIX, multiple processes can be running at the same time!
How to control processes:^C -- break^Z -- stop& -- start in backgroundps -- show which processes are runningkill -- kill a process
![Page 12: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/12.jpg)
What is UNIX?
grep - show every line from a file that matches a supplied patternEx. grep sub my_program.pl(would return every line in the file that contained the string ‘sub’)
ls - list filesEx. ls *.pl(would list all files in the current directory that end in ‘.pl’)
head - list the first lines in a fileEx. head -20 my_program.pl(would show the first 20 lines from my_program.pl)
sort - performs a lexical sorting of a fileEx. sort my_program.pl
![Page 13: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/13.jpg)
What is UNIX?
UNIX also provides a method for concatenating multiple programs together
Pipes…
Ex.head -20 *.pl | grep File | sort
pipes
![Page 14: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/14.jpg)
UNIX BasicsUNIX Command Summary
pwd - print working directorycd - change directoryls - list filesmv - move a file (relocate/rename)rm - remove a filecp - copy a file
mkdir - make a new directoryrmdir - remove a directorymore - display the contents of a file (one screen as a time)
chmod - change the permissions on a filechgrp - change the group associated with a file
![Page 15: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/15.jpg)
UNIX Shell
Shells
a.k.a. command interpreterthe primary user interface to UNIXinterpret and execute commands
1. Interactive use2. Customization of UNIX session (environment)3. programmability
/bin/sh - Bourne shell/bin/csh - C shell/bin/bash - Bourne again shell/bin/tcsh - modified, updated C shell
![Page 16: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/16.jpg)
UNIX Shell
bash
prompt -- by default shows who you are, what machine the shell is running on, and what directory you are in.
PATH -- environment variable that defines where the shell should look for the programs you are running.
/bin/usr/bin/usr/local/bin/usr/X11R6/bin/usr/sbin.
![Page 17: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/17.jpg)
Installing Software
Pre-built vs. source
RPM vs. “raw” binaries
Processdownloadingextractingcompilinginstallationconfiguration
![Page 18: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/18.jpg)
Mini-Tour of UNIX
Go through the most common commands.
![Page 19: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/19.jpg)
Perl
Basics of a Perl program under UNIX
Perl is an interpreted language
The first line of a Perl program (in UNIX) is...#!/usr/bin/perl
The # character is the comment character.
All single-expression statements must end in a semi-colon.$area = $pi * $radius * $radius;while (CONDITION) {
# some stuff}
![Page 20: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/20.jpg)
Programming Languages
Input/Output in Perl
Reading in from the keyboard...$line = <STDIN>;
Filehandles...
File: open(FH,”filename”);open(FH,”>filename”);...$line = <FH>;...close(FH);
DO HELLO WORLD WALK-THROUGH.
![Page 21: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/21.jpg)
Programming Languages
Data Types
Integer - 0, 1, 2, …, 1000, 1001, …Floating Point - 0.0, 0.001, 0.0003, 3.14159265, …Character - a, b, c, d, …, 0, 1, 2, :, !, …
Different languages use different conventions. In Perl, a string is also a basic data type. A string is a sequence of 0 or more characters.
![Page 22: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/22.jpg)
Programming Languages
Variables - Pieces of data stored within a program. (similar to variables in arithmetic)
scalar variables are distinguished by the ‘$’ at their front.
Any name beginning with a letter is allowed$a$a1$alphabet_soup_is_OK_to_me
![Page 23: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/23.jpg)
Programming LanguagesArithmetic Operations
+ Addition- Subtraction* Multiplication/ Division
% Modulo++ Increment-- Decrement|| Logical OR
&& Logical AND! Logical Negation
![Page 24: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/24.jpg)
Programming LanguagesArithmetic Operations
== Eq Equality!= neq Inequality> Greater than
>= … or equal to< Less than
<= … or equal to
![Page 25: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/25.jpg)
Programming LanguagesStatements
A program can be broken down into basic structures called statements. Statements are terminated by a semi-colon.
print “Hello, world!\n”;
Assignment statements use a single ‘=‘ rather than the ‘==‘ of the equality operation.
$pi = 3.1415926;$area = $pi * $radius * $radius;$line = <STDIN>;
![Page 26: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/26.jpg)
Programming Languages
Variable Types
Scalar - a single valueArray - a list of values (indexed by sequential number)Hash - a set of key,value pairs
Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)
0 11 22 33 5
First 1Second 2Third 3Fourth 5
......... ...
![Page 27: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/27.jpg)
Programming Languages
Arrays are good when the data is dense, and the algorithm uses a linear access pattern.
Prime Numbers = (1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, …)
1 1 1 0 1 0 1 0 0 0 1
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
1 2 3 5 7 11 13 17 19 23 29 31
![Page 28: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/28.jpg)
Programming Languages
0 1 2 3 4 5 6 7 8 9 10 11
1 2 3 5 7 11 13 17 19 23 29 31
1 2 3 5 7 11 13 17 19 23 29 31
1 1 1 1 1 1 1 1 1 1 1 1
Hash - “associative array”• array indices can be any unique set of “keys”• excellent for accessing in random patterns (in sparse data)
(Ex. “is 19 a prime number?”)
![Page 29: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/29.jpg)
Programming Languages
Scalar -- $foo, $a1, $a2000
Array -- @array, @iito access the element at index $i
$array[$i]
the last index of an array is $#array
the number of elements in an array is$num_elements = $#array + 1;
OR$num_elements = @array;
![Page 30: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/30.jpg)
Programming Languages
Hash --%hash, %envto access the element with index of $i
$hash{$i}
to get a list of keys used in a hash@key_list = keys(%hash);
to determine how many keys are in a hash$num_elements = @key_list;
OR$num_elements = keys(%hash);
![Page 31: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/31.jpg)
Programming Languages
Control of Program Execution
if -- executes a block of code, if the condition evaluates to TRUE
if($light eq “green”) {continue_driving();
}
if( ($light eq “green”) && ($no_traffic) ) {continue_driving();
}
![Page 32: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/32.jpg)
Programming LanguagesIn many cases, a simple if statement is not sufficient, as multiple alternative outcomes need to be evaluated.
if($light eq “green”) {continue_driving();
} else {stop_car();
}
if($light eq “green”) {continue_driving();
} elsif($light eq “red”) {stop_car();
} else {go_fast_to_beat_the_yellow();
}
![Page 33: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/33.jpg)
Programming Languages
Control of Program Execution
Sometimes you need to iterate through a statement multiple times...
Looping constructs:for (…) { … }foreach $var (@list) { … }while (COND) { … }
![Page 34: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/34.jpg)
Programming Languages
Foreach Loop…
foreach $var (@list) {do_stuff($var);
}
foreach $name (@name_list) {print “Name = $name\n”;
}
foreach $name (@name_list) {if($hair_color{$name} eq “blond”) {
print “$name has blond hair.\n”;}
}
![Page 35: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/35.jpg)
Programming Languages
for (INIT; COND; POST) {do_stuff();
}
for ($i=0; $i < 50;$i++) {print “i = $i\n”;
}
for ($i=0; $i < 50; $i++) {if($prime{$i} == 1) {
print “$i is prime!\n”;} else {
print “$i is not prime.\n”;}
}
![Page 36: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/36.jpg)
Programming Languageswhile (COND) {
do_stuff();}
while($line = <FILE_HANDLE>) {print “$line”;
}
while($flag ==0) {if($prime{$position} == 1) {
$flag = 1;} else {
$position++;}
}
![Page 37: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/37.jpg)
Intermission
![Page 38: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/38.jpg)
Review of Perl Concepts
Data Typesscalararrayhash
Input/Outputopen(FILEHANDLE,”filename”);$line = <FILEHANDLE>;print “$line”;
Arithmetic Operations+, -, *, /, %&&, ||, !
![Page 39: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/39.jpg)
Review of Perl Concepts
Control Structuresifif/elseif/elsif/else
foreach
for
while
![Page 40: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/40.jpg)
Regular Expressions
General approach to the problem of pattern matching
RE’s are a compact method for representing a set of possible strings without explicitly specifying each alternative.
For this portion of the discussion, I will be using {} to represent the scope of a set.
{A}{A,AA}
{Ø} = empty set
![Page 41: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/41.jpg)
Regular Expressions
In addition, the [] will be used to denote possible alternatives.
[AB] = {A,B}
With just these semantics available, we can begin building simple Regular Expressions.
[AB][AB] = {AA, AB, BA, BB}AA[AB]BB = {AAABB,AABBB}
![Page 42: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/42.jpg)
Regular Expressions
Additional Regular Expression components* = 0 or more of the specified symbol+ = 1 or more of the specified symbol
A+ = {A, AA, AAA, … }A* = {Ø, A, AA, AAA, … }
AB* = {A, AB, ABB, ABBB, … }[AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }
![Page 43: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/43.jpg)
Regular Expressions
What if we want a specific number of iterations?
A{2,4} = {AA, AAA, AAAA}[AB]{1,2} = {A, B, AA, AB, BA, BB}
What if we want any character except one?[^A] = {B}
What if we want to allow any symbol?
. = {A, B}
.* = {Ø, A, B, AA, AB, BA, BB, … }
![Page 44: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/44.jpg)
Regular Expressions
All of these operations are available in Perl
Several “shortcuts”
\d = {0, 2, 3, 4, 5, 6, 7, 8, 9}\w+\s\w+ = {…, Hello World, … }
Name Definition CodeWhitespace [space, tab,
new-line]\s
Wordcharacter
[a-zA-Z_0-9] \w
Digit [0-9] \d
![Page 45: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/45.jpg)
Pattern Matching
Perl supports built-in operations for pattern matching, substitution, and character replacement
Pattern Matching
if($line =~ m/Rn.\d+/) {...
}
In Perl, RE’s can be a part of the string rather than the whole string.
^ - beginning of string$ - end of string
![Page 46: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/46.jpg)
Pattern Matching
Back references…
if($line =~ m/(Rn.\d+)/) {$UniGene_label = $1;
}
![Page 47: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/47.jpg)
Regular Expressions
$file = “my_fasta_file”;open(IN, $file);$line_count = 0;while($line = <IN>) {
if($line =~ m/^\>/) {$line_count++;
}}print “There are $line_count FASTA sequences in $file.\n”;
![Page 48: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/48.jpg)
Pattern Matching
UniGene data file
ID Bt.1TITLE Cow casein kinase II alpha …EXPRESS ;placentaPROTSIM ORG=Caenorhabditis elegans; …PROTSIM ORG=Mus musculus; PROTGI=…SCOUNT 2SEQUENCE ACC=M93665; NID=g162776; …SEQUENCE ACC=BF043619; NID=…//ID Bt.2TITLE Bos taurus cyclin-dependent …...
![Page 49: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/49.jpg)
Pattern Matching
Let’s write a small Perl program to determine how many clusters there are in the Bos taurus UniGene file.
![Page 50: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/50.jpg)
Pattern Matching
Now we’ll build a Perl program that can write an HTML file containing some basic links based on the Bos taurus UniGene clustering.
Important:
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank
![Page 51: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/51.jpg)
Substitution
Pattern matching is useful for counting or indexing items, but to modify the data, substitution is required.
Substitution searches a string for a PATTERN and, if found, replaces it with REPLACEMENT.
$line =~ s/PATTERN/REPLACEMENT/;
Returns a value equal to the number of times the pattern was found and replaced.
$result = $line =~ s/PATTERN/REPLACEMENT/;
![Page 52: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/52.jpg)
Substitution
Substitution can take several different options.specified after the final slash
The most useful areg - global (can substitute at more than one location)i - case insensitive matching
$string = “One fish, Two fish, Red fish, Blue fish.”;$string =~ s/fish/dog/g;print “$string\n”;
One dog, Two dog, Red dog, Blue dog.
![Page 53: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/53.jpg)
Substitution
Example: Removing leading and trailing white-space
$line =~ s/^\s*(.*?)\s*$/$1/;
a *? performs a minimal match…it will stop at the first point that the remainder of the expression can be matched.
$line =~ s/^\s*(.*)\s*$/$1/;this statement will not remove trailing white-space, instead the white space is retained by the .*
![Page 54: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/54.jpg)
Character Replacement
A similar operation to substitution is character replacement.
$line =~ tr/a-z/A-Z/;
$count_CG = $line =~ tr/CG/CG/;
$line =~ tr/ACGT/TGCA/;
$line =~ s/A/T/g;$line =~ s/C/G/g;$line =~ s/G/C/g;$line =~ s/T/A/g;
![Page 55: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/55.jpg)
Character Replacement
while($line = <IN>) {$count_CG = $line =~ tr/CG/CG/;$count_AT = $line =~ tr/AT/AT/;
}$total = $count_CG + $count_AT;$percent_CG = 100 * ($count_CG/$total);
print “The sequence was $percent_CG CG-rich.\n”;
![Page 56: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/56.jpg)
Subroutines
One of the most important aspects of programming is dealing with complexity. A program that is written in one large section is generally more difficult to debug. Thus a major strategy in program development is modularization.
Break the program up into smaller portions that can each be developed and tested independently.
Makes the program more readable, and easier to maintain and modify.
![Page 57: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/57.jpg)
Subroutines
EXAMPLE:Reading in sequences from UniGene.all.seq file
Multiple FASTA sequences in a single file, each annotated with the UniGene cluster they belong to.
GOAL: Make an output file consisting only of the longest sequence from each cluster.
![Page 58: Introduction to UNIX and Perl Todd Scheetz Sept. 6, 2001 Computational Methods in Molecular Biology](https://reader031.vdocuments.us/reader031/viewer/2022032308/56649f4f5503460f94c7191e/html5/thumbnails/58.jpg)
Subroutines
ISSUES:1. Want to design and implement a usable program2. Use subroutines where useful to reduce complexity.3. Minimize the memory requirements.
(human UniGene seqs > 2 GB)