stat115 stat225 bist512 bio298 - intro to computational biology stat115 lab 3 part i homework q8 the...

34
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology STAT115 STAT115 Lab 3 PART I Lab 3 PART I Homework Q8 The Dot Matrix Method

Upload: ami-flowers

Post on 02-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

STAT115STAT115

Lab 3 PART ILab 3 PART I

Homework Q8 The Dot Matrix Method

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

The Dot Matrix Method.Gets you started thinking about sequence alignment in general.

Provides a ‘Gestalt’ of all possible alignments between two

sequences.

To begin — I will use a very simple 0, 1 (match, no-match) identity

scoring function without any windowing. As you will see later

today, more complex scoring functions will normally be used in

sequence analysis (especially with amino acid sequences)

A general way to see similarities in pair-wise comparisons:

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Since this is a comparison between two of the same sequences, an intra-sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

The biggest asset of dot matrix analysis is it allows

you to visualize the entire comparison at once, not

concentrating on any one ‘optimal’ region, but rather

giving you the ‘Gestalt’ of the whole thing.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Here you can easily see the effect of a sequence ‘insertion’ or ‘deletion.’ It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called an ‘indel.’ A jump or shift in the

register of the main diagonal on a dotplot clearly points out the existence of an indel. (again zero:one match score function)

Check out the ‘mutated’ inter-sequence comparison below:

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Another phenomenon that is very easy to visualize with dot matrix analysis are duplications or direct repeats. These are shown in the following example:

The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct repeats.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Now consider the more complicated ‘mutation’ in the following comparison:

Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Reconsider the same plot. Notice the extraneous dots that neither indicate

runs of identity between the two sequences nor inverted repeats.

These merely contribute ‘noise’ to the plot and are due to the ‘random’

occurrence of the letters in the sequences, the composition of the

sequences themselves.

How can we ‘clean up’ the plots so that this noise does not detract from our

interpretations? Consider the implementation of a filtered windowing

approach; a dot will only be placed if some ‘stringency’ is met.

What is meant by this is that if within some defined window size, and when

some defined criteria is met, then and only then, will a dot be placed at

the middle of that window. Then the window is shifted one position and

the entire process is repeated. This very successfully rids the plot of

unwanted noise.

Filtered Windowing —

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

In this plot a window of size

three and a stringency of two

is used to considerably

improve the signal to noise

ratio (remember, I am using a

1:0 identity scoring function).

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

TUTORIAL I

LAB 3Alejandro Quiroz-Zárate

Daniel Fernandez

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

A little of istory

R is a dialect of the S language

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• Essentially we work with a 40 year-old technology!

• R is dived in 2 parts– The BASE system

• What comes with the download from CRAN (Comprehensive R Archive Network)

– The packages that you download• Based on your needs!!!

• Over 1000 packages on CRAN– http://www.r-project.org/

• Last but NOT least– R is FREE!!!!!!

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Outline• The Console and the Script

– Workspace management• Objects

– Classes and Mode– Some Classes:

• Vectors, Matrices and data.frames– Some Modes:

• Lists, strings• Loops and conditional statements• Functions

– R functions– My own functions

• Handling data– Reading and writing!

• Plotting!• Libraries• Exercises

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Getting startedThe Console

Essentially were the commands are executed

The Script

Were the code is written

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

An R session

Type code here

Adjust/Extend code

Output appears

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workspace Management• Before jumping into R, it is important to ask

ourselves– Where am I?

• getwd()

– I want to be there…• setwd(“C://”)

– With who am I?• dir() # lists all the files in the working directory

– With who I can count on?• ls() #lists all the variables on the current session

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workplace Management (2)• Saving

– save(x,file=“name.RData”)• Saves specific objects

– save.image(“name.Rdata”)• Saves the whole workspace

• Loading– load(“name.Rdata”)

• ‘?function’ and ‘??function’– ? To get the documentation of the function– ?? Find related functions to the query

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Objects

• Almost all things in R are OBJECTS!– Functions, datasets, results, etc… (graphs NO)

• OBJECTS are classified by two criteria– MODE: How objects are stored in R

• Character, numeric, logical, factor, list, function…• To obtain the mode of an object

– mode(object)

– CLASS: How objects are treated by functions• Vector, matrix, array, data.frame,…• To obtain the class of an object

– class(object)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Objects (2)

x1 x2 x3 x4 x5 x6

12345678

MODE: Is determined by the type of things stored (numbers, characters, Boolean,)If only numbers: numericIf it is a mixture: list

CLASS: Is determined by how functions deal with this object.If only numbers: matrixIf it is a mixture: data.frame

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Some classes• Vectors!!!

– x=c(10,5,3,6)– Calculations on vector are performed on each

entry• y=c(log(x),x,x^2)

– Not necessarily to have vectors of the same length in operations!

• w=sqrt(x)+2• z=c(pi,exp(1),sqrt(2))• x+z

– Logical vectors• aux=x<7

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Some classes (2)• Matrices !!!

– x=1:8– dim(x)=c(2,4)– y=matrix(1:8,2,4,byrow=F)– Operations are applied on each element

• x*x, max(x)• x=matrix(1:28,ncol=4), y=7:10 so then x*y is…?

– y=matrix(1:8,ncol=2)• y%*%t(y)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Some classes (3)

• Extracting info– y[1,] or y[,1]

• Extending matrices– cbind(y,seq(101,104))– rbind(y,c(102,109))

• apply is a useful function!– apply(y,2,mean)– apply(y,1,log)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Some classes (4)

• data.frame!!!– Creation

• Several ways to create a data frame– 1)

» logical=sample(c(T,F),size=20,replace=T)» numeric=rnorm(20)» my.df=data.frame(logical, numeric)

– 2)» test=matrix(rnorm(21),7,3)» test=data.frame(test)

• class(my.df[1,])

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

A mode

• Lists!!!– Is like a vector

• An element of a list can be an object of any type and structure

– x1=1:5– x2=c(T,T,F,T,F)– y=list(numbers=x1,questions=x2)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions!• My own functions

– function.name=function(arg1,arg2,…,argN)

{ Body of the function

}

– fun.plot=function(y,z){y=log(y)*z-z^3+z^2

plot(z,y)}

– z=seq(-11,10)– y=seq(11,32)– fun.plot(y,z)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions! (2)• The ‘…’ argument

– Can be used to pass arguments from one function to another

• Without the need to specify arguments in the header

fun.plot=function(y,z,...)

{ y=log(y)*z-z^3+z^2

plot(z,y,...)

}

fun.plot(y,z,type="l",col="red")

fun.plot(y,z,type="l”,col=“red”,lwd=4)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Handling data I/O

• Reading files– read.csv(“filename.csv“) # reads csv files into a

data.frame– read.table(“filename.txt“) # reads txt files in a

table format to a data.frame– scan(filename) # not friendly for matrices or

tables!!!

• Writing to files– write(x,file=“filename”) # writes the object x to

filename– write.table(x,filename) # writes the object x to

filename in a table format

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting!• x.data=rnorm(1000)• y.data=x.data^3-10*x.data^2• z.data=-0.5*y.data-90

• plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")

• points(x.data,z.data,col="red")• legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red"))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting! (2)• You can export graphs in many formats

– To check the formats that are available in your R installation

• capabilities()

– png• png("Lab2_plot.png",width=520,height=440)• plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")

• points(x.data,z.data,col="red")• legend(-2,2,legend=c("Black points","Red points"),col=c("black","red"),pch=1,text.col=c("black","red"))

• dev.off()– eps

• postscript("Lab2_plot.eps",width=500,height=440)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Libraries!!

• Collection of R functions that together perform a specialized analysis or task.

• Install packages from CRAN• install.packages(“PackageName”)

• Loading libraries– library(LibraryName)

• Getting the documentation of a library– library(help=LibraryName)

• Listing all the available packages– library()

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 1 – Probability Transform

We know that , and we want to know the probability associated with

(a)Plot the theoretical pdf and cdf of X.

(b)Generate 10,000,000 observations of the random variable X

(c)Compute Y=3X5+4X2-7

(d)Estimate the probability that

(e)Plot histogram and empirical CDF of Y

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – The empire strikes back: GOOG versus BAIDU

Plot historical Stock Prices times series using prices from yahoo finance.

(a)Download and install tseries package.

(b)Include tseries package as a library in your code.

(c)Use get.hist.quote to download GOOG and BAIDU historical data.

(d)Plot both time series in the same panel and add a legend to the plot.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 3 – Challenging Challenger

On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch.The scientists had data (temperature, number of failures) from previous flights.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Question 3 – Challenging Challenger

(a) Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance?

(b) Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance?

(c) What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?