bioinformatics problems

12
 Bioinformatics Assignment Rit ajit Majumdar Mtech 1 st semester Computer Science and Engineering Class Roll:  1 Exam Roll:  97/CSM/140001 Registration No:  0029169 of 2008-2009 March 26, 2015 Contents 1 Problem 1 1 1.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Problem 2 4 2.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Problem 3 6 3.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Problem 4 8 4.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Discussion 10

Upload: ritajit-majumdar

Post on 01-Mar-2016

220 views

Category:

Documents


0 download

DESCRIPTION

This pdf contains some problems of bioinformatics with their theoretical support, algorithm and snapshot of solution

TRANSCRIPT

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 1/12

Bioinformatics Assignment

Ritajit Majumdar

Mtech 1st semesterComputer Science and Engineering

Class Roll:   1

Exam Roll:   97/CSM/140001Registration No:  0029169 of 2008-2009

March 26, 2015

Contents

1 Problem 1 11.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem 2 42.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Problem 3 63.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Problem 4 84.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Discussion 10

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 2/12

1 Problem 1

Consider the following table based on the experimental result of Roth-cancer research lab. The tableconsists of 8 genes with 3 attributes (viz., GO attributes, Expression level and Pseudo gene found)and one class label: cancer mediating. Write a program which can perform the following tasksa) Find out the test attribute and draw the decision tree.

b) Find out the class label of the gene with GO attribute  >  40, Expression level = medium andPseudo gene found = No.c) Construct the classifier that can predict the class label of the unknown genes.

Gene-ID GO attributes Expression Level Pseudo Gene found Cancer Mediatingg1   <=30 High No Nog2   <= 30 High No Nog3 31...40 High No Yesg4   >  40 Medium No Yesg5   >  40 Low Yes Yes

g6   >  40 Low Yes Nog7 31...40 Low Yes Yesg8   <= 30 Medium No No

1.1 Theoretical Support

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based onapplying Bayes’ theorem with strong (naive) independence assumptions between the features.

1.2 Algorithm

Let S be a set of s data samples. Suppose the class label attributes have m distinct values such asC i  is the  ith class label. The classes are  C i  to  C m  and  si  be the number of samples of S in the classlabel C i.Hence, the expected information needed to classify a given sample is given by

I (s1, s2,...,sm) = −i pilog2( pi)

where   pi   is the probability that an arbitrary sample belongs to class label   C i  and is estimated as pi =

  sis .

Let attribute A has v distinct values {a1, a2,...,av}. This attribute divides the entire data S into vsubsets S 1, S 2,...,S v. Here S  j  contains these samples in S that have a value a j . Let sij  be the number

of samples of class label  C i  of subset S  j. Hence the entropy based on the partitioning into subsets of A is

E(A) =

 j

sij+...+smj

s  *I (sij,...,smj)

where

I (sij,...,smj) = −i pijlog2( pij),  pij  =  sij|S j|

Hence gain of A

Gain(A) =  I (s1, s2,...,sm)−

E (A)

1

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 3/12

Bayes’ Theorem

Let X be a data sample whose class label (C) is unknown. Let H be a hypothesis that the data sampleX ∈ C i. For classification problem we want to determine P(H | X) i.e. probability that the hypothesisholds given the data sample X. P(H | X) is known as posteriori probability of H conditioned on X.

P(H |

 X) =   P (X |H ).P (H )

P (X )

1. Each data sample is represented by n-dimensional feature matrix X = (x1,  x2, ... ,  xn) with nattributes.

2. Suppose there are m classes C 1, C 2, ... , C m. According to the theorem, an unknown sample xbelongs to class label  C i   iff 

P(C i|X) >  P(C  j|X) where i=j and 1 ≤ j ≤ m

P(C i |  X) =   P (X |C i).P (C i)P (X )

3. P(X) is constant for all classes. Hence P (X |C i).P (C i) needs to be maximised. P(C i) =   sis .

4.   P (X |C i) =

k P (X k|C i). Each value of the product can be estimated in the following way -

•   if  Ak  is categorical, then compute  P (X k|C i) =   siksi

.

•   if  Ak  is continuous, then apply Gaussian distribution.

P (X k|C i) =   1√ (2π)σci

exp(−(xk−µci)

2

2σ2ci)

5. Thus the unknown sample X is assigned class label C i   iff 

P(C i|X) >  P(C  j|X) where i=j and 1 ≤ j ≤ m

1.3 Result

First the snapshot of decision tree generation is provided.

2

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 4/12

The snapshot of the result has been provided below.

3

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 5/12

2 Problem 2

Consider the following table

Set 1 g1 g2 g5 -Set 2 g2 g4 - -Set 3 g2 g3 - -Set 4 g1 g2 g4 -Set 5 g1 g3 - -Set 6 g2 g3 - -Set 7 g1 g3 - -Set 8 g1 g2 g3 g5Set 9 g1 g2 g3 -

Find out the frequent item set for support count 2. Also find out the set of significant rule set fromfrequent item set (Confidence level of the rules ≥ 70 signifies the significant rule).

2.1 Theoretical Support

Apriori algorithm is the originality algorithm of Boolean association rules of mining frequent itemsets, raised by R. Agrawa and R. Srikan in 1994. The core principles of this theory are the subsetsof frequent item sets are frequent item sets and the supersets of infrequent item sets are infrequentitem sets. This theory is regarded as the most typical data mining theory all the time[6].

2.2 Algorithm

The algorithm is provided in pseudo-code format. This format of the algorithm was obtained fromwikipedia.

L1 ← {large1 − itemsets}k ← 2while Lk−1 = Ø  do

C k ← {a ∪ {b}|a ∈ Lk−1 ∧ b ∈Lk−1 ∧ b /∈ a}for transactions  t ∈ T   do

C k ← {c|c ∈ C k ∧ c ⊆ t}for candidates c  ∈ C t  do

count[c] ← count[c] + 1end

endLk ← {c|c ∈ C k ∧ count[c] ≥ ε}k ← k  + 1

endreturn

k Lk

Algorithm 1:  Apriori(T,  ε)

4

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 6/12

2.3 Result

The t a b l e L1 :

g1   |   6g2

  |  7

g3   |   6g4   |   2g5   |   2

C2 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 : 4g1 , g3 : 4g1 , g4 : 1g1 , g5 : 2

g2 , g3 : 4g2 , g4 : 2g2 , g5 : 2g3 , g4 : 0g3 , g5 : 1g4 , g5 : 0−−−−−−−−−−−−−−−−−−−−−−−

The t a b l e L2 :

g1 , g2   |   4g1 , g3   |   4g1 , g5   |   2g2 , g3   |   4g2 , g4   |   2g2 , g5   |   2

C3 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 , g3 : 2g1 , g2 , g5 : 2

g1 , g2 , g4 : 1g1 , g3 , g5 : 1g1 , g2 , g3 , g4 : 0g1 , g2 , g3 , g5 : 1g1 , g2 , g4 , g5 : 0g2 , g3 , g4 : 0g2 , g3 , g5 : 1g2 , g4 , g5 : 0−−−−−−−−−−−−−−−−−−−−−−−

The t a b l e L3 :

g1 , g2 , g3   |   2

5

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 7/12

g1 , g2 , g5   |   2

C4 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 , g3 , g5 : 1

−−−−−−−−−−−−−−−−−−−−−−−The t a b l e L4 :

The f r e q u e n t i t e m s e t :The t a b l e L3 :

g1 , g2 , g3   |   2g1 , g2 , g5   |   2

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ D er iv ed r u l e s e t :−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−{g1} −−> {g 2 , g 3 } : 0 . 3 3{g2} −−> {g 1 , g 3 } : 0 . 2 9{g3} −−> {g 1 , g 3 } : 0 . 3 3{g 2 , g 3} −−>  {g1 } : 0 . 5{g 1 , g 3} −−>  {g2 } : 0 . 5{g 1 , g 2} −−>  {g3 } : 0 . 5{g1} −−> {g 2 , g 5 } : 0 . 3 3

{g2} −−> {g 1 , g 5 } : 0 . 2 9{g5} −−> {g 1 , g 2 } : 1 . 0{g 2 , g 5} −−>  {g1 } : 1 . 0{g 1 , g 5} −−>  {g2 } : 1 . 0{g 1 , g 2} −−>  {g5 } : 0 . 5−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

∗ S i g n i f i c a n t r u le s e t :

{g5} −−> {g1 , g2}

{g2 , g5

} −−>

 {g1

}{g1 , g5} −−>  {g2}

3 Problem 3

Design a Backpropagation learning algorithm for 3-2-1 feedforward neural network. The given train-ing set is (1,0,1) → 1. Discover the class label of all remaining patterns.

3.1 Theoretical Support

Backpropagation, an abbreviation for ”backward propagation of errors”, is a common method of 

training artificial neural networks used in conjunction with an optimization method such as gradientdescent. The backpropagation algorithm was originally introduced in the 1970s, but its importancewasn’t fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and RonaldWilliams[5]. That paper describes several neural networks where backpropagation works far faster

6

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 8/12

than earlier approaches to learning, making it possible to use neural nets to solve problems whichhad previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning inneural networks.

3.2 Algorithm

Initialize all weights with small random numbers (in the program random numbers between 0and 1 are used).repeat

for Every pattern in the training set   doPresent the pattern to the networkfor Each layer in the network   do

for Every node in the layer   do1. Calculate the weight sum of the inputs to the node.2. Add the threshold to the sum. The net input is  I  j  =

wijO j +  θ j.

3. Calculate the activation for the node. Typically use sigmoid function to get

the output of each node  O j  =  1

1+e−I j .end

endfor Every node in the output layer   do

Calculate the error signal as Err j  = O j(1 − O j)(T  j − O j) where  T  j  is the trueoutput.

endfor All hidden layers   do

for every node in the layer   do1. Calculate the node’s signal error as  Err j  = O j(1 − O j)

k Errkw jk  where

w jk  is the weight of the connection from unit j to k (which is the next highestlayer) and  Errk  is the error of unit k.2. Update each node’s weight in the network.

end

endCalculate the Global Error Function.

end

until  (maximum number of iterations  <  specified) AND (Error function  >   specified);Algorithm 2:  Backpropagation

3.3 Result

The snapshot of the result has been provided.

7

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 9/12

4 Problem 4

Given the following table of gene sequence, implement k-mean, k-mediod and fuzzy c means clusteringalgorithms to generate the clusters. Tune K or C values from 2 to 5. Consider index = (averageintra-cluster distance) / (1 + average inter-cluster distance). Find out the best result.

4.1 Theoretical Support

Clustering can be considered the most important unsupervised learning problem; it deals with findinga structure in a collection of unlabelled data. A cluster is therefore a collection of objects which aresimilar between them and are dissimilar to the objects belonging to other clusters[1].

K-means[?] is one of the simplest unsupervised learning algorithms that solve the well knownclustering problem. The procedure follows a simple and easy way to classify a given data set througha certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,one for each cluster. The next step is to take each point belonging to a given data set and associate

it to the nearest centroid. Then for each cluster, we compute the mean co-ordinate and take it asthe new cluster centre. This process continues till the cluster centres change no more.

8

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 10/12

4.2 K-means Algorithm

Data: E = {e1, e2,...,en} → Set of entitiesk → number of clustersMaxIters → limit of iterations

Result: C = {c1, c2,...,ck} → set of cluster centroids

L = {l(e)|e = 1, 2,...,n} → set of cluster labels of Eforeach  ci ∈ C  do

ci ← e j ∈ E (randomselection)endforeach  ei ∈ E  do

l(ei) ←  argminDistance(ei, c j)∀ j ∈ {1,...,k}endchanged ← falseiter ← 0repeat

foreach  ci ∈

 C  doUpdateCluster(ci)

endforeach  ei ∈ E  do

minDist ← argminDistance(ei, c j) ∀ j ∈ {1,...,k}if   minDist  = l(ei)  then

l(ei) ← minDistchanged ← true

end

enditer ++

until  changed = true and iter  ≤  MaxIter ;Algorithm 3:  K-Means Algorithm

4.3 Result

The points are chosen randomly. Then the number of cluster is varied from k = 2 to k = 5. Snapshotsare shown for k = 2 and k = 3.

9

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 11/12

5 Discussion

All the programs were developed using python 2.7.3 in Macintosh OS. The program should runsmoothly on Linux and Windows OS too, though the programs were not checked in those environ-ments. However, some programs will fail to run in python 3 since there are some commands which

10

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 12/12

are different in the later version of python.

References

[1] A tutorial on Clustering Algorithms, http://home.deib.polimi.it/matteucc/Clustering/tutorial html/ 

[2] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations”,Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability , Berkeley,University of California Press, 1:281-297, (1967).

[3] J. C. Dunn, ”A Fuzzy Relative of the ISODATA Process and Its Use in Detecting CompactWell-Separated Clusters”,  Journal of Cybernetics 3: 32-57 , 1973.

[4] J. C. Bezdek, ”Pattern Recognition with Fuzzy Objective Function Algorithms”, Plenum Press,New York , 1981.

[5] Rumelhart, Hinton, Williams “Learning representations by back-propagating errors”   Nature 

323, 533-536 9 October 1986.

[6] Jiao Yabing, “Research of an Improved Apriori Algorithm in Data Mining Association Rules”,International Journal of Computer and Communication Engineering, Vol. 2, No. 1, January2013.

11