bioinformatics problems

7/18/2019 Bioinformatics problems

http://slidepdf.com/reader/full/bioinformatics-problems 1/12

Bioinformatics Assignment

Ritajit Majumdar

Mtech 1st semesterComputer Science and Engineering

Class Roll: 1

Exam Roll: 97/CSM/140001Registration No: 0029169 of 2008-2009

March 26, 2015

Contents

1 Problem 1 11.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem 2 42.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Problem 3 63.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Problem 4 84.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Discussion 10



1 Problem 1

Consider the following table based on the experimental result of Roth-cancer research lab. The tableconsists of 8 genes with 3 attributes (viz., GO attributes, Expression level and Pseudo gene found)and one class label: cancer mediating. Write a program which can perform the following tasksa) Find out the test attribute and draw the decision tree.

b) Find out the class label of the gene with GO attribute > 40, Expression level = medium andPseudo gene found = No.c) Construct the classifier that can predict the class label of the unknown genes.

Gene-ID GO attributes Expression Level Pseudo Gene found Cancer Mediatingg1 <=30 High No Nog2 <= 30 High No Nog3 31...40 High No Yesg4 > 40 Medium No Yesg5 > 40 Low Yes Yes

g6 > 40 Low Yes Nog7 31...40 Low Yes Yesg8 <= 30 Medium No No

1.1 Theoretical Support

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based onapplying Bayes’ theorem with strong (naive) independence assumptions between the features.

1.2 Algorithm

Let S be a set of s data samples. Suppose the class label attributes have m distinct values such asC i is the ith class label. The classes are C i to C m and si be the number of samples of S in the classlabel C i.Hence, the expected information needed to classify a given sample is given by

I (s1, s2,...,sm) = −i pilog2( pi)

where pi is the probability that an arbitrary sample belongs to class label C i and is estimated as pi =

sis .

Let attribute A has v distinct values {a1, a2,...,av}. This attribute divides the entire data S into vsubsets S 1, S 2,...,S v. Here S j contains these samples in S that have a value a j . Let sij be the number

of samples of class label C i of subset S j. Hence the entropy based on the partitioning into subsets of A is

E(A) =

j

sij+...+smj

s *I (sij,...,smj)

where

I (sij,...,smj) = −i pijlog2( pij), pij = sij|S j|

Hence gain of A

Gain(A) = I (s1, s2,...,sm)−

E (A)

1



Bayes’ Theorem

Let X be a data sample whose class label (C) is unknown. Let H be a hypothesis that the data sampleX ∈ C i. For classification problem we want to determine P(H | X) i.e. probability that the hypothesisholds given the data sample X. P(H | X) is known as posteriori probability of H conditioned on X.

P(H |

X) = P (X |H ).P (H )

P (X )

1. Each data sample is represented by n-dimensional feature matrix X = (x1, x2, ... , xn) with nattributes.

2. Suppose there are m classes C 1, C 2, ... , C m. According to the theorem, an unknown sample xbelongs to class label C i iff

P(C i|X) > P(C j|X) where i=j and 1 ≤ j ≤ m

P(C i | X) = P (X |C i).P (C i)P (X )

3. P(X) is constant for all classes. Hence P (X |C i).P (C i) needs to be maximised. P(C i) = sis .

4. P (X |C i) =

k P (X k|C i). Each value of the product can be estimated in the following way -

• if Ak is categorical, then compute P (X k|C i) = siksi

.

• if Ak is continuous, then apply Gaussian distribution.

P (X k|C i) = 1√ (2π)σci

exp(−(xk−µci)

2

2σ2ci)

5. Thus the unknown sample X is assigned class label C i iff

P(C i|X) > P(C j|X) where i=j and 1 ≤ j ≤ m

1.3 Result

First the snapshot of decision tree generation is provided.

2



The snapshot of the result has been provided below.

3



2 Problem 2

Consider the following table

Set 1 g1 g2 g5 -Set 2 g2 g4 - -Set 3 g2 g3 - -Set 4 g1 g2 g4 -Set 5 g1 g3 - -Set 6 g2 g3 - -Set 7 g1 g3 - -Set 8 g1 g2 g3 g5Set 9 g1 g2 g3 -

Find out the frequent item set for support count 2. Also find out the set of significant rule set fromfrequent item set (Confidence level of the rules ≥ 70 signifies the significant rule).


Apriori algorithm is the originality algorithm of Boolean association rules of mining frequent itemsets, raised by R. Agrawa and R. Srikan in 1994. The core principles of this theory are the subsetsof frequent item sets are frequent item sets and the supersets of infrequent item sets are infrequentitem sets. This theory is regarded as the most typical data mining theory all the time[6].

2.2 Algorithm

The algorithm is provided in pseudo-code format. This format of the algorithm was obtained fromwikipedia.

L1 ← {large1 − itemsets}k ← 2while Lk−1 = Ø do

C k ← {a ∪ {b}|a ∈ Lk−1 ∧ b ∈Lk−1 ∧ b /∈ a}for transactions t ∈ T do

C k ← {c|c ∈ C k ∧ c ⊆ t}for candidates c ∈ C t do

count[c] ← count[c] + 1end

endLk ← {c|c ∈ C k ∧ count[c] ≥ ε}k ← k + 1

endreturn

k Lk

Algorithm 1: Apriori(T, ε)

4



2.3 Result

The t a b l e L1 :

g1 | 6g2

| 7

g3 | 6g4 | 2g5 | 2

C2 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 : 4g1 , g3 : 4g1 , g4 : 1g1 , g5 : 2

g2 , g3 : 4g2 , g4 : 2g2 , g5 : 2g3 , g4 : 0g3 , g5 : 1g4 , g5 : 0−−−−−−−−−−−−−−−−−−−−−−−

The t a b l e L2 :

g1 , g2 | 4g1 , g3 | 4g1 , g5 | 2g2 , g3 | 4g2 , g4 | 2g2 , g5 | 2

C3 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 , g3 : 2g1 , g2 , g5 : 2

g1 , g2 , g4 : 1g1 , g3 , g5 : 1g1 , g2 , g3 , g4 : 0g1 , g2 , g3 , g5 : 1g1 , g2 , g4 , g5 : 0g2 , g3 , g4 : 0g2 , g3 , g5 : 1g2 , g4 , g5 : 0−−−−−−−−−−−−−−−−−−−−−−−

The t a b l e L3 :

g1 , g2 , g3 | 2

5



g1 , g2 , g5 | 2

C4 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 , g3 , g5 : 1

−−−−−−−−−−−−−−−−−−−−−−−The t a b l e L4 :

The f r e q u e n t i t e m s e t :The t a b l e L3 :

g1 , g2 , g3 | 2g1 , g2 , g5 | 2

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ D er iv ed r u l e s e t :−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−{g1} −−> {g 2 , g 3 } : 0 . 3 3{g2} −−> {g 1 , g 3 } : 0 . 2 9{g3} −−> {g 1 , g 3 } : 0 . 3 3{g 2 , g 3} −−> {g1 } : 0 . 5{g 1 , g 3} −−> {g2 } : 0 . 5{g 1 , g 2} −−> {g3 } : 0 . 5{g1} −−> {g 2 , g 5 } : 0 . 3 3

{g2} −−> {g 1 , g 5 } : 0 . 2 9{g5} −−> {g 1 , g 2 } : 1 . 0{g 2 , g 5} −−> {g1 } : 1 . 0{g 1 , g 5} −−> {g2 } : 1 . 0{g 1 , g 2} −−> {g5 } : 0 . 5−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

∗ S i g n i f i c a n t r u le s e t :

{g5} −−> {g1 , g2}

{g2 , g5

} −−>

{g1

}{g1 , g5} −−> {g2}

3 Problem 3

Design a Backpropagation learning algorithm for 3-2-1 feedforward neural network. The given train-ing set is (1,0,1) → 1. Discover the class label of all remaining patterns.


Backpropagation, an abbreviation for ”backward propagation of errors”, is a common method of

training artificial neural networks used in conjunction with an optimization method such as gradientdescent. The backpropagation algorithm was originally introduced in the 1970s, but its importancewasn’t fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and RonaldWilliams[5]. That paper describes several neural networks where backpropagation works far faster

6



than earlier approaches to learning, making it possible to use neural nets to solve problems whichhad previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning inneural networks.

3.2 Algorithm

Initialize all weights with small random numbers (in the program random numbers between 0and 1 are used).repeat

for Every pattern in the training set doPresent the pattern to the networkfor Each layer in the network do

for Every node in the layer do1. Calculate the weight sum of the inputs to the node.2. Add the threshold to the sum. The net input is I j =

wijO j + θ j.

3. Calculate the activation for the node. Typically use sigmoid function to get

the output of each node O j = 1

1+e−I j .end

endfor Every node in the output layer do

Calculate the error signal as Err j = O j(1 − O j)(T j − O j) where T j is the trueoutput.

endfor All hidden layers do

for every node in the layer do1. Calculate the node’s signal error as Err j = O j(1 − O j)

k Errkw jk where

w jk is the weight of the connection from unit j to k (which is the next highestlayer) and Errk is the error of unit k.2. Update each node’s weight in the network.

end

endCalculate the Global Error Function.

end

until (maximum number of iterations < specified) AND (Error function > specified);Algorithm 2: Backpropagation

3.3 Result

The snapshot of the result has been provided.

7



4 Problem 4

Given the following table of gene sequence, implement k-mean, k-mediod and fuzzy c means clusteringalgorithms to generate the clusters. Tune K or C values from 2 to 5. Consider index = (averageintra-cluster distance) / (1 + average inter-cluster distance). Find out the best result.


Clustering can be considered the most important unsupervised learning problem; it deals with findinga structure in a collection of unlabelled data. A cluster is therefore a collection of objects which aresimilar between them and are dissimilar to the objects belonging to other clusters[1].

K-means[?] is one of the simplest unsupervised learning algorithms that solve the well knownclustering problem. The procedure follows a simple and easy way to classify a given data set througha certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,one for each cluster. The next step is to take each point belonging to a given data set and associate

it to the nearest centroid. Then for each cluster, we compute the mean co-ordinate and take it asthe new cluster centre. This process continues till the cluster centres change no more.

8



4.2 K-means Algorithm

Data: E = {e1, e2,...,en} → Set of entitiesk → number of clustersMaxIters → limit of iterations

Result: C = {c1, c2,...,ck} → set of cluster centroids

L = {l(e)|e = 1, 2,...,n} → set of cluster labels of Eforeach ci ∈ C do

ci ← e j ∈ E (randomselection)endforeach ei ∈ E do

l(ei) ← argminDistance(ei, c j)∀ j ∈ {1,...,k}endchanged ← falseiter ← 0repeat

foreach ci ∈

C doUpdateCluster(ci)

endforeach ei ∈ E do

minDist ← argminDistance(ei, c j) ∀ j ∈ {1,...,k}if minDist = l(ei) then

l(ei) ← minDistchanged ← true

end

enditer ++

until changed = true and iter ≤ MaxIter ;Algorithm 3: K-Means Algorithm

4.3 Result

The points are chosen randomly. Then the number of cluster is varied from k = 2 to k = 5. Snapshotsare shown for k = 2 and k = 3.

9



5 Discussion

All the programs were developed using python 2.7.3 in Macintosh OS. The program should runsmoothly on Linux and Windows OS too, though the programs were not checked in those environ-ments. However, some programs will fail to run in python 3 since there are some commands which

10



are different in the later version of python.

References

[1] A tutorial on Clustering Algorithms, http://home.deib.polimi.it/matteucc/Clustering/tutorial html/

[2] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations”,Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability , Berkeley,University of California Press, 1:281-297, (1967).

[3] J. C. Dunn, ”A Fuzzy Relative of the ISODATA Process and Its Use in Detecting CompactWell-Separated Clusters”, Journal of Cybernetics 3: 32-57 , 1973.

[4] J. C. Bezdek, ”Pattern Recognition with Fuzzy Objective Function Algorithms”, Plenum Press,New York , 1981.

[5] Rumelhart, Hinton, Williams “Learning representations by back-propagating errors” Nature

323, 533-536 9 October 1986.

[6] Jiao Yabing, “Research of an Improved Apriori Algorithm in Data Mining Association Rules”,International Journal of Computer and Communication Engineering, Vol. 2, No. 1, January2013.

11

bioinformatics problems

Documents