bioinformatics problems
DESCRIPTION
This pdf contains some problems of bioinformatics with their theoretical support, algorithm and snapshot of solutionTRANSCRIPT
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 1/12
Bioinformatics Assignment
Ritajit Majumdar
Mtech 1st semesterComputer Science and Engineering
Class Roll: 1
Exam Roll: 97/CSM/140001Registration No: 0029169 of 2008-2009
March 26, 2015
Contents
1 Problem 1 11.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Problem 2 42.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Problem 3 63.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Problem 4 84.1 Theoretical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Discussion 10
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 2/12
1 Problem 1
Consider the following table based on the experimental result of Roth-cancer research lab. The tableconsists of 8 genes with 3 attributes (viz., GO attributes, Expression level and Pseudo gene found)and one class label: cancer mediating. Write a program which can perform the following tasksa) Find out the test attribute and draw the decision tree.
b) Find out the class label of the gene with GO attribute > 40, Expression level = medium andPseudo gene found = No.c) Construct the classifier that can predict the class label of the unknown genes.
Gene-ID GO attributes Expression Level Pseudo Gene found Cancer Mediatingg1 <=30 High No Nog2 <= 30 High No Nog3 31...40 High No Yesg4 > 40 Medium No Yesg5 > 40 Low Yes Yes
g6 > 40 Low Yes Nog7 31...40 Low Yes Yesg8 <= 30 Medium No No
1.1 Theoretical Support
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based onapplying Bayes’ theorem with strong (naive) independence assumptions between the features.
1.2 Algorithm
Let S be a set of s data samples. Suppose the class label attributes have m distinct values such asC i is the ith class label. The classes are C i to C m and si be the number of samples of S in the classlabel C i.Hence, the expected information needed to classify a given sample is given by
I (s1, s2,...,sm) = −i pilog2( pi)
where pi is the probability that an arbitrary sample belongs to class label C i and is estimated as pi =
sis .
Let attribute A has v distinct values {a1, a2,...,av}. This attribute divides the entire data S into vsubsets S 1, S 2,...,S v. Here S j contains these samples in S that have a value a j . Let sij be the number
of samples of class label C i of subset S j. Hence the entropy based on the partitioning into subsets of A is
E(A) =
j
sij+...+smj
s *I (sij,...,smj)
where
I (sij,...,smj) = −i pijlog2( pij), pij = sij|S j|
Hence gain of A
Gain(A) = I (s1, s2,...,sm)−
E (A)
1
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 3/12
Bayes’ Theorem
Let X be a data sample whose class label (C) is unknown. Let H be a hypothesis that the data sampleX ∈ C i. For classification problem we want to determine P(H | X) i.e. probability that the hypothesisholds given the data sample X. P(H | X) is known as posteriori probability of H conditioned on X.
P(H |
X) = P (X |H ).P (H )
P (X )
1. Each data sample is represented by n-dimensional feature matrix X = (x1, x2, ... , xn) with nattributes.
2. Suppose there are m classes C 1, C 2, ... , C m. According to the theorem, an unknown sample xbelongs to class label C i iff
P(C i|X) > P(C j|X) where i=j and 1 ≤ j ≤ m
P(C i | X) = P (X |C i).P (C i)P (X )
3. P(X) is constant for all classes. Hence P (X |C i).P (C i) needs to be maximised. P(C i) = sis .
4. P (X |C i) =
k P (X k|C i). Each value of the product can be estimated in the following way -
• if Ak is categorical, then compute P (X k|C i) = siksi
.
• if Ak is continuous, then apply Gaussian distribution.
P (X k|C i) = 1√ (2π)σci
exp(−(xk−µci)
2
2σ2ci)
5. Thus the unknown sample X is assigned class label C i iff
P(C i|X) > P(C j|X) where i=j and 1 ≤ j ≤ m
1.3 Result
First the snapshot of decision tree generation is provided.
2
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 4/12
The snapshot of the result has been provided below.
3
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 5/12
2 Problem 2
Consider the following table
Set 1 g1 g2 g5 -Set 2 g2 g4 - -Set 3 g2 g3 - -Set 4 g1 g2 g4 -Set 5 g1 g3 - -Set 6 g2 g3 - -Set 7 g1 g3 - -Set 8 g1 g2 g3 g5Set 9 g1 g2 g3 -
Find out the frequent item set for support count 2. Also find out the set of significant rule set fromfrequent item set (Confidence level of the rules ≥ 70 signifies the significant rule).
2.1 Theoretical Support
Apriori algorithm is the originality algorithm of Boolean association rules of mining frequent itemsets, raised by R. Agrawa and R. Srikan in 1994. The core principles of this theory are the subsetsof frequent item sets are frequent item sets and the supersets of infrequent item sets are infrequentitem sets. This theory is regarded as the most typical data mining theory all the time[6].
2.2 Algorithm
The algorithm is provided in pseudo-code format. This format of the algorithm was obtained fromwikipedia.
L1 ← {large1 − itemsets}k ← 2while Lk−1 = Ø do
C k ← {a ∪ {b}|a ∈ Lk−1 ∧ b ∈Lk−1 ∧ b /∈ a}for transactions t ∈ T do
C k ← {c|c ∈ C k ∧ c ⊆ t}for candidates c ∈ C t do
count[c] ← count[c] + 1end
endLk ← {c|c ∈ C k ∧ count[c] ≥ ε}k ← k + 1
endreturn
k Lk
Algorithm 1: Apriori(T, ε)
4
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 6/12
2.3 Result
The t a b l e L1 :
g1 | 6g2
| 7
g3 | 6g4 | 2g5 | 2
C2 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 : 4g1 , g3 : 4g1 , g4 : 1g1 , g5 : 2
g2 , g3 : 4g2 , g4 : 2g2 , g5 : 2g3 , g4 : 0g3 , g5 : 1g4 , g5 : 0−−−−−−−−−−−−−−−−−−−−−−−
The t a b l e L2 :
g1 , g2 | 4g1 , g3 | 4g1 , g5 | 2g2 , g3 | 4g2 , g4 | 2g2 , g5 | 2
C3 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 , g3 : 2g1 , g2 , g5 : 2
g1 , g2 , g4 : 1g1 , g3 , g5 : 1g1 , g2 , g3 , g4 : 0g1 , g2 , g3 , g5 : 1g1 , g2 , g4 , g5 : 0g2 , g3 , g4 : 0g2 , g3 , g5 : 1g2 , g4 , g5 : 0−−−−−−−−−−−−−−−−−−−−−−−
The t a b l e L3 :
g1 , g2 , g3 | 2
5
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 7/12
g1 , g2 , g5 | 2
C4 :−−−−−−−−−−−−−−−−−−−−−−−g1 , g2 , g3 , g5 : 1
−−−−−−−−−−−−−−−−−−−−−−−The t a b l e L4 :
The f r e q u e n t i t e m s e t :The t a b l e L3 :
g1 , g2 , g3 | 2g1 , g2 , g5 | 2
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−∗ D er iv ed r u l e s e t :−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−{g1} −−> {g 2 , g 3 } : 0 . 3 3{g2} −−> {g 1 , g 3 } : 0 . 2 9{g3} −−> {g 1 , g 3 } : 0 . 3 3{g 2 , g 3} −−> {g1 } : 0 . 5{g 1 , g 3} −−> {g2 } : 0 . 5{g 1 , g 2} −−> {g3 } : 0 . 5{g1} −−> {g 2 , g 5 } : 0 . 3 3
{g2} −−> {g 1 , g 5 } : 0 . 2 9{g5} −−> {g 1 , g 2 } : 1 . 0{g 2 , g 5} −−> {g1 } : 1 . 0{g 1 , g 5} −−> {g2 } : 1 . 0{g 1 , g 2} −−> {g5 } : 0 . 5−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
∗ S i g n i f i c a n t r u le s e t :
{g5} −−> {g1 , g2}
{g2 , g5
} −−>
{g1
}{g1 , g5} −−> {g2}
3 Problem 3
Design a Backpropagation learning algorithm for 3-2-1 feedforward neural network. The given train-ing set is (1,0,1) → 1. Discover the class label of all remaining patterns.
3.1 Theoretical Support
Backpropagation, an abbreviation for ”backward propagation of errors”, is a common method of
training artificial neural networks used in conjunction with an optimization method such as gradientdescent. The backpropagation algorithm was originally introduced in the 1970s, but its importancewasn’t fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and RonaldWilliams[5]. That paper describes several neural networks where backpropagation works far faster
6
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 8/12
than earlier approaches to learning, making it possible to use neural nets to solve problems whichhad previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning inneural networks.
3.2 Algorithm
Initialize all weights with small random numbers (in the program random numbers between 0and 1 are used).repeat
for Every pattern in the training set doPresent the pattern to the networkfor Each layer in the network do
for Every node in the layer do1. Calculate the weight sum of the inputs to the node.2. Add the threshold to the sum. The net input is I j =
wijO j + θ j.
3. Calculate the activation for the node. Typically use sigmoid function to get
the output of each node O j = 1
1+e−I j .end
endfor Every node in the output layer do
Calculate the error signal as Err j = O j(1 − O j)(T j − O j) where T j is the trueoutput.
endfor All hidden layers do
for every node in the layer do1. Calculate the node’s signal error as Err j = O j(1 − O j)
k Errkw jk where
w jk is the weight of the connection from unit j to k (which is the next highestlayer) and Errk is the error of unit k.2. Update each node’s weight in the network.
end
endCalculate the Global Error Function.
end
until (maximum number of iterations < specified) AND (Error function > specified);Algorithm 2: Backpropagation
3.3 Result
The snapshot of the result has been provided.
7
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 9/12
4 Problem 4
Given the following table of gene sequence, implement k-mean, k-mediod and fuzzy c means clusteringalgorithms to generate the clusters. Tune K or C values from 2 to 5. Consider index = (averageintra-cluster distance) / (1 + average inter-cluster distance). Find out the best result.
4.1 Theoretical Support
Clustering can be considered the most important unsupervised learning problem; it deals with findinga structure in a collection of unlabelled data. A cluster is therefore a collection of objects which aresimilar between them and are dissimilar to the objects belonging to other clusters[1].
K-means[?] is one of the simplest unsupervised learning algorithms that solve the well knownclustering problem. The procedure follows a simple and easy way to classify a given data set througha certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,one for each cluster. The next step is to take each point belonging to a given data set and associate
it to the nearest centroid. Then for each cluster, we compute the mean co-ordinate and take it asthe new cluster centre. This process continues till the cluster centres change no more.
8
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 10/12
4.2 K-means Algorithm
Data: E = {e1, e2,...,en} → Set of entitiesk → number of clustersMaxIters → limit of iterations
Result: C = {c1, c2,...,ck} → set of cluster centroids
L = {l(e)|e = 1, 2,...,n} → set of cluster labels of Eforeach ci ∈ C do
ci ← e j ∈ E (randomselection)endforeach ei ∈ E do
l(ei) ← argminDistance(ei, c j)∀ j ∈ {1,...,k}endchanged ← falseiter ← 0repeat
foreach ci ∈
C doUpdateCluster(ci)
endforeach ei ∈ E do
minDist ← argminDistance(ei, c j) ∀ j ∈ {1,...,k}if minDist = l(ei) then
l(ei) ← minDistchanged ← true
end
enditer ++
until changed = true and iter ≤ MaxIter ;Algorithm 3: K-Means Algorithm
4.3 Result
The points are chosen randomly. Then the number of cluster is varied from k = 2 to k = 5. Snapshotsare shown for k = 2 and k = 3.
9
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 11/12
5 Discussion
All the programs were developed using python 2.7.3 in Macintosh OS. The program should runsmoothly on Linux and Windows OS too, though the programs were not checked in those environ-ments. However, some programs will fail to run in python 3 since there are some commands which
10
7/18/2019 Bioinformatics problems
http://slidepdf.com/reader/full/bioinformatics-problems 12/12
are different in the later version of python.
References
[1] A tutorial on Clustering Algorithms, http://home.deib.polimi.it/matteucc/Clustering/tutorial html/
[2] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations”,Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability , Berkeley,University of California Press, 1:281-297, (1967).
[3] J. C. Dunn, ”A Fuzzy Relative of the ISODATA Process and Its Use in Detecting CompactWell-Separated Clusters”, Journal of Cybernetics 3: 32-57 , 1973.
[4] J. C. Bezdek, ”Pattern Recognition with Fuzzy Objective Function Algorithms”, Plenum Press,New York , 1981.
[5] Rumelhart, Hinton, Williams “Learning representations by back-propagating errors” Nature
323, 533-536 9 October 1986.
[6] Jiao Yabing, “Research of an Improved Apriori Algorithm in Data Mining Association Rules”,International Journal of Computer and Communication Engineering, Vol. 2, No. 1, January2013.
11