doctoral thesis dissertation 2014-03-20 @polimi

40
Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures candidate: Davide Chicco [email protected] supervisor: Marco Masseroli PhD Thesis Defense Dissertation 20 th March 2014

Upload: davide-chicco

Post on 09-Jun-2015

1.793 views

Category:

Education


1 download

DESCRIPTION

Slides of my doctoral thesis dissertation talk, given on 20 March 2014 at Politecnico di Milano. Title: "Computational prediction of gene functions through machine learning methods and multiple validation procedures"

TRANSCRIPT

Page 1: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures

candidate: Davide Chicco [email protected]

supervisor: Marco Masseroli

PhD Thesis Defense Dissertation20th March 2014

Page 2: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

“Computational Prediction of Gene Functions

through Machine Learning methods

and Multiple Validation Procedures”

1) Analyzed scientific problem

2) Machine learning methods used

3) Validation procedures

4) Main results

5) Annotation list correlation measures

6) Novelty indicator

7) Final list of likely predicted annotations

8) Conclusions

Page 3: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Biomolecular annotations

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• The association of a gene and an information feature term

corresponds to a biomolecular annotation

• This information is expressed through controlled

vocabularies, sometimes structured as ontologies (e.g. Gene

Ontology), where every controlled term of the vocabulary is

associated with a unique alphanumeric code

Gene Biological function feature

Annotation

gene2bff

Page 4: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Biomolecular annotations

• The association of an information/feature with a gene ID

constitutes an annotation

• Annotation example:

• Scientific fact: “the gene GD4 is present in the

mitochondrial membrane”

• Corresponds to the coupling:

<GD4, mitochondrial membrane>

GD4 mitochondrial membrane

GD4 is present in the

mitochondrial membrane

Page 5: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

The problem

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated

information

• In vitro experiments are expensive (e.g. 1,000 € and 3 weeks)

• To support and quicken the time–consuming curation process,

prioritized lists of computationally predicted annotations are

extremely useful

• These lists could be generated by softwares based on

Machine Learning algorithms

Page 6: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

The problem

• Other scientists and researchers dealt with the problem in the

past by using:

• Support Vector Machines (SVM) [Barutcuoglu et al., 2006]

• k-nearest neighbor algorithm (kNN) [Tao et al., 2007]

• Decision trees [King et al., 2003]

• Hidden Markov models (HMM) [Mi et al. 2013]

• …

• These methods were all good in stating if a predicted

annotation was correct or not, but were not able to make

extrapolations, that is to suggest new annotations absent

from the input dataset

Page 7: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

The software

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

BioAnnotationPredictor:

A pipeline of steps and tools to predict,

validate and analyze biomolecular

annotation lists

Page 8: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• The software reads the data from the db GPDW

• The software creates the input matrix:

Input Annotation matrix A {0, 1} m x n

m rows: genes

n columns: annotation features

A(i,j) = 1 if gene i is annotated to feature j or to

any descendant of j in the considered ontology

structure (true path rule)

A(i,j) = 0 otherwise (it is unknown)

feat 1 feat 2 feat 3 feat 4 … feat N

gene 1 0 0 0 0 … 0

gene 2 0 1 1 0 … 1

… … … … … … …

gene M 0 0 0 0 … 0

Page 9: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• The software applies a statistical method

(Truncated Singular Value Decomposition,

Semantically Improved SVD with gene

clustering, Semantically Improved SVD with

clustering and term-term similarity weights) to

a binary A input matrix

• Returns a real output A~ matrix

• Every element of the A matrix is compared toits corresponding element of the A~ matrix

Page 10: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

• After the computation, we compare the Aij element to

the Aij~

inputmatrix

outputStatisticalmethod

0 0 0 0 … 0

0 1 1 0 … 1

… … … … … …

0 0 0 0 … 0

0.1 0.3 0.6 0.5 … 0.2

0.6 0.8 0.1 0.9 … 0.8

… … … … … …

0.3 0.2 0.4 0.6 … 0.8

Input Aij Output: Aij~

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

if Aij = 1 & Aij~ > τ: AC TP

if Aij = 1 & Aij~ ≤ τ: AR FN

if Aij = 0 & Aij~ ≤ τ: NAC TN

if Aij = 0 & Aij~ > τ: AP FP

AC: Annotation Confirmed; AR: Annotation to be Reviewed

NAC: No Annotation Confirmed; AP: Annotation Predicted

τ: minimizes the sum APs + ARs

Input Output

Yes Yes

Yes No

No No

No Yes

Page 11: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

AC: Annotation Confirmed

AR: Annotation to be Reviewed

NAC: No Annotation Confirmed

AP: Annotation Predicted

• The Annotations Predicted - AP (FP) are the

annotations absent in input and predicted by our

software: we suggest them as present

• We record them in ranked lists:

Input Output

Yes Yes

Yes No

No No

No Yes

Rank Annotation ID Likelihood value

1 218405 0.9742584

2 222571 0.8545574

… …

n 203145 0.1673128

Page 12: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• An annotation prediction is performed by computing

a reduced rank approximation A~ of the annotation

matrix A

(where 0 < k < r, with r the number of non zero

singular values of A, i.e. the rank of A)

Truncated Singular Value Decomposition (tSVD)

Page 13: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• Only the first most «important» k columns of A are

used for reconstruction

(where 0 < k < r, with r the number of non zero

singular values of A, i.e. the rank of A)

• In [P. Khatri et al. "A semantic analysis of the annotations of the

human genome“, Bioinformatics, 2005], the authors argued

that the study of the matrix A shows the semantic

relationships of the gene-function associations.

• A large value of a~ij suggests that gene i should be

annotated to term j, whereas a value close to zero

suggests the opposite.

Truncated Singular Value Decomposition (tSVD)

Page 14: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• We departed from this method developed by Khatri

et al. (2005) Wayne State Univeristy, Detroit, and

implemented it

• Improvement:

• Khatri et al. used a fixed SVD truncation level

k=500

• We developed a method for automated data-

driven selection of k based on Receiver

Opearating Characteristic (ROC) curve

• We got better results shown in several

publications

Truncated Singular Value Decomposition (tSVD)

Page 15: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• Semantically improved (SIM1) version of the

Truncated SVD, based on gene clustering [P. Drineas et al.,

"Clustering large graphs via the singular value decomposition",

Machine Learning, 2004]

• Inspiring idea: similar genes can be grouped in

clusters, that have different weights

Truncated SVD with gene clustering (SIM1)

Page 16: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Truncated SVD with gene clustering (SIM1)

1. We choose a number C of clusters, and completely

discard the columns of matrix U where j = C+1, ..., n.

(we have an algorithm for the choice of C)

2. Each column uc of SVD matrix U represents a cluster,

and the value U(i,c) indicates the membership of

gene i to the c-th cluster.

3. For each cluster, first we generate Wc = diag(uc), and

then the modified gene-to-term matrix Ac = Wc A, in

which the i-th row of A is weighted by the

membership score of the corresponding gene to the

c-cluster.

Page 17: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Truncated SVD with gene clustering (SIM1)

4. Then, we compute Tc = AcT Ac, and its SVD(Tc)

5. Then, every element of the A~ matrix is computed

considering the c_th cluster that minimize its

Euclidean norm distance to the original vector:

ai~ = ai * Vk,c,i * Vk,c,iT

6. Output matrix is produced

Tc = x

Page 18: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

• Semantically improved (SIM2) version of the

Truncated SVD, based on gene clustering and term-

term similarity weights [P. Resnik, "Using information content to

evaluate semantic similarity in a taxonomy“, arXiv.org, 1995]

• Inspiring idea: functionally similar terms, should be

annotated to the same genes

Truncated SVD with gene clustering and term-similarity weights (SIM2)

Page 19: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Truncated SVD with gene clustering and term-similarity weights (SIM2)

In the algorithm shown before, we would add the

following step:

6. a) Furthermore, to effect more accurate clustering, we

compute the eigenvectors of the matrix G~ = ASAT

where real n*n matrix S is the term similarity matrix.

Starting from a pair of ontology terms, j1 and j2, the

term functional similarity S(j1, j2) can be calculated

using different methods.

Similarity is based on Resnik measure [P. Resnik, "Using

information content to evaluate semantic similarity in a

taxonomy", arXiv.org, 1995]

Page 20: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Other methods

With some colleagues at Politecnico di Milano we also

implemented other methods (not included in this thesis):

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation with Gibbs sampling (LDA)

And with some colleagues at University of California

Irvine we have been trying to design and implement

other models:

• Auto-Encoder Deep Neural Network

Page 21: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

• After the computation, we compare the Aij element to

the Aij~

inputmatrix

outputStatisticalmethod

0 0 0 0 … 0

0 1 1 0 … 1

… … … … … …

0 0 0 0 … 0

0.1 0.3 0.6 0.5 … 0.2

0.6 0.8 0.1 0.9 … 0.8

… … … … … …

0.3 0.2 0.4 0.6 … 0.8

Input Aij Output: Aij~

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

if Aij = 1 & Aij~ > τ: AC TP

if Aij = 1 & Aij~ ≤ τ: AR FN

if Aij = 0 & Aij~ ≤ τ: NAC TN

if Aij = 0 & Aij~ > τ: AP FP

AC: Annotation Confirmed; AR: Annotation to be Reviewed

NAC: No Annotation Confirmed; AP: Annotation Predicted

τ: minimizes the sum APs + ARs

Input Output

Yes Yes

Yes No

No No

No Yes

Page 22: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

Predicted annotation

lists

Validation

A input matrix

A~ outputmatrix

• These four class results could be considered similar to

TP, FN, TN, FP

AC: Annotation Confirmed (TP)

AR: Annotation to be Reviewed (FN)

NAC: No Annotation Confirmed (TN)

AP: Annotation Predicted (FP)

• The software depicts ROC curves

AC rate =𝐴𝐶

𝐴𝐶+𝐴𝑅

AP rate =𝐴𝑃

𝐴𝑃+𝑁𝐴𝐶

Input Output

Yes Yes

Yes No

No No

No Yes

ROC Analysis Validation

Page 23: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

Predicted annotation

lists

Validation

A input matrix

A~ outputmatrix

• Ten-fold cross validation

• The software depicts the ROC curve

AC rate =𝐴𝐶

𝐴𝐶+𝐴𝑅

AP rate =𝐴𝑃

𝐴𝑃+𝑁𝐴𝐶

• Compute Area Under the Curve (AUC)

• If AUC ≥ 66.67% = 2/3, then good matrix reconstruction

• Otherwise, bad matrix reconstruction

ROC Analysis Validation

Page 24: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Database Validation

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Since more recent database versions contain

better data and information

• Compute the prediction of annotations on a

former database version (e.g. July 2009)

• Compare these predictions to a newer version

of that database (e.g. March 2013)

• More Annotation Predicted found in the new

version => better predictions

• Percentage of accuracyValidation

July 2009 -> March 2013

Page 25: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Database Validation

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Two main issues:

- Retrieve the annotation IDs in the former database

version to be used in the updated database

version;

- Management of duplicate annotations (i.e.

annotations having different evidence code)

Validation

Page 26: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Text Mining and Web Tool Validation

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Literature text mining and web tools validation

procedure

Databanks may be not updated, so we manually

searched for the predicted annotations through

• literature resources such as PubMed

• Web tools such as AmiGO and GeneCards

Validation

Page 27: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Results

inputmatrix

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

ROC Curves

Validation ROC curves for the Homo sapiens CC dataset. SVD-Khatri has k = 500;SVD-us, SIM1, SIM2 have k = 378; SIM1 and SIM2 use C = 2, and SIM2 uses Resnik measure.

Page 28: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Results

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

Results on the following annotation datasets:• Homo sapiens genes and CC feature terms

• Homo sapiens genes and MF feature terms

• Homo sapiens genes and BP feature terms

• Homo sapiens genes and CC+MF+BP feature terms

Validation

Page 29: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Results

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

The literature review allowed us to confirm some

additional predicted annotations

Validation

Page 30: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

List Comparison Measures

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Comparing methods and parameters

• When we have different lists of predicted

annotations and we want to know how

similar/different they are:

• How much similar are they?

• Answering this question will help us to

understand how method parameters

behave

Annotation ID

10,000

20,000

90,000

Annotation ID

40,000

10,000

90,000

Predicted annotation

lists

Comparison of the lists

Validation

Page 31: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

List Comparison Measures

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

How much similar are these lists?

• Spearman's rank correlation coefficient

the total sum of the difference position between

each element (e.g. 3rd position – 1st position = 2)

Annotation ID

10,000

20,000

30,000

Annotation ID

30,000

10,000

40,000

Predicted annotation

lists

Comparison of the lists

Validation

Page 32: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

List Comparison Measures

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

How much similar are these lists?

• Kendall tau distance:

the total sum of all the bubble-sort changes

needed to get a list equal to the other

outputAnnotation ID

10,000

20,000

90,000

Annotation ID

20,000

10,000

90,000

Predicted annotation

lists

Comparison of the lists

Validation

Page 33: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

List Comparison Measures

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Extended Kendall distance

Extended Spearman coefficient

output

Predicted annotation

lists

ValidationComparison of the lists

output

Annotation ID

AP List

10,000

20,000

30,000

...

NAC List

70,000

80,000

90,000

...

Annotation ID

AP List

30,000

10,000

40,000

...

NAC List

70,000

20,000

90,000

...

• We assign a high

penalty if an element

is absent from one of

the lists

And a low

penalty if an element

is absent from one of

the AP lists

but present

in its NAC list

Page 34: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

List Comparison Measures

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Significant patterns:

• Extended Kendall distances show that the similar

SVD truncations are, the lower is the Extended

Kendall distance is, and so the more similar the

lists are.

• Lists generated by predictions that produced

similar AUC have similar low Extended Spearman

coefficients.

This means that lists from

predictions having similar AUC

percentages have element

difference very low.

Predicted annotation

lists

Comparison of the lists

Validation

Page 35: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Example: DAG tree of the Molecular Functionterms predicted for the Homo sapiens gene P2RY14.Black balls: terms already present in the database.Blue exagons: predicted terms.

Novelty Indicator

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Predicted annotation

lists

Novelty indicator

Schlicker rate based on DAG

An indicator to express the “novelty” rate of a

prediction in a gene tree

• Statistical rate

• Visual DAG viewer

Comparison of the lists

Validation

Page 36: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Example: DAG tree of the Molecular Functionterms predicted for the Homo sapiens gene CCR2.Black balls: terms already present in the database.Blue exagons: predicted terms.

Novelty Indicator

inputmatrix

Statisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Predicted annotation

lists

Validation

Novelty indicator

Schlicker rate based on DAG

An indicator to express the “novelty” rate of a

prediction into a gene

• Statistical rate

• Visual DAG viewer

Comparison of the lists

Page 37: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Final predictions

inputmatrix

output

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

We finally get a list of the most likely predicted

annotations that have the following characteristics:

- predicted by all the three methods tSVD, SIM1,

SIM2

- prediction ranking in the first 50% of the list

- having at least one validated parent.

output

Predicted annotation

lists

Gene symbol Feature term

PPME1 Organelle organization. [BP]

CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP]

CHST14 Biopolymer biosynthetic process. [BP]

ROPN1B Microtubule-based agellum. [CC]

CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP]

CPA2 Proteolysis involved in cellular protein catabolic process. [BP]

PPME1 Chromosome organization. [BP]

CNOT2 Positive regulation of cellular metabolic process. [BP]

Validation

Page 38: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Recap

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

output

Predicted annotation

lists

Comparison of the lists

Truncated SVD with the automatically chosen truncation

showed better results (percentage of predicted

annotations found on the updated database version)

than previous method version with fixed parameters.

New methods (SIM1 and SIM2) outperformed

Truncated SVD.

ROC analysis, Database version, and text mining and

web tool validation procedure resulted very efficient.

Extended Kendall and Spearman

coefficients showed interesting patterns,

otherwise invisible.

Novelty indicator rate resulted very

useful in explaining which are the most

interesting prediction tree, showing

relevant research paths.

Novelty indicator

Validation

Page 39: Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Future

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

output

Predicted annotation

lists

Comparison of the lists

Future developments:

• integrate the software as a web application into

the Search Computing platform

• Implement and test the Auto-Encoder Deep

Neural Network algorithm

• Develop a text mining automated validation

procedure

• Add statistical tools to analyze the ROC

curves

Novelty indicator

Validation

Page 40: Doctoral Thesis Dissertation 2014-03-20 @PoliMi