pattern recognition in clinical data

61
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS Pattern Recognition In Clinical Data Saket Choudhary Dual Degree Project Guide: Prof. Santosh Noronha C G C A T C G A G C T C G C G T C G A G C T October 30, 2013

Upload: saket-choudhary

Post on 10-May-2015

184 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

Pattern Recognition In Clinical Data

Saket ChoudharyDual Degree Project

Guide: Prof. Santosh Noronha

C G C A T C G A G C T

C G C G T C G A G C T

October 30, 2013

Page 2: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

INTRODUCTION

INTRODUCTION

Objective

SIGNIFICANT MUTATIONS

Next Generation SequencingMotivationComputational Methods for Driver Detection

VIRAL GENOME DETECION

Workflow

REPRODUCIBILITY

Reproducibility

CONCLUSIONS

Wrapping up

Page 3: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 4: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 5: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 6: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 7: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 8: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 9: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 10: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 11: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 12: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 13: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 14: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 15: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

OBJECTIVE

Next GenerationSequencing

&Cancer Research

Toolsfor

BiologistsReproducible

analysis

Driver& Pas-senger

MutationDetection

GalaxyTools

ViralGenome

Inte-gration

GalaxyWork-flow

Better/NewAlgorithms

Bench-marking

Alignmenttools

BWAv/s

BWA-PSSM

Driver& Pas-senger

MutationDetection

EnsemblMethod

Lit.Survey

Page 16: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NEXT GENERATION SEQUENCING

C G C G T C G A G C T A G C A

Page 17: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NEXT GENERATION SEQUENCING

C G C G T C G A G C T A G C A

Page 18: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NEXT GENERATION SEQUENCING

C G C G T C G A G C T A G C A

G C G T∗ C G A G∗ C T A G∗

Page 19: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NEXT GENERATION SEQUENCING

C G C G T C G A G C T A G C A

G C G T∗ C G A G∗ C T A G∗

A G C G C G T C G A G C T A G C A C A

Page 20: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: WHY?

I Molecular Approach: Study of variations at the ’base’level

I Low Cost: 1000$ genomeI Faster: Quicker than traditional sequencing techniques

Page 21: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: WHY?

I Molecular Approach: Study of variations at the ’base’level

I Low Cost: 1000$ genomeI Faster: Quicker than traditional sequencing techniques

Page 22: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: WHY?

I Molecular Approach: Study of variations at the ’base’level

I Low Cost: 1000$ genomeI Faster: Quicker than traditional sequencing techniques

Page 23: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: WHERE?

I Study variations, genotype-phenotype associationI Look for ’markers of diseases’I Prognosis

Page 24: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: WHERE?

I Study variations, genotype-phenotype associationI Look for ’markers of diseases’I Prognosis

Page 25: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: WHERE?

I Study variations, genotype-phenotype associationI Look for ’markers of diseases’I Prognosis

Page 26: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

NGS: MUTATIONS

I 3x109 base pairsI We are all 99.9% similar, at DNA levelI More than 2 million SNPsI No particular pattern of SNPsI If a certain mutation causes a change in an amino acid, it is

referred to as non synonymous(nsSNV)

Page 27: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVERS AND PASSENGERS I

Cancer is known to arise due to mutationsNot all mutations are equally important!

Somatic MutationsSet of mutations acquired after zygote formation, over andabove the germline mutations

Driver MutationsMutations that confer growth advantages to the cell, beingselected positively in the tumor tissue

Page 28: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVERS AND PASSENGERS

Drivers are NOT simply loss of function mutations, but morethan that:

I Loss of function: Inactivate tumor suppressor proteinsI Gain of function: Activates normal genes transforming

them to oncogenesI Drug Resistance Mutations: Mutations that have evolved

to overcome the inhibitory effect of drugs

Page 29: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVERS AND PASSENGERS

Drivers are NOT simply loss of function mutations, but morethan that:

I Loss of function: Inactivate tumor suppressor proteinsI Gain of function: Activates normal genes transforming

them to oncogenesI Drug Resistance Mutations: Mutations that have evolved

to overcome the inhibitory effect of drugs

Page 30: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVERS AND PASSENGERS

Drivers are NOT simply loss of function mutations, but morethan that:

I Loss of function: Inactivate tumor suppressor proteinsI Gain of function: Activates normal genes transforming

them to oncogenesI Drug Resistance Mutations: Mutations that have evolved

to overcome the inhibitory effect of drugs

Page 31: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVER MUTATIONS: WHY?

Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis

I Low cost of NGS comes with a heavier roadblock of dataanalysis

I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem

I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial

I However there is no tool that allows one to visualise theresults on an input across the cohort of tools

Page 32: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVER MUTATIONS: WHY?

Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis

I Low cost of NGS comes with a heavier roadblock of dataanalysis

I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem

I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial

I However there is no tool that allows one to visualise theresults on an input across the cohort of tools

Page 33: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVER MUTATIONS: WHY?

Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis

I Low cost of NGS comes with a heavier roadblock of dataanalysis

I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem

I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial

I However there is no tool that allows one to visualise theresults on an input across the cohort of tools

Page 34: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVER MUTATIONS: WHY?

Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis

I Low cost of NGS comes with a heavier roadblock of dataanalysis

I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem

I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial

I However there is no tool that allows one to visualise theresults on an input across the cohort of tools

Page 35: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

MACHINE LEARNING ITwo datasets:

I Training: Labeled dataset, containing a table of featureswith mutations labelled as ”drivers/passengers”

I Test: ’Learning’ from training dataset, test the predictionmodel

Table: Training Dataset

Chromosome Position Ref Alt Type1 27822 A G Driver1 27832 T G Driver2 47842 G C Passenger. . . . .. . . . .. . . . .

Page 36: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

MACHINE LEARNING II

Table: Test Dataset

Chromosome Position Ref Alt Type1 27824 A G ?1 47832 T G ?

Page 37: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

MACHINE LEARNING: FEATURE SELECTION I

Machine Learning relies on a set of features for trainingRedundant features should be avoidedCHASM [1] makes use ofp(Xi) represents the probability of occurrence of an event XiConsidering a series of events X1, X2, X3...,Xn analogous ’seriesof packets’ in communication theory , the information receivedat each step can be quantified on a log scale by:

1log2(Xi)

= −log2(p(Xi)) (1)

The expected value of information from a series of events iscalled shannon entropy: H(X):

H(X) = −∑

i

p(Xi) log2 p(Xi) (2)

Page 38: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

MACHINE LEARNING: FEATURE SELECTION IIMutual Information between two random variables X,Y isdefined as the amount of information gained about randomvariable X due to additional information gained from thesecond, Y:

I(X,Y) = H(X)−H(X|Y) (3)

Here:X: Class Label[Driver/Passenger]Y: Predictive Featureand hence I(X,Y) represents how much information wasgained about the class label Y from knowledge of a feature XSimplifying :

I(X,Y) =∑

p(x, y)log2p(x, y)

p(x)p(y)(4)

Page 39: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

FUNCTIONAL IMPACT I

I If a certain mutation confers an advantage to the cell interms of replication rate, it is probably going to be selectedwhile all those mutations that reduce its fitness have ahigher chance of being eliminated from the population.

I Certain residues in a MSA of homologous sequences aremore conserved than others. A highly conserved ifmutated is possibly going to cost a lot since what had’evolved’ is disturbed!

I Scores can be assigned based on this ”conservation”parameter.

Page 40: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

FUNCTIONAL IMPACT IIFigure: SIFT [?] algorithm

Page 41: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

Some of the common tools/algorithms used for drivermutation prediction:

I SIFTI PolyphenI Mutation AssesorI TransFICI Condel

Page 42: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

FRAMEWORK FOR COMPARING VARIOUS TOOLS I

I Different tools use different formats, give different outputsfor similar input

I Running analysis on multiple tools −→ keep shifting dataformats

I Concordance?

Polyphen2 Input

chr1:888659 T/Cchr1:1120431 G/Achr1:1387764 G/Achr1:1421991 G/Achr1:1599812 C/Tchr1:1888193 C/Achr1:1900186 T/C

Page 43: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

FRAMEWORK FOR COMPARING VARIOUS TOOLS II

SIFT Input

1,888659,T,C1,1120431,G,A1,1387764,G,A1,1421991,G,A1,1599812,C,T1,1888193,C,A1,1900186,T,C

Page 44: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

DRIVER MUTATIONS: TOOLS DON’T AGREE

X Axis: Condel Score Y Axis: MA Score

Page 45: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

Solution?:Galaxy[?], an open source web-based platform forbioinformatics, makes it possible to represent the entire dataanalysis pipeline in an intuitive graphical interface

Figure: Galaxy Workflow polyphen2 algorithm

Page 46: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

Run all tools in one go:

Figure: Run all tools

Page 47: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

Compare all tools:

Figure: Compare all tools

Page 48: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

VIRAL GENOME DETECION

Cervical cancers have been proven to be associated withHuman Papillomavirus(HPV)Cervical cancer datasets from Indian women was put throughan analysis to detect :

1. Any possible HPV integration2. Sites of HPV integration

Who Cares?I Replacing whole genome sequencing, by targeted

sequencing at the sites where these virus have beendetected in a cohort of samples, thus speeding up thewhole process.

Page 49: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 50: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 51: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 52: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 53: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 54: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 55: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

RawData

Align datato humangenome

reference

Extractunmapped

regions

Alignunmapped

regionsto Virusgenome

ExtractmappedregionsBLAST

Page 56: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

GALAXY WORKFLOW

Page 57: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

Figure: Aligned HPV genomes

Page 58: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

REPRODUCIBILITY

I In pursuit of novel ’discovery’, standardizing the dataanalysis pipeline is often ignored, leading to dubiousconclusions

I Analysis should be reproducible and above all, correctI Parameter’s values can change the results by a big factor,

they need to be documented/loggedI Garbage in, Garbage out

Page 59: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

CONCLUSIONS

With the Galaxy tool box for identification of significantmutations and the study of the science behind the methods, thenext steps would be to:

I Open source the toolbox to the community: A tool makeslittle sense if it is not in a usable form, communityfeedback will be used to add more tools and improve theexisting ones

I A new method for driver mutation prediction: all themethods have low level of concordance. A new methodthat takes into account the available data at all levels :mutations, transcriptome and micro array data is possible.With the Galaxy toolbox in place, it would be possible tointegrate information at various levels

Page 60: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

FUTURE WORK

I Develop an algorithm that integrates machine learningapproach with functional approach by zeroing down upononly those attributes that are known to have an impact

I The algorithm would also account for information at otherlevels: RNA expressions, Clinical data.

I Integrating information at all levels would provide adeeper insight

I The developed Galaxy toolbox will be used a the basicframework for integrating information

Page 61: Pattern Recognition in clinical data

INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS

REFERENCES I

Hannah Carter, Sining Chen, Leyla Isik, SvitlanaTyekucheva, Victor E Velculescu, Kenneth W Kinzler, BertVogelstein, and Rachel Karchin.Cancer-specific high-throughput annotation of somaticmutations: computational prediction of driver missensemutations.Cancer research, 69(16):6660–6667, 2009.