targeted projection pursuit, joe faith, northumbria university, v1.1, targeted projection pursuit...

32
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1, Targeted Projection Pursuit for Microarray Data Analysis Joe Faith Northumbria University

Upload: branden-liptrap

Post on 14-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Targeted Projection Pursuit for Microarray Data Analysis

Joe Faith

Northumbria University

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Outline

1. Analysing High-Dimensional Array Data

2. Dimension-Reduction Techniques

3. Targeted Projection Pursuit

4. Experimental Results

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Array Data

• New array technologies producing floods of quantitative data– cDNA and oligonucleotide– Protein arrays– Combinatorial chemistry arrays– Tissue arrays

• Typically dozens of samples x thousands of genes (or other attributes)

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Array Analysis Tasks• In case of classified data (samples of known

diagnostic classes, eg cancer tumours)– spot clusters in data

– spot outliers

– classify new cases into existing classes

– genetic profiles, feature selection, finding markers for particular conditions

• Similar problems with time series / sequential data– Genome-wide study of transcription and regulation

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Array Analysis Techniques

• Lots of techniques borrowed from statistics, machine learning, data mining.

• Tend to be complicated and ‘opaque’

• Want to find ways to allow experimenter to:– Visualise / communicate– Explore– Hypothesis formation

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Statistical Problems

• Nature of data presents many statistical problems:– Normalisation– Control of variance– Determining significance– Determining reliability

• ‘high p, low n’• Will ignore all these!

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Suppose we had just 2 genes…

Gene A

Gene B

Clusters, classifications, outliers, correlations etc are then immediately obvious

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

3D Scatter Plots

Gene A

Gene B

Gene C

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

‘Virtual Reality’ 3D Scatter Plots

Angelova et al, 2005

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Dimension Reduction Techniques

• But what about p=4, 5, … 1000??

• Need some way of visualising and exploring higher dimensional ‘space’ in 2D

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Hierarchical Clustering• Produce a dendrogram based on sample/gene

distances, and optimise order for display• But single dimension obscures many relationships

Eisen et al, 1998

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Multi-Dimensional Scaling• Finds best possible 2D representation of

data points (ie preserve distances between points)

• Eg Sammon’s Mapping (Ewing et al, 2001)

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

But…

• ‘curse of dimensionality’ spreads points

• Not projection based, so cannot visualise position of new unclassified samples

• No indication of particular stresses

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Linear Projection Based Methods• Find a 2D ‘window’ through which we view the

multidimensional data• The position of the window then contains useful

information about, eg, respective significance of particular genes

• Principal Components Analysis (Yeung, 2001)– Find view (window position) that best spreads the data

• Projection Pursuit– Find projections best suited for particular purposes,

such as separating classifications (Lee, 2005)

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Grand Tours• But each of these only show a single

view out of many• So ‘Grand Tours’ show a

video of all possible views(Asimov, 1985)

• Grand Tours in high dimensionsare mostly uninformative; andmake it hard to interpret data

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Manual Controls

• Try using manual controls to alter projections?

• Controls are ‘opaque’: user has no intuition about the effect their actions will have

• Eg Xgobi (Cook, 97)

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Targeted Projection Pursuit

• The intuition:– Allow user to manipulate view of data directly– Computer then tries to find view that best

matches ‘target’

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Quantitative Evaluation

• Task: find a view of a data set that best shows classifications

• Data: publicly available gene expression data sets of diagnosed cancer tissues

• Method: compare resulting views with standard techniques for degree of class separation

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Data

• LEUK50: Gene expression in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) [Gol99]. 38 cases of B-cell ALL, 9 cases of T-cell ALL, and 25 cases of AML. Expression levels of 7219 genes

• SRBCT50: cDNA microarray analysis of small, round blue cell childhood tumors (SRBCT), including neuroblastoma (NB), rhabdomyosarcoma (RMS), Burkitt Lymphoma (BL; a subset of non-Hodgkin lymphoma) and members of Ewing’ family of tumors (EWS). 6567 genes for 83 samples [Kha01].

• NCI50: 60 cell lines from the National Cancer Institute's anticancer drug screen [Sch00]: 9 breast, 5 central nervous system (CNS), 7 colon, 6 leukemia, 8 melanoma, 9 non-small-cell lung carcinoma (NSCLC), 6 ovarian, 2 prostate, 8 renal. 9703 cDNA sequences.

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

DR Techniques

• TPP: Targeted Projection Pursuit

• PP: Projection Pursuit (computer search for optimal view)

• SAM: Sammon Mapping

• VS: VizStruct non-linear projection based on radial coordinates

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Metrics

• ILDA: Linear Discriminant Analysis Index (Lee 05)

• 5NN: Generalisation performance of K-Nearest Neighbours Classifier

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

ResultsData LEUK SRBCT NCI

Metric ILDA 5NN ILDA 5NN ILDA 5NN

DR

TPP .997 100 .999 100 .994 96.7

PP .972 98.6 .988 100 .981 62.3

SAM .959 97.2 .911 95.2 .927 67.2

VS .952 95.8 .637 56.6 .838 32.8

• Joe Faith, Robert Mintram, Maia Angelova (2006), "Targeted Projection Pursuit for Visualising Gene Expression Data Classifications", BioInformatics (forthcoming).

• Joe Faith, Michael Brockway (2006), "Targeted Projection Pursuit Tool for Gene Expression Visualisation", Journal of Integrative Biology, (forthcoming).

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

LEUK Data Views

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

SRBCT Data Views

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

NCI Data Views

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Classifier Construction

• Find a view in which all classes are clearly separated:– Components of projection then define

combination of genes to define classification– Can order by significance to find a subset of

relevant genes

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Outlier Detection

• See LEUK data, outliers between ALL/T and ALL/B

• See which potential outliers move with the rest of the samples of that class

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Gene Identification• Separate each class in turn from the remainder of the

data. The most significant genes in this separation can then be found

• NCI data: – Human melanoma antigen recognized by T-cells (MART-1)

mRNA Chr.9 selects for Melanoma samples [Coulie 94]– Desmoplakin gene selects ovarian cancer cases [Adams 06]

• SRBCT data:– CD83 selects Burkitt's Lymphoma samples [Dudziak 03]

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

Future Work• Quantitatively evaluate TPP on other tasks• Develop tool to:

– Handle wider range of data formats– Display time series / sequential data– Integrate with biological workflows:

• Standard gene lists• Click-through to gene ontologies and DBs

• Work with biologists to trial tool and get feedback

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

References• Maia Angelova, D. Ivanov and H. Yasrebi. Classification and

visualisation of E.coli genes from microarray experiments, poster presentation, MASAMB05, March, Rothamsted Research, Harpenden, UK www.rothamsted.bbsrc.ac.uk/bab/masamb/posters/MAngelova.pdf

• Eisen,M.B., Spellman,P.T., Brown,P.O., and Botstein,D. (1998) Cluster analysis and display of genome-wide expression patterns, PNAS 95:25, 14863-14868

• Ewing,R.M. and Cherry,J.M. (2001) Visualisation of expression clusters using Sammon's non-linear mapping. Bioinformatics, 17,658-659.

• K.Y.Yeung and W.L.Ruzzo, Principal Components Analysis for clustering gene expression data, Bioinformatics 17 (9) 763-774 (2001)

• Lee,E.K, Cook,D., Klinke,S. and Lumley,T. (2005), Projection Pursuit for Exploratory Supervised Classification, Journal of Computational and Graphical Statistics, 14(4), 831-846

• Asimov, D. (1985). The Grand Tour: A Tool for Viewing Multidimensional Data. SIAM Journal of Scientific and Statistical Computing 6(1), 128 -- 11.

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

• D. Cook, and A. Buja (1997), Manual Controls for High-Dimensional Data Projections J. Computational and Graphical Statistics, vol. 6, no. 4, pp. 464-480.

• Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,.J.R., Caligiuri,M.A., Bloomfield,C.D., Lander,E.S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science,286(5439):531-7.

• Scherf,U., Ross,D.T., Waltham,M., Smith,L.H., Lee,J.K., Tanabe,L., Kohn,K.W., Reinhold,W.C., Myers,T.G., Andrews,D.T., Scudiero,D.A., Eisen,M.B., Sausville,E.A., Pommier,Y., Botstein,D., Brown,P.O., and Weinstein,J.N. (2000) A Gene Expression Database for the Molecular Pharmacology of Cancer, Nature Genetics, 24(3), 236-244.

• Khan,J., Wei,J.S., Ringnér,M., Saal,L.H., Ladanyi,M., Westermann,F., Berthold,F., Schwab,M., Antonescu,C.R., Peterson,C., and Meltzer,P.S. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6): 673--679.

Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,

• Coulie,PG, et al, (1994) A new gene coding for a differentiation antigen recognized by autologous cytolytic T lymphocytes on HLA-A2 melanomas, J Exp Med. Jul 1;180(1):35-42

• Dudziak et al (2003) Latent Membrane Protein 1 of Epstein-Barr Virus Induces CD83 by the NF-?B Signaling Pathway, J Virol; 77(15): 8290--8298.

• Adams et al (2006) Meningothelial meningioma in a mature cystic teratoma of the ovary, Pathologe Mar 23