flow chart of affymetrix from sample to information

Flow chart of Affymetrix from sample to informationGenerate Affy.dat fileHyb. cRNAHybridize to Affy arraysOutput as Affy.chp fileTextSelf Organized Maps (SOMs)Functional annotationPathway assignmentCo-ordinate regulationPromoter motif commonalitiesTissue

Microarray Data AnalysisData preprocessing and visualization Supervised learningMachine learning approachesUnsupervised learningClustering and pattern detectionGene regulatory regions predictions based co-regulated genesLinkage between gene expression data and gene sequence/function databases

Data preprocessingData preparation or pre-processingNormalizationFeature selectionBase on the quality of the signal intensityBased on the fold changeT-test

NormalizationNeed to scale the red sample so that the overall intensities for each chip are equivalent

NormalizationTo insure the data are comparable, normalization attempts to correct the following variables:Number of cells in the sampleTotal RNA isolation efficiencySignal measurement sensitivityCan use simple mathNormalization by global scaling (bring each image to the same average brightness) Normalization by sectorsNormalization to housekeeping genesActive research area

Basic Data AnalysisFold change (relative change in intensity for each gene)

Microarray Data AnalysisData preprocessing and visualization Supervised learningMachine learning approachesUnsupervised learningClustering and pattern detectionGene regulatory regions predictions based co-regulated genesLinkage between gene expression data and gene sequence/function databases

Microarrays: An ExampleLeukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 199972 examples (38 train, 34 test), about 7,000 probeswell-studied (CAMDA-2000), good test exampleALLAMLVisually similar, but genetically very different

Feature selection

Hypothesis TestingNull hypothesis is an hypothesis about a population parameter.

Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data

Example:Test whether the time to respond to a tone is affected by the consumption of alcoholHypothesis : 1 - 2 = 0 1 is the mean time to respond after consuming alcohol 2 is the mean time to respond otherwise

Z-testTheorem: If xi has a normal distribution with mean and standard deviation 2, i=1,,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai 2.xi /n ~ N(, 2/n).

Z test : H: = 0 (0 and 0 are known, assume = 0)What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of = 100 and = 8? Use

Reject the null hypothesis.

HistogramSet 1Set 2

T-test

William Sealey Gosset (1876-1937)(Guinness Brewing Company)

Project 3A training data set (38 samples, 7129 probes, 27 ALL, 11 AML)A testing data set(35 samples, 7129 probes, 22 ALL, 13 AML)

Lab today: pick the top probes that can differentiate the two sub types and process the testing data set

K Nearest Neighbor Classification

Distance measuresEuclidean distance

Manhattan distance

Jury DecisionsUse one feature at a time for the classificationCombining the results from the top 51 featuresMajority decision

False DiscoveryTwo possible errors in making a decision about the null hypothesis.

We could reject the null hypothesis when it is actually true, i.e., our results were obtained by chance. (Type I error).We could fail to reject the null hypothesis when it is actually false, i.e. our experiment failed to detect the true difference that exists. (Type II error)

We set at a level which will minimize the chances of making either of these errors.

False DiscoveryType I error: False DiscoveryFalse Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the arrayFor a p-value of 0.01 10,000 genes = 100 false different genes You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001)The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

RCC subtypesClear Cell RCC (70-80%)

Papillary (15-20%)

Chromoprobe (4-5%)

Collecting duct

Oncocytoma

Saramatoid RCCGoal:

Identify a panel of discriminator genes

Genetic Algorithm for Feature SelectionSampleClear cell RCC,etc.Rawmeasurementdataf1f2f3f4f5Featurevector= pattern

Why Genetic Algorithm?Assuming 2,000 relevant genes, 20 important discriminator genes (features).Cost of an exhaustive search for the optimal set of features ?C(n,k)=n!/k!(n-k)!C(2,000, 20) = 2000!/(20!1980!) (100)^20 = 10^40If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer.

Evolutionary MethodsBased on the mechanics of Darwinian evolutionThe evolution of a solution is loosely based on biological evolutionPopulation of competing candidate solutionsChromosomes (a set of features)Genetic operators (mutation, recombination, etc.) generate new candidate solutions Selection pressure directs the searchthose that do well survive (selection) to form the basis for the next set of solutions.

A Simple Evolutionary AlgorithmSelectionGeneticOperatorsEvaluation

Genetic AlgorithmGood enoughStopNot good enough52143

EncodingMost difficult, and important part of any GAEncode so that illegal solutions are not possibleEncode to simplify the evolutionary processes, e.g. reduce the size of the search spaceMost GAs use a binary encoding of a solution, but other schemes are possible

GA FitnessAt the core of any optimization approach is the function that measures the quality of a solution or optimization.Called:Objective functionFitness functionError functionmeasureetc.

Genetic Operators

Genetic Algorithm/K-Nearest Neighbor AlgorithmClassifier(kNN)Feature Selection (GA)MicroarrayDatabase

The specific pattern recognizer we chose to experiment with first was knnSimpleWell understood in the literatureGood resultsBefore I tell you how we improved on the pattern recognizers we usedDefine the pattern recognition problem terminologyQuick overview of terms - no defsBased on evolution much like ANNs are based on neural functionBrief overview of optimization to provide a common languageDetails of EC, and specifically two branches of ECGenetic algorithmsEvolutionary programmingComputer pat rec & how it relates to our biochemistry problemsApplication to biochem

flow chart of affymetrix from sample to information

Documents