flow chart of affymetrix from sample to information

Download Flow chart of Affymetrix from sample to information

If you can't read please download the document

Upload: sonel

Post on 06-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Flow chart of Affymetrix from sample to information. Functional annotation. Pathway assignment. Co-ordinate regulation. Tissue. Promoter motif commonalities. Output as Affy.chp file. Text. Generate Affy.dat file. Hybridize to Affy arrays. Hyb. cRNA. Self Organized Maps (SOMs). - PowerPoint PPT Presentation

TRANSCRIPT

  • Flow chart of Affymetrix from sample to informationGenerate Affy.dat fileHyb. cRNAHybridize to Affy arraysOutput as Affy.chp fileTextSelf Organized Maps (SOMs)Functional annotationPathway assignmentCo-ordinate regulationPromoter motif commonalitiesTissue

  • Microarray Data AnalysisData preprocessing and visualization Supervised learningMachine learning approachesUnsupervised learningClustering and pattern detectionGene regulatory regions predictions based co-regulated genesLinkage between gene expression data and gene sequence/function databases

  • Data preprocessingData preparation or pre-processingNormalizationFeature selectionBase on the quality of the signal intensityBased on the fold changeT-test

  • NormalizationNeed to scale the red sample so that the overall intensities for each chip are equivalent

  • NormalizationTo insure the data are comparable, normalization attempts to correct the following variables:Number of cells in the sampleTotal RNA isolation efficiencySignal measurement sensitivityCan use simple mathNormalization by global scaling (bring each image to the same average brightness) Normalization by sectorsNormalization to housekeeping genesActive research area

  • Basic Data AnalysisFold change (relative change in intensity for each gene)

  • Microarray Data AnalysisData preprocessing and visualization Supervised learningMachine learning approachesUnsupervised learningClustering and pattern detectionGene regulatory regions predictions based co-regulated genesLinkage between gene expression data and gene sequence/function databases

  • Microarrays: An ExampleLeukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 199972 examples (38 train, 34 test), about 7,000 probeswell-studied (CAMDA-2000), good test exampleALLAMLVisually similar, but genetically very different

  • Feature selection

  • Hypothesis TestingNull hypothesis is an hypothesis about a population parameter.

    Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data

    Example:Test whether the time to respond to a tone is affected by the consumption of alcoholHypothesis : 1 - 2 = 0 1 is the mean time to respond after consuming alcohol 2 is the mean time to respond otherwise

  • Z-testTheorem: If xi has a normal distribution with mean and standard deviation 2, i=1,,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai 2.xi /n ~ N(, 2/n).

    Z test : H: = 0 (0 and 0 are known, assume = 0)What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of = 100 and = 8? Use

    Reject the null hypothesis.

  • HistogramSet 1Set 2

  • T-test

  • William Sealey Gosset (1876-1937)(Guinness Brewing Company)

  • Project 3A training data set (38 samples, 7129 probes, 27 ALL, 11 AML)A testing data set(35 samples, 7129 probes, 22 ALL, 13 AML)

    Lab today: pick the top probes that can differentiate the two sub types and process the testing data set

  • K Nearest Neighbor Classification

  • Distance measuresEuclidean distance

    Manhattan distance

  • Jury DecisionsUse one feature at a time for the classificationCombining the results from the top 51 featuresMajority decision

  • False DiscoveryTwo possible errors in making a decision about the null hypothesis.

    We could reject the null hypothesis when it is actually true, i.e., our results were obtained by chance. (Type I error).We could fail to reject the null hypothesis when it is actually false, i.e. our experiment failed to detect the true difference that exists. (Type II error)

    We set at a level which will minimize the chances of making either of these errors.

  • False DiscoveryType I error: False DiscoveryFalse Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the arrayFor a p-value of 0.01 10,000 genes = 100 false different genes You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001)The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

  • RCC subtypesClear Cell RCC (70-80%)

    Papillary (15-20%)

    Chromoprobe (4-5%)

    Collecting duct

    Oncocytoma

    Saramatoid RCCGoal:

    Identify a panel of discriminator genes

  • Genetic Algorithm for Feature SelectionSampleClear cell RCC,etc.Rawmeasurementdataf1f2f3f4f5Featurevector= pattern

  • Why Genetic Algorithm?Assuming 2,000 relevant genes, 20 important discriminator genes (features).Cost of an exhaustive search for the optimal set of features ?C(n,k)=n!/k!(n-k)!C(2,000, 20) = 2000!/(20!1980!) (100)^20 = 10^40If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer.

  • Evolutionary MethodsBased on the mechanics of Darwinian evolutionThe evolution of a solution is loosely based on biological evolutionPopulation of competing candidate solutionsChromosomes (a set of features)Genetic operators (mutation, recombination, etc.) generate new candidate solutions Selection pressure directs the searchthose that do well survive (selection) to form the basis for the next set of solutions.

  • A Simple Evolutionary AlgorithmSelectionGeneticOperatorsEvaluation

  • Genetic AlgorithmGood enoughStopNot good enough52143

  • EncodingMost difficult, and important part of any GAEncode so that illegal solutions are not possibleEncode to simplify the evolutionary processes, e.g. reduce the size of the search spaceMost GAs use a binary encoding of a solution, but other schemes are possible

  • GA FitnessAt the core of any optimization approach is the function that measures the quality of a solution or optimization.Called:Objective functionFitness functionError functionmeasureetc.

  • Genetic Operators

  • Genetic Algorithm/K-Nearest Neighbor AlgorithmClassifier(kNN)Feature Selection (GA)MicroarrayDatabase

    The specific pattern recognizer we chose to experiment with first was knnSimpleWell understood in the literatureGood resultsBefore I tell you how we improved on the pattern recognizers we usedDefine the pattern recognition problem terminologyQuick overview of terms - no defsBased on evolution much like ANNs are based on neural functionBrief overview of optimization to provide a common languageDetails of EC, and specifically two branches of ECGenetic algorithmsEvolutionary programmingComputer pat rec & how it relates to our biochemistry problemsApplication to biochem