analysis and management of microarray data dr g. p. s. raghava
Post on 11-Jan-2016
212 Views
Preview:
TRANSCRIPT
Analysis and Management of Microarray Data
Dr G. P. S. RaghavaDr G. P. S. Raghava
Major ApplicationsMajor Applications Identification of differentially Identification of differentially
expressed genes in diseased expressed genes in diseased tissues (in presence of drug)tissues (in presence of drug)
Classification of differentially Classification of differentially expressed (genes) or clustering/ expressed (genes) or clustering/ grouping of genes having similar grouping of genes having similar behaviour in different conditionsbehaviour in different conditions
Use expression profile of known Use expression profile of known disease to diagnosis and classify disease to diagnosis and classify of unknown genes of unknown genes
Management of Microarray DataManagement of Microarray Data
Magnitude of DataMagnitude of Data– ExperimentsExperiments
50 000 genes in human50 000 genes in human 320 cell types320 cell types 2000 compunds2000 compunds 3 times points3 times points 2 concentrations2 concentrations 2 replicates2 replicates
– Data VolumeData Volume 4*104*1011 11 data-pointsdata-points 10101515 = 1 petaB of Data = 1 petaB of Data
Gene expression database – a Gene expression database – a conceptual viewconceptual view
SamplesG
enes
Gene expression levels
Sample annotations
Gene annotations
Gene expression matrix
Management of Microarray Management of Microarray DataData
Major IssuesMajor Issues Large volume of microarray data in last few Large volume of microarray data in last few
yearsyears– Storage and efficient accessStorage and efficient access– Comparison and integration of dataComparison and integration of data
Problem of data access and exchangeProblem of data access and exchange– Data scattered around InternetData scattered around Internet– Supplementary material of publicationsSupplementary material of publications– Difficult for user to access relivent dataDifficult for user to access relivent data
Problems with existing databasesProblems with existing databases– Diverse purposeDiverse purpose– Developed for specific purposeDeveloped for specific purpose
Management of Microarray Management of Microarray DataData
Specific DatabaseSpecific Database– Platform (eg.Stanford MA Database; SMD)Platform (eg.Stanford MA Database; SMD)– Organism (Yeast MA global viewer)Organism (Yeast MA global viewer)– Project (Life cycle database of Project (Life cycle database of DrosophilaDrosophila))
Problem with Supplement and MA Problem with Supplement and MA databasesdatabases– Lack of direct accessLack of direct access– Quality not checkedQuality not checked– No standard formatNo standard format– Incomplete data Incomplete data
Comprehensive database server to manage Comprehensive database server to manage massive amount of Microarray Data massive amount of Microarray Data – Biomaterial InformationBiomaterial Information– Raw Data & ImagesRaw Data & Images– Web Tools (normalization; data viewing; analysis)Web Tools (normalization; data viewing; analysis)
Run on local servers allows full management Run on local servers allows full management and permission to add and view data and permission to add and view data
Minimum Information about Microarray Minimum Information about Microarray Experiment (MIAME)Experiment (MIAME)
BASE BASE http://bioinformatics1.uams.edu:8081:/http://bioinformatics1.uams.edu:8081:/
Public DatabasesPublic Databases
Gene Expression data is an essential Gene Expression data is an essential aspect of annotating the genomeaspect of annotating the genome
Publication and data exchange for Publication and data exchange for microarray experimentsmicroarray experiments
Data mining/Meta-studiesData mining/Meta-studies Common data format - XMLCommon data format - XML MIAME (Minimal Information About a MIAME (Minimal Information About a
Microarray Experiment)Microarray Experiment)
GEOGEO at the NCBIat the NCBI
Microarray Data Mining Microarray Data Mining ChallengesChallenges
too few records (samples), usually < 100 too few records (samples), usually < 100 too many columns (genes), usually > 1,000too many columns (genes), usually > 1,000 Too many columns likely to lead to False Too many columns likely to lead to False
positivespositives for exploration, a large set of all relevant for exploration, a large set of all relevant
genes is desiredgenes is desired for diagnostics or identification of therapeutic for diagnostics or identification of therapeutic
targets, the smallest set of genes is neededtargets, the smallest set of genes is needed model needs to be explainable to biologists model needs to be explainable to biologists
Analysis of Microarray DataAnalysis of Microarray Data Analysis of imagesAnalysis of images Preprocessing of gene expression data Preprocessing of gene expression data Normalization of dataNormalization of data
– Subtraction of Background NoiseSubtraction of Background Noise– Global/local Normalization Global/local Normalization – House keeping genes (or same gene) House keeping genes (or same gene) – Expression in ratio (test/references) in logExpression in ratio (test/references) in log
Differential Gene expressionDifferential Gene expression– Repeats and calculate significance (t-test)Repeats and calculate significance (t-test)– Significance of fold used statistical methodSignificance of fold used statistical method
ClusteringClustering– Supervised/Unsupervised (Hierarchical, K-Supervised/Unsupervised (Hierarchical, K-
means, SOM)means, SOM) Prediction or Supervised Machine Learnning Prediction or Supervised Machine Learnning
(SVM)(SVM)
Low Level AnalysisLow Level Analysisor or
Preprocessing of gene expression Preprocessing of gene expression
datadata
Scale TransformationScale Transformation Normalization and ScalingNormalization and Scaling Replicate HandlingReplicate Handling Missing value HandlingMissing value Handling Flat pattern filteringFlat pattern filtering Pattern standardization Pattern standardization
Normalization TechniquesNormalization Techniques Global normalizationGlobal normalization
– Divide channel value by meansDivide channel value by means Control spotsControl spots
– Common spots in both channelsCommon spots in both channels– House keeping genesHouse keeping genes– Ratio of intensity of same gene in two channel is used for Ratio of intensity of same gene in two channel is used for
correctioncorrection Iterative linear regressionIterative linear regression Parametric nonlinear nomalization Parametric nonlinear nomalization
– log(CY3/CY5) vs log(CY5))log(CY3/CY5) vs log(CY5))– Fitted log ratio – observed log ratioFitted log ratio – observed log ratio
General Non Linear NormalizationGeneral Non Linear Normalization– LOESSLOESS– curve between log(R/G) vs log(sqrt(R.G))curve between log(R/G) vs log(sqrt(R.G))
ClassificationClassification
Task: Task: assign objects to classes assign objects to classes (groups) on the basis of (groups) on the basis of measurements made on the objectsmeasurements made on the objects
Unsupervised: Unsupervised: classes unknown, want classes unknown, want to discover them from the data to discover them from the data (cluster analysis)(cluster analysis)
Supervised: Supervised: classes are predefined, classes are predefined, want to use a (training or learning) set want to use a (training or learning) set of labeled objects to form a classifier of labeled objects to form a classifier for classification of future observationsfor classification of future observations
Cluster analysisCluster analysis Used to find groups of objects when Used to find groups of objects when
not already knownnot already known ““Unsupervised learning”Unsupervised learning” Associated with each object is a set Associated with each object is a set
of measurements (the of measurements (the feature feature vectorvector))
Aim is to identify groups of similar Aim is to identify groups of similar objects on the basis of the observed objects on the basis of the observed measurementsmeasurements
Unsupervised LearnningUnsupervised Learnning
Hierarchical clustering: Hierarchical clustering: merging two branches at the merging two branches at the time until all vari-ables(genes) are in one tree. [it does time until all vari-ables(genes) are in one tree. [it does not answer the question of “howmany gene clusters not answer the question of “howmany gene clusters there are”?]there are”?]
K-mean clustering: K-mean clustering: assuming there are K clusters. assuming there are K clusters. [what if this assumption is incorrect?][what if this assumption is incorrect?]
Self Organizing Maps (SOM)Self Organizing Maps (SOM)– Split all genes into similar sub-groupsSplit all genes into similar sub-groups– Finds its own groups (machine learning)Finds its own groups (machine learning)
Principle ComponentPrinciple Component– every gene is a dimension (vector), find a single dimension every gene is a dimension (vector), find a single dimension
that best represents the differences in the datathat best represents the differences in the data Model-based clustering: Model-based clustering: the number of clusters is the number of clusters is
determined dynamically [could be one of the most determined dynamically [could be one of the most promising methods]promising methods]
‘cluster’
unclustered
Average linkage hierarchical clustering, melanoma only
Supervised AnalysisSupervised Analysis
Fisher’s linear discriminant Fisher’s linear discriminant analysisanalysis
Quadratic discriminant analysisQuadratic discriminant analysis Logistic regression Logistic regression (a linear (a linear
discriminant analysis)discriminant analysis) Neural networksNeural networks Support vector machineSupport vector machine
Example: Tumor ClassificationExample: Tumor Classification Reliable and precise classification essential for Reliable and precise classification essential for
successful cancer treatment successful cancer treatment
Current methods for classifying human Current methods for classifying human malignancies rely on a variety of morphological, malignancies rely on a variety of morphological, clinical and molecular variablesclinical and molecular variables
Uncertainties in diagnosis remain; likely that Uncertainties in diagnosis remain; likely that existing classes are heterogeneousexisting classes are heterogeneous
Characterize molecular variations among tumors Characterize molecular variations among tumors by monitoring gene expression (microarray)by monitoring gene expression (microarray)
Hope: that microarrays will lead to more reliable Hope: that microarrays will lead to more reliable tumor classification (and therefore more tumor classification (and therefore more appropriate treatments and better outcomes)appropriate treatments and better outcomes)
Higher LevelHigher LevelMicroarray data analysisMicroarray data analysis
Clustering and pattern detectionClustering and pattern detection Data mining and visualizationData mining and visualization Controls and normalization of resultsControls and normalization of results Statistical validatationStatistical validatation Linkage between gene expression data and Linkage between gene expression data and
gene sequence/function/metabolic pathways gene sequence/function/metabolic pathways databasesdatabases
Discovery of common sequences in co-Discovery of common sequences in co-regulated genesregulated genes
Meta-studies using data from multiple Meta-studies using data from multiple experimentsexperiments
ThanksThanks
top related