analysis and management of microarray data dr g. p. s. raghava

Analysis and Management of Microarray Data

Dr G. P. S. RaghavaDr G. P. S. Raghava

Major ApplicationsMajor Applications Identification of differentially Identification of differentially

expressed genes in diseased expressed genes in diseased tissues (in presence of drug)tissues (in presence of drug)

Classification of differentially Classification of differentially expressed (genes) or clustering/ expressed (genes) or clustering/ grouping of genes having similar grouping of genes having similar behaviour in different conditionsbehaviour in different conditions

Use expression profile of known Use expression profile of known disease to diagnosis and classify disease to diagnosis and classify of unknown genes of unknown genes

Management of Microarray DataManagement of Microarray Data

Magnitude of DataMagnitude of Data– ExperimentsExperiments

50 000 genes in human50 000 genes in human 320 cell types320 cell types 2000 compunds2000 compunds 3 times points3 times points 2 concentrations2 concentrations 2 replicates2 replicates

– Data VolumeData Volume 4*104*1011 11 data-pointsdata-points 10101515 = 1 petaB of Data = 1 petaB of Data

Gene expression database – a Gene expression database – a conceptual viewconceptual view

SamplesG

Gene expression levels

Sample annotations

Gene annotations

Gene expression matrix

Management of Microarray Management of Microarray DataData

Major IssuesMajor Issues Large volume of microarray data in last few Large volume of microarray data in last few

yearsyears– Storage and efficient accessStorage and efficient access– Comparison and integration of dataComparison and integration of data

Problem of data access and exchangeProblem of data access and exchange– Data scattered around InternetData scattered around Internet– Supplementary material of publicationsSupplementary material of publications– Difficult for user to access relivent dataDifficult for user to access relivent data

Problems with existing databasesProblems with existing databases– Diverse purposeDiverse purpose– Developed for specific purposeDeveloped for specific purpose

Management of Microarray Management of Microarray DataData

Specific DatabaseSpecific Database– Platform (eg.Stanford MA Database; SMD)Platform (eg.Stanford MA Database; SMD)– Organism (Yeast MA global viewer)Organism (Yeast MA global viewer)– Project (Life cycle database of Project (Life cycle database of DrosophilaDrosophila))

Problem with Supplement and MA Problem with Supplement and MA databasesdatabases– Lack of direct accessLack of direct access– Quality not checkedQuality not checked– No standard formatNo standard format– Incomplete data Incomplete data

Comprehensive database server to manage Comprehensive database server to manage massive amount of Microarray Data massive amount of Microarray Data – Biomaterial InformationBiomaterial Information– Raw Data & ImagesRaw Data & Images– Web Tools (normalization; data viewing; analysis)Web Tools (normalization; data viewing; analysis)

Run on local servers allows full management Run on local servers allows full management and permission to add and view data and permission to add and view data

Minimum Information about Microarray Minimum Information about Microarray Experiment (MIAME)Experiment (MIAME)

BASE BASE http://bioinformatics1.uams.edu:8081:/http://bioinformatics1.uams.edu:8081:/

Public DatabasesPublic Databases

Gene Expression data is an essential Gene Expression data is an essential aspect of annotating the genomeaspect of annotating the genome

Publication and data exchange for Publication and data exchange for microarray experimentsmicroarray experiments

Data mining/Meta-studiesData mining/Meta-studies Common data format - XMLCommon data format - XML MIAME (Minimal Information About a MIAME (Minimal Information About a

Microarray Experiment)Microarray Experiment)

GEOGEO at the NCBIat the NCBI

Microarray Data Mining Microarray Data Mining ChallengesChallenges

too few records (samples), usually < 100 too few records (samples), usually < 100 too many columns (genes), usually > 1,000too many columns (genes), usually > 1,000 Too many columns likely to lead to False Too many columns likely to lead to False

positivespositives for exploration, a large set of all relevant for exploration, a large set of all relevant

genes is desiredgenes is desired for diagnostics or identification of therapeutic for diagnostics or identification of therapeutic

targets, the smallest set of genes is neededtargets, the smallest set of genes is needed model needs to be explainable to biologists model needs to be explainable to biologists

Analysis of Microarray DataAnalysis of Microarray Data Analysis of imagesAnalysis of images Preprocessing of gene expression data Preprocessing of gene expression data Normalization of dataNormalization of data

– Subtraction of Background NoiseSubtraction of Background Noise– Global/local Normalization Global/local Normalization – House keeping genes (or same gene) House keeping genes (or same gene) – Expression in ratio (test/references) in logExpression in ratio (test/references) in log

Differential Gene expressionDifferential Gene expression– Repeats and calculate significance (t-test)Repeats and calculate significance (t-test)– Significance of fold used statistical methodSignificance of fold used statistical method

ClusteringClustering– Supervised/Unsupervised (Hierarchical, K-Supervised/Unsupervised (Hierarchical, K-

means, SOM)means, SOM) Prediction or Supervised Machine Learnning Prediction or Supervised Machine Learnning

(SVM)(SVM)

Low Level AnalysisLow Level Analysisor or

Preprocessing of gene expression Preprocessing of gene expression

datadata

Scale TransformationScale Transformation Normalization and ScalingNormalization and Scaling Replicate HandlingReplicate Handling Missing value HandlingMissing value Handling Flat pattern filteringFlat pattern filtering Pattern standardization Pattern standardization

Normalization TechniquesNormalization Techniques Global normalizationGlobal normalization

– Divide channel value by meansDivide channel value by means Control spotsControl spots

– Common spots in both channelsCommon spots in both channels– House keeping genesHouse keeping genes– Ratio of intensity of same gene in two channel is used for Ratio of intensity of same gene in two channel is used for

correctioncorrection Iterative linear regressionIterative linear regression Parametric nonlinear nomalization Parametric nonlinear nomalization

– log(CY3/CY5) vs log(CY5))log(CY3/CY5) vs log(CY5))– Fitted log ratio – observed log ratioFitted log ratio – observed log ratio

General Non Linear NormalizationGeneral Non Linear Normalization– LOESSLOESS– curve between log(R/G) vs log(sqrt(R.G))curve between log(R/G) vs log(sqrt(R.G))

ClassificationClassification

Task: Task: assign objects to classes assign objects to classes (groups) on the basis of (groups) on the basis of measurements made on the objectsmeasurements made on the objects

Unsupervised: Unsupervised: classes unknown, want classes unknown, want to discover them from the data to discover them from the data (cluster analysis)(cluster analysis)

Supervised: Supervised: classes are predefined, classes are predefined, want to use a (training or learning) set want to use a (training or learning) set of labeled objects to form a classifier of labeled objects to form a classifier for classification of future observationsfor classification of future observations

Cluster analysisCluster analysis Used to find groups of objects when Used to find groups of objects when

not already knownnot already known ““Unsupervised learning”Unsupervised learning” Associated with each object is a set Associated with each object is a set

of measurements (the of measurements (the feature feature vectorvector))

Aim is to identify groups of similar Aim is to identify groups of similar objects on the basis of the observed objects on the basis of the observed measurementsmeasurements

Unsupervised LearnningUnsupervised Learnning

Hierarchical clustering: Hierarchical clustering: merging two branches at the merging two branches at the time until all vari-ables(genes) are in one tree. [it does time until all vari-ables(genes) are in one tree. [it does not answer the question of “howmany gene clusters not answer the question of “howmany gene clusters there are”?]there are”?]

K-mean clustering: K-mean clustering: assuming there are K clusters. assuming there are K clusters. [what if this assumption is incorrect?][what if this assumption is incorrect?]

Self Organizing Maps (SOM)Self Organizing Maps (SOM)– Split all genes into similar sub-groupsSplit all genes into similar sub-groups– Finds its own groups (machine learning)Finds its own groups (machine learning)

Principle ComponentPrinciple Component– every gene is a dimension (vector), find a single dimension every gene is a dimension (vector), find a single dimension

that best represents the differences in the datathat best represents the differences in the data Model-based clustering: Model-based clustering: the number of clusters is the number of clusters is

determined dynamically [could be one of the most determined dynamically [could be one of the most promising methods]promising methods]

‘cluster’

unclustered

Average linkage hierarchical clustering, melanoma only

Supervised AnalysisSupervised Analysis

Fisher’s linear discriminant Fisher’s linear discriminant analysisanalysis

Quadratic discriminant analysisQuadratic discriminant analysis Logistic regression Logistic regression (a linear (a linear

discriminant analysis)discriminant analysis) Neural networksNeural networks Support vector machineSupport vector machine

Example: Tumor ClassificationExample: Tumor Classification Reliable and precise classification essential for Reliable and precise classification essential for

successful cancer treatment successful cancer treatment

Current methods for classifying human Current methods for classifying human malignancies rely on a variety of morphological, malignancies rely on a variety of morphological, clinical and molecular variablesclinical and molecular variables

Uncertainties in diagnosis remain; likely that Uncertainties in diagnosis remain; likely that existing classes are heterogeneousexisting classes are heterogeneous

Characterize molecular variations among tumors Characterize molecular variations among tumors by monitoring gene expression (microarray)by monitoring gene expression (microarray)

Hope: that microarrays will lead to more reliable Hope: that microarrays will lead to more reliable tumor classification (and therefore more tumor classification (and therefore more appropriate treatments and better outcomes)appropriate treatments and better outcomes)

Higher LevelHigher LevelMicroarray data analysisMicroarray data analysis

Clustering and pattern detectionClustering and pattern detection Data mining and visualizationData mining and visualization Controls and normalization of resultsControls and normalization of results Statistical validatationStatistical validatation Linkage between gene expression data and Linkage between gene expression data and

gene sequence/function/metabolic pathways gene sequence/function/metabolic pathways databasesdatabases

Discovery of common sequences in co-Discovery of common sequences in co-regulated genesregulated genes

Meta-studies using data from multiple Meta-studies using data from multiple experimentsexperiments

ThanksThanks

analysis and management of microarray data dr g. p. s. raghava

Documents

opportunities in bioinformatics presented by dr g. p. s....

basics of comparative genomics dr g. p. s. raghava

scanned by camscanner€¦ · dr. asha c. ingalagi sri...

pabs raghava, ceo tours limited...

article - open source drug...

raghava mukkamalaand ravi vatrapu - copenhagen...

raghava pandaviyam_anargha raghava_ishvar shatak_devi...

career opportunities in it presented by dr g p s raghava,...

raghava kommalapati1,2*, shahzeb sheikh1, hongbo du1 1,3 ·...

kåñëa-bhakti-ratna-prakäçaùignca.nic.in/sanskrit/krsna_bhakti_ratna_prakasa.pdfkåñëa-bhakti-ratna-prakäçaù...

computer programs for biological problems: is it service or...

protein secondary structure prediction g p s raghava

protein structure prediction dr. g.p.s. raghava protein...

by chandan raghava narayan - school of arts &...

bioinformaticsdrug informaticsvaccine informatics...

raghava kamalesh ‘14 mentor€¦ · raghava kamalesh...

introduction to bioinformatics presented by dr g. p. s....

open source drug...

microarray (dna and snp microarray)

microarray analysis - the...