session ii g3 overview behavior science mmc

24
Epidemiology modeling (Microarray, NGS & qRT-PCR) Theme: Transcriptional Program in the Response of Human Fibroblasts to Serum. Etienne Z. Gnimpieba BRIN WS 2013 Mount Marty College – June 24 th 2013 [email protected]

Upload: usd-bioinformatics

Post on 11-May-2015

170 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Session ii g3 overview behavior science mmc

Epidemiology modeling(Microarray, NGS & qRT-PCR)

Theme: Transcriptional Program in the Response of Human Fibroblasts to Serum.

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th [email protected]

Page 2: Session ii g3 overview behavior science mmc

Data manipulation Gene expression data analysisOMIC World

DNA

E

DNA

mRNA

E Degradatio

n

Degradation

Translation

Transcription

Gene Repressi

on

S P

Catalyse

Genomics

FunctionalGenomics

Transcriptomics

Proteomics

Metabolomics

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 3: Session ii g3 overview behavior science mmc

Data manipulation Gene expression data analysisOMIC World

GENOMICS

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 4: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisOMIC World

Genomics is the sub discipline of genetics devoted to the mapping,

sequencing ,and functional

analysis of genomicsGenomics can be said to have appeared in the 1980s, and took off in the 1990s with the initiation of genome projects for several biological species.

The most important tools here are microarrays and bioinformatics

DNA microarrays allow for rapid measurement and visualization of differential expression between genes at the whole genome scale. If technique implementation is

quite complicated, it’s principle is very easy. Here are described the major steps involved in this process

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 5: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisProcess

Biological questionDifferentially expressed genesSample class prediction etc.

Testing

Biological verification and interpretation

Microarray experiment

Estimation

Experimental design

Image analysis

Normalization

Clustering Discrimination

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 6: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisProcess

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 7: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisMicroarray Production Process

High density filters(macroarrays)

Glass slides (microarrays) Oligonucleotides chips

Detail: Detail: Detail:

Size: 12cm x 8cm Size: 5,4cm x 0,9cm Size: 1,28cm x 1,28cm

•2400 clones by membrane•radioactive labelling•1 experimental condition by membrane

•10000 clones by slide•fluorescent labelling•2 experimental conditions by slide

•300000 oligonucleotides by slide•fluorescent labelling•1 experimental condition by slide

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 8: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisMicroarray Production Process

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France) Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Expression Profile Clustering:

Slide Scanning:

Target Preparation:

Hybridization:

Page 9: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisMicroarray Production Process

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France)

• Image analysis (genepix)• Normalization (R)• Pre-treatment• Differential expression• Clustering• Data mining• Annotation

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 10: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisExcel Used in Genomics

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France)

• How to select columns• How to use functions• How to anchor a cell value in a function• How to copy the function result and not the

function itself• How to sort data by columns• How to search and replace

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Plan

Page 11: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisExcel Used in Genomics: Pre-Treatment

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France)

1. Open the file containing the experiment series (your expression matrix) in Excel software, using the tabulation character as the column separator.

2. For one column (corresponding to one DNA microarray experiment), calculate the mean value, using the MEAN Excel function. Verify that the value obtained is equal to zero.

3. If it is not the case, remove from each experiment log2(Ratio) value the corresponding mean value. Be careful, for missing values (empty cells), replace empty contents by the NULL or NA string, in order to avoid introducing a zero value in Excel calculation in this cell. Indeed, a missing value is different from a true null one!

4. Once this operation has been done, verify that the final mean value is equal to zero, this in order to avoid errors with Excel handling. Be careful, with decimal separator handling in Excel version (dot or coma)!

Centering and Scaling Data

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 12: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisExcel Used in Genomics : Differential Expression Analysis (1)

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France)

Significance Analysis of Microarrays (SAM):SAM is an Excel macro freely available for academics on the web. The use of SAM in Excel spreadsheet makes this tool easier to use for most of microarray users. Using SAM implies several modifications in your data file:

The ratio or intensity values in the Excel sheet must not contain any comas but only points as decimal separator.

The header line depends on the type of analysis you want to perform. You can refer to SAM manual for more information. So you must duplicate your header if you don’t want to loose the experiment information (see image below).

Two annotation columns are available. SAM always references its calculation to the line number in the departure sheet.

SAM (Significance Analysis of Microarray), Excel macro allowing to search for differentially expressed genes using a bootstrapping method. Website: http://www-stat.stanford.edu/~tibs/SAM/

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 13: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisExcel Used in Genomics : Differential Expression Analysis (2)

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France)

When the SAM macro is launched in the tool bar (“SAM”), a setting window appears. For further information on the various options you can choose, the best is to refer to the SAM manual. However, the first important things to do is to indicate if the data source has been transformed in log2 or not, then, as data bootstrapping uses a random generator, you need to initialize it several times by creating a various number of seeds.

Once all the chosen iterations have been done, SAM displays a plot representing each gene thanks to its score in the real distribution compared to the random distributions. Therefore, the differentially expressed genes are the ones moving away from the 45° slope line.

First, display the delta table. This table indicates for each delta value, the number of putative differentially expressed genes, the significant genes, and the number of false positive genes estimated using the False Discovery Rate (FDR). The user fixes the delta value according to the number of false positive or significant genes he wants to obtain.

To choose the delta value, get back to the SAM plot sheet and display the “SAM plot controller” by clicking on the SAM macro button.

The SAM Plot Controller window lets you fix the delta value you want: “Manually Enter Delta”. Then if you select the “List Significant Genes” button, SAM displays the list of differentially expressed genes in the “SAM output” sheet according to the delta value you chose.

This sheet summarizes the selected parameters and gives you the list of induced and repressed genes.

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 14: Session ii g3 overview behavior science mmc

Data Manipulation Gene Expression Data AnalysisGEPAS: Gene Expression Pattern Analysis Suite

• Frouin, V. & Gidrol, X. (2005) • CBB group (Berlin)

• Transcriptome ENS (France)

Verify the availability of the data file in your folder name FibroGEPAS.txt

Open the dataset for description Open GEPAS portal on http://

www.transcriptome.ens.fr/gepas/index.html Click on “Tools” Preprocessing

- Preprocess DNA array data files: log-transformation, replicate handling, missing value imputation, filtering and normalization- Filtering

Viewing Clustering Differential expression Classification Data mining

Etienne Z. GnimpiebaBRIN WS 2013

Mount Marty College – June 24th 2013

Page 15: Session ii g3 overview behavior science mmc

Microarray Dataset: Mining and Gene Profile Analysis using online Tools

Kruer Lab

Page 16: Session ii g3 overview behavior science mmc

Plan • Gene Expression Measurement • Microarray Process• Gene Expression Data Stores• Data Mining / Querying• Data Analysis• Example: ATP13A2 Profile in Stress

Conditions

Page 17: Session ii g3 overview behavior science mmc

Gene Expression MeasurementGene

expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Higher-plex techniques: SAGEDNA microarrayTiling arrayRNA-SeqNGS

Low-to-mid-plex techniques: Reporter geneNorthern blotWestern blotFluorescent in situ hybridizationReverse transcription PCR

Page 18: Session ii g3 overview behavior science mmc

DatabaseMicroarray Experiment

SetsSample Profiles Date Reported

ArrayExpress at EBI 24,838 708,914 October 28, 2011

ArrayTrack™ 1,622 50,953 February 11, 2012

caArray at NCI 41 1,741 November 15, 2006

Gene Expression Omnibus - NCBI 25,859 641,770 October 28, 2011

Genevestigator database 2,500 65,000 January 2012MUSC database ~45 555 April 1, 2007Stanford Microarray database 82,542 Not reported October 23, 2011

UNC Microarray database ~31 2,093 April 1, 2007

UNC modENCODE Microarray database ~6 180 July 17, 2009

UPenn RAD database ~100 ~2,500 September 1, 2007

UPSC-BASE ~100 Not reported November 15, 2007

SAGE GEOGUDMAP (421) MGIBIOGPS

Gene expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Gene Expression Measurement

Page 19: Session ii g3 overview behavior science mmc

Data Mining / Querying

• Problem specification• Query• Extraction• Storage • Load• Pretreat / prepare for analysis

Gene expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Page 20: Session ii g3 overview behavior science mmc

Data Analysis • Question-Answer

– Experimental condition profile: group comparison– Annotation profile: systems biological involved– Clustering profile: co-regulation– Time course profile: time variation– …

• Descriptive – Boxplot (SD, MEAN, MEDIAN, )– Scatter plot

• Predictive / inference (clustering)• Modeling (machine learning, simulation)

Gene expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Page 21: Session ii g3 overview behavior science mmc

• 3 Questions – What is the right dataset (experimental

condition)?– Is dataset is ready for analysis (quality)?– What is the expression profile for a given gene?– Significant differential expression in groups

comparison• Tools– ArrayExpress (EBI)– Boxplot – GEO2R (LIMMA, profile graph,)

Gene expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Data Analysis

Page 22: Session ii g3 overview behavior science mmc

Boxplot Gene

expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Data Analysis

Page 23: Session ii g3 overview behavior science mmc

Example: ATP13A2 Profile in Stress Conditions

• Specification: ATP13A2 profile in stress conditions

• Data querying: – GEO– Array Express – Gene Atlas

• Data analysis: – Online: GEO2R, Genospace, …– Desktop: R, ArrayTrack, …

Gene expression technologies

Microarray process

Gene expression data stores

Data mining / quering (pb-query-extraction-load-store-pretreat)

Data analysis (Question-Answer, descriptive, predictive, modeling)

Example: ATP13A2 profile in stress conditions

Page 24: Session ii g3 overview behavior science mmc

Resolution Process

Context

Specification & Aims

Lab #2

Preprocessing Viewing Clustering Differential expression Classification Data mining

24

Statement of problem / Case study: The temporal program of gene expression during a model physiological response of human cells, the response of fibroblasts to

serum, was explored with a complementary DNA microarray representing about 8600 different human genes. Genes could be clustered into groups on the basis of their temporal patterns of expression in this program. Many features of the transcriptional program appeared to be related to the physiology of wound repair, suggesting that fibroblasts play a larger and richer role in this complex multicellular response than had previously been appreciated.

Gene Expression Data Analysis

16 Vishwanath R. Iyer, Scince, 1999

Conclusion: ?

Aim: The purpose of this lab is to initiate on gene expression data analysis process. We simulated the application on “Transcriptional Program in the Response of Human Fibroblasts to Serum” . Now we can understand how a researcher can come to identify a significant expressed gene from microarray dataset.

T1. Gene expression overview

T2. Excel used in GenomicsObjective: used of basic excel functionalities to solve some

gene expression data analysis needs

Acquired skills- Gene expression data overview- Excel Used for genomics- Microarray data analysis using GEPAS

T1.1. Review of genomics place in OMIC- world T1.2. Microarray data technics and process T1.3. Data analysis cycle and tools

T2.1. Colum manipulation, functions used, anchor, copy with function, sort data, search and replaceT2.2. Experiment comparison: Data pre-treatmentT1.3. Differential expressed gene from replicate experiments (SAM)T2. GEPAS: Gene expression analysis

pattern suiteObjective: used of the GEPAS suite to apply the whole microarray data analyzing process on fibroblast data.

http://www.transcriptome.ens.fr/gepas/index.html

Expression Profile Clustering:

Slide Scanning:

Target Preparation:

Hybridization: