gene expression bmi 731 week 5 catalin barbacioru department of biomedical informatics ohio state...
Post on 21-Dec-2015
214 views
TRANSCRIPT
![Page 1: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/1.jpg)
Gene Expression
BMI 731 week 5
Catalin BarbacioruDepartment of Biomedical Informatics
Ohio State University
![Page 2: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/2.jpg)
Thesis: the analysis of gene expression data is going to be big in
21st century statistics
Many different technologies, including
High-density nylon membrane arrays
Serial analysis of gene expression (SAGE)
Short oligonucleotide arrays (Affymetrix)
Long oligo arrays (Agilent)
Fibre optic arrays (Illumina)
cDNA arrays (Brown/Botstein)*
![Page 3: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/3.jpg)
1995 1996 1997 1998 1999 2000 2001
0
100
200
300
400
500
600
(projected)
Year
Num
ber
of
papers
Total microarray articles indexed in Medline
![Page 4: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/4.jpg)
Common themes
• Parallel approach to collection of very large amounts of data (by biological standards)
• Sophisticated instrumentation, requires some understanding
• Systematic features of the data are at least as important as the random ones
• Often more like industrial process than single investigator lab research
• Integration of many data types: clinical, genetic, molecular…..databases
![Page 5: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/5.jpg)
Biological background
G T A A T C C T C | | | | | | | | | C A T T A G G A G
DNA
G U A A U C C
RNA polymerase
mRNA
Transcription
![Page 6: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/6.jpg)
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell.
Measuring protein might be better, but is currently harder.
![Page 7: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/7.jpg)
Reverse transcriptionClone cDNA strands, complementary to the mRNA
G U A A U C C U C
Reverse transcriptase
mRNA
cDNA
C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G
T T A G G A G
C A T T A G G A G C A T T A G G A G C A T T A G G A G
C A T T A G G A G
C A T T A G G A G
![Page 8: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/8.jpg)
cDNA microarray experiments
mRNA levels compared in many different contexts
Different tissues, same organism (brain v. liver) Same tissue, same organism (ttt v. ctl, tumor v. non-tumor) Same tissue, different organisms (wt v. ko, tg, or mutant)
Time course experiments (effect of ttt, development)
Other special designs (e.g. to detect spatial patterns).
![Page 9: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/9.jpg)
• DNA microarrays represent an important new method for determining the complete expression profile of a cell.
• Monitoring gene expression lies at the heart of a wide variety of medical and biological research projects, including classifying diseases, understanding basic biological processes, and identifying new drug targets.
![Page 10: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/10.jpg)
Affymetrix® Instrument System Platform for GeneChipPlatform for GeneChip®® Probe Arrays Probe Arrays
• IntegratedIntegrated
• Easy to useEasy to use• ExportableExportable
•VersatileVersatile
![Page 11: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/11.jpg)
Photolithography
![Page 12: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/12.jpg)
Synthesis of Ordered Oligonucleotide Arrays
O O O O O
Light(deprotection)
HO HO O O O T T O O O
T T C C O
Light(deprotection)
T T O O O
C A T A TA G C T GT T C C G
MaskMask
SubstrateSubstrate
MaskMask
SubstrateSubstrate
T –T –
C –C –REPEATREPEAT
![Page 13: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/13.jpg)
Affymetrix GeneChip arrays
![Page 14: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/14.jpg)
GeneChip® Probe Arrays
24µm24µm
Millions of copies of a specificMillions of copies of a specificoligonucleotide probeoligonucleotide probe
Image of Hybridized Probe ArrayImage of Hybridized Probe Array
>200,000 different>200,000 differentcomplementary probes complementary probes
Single stranded, Single stranded, labeled RNA targetlabeled RNA target
Oligonucleotide probeOligonucleotide probe
**
**
*
1.28cm1.28cm
GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell
![Page 15: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/15.jpg)
![Page 16: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/16.jpg)
A single, contiguous gene set for the rat B-actin gene.
Perfect Match (PM)
Mis Match (MM) Control
log(PM / MM) = difference score
All significant difference scores are averaged to create “average difference” = expression level of the gene.
Each pixel is quantitated and integrated for each oligo feature (range 0-25,000)
Analysis of expression level from probe sets
![Page 17: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/17.jpg)
Expression screening by GeneChip
• each oligo sequence (20-25 mer) is synthesized as a 20 µ square (feature)
• each feature contains > 1 million copies of the oligo• scanner resolution is about 2 µ (pixel)• each gene is quantitated by 16-20 oligos and
compared to equal # of mismatched controls• 22,000 genes are evaluated with 20 matching oligos
and 10 mismatched oligos = 480,000 features/chip• 480,000 features are photolithographically synthesized
onto a 2 x 2 cm glass substrate
![Page 18: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/18.jpg)
Affymetrix GeneChip arrays• Global views of gene expression are often essential for obtaining
comprehensive pictures of cell function. • For example, it is estimated that between 0.2 to 10% of the 10,000
to 20,000 mRNA species in a typical mammalian cell are differentially expressed between cancer and normal tissues.
• Whole-genome analyses also benefit studies where the end goal is to focus on small numbers of genes, by providing an efficient tool to sort through the activities of thousands of genes, and to recognize the key players.
• In addition, monitoring multiple genes in parallel allows the identification of robust classifiers, called "signatures", of disease.
• Global analyses frequently provide insights into multiple facets of a project. A study designed to identify new disease classes, for example, may also reveal clues about the basic biology of disorders, and may suggest novel drug targets.
![Page 19: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/19.jpg)
cDNA microarrays • In ‘‘spotted’’ microarrays, slides carrying spots of target DNA are
hybridized to fluorescently labeled cDNA from experimental and control cells and the arrays are imaged at two or more wavelengths
• Expression profiling involves the hybridization of fluorescently labeled cDNA, prepared from cellular mRNA, to microarrays carrying thousands of unique sequences.
• Typically, a set of target DNA samples representing different genes is prepared by PCR and transferred to a coated slide to form a 2-D array of spots with a center-to-center distance (pitch) of about 200 μm, providing a pan-genomic profile in an area of 3 cm2 or less.
• cDNA samples from experimental and control cells are labeled with different color fluors (cytochrome Cy5 and Cy3) and hybridized simultaneously to microarrays, and the relative levels of mRNA for each gene are then determined by comparing red and green signal intensities
![Page 20: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/20.jpg)
cDNA microarrays
Scanning Technology• Microarray slides are imaged with a modified fluorescence
microscope designed for scanning large areas at high resolution (arrayWoRx, Applied Precision, Issaquah, WA, Affymetrix).
• Fluorescence illumination are obtained from a metal halide arc lamp focused onto a fiber optic bundle, the output of which is directed at the microarray slide and emission recorded through a microscope objective (Nikon) onto a cooled CCD (charge-coupled device) camera.
• Interference filters are used to select the excitation and emission wavelengths corresponding to the Cy3 and Cy5 fluorescent probes (Amersham Pharmacia).
• Each image covered a 2.4 x 2.4 mm area of the slide at 5-μm resolution. To scan the entire microarray, a series of images (‘‘panels’’) were acquired by moving the slide under the microscope objective in 2.4-mm increments.
![Page 21: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/21.jpg)
http://www.bio.davidson.edu/courses/genomics/chip/chip.swf
Jump to Animation
![Page 22: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/22.jpg)
Biological questionDifferentially expressed genesSample class prediction etc.
Testing
Biological verification and interpretation
Microarray experiment
Estimation
Experimental design
Image analysis
Normalization
Clustering Discrimination
R, G
16-bit TIFF files
(Rfg, Rbg), (Gfg, Gbg)
![Page 23: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/23.jpg)
Some statistical questions
Image analysis: addressing, segmenting, quantifying Normalisation: within and between slides
Quality: of images, of spots, of (log) ratios
Which genes are (relatively) up/down regulated?
Assigning p-values to tests/confidence to results.
![Page 24: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/24.jpg)
Some statistical questions, ctd
Planning of experiments: design, sample size
Discrimination and allocation of samples
Clustering, classification: of samples, of genes
Selection of genes relevant to any given analysis
Analysis of time course, factorial and other special experiments…..…...& much more.
![Page 25: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/25.jpg)
Some bioinformatic questions
Connecting spots to databases, e.g. to sequence, structure, and pathway databases
Discovering short sequences regulating sets of genes: direct and inverse methods
Relating expression profiles to structure and function, e.g. protein localisation
Identifying novel biochemical or signalling pathways, ………..and much more.
![Page 26: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/26.jpg)
Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale
![Page 27: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/27.jpg)
Does one size fit all?
![Page 28: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/28.jpg)
Segmentation: limitation of the fixed circle method
SRG Fixed Circle
Inside the boundary is spot (foreground), outside is not.
![Page 29: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/29.jpg)
Some local backgrounds
We use something different again: a smaller, less variable value.
Single channelgrey scale
![Page 30: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/30.jpg)
Quantification of expressionFor each spot on the slide we calculate
Red intensity (PM) = Rfg - Rbgfg = foreground, bg = background, and
Green intensity (MM) = Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)
Log2( PM / MM)
![Page 31: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/31.jpg)
Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but growing,
Genes
Slides
Gene expression level of gene 5 in slide 4
= Log2( Red intensity / Green intensity)
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.
![Page 32: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/32.jpg)
![Page 33: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/33.jpg)
The red/green ratios can be spatially biased
• .Top 2.5%of ratios red, bottom 2.5% of ratios green
![Page 34: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/34.jpg)
Affymetrix vs. cDNA ArraysAffy Strengths:
- highly reliable: synthesized in situ- highly reproducible from run to run- no clone maintenance or ‘drift’- sealed fluidics and controlled temperature- standardized chips increase database power- excellent scanner- complex, but very reliable labelling- excellent cost/benefit ratio- amenable to mutation and SNP detection
![Page 35: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/35.jpg)
Affymetrix weaknesses/limitations
- not easily customized: $300K/chip- high labeling cost $170/chip- high per chip cost $350 to $1850- limited choice of species- requires knowledge of sequence- not designed for competitive protocols
![Page 36: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/36.jpg)
Limitations to all microarrays.- dynamic range of gene expression:
very difficult to simultaneously detect low and high abundance genes accurately
- each gene has multiple splice variants 2 splice variants may have opposite effects (i.e. trk)arrays can be designed for splicing, but complexity ^ 5X
- translational efficiency is a regulated process:mRNA level does not correlate with protein level
- proteins are modified post-translationallyglycosylation, phosphorylation, etc.
- pathogens might have little ‘genomic’ effect
![Page 37: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/37.jpg)
Analysis
• In general the expression level of individual genes is measured by log(PM/MM) or log(R/G).
• Intensity-dependent normalization methods are preferred over a global methods.
• To correct intensity- and dye-bias we used location and scale normalization methods, which are based on robust, locally linear fits (lowess).
• Global methods use linear regression models, combined with ANOVA.
![Page 38: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/38.jpg)
Normalization
Why? To correct for systematic differences between
samples on the same slide, or between slides, which do not represent true biological variation between samples.
How do we know it is necessary? By examining self-self hybridizations, where no
true differential expression is occurring.
We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….
![Page 39: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/39.jpg)
Analysis
Pre-normalization Post-normalization
![Page 40: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/40.jpg)
The simplest cDNA microarray data analysis problem is identifying differentially expressed
genes using replicated slides
There are a number of different aspects:• First, between-slide normalization; then• What should we look at: averages, SDs, t-statistics, other
summaries?• How should we look at them?• Can we make valid probability statements?
![Page 41: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/41.jpg)
• 8 treatment mice and 8 control mice
• 16 hybridizations: liver mRNA from each of the 16 mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3.
• Probes: ~ 6,000 cDNAs (genes), including 200 related to lipid metabolism.
Goal. To identify genes with altered expression in the livers of Apo AI knock-out mice (T) compared to inbred C57Bl/6 control mice (C).
Apo AI experiment (Matt Callow, LBNL)
![Page 42: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/42.jpg)
![Page 43: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/43.jpg)
Which genes have changed?When permutation testing possible
1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G).
2. For each gene form the t statistic:
average of 8 ko Ms - average of 8 ctl Mssqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2)
3. Form a histogram of 6,000 t values.
4. Do a normal q-q plot; look for values “off the line”.
5. Permutation testing (next lecture).
6. Adjust for multiple testing (next lecture).
![Page 44: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/44.jpg)
Histogram & normal q-q plot of t-statistics
ApoA1
![Page 45: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/45.jpg)
Patterns, More Globally...
1. Find the genes whose expression fits specific, predefined patterns.
2. Perform cluster analysis - see what expression patterns emerge.
Can we identify genes with interesting patterns of expression across arrays?
Two approaches:
![Page 46: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/46.jpg)
![Page 47: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/47.jpg)
![Page 48: Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d635503460f94a4656f/html5/thumbnails/48.jpg)
The 16 groups systematically arranged (6 point representation)