introduction to cancer genomics - university of …...introduction to cancer genomics gene...
TRANSCRIPT
![Page 1: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/1.jpg)
Introduction to Cancer Genomics
Gene expression data analysis – part I
David GfellerComputational Cancer Biology
Ludwig Center for Cancer [email protected]
1
![Page 2: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/2.jpg)
Overview
1. Basic understanding of RNA-Seq data processing.
2. Differential expression.
3. Dimensionality reduction.
2
Examples of R code
![Page 3: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/3.jpg)
Goals
• Help you understand what can be done with a computer -> programming logic
• Give you some basic idea of how to ask the computer to perform some tasks -> syntax.
• Show you a few examples of gene expression data analysis in R that you could reuse for your projects (see also practical).
3
![Page 4: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/4.jpg)
Gene expression experiments
• Microarrays: Chip with DNA probes that will pair with DNA (retro-transcribed RNA) in a sample. Intensity is measured as a light signal.– Very popular in (2000-2010)
• RNA-Seq: Directly count how many transcripts (mRNA molecules) originate from each gene in a sample.– Increasingly replacing microarrays for gene
expression analyses 4
![Page 5: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/5.jpg)
RNA-Seq
Gene expression => Quite easy (count the reads)Gene fusion => More difficult (especially for new fusion events)Splicing => More difficult (especially for poorly annotated isoforms)
RNA fragmentation
Adaptors + amplification
Sequencing
Map to reference transcriptomeRNA
ACCTAG…CGGTAA…ATGGCA…TGGGAC…TATAGG…
Reverse transcription
Gene A Gene B
>100M reads
5
![Page 6: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/6.jpg)
1 - Typical output of RNA-Seq
Raw sequences:
- Fastq format (sequence of the reads + quality information)
- Bam format (compressed)
- Sra format (compressed)
Processed data:
- Counts: Number of reads mapping to each gene/transcript.
6
![Page 7: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/7.jpg)
How to think about these data in a
computer
N genes
S s
am
ple
s
Sample1: gene1: 254; gene2: 1284; gene3: 7234; …
Sample2: gene1: 5; gene2: 362; gene3: 0; …
Sample3: gene1: 8902; gene2: 2199; gene3: 722; …
Each expression value corresponds to a scalar. Each sample corresponds to
a vector. All samples form a matrix (M)
M[s,n] corresponds to expression of gene “n” in sample “s” 7
![Page 8: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/8.jpg)
Computers like numbers
In R:- Scalar (numeric)- Vector (array)- Matrix (multidimensional arrays, e.g. S x N)
Gene expression data are “naturally” digitalized, which makes them especially appropriate to use with computers
Many other biological objects can be digitalized as vectors or matrices:
- Protein/DNA sequences <-> vectors of letters/numbers- Protein structures <-> vectors/matrices of 3D coordinates- Interactions <-> N x N matrix with 1’s and 0’s- Image <-> matrix of pixel (1/0 for two-color image)- Set of measurements <-> vector of values
8
![Page 9: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/9.jpg)
How to think about these data in a computer
In R, once you load your data into a matrix (M), you can very easily:- Print one specific column: M[,2]- Print one specific line: M[1,]- Plot the correlation of two genes: plot(M[,5], M[,7])- Make operations on lines or columns.
9
![Page 10: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/10.jpg)
Let’s practice…
• Create a empty directory ‘Tutorial_Gfeller’ and
‘Tutorial_Gfeller/Data’
• Download the file: GSE93722_RAW.tar at:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93722
• Put it in Tutorial_Gfeller/Data/ and uncompress it and uncompress
the zip files. Each of the files corresponds to the gene expression
profiling of a melanoma sample.
• Open Rstudio.
• Set the working directory (Session -> Set Working Directory) to
‘Tutorial_Gfeller’.
• Create a new Rscript file (File-> New File -> R script); this is where
you will write your code and save it in ‘Tutorial_Gfeller’ as ‘file’.R.
10
![Page 11: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/11.jpg)
Let’s load the data…• Each GSMxxx corresponds to one sample• First have a look at the files in a Excel (or any text editor). To
start with, we will focus on the ‘expected_count’ column• The command to load file is read.delim():
m1 <- read.delim("Data/GSE93722_RAW/GSM2461003_LAU125.genes.results.txt")
• Then execute the command in the Console (pasting it or command+Enter).– Now you can look at the elements of m1 (e.g., for the first line, type m1[1,] in the console). Does it correspond to the first line of the file?
– With dim(m1)you can check the dimensions of m1.11
Name of the object that will store the data.
Path to the file to be loaded
![Page 12: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/12.jpg)
Let’s load the data
• Load the other files into m2 (GSM2461007_LAU1255), m3 (GSM2461009_LAU1314) and m4 GSM2461005_LAU355).
• Build a matrix taking the fifth column in each file:
12
M <- matrix(nrow=4, ncol=dim(m1)[1])
M[1,] <- m1[,5]
Initialize an empty
matrix with the correct dimension
In the first line, put
the 5th column of m1
• Do the same with m2, m3 and m4 (M[2,] <- m2[,5], if
you had many files, we would do a loop, see exercises).• Try to query any entry of your matrix (e.g., M[3,5]). Do
you get the expected number?
![Page 13: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/13.jpg)
Genes have (many) names…
13
• In these files, we have Ensembl gene Ids… We want to convert them to Common Gene names.– We need a file with the mapping (two columns, one for
Ensembl IDs, one for gene names).
• Go to: https://www.ensembl.org/biomart/martview/– Select Ensembl Genes 94, then Human genes.– In Attributes, Select GENE: -> Gene stable ID and
EXTERNAL: -> HGNC symbol.– Click on Results, then ‘Unique results only’, and ‘Go’ to
save to a local file (put the file in Tutorial_Gfeller/Data).
![Page 14: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/14.jpg)
Then in R• Open the file:
mapping <- read.delim("Data/mart_export.txt")
• Use the match() function to find the position in mapping of all the genes for which you have expression data in m1:i <- match(m1[,1], mapping[,1])
• Then build a vector with the gene namesgene <- as.character(mapping[i,2])N <- length(gene)
• Verify that the mapping is correct by checking a few examples
14
![Page 15: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/15.jpg)
Computers like simple and sequential calculations
• Additions/subtractions and multiplications/divisions– You need to decompose any problem into a set of “simple” operations.– You need to tell the computer about every step of your calculations
(e.g., loop over all entries in one column).
Example: Find the average expression of a gene (e.g., EGFR) across samples
15
![Page 16: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/16.jpg)
How to do it on a computer
M =
1) Have a matrix “M” with all expression values and a vector “gene” with the name of the genes (columns of M).
2) Find the column corresponding to your gene: n <- which(gene == "EGFR")
3) Initialize a scalar: av <- 0
4) Go through each element of the column: S <- dim(M)[1]
for(s in 1:S){ av <- av + M[s,n] }
5) Normalize your value: av <- av/S
M[,n]
gene = EGFR
16
![Page 17: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/17.jpg)
How programming languages work
• The exact commands will change between programming languages (R, python, perl, C, matlab), but the logic remains the same (“grammar”).
• Learning the syntax (“words”) can be done with many online resources.
• In these two days, we will focus on R, since it is very convenient for graphical visualization of the data.
– Many built-in functions (e.g., mean()), but important to understand the logic.
17
![Page 18: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/18.jpg)
Typical output of RNA-Seq
Raw sequences:
- Fastq format (sequence of the reads + quality information
- Bam format (compressed)
- Sra format (compressed)
Processed data:
- Counts: Number of reads mapping to each gene/transcript.
18
![Page 19: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/19.jpg)
Computational analyses
• Alignments• Isoforms (splicing)
• Low complexity regions (repeats)• Variable regions (TCR, MHC)
• Sequencing errors
• Poorly annotated regions / genomes
Gene A Gene B
Map to reference transcriptome
ACCTAG…CGGTAA…ATGGCA…TGGGAC…TATAGG…
>100M reads
19
![Page 20: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/20.jpg)
What else needs to be considered
• Longer genes have more reads.
Gene A Gene B
If you want to compare expression between genes, you need to renormalize
by gene length,
• Different samples can have different total number of reads (e.g., different sequencing depth).
Gene A Gene B
Gene A Gene B
Sample 1
Sample 2
If you want to compare expression between samples, you need to renormalize by total
number of reads,
![Page 21: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/21.jpg)
How to do it (naïve way)
N <- dim(M)[2]M.norm <- matrix(nrow=S, ncol=N) # Initialize an empty matrixfor( s in 1:S ){
tot <- 0for (n in 1:N){
tot <- tot+M[s,n] # Compute the sum over row s}for (n in 1:N){
M.norm[s,n] <- M[s,n]/tot # Normalize row s}
}M.norm <- M.norm*1000000 # Avoid having too small numbers
M =
10’362’09312’482’5467’542’733
21
![Page 22: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/22.jpg)
A few names commonly used• Raw counts: Number of reads mapping to a gene
• Scaled counts: After renormalization by total number of counts in the sample.
• Reads Per Kilobase Million (RPKM): Divide by the total number of reads and then by the gene length.
• Transcripts Per Kilobase Million (TPM): Divide by gene length and then normalize across all genes (i.e. sum of TPMs of all genes is the same for all samples)
Multiply by 1’000’000 to have numbers that are
easier to read.
![Page 23: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/23.jpg)
Scaled counts vs TPM vs RPKM
• TPM are increasingly used.• The sum is always equal to 106 in TPM.• The two values (TPM vs RPKM) are equivalent,
up to a renormalizing factor.
• Scaled counts are enough to compare the same gene in different samples. TPM/RPKM are required to compare different genes.
23
![Page 24: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/24.jpg)
Studying expression of some gene in two types of samples
M[,n]
1) Define the groups: G1 <- c(1,2); G2 <- c(3,4)
2) Find the column corresponding to the gene: n <- which(gene== "CD19")
3) Take the mean over the blue box: av1 <- 0; for(s in G1) { av1 <- av1 + M.norm[s,n] }; av1 <- av1/length(G1)
4) Take the mean over the red box: av2 <- 0; for(s in G2) {av2 <- av2 + M.norm[s,n] }; av2 <- av2/length(G2)
5) Compare expression.6) With more samples you can do statistics (T-test, boxplot,… see exercises).
G1
G2
24
![Page 25: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/25.jpg)
2 - Differential expressionEx
pres
sion
leve
l
S1 S2
Samples
How can we quantify these differences?
25
![Page 26: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/26.jpg)
Differential expression
• Log fold change:– High expression genes can show big differences in counts
(10’000 to 20’000), compared to low expression genes (10 to 20), even if they experience the same relative change.
– Better to use logarithms.– 10 -> 20 = log2 fold change of 1 = 10’000 -> 20’000.
– But warning: log transforms amplify changes in low expression!!!
26
![Page 27: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/27.jpg)
Differential expression• P-value:
– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
Expr
essio
n le
vel
Expr
essio
n le
vel
Differences in the mean values are not enough! 27
![Page 28: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/28.jpg)
Differential expression
• P-value:– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
1
2
1’000
2’000
Depending on your random model, the first case may be more likely to appear “by chance”.
28
![Page 29: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/29.jpg)
Differential expression
• P-value:– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
1
2
1’000
2’000 Advanced statistical methods have been developed to estimate P-values in RNA-Seq data!
29
![Page 30: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/30.jpg)
Differential expression
• P-value:– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
Gene 1
Gene 5 Gene 6
Gene 4
Gene 3
Gene 2
Gene 8
Gene 7
Gene 11
Gene 10
Gene 9
Many genes (20’000) => many testing => Higher chances that the differences are just due to chance.
30
![Page 31: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/31.jpg)
Tools for differential expression• Accurate estimation of P-values aim at considering these
different issues in testing the hypothesis that the expression values come from the same distribution or have the same mean in two conditions.
• Consider the multiple testing problem.
gene Log-fold change
P-value adjusted
P-value
Tools in R:- EdgeR- DESeq2
20’000 genes
mean
P=
31
![Page 32: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/32.jpg)
How to show your results?
How to plot this in your computer?
1) Select genes with P_adj >= 0.05:ind1 <- which( P[,5] >= 0.05 )
2) Plot these pointsplot( P[ind1, 2], P[ind1, 3] )
3) Select genes with P_adj < 0.05:ind2 <- which( P[,5] < 0.05 )
4) Plot these pointspar(new=T) # This is to overlay the graphsplot( P[ind2, 2], P[ind2, 3], col=“red” )
P_adj < 0.05
P_adj >= 0.05
gene Log-fold change
P-value adjusted
P-valuemean
32
P=
![Page 33: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/33.jpg)
3 - Visualizing high-dimensional data
• Each sample can be considered as a point in a very high dimensional space (N dimensions).
• In this high-dimensional space, are some samples more similar to each other?– Replicates– Similar cell types– Cancer subtypes
33
![Page 34: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/34.jpg)
Example in 3D (i.e. 3 genes)
Gene 1
Gene 3
Gene 2
S1
S2
S3Visually, you can see that:- S1, S3, S4 are similar to
each other.- S2, S5 are similar to each
other.
Can you quantify it?
S4
S5
- Distance- Angle (correlation)
34
![Page 35: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/35.jpg)
Distances - How would you do it on a computer?
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5
S1 <- c(5, 6, -1)S2 <- c(-2, 5, 3)
35
d12 <- 0for(i in 1:3){
d12 <- d12 + (S1[i]-S2[i])**2}d12 <- sqrt(d12)
Here we used the “**” for taking the
square of a number and the sqrt()
function for square root.
![Page 36: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/36.jpg)
• Very hard to visualize• You can still compute distances
What if you have 20’000 genes?
d12 <- 0for(i in 1:length(S1)){
d12 <- d12 + (S1[i]-S2[i])**2}d12 <- sqrt(d12)
This is a big advantage of using programming languages, compared to Excel (or manual calculations…)
36
![Page 37: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/37.jpg)
Visualization
• Distances are still not very intuitive…• If you have many points (S), the number of
pairwise distances is S(S-1)/2…
• Idea: Project the data in 2D, so that it represents optimally the raw data (gene expression profiles) in the N-dimensional space.
37
![Page 38: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/38.jpg)
2D projection – the good choice
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5S2
S5PC2
PC1
PC2
PC1
S1S3
S4
In 2D
38
![Page 39: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/39.jpg)
2D projection – the bad choice
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5 S2
S5PC2
PC1
PC2
PC1
S1S3
S4
In 2D
39
![Page 40: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/40.jpg)
Principle Component Analysis (PCA)
• How to select your 2D plan on which to project the data?
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5
PC2
PC1
- Intuitive idea: Take axes with the largest variance or dispersion (Principal Components).
- The math behind is not simple (eigenvalue decomposition of covariance matrix) but does not depend on the number of genes (dimension).
- You do not need to understand the math to use it.
40
![Page 41: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/41.jpg)
How to do it on your computer• In R, use function prcomp (stats package).
S1 <- c(5, 6, -1)S2 <- c(-2, 5, 3)S3 <- c(5.5, 6.5, -1.3)S4 <- c(4, 6.5, -0.3)S5 <- c(-2.2, 5.3, 3.1) x <- c(S1[1], S2[1], S3[1], S4[1], S5[1])y <- c(S1[2], S2[2], S3[2], S4[2], S5[2])z <- c(S1[3], S2[3], S3[3], S4[3], S5[3])
• Plot the data in 3Dlibrary(rgl) # you may also need to run: install.packages("rgl")plot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10))
Orlibrary(scatterplot3d) # you may also need to run: install.packages("scatterplot3d")scatterplot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10))
• Make a PCA analysismat <- t(matrix(c(S1, S2, S3, S4, S5), nrow=3))pca = prcomp(mat)plot(pca$x[,1], pca$x[,2])
See practical this afternoon41
Each point in space
Coordinates along x, y, z axes
Make a matrix with each point in one line
![Page 42: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/42.jpg)
Putting labels on the plot
• For a better visualization, you can add margins and put labels:
42
plot(pca$x[,1], pca$x[,2], xlim=c(min(pca$x[,1]-1), max(pca$x[,1]+1)), ylim=c(min(pca$x[,2]-0.1), max(pca$x[,2]+0.1)))
text(x=pca$x[,1]+0.2, y=pca$x[,2]+0.05, labels =c("S1", "S2", "S3", "S4", "S5"))
![Page 43: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/43.jpg)
Now let’s look at the tumor expression data
• Run:pca = prcomp(M.norm)# Plot the samples along the two first componentsplot(pca$x[,1], pca$x[,2])
What do you see?Does it make sense in light of expression of CD19?
43
![Page 44: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/44.jpg)
Principle component analysis – some discussions
PC1
- The axis with the largest variance do not necessarily reflect the structures in the data.
- In PCA, the principle components are always orthogonal (linear method).
- It is often useful to make sure the mean of the samples is at 0.
Gene 1
Gene 2
PC1 44
![Page 45: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/45.jpg)
Many refinements/alternatives
• In PCA, only select a subset of genes (high expression, high variability,…).
• Multi-dimensional scaling (MDS). Plot the points in 2D so that distances in the original space are best preserved (R package “cmdscale”).
• Stochastic Neighbor Embedding (tSNE). Very popular these days (R package “tsne”).
• UMAP: Recent algorithm
Non-linear techniques (not a simple projection).
All these techniques are fully unsupervised: they do not need to know what your data are, which cluster you should expect,…
45
![Page 46: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/46.jpg)
How to choose?• Start with PCA.
• If you know what your samples are (e.g., different cell types), you can try to play a bit with parameters (e.g., choice of genes, choice of algorithm) to have meaningful clusters.
Find optimal parameters that best capture the signal in your data.=> Allows you to discover new things
Overfit your data:Þ See only what you want to see (even
if it is not there).Þ Prevents from seeing anything new
46
![Page 47: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/47.jpg)
Where to access gene expression data
• GEO: – Largest collection of gene expression data (microarray,
RNA-Seq).– Often has counts (not only raw data).
• ENA (European Nucleotide Archive):– Large collection of raw RNA-Seq data (bam files).
• ArrayExpress: functional genomics data
See exercises this afternoon 47
![Page 48: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/48.jpg)
Where can we access cancer gene expression data
• TCGA: large collection of tumor RNA-Seq, Exome-Seq, methylation, clinical information,…
> 10’000 patients with sequenced tumors
See exercises tomorrow48
![Page 49: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/49.jpg)
General remarks about programming
• Computers like numbers and simple operations– Need to decompose complex tasks into simple steps.
• Learning a programming language takes time, but you do not need to know everything before starting.– First understand the logics, then use books or online
resources for the syntax.
• Data analysis takes time– Analyzing large datasets is often more challenging than
producing them…
49
![Page 50: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/50.jpg)
General remarks about programming
• Many ways of making many mistakes!!!–We all do mistakes– You need to check your outputs when you write a
code• If you do a normalization on matrix rows, check that the
row sums are truly equal.
• If there is something incoherent in your output, always go back to find the mistakes (do not impute to noise…), even if the data come from a bioinformatics “expert”.
50
![Page 51: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/51.jpg)
General remarks about programming
• In the beginning, it is a big investment to write a script, rather than using Excel.
• But in the long range, it allows you to go much faster and quickly analyze many datasets without having to redo everything each time.
• Many analyses cannot be done in Excel, while R provides many packages that you can use.
51
![Page 52: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/52.jpg)
How to get support for bioinformatics
analyses of gene expression data
• Sequencing facility:
– GTF (Keith Harshman)
– Standard pipelines for normalizing and PCA
• Bioinformatics core facility (Delorenzi) or Vital-
IT.
• Very specific analyses: groups working in
computational biology.
52
![Page 53: Introduction to Cancer Genomics - University of …...Introduction to Cancer Genomics Gene expression data analysis –part I David Gfeller Computational Cancer Biology Ludwig Center](https://reader031.vdocuments.us/reader031/viewer/2022041023/5ed4c90ffd1f950b814dfb1f/html5/thumbnails/53.jpg)
Questions?
53