introduction to cancer genomics - university of …...introduction to cancer genomics gene...

Introduction to Cancer Genomics

Gene expression data analysis – part I

David GfellerComputational Cancer Biology

Ludwig Center for Cancer [email protected]

1

Overview

1. Basic understanding of RNA-Seq data processing.

2. Differential expression.

3. Dimensionality reduction.

2

Examples of R code

Goals

• Help you understand what can be done with a computer -> programming logic

• Give you some basic idea of how to ask the computer to perform some tasks -> syntax.

• Show you a few examples of gene expression data analysis in R that you could reuse for your projects (see also practical).

3

Gene expression experiments

• Microarrays: Chip with DNA probes that will pair with DNA (retro-transcribed RNA) in a sample. Intensity is measured as a light signal.– Very popular in (2000-2010)

• RNA-Seq: Directly count how many transcripts (mRNA molecules) originate from each gene in a sample.– Increasingly replacing microarrays for gene

expression analyses 4

RNA-Seq

Gene expression => Quite easy (count the reads)Gene fusion => More difficult (especially for new fusion events)Splicing => More difficult (especially for poorly annotated isoforms)

RNA fragmentation

Adaptors + amplification

Sequencing

Map to reference transcriptomeRNA

ACCTAG…CGGTAA…ATGGCA…TGGGAC…TATAGG…

Reverse transcription

Gene A Gene B

>100M reads

5

1 - Typical output of RNA-Seq

Raw sequences:

- Fastq format (sequence of the reads + quality information)

- Bam format (compressed)

- Sra format (compressed)

Processed data:

- Counts: Number of reads mapping to each gene/transcript.

6

How to think about these data in a

computer

N genes

S s

am

ple

s

Sample1: gene1: 254; gene2: 1284; gene3: 7234; …



Each expression value corresponds to a scalar. Each sample corresponds to

a vector. All samples form a matrix (M)

M[s,n] corresponds to expression of gene “n” in sample “s” 7

Computers like numbers

In R:- Scalar (numeric)- Vector (array)- Matrix (multidimensional arrays, e.g. S x N)

Gene expression data are “naturally” digitalized, which makes them especially appropriate to use with computers

Many other biological objects can be digitalized as vectors or matrices:

- Protein/DNA sequences <-> vectors of letters/numbers- Protein structures <-> vectors/matrices of 3D coordinates- Interactions <-> N x N matrix with 1’s and 0’s- Image <-> matrix of pixel (1/0 for two-color image)- Set of measurements <-> vector of values

8

How to think about these data in a computer

In R, once you load your data into a matrix (M), you can very easily:- Print one specific column: M[,2]- Print one specific line: M[1,]- Plot the correlation of two genes: plot(M[,5], M[,7])- Make operations on lines or columns.

9

Let’s practice…

• Create a empty directory ‘Tutorial_Gfeller’ and

‘Tutorial_Gfeller/Data’

• Download the file: GSE93722_RAW.tar at:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93722

• Put it in Tutorial_Gfeller/Data/ and uncompress it and uncompress

the zip files. Each of the files corresponds to the gene expression

profiling of a melanoma sample.

• Open Rstudio.

• Set the working directory (Session -> Set Working Directory) to

‘Tutorial_Gfeller’.

• Create a new Rscript file (File-> New File -> R script); this is where

you will write your code and save it in ‘Tutorial_Gfeller’ as ‘file’.R.

10

Let’s load the data…• Each GSMxxx corresponds to one sample• First have a look at the files in a Excel (or any text editor). To

start with, we will focus on the ‘expected_count’ column• The command to load file is read.delim():

m1 <- read.delim("Data/GSE93722_RAW/GSM2461003_LAU125.genes.results.txt")

• Then execute the command in the Console (pasting it or command+Enter).– Now you can look at the elements of m1 (e.g., for the first line, type m1[1,] in the console). Does it correspond to the first line of the file?

– With dim(m1)you can check the dimensions of m1.11

Name of the object that will store the data.

Path to the file to be loaded

Let’s load the data

• Load the other files into m2 (GSM2461007_LAU1255), m3 (GSM2461009_LAU1314) and m4 GSM2461005_LAU355).

• Build a matrix taking the fifth column in each file:

12

M <- matrix(nrow=4, ncol=dim(m1)[1])

M[1,] <- m1[,5]

Initialize an empty

matrix with the correct dimension

In the first line, put

the 5th column of m1

• Do the same with m2, m3 and m4 (M[2,] <- m2[,5], if

you had many files, we would do a loop, see exercises).• Try to query any entry of your matrix (e.g., M[3,5]). Do

you get the expected number?

Genes have (many) names…

13

• In these files, we have Ensembl gene Ids… We want to convert them to Common Gene names.– We need a file with the mapping (two columns, one for

Ensembl IDs, one for gene names).

• Go to: https://www.ensembl.org/biomart/martview/– Select Ensembl Genes 94, then Human genes.– In Attributes, Select GENE: -> Gene stable ID and

EXTERNAL: -> HGNC symbol.– Click on Results, then ‘Unique results only’, and ‘Go’ to

save to a local file (put the file in Tutorial_Gfeller/Data).

https://www.ensembl.org/biomart/martview/

Then in R• Open the file:

mapping <- read.delim("Data/mart_export.txt")

• Use the match() function to find the position in mapping of all the genes for which you have expression data in m1:i <- match(m1[,1], mapping[,1])

• Then build a vector with the gene namesgene <- as.character(mapping[i,2])N <- length(gene)

• Verify that the mapping is correct by checking a few examples

14

Computers like simple and sequential calculations

• Additions/subtractions and multiplications/divisions– You need to decompose any problem into a set of “simple” operations.– You need to tell the computer about every step of your calculations

(e.g., loop over all entries in one column).

Example: Find the average expression of a gene (e.g., EGFR) across samples

15

How to do it on a computer

M =

1) Have a matrix “M” with all expression values and a vector “gene” with the name of the genes (columns of M).

2) Find the column corresponding to your gene: n <- which(gene == "EGFR")

3) Initialize a scalar: av <- 0

4) Go through each element of the column: S <- dim(M)[1]

for(s in 1:S){ av <- av + M[s,n] }

5) Normalize your value: av <- av/S

M[,n]

gene = EGFR

16

How programming languages work

• The exact commands will change between programming languages (R, python, perl, C, matlab), but the logic remains the same (“grammar”).

• Learning the syntax (“words”) can be done with many online resources.

• In these two days, we will focus on R, since it is very convenient for graphical visualization of the data.

– Many built-in functions (e.g., mean()), but important to understand the logic.

17

Typical output of RNA-Seq

Raw sequences:

- Fastq format (sequence of the reads + quality information

- Bam format (compressed)

- Sra format (compressed)

Processed data:

- Counts: Number of reads mapping to each gene/transcript.

18

Computational analyses

• Alignments• Isoforms (splicing)

• Low complexity regions (repeats)• Variable regions (TCR, MHC)

• Sequencing errors

• Poorly annotated regions / genomes

Gene A Gene B

Map to reference transcriptome

ACCTAG…CGGTAA…ATGGCA…TGGGAC…TATAGG…

>100M reads

19

What else needs to be considered

• Longer genes have more reads.

Gene A Gene B

If you want to compare expression between genes, you need to renormalize

by gene length,

• Different samples can have different total number of reads (e.g., different sequencing depth).

Gene A Gene B

Gene A Gene B

Sample 1

Sample 2

If you want to compare expression between samples, you need to renormalize by total

number of reads,

How to do it (naïve way)

N <- dim(M)[2]M.norm <- matrix(nrow=S, ncol=N) # Initialize an empty matrixfor( s in 1:S ){

tot <- 0for (n in 1:N){

tot <- tot+M[s,n] # Compute the sum over row s}for (n in 1:N){

M.norm[s,n] <- M[s,n]/tot # Normalize row s}

}M.norm <- M.norm*1000000 # Avoid having too small numbers

M =

10’362’09312’482’5467’542’733

21

A few names commonly used• Raw counts: Number of reads mapping to a gene

• Scaled counts: After renormalization by total number of counts in the sample.

• Reads Per Kilobase Million (RPKM): Divide by the total number of reads and then by the gene length.

• Transcripts Per Kilobase Million (TPM): Divide by gene length and then normalize across all genes (i.e. sum of TPMs of all genes is the same for all samples)

Multiply by 1’000’000 to have numbers that are

easier to read.

Scaled counts vs TPM vs RPKM

• TPM are increasingly used.• The sum is always equal to 106 in TPM.• The two values (TPM vs RPKM) are equivalent,

up to a renormalizing factor.

• Scaled counts are enough to compare the same gene in different samples. TPM/RPKM are required to compare different genes.

23

Studying expression of some gene in two types of samples

M[,n]

1) Define the groups: G1 <- c(1,2); G2 <- c(3,4)

2) Find the column corresponding to the gene: n <- which(gene== "CD19")

3) Take the mean over the blue box: av1 <- 0; for(s in G1) { av1 <- av1 + M.norm[s,n] }; av1 <- av1/length(G1)

4) Take the mean over the red box: av2 <- 0; for(s in G2) {av2 <- av2 + M.norm[s,n] }; av2 <- av2/length(G2)

5) Compare expression.6) With more samples you can do statistics (T-test, boxplot,… see exercises).

G1

G2

24

2 - Differential expressionEx

pres

sion

leve

l

S1 S2

Samples

How can we quantify these differences?

25

Differential expression

• Log fold change:– High expression genes can show big differences in counts

(10’000 to 20’000), compared to low expression genes (10 to 20), even if they experience the same relative change.

– Better to use logarithms.– 10 -> 20 = log2 fold change of 1 = 10’000 -> 20’000.

– But warning: log transforms amplify changes in low expression!!!

26

Differential expression• P-value:

– Give a statistical significance, but not trivial to estimate.

Expr

essio

n le

vel

Expr

essio

n le

vel

Expr

essio

n le

vel

Differences in the mean values are not enough! 27


• P-value:– Give a statistical significance, but not trivial to estimate.

Expr

essio

n le

vel

1

2

1’000

2’000

Depending on your random model, the first case may be more likely to appear “by chance”.

28



Expr

essio

n le

vel

1

2

1’000

2’000 Advanced statistical methods have been developed to estimate P-values in RNA-Seq data!

29



Expr

essio

n le

vel

Gene 1

Gene 5 Gene 6

Gene 4

Gene 3

Gene 2

Gene 8

Gene 7

Gene 11

Gene 10

Gene 9

Many genes (20’000) => many testing => Higher chances that the differences are just due to chance.

30

Tools for differential expression• Accurate estimation of P-values aim at considering these

different issues in testing the hypothesis that the expression values come from the same distribution or have the same mean in two conditions.

• Consider the multiple testing problem.

gene Log-fold change

P-value adjusted

P-value

Tools in R:- EdgeR- DESeq2

20’000 genes

mean

P=

31

How to show your results?

How to plot this in your computer?

1) Select genes with P_adj >= 0.05:ind1 <- which( P[,5] >= 0.05 )

2) Plot these pointsplot( P[ind1, 2], P[ind1, 3] )

3) Select genes with P_adj < 0.05:ind2 <- which( P[,5] < 0.05 )

4) Plot these pointspar(new=T) # This is to overlay the graphsplot( P[ind2, 2], P[ind2, 3], col=“red” )

P_adj < 0.05

P_adj >= 0.05

gene Log-fold change

P-value adjusted

P-valuemean

32

P=

3 - Visualizing high-dimensional data

• Each sample can be considered as a point in a very high dimensional space (N dimensions).

• In this high-dimensional space, are some samples more similar to each other?– Replicates– Similar cell types– Cancer subtypes

33

Example in 3D (i.e. 3 genes)

Gene 1

Gene 3

Gene 2

S1

S2

S3Visually, you can see that:- S1, S3, S4 are similar to

each other.- S2, S5 are similar to each

other.

Can you quantify it?

S4

S5

- Distance- Angle (correlation)

34

Distances - How would you do it on a computer?

Gene 1

Gene 3

Gene 2

S1

S2

S3

S4

S5

S1 <- c(5, 6, -1)S2 <- c(-2, 5, 3)

35

d12 <- 0for(i in 1:3){

d12 <- d12 + (S1[i]-S2[i])**2}d12 <- sqrt(d12)

Here we used the “**” for taking the

square of a number and the sqrt()

function for square root.

• Very hard to visualize• You can still compute distances

What if you have 20’000 genes?

d12 <- 0for(i in 1:length(S1)){

d12 <- d12 + (S1[i]-S2[i])**2}d12 <- sqrt(d12)

This is a big advantage of using programming languages, compared to Excel (or manual calculations…)

36

Visualization

• Distances are still not very intuitive…• If you have many points (S), the number of

pairwise distances is S(S-1)/2…

• Idea: Project the data in 2D, so that it represents optimally the raw data (gene expression profiles) in the N-dimensional space.

37

2D projection – the good choice

Gene 1

Gene 3

Gene 2

S1

S2

S3

S4

S5S2

S5PC2

PC1

PC2

PC1

S1S3

S4

In 2D

38

2D projection – the bad choice

Gene 1

Gene 3

Gene 2

S1

S2

S3

S4

S5 S2

S5PC2

PC1

PC2

PC1

S1S3

S4

In 2D

39

Principle Component Analysis (PCA)

• How to select your 2D plan on which to project the data?

Gene 1

Gene 3

Gene 2

S1

S2

S3

S4

S5

PC2

PC1

- Intuitive idea: Take axes with the largest variance or dispersion (Principal Components).

- The math behind is not simple (eigenvalue decomposition of covariance matrix) but does not depend on the number of genes (dimension).

- You do not need to understand the math to use it.

40

How to do it on your computer• In R, use function prcomp (stats package).

S1 <- c(5, 6, -1)S2 <- c(-2, 5, 3)S3 <- c(5.5, 6.5, -1.3)S4 <- c(4, 6.5, -0.3)S5 <- c(-2.2, 5.3, 3.1) x <- c(S1[1], S2[1], S3[1], S4[1], S5[1])y <- c(S1[2], S2[2], S3[2], S4[2], S5[2])z <- c(S1[3], S2[3], S3[3], S4[3], S5[3])

• Plot the data in 3Dlibrary(rgl) # you may also need to run: install.packages("rgl")plot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10))

Orlibrary(scatterplot3d) # you may also need to run: install.packages("scatterplot3d")scatterplot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10))

• Make a PCA analysismat <- t(matrix(c(S1, S2, S3, S4, S5), nrow=3))pca = prcomp(mat)plot(pca$x[,1], pca$x[,2])

See practical this afternoon41

Each point in space

Coordinates along x, y, z axes

Make a matrix with each point in one line

Putting labels on the plot

• For a better visualization, you can add margins and put labels:

42

plot(pca$x[,1], pca$x[,2], xlim=c(min(pca$x[,1]-1), max(pca$x[,1]+1)), ylim=c(min(pca$x[,2]-0.1), max(pca$x[,2]+0.1)))

text(x=pca$x[,1]+0.2, y=pca$x[,2]+0.05, labels =c("S1", "S2", "S3", "S4", "S5"))

Now let’s look at the tumor expression data

• Run:pca = prcomp(M.norm)# Plot the samples along the two first componentsplot(pca$x[,1], pca$x[,2])

What do you see?Does it make sense in light of expression of CD19?

43

Principle component analysis – some discussions

PC1

- The axis with the largest variance do not necessarily reflect the structures in the data.

- In PCA, the principle components are always orthogonal (linear method).

- It is often useful to make sure the mean of the samples is at 0.

Gene 1

Gene 2

PC1 44

Many refinements/alternatives

• In PCA, only select a subset of genes (high expression, high variability,…).

• Multi-dimensional scaling (MDS). Plot the points in 2D so that distances in the original space are best preserved (R package “cmdscale”).

• Stochastic Neighbor Embedding (tSNE). Very popular these days (R package “tsne”).

• UMAP: Recent algorithm

Non-linear techniques (not a simple projection).

All these techniques are fully unsupervised: they do not need to know what your data are, which cluster you should expect,…

45

How to choose?• Start with PCA.

• If you know what your samples are (e.g., different cell types), you can try to play a bit with parameters (e.g., choice of genes, choice of algorithm) to have meaningful clusters.

Find optimal parameters that best capture the signal in your data.=> Allows you to discover new things

Overfit your data:Þ See only what you want to see (even

if it is not there).Þ Prevents from seeing anything new

46

Where to access gene expression data

• GEO: – Largest collection of gene expression data (microarray,

RNA-Seq).– Often has counts (not only raw data).

• ENA (European Nucleotide Archive):– Large collection of raw RNA-Seq data (bam files).

• ArrayExpress: functional genomics data

See exercises this afternoon 47

Where can we access cancer gene expression data

• TCGA: large collection of tumor RNA-Seq, Exome-Seq, methylation, clinical information,…

> 10’000 patients with sequenced tumors

See exercises tomorrow48

General remarks about programming

• Computers like numbers and simple operations– Need to decompose complex tasks into simple steps.

• Learning a programming language takes time, but you do not need to know everything before starting.– First understand the logics, then use books or online

resources for the syntax.

• Data analysis takes time– Analyzing large datasets is often more challenging than

producing them…

49


• Many ways of making many mistakes!!!–We all do mistakes– You need to check your outputs when you write a

code• If you do a normalization on matrix rows, check that the

row sums are truly equal.

• If there is something incoherent in your output, always go back to find the mistakes (do not impute to noise…), even if the data come from a bioinformatics “expert”.

50


• In the beginning, it is a big investment to write a script, rather than using Excel.

• But in the long range, it allows you to go much faster and quickly analyze many datasets without having to redo everything each time.

• Many analyses cannot be done in Excel, while R provides many packages that you can use.

51

How to get support for bioinformatics

analyses of gene expression data

• Sequencing facility:

– GTF (Keith Harshman)

– Standard pipelines for normalizing and PCA

• Bioinformatics core facility (Delorenzi) or Vital-

IT.

• Very specific analyses: groups working in

computational biology.

52

Questions?

53

introduction to cancer genomics - university of …...introduction to cancer genomics gene...

Documents