cytof workflow: differential discovery in high …€¦ · workshop!! cytof workflow:...
TRANSCRIPT
Workshop!!
CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry
datasets !
Malgorzata Nowicka !University of Zurich!
BioC 2017: Where Software and Biology Connect!Boston, 28 July 2017!
HaveyouanalyzedCyTOForflowcytometrydatabefore? !
My approach to this workshop!
• This is just a shorter version of the original workflow!
• The text is basically the same but maybe try to not focus on it so much, as you can carefully read it later!
• Ideally, I would do the life coding but because of the time constrains, I will:!– explain all the code in the vignette (html file)!– copy-paste to R, to follow with you and have the
opportunity to modify some things if needed !
Introduction!
Figure adapted from Bendall et al. Science 2011!
• Measuring ~40 parameters (metals)!• Hundreds of cells per second
CyTOF (mass cytometry) experiment!
Introduction !• High-dimensional cytometry because it is higher than
before (from ~12 to ~40)!• We refer to this differential approach as “classic”!
– Common strategy: !• identify cell populations of interest by manual gating or automated
clustering!• determine which of the cell subpopulations or protein markers are
associated with a phenotype of interest using statistical tests!
• Other approaches:!– Citrus (Bruggner et al. 2014)!– CellCnn (Arvaniti and Claassen 2016)!
• Hybrid approaches (thanks to the modularity)!• This workflow can serve as a template !
Overview!1. Data description!2. Data pre-processing (CATALYST pckg)!3. Data import – flowCore pckg!4. Data transformation - arcsinh with cofactor 5!5. Spot checks!
– MDS plots (limma pckg)!6. Cell clustering with FlowSOM and ClusterConsensusPlus pckgs!
– overclustering + manual merging!7. Visualization of the clusters with!
– t-SNE - Rtsne pckg !– heatmaps - pheatmap pckg!– ggplot pckg!
8. Differential analysis with linear mixed models (lme4 and multcomp pckgs)!– diff. cell population abundance !– diff. marker expression !
Data description !
• Subset of data from the Bodenmiller et al. 2012 study; used mass cytometry to measure PBMC's !– from 8 different patients !– after 11 different stimulation conditions as well as an
unstimulated "Reference" state!
• Here, we use the samples that correspond to the BCR-crosslinking and Reference !
• For each sample, 10 cell surface markers and 14 signaling markers were measured!
Data pre-processing!
• The pre-processing steps in the Bodenmiller data, as we can download it, included removal of debris and de-barcoding!
• In general, pre-processing steps may involve:!– normalization using bead standards!– de-barcoding!– compensation !
• CATALYST pckg!
Data import!
• All the data is available at the Robinson Lab server http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow/!
• Downloading using the download.file() !– Metadata !– FCS files!– Panel file!– Files with cluster merging !
Data import: expr!M1 M2 M3 …
S1
c1
c2
c3
c4
S2
c5
c6
c7
S3
c8
c9
c10
c11
c12
…
Data transformation!
• Skewed distributions !– Hard to distinguish between positive and negative populations!
• Magic transformation - arcsinh (hyperbolic inverse sine) with cofactor 5 !– linear for the lower expression, !– logarithmic for the higher expression,!– works for the negative and zero values (they do not have to be
excluded from the analysis)!
Linear Arcsinh cofactor 5Mass cytometry
Figure adapted from Bendall et al. Science 2011!
Diagnostic plots – quick look on!
• Plot with per-sample marker expression distributions, colored by condition!– Identify problematic samples or markers!
• Cell counts in samples!• MDS plot (or PCA plot)!– Standard usage in the RNA-seq analysis!– Generated with the plotMDS() function from the
limma pckg!– As we want to plot samples we need to summarize
the information about cells to the sample level!
MDS plot: expr_median_sample_tbl!M1 M2 M3 …
S1
c1
* * *c2
c3
c4
S2
c5
* * *c6
c7
S3
c8
* * *
c9
c10
c11
c12
…
M1 M2 M3 …
S1 * * *
S2 * * *
S3 * * *
…
* - median!
MDS plot!
Cell population identification!• Manual gating!• Comparison of clustering algorithms by Lukas
Weber et al., Cytometry Part A, 2016!• FlowSOM (+ ClusterConsensusPlus) one of the best
performing methods and super fast!• 3 steps:!
1. Building of the self-organizing map (SOM) with the BuildSOM function - cells are assigned according to their similarities to 100 grid points (or, so-called codes) of the SOM!
2. Building of a minimal spanning tree, which is mainly used for graphical representation of the clusters!
3. Metaclustering of the SOM codes performed directly with the ConsensusClusterPlus function!
Over-clustering!
• Some level of over-clustering is necessary, in order to detect somewhat rare populations!
• In addition, merging can always follow an over-clustering step, but splitting of existing clusters is generally not feasible!
Over-clustering!
Figures adapted from Nowicka et al. F1000Research 2017!
Over-clustering into 20 groups!
Median marker expression of data normalized to 0-1!0.07
0.07
0.70
0.10
0.62
0.58
0.76
0.73
0.63
0.09
0.53
0.06
0.07
0.07
0.15
0.07
0.08
0.07
0.10
0.07
0.12
0.12
0.79
0.72
0.75
0.77
0.12
0.11
0.11
0.12
0.66
0.52
0.12
0.18
0.20
0.46
0.15
0.54
0.27
0.12
0.21
0.18
0.81
0.79
0.79
0.81
0.20
0.52
0.71
0.13
0.32
0.22
0.54
0.42
0.43
0.50
0.25
0.49
0.47
0.20
0.69
0.58
0.54
0.13
0.14
0.17
0.16
0.21
0.18
0.14
0.14
0.21
0.14
0.25
0.70
0.71
0.58
0.91
0.57
0.18
0.87
0.78
0.52
0.14
0.12
0.11
0.11
0.11
0.11
0.12
0.12
0.32
0.12
0.12
0.11
0.18
0.11
0.19
0.16
0.11
0.81
0.15
0.13
0.07
0.07
0.08
0.07
0.08
0.07
0.07
0.07
0.07
0.07
0.07
0.08
0.07
0.07
0.08
0.08
0.08
0.25
0.25
0.25
0.24
0.24
0.67
0.24
0.67
0.24
0.24
0.24
0.25
0.24
0.65
0.69
0.28
0.25
0.71
1.00
0.95
0.33
0.25
0.35
0.20
0.20
0.20
0.20
0.19
0.19
0.20
0.19
0.31
0.21
0.24
0.43
0.54
0.32
0.93
0.31
0.20
0.70
0.68
0.80
0.76
0.73
0.75
0.68
0.71
0.74
0.06
0.55
0.73
0.75
0.76
0.75
0.76
0.71
0.81
0.51
0.30
0.23
0.24
0.33
0.22
0.22
0.22
0.25
0.24
0.22
0.22
0.21
0.31
0.25
0.27
0.55
0.51
0.33
0.80
0.60
0.24
CD
7
CD
4
CD
3
HLA_D
R
CD
20
IgM
CD
123
CD
33
CD
45
CD
14
1 (3.85%)5 (1.89%)20 (0.34%)15 (7.8%)12 (24.25%)18 (1.85%)3 (14.19%)4 (1.97%)9 (16.62%)2 (2.26%)17 (0.64%)19 (1.51%)8 (8.82%)14 (2.2%)10 (2.08%)6 (3.58%)7 (4.37%)11 (0.64%)13 (0.42%)16 (0.72%)
cluster
0.2
0.4
0.6
0.8
1
Merging by an expert !
Figures adapted from Nowicka et al. F1000Research 2017!
Median marker expression of data normalized to 0-1!0.07
0.07
0.70
0.10
0.62
0.58
0.76
0.73
0.63
0.09
0.53
0.06
0.07
0.07
0.15
0.07
0.08
0.07
0.10
0.07
0.12
0.12
0.79
0.72
0.75
0.77
0.12
0.11
0.11
0.12
0.66
0.52
0.12
0.18
0.20
0.46
0.15
0.54
0.27
0.12
0.21
0.18
0.81
0.79
0.79
0.81
0.20
0.52
0.71
0.13
0.32
0.22
0.54
0.42
0.43
0.50
0.25
0.49
0.47
0.20
0.69
0.58
0.54
0.13
0.14
0.17
0.16
0.21
0.18
0.14
0.14
0.21
0.14
0.25
0.70
0.71
0.58
0.91
0.57
0.18
0.87
0.78
0.52
0.14
0.12
0.11
0.11
0.11
0.11
0.12
0.12
0.32
0.12
0.12
0.11
0.18
0.11
0.19
0.16
0.11
0.81
0.15
0.13
0.07
0.07
0.08
0.07
0.08
0.07
0.07
0.07
0.07
0.07
0.07
0.08
0.07
0.07
0.08
0.08
0.08
0.25
0.25
0.25
0.24
0.24
0.67
0.24
0.67
0.24
0.24
0.24
0.25
0.24
0.65
0.69
0.28
0.25
0.71
1.00
0.95
0.33
0.25
0.35
0.20
0.20
0.20
0.20
0.19
0.19
0.20
0.19
0.31
0.21
0.24
0.43
0.54
0.32
0.93
0.31
0.20
0.70
0.68
0.80
0.76
0.73
0.75
0.68
0.71
0.74
0.06
0.55
0.73
0.75
0.76
0.75
0.76
0.71
0.81
0.51
0.30
0.23
0.24
0.33
0.22
0.22
0.22
0.25
0.24
0.22
0.22
0.21
0.31
0.25
0.27
0.55
0.51
0.33
0.80
0.60
0.24
CD
7
CD
4
CD
3
HLA_D
R
CD
20
IgM
CD
123
CD
33
CD
45
CD
14
1 (3.85%)5 (1.89%)20 (0.34%)15 (7.8%)12 (24.25%)18 (1.85%)3 (14.19%)4 (1.97%)9 (16.62%)2 (2.26%)17 (0.64%)19 (1.51%)8 (8.82%)14 (2.2%)10 (2.08%)6 (3.58%)7 (4.37%)11 (0.64%)13 (0.42%)16 (0.72%)
clustercluster_m
erging
cluster_mergingB−cells IgM+B−cells IgM−CD4 T−cellsCD8 T−cellsDCNK cellsmonocytessurface−
cluster123456789101112131415161718
0.2
0.4
0.6
0.8
1
0.07
0.07
0.70
0.10
0.62
0.58
0.76
0.73
0.63
0.09
0.53
0.06
0.07
0.07
0.15
0.07
0.08
0.07
0.10
0.07
0.12
0.12
0.79
0.72
0.75
0.77
0.12
0.11
0.11
0.12
0.66
0.52
0.12
0.18
0.20
0.46
0.15
0.54
0.27
0.12
0.21
0.18
0.81
0.79
0.79
0.81
0.20
0.52
0.71
0.13
0.32
0.22
0.54
0.42
0.43
0.50
0.25
0.49
0.47
0.20
0.69
0.58
0.54
0.13
0.14
0.17
0.16
0.21
0.18
0.14
0.14
0.21
0.14
0.25
0.70
0.71
0.58
0.91
0.57
0.18
0.87
0.78
0.52
0.14
0.12
0.11
0.11
0.11
0.11
0.12
0.12
0.32
0.12
0.12
0.11
0.18
0.11
0.19
0.16
0.11
0.81
0.15
0.13
0.07
0.07
0.08
0.07
0.08
0.07
0.07
0.07
0.07
0.07
0.07
0.08
0.07
0.07
0.08
0.08
0.08
0.25
0.25
0.25
0.24
0.24
0.67
0.24
0.67
0.24
0.24
0.24
0.25
0.24
0.65
0.69
0.28
0.25
0.71
1.00
0.95
0.33
0.25
0.35
0.20
0.20
0.20
0.20
0.19
0.19
0.20
0.19
0.31
0.21
0.24
0.43
0.54
0.32
0.93
0.31
0.20
0.70
0.68
0.80
0.76
0.73
0.75
0.68
0.71
0.74
0.06
0.55
0.73
0.75
0.76
0.75
0.76
0.71
0.81
0.51
0.30
0.23
0.24
0.33
0.22
0.22
0.22
0.25
0.24
0.22
0.22
0.21
0.31
0.25
0.27
0.55
0.51
0.33
0.80
0.60
0.24
CD
7
CD
4
CD
3
HLA_D
R
CD
20
IgM
CD
123
CD
33
CD
45
CD
14
1 (3.85%)5 (1.89%)20 (0.34%)15 (7.8%)12 (24.25%)18 (1.85%)3 (14.19%)4 (1.97%)9 (16.62%)2 (2.26%)17 (0.64%)19 (1.51%)8 (8.82%)14 (2.2%)10 (2.08%)6 (3.58%)7 (4.37%)11 (0.64%)13 (0.42%)16 (0.72%)
clustercluster_m
erging
cluster_mergingB−cells IgM+B−cells IgM−CD4 T−cellsCD8 T−cellsDCNK cellsmonocytessurface−
cluster123456789101112131415161718
0.2
0.4
0.6
0.8
1
Statistical testing – differential abundance!
Statistical testing – differential abundance!
Ref BCRXL
Ref
1
Ref
2
Ref
3
Ref
4
Ref
5
Ref
6
Ref
7
Ref
8
BCR
XL1
BCR
XL2
BCR
XL3
BCR
XL4
BCR
XL5
BCR
XL6
BCR
XL7
BCR
XL8
0
25
50
75
100
sample_id
prop
ortio
n
clusterB−cells IgM+
B−cells IgM−
CD4 T−cells
CD8 T−cells
DC
NK cells
monocytes
surface−
Ref BCRXL
Ref
1
Ref
2
Ref
3
Ref
4
Ref
5
Ref
6
Ref
7
Ref
8
BCR
XL1
BCR
XL2
BCR
XL3
BCR
XL4
BCR
XL5
BCR
XL6
BCR
XL7
BCR
XL8
0
25
50
75
100
sample_id
prop
ortio
n
clusterB−cells IgM+
B−cells IgM−
CD4 T−cells
CD8 T−cells
DC
NK cells
monocytes
surface−
Ref BCRXL
Ref
1
Ref
2
Ref
3
Ref
4
Ref
5
Ref
6
Ref
7
Ref
8
BCR
XL1
BCR
XL2
BCR
XL3
BCR
XL4
BCR
XL5
BCR
XL6
BCR
XL7
BCR
XL8
0
25
50
75
100
sample_id
prop
ortio
n
clusterB−cells IgM+
B−cells IgM−
CD4 T−cells
CD8 T−cells
DC
NK cells
monocytes
surface−
Statistical testing – differential abundance!DC NK cells monocytes surface−
B−cells IgM+ B−cells IgM− CD4 T−cells CD8 T−cells
20
30
40
1.5
2.0
2.5
3.0
30
35
40
45
50
10
20
1.0
1.5
2.0
2.5
3.0
10
15
20
25
2
4
6
8
0.5
1.0
1.5
2.0
condition
prop
ortio
n
patient_idPatient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL
DC NK cells monocytes surface−
B−cells IgM+ B−cells IgM− CD4 T−cells CD8 T−cells
20
30
40
1.5
2.0
2.5
3.0
30
35
40
45
50
10
20
1.0
1.5
2.0
2.5
3.0
10
15
20
25
2
4
6
8
0.5
1.0
1.5
2.0
condition
prop
ortio
n
patient_idPatient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL
DC NK cells monocytes surface−
B−cells IgM+ B−cells IgM− CD4 T−cells CD8 T−cells
20
30
40
1.5
2.0
2.5
3.0
30
35
40
45
50
10
20
1.0
1.5
2.0
2.5
3.0
10
15
20
25
2
4
6
8
0.5
1.0
1.5
2.0
condition
prop
ortio
n
patient_idPatient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL
The generalized linear mixed model (GLMM) for each cell population!
• observation-level random effects!!• random intercept for each patient !
i = 1,…,8 – patient ID!j = 1, 2 – condition !
Y ⇠ Bin(m,⇡)Y – number of CD4 cells!π – proportion of CD4 cells!m – total number of cells !
prop
ortio
n!
logit(⇡ij) = �0 + �1xij + ⇠ij + �i
⇠ij ⇠ N(0,�2⇠ )
�i ⇠ N(0,�2�)
Statistical testing – differential expression of signaling markers!
Statistical testing – differential expression of signaling markers!
Yij = �0 + �1xij + ✏ij + �i
Statistical testing – differential expression of signaling markers!
monocytes surface−
DC NK cells
CD4 T−cells CD8 T−cells
B−cells IgM+ B−cells IgM−
pNFk
Bpp
38pS
tat5
pAkt
pSta
t1pS
HP2
pZap
70pS
tat3
pSlp7
6pB
tkpP
lcg2
pErk
pLat
pS6
pNFk
Bpp
38pS
tat5
pAkt
pSta
t1pS
HP2
pZap
70pS
tat3
pSlp7
6pB
tkpP
lcg2
pErk
pLat
pS6
0
1
2
3
4
0
1
2
3
0
1
2
3
0.0
0.5
1.0
1.5
0
1
2
3
4
0
1
2
3
0
1
2
0
1
2
3
4
antigen
med
ian_e
xpre
ssion
patient_idPatient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL
monocytes surface−
DC NK cells
CD4 T−cells CD8 T−cells
B−cells IgM+ B−cells IgM−
pNFk
Bpp
38pS
tat5
pAkt
pSta
t1pS
HP2
pZap
70pS
tat3
pSlp7
6pB
tkpP
lcg2
pErk
pLat
pS6
pNFk
Bpp
38pS
tat5
pAkt
pSta
t1pS
HP2
pZap
70pS
tat3
pSlp7
6pB
tkpP
lcg2
pErk
pLat
pS6
0
1
2
3
4
0
1
2
3
0
1
2
3
0.0
0.5
1.0
1.5
0
1
2
3
4
0
1
2
3
0
1
2
0
1
2
3
4
antigen
med
ian_e
xpre
ssion
patient_idPatient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL
The linear mixed model (LMM) for each cell population!
!• random intercept for each patient !
i = 1,…,8 – patient ID!j = 1, 2 – condition !
Y – median expression of marker in CD4 cells!
�i ⇠ N(0,�2�)
Med
ian
expr
essi
on!
Y ⇠ N(µ,�2)
✏ij ⇠ N(0,�2)
DC NK cells monocytes surface−
B−cells IgM+ B−cells IgM− CD4 T−cells CD8 T−cells
20
30
40
1.5
2.0
2.5
3.0
30
35
40
45
50
10
20
1.0
1.5
2.0
2.5
3.0
10
15
20
25
2
4
6
8
0.5
1.0
1.5
2.0
condition
prop
ortio
n
patient_idPatient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL
DC NK cells monocytes surface−
B−cells IgM+ B−cells IgM− CD4 T−cells CD8 T−cells
20
30
40
1.5
2.0
2.5
3.0
30
35
40
45
50
10
20
1.0
1.5
2.0
2.5
3.0
10
15
20
25
2
4
6
8
0.5
1.0
1.5
2.0
condition
prop
ortio
npatient_id
Patient1
Patient2
Patient3
Patient4
Patient5
Patient6
Patient7
Patient8
conditionRef
BCRXL