scrnaseqnormalization and gene set selection
TRANSCRIPT
scRNAseq normalizationandgenesetselection
Åsa Bjö[email protected]
Outline
• Introduction• Normalization• Genesetselection• Removalofconfounders
Biologicalandtechnicalvariation
• Biologicalvariation:– Celltype/state– Cellcycle– Cellsize– Sex,Age,…– Etc..
• Technicalvariation– Cellquality– Libraryprepefficiency– Batcheffects– Etc…
Biologicalandtechnicalvariation
• Biologicalvariation:– Celltype/state– Cellcycle– Cellsize– Sex,Age,…
– Etc..
• Technicalvariation– Cellquality– Libraryprepefficiency– Batcheffects– Etc..
Toidentifycelltypeswewouldliketoremoveallothersourcesofvariation.
UMIsdoesnotsolvetheproblem
Vallejos etal.NatureMethods2017
Normalization
• Countnormalization –forunevensequencingdepth• Genelengthnormalization– fordifferencesingenedetectionduetogenelength
• Drop-outratenormalization– fordifferencesinRNAcontent/drop-outrates
BulkRNAseq methods• CPM:Controlsforsequencingdepthwhendividingbytotalcount• RPKM/FPKM:Controlsforsequencingdepthandgenelength.Goodfor
technicalreplicates,notgoodforsample-sampleduetocompositionalbias.AssumestotalRNAoutputissameinallsamples.
• TPM:SimilartoRPKM/FPKM.Correctsforsequencingdepthandgenelength.Alsocomparablebetweensamplesbutnocorrectionforcompositionalbias.
Xi:observedcountli:lengthofthetranscriptNnumberoffragmentssequenced
BulkRNAseq methods
• TMM/RLE/MRN:Improvedassumption:Theoutputbetweensamplesforacoresetonlyofgenesissimilar.Correctsforcompositionalbias.RLEandMRNareverysimilarandcorrelateswellwithsequencingdepth. edgeR::calcNormFactors() implementsTMM,TMMwzp,RLE&UQ. DESeq2::estimateSizeFactors implementsmedianratiomethod(RLE).Doesnotcorrectforgenelength.
• VST/RLOG/VOOM:Varianceisstabilised acrosstherangeofmeanvalues.Foruseinexploratoryanalyses. vst() and rlog() functionsfrom DESeq2. voom() functionfrom Limma convertsdatatonormaldistribution.
scRNAseq normalization
• Deconvolution/Scran (Pooling-Across-Cells)• SCnorm (Expression-DepthRelation)• SCTransform• Census• Linnorm• ZINB-WaVE• BASiCS• More…
Logtransformation
• Log-transformedvaluesapproachesnormaldistributionforbulkRNAseq data
• ForscRNAseq – moresimilartozero-inflatedbinomial
• Whilenon-transformeddataishardtofit.
Depthnormalizationandlogtransformation
• Themostsimplenormalizationistodividebysequencingdepth*ascalefactorandlog-transformthedata
• Scater normalize – usestotalcountsorsizefactors.Defaultisreturn_log =TRUE.
• SeuratNormalizeData – returnslog-normalizeddatawithscale.factor =10Kbydefault.
• Scanpy normalize_per_cell/normalize_total –normalizebysequencingdepth– thenneedtorunlog1p.
Depthnormalization
• AssumingsameRNAcontentinallcells– mayworkwellinhomogeneouscellpopulation
• InmostcasestheamountofRNA– andofUMIs/readsdifferbetweencells.
• Alsoimportanttocheckforoulier genesthatconstitutelargeproportionofthereads!
Deconvolution
Lun etal.GenomeBiol.2016
Scran - computeSumFactors
• Deconvolutionwithallcells– Theassumptionisthatmostgenesarenotdifferentiallyexpressed(DE)betweencells,
• Deconvolutionwithinclusters(FastClusterbeforehand)– Sizefactorscomputedwithineachclusterandrescaledbynormalizationbetweenclusters.
– WhenmanygenesareDEbetweenclustersinaheterogeneouspopulation.
• computeSumFactors – willalsoremovelowabundancegenes
Normalizationwithgenegroups
• Globalscalefactorsmayleadtoovercorrectionforweaklyandmoderatelyexpressedgenesandundernormalization forhighlyexpressedgenes.
• Solution:Donormalizationforgenesatdifferentexpressionlevels.
SCNorm:Expressionvs.DepthBiasCorrection
Bacher etal.NatureMethods2017)
Quantileregressiontoestimatethecount–depthrelationship
SCNorm:Expressionvs.DepthBiasCorrection
IdenticalcellsintwogroupsshouldresultinnoDEandFC=1ifnormalizationwasefficient
Bacher etal.NatureMethods2017)
SCTransform (Seurat)
Hafmeister &Satija GenomeBiology2019
SCTransform (Seurat)
Pearsonresidualsfromregularizednegativebinomial(NB)regression
Hafmeister &Satija GenomeBiology2019
SCTransform (Seurat)
• OBS!SCTransform functioninSeuratalsodoesvariablegeneselction inthesamestepwithaslightlydifferentmethodthanthedefaultinSeurat.
• Butyoucanalsospecifywhichgenestoruniton.
• Youcanalsorunregressioninthesamestep.
Zero-InflatedNegativeBinomial-basedWantedVariationExtraction(ZINB-WaVE).
• Bothgene-levelandsample-levelcovariates• ExtensionoftheRUVmodel
Risso etal.Nat.Comm.2018
ZINB-WaVE
ReducestechnicalinfluenceonPCA,alsobatcheffect.
Sizefactorswithdifferentnormalizations
Vieth etal.NatureComm.2019
DEwithdifferentnormalizations
Vieth etal.NatureComm.2019
Imputation
• scRNAseq hasalotofzerosinexpressionmatrix• CommonforGWASdatatoimputeSNPs• Manymethodsrecentlypublished:– SAVER– DrImpute– scImpute– MAGiC– Knn-smooth– Deepcountautoencoder
Imputationcanintroducefalsecorrelations
Andrewsetal.F1000research2018
ImputationhaslittleeffectonDEdetection
Vieth etal.NatureComm.2019
Normalization+imputationcomparison
TianNatureMethods2019
Scalingdata– Z-scoretransformation
• Z-scoretransformation- linearly transform data toameanofzeroandastandarddeviationof1.
• PCAoranyothertypeofanalysiswillbedominatedbyhighlyexpressedgeneswithhighvariance.
• ItcanbewisetocenterandscaleeachgenebeforeperformingPCA
Whatnormalizationshouldyouuse?
• Normalizationhasbigimpactondifferentialgeneexpression,butnotasmuchonclustering
• Inmostcasesitisenoughtodosequencedepthnormalization
• Whenworkingwithhighlysimilarsubtypesofthesamecelltype,orwithcelltypes ofverydifferentsizes,individualsizefactorscouldhelp.
• Binningbygenelevel(SCTransform)helpstoremovetheeffectofdifferentgenedetectionacrosscells.
Selectinggenes
• Excludinginvariablegenesthatdonotcontributeinformative/interestinginformation– Improvedsignaltonoiseratio– Reducedcomputationalrequirements
• Highlyvariablegenes(HVGs)• Correlatedgenepairs/groups• TopPCAloadings
Variablegeneselection
• Geneswhichbehavedifferentlyfromanullmodeldescribingtechnicalnoise– Mean-variancetrend:geneswithhigherthanexpectedvariance
– Coefficientofvariation(Brennecke etal.2013)
• Highdropoutgenes– Numberofzerosunexpectedlyhighcomparedtonullmodel
Highlyvariablegenes(HVGs)
(Brennecke etal.NatureMethods2013)
Fitagammageneralizedlinearmodel
NoERCCs?->estimatetechnicalnoisebasedonallgenes
HVGswithspike-incontrols– normalizationmatters
M3Drop
• ReversetranscriptionisanenzymereactionthuscanbemodelledusingtheMichaelis-Menten equation:
S:averageexpressionKM:Michaelis-Menten constant
Confoundingfactors
• Anysourceofvariationthatyoudonotexpecttogiveseparationofthecelltypes.– Cellcycle– Cellsize– Sequencingdepth– Cellquality– Batch– More…
Linearregression
• Fitalinetothegeneexpressionvsvariableofinterest
• Calculateresiduals• Removevarianceexplainedbythevariableofinterestbytakingtheresiduals.
• Multiplelinearregressionifmultiplefactors.
Othertoolstoremoveunwantedvariance
• RUVseq()orsvaseq()• Linearmodelswithe.g.removeBatchEffect()inlimma orscater
• ComBat()insva
Whatconfoundersshouldyouremove?
• Percentmitochondrialreads– oftencorrelateswithqualityofcell
• Sequencingdepth• Genedetectionrate– relatestoamountofRNApercell.
• Cellcycle• Batcheffects(Sample,sortdate,sex,etc.)ALWAYS checkQCparametersafteranalysisandseehowtheyinfluenceyourdata.BUT, becarefulthatyourconfoundersarenotrelatedtoyourbiologicalquestion!
Scalingandregressioninpractice
• SeuratScaleData:doesZ-scoretransformationandregressionofvariablesinvars.to.regress. Canuselinear(default),poisson ornegbiommodels.
• Scran: runsscalingbutnotcenteringautomaticallyinPCAstep.trendVar functionestimatesunwantedvariationeitherwithadesignmatrixorwithblockfactors.decomposeVar ordenoisePCA toremoveunwantedvariation.
• Scanpy:pp.regress_out andpp.scale functions.
Cellcycleeffect
Buettner etal.NatureBiotech.2019
Predictcellcyclestage/scores
• Seurat– CellCycleScoring – buildsonG2M- &S-phasehumangenelistsfromTirosh etal.paper
• Scran – cyclone function– trainedonmousecellcyclesortedcells.Usesrelativeexpressionofpairsofgenes.
• Scanpy - tl.score_genes_cell_cycle – usessamegenelistasSeurat
Cellcycleremoval
• Regressiononcellcyclescores.• scLVM (betapre-release)- Designedforcell-cyclevariationcorrection.Alsocorrectionofotherconfoundingvariables.
• ccRemover (stableversionfromCRAN).“ccRemoveroutperformsscLVM slightly.”
• Oscope• reCAT
Conclusions
• Normalizationhasbigimpactondifferentialgeneexpression.
• Manydifferentmethodstoremoveunwantedvariance– oftenanimportantstep!
• Selectionofvariablegenesisimportanttoremovenoiseinthedata.AlwayssubsetgenesbeforerunningPCA/clustering.
• Alwaysaimforsamesequencingdepthinallsamples– toavoidatleastoneconfoundingfactor.
Donotworry!
Ifyouhavedistinctcelltypes – theclusteringwillbethesameregardlessofhowyoutreatthedata.
But,forsubclustering ofsimilarcelltypes normalizationandremovalofconfoundersmaybecrucial.