standing on the shoulders of giants, german demidov,...
TRANSCRIPT
Standing on the
shoulders of giants,
German Demidov,
Bioinformatics
Summer School
2017
BiologyandBigData
> Discoveringtruth
bybuildingon
previous
discoveries
Whyitisuseful?
Justoneexample:
Usingdatafromconsortia
> Whichtypesofdatacanyouobtainfrom
consortia?Howtoaccessanddownload
data?
> Howtoworkasapartofconsortia?Which
problemsyoumayface?
ImportantRemark
> Workshops“Howtouseconsortium_name”
usuallytake~3days(ie
https://www.encodeproject.org/tutorials/
encode-meeting-2016/),wewilltrytomake
anoverviewin1hour
> However,ifyouwanttofindmoreinformation
– google“consortium_nameworkshop”
> Thereareseparatepapers(i.e.EwanBirney,
2012,Nature,aboutENCODE)
GWASConsortia
> http://
www.wikigenes.org/
e/art/e/185.html
> 500.000genotyped
peopleinUK
EWASConsortia
GenomicsConsortia
> TheExomeAggregationConsortium
> 1000Genomes
> HumanReferenceGenome
> InternationalCancerGenomeConsortium
> TheCancerGenomeAtlas
> PanCancerAnalysisofWholeGenomes
> GTEx
EpigenomicsConsortia
> ENCODE
> RoadmapEpigenomics
> BluePrint
> InternationalHumanEpigenome
Consortium
ExACOverivew
> http://exac.broadinstitute.org/about
> Firstthingtodo–lookandreadflagship
paper!
> Thedatasetprovidedonthiswebsitespans
60,706unrelatedindividualssequencedas
partofvariousdisease-specificand
populationgeneticstudies.
ExAC:Whyitisuseful
Itisusedto
> calculateobjectivemetricsofpathogenicityforsequencevariants,
> identifygenessubjecttostrongselectionagainstvariousclassesofmutation;identifying3,230geneswithnear-completereductionofnumberofpredictedprotein-truncatingvariants,with72%ofthesegeneshavingnocurrentlyestablishedhumandiseasephenotype,
> efficientfilteringofcandidatedisease-causingvariants
ExAC:Results
• ANNOVARandATAVwereupdatedusing
ExACdata
• CADDscoreswerere-calculated
• CommercialtoolssuchasGoldenHelixand
GeneTalkalsoincorporatedExACdata
ExAC:Download
> Download
ExAC:Methods
> FlagshipPaper–Methods–short
descriptionwithdetailedpipelinesin
SupplementaryInformation
> 91,796individualexomesdrawnfroma
widerangeofprimarilydisease-focused
consortia
ExACQualityAssesment
> Comparisonwithintrios:singletontransmissionrateof50.1%(~50%)
> >10.000sampleswerecheckedwithSNPArrays–97-99%heterozygousconcordance
> Platinumstandardgenomesequencedwith5differenttechnologies–99.8%Sensitivity,0.056%FDR
> Comparisonwith13WGS~30x,PCR-free
> IndelFDRishigher(4.7%),singletonvariantsshowhigherFDR
> FDRisdifferentfordifferentannotationclasses(missense,synonymous,proteintruncating)
ExACSampleFiltering
> Only60.706samplespassedQCoutof91.796
> SetofcommonSNPswasselected(5.400)andsampleswithoutlierheterozygositywereremovedpriortoPCA
> Persamplenumberofvariants,transition/transversion(TiTv)ratio,alternatealleleheterozygous/homozygous(Het/Hom)ratioandinsertion/deletion(indel)ratio
> Closerelativeswereremoved
> Finalcoverage:80%oftargetedbases>20x
> 77%wereenrichedwithAgilentKit(33MBtarget)
1000GP
> http://www.internationalgenome.org
1000GP:Overview,goals
> http://www.internationalgenome.org/data-portal/sample
> Prettyconvenientdataportalthatallowsyounicefiltering!
> Thegoalofthe1000GenomesProjectwastofindmostgeneticvariantswithfrequenciesofatleast1%inthepopulationsstudied.
> Theprojectplannedtosequenceeachsampleto4xgenomecoverage;atthisdepth,sequencingcannotdiscoverallvariantsineachsample,butcanallowthedetectionofmostvariantswithfrequenciesaslowas1%.
1000GP:MainPublications
> Pilot:Amapofhumangenomevariationfrompopulation-scalesequencingNature467,1061–1073(28October2010)
> Phase1:Anintegratedmapofgeneticvariationfrom1,092humangenomesNature491,56–65(01November2012)
> Phase3:AglobalreferenceforhumangeneticvariationNature526,68–74(01October2015)
> Anintegratedmapofstructuralvariationin2,504humangenomesNature526,75–81(01October2015)
1000GP:Pipeline
1000GP:PowerofDetection,Heterozygous
Discordance,SequencingDepth
1000GP:Results
1000GP:VariantCalling
1000GP:CNVs
1000GP:CNVsconcordance
PanCancerAnalysisOfWG
> https://dcc.icgc.org/pcawg
PanCancerAnalysisOfWG
1. Novelsomaticmutationcallingmethods
2. Analysisofmutationsinregulatoryregions
3. Integrationofthetranscriptomeandgenome
4. Integrationoftheepigenomeandgenome
5. Consequencesofsomaticmutationsonpathwayandnetworkactivity
6. Patternsofstructuralvariations,signatures,genomiccorrelations,retrotransposonsandmobileelements
7. Mutationsignaturesandprocesses
8. Germlinecancergenome
9. Inferringdrivermutationsandidentifyingcancergenesandpathways
10. Translatingcancergenomestotheclinic
11. Evolutionandheterogeneity
12. Portals,visualizationandsoftwareinfrastructure
13. Molecularsubtypesandclassification
14. Analysisofmutationsinnon-codingRNA
15. Mitochondrial
16. Pathogens
PCAWG,WG8:Validation
> High-coveragevalidation
> 3maincallers:BroadInstitute–HaplotypeCaller,Annai-RTG(privatecompany),Freebayes(EMBL-DKFZ)
> 50samples,5000sitespersamplesequencedwith~1000depth
> ~2300SNVs,~2700indels
> SNPRecall/PPV/concordance~0.995
> Indels:0.94Recall,0.91PPV,concordance0.88
PCAWGWG8,CNVs
> CNVs
PCAWGWG8:Results
> Sensitivity,deletionsonly~60%,
duplications~40%!
FurtherInformation
> Flagshippaperisnotinformative:/
> 16papersarereleasedinbioRxiv
GTEx
> TheGenotype-TissueExpressionprojectaimstoprovidetothescientificcommunityaresourcewithwhichtostudyhumangeneexpressionandregulationanditsrelationshiptogeneticvariation
> Variationsingeneexpressionthatarehighlycorrelatedwithgeneticvariationcanbeidentifiedasexpressionquantitativetraitloci,oreQTLs
GTEx
> Alotofgeneticchangesassociatedwithcommonhumandiseases,suchasheartdisease,cancer,diabetes,asthma,andstroke,liesoutsideoftheprotein-codingregionsofgenes
> ThecomprehensiveidentificationofhumaneQTLswillgreatlyhelptoidentifygeneswhoseexpressionisaffectedbygeneticvariation
GTExDataOverview
GTExScheme
GTEx:CausesofDeath
ENCODE:Overview
> https://www.encodeproject.org
> EncyclopediaofDNAelements
> ThegoalofENCODEistobuilda
comprehensivepartslistoffunctional
elementsinthehuman(mouse/fly/worm)
genome
ENCODETimeline
ENCODEasfor2012
ENCODE:TypesofData
> https://www.encodeproject.org
ENCODE:DataMatrix
ENCODE:AuditCategory
Eachsamplecanhavemultiple
QCissuesandcanstill
Beavailablefordownloading!
ENCODE:ResultofAnalysis
ENCODE:GroundLevel
ENCODE:Mid-level
ENCODE:Top-Level
ENCODEpublications
> Ofcourse,oneoftheproductsis
publicaitons!
0
100
200
300
400
500
600
Nu
mb
er
of
Pu
blic
ati
on
s
Cumulative ENCODE Publications Over Time
Papers from Non-ENCODE Authors
Papers from ENCODE 2 Production Groups
ENCODEstandards
> DataStandards
BluePrint
> “BLUEPRINTisalarge-scaleresearchprojectreceivingcloseto30millioneurofundingfromtheEU.”
> 42leadingEuropeanscientificcenters
> Theaimtofurthertheunderstandingofhowgenesareactivatedorrepressedinbothhealthyanddiseasedhumancells
> Focusondistincttypesofhaematopoieticcellsfromhealthyindividualsandontheirmalignantleukaemiccounterparts
BluePrint
> http://www.blueprint-epigenome.eu
> Publications(CellPapers)&DataPortal
BluePrint
> http://dcc.blueprint-epigenome.eu/#/home
BluePrint
BluePrint
RoadMapEpigenomics
> TheNIHRoadmapEpigenomicsResearchtotransformourunderstandingofhowepigeneticscontributestodisease
> TheConsortiumleveragesexperimentalpipelinesbuiltaroundnext-generationsequencingtechnologiestomapDNAmethylation,histonemodifications,chromatinaccessibilityandsmallRNAtranscriptsinstemcellsandprimaryexvivotissuesselectedtorepresentthenormalcounterpartsoftissuesandorgansystemsfrequentlyinvolvedinhumandisease
RoadMapEpigenomics
RoadMapEpigenomics
RoadMapEpigenomics
ItlookslikewecangetProtocolsclickingonthelink,however,
therearenotalotofthemthere.Theprotocolsaresuper
outdated!(egREMCSTANDARDSANDGUIDELINESFORCHIP-
SEQDEC.2,2011—V1.0)
RoadMapEpigenomics
> Ifyouwannatoworkwiththesedata–readthepaper“Integrativeanalysisof111referencehumanepigenomes”(+16ENCODE2012,donotprintthepaper!)
> Gothroughthe“Publications”list
RoadMapEpigenomics
ThemostusefulsectionisMethods:
> RNA-sequniformprocessingandquantificationforconsolidatedepigenomes
> ChIP-seqandDNase-sequniformreprocessingforconsolidatedepigenomes
> Methylationdatacross-assaystandardizationanduniformprocessingforconsolidatedepigenomes
> Chromatinstatelearning
> Etc.
RoadMapEpigenomics
> Publications
RoadMapEpigenomics
> HistonemarkcombinationsshowdistinctlevelsofDNAmethylationandaccessibility,andpredictdifferencesinRNAexpressionlevelsthatarenotreflectedineitheraccessibilityormethylation.
> Megabase-scaleregionswithdistinctepigenomicsignaturesshowstrongdifferencesinactivity,genedensityandnuclearlaminaassociations,suggestingdistinctchromosomaldomains.
> Approximately5%ofeachreferenceepigenomeshowsenhancerandpromotersignatures,whicharetwofoldenrichedforevolutionarilyconservednon-exonicelementsonaverage.
> Epigenomicdatasetscanbeimputedathighresolutionfromexistingdata,completingmissingmarksinadditionalcelltypes,andprovidingamorerobustsignalevenforobserveddatasets.
> Dynamicsofepigenomicmarksintheirrelevantchromatinstatesallowadata-drivenapproachtolearnbiologicallymeaningfulrelationshipsbetweencelltypes,tissuesandlineages.
WorkinginConsortia
WorkingwithData
• GettingRawData
• Workingwiththedatafromdifferent
consortiasimultaneously:differentQCs,
differentdataanalysispipeline
• Versionsoftoolsmissedoroutdated/
unsupportedtools–failureofreplication!
WorkinginConsortiaI
• WhenyourServergetsdownorallyour
datawereaccidentallyremoved
• Deadlines–add3-6monthstoexpected
date!
• Communication:teleconferences
• Passwordsrenewal,permissionstoaccess
• Efficientdatasharing–speed,reliability,
confidentiality
WorkinginConsortiaII
• Differentnamingofthesamesamplesindifferentworkinggroups/labs
• Wrong/MissingIdentifiers(egwrongcancertypeorpopulation)–case:normalandsomaticwereactuallyswapped
• Thesame,butfromclinicians
• Differentlabs-differentlibrarypreparation(egcoveragedepthsafterPCR-freeandPCR-basedWGS)
• Severaltoolscanbeusedfortheanalysis–establishmentofthebesttoolorgenerationofjointcallset
• Multipleblacklistoroutlierlists(everylab/grouphasitsownandtheydonotcompletelyoverlap)
WorkinginConsortiaIII
• UnbalancedPopulationStructure
• Mixofdifferenteffects(egCancervs.
Population)
• IsyourGermlinereallyGermline?
SlidefromAgENCODE,EwanBirney
Спасибозавнимание!