standing on the shoulders of giants, german demidov,...

Post on 16-Mar-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Standing on the

shoulders of giants,

German Demidov,

Bioinformatics

Summer School

2017

BiologyandBigData

> Discoveringtruth

bybuildingon

previous

discoveries

Whyitisuseful?

Justoneexample:

Usingdatafromconsortia

> Whichtypesofdatacanyouobtainfrom

consortia?Howtoaccessanddownload

data?

> Howtoworkasapartofconsortia?Which

problemsyoumayface?

ImportantRemark

> Workshops“Howtouseconsortium_name”

usuallytake~3days(ie

https://www.encodeproject.org/tutorials/

encode-meeting-2016/),wewilltrytomake

anoverviewin1hour

> However,ifyouwanttofindmoreinformation

– google“consortium_nameworkshop”

> Thereareseparatepapers(i.e.EwanBirney,

2012,Nature,aboutENCODE)

GWASConsortia

> http://

www.wikigenes.org/

e/art/e/185.html

> 500.000genotyped

peopleinUK

EWASConsortia

GenomicsConsortia

> TheExomeAggregationConsortium

> 1000Genomes

> HumanReferenceGenome

> InternationalCancerGenomeConsortium

> TheCancerGenomeAtlas

> PanCancerAnalysisofWholeGenomes

> GTEx

EpigenomicsConsortia

> ENCODE

> RoadmapEpigenomics

> BluePrint

> InternationalHumanEpigenome

Consortium

ExACOverivew

> http://exac.broadinstitute.org/about

> Firstthingtodo–lookandreadflagship

paper!

> Thedatasetprovidedonthiswebsitespans

60,706unrelatedindividualssequencedas

partofvariousdisease-specificand

populationgeneticstudies.

ExAC:Whyitisuseful

Itisusedto

> calculateobjectivemetricsofpathogenicityforsequencevariants,

> identifygenessubjecttostrongselectionagainstvariousclassesofmutation;identifying3,230geneswithnear-completereductionofnumberofpredictedprotein-truncatingvariants,with72%ofthesegeneshavingnocurrentlyestablishedhumandiseasephenotype,

> efficientfilteringofcandidatedisease-causingvariants

ExAC:Results

•  ANNOVARandATAVwereupdatedusing

ExACdata

•  CADDscoreswerere-calculated

•  CommercialtoolssuchasGoldenHelixand

GeneTalkalsoincorporatedExACdata

ExAC:Download

> Download

ExAC:Methods

> FlagshipPaper–Methods–short

descriptionwithdetailedpipelinesin

SupplementaryInformation

> 91,796individualexomesdrawnfroma

widerangeofprimarilydisease-focused

consortia

ExACQualityAssesment

> Comparisonwithintrios:singletontransmissionrateof50.1%(~50%)

> >10.000sampleswerecheckedwithSNPArrays–97-99%heterozygousconcordance

> Platinumstandardgenomesequencedwith5differenttechnologies–99.8%Sensitivity,0.056%FDR

> Comparisonwith13WGS~30x,PCR-free

> IndelFDRishigher(4.7%),singletonvariantsshowhigherFDR

> FDRisdifferentfordifferentannotationclasses(missense,synonymous,proteintruncating)

ExACSampleFiltering

> Only60.706samplespassedQCoutof91.796

> SetofcommonSNPswasselected(5.400)andsampleswithoutlierheterozygositywereremovedpriortoPCA

> Persamplenumberofvariants,transition/transversion(TiTv)ratio,alternatealleleheterozygous/homozygous(Het/Hom)ratioandinsertion/deletion(indel)ratio

> Closerelativeswereremoved

> Finalcoverage:80%oftargetedbases>20x

> 77%wereenrichedwithAgilentKit(33MBtarget)

1000GP

> http://www.internationalgenome.org

1000GP:Overview,goals

> http://www.internationalgenome.org/data-portal/sample

> Prettyconvenientdataportalthatallowsyounicefiltering!

> Thegoalofthe1000GenomesProjectwastofindmostgeneticvariantswithfrequenciesofatleast1%inthepopulationsstudied.

> Theprojectplannedtosequenceeachsampleto4xgenomecoverage;atthisdepth,sequencingcannotdiscoverallvariantsineachsample,butcanallowthedetectionofmostvariantswithfrequenciesaslowas1%.

1000GP:MainPublications

> Pilot:Amapofhumangenomevariationfrompopulation-scalesequencingNature467,1061–1073(28October2010)

> Phase1:Anintegratedmapofgeneticvariationfrom1,092humangenomesNature491,56–65(01November2012)

> Phase3:AglobalreferenceforhumangeneticvariationNature526,68–74(01October2015)

> Anintegratedmapofstructuralvariationin2,504humangenomesNature526,75–81(01October2015)

1000GP:Pipeline

1000GP:PowerofDetection,Heterozygous

Discordance,SequencingDepth

1000GP:Results

1000GP:VariantCalling

1000GP:CNVs

1000GP:CNVsconcordance

PanCancerAnalysisOfWG

> https://dcc.icgc.org/pcawg

PanCancerAnalysisOfWG

1.  Novelsomaticmutationcallingmethods

2.  Analysisofmutationsinregulatoryregions

3.  Integrationofthetranscriptomeandgenome

4.  Integrationoftheepigenomeandgenome

5.  Consequencesofsomaticmutationsonpathwayandnetworkactivity

6.  Patternsofstructuralvariations,signatures,genomiccorrelations,retrotransposonsandmobileelements

7.  Mutationsignaturesandprocesses

8.  Germlinecancergenome

9.  Inferringdrivermutationsandidentifyingcancergenesandpathways

10.  Translatingcancergenomestotheclinic

11.  Evolutionandheterogeneity

12.  Portals,visualizationandsoftwareinfrastructure

13.  Molecularsubtypesandclassification

14.  Analysisofmutationsinnon-codingRNA

15.  Mitochondrial

16.  Pathogens

PCAWG,WG8:Validation

> High-coveragevalidation

> 3maincallers:BroadInstitute–HaplotypeCaller,Annai-RTG(privatecompany),Freebayes(EMBL-DKFZ)

> 50samples,5000sitespersamplesequencedwith~1000depth

> ~2300SNVs,~2700indels

> SNPRecall/PPV/concordance~0.995

> Indels:0.94Recall,0.91PPV,concordance0.88

PCAWGWG8,CNVs

> CNVs

PCAWGWG8:Results

> Sensitivity,deletionsonly~60%,

duplications~40%!

FurtherInformation

> Flagshippaperisnotinformative:/

> 16papersarereleasedinbioRxiv

GTEx

> TheGenotype-TissueExpressionprojectaimstoprovidetothescientificcommunityaresourcewithwhichtostudyhumangeneexpressionandregulationanditsrelationshiptogeneticvariation

> Variationsingeneexpressionthatarehighlycorrelatedwithgeneticvariationcanbeidentifiedasexpressionquantitativetraitloci,oreQTLs

GTEx

> Alotofgeneticchangesassociatedwithcommonhumandiseases,suchasheartdisease,cancer,diabetes,asthma,andstroke,liesoutsideoftheprotein-codingregionsofgenes

> ThecomprehensiveidentificationofhumaneQTLswillgreatlyhelptoidentifygeneswhoseexpressionisaffectedbygeneticvariation

GTExDataOverview

GTExScheme

GTEx:CausesofDeath

ENCODE:Overview

> https://www.encodeproject.org

> EncyclopediaofDNAelements

> ThegoalofENCODEistobuilda

comprehensivepartslistoffunctional

elementsinthehuman(mouse/fly/worm)

genome

ENCODETimeline

ENCODEasfor2012

ENCODE:TypesofData

> https://www.encodeproject.org

ENCODE:DataMatrix

ENCODE:AuditCategory

Eachsamplecanhavemultiple

QCissuesandcanstill

Beavailablefordownloading!

ENCODE:ResultofAnalysis

ENCODE:GroundLevel

ENCODE:Mid-level

ENCODE:Top-Level

ENCODEpublications

> Ofcourse,oneoftheproductsis

publicaitons!

0

100

200

300

400

500

600

Nu

mb

er

of

Pu

blic

ati

on

s

Cumulative ENCODE Publications Over Time

Papers from Non-ENCODE Authors

Papers from ENCODE 2 Production Groups

ENCODEstandards

> DataStandards

BluePrint

> “BLUEPRINTisalarge-scaleresearchprojectreceivingcloseto30millioneurofundingfromtheEU.”

> 42leadingEuropeanscientificcenters

> Theaimtofurthertheunderstandingofhowgenesareactivatedorrepressedinbothhealthyanddiseasedhumancells

> Focusondistincttypesofhaematopoieticcellsfromhealthyindividualsandontheirmalignantleukaemiccounterparts

BluePrint

> http://www.blueprint-epigenome.eu

> Publications(CellPapers)&DataPortal

BluePrint

> http://dcc.blueprint-epigenome.eu/#/home

BluePrint

BluePrint

RoadMapEpigenomics

> TheNIHRoadmapEpigenomicsResearchtotransformourunderstandingofhowepigeneticscontributestodisease

> TheConsortiumleveragesexperimentalpipelinesbuiltaroundnext-generationsequencingtechnologiestomapDNAmethylation,histonemodifications,chromatinaccessibilityandsmallRNAtranscriptsinstemcellsandprimaryexvivotissuesselectedtorepresentthenormalcounterpartsoftissuesandorgansystemsfrequentlyinvolvedinhumandisease

RoadMapEpigenomics

RoadMapEpigenomics

RoadMapEpigenomics

ItlookslikewecangetProtocolsclickingonthelink,however,

therearenotalotofthemthere.Theprotocolsaresuper

outdated!(egREMCSTANDARDSANDGUIDELINESFORCHIP-

SEQDEC.2,2011—V1.0)

RoadMapEpigenomics

> Ifyouwannatoworkwiththesedata–readthepaper“Integrativeanalysisof111referencehumanepigenomes”(+16ENCODE2012,donotprintthepaper!)

> Gothroughthe“Publications”list

RoadMapEpigenomics

ThemostusefulsectionisMethods:

> RNA-sequniformprocessingandquantificationforconsolidatedepigenomes

> ChIP-seqandDNase-sequniformreprocessingforconsolidatedepigenomes

> Methylationdatacross-assaystandardizationanduniformprocessingforconsolidatedepigenomes

> Chromatinstatelearning

> Etc.

RoadMapEpigenomics

> Publications

RoadMapEpigenomics

>  HistonemarkcombinationsshowdistinctlevelsofDNAmethylationandaccessibility,andpredictdifferencesinRNAexpressionlevelsthatarenotreflectedineitheraccessibilityormethylation.

>  Megabase-scaleregionswithdistinctepigenomicsignaturesshowstrongdifferencesinactivity,genedensityandnuclearlaminaassociations,suggestingdistinctchromosomaldomains.

>  Approximately5%ofeachreferenceepigenomeshowsenhancerandpromotersignatures,whicharetwofoldenrichedforevolutionarilyconservednon-exonicelementsonaverage.

>  Epigenomicdatasetscanbeimputedathighresolutionfromexistingdata,completingmissingmarksinadditionalcelltypes,andprovidingamorerobustsignalevenforobserveddatasets.

>  Dynamicsofepigenomicmarksintheirrelevantchromatinstatesallowadata-drivenapproachtolearnbiologicallymeaningfulrelationshipsbetweencelltypes,tissuesandlineages.

WorkinginConsortia

WorkingwithData

•  GettingRawData

•  Workingwiththedatafromdifferent

consortiasimultaneously:differentQCs,

differentdataanalysispipeline

•  Versionsoftoolsmissedoroutdated/

unsupportedtools–failureofreplication!

WorkinginConsortiaI

•  WhenyourServergetsdownorallyour

datawereaccidentallyremoved

•  Deadlines–add3-6monthstoexpected

date!

•  Communication:teleconferences

•  Passwordsrenewal,permissionstoaccess

•  Efficientdatasharing–speed,reliability,

confidentiality

WorkinginConsortiaII

•  Differentnamingofthesamesamplesindifferentworkinggroups/labs

•  Wrong/MissingIdentifiers(egwrongcancertypeorpopulation)–case:normalandsomaticwereactuallyswapped

•  Thesame,butfromclinicians

•  Differentlabs-differentlibrarypreparation(egcoveragedepthsafterPCR-freeandPCR-basedWGS)

•  Severaltoolscanbeusedfortheanalysis–establishmentofthebesttoolorgenerationofjointcallset

•  Multipleblacklistoroutlierlists(everylab/grouphasitsownandtheydonotcompletelyoverlap)

WorkinginConsortiaIII

•  UnbalancedPopulationStructure

•  Mixofdifferenteffects(egCancervs.

Population)

•  IsyourGermlinereallyGermline?

SlidefromAgENCODE,EwanBirney

Спасибозавнимание!

top related