mapping copy number variation by population scale genome … · 2015. 2. 19. · mapping copy...
TRANSCRIPT
Mapping Copy Number Variation by Population Scale GenomeSequencing
(Article begins on next page)
The Harvard community has made this article openly available.Please share how this access benefits you. Your story matters.
Citation Mills, Ryan E., Robert E. Handsaker, Joshua Korn, JamesNemesh, Xinghua Shi, Charles Lee, Steven A. McCarroll et al.2011. Mapping copy number variation by population scale genomesequencing. Nature 470(7332): 59-65.
Published Version doi:10.1038/nature09708
Accessed February 19, 2015 9:02:09 AM EST
Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:5339501
Terms of Use This article was downloaded from Harvard University's DASHrepository, and is made available under the terms and conditionsapplicable to Open Access Policy Articles, as set forth athttp://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#OAP
1
Mapping copy number variation by population scale genome sequencing Ryan E. Mills1,*, Klaudia Walter2,*, Chip Stewart3,*, Robert E. Handsaker4,*, KenChen5,*, Can Alkan6,7,*, Alexej Abyzov8,*, Seungtai Chris Yoon9,*, Kai Ye10,*, R. KeiraCheetham11,AsifChinwalla5,DonaldF.Conrad2,YutaoFu12,FabianGrubert13.,ImanHajirasouliha14, Fereydoun Hormozdiari14, Lilia M. Iakoucheva15, Zamin Iqbal16,ShuliKang15, JeffreyM.Kidd6,MiriamK.Konkel17, JoshuaKorn4, EktaKhurana8,18,Deniz Kural3, Hugo Y. K. Lam13, Jing Leng8, Ruiqiang Li19, Yingrui Li19, Chang‐YunLin20,RuibangLuo19,XinmengJasmineMu8,JamesNemesh4,HeatherE.Peckham12,Tobias Rausch21, Aylwyn Scally2, Xinghua Shi1, Michael P. Stromberg3, Adrian M.Stütz21,AlexanderEckehartUrban13,JerilynA.Walker17,JiantaoWu3,YujunZhang2,ZhengdongD.Zhang8,MarkA.Batzer17,LiDing5,22,GaborT.Marth3,GilMcVean23,Jonathan Sebat15,Michael Snyder13, JunWang19,24, Kenny Ye20, Evan E. Eichler6,7,*,MarkB.Gerstein8,18,25,*,MatthewE.Hurles2,*,CharlesLee1,*,StevenA.McCarroll4,26,*,andJanO.Korbel21,*,@
forthe1000GenomesProject#
1.DepartmentofPathology,BrighamandWomen’sHospitalandHarvardMedicalSchool,Boston,MA2.TheWellcomeTrustSangerInstitute,WellcomeTrustGenomeCampus,Hinxton,Cambridge,CB101SAUK.3.DepartmentofBiology,BostonCollege,Boston,MA4.BroadInstituteofHarvardandMassachusettsInstituteofTechnology,Cambridge,MA5.TheGenomeCenteratWashingtonUniversity,St.Louis,MO6.DepartmentofGenomeSciences,UniversityofWashingtonSchoolofMedicine,Seattle,WA7.HowardHughesMedicalInstitute,UniversityofWashington,Seattle,Washington,USA.8.PrograminComputationalBiologyandBioinformatics,YaleUniversity,NewHaven,CT9.SeaverAutismCenterandDepartmentofPsychiatry,MountSinaiSchoolofMedicine,NewYork,NY10.DepartmentsofMolecularEpidemiology,MedicalStatisticsandBioinformatics,LeidenUniversityMedicalCenter,Leiden,theNetherlands11.IlluminaCambridgeLtd,ChesterfordResearchPark,LittleChesterford,EssexCB101XL,UK12.LifeTechnologies,Beverly,MA13.DepartmentofGenetics,StanfordUniversity,Stanford,CA14.SchoolofComputingScience,SimonFraserUniversity,Burnaby,BritishColumbia,Canada.15.DepartmentofPsychiatry,DepartmentofCellularandMolecularMedicine,InstituteforGenomicMedicine,UniversityofCalifornia,SanDiego,LaJolla,CA16.WellcomeTrustCentreforHumanGenetics,UniversityofOxford,OX37BN,UK17.DepartmentofBiologicalSciences,LouisianaStateUniversity,BatonRouge,Louisiana18.MolecularBiophysicsandBiochemistryDepartment,YaleUniversity,NewHaven,CT19.BGI‐Shenzhen,Shenzhen518083,China20.AlbertEinsteinCollegeofMedicine,Bronx,NY21.GenomeBiologyResearchUnit,EuropeanMolecularBiologyLaboratory,Heidelberg,Germany22.DepartmentofGenetics,WashingtonUniversity,St.Louis,MO23.DepartmentofStatistics,UniversityofOxford,OX37BN,UK24.DepartmentofBiology,UniversityofCopenhagen,Copenhagen,Denmark25.DepartmentofComputerScience,YaleUniversity,NewHaven,CT26.DepartmentofGenetics,HarvardMedicalSchool,Boston,MA*Theseauthorscontributedequallytothiswork.@CorrespondenceshouldbeaddressedtoJ.O.K.([email protected]).#ListsofparticipantsandaffiliationsappearinSupplementaryInformation.
2
Summary Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
3
Introduction
Unbalanced structural variants (SVs), or copy number variants (CNVs), involvinglarge‐scaledeletions,duplications,andinsertionsformoneoftheleastwellstudiedclasses of genetic variation. The fraction of the genome affected by SVs iscomparatively largerthanthataccountedforbysinglenucleotidepolymorphisms1(SNPs),implyingsignificantconsequencesofSVsonphenotypicvariation.SVshavealreadybeenassociatedwithdiversediseases, includingautism2,3,schizophrenia4,5and Crohn’s disease6,7. Furthermore, locus‐specific studies suggest that diversemechanisms may form SVs de novo, with some mechanisms involving complexrearrangementsresultinginmultiplechromosomalbreakpoints8,9.
Initialmicroarray‐based SV surveys focused on large gains and losses10,11,12, withrecentadvancesinarraytechnologywideningtheaccessiblesizespectrumtowardssmaller SVs1,13. Microarray‐based surveys commonlymapped SVs to approximategenomiclocations.However,adetailedSVcharacterization,includinganalysesofSVorigin and impact, requires knowledge of precise SV sequences. Advances insequencing technology have enabled applying sequence‐based approaches formappingSVsatfine‐scale14,15,16,17,18,19,20,21.Theseapproachesinclude:(i)paired‐endmapping (or read pair ‘RP’ analysis) based on sequencing and analysis ofabnormally mapping pairs of clone ends14,22,23,24 or high‐throughput sequencingfragments15,17,18;(ii)read‐depth(‘RD’)analysis,whichdetectsSVsbyanalyzingtheread depth‐of‐coverage16,21,25,26,27; (iii)split‐read (‘SR’) analysis, which evaluatesgappedsequencealignmentsforSVdetection28,29;and(iv)sequenceassembly(‘AS’),which enables the fine‐scale discovery of SVs, including novel (non‐reference)sequence insertions30,31,32. Sequence‐based SVdiscovery approacheshave thus farbeen applied to a limited (<20) number of genomes, leaving the fine‐scalearchitectureofmostcommonSVsunknown.
Sequence data generated by the 1000 Genomes Project (1000GP) provide anunprecedented opportunity to generate a comprehensive SV map. The 1000GPrecentlygenerated4.1Terabasesofrawsequenceinpilotprojectstargetingwholehumangenomes33 (SupplementaryTable1).Thesestudiescompriseapopulation‐scale project, termed ‘low‐coverage project’, in which 179 unrelated individualsweresequencedwithanaveragecoverageof3.6X–including59YorubaindividualsfromNigeria(YRI),60individualsofEuropeanancestryfromUtah(CEU),30ofHanancestry from Beijing (CHB), and 30 of Japanese ancestry from Tokyo (JPT; thelattertwowere jointlyanalyzedas JPT+CHB). Inaddition,ahigh‐coverageproject,termed the ‘trio project’, was carried out, with individuals of a CEU and a YRIparent‐offspringtriosequencedto42Xcoverageonaverage.
We report here the results of analyses undertaken by the Structural VariationAnalysisGroupof the1000GP.Thegroup’sobjectiveswere todiscover, assemble,genotype,andvalidateSVsof50bpand larger insize,andtoassessandcomparedifferent sequence‐based SV detection approaches. The focus of the group wasinitiallyondeletions,avariant classoftenassociatedwithdisease9, forwhichrich
4
controldatasetsanddiverseascertainmentapproachesexist1,13,22,28.Lessfocuswasplaced on insertions and duplications34 and none on balanced SV forms (such asinversions).Specifically,weappliednineteenmethodstogenerateanSVdiscoveryset.Wefurthergeneratedreferencegenotypesformostdeletions,assessedtheSVs’functional impact, and stratified SV formationmechanismwith respect to variantsizeandgenomiccontext.
PredictionofSVcandidatelociandassessmentofdiscoverymethods
WeincorporatedtheSVdiscoverymethodsintoapipeline(Fig.1AB),withthegoalof ascertaining different SV types and assessing each method for its ability todiscoverSVs.ThemethodsdetectedSVsbyanalyzingRD,RP,SR,andASfeatures,orby combining RP andRD features (abbreviated as ‘PD’). Altogetherwe generatedthirty‐sixSVcallsetsbyapplyingthemethodsontrioandlow‐coveragedata,andbyidentifyingSVsasgenomicdifferencesrelativetoahumanreference,correspondingto the reference genome, or to a set of individuals (i.e. population reference;Supplementary Table 2). We initially identified SVs as deletions, tandemduplications, novel sequence insertions, and mobile element insertions (MEIs)relative to the human reference. Subsequent comparative analyses involvingprimategenomesenabledustoclassifySVsasdeletions,duplications,orinsertionsrelative to inferred ancestral genomic loci, reflectingmechanismsof SV formation(seebelow).DNAreadsanalyzedbySVdiscoverymethodswereinitiallymappedtothehumanreferencegenomeusingavarietyofalignmentalgorithms.Mostofthesealgorithmsmappedeachreadtoasinglegenomicposition,althoughonealgorithm(mrFAST16) also considered alternativemappingpositions for reads aligning ontorepetitive regions (see Supplementary Tables 2‐4 formethod‐specific parametersand full SV callsets). We filtered each callset by excluding SVs <50bp, which arereported elsewhere33. Many SVs exhibited support from distinct SV discoverymethods, as exemplified by a common deletion, previously associatedwith body‐mass index35 (BMI), that we identified with RP, RD, and SR methods (Fig. 1C).Nonetheless, we observed notable differences between methods (Fig. 2ABC) interms of genomic regions ascertained (Supplementary Fig. 1), accessible SV size‐range(Fig.2A),andbreakpointprecision(Fig.2C,SupplementaryFig.2).
To estimate callset specificity, we carried out extensive validations (Methods),including PCRs for over 3,000 candidate loci, and microarray data analyses for50,000 candidate loci (Supplementary Tables 3, 4; Supplementary Fig. 3). Wecombined PCR and array‐based analysis results to estimate false discovery rates(FDRs),andfoundthateightcallsets(threedeletion,fourinsertion,andonetandemduplication callset) met the pre‐specified specificity threshold33 (FDR≤10%),whereastheothercallsetsyieldedlowerspecificity(FDRsof13%‐89%).
Wefurtherassessedthesensitivityofdeletiondiscoverymethodsbycollatingdatafrom four earlier surveys1,13,22,28 into a gold standard (Methods, SupplementaryTables 5, 6, and Supplementary Fig. 4A), and specifically assessing the detectionsensitivityforanindividualsequencedathigh‐coverage(NA12878)aswellasforan
5
individualsequencedatlow‐coverage(NA12156).Unsurprisingly,giventhetypicaltrade‐off between sensitivity and specificity, in the trios the highest sensitivitieswereachievedbyRDandRPmethodswithFDR>10%(Fig.2B).Bycomparison, inthe low‐coverage data, the individual method with the greatest accuracy(FDR=3.7%)was the secondmost sensitive basedon our gold standard (Fig. 2B),and the most sensitive when expanding the gold standard to a larger set ofindividuals(SupplementaryFig.4B).Thismethod,GenomeSTRiP(tobedescribedelsewhere36), integrated bothRP andRD features (PD), implying that consideringdifferentevidencetypescanimproveSVdiscovery.
ConstructionofahighconfidenceSVdiscoveryset
To construct our SV discovery set (“release set”), we joined calls from differentdiscoverymethodscorrespondingtothesameSVwithamergingapproachthatwasawareofeachcallset’sprecision inSVbreakpointdetection(SupplementaryFig.5andMethods). Most SVs in the release set (61%)were contributed by individualmethodsmeeting thepre‐definedspecificity threshold(FDR≤10%).Theremaining39%ofcallswerecontributedbylowerspecificitymethodsfollowingexperimentalvalidation. Altogether, the release set comprised 22,025 deletions, 501 tandemduplications,5,371MEIs,and128non‐referenceinsertions(Table1,SupplementaryTable 7). With our gold standard we estimated an overall sensitivity of deletiondiscoveryof82%inthetrios,and69%inlow‐coveragesequence(Fig.2B)usinga1bp overlap criterion. When instead applying a stringent 50% reciprocal overlapcriterion for sensitivity assessment (which requiredSV sizes inferredondifferentexperimental platforms to be in close agreement) our sensitivity estimatesdecreased by 12% and 18%, respectively, in trio and low‐coverage sequence(Supplementary Table 8). We further examined an alternative approach thatinvolved the pairwise integration of deletion discovery methods, and tested itsability to discover SVs without relying on the inclusion of lower specificity callsfollowing experimental validation (“algorithm‐centric set”; Fig. 1B). While thisalternativeapproachresultedinanincreasednumber(by~13%)ofhigh‐specificity(FDR<10%) calls compared to the release set (Supplementary Text), it overallresultedinfewerSVcallsowingto itsdecreasedsensitivityatthe lower(<200bp)SVsizerange.Inthefollowinganalyseswethusfocusedonthereleaseset.ExtentandimpactofourSVdiscoverysetWe next assessed the extent and impact of our SV discovery (release) set. ThemedianSVsizewas729bp(mean=8kb),approximatelyfourtimessmallerthaninarecenttilingCGHbasedstudy1,reflectingthehighresolutionofDNAsequencebasedSVdiscovery.Wealsocomparedourset toarecentsurveyofSVs inan individualgenome37basedoncapillarysequencingandarray‐basedanalyses24,andobservedasimilar size distribution for deletions, but differences in the size distributions ofother SV classes, reflecting underlying differences in SV ascertainment(SupplementaryFig.6).BycomparingourSVs todatabasesof structuralvariationandtoadditionalpersonalgenomedatasets,weclassified15,556SVsinoursetas
6
novel,withanenrichmentof low frequencySVsand small SVsamongst thenovelvariants(MethodsandSupplementaryText).
A major advantage of sequence‐based SV discovery is the nucleotide resolutionmappingofSVs.Weinitiallymappedthebreakpointsof7,066deletionsand3,299MEIsusingSRandASfeatures.UsingtheTIGRA‐targetedassemblyapproach38wefurtheridentifiedthebreakpointsofanadditional4,188deletionsand160tandemduplications, initially discovered by RD, RP, and PD methods (Methods,Supplementary Table 2). Altogether, we mapped ~15,000 SVs at nucleotideresolution,48%ofwhichwerenovel.Fewdeletion loci (4.4%)displayeddifferentSV breakpoints in different samples, which is explainable by rare TIGRA mis‐assemblies, or alternatively, by recurrently formed, multi‐allelic SVs(SupplementaryText).TIGRAfurtherenabledustovalidateanadditional7,359SVsdiscoveredwithRPorRD featuresby identifying the SVs’ breakpoints (Methods),and to evaluate the mapping precision of SV discovery methods (Fig. 2C,SupplementaryFigure2).
We further assessed the putative functional impact of SVs in our set by relatingthem to genomic annotation. Seventeen hundred SVs affected coding sequences,resulting in fullgeneoverlapsorexondisruptions(Table2),manyofwhich ledtoout‐of‐frameexons (SupplementaryTable9).Werelatedgenedisruptions togenefunctions, and observed significant enrichments for several functional categoriesincluding cell defense and sensory perception (Supplementary Table10). Highlevels of structural variation, including copy‐number variation, were previouslydescribed for both processes15,22,39. These SVs might be maintained in thepopulationbyselection for thepurposeof functional redundancy.WhilemostSVsintersectingwithgenesweredeletions, severalvalidated tandemduplicationsandMEIsalsointersectedwithcodingsequences(Table2).
Populationgeneticpropertiesofdeletions
Wenextsoughttogenerategenotypesfordeletionsdiscoveredinthe1000GPdata,bothtofacilitatepopulationgeneticsanalysesandtomakeourSVsetamenabletoassociation studies in the form of a reference genotype set. In this regard, theGenome STRiP36 genotyping method was developed, a method combininginformation fromRD,RP, SR andhaplotype featuresof population‐scale sequencedata for genotyping (Methods, Supplementary Text). Using this approach wegenerated genotypes for 13,826 autosomal deletions in 156 individuals. Thegenotypes displayed 99.1% concordance with CGH array1 based genotypes(availablefor1,970ofthedeletions),suggestinghighgenotypingaccuracy.
Fig. 3 presents allele frequency analyses based on these genotypes. As expected,common polymorphisms (minor allele frequency (MAF) >5%) were generallysharedacrosspopulations,whilerarealleleswerefrequentlyobservedinonlyonepopulation (Figs. 3ABC). We observed several candidates for monomorphicdeletions (i.e., genomicsegmentsputativelydeleted inall individuals), explainable
7
by rare insertions present in the reference genome or by remaining genotypinginaccuracies(SupplementaryText).
We next assessed the allele frequencies of gene deletions (Fig. 3D). Similar to arecentarray‐basedstudy1,weobservedadepletionofhighfrequencyallelesamongdeletions intersecting with protein‐coding sequence compared to other deletions(P=1.1x10‐11; KS test), consistent with purifying selection keeping most genedeletions at low frequency. Nonetheless, several coding sequence deletions wereobserved with high allele frequency (>80%). Most of these occurred in regionsannotated as segmental duplications, consistent with lessened evolutionaryconstraint infunctionallyredundantgenecategories22. Intriguingly,commongenedeletions also affectedmany unique genes with no obvious paralogs.We furtheranalyzed the abundance of gene deletions in different populations and observedhighlydifferentiatedloci,albeitwithnostatisticallysignificantrelationshipbetweendifferentiation and particular categories of gene overlap, i.e., intronic vs. exonic(SupplementaryText).
By comparing deletion genotypes with genotypes of nearby SNPs, we found,consistentwithearlierstudies1,13,40,thatdeletionsingenomicregionsaccessibletoshort read sequencing display extensive linkage disequilibrium (LD) with SNPs.81% of common deletions had one or more SNPs with which they are stronglycorrelated (r2>0.8; Supplementary Fig. 7). This suggests that many deletionsmapped in our study will be identifiable through tagging SNPs in future studies(SupplementaryText).On the other hand, a fifth of the genotypeddeletionswerenot tagged byHapMap SNPs (a figure similar to the fraction of SNPs that are nottaggedbyHapMapSNPs41),implyingthattheseSVsshouldbegenotypeddirectlyinassociationstudies.Furthermore,theLDpropertiesofcomplexSVs(e.g.,multiallelicSV) have not yet been fully ascertained asmethods for genotyping such SVswithsimilaraccuracyarestillbeingdeveloped.
SVformationmechanismanalysis
Nucleotide resolution breakpoint information enables inference of SV formationmechanisms15,22. Recent studies broadly distinguished between several germlinerearrangement classes, someofwhichmaycomprisemore thanoneSV formationmechanism15,22,42,43:non‐allelichomologousrecombination(NAHR),associatedwithlong sequence similarity stretches around thebreakpoints; rearrangements in theabsenceofextendedsequencesimilarity(abbreviatedas“non‐homologous”orNH),associated with DNA repair by non‐homologous end‐joining (NHEJ) or withmicrohomology‐mediated break‐induced replication (MMBIR); the shrinking orexpansion of variable number of tandem repeats (VNTRs), frequently involvingsimplesequences,byslippage;andMEIs.WedistinguishedamongtheclassesNAHR,NH,VNTR,andMEIbyexaminingthebreakpointjunctionsequenceofSVsinitiallydiscoveredasdeletionsortandemduplicationsrelativetoahumanreference.
8
We first compared theSVs toorthologousprimategenomic regions todistinguishdeletionsfrominsertions/duplicationswithrespecttoreconstructedancestral lociusing the BreakSeq classification approach43. This analysis showed that of the11,254 nucleotide‐resolution SVs discovered as deletions relative to a humanreference, 21% actually represented insertions and 2% represented tandemduplications relative to theputativeancestralgenome.Of the remainingSVs,60%were classified as deletions relative to ancestral sequence, whereas the ancestralstate of 17%was undetermined. By comparison, out of 160nucleotide‐resolutionSVsidentifiedastandemduplicationsrelativetothereferencegenome,91.6%wereclassified as duplications relative to the ancestral genome, whereas the ancestralstate of 8.4% remained undetermined (Supplementary Text). Our breakpointanalysisrevealedthat70.8%ofthedeletionsand89.6%oftheinsertionsexhibitedbreakpoint microhomology/homology ranging from 2‐376 bp in size, withdistributionmodes of 2 bp (attributable to NH) and 15 bp (attributable toMEI),respectively (Fig. 4A, Supplementary Text). As expected42, a small portion of thedeletions (16.1%) displayed non‐template inserted sequences at their breakpointjunctions. By comparison, the tandem duplications showed extensive stretchesdisplaying≥95%sequenceidentityatthebreakpointslinearlycorrelatinginlengthwithSVsize (Fig.4A). Inaddition,most tandemduplicationsdisplayed2‐17bpofmicrohomologyatthebreakpointjunctions(SupplementaryText).
We subsequently applied BreakSeq43 to infer formation mechanisms for all SVsclassified with regard to ancestral state. Using BreakSeq, we inferred NH as thedominatingdeletionmechanism, andMEI as thedominating insertionmechanism(Fig.4BC,SupplementaryTable11).Furthermore,anabundanceofmicrohomologyattandemduplicationbreakpointssuggestedfrequentformationofthisSVclassbyarearrangementprocessactingintheabsenceofhomology(NH).WhenrelatingSVformation to the variant size spectrum, we observedmarked insertion peaks forMEIs at 300 bp, corresponding to Alu elements, and at 6 kb, corresponding toL1/LINEs (Fig. 4C). By comparison, NH and NAHR based mechanisms occurredacross a wide size‐range, whereas VNTR expansion/shrinkage, consistent withearlierfindings1,ledtorelativelysmallSVsizes(Figs.4C,D).
Furthermore, when displaying the genomic distribution of SVs (Fig. 5A), weobservedanotableclusteringofSVsinto‘SVhotspots’.Weanalyzedthisclusteringin detail by examining the distribution of non‐overlapping, adjacent SVs, andobservedamarkedclusteringofSVsformedbyNAHR,VNTR,andNH,respectively,asignalextendingtohundredsofkilobases(Fig.5B).TheclusteringwasinfluencedbyanabundanceofVNTRnearthecentromeres43andNAHRnearthetelomeres(Fig.5A). A significant enrichment of NAHR near recombination hotspots (P=1.3e‐15)and segmental duplications (P=3.1e‐17) further contributed to the clustering(SupplementaryTable13).
To further explore this clustering we devised a segmentation approach forpredictingSVhotspots(Methods),whichyieldedamapof51putativeSVhotspots(SupplementaryTable14).80%of thehotspotsmainly comprisedSVsoriginating
9
fromasingleformationmechanism(Fig.5C).MostofthesecorrespondedtoNAHRhotspots,althoughhotspotsdominatedbyNHandVNTRalsowereevident.Theseobservations suggest that SV formation is frequently associated with the locus‐specificpropensityforgenomicrearrangement.
ConclusionsanddiscussionBy generating an SV set of unprecedented size alongwith breakpoint assembliesand reference genotypes, we demonstrate the suitability of population‐scalesequencing for SV analysis. Nucleotide resolution data allow the construction ofreference datasets andmake SVs readily assessable across different experimentalplatformsusinggenotypingapproaches.Ourfine‐scalemapenabledustoexaminethe functional impact of SVs, as exemplified by our analysis of gene disruptionvariants,whichwillbeofvalueforgenomeandexomesequencingstudies.
Ourmap further enabledus to examine size spectra of SV formationmechanismsandledustoidentifygenomicSVhotspotsthatarecommonlydominatedbyasingleformationmechanism.Recurrentrearrangements,implicatedingenomicdisorders,are hypothesized to be associated with local genome architecture44, e.g., withsegmental duplications that facilitate NAHR. Also, DNA rearrangement in theabsenceofhomology,i.e.,MMBIR,hasbeenimplicatedinrecurrentSVformation8,45.In this regard,we noticed that out of the hotspotswe report, six fall into criticalregions of known genetic disorders associated with recurrent de novo deletions,including Miller‐Dieker syndrome and Leri‐Weill dyschondrosteosis(SupplementaryTable 14). Irrespective of potential disease relevance, or inferredmechanism of formation, our analysis revealed a map of SV hotspots that mayconstitute local centers ofde novo SV formation, consistentwith the concept thatlocalgenomearchitecturecontributestogenomicinstability44.
Our study focused on characterizing deletions, which are often associated withdisease9. Facilitated by ancestral analyses of SV loci, we also characterizedinsertionsandtandemduplications,albeitinlessdetailthandeletions.Companionpapers with more detailed analyses of MEIs, and copy‐number variation withinsegmental duplications are published elsewhere34,46. Of note, most SV discoverymethods depend on mapping reads onto their genomic locus of origin, i.e., the‘accessible’ fraction of the genome, a fraction lessened in segmental duplicationsthatareofhighinteresttoSVanalysis.Nonetheless,owingtotheabilitiesofRPandRD methods in detecting SVs in these regions and in interpreting reads withmultiplemappingpositions,the‘accessible’fractionofthegenomeishigherforSVsthanforSNPs16.Inthefuture,sequencingtechnologiesgeneratinglongerDNAreadswill increase the accessible genome, and will enable the assessment of SVsembeddedinlongrepeatstructures,suchasbalancedinversions.
Our SV resourcewill enable the discovery, genotyping, and imputation of SVs inlarger cohorts. Numerous genomes will be sequenced in the coming months tofacilitate disease association studies. Systematic characterization of SVs in thesegenomeswillbenefitfromtheconceptsanddatasetspresentedhere.
10
MethodsSummarySamplesSequence data for 179 unrelated individuals and six individuals from parent‐offspring trios were obtained as part of the 1000GP. These data were generatedwith Illumina/Solexa, Roche/454, and Life Technologies/SOLiD sequencingtechnologyplatforms.SVdiscoveryandbreakpointassemblyTheSVdiscoverymethodsweappliedcomprisedsixRP,fourRD,threeSR,fourAS,andtwoPDbasedmethods.TIGRA38wasusedfortargetedbreakpointassembly.ExperimentalvalidationWevalidatedSVcallsbyPCR,arrayCGHandSNPmicroarrays, targetedassembly,and custom microarray‐based sequence capture. PCR was performed in variousdifferent laboratories33, CGH analysis was performed based on tiling array dataprovidedbytheGenomeStructuralVariationConsortium(ArrayExpress:E‐MTAB‐40),andSNParrayanalysisbasedondataobtainedfromtheInternationalHapMapConsortium(http://hapmap.ncbi.nlm.nih.gov).GenotypingGenomeSTRiP36wasused fordeletiongenotyping in lowcoveragesequencedata.Initial genotype likelihoods were derived with a Bayesian model and imputationintoaSNPgenotypereferencepanelfromtheHapMap41(Hapmap3r2)wasachievedwithBeagle(v3.1;http://faculty.washington.edu/browning/beagle/beagle.html).SVformationmechanismanalysisSVbreakpointsmappedatnucleotideresolutionwereanalyzedwithBreakSeq43toclassifySVsrelativetoputativeancestrallociandtoinferSVformationmechanisms.SVhotspotsweremappedwithcustomPerlandRscripts.
11
DisplayItemsTable 1. Summary of discovered structural variation
Deletions Tandem Duplications
Mobile element
insertions
Novel sequence insertions
Total
Individual Callsets <10% FDR 11215 501 5371 - 17087
Validated Experimentally* 10810 - - 128 10938
Release set 22025 501 5371 128 28025 *Only tabulates validated calls which were not already present in the individual callsets with <10% FDR
Table 2. Functional impact of our fine resolution SV set. Figures in parentheses indicate numbers of validated SVs per category. We inferred gene overlap with Gencode gene annotation47.
Gene Overlap
SV class Full gene
overlap
Coding exon
affected (partial)
UTR overlap Intron overlap
Total Gene
overlap
Total Inter-genic
Deletions 654 (631)
1093 (1031)
315 (290)
7319 (6481)
9381 (8433)
12644 (10386)
Tandem duplications
2 (2)
7 (6)
9 (5)
197 (62)
215 (75)
286 (76)
Mobile element insertions -
3 (-)
36 (-)
1304 (97)
1348 (112)
4023 (758)
Novel sequence insertions - -
2 (2)
49 (49)
51 (51)
77 (77)
Sum 656 (633)
1119 (1040)
351 (309)
8869 (6689)
10995 (8671)
17030 (11280)
12
FigureLegends
Figure 1. SV discovery and genotyping in population scale sequence data.A.Schematicdepictingthedifferentmodes(i.e.,approaches)ofsequencebasedSVdetection we used. The RP approach assesses the orientation and spacing of themappedreadsofpaired‐endsequences14,15(readsaredenotedbyarrows); theRDapproach evaluates the read depth‐of‐coverage25,26; the SR approach maps theboundaries (breakpoints) of SVs by sequence alignment28,29; the AS approachassembles SVs30,31,32. B.Integrated pipeline for SV discovery, validation, andgenotyping. Colored circles represent individual SV discovery methods (listed inSupplementary Table 1), with modes indicated by a color scheme: green=RP;yellow=RD; purple=SR; red=AS; green and yellow=methods evaluatingRPand RD(abbreviated as ‘PD’).C.Example of a deletion, previously associatedwith BMI35,identified independently with RP (green), RD (yellow), and SR (red) methods.TargetedassemblyconfirmedthebreakpointsdetectedbySR.
Figure2. Comparative assessment of deletiondiscoverymethods.A.Deletionsize‐rangeascertainedbydifferentmodesofSVdiscovery.Threegroupsarevisible,withASandSR,PDandRP,aswellasRDand‘RL’(RPanalysisinvolvingrelativelylong range (≥1 kb) insert size libraries, resulting in a different deletion detectionsize range compared to the predominantly used <500kb insert size libraries),respectively,ascertainingsimilarsize‐ranges.PiechartsdisplaythecontributionofdifferentSVdiscoverymodestothereleaseset.Outerpie=basedonnumberofSVcalls; inner pie = based on total number of variable nucleotides. Of note, not allapproaches were applied across all individuals (see Supplementary Table 2).B.SensitivityandFDRestimatesforindividualdeletiondiscoverymethodsbasedongoldstandardsets for individualssequencedathigh(NA12878)and low‐coverage(NA12156),respectively.AlldepictedestimatesaresummarizedinSupplementaryTables 3, 4, 6. Vertical dotted lines correspond to the specificity threshold(FDR≤10%).C.Breakpointmappingresolutionofthreedeletiondiscoverymethods(the respective method names are in Supplementary Table 2). The blue and redhistograms are the breakpoint residuals for predicted deletion start and endcoordinates,respectively,relativetoassembledcoordinates(hereassessedin low‐coveragedata).Thehorizontallinesatthetopofeachplotmarkthe98%confidenceintervals (labeled foreachpanel),withverticalnotches indicating thepositionsofthemostprobablebreakpoint(thedistributionmode).
Figure 3. Analysis of deletion presence and absence in two populations.AC.Deletionallelefrequenciesandobservedsharingofallelesacrosspopulations,displayedfordeletionsdiscoveredintheCEU,YRI,andJPT+CHBpopulationsamplesintermsofstackedbars.D.Allelefrequencyspectrafordeletionsintersectingwithintergenic(blue),intronic(yellow),andprotein‐codingsequences(red).
Figure4. Contributionof SV formationmechanisms to the SV size spectrum.A.Breakpointjunctionhomology/microhomologylengthplottedasafunctionofSVsizeforSVsoriginallyidentifiedasdeletionscomparedtoahumanreference.Dots
13
arecoloredaccordingtotheSVs’classificationasdeletions,insertions/duplications,or “undetermined” relative to inferred ancestral genomic loci. Gray lines markgroups of SVs likely formed by a common formation mechanism. The diagonalhighlights tandemduplications (and few reciprocal deletion events), inwhich thelengthoftheduplicatedsequencecorrelateslinearlywiththelengthofthelongestbreakpoint junction sequence identity stretch. The ellipses indicateMEIs, i.e.,Alu(~300bp)andL1(~6kb)insertions,associatedwithtargetsiteduplicationsofupto28bpinsizeatthebreakpoints.ThehorizontalgroupcorrespondsmostlytoNH‐associateddeletionswith<10bpmicrohomologyatthebreakpoints.Theremaining(ungrouped)SVscomprisetruncatedMEIs,VNTRexpansionandshrinkageevents,aswellasNAHR‐associateddeletionsandduplications.B.RelativecontributionsofSVformationmechanismsinthegenome.NumbersofSVsaredisplayedontheouterpiechartandaffectedbasepairsontheinner.Leftpanel:SVsclassifiedasdeletionsrelative to ancestral loci. Right panel: SVs classified as insertions/duplications.C.Size spectra of deletions classified relative to ancestral loci. D.Size spectra ofinsertions/duplications.
Figure5.MappinghotspotsofSVformationinthegenome.A.DistributionofSVson chromosome 10 (“chr10”). Above the ideogram, colored bars indicate SVformationmechanisms(samecolorschemeasinBandC);barlengthsrelatetothelogarithmofSV size.Below the ideogram,bar lengthsaredirectlyproportional toallele frequencies.Arrows indicate an SVhotspotnear the centromereunderlyingmainly VNTR, and several hotspots near the telomeres underlying mainly NAHRevents. B.Enrichment of SVs inferred to be formed by the same formationmechanism for different genomic window sizes. Displayed is an enrichment ofnearby,non‐overlappingSVs formedby thesamemechanismrelative toanSVsetwhere mechanism assignments are shuffled randomly. C.SV hotspots are mostlydominated by a single formationmechanism. Colored bars depict numbers of SVhotspotsinwhichatleast50%ofthevariantswereinferredtobeformedbyasingleformation mechanism. The average abundance of NAHR‐classified SVs in NAHRhotspotswas70%(comparedwith77%forVNTR‐hotspots;69%forNH).Thegraybar(“mixed”)correspondstoSVhotspotswithnosinglemechanismdominating.
14
References1 Conrad,D.F.etal.Originsandfunctionalimpactofcopynumbervariationin
thehumangenome.Nature464,704‐712(2010).2 Pinto, D. et al. Functional impact of global rare copy number variation in
autismspectrumdisorders.Nature466,368‐372(2010).3 Sebat, J. et al. Strong association of de novo copy number mutations with
autism.Science316,445‐449(2007).4 Stefansson, H. et al. Large recurrent microdeletions associated with
schizophrenia.Nature455,232‐236(2008).5 McCarthy, S. E. et al. Microduplications of 16p11.2 are associated with
schizophrenia.NatGenet41,1223‐1227(2009).6 Craddock,N.etal.Genome‐wideassociationstudyofCNVsin16,000casesof
eight common diseases and 3,000 shared controls. Nature 464, 713‐720,(2010).
7 McCarroll, S. A. et al. Deletion polymorphismupstreamof IRGMassociatedwith altered IRGMexpression andCrohn's disease.NatGenet,40, 1107‐12(2008).
8 Hastings,P.J.,Lupski,J.R.,Rosenberg,S.M.&Ira,G.Mechanismsofchangeingenecopynumber.NatRevGenet10,551‐564(2009).
9 Stankiewicz,P.&Lupski,J.R.Structuralvariationinthehumangenomeanditsroleindisease.AnnuRevMed61,437‐455(2010).
10 Sebat,J.etal.Large‐scalecopynumberpolymorphisminthehumangenome.Science305,525‐528(2004).
11 Iafrate,A.J.etal.Detectionoflarge‐scalevariationinthehumangenome.NatGenet36,949‐951(2004).
12 Sharp, A. J. et al. Segmental duplications and copy‐number variation in thehumangenome.AmJHumGenet77,78‐88(2005).
13 McCarroll,S.A.etal.Integrateddetectionandpopulation‐geneticanalysisofSNPsandcopynumbervariation.NatGenet40,1166‐1174(2008).
14 Tuzun, E. et al. Fine‐scale structural variation of the human genome. NatGenet37,727‐732(2005).
15 Korbel,J.O.etal.Paired‐endmappingrevealsextensivestructuralvariationinthehumangenome.Science318,420‐426(2007).
16 Alkan, C. et al. Personalized copy number and segmental duplicationmapsusingnext‐generationsequencing.NatGenet41,1061‐1067(2009).
17 Chen, K. et al. BreakDancer: an algorithm for high‐resolution mapping ofgenomicstructuralvariation.NatMethods6,677‐681(2009).
18 Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorialalgorithms for structural variation detection in high‐throughput sequencedgenomes.GenomeRes19,1270‐1278(2009).
19 Medvedev, P., Stanciu, M. & Brudno, M. Computational methods fordiscovering structural variation with next‐generation sequencing. NatMethods6,S13‐20(2009).
15
20 McKernan,K. J.etal.Sequenceandstructuralvariation inahumangenomeuncoveredby short‐read,massively parallel ligation sequencingusing two‐baseencoding.GenomeRes19,1527‐1541(2009).
21 Chiang,D.Y.etal.High‐resolutionmappingofcopy‐numberalterationswithmassivelyparallelsequencing.NatMethods6,99‐103(2009).
22 Kidd, J.M. etal.Mappingand sequencingof structural variation fromeighthumangenomes.Nature453,56‐64(2008).
23 Lee,S.,Cheran,E.&Brudno,M.Arobustframeworkfordetectingstructuralvariationsinagenome.Bioinformatics24,i59‐67(2008).
24 Pang,A.W. et al. Towards a comprehensive structural variationmapof anindividualhumangenome.GenomeBiol11,R52(2010).
25 Bailey, J. A. et al. Recent segmental duplications in the human genome.Science297,1003‐1007(2002).
26 Campbell,P.J.etal.Identificationofsomaticallyacquiredrearrangementsincancer using genome‐wide massively parallel paired‐end sequencing. NatGenet40,722‐729(2008).
27 Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Sensitive and accuratedetectionofcopynumbervariantsusingreaddepthofcoverage.GenomeRes19,1586‐1592(2009).
28 Mills,R.E.etal.Aninitialmapofinsertionanddeletion(INDEL)variationinthehumangenome.GenomeRes16,1182‐1190(2006).
29 Ye,K.,Schulz,M.H.,Long,Q.,Apweiler,R.&Ning,Z.Pindel:apatterngrowthapproach to detect break points of large deletions and medium sizedinsertions from paired‐end short reads. Bioinformatics 25, 2865‐2871,(2009).
30 Simpson,J.T.etal.ABySS:aparallelassemblerforshortreadsequencedata.GenomeRes19,1117‐1123(2009).
31 Hajirasouliha, I. et al. Detection and characterization of novel sequenceinsertions using paired‐end next‐generation sequencing.Bioinformatics26,1277‐1283(2010).
32 Li,R.etal.Thesequenceanddenovoassemblyofthegiantpandagenome.Nature463,311‐317(2010).
33 The‐1000‐Genomes‐Project‐Consortium.Amapofhumangenomevariationfrompopulation‐scalesequencing.Nature467,1061‐1073(2010).
34 Sudmant, P. H. et al. Diversity of human copy number variation andmulticopygenes.Science330,641‐646(2010).
35 Willer, C. J. & Willer, C. J. Six new loci associated with body mass indexhighlightaneuronal influenceonbodyweightregulation.NatGenet41,25‐34(2009).
36 Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery andgenotyping of genome structural polymorphism by sequencing on apopulationscale.submitted.
37 Levy,S.etal.Thediploidgenomesequenceofanindividualhuman.PLoSBiol5,e254(2007).
38 Chen,L.etal.TIGRAlocaltargetedassemblyofstructuralvariants.submitted(2010).
16
39 Hasin‐Brumshtein, Y., Lancet, D. & Olender, T. Human olfaction: fromgenomic variation to phenotypic diversity. Trends Genet 25, 178‐184,(2009).
40 Hinds,D.A.,Kloek,A.P., Jen,M.,Chen,X.&Frazer,K.A.CommondeletionsandSNPsareinlinkagedisequilibriuminthehumangenome.NatGenet38,82‐85,(2006).
41 Altshuler, D. M. et al. Integrating common and rare genetic variation indiversehumanpopulations.Nature467,52‐58,(2010).
42 Conrad,D.F.etal.MutationspectrumrevealedbybreakpointsequencingofhumangermlineCNVs.NatGenet42,385‐391(2010).
43 Lam, H. Y. et al. Nucleotide‐resolution analysis of structural variants usingBreakSeqandabreakpointlibrary.NatBiotechnol28,47‐55(2010).
44 Lupski,J.R.Genomicdisorders:structuralfeaturesofthegenomecanleadtoDNA rearrangements and human disease traits.Trends Genet14, 417‐422,(1998).
45 Lee, J. A., Carvalho, C. M. & Lupski, J. R. A DNA replicationmechanism forgeneratingnonrecurrentrearrangementsassociatedwithgenomicdisorders.Cell131,1235‐1247(2007).
46 Stewart, C. et al. A comprehensive map of mobile element insertionpolymorphismsinhumans.inpreparation.
47 Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE.GenomeBiol7Suppl1,S41‐9(2006).
Acknowledgements:Wewould like to acknowledgeClaireHardy,Richard Smith,AnniekDeWitte,andShaneGilesfortheirassistancewithvalidation.M.A.B’sgroupwassupportedbygrantsfromtheNationalInstitutesofHealth(RO1GM59290)andG.T.M’s group by grants R01 HG004719 and RC2 HG005552, also from the NIH.J.O.K.’s group was supported by an Emmy Noether Fellowship of the GermanResearch Foundation (Deutsche Forschungsgemeinschaft). J.W.’s group wassupported by the National Basic Research Program of China (973 program no.2011CB809200), the National Natural Science Foundation of China (30725008;30890032;30811130531;30221004), theChinese863program (2006AA02Z177;2006AA02Z334; 2006AA02A302; 2009AA022707), the Shenzhen MunicipalGovernment of China (grants JC200903190767A; JC200903190772A;ZYC200903240076A; CXB200903110066A; ZYC200903240077A;ZYC200903240076AandZYC200903240080A),andtheOleRømergrant fromtheDanishNaturalScienceResearchCouncil.C.L.’sgroupwassupportedbygrantsfromthe National Institutes of Health: P41 HG004221, RO1 GM081533, and UO1HG005209 andX.S.was supported by a T32 fellowship award from theNIH. Wethank the Genome Structural Variation Consortium(http://www.sanger.ac.uk/humgen/cnv/42mio/) and the International HapMapConsortium for making available microarray data. The authors acknowledge theindividuals participating in the 1000 Genomes Project by providing samples,includingTheYorubapeopleof Ibadan,Nigeria, the community atBeijingNormalUniversity,thepeopleofTokyo,Japan,andthepeopleoftheUtahCEPHcommunity.
17
Furthermore, we thank Richard Durbin and Lars Steinmetz for comments on themanuscript.
Author Contributions: The authors contributed this study at different levels, asdescribed in the following.SVdiscovery:K.W.,C.S.,R.H.,K.C.,C.A.,A.A.,S.C.Y.,R.K.C.,A.C.,Y.F.,I.H.,F.H.,Z.I.,D.K.,R.L.,Y.L.,C.L.,R.L.,X.J.M.,H.E.P.,L.D.,G.T.M.,J.S.,J.W.,K.Y.,K.Y., E.E.E., M.B.G., M.E.H., S.A.M., and J.O.K. SVvalidation: R.E.M., K.W., K.C., A.A.,S.C.Y.,F.G.,M.K.K.,J.K.,J.N.,A.E.U.,X.S.,A.M.S.,J.A.W.,Y.Z.,Z.Z.,M.A.B.,J.S.,M.S.,M.E.H.,C.L,J.O.K.SVgenotyping:K.W.,R.H.,M.E.H,andS.A.M.Dataanalysis:R.E.M.,C.S.,C.A.,A.A.,R.H.,K.C.,S.C.Y.,R.K.C.,A.C.,D.C.,Y.F.,F.H.,L.M.I.,Z.I.,J.M.K.,M.K.K.,S.K.,J.K.,E.K.,D.K.,H.Y.K.L., J.L.,R.L., Y.L., C.L.,R.L., X.J.M., J.N.,H.E.P.,T.R.,A.S., X.S.,M.P.S., J.A.W.,J.W.,Y.Z.,Z.Z.,M.A.B.,L.D.,G.T.M.,G.M.,J.S.,M.S.,J.W.,K.Y.,K.Y.,E.E.E.,M.B.G.,M.E.H.,C.L,S.A.M.,andJ.O.K.Preparationofmanuscriptdisplayitems:R.E.M.,K.W.,C.S.,C.A.,A.A.,R.H.,S.C.Y.,L.M.I.,S.K.,E.K.,M.K.K.,X.J.M.,X.S.,J.A.W.,M.B.G.,S.A.M.,andJ.O.K.CochairsoftheStructuralVariationAnalysisgroup:E.E.E.,M.E.H.,andC.L.Thefollowingwereleadingcontributorstotheanalysisdescribedinthispaperandthereforeshouldbeconsideredjointfirstauthors:R.E.M.,K.W.,C.S.,R.H.,K.C.,C.A.,A.A.,S.C.Y,andK.Y.The following equally contributed to directing the described analyses andparticipatinginthedesignofthestudyandshouldbeconsideredjointseniorauthors:E.E.E, M.B.G., M.E.H., C.L, S.A.M., and J.O.K. The manuscript was written by thefollowingauthors:R.E.M.andJ.O.K.
Data retrieval: The data sets described here can be obtained from the 1000Genomes Project website at www.1000genomes.org (July 2010 Data Release).Individual SV discovery methods can be obtained from sources mentioned inSupplementary Table 1, or upon request from the authors. Abbreviations used inthispaperaresummarizedintheSupplementaryText.
Application of diverse SV discovery methods
Validation of SVs (deletions, duplications and
insertions)
Targeted SV breakpoint assembly (focused on
deletions)
Precision-aware merging of discovered SVs
Release set (algorithms & extensive
validations)inclusion of SVs inferred with individual methods(criterion: FDR<10%),
followed by validation-awareSV inclusion
Algorithm-centric set(algorithms & sparse
validations)inclusion of SVs inferred with individual methods,and such with evidence
from >2 methods(criterion: FDR<10%)
SV discovery set
Genotyping (focused on deletions)
Deletion (Del), Duplication (Dup), and Insertion (Ins)
RP
RD
SR
AS
PD
ReferenceSample genome
Del Dup Ins
Reference-supportingSV-supporting read-pair (RP)SV-supporting read-depth (RD)
SV-supporting read for split-read analysis (SR) or assembly (AS)Mobile element insertion support
MEI
MEI
a b
c
6070
8090
100
6070
8090
100
72.52 72.54 72.56 72.58 72.60Chromosome 1 position (in Mb)
DN
A re
ad m
appi
ng q
ualit
y sc
ores
NA
1924
0 (Y
RI)
NA
1287
8 (C
EU
)
Dep
th o
f cov
erag
e44
3322
2211
011
044
33
AluLINE
NEGR1Del
0 1000 2000 3000 4000 5000 6000 7000
0.00
000.
0005
0.00
100.
0015
0.00
20
Deletion size (bp)
Den
sity
ASSRPD
RPRLRD
Releaseset
1.3%
19.2%
35.3%17.9%
2.0%
24.3%0.5%
17.4%
49.7%
10.6%
0.6%
21.2%
0.0
0.2
0.4
0.6
0.8
1.0
0% 20% 40% 60% 80% 100%
0
0.2
0.4
0.6
0.8
1
0%20%40%60%80%100%
FDR
FDR
TrioLow co
verag
e
Sen
sitiv
ity
Sensitivity
a b
−10 −5 0 100
1000
2000
3000
4000
5000 SR (LN)n = 5375
19 bp
18 bp
5 −100 −50 0 50 1000
50
100
150
200
250
300 RP (SI)n = 5229
210 bp
220 bp
−500 −250 0 250 5000
10
20
30
40
50RD (YL)n = 501
700 bp
900 bp
Freq
uenc
y
O�set from breakpoint (bp)
c
0 0.2 0.4 0.6 0.8 1Alternate allele frequency
0
500
1000
1500
2000
2500
Num
ber o
f SV
s JPT+CHB
0 0.2 0.4 0.6 0.8 1Alternate allele frequency
0
500
1000
1500
2000
2500N
umbe
r of S
Vs observed also in YRI
observed also in JPT+CHBshared among allobserved only in CEU
CEU
0 0.2 0.4 0.6 0.8 1Alternate allele frequency
0
500
1000
1500
2000
2500
Num
ber o
f SV
s YRI
0 0.2 0.4 0.6 0.8 1Average alternate allele frequency
0
1
2
3
4
Log 10
(num
ber o
f SV
s)
intersect with intronsintergenic
intersect with CDS
a b
c d
observed also in CEUobserved also in JPT+CHBshared among allobserved only in YRI
observed also in CEUobserved also in YRIshared among allobserved only in JPT+CHB
100 300 500 700 900
50
100
150
200
250
300
SV length (bp)
L
engt
h of
long
est s
eque
nce
sim
ilarit
y s
tretc
h at
the
SV
bre
akpo
int j
unct
ion
(bp)
insertions/duplicationsdeletionsundetermined
4000 6000 8000
Size of deletion
Num
ber o
f del
etio
ns
100bp 1kb 10kb 100kb
a
0
200
400
600
800
1000
1200
1400
1600
1800
Size of insertion/duplication
Num
ber o
f Ins
ertio
ns/d
uplic
atio
ns
100bp 1kb 10kb 100kb
LINE
Alu
0
200
400
600
800
1000
1200
1400
1600
1800
UnclassifiedMEIVNTRNAHRNH
c
db
393kb
89kb
1Mb
1,496
4,500
245 272
12Mb
24Mb
226
79
122
1,994
NAHRNHRVNTRMEI
21kb
164kb
UnclassifiedMEIVNTRNAHRNH
200b
p
500b
p
1kb
2kb
5kb
10kb
20kb
50kb
100k
b
200k
b
500k
b
1Mb
NAHRNHMEIVNTRControl (NH vs. NAHR)
Enr
ichm
ent
(clu
ster
ing
of S
V
form
atio
n pr
oces
s)
Dep
letio
n(n
o cl
uste
ring)
10
1
0.1
chr10
Num
bers
of g
enom
ic S
V h
otsp
ots
(col
or :
dom
inat
ed b
y si
ngle
mec
hani
sm)
0
5
10
15
20
25
30
NAHR NH VNTR MEI mixed
a
b c