mapping copy number variation by population scale genome … · 2015. 2. 19. · mapping copy...

23
Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page) The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Mills, Ryan E., Robert E. Handsaker, Joshua Korn, James Nemesh, Xinghua Shi, Charles Lee, Steven A. McCarroll et al. 2011. Mapping copy number variation by population scale genome sequencing. Nature 470(7332): 59-65. Published Version doi:10.1038/nature09708 Accessed February 19, 2015 9:02:09 AM EST Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:5339501 Terms of Use This article was downloaded from Harvard University's DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP

Upload: others

Post on 20-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

Mapping Copy Number Variation by Population Scale GenomeSequencing

(Article begins on next page)

The Harvard community has made this article openly available.Please share how this access benefits you. Your story matters.

Citation Mills, Ryan E., Robert E. Handsaker, Joshua Korn, JamesNemesh, Xinghua Shi, Charles Lee, Steven A. McCarroll et al.2011. Mapping copy number variation by population scale genomesequencing. Nature 470(7332): 59-65.

Published Version doi:10.1038/nature09708

Accessed February 19, 2015 9:02:09 AM EST

Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:5339501

Terms of Use This article was downloaded from Harvard University's DASHrepository, and is made available under the terms and conditionsapplicable to Open Access Policy Articles, as set forth athttp://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#OAP

Page 2: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

1

Mapping copy number variation by population scale genome sequencing Ryan E. Mills1,*, Klaudia Walter2,*, Chip Stewart3,*, Robert E. Handsaker4,*, KenChen5,*, Can Alkan6,7,*, Alexej Abyzov8,*, Seungtai Chris Yoon9,*, Kai Ye10,*, R. KeiraCheetham11,AsifChinwalla5,DonaldF.Conrad2,YutaoFu12,FabianGrubert13.,ImanHajirasouliha14, Fereydoun Hormozdiari14, Lilia M. Iakoucheva15, Zamin Iqbal16,ShuliKang15, JeffreyM.Kidd6,MiriamK.Konkel17, JoshuaKorn4, EktaKhurana8,18,Deniz Kural3, Hugo Y. K. Lam13, Jing Leng8, Ruiqiang Li19, Yingrui Li19, Chang‐YunLin20,RuibangLuo19,XinmengJasmineMu8,JamesNemesh4,HeatherE.Peckham12,Tobias Rausch21, Aylwyn Scally2, Xinghua Shi1, Michael P. Stromberg3, Adrian M.Stütz21,AlexanderEckehartUrban13,JerilynA.Walker17,JiantaoWu3,YujunZhang2,ZhengdongD.Zhang8,MarkA.Batzer17,LiDing5,22,GaborT.Marth3,GilMcVean23,Jonathan Sebat15,Michael Snyder13, JunWang19,24, Kenny Ye20, Evan E. Eichler6,7,*,MarkB.Gerstein8,18,25,*,MatthewE.Hurles2,*,CharlesLee1,*,StevenA.McCarroll4,26,*,andJanO.Korbel21,*,@

forthe1000GenomesProject#

1.DepartmentofPathology,BrighamandWomen’sHospitalandHarvardMedicalSchool,Boston,MA2.TheWellcomeTrustSangerInstitute,WellcomeTrustGenomeCampus,Hinxton,Cambridge,CB101SAUK.3.DepartmentofBiology,BostonCollege,Boston,MA4.BroadInstituteofHarvardandMassachusettsInstituteofTechnology,Cambridge,MA5.TheGenomeCenteratWashingtonUniversity,St.Louis,MO6.DepartmentofGenomeSciences,UniversityofWashingtonSchoolofMedicine,Seattle,WA7.HowardHughesMedicalInstitute,UniversityofWashington,Seattle,Washington,USA.8.PrograminComputationalBiologyandBioinformatics,YaleUniversity,NewHaven,CT9.SeaverAutismCenterandDepartmentofPsychiatry,MountSinaiSchoolofMedicine,NewYork,NY10.DepartmentsofMolecularEpidemiology,MedicalStatisticsandBioinformatics,LeidenUniversityMedicalCenter,Leiden,theNetherlands11.IlluminaCambridgeLtd,ChesterfordResearchPark,LittleChesterford,EssexCB101XL,UK12.LifeTechnologies,Beverly,MA13.DepartmentofGenetics,StanfordUniversity,Stanford,CA14.SchoolofComputingScience,SimonFraserUniversity,Burnaby,BritishColumbia,Canada.15.DepartmentofPsychiatry,DepartmentofCellularandMolecularMedicine,InstituteforGenomicMedicine,UniversityofCalifornia,SanDiego,LaJolla,CA16.WellcomeTrustCentreforHumanGenetics,UniversityofOxford,OX37BN,UK17.DepartmentofBiologicalSciences,LouisianaStateUniversity,BatonRouge,Louisiana18.MolecularBiophysicsandBiochemistryDepartment,YaleUniversity,NewHaven,CT19.BGI‐Shenzhen,Shenzhen518083,China20.AlbertEinsteinCollegeofMedicine,Bronx,NY21.GenomeBiologyResearchUnit,EuropeanMolecularBiologyLaboratory,Heidelberg,Germany22.DepartmentofGenetics,WashingtonUniversity,St.Louis,MO23.DepartmentofStatistics,UniversityofOxford,OX37BN,UK24.DepartmentofBiology,UniversityofCopenhagen,Copenhagen,Denmark25.DepartmentofComputerScience,YaleUniversity,NewHaven,CT26.DepartmentofGenetics,HarvardMedicalSchool,Boston,MA*Theseauthorscontributedequallytothiswork.@CorrespondenceshouldbeaddressedtoJ.O.K.([email protected]).#ListsofparticipantsandaffiliationsappearinSupplementaryInformation.

Page 3: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

2

Summary Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

Page 4: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

3

Introduction

Unbalanced structural variants (SVs), or copy number variants (CNVs), involvinglarge‐scaledeletions,duplications,andinsertionsformoneoftheleastwellstudiedclasses of genetic variation. The fraction of the genome affected by SVs iscomparatively largerthanthataccountedforbysinglenucleotidepolymorphisms1(SNPs),implyingsignificantconsequencesofSVsonphenotypicvariation.SVshavealreadybeenassociatedwithdiversediseases, includingautism2,3,schizophrenia4,5and Crohn’s disease6,7. Furthermore, locus‐specific studies suggest that diversemechanisms may form SVs de novo, with some mechanisms involving complexrearrangementsresultinginmultiplechromosomalbreakpoints8,9.

Initialmicroarray‐based SV surveys focused on large gains and losses10,11,12, withrecentadvancesinarraytechnologywideningtheaccessiblesizespectrumtowardssmaller SVs1,13. Microarray‐based surveys commonlymapped SVs to approximategenomiclocations.However,adetailedSVcharacterization,includinganalysesofSVorigin and impact, requires knowledge of precise SV sequences. Advances insequencing technology have enabled applying sequence‐based approaches formappingSVsatfine‐scale14,15,16,17,18,19,20,21.Theseapproachesinclude:(i)paired‐endmapping (or read pair ‘RP’ analysis) based on sequencing and analysis ofabnormally mapping pairs of clone ends14,22,23,24 or high‐throughput sequencingfragments15,17,18;(ii)read‐depth(‘RD’)analysis,whichdetectsSVsbyanalyzingtheread depth‐of‐coverage16,21,25,26,27; (iii)split‐read (‘SR’) analysis, which evaluatesgappedsequencealignmentsforSVdetection28,29;and(iv)sequenceassembly(‘AS’),which enables the fine‐scale discovery of SVs, including novel (non‐reference)sequence insertions30,31,32. Sequence‐based SVdiscovery approacheshave thus farbeen applied to a limited (<20) number of genomes, leaving the fine‐scalearchitectureofmostcommonSVsunknown.

Sequence data generated by the 1000 Genomes Project (1000GP) provide anunprecedented opportunity to generate a comprehensive SV map. The 1000GPrecentlygenerated4.1Terabasesofrawsequenceinpilotprojectstargetingwholehumangenomes33 (SupplementaryTable1).Thesestudiescompriseapopulation‐scale project, termed ‘low‐coverage project’, in which 179 unrelated individualsweresequencedwithanaveragecoverageof3.6X–including59YorubaindividualsfromNigeria(YRI),60individualsofEuropeanancestryfromUtah(CEU),30ofHanancestry from Beijing (CHB), and 30 of Japanese ancestry from Tokyo (JPT; thelattertwowere jointlyanalyzedas JPT+CHB). Inaddition,ahigh‐coverageproject,termed the ‘trio project’, was carried out, with individuals of a CEU and a YRIparent‐offspringtriosequencedto42Xcoverageonaverage.

We report here the results of analyses undertaken by the Structural VariationAnalysisGroupof the1000GP.Thegroup’sobjectiveswere todiscover, assemble,genotype,andvalidateSVsof50bpand larger insize,andtoassessandcomparedifferent sequence‐based SV detection approaches. The focus of the group wasinitiallyondeletions,avariant classoftenassociatedwithdisease9, forwhichrich

Page 5: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

4

controldatasetsanddiverseascertainmentapproachesexist1,13,22,28.Lessfocuswasplaced on insertions and duplications34 and none on balanced SV forms (such asinversions).Specifically,weappliednineteenmethodstogenerateanSVdiscoveryset.Wefurthergeneratedreferencegenotypesformostdeletions,assessedtheSVs’functional impact, and stratified SV formationmechanismwith respect to variantsizeandgenomiccontext.

PredictionofSVcandidatelociandassessmentofdiscoverymethods

WeincorporatedtheSVdiscoverymethodsintoapipeline(Fig.1AB),withthegoalof ascertaining different SV types and assessing each method for its ability todiscoverSVs.ThemethodsdetectedSVsbyanalyzingRD,RP,SR,andASfeatures,orby combining RP andRD features (abbreviated as ‘PD’). Altogetherwe generatedthirty‐sixSVcallsetsbyapplyingthemethodsontrioandlow‐coveragedata,andbyidentifyingSVsasgenomicdifferencesrelativetoahumanreference,correspondingto the reference genome, or to a set of individuals (i.e. population reference;Supplementary Table 2). We initially identified SVs as deletions, tandemduplications, novel sequence insertions, and mobile element insertions (MEIs)relative to the human reference. Subsequent comparative analyses involvingprimategenomesenabledustoclassifySVsasdeletions,duplications,orinsertionsrelative to inferred ancestral genomic loci, reflectingmechanismsof SV formation(seebelow).DNAreadsanalyzedbySVdiscoverymethodswereinitiallymappedtothehumanreferencegenomeusingavarietyofalignmentalgorithms.Mostofthesealgorithmsmappedeachreadtoasinglegenomicposition,althoughonealgorithm(mrFAST16) also considered alternativemappingpositions for reads aligning ontorepetitive regions (see Supplementary Tables 2‐4 formethod‐specific parametersand full SV callsets). We filtered each callset by excluding SVs <50bp, which arereported elsewhere33. Many SVs exhibited support from distinct SV discoverymethods, as exemplified by a common deletion, previously associatedwith body‐mass index35 (BMI), that we identified with RP, RD, and SR methods (Fig. 1C).Nonetheless, we observed notable differences between methods (Fig. 2ABC) interms of genomic regions ascertained (Supplementary Fig. 1), accessible SV size‐range(Fig.2A),andbreakpointprecision(Fig.2C,SupplementaryFig.2).

To estimate callset specificity, we carried out extensive validations (Methods),including PCRs for over 3,000 candidate loci, and microarray data analyses for50,000 candidate loci (Supplementary Tables 3, 4; Supplementary Fig. 3). Wecombined PCR and array‐based analysis results to estimate false discovery rates(FDRs),andfoundthateightcallsets(threedeletion,fourinsertion,andonetandemduplication callset) met the pre‐specified specificity threshold33 (FDR≤10%),whereastheothercallsetsyieldedlowerspecificity(FDRsof13%‐89%).

Wefurtherassessedthesensitivityofdeletiondiscoverymethodsbycollatingdatafrom four earlier surveys1,13,22,28 into a gold standard (Methods, SupplementaryTables 5, 6, and Supplementary Fig. 4A), and specifically assessing the detectionsensitivityforanindividualsequencedathigh‐coverage(NA12878)aswellasforan

Page 6: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

5

individualsequencedatlow‐coverage(NA12156).Unsurprisingly,giventhetypicaltrade‐off between sensitivity and specificity, in the trios the highest sensitivitieswereachievedbyRDandRPmethodswithFDR>10%(Fig.2B).Bycomparison, inthe low‐coverage data, the individual method with the greatest accuracy(FDR=3.7%)was the secondmost sensitive basedon our gold standard (Fig. 2B),and the most sensitive when expanding the gold standard to a larger set ofindividuals(SupplementaryFig.4B).Thismethod,GenomeSTRiP(tobedescribedelsewhere36), integrated bothRP andRD features (PD), implying that consideringdifferentevidencetypescanimproveSVdiscovery.

Constructionofahigh­confidenceSVdiscoveryset

To construct our SV discovery set (“release set”), we joined calls from differentdiscoverymethodscorrespondingtothesameSVwithamergingapproachthatwasawareofeachcallset’sprecision inSVbreakpointdetection(SupplementaryFig.5andMethods). Most SVs in the release set (61%)were contributed by individualmethodsmeeting thepre‐definedspecificity threshold(FDR≤10%).Theremaining39%ofcallswerecontributedbylowerspecificitymethodsfollowingexperimentalvalidation. Altogether, the release set comprised 22,025 deletions, 501 tandemduplications,5,371MEIs,and128non‐referenceinsertions(Table1,SupplementaryTable 7). With our gold standard we estimated an overall sensitivity of deletiondiscoveryof82%inthetrios,and69%inlow‐coveragesequence(Fig.2B)usinga1bp overlap criterion. When instead applying a stringent 50% reciprocal overlapcriterion for sensitivity assessment (which requiredSV sizes inferredondifferentexperimental platforms to be in close agreement) our sensitivity estimatesdecreased by 12% and 18%, respectively, in trio and low‐coverage sequence(Supplementary Table 8). We further examined an alternative approach thatinvolved the pairwise integration of deletion discovery methods, and tested itsability to discover SVs without relying on the inclusion of lower specificity callsfollowing experimental validation (“algorithm‐centric set”; Fig. 1B). While thisalternativeapproachresultedinanincreasednumber(by~13%)ofhigh‐specificity(FDR<10%) calls compared to the release set (Supplementary Text), it overallresultedinfewerSVcallsowingto itsdecreasedsensitivityatthe lower(<200bp)SVsizerange.Inthefollowinganalyseswethusfocusedonthereleaseset.ExtentandimpactofourSVdiscoverysetWe next assessed the extent and impact of our SV discovery (release) set. ThemedianSVsizewas729bp(mean=8kb),approximatelyfourtimessmallerthaninarecenttilingCGHbasedstudy1,reflectingthehighresolutionofDNAsequencebasedSVdiscovery.Wealsocomparedourset toarecentsurveyofSVs inan individualgenome37basedoncapillarysequencingandarray‐basedanalyses24,andobservedasimilar size distribution for deletions, but differences in the size distributions ofother SV classes, reflecting underlying differences in SV ascertainment(SupplementaryFig.6).BycomparingourSVs todatabasesof structuralvariationandtoadditionalpersonalgenomedatasets,weclassified15,556SVsinoursetas

Page 7: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

6

novel,withanenrichmentof low frequencySVsand small SVsamongst thenovelvariants(MethodsandSupplementaryText).

A major advantage of sequence‐based SV discovery is the nucleotide resolutionmappingofSVs.Weinitiallymappedthebreakpointsof7,066deletionsand3,299MEIsusingSRandASfeatures.UsingtheTIGRA‐targetedassemblyapproach38wefurtheridentifiedthebreakpointsofanadditional4,188deletionsand160tandemduplications, initially discovered by RD, RP, and PD methods (Methods,Supplementary Table 2). Altogether, we mapped ~15,000 SVs at nucleotideresolution,48%ofwhichwerenovel.Fewdeletion loci (4.4%)displayeddifferentSV breakpoints in different samples, which is explainable by rare TIGRA mis‐assemblies, or alternatively, by recurrently formed, multi‐allelic SVs(SupplementaryText).TIGRAfurtherenabledustovalidateanadditional7,359SVsdiscoveredwithRPorRD featuresby identifying the SVs’ breakpoints (Methods),and to evaluate the mapping precision of SV discovery methods (Fig. 2C,SupplementaryFigure2).

We further assessed the putative functional impact of SVs in our set by relatingthem to genomic annotation. Seventeen hundred SVs affected coding sequences,resulting in fullgeneoverlapsorexondisruptions(Table2),manyofwhich ledtoout‐of‐frameexons (SupplementaryTable9).Werelatedgenedisruptions togenefunctions, and observed significant enrichments for several functional categoriesincluding cell defense and sensory perception (Supplementary Table10). Highlevels of structural variation, including copy‐number variation, were previouslydescribed for both processes15,22,39. These SVs might be maintained in thepopulationbyselection for thepurposeof functional redundancy.WhilemostSVsintersectingwithgenesweredeletions, severalvalidated tandemduplicationsandMEIsalsointersectedwithcodingsequences(Table2).

Populationgeneticpropertiesofdeletions

Wenextsoughttogenerategenotypesfordeletionsdiscoveredinthe1000GPdata,bothtofacilitatepopulationgeneticsanalysesandtomakeourSVsetamenabletoassociation studies in the form of a reference genotype set. In this regard, theGenome STRiP36 genotyping method was developed, a method combininginformation fromRD,RP, SR andhaplotype featuresof population‐scale sequencedata for genotyping (Methods, Supplementary Text). Using this approach wegenerated genotypes for 13,826 autosomal deletions in 156 individuals. Thegenotypes displayed 99.1% concordance with CGH array1 based genotypes(availablefor1,970ofthedeletions),suggestinghighgenotypingaccuracy.

Fig. 3 presents allele frequency analyses based on these genotypes. As expected,common polymorphisms (minor allele frequency (MAF) >5%) were generallysharedacrosspopulations,whilerarealleleswerefrequentlyobservedinonlyonepopulation (Figs. 3ABC). We observed several candidates for monomorphicdeletions (i.e., genomicsegmentsputativelydeleted inall individuals), explainable

Page 8: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

7

by rare insertions present in the reference genome or by remaining genotypinginaccuracies(SupplementaryText).

We next assessed the allele frequencies of gene deletions (Fig. 3D). Similar to arecentarray‐basedstudy1,weobservedadepletionofhighfrequencyallelesamongdeletions intersecting with protein‐coding sequence compared to other deletions(P=1.1x10‐11; KS test), consistent with purifying selection keeping most genedeletions at low frequency. Nonetheless, several coding sequence deletions wereobserved with high allele frequency (>80%). Most of these occurred in regionsannotated as segmental duplications, consistent with lessened evolutionaryconstraint infunctionallyredundantgenecategories22. Intriguingly,commongenedeletions also affectedmany unique genes with no obvious paralogs.We furtheranalyzed the abundance of gene deletions in different populations and observedhighlydifferentiatedloci,albeitwithnostatisticallysignificantrelationshipbetweendifferentiation and particular categories of gene overlap, i.e., intronic vs. exonic(SupplementaryText).

By comparing deletion genotypes with genotypes of nearby SNPs, we found,consistentwithearlierstudies1,13,40,thatdeletionsingenomicregionsaccessibletoshort read sequencing display extensive linkage disequilibrium (LD) with SNPs.81% of common deletions had one or more SNPs with which they are stronglycorrelated (r2>0.8; Supplementary Fig. 7). This suggests that many deletionsmapped in our study will be identifiable through tagging SNPs in future studies(SupplementaryText).On the other hand, a fifth of the genotypeddeletionswerenot tagged byHapMap SNPs (a figure similar to the fraction of SNPs that are nottaggedbyHapMapSNPs41),implyingthattheseSVsshouldbegenotypeddirectlyinassociationstudies.Furthermore,theLDpropertiesofcomplexSVs(e.g.,multiallelicSV) have not yet been fully ascertained asmethods for genotyping such SVswithsimilaraccuracyarestillbeingdeveloped.

SVformationmechanismanalysis

Nucleotide resolution breakpoint information enables inference of SV formationmechanisms15,22. Recent studies broadly distinguished between several germlinerearrangement classes, someofwhichmaycomprisemore thanoneSV formationmechanism15,22,42,43:non‐allelichomologousrecombination(NAHR),associatedwithlong sequence similarity stretches around thebreakpoints; rearrangements in theabsenceofextendedsequencesimilarity(abbreviatedas“non‐homologous”orNH),associated with DNA repair by non‐homologous end‐joining (NHEJ) or withmicrohomology‐mediated break‐induced replication (MMBIR); the shrinking orexpansion of variable number of tandem repeats (VNTRs), frequently involvingsimplesequences,byslippage;andMEIs.WedistinguishedamongtheclassesNAHR,NH,VNTR,andMEIbyexaminingthebreakpointjunctionsequenceofSVsinitiallydiscoveredasdeletionsortandemduplicationsrelativetoahumanreference.

Page 9: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

8

We first compared theSVs toorthologousprimategenomic regions todistinguishdeletionsfrominsertions/duplicationswithrespecttoreconstructedancestral lociusing the BreakSeq classification approach43. This analysis showed that of the11,254 nucleotide‐resolution SVs discovered as deletions relative to a humanreference, 21% actually represented insertions and 2% represented tandemduplications relative to theputativeancestralgenome.Of the remainingSVs,60%were classified as deletions relative to ancestral sequence, whereas the ancestralstate of 17%was undetermined. By comparison, out of 160nucleotide‐resolutionSVsidentifiedastandemduplicationsrelativetothereferencegenome,91.6%wereclassified as duplications relative to the ancestral genome, whereas the ancestralstate of 8.4% remained undetermined (Supplementary Text). Our breakpointanalysisrevealedthat70.8%ofthedeletionsand89.6%oftheinsertionsexhibitedbreakpoint microhomology/homology ranging from 2‐376 bp in size, withdistributionmodes of 2 bp (attributable to NH) and 15 bp (attributable toMEI),respectively (Fig. 4A, Supplementary Text). As expected42, a small portion of thedeletions (16.1%) displayed non‐template inserted sequences at their breakpointjunctions. By comparison, the tandem duplications showed extensive stretchesdisplaying≥95%sequenceidentityatthebreakpointslinearlycorrelatinginlengthwithSVsize (Fig.4A). Inaddition,most tandemduplicationsdisplayed2‐17bpofmicrohomologyatthebreakpointjunctions(SupplementaryText).

We subsequently applied BreakSeq43 to infer formation mechanisms for all SVsclassified with regard to ancestral state. Using BreakSeq, we inferred NH as thedominatingdeletionmechanism, andMEI as thedominating insertionmechanism(Fig.4BC,SupplementaryTable11).Furthermore,anabundanceofmicrohomologyattandemduplicationbreakpointssuggestedfrequentformationofthisSVclassbyarearrangementprocessactingintheabsenceofhomology(NH).WhenrelatingSVformation to the variant size spectrum, we observedmarked insertion peaks forMEIs at 300 bp, corresponding to Alu elements, and at 6 kb, corresponding toL1/LINEs (Fig. 4C). By comparison, NH and NAHR based mechanisms occurredacross a wide size‐range, whereas VNTR expansion/shrinkage, consistent withearlierfindings1,ledtorelativelysmallSVsizes(Figs.4C,D).

Furthermore, when displaying the genomic distribution of SVs (Fig. 5A), weobservedanotableclusteringofSVsinto‘SVhotspots’.Weanalyzedthisclusteringin detail by examining the distribution of non‐overlapping, adjacent SVs, andobservedamarkedclusteringofSVsformedbyNAHR,VNTR,andNH,respectively,asignalextendingtohundredsofkilobases(Fig.5B).TheclusteringwasinfluencedbyanabundanceofVNTRnearthecentromeres43andNAHRnearthetelomeres(Fig.5A). A significant enrichment of NAHR near recombination hotspots (P=1.3e‐15)and segmental duplications (P=3.1e‐17) further contributed to the clustering(SupplementaryTable13).

To further explore this clustering we devised a segmentation approach forpredictingSVhotspots(Methods),whichyieldedamapof51putativeSVhotspots(SupplementaryTable14).80%of thehotspotsmainly comprisedSVsoriginating

Page 10: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

9

fromasingleformationmechanism(Fig.5C).MostofthesecorrespondedtoNAHRhotspots,althoughhotspotsdominatedbyNHandVNTRalsowereevident.Theseobservations suggest that SV formation is frequently associated with the locus‐specificpropensityforgenomicrearrangement.

ConclusionsanddiscussionBy generating an SV set of unprecedented size alongwith breakpoint assembliesand reference genotypes, we demonstrate the suitability of population‐scalesequencing for SV analysis. Nucleotide resolution data allow the construction ofreference datasets andmake SVs readily assessable across different experimentalplatformsusinggenotypingapproaches.Ourfine‐scalemapenabledustoexaminethe functional impact of SVs, as exemplified by our analysis of gene disruptionvariants,whichwillbeofvalueforgenomeandexomesequencingstudies.

Ourmap further enabledus to examine size spectra of SV formationmechanismsandledustoidentifygenomicSVhotspotsthatarecommonlydominatedbyasingleformationmechanism.Recurrentrearrangements,implicatedingenomicdisorders,are hypothesized to be associated with local genome architecture44, e.g., withsegmental duplications that facilitate NAHR. Also, DNA rearrangement in theabsenceofhomology,i.e.,MMBIR,hasbeenimplicatedinrecurrentSVformation8,45.In this regard,we noticed that out of the hotspotswe report, six fall into criticalregions of known genetic disorders associated with recurrent de novo deletions,including Miller‐Dieker syndrome and Leri‐Weill dyschondrosteosis(SupplementaryTable 14). Irrespective of potential disease relevance, or inferredmechanism of formation, our analysis revealed a map of SV hotspots that mayconstitute local centers ofde novo SV formation, consistentwith the concept thatlocalgenomearchitecturecontributestogenomicinstability44.

Our study focused on characterizing deletions, which are often associated withdisease9. Facilitated by ancestral analyses of SV loci, we also characterizedinsertionsandtandemduplications,albeitinlessdetailthandeletions.Companionpapers with more detailed analyses of MEIs, and copy‐number variation withinsegmental duplications are published elsewhere34,46. Of note, most SV discoverymethods depend on mapping reads onto their genomic locus of origin, i.e., the‘accessible’ fraction of the genome, a fraction lessened in segmental duplicationsthatareofhighinteresttoSVanalysis.Nonetheless,owingtotheabilitiesofRPandRD methods in detecting SVs in these regions and in interpreting reads withmultiplemappingpositions,the‘accessible’fractionofthegenomeishigherforSVsthanforSNPs16.Inthefuture,sequencingtechnologiesgeneratinglongerDNAreadswill increase the accessible genome, and will enable the assessment of SVsembeddedinlongrepeatstructures,suchasbalancedinversions.

Our SV resourcewill enable the discovery, genotyping, and imputation of SVs inlarger cohorts. Numerous genomes will be sequenced in the coming months tofacilitate disease association studies. Systematic characterization of SVs in thesegenomeswillbenefitfromtheconceptsanddatasetspresentedhere.

Page 11: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

10

MethodsSummarySamplesSequence data for 179 unrelated individuals and six individuals from parent‐offspring trios were obtained as part of the 1000GP. These data were generatedwith Illumina/Solexa, Roche/454, and Life Technologies/SOLiD sequencingtechnologyplatforms.SVdiscoveryandbreakpointassemblyTheSVdiscoverymethodsweappliedcomprisedsixRP,fourRD,threeSR,fourAS,andtwoPDbasedmethods.TIGRA38wasusedfortargetedbreakpointassembly.ExperimentalvalidationWevalidatedSVcallsbyPCR,arrayCGHandSNPmicroarrays, targetedassembly,and custom microarray‐based sequence capture. PCR was performed in variousdifferent laboratories33, CGH analysis was performed based on tiling array dataprovidedbytheGenomeStructuralVariationConsortium(ArrayExpress:E‐MTAB‐40),andSNParrayanalysisbasedondataobtainedfromtheInternationalHapMapConsortium(http://hapmap.ncbi.nlm.nih.gov).GenotypingGenomeSTRiP36wasused fordeletiongenotyping in lowcoveragesequencedata.Initial genotype likelihoods were derived with a Bayesian model and imputationintoaSNPgenotypereferencepanelfromtheHapMap41(Hapmap3r2)wasachievedwithBeagle(v3.1;http://faculty.washington.edu/browning/beagle/beagle.html).SVformationmechanismanalysisSVbreakpointsmappedatnucleotideresolutionwereanalyzedwithBreakSeq43toclassifySVsrelativetoputativeancestrallociandtoinferSVformationmechanisms.SVhotspotsweremappedwithcustomPerlandRscripts.

Page 12: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

11

DisplayItemsTable 1. Summary of discovered structural variation

Deletions Tandem Duplications

Mobile element

insertions

Novel sequence insertions

Total

Individual Callsets <10% FDR 11215 501 5371 - 17087

Validated Experimentally* 10810 - - 128 10938

Release set 22025 501 5371 128 28025 *Only tabulates validated calls which were not already present in the individual callsets with <10% FDR

Table 2. Functional impact of our fine resolution SV set. Figures in parentheses indicate numbers of validated SVs per category. We inferred gene overlap with Gencode gene annotation47.

Gene Overlap

SV class Full gene

overlap

Coding exon

affected (partial)

UTR overlap Intron overlap

Total Gene

overlap

Total Inter-genic

Deletions 654 (631)

1093 (1031)

315 (290)

7319 (6481)

9381 (8433)

12644 (10386)

Tandem duplications

2 (2)

7 (6)

9 (5)

197 (62)

215 (75)

286 (76)

Mobile element insertions -

3 (-)

36 (-)

1304 (97)

1348 (112)

4023 (758)

Novel sequence insertions - -

2 (2)

49 (49)

51 (51)

77 (77)

Sum 656 (633)

1119 (1040)

351 (309)

8869 (6689)

10995 (8671)

17030 (11280)

Page 13: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

12

FigureLegends

Figure 1. SV discovery and genotyping in population scale sequence data.A.Schematicdepictingthedifferentmodes(i.e.,approaches)ofsequencebasedSVdetection we used. The RP approach assesses the orientation and spacing of themappedreadsofpaired‐endsequences14,15(readsaredenotedbyarrows); theRDapproach evaluates the read depth‐of‐coverage25,26; the SR approach maps theboundaries (breakpoints) of SVs by sequence alignment28,29; the AS approachassembles SVs30,31,32. B.Integrated pipeline for SV discovery, validation, andgenotyping. Colored circles represent individual SV discovery methods (listed inSupplementary Table 1), with modes indicated by a color scheme: green=RP;yellow=RD; purple=SR; red=AS; green and yellow=methods evaluatingRPand RD(abbreviated as ‘PD’).C.Example of a deletion, previously associatedwith BMI35,identified independently with RP (green), RD (yellow), and SR (red) methods.TargetedassemblyconfirmedthebreakpointsdetectedbySR.

Figure2. Comparative assessment of deletiondiscoverymethods.A.Deletionsize‐rangeascertainedbydifferentmodesofSVdiscovery.Threegroupsarevisible,withASandSR,PDandRP,aswellasRDand‘RL’(RPanalysisinvolvingrelativelylong range (≥1 kb) insert size libraries, resulting in a different deletion detectionsize range compared to the predominantly used <500kb insert size libraries),respectively,ascertainingsimilarsize‐ranges.PiechartsdisplaythecontributionofdifferentSVdiscoverymodestothereleaseset.Outerpie=basedonnumberofSVcalls; inner pie = based on total number of variable nucleotides. Of note, not allapproaches were applied across all individuals (see Supplementary Table 2).B.SensitivityandFDRestimatesforindividualdeletiondiscoverymethodsbasedongoldstandardsets for individualssequencedathigh(NA12878)and low‐coverage(NA12156),respectively.AlldepictedestimatesaresummarizedinSupplementaryTables 3, 4, 6. Vertical dotted lines correspond to the specificity threshold(FDR≤10%).C.Breakpointmappingresolutionofthreedeletiondiscoverymethods(the respective method names are in Supplementary Table 2). The blue and redhistograms are the breakpoint residuals for predicted deletion start and endcoordinates,respectively,relativetoassembledcoordinates(hereassessedin low‐coveragedata).Thehorizontallinesatthetopofeachplotmarkthe98%confidenceintervals (labeled foreachpanel),withverticalnotches indicating thepositionsofthemostprobablebreakpoint(thedistributionmode).

Figure 3. Analysis of deletion presence and absence in two populations.A­C.Deletionallelefrequenciesandobservedsharingofallelesacrosspopulations,displayedfordeletionsdiscoveredintheCEU,YRI,andJPT+CHBpopulationsamplesintermsofstackedbars.D.Allelefrequencyspectrafordeletionsintersectingwithintergenic(blue),intronic(yellow),andprotein‐codingsequences(red).

Figure4. Contributionof SV formationmechanisms to the SV size spectrum.A.Breakpointjunctionhomology/microhomologylengthplottedasafunctionofSVsizeforSVsoriginallyidentifiedasdeletionscomparedtoahumanreference.Dots

Page 14: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

13

arecoloredaccordingtotheSVs’classificationasdeletions,insertions/duplications,or “undetermined” relative to inferred ancestral genomic loci. Gray lines markgroups of SVs likely formed by a common formation mechanism. The diagonalhighlights tandemduplications (and few reciprocal deletion events), inwhich thelengthoftheduplicatedsequencecorrelateslinearlywiththelengthofthelongestbreakpoint junction sequence identity stretch. The ellipses indicateMEIs, i.e.,Alu(~300bp)andL1(~6kb)insertions,associatedwithtargetsiteduplicationsofupto28bpinsizeatthebreakpoints.ThehorizontalgroupcorrespondsmostlytoNH‐associateddeletionswith<10bpmicrohomologyatthebreakpoints.Theremaining(ungrouped)SVscomprisetruncatedMEIs,VNTRexpansionandshrinkageevents,aswellasNAHR‐associateddeletionsandduplications.B.RelativecontributionsofSVformationmechanismsinthegenome.NumbersofSVsaredisplayedontheouterpiechartandaffectedbasepairsontheinner.Leftpanel:SVsclassifiedasdeletionsrelative to ancestral loci. Right panel: SVs classified as insertions/duplications.C.Size spectra of deletions classified relative to ancestral loci. D.Size spectra ofinsertions/duplications.

Figure5.MappinghotspotsofSVformationinthegenome.A.DistributionofSVson chromosome 10 (“chr10”). Above the ideogram, colored bars indicate SVformationmechanisms(samecolorschemeasinBandC);barlengthsrelatetothelogarithmofSV size.Below the ideogram,bar lengthsaredirectlyproportional toallele frequencies.Arrows indicate an SVhotspotnear the centromereunderlyingmainly VNTR, and several hotspots near the telomeres underlying mainly NAHRevents. B.Enrichment of SVs inferred to be formed by the same formationmechanism for different genomic window sizes. Displayed is an enrichment ofnearby,non‐overlappingSVs formedby thesamemechanismrelative toanSVsetwhere mechanism assignments are shuffled randomly. C.SV hotspots are mostlydominated by a single formationmechanism. Colored bars depict numbers of SVhotspotsinwhichatleast50%ofthevariantswereinferredtobeformedbyasingleformation mechanism. The average abundance of NAHR‐classified SVs in NAHRhotspotswas70%(comparedwith77%forVNTR‐hotspots;69%forNH).Thegraybar(“mixed”)correspondstoSVhotspotswithnosinglemechanismdominating.

Page 15: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

14

References1 Conrad,D.F.etal.Originsandfunctionalimpactofcopynumbervariationin

thehumangenome.Nature464,704‐712(2010).2 Pinto, D. et al. Functional impact of global rare copy number variation in

autismspectrumdisorders.Nature466,368‐372(2010).3 Sebat, J. et al. Strong association of de novo copy number mutations with

autism.Science316,445‐449(2007).4 Stefansson, H. et al. Large recurrent microdeletions associated with

schizophrenia.Nature455,232‐236(2008).5 McCarthy, S. E. et al. Microduplications of 16p11.2 are associated with

schizophrenia.NatGenet41,1223‐1227(2009).6 Craddock,N.etal.Genome‐wideassociationstudyofCNVsin16,000casesof

eight common diseases and 3,000 shared controls. Nature 464, 713‐720,(2010).

7 McCarroll, S. A. et al. Deletion polymorphismupstreamof IRGMassociatedwith altered IRGMexpression andCrohn's disease.NatGenet,40, 1107‐12(2008).

8 Hastings,P.J.,Lupski,J.R.,Rosenberg,S.M.&Ira,G.Mechanismsofchangeingenecopynumber.NatRevGenet10,551‐564(2009).

9 Stankiewicz,P.&Lupski,J.R.Structuralvariationinthehumangenomeanditsroleindisease.AnnuRevMed61,437‐455(2010).

10 Sebat,J.etal.Large‐scalecopynumberpolymorphisminthehumangenome.Science305,525‐528(2004).

11 Iafrate,A.J.etal.Detectionoflarge‐scalevariationinthehumangenome.NatGenet36,949‐951(2004).

12 Sharp, A. J. et al. Segmental duplications and copy‐number variation in thehumangenome.AmJHumGenet77,78‐88(2005).

13 McCarroll,S.A.etal.Integrateddetectionandpopulation‐geneticanalysisofSNPsandcopynumbervariation.NatGenet40,1166‐1174(2008).

14 Tuzun, E. et al. Fine‐scale structural variation of the human genome. NatGenet37,727‐732(2005).

15 Korbel,J.O.etal.Paired‐endmappingrevealsextensivestructuralvariationinthehumangenome.Science318,420‐426(2007).

16 Alkan, C. et al. Personalized copy number and segmental duplicationmapsusingnext‐generationsequencing.NatGenet41,1061‐1067(2009).

17 Chen, K. et al. BreakDancer: an algorithm for high‐resolution mapping ofgenomicstructuralvariation.NatMethods6,677‐681(2009).

18 Hormozdiari, F., Alkan, C., Eichler, E. E. & Sahinalp, S. C. Combinatorialalgorithms for structural variation detection in high‐throughput sequencedgenomes.GenomeRes19,1270‐1278(2009).

19 Medvedev, P., Stanciu, M. & Brudno, M. Computational methods fordiscovering structural variation with next‐generation sequencing. NatMethods6,S13‐20(2009).

Page 16: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

15

20 McKernan,K. J.etal.Sequenceandstructuralvariation inahumangenomeuncoveredby short‐read,massively parallel ligation sequencingusing two‐baseencoding.GenomeRes19,1527‐1541(2009).

21 Chiang,D.Y.etal.High‐resolutionmappingofcopy‐numberalterationswithmassivelyparallelsequencing.NatMethods6,99‐103(2009).

22 Kidd, J.M. etal.Mappingand sequencingof structural variation fromeighthumangenomes.Nature453,56‐64(2008).

23 Lee,S.,Cheran,E.&Brudno,M.Arobustframeworkfordetectingstructuralvariationsinagenome.Bioinformatics24,i59‐67(2008).

24 Pang,A.W. et al. Towards a comprehensive structural variationmapof anindividualhumangenome.GenomeBiol11,R52(2010).

25 Bailey, J. A. et al. Recent segmental duplications in the human genome.Science297,1003‐1007(2002).

26 Campbell,P.J.etal.Identificationofsomaticallyacquiredrearrangementsincancer using genome‐wide massively parallel paired‐end sequencing. NatGenet40,722‐729(2008).

27 Yoon, S., Xuan, Z., Makarov, V., Ye, K. & Sebat, J. Sensitive and accuratedetectionofcopynumbervariantsusingreaddepthofcoverage.GenomeRes19,1586‐1592(2009).

28 Mills,R.E.etal.Aninitialmapofinsertionanddeletion(INDEL)variationinthehumangenome.GenomeRes16,1182‐1190(2006).

29 Ye,K.,Schulz,M.H.,Long,Q.,Apweiler,R.&Ning,Z.Pindel:apatterngrowthapproach to detect break points of large deletions and medium sizedinsertions from paired‐end short reads. Bioinformatics 25, 2865‐2871,(2009).

30 Simpson,J.T.etal.ABySS:aparallelassemblerforshortreadsequencedata.GenomeRes19,1117‐1123(2009).

31 Hajirasouliha, I. et al. Detection and characterization of novel sequenceinsertions using paired‐end next‐generation sequencing.Bioinformatics26,1277‐1283(2010).

32 Li,R.etal.Thesequenceanddenovoassemblyofthegiantpandagenome.Nature463,311‐317(2010).

33 The‐1000‐Genomes‐Project‐Consortium.Amapofhumangenomevariationfrompopulation‐scalesequencing.Nature467,1061‐1073(2010).

34 Sudmant, P. H. et al. Diversity of human copy number variation andmulticopygenes.Science330,641‐646(2010).

35 Willer, C. J. & Willer, C. J. Six new loci associated with body mass indexhighlightaneuronal influenceonbodyweightregulation.NatGenet41,25‐34(2009).

36 Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery andgenotyping of genome structural polymorphism by sequencing on apopulationscale.submitted.

37 Levy,S.etal.Thediploidgenomesequenceofanindividualhuman.PLoSBiol5,e254(2007).

38 Chen,L.etal.TIGRAlocaltargetedassemblyofstructuralvariants.submitted(2010).

Page 17: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

16

39 Hasin‐Brumshtein, Y., Lancet, D. & Olender, T. Human olfaction: fromgenomic variation to phenotypic diversity. Trends Genet 25, 178‐184,(2009).

40 Hinds,D.A.,Kloek,A.P., Jen,M.,Chen,X.&Frazer,K.A.CommondeletionsandSNPsareinlinkagedisequilibriuminthehumangenome.NatGenet38,82‐85,(2006).

41 Altshuler, D. M. et al. Integrating common and rare genetic variation indiversehumanpopulations.Nature467,52‐58,(2010).

42 Conrad,D.F.etal.MutationspectrumrevealedbybreakpointsequencingofhumangermlineCNVs.NatGenet42,385‐391(2010).

43 Lam, H. Y. et al. Nucleotide‐resolution analysis of structural variants usingBreakSeqandabreakpointlibrary.NatBiotechnol28,47‐55(2010).

44 Lupski,J.R.Genomicdisorders:structuralfeaturesofthegenomecanleadtoDNA rearrangements and human disease traits.Trends Genet14, 417‐422,(1998).

45 Lee, J. A., Carvalho, C. M. & Lupski, J. R. A DNA replicationmechanism forgeneratingnonrecurrentrearrangementsassociatedwithgenomicdisorders.Cell131,1235‐1247(2007).

46 Stewart, C. et al. A comprehensive map of mobile element insertionpolymorphismsinhumans.inpreparation.

47 Harrow, J. et al. GENCODE: producing a reference annotation for ENCODE.GenomeBiol7Suppl1,S41‐9(2006).

Acknowledgements:Wewould like to acknowledgeClaireHardy,Richard Smith,AnniekDeWitte,andShaneGilesfortheirassistancewithvalidation.M.A.B’sgroupwassupportedbygrantsfromtheNationalInstitutesofHealth(RO1GM59290)andG.T.M’s group by grants R01 HG004719 and RC2 HG005552, also from the NIH.J.O.K.’s group was supported by an Emmy Noether Fellowship of the GermanResearch Foundation (Deutsche Forschungsgemeinschaft). J.W.’s group wassupported by the National Basic Research Program of China (973 program no.2011CB809200), the National Natural Science Foundation of China (30725008;30890032;30811130531;30221004), theChinese863program (2006AA02Z177;2006AA02Z334; 2006AA02A302; 2009AA022707), the Shenzhen MunicipalGovernment of China (grants JC200903190767A; JC200903190772A;ZYC200903240076A; CXB200903110066A; ZYC200903240077A;ZYC200903240076AandZYC200903240080A),andtheOleRømergrant fromtheDanishNaturalScienceResearchCouncil.C.L.’sgroupwassupportedbygrantsfromthe National Institutes of Health: P41 HG004221, RO1 GM081533, and UO1HG005209 andX.S.was supported by a T32 fellowship award from theNIH. Wethank the Genome Structural Variation Consortium(http://www.sanger.ac.uk/humgen/cnv/42mio/) and the International HapMapConsortium for making available microarray data. The authors acknowledge theindividuals participating in the 1000 Genomes Project by providing samples,includingTheYorubapeopleof Ibadan,Nigeria, the community atBeijingNormalUniversity,thepeopleofTokyo,Japan,andthepeopleoftheUtahCEPHcommunity.

Page 18: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

17

Furthermore, we thank Richard Durbin and Lars Steinmetz for comments on themanuscript.

Author Contributions: The authors contributed this study at different levels, asdescribed in the following.SVdiscovery:K.W.,C.S.,R.H.,K.C.,C.A.,A.A.,S.C.Y.,R.K.C.,A.C.,Y.F.,I.H.,F.H.,Z.I.,D.K.,R.L.,Y.L.,C.L.,R.L.,X.J.M.,H.E.P.,L.D.,G.T.M.,J.S.,J.W.,K.Y.,K.Y., E.E.E., M.B.G., M.E.H., S.A.M., and J.O.K. SVvalidation: R.E.M., K.W., K.C., A.A.,S.C.Y.,F.G.,M.K.K.,J.K.,J.N.,A.E.U.,X.S.,A.M.S.,J.A.W.,Y.Z.,Z.Z.,M.A.B.,J.S.,M.S.,M.E.H.,C.L,J.O.K.SVgenotyping:K.W.,R.H.,M.E.H,andS.A.M.Dataanalysis:R.E.M.,C.S.,C.A.,A.A.,R.H.,K.C.,S.C.Y.,R.K.C.,A.C.,D.C.,Y.F.,F.H.,L.M.I.,Z.I.,J.M.K.,M.K.K.,S.K.,J.K.,E.K.,D.K.,H.Y.K.L., J.L.,R.L., Y.L., C.L.,R.L., X.J.M., J.N.,H.E.P.,T.R.,A.S., X.S.,M.P.S., J.A.W.,J.W.,Y.Z.,Z.Z.,M.A.B.,L.D.,G.T.M.,G.M.,J.S.,M.S.,J.W.,K.Y.,K.Y.,E.E.E.,M.B.G.,M.E.H.,C.L,S.A.M.,andJ.O.K.Preparationofmanuscriptdisplayitems:R.E.M.,K.W.,C.S.,C.A.,A.A.,R.H.,S.C.Y.,L.M.I.,S.K.,E.K.,M.K.K.,X.J.M.,X.S.,J.A.W.,M.B.G.,S.A.M.,andJ.O.K.Co­chairsoftheStructuralVariationAnalysisgroup:E.E.E.,M.E.H.,andC.L.Thefollowingwereleadingcontributorstotheanalysisdescribedinthispaperandthereforeshouldbeconsideredjointfirstauthors:R.E.M.,K.W.,C.S.,R.H.,K.C.,C.A.,A.A.,S.C.Y,andK.Y.The following equally contributed to directing the described analyses andparticipatinginthedesignofthestudyandshouldbeconsideredjointseniorauthors:E.E.E, M.B.G., M.E.H., C.L, S.A.M., and J.O.K. The manuscript was written by thefollowingauthors:R.E.M.andJ.O.K.

Data retrieval: The data sets described here can be obtained from the 1000Genomes Project website at www.1000genomes.org (July 2010 Data Release).Individual SV discovery methods can be obtained from sources mentioned inSupplementary Table 1, or upon request from the authors. Abbreviations used inthispaperaresummarizedintheSupplementaryText.

Page 19: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

Application of diverse SV discovery methods

Validation of SVs (deletions, duplications and

insertions)

Targeted SV breakpoint assembly (focused on

deletions)

Precision-aware merging of discovered SVs

Release set (algorithms & extensive

validations)inclusion of SVs inferred with individual methods(criterion: FDR<10%),

followed by validation-awareSV inclusion

Algorithm-centric set(algorithms & sparse

validations)inclusion of SVs inferred with individual methods,and such with evidence

from >2 methods(criterion: FDR<10%)

SV discovery set

Genotyping (focused on deletions)

Deletion (Del), Duplication (Dup), and Insertion (Ins)

RP

RD

SR

AS

PD

ReferenceSample genome

Del Dup Ins

Reference-supportingSV-supporting read-pair (RP)SV-supporting read-depth (RD)

SV-supporting read for split-read analysis (SR) or assembly (AS)Mobile element insertion support

MEI

MEI

a b

c

6070

8090

100

6070

8090

100

72.52 72.54 72.56 72.58 72.60Chromosome 1 position (in Mb)

DN

A re

ad m

appi

ng q

ualit

y sc

ores

NA

1924

0 (Y

RI)

NA

1287

8 (C

EU

)

Dep

th o

f cov

erag

e44

3322

2211

011

044

33

AluLINE

NEGR1Del

Page 20: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

0 1000 2000 3000 4000 5000 6000 7000

0.00

000.

0005

0.00

100.

0015

0.00

20

Deletion size (bp)

Den

sity

ASSRPD

RPRLRD

Releaseset

1.3%

19.2%

35.3%17.9%

2.0%

24.3%0.5%

17.4%

49.7%

10.6%

0.6%

21.2%

0.0

0.2

0.4

0.6

0.8

1.0

0% 20% 40% 60% 80% 100%

0

0.2

0.4

0.6

0.8

1

0%20%40%60%80%100%

FDR

FDR

TrioLow co

verag

e

Sen

sitiv

ity

Sensitivity

a b

−10 −5 0 100

1000

2000

3000

4000

5000 SR (LN)n = 5375

19 bp

18 bp

5 −100 −50 0 50 1000

50

100

150

200

250

300 RP (SI)n = 5229

210 bp

220 bp

−500 −250 0 250 5000

10

20

30

40

50RD (YL)n = 501

700 bp

900 bp

Freq

uenc

y

O�set from breakpoint (bp)

c

Page 21: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

0 0.2 0.4 0.6 0.8 1Alternate allele frequency

0

500

1000

1500

2000

2500

Num

ber o

f SV

s JPT+CHB

0 0.2 0.4 0.6 0.8 1Alternate allele frequency

0

500

1000

1500

2000

2500N

umbe

r of S

Vs observed also in YRI

observed also in JPT+CHBshared among allobserved only in CEU

CEU

0 0.2 0.4 0.6 0.8 1Alternate allele frequency

0

500

1000

1500

2000

2500

Num

ber o

f SV

s YRI

0 0.2 0.4 0.6 0.8 1Average alternate allele frequency

0

1

2

3

4

Log 10

(num

ber o

f SV

s)

intersect with intronsintergenic

intersect with CDS

a b

c d

observed also in CEUobserved also in JPT+CHBshared among allobserved only in YRI

observed also in CEUobserved also in YRIshared among allobserved only in JPT+CHB

Page 22: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

100 300 500 700 900

50

100

150

200

250

300

SV length (bp)

L

engt

h of

long

est s

eque

nce

sim

ilarit

y s

tretc

h at

the

SV

bre

akpo

int j

unct

ion

(bp)

insertions/duplicationsdeletionsundetermined

4000 6000 8000

Size of deletion

Num

ber o

f del

etio

ns

100bp 1kb 10kb 100kb

a

0

200

400

600

800

1000

1200

1400

1600

1800

Size of insertion/duplication

Num

ber o

f Ins

ertio

ns/d

uplic

atio

ns

100bp 1kb 10kb 100kb

LINE

Alu

0

200

400

600

800

1000

1200

1400

1600

1800

UnclassifiedMEIVNTRNAHRNH

c

db

393kb

89kb

1Mb

1,496

4,500

245 272

12Mb

24Mb

226

79

122

1,994

NAHRNHRVNTRMEI

21kb

164kb

UnclassifiedMEIVNTRNAHRNH

Page 23: Mapping Copy Number Variation by Population Scale Genome … · 2015. 2. 19. · Mapping Copy Number Variation by Population Scale Genome Sequencing (Article begins on next page)

200b

p

500b

p

1kb

2kb

5kb

10kb

20kb

50kb

100k

b

200k

b

500k

b

1Mb

NAHRNHMEIVNTRControl (NH vs. NAHR)

Enr

ichm

ent

(clu

ster

ing

of S

V

form

atio

n pr

oces

s)

Dep

letio

n(n

o cl

uste

ring)

10

1

0.1

chr10

Num

bers

of g

enom

ic S

V h

otsp

ots

(col

or :

dom

inat

ed b

y si

ngle

mec

hani

sm)

0

5

10

15

20

25

30

NAHR NH VNTR MEI mixed

a

b c