variation graphs and population assisted genome inference copy

46
Human Genome Variation Graphs Benedict Paten - UC Santa Cruz Genomics Institute [email protected] https://cgl.genomics.ucsc.edu/ Twitter: @BenedictPaten

Upload: genome-reference-consortium

Post on 22-Jan-2018

519 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: Variation graphs and population assisted genome inference copy

Human Genome Variation Graphs

Benedict Paten - UC Santa Cruz Genomics Institute

[email protected]://cgl.genomics.ucsc.edu/Twitter: @BenedictPaten

Page 2: Variation graphs and population assisted genome inference copy

Triumph of the reference human genome

• The publication of the human reference genome unleashed the field of large-scale human genomics

• It offers a coordinate system to:

• describe gene sequences

• display annotations

• interpret molecular assays

• However, the reference genome represents only a single instance among billions of unique human genomes...

Page 3: Variation graphs and population assisted genome inference copy

Triumph of the reference human genome

• The publication of the human reference genome unleashed the field of large-scale human genomics

• It offers a coordinate system to:

• describe gene sequences

• display annotations

• interpret molecular assays

• However, the reference genome represents only a single instance among billions of unique human genomes...

Supplementary Figure 2 – BrowserWindow Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)

100 vertebrates Basewise Conservation by PhyloP

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE

GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)

GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)

GTEx RNA-seq read coverage from Brain - Cortex

GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)

GTEx RNA-seq read coverage from Muscle - Skeletal

GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)

GTEx RNA-seq read coverage from Thyroid

PPP1R1B

STARD3

TCAP

PNMT

100 Vert. Cons

7.76614 _

-1.84367 _

Transcription

ln(x+1) 8 _

0 _

brainCauda M P44G127 _

0 _

brainCauda M NPJ8brainCauda M R55F

brainCauda M S7SE

brainCauda M T6MN

brainCauda M WL46

brainCauda M WVLH

brainCauda M WZTO

brainCauda M XOTO

brainCauda M Z93S

brainCauda M ZUA1

brainCorte M NPJ8brainCorte M R55F

brainCorte M T6MN

brainCorte M XOTO

brainCorte M WL46

brainCorte M WVLH

brainCorte M WZTO

brainCorte M ZUA1

brainCorte M Z93S

muscleSkel M 11DXW127 _

0 _

muscleSkel M NPJ8muscleSkel M OOBK

muscleSkel M Q2AH

muscleSkel M Q2AI

muscleSkel M R55C

muscleSkel M U3ZM

muscleSkel M U4B1

muscleSkel M WFON

muscleSkel M WZTO

muscleSkel M X5EB

skinExpose M ZAB4

thyroid M ZAB5

Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTExRNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed inmuscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal gangliabut not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected fordisplay, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOvertool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browserdisplay was configured to use the Multi-region exon view.

. CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/062497doi: bioRxiv preprint first posted online Jul. 7, 2016;

Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–316 (2017) doi:10.1038/nbt.3772

Page 4: Variation graphs and population assisted genome inference copy

Triumph of the reference human genome

• The publication of the human reference genome unleashed the field of large-scale human genomics

• It offers a coordinate system to:

• describe gene sequences

• display annotations

• interpret molecular assays

• However, the reference genome represents only a single instance among billions of unique human genomes...

Supplementary Figure 2 – BrowserWindow Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)

100 vertebrates Basewise Conservation by PhyloP

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE

GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)

GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)

GTEx RNA-seq read coverage from Brain - Cortex

GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)

GTEx RNA-seq read coverage from Muscle - Skeletal

GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)

GTEx RNA-seq read coverage from Thyroid

PPP1R1B

STARD3

TCAP

PNMT

100 Vert. Cons

7.76614 _

-1.84367 _

Transcription

ln(x+1) 8 _

0 _

brainCauda M P44G127 _

0 _

brainCauda M NPJ8brainCauda M R55F

brainCauda M S7SE

brainCauda M T6MN

brainCauda M WL46

brainCauda M WVLH

brainCauda M WZTO

brainCauda M XOTO

brainCauda M Z93S

brainCauda M ZUA1

brainCorte M NPJ8brainCorte M R55F

brainCorte M T6MN

brainCorte M XOTO

brainCorte M WL46

brainCorte M WVLH

brainCorte M WZTO

brainCorte M ZUA1

brainCorte M Z93S

muscleSkel M 11DXW127 _

0 _

muscleSkel M NPJ8muscleSkel M OOBK

muscleSkel M Q2AH

muscleSkel M Q2AI

muscleSkel M R55C

muscleSkel M U3ZM

muscleSkel M U4B1

muscleSkel M WFON

muscleSkel M WZTO

muscleSkel M X5EB

skinExpose M ZAB4

thyroid M ZAB5

Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTExRNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed inmuscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal gangliabut not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected fordisplay, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOvertool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browserdisplay was configured to use the Multi-region exon view.

. CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/062497doi: bioRxiv preprint first posted online Jul. 7, 2016;

Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–316 (2017) doi:10.1038/nbt.3772

Page 5: Variation graphs and population assisted genome inference copy

Triumph of the reference human genome

• The publication of the human reference genome unleashed the field of large-scale human genomics

• It offers a coordinate system to:

• describe gene sequences

• display annotations

• interpret molecular assays

• However, the reference genome represents only a single instance among billions of unique human genomes...

Supplementary Figure 2 – BrowserWindow Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)

100 vertebrates Basewise Conservation by PhyloP

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE

GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)

GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)

GTEx RNA-seq read coverage from Brain - Cortex

GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)

GTEx RNA-seq read coverage from Muscle - Skeletal

GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)

GTEx RNA-seq read coverage from Thyroid

PPP1R1B

STARD3

TCAP

PNMT

100 Vert. Cons

7.76614 _

-1.84367 _

Transcription

ln(x+1) 8 _

0 _

brainCauda M P44G127 _

0 _

brainCauda M NPJ8brainCauda M R55F

brainCauda M S7SE

brainCauda M T6MN

brainCauda M WL46

brainCauda M WVLH

brainCauda M WZTO

brainCauda M XOTO

brainCauda M Z93S

brainCauda M ZUA1

brainCorte M NPJ8brainCorte M R55F

brainCorte M T6MN

brainCorte M XOTO

brainCorte M WL46

brainCorte M WVLH

brainCorte M WZTO

brainCorte M ZUA1

brainCorte M Z93S

muscleSkel M 11DXW127 _

0 _

muscleSkel M NPJ8muscleSkel M OOBK

muscleSkel M Q2AH

muscleSkel M Q2AI

muscleSkel M R55C

muscleSkel M U3ZM

muscleSkel M U4B1

muscleSkel M WFON

muscleSkel M WZTO

muscleSkel M X5EB

skinExpose M ZAB4

thyroid M ZAB5

Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTExRNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed inmuscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal gangliabut not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected fordisplay, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOvertool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browserdisplay was configured to use the Multi-region exon view.

. CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/062497doi: bioRxiv preprint first posted online Jul. 7, 2016;

Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–316 (2017) doi:10.1038/nbt.3772

Page 6: Variation graphs and population assisted genome inference copy

Triumph of the reference human genome

• The publication of the human reference genome unleashed the field of large-scale human genomics

• It offers a coordinate system to:

• describe gene sequences

• display annotations

• interpret molecular assays

• However, the reference genome represents only a single instance among billions of unique human genomes...

Supplementary Figure 2 – BrowserWindow Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)

100 vertebrates Basewise Conservation by PhyloP

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE

GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)

GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)

GTEx RNA-seq read coverage from Brain - Cortex

GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)

GTEx RNA-seq read coverage from Muscle - Skeletal

GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)

GTEx RNA-seq read coverage from Thyroid

PPP1R1B

STARD3

TCAP

PNMT

100 Vert. Cons

7.76614 _

-1.84367 _

Transcription

ln(x+1) 8 _

0 _

brainCauda M P44G127 _

0 _

brainCauda M NPJ8brainCauda M R55F

brainCauda M S7SE

brainCauda M T6MN

brainCauda M WL46

brainCauda M WVLH

brainCauda M WZTO

brainCauda M XOTO

brainCauda M Z93S

brainCauda M ZUA1

brainCorte M NPJ8brainCorte M R55F

brainCorte M T6MN

brainCorte M XOTO

brainCorte M WL46

brainCorte M WVLH

brainCorte M WZTO

brainCorte M ZUA1

brainCorte M Z93S

muscleSkel M 11DXW127 _

0 _

muscleSkel M NPJ8muscleSkel M OOBK

muscleSkel M Q2AH

muscleSkel M Q2AI

muscleSkel M R55C

muscleSkel M U3ZM

muscleSkel M U4B1

muscleSkel M WFON

muscleSkel M WZTO

muscleSkel M X5EB

skinExpose M ZAB4

thyroid M ZAB5

Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTExRNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed inmuscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal gangliabut not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected fordisplay, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOvertool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browserdisplay was configured to use the Multi-region exon view.

. CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/062497doi: bioRxiv preprint first posted online Jul. 7, 2016;

Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–316 (2017) doi:10.1038/nbt.3772

Page 7: Variation graphs and population assisted genome inference copy

Triumph of the reference human genome

• The publication of the human reference genome unleashed the field of large-scale human genomics

• It offers a coordinate system to:

• describe gene sequences

• display annotations

• interpret molecular assays

• However, the primary ref genome represents only a single instance among billions of unique germline human genomes...

Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–316 (2017) doi:10.1038/nbt.3772

Supplementary Figure 2 – BrowserWindow Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)

100 vertebrates Basewise Conservation by PhyloP

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE

GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)

GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)

GTEx RNA-seq read coverage from Brain - Cortex

GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)

GTEx RNA-seq read coverage from Muscle - Skeletal

GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)

GTEx RNA-seq read coverage from Thyroid

PPP1R1B

STARD3

TCAP

PNMT

100 Vert. Cons

7.76614 _

-1.84367 _

Transcription

ln(x+1) 8 _

0 _

brainCauda M P44G127 _

0 _

brainCauda M NPJ8brainCauda M R55F

brainCauda M S7SE

brainCauda M T6MN

brainCauda M WL46

brainCauda M WVLH

brainCauda M WZTO

brainCauda M XOTO

brainCauda M Z93S

brainCauda M ZUA1

brainCorte M NPJ8brainCorte M R55F

brainCorte M T6MN

brainCorte M XOTO

brainCorte M WL46

brainCorte M WVLH

brainCorte M WZTO

brainCorte M ZUA1

brainCorte M Z93S

muscleSkel M 11DXW127 _

0 _

muscleSkel M NPJ8muscleSkel M OOBK

muscleSkel M Q2AH

muscleSkel M Q2AI

muscleSkel M R55C

muscleSkel M U3ZM

muscleSkel M U4B1

muscleSkel M WFON

muscleSkel M WZTO

muscleSkel M X5EB

skinExpose M ZAB4

thyroid M ZAB5

Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTExRNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed inmuscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal gangliabut not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected fordisplay, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOvertool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browserdisplay was configured to use the Multi-region exon view.

. CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/062497doi: bioRxiv preprint first posted online Jul. 7, 2016;

Page 8: Variation graphs and population assisted genome inference copy

The problem with the reference

• Avg. 4-5 m point variations / individual

• 80 m point variants w/>= 0.1% freq.

• Avg. > 10 megabases (MB) in copy-number variants (CNVs) / individual

• 350-400 MB in CNVs w/ >= 0.1% freq.

• Avg. > 6 MB in large indels / individual

• > 100 MB in large indels w/>= 0.1% freq.

Page 9: Variation graphs and population assisted genome inference copy

The problem with the reference

• Avg. 4-5 m point variations / individual

• 80 m point variants w/>= 0.1% freq.

• Avg. > 10 megabases (MB) in copy-number variants (CNVs) / individual

• 350-400 MB in CNVs w/ >= 0.1% freq.

• Avg. > 6 MB in large indels / individual

• > 100 MB in large indels w/>= 0.1% freq.

ANRV285-GG07-17 ARI 3 August 2006 8:58

Structural Variation of theHuman GenomeAndrew J. Sharp, Ze Cheng, and Evan E. EichlerDepartment of Genome Sciences, University of Washington, Howard HughesMedical Institute, Seattle, Washington 98195; email: [email protected]

Annu. Rev. Genomics Hum. Genet. 2006.7:407–42

First published online as a Review inAdvance on June 16, 2006

The Annual Review of Genomics and HumanGenetics is online atgenom.annualreviews.org

This article’s doi:10.1146/annurev.genom.7.080505.115618

Copyright c⃝ 2006 by Annual Reviews.All rights reserved

1527-8204/06/0922-0407$20.00

Key Wordspolymorphism, rearrangement, insertion, deletion, inversion

AbstractThere is growing appreciation that the human genome contains sig-nificant numbers of structural rearrangements, such as insertions,deletions, inversions, and large tandem repeats. Recent studies havedefined approximately 5% of the human genome as structurally vari-ant in the normal population, involving more than 800 independentgenes. We present a detailed review of the various structural rear-rangements identified to date in humans, with particular reference totheir influence on human phenotypic variation. Our current knowl-edge of the extent of human structural variation shows that the hu-man genome is a highly dynamic structure that shows significantlarge-scale variation from the currently published genome referencesequence.

407

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

407-

442.

Dow

nloa

ded

from

ww

w.a

nnua

lrevi

ews.o

rgby

Uni

vers

ity o

f Cal

iforn

ia -

Sant

a C

ruz

on 0

9/13

/12.

For

per

sona

l use

onl

y.

Characterization of Missing Human Genome Sequences and

Copy-number Polymorphic Insertions

Jeffrey M. Kidd1, Nick Sampas2, Francesca Antonacci1, Tina Graves3, Robert Fulton3,Hillary S. Hayden1, Can Alkan1, Maika Malig1, Mario Ventura4, Giuliana Giannuzzi4, JoelleKallicki3, Paige Anderson2, Anya Tsalenko2, N. Alice Yamada2, Peter Tsang2, RajinderKaul1, Richard K. Wilson3, Laurakay Bruhn2, and Evan E. Eichler1,5,61Department of Genome Sciences, University of Washington School of Medicine, Seattle,Washington 98195, USA

2Agilent Laboratories, Santa Clara, California 95051, USA

3Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri63108, USA

4Department of Genetics and Microbiology, University of Bari, Bari 70126, Italy

5Howard Hughes Medical Institute, Seattle, Washington 98195, USA

AbstractThe extent of human genomic structural variation suggests that there must be portions of thegenome yet to be discovered, annotated and characterized at the sequence level. We present aresource and analysis of 2,363 novel insertion sequences corresponding to 720 genomic loci. Weshow that a substantial fraction of these sequences are either missing, fragmented or mis-assignedwhen compared to recent de novo sequence assemblies from short-read next-generation sequencedata. We determine that 18–37% of these novel insertions are copy-number polymorphic,including loci that show extensive population stratification among Europeans, Asians andAfricans. Complete sequencing of 156 of these insertions identifies novel exons and conservednon-coding sequences not yet represented in the reference genome. We develop a method toaccurately genotype these novel insertions by mapping next-generation sequencing datasets to thebreakpoint thereby providing a means to characterize copy-number status for regions previouslyinaccessible to SNP microarrays.

Introduction

The human genome reference assembly is a mosaic of distinct haplotypes sampled frommultiple individuals1. As a result of both gaps in the assembled sequence and the structuraldifferences that exist among different humans, individual genome projects are expected touncover human sequences present in some (or all) individuals that are not represented in theassembly. Consistent with this prediction, the first sequences of individual genomes2, 3revealed 23–29 Mb of sequence that do not map against the reference assembly. The short-read, high-throughput approaches currently being employed are also expected to uncoverunrepresented insertions4–7. However, these sequences often assemble only as short(median length of 220 to 314 bp 7) contiguous sequences (contigs) that are difficult toanchor and incorporate into existing genome assemblies. Thus, while thousands of novelsequences may be discovered over the next few years, their annotation and completeintegration into the human genome will remain a significant bottleneck 8. Since genotyping

NIH Public AccessAuthor ManuscriptNat Methods. Author manuscript; available in PMC 2010 November 1.

Published in final edited form as:Nat Methods. 2010 May ; 7(5): 365–371.

NIH

-PA

Author M

anuscriptN

IH-P

A A

uthor Manuscript

NIH

-PA

Author M

anuscript

Page 10: Variation graphs and population assisted genome inference copy

The problem with the reference

• These differences create a failure of representation, for example:

• Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes)

• Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants

• The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations

• In summary: the current reference genome has become an impediment to personal genomics

Page 11: Variation graphs and population assisted genome inference copy

The problem with the reference

RESEARCH Open Access

The GENCODE pseudogene resourceBaikang Pei1†, Cristina Sisu1,2†, Adam Frankish3, Cédric Howald4, Lukas Habegger1, Xinmeng Jasmine Mu1,Rachel Harte5, Suganthi Balasubramanian1,2, Andrea Tanzer6, Mark Diekhans5, Alexandre Reymond4,Tim J Hubbard3, Jennifer Harrow3 and Mark B Gerstein1,2,7*

Abstract

Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recentevidence suggests that many of them might have some form of biological activity, and the possibility offunctionality has increased interest in their accurate annotation and integration with functional genomics data.

Results: As part of the GENCODE annotation of the human genome, we present the first genome-widepseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silicopipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiasedfashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotationswith the extensive ENCODE functional genomics information. In particular, we determine the expression level,transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Basedon their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data fromprimate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.

Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these mayrepresent genes that have recently died. On the other hand, we find interesting patterns of partial activity, whichmay suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of eachpseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification ofpotentially functional pseudogenes.

BackgroundPseudogenes are defined as defunct genomic loci withsequence similarity to functional genes but lacking cod-ing potential due to the presence of disruptive muta-tions such as frame shifts and premature stop codons[1–4]. The functional paralogs of pseudogenes are oftenreferred to as parent genes. Based on the mechanism oftheir creation, pseudogenes can be categorized intothree large groups: (1) processed pseudogenes, createdby retrotransposition of mRNA from functional protein-coding loci back into the genome; (2) duplicated (alsoreferred to as unprocessed) pseudogenes, derived fromduplication of functional genes; and (3) unitary

pseudogenes, which arise through in situ mutations inpreviously functional protein-coding genes [1,4–6].Different types of pseudogenes exhibit different geno-

mic features. Duplicated pseudogenes have intron-exon-like genomic structures and may still maintain theupstream regulatory sequences of their parents. In con-trast, processed pseudogenes, having lost their introns,contain only exonic sequence and do not retain theupstream regulatory regions. Processed pseudogenesmay preserve evidence of their insertion in the form ofpolyadenine features at their 3’ end. These features ofprocessed pseudogenes are shared with other genomicelements commonly known as retrogenes [7]. However,retrogenes differ from pseudogenes in that they haveintact coding frames and encode functional proteins [8].The composition of different types of pseudogenes var-ies among organisms [9]. In the human genome, pro-cessed pseudogenes are the most abundant type due to

* Correspondence: [email protected]† Contributed equally1Program in Computational Biology and Bioinformatics, Yale University, Bass432, 266 Whitney Avenue, New Haven, CT 06520, USAFull list of author information is available at the end of the article

Pei et al. Genome Biology 2012, 13:R51http://genomebiology.com/2012/13/9/R51

© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

• These differences create a failure of representation, for example:

• Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes)

• Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants

• The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations

• In summary: the current reference genome has become an impediment to personal genomics

Page 12: Variation graphs and population assisted genome inference copy

The problem with the reference

RESEARCH Open Access

The GENCODE pseudogene resourceBaikang Pei1†, Cristina Sisu1,2†, Adam Frankish3, Cédric Howald4, Lukas Habegger1, Xinmeng Jasmine Mu1,Rachel Harte5, Suganthi Balasubramanian1,2, Andrea Tanzer6, Mark Diekhans5, Alexandre Reymond4,Tim J Hubbard3, Jennifer Harrow3 and Mark B Gerstein1,2,7*

Abstract

Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recentevidence suggests that many of them might have some form of biological activity, and the possibility offunctionality has increased interest in their accurate annotation and integration with functional genomics data.

Results: As part of the GENCODE annotation of the human genome, we present the first genome-widepseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silicopipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiasedfashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotationswith the extensive ENCODE functional genomics information. In particular, we determine the expression level,transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Basedon their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data fromprimate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.

Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these mayrepresent genes that have recently died. On the other hand, we find interesting patterns of partial activity, whichmay suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of eachpseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification ofpotentially functional pseudogenes.

BackgroundPseudogenes are defined as defunct genomic loci withsequence similarity to functional genes but lacking cod-ing potential due to the presence of disruptive muta-tions such as frame shifts and premature stop codons[1–4]. The functional paralogs of pseudogenes are oftenreferred to as parent genes. Based on the mechanism oftheir creation, pseudogenes can be categorized intothree large groups: (1) processed pseudogenes, createdby retrotransposition of mRNA from functional protein-coding loci back into the genome; (2) duplicated (alsoreferred to as unprocessed) pseudogenes, derived fromduplication of functional genes; and (3) unitary

pseudogenes, which arise through in situ mutations inpreviously functional protein-coding genes [1,4–6].Different types of pseudogenes exhibit different geno-

mic features. Duplicated pseudogenes have intron-exon-like genomic structures and may still maintain theupstream regulatory sequences of their parents. In con-trast, processed pseudogenes, having lost their introns,contain only exonic sequence and do not retain theupstream regulatory regions. Processed pseudogenesmay preserve evidence of their insertion in the form ofpolyadenine features at their 3’ end. These features ofprocessed pseudogenes are shared with other genomicelements commonly known as retrogenes [7]. However,retrogenes differ from pseudogenes in that they haveintact coding frames and encode functional proteins [8].The composition of different types of pseudogenes var-ies among organisms [9]. In the human genome, pro-cessed pseudogenes are the most abundant type due to

* Correspondence: [email protected]† Contributed equally1Program in Computational Biology and Bioinformatics, Yale University, Bass432, 266 Whitney Avenue, New Haven, CT 06520, USAFull list of author information is available at the end of the article

Pei et al. Genome Biology 2012, 13:R51http://genomebiology.com/2012/13/9/R51

© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

[10:50 28/11/2009 Bioinformatics-btp579.tex] Page: 3207 3207–3212

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 3207–3212doi:10.1093/bioinformatics/btp579

Genome analysis

Effect of read-mapping biases on detecting allele-specificexpression from RNA-sequencing dataJacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard HughesMedical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USAReceived on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009

Advance Access publication October 6, 2009

Associate Editor: Limsoon Wong

ABSTRACTMotivation: Next-generation sequencing has become an importanttool for genome-wide quantification of DNA and RNA. However,a major technical hurdle lies in the need to map short sequencereads back to their correct locations in a reference genome. Here,we investigate the impact of SNP variation on the reliability ofread-mapping in the context of detecting allele-specific expression(ASE).Results: We generated 16 million 35 bp reads from mRNA of eachof two HapMap Yoruba individuals. When we mapped these readsto the human genome we found that, at heterozygous SNPs, therewas a significant bias toward higher mapping rates of the allelein the reference sequence, compared with the alternative allele.Masking known SNP positions in the genome sequence eliminatedthe reference bias but, surprisingly, did not lead to more reliableresults overall. We find that even after masking, ∼ 5–10% of SNPsstill have an inherent bias toward more effective mapping of oneallele. Filtering out inherently biased SNPs removes 40% of the topsignals of ASE. The remaining SNPs showing ASE are enriched ingenes previously known to harbor cis-regulatory variation or knownto show uniparental imprinting. Our results have implications for avariety of applications involving detection of alternate alleles fromshort-read sequence data.Availability: Scripts, written in Perl and R, for simulating short reads,masking SNP variation in a reference genome and analyzing thesimulation output are available upon request from JFD. Raw shortread data were deposited in GEO (http://www.ncbi.nlm.nih.gov/geo/)under accession number GSE18156.Contact: [email protected]; [email protected];[email protected]; [email protected] information: Supplementary data are available atBioinformatics online.

1 INTRODUCTIONThere has been a great deal of recent interest in identifying genes forwhich the two alleles in an individual are expressed at different rates(Knight, 2004; Milani et al., 2009; Ronald et al., 2005; Wittkoppet al., 2008; Yan et al., 2002). At least two important biological

∗To whom correspondence should be addressed.

mechanisms can be uncovered through the identification of allele-specific expression (ASE). For example, studies investigating ASEhave uncovered both genes harboring cis-regulatory variation andimprinted genes that are epigenetically silenced in one copy but notthe other (Babak et al., 2008; Serre et al., 2008; Wang et al., 2008).

Recently developed sequencing technologies such as the IlluminaGenome Analyzer, Roche 454 GS FLX sequencer and AppliedBiosystems SOLiD sequencer have the potential to greatly improveour ability to detect ASE and to improve our understanding ofcis-regulatory variation and epigenetic imprinting. However, thedetection of ASE depends critically on accurate mapping of shortreads in the presence of sequence variation. Here, using RNA-Seq data from two HapMap individuals, along with simulationexperiments, we characterize the effects of individual SNPs on thequantification of expression levels. Our results are also relevantto other applications of next-generation sequencing, such as SNPdiscovery, expression QTL mapping and detection of allele-specificdifferences in transcription factor binding.

2 METHODS

2.1 RNA isolation and sequencingTotal RNA from two HapMap Yoruba lymphoblastoid cell lines (GM19238and GM19239) was extracted using an RNeasy Mini Kit (Qiagen,Valencia, CA) and assessed using an Agilent Bioanalyzer. mRNA wasthen isolated with Dyna1 oligo-dT beads (Invitrogen, Carlsbad, CA) from10 µg of total RNA. The mRNA was randomly fragmented using the RNAfragmentation kit from Ambion. First-strand cDNA synthesis was performedusing random primers and SuperScriptII reverse-transcriptase (Invitrogen,Carlsbad, CA). This was followed by second-strand cDNA synthesis usingDNA Polymerase I and RNaseH (Invitrogen, Carlsbad, CA).

The short cDNA fragments from each sample were prepared into a libraryfor Illumina sequencing. Briefly, the Illumina adaptor was ligated to theends of the double-stranded cDNA fragments and a 200 bp size selectionof the final product was performed by gel-excision, following the Illumina-recommended protocol. To create the final library, 200 bp cDNA templatemolecules with the adaptor attached were enriched by PCR. Sequencingwas performed on the Illumina Genome Analyzer II for 36 cycles (resultingin 35 bp reads after discarding the final base). The images taken duringthe sequencing reactions were processed using Illumina’s standard analysispipeline (v.1.3.2). Two lanes of a flow-cell were used for each individualyielding 15 579 717 and 16 780 153 total sequence reads for GM19238 andGM19239, respectively.

© The Author(s) 2009. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at University of C

alifornia, Santa Cruz on Septem

ber 13, 2012http://bioinform

atics.oxfordjournals.org/D

ownloaded from

• These differences create a failure of representation, for example:

• Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes)

• Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants

• The current reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations

• In summary: the current reference genome has become an impediment to personal genomics

Page 13: Variation graphs and population assisted genome inference copy

The problem with the reference

[10:50 28/11/2009 Bioinformatics-btp579.tex] Page: 3207 3207–3212

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 3207–3212doi:10.1093/bioinformatics/btp579

Genome analysis

Effect of read-mapping biases on detecting allele-specificexpression from RNA-sequencing dataJacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard HughesMedical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USAReceived on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009

Advance Access publication October 6, 2009

Associate Editor: Limsoon Wong

ABSTRACTMotivation: Next-generation sequencing has become an importanttool for genome-wide quantification of DNA and RNA. However,a major technical hurdle lies in the need to map short sequencereads back to their correct locations in a reference genome. Here,we investigate the impact of SNP variation on the reliability ofread-mapping in the context of detecting allele-specific expression(ASE).Results: We generated 16 million 35 bp reads from mRNA of eachof two HapMap Yoruba individuals. When we mapped these readsto the human genome we found that, at heterozygous SNPs, therewas a significant bias toward higher mapping rates of the allelein the reference sequence, compared with the alternative allele.Masking known SNP positions in the genome sequence eliminatedthe reference bias but, surprisingly, did not lead to more reliableresults overall. We find that even after masking, ∼ 5–10% of SNPsstill have an inherent bias toward more effective mapping of oneallele. Filtering out inherently biased SNPs removes 40% of the topsignals of ASE. The remaining SNPs showing ASE are enriched ingenes previously known to harbor cis-regulatory variation or knownto show uniparental imprinting. Our results have implications for avariety of applications involving detection of alternate alleles fromshort-read sequence data.Availability: Scripts, written in Perl and R, for simulating short reads,masking SNP variation in a reference genome and analyzing thesimulation output are available upon request from JFD. Raw shortread data were deposited in GEO (http://www.ncbi.nlm.nih.gov/geo/)under accession number GSE18156.Contact: [email protected]; [email protected];[email protected]; [email protected] information: Supplementary data are available atBioinformatics online.

1 INTRODUCTIONThere has been a great deal of recent interest in identifying genes forwhich the two alleles in an individual are expressed at different rates(Knight, 2004; Milani et al., 2009; Ronald et al., 2005; Wittkoppet al., 2008; Yan et al., 2002). At least two important biological

∗To whom correspondence should be addressed.

mechanisms can be uncovered through the identification of allele-specific expression (ASE). For example, studies investigating ASEhave uncovered both genes harboring cis-regulatory variation andimprinted genes that are epigenetically silenced in one copy but notthe other (Babak et al., 2008; Serre et al., 2008; Wang et al., 2008).

Recently developed sequencing technologies such as the IlluminaGenome Analyzer, Roche 454 GS FLX sequencer and AppliedBiosystems SOLiD sequencer have the potential to greatly improveour ability to detect ASE and to improve our understanding ofcis-regulatory variation and epigenetic imprinting. However, thedetection of ASE depends critically on accurate mapping of shortreads in the presence of sequence variation. Here, using RNA-Seq data from two HapMap individuals, along with simulationexperiments, we characterize the effects of individual SNPs on thequantification of expression levels. Our results are also relevantto other applications of next-generation sequencing, such as SNPdiscovery, expression QTL mapping and detection of allele-specificdifferences in transcription factor binding.

2 METHODS

2.1 RNA isolation and sequencingTotal RNA from two HapMap Yoruba lymphoblastoid cell lines (GM19238and GM19239) was extracted using an RNeasy Mini Kit (Qiagen,Valencia, CA) and assessed using an Agilent Bioanalyzer. mRNA wasthen isolated with Dyna1 oligo-dT beads (Invitrogen, Carlsbad, CA) from10 µg of total RNA. The mRNA was randomly fragmented using the RNAfragmentation kit from Ambion. First-strand cDNA synthesis was performedusing random primers and SuperScriptII reverse-transcriptase (Invitrogen,Carlsbad, CA). This was followed by second-strand cDNA synthesis usingDNA Polymerase I and RNaseH (Invitrogen, Carlsbad, CA).

The short cDNA fragments from each sample were prepared into a libraryfor Illumina sequencing. Briefly, the Illumina adaptor was ligated to theends of the double-stranded cDNA fragments and a 200 bp size selectionof the final product was performed by gel-excision, following the Illumina-recommended protocol. To create the final library, 200 bp cDNA templatemolecules with the adaptor attached were enriched by PCR. Sequencingwas performed on the Illumina Genome Analyzer II for 36 cycles (resultingin 35 bp reads after discarding the final base). The images taken duringthe sequencing reactions were processed using Illumina’s standard analysispipeline (v.1.3.2). Two lanes of a flow-cell were used for each individualyielding 15 579 717 and 16 780 153 total sequence reads for GM19238 andGM19239, respectively.

© The Author(s) 2009. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at University of C

alifornia, Santa Cruz on Septem

ber 13, 2012http://bioinform

atics.oxfordjournals.org/D

ownloaded from

A Bacterial Artificial Chromosome Libraryfor Sequencing the Complete Human GenomeKazutoyo Osoegawa,1 Aaron G. Mammoser, Chenyan Wu,2 Eirik Frengen,3

Changjiang Zeng, Joseph J. Catanese,1,2 and Pieter J. de Jong1,2,4

Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA

A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (178kb) has been constructed to provide the intermediate substrate for the international genome sequencing effort.The DNA was obtained from a single anonymous volunteer, whose identity was protected through adouble-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librarysegments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vectors,respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements andartifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC library,designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clonefingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic andfunctional studies.

The sequence data described in this paper have been submitted to the GenBank data library under accessionnos. AQ936150–AQ936491.]

The main goal of the publicly funded human genomeproject is to completely determine the human genomicDNA sequence. Five large centers in the United Statesand the United Kingdom (the G5 group) along withthree smaller centers in France, Germany, and Japan(the G8 group) are the major contributors to the se-quencing effort. The initial draft version of the humanDNA sequence was completed on June 26, 2000, and ahigh-quality version will become accessible by 2003.The human genome project presents unique ethicaland political requirements with respect to the sourceDNA for library construction, because never before hasan individual’s genetic blueprint been decipheredcompletely. One or more volunteers were required todonate their DNA for the sequencing effort. Donor re-cruitment must comply with regulations (Botkin andGut 1996; Marshall 1996) to protect the individual’sinterests and requires informed consent. In addition, itis preferable to obtain the first human genome se-quence with the focus on the composition of genesacross the prototypical human genome rather than ex-ploring the diversity of genes across the human popu-lation. With only a few donors contributing to the pro-totype of the human genome, it is likely that the pro-totype will not be equally derived from all ethnic or

social groups. To avoid a willful bias with respect torepresentatives from one group or another, a double-blind donor selection protocol was desirable and wasformulated in compliance with the stated policies ofthe funding agencies (see http://www.nhgri.nih.gov:80/Grant_info/Funding/Statements/RFA/human_subjects.html).

Large-insert genomic DNA libraries in bacteria,such as bacterial artificial chromosome (BAC; Shizuyaet al. 1992) and P1-derived artificial chromosome(PAC; Ioannou et al. 1994) libraries, provide a way todivide the complexity of the human genome into acomposite of large DNA segments of reduced complex-ity. Ideally, BAC libraries should completely representthe genome without cloning artifacts or rearrange-ments and should be provided in an addressable for-mat with clones physically separated. Libraries arrayedin microtiter dishes provide the opportunity for manyresearchers around the world to accumulate and useinformation on particular clones (Green and Olson1990; Nizetic et al. 1991; Evans et al. 1992; Cohen et al.1993; Marra et al. 1997; Zhao et al. 2000), thus permit-ting resource sharing through central repositories. BAClibraries are used as a source of substrates for shotgunsequencing projects, to create a database of end se-quences (Mahairas et al. 1999; Zhao 2000; Zhao et al.2000) and restriction fingerprints for building overlap-ping clone sets (contigs; Marra et al. 1997, 1999). BACsalso provide scaffolding information for mapping se-quence contigs to localized genomic regions by using adirect genomic shotgun sequencing approach (Adamset al. 2000; Hoskins et al. 2000). The BAC library (RPCI-11) described in this manuscript represents one of the

Present addresses: 1Children’s Hospital Oakland Research Insti-tute, 747 Fifty-second Street, Oakland, CA 94609-1809, USA;2Pfizer Global Research and Development, Alameda Laborato-ries, 1501 Harbor Bay Parkway, Alameda, CA 94502, USA; 3 TheBiotechnology Centre of Oslo, University of Oslo, N-0317 Oslo,Norway.4Corresponding author.E-MAIL [email protected]; FAX (510) 450-7924.Article and publication are at www.genome.org/cgi/doi/10.1101/gr.169601.

Resource

11:483–496 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org Genome Research 483www.genome.org

Cold Spring Harbor Laboratory Press on September 9, 2011 - Published by genome.cshlp.orgDownloaded from

RESEARCH Open Access

The GENCODE pseudogene resourceBaikang Pei1†, Cristina Sisu1,2†, Adam Frankish3, Cédric Howald4, Lukas Habegger1, Xinmeng Jasmine Mu1,Rachel Harte5, Suganthi Balasubramanian1,2, Andrea Tanzer6, Mark Diekhans5, Alexandre Reymond4,Tim J Hubbard3, Jennifer Harrow3 and Mark B Gerstein1,2,7*

Abstract

Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recentevidence suggests that many of them might have some form of biological activity, and the possibility offunctionality has increased interest in their accurate annotation and integration with functional genomics data.

Results: As part of the GENCODE annotation of the human genome, we present the first genome-widepseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silicopipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiasedfashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotationswith the extensive ENCODE functional genomics information. In particular, we determine the expression level,transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Basedon their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data fromprimate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.

Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these mayrepresent genes that have recently died. On the other hand, we find interesting patterns of partial activity, whichmay suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of eachpseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification ofpotentially functional pseudogenes.

BackgroundPseudogenes are defined as defunct genomic loci withsequence similarity to functional genes but lacking cod-ing potential due to the presence of disruptive muta-tions such as frame shifts and premature stop codons[1–4]. The functional paralogs of pseudogenes are oftenreferred to as parent genes. Based on the mechanism oftheir creation, pseudogenes can be categorized intothree large groups: (1) processed pseudogenes, createdby retrotransposition of mRNA from functional protein-coding loci back into the genome; (2) duplicated (alsoreferred to as unprocessed) pseudogenes, derived fromduplication of functional genes; and (3) unitary

pseudogenes, which arise through in situ mutations inpreviously functional protein-coding genes [1,4–6].Different types of pseudogenes exhibit different geno-

mic features. Duplicated pseudogenes have intron-exon-like genomic structures and may still maintain theupstream regulatory sequences of their parents. In con-trast, processed pseudogenes, having lost their introns,contain only exonic sequence and do not retain theupstream regulatory regions. Processed pseudogenesmay preserve evidence of their insertion in the form ofpolyadenine features at their 3’ end. These features ofprocessed pseudogenes are shared with other genomicelements commonly known as retrogenes [7]. However,retrogenes differ from pseudogenes in that they haveintact coding frames and encode functional proteins [8].The composition of different types of pseudogenes var-ies among organisms [9]. In the human genome, pro-cessed pseudogenes are the most abundant type due to

* Correspondence: [email protected]† Contributed equally1Program in Computational Biology and Bioinformatics, Yale University, Bass432, 266 Whitney Avenue, New Haven, CT 06520, USAFull list of author information is available at the end of the article

Pei et al. Genome Biology 2012, 13:R51http://genomebiology.com/2012/13/9/R51

© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

• These differences create a failure of representation, for example:

• Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes)

• Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants

• The current primary reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations

• In summary: the current reference genome has become an impediment to personal genomics

Page 14: Variation graphs and population assisted genome inference copy

The problem with the reference

• These differences create a failure of representation, for example:

• Some functional (transcribed) genes are either present in disabled form or absent from the current reference (e.g. some HLA genes)

• Reference Allele Bias: Mapping algorithms are intrinsically biased towards ignoring evidence of variants

• The current primary reference is largely derived from one individual, making it less suitable for the study of genomes that derive from other subpopulations

• In summary: the current primary reference genome is an imperfect lens for personal genomics

[10:50 28/11/2009 Bioinformatics-btp579.tex] Page: 3207 3207–3212

BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 3207–3212doi:10.1093/bioinformatics/btp579

Genome analysis

Effect of read-mapping biases on detecting allele-specificexpression from RNA-sequencing dataJacob F. Degner1,2,∗, John C. Marioni1,∗, Athma A. Pai1, Joseph K. Pickrell1,Everlyne Nkadori1,3, Yoav Gilad1,∗ and Jonathan K. Pritchard1,3,∗1Department of Human Genetics, 2Committee on Genetics, Genomics and Systems Biology and 3Howard HughesMedical Institute, University of Chicago, 920 E. 58th St., CLSC 507, Chicago, IL 60637, USAReceived on June 25, 2009; revised on September 17, 2009; accepted on September 30, 2009

Advance Access publication October 6, 2009

Associate Editor: Limsoon Wong

ABSTRACTMotivation: Next-generation sequencing has become an importanttool for genome-wide quantification of DNA and RNA. However,a major technical hurdle lies in the need to map short sequencereads back to their correct locations in a reference genome. Here,we investigate the impact of SNP variation on the reliability ofread-mapping in the context of detecting allele-specific expression(ASE).Results: We generated 16 million 35 bp reads from mRNA of eachof two HapMap Yoruba individuals. When we mapped these readsto the human genome we found that, at heterozygous SNPs, therewas a significant bias toward higher mapping rates of the allelein the reference sequence, compared with the alternative allele.Masking known SNP positions in the genome sequence eliminatedthe reference bias but, surprisingly, did not lead to more reliableresults overall. We find that even after masking, ∼ 5–10% of SNPsstill have an inherent bias toward more effective mapping of oneallele. Filtering out inherently biased SNPs removes 40% of the topsignals of ASE. The remaining SNPs showing ASE are enriched ingenes previously known to harbor cis-regulatory variation or knownto show uniparental imprinting. Our results have implications for avariety of applications involving detection of alternate alleles fromshort-read sequence data.Availability: Scripts, written in Perl and R, for simulating short reads,masking SNP variation in a reference genome and analyzing thesimulation output are available upon request from JFD. Raw shortread data were deposited in GEO (http://www.ncbi.nlm.nih.gov/geo/)under accession number GSE18156.Contact: [email protected]; [email protected];[email protected]; [email protected] information: Supplementary data are available atBioinformatics online.

1 INTRODUCTIONThere has been a great deal of recent interest in identifying genes forwhich the two alleles in an individual are expressed at different rates(Knight, 2004; Milani et al., 2009; Ronald et al., 2005; Wittkoppet al., 2008; Yan et al., 2002). At least two important biological

∗To whom correspondence should be addressed.

mechanisms can be uncovered through the identification of allele-specific expression (ASE). For example, studies investigating ASEhave uncovered both genes harboring cis-regulatory variation andimprinted genes that are epigenetically silenced in one copy but notthe other (Babak et al., 2008; Serre et al., 2008; Wang et al., 2008).

Recently developed sequencing technologies such as the IlluminaGenome Analyzer, Roche 454 GS FLX sequencer and AppliedBiosystems SOLiD sequencer have the potential to greatly improveour ability to detect ASE and to improve our understanding ofcis-regulatory variation and epigenetic imprinting. However, thedetection of ASE depends critically on accurate mapping of shortreads in the presence of sequence variation. Here, using RNA-Seq data from two HapMap individuals, along with simulationexperiments, we characterize the effects of individual SNPs on thequantification of expression levels. Our results are also relevantto other applications of next-generation sequencing, such as SNPdiscovery, expression QTL mapping and detection of allele-specificdifferences in transcription factor binding.

2 METHODS

2.1 RNA isolation and sequencingTotal RNA from two HapMap Yoruba lymphoblastoid cell lines (GM19238and GM19239) was extracted using an RNeasy Mini Kit (Qiagen,Valencia, CA) and assessed using an Agilent Bioanalyzer. mRNA wasthen isolated with Dyna1 oligo-dT beads (Invitrogen, Carlsbad, CA) from10 µg of total RNA. The mRNA was randomly fragmented using the RNAfragmentation kit from Ambion. First-strand cDNA synthesis was performedusing random primers and SuperScriptII reverse-transcriptase (Invitrogen,Carlsbad, CA). This was followed by second-strand cDNA synthesis usingDNA Polymerase I and RNaseH (Invitrogen, Carlsbad, CA).

The short cDNA fragments from each sample were prepared into a libraryfor Illumina sequencing. Briefly, the Illumina adaptor was ligated to theends of the double-stranded cDNA fragments and a 200 bp size selectionof the final product was performed by gel-excision, following the Illumina-recommended protocol. To create the final library, 200 bp cDNA templatemolecules with the adaptor attached were enriched by PCR. Sequencingwas performed on the Illumina Genome Analyzer II for 36 cycles (resultingin 35 bp reads after discarding the final base). The images taken duringthe sequencing reactions were processed using Illumina’s standard analysispipeline (v.1.3.2). Two lanes of a flow-cell were used for each individualyielding 15 579 717 and 16 780 153 total sequence reads for GM19238 andGM19239, respectively.

© The Author(s) 2009. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at University of C

alifornia, Santa Cruz on Septem

ber 13, 2012http://bioinform

atics.oxfordjournals.org/D

ownloaded from

A Bacterial Artificial Chromosome Libraryfor Sequencing the Complete Human GenomeKazutoyo Osoegawa,1 Aaron G. Mammoser, Chenyan Wu,2 Eirik Frengen,3

Changjiang Zeng, Joseph J. Catanese,1,2 and Pieter J. de Jong1,2,4

Department of Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York 14263, USA

A 30-fold redundant human bacterial artificial chromosome (BAC) library with a large average insert size (178kb) has been constructed to provide the intermediate substrate for the international genome sequencing effort.The DNA was obtained from a single anonymous volunteer, whose identity was protected through adouble-blind donor selection protocol. DNA fragments were generated by partial digestion with EcoRI (librarysegments 1–4: 24-fold) and MboI (segment 5: sixfold) and cloned into the pBACe3.6 and pTARBAC1 vectors,respectively. The quality of the library was assessed by extensive analysis of 169 clones for rearrangements andartifacts. Eighteen BACs (11%) revealed minor insert rearrangements, and none was chimeric. This BAC library,designated as “RPCI-11,” has been used widely as the central resource for insert-end sequencing, clonefingerprinting, high-throughput sequence analysis and as a source of mapped clones for diagnostic andfunctional studies.

The sequence data described in this paper have been submitted to the GenBank data library under accessionnos. AQ936150–AQ936491.]

The main goal of the publicly funded human genomeproject is to completely determine the human genomicDNA sequence. Five large centers in the United Statesand the United Kingdom (the G5 group) along withthree smaller centers in France, Germany, and Japan(the G8 group) are the major contributors to the se-quencing effort. The initial draft version of the humanDNA sequence was completed on June 26, 2000, and ahigh-quality version will become accessible by 2003.The human genome project presents unique ethicaland political requirements with respect to the sourceDNA for library construction, because never before hasan individual’s genetic blueprint been decipheredcompletely. One or more volunteers were required todonate their DNA for the sequencing effort. Donor re-cruitment must comply with regulations (Botkin andGut 1996; Marshall 1996) to protect the individual’sinterests and requires informed consent. In addition, itis preferable to obtain the first human genome se-quence with the focus on the composition of genesacross the prototypical human genome rather than ex-ploring the diversity of genes across the human popu-lation. With only a few donors contributing to the pro-totype of the human genome, it is likely that the pro-totype will not be equally derived from all ethnic or

social groups. To avoid a willful bias with respect torepresentatives from one group or another, a double-blind donor selection protocol was desirable and wasformulated in compliance with the stated policies ofthe funding agencies (see http://www.nhgri.nih.gov:80/Grant_info/Funding/Statements/RFA/human_subjects.html).

Large-insert genomic DNA libraries in bacteria,such as bacterial artificial chromosome (BAC; Shizuyaet al. 1992) and P1-derived artificial chromosome(PAC; Ioannou et al. 1994) libraries, provide a way todivide the complexity of the human genome into acomposite of large DNA segments of reduced complex-ity. Ideally, BAC libraries should completely representthe genome without cloning artifacts or rearrange-ments and should be provided in an addressable for-mat with clones physically separated. Libraries arrayedin microtiter dishes provide the opportunity for manyresearchers around the world to accumulate and useinformation on particular clones (Green and Olson1990; Nizetic et al. 1991; Evans et al. 1992; Cohen et al.1993; Marra et al. 1997; Zhao et al. 2000), thus permit-ting resource sharing through central repositories. BAClibraries are used as a source of substrates for shotgunsequencing projects, to create a database of end se-quences (Mahairas et al. 1999; Zhao 2000; Zhao et al.2000) and restriction fingerprints for building overlap-ping clone sets (contigs; Marra et al. 1997, 1999). BACsalso provide scaffolding information for mapping se-quence contigs to localized genomic regions by using adirect genomic shotgun sequencing approach (Adamset al. 2000; Hoskins et al. 2000). The BAC library (RPCI-11) described in this manuscript represents one of the

Present addresses: 1Children’s Hospital Oakland Research Insti-tute, 747 Fifty-second Street, Oakland, CA 94609-1809, USA;2Pfizer Global Research and Development, Alameda Laborato-ries, 1501 Harbor Bay Parkway, Alameda, CA 94502, USA; 3 TheBiotechnology Centre of Oslo, University of Oslo, N-0317 Oslo,Norway.4Corresponding author.E-MAIL [email protected]; FAX (510) 450-7924.Article and publication are at www.genome.org/cgi/doi/10.1101/gr.169601.

Resource

11:483–496 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org Genome Research 483www.genome.org

Cold Spring Harbor Laboratory Press on September 9, 2011 - Published by genome.cshlp.orgDownloaded from

RESEARCH Open Access

The GENCODE pseudogene resourceBaikang Pei1†, Cristina Sisu1,2†, Adam Frankish3, Cédric Howald4, Lukas Habegger1, Xinmeng Jasmine Mu1,Rachel Harte5, Suganthi Balasubramanian1,2, Andrea Tanzer6, Mark Diekhans5, Alexandre Reymond4,Tim J Hubbard3, Jennifer Harrow3 and Mark B Gerstein1,2,7*

Abstract

Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recentevidence suggests that many of them might have some form of biological activity, and the possibility offunctionality has increased interest in their accurate annotation and integration with functional genomics data.

Results: As part of the GENCODE annotation of the human genome, we present the first genome-widepseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silicopipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiasedfashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotationswith the extensive ENCODE functional genomics information. In particular, we determine the expression level,transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Basedon their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data fromprimate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.

Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these mayrepresent genes that have recently died. On the other hand, we find interesting patterns of partial activity, whichmay suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of eachpseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification ofpotentially functional pseudogenes.

BackgroundPseudogenes are defined as defunct genomic loci withsequence similarity to functional genes but lacking cod-ing potential due to the presence of disruptive muta-tions such as frame shifts and premature stop codons[1–4]. The functional paralogs of pseudogenes are oftenreferred to as parent genes. Based on the mechanism oftheir creation, pseudogenes can be categorized intothree large groups: (1) processed pseudogenes, createdby retrotransposition of mRNA from functional protein-coding loci back into the genome; (2) duplicated (alsoreferred to as unprocessed) pseudogenes, derived fromduplication of functional genes; and (3) unitary

pseudogenes, which arise through in situ mutations inpreviously functional protein-coding genes [1,4–6].Different types of pseudogenes exhibit different geno-

mic features. Duplicated pseudogenes have intron-exon-like genomic structures and may still maintain theupstream regulatory sequences of their parents. In con-trast, processed pseudogenes, having lost their introns,contain only exonic sequence and do not retain theupstream regulatory regions. Processed pseudogenesmay preserve evidence of their insertion in the form ofpolyadenine features at their 3’ end. These features ofprocessed pseudogenes are shared with other genomicelements commonly known as retrogenes [7]. However,retrogenes differ from pseudogenes in that they haveintact coding frames and encode functional proteins [8].The composition of different types of pseudogenes var-ies among organisms [9]. In the human genome, pro-cessed pseudogenes are the most abundant type due to

* Correspondence: [email protected]† Contributed equally1Program in Computational Biology and Bioinformatics, Yale University, Bass432, 266 Whitney Avenue, New Haven, CT 06520, USAFull list of author information is available at the end of the article

Pei et al. Genome Biology 2012, 13:R51http://genomebiology.com/2012/13/9/R51

© 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

Page 15: Variation graphs and population assisted genome inference copy

Alternate haplotypes

Page 16: Variation graphs and population assisted genome inference copy

Alternate haplotypes

GRCh38 is a graph!

Page 17: Variation graphs and population assisted genome inference copy

Human Genome Variation Graph Project

• Goals:

• Develop next generation human genetic reference that includes known variation from all human ethnic populations

• Provide tools to map, call, phase and represent genomes

Figure courtesy Kiran Garimella & Gil McVean

Page 18: Variation graphs and population assisted genome inference copy

Existing Variation is Fragmented

Variants associated with phenotype

Genome- and locus-specific variation databases

Sequencing projects

Human reference genome

Page 19: Variation graphs and population assisted genome inference copy

A Rosetta Stone for human genomics

Page 20: Variation graphs and population assisted genome inference copy

Merge diverse genomes into one graph

The major histocompatibility complex− Kiran Garimella & Gil McVean

Page 21: Variation graphs and population assisted genome inference copy

Zooming in, you see local structure

Page 22: Variation graphs and population assisted genome inference copy

At base level we assign unique position identifiers

Page 23: Variation graphs and population assisted genome inference copy

Variation Graphs – The Essentials

GTCCCAA

ACGTGG

ACTACCA

TTACTAC

Set of sequences(nodes)

Joins(edges)connectsidesofsequences.

Page 24: Variation graphs and population assisted genome inference copy

Variation Graphs – The Essentials

GTCCCAAACGTGG TTACTAC

Joins can connect either side of a sequence (bidirected edges)

Walks encode DNA strings, with side of entry determining strand

Page 25: Variation graphs and population assisted genome inference copy

Essential operations on variation graphs• To switch to

variation graphs a complete ecosystem must be redeveloped

• “rebooting genomics” - Erik Garrison

“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)

variation graph

another variation

graph

Page 26: Variation graphs and population assisted genome inference copy

variation graph

another variation

graph

Essential operations on variation graphs• To switch to

variation graphs a complete ecosystem must be redeveloped

“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.” Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)

https://github.com/vgteam/vg

Page 27: Variation graphs and population assisted genome inference copy

Now lots of good genome graph development …

Page 28: Variation graphs and population assisted genome inference copy

Genome Graph Vignettes

• Read mapping

• Haplotypes vs. graphs

• Visualization

• Alleles and sites

• Variant calling

Page 29: Variation graphs and population assisted genome inference copy

Variation graph mapping GRCh38 alts in B-3106 from human MHC

Page 30: Variation graphs and population assisted genome inference copy

Simulation Study - Human

60

60

60

60

60

60

50

50

50

50

50

50

40

40

40

40

40

40

30

30

30

30

30

30

20

20

20

2020

20

10

10

10

1010

10

0

0

0

00

0

0.95

0.96

0.97

0.98

0.99

1.00

1e−06 1e−05 1e−04 1e−03 1e−02FPR

TPR

alignera●

a●

a●

a●

a●

a●

bwa.mem.pe

bwa.mem.se

vg.pan.pe

vg.pan.se

vg.ref.pe

vg.ref.se

number●

●●

250000

500000

750000

1000000

60

60

60

60

60

60

50

50

50

50

50

50

40

40

40

40

40

40

30

30

30

30

30

30

20

20

20

20

20 20

10

10

10

10

1010

0

0

0

0

0

0

0.94

0.96

0.98

1e−06 1e−05 1e−04 1e−03 1e−02FPR

TPR

number●

2500000

5000000

7500000

alignera●

a●

a●

a●

a●

a●

bwa.mem.pe

bwa.mem.se

vg.pan.pe

vg.pan.se

vg.ref.pe

vg.ref.se

• 10 M reads from a genome with 1% error

• Subset of reads with >=1 match to non-primary ref match

Page 31: Variation graphs and population assisted genome inference copy

Simulation Study - Human

60

60

60

60

60

60

50

50

50

50

50

50

40

40

40

40

40

40

30

30

30

30

30

30

20

20

20

2020

20

10

10

10

1010

10

0

0

0

00

0

0.95

0.96

0.97

0.98

0.99

1.00

1e−06 1e−05 1e−04 1e−03 1e−02FPR

TPR

alignera●

a●

a●

a●

a●

a●

bwa.mem.pe

bwa.mem.se

vg.pan.pe

vg.pan.se

vg.ref.pe

vg.ref.se

number●

●●

250000

500000

750000

1000000

• 10 M reads from a genome with 1% error

• Subset of reads with >=1 match to non-primary ref match

Page 32: Variation graphs and population assisted genome inference copy

Human - Indel Mapping Bias Alleviated

(a) alignment ROC curve

60

60

60

60

60

60

50

50

50

50 50

50

40

40

40

40

40

40

30

30

30

30 30

30

20

20

20

20 2020

10

10

10

10

10

10

0

0

0

00

0

●●

●●

●●

●●

●●

0.925

0.950

0.975

1e−05 1e−04 1e−03 1e−02False positive rate

True

pos

itive

rate

number●

2500000

5000000

7500000

alignera●

a●

a●

a●

a●

a●

bwa.mem.pe

bwa.mem.se

vg.pan.pe

vg.pan.se

vg.ref.pe

vg.ref.se

(b) allele fraction vs variant size

(c) alternate allele fraction vs distance to nearest variant

●●

●●

●●

●● ● ● ●

● ● ● ● ● ● ● ● ●●

● ●●

●●

● ●

●●

●● ● ●

● ● ● ● ●● ● ● ● ● ● ● ● ●

●●

0.480

0.485

0.490

0.495

0.500

(0,1

](1

,2]

(2,3

](3

,4]

(4,5

](5

,6]

(6,7

](7

,8]

(8,9

](9

,10]

(10,

20]

(20,

30]

(30,

40]

(40,

50]

(50,

60]

(60,

70]

(70,

80]

(80,

90]

(90,

100]

(100

,200

](2

00,3

00]

(300

,400

](4

00,5

00]

(500

,600

](6

00,7

00]

(700

,800

](8

00,9

00]

(900

,1e+

03]

(1e+

03,In

f]

Distance to nearest variant

Frac

tion

of a

ltern

ate

alle

le

method●

bwa.fb

vg.pan

Figure 3: (a) the ROC curves for 10M read pairs simulated from the humanpangenome as mapped by bwa mem, vg with a linear genome reference, and vg

with the same pangenome reference. Performance is shown for both single end(se) and pair end (pe) mapping. (b) the alternate allele fraction at heterozygousvariants called by GIAB in NA24385 as a function of deletion or insertion size.(c) as in (b) but as a function of distance to the nearest non-reference variant.

and paired end alignment.

We then selected a real human genome read set from the Genome in a Bottle

(GIAB) Consortium (cite: doi:10.1038/sdata.2016.25) to map to the 1000GP

graph. This read set provides roughly 30X coverage (325,420,402 2x148bp Illu-

mina HiSeq 2500 read pairs) for an Ashkenazi Jewish male designated NA24385,

11

Page 33: Variation graphs and population assisted genome inference copy

Mapping improvements differ by population

1000 Genomes Super Population

MHC

% D

iff. in

per

fect

map

. pr

imar

y vs

. 1KG

Page 34: Variation graphs and population assisted genome inference copy

1: 82 bp

2: A

3: G

4: 38 bp

5: C

6: T

7: 24 bp

1: 82 bp

2: A

3: G 4': 38 bp

5: C

6: T

7: 24 bp

4: 38 bp

Embedding Haplotypes• Genome graphs do not encode linkage

• To restrict linkage, natural solution is to duplicate paths:

• But duplication creates mapping ambiguity

Page 35: Variation graphs and population assisted genome inference copy

Embedding Haplotypes

1: 82 bp

2: A

3: G

4: 38 bp

5: C

6: T

7: 24 bp

1': 82 bp

2: A

3: G 4': 38 bp

5: C

6: T

7: 24 bp4: 38 bp 1: 82 bp

7': 24 bp

• Instead maintain projection from haplotypes to graph:

• The question then becomes how to encode this projection?

Page 36: Variation graphs and population assisted genome inference copy

Embedding Haplotypes• The Graph Positional Burrows Wheeler Transform

(gPBWT)

From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016”

3

counting of the number of threads in T that contain a given new thread as asubthread. Figure 2 and Table 1 give a worked example.

1

2

3

21

3

1122

B0· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

Fig. 1. An illustration of the B0[] array for a single side numbered 0. Threads visiting

this side may enter their next nodes on sides 1, 2, or 3. The B0[] array records, for each

visit of a thread to side 0, the side on which it enters its next node. This determines

through which of the available edges it should leave the current node. Because threads

tend to be similar to each other, they are likely to run in “ribbons” of multiple threads

that both enter and leave together. These ribbons cause the Bs[] arrays to contain runs

of identical values, which may be compressed.

4 Extracting Threads

To reproduce T from G, and the gPBWT, consider each side s in G in turn.Establish how many threads begin (or, equivalently, end) at s by taking theminimum of c(x, s) for all sides x adjacent to s. If s has no incident edges, takethe length of Bs[] instead. Call this number b. Then, for i running from 0 tob, exclusive, begin a new thread at n(s) with the sides [s, s]. Next, we traversefrom n(s) to the next node. Consult the Bs[i] entry. If it is the null side, stoptraversing, yield the thread, and start again from the original node s with thenext i value less than b. Otherwise, traverse to side s0 = Bs[i]. Calculate thearrival index i0 as c(s, s0) plus the number of entries in Bs[] before entry i thatare also equal to s0. This gives the index in s0 of the thread being extracted.Then append s0 and s0 to the growing thread, and repeat the traversal processwith i i0 and s s0, until the end of the thread is reached.

. CC-BY 4.0 International licensepeer-reviewed) is the author/funder. It is made available under aThe copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/051409doi: bioRxiv preprint first posted online May. 2, 2016;

gPBWTk[]

• Reversible, compressible, enables efficient indexed queries

Page 37: Variation graphs and population assisted genome inference copy

gPBWT Performance

• Experiment: • chr22 • 50,818,468 bp • 5004 Haplotypes

• Result: • 356 MB gPBWT + vg graph • 0.011 bits per base -

200x compression • ~336 GB for whole

genome w/80 million point variants @ 100,000 diploid genomes

Page 38: Variation graphs and population assisted genome inference copy

Embedding Haplotypes• Tube Maps

Wolfgang Beyer

Page 39: Variation graphs and population assisted genome inference copy

Embedding Haplotypes

Prototype: Wolfgang Beyer https://vgteam.github.io/sequenceTubeMap/

Page 40: Variation graphs and population assisted genome inference copy

Haplotype Probabilities

• Li & Stephens: Efficiently compute P(h|H), where h is haplotype and H is population

“Li and Stephens” on sequence graphs

Li and Stephens: sequences h are generated by walks x across the space of all haplotypes

H

x

h

&&

Page 41: Variation graphs and population assisted genome inference copy

Haplotype Probabilities

• Graph Li & Stephens: Efficiently compute P(x|H), where x is haplotype walk in a genome graph

“Li and Stephens” on sequence graphs

Li and Stephens: sequences h are generated by walks x across the space of all haplotypes

Our model: sequences h are generated by walks x through G which follow segments of the

haplotypes in H

h

x c/w h

g1, g2, g3 ε H

&&

Page 42: Variation graphs and population assisted genome inference copy

Haplotype Probabilities

• Applied to vg mapped reads:

Single recombinants, 9%

Double recombinants, 1%

Non recombinants,

90%

Page 43: Variation graphs and population assisted genome inference copy

What’s a site and an allele in a genome graph?

What’s a site and an allele in a variation graph?

Bubble: Superbubble:

• Use subgraph decomposition to find single source/sink subgraphs, set of paths are the alleles

A TC

AT C A T

C

AT C A T

Page 44: Variation graphs and population assisted genome inference copy

A haplotype phasing pipeline

Read mapping

Variant calling

Haplotype phasing

Known population information

Population Assisted Variant Calling

Generative modelH

h h

R

1 2Haplotype likelihood

Read likelihood

Population of known haplotypes

Diploid genome

Read data

genome posterior probability

Generative model

Haplotype likelihood

Read likelihood

A haplotype phasing pipeline

Read mapping

Variant calling

Haplotype phasing

Known population information

Page 45: Variation graphs and population assisted genome inference copy

Genome Variation Graphs Summary

• A shared reference graph will provide a single canonical naming scheme for human variants: either it is already a (named) path in the graph, or it is a new canonically named augmentation

• A better prior: Clear benefits for simplifying and improving read mapping and variant calling - could ultimately lower cost of genome inference

• Additional haplotype data can be embedded (gPBWT)

• The natural reference is a population cohort - we should build a public cohort for hundreds of thousands of individuals - let’s change the culture of de-identified sharing

• True population assisted genome inference is coming

• Still many open problems: repeatome, annotations, RNA

Page 46: Variation graphs and population assisted genome inference copy

Thanks!UCSC

Adam Novak

Glenn Hickey

Sean Blum

Yohei Rosen

Jordan Eizenga

Wolfgang Beyer

Karen Hayden

David Haussler

Team VG:

Erik Garrison

Eric Dawson

Mike Lin

Jouni Siren

(and many more)

GA4GH ref-var group:

Andres Kahles

Ben Murray

Goran Rakocevic

Alex Dilthey

Sarah Guthrie

Jerome Kelleher

Heng Li

Stephen Keenan

Richard Durbin

Gil McVean

Opportunities: https://cgl.genomics.ucsc.edu/ [email protected]