Polymorphism Structure of the Human Genome
Gabor T. Marth
Department of BiologyBoston CollegeChestnut Hill, MA 02467
Human variation structure is heterogeneous
chromosomal averages
polymorphism density along chromosomes
Heterogeneity at the level of distributions
0.0
0
5.0
0
10
.00
15
.00
20
.00
25
.00
30
.00
35
.00
40
.00
4 kb
8 kb
12 kb
16 kb0
0.1
0.2
0.3
0.4
“sparse” “dense”
marker density
“rare” “common”
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
allele frequenc
y
What explains nucleotide diversity?
5
6
7
8
30 33 36 39 42 45 48 51 54
G+C Content [%]
SN
P R
ate
[per
10,
000
bp
]
5
6
7
8
0.3 1.2 2.1 3 3.9 4.8 5.7
CpG Content [%]
SN
P R
ate
[p
er
10,0
00 b
p]
G+C nucleotide content
CpG di-nucleotide content
5
6
7
8
9
10
0 0.5 1 1.5 2 2.5 3 3.5 4
Recombination rate [per Mb]
SN
P R
ate
[per
10,
000
bp
] recombination rate
functional constraints
3’ UTR 5.00 x 10-4
5’ UTR 4.95 x 10-4
Exon, overall 4.20 x 10-4
Exon, coding 3.77 x 10-4
synonymous 366 / 653non-synonymous 287 / 653
Variance is so high that these quantities are poor predictors of nucleotide diversity in local regions hence random processes are likely to govern the basic shape of the genome variation landscape (random) genetic drift
Components of drift: Genealogy
present generation
randomly mating population, genealogy evolves in a non-deterministic fashion
Components of drift: Mutation
mutation randomly “drift”: die out, go to higher frequency or get fixed
Modulators: Changing population size
mutation randomly “drift”: die out, go to higher frequency or get fixed
genetic bottleneck
Modulators: Population subdivision
subdivision
subdivision promotes private polymorphisms, and skews allele frequency
Modulators: Recombination
accgttatgcaga acagttatgtaga
acagttatgcaga
accgttatgtagaaccgttatgcaga acagttatgtaga
recombination
different nucleotide sites within the same DNA segment no longer share the same genealogy
Modulators: Natural selection
negative (purifying) selection
positive selection
the genealogy is no longer independent of (and hence cannot be decoupled from) the mutation process
Modeling ancestral processes
“forward simulations” the “Coalescent” process
By focusing on a small sample, complexity of the relevant part of the ancestral process is greatly reduced. There are,
however, limitations.
Inferences from variation data
larger population size (N) -> more mutations -> higher diversity (θ)
larger mutation rate (μ) -> more mutations -> higher diversity (θ)
higher diversity -> larger population size OR higher mutation rate(θ = 4Nμ)
Ancestral inference: modeling
past
present
stationary expansioncollapse
MD(simulation)
AFS(direct form)
histo
ry
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 100
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
bottleneck
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
Ancestral inference: model fitting
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
minor allele count
bottleneckmodest but
uninterrupted expansion
Allelic association
accgttatgcaga
acagttatgtaga
acagttatgcaga
accgttatgtaga
possible allele combinations (2-marker
haplotypes)
higher recombination rate
(r)
Allelic association: LD
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.81E-6
1E-5
1E-4
1E-3
0.01
0.1
1
10
100
1000
Recom
bin
ation F
raction
r2
European Asian
African American
Dis
tance (k
b)
measure of allelic association: “linkage disequilibrium (LD)”
Haplotype structure
“haplotype block”