![Page 1: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/1.jpg)
Characterizing the short tandem repeat mutation process at every locus in the genome
Melissa GymrekGenome Informatics 2015
@mgymrek
![Page 2: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/2.jpg)
Genetic variation comes in many forms
ACGACTCGAGCG
ACGACACGAGCG
μSNP: 1.20 × 10-8 /loc/gen
SNP
ACGACTCGAGCG
ACGAC-CGAGCGμINDEL: 0.68 × 10-9 /loc/gen
Short indel (1-20bp)
Short tandem repeat
CAGCAG---CAGCAGCA
CAGCAGCAGCAGCAGCA
μSTR: 10-2-10-5 /loc/gen
Alu retrotransposition
Alu
Struct. Var /CNV (>20bp)
STR 500
Alu 0.05
SV 0.2
Indel 3
SNP 50
# de novo/gen
STR 500
Alu 0.05
SV 0.2
Indel 3SNP 50
0
100
200
300
400
500
# de novo/gen
0
100
200
300
400
500
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 3: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/3.jpg)
eSTRs contribute to gene expression variability
Obse
rved
p-v
alue
[-lo
g10]
Expected p-value under the null [-log10]
Gene(TG)
STR
Expr
essio
n
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 4: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/4.jpg)
Why study the STR mutation process?
1. Identify rapidly mutating STRs
2. Understand biological processes driving mutation patterns
3. Identify STRs under selective pressure
Haasl and Payseur 2013
H0: Locus evolves under neutral modelH1: Locus is under selection
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 5: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/5.jpg)
STRs and SNPs provide orthogonal molecular clocks
TIME
Clock 1: SNPs
Clock 2: STRs
# mismatches ~ f(μSNP, t, …)t
(m-n)2 ~ f(μSTR, t, …)
ACCCATCCTAGCTACCGACTACAACGACCGATCCTAGCTTCCGACTACCACGACACTCATCTG(CAG)mACACACTGAACACTCATCTG(CAG)nACACACTGA
Use known value of μSNP
to calibrate the STR molecular clock
μSTR: STR mutation rate (/loc/gen)
t: Time to the most recent common ancestor (TMRCA)μSNP: SNP mutation rate (/loc/gen)
Intro.
STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 6: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/6.jpg)
Estimating STR mutation parameters from WGS
TMRCASTR calls
300 high coverage SGDP whole genomes
CAGm
CAGn
PSMC(Li and Durbin 2011)
SNPsTMRCA
Infer locus specific mutation params.
L
k
TMRCA
(m-n
)2
Step size
Freq
uenc
y
Learn model to predict mutation parameters from
sequence features
Diploid locus
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 7: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/7.jpg)
We are now armed with deep WGS amenable to STR profiling
SGDP: 300 deeply sequenced, PCR free genomes with diverse origins
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 8: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/8.jpg)
Generating high quality STR genotypes
Alignment
Sample 1 Sample 2 Sample n
Alignment Alignment
FASTQ FASTQ FASTQ
BAM BAM BAM
BW
A-M
EM
Allelotype(multi-sample)
lobSTRVCFFiltering
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
lobSTR
![Page 9: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/9.jpg)
High coverage genomes provide accurate STR genotypes
Homopolymers (n=50,398)
R2=0.92
93% concordance with capillary data
Accurately recover population structure
http://strcat.teamerlich.org/
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
(e.g. AAAAAA)
![Page 10: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/10.jpg)
Estimating STR mutation parameters from WGS
TMRCASTR calls
300 high coverage SGDP samples
CAGm
CAGn
PSMC(Li and Durbin 2011)
SNPsTMRCA
Infer locus specific mutation params.
L
k
TMRCA
(m-n
)2
Step size
Freq
uenc
y
Learn model to predict mutation parameters from
sequence features
Diploid locus
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 11: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/11.jpg)
Measuring TMRCA using PSMC
Dis
cret
ized
TM
RC
A
Li and Durbin, Nature 2011
Maternal chromosome
Paternal chromosome
CAGm
CAGn
Measure local TMRCA
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 12: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/12.jpg)
Estimating STR mutation parameters from WGS
TMRCASTR calls
300 high coverage SGDP samples
CAGm
CAGn
PSMC(Li and Durbin 2011)
SNPsTMRCA
Infer locus specific mutation params.
L
k
TMRCA
(m-n
)2
Step size
Freq
uenc
y
Learn model to predict mutation parameters from
sequence features
Diploid locus
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 13: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/13.jpg)
What we know about STR mutations
1. Mutate in “unit” lengths
2. Step size distribution ~Geometric
3. Length constraint biases mutation direction
4. Other important factors not modeled here
CAGCAGCAGCAGCAGCAGCAGCAG
CAGCAGCAG---CAGCAGCAGCAG
CAGCAG------CAGCAGCAGCAG
CAGCAG---CAGCAGCAGCAGCAG
CAGCAGCA-CAGCAGCAGCAGCAGSun et al. 2012
short alleles longer shorter longer • Length-dependent mutation rate
• Motif sequence interruptions
• Large expansions behave differently (e.g. Huntington’s)
• Biased gene conversion?
• Interaction between alleles?
P: probability of mutating a single step
3, 6 4, 4
4, 4
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 14: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/14.jpg)
Modeling STR mutation as a mean-centered random walk
Simple Stepwise Model (SMM): mutate by +/- 1 copy of the repeat unit with probability μ
t
CAGMRCA
CAGm CAGn
mm
n
Observed(Sun et al. 2012)
Mean-centered random walk (Ohrnstein-Uhlenbeck):
m
n
μSTR: Mutation rate(per generation)
β: Length constraint(0 ≤ β ≤ 1)
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
β
![Page 15: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/15.jpg)
Estimating the step size distribution
0 +5-5(Mean allele length)
1 2 3 4
0.2
0.4
0.6
0.8
Step size (# units)
Freq
uenc
y
+1 +2 +3 +4
Step size (# units)
0.1
0.2
0.3
0.4
Freq
uenc
y
-1-2-3-4
+1 +2 +3 +4
Step size (# units)
0.1
0.2
0.3
0.4
Freq
uenc
y
-1-2-3-4 +1 +2 +3 +4
Step size (# units)
0.1
0.2
0.3
0.4
Freq
uenc
y
-1-2-3-4p: Probability that the step size is a single unit.
Tetranucleotides: p = ~0.95Dinucleotides: p = ~0.7
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 16: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/16.jpg)
Model validation using Y-STRs
Thomas Willems
Find maximum likelihood mutation parameters(1000 Genomes Project):
P(STR data | Y phylogeny, μ, β, σ)
Validation set:Ballantyne et al (~2,000 father-son pairs)
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
Ballantyne, et al.lo
bSTR
r=0.831, N=64
![Page 17: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/17.jpg)
Estimating mutation parameters at autosomal loci
TMRCA
AS
D
0
4
9
16
CAG5
CAG5
Individual 1
CAG5
CAG8
Individual 2
CAGm
CAGn
Individual k
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 18: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/18.jpg)
Per-locus estimation of STR mutation parameters
Estimates for 120K multi-allelic STRs
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 19: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/19.jpg)
STR mutation trends by motif lengthIntro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 20: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/20.jpg)
Future directions: a genome-wide scan for STR selection
Expected Observed
FeaturesMotif length Recomb. rate
Total length GC content
Linear model
Predict μ, β
Explain: 46% of variation in μ 4.6% of variation in β
Develop genome-wide scan STR selection scan
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 21: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/21.jpg)
Conclusion
The first genome-wide characterization of STR mutation
1. STR mutation model2. Validation against published de novo mutation rates3. Strong effect of local sequence features4. Future work: improve estimation, genome-wide selection
scan
An unexplored, important source of genetic variation
Intro. STR catalog PSMC Mutation process Conclusion
10/29/15 Melissa GymrekGenome Informatics 2015
![Page 22: Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics](https://reader036.vdocuments.us/reader036/viewer/2022062504/5a4d1b867f8b9ab0599bc94b/html5/thumbnails/22.jpg)
Yaniv ErlichDavid ReichMark DalyNick PattersonSwapan MallickThomas WillemsAlon Goren
Acknowledgements