assessing the significance of y str evidence...genes on the human y chromosome • 23 mb of the...
TRANSCRIPT
Bruce Budowle
Assessing the Significance of Y STR Evidence
Characteristics of the Human Y ChromosomeCharacteristics of the Human Y Chromosome
• size: ~ 60 Mb
• ~ 35 Mb euchromatic (transcribed)
• ~ 25 Mb heterochromatic (non-transcribed)
• 95% non-recombining (NRY)
• 5% X-recombining (2 pseudoautosomal regions at telomeres)
• shape: acrocentric - very short p-arm, long q-arm (“Y” name)
• rich in different kinds of repetitive DNA sequences
• lack of recombination
• relatively poor in gene content
Genes on the Human Y ChromosomeGenes on the Human Y Chromosome
• 23 Mb of the euchromatic region determined
• 156 transcription units
• 78 encode proteins (genes)
• 27 distinct Y-specific protein-coding genes (gene families)
• 16 ubiquitously expressed genes = housekeeping genes
– e.g. RPS4Y, ZFY, AMELY, SMCY, DBY
• 9 testis-specific genes = male sex determination, spermatogenesis
– e.g. SRY, TSPY, CDY, RBMY, DAZ
• origin of NRY genes:
– derived / preserved from the proto-sex chromosomes (X-homology)
– specialisation in male-specific function
Genes Mapped to Y Chromosome
Evolution of Mammalian Sex ChromosomesEvolution of Mammalian Sex Chromosomes
Lahn, Pearson & Jegalian 2001
Some homology Some homology –– need to consider in validationneed to consider in validation
Polymorphisms on the Human Y Chromosome
Repetitive DNA – e.g., STRsSingle-Copy DNA – e.g., SNPs, indels
Y Chromosome Polymorphisms
• ~ 200 binary polymorphisms (Y-SNPs) characterized
• > 300 microsatellites (Y-STRs) characterized
• 1 minisatellite (MSY1)
Not all mutations
occur at the same
rate
‘hot spots’
‘cold spots’SNPs
-From J.M. Butler (2003) Forensic Sci. Rev. 15:91-111
Y Chromosome STRs
Nucleic Acids Res. 28(2), e8 (2000)
Marker Name Repeat Motif Allele RangeDYS19 TAGA 8-16DYS385 GAAA 10-22DYS388 ATT 12-17
DYS389 I (TCTG) (TCTA) 7-13DYS389 II (TCTG) (TCTA) 23-31DYS390 (TCTA) (TCTG) 18-27DYS391 TCTA 8-13DYS392 TAT 7-16DYS393 AGAT 9-15YCAII CA 16-25YCAIII CA 19-25
Marker Name Repeat Motif Allele RangeDYS434 ATCT 8-11DYS435 TGGA 9-13DYS436 GTT 10-15DYS437 TCTA 8-11DYS438 TTTTC 6-12DYS439 AGAT 9-14
Y-GATA-A4 AGAT 11-14Y-GATA-A7.1 ATAG 7-12Y-GATA-A7.2 TAGA 8-12Y-GATA-A8 TCTA 8-14
Y-GATA-A10 TATC 11-14Y-GATA-C4 TATC 11-16Y-GATA-H4 TAGA 10-13
Marker Name Repeat Motif Allele RangeDYS441 CCTT 12-18DYS442 TATC 10-14DYS443 TTCC 12-17DYS444 TAGA 11-15DYS445 TTTA 10-13G09411 TATG 8-14
Mutation Process for Mutation Process for STR lociSTR loci
Scientific Uses
• Genealogy studiese.g., Hemmings-Jefferson case
**Remember paternal lineage issues for identity testing
Scientific Use
• Genealogy studiese.g., Hemmings-Jefferson case
• Human evolution studiese.g., Human migrations
• Haplogroup: set of haplotypes defined
by slowly mutating markers (mainly SNPs) which
have more phylogenetic stability
• Haplotype: combination of allelic states of a
set of polymorphic markers lying on the same
DNA molecule
Definitions
Unique event polymorphisms record history of Y chromosome
Phylogenetic tree of
Y haplogroupsbased in binary
SNP data
Applications of Y-STRs• Forensic Analysis
– Detect male DNA in a sample containing male and female DNA (Huge background of female DNA)
– Aspermic males– Fingernail Scrapings– Additional DP value– Multiple male donors– Limits of differential extraction/ tissues– Gender clarification (amelogenin)
• Paternal Lineage– Paternity Testing– Kinship analysis– Deficiency cases
Identification of Male Contributor DNA in Crime Scene MaterialIdentification of Male Contributor DNA in Crime Scene Material
Autosomal STR profile
Female Victim DNA:
Male Suspect DNA:
Large Female DNA: Perpetrator Male DNA
- See only female DNA profile- Or partial DNA profile
- no female DNA- no profile overlap- only male component
Y STR profile
One Forensic Application• Some victims of sexual assault may not provide
vaginal samples immediately after the incident
• Ability to retrieve typeable autosomal STR profiles from the semen diminishes rapidly
• However, sperm persist in the vaginal canal up to 3 days after intercourse (may be detectable up to 7 days after intercourse in the cervix)
• Sperm decrease in number as interval increases
• Lymphocytes and epithelial cells from male
Finger Nail Scraping Case
Victim was found strangled to death
Suspect had scratches on his face
Based on STR results, suspect could not be excluded; many alleles were below minimal threshold (inconclusive result)
Evidence Profile
Suspect Profile
Applications of Y-STRs
− Paternity Testing− Kinship analysis− Deficiency cases
• Forensic Analysis
• Paternal Lineages
Deficiency Case Male LineageDeficiency Case Male Lineage
• Y STR analysis - any male relative in pedigree can be a
reference for alleged father
?
YY--STR Haplotype Analysis in Deficiency Paternity CaseSTR Haplotype Analysis in Deficiency Paternity Case
DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385 DYS413 YCAII
Nephew 14 13 30(16) 25 11 13 12 11-14 22 7
Son 14 12 29 (16) 24 10 15 12 11-14 22 7
Exclusion
If true biological nephew, then alleged father is excluded as father of child in question
?
Kayser et al. Progress in Forensic Genetics (1998), 7: 494-496
DYS385 a/b
a = b a ≠ b
DYS389 I/III
IIF primer F primer
R primer
a b
Duplicated regions are 40,775 bp apart and facing
away from each other
F primer
R primer
F primer
R primer
DYS389I DYS389II
Figure 9.5, J.M. Butler (2005) Forensic DNA Typing, 2nd Edition © 2005 Elsevier Academic Press
Multi-Copy (Duplicated) Marker
Single Region but Two PCR Products (because forward primers bind twice)
Y-STR consensus structure and allele ranges
Marker Name GenBankAccession
Repeat Motif AlleleRange
PCR ProductSizes
Reference
DYS19 X77751 TAGA 8-16 178-210 bp Roewer 1992DYS385 Z93950 GAAA 10-22 252-300 bp Schneider 1998DYS388 G09695 ATT 12-17 128-143 bp Kayser 1997
DYS389 IDYS389 II
G09600G09600
(TCTG) (TCTA)(TCTG) (TCTA)
I: 7-13II:23-31
239-263 bp353-385 bp
Kayser 1997Kayser 1997
DYS390 G09611 (TCTA) (TCTG) 18-27 191-227 bp Kayser 1997DYS391 G09613 TCTA 8-13 275-295 bp Kayser 1997DYS392 G09867 TAT 7-16 236-263 bp Kayser 1997DYS393 G09601 AGAT 9-15 108-132 bp Kayser 1997YCAIII AC006370 CA 19-25 192-204 bp Kayser 1997DYS434 AC002992 ATCT 8-11 110-122 bp Ayub 2000DYS435 AC002992 TGGA 9-13 210-228 bp Ayub 2000DYS436 AC005820 GTT 10-15 128-143 bp Ayub 2000DYS437 AC002992 TCTA 8-11 186-202 bp Ayub 2000DYS438 AC002531 TTTTC 6-12 203-233 bp Ayub 2000DYS439 AC002992 AGAT 9-14 238-258 bp Ayub 2000
Y-GATA-A4 G42670 AGAT 11-14 242-254 bp White 1999Y-GATA-A7.1 G42675 ATAG 7-12 161-181 bp White 1999Y-GATA-A7.2 G42671 TAGA 8-12 174-190 bp White 1999Y-GATA-A8 G42672 TCTA 8-14 219-244 bp White 1999
Y-GATA-A10 G42674 TATC 11-14 160-172 bp White 1999Y-GATA-C4 G42673 TATC 11-16 251-271 bp White 1999Y-GATA-H4 G42676 TAGA 10-13 362-370 bp White 1999
Y Chromosome STR Markers
• For effective use, guidelines are needed
• ISFG Recommendations
• Combine with existing recommendations (NRC II Report)
• Nomenclature, Allelic Ladders, Population Genetics, Statistical Issues
• Similar to autosomal STRs
• Thresholds for detection and interpretation
• Stutter
• Mixtures – what constitutes a mixture
• Validation studies in concert with guidelines
• Interpret evidence before knowns
Basic Interpretation Guidelines
Y STR LOCI• DYS19
• DYS398 I
• DYS398 II
• DYS390
• DYS391
• DYS392
• DYS393
• DYS385 I/II
“Minimal Haplotype” – defined for research only
DYS19DYS389IDYS389IIDYS390DYS391DYS392DYS393DYS438DYS439DYS385a/b
SWGDAM
DYS385 – two loci
Y STR Loci
DYS389 – two loci
• Commercially available Y-STR multiplex kits ---allow for standard markers and QA/QC
• Most have EMH and SWGDAM recommended loci
Kits
DYS19DYS389IDYS389IIDYS390DYS391DYS392DYS393DYS437DYS438DYS439DYS385
PowerPlex® Y System
DYS385 – two loci
DYS19DYS389IDYS389IIDYS390DYS391DYS392DYS393DYS437DYS438DYS439DYS385DYS448DYS456DYS458DYS635GATAH4.1
AmpFlSTR® Yfiler™ Kit
DYS385 – two loci
Promega Amplification Performance(moderate-to-low sensitivity instrument)
1ng male DNA
0.125ng male DNA
100ng female DNA
0.1ng male + 100ng female
2s injection on ABI PRISM® 310 Genetic Analyzer
0.5ng male
0.5ng male + 5ng female (10-fold)
Neg. control3s inj. on ABI PRISM® 310 Genetic Analyzer
0.5ng male + 50ng female (100-fold)
0.5ng male + 500ng female (1000-fold)
DNA Mixtures: Increasing female
Promega0.5ng male + 0.5ng female (equal
mix)
0.5ng male + 150ng female (300-fold)
DNA quantified by A260
AmpFλSTR® Yfiler™ Kit 1 ng Male Control DNA 007
DYS458 DYS389 I DYS390 DYS389 II
DYS438 DYS19 DYS385 a/b
DYS393 DYS391 DYS439 Y GATA C4 DYS392
Y GATA H4 DYS456 DYS437 DYS448
AmpFlSTR® Yfiler™
137 alleles
Allelic ladder
CFS AFRCT AFRMI AFRNYC AFRTX AFRCFS CAUCT CAUMI CAUNYC CAUTX CAU
371828680192571649783194
MI HISMN HISNYC HISTX HISApacheNavajoCFS ASNMN ASNNYC ASNTX ASN
CT HIS
Population
160
N
CFS EI
Population9710180192138219281014573
Y STR Population DataPromega Study
37
N
Total = 2443
DYS19Allele Frequencies African American
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
12 13 14 15 16 17 18
Alleles
Freq
uenc
y
Sinha (n=543)CFS (n=37)CT (n=182)MI (n=86)NYC (n=80)TX (n=193)
χ = frequency of each haplotype n = # haplotypes
(h = n(1-Σχ2)/ (n-1)Genetic Diversity
P = Σχ2Random Match Probability
Population Parameters
Gene DiversityHispanic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
DYS437 DYS19 DYS392 DYS393 DYS390
CT HIS MI HIS MN HIS NYC HIS TX HIS
CFS AFRCT AFRMI AFRNYC AFRTX AFRCFS CAUCT CAUMI CAUNYC CAUTX CAUCT HISMI HISMN HISNYC HISTX HIS
Y Haplotype ProfilesPopulation
3718286801935716397831941589710080192
N # Haplotypes
361728580181501538780170130909574179
% Single
97.394.598.810093.887.793.989.796.487.682.392.895.092.593.2
Haplotype Diversity
0.99850.99940.9997
>0.99990.99930.99440.99910.99680.99910.99810.99630.99850.99880.99680.9991
ApacheNavajoCFS ASNMN ASNNYC ASNTX ASNCFS EI
Y Haplotype Profiles
13821928100457037
701012896436935
50.746.110096.095.698.694.6
0.97010.9806
>0.99990.99920.99700.99960.9955
Population N # Haplotypes % SingleHaplotype Diversity
high haplotype diversity = high intra-individual variation
Haplotype DiversityN>80
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
1.005
CT CAU
NYC CAU
MI CAU
TX CAU
CT AFR
MI AFR
NYC AFR
TX AFR
CT HIS
MI HIS
MN HIS
NYC HIS
TX HIS
APACHENAVAJOMN A
SN
Multiband Y Patterns• MN ASIAN PA0077 DYS385 - 3 Bands
• MN HISPANIC PH0031 DYS390 - 2 Bands
• MN HISPANIC PH0063 Multibands
• NYC HISPANIC 26 DYS19 - 2 Bands
• NYC CAUCASIAN 4 DYS19 - 2 Bands
• CT HISPANIC 00-1851 DYS19 - 2 Bands
• CT HISPANIC 99-1695 DYS19 - 2 Bands
• CT HISPANIC 99-0362 DYS19 - 2 Bands
• CT HISPANIC 98-2136 DYS19 - 2 Bands
• CT CAUCASIAN 00-3022 DYS385 - 3 Bands
• ASIAN A-FTA-34-F/C DYS385 - 3 Bands
• ASIAN A-FTA-36-F/C DYS19 – Primer Binding site?
• ASIAN A-FTA-32-F/C DYS385 - 4 Bands
Must consider when analyzing
mixtures
What about Equilibrium?
• These markers are all on a single chromosome
• No recombination
• Predict strong linkage and a lack of independence between the loci
• Haplotype
CFS AFRCT AFRMI AFRNYC AFRTX AFRCFS CAUCT CAUMI CAUNYC CAUTX CAUCT HISMI HISMN HISNYC HISTX HIS
Population3718286801935716397831941589710080192
N # Equilibrium
352327342130263022112633423537
Y STR Loci Pairwise Tests12 Loci – 66 tests
ApacheNavajoCFS ASNMN ASNNYC ASNTX ASNCFS EI
Y STR Loci Pairwise Tests12 Loci – 66 tests
13821928100457037
9126050475035
Fewest – Native AmericanMost – Asian (sample size)
# EquilibriumPopulation N
391/389I391/389II389I/439389I/385-2439/389II439/393439/385-1439/385-2389II/393437/393
Loci19181718171918201718
# Populations
Y STR Loci Pairwise Tests22 populations; >17 Equilibrium detected
391/438389I/389II438/437438/19438/392438/385-1438/385-2437/385-119/39219/385-1392/385-1390/385-1385-1/385-2
Loci5144304553553
# Populations
Y STR Loci Pairwise Tests22 populations; <5 Equilibrium
389I/392438/439437/439437/385-2390/385-2
Loci1511161212
# Populations/Equilibria
Y STR Loci Pairwise Tests22 Populations – Examples of Population Specific Disequilibrium
CaucasianCaucasianCaucasian
African AmericanAfrican American
Population/Equilibria
• There is evidence of “independence” between some pairs of loci in the populations samples
• Likely due to combination of mutation rate, subdivision and random drift
• A large factor is haplogroup diversity
• The marker selection for increasing haplotype diversity is not directly correlated to gene diversity
• Conservative estimates
Equilibrium and Impact?
Approaching Analysis
• Some may suggest - “Use the set of Core Y STRs and add more as needed to resolve matches”
• First question – when do you stop?
• If you get a match, you would have to continue on ad infinitum!
• Is this a sensible policy?
How much power is needed???
Y-STR marker combination
African American (N=786)
Caucasians (N=778)
Hispanics (N=381)
European minimal haplotype (9) 75.8% 61.7% 79.8%
Minimal + SWGDAM + DYS437 (12)
87.7% 76.7% 88.2%
Eur. Minimal + SWGDAM (11) 86.8% 74.3% 85.6%
AmpFλSTR®
Yfiler kit (17)
97.6% 95.5% 95.8%
Discriminatory Capacity for Three US Populations
*DC= (# of different haplotypes / pop. size) x 100Mulero et al., JFS (2006) 51:64-75
Number of Unique Haplotypes Observed in Three US Populations
Y-STR marker combination
African American (N=786)
Caucasians (N=778)
Hispanics (N=381)
European minimal
haplotype (9)496 382 266
Minimal + SWGDAM + DYS437 (12)
628 524 306
Eur. Minimal + SWGDAM (11) 618 503 295
AmpFλSTR®
Yfiler kit (17)
749 714 350
Mulero et al., JFS (2006) 51:64-75
So… more loci do result in better resolution
But…size of the database impacts more on statistics
Approaching Analysis
• So additional testing beyond a single kit is an unlikely approach because information gain is low
• Many samples will already be very limited
• Community will rely on commercially available kits not in-house designer systems
• QC/ Proficiency Testing
• Better to increase size of database(s) to gain power
• We will re-visit substructure issues later
Approaching Analysis
• Some may suggest - “A reference database should contain related individuals” – to better define the population
• Probability of paternal relative having the same haplotype is usually 1
• Typically comprised of unrelated individuals
• Although a small unknown number of related individuals may be in a database
• Able to address significance of a very closely related profile
Exclusion with 1 mismatch among 12 analyzed Y-STRs
Evidence 14 12 28 25 11 11 13 14,14 11 11 15
Known 14 12 28 24 11 11 13 14,14 11 11 15
By having a database of unrelated males one can assess weight ofrelative (with mutation) versus rarity of haplotype in population
Qualitative Conclusions of Y-STR Haplotype Comparison
Exclusion- The two haplotypes are dissimilar; i.e, the reference person is excluded as the contributor of Y-specific DNA of the evidence sample
Inclusion/Match- The Y haplotypes from two samples are sufficiently similar and potentially could have originated from the same source, or from a common paternal lineage
Inconclusive- Exclusion/Inclusion cannot be definitively inferred due to insufficient data from one or both of the DNA samples
• Unique event polymorphisms record history of Y chromosome
• Effective population size
• Patrilineal inheritance reduces population size
• Variance of offspring further reduces effective population size
• Patrilocal effect causes local differentiation
• Lack of recombination
• But some detectable linkage equilibrium
Now Let’s Think About Application
Population Differentiation
• Effective population size of Y chromosome is 1/4 of autosome or 1/3 of X– lower sequence diversity on Y– more susceptible to genetic drift
• random changes in frequency of haplotypes due to sampling bias from one generation to next
• accelerates differences between populations• Geographical clustering due to patrilocal behavior of
men– women move closer to man’s birthplace– local geographical differentiation enhanced
• Variance of offspring further reduces Ne (effective population size)
Calculation to Convey to the Court
• Frequency estimate not possible
• Court desires a frequency estimate
• Point Estimate (Counting Method)
• Confidence Interval
• Approach the same as mtDNA
Calculation to Convey to the CourtIssues
• Estimating the rarity of a Y DNA profile is performed differently than for autosomal DNA markers
• No evidence for recombination across the majority of the Y-chromosome
• One cannot employ the product rule to estimate the rarity of the Y types in a profile
• The composite multi-locus profile is treated as a single locus or haplotype
• The vast majority of possible haplotypes will not be observed in any database
• The counting method is likely to be conservative
• A correction for sampling
• A correction for substructure
Calculation to Convey to the Court
Calculation to Convey to the CourtCounting Method
• The counting method is very simple
• A Y STR haplotype (evidence sample) is compared to a reference database(s) of unrelated individuals
• The number of times the haplotype is observed in a database
Calculation to Convey to the CourtApproaches
• It is more likely that the counting method will be employed by the U.S. laboratories and courts because of its operational simplicity
Limitations of the Counting Method
• Non-matched sites of the haplotype are given weight equal to that of different origin (but may have some extra value for substructure)
• Mutations are not weighted
• Haplotypes of the same paternal lineage can be excluded, when they are subject to mutations
• Does not recognize evolutionary changes, and/or effect of convergent mutations
Maximum haplotype frequency
• If a Y-haplotype is not seen in a sample of N males then at the α level of significance
• As N becomes larger, maximum frequency becomes closer to point estimate
If the haplotype has not been observedin the database, then:
The upper bound of the CI is
1-α1/N
• N frequency
• 100 3/100
• 500 3/500
• 1,000 3/1,000
• 10,000 3/10,000
Haplotype frequency (95%)
CI = p ± 1.96 p(1-p)/N
For Y haplotype observed,count the number of timesthe profile is observed (X)
p = X/N
I have confidence that the true haplotype frequency is less thanthe upper CI value
A potential mtDNA statement could be:
• Correction for population structure may be considered
• Effective population size ¼ of autosmal loci
• May actually be a little lower
• Substructure effects less in US than ancestral populations
• Use when reference database considered not representative
Calculation to Convey to the CourtPopulation Substructure
Problems created by population subdivision
Haplotype frequencies calculated from Haplotype frequencies calculated from population averagepopulation average
frequencies frequencies couldcould lead to:lead to:
–– Wrong estimates!Wrong estimates!
Employ a Theta (θ) Correction
θ is used as a measure of the effects of population subdivision (inbreeding)
NRC θ recommendation was pragmatically set
Empirical values are much less for autosomal loci
National Academy of SciencesMay 1996
Still need to calculate substructure effects
But likely to be low for most major populations, if evaluated under a forensic model vs that of an evolutionary model
• Population is relevant and representative
• Population is relevant, but not representative
• Population is similar and not representative
• Follow NRC II Recommendations
Calculation to Convey to the CourtPopulation Substructure
U.S.U.S. YY--STR HaplotypeSTR Haplotype Reference DatabaseReference Databasewww.ystr.org/usa
AA CAU HIS Total
Haplotype diversity 99.8% 99.6% 99.5% 99.7%
Number of populations sampled 10 11 9 30
Number of individuals sampled 599 628 478 1,705
Number of Y-STR loci typed (EMH) 9 9 9 9
Number of different haplotypes 454 76%
437 70%
354 74%
1116 65%
Most frequent haplotype 12 2.0%
25 3.98%
19 3.97%
533.1%
Kayser et al. J. Forensic Sci. (2002), 47(3): 5513-519
Hispanic
Structure of U.S. Populations Structure of U.S. Populations withwith YY--STR STR HaplotypesHaplotypes
Florida EA
European-American
African-American
Virginia AAFlorida AA
Maryland AATexas AA
New York AAPennsylvania AA
Missouri AAOregon AA
Indiana AALousiana AA
Pennsylvania EANew York EA
Indiana EAMissouri EA
Lousiana EAMaryland EA
Pennsylvania HFlorida H
New York HConnecticut H
Texas EACajun EA
Virginia EAOregon EA
Oregon HMaryland H
Lousiana HTexas HVirginia H
RST = 0.1
RST: measure for population differentiation
Kayser et al. J. Forensic Sci. (2002), 47(3): 5513-519
Based on Evolutionary ModelNot a Forensic Model
Computation of Frequency of Lineage-based Marker Profile
Using the general theory, the unconditional frequency of an haplotype (say Ai),
which is count divided by sample size, can be modified to get the conditional probability
Pr. (Ai|Ai) = [pi2 + θ pi(1-pi)]/pi
= pi + θ(1-pi) or= θ + pi(1 - θ)
Hence, the conditional probability always exceeds θ, the adjustment factor of possible population substructure in the database used
θ Becomes the bound
Computation of Frequency of Lineage-based Marker Profile (Contd.)
Some suggest that the quantity pi in Pr. (Ai|Ai) = θ + pi(1 - θ)
can be substituted by (Count of Ai + 2)/(N + 3),
where N is the sample size. When N is large, this has little effect, but can be of help when the count of Ai in the database is zero (i.e., profile in evidence not seen in the database)
Y STR θ -Value
Since in terms of match versus non-match, how different are the haplotypes is not an issue
The θ values for Y STRs are not computed based on mismatch approaches (such as AMOVA), but instead treating all haplotypes as different alleles, generally leading to a much smaller θvalue
What θ to use?
• The value of θ (computed by either RST or FST) is dependent indirectly on the number of Y STR loci comprising the haplotype
• Generally, as more loci are included in the haplotype, most haplotypes in a data set will become differentiated
• Therefore, the greater the number of loci within the haplotypes, the smaller should be the value of θ
What θ to use?• RST has been used to estimate the value of θ for forensic
calculations
• However, the analyses based on RST are better applied to evolutionary biology for studying the phylogenetichistory of Y-chromosomes
• The RST values are based on allele size variance, exploiting the extent of difference between different haplotypes
• Such an approach, however, typically does not apply to forensic inferences; forensic applications assess the evidence in terms of match or non-match.
What θ to use?
• In these terms, the θ values for Y STRs should not be computed on mismatch based approaches, but instead by simply treating all haplotypes as different alleles
• Thus, FST (or GST) is a better estimator for θ
• Haplotypes are identified solely by their distinctiveness (i.e., haplotypes are considered simply in terms of identity by state)
• This generally leads to a more appropriate and much smaller θ value than estimated by RST
Y STR haplotype is one locus with many allelesA1A2....
A100
A101A102
.
.
.
.A200
Population 1
Population 2
Databases with reasonable size approximate this model
θ is almost 0
Y STR haplotype is one locus with many alleles
A1A2....
AnA1'A2'
.
.
.
.An'
Population 1
Population 2
In reality, with large number of loci a few types are shared and most if not all have never been seen in the database
θ approaches 0
What did he just say?
Forensic ModelPopulation Substructure
DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385
A --- 14 12 29 24 10 15 12 11-14
E --- 18 12 25 24 10 15 15 11-18
B --- 14 13 29 24 10 15 12 11-14
C --- 14 13 29 24 12 15 12 10-14
D --- 18 11 25 24 10 13 15 12-18
Which haplotypes might be more closely related?
Forensic ModelPopulation Substructure
DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385
A --- 14 12 29 24 10 15 12 11-14C --- 14 13 29 24 12 15 12 10-14
Are such evolutionary differences considered in forensic evaluation?
E --- 18 12 25 24 10 15 15 11-18A --- 14 12 29 24 10 15 12 11-14
Exclusion
Exclusion
F-Statistics distribution
0
0.0005
0.001
0.0015
0.002
0.0025
16 15 13 11 10
# of Loci
FstFst w/o NAFscFsc w/o NAFctFct w/o NA
θ Value and Impact(or do we need more Y STR loci?)
θ Value and Impact• FST values across major populations range from <0.002 to <0.0001
(depending on the number of loci comprising the haplotype)
• Population-specific values are smaller, excluding Native Americans
• 17 loci A = 0.00005; C = 0.00016; H = 0.00007; 12 loci A = 0.00005; C = 0.0002; H = 0.0004
• With the size of current reference population databases, FST rarely will influence the upper bound of the Y STR frequency for most population groups (with the current commercially available systems)
• Increasing the size of reference population data sets and including more populations would be more valuable and a better use of resources for exploiting the full power of Y STR typing
• Dedicating efforts to expanding the battery of Y STR loci will not significantly exploit the power of the assays
Key thoughts
• Diversity• Most common types in population• Following NRCII recommendations
Partial Profile θCombined Populations
• 0.0008 -- 15 loci• 0.002 -- 11 loci• 0.0329 -- 5 loci• 0.1199 -- 2 loci
Locus Caucasian(N = 199)
Afr Amer(N = 203)
Hispanic(N = 207)
Asian(N = 83)
Total(N = 692)
DYS391 11:10 1
DYS389I 12:13 1
DYS389II 29:30 30:29 1*
DYS439 13:12 11:12 2
DYS438 0
DYS437 15:16 15:14 2
DYS19 16:17 2
17:16
DYS392 0
DYS393 14:15 1
DYS390 0
DYS385 14:15 14:15 12,14:14 3
Total 2/2388 6/2436 2/2484 3/996 13/8304
0.00084 0.00246 0.00081 0.0031 0.00157
Y STR mutations (father:son allele transmission)
• 692 confirmed father-son pairs (probability > 99.9%)
• 14 mutation events were observed
• Average rate of 1.57 x 10-3/locus /generation (13/8304)
• With a 95% confidence bound of 0.83 x 10-3 to 2.69 x 10-3
• This rate is a little smaller than that of the Kayser, et al.
• Estimate (2.80 x 10-3/locus)
• But the difference is not statistically significant (P > 0.05).
one Asian father-son pair at the DYS389I/II loci complex (12,29) �(13, 30) appears as a double mutation, but likely is a single original event.
Y STR mutations (father:son allele transmission)
Next Task
• Test independence between autosomalloci and Y haplotypes
Independence Testing of Y Haplotypeand 13 Autosomal CODIS STR Loci
(Autosomal Locus/ Y Haplotype Displaying Disequilibria* - 22 populations)
1. ApacheFGA, p-value = 0.03760000D21S11, p-value = 0.03460000D18S51, p-value = 0.02820000D5S818, p-value = 0.02660000
2. Minnesota AsianD8S1179, p < 10-3
3. Minnesota HispanicD16S539, p-value = 0.03340000D18S51, p-value = 0.02100000
4. Canada African AmericanFGA, p-value = 0.00920000
5. Canada Asian IndianD7S820, p-value = 0.02820000
6. Connecticut African AmericanFGA, p-value = 0.04300000THO1, p-value = 0.00280000
7. Connecticut CaucasianTHO1, p-value = 0.02880000
8. Michigan CaucasianD16S539, p-value = 0.04820000
13. New York CaucasianD7S820, p-value = 0.01660000
14. New York HispanicD21S11, p-value = 0.01340000
15. Texas African AmericanD13S317, p-value = 0.01200000D18S51, p-value = 0.01420000
16. Texas HispanicD5S818, p-value = 0.01880000
9. Michigan HispanicvWA, p-value = 0.03160000FGA, p-value = 0.02240000
10. Native American TotalD3S1358, p-value = 0.02680000D21S11, p-value = 0.00060000D18S51, p-value = 0.00840000
11. NavajoD21S11, p-value = 0.02820000
12. New York AsianD16S539, p-value = 0.00740000
Independence Testing of Y Haplotypeand 13 Autosomal CODIS STR Loci
(Autosomal Locus/ Y Haplotype Displaying Disequilibria* - 22 populations)
Next Task• Mixtures
• Assume 2 alleles for 11 loci
• 211 possible haplotypes – 2048
• Most haplotypes not observed in database
• Assumption of independence not correct
• Minimal haplotype frequency (≈minimum allele frequency) not practical
Mixture• Probability of Exclusion
• Binomial distribution - haplotypes excluded and haplotypes not excluded
• Count number (m) not excluded; (PI = m/n)
• Estimate upper CI of PI
• PE = 1- PI
• Based on same principles used for autosomal loci (but at haplotype level)
• Four scenarios for two contributor sample in example:
•Hp --- S1 and S2 are source
• Hd --- S1 and unknown are source (same as PI)
• Hd --- S2 and unknown are source (same as PI)
• Hd --- two unknowns are source
MixtureLikelihood Ratio
• Assume three loci, two alleles at each locus, two male suspects
•Total alleles the same as in evidence
• Equal contribution
• 8 possible haplotypes
MixtureLikelihood Ratio
13/15 8/10 22/25 --- 3 locus profile
15 8 22 --- haplotype 2
15 8 25 --- haplotype 4
13 10 22 --- haplotype 5
15 10 22 --- haplotype 6
13 10 25 --- haplotype 7
13 8 25 --- haplotype 3
15 10 25 --- haplotype 8
13 8 22 --- haplotype 1
All haplotypes included are
PE/PI
PE/PI
All possible haplotypes are included
13/15, 8/10, 22/25 --- 3 locus profile
15 8 22 --- haplotype 2
15 8 25 --- haplotype 4 13 10 22 --- haplotype 515 10 22 --- haplotype 613 10 25 --- haplotype 7
13 8 25 --- haplotype 3
15 10 25 --- haplotype 8
13 8 22 --- haplotype 1
But certain haplotype pairs can not explain evidence
haplotype 1 + haplotype 2haplotype 1 + haplotype 3haplotype 1 + haplotype 4haplotype 1 + haplotype 5
and so on
PE/PI
All possible haplotypes are included
13/15, 8/10, 22/25 --- 3 locus profile
15 8 22 --- haplotype 2
15 8 25 --- haplotype 4 13 10 22 --- haplotype 515 10 22 --- haplotype 613 10 25 --- haplotype 7
13 8 25 --- haplotype 3
15 10 25 --- haplotype 8
13 8 22 --- haplotype 1
haplotype 1 + haplotype 8haplotype 2 + haplotype 7haplotype 3 + haplotype 6haplotype 4 + haplotype 5
Only certain haplotype pairs can explain evidence
LR =1
2[Pr(H1)Pr(H8) + Pr(H2)Pr(H7) + Pr(H3)Pr(H6) + Pr(H4)Pr(H5)]
MixtureLikelihood Ratio
• Technically correct
• Can not estimate individual haplotype frequencies
• 217 possible haplotypes (although not all combinations can explain the evidence)
• Assuming independence is not correct
• Cannot place types in database, most never seen, too many
MixtureLR
Use same logic as PIfor the denominator in the LR
E = excluded E = not excluded
Haplotypes fall into either category
E/E and E/E pairs can not explain the evidence
Only E/E can explain the evidenceand only a subset of these fit
m* - those pairs (of E/E) that explain evidence
m*/n(n-1) and take upper CI as denominator
MixtureLR
LR =1
m*/n(n-1)
MixtureLR
•The denominator is the PI with an assumed number of contributors
•Makes better use of data
UCI
Online available Y-STR haplotype reference databases/ Commercial Kits
To calculate/search matches - haplotype frequencies
http://www.appliedbiosystems.com/yfilerdatabase/
Applied Biosystems Yfiler
http://www.promega.com/techserv/tools/pplexy/default.htm
Promega Powerplex Y
AB Yfiler
Haplotype data can be input manually or through file upload
There can be no matches
when testing this many loci and this size
database
So using our formula from before and an α = 0.05
1-α1/N
1 – (0.05)1/3561 = 0.00084
But would not use total data set, breakdown by population
If there is a match in the database
CI = p ± 1.96 p(1-p)/N
What p…? What N…?
CI = p ± 1.96 p(1-p)/N
As before, look at match number and sample size:
CI = 0.0006 + 1.96√ (0.0006(0.9994))/3561
Upper bound would be:0.00140
CI = p ± 1.96 p(1-p)/N
In practice, use population group in which the match was found (and other populations appropriate):
CI = 0.0016 + 1.96√ (0.0016(0.9984))/1276
Upper bound would be:
0.00379
One issue to consider is
• Expressing results when database sizes are different
• 1 in 70• 1 in 700• May have nothing to do with variation• But may be exploited incorrectly by some
Recap Comments on Lineage Markers
• Y STRs are paternally inherited
• Barring mutations, all paternally related persons will have the same haplotype
• Different markers on the Y chromosome are genetically linked with no recombination
• Consequently, Y STR sequence data are treated as a haplotype, frequency - NOT multiplicative across markers
Comments on Lineage Markers (Contd.)• Counting method is one approach that captures the genetic
information
• Stated ethnicity of individuals does not necessarily reflect patrilineal ancestry
• e.g., mtDNA of Hispanics may be almost entirely of Native American descent, while for the autosomal STRs, only 30-50% of their genes are of Native American descent
• Thus, grouping of populations by ancestral lineage does not necessarily provide accurate frequency estimates of specific Y STR haplotypes in a forensic context
Summary Steps of Computations Step 1
• Convenience samples with subjects included without a prior knowledge of their DNA profile is an adequate database
• Subjects’ with broad population grouping recorded with self-identified ethnicity
• Sample size of XXX to XXX individuals is generally adequate
• Haplotype counts from samples of genetically affine groups may be pooled to enhance the sample size
Calculation to Convey to the CourtCounting Method
• The counting method is very simple• A Y STR haplotype (evidence sample) is compared to a
reference database(s) of unrelated individuals• The number of times the haplotype is observed in a database• The size of a database can be and is often limited• With databases (e.g., n = 100 to 3000), many possible
haplotypes will not be observed and there will be sampling error • A confidence interval can be placed on the observation • Can convey with a high degree of confidence that the rarity of
the evidence Y STR haplotype among unrelated individuals in a given population(s) is less than the upper bound of the estimate
• θ adjustment – see presentation during meeting