assessing the significance of y str evidence...genes on the human y chromosome • 23 mb of the...

Bruce Budowle

Assessing the Significance of Y STR Evidence

Characteristics of the Human Y ChromosomeCharacteristics of the Human Y Chromosome

• size: ~ 60 Mb

• ~ 35 Mb euchromatic (transcribed)

• ~ 25 Mb heterochromatic (non-transcribed)

• 95% non-recombining (NRY)

• 5% X-recombining (2 pseudoautosomal regions at telomeres)

• shape: acrocentric - very short p-arm, long q-arm (“Y” name)

• rich in different kinds of repetitive DNA sequences

• lack of recombination

• relatively poor in gene content

Genes on the Human Y ChromosomeGenes on the Human Y Chromosome

• 23 Mb of the euchromatic region determined

• 156 transcription units

• 78 encode proteins (genes)

• 27 distinct Y-specific protein-coding genes (gene families)

• 16 ubiquitously expressed genes = housekeeping genes

– e.g. RPS4Y, ZFY, AMELY, SMCY, DBY

• 9 testis-specific genes = male sex determination, spermatogenesis

– e.g. SRY, TSPY, CDY, RBMY, DAZ

• origin of NRY genes:

– derived / preserved from the proto-sex chromosomes (X-homology)

– specialisation in male-specific function

Genes Mapped to Y Chromosome

Evolution of Mammalian Sex ChromosomesEvolution of Mammalian Sex Chromosomes

Lahn, Pearson & Jegalian 2001

Some homology Some homology –– need to consider in validationneed to consider in validation

Polymorphisms on the Human Y Chromosome

Repetitive DNA – e.g., STRsSingle-Copy DNA – e.g., SNPs, indels

Y Chromosome Polymorphisms

• ~ 200 binary polymorphisms (Y-SNPs) characterized

• > 300 microsatellites (Y-STRs) characterized

• 1 minisatellite (MSY1)

Not all mutations

occur at the same

rate

‘hot spots’

‘cold spots’SNPs

-From J.M. Butler (2003) Forensic Sci. Rev. 15:91-111

Y Chromosome STRs

Nucleic Acids Res. 28(2), e8 (2000)

Marker Name Repeat Motif Allele RangeDYS19 TAGA 8-16DYS385 GAAA 10-22DYS388 ATT 12-17

DYS389 I (TCTG) (TCTA) 7-13DYS389 II (TCTG) (TCTA) 23-31DYS390 (TCTA) (TCTG) 18-27DYS391 TCTA 8-13DYS392 TAT 7-16DYS393 AGAT 9-15YCAII CA 16-25YCAIII CA 19-25

Marker Name Repeat Motif Allele RangeDYS434 ATCT 8-11DYS435 TGGA 9-13DYS436 GTT 10-15DYS437 TCTA 8-11DYS438 TTTTC 6-12DYS439 AGAT 9-14

Y-GATA-A4 AGAT 11-14Y-GATA-A7.1 ATAG 7-12Y-GATA-A7.2 TAGA 8-12Y-GATA-A8 TCTA 8-14

Y-GATA-A10 TATC 11-14Y-GATA-C4 TATC 11-16Y-GATA-H4 TAGA 10-13

Marker Name Repeat Motif Allele RangeDYS441 CCTT 12-18DYS442 TATC 10-14DYS443 TTCC 12-17DYS444 TAGA 11-15DYS445 TTTA 10-13G09411 TATG 8-14

Mutation Process for Mutation Process for STR lociSTR loci

Scientific Uses

• Genealogy studiese.g., Hemmings-Jefferson case

**Remember paternal lineage issues for identity testing

Scientific Use

• Genealogy studiese.g., Hemmings-Jefferson case

• Human evolution studiese.g., Human migrations

• Haplogroup: set of haplotypes defined

by slowly mutating markers (mainly SNPs) which

have more phylogenetic stability

• Haplotype: combination of allelic states of a

set of polymorphic markers lying on the same

DNA molecule

Definitions

Unique event polymorphisms record history of Y chromosome

Phylogenetic tree of

Y haplogroupsbased in binary

SNP data

Applications of Y-STRs• Forensic Analysis

– Detect male DNA in a sample containing male and female DNA (Huge background of female DNA)

– Aspermic males– Fingernail Scrapings– Additional DP value– Multiple male donors– Limits of differential extraction/ tissues– Gender clarification (amelogenin)

• Paternal Lineage– Paternity Testing– Kinship analysis– Deficiency cases

Identification of Male Contributor DNA in Crime Scene MaterialIdentification of Male Contributor DNA in Crime Scene Material

Autosomal STR profile

Female Victim DNA:

Male Suspect DNA:

Large Female DNA: Perpetrator Male DNA

- See only female DNA profile- Or partial DNA profile

- no female DNA- no profile overlap- only male component

Y STR profile

One Forensic Application• Some victims of sexual assault may not provide

vaginal samples immediately after the incident

• Ability to retrieve typeable autosomal STR profiles from the semen diminishes rapidly

• However, sperm persist in the vaginal canal up to 3 days after intercourse (may be detectable up to 7 days after intercourse in the cervix)

• Sperm decrease in number as interval increases

• Lymphocytes and epithelial cells from male

Finger Nail Scraping Case

Victim was found strangled to death

Suspect had scratches on his face

Based on STR results, suspect could not be excluded; many alleles were below minimal threshold (inconclusive result)

Evidence Profile

Suspect Profile

Applications of Y-STRs

− Paternity Testing− Kinship analysis− Deficiency cases

• Forensic Analysis

• Paternal Lineages

Deficiency Case Male LineageDeficiency Case Male Lineage

• Y STR analysis - any male relative in pedigree can be a

reference for alleged father

?

YY--STR Haplotype Analysis in Deficiency Paternity CaseSTR Haplotype Analysis in Deficiency Paternity Case

DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385 DYS413 YCAII

Nephew 14 13 30(16) 25 11 13 12 11-14 22 7

Son 14 12 29 (16) 24 10 15 12 11-14 22 7

Exclusion

If true biological nephew, then alleged father is excluded as father of child in question

?

Kayser et al. Progress in Forensic Genetics (1998), 7: 494-496

DYS385 a/b

a = b a ≠ b

DYS389 I/III

IIF primer F primer

R primer

a b

Duplicated regions are 40,775 bp apart and facing

away from each other

F primer

R primer

F primer

R primer

DYS389I DYS389II

Figure 9.5, J.M. Butler (2005) Forensic DNA Typing, 2nd Edition © 2005 Elsevier Academic Press

Multi-Copy (Duplicated) Marker

Single Region but Two PCR Products (because forward primers bind twice)

Y-STR consensus structure and allele ranges

Marker Name GenBankAccession

Repeat Motif AlleleRange

PCR ProductSizes

Reference

DYS19 X77751 TAGA 8-16 178-210 bp Roewer 1992DYS385 Z93950 GAAA 10-22 252-300 bp Schneider 1998DYS388 G09695 ATT 12-17 128-143 bp Kayser 1997

DYS389 IDYS389 II

G09600G09600

(TCTG) (TCTA)(TCTG) (TCTA)

I: 7-13II:23-31

239-263 bp353-385 bp

Kayser 1997Kayser 1997

DYS390 G09611 (TCTA) (TCTG) 18-27 191-227 bp Kayser 1997DYS391 G09613 TCTA 8-13 275-295 bp Kayser 1997DYS392 G09867 TAT 7-16 236-263 bp Kayser 1997DYS393 G09601 AGAT 9-15 108-132 bp Kayser 1997YCAIII AC006370 CA 19-25 192-204 bp Kayser 1997DYS434 AC002992 ATCT 8-11 110-122 bp Ayub 2000DYS435 AC002992 TGGA 9-13 210-228 bp Ayub 2000DYS436 AC005820 GTT 10-15 128-143 bp Ayub 2000DYS437 AC002992 TCTA 8-11 186-202 bp Ayub 2000DYS438 AC002531 TTTTC 6-12 203-233 bp Ayub 2000DYS439 AC002992 AGAT 9-14 238-258 bp Ayub 2000

Y-GATA-A4 G42670 AGAT 11-14 242-254 bp White 1999Y-GATA-A7.1 G42675 ATAG 7-12 161-181 bp White 1999Y-GATA-A7.2 G42671 TAGA 8-12 174-190 bp White 1999Y-GATA-A8 G42672 TCTA 8-14 219-244 bp White 1999

Y-GATA-A10 G42674 TATC 11-14 160-172 bp White 1999Y-GATA-C4 G42673 TATC 11-16 251-271 bp White 1999Y-GATA-H4 G42676 TAGA 10-13 362-370 bp White 1999

Y Chromosome STR Markers

• For effective use, guidelines are needed

• ISFG Recommendations

• Combine with existing recommendations (NRC II Report)

• Nomenclature, Allelic Ladders, Population Genetics, Statistical Issues

• Similar to autosomal STRs

• Thresholds for detection and interpretation

• Stutter

• Mixtures – what constitutes a mixture

• Validation studies in concert with guidelines

• Interpret evidence before knowns

Basic Interpretation Guidelines

Y STR LOCI• DYS19

• DYS398 I

• DYS398 II

• DYS390

• DYS391

• DYS392

• DYS393

• DYS385 I/II

“Minimal Haplotype” – defined for research only

DYS19DYS389IDYS389IIDYS390DYS391DYS392DYS393DYS438DYS439DYS385a/b

SWGDAM

DYS385 – two loci

Y STR Loci

DYS389 – two loci

• Commercially available Y-STR multiplex kits ---allow for standard markers and QA/QC

• Most have EMH and SWGDAM recommended loci

Kits

DYS19DYS389IDYS389IIDYS390DYS391DYS392DYS393DYS437DYS438DYS439DYS385

PowerPlex® Y System

DYS385 – two loci

DYS19DYS389IDYS389IIDYS390DYS391DYS392DYS393DYS437DYS438DYS439DYS385DYS448DYS456DYS458DYS635GATAH4.1

AmpFlSTR® Yfiler™ Kit

DYS385 – two loci

Promega Amplification Performance(moderate-to-low sensitivity instrument)

1ng male DNA

0.125ng male DNA

100ng female DNA

0.1ng male + 100ng female

2s injection on ABI PRISM® 310 Genetic Analyzer

0.5ng male

0.5ng male + 5ng female (10-fold)

Neg. control3s inj. on ABI PRISM® 310 Genetic Analyzer



DNA Mixtures: Increasing female

Promega0.5ng male + 0.5ng female (equal

mix)


DNA quantified by A260

AmpFλSTR® Yfiler™ Kit 1 ng Male Control DNA 007

DYS458 DYS389 I DYS390 DYS389 II

DYS438 DYS19 DYS385 a/b

DYS393 DYS391 DYS439 Y GATA C4 DYS392

Y GATA H4 DYS456 DYS437 DYS448

AmpFlSTR® Yfiler™

137 alleles

Allelic ladder

CFS AFRCT AFRMI AFRNYC AFRTX AFRCFS CAUCT CAUMI CAUNYC CAUTX CAU

371828680192571649783194

MI HISMN HISNYC HISTX HISApacheNavajoCFS ASNMN ASNNYC ASNTX ASN

CT HIS

Population

160

N

CFS EI

Population9710180192138219281014573

Y STR Population DataPromega Study

37

N

Total = 2443

DYS19Allele Frequencies African American

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

12 13 14 15 16 17 18

Alleles

Freq

uenc

y

Sinha (n=543)CFS (n=37)CT (n=182)MI (n=86)NYC (n=80)TX (n=193)

χ = frequency of each haplotype n = # haplotypes

(h = n(1-Σχ2)/ (n-1)Genetic Diversity

P = Σχ2Random Match Probability

Population Parameters

Gene DiversityHispanic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

DYS437 DYS19 DYS392 DYS393 DYS390

CT HIS MI HIS MN HIS NYC HIS TX HIS

CFS AFRCT AFRMI AFRNYC AFRTX AFRCFS CAUCT CAUMI CAUNYC CAUTX CAUCT HISMI HISMN HISNYC HISTX HIS

Y Haplotype ProfilesPopulation

3718286801935716397831941589710080192

N # Haplotypes

361728580181501538780170130909574179

% Single

97.394.598.810093.887.793.989.796.487.682.392.895.092.593.2

Haplotype Diversity

0.99850.99940.9997

>0.99990.99930.99440.99910.99680.99910.99810.99630.99850.99880.99680.9991

ApacheNavajoCFS ASNMN ASNNYC ASNTX ASNCFS EI

Y Haplotype Profiles

13821928100457037

701012896436935

50.746.110096.095.698.694.6

0.97010.9806

>0.99990.99920.99700.99960.9955

Population N # Haplotypes % SingleHaplotype Diversity

high haplotype diversity = high intra-individual variation

Haplotype DiversityN>80

0.955

0.96

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

1.005

CT CAU

NYC CAU

MI CAU

TX CAU

CT AFR

MI AFR

NYC AFR

TX AFR

CT HIS

MI HIS

MN HIS

NYC HIS

TX HIS

APACHENAVAJOMN A

SN

Multiband Y Patterns• MN ASIAN PA0077 DYS385 - 3 Bands

• MN HISPANIC PH0031 DYS390 - 2 Bands

• MN HISPANIC PH0063 Multibands

• NYC HISPANIC 26 DYS19 - 2 Bands

• NYC CAUCASIAN 4 DYS19 - 2 Bands

• CT HISPANIC 00-1851 DYS19 - 2 Bands




• CT CAUCASIAN 00-3022 DYS385 - 3 Bands

• ASIAN A-FTA-34-F/C DYS385 - 3 Bands

• ASIAN A-FTA-36-F/C DYS19 – Primer Binding site?

• ASIAN A-FTA-32-F/C DYS385 - 4 Bands

Must consider when analyzing

mixtures

What about Equilibrium?

• These markers are all on a single chromosome

• No recombination

• Predict strong linkage and a lack of independence between the loci

• Haplotype

CFS AFRCT AFRMI AFRNYC AFRTX AFRCFS CAUCT CAUMI CAUNYC CAUTX CAUCT HISMI HISMN HISNYC HISTX HIS

Population3718286801935716397831941589710080192

N # Equilibrium

352327342130263022112633423537

Y STR Loci Pairwise Tests12 Loci – 66 tests

ApacheNavajoCFS ASNMN ASNNYC ASNTX ASNCFS EI

Y STR Loci Pairwise Tests12 Loci – 66 tests

13821928100457037

9126050475035

Fewest – Native AmericanMost – Asian (sample size)

# EquilibriumPopulation N

391/389I391/389II389I/439389I/385-2439/389II439/393439/385-1439/385-2389II/393437/393

Loci19181718171918201718

# Populations

Y STR Loci Pairwise Tests22 populations; >17 Equilibrium detected

391/438389I/389II438/437438/19438/392438/385-1438/385-2437/385-119/39219/385-1392/385-1390/385-1385-1/385-2

Loci5144304553553

# Populations

Y STR Loci Pairwise Tests22 populations; <5 Equilibrium

389I/392438/439437/439437/385-2390/385-2

Loci1511161212

# Populations/Equilibria

Y STR Loci Pairwise Tests22 Populations – Examples of Population Specific Disequilibrium

CaucasianCaucasianCaucasian

African AmericanAfrican American

Population/Equilibria

• There is evidence of “independence” between some pairs of loci in the populations samples

• Likely due to combination of mutation rate, subdivision and random drift

• A large factor is haplogroup diversity

• The marker selection for increasing haplotype diversity is not directly correlated to gene diversity

• Conservative estimates

Equilibrium and Impact?

Approaching Analysis

• Some may suggest - “Use the set of Core Y STRs and add more as needed to resolve matches”

• First question – when do you stop?

• If you get a match, you would have to continue on ad infinitum!

• Is this a sensible policy?

How much power is needed???

Y-STR marker combination

African American (N=786)

Caucasians (N=778)

Hispanics (N=381)

European minimal haplotype (9) 75.8% 61.7% 79.8%

Minimal + SWGDAM + DYS437 (12)

87.7% 76.7% 88.2%

Eur. Minimal + SWGDAM (11) 86.8% 74.3% 85.6%

AmpFλSTR®

Yfiler kit (17)

97.6% 95.5% 95.8%

Discriminatory Capacity for Three US Populations

*DC= (# of different haplotypes / pop. size) x 100Mulero et al., JFS (2006) 51:64-75

Number of Unique Haplotypes Observed in Three US Populations

Y-STR marker combination

African American (N=786)

Caucasians (N=778)

Hispanics (N=381)

European minimal

haplotype (9)496 382 266

Minimal + SWGDAM + DYS437 (12)

628 524 306

Eur. Minimal + SWGDAM (11) 618 503 295

AmpFλSTR®

Yfiler kit (17)

749 714 350

Mulero et al., JFS (2006) 51:64-75

So… more loci do result in better resolution

But…size of the database impacts more on statistics


• So additional testing beyond a single kit is an unlikely approach because information gain is low

• Many samples will already be very limited

• Community will rely on commercially available kits not in-house designer systems

• QC/ Proficiency Testing

• Better to increase size of database(s) to gain power

• We will re-visit substructure issues later


• Some may suggest - “A reference database should contain related individuals” – to better define the population

• Probability of paternal relative having the same haplotype is usually 1

• Typically comprised of unrelated individuals

• Although a small unknown number of related individuals may be in a database

• Able to address significance of a very closely related profile

Exclusion with 1 mismatch among 12 analyzed Y-STRs

Evidence 14 12 28 25 11 11 13 14,14 11 11 15

Known 14 12 28 24 11 11 13 14,14 11 11 15

By having a database of unrelated males one can assess weight ofrelative (with mutation) versus rarity of haplotype in population

Qualitative Conclusions of Y-STR Haplotype Comparison

Exclusion- The two haplotypes are dissimilar; i.e, the reference person is excluded as the contributor of Y-specific DNA of the evidence sample

Inclusion/Match- The Y haplotypes from two samples are sufficiently similar and potentially could have originated from the same source, or from a common paternal lineage

Inconclusive- Exclusion/Inclusion cannot be definitively inferred due to insufficient data from one or both of the DNA samples

• Unique event polymorphisms record history of Y chromosome

• Effective population size

• Patrilineal inheritance reduces population size

• Variance of offspring further reduces effective population size

• Patrilocal effect causes local differentiation

• Lack of recombination

• But some detectable linkage equilibrium

Now Let’s Think About Application

Population Differentiation

• Effective population size of Y chromosome is 1/4 of autosome or 1/3 of X– lower sequence diversity on Y– more susceptible to genetic drift

• random changes in frequency of haplotypes due to sampling bias from one generation to next

• accelerates differences between populations• Geographical clustering due to patrilocal behavior of

men– women move closer to man’s birthplace– local geographical differentiation enhanced

• Variance of offspring further reduces Ne (effective population size)

Calculation to Convey to the Court

• Frequency estimate not possible

• Court desires a frequency estimate

• Point Estimate (Counting Method)

• Confidence Interval

• Approach the same as mtDNA

Calculation to Convey to the CourtIssues

• Estimating the rarity of a Y DNA profile is performed differently than for autosomal DNA markers

• No evidence for recombination across the majority of the Y-chromosome

• One cannot employ the product rule to estimate the rarity of the Y types in a profile

• The composite multi-locus profile is treated as a single locus or haplotype

• The vast majority of possible haplotypes will not be observed in any database

• The counting method is likely to be conservative

• A correction for sampling

• A correction for substructure

Calculation to Convey to the Court

Calculation to Convey to the CourtCounting Method

• The counting method is very simple

• A Y STR haplotype (evidence sample) is compared to a reference database(s) of unrelated individuals

• The number of times the haplotype is observed in a database

Calculation to Convey to the CourtApproaches

• It is more likely that the counting method will be employed by the U.S. laboratories and courts because of its operational simplicity

Limitations of the Counting Method

• Non-matched sites of the haplotype are given weight equal to that of different origin (but may have some extra value for substructure)

• Mutations are not weighted

• Haplotypes of the same paternal lineage can be excluded, when they are subject to mutations

• Does not recognize evolutionary changes, and/or effect of convergent mutations

Maximum haplotype frequency

• If a Y-haplotype is not seen in a sample of N males then at the α level of significance

• As N becomes larger, maximum frequency becomes closer to point estimate

If the haplotype has not been observedin the database, then:

The upper bound of the CI is

1-α1/N

• N frequency

• 100 3/100

• 500 3/500

• 1,000 3/1,000

• 10,000 3/10,000

Haplotype frequency (95%)

CI = p ± 1.96 p(1-p)/N

For Y haplotype observed,count the number of timesthe profile is observed (X)

p = X/N

I have confidence that the true haplotype frequency is less thanthe upper CI value

A potential mtDNA statement could be:

• Correction for population structure may be considered

• Effective population size ¼ of autosmal loci

• May actually be a little lower

• Substructure effects less in US than ancestral populations

• Use when reference database considered not representative

Calculation to Convey to the CourtPopulation Substructure

Problems created by population subdivision

Haplotype frequencies calculated from Haplotype frequencies calculated from population averagepopulation average

frequencies frequencies couldcould lead to:lead to:

–– Wrong estimates!Wrong estimates!

Employ a Theta (θ) Correction

θ is used as a measure of the effects of population subdivision (inbreeding)

NRC θ recommendation was pragmatically set

Empirical values are much less for autosomal loci

National Academy of SciencesMay 1996

Still need to calculate substructure effects

But likely to be low for most major populations, if evaluated under a forensic model vs that of an evolutionary model

• Population is relevant and representative

• Population is relevant, but not representative

• Population is similar and not representative

• Follow NRC II Recommendations

Calculation to Convey to the CourtPopulation Substructure

U.S.U.S. YY--STR HaplotypeSTR Haplotype Reference DatabaseReference Databasewww.ystr.org/usa

AA CAU HIS Total

Haplotype diversity 99.8% 99.6% 99.5% 99.7%

Number of populations sampled 10 11 9 30

Number of individuals sampled 599 628 478 1,705

Number of Y-STR loci typed (EMH) 9 9 9 9

Number of different haplotypes 454 76%

437 70%

354 74%

1116 65%

Most frequent haplotype 12 2.0%

25 3.98%

19 3.97%

533.1%

Kayser et al. J. Forensic Sci. (2002), 47(3): 5513-519

Hispanic

Structure of U.S. Populations Structure of U.S. Populations withwith YY--STR STR HaplotypesHaplotypes

Florida EA

European-American

African-American

Virginia AAFlorida AA

Maryland AATexas AA

New York AAPennsylvania AA

Missouri AAOregon AA

Indiana AALousiana AA

Pennsylvania EANew York EA

Indiana EAMissouri EA

Lousiana EAMaryland EA

Pennsylvania HFlorida H

New York HConnecticut H

Texas EACajun EA

Virginia EAOregon EA

Oregon HMaryland H

Lousiana HTexas HVirginia H

RST = 0.1

RST: measure for population differentiation

Kayser et al. J. Forensic Sci. (2002), 47(3): 5513-519

Based on Evolutionary ModelNot a Forensic Model

Computation of Frequency of Lineage-based Marker Profile

Using the general theory, the unconditional frequency of an haplotype (say Ai),

which is count divided by sample size, can be modified to get the conditional probability

Pr. (Ai|Ai) = [pi2 + θ pi(1-pi)]/pi

= pi + θ(1-pi) or= θ + pi(1 - θ)

Hence, the conditional probability always exceeds θ, the adjustment factor of possible population substructure in the database used

θ Becomes the bound

Computation of Frequency of Lineage-based Marker Profile (Contd.)

Some suggest that the quantity pi in Pr. (Ai|Ai) = θ + pi(1 - θ)

can be substituted by (Count of Ai + 2)/(N + 3),

where N is the sample size. When N is large, this has little effect, but can be of help when the count of Ai in the database is zero (i.e., profile in evidence not seen in the database)

Y STR θ -Value

Since in terms of match versus non-match, how different are the haplotypes is not an issue

The θ values for Y STRs are not computed based on mismatch approaches (such as AMOVA), but instead treating all haplotypes as different alleles, generally leading to a much smaller θvalue

What θ to use?

• The value of θ (computed by either RST or FST) is dependent indirectly on the number of Y STR loci comprising the haplotype

• Generally, as more loci are included in the haplotype, most haplotypes in a data set will become differentiated

• Therefore, the greater the number of loci within the haplotypes, the smaller should be the value of θ

What θ to use?• RST has been used to estimate the value of θ for forensic

calculations

• However, the analyses based on RST are better applied to evolutionary biology for studying the phylogenetichistory of Y-chromosomes

• The RST values are based on allele size variance, exploiting the extent of difference between different haplotypes

• Such an approach, however, typically does not apply to forensic inferences; forensic applications assess the evidence in terms of match or non-match.

What θ to use?

• In these terms, the θ values for Y STRs should not be computed on mismatch based approaches, but instead by simply treating all haplotypes as different alleles

• Thus, FST (or GST) is a better estimator for θ

• Haplotypes are identified solely by their distinctiveness (i.e., haplotypes are considered simply in terms of identity by state)

• This generally leads to a more appropriate and much smaller θ value than estimated by RST

Y STR haplotype is one locus with many allelesA1A2....

A100

A101A102

.

.

.

.A200

Population 1

Population 2

Databases with reasonable size approximate this model

θ is almost 0

Y STR haplotype is one locus with many alleles

A1A2....

AnA1'A2'

.

.

.

.An'

Population 1

Population 2

In reality, with large number of loci a few types are shared and most if not all have never been seen in the database

θ approaches 0

What did he just say?

Forensic ModelPopulation Substructure

DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385

A --- 14 12 29 24 10 15 12 11-14

E --- 18 12 25 24 10 15 15 11-18

B --- 14 13 29 24 10 15 12 11-14

C --- 14 13 29 24 12 15 12 10-14

D --- 18 11 25 24 10 13 15 12-18

Which haplotypes might be more closely related?

Forensic ModelPopulation Substructure

DYS19 DYS389I DYS389II DYS390 DYS391 DYS392 DYS393 DYS385

A --- 14 12 29 24 10 15 12 11-14C --- 14 13 29 24 12 15 12 10-14

Are such evolutionary differences considered in forensic evaluation?

E --- 18 12 25 24 10 15 15 11-18A --- 14 12 29 24 10 15 12 11-14

Exclusion

Exclusion

F-Statistics distribution

0

0.0005

0.001

0.0015

0.002

0.0025

16 15 13 11 10

# of Loci

FstFst w/o NAFscFsc w/o NAFctFct w/o NA

θ Value and Impact(or do we need more Y STR loci?)

θ Value and Impact• FST values across major populations range from <0.002 to <0.0001

(depending on the number of loci comprising the haplotype)

• Population-specific values are smaller, excluding Native Americans

• 17 loci A = 0.00005; C = 0.00016; H = 0.00007; 12 loci A = 0.00005; C = 0.0002; H = 0.0004

• With the size of current reference population databases, FST rarely will influence the upper bound of the Y STR frequency for most population groups (with the current commercially available systems)

• Increasing the size of reference population data sets and including more populations would be more valuable and a better use of resources for exploiting the full power of Y STR typing

• Dedicating efforts to expanding the battery of Y STR loci will not significantly exploit the power of the assays

Key thoughts

• Diversity• Most common types in population• Following NRCII recommendations

Partial Profile θCombined Populations

• 0.0008 -- 15 loci• 0.002 -- 11 loci• 0.0329 -- 5 loci• 0.1199 -- 2 loci

Locus Caucasian(N = 199)

Afr Amer(N = 203)

Hispanic(N = 207)

Asian(N = 83)

Total(N = 692)

DYS391 11:10 1

DYS389I 12:13 1

DYS389II 29:30 30:29 1*

DYS439 13:12 11:12 2

DYS438 0

DYS437 15:16 15:14 2

DYS19 16:17 2

17:16

DYS392 0

DYS393 14:15 1

DYS390 0

DYS385 14:15 14:15 12,14:14 3

Total 2/2388 6/2436 2/2484 3/996 13/8304

0.00084 0.00246 0.00081 0.0031 0.00157

Y STR mutations (father:son allele transmission)

• 692 confirmed father-son pairs (probability > 99.9%)

• 14 mutation events were observed

• Average rate of 1.57 x 10-3/locus /generation (13/8304)

• With a 95% confidence bound of 0.83 x 10-3 to 2.69 x 10-3

• This rate is a little smaller than that of the Kayser, et al.

• Estimate (2.80 x 10-3/locus)

• But the difference is not statistically significant (P > 0.05).

one Asian father-son pair at the DYS389I/II loci complex (12,29) �(13, 30) appears as a double mutation, but likely is a single original event.

Y STR mutations (father:son allele transmission)

Next Task

• Test independence between autosomalloci and Y haplotypes

Independence Testing of Y Haplotypeand 13 Autosomal CODIS STR Loci

(Autosomal Locus/ Y Haplotype Displaying Disequilibria* - 22 populations)

1. ApacheFGA, p-value = 0.03760000D21S11, p-value = 0.03460000D18S51, p-value = 0.02820000D5S818, p-value = 0.02660000

2. Minnesota AsianD8S1179, p < 10-3

3. Minnesota HispanicD16S539, p-value = 0.03340000D18S51, p-value = 0.02100000

4. Canada African AmericanFGA, p-value = 0.00920000

5. Canada Asian IndianD7S820, p-value = 0.02820000

6. Connecticut African AmericanFGA, p-value = 0.04300000THO1, p-value = 0.00280000

7. Connecticut CaucasianTHO1, p-value = 0.02880000

8. Michigan CaucasianD16S539, p-value = 0.04820000

13. New York CaucasianD7S820, p-value = 0.01660000

14. New York HispanicD21S11, p-value = 0.01340000

15. Texas African AmericanD13S317, p-value = 0.01200000D18S51, p-value = 0.01420000

16. Texas HispanicD5S818, p-value = 0.01880000

9. Michigan HispanicvWA, p-value = 0.03160000FGA, p-value = 0.02240000

10. Native American TotalD3S1358, p-value = 0.02680000D21S11, p-value = 0.00060000D18S51, p-value = 0.00840000

11. NavajoD21S11, p-value = 0.02820000

12. New York AsianD16S539, p-value = 0.00740000

Independence Testing of Y Haplotypeand 13 Autosomal CODIS STR Loci

(Autosomal Locus/ Y Haplotype Displaying Disequilibria* - 22 populations)

Next Task• Mixtures

• Assume 2 alleles for 11 loci

• 211 possible haplotypes – 2048

• Most haplotypes not observed in database

• Assumption of independence not correct

• Minimal haplotype frequency (≈minimum allele frequency) not practical

Mixture• Probability of Exclusion

• Binomial distribution - haplotypes excluded and haplotypes not excluded

• Count number (m) not excluded; (PI = m/n)

• Estimate upper CI of PI

• PE = 1- PI

• Based on same principles used for autosomal loci (but at haplotype level)

• Four scenarios for two contributor sample in example:

•Hp --- S1 and S2 are source

• Hd --- S1 and unknown are source (same as PI)

• Hd --- S2 and unknown are source (same as PI)

• Hd --- two unknowns are source

MixtureLikelihood Ratio

• Assume three loci, two alleles at each locus, two male suspects

•Total alleles the same as in evidence

• Equal contribution

• 8 possible haplotypes


13/15 8/10 22/25 --- 3 locus profile

15 8 22 --- haplotype 2

15 8 25 --- haplotype 4

13 10 22 --- haplotype 5

15 10 22 --- haplotype 6

13 10 25 --- haplotype 7

13 8 25 --- haplotype 3

15 10 25 --- haplotype 8

13 8 22 --- haplotype 1

All haplotypes included are

PE/PI

PE/PI

All possible haplotypes are included

13/15, 8/10, 22/25 --- 3 locus profile

15 8 22 --- haplotype 2

15 8 25 --- haplotype 4 13 10 22 --- haplotype 515 10 22 --- haplotype 613 10 25 --- haplotype 7

13 8 25 --- haplotype 3

15 10 25 --- haplotype 8

13 8 22 --- haplotype 1

But certain haplotype pairs can not explain evidence

haplotype 1 + haplotype 2haplotype 1 + haplotype 3haplotype 1 + haplotype 4haplotype 1 + haplotype 5

and so on

PE/PI

All possible haplotypes are included

13/15, 8/10, 22/25 --- 3 locus profile

15 8 22 --- haplotype 2

15 8 25 --- haplotype 4 13 10 22 --- haplotype 515 10 22 --- haplotype 613 10 25 --- haplotype 7

13 8 25 --- haplotype 3

15 10 25 --- haplotype 8

13 8 22 --- haplotype 1

haplotype 1 + haplotype 8haplotype 2 + haplotype 7haplotype 3 + haplotype 6haplotype 4 + haplotype 5

Only certain haplotype pairs can explain evidence

LR =1

2[Pr(H1)Pr(H8) + Pr(H2)Pr(H7) + Pr(H3)Pr(H6) + Pr(H4)Pr(H5)]


• Technically correct

• Can not estimate individual haplotype frequencies

• 217 possible haplotypes (although not all combinations can explain the evidence)

• Assuming independence is not correct

• Cannot place types in database, most never seen, too many

MixtureLR

Use same logic as PIfor the denominator in the LR

E = excluded E = not excluded

Haplotypes fall into either category

E/E and E/E pairs can not explain the evidence

Only E/E can explain the evidenceand only a subset of these fit

m* - those pairs (of E/E) that explain evidence

m*/n(n-1) and take upper CI as denominator

MixtureLR

LR =1

m*/n(n-1)

MixtureLR

•The denominator is the PI with an assumed number of contributors

•Makes better use of data

UCI

Online available Y-STR haplotype reference databases/ Commercial Kits

To calculate/search matches - haplotype frequencies

http://www.appliedbiosystems.com/yfilerdatabase/

Applied Biosystems Yfiler

http://www.promega.com/techserv/tools/pplexy/default.htm

Promega Powerplex Y

http://www.appliedbiosystems.com/yfilerdatabase/

http://www.promega.com/techserv/tools/pplexy/default.htm

AB Yfiler

Haplotype data can be input manually or through file upload

There can be no matches

when testing this many loci and this size

database

So using our formula from before and an α = 0.05

1-α1/N

1 – (0.05)1/3561 = 0.00084

But would not use total data set, breakdown by population

If there is a match in the database

CI = p ± 1.96 p(1-p)/N

What p…? What N…?

CI = p ± 1.96 p(1-p)/N

As before, look at match number and sample size:

CI = 0.0006 + 1.96√ (0.0006(0.9994))/3561

Upper bound would be:0.00140

CI = p ± 1.96 p(1-p)/N

In practice, use population group in which the match was found (and other populations appropriate):

CI = 0.0016 + 1.96√ (0.0016(0.9984))/1276

Upper bound would be:

0.00379

One issue to consider is

• Expressing results when database sizes are different

• 1 in 70• 1 in 700• May have nothing to do with variation• But may be exploited incorrectly by some

Recap Comments on Lineage Markers

• Y STRs are paternally inherited

• Barring mutations, all paternally related persons will have the same haplotype

• Different markers on the Y chromosome are genetically linked with no recombination

• Consequently, Y STR sequence data are treated as a haplotype, frequency - NOT multiplicative across markers

Comments on Lineage Markers (Contd.)• Counting method is one approach that captures the genetic

information

• Stated ethnicity of individuals does not necessarily reflect patrilineal ancestry

• e.g., mtDNA of Hispanics may be almost entirely of Native American descent, while for the autosomal STRs, only 30-50% of their genes are of Native American descent

• Thus, grouping of populations by ancestral lineage does not necessarily provide accurate frequency estimates of specific Y STR haplotypes in a forensic context

Summary Steps of Computations Step 1

• Convenience samples with subjects included without a prior knowledge of their DNA profile is an adequate database

• Subjects’ with broad population grouping recorded with self-identified ethnicity

• Sample size of XXX to XXX individuals is generally adequate

• Haplotype counts from samples of genetically affine groups may be pooled to enhance the sample size

Calculation to Convey to the CourtCounting Method

• The counting method is very simple• A Y STR haplotype (evidence sample) is compared to a

reference database(s) of unrelated individuals• The number of times the haplotype is observed in a database• The size of a database can be and is often limited• With databases (e.g., n = 100 to 3000), many possible

haplotypes will not be observed and there will be sampling error • A confidence interval can be placed on the observation • Can convey with a high degree of confidence that the rarity of

the evidence Y STR haplotype among unrelated individuals in a given population(s) is less than the upper bound of the estimate

• θ adjustment – see presentation during meeting

assessing the significance of y str evidence...genes on the human y chromosome • 23 mb of the...

Documents