identity by descentcsuros/ift6299/h2014/content/prez12-ibd.pdf · mode d’identite (jacquard)´...

25
IBD ? IFT6299 H2014 ? UdeM ? Mikl´ os Cs˝ ur¨ os I DENTITY BY DESCENT

Upload: others

Post on 31-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros

IDENTITY BY DESCENT

Page 2: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Consanguinite

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros ii

entre deux individus relies, on copte sur l’identite de haplotypes par descente

GE46CH28-Browning ARI 3 October 2012 16:38

Half-siblings:Chromosome 1

Fifth-cousins:Chromosome 1

IBD

IBD

IBD

Figure 1Identity by descent (IBD) on chromosome 1 for half-siblings and fifth cousins. Chromosome 1 is approximately 250 Mb long and has agenetic length of approximately 280 cM. The common ancestors’ copies of chromosome 1 are shown in various colors, and tanrepresents all other haplotypes. Regions of IBD are shown with black bars.

shared ancestry are approximately exponen-tially distributed with a mean of 100 m!1 cM.Thus, for example, fifth cousins (see Figure 1)are separated by 12 meioses. On average, 0.05%of their genome, or approximately 1.5 cM("1.5 Mb), is identical by descent throughtheir great-great-great-great grandmother. Ifthey are full fifth cousins, they may also haveIBD sharing through their great-great-great-great grandfather, which doubles the expectedIBD proportion to 0.1% of their genome, orapproximately 3 cM ("3 Mb). However, fifthcousins usually have no detectable IBD sharing,and when they do have IBD sharing it is usuallycomposed of a single IBD segment with amean length of 8.3 cM ("8 Mb). Extrapolatingfurther, individuals who have shared ancestrythrough a certain common ancestor 25 genera-tions ago (with 50 meioses of separation) almostalways share none of their genome identical by

descent through that ancestor, but if they dohave an IBD segment through that ancestor itwill have a mean length of 2 cM ("2 Mb).

Any given pair of individuals is relatedthrough many common ancestors. For a pairof individuals on different continents, therelationships may be too distant to result indetectable IBD sharing. However, pairs ofindividuals from the same geographic regionmay have many recent common ancestors.Such individuals may, however, have only oneor two detectable IBD segments, as many ofthe relationships have not resulted in any IBDsharing.

In a data set with N unrelated individu-als, there are N(N-1)/2 pairs of individuals.Although any given pair has very little IBDsharing, the total amount of IBD sharing in thesample, and the total amount per individual,can be large.

www.annualreviews.org • Identity by Descent 619

Ann

u. R

ev. G

enet

. 201

2.46

:617

-633

. Dow

nloa

ded

from

ww

w.a

nnua

lrevi

ews.o

rgby

66.

131.

174.

77 o

n 05

/08/

13. F

or p

erso

nal u

se o

nly.

Browning & Browning (2012)

Page 3: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Segments IBD

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros iii

centiMorgan : distance sur laquelle 0.01 recombinaisons occurrent en une generation

genome humain : 106 pb ≈ 1 cM

distance de m generations : proportion de genome partagee 1/2m−1

longueur de segment : 100/m cM

⇒ cousins de 5e degre : partage 3000 Mb211 ≈ 1.5 cM en moyenne venant de

chaque grand5parent, de longueur 8.3 cM en moyenne ; aucun segment en com-mun avec ≥ 65% probabilite :

2 · 1.53.5

= 0.35

Page 4: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Reine Victoria et Prince Albert

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros iv

Hanover

Saxe-Coburg-Saalfeld

Saxe-Coburg & Gotha

Saxe-Gotha-Altenburg Mecklenburg-SchwerinSaxe-Meiningen

Brunswick-Wolfensbüttel

E1767

G1738C1744m

V1819

F1750

V1786

L1726

L1756

L1779

M1679F1676

J1704F1699

L1800

E1745

A1772

A1819

E1784

A1757

A1719E1724

F1697

F1756

A1700

C1731L1725

L1710

C1751

A1687

E1672

M1647B1649E1658

F1707

C1730Reuss-

Ebersdorf

Hesse-Darmstadt

Anhalt-Zerbst

Schwarzburg-Rudolstadt

Reuss

F1646

D1674

M1648

Saxe-Weissenfels

L1671C1671

A1633E1634

S1724

A1696

F1680

Oettingen-Oettingen

Holstein-Norburg

Mecklenburg-Strelitzt

Hanover

Brunswick-Wolfensbüttel

Saxe-Meiningen

Mecklenburg-SchwerinHesse-

Phillippstal

C1664

E1601

E1619

J1658

Saxe-Coburg-SaalfeldSaxe-Gotha-Altenburg

Csuros (2014)

Page 5: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Mode d’identite (Jacquard)

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros v

A

B

x

x

A

B

x

y

A

B

y

x

x/x x/x

x/x x/y

x/y x/x

A

B

x y

x zx/y x/z

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

A

B

x y

x yx/y x/y

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

A

B

x x

y z

A

B

x x

y y

A

B

x z

y y

A

B

x y

z w

x/x y/y

x/x y/z

x/z y/y

x/y z/w

identity coefficientIBD mode A's

genotypeB's

genotypeThe system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) tp eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because it is already encoded in the joint geno-type distribution: let

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�Using the Exome Variant Server’s data, we can get an idea of the magnitude for

µ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

The system of equations (4) can be written in matrix form as

F ·

0BBBBBBBBBBBB@

�1

�2

�3

�4

�5

�6

�7

�8

�9

1CCCCCCCCCCCCA=

0BBBBBBBBBB@

f1111

f1101

f0111

f0101

f1100

f0011

f0100

f0001

1CCCCCCCCCCA. (5)

The matrix F in Eq. (5) is

F

=

0BBBBBBBBBB@

a a� b a� b a� b� c a� b a� b� c a� b a� b� c a� b� c� d0 0 b 2c 0 0 0 c 2d0 0 0 0 b 2c 0 c 2d0 0 0 0 0 0 2b b 4c� 4d0 b 0 b� c 0 c 0 0 c� d0 b 0 c 0 b� c 0 0 c� d0 0 0 0 b 2b� 2c 0 b� c 2b� 4c + 2d0 0 b 2b� 2c 0 0 0 b� c 2b� 4c + 2d

1CCCCCCCCCCA(6)

witha = µ, b = µ� µ2, c = µ2 � µ3, d = µ3 � µ4.

1.3 Coancestry is not identifiable

The matrix Equation (5) links nine parameters (�j) to eight estimable quantities(fxyzu), and the Jacquard coefficients are thus undetermined. The constraint

Pj �j =

1 does not restrict the solution space because amending the matrix F with the row�1 1 · · · 1

�still creates a

u =�1 3

434

12

12

12

14

14

�.

By Equation (6),

uF =�a a a a a a a a a

�.

Using the Exome Variant Server’s data, we can get an idea of the magnitude forµ, µ2, µ3, µ4. Table 1 shows that second- and even higher-order moments cannotbe ignored in the formulas, as the contribution from frequent SNPs keep them high.Table 1 estimates that µ2 ⇡ 0.26µ, µ3 ⇡ 0.09µ and µ4 ⇡ 0.4µ.

7

x

x

x

x

x

x

identity coefficientIBD mode A's

genotypeB's

genotype

Page 6: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

IBD→ IBS

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros vi

(identity-by-descent et identity-by-state)

© 2006 Nature Publishing Group

Unordered genotypes The probability of unordered genotypes does not require specifying which genotype belongs to which individual (for example, which is for the parent and which is for the child). By contrast, the probability of ordered genotypes requires this information.

For a single individual, the two alleles at a locus are either IBD or not IBD with probabilities F and (1 – F), respectively. In the first situation, the IBD alleles must be the same type, so the chance that they are both of type Ai is the same as the chance that either of them is of that type; this is the population frequency Pi of that allele. If two alleles at a locus are not IBD then they are independent and each has its own chance Pi of being of type Ai. The probability (Pr) of a homozy-gote AiAi is therefore Pr(AiAi) = FPi + (1 – F)Pi

2, and the corresponding result for a heterozygote AiAj, i ! j is Pr(AiAj) = 2(1 – F)PiPj. The factor of 2 allows for each allele to be either maternal or paternal. The same logic leads to the joint probabilities of all seven possible pairs of unordered genotypes, which are shown in TABLE 2.

Distinguishing between relationships. In paternity testing, it is necessary to decide whether an individual is the

father of a child or unrelated to the child. For remains identification, it is necessary to decide whether the remains are from a person with a specified relationship to a family member of a missing person. Although an absolute determination of relationship cannot be made, it is possible to find which of the competing putative rela-tionships makes the observed genotypes most probable by using likelihood ratios, which compare the probabilities of the observed genotypes under alternative hypotheses about relationships. For non-inbred relatives, when only the three relationship coefficients are needed, and in the case in which the alternative is that the individu-als are unrelated, the likelihood ratio has a simple form4

(Supplementary information S1 (box)).Approaches based on likelihood ratios have been

used since the earliest days of paternity testing. Here, the putative relationships are that the alleged father is indeed the father of a child or that he is unrelated to the child, and the likelihood ratio is called the paternity index. In a forensic setting, the relationship alternatives might be ‘self ’ or ‘unrelated’: the suspect in a crime is either the source of a biological stain or is unrelated to the source of that stain.

More recently, a likelihood ratio expression was used4

to identify remains from the World Trade Center; geno-types from tissue found at the site and from a family member of a missing person were examined for pos-sible full-sibling or parent–offspring relationships. This approach considerably reduced the number of calcula-tions that would have been necessary if all the possible relationships between a tissue sample and everyone who had lost a relative were considered. In practice it can be difficult to distinguish between full- and half-siblings, because loci with the same genotype are more common in full-siblings whereas loci with different genotypes are more common in half-siblings5. Nevertheless, provided the two degrees of relationship that are being

Table 1 | Identity-by-descent probabilities for common, non-inbred relatives

Relationship k2 k1 k0 !! = k1/4 + k2/2

Identical twins 1 0 0 1/2

Full-siblings 1/4 1/2 1/4 1/4

Parent–child 0 1 0 1/4

Double first cousins 1/16 3/8 9/16 1/8

Half-siblings* 0 1/2 1/2 1/8

First cousins 0 1/4 3/4 1/16

Unrelated 0 0 1 0*Also grandparent–grandchild and avuncular (for example, uncle–niece). The table shows the three identity-by-descent probabilities (k0–2) and the coancestry coefficients (! ) for common relationships. Note that the coancestry coefficient for full-siblings and parent–child is the same (1/4), but that the pattern of allele sharing is different in each case (that is, there is a different set of k values). ki, the probability of sharing i number of identical-by-descent alleles (where i = 0–2; see also BOX 1; FIG. 1; !, the coancestry coefficient of two individuals (equivalent to the inbreeding coefficient of their offspring).

Table 2 | Joint genotypic probabilities

Genotypes Genotypic state

Number of shared alleles

General Non-inbred

1 AiAi, AiAi Hom/hom 2 "1Pi + ("2 + "3 + "5 + "7)Pi2 + ("4 + "6 + "8)Pi

3 + "9Pi4 k2Pi

2 + k1Pi3 + k0Pi

4

2 AiAi, AjAj Hom/hom 0 "2PiPj + "4PiPj2 + "6Pi

2Pj + "9Pi2Pj

2 k0Pi2Pj

2

3 AiAi, AiAj Hom/het 1 "3PiPj + (2"4 + "8)Pi2Pj + 2"9Pi

3Pj k1Pi2Pj + 2k0Pi

3Pj

4 AiAi, AjAm Hom/het 0 2"4PiPjPm + 2"9Pi2PjPm 2k0Pi

2PjPm

5 AiAj, AiAj Het/het 2 2"7PiPj + "8PiPj(Pi + Pj) + 4"9Pi2Pj

2 2k2PiPj + k1PiPj(Pi + Pj) + 4k0Pi2Pj

2

6 AiAj, AiAm Het/het 1 "8PiPjPm + 4"9Pi2PjPm k1PiPjPm + 4k0Pi

2PjPm

7 AiAj, AmAl Het/het 0 4"9PiPjPmPl 4k0PiPjPmPl

The table shows seven distinct patterns of genotypes that are possible for two unordered individuals, and the probabilities of these pairs of genotypes in general, or assuming no inbreeding. Two genotypes could be homozygous (hom) for the same or different alleles (rows 1 and 2), one could be homozygous and the other heterozygous (het) with one or zero shared alleles with the homozygote (rows 3 and 4), or both individuals could be heterozygous with two, one or zero shared alleles (rows 5–7). There are nine pairs of genotypes if the ordering of individuals is important (not shown), as the genotypes in rows 3 and 4 (one homozygote and one heterozygote) each have two orders. ki, the probability of sharing i number of alleles that are identical-by-descent (where i = 0–2; see also FIG. 1); P, allele frequency; "1–9, Jacquard coefficients, which are measures of identity-by-descent status (BOX 1; FIG. 1).

R E V I E W S

NATURE REVIEWS | GENETICS VOLUME 7 | OCTOBER 2006 | 775

F O C U S O N S TAT I S T I C A L A N A LY S I S

→ probabilite d’emission pour HMM avec etats ∆i

Weir & al. (2012)

Page 7: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Detecter les segments IBD

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros vii

Problemes

I. On a 2 individus — on veut identifier le niveau exact de parente :kcoeffs, CARROT

II. Identifier des liens de parente dans une population : on a n genomes (diploides)— on veut identifier les paires d’individus avec segments IBD :fastIBD, GERMLINE, . . .

Page 8: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Quantification

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros viii

coefficient de consanguinite γ (inbreeding coefficient) :probabilite d’IBD entre les 2 alleles du meme individu

modele iid (allele A avec p, a avec q = 1−p) + consanguinite : proba de genotypesnon-ordonnes

φ(Aa)

= 2(1− γ)pq

φ(AA)

= p2(

1 + γq

p

)φ(aa)

= q2(

1 + γp

q

)

avec γ = 0 : coefficients de Cotterman pour frequences de IBD0, IBD1, IBD2

Page 9: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Methode de Lee

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros ix

homozygotes discordantes (D) : (AA, aa) ou (aa,AA)heterozygotes concordantes (C) : (Aa,Aa)

probabilites sans IBD : d = PD = 2p2q2, c = PC = 4p2q2

compter seulement les sites C ou D : X1, X2, . . . , Xn ∈ {0,1} ou Xi = 1denote des heterozygotes concordantes

definir le compte NC =∑ni=1Xi ; on a

ENC =2n

3; VarNC =

2n

9

⇒ test d’hypothese : NCn ∼ N(

2/3,√

23n

)si independents

si IBD, alors ENC > 2n/3 ;si populations differentes (aucun IBD, p differentes), ENC < 2n/3

Lee Annals of Human Genetics, 67 :618 (2003)

Page 10: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

kcoeff

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros x

distribution jointe de IBS2∗ ratio = NCNC+ND et NC+ND

NC+ND+Nautres

no IBS2*_ratio values .2/3 (e.g. 0.70). Surprisingly, we observed25 data points with values .0.70 that potentially corresponded tofamilial relationships (Table 1 includes a subset of 16 of thesepairwise comparisons for which we obtained evidence of familialrelationships, as discussed below; 6 other relationships in the tablewith IBS2*_ratio values ,0.70 are described below). The CAUgroup included a pair of identical samples (Figure 1A arrow,corresponding to NA17255/NA17263). The IBS2*_ratio valuewas near 1.0 for this pairwise comparison, as expected for identicalsamples that lack essentially all IBS0 calls. This relationship is

supported by plotting IBS for each chromosomal position acrossall autosomes using SNPduo software [15], a program thatperforms pairwise comparisons of SNP genotype data and plotsIBS (as well as genotypes) for one chromosome or the entiregenome. This revealed a predominant pattern of IBS2 as shownfor chromosome 2 (Figure 2A). Typical of other geneticallyidentical samples analyzed with low genotyping error rates, thesetwo individuals shared only 11 IBS0 calls and 6,410 IBS1 calls incontrast to 838,898 IBS2 calls from autosomal loci. The sampleswere annotated by the Human Genetic Cell Repository as a 6

Figure 1. Genetic relatedness plots of the Human Variation Panel genotype data. Abbreviations: AA, African American; CAU, Caucasian;CHI, Chinese; MEX, Mexican. (A) IBS2* plot of the within-group comparisons (n = 19,800). The IBS2*_ratio values are centered on 2/3 for unrelatedindividuals within a population. The relationship of NA17251 to 99 other AA individuals is indicated (arrow). A group of 9 MEX individuals haveatypically low heterozygosity rates and form a cluster separated from other within-MEX comparisons (arrow 1). (B) IBS2* plot in which pairwisecomparisons with IBS2*_ratio values .0.8 are removed (n = 13) and data points are colored by the sum of autosomal heterozygosity of each pair ofindividuals. (C) IBS2* plot for between-group comparisons (n = 60,000) for which none are expected to be genetically related. For groups havingindividuals with large differences in heterozygosity rates, such as AA-CHI comparisons, the IBS2*_ratio values are significantly lower than 2/3. TheMEX individuals with atypical heterozygosity rates tend to form outlier clusters in between-group comparisons such as AA-MEX (arrow 1) and CHI-MEX (arrow 2). A group of five pairwise comparisons having relatively high IBS2*_ratio values (0.685 to 0.692; arrow 3) involve MEX individualNA17709 in comparison to CAU individuals.doi:10.1371/journal.pgen.1002287.g001

Identity-by-Descent and Identity-by-State

PLoS Genetics | www.plosgenetics.org 3 September 2011 | Volume 7 | Issue 9 | e1002287

Homozygosity and Distant IBDOur kcoeff IBD method was robust for inferring relationships

with an estimated K1$0.025. We previously established a methodfor comparing regions of homozygosity in offspring to possibleregions of IBD1 between the parents indicating when thehomozygosity is due to autozygosity [25]. We modified thisapproach to include the minimum regions of homozygosity$2 Mb and $400 SNPs. Copy number information was notused to discriminate those ROH that result from a hemizygousdeletion. A ROH in a child overlapping a region of IBD1 betweenthe parents is evidence of inbreeding (as given in Table 2). Sinceparents were available for a small percentage of individuals, themajority of the ROH reported in Tables S2–3 could be due to ahemizygous deletion or autozygosity.

Reconstruction of PedigreesInferring the degree of relationship allows for a potential

classification of the type of relationship. For example, a pair ofindividuals inferred to be second-degree relatives could be inferredto be half-siblings, as opposed to grandparent-grandchild or

avuncular. We present a method for reconstructing second-degreeand third-degree relationships based on multiple pairwise com-parisons. This approach requires specific information based onhow alleles are shared. We provide five scenarios (as seen inTable 3) for classifying second-degree relationships: Scenario 1,inferring an avuncular (AV) relationship to two half-siblings (HS);Scenario 2, inferring an AV relationship to two full-siblings (FS);Scenario 3, inferring HS; Scenario 4, inferring a third or fourth-degree relationship; and Scenario 5, ruling out specific types ofrelationships. These methods are described in detail in thesupporting information as well as Figures S5–11 and TableS4. The majority of this method was applied to the MKKpopulation and a section of the reconstructed pedigree is presentedin Figure 3. The full pedigree is contained in Figure S3 and linksall relationships with a K1 value greater than 0.20. Note that someof the relationships are indicated by the estimated degree ofrelationship as full reconstruction of relationship type is notpossible without more information.

Table 3. Cont.

IID1 IID2 Group k0 k1 k2 Inferred Annotated Reason Comments

NA21678 NA21519 MKK 0.7294 0.2706 0.0000 3u 2u Pemberton et al.were conservative

Previously designated relationships for annotated pairs are reassigned based on pedigree reconstruction methods or IBD analysis. Note that certain relationshipsannotated correctly by previous studies (and Pemberton et al. 2010 [19]) are included because of the addition of further information. For example, NA12874 andNA12865 were correctly assigned a parent-child relationship but we amend that to parent-child_IBD0 based on the presence of apparent IBD0 between them. Scenariosused to prove or rule out a relationship type are provided in the Supplemental Method File. Abbreviations used: Inferred, our annotation for a given pairwisecomparison; PO, parent-child; AV, avuncular; GG, grandparent-grandchild; HS, half-sibling.doi:10.1371/journal.pone.0049575.t003

Figure 3. Reconstruction of a partial pedigree from the MKK group. We analyzed MKK genotype data using IBD analysis and inferred thefamilial relationships of 61 individuals with 46 being related to at least 1 other person. This graph contains relationships constructed from second-degree, full-sibling, parent-child, and identical relationships (with the exception of NA21352 and NA21351 who are inferred to be first-cousins basedon their second-degree relationship to NA21414; see top left of figure). All indicated relationships are based on previous analysis (siblings: thick greenlines), previous annotation (family trios; family ID), and inferred analyses (sibling relationships, thick blue lines; corrected parent-child orientation,thick red lines; corrections made to annotated relationships, thick yellow lines; other familial relationships; thin black lines). Dashed rectanglesindicate family units annotated by the HapMap project at the Coriell website. F indicates family identifier (e.g. F2654). Individual identifiers are shownas the last three digits of NA21xxx (e.g. 353 at the upper left of the figure corresponds to individual NA21353). All IBD information is given in Table S1.Note that several individuals who are part of MKK (e.g. NA12310 in family 2566) and for whom cell lines were created did not have SNP data as part ofthe HapMap Phase III release.doi:10.1371/journal.pone.0049575.g003

Relatedness and Inbreeding in HapMap

PLOS ONE | www.plosone.org 9 November 2012 | Volume 7 | Issue 11 | e49575

Stevens & al PLoS Genetics, 7 :e1002287 (2011)

Page 11: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Segments a fine echelle

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xi

[20:20 6/6/2011 Bioinformatics-btr243.tex] Page: i334 i333–i341

S.Kyriazopoulou-Panagiotopoulou et al.

fraction of the reference population that has the same haplotype as A in thatregion.

The use of phased data enables CARROT to assign different likelihoods torelationships that are indistinguishable by previous methods: in the case of anaunt–niece pair, for instance, segments inherited from the common ancestorscan lie on either haplotype of the aunt, but must lie on the same haplotypeof the niece. This haplotype model, combined with linkage informationcaptured in the haplo-frequencies, gives CARROT a significant advantagein differentiating between rotated relationships compared with the existingmethods.

2.2 HMMs and factorial HMMsAHMM is a probabilistic model for capturing the dependencies of a sequenceof observations G1,G2,...,GM on a chain of unknown (or hidden) variablesS1,S2,...,SM taken from a set S. An HMM makes the following conditionalindependence assumptions: first, given Sk , Gk is conditionally independentof all observations and hidden states, that is P(Gk |S1,...,Sk,G1,Gk!1)=P(Gk |Sk). Second, given Sk!1, Sk is conditionally independent of all previoushidden states, that is, P(Sk |S1,...,Sk!1)=P(Sk |Sk!1). An HMM is, therefore,defined by a set of transition probabilities P(Sk |Sk!1), a set of emissionprobabilities P(Gk |Sk) a probability distribution over the initial states.

Often, we want to infer the value of the hidden variables from the observedvariables. The posterior probability P(Si|G) can be computed using theforward–backward algorithm in time O(M|S|2) (Rabiner and Juang, 1986),where |S| is the number of values in S.

In a factorial HMM (Ghahramani and Jordan, 1997), the observation atposition k depends on multiple hidden variables, S1

k ,S2k ,...,ST

k , which areassumed to evolve independently, that is:

P(Sk = (s1k ,s

2k ,...,s

Tk )|Sk!1 = (s1

k!1,s2k!1,...,s

Tk!1))

=T!

t=1

P(Stk =st

k |Stk!1 =st

k!1)

A factorial HMM where each hidden variable Stk takes values from

the set S is equivalent to an HMM with hidden variables taking valuesfrom the Cartesian product ST . Using the latter representation, runningthe forward–backward algorithm on a factorial HMM requires O(M|S|2T )time. However, by taking advantage of the independence assumptions for thehidden variables, the forward–backward algorithm can be modified to run inO(MT |S|T+1) time, which is a significant improvement when the number ofhidden variables T is large.

2.3 Likelihood computation assuming linkageequilibrium

We want to infer the relationship between two individuals A and B, genotypedat a set of M unlinked SNPs. Let HA0,HA1,HB0,HB1 " {A,C,G,T}M be thetwo haplotypes of A and B, respectively, GA = (HA0,HA1), GB = (HB0,HB1)be their ordered, or phased, genotypes, and !k be the probability ofrecombination between SNPs k and k+1. Throughout this work, we assumethat !k is the same for both sexes.

Let R be a set of putative relationships for individuals A and B. For anyrelationship R"R, we want to compute the likelihood of R, or the probabilityof the observed genotypes under the assumption that the true relationshipbetween A and B is R, LR =P(GA,GB|R). As noted in Skare et al. (2009),assuming that A and B are not inbred, their relationship must fall into exactlyone of the following categories:

(1) A and B share exactly 2 MRCAs (e.g. full siblings, first cousins);

(2) A and B share exactly 1 MRCA (e.g. half siblings, half cousins); and

(3) A is the ancestor of B or vice versa.

We call relationships R1 and R2 rotated, if R1 and R2 are in the samerelationship category and the total number of meioses between the two

Fig. 1. Pedigree for a pair of individuals with two common ancestors:individuals A and B share two MRCAs, C and D. There are genA generationsbetween the MRCAs and A (i.e. genA +1 meioses separating them) andgenB generations between the MRCAs and B. The sex of the individuals isarbitrary.

individuals is the same in R1 and R2. Alternatively, we say that R1 is arotation of R2.

We defined a set of HMMs for each of three relationship categoriessimilarly to Stankovich et al. (2005) and Bercovici et al. (2010), who definedHMMs for cousins parameterized by the number of generations betweenthem. Unlike these methods, the state space of our models does not increasewith the number of generations of the pedigree. Below, we describe ourHMMs for the first type of relationships. The models for the other two casesare derived along similar lines. Given that A and B have two MRCAs, Cand D (Fig. 1), the hidden state at SNP k depends on the following binaryvariables:

(1) mC (k) and mD(k) indicate whether C and D, respectively, passed thesame allele to their immediate descendants E1 and F1. For example,if both E1 and F1 inherited the maternal allele of C at position k, thenmC (k)=1. If E1 received the maternal allele of C, and F1 receivedthe paternal allele of C, then mC (k)=0.

(2) mE1 (k) and mF1 (k) indicate whether E1 and F1 passed to E2 and F2,respectively, the allele of C and not the allele of D.

(3) dA(k) takes the value 0 if A inherited the allele that E2 got from E1(which came from either C or D) and the value 1 otherwise. That is,dA(k)=0, if for all i>2, Ei got from Ei!1 the allele of Ei!2 and notthe allele of Gi!2. If dA(k)=1, we will say that there were off-chaindonations in the lineage of A at position k. dB(k) is defined in ananalogous way for the lineage of B.

(4) pA(k) indicates which of the alleles of A, HA0(k) or HA1(k), comesfrom EgenA and is used to capture phasing errors. pB(k) is defined inan analogous way.

Each of these variables refers to a different set of meioses inthe pedigree, therefore they all evolve independently from each other.We thus model the process of generating the genotypes GA and GB

as a factorial HMM with hidden state s(k)= (mC (k),mD(k),mE1 (k),mF1 (k),dA(k),dB(k),pA(k),pB(k)).

i334

by guest on May 2, 2013

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

HMM avec etat [∅ consanguinite]comme vecteurs de 8 indicateurs :mC ∈ {0,1} : E1 et F1 heritent le meme haplotype duparent C ; mD

mE ∈ {0,1} : E2 recoit l’allele de C ; mF (recoit de D)dE = 0 si E1 → E2 · · · → EkpA : phase de A/EgenA

Transitions : par coordonnees, independamment !

Kyriazopoulou-Panagiotopoulou & al ISMB 2011

Page 12: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

CARROT

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xii

[20:20 6/6/2011 Bioinformatics-btr243.tex] Page: i335 i333–i341

Relationship reconstruction

Table 1. Transition probabilities for the HMMs with two MRCAs.

Variable Pr(0!0) Pr(1!0)

mC !2k +(1"!k)2 2!k(1"!k)

mE1 1"!k !k

dA (1"!k)genA"1!"genA"1

n=1

!genA "1

n

#!n

k (1"!k)genA"1"n#

/$2genA"1 "1

%

pA 1"" "

P(i! j) is the probability that a variable transitions from state i at SNP k to state j at SNP k+1, genA and genB are the generations between the MRCAs and each of A and B(Fig. 1), !k is the recombination probability between SNPs k and k+1; and " is the probability of a phasing error. The transition probabilities for the variables mD , mF1 , pB anddB are derived similarly.

The transition probabilities for all the variables are shown in Table 1. Wenow derive the transition probabilities for dA. Let genA be the number ofgenerations between the MRCAs and A (Fig. 1). The number of meiosesbetween E2 and A is genA "1. Assume that dA(k)=0, that is, the allelethat E2 inherited from the MRCAs at locus k was passed down to A.Any recombination between loci k and k+1 would result in dA(k+1)=1,therefore P(dA(k+1)=0|dA(k)=0)= (1"!k)genA"1. If dA(k)=1, then thereexists at least one off-chain donation in the genA "1 meioses between E2and A. The probability that there are exactly n off-chain donations between

E2 and A is!

genA "1n

#(1/2)genA"1. Given that there are exactly n off-chain

donations at SNP k, the probability that there are no off-chain donations atSNP k+1 is !n

k (1"!k)genA"1"n. Therefore:

P(dA(k+1)=0|dA(k)=1)= P(dA(k+1)=0 and dA(k)=1)P(dA(k)=1)

=

"genA"1n=1 !n

k (1"!k)genA"1"n!

genA "1n

#(1/2)genA"1

1"(1/2)genA"1

= 12genA"1 "1

genA"1&

n=1

!genA "1

n

#!n

k (1"!k)genA"1"n

Given s(k), we can determine the IBD status at SNP k and use populationallele frequencies to compute the emission probabilities (Epstein et al., 2000).To account for genotyping errors, let # be the probability of a genotypingerror, and f#(x,y) be the probability that allele x is genotyped as y:

f#(x,y)='

1"# if x=y# if x #=y

Then:

P(HA0(k)=a,HB0(k)=b|HA0 and HB0 are IBD)=&

c${A,C,G,T}qcf#(c,a)f#(c,b)

where qc is the frequency of allele c in the reference population. The rest ofthe emission probabilities are adjusted in a similar way.

2.4 Relationship notationWe refer to relationship types using the notation (mrcas,genA,genB). Thevariable mrcas is 2 when individuals A and B share two MRCAs and 1otherwise. Unless A is the ancestor of B or vice versa, genA and genB arethe number of generations between the MRCA(s) and A and B, respectively.If A is the ancestor of B, then genA is set to "1, and genB is the number ofgenerations between A and B. For close relationships, we prefer to use theusual verbal description, unless space is limited. Table 2 shows the numericalnotation for some common relationships.

Table 2. Numerical notation for some common relationships

Degree Relationship (mrcas,genA,genB)

Full siblings (2, 0, 0)1

Parent–child (1, "1, 0)

Half siblings (1, 0, 0)Aunt-niece (2, 0, 1)Avuncular (2, 0, 1) or (2, 1, 0)

2

Grandparent–grandchild (1, "1, 1)

First cousins (2, 1, 1)Great grandparent–grandchild (1, "1, 2)Great aunt–niece (2, 0, 2)

3

Half aunt–niece (1, 0, 1)

4 Half first cousins (1, 1, 1)

For all relationships with genA #=genB , there is a corresponding symmetric relationship,for example (1, 0, "1) denotes a child–parent pair. Note that the term ‘avuncular’ doesnot specify a direction. The terms ‘aunt’ and ‘niece’ should be read as ‘aunt/uncle’ and‘niece/nephew’, respectively.

2.5 Incorporating linkage informationHMMs that use unlinked markers have limited power to distinguish betweenrelationships of the same degree (Sun et al., 2002). Linkage information canhelp disambiguate such relationships. Assume, for instance, that we wantto determine whether the relationship between individuals A and B is firstcousins or great aunt–niece. An IBD block between A and B implies thatthey inherited overlapping genomic segments from their MRCAs (Fig. 2).If A and B are first cousins, the two scenarios of Figure 2 are equally likely.However, if A is closer to the MRCAs than B, then it is more likely that Ainherited a larger segment from the common ancestor than B [scenario (a)],because we expect fewer recombinations between the MRCAs and A thanbetween the MRCAs and B. Therefore, if we compare the haplotypes of Aand B in a small window around the IBD transitions to the haplotypes of areference population, we are more likely to find a match for the haplotypeof A, than for the haplotype of B.

To quantify this intuition, assume that there is a transition in IBD statusbetween SNPs k and k+1, and let Hi(k"w+1..k+w) be the haplotype ofindividual i at positions k"w+1,k"w+2,...,k+w, that is in a window ofsize 2w around k. The haplo-frequency of A is defined as:

CA ="

i f#(Hi(k"w+1..k+w),HA(k"w+1..k+w))N

(1)

where the sum is over all haplotypes in the reference population, Nis the number of such haplotypes, and f#(Hi(k"w+1..k+w),HA(k"w+1..k+w)) is the probability that Hi(k"w+1..k+w) is genotyped as

i335

by guest on May 2, 2013

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

[20:20 6/6/2011 Bioinformatics-btr243.tex] Page: i335 i333–i341

Relationship reconstruction

Table 1. Transition probabilities for the HMMs with two MRCAs.

Variable Pr(0!0) Pr(1!0)

mC !2k +(1"!k)2 2!k(1"!k)

mE1 1"!k !k

dA (1"!k)genA"1!"genA"1

n=1

!genA "1

n

#!n

k (1"!k)genA"1"n#

/$2genA"1 "1

%

pA 1"" "

P(i! j) is the probability that a variable transitions from state i at SNP k to state j at SNP k+1, genA and genB are the generations between the MRCAs and each of A and B(Fig. 1), !k is the recombination probability between SNPs k and k+1; and " is the probability of a phasing error. The transition probabilities for the variables mD , mF1 , pB anddB are derived similarly.

The transition probabilities for all the variables are shown in Table 1. Wenow derive the transition probabilities for dA. Let genA be the number ofgenerations between the MRCAs and A (Fig. 1). The number of meiosesbetween E2 and A is genA "1. Assume that dA(k)=0, that is, the allelethat E2 inherited from the MRCAs at locus k was passed down to A.Any recombination between loci k and k+1 would result in dA(k+1)=1,therefore P(dA(k+1)=0|dA(k)=0)= (1"!k)genA"1. If dA(k)=1, then thereexists at least one off-chain donation in the genA "1 meioses between E2and A. The probability that there are exactly n off-chain donations between

E2 and A is!

genA "1n

#(1/2)genA"1. Given that there are exactly n off-chain

donations at SNP k, the probability that there are no off-chain donations atSNP k+1 is !n

k (1"!k)genA"1"n. Therefore:

P(dA(k+1)=0|dA(k)=1)= P(dA(k+1)=0 and dA(k)=1)P(dA(k)=1)

=

"genA"1n=1 !n

k (1"!k)genA"1"n!

genA "1n

#(1/2)genA"1

1"(1/2)genA"1

= 12genA"1 "1

genA"1&

n=1

!genA "1

n

#!n

k (1"!k)genA"1"n

Given s(k), we can determine the IBD status at SNP k and use populationallele frequencies to compute the emission probabilities (Epstein et al., 2000).To account for genotyping errors, let # be the probability of a genotypingerror, and f#(x,y) be the probability that allele x is genotyped as y:

f#(x,y)='

1"# if x=y# if x #=y

Then:

P(HA0(k)=a,HB0(k)=b|HA0 and HB0 are IBD)=&

c${A,C,G,T}qcf#(c,a)f#(c,b)

where qc is the frequency of allele c in the reference population. The rest ofthe emission probabilities are adjusted in a similar way.

2.4 Relationship notationWe refer to relationship types using the notation (mrcas,genA,genB). Thevariable mrcas is 2 when individuals A and B share two MRCAs and 1otherwise. Unless A is the ancestor of B or vice versa, genA and genB arethe number of generations between the MRCA(s) and A and B, respectively.If A is the ancestor of B, then genA is set to "1, and genB is the number ofgenerations between A and B. For close relationships, we prefer to use theusual verbal description, unless space is limited. Table 2 shows the numericalnotation for some common relationships.

Table 2. Numerical notation for some common relationships

Degree Relationship (mrcas,genA,genB)

Full siblings (2, 0, 0)1

Parent–child (1, "1, 0)

Half siblings (1, 0, 0)Aunt-niece (2, 0, 1)Avuncular (2, 0, 1) or (2, 1, 0)

2

Grandparent–grandchild (1, "1, 1)

First cousins (2, 1, 1)Great grandparent–grandchild (1, "1, 2)Great aunt–niece (2, 0, 2)

3

Half aunt–niece (1, 0, 1)

4 Half first cousins (1, 1, 1)

For all relationships with genA #=genB , there is a corresponding symmetric relationship,for example (1, 0, "1) denotes a child–parent pair. Note that the term ‘avuncular’ doesnot specify a direction. The terms ‘aunt’ and ‘niece’ should be read as ‘aunt/uncle’ and‘niece/nephew’, respectively.

2.5 Incorporating linkage informationHMMs that use unlinked markers have limited power to distinguish betweenrelationships of the same degree (Sun et al., 2002). Linkage information canhelp disambiguate such relationships. Assume, for instance, that we wantto determine whether the relationship between individuals A and B is firstcousins or great aunt–niece. An IBD block between A and B implies thatthey inherited overlapping genomic segments from their MRCAs (Fig. 2).If A and B are first cousins, the two scenarios of Figure 2 are equally likely.However, if A is closer to the MRCAs than B, then it is more likely that Ainherited a larger segment from the common ancestor than B [scenario (a)],because we expect fewer recombinations between the MRCAs and A thanbetween the MRCAs and B. Therefore, if we compare the haplotypes of Aand B in a small window around the IBD transitions to the haplotypes of areference population, we are more likely to find a match for the haplotypeof A, than for the haplotype of B.

To quantify this intuition, assume that there is a transition in IBD statusbetween SNPs k and k+1, and let Hi(k"w+1..k+w) be the haplotype ofindividual i at positions k"w+1,k"w+2,...,k+w, that is in a window ofsize 2w around k. The haplo-frequency of A is defined as:

CA ="

i f#(Hi(k"w+1..k+w),HA(k"w+1..k+w))N

(1)

where the sum is over all haplotypes in the reference population, Nis the number of such haplotypes, and f#(Hi(k"w+1..k+w),HA(k"w+1..k+w)) is the probability that Hi(k"w+1..k+w) is genotyped as

i335

by guest on May 2, 2013

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

? emissions selon qualite des bases de sequencage? classification de segments (fenetre) par proba-

bilite posterieure d’IBD

[20:20 6/6/2011 Bioinformatics-btr243.tex] Page: i338 i333–i341

S.Kyriazopoulou-Panagiotopoulou et al.

Table 3. Comparison between CARROT and two other approaches forrelationship inference

Method/Degree 1 2 3 4 5

Max likelihood 100 98.6 62.29 39.33 26.73CARROT likelihoods 100 100 74 43.78 29CARROT 100 100 84 57.56 34.45

max likelihood selects the relationship with the maximum likelihood; CARROTlikelihoods uses only the likelihoods as features for the classification. We simulated100 pairs of individuals from each relationship and performed predictions only withineach degree. The number reported is the percentage of pairs classified in the correctrelationship.

direct comparison between these methods and CARROT can onlybe done for a small set of relationships. To extend this comparisonto additional relationships, we implemented a classifier that usesthe HMMs of Section 2.3 for the likelihood computation andthen selects the relationship with the maximum likelihood. Thisclassifier can serve as a proxy for any method that maximizesthe likelihood of unlinked markers. We compared the maximum-likelihood classifier with CARROT using a set of 100 simulatedpairs of phased individuals for each relationship of degree up to five,including all rotated relationships. We assumed that the degree of therelationship was known, so predictions were made only within eachdegree. To assess whether a classification-based approach is betterthat a maximum-likelihood approach, we also ran CARROT usingonly the likelihoods as features. We observed that for relationshipsof degree up to two, the likelihoods are sufficient to differentiatebetween relationships (Table 3). For higher degrees, the likelihoodsbecome less informative and the additional features of CARROTresult in a significant increase in accuracy. Additionally, we noticethat CARROT performs consistently better than the maximum-likelihood approach, even when we only use the likelihoods asfeatures. Intuitively, the classifier can capture correlations betweenthe likelihoods of different relationships. We observed, for instance,that when the true relationship is great grandparent–grandchild, thelikelihood of the relationship great aunt/niece tends to be increased,but this effect is overlooked when we use the maximum likelihoodcriterion.

Differentiating between rotated relationships: to evaluate theability of CARROT to distinguish between rotations ofrelationships, we simulated 100 pairs of individuals for each of thepossible relationships of degree up to five, including all possiblerotated relationships. We first assumed that the degree of eachrelationship was known, and ran CARROT separately for eachdegree. We started by examining the ideal case of perfect phasing,so we set the probability of phasing errors, !, to zero.

We assessed CARROT’s accuracy using 10-fold cross-validation,as described in the previous section: the simulated pairs were dividedinto 10 subsets each containing 10 pairs from each relationship.CARROT was trained on nine of the subsets and tested on theremaining subset. This process was repeated 10 times and theaccuracy was averaged over all 10 runs. We defined the predictionaccuracy as the number of pairs that were classified in the correctrelationship.

Table 4. Classification accuracy of CARROT on third-degree relatives

Rel. 2,0,2 2,1,1 2,2,0 1,!1,2 1,0,1 1,1,0 1,2,!1

2,0,2 88 – – 5 5 2 –2,1,1 – 84 1 – 7 8 –2,2,0 – 1 88 – 2 4 51,!1,2 2 – – 96 1 1 –1,0,1 – 11 – – 68 21 –1,1,0 – 8 – – 24 68 –1,2,!1 – – 3 – 1 1 95

The value at row i and column j is the percentage of pairs of relationship i that werepredicted to be of relationship j. (2, 0, 2): great aunt–niece; (2, 1, 1): first cousins; (2, 2,0): great niece–aunt; (1, !1, 2): great grandparent–grandchild, (1, 0, 1): half aunt–niece;(1, 1, 0): great niece–aunt; (1, 2, !1): great grandchild–grandparent.

When run on first- and second-degree relatives, CARROTachieved perfect performance. The results for the third- and fourth-degree relationships are summarized in Tables 4 and 5, respectively.The average accuracy over all the pairs of the correspondingdegree, was 83.86 and 56.89%, respectively. For the fifth-degreerelationships, the average accuracy was 34.45% (full results notshown). As expected, the accuracy of our classifiers drops as thedegree of the relationship increases. However, even within the samedegree, some relationships are much harder to predict correctlythan others. For example, the two half-avuncular relationships,(1, 0, 1) and (1, 1, 0), are hard to differentiate from each otherand from first cousins, since the difference in the distance of eachof the individuals from their MRCA is not enough for the haplo-frequencies to distinguish them from a balanced relationship whereboth individuals are equally distant from the MRCAs. Similarly,although the average accuracy for the fifth-degree relationships was34.45%, the (2, 4, 0) and (2, 0, 4) pairs were predicted correctly in48.5% of the cases.

Predictions across degrees: since in practice the degree of therelationship is not necessarily known, we also performed crossvalidation on a set of 200 pairs of individuals from each of therelationships of degree up to 5, including rotated relationships, aswell as 200 pairs of unrelated individuals. The average classificationaccuracy was 57.5%, varying widely for different degrees: all thefirst-degree pairs were classified correctly; the average accuracyfor the second-degree pairs was 99.5%, for the third-degree pairs76.57%, for the fourth-degree pairs 46% and for the fifth-degreepairs 23.36%. Finally, 90% of the unrelated individuals wereclassified correctly. We note that there was a small decrease inaccuracy compared with the results of the previous section, becausesome of the pairs were classified in relationships of the incorrectdegree, while this was never the case when only within-degreepredictions were made. Table 6 shows the percentage of pairs thatwere classified in a relationship of the correct degree for each of thedegrees examined. On average, the correct degree was predicted for89.83% of the pairs.

The effect of phasing errors: the phasing error rate is defined asthe proportion of successive pairs of heterozygote SNPs that arephased incorrectly with respect to each other. To examine the effectof phasing errors on the classification accuracy of CARROT, wesimulated 100 pairs of individuals for each of the third-degree

i338

by guest on May 2, 2013

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Kyriazopoulou-Panagiotopoulou & al ISMB 2011

Page 13: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Segments IBD dans une population

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xiii

(methode de GERMLINE : segments longues)

On a des haplotypes [phases !] : matrice binaire H[1..2n][1..s] avec haplotypessur s sites dans n genomes

segment IBD : H[i][j..j′] = H[i′][j..j′]

But : identifier les segments IBD les plus longs

1. groupage de haplotypes identiques (tableau de hachage) dans bloc

2. fusion de paires dans blocs consecutifs ; maintenir debut de segment pour la paire

paires de haplotypes (i, i′) avec bloc (k − 1) IBD : soit extension a bloc k (avecerreurs permises), soit tester longueur (depasse Lmin ?)

Gusev & al Genome Research 19 :318 (2009)

Page 14: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Segments IBD dans une population 2

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xiv

Beagle et fastIBD : modele LD (linkage disequilibrium) pour haplotypes

HMM : etats = allele 0/1 dans le haplotype ; transitions par frequence dans leshaplotypes ; emissions incluent erreur de sequencage

haplotype modified by the penalties assessed at eachswitch between alternate phasings. Thus, a small score(close to zero) for a pair indicates that the two individualsshare a low-frequency haplotype and are thus likely to beidentical by descent. We use sampled haplotypes witha sliding marker window, as is done in GERMLINE,5

which permits rapid computation. A critical differencebetween our method and GERMLINE is that our methodis based on shared haplotype frequency rather than sharedhaplotype length.

Material and Methods

The fastIBD algorithm starts by sampling a fixed number of haplo-

type pairs (four pairs by default) for each individual from the poste-

rior haplotype distribution. Each sampled haplotype corresponds

to a sequence of hidden Markov model (HMM) states. The fastIBD

algorithm searches for pairs of sampled haplotypes sharing the

same sequence of HMM states for a set of consecutive markers. If

the pair of sampled haplotypes belongs to two distinct individuals,

the shared haplotype tract is recorded. For each pair of individuals,

overlapping shared haplotype tracts are merged, and the merged

shared haplotype tract is a mosaic of pairs of sampled haplotypes

(see Figure 1). A fastIBD score is calculated for each merged tract,

and if the score is below a user-specified threshold, the tract is

printed to an output file. We now describe in detail the algorithm

for finding shared haplotype tracts, the calculation of fastIBD

scores for those tracts, and the algorithmicdetails that allow for effi-

cient computation. Pseudocode is available as supplemental data.

Shared Haplotype TractsA shared haplotype tract T consists of a pair of sampled haplotypes

(T.H1 and T.H2), a startingmarker index (T.start), an endingmarker

index (T.end), and a fastIBD score (T.score).We use the convention

that the starting marker index is inclusive and the ending marker

index is exclusive. When shared haplotype tracts are first discov-

ered, the fastIBD score is equal to the pairwise haplotype score

defined below for the two haplotypes in the marker interval.

However, after shared haplotype tracts are found, overlapping

shared haplotype tracts are merged, and the merging algorithm

defines a new fastIBD score for themerged tract. In general, the fas-

tIBD score roughly approximates the frequency of the shared

haplotype.

Pairwise Haplotype ScoresFor any pair of haplotypes H1 and H2 and any interval of markers

m1 < m2, we define a pairwise haplotype score S(H1, H2, m1, m2).

The Beagle model defines a unique sequence of HMM states for

each haplotype. If both haplotypes have the same sequence of

HMM states in the marker interval, the pairwise haplotype score

is the haplotype frequency or, more precisely, the frequency of

the shared sequence of HMM states. As a consequence of the LD

model’s being a HMM, the frequency of a sequence of HMM states

sm, sm!1, ., sm!k can be expressed as a product of state and tran-

sition probabilities:

P"sm; sm!1;.; sm!k# $ P"sm#Yk

j$1

P!sm!j j sm!j%1

"

In the preceding equation, there is a term corresponding to each

marker: P(sm) for marker m and P(sm!j j sm!j%1) for marker m ! j

(j > 0). If the two haplotypes do not have the same HMM state

at one or more markers in the marker interval, one obtains the

pairwise haplotype score by replacing the corresponding state

P(sm) or transition probability P(sm!j j sm!j%1) with 100 at each

marker for which the two haplotypes have different HMM states.

This penalizes the pairwise haplotype score by inflating the esti-

mated shared haplotype frequency.

Merging Shared Haplotype TractsTwo shared haplotype tracts T and U can be merged to create

a merged shared haplotype tract M if the pair of sampled haplo-

types in each tract corresponds to a single pair of individuals

and if either the marker intervals for the two shared haplotype

tracts overlap or the starting marker for one tract is the ending

marker for the other tract. When merging overlapping shared

haplotype tracts for a pair of individuals, we merge tracts with

the smallest starting marker indices first.

The marker interval for the merged tract is the union of the two

marker intervals. The fastIBD score of the merged tract is defined

to be less than or equal to the two component fastIBD scores.

For the purposes of further computation, we only need to keep

track of the haplotypes at the right end of the merged tract, so

the two notated haplotypes of the merged tract are the haplotypes

from the tract with the largest ending index. For example, if the

marker interval in shared haplotype tract T is a subset of the

marker interval in shared haplotype tract U, we say tract U covers

tract T, and we define the merged tract as M.H1 $ U.H1, M.H2 $U.H2,M.start$U.start,M.end$U.end, andM.score$min{T.score,

U.score}.

If shared haplotype tracts TandU can bemerged, and if one tract

does not cover the other tract, then either T.start % U.start and

T.end % U.end or U.start % T.start and U.end % T.end. If we

assume the former configuration, the merged tract haplotypes

areM.H1$U.H1 andM.H2$U.H2, themerged tractmarker interval

is M.start $ T.start, M.end $ U.end, and the merged-tract fastIBD

score M.score is the minimum of a left score and a right score.

Figure 1. Merging of Shared Haplotype TractsFour pairs of haplotypes have been sampled from individuals1 and 2. Two shared haplotype tracts have been found (denotedby patterned regions). The two tracts are merged into a singleshared haplotype tract.

174 The American Journal of Human Genetics 88, 173–182, February 11, 2011

haplotype is very rare, the difference between the IBD and non-

IBD probabilities can become very large (because the very small

haplotype probability occurs in the non-IBD probability but essen-

tially disappears into the mean in the IBD probability). As a result,

in some regions several pairs of individuals were reported to have

very small (< 0.1 cM) IBD segments. We found that using the

minimum avoids this problem. However, the transition probabil-

ities from a state should sum to 1. They do sum to 1 if the mean

is used, but not if the minimum is used (in which case they sum

to < 1). Thus, using a minimum downweights the probabilities

when the two possibly IBD haplotypes are traveling through

different paths of the model, which is useful. We plan to investi-

gate this issue further in future research.

To demonstrate these probability calculations, we give a small

example on four SNP markers (however, note that the method is

designed for dense SNP data with thousands of markers per chro-

mosome). Again, we assume that the haplotypes are known.

However, in calculating the posterior probability of IBD, the

HMM method will account for haplotype uncertainty (the full

calculation of IBD probabilities for this example is not shown).

The LD model for the four SNPs is taken from previous work23

and is shown in Figure 1. The transition probabilities for the

model are P(eA) ! 0.518, P(eB) ! 0.482, P(eC) ! 0.627, P(eD) !0.373, P(eE) ! 1.0, P(eF) ! 0.490, P(eG) ! 0.510, P(eH)! 1.0, P(eI) !0.194, P(eJ) ! 0.806, P(eK) ! 1.0, P(eL) ! 1.0. Individual 1 has

haplotypes H1 ! 1 1 1 1 and H2 ! 1 2 2 1; individual 2 has haplo-

types H3! 2 1 1 1 and H4! 2 1 2 2. We calculate the probability of

the four haplotypes given that haplotypes H1 and H3 are IBD at all

four marker positions.

P"H1,H2,H3,H4 jH1 and H3 are IBD#!P"H2#P"H4#P"H1,H3 j IBD#

P"H2# ! P"eA#P"eD#P"eH#P"eL# ! "0:518#"0:373#"1:0#"1:0# ! 0:193

P"H4# ! P"eB#P"eE#P"eG#P"eK# ! "0:482#"1:0#"0:510#"1:0# ! 0:246

P"H1,H3 j IBD# !min"P"eA#,P"eB##3 min"P"eC#,P"eE##$ "1% 3#P"eF#"1% 3#P"eI#"1% 3# ! "0:482#"0:005#"0:627#$ "0:995#"0:490#"0:995#"0:194#"0:995# ! 1:41 3 10%4

Finally,

P"H1,H2,H3,H4 jH1 is IBD with H3#

! "0:193#"0:246#!1:41 3 10%4

"! 6:7 3 10%6:

Informally, whenthe probability of the data is much higherunder

IBD than under non-IBD (high enough to overcome the low prior

probability of IBD), the posterior probability of IBD will be high.

The estimation of the IBD proceeds by first building the LD

model from the unphased genotypes by using ten iterations of

the model-building algorithm to obtain convergence.23 We then

add the IBD model to the LD model and use the forward-backward

algorithm for HMMs26,27 to obtain posterior probabilities of IBD

for each pair. Our software also reports the most likely haplotype

phasing given IBD, which can be useful for phasing related indi-

viduals. The procedure may be repeated several times with the

use of different random number seeds, with the maximum poste-

rior IBD probability from the multiple runs used. This avoids false

negatives due to the fitted LD model converging to a local

maximum that does not allow the haplotypes to follow their

true IBD configuration. In this study, we use ten runs for IBD prob-

abilities and five runs for HBD probabilities (see below).

Constructing the LD model takes the same amount of computa-

tional time as it would to phase the data set by using BEAGLE,

which is relatively fast.23 However, with n individuals, there are

on the order of n2 potential pairs on which to calculate IBD prob-

abilities, thus increasing the total computation time, relative to

phasing, by the order of n. Thus, it is not currently feasible to

compute IBD probabilities on all pairs of individuals over the

whole genome in a large data set with thousands of individuals.

Calculation of HBD probabilities involves only two haplotypes

(from a single individual), but the basic principle is the same.

The probability of the two haplotypes given that they are non-

HBD is found by multiplying the two haplotype probabilities

together. The probability of the two haplotypes given HBD is the

same as the probability of two haplotypes given IBD, as described

above.

For HBD, the basic unit is individuals, rather than pairs of indi-

viduals. Thus, estimating HBD probabilities for all individuals

takes only slightly longer than phasing all individuals in a data

set. We have estimated HBD probabilities on all individuals from

several case-control cohorts from the Wellcome Trust Case Control

Consortium17 with approximately 5000 individuals genotyped on

400,000 autosomal SNPs, thus demonstrating that our HBD detec-

tion method can be applied to large genome-wide association

studies. Genome-wide HBD could be useful for gene mapping in

diseases with rare recessive variants of strong effect.

We define as IBD or HBD any position at which the correspond-

ing IBD or HBD probability exceeds 0.5. To define the length of an

IBD or HBD region, we measure the genetic length from the first

position at which the pair is IBD or the individual is HBD to the

last position before the IBD or HBD probability drops below 0.5.

Comparison with Other ProgramsWe tested our method (implemented in BEAGLE) against GERM-

LINE version 1.4.08 and PLINK version 1.07,2 two existing state-

of-the-art programs for IBD detection. We also attempted to

include RELATE4 in our comparisons. However, we were unable

to successfully run this program. We ran GERMLINE with default

settings (a maximum of two mismatched homozygote markers

in a slice for it to be considered a match, and slice size of 128

markers), except that we adjusted the minimum length of reported

IBD segments, as described in the Results (the default is 5 cM). For

PLINK, we followed the method for pruning SNPs suggested in the

PLINK documentation for shared segment analysis (SNPs with

> 1% missing genotypes and < 5% minor allele frequency

removed, then pairwise LD-based pruning with window size 100,

eCeF

eE

eGeA

eB

eD

eH

eIeJ

eK

eL

Figure 1. Example of an LD Model on Four SNPsSNP 1 is represented by edges eA and eB; SNP 2 by edges eC, eD, eE;SNP 3 by edges eF, eG, eH; and SNP 4 by edges eI, eJ, eK, eL. Foreach SNP, allele 1 is represented by a solid line, whereas allele 2is represented by a dashed line. Haplotype H1 (1 1 1 1) followsthe orange path (eA, eC, eF, eI), and haplotye H2 (2 1 1 1) followsthe blue path (eB, eE, eF, eI).

The American Journal of Human Genetics 86, 526–539, April 9, 2010 529

Browning & Browning American Journal of Human Genetics (2010,2011)

Page 15: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Groupage de haplotypes localises

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xv

To process the data in an efficient manner, weuse a slight variant of the algorithm of Browning[2006], which is adapted from that of Ron et al.[1998]. The algorithm is the same as that inBrowning [2006] except that the merging thresh-old is changed as described below. This algorithmprocesses the markers in chromosomal order,starting with all haplotypes in a single node, andin turn splitting the nodes by considering thealleles at the next marker, and then merging nodesbased on the Markov criterion, as illustrated inFigure 1. To implement the Markov criterion in themerging procedure, a score is calculated for eachpair of nodes at the current level. The score isdescribed in detail in Browning [2006]. A lowscore for a pair of nodes means that the two nodes(i.e. haplotype clusters) have similar probabilitiesfor sequences of alleles at markers t11, t12, y, t1k, for all k. A pair of nodes may be merged if thescore is less than a threshold m!n"1

x # n"1y $

1=2 # b.In this formula, m and b are scale and shiftparameters, respectively, that we have introducedbecause simulation studies (unpublished data)demonstrated increased power using a moreparsimonious model with fewer clusters at eachposition (this is achieved by increasing the thresh-old). In the results presented here we use m 5 4and b 5 0.2, which gives good power over a rangeof conditions. These parameters were not usedin Browning [2006] (effectively m 5 1 and b 5 0were used).

The outcome of this procedure is a directedacyclic graph. Edges of the graph are labeled byalleles, and each haplotype in the sample traces apath through the graph from the root node to theterminal node, following the sequence of alleles.Edges of the graph correspond to localized

haplotype clusters. The localized haplotype clus-ter given by an edge of the graph consists ofall haplotypes in the sample tracing their paththrough that edge. Figure 2 illustrates theseconcepts.

INTERPRETATION OF MODEL

The fitted graph model describing the localizedhaplotype clusters does not correspond to anyspecific population genetic model, yet it flexiblymodels the empirical pattern of LD seen in thedata. Roughly speaking, merges (where two edgesdirect into the same node) correspond to historicalrecombination. At recombination hot-spots onewould expect to see many merges, and a lownumber of haplotype clusters. In fact, in regions ofvery low LD (due to recombination hot-spots or towidely spaced markers) the number of nodes ateach position may be reduced to one and thehaplotype clusters will simply correspond toalleles, with one edge (cluster) for each allele ateach marker. In this case the haplotypic testsreduce to single-marker tests.

2

1A DC

2

1

2

1

B

Fig. 1. Initial steps in procedure for creating localized haplotype clusters. (A) Starting with all haplotypes in a single node, the nodeis split by the alleles at the first marker (a diallelic marker is shown here, however the method can be used with multiallelic markers).The resulting nodes may be merged if the merging criterion is met (for this example the nodes are not merged). (B) The nodes are splitby the alleles at the second marker. The four nodes represent the four possible haplotypes when considering the first two markers.(C) Pairs of nodes may be merged. In this example, the first and third nodes are merged. The resulting merged node corresponds toall haplotypes with allele 1 at the second marker. (D) The three nodes are split by the third marker. The process continues byconsidering merging the resulting nodes, then splitting by the fourth marker and so on.

Initialnode

Terminalnode

Fig. 2. Example of a directed acyclic graph representing loca-lized haplotype clusters for four markers. For each marker, allele1 is shown with a solid line and allele 2 by a dashed line. Thebolded path through the graph represents the haplotype 2122.The starred edge represents the localized haplotype clusterconsisting of haplotypes 1122 and 2122.

367Whole Genome Multilocus Association Testing

Genet. Epidemiol. DOI 10.1002/gepi

(fusionner les haplotypes les plus proches selon quelques criteres)

Browning & Browning Genetic Epidemiology 31 :365 (2007)

Page 16: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Detection de segments IBD

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xvi

Chromosome 1 in the high variant coverage sequencedata is longer than that in the low variant coverage data(Table 3), suggesting not just greater resolution but alsogreater detection of IBD using sequence data. This maybe due to the fact that fastIBD is able to detect manymore small segments of size 0.5 cM or smaller with theuse of sequence data (high variant density) than thatusing the low density genotype dataset.

DiscussionIn this study, we examined how the density of geneticvariants in a dataset affects the power to detect IBD be-tween individuals. We found that analysis of sequencedata with high SNP density improves resolution andpower for detecting IBD relative to microarray-basedgenotyping, particularly for small segments. In our simu-lation, there was good power (80%) to detect IBD seg-ments of size 0.4 cM using high coverage sequence datawith a low false positive rate, compared to a power of

approximately (77%) for segments of size 1 cM usingmicroarray genotype data (WTCCC).It is possible that the methods we examined in this

study may be further refined to improve the power todetect even smaller IBD segments. We found that Germ-line has slightly higher power to detect IBD using se-quence data compared to fastIBD, but it has a muchhigher false positive rate. That is, for high variant densitydata, Germline detects many small segments, wherearound 25% of them are false positives. We set the de-tectable minimum length to 0.1 cM while runningGermline, which allows Germline be able to detect smallsegments, but it increases the false positive rate. Germ-line also provided lower power for detecting IBD seg-ments using the microarray dataset (from WTCCC).These results indicate that fastIBD provides more robustand reliable IBD detection than Germline for these typesof datasets. Given these observations, the current imple-mentation of fastIBD appears to be better than thecurrent implementation of Germline for detecting IBD

Figure 1 Empirical power of fastIBD. Empirical power of fastIBD to detect an IBD segment as a function of the number of SNPs within asegment in the simulation study. Each plot presents different lengths of IBD segments examined. The power of each dataset is represented bydifferent colored circles and plotted against the number of SNPs contained within a given region.

Table 1 The average power of fastIBD and GermlinefastIBD Germline

Segment Size (cM) WTCCC HapMap 1000 g complete WTCCC HapMap 1000 g complete

0.2 0.126 0.251 0.562 0.629 0.043 0.344 0.665 0.645

0.4 0.327 0.518 0.781 0.801 0.099 0.551 0.836 0.806

0.6 0.495 0.649 0.864 0.874 0.135 0.617 0.901 0.907

1 0.767 0.840 0.904 0.899 0.231 0.794 0.941 0.944

2 0.909 0.918 0.935 0.919 0.389 0.905 0.982 0.992

Su et al. BMC Bioinformatics 2012, 13:121 Page 3 of 8http://www.biomedcentral.com/1471-2105/13/121

WTCCC : puce avec SNPs eparses ; Hapmap : puce dense ;1000G : Illumina ; Complete Genomics

Su & al BCM Bioinformatics 13 :121 (2012)

Page 17: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Fins points de simulation

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xvii

segments of 0.4, 0.6, 1, and 2 cM, we combined 2, 3, 5,and 10 consecutive composite segments. In this way, wegenerated 10 simulated individuals with specific chromo-somal segments that are not IBD. In this simulation set-ting, we expect that any pair of these 10 individuals withcomposite chromosomal segments are unlikely to shareany part of that segment longer than 0.02 cM. Thus,detecting an IBD segment longer than 0.02 cM amongthese individuals can be considered as a false positive.As in the first simulation, we investigated 100 regions

on Chromosome 1 for each of the 5 different lengths ofcomposite segments (0.2, 0.4, 0.6, 1, and 2 cM). A subsetof 100 individuals from WTCCC and the 1000 Genomesdatasets were randomly selected to create the 10 com-posite individuals. We included the other 900 individualsin the WTCCC data and 183 individuals in the 1000Genomes data for IBD analysis. We did not investigatethe error rate for HapMap and Complete Genomic datadue to the limited number of individuals available forstudy. For each IBD segment length, the false positiverate is calculated by the number of SNPs that aredetected as IBD divided by the total number of SNPswithin the simulated segment. The error rates were thenaveraged over 100 regions and any pair of these 10 indi-viduals. For Germline, the input data need to be phasedgenotype data. Thus, we phased the data before runningGermline using fastIBD. Both fastIBD and Germlinegenerate a list of all pairwise IBD segments.

We ran the fastIBD function in Beagle V3.3.1 with de-fault settings. fastIBD applied a score threshold whendetecting IBD. The results in the previous study showsthat a threshold of 10!10 gives good power to detect IBDand also keep the false discovery rate close to zero [14].Here, we used the default threshold 10!8. We used de-fault settings in Germline V1.5.0 except that we set theminimum length (!min m) to 0.1 cM. This allowsGermline to have a chance of detecting small segments.

Competing interestsThe authors declare that they have no competing interests.

AcknowledgementsWe would like to thank Tom Hoffman, Iona Cheng, and John Witte forreview of the manuscript. This study makes use of data generated by theWellcome Trust Case–control Consortium, the International HapMap Project,and the 1000 Genomes Project. A full list of the investigators whocontributed to the generation of the data is available from www.wtccc.org.uk, hapmap.ncbi.nlm.nih.gov, and www.1000genomes.org. The deepsequence data are contributed from the UCSF Sequencing Consortium(sequencing.galloresearch.org) and Complete Genomics (www.completegenomics.com).

Authors’ contributionsSYS carried out the analysis and wrote the initial draft of the paper. JKprovided assistant on accessing and manipulating the datasets. SB, WB, WL,JO and ES provided sequence data and edited the paper. EJ instigated/oversaw the project and edited the paper. All authors read and approvedthe final manuscript.

Author details1Ernest Gallo Clinic and Research Center, University of California SanFrancisco, 5858 Horton St. Suite 200, Emeryville, CA 94608, USA. 2Department

Figure 7 Illustration of the construction of composite segments. Each line represents the chromosome sequence of an individual. Thecolored circle represents the consecutive sequence of a segment size 0.02 cM, which may contain multiple SNPs. A composite segment of size0.2 cM is composed of 10 consecutive segments of size 0.02 cM from 10 different individuals. To create a composite segment of size 0.4 cM, twocomposite segments of size 0.2 cM are constructed and merged. A similar procedure is conducted to create composite segments of size 0.6, 1and 2 cM, where three, five and ten small composite segments are constructed and merged respectively.

Su et al. BMC Bioinformatics 2012, 13:121 Page 7 of 8http://www.biomedcentral.com/1471-2105/13/121

copying a haplotype from a chromosome of one individ-ual into the same location in the other individual (Fig-ure 6). We simulated 30 pairs for WTCCC, HapMapand 1000 g data as well as 27 pairs for the deep coveragesequencing data, where individuals contain these artifi-cial IBD segments for each of the datasets (a subset ofindividuals from the WTCCC and 1000 Genomes pro-jects were selected). To assure that our results were notinfluenced by the structure of a single region, we ran-domly selected 100 regions, each 3 cM in length, onChromosome 1, and conducted the simulation for eachof these regions separately. In each region, we createdartificial IBD segments of lengths 0.2, 0.4, 0.6, 1, and2 cM. The program may detect several small segmentswithin an artificially created IBD segment. Thus, we cal-culated the number of SNPs on these small IBD seg-ments identified by the program. For each length of IBDsegment, power was then assessed by taking the averageproportion of SNPs that are detected as IBD within thesimulated segments, over 100 regions and 30 pairs ofindividuals.

Construction of composite individuals for assessing thefalse positive rate. We then investigated the rate of falselydetecting IBD when no IBD is present through a secondsimulation. We started by constructing a compositechromosomal segment for each of 10 simulated indivi-duals [14]. To do this, we selected 100 individuals fromthe each of the WTCCC and 1000 Genomes datasets.Composite chromosome segments of length 0.2 cM wereconstructed by copying 10 consecutive regions of length0.02 cM from 10 different individuals (Figure 7). To create

Figure 5 The distribution of detected lengths of IBD segments across four investigated datasets using fastIBD. We report the counts ofdetected IBD segments against estimated segment lengths on Chromosome 1 for 60 individuals in each of the datasets (54 individuals for thedeep coverage sequencing dataset).

Table 3 The total length (cM) of IBD segments detectedon Chromosome 1 using fastIBD

WTCCC HapMap 1000g complete

number of individuals 60 60 60 54

total length (cM) 17409 26843 29372 46936

Figure 6 Illustration of the procedure for creating artificial IBDsegments. For a pair of individuals, a segment of the chromosome(shown in red) within a randomly chosen region on Chromosome 1is copied to one chromosome in another individual.

Su et al. BMC Bioinformatics 2012, 13:121 Page 6 of 8http://www.biomedcentral.com/1471-2105/13/121

«Vrai» haplotypes artificiels apres mixage (pas de relationsnon-detectees)

IBD par copiage au hasard

Su & al BCM Bioinformatics 13 :121 (2012)

Page 18: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Fausses positives. . .

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xviii

relatively small population could be affected dramaticallyby some aspects of population history (e.g., growth typeand internal subdivision) [16]. In fact, recent studies haveshown that both the amount of the genome shared identi-cal by descent and the proportion of the genome that iscovered by long runs of homozygosity differs by popula-tion [5,17,18]. A more detailed assessment of IBD acrosspopulations could help determine to what extent wholegenome sequence data can improve the power of thesemapping approaches. Additionally, the continued im-provement of IBD detection methods and the testing ofthose methods on dense genetic data can provide a foun-dation for future genetic studies.

Materials and methodsDataTo assess the statistical power to detect chromosomal seg-ments that are shared identical by descent between twoindividuals, we conducted a simulation study. First, we col-lected genotype data from four sources that represent

different levels of coverage (that is, the proportion of all var-iants in the genome that are assayed by a given platform),ranging from microarray genotype data to deep coveragewhole genome sequence data. These empirical genotypedata include: Microarray genotype data (WTCCC) from theWellcome Trust Case Control Consortium (WTCCC)study. We obtained genotype data on 1000 controls fromUK National Blood Donors (NBS) cohort genotyped on theIllumina 1.2 M chip. We used the SNP set released fromthe WTCCC database, which represents a cleaned set ofdata from their default QC procedures [19]Denser genotype data (HapMap) from the HapMap

phase II project. We obtained genotype data on 60 unre-lated samples from the CEU population (Utah residentswith ancestry from northern and western Europe) Lowcoverage sequence data (1000 g) from the 1000 Gen-omes Project. We obtained genotype data on 283 indivi-duals that originate from Europe sequenced with 4Xcoverage (2010.08 release).Deep coverage sequence data (complete) from Univer-

sity of California at San Francisco Whole Genome Se-quencing Consortium and Complete Genomics [20]. Weobtained genotype data on 54 samples of European ori-gin sequenced by Complete Genomics with an averageof 50X coverage. We used the Complete Genomics de-fault cut offs for full genotype calls (excluding partialand no calls), which pass a strict quality score metric.

Construction of artificial IBD for assessing powerNext, to investigate the power to detect IBD of variouslengths, we constructed artificial IBD segments by

Figure 4 Empirical false positive rate of Germline. Empirical false positive rate of Germline to detect an IBD segment as a function of thenumber of SNPs within a segment in the simulation study. Each plot presents different lengths of IBD segments examined. The error rate of eachdataset is represented by different colored circles and plotted against the number of SNPs contained within a given region.

Table 2 The average false positive rate of fastIBD andGermline

fastIBD Germline

Segment Size (cM) WTCCC 1000 g WTCCC 1000 g

0.2 0.009 0.009 0.015 0.269

0.4 0.007 0.009 0.006 0.239

0.6 0.008 0.007 0.009 0.225

1 0.010 0.009 0.006 0.223

2 0.011 0.009 0.003 0.218

Su et al. BMC Bioinformatics 2012, 13:121 Page 5 of 8http://www.biomedcentral.com/1471-2105/13/121

Su & al BCM Bioinformatics 13 :121 (2012)

Page 19: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Distribution de segments IBD dans la po-

pulation

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xix

the Akaike information criterion42 (AIC) to compare models while

controlling for their different degrees of freedom (see the algo-

rithm reported in Table S2).

Three models were used for the inference in the AJ population

(see Figure 1 and an additional description in the Results): (1)

a model of exponential expansion !ME", (2) a model including

a founder event followed by exponential expansion !MFE", and(3) a model of two exponential-expansion periods separated by

a founder event !MEFE". The ME model did not provide enough

flexibility to fit the IBD-sharing summary extracted for the AJ pop-

ulation, resulting in a poor fit (particularly for shorter segments)

and unrealistically large values for the recent population size.

We therefore excluded this model from further analysis. For

models MFE and MEFE, we used the following rejection-sampling

approach to maximize the model likelihood around the least-

squares solution obtained in the previous step. (1) For eachmodel,

for each model parameter, we generated a list of neighboring

points by allowing each parameter to vary by 5 3% of its current

value. (2) For each point on such a local grid, we sampled several

random data sets of sharing individuals by using the correspond-

ing demographic parameters (details in Table S3). We created

each data set by sampling random sharing values for independent

individual pairs from the distribution of Equation 17. (3) For each

analyzed set of parameter values, we computed a likelihood as

the fraction of data points for which the deviation between AJ

and sampled sharing was smaller than a tolerance threshold

d (dx0:089 for MFE and dx0:037 for MEFE). (4) We updated the

current point to the most likely point in the analyzed neighbor-

hood, if any, and iterated steps 1–3 until no point with a higher

likelihood was found. (5) We applied the AIC to compare models.

For bothmodels, only one iteration of the above localmaximiza-

tion was required. The most likely parameter values in the grid

matched those obtained with the least-squares approach, except

for the current population size, which increased by 3% for model

MFE and decreased by 3% for model MEFE. When comparing the

Figure 1. Demographic Models(A) Population of constant size.(B) Exponential expansion (contractionfor Na > Nc).(C) A founder event followed by exponen-tial expansion.(D) Two subsequent exponential expan-sions divided by a founder event.

two models, we used a tolerance threshold

of dx0:037 and obtained an AIC value of

19.21 for the MEFE model, which allows

five parameters to vary (such d results in

a likelihood of 0.01 for the MEFE model).

Using the same acceptance threshold, we

thus required a log likelihood of at least

#5.6 (a likelihoodof~3.7310#3) formodel

MFE, which has four parameters, to be

selected. None of the 105 sampled points

were accepted with such a threshold,

leading us to choose the MEFE model. The

likelihoods of additional parameter values

estimated for the MEFE model with the

use of a wider grid are reported in Table S4.

Note that when sampling from Equa-

tion 17, we assumed independence of the

analyzed sharing length intervals Ri and of the pairs within

a data set, potentially underestimating the variance of randomly

sampled summaries of IBD. To account for the presence of small

correlations, we thus performed full coalescent simulations ac-

cording to the most likely set of parameters of each model by

only sampling a synthetic chromosome 1 for 500 diploid individ-

uals. We repeated the rejection-based comparison by using 104

such points for each model and obtained an equivalent result.

Accounting for Phase ErrorsThe inference procedure described in the previous sections

assumes that high-quality IBD information is available. When

real data sets are analyzed, several sources of noise, such as compu-

tational phasing errors, might distort summary statistics of haplo-

type sharing. In the absence of reliable probabilistic measures for

the quality of shared segments, modeling this potential bias is

complicated. To account for this additional noise, we refined the

inferred AJ demographic model by using simulations that mimic

SNP ascertainment, inaccurate phasing, and IBD discovery in the

analyzed data sets. We expected the distortion of IBD summary

statistics in the AJ data set to not be substantial (Figure S3). The

preliminary inference based on the assumption of high-quality

IBD information therefore provides an efficient means for ex-

ploring large portions of the parameter space and for performing

model comparison. This can be followed by such simulation-based

refinement, which requires considerable computation.

After finding the most likely parameters and selecting model

MEFE for the AJ data as previously described, we refined the ob-

tained solution by using a local-search approach. We iteratively

varied one demographic parameter at a time and kept a tested

value if it resulted in a decreased deviation from the AJ data

summary. Note that in order to account for the stochastic varia-

tion observed across multiple independent simulations of the

same demographic history, we would need to generate several

synthetic data sets for each tested set of demographic parameters.

814 The American Journal of Human Genetics 91, 809–822, November 2, 2012

EFE

taille constante

croissance/ contraction

exponentielle

fondateur + croissance

la distribution de segments IBD depend de l’histoire de la population

Palamara & al American Journal of Human Genetics 91 :809 (2012)

Page 20: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Segments IBD et parametres

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xx

simulatedmutation rate).AnestimateofNewasobtained foreach data set across all simulated times of expansion (Fig-ure 4D). As expected, the obtained estimate of Ne tendedto lie in the range between the ancestral and the currentsize of the population. Long, recently originated segmentsprovide a better prediction of the current population size,especially for remote expansions. In contrast, the high fre-quency of shorter segments of more remote origins biasesthe inference toward a smaller population size when thesesegments are taken into account. For example, the effectsof a small ancestral population size can be observed onsegments between 4 and5 cM in length only for expansionsthat occurred fewer than 120 generations ago; in contrast,when segments between 1 and 2 cM in length are analyzed,traces of a smaller ancestral population are still notable,even for expansions that occurred as far back as 400 genera-tions ago.When comparing these results to population-sizeestimates obtained with heterozygosity from full syntheticgenomic sequence, we observed the heterozygosity-basedestimates of Ne to be strongly biased toward the small sizeof the ancestral population. Although they present lessinstability than do the IBD-based estimates, the inferred

values approached the ancestral population size, even forexpansions that occurred 400 generations before thepresent. This analysis outlines the unique sensitivity oflong-range IBD sharing to recent demographic variation.

Evaluation of the Inference in Populations of VaryingSizeWe tested the accuracy of our inference procedure forthe cases of either an exponential increase or decrease inpopulation size (expansion or contraction, respectively;Figure 1B). We simulated 450 synthetic populations thatunderwent an exponential expansion and 450 that under-went exponential contraction (see Table S1 for a list ofparameters). We analyzed the IBD sharing of 500 diploidsamples from each simulated population along a 278 cMchromosome. We evaluated the accuracy of the inferreddemography by using the ratio between true and predictedsizes of each analyzed population (Figure 4B) for all gener-ations between 1 and 100. We found our inferred popula-tion size to be within 10% of the true value 95% of thetime. The population size of recent generations was harderto infer because of the scarcity of long IBD segments invery large populations (this scarcity is due to a low chanceof recent coalescent events).Note that the reconstruction accuracy is influenced by

sample size and length of the analyzed region (see MaterialandMethods). The rates of expansion and contraction alsosubstantially affect the ability to recover the correct popula-tion size; faster expansion and contraction rates incurmorenoisy estimates (the testing reported in Figure 4 includedextreme and possibly unrealistically large rates of expan-sion and contraction). This was evident when we classifiedthe synthetic populations as either strong or mild contrac-tion or expansion events and separately assessed the infer-ence accuracy for each of these classes (Figure S5).

Expansion! Founder Event! ExpansionModel of theAJ PopulationWe analyzed the demographic history of the AJ populationby applying ourmethod to a real data set of 500 individuals(Material and Methods; segment-length distributions inFigure 5). We initially tested several models by using theproposed procedure. After inferring the most likely param-eters for the chosen model, we used simulations to refinethe analytical solution and account for potential errors inIBD detection (see Material and Methods and Table S2 foran algorithmic summary of the analysis).As a first step, we fitted a simple model of exponential

growth (Figure 1B). If only long (R5 cM) segments areconsidered, the parameters of this model can be optimizedto provide a good match for the observed sharing. Thissupports the occurrence of an expansion event in the recenthistory of this population, as reported in our previous anal-ysis using a simpler simulation-based approach.33 However,exponential growth alone is unable to provide a good fit forthe observed frequency of shorter segments, suggestingadditional demographic dynamics during more ancient AJ

A

B

5 x 10–2

5 x 10–3

5 x 10–4

5 x 10–5

1 x 10–1

1 x 10–2

1 x 10–3

1 x 10–4

1 x 10–5

1 x 10–6

1 x 10–7

Figure 3. Effects of Demographic Parameters on IBD SharingWhen a population of constant size Ne is considered (A), a largernumber of individuals in the population results in a decreasedchance of sharing IBD segments across all length intervals. Asimilar behavior is observed for the case of an exponential popula-tion expansion (B) parameterized by Na ancestral individualsexponentially expanding toNc current individuals duringG gener-ations. Larger values of Na and Nc correspond to a smaller chanceof IBD sharing for short and long segments, respectively. For fixedNa and Nc, changes in G (affecting the expansion rate) have animpact on segments of medium length, i.e., the slope of the distri-bution between short and long segments.

816 The American Journal of Human Genetics 91, 809–822, November 2, 2012

(taille constante, croissance exponentielle)

Palamara & al American Journal of Human Genetics 91 :809 (2012)

Page 21: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Demographie des Ashkenazim

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xxi

a founder event (Figure 1D). We focused our analysis ongenerations 1–200 (i.e., setting G1 ! G2 " 200 inFigure 1D). The considered model allows Na3 founders toexponentially expand to a population of Na2 individualsduringG2 generations.After a founder event,Na1 individualsare randomly selected and exponentially expand to reacha current population ofNc individuals during the remainingG1 generations. Using this model, we were able to obtaina goodfit for the entire IBD frequency spectrum, correspond-ing to the parameter values Na3 # 1;800; Na2 # 37;800;Na1 # 230; and G1 " 33 (therefore, G2 " 167) andNc # 42;000;000: Model comparison based on the AICsupports this model over simpler demographic scenarios(see Material and Methods). We note that the most recentexpansion period was inferred to have a considerably highrate (r ~ 0.37, defined in Equation 7). More complexmodels(e.g., inferring the value of G2 and allowing for a founderevent predating the remote expansion) did not significantlyimprove on the reported demography.When real data is analyzed, the quality of computational

phasing and IBD detection might affect the reconstructionaccuracy. Inaccuracies in the recovery of long-rangeIBD haplotypes are reflected in the inferred current size ofthe AJ population, which is extremely large. This is mostlikely due to long IBD segments being shortened to smallersegments because of switch errors during computationalphasing, in addition to greater uncertainty associated withthe inference of recent large population sizes (Figure 3 andFigure S5). We therefore refined inferred parameters totake into account such potential bias by using realistic coa-

lescent simulations that also reproducenoisedue to compu-tationalphasingand IBDdiscovery (Material andMethods).We obtained an improved fit for a population composed of~2,300 ancestors 200 generations before the present; thispopulation exponentially expanded to reach ~45,000 indi-viduals 34 generations ago. After a severe founder event, thepopulation was reduced to ~270 individuals, which thenexpanded rapidly during 33 generations (rate r ~ 0.29) andreached a modern population of ~4,300,000 individuals.

Exponential Contraction in the MKK Individuals: TheVillage ModelWe additionally investigated the demographic profile of 56samples of self-reported unrelated MKK individuals fromthe HapMap 3 data set (Material and Methods). We de-tected high levels of segmental sharing across individuals,consistent with recent analysis of hidden relatedness inthis sample.32,33 Genome-wide IBD sharing was elevatedamong all individual pairs, suggesting high rates of recentcommon ancestry across the entire group rather thanthe presence of occasional cryptic relatives due to errorsduring sample collection (Figure S6). Optimizing a modelof exponential expansion and contraction (Figure 1A),we obtained a good fit to the observed IBD frequency spec-trum (Figure 6), suggesting that an ancestral population of~23,500 individuals decreased to ~500 current individualsduring the course of 23 generations (r ~ $0.17). We notethat this result might not be driven by an actual gradualpopulation contraction in the MKK individuals, but itmost likely reflects the societal structure of this

2 x 10–2

2 x 10–3

2 x 10–4

2 x 10–5

2 x 10–6

Figure 5. Reconstruction for the AJ Demographic HistoryWe applied several demographicmodels to study the demographic history of 500 self-reported AJ individuals on the basis of the observeddistribution of haplotype sharing (green line). The parameters of exponential expansion can be optimized to provide a good fit whenonly long (R5 cM) segments are considered (red line, Figure 1B; best fit:Nc # 97;700;000,G" 26, andNa # 1;300). However, thismodelis not flexible enough to accommodate abundant short segments found in this population. Themilder slope observed between segmentsof 2–5 cM in length suggests a larger ancestral population size that rapidly recovered from a severe founder event by expanding to reacha large modern population size (purple line, Figure 1C; best-fit: Nc # 12;800;000; G " 35; Na1 # 230; and Na2 # 70;600%: Still, thismodel cannot provide a good fit for additional slope variation (observed for segments between 1–2 cM) that is well explained by an addi-tional exponential expansion that precedes the founder event but that is distinct from the other, more recent expansion (orange line;Figure 1D; best-fit:Nc # 42;000;000; G1 " 33; Na1 # 23; Na2 # 37;800; Na3 # 1;800; and G2 " 167). All population sizes are expressedas diploid individuals. G2 was not optimized because it was assumed that G1 ! G2 " 200.

818 The American Journal of Human Genetics 91, 809–822, November 2, 2012

meilleur fit pour donnees d’Ashkenazim (n = 500 genotypes, ` = 750 k sites) :EFE avec Nc = 2300 (-200 generations),↗ 45 k (-34 generations)↘ 270↗4.3 M

Palamara & al American Journal of Human Genetics 91 :809 (2012)

Page 22: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Demographie des Maasaı

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xxii

seminomadic population. Although little demographicevidence has been reported, the MKK population is infact believed to have a slow but steady annual populationgrowth.47 We hypothesized that a high level of migrationacross small-sized MKK villages (Manyatta) providesa potential explanation for the observed IBD patterns inthis population. In such a model, a small genetic pool forrecent generations gradually becomes larger as a result ofmigration across villages as one moves back into the past.To validate the plausibility of this hypothesis, we simulateda demographic scenario in which multiple small villagesinteract throughhighmigration rates. This setting is similarto Wright’s island model,48 and we shall refer to it as thevillage model in this case (Figure S7).We extracted IBD infor-mation for one of the simulated villages and attempted toinfer its demographic history by using a single-populationmodel of exponential expansionandcontraction (Figure1).Indeed, the single-populationmodel provides a good fit forthis synthetic sample, and the severity of the gradualcontraction of the population was observed to be propor-tional to the simulated migration rate. We thus used thevillage model to analyze the MKK demography and reliedon coalescent simulations to retrieve its parameters: migra-tion rate, size, andnumber of villages that provide a goodfitfor the empirical distributionof IBD segments.Weobserveda compatible fit for this model, in which 44 villages of 485individuals each intermix with a migration rate of 0.13individuals per generation (Figure 6).Note that, although our simulations involved several

villages of constant size, adequate choices of migrationrates would result in the signature of a drastic contractioneven among expanding villages (and, therefore, overallexpanding population). From a methodological point ofview, we further note that LD might also provide informa-tion for inferring such a ‘‘village effect.’’ However, althoughcurrent strategies for IBD detection allow finding sharedhaplotypes in the presence of computational phasing

errors, LD analysis over long genomic intervals is sub-stantially affected by noisy phase information (Figure S8).

Discussion

Recent availability of high-density genetic data has enabledthe investigation of human diversity at increasingly highlevels of detail. Although the vast majority of humangenetic variation arose in the panhuman ancestral popula-tion and is therefore shared across continents, substantiallocal differentiation between populations occurred as aconsequence of fine-scale demographic events of morerecent history.49 The intricate structure of these events ismost visible through population-specific allele frequenciesthat models of panmictic admixture fail to adequatelyexplain.18 As sequencing technologies provide new in-sights into recent genetic variation, our ability to under-stand these demographic patterns becomes essential.In this paper, we developed a formal relationshipbetween

demographic history and the distribution of IBD-sharedhaplotypes between purportedly unrelated individuals.This allowed us to provide an inference procedure fordemographic events that occurred in recent millennia. Theproposed approach can take into account subtle correlationstructures induced by long-range haplotypes, a distinguish-ing advantage compared to existing methods. Specifically,methods that assume independence of markers (e.g., allelefrequency spectrum) ignore this correlation, whereasmethods that focus on stronger forms of local correlation(e.g., LD) fail to capture this source of information. It is theability of our approach to account for long-range correla-tions across individual pairs that translates into higherresolution when reconstructing recent historical events.With thematurationofpopulation-scale sequencing tech-

nologies, direct observation of rare variants will pave newways for investigating recent demography. Accounting for

1.2 x 10–3

1.0 x 10–3

8.0 x 10–4

6.0 x 10–4

4.0 x 10–4

2.0 x 10–4

0.0 x 100

Figure 6. MKK DemographyIBD sharing is high across MKK samples,particularly for long haplotypes. Our anal-ysis of the observed distribution of haplo-type sharing (red) with the use of asingle-population model (blue) suggestsoccurrence of a severe population contrac-tion in recent generations (~23,500 ances-tral individuals decreasing to ~500 currentindividuals during 23 generations at a highexponential rate r ~ !0.17). An alternativedemographic model containing severalsmall demes that interact through highmigration rates creates the same effect asa recent severe population bottleneck andprovides and alternative justification to theabundance and distribution of IBD sharing.In particular, we reconstructed a plausiblescenario (dashed CI obtained throughrandom resampling of 200 synthetic datasets) in which 44 villages of 485 individualseach intermix with a migration rate of0.13 per individual per generation.

The American Journal of Human Genetics 91, 809–822, November 2, 2012 819

meilleur fit pour donnees de Maasaı (n = 78 genotypes, ` = 1.5 M sites) : EFEavec Nc = 23500 (-23 generations) ,↗ 500

explication : villages avec migration de taux bas (⇒ ancetres communs sont plusanciens)

Palamara & al American Journal of Human Genetics 91 :809 (2012)

Page 23: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Population des Ameriques

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xxiii

does not necessarily indicate a single, punctual event, but probable

contact between an admixed population and Native American

individuals during that period. By contrast, we find no evidence for

continuing African gene flow in CLM.

Identity by descent analysisWe used germline [55] and the trio-phased OMNI data above

to identify segments identical-by-descent (IBD) within and acrosspopulations (see Text S1). Not surprisingly, we found more IBD

Figure 2. (a) Individual ancestry proportions in the 1000 Genomes CLM, MXL, and PUR populations according to ADMIXTURE, (b) Map showing thesampling locations for the populations most closely related to the Native components of the 1000 Genomes populations. (c) Principal componentanalysis restricted to genomic segments inferred to be of Native Ancestry in these populations, compared to a reference panel of Native Americangroups from [40], pooled according to country of origin as a proxy for geography. Populations sampled across many locations are labeled accordingto the country of the centroid of locations. (d) Zoomed version of the PCA plot, showing specific Native American population labels, coloredaccording to country of origin.doi:10.1371/journal.pgen.1004023.g002

Native American Migrations from Sequence Data

PLOS Genetics | www.plosgenetics.org 4 December 2013 | Volume 9 | Issue 12 | e1004023

Gravel & al PLoS Genetics 9 :e1004023 (2013)

Page 24: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Fondateurs et diversion

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xxiv

segments within populations (23936) compared to across popula-tions (1440), and within-population segments were longer (FigureS3).

The MXL population exhibits significantly less within-popula-tion IBD compared to the other two panels (Figure 4). Theamount of IBD among unrelated individuals can be used to inferthe underlying population size under panmictic assumption: thelarger a population, the more distant the expected relationshipbetween any two individuals [56]. Using IBD segments longerthan 4 cM, we infer effective population sizes of 140,000 in MXL,15,000 in CLM, and 10,000 in PUR. As we will show, theselargely reflect post- ADMIXTURE population sizes.

We expect long IBD segments to be inherited from a recentcommon ancestor, and therefore to have identical continentalancestry. Comparing the RFMIX ancestry assignments onchromosomes that have been identified as IBD by germline thusprovides a measure of the consistency of the two methods (see [57]for a related metric). Rates of IBD-Ancestry mismatch rangedfrom 2:6% in segments of 5Mb to less than 0:2% for segmentslonger than 40 Mb (Figure S4).

Patterns of ancestry in IBD segments within a population differmarkedly from those across populations (Figure 5): IBD segmentswithin populations contain many ancestry switches. This indicatesthat many common ancestors lived after contact, and that theeffective population sizes estimated using IBD largely reflects post-contact demography. The IBD patterns in cross-population IBDsegments exhibited fewer ancestry switches than a random control(Figure S5), as may be expected if common ancestors often predatethe onset of ADMIXTURE. Cross-population IBD segments were alsofound to be overwhelmingly of European origin: among the 120longest cross-population IBD segments, 117 are in European-inferred segments, two are among Native segments, and one isamong African segments. This is not due to overall ancestryproportions, as can be observed by considering the alternate (non-IBD) haplotypes at the same positions (Figure S5). This is likely aresult of the colonization history, in which European colonistsrapidly spread from a relatively specific region over a largecontinent. This interpretation is supported by the ADMIXTURE

analysis (Figure S6), showing a common cluster of ancestry forthe European component dominant in PUR, CLM, MXL, andAndean populations, but not in CEU, Eskimo-Aleut, and Na-Dene. Finally, we were interested in testing whether therelationship between IBD and ancestry can be used to date

Figure 3. Ancestry tract length distribution in PUR (a) and CLM (b) compared to the predictions of the best-fitting migration model.Solid lines represent model predictions and shaded areas are one standard deviation confidence regions surrounding the predictions, assuming aPoisson distribution of counts per bin. The best-fitting models are displayed under each graph. Pie charts sizes indicate the proportion of migrants ateach generation, and the pie parts represent the fraction of migrants of each origin at a given generation. Migrants are taken to have uniformcontinental ancestry. ‘Single-pulse’ ADMIXTURE events occurring at non integer time in generations are distributed among neighboring generations: inthe CLM, the inferred onset was 13.02 generations ago (ga). The model involves founding 14 ga, but almost complete replacement 13 ga. At 30 yearsper generation [68], 14.9 ga corresponds to c:1566, and 13 to c:1623. Model parameters and confidence intervals are displayed in Table S1 in the TextS1 file.doi:10.1371/journal.pgen.1004023.g003

Figure 4. Number of IBD tracts by length bin in the three panelpopulations (independent of ancestry estimations), normal-ized by the number of individual pairs. The lower level of IBD inthe MXL population indicate a much larger effective population size.doi:10.1371/journal.pgen.1004023.g004

Native American Migrations from Sequence Data

PLOS Genetics | www.plosgenetics.org 5 December 2013 | Volume 9 | Issue 12 | e1004023

Gravel & al PLoS Genetics 9 :e1004023 (2013)

Page 25: IDENTITY BY DESCENTcsuros/IFT6299/H2014/content/prez12-ibd.pdf · Mode d’identite (Jacquard)´ IBD ?IFT6299 H2014 ?UdeM ?Mikl os Cs}ur os v A B x x A B x y A B y x x/x x/x x/x x/y

Population des Ameriques 2

IBD ? IFT6299 H2014 ? UdeM ? Miklos Csuros xxvBy calibrating our results using TA~16kya, towards the mostrecent end of the range of plausible values for the peopling of theAmericas (see e.g., [6] and references therein), we find a

mutation rate of 1:44|10{8bp{1gen{1 (bootstrap 95% CI:

1:32{1:53|10{8bp{1gen{1), within the range of recentlypublished human mutation rates [63]. The narrowest confidence

interval reported in [63] was 1:05{1:5|10{8bp{1gen{1,obtained from a de novo exome sequencing study [64]. Oursampling confidence interval is narrower than this value, but themain source of uncertainty here is the degree to which thebottleneck in our model reflects the bottleneck at the founding ofthe Americas, or the earlier split with the ancestors to theChinese (CHB) and Japanese (JPT) sample, as well as uncertaintywith respect to the timing of these two events (see Figure 7). Theeffect of changing the founding time or mutation rate assump-tions would be to scale all parameters and confidence intervalsaccording to T!N!1=m: Thus the absolute uncertainty onindividual parameters is larger than the sampling uncertaintysuggests.

Estimating Native American allele frequenciesThere is scarce publicly available, genome-wide data about

Native American genomic diversity. The 1000 Genomes datasetoffers the opportunity to provide a diversity resource for NativeAmerican genomics by reconstructing the genetic makeup ofNative American populations ancestral to the PUR, CLM, andMXL. This is particularly interesting in the case of the PuertoRican population, where such reconstruction may be the only wayto understand the genetic make-up of the pre-Columbianinhabitants of the Islands. Using the expectation maximizationmethod presented in the Methods section, we estimated the allelefrequencies in the Native-American-inferred part of the genomesof the sequenced individuals. These estimates are available athttp://genomes.uprm.edu/cgi-bin/gb2/gbrowse/.

Figure 8 shows the distribution of the number of NativeAmerican haplotypes per site and the resulting confidenceintervals for allele frequency in each population for exome capturetarget regions. Absolute confidence intervals are narrow for rarevariants, and reach a maximum for SNPs at intermediate

Figure 7. Plausible parameter range for the human mutation rate and the founding time of the Native American populations. Theshaded blue area is the 95% confidence interval from the current analysis. The horizontal line shows the lowest mutation rate estimate from [63], andthe vertical line shows the lowest plausible date for the founding of the ancestral Native American populations according to [6]. The plausible region,given by the overlap of the three areas, would correspond to a mutation rate of 0:97{1:6|10{8bp{1gen{1 and a Native American founding time15{24kya.doi:10.1371/journal.pgen.1004023.g007

Figure 8. (a) Number of inferred Native American haplotypes per site, out of 120 CLM, 132 MXL, and 110 PUR haplotypes. (b) Distribution ofconfidence intervals widths for allele frequency estimations among the exomic Native American segments of the three panels.doi:10.1371/journal.pgen.1004023.g008

Native American Migrations from Sequence Data

PLOS Genetics | www.plosgenetics.org 8 December 2013 | Volume 9 | Issue 12 | e1004023

arrivee : 16000 ans, MXL 12200 ans, PUR/CLM 11700 ans, migrations

Gravel & al PLoS Genetics 9 :e1004023 (2013)