structure riccardo negrini [email protected]
TRANSCRIPT
![Page 2: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/2.jpg)
A model-based clustering methods that use molecular markers to:
Infer the properties of populations starting from single individuals
Classify individuals of unknown origins
Detecting “cryptic” populations structure
Identify immigrant
Identify mixed individuals
Demonstrating the presence of a populations structure
![Page 3: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/3.jpg)
Distance-based methods
Easy to apply and visually appealing
but
The cluster identify are heavily dependant to the distance measures and to the graphical representation chosen
Difficult to asses the level of confidence of the cluster obtained
Difficult to incorporate additional information
More suited to exploratory data analysis than to fine statistical inference
![Page 4: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/4.jpg)
-0,5
-0,4
-0,3
-0,2
-0,1
0
0,1
0,2
0,3
-0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4
Marchigiana
Italian Limousine
Romagnola
Dice similarity and multivariate analysis
![Page 5: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/5.jpg)
0
2
4
6
8
10
12
14
16
18
20
0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82
Distribution Dice similarity between (dotted line) and within breeds
ROM/FRI
ROM/CHI
ROM/MCG
ROM/LMI
ROM/ROM
![Page 6: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/6.jpg)
STRUCTURE main assumption:
H-W equilibrium within populations
Linkage equilibrium between loci within populations
STRUCTURE accounts for the presence of H-W and LD by introducing population structure and attempts to find populations grouping that (as far as possible) are not in disequilibrium
STRUCTURE does not assume a particular mutation process so it can be use with the most common molecular markers (STR, RFLP, SNP, AFLP). Sequence data, Y chromosome or mtDNA haplotypes have to be recoded as a single locus with many alleles
![Page 7: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/7.jpg)
STRUCTURE adopt a BAYESIAN approach:
Let X denote the genotype of the sampled individuals
Let Z denote the unknown population of origin of the individuals
Let P denote the unknown allele frequencies in all populations
Under H-W and LE each allele at each locus in each genotype in an independent drown from the appropriate frequency distributions
Having observed X, the knowledge on Z and P is given by the posterior probability of Bayes theorem:
Pr (Z, P|X) = Pr(Z) Pr(P) Pr(X|Z, P)
It is not possible to compute the distribution exactly but it is possible to obtain approximate samples of Z and P using MCMC and than make inference based on summary statistics of this samples
![Page 8: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/8.jpg)
Bayesian inferences: basic principles
No logic distinction between parameters and data. Both are random variables: data “observed” and parameters “unobserved”
PRIOR encapsulates information about the values of a parameters before observing the data
LIKELIHOOD is a conditional distribution that specified the probability of the data at any particular values of the parameters
Aims of Bayesian inference is to calculate the POSTERIOR distribution of the parameters (The conditional distribution of the parameters given the data)
![Page 9: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/9.jpg)
FORMAT OF THE DATA FILE:
Label Pop Flag Locus 1 Locus 2 Locus 3 Locus n
Chi1 1 1 145 92 113 Size
Chi1 1 1 145 98 115 Size
Chi2 1 1 143 90 115 …
Chi2 1 1 -9 90 119 …
Chi3 1 0 151 155 117 …
Chi3 1 0 145 92 119
Rom1 2 0 145 98 121
Rom1 2 0 143 90 125
Rom2 2 0 -9 90 125
Rom2 2 0 -9 94 123
Indicate learning samplesAlleles in rows Missing data
File in txt format with tabs
Dominant data: code 1 the band presence (AA or Aa) and 2 the absence (aa)second alleles as missing data (-9)
![Page 10: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/10.jpg)
BUILDING A PROJECT:
Step 1 Step 2
Step 3Step 4
![Page 11: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/11.jpg)
…….if everything goes well:
![Page 12: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/12.jpg)
MODELLING DECISION:
Ancestry model:
No admixture model: each ind comes purely from one of the k populations. The output is the posterior prob that individual i comes from the pop k. The prior prob for each populations is 1/k. appropriate for discrete populations and for dominat data
Admixture model: ind may have mixed ancestry i.e have inherited some fractions of its genome from ancestors in population k. The output is the posterior mean estimates of this proportions
Linkage model: If t generation in the past there was an admixture event that mixed the k populations, any individual chromosome resulted composed of “chunks” inherited as discrete units from ancestors at the time of admixture.
Using prior population information: this is the default option in structure. Not recommended in the exploratory preliminary analysis of the data. Popflag allow to specify which samples had to be used as learning samples to assist clustering
![Page 13: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/13.jpg)
Frequency mode:
Allele frequencies correlate: it assumes that allele frequencies in the different populations are likely to be correlate probably due to migrations or shared ancestry. The K populations represented in the dataset have each undergone an independent drift away from the ancestral allele fequencies
Allele frequencies independent: it assumes that allele frequencies in each populations are independent drown from a distribution specified by a parameters . The prior says that we expect the allele frequencies in each population to be reasonably different from each others.
![Page 14: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/14.jpg)
How long run the program?
Length of burn-in period: number of MCMC iteration necessary to reach a “stationary distribution”: the state it visit will tend to the probability distribution of interest (e.g. Pr(Z, P|X)) that no longer depend on the number of iteration or the initial state of the variables.
Number of MCMC after burn-in: number of iteration after burn-in to get accurate parameters estimate
Loosely speaking: usually burn-in from 10,000 to 100,000 iteration are adequate.Good estimate of the parameters P and Q can be obtained with fairly short run (100,000).Accurate estimation of Pr(X|K) need quite long run (106)
![Page 15: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/15.jpg)
How to choose k (number of populations)?
No rules, but only iterative method: i.e. try different k and different Length of burn-in period and number of MCMC iteration after burn-in.
Be careful to:
Run several independent run for each K in order to verify the consistency of the estimates across run
Population structure leads to LD among unlinked loci and departures from H-W. These are the signals used by STRUCTURE. But also inbreeding, genotyping errors or null alleles can lead to the same effect.
Fully resolving all the groups in your dataset testing all the values until highest values likelihood values are reached
Determining the rough relation (low K)
![Page 16: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/16.jpg)
INTERPRETING THE OUTPUT:
The screen during run
Number of MCMC iteration
Divergence between populations calculated as Fst
Log of data given the current values of P and Q
Current estimates of ln(P|K) averaged over all the iteration since the end of burn-in period
![Page 17: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/17.jpg)
The output file
Current estimates of Prln(P|K) averaged over all the iteration since the end of burn-in period
![Page 18: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/18.jpg)
Q output without using prior information
Estimated membership in the clusters (k=3) and 90% probability interval (ANCENDIST turned on)
![Page 19: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/19.jpg)
Q output using prior information
Posterior probability of belonging to the presumed population
Estimated probability of belonging to the second populations or have parent and grandparent that belong to the second population
![Page 20: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/20.jpg)
PLOT THE RESULTS
• color = cluster
• more colors/line:genetic components of individual
• one vertical line/individual
![Page 21: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/21.jpg)
INFERRING POPULATION STRUCTURE
RESGEN PROJECT: Towards a strategy for the conservation of the genetic diversity
of European cattle
THE DATASETMore that 60 cattle breeds from Europe5 African bos indicus breeds20 individuals per breed30 microsatellites
Structure parameters:Admixture modelsAllele frequencies correlateNo prior information
![Page 22: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/22.jpg)
Swedish Red Polled Bohemian Red Polish Red Red Danish Angeln MRY Red HF dual Red HF dairy Groningen WH
Swiss HF British HF Jutland 1950 Dutch Belted German BP-W Friesian-Holland Belgian Blue Germ. Shorthorn Maine-Anjou Normande
Bretonne BP Charolais Ayrshire Highland Hereford Dexter Aberdeen Angus Jersey Guernsey Betizu A
Betizu B Pirenaica Blonde d'Aquitaine Limousin Bazadais Gasconne Aubrac Salers Montbéliard Pezzata Rossa Ital.
Germ. Simmental Simmental Hinterwaelder German Yellow Evolene Eringer Piemontese Grigio Alpina Rendena Cabannina
Swiss Brown Germ. Br. Württemberg Germ. Br. Bavaria Germ. Br. Orig Bruna Pirineds Menorquina Mallorquina Retinta Morucha Avilena
Sayaguesa Alistano Rubia Gallega Asturiana Valles Asturiana Montana Tudanca Tora de Lidia Casta Navarra Hungarian Grey Istrian
Podolica Romagnola Chianina N'Dama Somba Lagunaire Borgou Zebu Peul
k=2
EUR AFR
k=2
Europe – Africa
Zebu P
eul
Hungaria
n Gre
y
Istri
an
Podolica
Romag
nola
Chianin
a
N’Dam
a
Somba
Lagunai
re
Borgou
Zebu influence in Podolian breeds
Model-based clusteringEuropean cattle
![Page 23: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/23.jpg)
Swedish Red Polled Bohemian Red Polish Red Red Danish Angeln MRY Red HF dual Red HF dairy Groningen WH
Swiss HF British HF Jutland 1950 Dutch Belted German BP-W Friesian-Holland Belgian Blue Germ. Shorthorn Maine-Anjou Normande
Bretonne BP Charolais Ayrshire Highland Hereford Dexter Aberdeen Angus Jersey Guernsey Betizu A
Betizu B Pirenaica Blonde d'Aquitaine Limousin Bazadais Gasconne Aubrac Salers Montbéliard Pezzata Rossa Ital.
Germ. Simmental Simmental Hinterwaelder German Yellow Evolene Eringer Piemontese Grigio Alpina Rendena Cabannina
Swiss Brown Germ. Br. Württemberg Germ. Br. Bavaria Germ. Br. Orig Bruna Pirineds Menorquina Mallorquina Retinta Morucha Avilena
Sayaguesa Alistano Rubia Gallega Asturiana Valles Asturiana Montana Tudanca Tora de Lidia Casta Navarra Hungarian Grey Istrian
Podolica Romagnola Chianina N'Dama Somba Lagunaire Borgou Zebu Peul
PodolianIberianAlpineBrown
AlpineIntermediates
AlpineSpotted
FrenchBrown
BritishLowlandPiedBaltic
Red
Nordic
North-WestIntermediates
k=2
k=5
k=7
k=9
Model-based clusteringEuropean cattle
9 homogeneous clusters + 2 intermediate zones.
Courtesy of dr. J. A. Lenstra, dr I. Nijman and Resgen Consortium
![Page 24: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/24.jpg)
INTRABIODIV: Tracking surrogates f. intraspecific biodiversity: towards efficient selection strategies f. the conservation of natural genetic resources using comparative mapping & modelling approaches
![Page 25: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/25.jpg)
Phylogeography of Geum reptans
• 59 localities• 177 samples• ≈80 polymorphic
AFLP markers
![Page 26: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/26.jpg)
Phylogeography of Geum reptans
High diversity
Low diversity
![Page 27: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/27.jpg)
Phylogeography of Geum reptans
High diversity
Low diversity
![Page 28: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/28.jpg)
Phylogeography of Ligusticum
mutellinoides
• 127 localities• 381 samples• 123 polymorphic AFLP
markers
![Page 29: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/29.jpg)
Phylogeography of Ligusticum mutellinoides
High diversity
Low diversity
Courtesy of dr. P.Taberlet and Intrabiodiv Consortium
![Page 30: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/30.jpg)
PERFORM ASSIGNEMENT TEST
![Page 31: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/31.jpg)
THE REFERENCE DATASET
CARTINAPiemontese
CabanninaChianina
Calvana
Mucca Pisana
Maremmana
Romagnola
Limousine
Marchigiana
FrisonaRendena
Pezzata Rossa It.
Podolica
BrunaGrigio AlpinaValdostana Pezzata Rossa
16 breeds reared in Italy 416 individuals 3 AFLP primer combinations
132 polymorphisms Information on origins
![Page 32: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/32.jpg)
Checking the reference dataset
98% of individuals correctly assigned with a p>90% (91% con p>99%)
100% of Romagnola individuals from the genetic center assigned with p>99%
20000 burn-in + 50000 routine MCMC; 8 independent runs0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pro
babili
tà
LMI BRUMCG MMAFRI CHI MUPROM
00.10.20.30.40.50.60.70.80.9
1
Pro
babili
tà
CAL VPRGAL PIMPRI POD CAB REN
90% threshold
90% threshold
![Page 33: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/33.jpg)
THE BLIND TEST
44 Romagnola individuals randomly selected 3 AFLP primer combination ; 132 polymorphism No prior information
![Page 34: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/34.jpg)
THE RESULTS
00,10,20,30,40,50,60,70,80,9
1
ROMAGNOLA
BRUMCG
LIM
GAL VPR
MUPCAL
FRICHIMMA
PRI PIMPOD
CABREN
36 Romagnola cattle assigned with p>99%
4 Romagnola cattle assigned with 90%>p>99%4 Romagnola cattle not assigned
Assignement probability to the different breeds of the reference dataset
![Page 35: STRUCTURE Riccardo Negrini riccardo.negrini@unicatt.it](https://reader035.vdocuments.us/reader035/viewer/2022062712/56649c985503460f94954a04/html5/thumbnails/35.jpg)
•Yang BZ, Zhao H, Kranzler HR, Gelernter J. Practical population group assignment with selected informative markers: characteristics and properties of Bayesian clustering via STRUCTURE. Genet Epidemiol. 2005 May;28(4):302-12.
•Sullivan PF, Walsh D, O'Neill FA, Kendler KS. Evaluation of genetic substructure in the Irish Study of High-Density Schizophrenia Families. Psychiatr Genet. 2004 Dec;14(4):187-9.
•Lucchini V, Galov A, Randi E. Evidence of genetic distinction and long-term population decline in wolves (Canis lupus) in the Italian Apennines. Mol Ecol. 2004 Mar;13(3):523-36
•Peever TL, Salimath SS, Su G, Kaiser WJ, Muehlbauer FJ. Historical and contemporary multilocus population structure of Ascochyta rabiei (teleomorph: Didymella rabiei) in the Pacific Northwest of the United States. Mol Ecol. 2004 Feb;13(2):291-309.
•Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003 Aug;164(4):1567-87.
•Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB. Human population genetic structure and inference of group membership. Am J Hum Genet. 2003 Mar;72(3):578-89. Epub 2003 Jan 28.
•Koskinen MT. Individual assignment using microsatellite DNA reveals unambiguous breed identification in the domestic dog. Anim Genet. 2003 Aug;34(4):297-301.
•Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic structure of human populations. Science. 2002 Dec 20;298(5602):2381-5.
•Rosenberg NA, Burke T, Elo K, Feldman MW, Freidlin PJ, Groenen MA, Hillel J, Maki-Tanila A, Tixier-Boichard M, Vignal A, Wimmers K, Weigend S. Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics. 2001 Oct;159(2):699-713
•Randi E, Pierpaoli M, Beaumont M, Ragni B, Sforzi A. Genetic identification of wild and domestic cats (Felis silvestris) and their hybrids using Bayesian clustering methods. Mol Biol Evol. 2001 Sep;18(9):1679-93
for who are very interested