organizational heterogeneity of human genome

19
Organizational Heterogeneity of Human Genome: Significant variation of recombination rate of 100 kbp sequences within GC ranges Svetlana Frenkel Valery Kirzhner Abraham Korol Department of Evolutionary and Environmental Biology Institute of Evolution University of Haifa

Upload: svetlana-frenkel

Post on 15-Apr-2017

117 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Organizational Heterogeneity of Human Genome

Organizational Heterogeneity of Human Genome:

Significant variation of recombination rate of 100 kbp sequences within GC ranges

Svetlana FrenkelValery KirzhnerAbraham Korol

Department of Evolutionary and Environmental BiologyInstitute of Evolution

University of Haifa

Page 2: Organizational Heterogeneity of Human Genome

Some aspects of intra-genome heterogeneity

Varying gene density Clusters of tissue-specific and

housekeeping genes Linkage disequilibrium (LD) blocks Mutation and recombination rates Conserved and Ultraconserved segments Localization of inversions, deletions,

insertions and duplications

Page 3: Organizational Heterogeneity of Human Genome

Genome Heterogeneity: GC content

From: Costantini, M., Clay, O., Auletta, F., Bernardi, G. (2006) An isochore map of human chromosomes. Genome Res., 16, 536-541.

From: UHN Microarray Centre's CpG Island Database http://data.microarrays.ca/cpg/index.htm

The level of redness denotes the relative number of CpG islands that can be located on the chromosome in that region

Page 4: Organizational Heterogeneity of Human Genome

4

Genome Signature Samuel Karlin, et al, 1997

Local: • preliminary searches of candidates for gene

alignment• detecting candidate regulatory signals• detecting promoter regions• detecting repetitive elements • duplications of genomic • horizontal gene transfer

Genome-wide: • phylogenetic analysis

• species recognition• whole-genome sequence comparisons

Page 5: Organizational Heterogeneity of Human Genome

Linguistic-like methods

Detecting all of “words” with certain maximal lengthCharacterizing the sequence “vocabulary”

Scoring the occurrences of fixed-

length “words” from a predefined

“vocabulary”Comparison of “word” frequencies obtained

from different sequencesComparison the

“vocabularies” of different sequences

Compositional Spectra Analysis

Page 6: Organizational Heterogeneity of Human Genome

Compositional Spectra

A linguistic-like method of genome analysis based on occurrences of “words” in the A,C,G,T alphabet Compositional spectrum (CS) is measured as a histogram of imperfect word occurrences

From: V. Kirzhner et al., 2002-20056

Page 7: Organizational Heterogeneity of Human Genome

Methods: calculating of distances

d1

d’1 d’2

d2

F(Si, W)

F(S’i, W)

F(Sj, W)

F(S’ j, W)

5’

5’

3’

3’

Manhattan (city block) distanceSpearman Rank Correlation ρ (d= 1-ρ)Kendall distance τ

d = min(di, d’i, dj, d’j)F(Si, W’)

F(Sj, W’)

Page 8: Organizational Heterogeneity of Human Genome

Methods: Detection of Organizational Pattern groups of segments

Genome segment number

Low HighClustering tree

Relative distance between two clusters

Maximal distance between segments

Neighbor-Joining Clustering

“adaptive cutoff”

Page 9: Organizational Heterogeneity of Human Genome

Analysis of Organizational Pattern groups of segments

9

Page 10: Organizational Heterogeneity of Human Genome

Significant variation of evolutionary features of 100 kbp sequences within GC ranges

Testing for potential association between genome-wide distribution of organizational patterns and various evolutionary and structural features reveals the existence of inter-OP heterogeneity in such features as SNP and Indels frequency, recombination rate, number of segmental duplications, size of linkage disequilibrium blocks, and proportion of evolutionary conserved sequence.

10

Page 11: Organizational Heterogeneity of Human Genome

Estimation of heterogeneity between OP groups

11 GC

Rec

ombi

natio

n R

ate

Page 12: Organizational Heterogeneity of Human Genome

Estimation of heterogeneity between OP groups

12

0.22 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 0.03 1.9×10-3 0.01 0.11 3.9×10-3

-log(

FDR

-cor

rect

ed p

-val

ue)

GC

Kruskal–Wallis non-parametric rank test10,000 segments reshuffles to estimate test critical value FDR correction for multiple comparisons

Reshuffled sequences within every segment as control

2.3 5.1 86.1 48.6 81.9 35.7 21.0 26.0 46.7 36.6 13.6 15.7 15.5 16.9

Page 13: Organizational Heterogeneity of Human Genome

Detecting the words related to recombination rate

13

GC% ,Average RR in the compared

OPGsProportion of correct classifications of segments to OP

groups% ,

low RR high RR all words set of 47 words set of 8 words

35 0.82 0.93 98.60 98.62 76.0336 0.62 1.16 98.40 96.56 82.3437 0.83 1.28 94.10 93.88 80.4738 0.80 1.46 99.58 99.17 98.33

39 0.91 1.59 97.32 97.32 96.55

40 0.96 1.50 100.0 100.0 100.0

41 1.13 1.81 98.80 98.50 98.50

42 1.05 1.80 100.0 100.0 99.62

43 1.29 1.99 97.48 96.98 95.46

44 1.44 1.83 99.01 99.21 98.81

45 1.35 2.06 100 98.93 98.22

46 1.30 1.88 98.53 98.53 97.3547 1.15 1.74 94.62 94.61 91.4848 1.33 2.04 98.78 98.77 97.55

Page 14: Organizational Heterogeneity of Human Genome

Oligonucleotides, which showed high importance in more than half of OPG comparisons in classification of 100kbp segments for high and low recombination rate

14

Oligonucleotide GC, %Appeared in the list of 10 most important variables

(times)

Appearedas the most important variable

(times)Previously described

pattern Reference

CAGCCAGGTT 60 11 4 -CCNCCNTNNCCNC--CAGCCAGGTT---- Myers et al. 2008

GACCGGACTG 70 10 1

---CCTCCCT---GACCGGACTG- Myers et al. 2005

-CCNCCNTNNCCNC----GACCGGACTG-- Myers et al. 2008

CGCCGGGACT 80 10 3 -CCNCCNTNNCCNC---CGCCGGGACT--- Myers et al. 2008

GCGTAGGCTA 60 9 0 -CCNCCNTNNCCNC----GCGTAGGCTA-- Myers et al. 2008

TGGGCCCGGC 90 8 4 n/a  

GGCGTGCGCG 90 8 1

-GGNGGNAGGGG--GGCGTGCGCG-- Zheng et al. 2010

-CCNCCNTNNCCNC----GGCGTGCGCG-- Myers et al. 2008

CCCGGTATCG 70 8 0-CCNCCNTNNCCNC---CCCGGTATCG--- Myers et al. 2008

GCCCTTTCCT 60 7 0

---CCTCCCT---GCCCTTTCCT- Myers et al. 2005

-CCNCCNTNNCCNC----GCCCTTTCCT-- Myers et al. 2008

-CCTCCCTNNCCAC----GCCCTTTCCT-- Myers et al. 2008

Page 15: Organizational Heterogeneity of Human Genome

Functionally related genes tend to reside in organizationally similar genomic regions

Genes provided the GO enrichment of four organizational pattern clusters, which showed the most significant GO enrichments.

L2-a cluster is enriched by “mitochondrion”, “intracellular non-membrane-bounded organelle”, “nuclear envelope” and “ribonucleoprotein complex” GO terms;L2-h cluster is enriched by “G-protein-coupled receptor protein signaling pathway” and “sensory perception of smell” GO terms;H1-i cluster is enriched by “epithelial cell differentiation” and “epithelium development” GO terms;H2-a cluster is enriched by “skeletal system development” GO term.

Paz A, Frenkel S, Snir S, Kirzhner V, Korol A. 2014. BMC Genomics 15:252. 15

Page 16: Organizational Heterogeneity of Human Genome

Thank you for your attention

Acknowledgments

Dr. Valery Kirzhner Prof. Abraham Korol Prof. Edward Trifonov Dr. Arnon Paz and Dr. Zeev Frenkel

This work was supported byThe Israeli Ministry of Immigrant AbsorptionThe Israel Council for Higher Education

Page 17: Organizational Heterogeneity of Human Genome

Calculating compositional spectra

…AGTAGTTACACTACTATAGTGACGACTCCATCGTCGTCGAGAACGTACCTTCTATATCCAAGGTACTACACTCGCGACCG

3676CTACTATAGT

…CTACTATAGTCTACTAAAGTCTAGTAAAGTCTAGTAAAGTCTAGTAACGTCGCCTAAAGTCCACTAAGGT

256 × 3676 = 941056 86.7%Additional slide

Page 18: Organizational Heterogeneity of Human Genome

Spearman's rank correlation coefficient rho Spearman's rank correlation coefficient is a non-

parametric measure of correlation ρ is given by:

where:• Di = xi − yi = the difference between the ranks of

corresponding values Xi and Yi, and • n = the number of values in each data set (same for

both sets).

Additional slide

Page 19: Organizational Heterogeneity of Human Genome

The Kendall tau distance The Kendall tau distance is a metric that counts the number of

pairwise disagreements between two lists. The larger the distance, the more dissimilar the two lists are.

The Kendall tau distance between two lists τ1 and τ2 is

K(τ1,τ2) will be equal to 0 if the two lists are identical and n(n − 1) / 2 (where n is the list size) if one list is the reverse of the other. Often Kendall tau distance is normalized by dividing by n(n − 1) / 2 so a value of 1 indicates maximum disagreement. The normalized Kendall tau distance therefore lies in the interval [0,1].

Additional slide