presented by guohui ding r&d, sibs, cas

Post on 13-Feb-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Temporal Profile for Animal Transmembrane Gene Duplication (Insights into the Coupling of Duplication and Macroevolution). Presented by Guohui Ding R&D, SIBS, CAS. Background. Gene Duplication[1,2] - PowerPoint PPT Presentation

TRANSCRIPT

A Temporal Profile for Animal Transmembrane Gene Duplication (Insights in

to the Coupling of Duplication and Macroevolution)

Presented byGuohui Ding

R&D, SIBS, CAS

Background• Gene Duplication[1,2]

– The predominant mechanism by which genes with new functions and associated phenotypic novelties arise

– Several models try to explain the process of gene duplications– Positive selection play a key role in the neo/subfunctionalization (?). It gi

ves the chance to study the interplay of physical and biotic factors.• Macroevolution[3]

– The dynamics of evolution above species level– Biogeographic/geochemic/Palaeontologyical/Ecological data (e.g. fossil

data, ocean chemistry data)• TM proteins[4]

– At least one transmembrane helix– Such as active transport, ion flows, energy transduction, and signal tran

sduction et al– Information exchange between the cell and the environment

Gene Duplication

Accumulation of mutations

Environment/Genetic backgroundselection

Genetics. 1999 Apr;151(4):1531-45.

Macroevolution’s Evidences/Data

Science. 2002 Aug 16;297(5584):1137-42. Review. Nature. 2005 Mar 10;434(7030):208-10. Nature. 2000 Mar 9;404(6774):177-80. Science. 2000 Dec 1;290(5497):1758-61.

TM Proteins

• ……• Old and from long, long ago• Not a good choice for

evolution theory study but maybe a suitable model illustrating the interaction between environment and life

Nature. 2004 Oct 21;431(7011):913.

The Question and The Logic• What will the temporal profile of Animal TM gene duplications lo

ok like? Is it a uniform distribution? If not, what scenarios can be used to explain the distribution? (Null hypothesis: Neutral theory)

• Are large-scale cycles and patterns found in phanerozoic fossil records leaving some imprint in the TM Gene duplication temporal profile, if God adjusts the macroevolution by the microevolution or genes? How important is gene duplication to the speciation?[1,5] (Just a little extrapolation) Does duplication events synchronize with the speciation or origination/extinction? When they are asynchronous, what it want to tell us?

• Can this logic/method be applied in understanding the macroevolution? In general, the sequence data are far more readily attainable than the fossil data. Also, it shows a second way.

The logic: Duplicates selected by environmentMore duplicates at a time implying more “diversity” in the environment that time

Methods

• TM protein prediction• Family construction• Estimation of molecular time scale• Duplication events detection• Data processing

TM protein prediction• Data

– NCBI Reference Sequence (RefSeq) Database (Release 7, September 12, 2004)

– 13 eukaryotic genomes, (61 bacterial genomes, 11 archaebacterial genomes)

• Transmembrane Topology Models prediction– Conpred II

• Identification of TM Proteins– At least one

transmembrane helix

Nucleic Acids Res. 2004 Jul 1;32:W390-3.

Family construction• Detection and masking of widespread, typically repetitive domains.• Filtering by SEG and all to all comparison of protein sequence by using

gapped BLAST program with default setting.• E value determination based on overall distribution of E value over the

entire protein space.• Detection of transitive best hit.• Single-linkage clustering to the best hit and get the symmetrical best hit.• Remove the fragment sequence.• Single-linkage clustering again.• Detection of cluster that has no cut-edges (bridge).• Detection of cluster with as least one triangle of mutually consistent, ge

nome-specific best hits (BeTs).• Iterative multiple alignment.• Detection of triangles of mutually consistent, genome-specific best hits• Case by case analysis of each candidate family.

About E value• Based on the overall distribution of expectation values over the

entire protein space.• The distribution shown may be thought as the average distribution

of E value for a ‘typical’ protein sequence as a query.• The steep slope at high E value indicates a rapid growth in the

number of sequences that are unrelated to the query sequence.• Every sequence has

its own, only the threshold derived by the averaged distribution is reliable.

• The deviation from straightline starts around 1e-5 in my work.

Proteins. 1999 Nov 15;37(3):360-78.

About transitive/ symmetrical best hit

• A threshold of 1e-5 for E value of HSPs.• HSPs are not compatible with a global alignment.• The remaining HSPs cover at least 80% of the p

roteins length.• Their similarity is greater or equal to 50%• Both sequences are complete

Genome Res. 2000 Mar;10(3):379-85.

About cluster that has no cut-edges

• It detects densely connected regions in large protein-protein similarity networks.

• Splitting the large family

Cut edges

About triangle of mutually consistent, genome-specific best hits (BeTs)

• Triangle• Mutually consistent• Genome-sepcific

Science. 1997 Oct 24;278(5338):631-7. Review

Iterative multiple alignment• Multiple sequence alignment with

CLUSTAL W (1.83) in default value• Boot-strapping with 500 bootstraps• If the tree branch’s bootstrap

value less than 50%, break the branch and get two subfamily.

• Multiple sequence alignment withthe subfamily’s members untilthere is no branch whose bootstrapvalue is less than 50% in the family(11 times to the end).

Estimation of molecular time scale

• Inference of phylogenetic tree• Calibration time • Maximum likelihood estimation of protein d

ivergence times

Inference of phylogenetic tree

• Neighbor-Joining method with Poisson distance

• Prokaryotic or other non-animal sequence as the outgroup to find the root. In the absence of outgroup sequence, the root is given at the midpoint of the longest route connecting two proteins(midpoint rooting).

• Software: LINTREE by N. Takezaki

Mol Biol Evol. 1987 Jul;4(4):406-25.

Calibration time• Several calibration

– Mouse-rat: 41mya– Primate-rodent: 91mya– Mammal-bird: 310Mya– Vertebrate-Drosophila: 993Mya– Vertebrate-nematodes: 1177Mya– Animal-plant-fungi: 1576Mya

• Mapping it to the phylogenetic tree manually

• Mark each orthlogous with an evolution rate group

1399_1.trees

To: ppt21Nat Rev Genet. 2002 Nov;3(11):838-49. Review. Trends Genet. 2003 Apr;19(4):200-6.

Maximum likelihood estimation of protein divergence times• Specify a empirical mode: mtREV24.dat.• Gamma shape parameter is estimate by th

e soft itself.• Global clock and local clock all will be use

d. (For robust test)• Software: PAML 3.14 by Ziheng Yang

Syst. Biol. 52(5):705-716, 2003

Global clock vs. Local clock

Global clock Local clock

Global clock vs. Local clock

The coefficient of pearson correlation is 0.7439441 (p < 2.2e-16).: y = x : regression lines for local clock vs. global clock.

Duplication events detection• Outparalog[6]: paralogs in the given lineage that evolv

ed by gene duplications that happened before the radiation (speciation) event.

• Orthologous along with the corresponding duplication event have at least two paralogs from different species.

• Exclude gene families that was sharply in conflict with the uncontested animal phylogeny.

• We identified 1651 duplication events in the final data set with 786 gene families. All the duplication events were noted with the time point it happened.

• As 31 duplication events’ time is larger than 4.5 Gya, we only keep 1620 duplication events’ time point.

See: ppt 17

Distribution of the Taxonomy • 100% mouse• 97% rat• 92% human• 60% chicken• 27% fly• 23% worm• 10% cress• 6% fission yeast• 6% baker yeast

Data processing

• Overall distribution• Duplication and the extinction/origination• Periodogram analysis (FFT)

Overall distribution

Control

Kernel density estimates withgaussian method.

Result …• About the control

– Randomly sample 1620 time point from all the nodes marked with time point to generate a distribution, without replacement.

– Repeat 10,000 times to get 10,000 randomly generated profile.– An average distribution from the generated distribution by the means of ever

y bins. (Red line in the graph is the average distribution by random).– The distance/correlation between the randomly generated/observed distribu

tion and the average distribution are calculated. By the distribution of the distance/correlation, p << 0.00001.

– By ~2.75Gya, the observed distribution deviate from the control. We use the data after 2.75Gya following.

• Strikingly, the overall distribution of duplication after 2.75 Gya is not a uniform distribution. (D = 0.5318, p < 2.2e-16, Kolmogorov-Smirnov test)

• The distribution of the data conform to a random walk.– Random walk is the model of the form– Sequence of ε is gotten and a KS uniform test is applied to it. As D = 0.9982,

p = 0.2730, we can’t reject the null hypothesis. (注明:该处统计有误,当时做的统计实际上是 ks.test(x, max(x), min(x))。具体的统计应该是 ks.test(x, ‘punif’, max(x), min(x)), 但是统计上不能通过。或者做 Box.test()统计 white noise)

11 iii yy

Discussion …• ~ 2.75 Gya is a very important time point in the rise of the atmospheric oxygen. T

here are two scenarios surround this question[7]. Out data show something changed ~2.75 Gya consisting with the evolution of oxygenic photosynthesis by 2.7Gya supported by organic biomarker and carbon stable isotope evidence. In this scenario, we can see the TM Gene’s duplication increased when the oxygenic content of the air changed (e. g, flower plant(~0.146Gya), platsid(~1.58Gya), mitochondria(~1.8Gya), et al).

– Two Great Oxidation Event[8]: 2.0 ~ 2.4Gya; 0.55~0.8Gya– Snowball earth[9]: 0.58 ~ 0.75 Gya

• The emergence of platsid/mitochondrion may take an import role in the TM protein evolution. Organelle has more membrane structure. The rise of complex multicellular life(1~ 1.5Gya) also is the cause[10].

• The rate of the TM protein duplication is non-uniform. This conforms to the result that both large- and small-scale duplications in the evolution.

• The random-walk model of distribution suggests that either these variables were correlated with environmental variables that follow a random walk or so many mechanisms were affecting these variable, in different ways, that the resultant trends appear random.[11]

Duplication and the extinction/origination

?

?

Result ..• About the extinction

– Early cambrian (512Mya)– End ordovician(439Mya)– Frasnian-Famennian(376Mya)– End-Permian(251Mya)– End-Triassic(206Mya)– Cretaceous-Tertiary(65Mya)

• Almost all the major mass extinction corresponding to a duplication peak, but two peak has no corresponded extinction record.

• Base on the fossil data of marine animal, origination/ extinction rates were computed by linear interpolation for the appropriate time. The correlation of origination/ extinction rates and duplication number are calculated.

– Extinction rates displays positive correlation with duplication profile, but not significant. (r = 0.0259369, p = 0.5483) (r = 0.07933089, p = 0.4144)

– Origination rates shows significant negative correlation with duplication profile. (r = -0.1546602, p = 0.0003174) (r = -0.1230396, p = 0.2046)

– For diversity (r = -0.3018349, p = 0.0015) • Kernel density estimates. (Genetics 147:1965-1975)

Discussion …• A funny and plausible mode (creator by extinction)(divergent resolution?)

– When the environment changed dramatically, the population of most species will be smaller, even extinct (extinction). In the gene duplication’s mode, the sudden and various positive selection will fix more new duplicates in neo/sub function. On the other hand, a change which is deleterious to the gene’s function is readily to escape purifying selection in a small population[12]. In the population, its redundance and robust all increase[13]. So the genome structure isn’t a optimized one, but good for survival (note: TM protein mostly belong to dosage-sensitive gene). If the environment level off, the population must increase and migrate. For a redundant genome, it will subfunctionalize some duplicates. This time, most new species will emerge. (Is this one of the possible logic among duplication, extinction, origination?)

– The correlation analysis between origination/ extinction rates and duplication profile may need more data. But they can say something.

• Life is not only a passive process, especially the ecosystem. (about the two conflicts in the figure) (consist with the evolution of oxygenic photosynthesis)– ~0.3Gya, Gymnosperms begin to diversify widely. – ~0.13Gya, Angiosperm plants evolve flowers, structures that attract insects a

nd other animals to spread pollen. The evolution of the angiosperms cause a major burst of animal evolution.

Nature, vol 400, 58~ 61 ( For flower plant)

Periodogram analysis (FFT)

dtimectimebtimeafit 23

a=56.72948; b=-51.95965; c=12.79374; d=0.07294

FFT …

)2sin( PhasetimePeriod

Ampfit

Amp=0.15357741Period =0.06230346 GyaPhrase=1.09601472 (radians)

Account for 8.5% of the variance.(>5%)

FFT …Model R/W Monte Carlo simulation

P=0.1294

P=0.0138

R model

W model

α=0.05

α=0.05

control

observation

Result …• 0.062Gya cycles is evident in the Phanerozoic in the fourier sp

ectrum, but can’t reject the Random walk null hypothesis. (R: p = 0.1294; W: p = 0.0138; V: 8.52%)– Several others: 0.0912Gya (0.2956/0.0039, 10.63%); 0.0275Gya (0.00

71/0.1037, 4.22%); 0.0162(1e-4/0.1047).– Ten thousand Monte Carlo simulations were done.

• Overall Periodogram after ~ 2.75Gya– Not a good question. It is difficult to choose an appropriate trends functi

on to detrend the data. • The phase is different between fossil diversity and duplication’

s 62 Mya cycles’ wave.– 5.21 (radians) - 1.1 (radians) = 4.1 (radians) = 1.305π

Nature. 2005 Mar 10;434(7030):208-10

Discussion …• The 62-million-year wave is surprisingly strong and— so

far – there is no good explanation for it (the wave from the GOD^_^). We have detected it in an independent data applying the same trend functions. Is it an egg-chicken question?– It implicates some essence question about the life and the enviro

nment. What cause it? – We give a second way to discuss this question.

• About the phase shifting– 1.305π ≠ π. In my story, it must be 1.5 π, but that is not the true. – The phase shifting indicates the asynchronism between duplicati

on profile and genus diversity.

Nature. 2005 Mar 10;434(7030):208-10

Some references• [1]Jianzhi Zhang. Evolution by gene duplication: an update. TRENDS in Ecology and Evolution 18, 292-298(2003).• [2]Michael Lynch & Vaishali Katju. The altered evolutionary trajectories of gene duplicates. TRENDS in Genetics

20, 544-549(2004).• [3]David Jablonski. The interplay of physical and biotic factors in macroevolution. Evolution Planet Earth(book).• [4]U Lehnert, Y Xia et al. Computational analysis of membrane proteins: genomic occurrence, structure prediction

and helix interactions. Quaterly Review in Biophysics (in press). • [5]Lynch M & Conery JS. The evolutionary fate and consequences of duplicate genes. Science 290(5494), 1151-

5(2000). • [6]Sonnhammer EL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Ge

net 18(12), 619-20(2002). • [7]Canfield DE, Habicht KS, Thamdrup B. The Archean sulfur cycle and the early history of atmospheric oxygen.

Science. 2000 Apr 28;288(5466):658-61. • [8]Hayes JM. Biogeochemistry: a lowdown on oxygen. Nature. 2002 May 9;417(6885):127-8. • [9]Hoffman PF, Kaufman AJ, Halverson GP, Schrag DP. A neoproterozoic snowball earth. Science. 1998 Aug 28;

281(5381):1342-6. • [10]Hedges SB, Blair JE, Venturi ML, Shoe JL. A molecular timescale of eukaryote evolution and the rise of compl

ex multicellular life. BMC Evol Biol. 2004 Jan 28;4(1):2. • [11]Cornette JL, Lieberman BS. Random walks in the history of life.Proc Natl Acad Sci U S A. 2004 Jan 6;101(1):

187-91. • [12]Sidow A. Gen(om)e duplications in the evolution of early vertebrates. Curr Opin Genet Dev. 1996 Dec;6(6):71

5-22. • [13]Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, Li WH. Role of duplicate genes in genetic robustness again

st null mutations. Nature. 2003 Jan 2;421(6918):63-6.

。。。• Function clustering• The methodology discussion• ……

Acknowledge• Dr Qi Wang Prof Yixue Li• Dr Qi Liu Prof Gang Pei• Ziliang Qian Prof Tieliu Shi• Yongzhang Zhu• Guang Li• PeiLin Jia• Changzheng Dong• Fudong Yu• ……

top related