gene family

Upload: arun-sendhil

Post on 04-Nov-2015

217 views

Category:

Documents


0 download

DESCRIPTION

gene

TRANSCRIPT

  • GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTESShin-Han ShiuPlant Biology / QBMIMichigan State University

  • Genomes and gene contents30,00025,00010,0006,00045,00017,000

  • Duplicate genes in the genomeArabidopsis gene families**: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

  • Gene function and duplicationWhats the consequence?

  • Gene function and duplicationWhats the consequence?

  • Focus I: Duplication Mechanism and Loss RateGeneDuplicationsMechanismsConsequencesPreferentialretention

  • Duplication mechanismsWhole genome duplication

    Tandem duplication

    Segmental duplication

    Replicative transposition

  • Lineage-specific gains in plants and animals*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively).**: Numbers in parentheses refer to percentage total based on normalized gains.Substantially more recent duplicates in plants than in animalsMostly due to frequent whole genome duplications in plants

    OrganismLineage-specific gainsNormalized gain*# of genes in familiesanalyzed% totalRice1011567432846735.5 (23.7)**Arabidopsis598439902193627.3 (18.2)**Human811811219543.7Mouse12651265240415.3

  • Gain vs. Loss3 rounds of whole-genome duplications in the Arabidopsis lineage~82% duplicates from the last round were lost in the past 40 million years

    15,000*30,00060,000120,000Arabidopsisgene content:21,000***: Number of orthologous groups in shared families between Arabidopsis and rice.**: Number of genes in shared families.Genome duplications + tandem duplications gene losses =

  • Age distribution of animal duplicatesSteady decay in the number of duplicatesFrequent TD, SD, and RT

    Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identityShiu et al., 2006

  • Plant duplicate age distributionApparent peak at ~0.18 instead of zero KsFrequent WGD, TD, SD (maybe), and RT (in some plants)

    Shiu et al., 2004

  • Genome remodeling in polyploidsNatural and synthetic polyploids~348 Mb~203 Mb~314 Mb~257 Mb20,000 yr

  • Experimental approachesGenome-wide polymorphism monitored by tiling arrayGenomeTiled probesGapResolutionArray20,000 yr~6 million features

  • Genome-wide Single Feature PolymorphismMid-parent (MP) vs. Arabidopsis suecica (As)

    PolyploidSFPNatural58,517Synthetic503

  • Genome-wide Single Feature PolymorphismGenome-wide polymorphism monitored by tiling arrayGenePseudogeneTransposon

  • Genome-wide Single Feature PolymorphismDuplication or deletionMP duplication or

    As deletion

  • Genome Survey SequencingSequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week!

    Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership GrantUltra-high throughput20-30 Mb per run, each run 5 hoursWill be 100Mb per run early 2007Cost efficient~$0.3/kbRead length rather limited~100bp per read nowWill be ~200bp early 2007For more information contact:Andreas Weber ([email protected])David DeWitt ([email protected])Or Shin-Han Shiu ([email protected])

    Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS

  • Summary: Gene duplication and polyploidyGene duplication occurred frequently in eukaryotes but most duplicate are lost.

    In plants, whole genome duplication is common. But gene lost occurred frequently.

    After 4 generations, very small number of SFPs are identified in synthetic polyploids.

    After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion.

    Clustered polymorphisms mostly locate in pseudogenes and transposons.

    Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.

  • Focus II: Differential Retention of DuplicatesGeneDuplicationsMechanismsConsequencesPreferentialretention

  • Duplicate genes in the genomeArabidopsis gene families**: Clusters of Markov clustering using all-against-all BLAST E values as distance measures

  • Large gene families in plantsOne of the largest gene families

  • Normalized gain: % expanded OGs Large family sizes do not necessarily indicates higher expansion rates

  • Ancestral family sizes and gene gainsLarge ancestral family tend to have more lineage specific gains but with many exceptions

  • Differential expansion of functional categoriesGO: GeneOntologyProtein ubiquitinationPolysaccharide biosynthesisCell wall modificationTranscriptional regulationBiotic stress responseSecondary metabolism

  • Differences in DuplicabilityDuplicabilityThe propensity for the retention of a duplicate geneComputational analysis of genome-wide trend

    CategoryArabidopsisHumanDefense responseProteolysisTransportIon channel activityMetabolismDevelopmentProtein kinase activityTranscription factor activity

  • Kinase superfamily sizes among eukaryotesShiu & Bleecker, 2003

    OrganismNumber of genesKinase superfamilyPercent total geneArabidopsis thaliana25,81410414.0Oryza sativa subsp. indica~35,00016073.6Chlamydomonas reinhardtii~12,2004143.4Plasmodium falciparum5,334941.8Plasmodium yoelii7,681700.9Caenorhabditis elegans19,4844172.1Drosophila melanogaster13,8082621.9Anopheles gambiae15,0882161.4Ciona intestinalis15,8523162.0Fugu rubripes33,6096321.9Mus musculus22,4444952.2Homo sapiens22,9804722.1Saccharomyces cerevisiae64491131.8Candida albicans6,164951.5Neurospora crassa100821041.9Schizosaccharomyces pombe49451092.2

  • Kinase families in rice and ArabidopsisGene count differences among families indicate differential expansionShiu et al., 2004

  • Estimation of ancestral RLK family sizeA.B.440 speciation points rice ArabidopsisKinase phylogeny of Arabidopsis and rice RLKsShiu et al., 2004

  • Development vs. resistance/defense RLKsShiu et al., 2004

  • ContradictionPlant genes invovled in development tend to have high duplicability

  • Selection for expansionDepend on the level of variations of the signalsOR

  • Summary: differential retentionLongevity and duplicability of plant genes

    HighHighHighLowLowHighLowLowDuplicabilityLongevityExamplesTranscription factorsResistance genesEnzymes in central metabolicpathways??

  • Focus III: Functional ConsequencesGeneDuplicationsMechanismsConsequencesPreferentialretention

  • Functional Consequences of DuplicationFunctional divergence and conservationIs it because of changes in cis-regulatory elements or coding sequences

    How are duplicates retained, subfunctionalization or neofunctionalization

  • Divergence in gene expressionDevelop pipelines for cis-element prediction and

  • Divergence in post-translational modificationConservation of phosphorylation site across specesSACE: budding yeastCAGL: Candida glabraCAAL: Candida albicansCATR: Candida tropicalisNECR: Neurospora crassaDEHA: Debaryomuces hansenii

  • Detailed Functional Studies of Duplicate GenesFunctional analyses of DDF1 and DDF2 transcription factorsDerived from recent whole genome duplication in ArabidopsisRelated to the well known CBF factors involved in cold and draught stress

    DDFsPromoterGFPKnockoutsOver-expressionstudiesInteractingproteinsBindingtargetsDDFsPromoterGFPKnockoutsOver-expressionstudiesInteractingproteinsBindingtargetsArabidopsis thalianaArabidopsis lyrata

  • Focus IV: Protein space

  • Tiling array analysis of transcriptomeHuman Chr 21, 22Kapranov et al., 2002

  • Posterior probability p(F|coding)

  • Performance of the CI measureKnown Arabidopsis exon and intron 90-300bp

    Arabidopsis small protein that are not annotatedCorrectly predict 19 out of 20 (95%).

    Yesat sORF with translation evidenceCorrectly predict 98 out of 114 (86%)

    In intergenic sequences of Arabidopsis genome3,274 sORF identified

  • Coupling with tiling array expressionHybridization intensities for feature types

  • Summary: Novel coding genesMany unannotated regions in the genomes are expressed.

    Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly.

    Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome.

    Using tiling array data, we found that many of these novel coding regions are expressed.

  • AcknowledgementLab members

    Kousuke Hanada

    Melissa Lehti-Shiu

    Cheng Zou

    Emily EckenrodeUniversity of ChicagoJustin BorevitzXu Zhang

    University of WisconsinSara PattersonRick Vierstra

    University of MissouriScott Peck

    Michigan State UniversityManyRong Jin, Comp Sci & EngYue-Hua Cui, Stat & ProbStartup fund

  • Recent completion

  • Genome remodeling in polyploidsGenome duplication occur frequently in plantsWhat is the fate of duplicates?How fast do gene losses occur?Is there any preference in genes retained?AB

    CD

    EA1B1

    C1D1

    E1A2B2

    C2D2

    E2t1t2A1B1

    C1D1

    E1A2B2

    C2D2

    E2A1B1

    C1D1

    E1A2B2

    C2D2

    E2Ng = 5 10 8 5

  • Comparing degrees of expansionCombined setArabidopsis: ~25,000 proteinsRice prediction:~66,000 genesGene/domainfamiliesShareduniquePairwise distancePutative orthologous groupsui = 1GO:0001ei = 4All orthologous groupsTotal unexpanded = uiTotal expanded = ei

  • Major questions on gene duplicationWhen: timing of gene duplications, e.g. N = 10

  • Domain gains in rice and ArabidopsisGain in one lineage does not necessarily predict gain in the other

  • Identify novel small coding genesDetermine base composition probabilitiesCodingsequencesNon-codingsequencesCDSparametersNCDSparametersPc(AAA) =Pc(T|AAA) =Calculate posterior probabilityc1c2c3c4c5c6Feature tablesn

  • Setting up the BayesPriors

    S = ATG TTC TAC TTT G

  • Coding Likelihood (CL)Sliding windows of a sequence

    Simulation based on NCDS (introns)1 2 3 4 n

  • Divergence in post-translational modificationConservation of phosphorylation site across speces