identifying novel genes and gene refinements in … · 2016. 4. 26. · be eliminated in tte. •...

23
Identifying Novel Genes and Gene Refinements in Thermoanaerobacter tengcongensis Presented by Kun Zhang 1 Institute of Computing Technology, CAS, Beijing, China http://pfind.ict.ac.cn 20140620

Upload: others

Post on 01-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Identifying Novel Genes and Gene Refinements in Thermoanaerobacter

    tengcongensisPresented by Kun Zhang

    1

    Institute of Computing Technology, CAS, Beijing, Chinahttp://pfind.ict.ac.cn

    2014‐06‐20

  • Thermoanaerobacter tengcongensis

    • Thermoanaerobacter tengcongensis (TTE)– Discovered in Tengchong, Yunnan, China

    – Complete genome sequencing in 20021, 2.7Mb

    • Genome annotation2– Computational: ab initial, comparative

    – Experimental: RNA‐seq, EST, etc. (cDNA‐seq related)

    – TTE: 2588 coding genes (through ab initial, comparative)

    1. Bao Q, et al. A complete sequence of the T. tengcongensis genome. Genome Res 12, 689‐700 (2002).

    2. 1. Brent MR. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62‐73 (2008). 2

  • Background

    • Deficiencies of the – High error rate in the genes predicted

    – Discriminate coding genes from non‐coding ones

    – Explain the translational event, such as alternative translational 

    initiate site, translational UTR (untranslational region)

    1. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 15, 193‐204 (2014). 3

  • Background

    • Another experimental evidence: proteomics 

    data (LC‐MS/MS based)

    • For genome annotation: Proteogenomics– Direct evidence of ORF translation

    – Find novel genes and refine gene models that the traditional 

    approaches could not reach

    1. Ahrens CH, Brunner E, Qeli E, Basler K, Aebersold R. Generating and navigating proteome maps using mass spectrometry. Nat Rev Mol Cell Biol 11, 789‐801 (2010).

    2. Andrews SJ, Rothnagel JA. Emerging evidence for functional peptides encoded by short open reading frames. Nat Rev Genet 15, 193‐204 (2014). 4

    Utilize proteomics data to find novel genes and gene refinements in TTE.

  • LC‐MS/MS based proteomics• LC‐MS/MS based proteomics1

    1. Steen H, Mann M. The ABC's (and XYZ's) of peptide sequencing. Nat Rev MolCell Biol 5, 699‐711 (2004). 5

    PP EE PP TT II DD EE

    PP EE PP TT II DD EE

    PP EE PP TT II DD EE++

    + +

    + +

    B ions Y ions

  • Searching the protein database• Database search1

    1. Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787‐797 (2007). 6

    Reference proteomes of experientially verified or predicted protein sequences

  • 7Discipline

    Algorithm

    Software

    Application

    Aim Deep Analysis of Big Data for MS‐based Proteomics

    • Nature Methods 2012 : Identification of cross‐linked peptides from complex samples• PNAS 2012: Nematode sperm maturation involves serine protease inhibitor (Serpin)•MCP 2009: A strategy for large‐scale identification of core fucosylated glycoproteins• Proteomics 2012: The ovarian cancer‐derived proteome: A repertoire of tumor markers•Mol BioSys 2012: Improved pipeline for LC‐ETD‐MS/MS using charge enhancing methods• 2009 –2013: Protein identification service at the Institute of Zoology, CAS 

    Computational Proteomics

  • Proteogenomic analysis

    ORF Scoring & TDA Filtering

    A

    B

    C

    Peptide Spectrum Matching

    Proteogenimc Database MS Data

    .fasta MS2

    .fna TTE Genome

    .gtf

    ReferenceDatabase

    ORF0201ORF0013ORF0387ORF1029ORF1101ORF2033ORF1587ORF1323ORF1977REV_ORF0021

    2.032.102.112.162.222.232.252.272.282.31

    Ranked ORF Score

    Reliability

    D

    E

    chrRefined ORF Novel ORF Annotated ORF

    Genome search specific peptides

    Annotated ORF Homologous ORF

    Peptide from annotated ORF

    Annotated ORFchr

    Re‐annotating

    6‐ORF Translating

    8

  • Novel genes and gene refinements

    Reference chromosomeRefinedORF Novel ORF Annotated ORF

    Genome search specific peptides

    Annotated ORF Homologous ORF

    Peptide from annotated ORF

    Annotated ORFchromosome

    Re‐annotating

    Translated genome

    Database

    peptides

    peptides

    9

  • Data and database• Database: translated TTE genome• Data

    – Mass spectrometry: Q‐Exactive (HCD‐FT)– Tandem mass spectra: ~5 million

    • Search parameter– Engine: pFind v3.0– Default params for HCD‐FT mode

    • Quality control: ORF level FDR 

  • All ORFs identified

    Category Number Ratio

    Annotated ORF 2002 91%

    Refined ORF 112 5%

    Novel ORF 49 2%

    Homologous ORF 42 2%

    11

    • Validation of novel ORFs– Computational: the local FDR(posterior error probability) measures 

    the reliability of each ORF was calculated

    – Transcriptional evidence: the strand specific RNA‐seq data was also 

    used as a supporting evidence

    Table. All ORFs identified in this study.

  • Distribution of novel and reference ORFs

    0 500 1000 1500 20000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Ranked ORF

    Cum

    ulat

    ive

    perc

    enta

    ge o

    f OR

    F 2588 genenovel ORFlocal FDR

    FDR

  • Novel ORFs in length

    0 0.5 1 1.5 2 2.5 3 3.5 40

    5

    10

    15

    20

    25

    30

    35Novel ORFAnnotated ORF

    Perc

    enta

    ge (%

    )

    log (length)10

    ~150bp (50aa)

    13

    The length (bp) and RNA expression of the novel ORFs

    Novel ORFAnnotated ORF

    log (FPKM)10

    0

    5

    10

    15

    20

    25

    30

    35

    Perc

    enta

    ge (%

    )

    0 5-5

  • Genomic location

    • Novel ORFs were shorter in length (sORF)

    • Classification based on the distance (1000kb) to 

    annotated ORFs

    – UTR translation or operon structure (46)

    – Intragenic regions (1)

    – Overlapping with annotated ORFs (2)

    14

  • • Examples of identified novel ORFs– UTR translation

    – Overlapping ORF

    – Non‐canonical start codon

    15

  • UTR translationA

    B

    gi|20808287|ref|NP_623458.1| gi|20808288|ref|NP_623459.1| gi|20808289|ref|NP_623460.1|

    pep_EEYALR

    pep_LVPAVALPHPLGNPSLGPK

    pep_VALPHPLGNPSLGPK

    Novel ORF

    expression range [0 - 500]

    expression range [0 - 500]

    RNA-seq expression level on positive strand

    RNA-seq expression level on negative strand

    1,820,000 bp 1,821,000 bp 1,822,000 bp 1,823,000 bp 1,824,000 bp 1,825,000 bp

    ORF

    Identified peptides

    pep_AGLPTVHVCTLVPLSKpep_LVPAVALPHPLGNPSLGPKEEYALR

    Rela

    tive

    Int

    ensi

    ty(%

    )

    m/z200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400

    y1+

    147.

    11

    a2+

    185.

    16

    b2+

    213.

    16

    y2+

    244.

    16

    a3+

    282.

    18 y3+

    301.

    19

    a4+

    353.

    22

    b4+

    381.

    25

    y4+

    414.

    27

    a5+

    452.

    29

    y10+

    +49

    0.28

    y11+

    +55

    8.81

    y6+

    598.

    35y1

    2++

    607.

    34

    y13+

    +66

    3.88

    y14+

    +69

    9.40

    y7+

    712.

    40

    y15+

    +74

    8.93

    y8+

    769.

    42y1

    6++

    784.

    45

    y17+

    +83

    2.97

    y9+

    882.

    50

    y10+

    979.

    55

    y11+

    1116

    .61

    y12+

    1213

    .66

    y13+

    1326

    .75

    y14+

    1397

    .79

    y1

    Ky2

    Py3

    Gy4

    LSy6

    Py7

    Ny8

    Gy9

    Ly10

    Pb8

    y11

    Hb7

    y12

    Py13

    Lb5

    y14

    Ab4

    y15

    Vb3

    y16

    Ab2

    y17

    Py18

    VL3+100

    90

    80

    70

    60

    50

    40

    30

    20

    10

    0

    +20

    -20Erro

    r (pp

    m)

    0

    chrom

    16

  • Overlapping ORF

    gi|20808787|ref|NP_623958.1| gi|20808788|ref|NP_623959.1|

    pep_LLEESLLKR

    pep_LLEESLLK

    expression range [0 - 110]

    expression range [0 - 52]

    2,317,400 bp 2,317,600 bp 2,317,800 bp 2,318,000 bp 2,318,200 bp 2,318,400 bp 2,318,600 bp 2,318,800 bp

    Identified peptides

    RNA-seq expression level on positive strand

    RNA-seq expression level on negative strand

    ORF

    Rela

    tive

    Int

    ensi

    ty(%

    )

    m/z200 300 400 500 600 700 800 900 1000

    y1+

    175.

    12

    a2+

    199.

    18

    b2+

    227.

    18

    b4++

    243.

    13

    y3+

    416.

    29 y8++

    494.

    29

    [M]+

    +55

    0.84 y5

    +61

    6.41

    y6+

    745.

    46

    y7-N

    H3+

    857.

    48

    y7+

    874.

    50

    y8+

    987.

    58

    y1

    RKy3

    LLb4

    y5

    Sy6

    Eb2

    y7

    Ey8

    LL2+

    100

    90

    80

    70

    60

    50

    40

    30

    20

    10

    0

    +20

    -20Err

    or (p

    pm)

    0

    Novel ORF

    chrom

    17

    A

    B

  • Non‐canonical start codon

    18

    V G * L P * E V I A D P A S K I K N Y A L N F N E R I K R E I E E A I E K N E K L F Q K A N E R G P G T K H R * L L V L F R K EY G R V P A I R S Y C * S S I Q D * Q L C S Q F K G Q N E E * D R R R D G * K R K F V A K C K R K R A W N * T K I P T G L V A K R *L G K C P S N * * L M L L Q N S R I T L L I S I K G S K G R L R K Q S R R I K K * F S S Q M K E E Q G L K I D K Y S Y W S G S K E

    gi|20807821|ref|NP_622992.1|

    pep_LRENFNLAYNK pep_QFLKENK

    pep_ENFNLAYNK pep_NKELAEELERK

    pep_LRENFNLAY

    pep_ELAEELERK

    pep_QFLKENKELAEELER

    pep_ELAEELER

    expression range [0 - 944]

    expression range [0 - 774]

    1,356,680 bp 1,356,700 bp 1,356,720 bp 1,356,740 bp 1,356,760 bp 1,356,780 bp 1,356,800 bp 1,356,820 bp 1,356,840 bp 1,356,860 bp

    RNA-seq expression level on positive strand

    RNA-seq expression level on negative strand

    chrom

    Rela

    tive

    Int

    ensi

    ty(%

    )

    m/z100 200 300 400 500 600 700 800 900 1000 1100 1200

    y1-H

    2O+

    129.

    10

    y1+

    147.

    11

    y2-H

    2O+

    243.

    15

    y2+

    261.

    15b2

    +27

    0.19

    b6++

    387.

    70

    y3-N

    H3+

    407.

    19 y3+

    424.

    22

    b7++

    444.

    24

    y4-N

    H3+

    478.

    23y4

    +49

    5.26

    b4+

    513.

    28

    b9++

    561.

    29

    y5-N

    H3+

    591.

    31y5

    +60

    8.34

    a5+

    632.

    35

    b5+

    660.

    34[M

    ]-NH

    3-N

    H3+

    +67

    4.34

    [M]-N

    H3++

    682.

    85

    [M]+

    +69

    1.36

    b6+

    774.

    39

    a7+

    859.

    48

    b7+

    887.

    47

    a8+

    930.

    51

    b8+

    958.

    51

    a9+

    1093

    .58

    b9+

    1121

    .57

    a10+

    1207

    .60

    b10+

    1235

    .62

    b10

    y1

    Kb9

    y2

    Nb8

    y3

    Yb7

    y4

    Ab6

    y5

    Lb5N

    b4F

    y8

    Nb2ERL

    2+100

    90

    80

    70

    60

    50

    40

    30

    20

    10

    0

    +20

    -20Err

    or (p

    pm)

    0

    A

    B

  • 19

    Category Sensitivity Accuracy

    Transcriptomic ★★★ ★★

    Computational ★★ ★☆

    Proteomic ★★☆ ★★★

  • Conclusion• 2% ORFs are missing, short ORFs are tend to be eliminated in TTE.

    • Most of the novel ORFs identified in this study come from UTR region, operon, non‐canonical start codons.

    • Proteogenomics is a good source for genome refinement and novel gene discovery, which may beyond the reach of genomic and transcriptomic approaches.

    20

  • Acknowledgement

    21

    Associate ProfessorRui‐Xiang Sun

    Assistant ResearcherHao Chi

    M.D.Long Wu

    Associate ProfessorQuan‐Hui Wang

    ProfessorSi‐Min He

    ProfessorSi‐Qi Liu

    ProfessorXiao‐Hong Qian

    Associate ProfessorYi Zhao

    ProfessorLu Xie

    Ph.D. candidateDe‐Chao Bu

  • And, many thanks to our pFinders

    22

  • Q & A

    23