predicting ncrna genes in zebrafish genome: a maching learning approach

Upload: orfeas-aidonopoulos

Post on 03-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    1/33

    Machine Learning Methods inComputational Biology

    Instructor:,George S. Vernikos PhD

    Predicting ncRNA genes in Zebrafish

    genomeRelevant Vector Machine

    Aidonopoulos Orfeas,

    Sc Student in Bioinformatics, May !"#

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    2/33

    "$ %ntroductionNo ada!s, one of the most cha""enging #rob"ems in com#utationa" bio"og! is to transform

    the huge $o"ume of data, #ro$ided b! ne "! de$e"o#ed techno"ogies, into kno "edge.

    %achine "earning has become an im#ortant too" to carr! out this &'(. Se$era" techni)ues

    and methods ha$e been de$e"o#ed in order to bui"d mode"s hich can be trained and

    make crucia" decisions. *a!esian c"assifiers, "ogistic regression, discriminant ana"!sis,

    c"assification trees, random forests, nearest neighbour, neura" net orks, su##ort $ector

    machines, ensemb"es of c"assifiers, #artitiona" c"ustering, hierarchica" c"ustering, mi+ture

    mode"s, hidden %arko$ mode"s, *a!esian net orks and Gaussian net orks are some of

    that kind of methods.

    In our #ro ect the aim as to de$e"o# genera"i-ed "inear mode"s using the Re"e$ant Vector

    %achine techni)ue in order to #redict ncRNA genes in genomic &DNA( se)uences of

    Zebrafish genome. Noncoding RNAs &ncRNA( are RNAs that are transcribed, but not

    trans"ated into #rotein. here are t o kinds of ncRNA: short and "ong non/coding RNAs.

    he! both inc"ude e""/characteri-ed transfer RNAs and ribosoma" RNAs, snRNAs,

    snoRNAs, and miRNAs, as e"" as a #"ethora of ne ncRNAs that ha$e been sho n to #"a!

    ma or ro"es in the ce""u"ar #rocesses of a"" "i$ing organisms&0(&1(. In addition, it has been

    studied the functiona" ro"e of "ong non/coding RNA in human carcinomas &2(&3(.

    Re"e$ant $ector machine &RV%( is a machine "earning method hich e+#"oits a #roba"istic

    *a!esian "earning frame ork and ha$e an identica" functiona" form to the e""/kno n

    su##ort $ector machine &SV%( &4(. RV% has the abi"it! to construct accurate #rediction

    mode"s hich uti"i-e dramatica""! fe er basis function than a SV% hi"e offering se$era"

    additiona" ad$antages. he inno$ati$e function of a RV% is the #robabi"istic #redictions it

    creates. It doesn5t decide if a dato be"ongs or not in a c"ass but it gi$es it a #robabi"it! of

    be"onging to a c"ass. he! can be uti"i-ed for both c"assification and regression #rob"ems.

    $ &revious 'or(s and our pro)ect6a$ing done a research about #re$ious orks on the #rediction of non/coding RNA genes

    e found on"! one #a#er hich is re"ated to our #ro ect and has been used the RV%

    method. In &7(, Do n and 6ubbard tried to gain im#ortant information from non/coding

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    3/33

    regions of simi"arit! bet een genomes. 8s#ecia""!, their aim as to e+tract the strongest

    signa" from a set of non/coding conser$ed se)uences using RV%. 9rom this ork it as

    sho n that the #redictions of the mode" ere c"ose to the start of annotated genes, as e

    can see on the figure be"o . he! a"so $erified that the #romoter signa" is the strongest

    sing"e motif/based signa" in the non/coding functiona" fraction of the genome hi"e subsetsof these #romoter regions ha$e an abundance of #G dinuc"eotides.

    A#art from the ork of Do n and 6ubbard se$era" other a##roaches has been de$e"o#ed

    for the #rediction of non/coding RNA genes or regions. At &;( the #ur#ose as the

    c"assification of RNA se)uence a"ignments based on S I and a -/score using the su##ort

    $ector machine. S I is a measure for RNA secondar! structure conser$ation hi"e the -/

    score re#resents a measure for thermod!namic stabi"it! of a"ignments &norma"i-ed ith

    res#ect to se)uence "ength and base com#osition(. At the fo""o ing figure the green circ"es

    are the #ositi$e e+am#"es of the training set &nati$e a"ignments( and the red crosses the

    negati$e ones &shuff"ed #ositions of random a"ignments(. he background co"or rangingfrom red to green indicates the RNA c"ass #robabi"it! for different regions of the -

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    4/33

    he annotation of noncoding RNA genes remains a ma or bott"eneck in genome

    se)uencing #ro ects. %ost genome se)uences re"eased toda! sti"" come ith sets of tRNAs

    and rRNAs as the on"! annotated RNA e"ements, ignoring hundreds of other RNA fami"ies.

    Se$era" on"ine too"s ha$e been created for this #ur#ose. RNAs#ace.org &=( is one of the

    most recent too"s for the #rediction, annotation and ana"!sis of "ncRNA genes. NcRNA.org

    is another too" for finding "ong non/coding RNA genes in RNA se)uences gi$ing a"so

    information about the secondar! structure of the resu"ts & 9igure : ncRNA.org ( &'>(. Se$era"

    other too"s and databases can be found on the ab"e 0 from Gibb5s #ub"ication &3( / 9igure .

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    5/33

    *igure + ncR A$org

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    6/33

    *igure + long non-coding R A online data.ases and tools

    #$ Ma(ing the datasetAs e referred, the aim of the #resent ork as to #redict ncRNA genes in genomic &DNA(

    se)uences of Zebrafish genome using the ?re"e$ant $ector machine5 method. @ut training

    set as created as this: he #ositi$e e+am#"es ere consisted of kno n ncRNA genes

    hich ere taken from htt#: .ensemb".org . hen, for the negati$e set, e

    do n"oaded a"" the #rotein coding genes of Zebrafish from the ebsite of ensemb"e ande random"! se"ected the same number of ncRNA genes &about 23>> se)uences(. Be did

    this in order to ha$e a ba"ance bet een the number of #ositi$e and negati$e e+am#"es.

    As e kno , a GC% is a used form of mode" for both c"assification and regression

    #rob"ems. It takes the ne+t form:

    here is a set of basis functions & hich can be arbitrar! rea"/$a"ued functions( and is

    a $ector of eights. In other ords this is e)ua" ith:

    http://www.ensembl.org/index.htmlhttp://www.ensembl.org/index.html
  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    7/33

    h/01 2 3 4 feature"5L/"14feature 5L/ 146 (Equation 1)

    In our #ro ect 2> features ere used:

    '. Strand: ' and > for com#"ementar!

    0. Position in chromosome: Percentage ith res#ect to the "ength of the chromosome

    in hich the se)uence be"ongs

    1. om#osition 9re)uencies:

    a. A, , , G

    b. Dimer & AA, AG,E(

    c. rimer & AAA, , GGG and (

    2. G content

    3. o Ratios: A and AG

    4. '' motifs: 9or the finding of motifs, e $isited .motifsearch.com here a de/

    no$o DNA motif se)uences search can be im#"emented. Be ga$e as in#ut the DNA

    se)uences &from our #ositi$e set/ncRNA se)uences( in 9AS A format and the out#ut

    hich as returned ere '' motifs:

    FAG AAG AF, FAAG AAG F , FGAAG AAGF, F A GGGAAAF, FG AGGG GF,

    F A GAAG F, FA GGGAGA F, F GAAG AAF, FGA GGGAGAF,

    FG AAG AGF, FA A GGGAAF.

    he ho"e construction of our raining Set as im#"emented in #er" "anguage. he scri#ts

    ith their in#uts and out#uts are in fo"der ? rainSet5. In this fo"der there are t o subfo"ders

    ?NegSet and ? PosSet each one for the corres#onding data set &negati$e and #ositi$e

    e+am#"es(. he fina" training set is the fi"e ith the name TrainingSet and is the

    concatenation of the fi"es: FeaturesNegSet.bed and FeaturesPosSet.bed / 9igure .

    http://www.motifsearch.com/http://www.motifsearch.com/
  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    8/33

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    9/33

    In case e ant to test the #erformance our mode" e indicate at "ine 3 of configuration

    fi"e our est fi"e and at the "ast "ine 0 fi"es are inc"uded. he one is the out#ut ith the #ost

    #robabi"ities and the other is the eights fi"e &from the training ste#( on hich the test run

    as based. At the fo""o ing figure e can see the Net*eans en$ironment here the RV%

    runs e$er! time.

    *igure + Configuration *ile of RVM

    *igure + &ost &ro.a.ilities

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    10/33

    *igure + :eights of .asis functions

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    11/33

    *igure + etBeans %;>? and ClassificationAccuracy 2 >>$! ?

    nd Step 8 *eature e0traction and their importance

    9irst of a"" it5s im#ortant to sa! that for each raining set e created a #ointer fi"e/tab"e so

    e can see each bar hich feature re#resents on the fo""o ing charts sho s. *e"o is the

    tab"e ith the #ointers and their features:

    Pointer 9eature

    ' Strand

    0 Position

    1 A

    2 G

    3

    4

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    16/33

    7 AA

    ; AG

    = A

    '> A'' GA

    '0 GG

    '1 G

    '2 G

    '3 A

    '4 G

    '7

    ';

    '= A

    0> G

    0'

    00

    01 AAA

    02 GGG

    03

    04

    07 G J ontent

    0; A

    0= A G

    1> FAG AAG AF

    1' FAAG AAG F

    10 FGAAG AAGF

    11 F A GGGAAAF

    12 FG AGGG GF

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    17/33

    13 F A GAAG F

    14 FA GGGAGA F

    17 F GAAG AAF

    1; FGA GGGAGAF1= FG AAG AGF

    2> FA A GGGAAF

    he fo""o ing subcha#ters describe the mode"s e created for each different training ste#,

    ho the! ere constructed, the features hich ere e+tracted, their im#ortance and their

    #ossib"e ro"e in the #rob"em of #redicting non/coding RNA genes from genomic se)uences.

    7raining 'ith All *eatures

    9irst"!, as e said, in our training set a"" features ere inc"uded. he RV% mode" e+c"uded

    = of 2> features. hese ere the Strand , t o nuc"eotides G, C and si+ of nine motifs

    &AAGCTAAGC, GAAGCTAAG, TGAAGCTAA, GATGGGAGA, GCTAAGCAG,

    ACATGGGAA (. he ma orit! of the rest features had a negati$e eight. 8s#ecia""!, as one

    can see at 9igure , on"! = features had a #ositi$e contribution to our mode" ositi$e

    eights( and these ere the Position of se)uence, the A and T nuc"eotides, the @C

    content and 3 motifs & AGCTAAGCA, CATGGGAAA, GCAGGGCTG, CACTGAAGC, ATGGGAGAC (.

    herefore, e ha$e constructed our ' st Genera"i-ed Cinear %ode". In our occasion e

    cannot rite in this documentation our mode" because of the "arge number of basic

    functions &2> features(. So, e #ro$ide a tab"e ith the eight of each basic function

    hich as e+tracted from RV% machine:

    *eature :eight

    Position =.='e/>2GA /

    0.30>'2;

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    18/33

    2

    GG/

    0.'220;4

    G

    /1.;2471=

    3

    A

    /2.'34>34

    3

    A/

    0.414';=

    G

    /2.0;

    74443/

    0.1=0>==

    /2.144>03

    7

    A

    /0.3>>=3'

    =

    G

    /2.143;==

    '

    A2.373470

    ;4

    /2.'7140>

    =

    /

    0.>1044'

    3

    AAA

    />.1111=;

    1

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    19/33

    GGG/

    >.47=1;3

    />.3007;=

    4

    /'.07=22;

    0

    G J ontent

    ;.0;>;'4

    3'

    A

    />.>24307

    2

    A G/

    ;.>;e/>3

    FAG AAGAF

    >.'=';24

    '7

    F A GGGAAAF

    >.17411>

    =1

    FG AGGG GF

    >.0>10>0

    32F A GAA

    G F

    >.1010=2

    3

    FA GGGAGA F

    >.107270

    ;2

    ;.471103

    41

    AA

    /'.>4

    >0420

    AG

    /1.;==31'

    =

    A/

    1.4=2'1

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    20/33

    A

    /2.>1>04>

    3

    O75 fi"es.

    *igure + :eights of all features

    he contribution of each structura" feature to our mode" is e$a"uated through a function

    &R( that )uantifies the re"ati$e feature im#ortance, rather than the actua" feature eight

    &W (. *rief"!, the im#ortance R of each feature is e+#ressed as the #roduct of the

    corres#onding eight and the corres#onding standard de$iation & SD ( of the feature $a"ues

    in the training set. Be #refer to assess the feature contribution to the mode", through the

    R rather than the W $a"ue, because R takes into account the $ariabi"it! of the data set,

    norma"i-ing the $a"ues ith the corres#onding SD &''(. %oreo$er, it is #ossib"e to fa"" into

    the tra# of considering that a feature ith a "arge #ositi$e eight can enhance the

    c"assifier5s se#arating abi"it!. here are man! cases hen a feature ith a #ositi$e eight

    has a sma"" R. his ha##ens due to the feature5s dis#ersion. A feature ith a "arge #ositi$e

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    21/33

    eight but ith sma"" dis#ersion/and conse)uent"! sma"" SD/ i"" be ha$e a sma""er R than

    B, as R2:5S; .

    So, sa$ing Net*eans out#ut in a fi"e &dir: RV%Runs/

    Per"9i"esLout#utsLNetbeans ommandCine@uts( e can #arse the SDs of each feature and

    then ca"cu"ate their significance R. he #er" scri#t that im#"ements this is the ari!"."# .

    he im#ortance of each feature is de#icted on 9igure . his chart sho $erifies that

    percentage of Adenines and 7hymines in a se)uence #"a! an im#ortant ro"e for

    deciding if in a DNA se)uence there is a ncRNA gene &I&A( '=.144= and I& ( 10.7>4'(.

    *ut the most significant feature for making a decision is the percentage of @C content

    in se)uence &I&G J @N 8N ( 42.420'(.

    *igure + *eatures %mportance

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    22/33

    7raining Set 'ith %n*eatures

    In order to $erif! our #re$ious resu"ts e constructed a dataset on"! ith the features for

    hich RV% ga$e them a non/-ero eight. he resu"ts for eights and Im#ortance are cited

    be""o . As e can see from the 9igure for once again the most of the features had a

    negati$e eight. @n"! ; from 1' features ere #ositi$e eighted. herefore, it is $erified

    the fact that the most of them are negati$e"! corre"ated among each other.

    &ointer *eature' Position0 A12 AA3 AG

    4 A7 A; GA= GG

    '> G'' G'0 A'1 G'2'3'4 A'7 G';'=0> AAA0' GGG0001

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    23/33

    02G J ontent

    03 A04 A G

    07

    FAG AA

    G AF0;

    F A GGGAAAF

    0=FG AGGG

    GF

    1>F A GAAG F

    1'FA GGGAGA F

    *igure + :eights for the %n*eature dataset

    As far as the features5 im#ortance it is a"so $erified that @C content is the most

    significant feature &I 4;.>>>>(. he )uantities of 7hymines and Adenines in a se)uence

    come second and third on the "ist.

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    24/33

    *igure + %n*eature s importance

    7raining Set 'ith Out*eatures

    In order to make some conc"usions on unseen data e created a training set ith the

    features that had a -ero/ eight &M@ut9eatures (. hese features ere the fo""o ing:

    &ointer *eature' Strand0 G1

    2FAAG AAG F

    3FGAAGAAGF

    4F GAAG

    AAF

    7

    FGA GGG

    AGAF;

    FG AAGAGF

    =FA A GGGAAF

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    25/33

    a"cu"ating the corres#onding eights and then the im#ortance, e can easi"! see that the

    motifs 3/= are strong"! corre"ated. 8ssentia"!, this is some kind of "ogica" as these motifs

    ha$e a $er! sma"" #resence in se)uences. A"so, FAAG AAG F motif is hig"! non/significant.

    Strand and G do ha$e some re"ationshi# as e can see from 9igure but the! are being

    $erified as features hich affect negati$e"! our mode" & 9igure (.

    *igure + :eights of Out*eatures

    *igure + Out*eatures importance

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    26/33

    7raining Set 'ith egative *eatures

    It is im#ortant to refer again that e #refer inter#reting the eight B as a measure that

    sho s us ho much a feature affects one other rather than making conc"usions about its

    im#ortance as a #redictor. Ne$erthe"ess, in our #ro ect e chose not to ignore the meaning

    of negati$e eighted features as the! indicate the fact there is a negati$e corre"ation

    bet een these basis functions. onse)uent"!, e ou"d "ike to obser$e ho the negati$e

    eighted features beha$e. he features for hich RV% ga$e a negati$e eight ere the

    fo""o ing:

    &ointer*eature

    ' AA0 AG

    1 A2 A3 GA4 GG7 G; G= A

    '> G'''0

    '1 A'2 G'3'4'7 AAA'; GGG'=0>0' A00 A G

    @ne thing that #ro$es that the most of the abo$e basis functions affect our modem in a

    Mnegati$e a! is the fo""o ing chart sho & 9igure (. Be can readi"! obser$e that most of

    the features continue to ha$e the same beha$ior. he most of them ha$e a negati$e

    eight. @n"! $% and &% are re"ated each other. his can be "ogica""! e+#"ained due to the

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    27/33

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    28/33

    *igure + %mportance of eg*eatures

    7raining Set 'ith &ositive *eatures

    Cast"!, e created a raining set inc"uding the features that the first RV% run ga$e them a

    #ositi$e eight. raining es#ecia""! these features e cou"d conc"ude that the #osition,

    Adenine, h!mine, G content and the motifs 4 and ; are strong"! corre"ated & 9igure (. Not

    on"! does their re"ationshi# is stab"e but their im#ortance too. he 9igure de#icts the fact

    that their im#ortance to a mode" hich inc"udes on"! these basis functions are stead!. Cast

    but not "east, there aren5t man! differences in im#ortance magnitude &most features5

    im#ortance measure ranges from >.'3 to >.33(. So e cannot sa! for e+am#"e that motif

    FAG AAG AF has a more im#ortant ro"e than G content.

    &ointer *eature' Position

    0 A1

    2G J ontent

    3FAG AAG AF

    4F A GGGAAAF

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    29/33

    7FG AGGG

    GF

    ;F A GAAG F

    =FA GGGAGA F

    *igure + :eights of &os*eatures

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    30/33

    *igure + &os*eatures importance

    =$ @eneral results 8 Biological interpretationA"" trainings ha$e been considered, e can conc"ude that G content is a measure that e

    ought to obser$e and stud! more essentia""!. G content as e+ce""ed among a"" the

    features. Pre$ious orks ha$e sho n that G /rich isochores inc"ude in them man! #rotein

    coding genesK thus determination of ratio of these s#ecific regions contributes in ma##ing

    gene/rich regions of the genome. 9or e+am#"e, as e said in the beginning, it has been

    sho n that human genes associated ith #G is"ands increase in number as the! increase

    in of Guanine O !tosine "e$e"s, and that most genes associated ith #G is"ands are

    "ocated in the G /richest com#artment of the human genome. herefore, for this reason

    e create 0 distribution in order to see the differences bet een #ositi$e &kno n ncRNA

    genes( and negati$e rotein coding genes( e+am#"es.

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    31/33

    9rom the abo$e figures e can see that in -ebrafish genome #rotein coding genes are a"itt"e richer in G content than kno n ncRNAs, such as in human genome. he GO

    content is bigger than 2> in about 13>> ncRNA se)uences and in 1=>> Protein Genes.

    he #ercentages of Adenine and h!mine in genome se)uences a"so e+#orted and $erified

    as significant #redictors. his is )uite "ogica" as e #ro$ed that G content #"a!s an

    im#ortant ro"e in our mode". onse)uent"!, the com#"ementar! bases of G and ma! a"so

    be #"a!ing some ro"e. he fo""o ing distributions sho us the #ercentage of Adenines and

    h!mines & ith res#ect to the "ength of each se)uence( for both Positi$e and Negati$e

    e+am#"es.

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    32/33

    9rom the abo$e chart sho s, first of a"", e are seeing that the ma+imum of both Adenines

    and h!mines content in each se)uence doesn5t e+ceed the #ercentage of 2> . he

    se)uences hich are encoded for a #rotein are a "itt"e richer in Adenines than the ones

    hich gi$e RNA genes, "ike in the case of G ontent. @n the other hand, both #rotein

    coding and ncRNA genes ha$e the same content of h!mines in their content. Be cannot

    e+tract an! ma or difference bet een our e+am#"es e+ce#t for the the kind of distribution

    of Adenines in t o datasets. he a""ocation in Protein Genes is more ba"anced than in

    ncRNA ones.

    herefore, G content is rea""! a significant #redictor for making a decision if a DNAse)uence i"" be trans"ated into a #rotein or not.

  • 8/12/2019 Predicting ncRNA genes in Zebrafish genome: a maching learning approach

    33/33

    References'. Carra aga P, a"$o *, Santana R, *ie"-a , Ga"diano Q, In-a I, et a". %achine "earning in

    bioinformatics. *rief. *ioinform. 0>>4 %ar 'K7&'(:;4'' @ct 1K4&'>(:e03='3.

    3. Gibb 8A, *ro n Q, Cam BC. he functiona" ro"e of "ong non/coding RNA in humancarcinomas. %o". ancer. 0>'' A#r '1K'>&'(:1;.

    4. i##ing %8. S#arse ba!esian "earning and the re"e$ance $ector machine. Q %ach CearnRes. 0>>' Se#K':0''>2 Se# '3K3&'(:'1'.

    ;. Bashiet" S, 6ofacker IC, Stad"er P9. 9ast and re"iab"e #rediction of noncoding RNAs.Proc. Nat". Acad. Sci. H. S. A. 0>>3 9eb '3K'>0&7(:0232>; Qu" 'K14&su##" 0(:B73>; 9eb 'K';&0(:11'