entropy and information - pompeu fabra universityregulatorygenomics.upf.edu › courses ›...

Post on 27-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Entropy and Information

Eduardo Eyras Computational Genomics

Pompeu Fabra University - ICREA Barcelona, Spain

Master in Bioinformatics UPF 2017-2018

What are the best variables to describe our model?

Feature/attribute selection Can we compare two probabilistic models?

Is our model informative? (different from random)

Which model is more informative?

E.g. see “The Information”, by James Gleick

We can use information

Anystringofnucleo0descanthenbeexpressedasstringsof0and1,butalways2bitspersymbol.E.g.considerthefollowingsequence:

A = 00 , C = 01 , G =10 , T =11

Messages(e.g.nucleo0desequences)canbeencodedindifferentwaystotransmitamessageE.g.Binaryencoding:Thebit(binarydigit)isavariablethatcanassumethevalue0or1.Considera2-bitencodingofthenucleo0des

ACATGAAC=0001001110000001

Wehaveused16binarydigitstoencode8symbols,thus2bitspersymbol.Thatistheexpectednumberofbitspersymbolsusingthisencoding.

However,thisassumesthatallnucleo0desareequallyprobable.Thiswouldnotbeanop0malencodingifinoursequenceswefindmorefrequentlyoneofthegivensymbols,e.g.A.

Information

ConsideradiscreterandomvariableXwithpossiblevalues{x1,…,xn}TheShannonself-Informa/onofanoutcomeisdefinedas:

I (xi) = − log2 P(xi)

ItismeasuredinbitsItisalsocalledtheSurprisalofXforagivenvaluexiIfP(xi)isquitelow,I(xi) wouldbeveryhigh(highlysurprisedtoseexi)IfP(xi)iscloseto1,I(xi)isalmostzero(wearenotsurprisedtoseexiatall)I(xi) measurestheop0mallengthcodeassignedtoasymbolxiofprobabilityP(xi)

f (x) = − log2 x

Information

Considerasequencewherethenucleo0desappearwiththefollowingprobabili0es:

P(A) =12

, P(C) =14

, P(G) =18

, P(T) =18

,

− log2 P(A) =1 bit , − log2 P(C) = 2 bit , − log2 P(G) = 3 bits , − log2 P(T ) = 3 bits ,

A =1 , C = 01 , G = 000 , T = 001Consideringthecorrespondingrecoding

NowwehaveACATGAAC=10110010001101Wehaveused14binarydigitstoencode8symbols,hence14/8=1.75bitspersymbol.Weobtainalowerexpectednumberofbitspersymbol.

Accordingtoinforma0ontheory,theop0mal-lengthencodingis:

Weshoulduseanencodingsuchthatthemorefrequentthesymbol,thelessbitsweuseforthesymbol.Theaverageencodinglengthwouldbethenminimized.

Information

Expectedvalues

TheexpectedvalueofarandomvariableXthattakesonnumericalvaluesxisdefinedas:

E[X] = xii=1

n

∑ P(xi)

WhichisthesamethingasthemeanWecanalsocalculatetheexpectedvalueofafunc0onofarandomvariable:

E[g(X)]= g(xi)P(xi)i=1

n

Entropy

Considerastringofvaluestakenfrom{x1,…,xN}suchthateachvaluexiappearni0mes,and

M = nii=1

N

Theaveragenumberofbitspersymboltoencodethemessageiswri[enas:

1M

niI(xi)i=1

N

I(xi ) = − log2 P(xi )Where

1M

niI (xi)i=1

N

∑ =niMI (xi)

i=1

N

∑ M →∞% → % % P(xi)I (xi) =

i=1

N

∑ − P(xi) log2 P(xi)i=1

N

Theaveragenumberofbitspersymbolconvergestotheexpectedvalueofthesurprisalforalargenumberofobserva0ons.

Entropy

H (X) = − P(xi )i=1

N

∑ log2 P(xi )

ThustheEntropyisdefinedastheexpected(average)numberofbitspersymbolneededtoencodeastringofsymbolsxidrawnfromasetofpossibleones{x1,…,xn}

Entropy

Thelogistakeninbase2andtheentropyismeasuredinbitsEntropyisalsoameasureofuncertaintyassociatedwithavariable...

Theentropyisminimalifweareabsolutelycertainabouttheoutcomeofavaluefromthedistribu0on,ForinstanceP(xj)=1foraspecificj,andP(xi) = 0forxi ≠ xj

H (X) = − P(xi )i=1

N

∑ log2 P(xi ) = − log2 P(x j ) = 0

lim x log x =x→00

P(x j ) =1

Entropyanduncertainty

Intui0vely:wedonotneedanyinforma0ontoknowaboutthetransmi[edmessage

Entropyachievesitsuniquemaximumfortheuniformdistribu0on

Intui0vely:underequiprobability,theShannoninforma0onforacharacterinanalphabetofsizeNisI= log2(N),i.e.wemustwaitforallbinarybitstoknowthetransmi[edmessage

Thatis,theentropyismaximalwhenalloutcomesareequallyprobable,orequivalently,whenwearemaximallyuncertainabouttheoutcomeofarandomvariable:

H (X) = − P(xi )i=1

N

∑ log2 P(xi ) = −N1Nlog2

1N

#

$%

&

'(= log2 N€

P(xi) =1N

ThemaximalentropydependsonlyonthenumberofsymbolsN

Entropyanduncertainty

Entr

opy(S

)

1.0

0.5

0.0 0.5 1.0

p+

For N=2:

H (X) = −p log2 p− q log2 q= −p log2 p− (1− p)log2(1− p)€

0 < p,q <1p+ q =1

H(X)

p

Entropyanduncertainty

Ifyouaretoldtheoutcomeofanevent,theuncertaintyisreducedfromHtozero.Areduc0onofuncertaintyisequivalenttoanincreaseofinforma)on.(anincreaseoftheentropyalwaysimplieslossofinforma0on)We define informa)on content as the reduc0on of the uncertainty a]er somemessagehasbeenreceived.Thatis,thechangeinentropy:

Informa0oncontent

Ic (X) = Hbefore −Hafter

Ic (X) = Hbefore −Hafter = log2 N + P(xi)i=1

N

∑ logP(xi)

Ifwestartfrommaximaluncertainty:

Note that the uncertainty isnot necessarily reduced tozero

H (X) = − P(xi )i=1

N

∑ logP(xi )

Ic (X) = Hbefore −Hafter = log2 N The maximum information is log N

Whatisthemaximalreduc0onofentropy(maximumInforma0oncontent):

Hbefore = log2 NHafter = 0

Informa0oncontent

ATG GT AG GT AG... ... ...startcodon stopcodondonorsite donorsiteacceptorsite acceptorsite

exon exon exonintronintronTGA

Stop codons

TGA 50%TAA 25%TAG 25%

1 2 3 4 5 6 7 …

Pos P(n) Ic

1 P(a) = 0.25

5 P(T)=16 P(G)=P(A)=0.57 P(A)=0.75, P(G)=0.25

Informa0oncontent

Ic (X) = Hbefore −Hafter

= log2 N + P(xi )i=1

N

∑ log2 P(xi )

Themoreconservedtheposi0on,thehighertheinforma0oncontent.

ATG GT AG GT AG... ... ...startcodon stopcodondonorsite donorsiteacceptorsite acceptorsite

exon exon exonintronintronTGA

Pos P(n) Ic

1 P(a) = 0.25 0

5 P(T)=1 2

6 P(G)=P(A)=0.5 1

7 P(A)=0.75, P(G)=0.25 1.18

1 2 3 4 5 6 7 …

Informa0oncontent

Informa0on content of each posi0on can be represented graphically usingsequencelogos:h[p://weblogo.berkeley.edu/

Frequencyplot:theheightofeachle[erispropor0onaltoitsfrequencyinthatposi0on

Informa0oncontentplot:theheightofthecolumnisle[erispropor0onaltotheInforma0onatthatposi0on

invariant informative

Almost random

Informa0oncontent

CTGAGGTAGATTGACATAGTGTGAGCTAAATTGACATAAT

Exercise: Consider the following 5 positions in a set of sequences

12345

Each position i=1,2,3,4,5, can be considered to correspond to a probabilistic model Pi(X) on the nucleotides Calculate the entropy for each one of the positions. What is the maximum possible value of the entropy? Can you extract any information for each position from the entropy values?

Entropy-based measures

GivenvariablesX,Y,takingNpossiblevalues,theirjointEntropyisdefinedas:

JointEntropy

H (X,Y ) = − P(x, y)log2 P(x, y)y∑

x∑

H(X,Y ) = H(X) +H(Y )

Entropyisaddi0veforindependentvariables:

P(X,Y ) = P(X)P(Y ) è

H (X,Y ) = − P(x, y)log2 P(x, y)y∑

x∑

= − P(x)P(y) log2 P(x)+ log2 P(y)( )y∑

x∑

= − P(y)y∑ P(x)log2 P(x)−

x∑ P(x)

x∑ P(y)log2 P(y)

y∑

= − P(x)log2 P(x)−x∑ P(y)log2 P(y)

y∑

If

Proof:

Conditional Entropy

It quantifies the amount of information related to Y given that X is known:

H (Y | X) = − P(x, y)log2P(x, y)P(x)y

∑x∑

The entropy of Y conditioned to X is defined as:

Similarly, the entropy of Y conditioned to X is

H (X |Y ) = − P(x, y)log2P(x, y)P(y)y

∑x∑

1) If the value of Y is completely determined by the value of X ⇒ H (Y | X) = 0

⇒ H (Y | X) = H (X)2) If Y and X are independent

Exercise:

Using the definition of condition entropy, show that:

The chain rule:

H (X,Y )−H (Y ) = H (X |Y )

H (X,Y )−H (X) = H (Y | X)

(a special case of this is

H(X,Y ) = H(X) +H(Y ) when they are independent)

and

H (Y | X) = − P(x, y)log2y∑

x∑ P(x, y)

P(x)

= − P(x, y)log2y∑

x∑ P(x, y)+ P(x, y)log2

y∑

x∑ P(x)

= H (X,Y )+ P(x)log2x∑ P(x)

= H (X,Y )−H (X)P(x) = P(x, y)

y∑

Definition of Joint Entropy

The relation between joint and conditional entropy

Proof:

Other properties:

H (X,Y ) ≥max H (X),H (Y ){ }H (X,Y ) ≤ H (X)+H (Y )

H (X1,...,XN ) ≥max H (X1),...,H (XN ){ }H (X1,...,XN ) ≤ H (X1)+...+H (XN )

The joint entropy is always larger (or equal) than the individual entropies. and it is always smaller (or equal) than the sum of the individual entropies:

The same holds for any number of variables:

The equality only holds when the variables are independent. So the difference is due to the dependencies between the variables, which can be measured with the Mutual Information….

Mutualinforma0on

MI (X,Y ) = P(x,y) log P(x,y)P(x)P(y)y

∑x∑

MI(X,Y ) = H (X)+H (Y )−H (H,Y )

= − P(x)log2 P(x)x∑ − P(y)log2 P(y)

y∑ − P(x, y)log2 P(x, y)

y∑

x∑

= − P(x, y)y∑ log2 P(x)

x∑ − P(y, x)

x∑ log2 P(y)

y∑ + P(x, y)log2 P(x, y)

y∑

x∑

= P(x, y)log2P(x, y)P(x)P(y)y

∑x∑

The mutual information describes the difference between the individual Entropies of two variables and their joint entropy:

MI(X,Y ) = H (X)+H (Y )−H (H,Y )

Mutualinforma0on

Mutualinforma0onmeasuresthedependenciesbetweentwovariables

MI(X,Y ) = P(x, y)log2P(x, y)P(x)P(y)y

∑x∑

MI(X,Y)measurestheinforma0oninXthatissharedwithYIfXandYareindependentMItakesvaluezero(knowingonedoesnothelpknowingtheother)MIissymmetricMI(X,Y) = MI(Y,X)Ifthe2variablesareiden0cal,knowingonedoesnotaddtotheother,henceMIisequaltotheEntropyofasinglevariable.e.g.XandYtakeasvaluesthenucleo0desintwodifferentposi0ons,andthesumiscarriedoutoverthealphabetofnucleo0des.Posi0onsXandYdonotneedtobecon0guous.Noteasytoextendtomorethan2posi0ons

H (X)+H (Y ) = H (H,Y )

Using H (X,Y )−H (Y ) = H (X |Y )

H (X,Y )−H (X) = H (Y | X)or

The relation between joint and conditional entropy

MI(X,Y ) = H (X)+H (Y )−H (X,Y )

MI(X,Y ) = H (X)−H (X |Y )

MI(X,Y ) = H (Y )−H (Y | X)

We can rewrite the Mutual Information

In terms of the conditional entropy

Mutualinforma0on

Joint entropy: Entropy given by both variables

Once we know Y, The rest of the entropy in X

The contribution from both together:

MI(X,Y ) = H (X,Y )−H (X |Y )−H (Y | X)

H (X,Y ) =MI(X,Y )+H (X |Y )+H (Y | X)

H(X|Y) H(Y|X) MI(X,Y)

H(X) H(Y)

H (X,Y ) = H (X |Y )+H (Y ) = H (Y | X)+H (X)

Consider the following multiple alignment (with just two symbols A and B). Consider the positions X and Y

Exercise:

Calculate:(a) H(X), H(Y ). (b) H(X|Y ), H(Y |X) (c) H(X,Y) (d) H(Y)−H(Y|X) (e) MI(X,Y)

Mutual information MI(X,Y ) = P(x, y)log2P(x, y)P(x)P(y)y

∑x∑

Joint Entropy H (X,Y ) = − P(x, y)log2 P(x, y)y∑

x∑

H (Y | X) = − P(x, y)log2P(x, y)P(x)y

∑x∑Conditional entropy:

Entropy H (X) = − P(x)log2 P(x)x∑

A B A BA A A BA B B AA A B BB B B AB A B A

X Z Y W

Recall:

Exercise:

A B A BA A A BA B B AA A B BB B B AB A B A

P(X=A) = 2/3 P(X=A,Y=B) = 1/3

H (X,Y ) = − P(x, y)log2 P(x, y)y={A,B}∑

x={A,B}∑

= −P(A,A)log2 P(A,A)−P(A,B)log2 P(B,A)+−P(B,A)log2 P(B,A)−P(B,B)log2 P(B,B)

Here P(B,A) means P(X=B,Y=A), etc…

X Z Y W

Consider the following multiple alignment (with just two symbols A and B). Consider the positions X and Y

log2 L = log2P(x)Q(x)

DKL (P ||Q) = E(log2 L) = P(xi )log2P(xi )Q(xi )i=1

n

Alsocalledtherela/veentropy,istheexpectedvalueofthelog-rateoftwodistribu0ons

Therela0veentropyisdefinedfortwoprobabilitydistribu0onsthattakevaluesoverthesamealphabet(samesymbols)

Kullback-Leibler Divergence of two distributions

DKL (P ||Q) = P(x)log2P(x)Q(x)x

Kullback-Leibler Divergence of two distributions

DKL (P ||Q) = P(x)log2P(x)Q(x)x

Therela0veentropyisnotadistance,butmeasureshowdifferenttwodistribu/onsareThevalueisnevernega0ve.Itiszerowhenthe2distribu0onsareiden0calDKL (P ||Q) ≥ 0 with “=0 “ for P=Q

DKL (P ||Q) ≠ DKL (P ||Q) It is not symmetric

The relative entropy provides a measure of the information content gained with the distribution P with respect to the distribution Q. Its applications are similar to those of the Information Content

Consider two discrete probability distributions P and Q, such that Show that the relative entropy DKL (P||Q) is equivalent to the information content of P when the distribution Q is uniform.

Q(xi ) =1i∑P(xi ) =1

i∑

Exercise: (exam 2013)

and

Jensen-Shannondivergence

JS(P,Q) = 12DKL (P ||M )+

12DKL (Q ||M ) M =

12(P +Q)

Provides another way of measuring the similarity of two probability distributions.

Jensen-Shannondivergence

M =12(P +Q)DKL (P ||Q) = P(x)log P(x)

Q(x)x∑

JS(P,Q) = H P +Q2

!

"#

$

%&−12H (P)− 1

2H (Q)

JS(X1,...,XN ) = HX1 +...+ XN

N!

"#

$

%&−1N

H (X1)+...+H (XN )( )

You can generalize this to N variables (distributions):

JS(P,Q) = 12

P(x)log22P(x)

P(x)+Q(x)x∑ +

12

Q(x)log 2Q(x)P(x)+Q(x)x

= −P(x)+Q(x)

2log2

P(x)+Q(x)2x

∑ +12

P(x)log2 P(x)+x∑ 1

2Q(x)log2Q(x)

x∑

= H P +Q2

#

$%

&

'(−12H (P)− 1

2H (Q)

JS(P,Q) = 12DKL (P ||M )+

12DKL (Q ||M )

Jensen-Shannondivergence

JS(X,Y) issymmetric JS(X,Y ) = JS(Y,X)

JS(X,Y ) ≥ 0 JS(X,Y ) = 0⇔ X =YIt is non-negative

d(X,Y ) = JS(X,Y )

with

The square-root is a metric (= a distance) and distributes normally

d(X,Y ) ≥ 0d(X,Y ) = 0⇔ X =Yd(X,Y ) = d(Y,X)d(X,Y ) ≤ d(X,Z )+ d(Z,Y )

Properties of a metric

JS(P,Q) = 12DKL (P ||M )+

12DKL (Q ||M )

0 ≤ d(X,Y ) ≤1s(X,Y ) =1− d(X,Y )

Example:

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

Sample 1 4,00 3,00 2,00 1,00 0,10 Sample 2 0,10 1,00 2,00 3,00 4,00 Sample 3 5,00 2,00 5,00 1,00 3,00 Sample 4 2,00 2,00 2,00 2,00 2,00

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

Sample 1 0,40 0,30 0,20 0,10 0,01 Sample 2 0,01 0,10 0,20 0,30 0,40 Sample 3 0,31 0,13 0,31 0,06 0,19 Sample 4 0,20 0,20 0,20 0,20 0,20

Normalize gene expression per sample: P(sample = s,gene = g) =e(s,g)e(s,g ')

g '∑

JS divergence can be used to compute a dissimilarity between distributions

See: Berretta R, Moscato P. Cancer biomarker discovery: the entropic hallmark. PLoS One. 2010 Aug 18;5(8):e12262.

≈ 4.00 / 10.10

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

Sample 1 0,40 0,30 0,20 0,10 0,01 Sample 2 0,01 0,10 0,20 0,30 0,40 Sample 3 0,31 0,13 0,31 0,06 0,19 Sample 4 0,20 0,20 0,20 0,20 0,20

Example:

What samples are most similar? And most different?

Example:

H H/log2(5)

Sample 1 1,91 0,82

Sample 2 1,91 0,82

Sample 3 2,13 0,92

Sample 4 2,32 1,00

H (s) = P(g)log2 P(g)g∑

P(g)

Normalized to 1

Sample 1 and 2 have same Entropy but different gene expression profiles

Entropy describes how expression is distributed, but it is not a good measure of distance/similarity

JS(1, 2) = H P1 +P22

!

"#

$

%&−12H (P1)−

12H (P2 )

Example:

Sample 1 Sample 2 Sample 3 Sample 4 Sample 1 0 0,28 0,07 0,82Sample 2 0,28 0 0,28 0,82Sample 3 0,07 0,28 0 0,03Sample 4 0,82 0,82 0,03 0

P(g)

The closest expression profiles are sample 3 and sample 4 The most distant ones are sample 1 (or 2) and sample 4

Merkin J, Russell C, Chen P, Burge CB. Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science. 2012 Dec 21;338(6114):1593-9.

The JS-divergence (square-root) has been used before to establish the similarities between expression patterns in tissues from different species:

Examples of application of JSD in tissue specific expression

Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011 Sep 15;25(18):1915-27.

Entropy and Classification

Entropycanbeinterpretedasameasureofthehomogeneityofexamplesaccordingtotheclassifica0onvalues.

ConsiderSisasampleoftrainingexamplesinabinaryclassifica0onproblemwithp+thepropor0onofposi0vecasesp-thepropor0onofnega0vecases

H (S) = −p+ log2 p+ − p− log2 p−

Entropyandclassifica0on

Entr

opy(S

)

1.0

0.5

0.0 0.5 1.0

p+

p+

H(S)

EntropymeasurestheimpurityofthesetS:

H~0, mostly 1 class H~Hmax, random mixture of classes

Consider a collection of 14 examples (with 4 attributes) for a boolean classification (PlayTennis = yes/no)

Entropyandclassifica0on

Entropyandclassifica0on

Theentropyofthisclassifica0onis(yes=9,no=5)

H (S) = − 914log2

914

−514log2

514

= 0.940

Recall:Theentropyisminimal(zero)ifallmembersbelongtothesameclassTheentropyismaximal(=log2(N))ifthereisanequalnumberofmembersin

eachclass

Generally,iftheclassifica0oncantakeNdifferentvalues:

H (S) = − P(si )i=1

N

∑ log2 P(si )

WhereP(si)isthepropor0onofcasesinSbelongingtotheclasssi

Entropyandclassifica0on

Wecanmeasuretheeffec0venessofana[ributeinclassifyingthetrainingdataasthe“informa0ongain”:Informa)ongainofana[ributeArela0vetoacollec0onSisdefinedasthemutualinforma0onofthecollec0onandthea[ribute:

IG(S,A) =MI(S,A) = H (S)−H (S | A)

IG(S,A)measureshowmuchinforma0onwegainintheclassifica0onbyknowingthevalueofa[ributeA.IG(S,A)istheexpectedreduc0oninentropycausedbypar00oningtheexamplesaccordingtoonea[ribute.

Entropyandclassifica0on

IG(S,A) =MI(S,A) = H (S)−H (S | A)Informa)ongain

H (S) = − P(s)log2 P(s)s={classes}∑ Thetotalentropyofthesystemaccordingtotheclasses

H (S | A) = − P(s,a)log2P(s,a)P(a)s={classes}

∑a={values}∑ = − P(a)

a={values}∑ P(s | a)log2 P(s | a)

s={classes}∑

The proportion of examples for each value of attribute A

Entropy according to the classes restricted to a specific value of attribute A

Using

Entropyandclassifica0on

IG(S,A) =MI(S,A) = H (S)−H (S | A)Informa)ongain

H (S) = − P(s)log2 P(s)s={classes}∑ Thetotalentropyofthesystemaccordingtotheclasses

H (S | A) = − P(s,a)log2P(s,a)P(a)s={classes}

∑a={values}∑ = − P(a)

a={values}∑ P(s | a)log2 P(s | a)

s={classes}∑

The proportion of examples for each value of attribute A

Entropy according to the classes restricted to a specific value of attribute A

MI(S,A) = − P(s)log2 P(s)s={classes}∑ + P(a)

a={values}∑ P(s | a)log2 P(s | a)

s={classes}∑ = H (S)− | Sa |

| S |H (Sa )

a={values}∑

We can rewrite it as:

Using

Sa = {s∈ S | A(s) = a}Saisthesubsetofthecollec0onSforwhicha[ributeAhasvaluea:

IG(S,A) = H (S)− | Sa || S |a∈Values(A)

∑ H (Sa )

Entropyandclassifica0on

H(S)isthetotalentropyofthesystem.Values(A)isthesetofallpossiblevaluesfora[ributeA(e.g.:Outlook={rain,overcast,sunny})Saisthesubsetofthecollec0onSforwhicha[ributeAhasvaluea:

Sa = {s∈ S | A(s) = a}

|Sa|/|S|isthefrac0onfromthecollec0onforwhicha[ributeAhasvaluea

IG(S,A) =MI(S,A) = H (S)−H (S | A) = H (S)− | Sa || S |a∈Values(A)

∑ H (Sa )

Entropyandclassifica0on

IG(S,A) =MI(S,A) = H (S)−H (S | A) = H (S)− | Sa || S |a∈Values(A)

∑ H (Sa )

H(Sa ) = − P(si | a)i=1

N

∑ logP(si | a)

Thesecondtermcontainstheentropyoftheelementswithagivenvalueofa[ributeA:

withP(s|a) thepropor0onofcaseswithvalueforA=aandclassifiedinclasss.

IG(S,A) = H (S)− | Sa || S |a∈Values(A)

∑ H (Sa )

sumoftheentropiesofeachsubsetSaweightedbythefrac0onofcases

IG(S,A)istheinforma0on(reduc0oninentropy)providedbyknowingthevalueofana[ribute(weightedbythepropor0onsofthea[ributes)

where

IG(S,A) =MI(S,A) = H (S)−H (S | A)

GR(S,A) = MI(S,A)H (A)

SU(S,A) = 2MI(S,A)H (S)+H (A)

Information Gain (IG) is defined as the mutual information between the group labels of the training set S and the values of a feature (or attribute) A

Gain Ratio (GR) is the mutual information of the group labels and the attribute, normalized by the entropy contribution from the proportions of the samples according to the partitioning by the attribute:

Symmetrical Uncertainty (SU) provides a symmetric measurement of feature correlation with the labels and it compensates possible biases from the other two measures:

See: Hall M. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. 2000 ICML’00 Proceedings of the Seventeenth International Conference of Machine Learning, pages 359-366.

Entropyandclassifica0on

Example

Entropyandclassifica0on

Consider a collection of 14 examples S: [9+,5-] (9 Yes, 5 No)

Entropyandclassifica0on

Values(Wind)={strong,weak} S = [9+,5-] S(weak) = [6+,2-] S(strong) = [3+,3-]

Entropyandclassifica0on

IG(S,wind) = H (S)− | Sv || S |v∈{weak,strong}

∑ H (Sv )

= H (S)− 814H (Sweak )−

614H (Sstrong ) = 0.940− 8

140.811− 6

141.00 = 0.048

H (S) = − P(si )i=1

N

∑ log2 P(si )

IG(S,humidity) = H (S)− | Sv || S |v∈{high,normal}

∑ H (Sv )

= H (S)− 714H (Shigh )− 7

14H (Snormal )

= 0.940− 714

0.985− 714

0.592

= 0.151

Whataretheimplica0onsofthis?

IG(S,humidity) >IG(S,wind)

Entropyandclassifica0on

Entropy=0.985 Entropy=0.592 Entropy=0.811 Entropy=1

Entropy Entropy

“Humidity”providesgreaterinforma0ongainthanwindrela0vetothetargetclassifica0on(yes/no).

Entropyandclassifica0on

Entropy=0.985 Entropy=0.592 Entropy=0.811 Entropy=1

“Humidity”providesgreaterinforma0ongainthanwindrela0vetothetargetclassifica0on(yes/no).ABribute“Humidity”isbeBerclassifier(Ifweonlyuse“Humidity”toclassify,weareclosertothetargetclassifica0on(yes/no)

Entropy Entropy

[6+,1-]Entropy=0.592

Which attribute would be the best classifier if tested alone?

IG(S,outlook) = 0.246IG(S,humidity) = 0.151IG(S,wind) = 0.048IG(S, temperature) = 0.029

Outlook performs the best prediction of the target value “play tennis”

Entropyandclassifica0on

Everyexample“overcast”islabeledasyes->leafnodewithclassifica0on“yes”

Entropyandclassifica0on

Theotherdescendants(sunnyandrain)s0llhavenon-zeroentropy->con0nuedownthesenodes

Everyexample“overcast”islabeledasyes->leafnodewithclassifica0on“yes”

Whicha[ributeshouldbetestedhere?

Entropyandclassifica0on

Theotherdescendants(sunnyandrain)s0llhavenon-zeroentropy->con0nuedownthesenodes

Everyexample“overcast”islabeledasyes->leafnodewithclassifica0on“yes”

Entropyandclassifica0on

IG(Ssunny,humidity) = 0.970−350.0− 2

50.0 = 0.970

IG(Ssunny, temperature) = 0.970−250.0− 2

51.0 = 0.570

IG(Ssunny,wind) = 0.970−251.0− 3

50.918 = 0.19

Ssunny = {D1,D2,D8,D9,D11}

Incorporating continuous-valued attributes

We have used also attributes with discrete values (e.g. Wind=weak, strong) We can dynamically define discrete valued attributes by partitioning the continuous attribute values A into a discrete set of intervals. The then define a new boolean attribute Ac that is true if A < c and false otherwise. Consider:

Temperature (C°)

5 10 15 20 25 30

PlayTennis No No Yes Yes Yes No

Pick a threshold that produces the largest information gain: Sort examples according to the attribute Select to test only adjacent examples with different target classification choose the boundary with largest information gain

Incorporating continuous-valued attributes

Temperature (C°)

5 10 15 20 25 30

PlayTennis No No Yes Yes Yes No

Pick a threshold that produces the largest information gain: Sort examples according to the attribute Select to test only adjacent examples with different target classification choose the boundary with largest information gain. This dynamically creates a boolean attribute: An alternative is to use multiple (discrete) intervals

10 +152

=12.5⇒ temperature >12.5

25 + 302

= 27.5⇒ temperature > 27.5

temperature >12.5

Incorporating continuous-valued attributes

Temperature (C°)

5 10 15 20 25 30

PlayTennis No No Yes Yes Yes No

Equivalently, pick the point a0 that produces the minimum entropy after separating the attribute values by this threshold:

| Sa<a0

|| S |

H (Sa<a0)+

| Sa>a0|

| S |H (Sa>a0

)

IG(S,A) = H (S)− | Sa || S |a∈Values(A)

∑ H (Sa )

That is, we minimize the right side of the IG definition:

See e.g. Fayyad, U, and Keki I. (1993) "Multi-interval discretization of continuous-valued attributes for classification learning." (1993).Proceedings of the thirteen joint conference of Artificial Intelligence, pages 1022-1027. Morgan Kaufmann

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

Entropyandclassifica0on

Informa0onGainallowsyoutofindthea[ributes(variables)thataremostinforma0vetotest/measurefortheclassifica0onproblem.

Repea0ngthisprocessallowsyoutobuildatree.Anya[ributecanappearonlyoncealonganypathinthetreeThisprocesscon0nuesun0leither1)  Everya[ributehasalreadybeenincludedalongthispath2)  Alltrainingexampleshavethesametargetvalue(entropyiszero)

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

Entropyandclassifica0on

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunny

Decision trees

Decision nodes: specifies a test on a single attribute

One branch for each outcome of the test

Leaf nodes: value of a target (classification: take in general the most probable classification)

A decision tree is used to classify a new instance (example) by starting from the root of the tree and moving down through it until a leaf node is reached

Decision trees

1) Decision trees are best suited for instances that are represented by attribute – value pairs. E.g. attribute =temp, Values(temp)={hot, mild, cold} 2) Target classification should take discrete values (2 or more): e.g. “yes”/”no”, although can be extended to real value outputs. 3) Decision trees represent naturally disjunction of conjunctions descriptions: E.g. I play tennis if …

(outlook=sunny AND humidity=normal) OR (outlook=overcast) OR (outlook=rain AND wind=weak)

4) Decision trees are robust to errors in the training data 5) Decision trees can be used even when there is some missing data

Overfitting

We build the tree deep enough to perfectly classify the training examples Too few examples (or too noisy) may cause overfitting There is overfitting when we add new training data that makes the model reproduce perfectly the training data but at the cost of performing worse or being not valid for new cases. This can be detected using cross-validation

To avoid overfitting: 1)  Stop growing tree before it reaches perfection 2)  Fully grow the tree, and then post-prune some branches

Consider one extra (noisy) example: (outlook=sunny, temperature=hot, humidity=normal, wind=strong, play tennis=no) How does it affect our earlier tree?

Overfitting

Outlook

Overcast

Humidity

NormalHigh

No Yes

Wind

Strong Weak

No Yes

Yes

RainSunnyThe new tree would fit the training data perfectly well but the earlier tree will perform better in general for new examples

ID3 Algorithm Consider a classification of examples with two class values: “+” or “-”

ID3 Algorithm ID3 (Examples S, Target_labels (classes), attributes)

ID3 Algorithm ID3 (Examples S, Target_labels (classes), attributes) Create a root node for the tree If all examples are positive

return single-node tree with label “+” Else If all examples are negative

return single-node tree with label “-” Else If attribute is empty

return single-node tree with most common label in Examples Else…

First we deal with the extreme cases

ID3 Algorithm ID3 (Examples S, Target_labels (classes), attributes) Create a root node for the tree If all examples are positive

return single-node tree with label “+” Else If all examples are negative

return single-node tree with label “-” Else If attribute is empty

return single-node tree with most common label in Examples Else

pick the attribute A that best classifies Examples (maximizes Gain(S,A) ) assign attribute A to root of the tree For each value a of A add a new tree branch below corresponding to the test A=v consider Examples(a) the subset of Examples that have value a for A if Examples(a) empty add a leaf node with most common label from Examples Else add the subtree ID3(Examples(a), target_labels(a), attributes – A)

End Return tree

ID3 Algorithm ID3 (Examples S, Target_labels (classes), attributes) Create a root node for the tree If all examples are positive

return single-node tree with label “+” Else If all examples are negative

return single-node tree with label “-” Else If attribute is empty

return single-node tree with most common label in Examples Else

pick the attribute A that best classifies Examples (maximizes Gain(S,A) ) assign attribute A to root of the tree For each value a of A add a new tree branch below corresponding to the test A=a consider Examples(a) the subset of Examples that have value v for A if Examples(a) empty add a leaf node with most common label from Examples Else add the subtree ID3(Examples(v), target_labels(v), attributes – A)

End Return tree

Prior probability of the classification

ID3 Algorithm ID3 (Examples S, Target_labels (classes), attributes) Create a root node for the tree If all examples are positive

return single-node tree with label “+” Else If all examples are negative

return single-node tree with label “-” Else If attribute is empty

return single-node tree with most common label in Examples Else

pick the attribute A that best classifies Examples (maximizes Gain(S,A) ) assign attribute A to root of the tree For each value a of A add a new tree branch below corresponding to the test A=a consider Examples(a) the subset of Examples that have value a for A if Examples(a) empty add a leaf node with most common label from Examples Else add the subtree ID3(Examples(a), target_labels(v), attributes – A)

End Return tree

If we run out of attributes and entropy is non-zero, choose the most common target_label of these subset of examples

ID3 Algorithm

One of the attribute values does not appear in the subpopulation: we choose a default, which is the most common target label over the entire tree (the most probable label)

ID3 (Examples S, Target_labels (classes), attributes) Create a root node for the tree If all examples are positive

return single-node tree with label “+” Else If all examples are negative

return single-node tree with label “-” Else If attribute is empty

return single-node tree with most common label in Examples Else

pick the attribute A that best classifies Examples (maximizes Gain(S,A) ) assign attribute A to root of the tree For each value a of A add a new tree branch below corresponding to the test A=a consider Examples(a) the subset of Examples that have value a for A if Examples(a) empty add a leaf node with most common label from Examples Else add the subtree ID3(Examples(a), target_labels(v), attributes – A)

End Return tree

Decision trees

Example: decision tree to predict protein-protein interactions

Each data point is a gene-pair (e.g. A-B) associated with some attributes Some attributes take on real values (e.g. Genomic distance) Other attributes take on discrete values (e.g. shared localization?) The target values of the classification is “yes” (interact) or “no” (do not interact)

Decision trees

Binary classification on continuous values are based on a threshold

Proportion of each attribute value in the training set

Decision trees

Binary classification on continuous values are based on a threshold

Proportion of each attribute value in the training set

New examples are predicted to interact if they arrive at a leaf with higher proportion of green, or to not interact if they arrive at a predominately red leaf

Exercise (exam from 2014) We would like to build a decision tree model to predict cell proliferation based on the gene expression of two genes: NUMB and SRSF1. Our experiments have been recorded in the following table:

Which of the attributes will you test first in the decision tree? Explain why. Help: you can use log2 3 = 1.6

References

Machine Learning. Tom Mitchell McGraw Hill, 1997. http://www.cs.cmu.edu/~tom/mlbook.html Computational Molecular Biology. An Introduction Peter Clote and Rolf Backofen. Wiley 2000 What are decision trees? Kingsford C, Salzberg SL. Nat Biotechnol. 2008 Sep;26(9):1011-3. Review. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Cambridge University Press, 1999 Problems and Solutions in Biological Sequence Analysis ‎ Mark Borodovsky, Svetlana Ekisheva Cambridge University Press, 2006

top related