molecular biomedical informatics machine learning and bioinformatics machine learning &...

52
Molecular Biomedical Informatics 分分分分分分分分分 Machine Learning and Bioinformatics 分分分分分分分分分分 Machine Learning & Bioinformatics 1

Upload: jayden-tilden

Post on 29-Mar-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning & Bioinformatics 1

Molecular Biomedical Informatics分子生醫資訊實驗室Machine Learning and Bioinformatics機器學習與生物資訊學

Page 2: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

2Machine Learning and Bioinformatics

Statistics

Page 3: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 3

Statistical test In statistics, a result is called statistically significant if it is

unlikely to have occurred by chance Determines what outcomes of an experiment would lead to a

rejection of the null hypothesis; helping to decide whether

experimental results contain enough information to cast doubt

on conventional wisdom Answers

– assuming that the null hypothesis is true, what is the probability of

observing a value for the test statistic that is at least as extreme as the

actually observed one?

– that probability is known as the P-value

Page 4: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 4

Similar to a criminal trial A defendant is considered not guilty until his guilt is proven

– the prosecutor tries to prove the guilt of the defendant, until there is

enough charging evidence the defendant is convicted

In the start of the procedure, there are two hypotheses

– H0: “the defendant is not guilty”

– H1: “the defendant is guilty”

The first one is called null hypothesis, and is for the time

being accepted The second one is called alternative (hypothesis), which is the

hypothesis one hopes to support

Page 5: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 5

The hypothesis of innocence is only rejected when an error is very

unlikely, because one doesn’t want to convict an innocent

defendant Such an error is called error of the first kind (i.e. the conviction of

an innocent person), and the occurrence of this error is controlled

to be rare As a consequence of this asymmetric behavior, the error of the

second kind (acquitting a person who committed the crime), is

often rather large H0 is trueTruly not guilty

H1 is trueTruly guilty

Accept Null HypothesisAcquittal

Right decisionWrong decision

Type II Error

Reject Null HypothesisConviction

Wrong decisionType I Error

Right decision

Page 6: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 6

Philosopher’s beans Few beans of this handful are white.

Most beans in this bag are white. Therefore, probably, these beans were taken from another

bag.– this is an hypothetical inference

Terminology– the beans in the bag are the population

– the handful are the sample

– the null hypothesis is that the sample originated from the

population

Page 7: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 7

The criterion for rejecting the null-hypothesis is the

“obvious” difference in appearance (an informal

difference in the mean) Again, assuming that the null hypothesis is true,

what is the probability of observing a difference

that is at least as extreme as the actually observed

one? To be a real statistical hypothesis test, this example

requires the formalities of a probability calculation

and a comparison of that probability to a standard

Page 8: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 8

Clairvoyant card game A person (the subject) is tested

for clairvoyance. He is shown

the reverse of a randomly chosen playing card 25 times

and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance

– the null hypothesis is that the person is not clairvoyant

– the alternative is, of course, the person is (more or less)

clairvoyant

null hypothesis?

Page 9: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 9

If the null hypothesis is valid, the only thing the test

person can do is guess– for every card, the probability (relative frequency) of any

single suit appearing is ¼

If the alternative is valid, the test subject will predict

the suit correctly with probability greater than ¼ Suppose that the observed probability of guessing

correctly is p, then the hypotheses, then are

– null hypothesis (H0): p = ¼ (just guessing)

– alternative hypothesis (H1): p > ¼ (true clairvoyant)

Page 10: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 10

What’s the decision? When the test subject correctly predicts

all 25 cards, we will consider him

clairvoyant, and reject the null hypothesis.

Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to

consider him so. But what about 12 hits, or 17 hits?– what is the critical number, c, of hits, at which point we consider the

subject to be clairvoyant?

– how do we determine the critical value c?

It is obvious that with the choice c=25 we’re more critical than

with c=10

Page 11: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 11

In practice, one decides how critical one will be– one decides how often an error of the first kind (false

positive or Type I error)

With c=25 the probability of such an error is very

small

Being less critical, with c=10, yields a much

grater probability of false positive

These are p-values

Page 12: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 12

The probability of Type I error Before the test is actually performed, the maximum

acceptable probability of a Type I error (α) is determined Depending on this Type I error rate, the critical value c is

calculated. For example, if we select an error rate of 1%

– from all the numbers c with this property we choose the

smallest, in order to minimize the probability of a Type II error

(false negative)

– for the above example, we select c=13

Page 13: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 13

P-value vs. α

Page 14: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 14

Any Questions?about the figure in the last slide

Page 15: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 15

Wherethe distribution (the blue curve) comes from?

Page 16: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 16

You have to choose the right oneThe hardest part for many people

But please understand the basic, rather than the practice

Page 17: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 17

Normal distribution A continuous probability

distribution, defined on the

entire real line, that has a bell-shaped probability density function Known as the Gaussian function

μ is the mean or expectation (location of the peak); σ2 is the

variance; σ is known as the standard deviation The distribution with μ=0 and σ2=1 is called the standard normal

distribution or the unit normal distribution Normal distribution - Wikipedia, the free encyclopedia

Page 18: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 18

The normal distribution is considered the most prominent

probability distribution in statistics The normal distribution arises from the central limit theorem

– under mild conditions, the mean of a large number of random

variables independently drawn from the same distribution is

distributed approximately normally, irrespective of the form of the

original distribution

Very tractable analytically, that is, a large number of results

involving this distribution can be derived in explicit form For these reasons, the normal distribution is commonly

encountered in practice– for example, the observational error in an experiment is usually

assumed to follow a normal distribution

Page 21: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 21

Z-test Z-test - Wikipedia, the free encyclopedia For any test statistic of which the distribution under the null

hypothesis can be approximated by a normal distribution Because of the central limit theorem, many test statistics are

approximately normally distributed for large samples Many statistical tests can be conveniently performed as

approximate Z-tests if the sample size is large or the population

variance known– if the population variance is unknown (and therefore has to be

estimated from the sample itself) and the sample size is not large (n <

30), the Student t-test may be more appropriate

Page 22: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 22

If T is a statistic that is approximately normally distributed

under the null hypothesis– estimate the expected value θ of T under the null hypothesis

– obtain an estimate s of the standard deviation of T

– calculate the standard score Z = (T − θ) / s

– one-tailed and two-tailed p-values can be calculated as Φ(−|Z|)

and 2Φ(−|Z|), respectively

– Φ is the standard normal cumulative distribution function

Page 23: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 23

Z-test

Example Suppose that in a particular

geographic region, the mean

and standard deviation of scores

on a reading test are 100 and 12 points, respectively. Our interest is in the scores of 55 students in a particular school

who received a mean score of 96 We can ask whether this mean score is significantly lower than

the regional mean– are the students in this school comparable to a simple random sample

of 55 students from the region as a whole

– or are their scores surprisingly low?

Page 24: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 24

The standard error

The z-score, which is the distance from the sample mean

to the population mean in units of the standard error

Looking up the table of the standard normal distribution,

the probability of observing a standard normal value ≤ -

2.47 is about 0.0068– with 99.32% confidence we reject the null hypothesis

If instead of a classroom, we considered a sub-region

containing 900 students whose mean score was 99, nearly

the same z-score and p-value would be observed

Page 25: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 25

Hyper-geometric distribution A discrete probability distribution that describes

the probability of k successes in n draws from a

finite population of size N containing m successes

without replacement A random variable X follows the hyper-geometric distribution

if its probability mass function is given by

– N is the population size; m is the number of success states in the

population; n is the number of draws; k is the number of successes

Hypergeometric distribution - Wikipedia, the free encyclopedia

Page 27: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 27

Fisher’s exact test Used in the analysis of contingency tables Although in practice it is employed when

sample sizes are small, it is valid for all

sample sizes It is called exact because the significance of the deviation from a

null hypothesis can be calculated exactly, rather than relying on an

approximation that becomes exact in the limit as the sample size

grows to infinity Fisher devised the test due to a boast

– try google ‘lady tasting tea’

Fisher's exact test - Wikipedia, the free encyclopedia

Page 28: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 28

The test is useful for

categorical data that

result from classifying

objects in two different ways It is used to examine the significance of the

association (contingency) between the two kinds

of classification The numbers in the cells of the table form a

hyper-geometric distribution under the null

hypothesis of independence

Men Women Total

Dieting 1 9 12

Non-dieting 11 3 12

Total 12 12 24

Page 29: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 29

Fisher’s exact test

Example A sample of teenagers might be divided into

– male and female

– and those that are and are not currently dieting

Test whether the observed difference of proportions is

significant– what is the probability that the 10 dieters would be so

unevenly distributed between the women and the men?

– if we were to choose 10 of the teenagers at random, what is

the probability that 9 of them would be among the 12 women,

and only 1 from among the 12 men?

Men Women Total

Dieting 1 9 12

Non-dieting 11 3 12

Total 12 12 24

Page 30: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 30

The probability follows

the hyper-geometric

distribution

– the exact probability of this particular arrangement of the data

– on the null hypothesis of independence that men and women are

equally likely to be dieters

– assuming the given marginal totals

We can calculate the exact probability of any arrangement Fisher showed that to generate a significance level, we

need consider only the more extreme cases with the same

marginal totals

Men Women Total

Dieting a b a+b

Non-dieting c d c+d

Total a+c b+d n

Page 31: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 31

Any Questions?so far

Page 32: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 32

Howdo you choose the test, or

do you know the distribution

Page 33: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 33

Distribution is “assumed”Different tests may use the same distribution

One test statistic could be tested under different assumptions

Page 34: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 34

Overlap significance Determine the degree of the

overlap– ; ;

The above statistics answer the degree but not the

confidence of overlap Consider outside the two leafs Can you formulize a statistical test based on hyper-

geometric distribution?

Page 35: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 35

Suppose that we are drawing an area as large as the

first leaf What’s the probability to obtain an area with larger

overlap with the second leaf by chance?

– N is the size of the entire area

Notice that the p-value answers the confidence

when we claim that these two leaves

overlapped, but not the degree of the overlap

Page 36: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

http://www.nature.com/nrc/journal/v7/n1/images/nrc2036-f1.jpg

Gene Ontology Enrichment Analysis

Page 37: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 37

Student’s t-test The test statistic follows a

Student’s t distribution if the

null hypothesis is supported Commonly applied Z-test when the test statistic follows a normal

distribution and the value of a scaling term is known When the scaling term is unknown and is replaced by an estimate

based on the data, the test statistic follows a Student’s t

distribution The t-statistic was introduced in 1908 by William Sealy Gosset

(“Student” was his pen name) Student’s t-test - Wikipedia, the free encyclopedia

Page 38: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 38

Compared to normal distribution The probability of seeing a normally distributed value far

(i.e. more than a few standard deviations) from the mean

drops off extremely rapidly– thus, normal distribution is not robust to the presence of

outliers (data that are unexpectedly far from the mean, due to

exceptional circumstances, observational error, etc.)

– data with outliers may be better described using a heavy-tailed

distribution such as the Student’s t-distribution

If are independent normally distributed random variables

with means μ and variances σ2

Page 39: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 39

The sample mean follows normal distribution

The ratio of the sample mean to the sample

standard deviation follows the Student’s t-

distribution with n−1 degrees of freedom

– this is useful to compare two sets of numerical data

The sum of their squares has the chi-squared

distribution with n degrees of freedom

Page 40: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 40

How manytest you remember

Page 42: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 42

Do not use themunless you understand the concepts introduced in this slide

Page 43: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 43

Chi-squared distribution The chi-squared distribution (also chi-square or χ²-

distribution) with k degrees of freedom is the distribution of

a sum of k independent standard normal random variables Used in chi-squared tests for

– goodness of fit of an observed distribution to a theoretical one

– the independence of two criteria of classification

– confidence interval estimation for a population standard deviation

of a normal distribution from a sample standard deviation

– many other statistical tests also use this distribution, like

Friedman’s analysis of variance by ranks

Page 44: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 44

A special case of the gamma distribution If are independent, standard normal random variables, then the

sum of their squares

is distributed according to the chi-squared distribution with k

degrees of freedom This is usually denoted as or Chi-squared distribution - Wikipedia, the free encyclopedia

Page 45: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 45

Chi-squared tests Also known as chi-square test or χ² test Note the distinction between the test statistic and its distribution The distribution is a chi-squared distribution when the null

hypothesis is true, or asymptotically true– the sampling distribution can be approximated to a chi-squared

distribution as closely as desired by enlarging the sample size

Often the shorthand for Pearson’s chi-squared test, also known

as– the chi-squared goodness-of-fit test

– the chi-squared test for independence

Page 46: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 46

Pearson’s chi-squared test Pearson’s chi-squared test - Wikipedia The best-known of several chi-squared tests Tests the frequency distributions of events

– the considered events must be mutually exclusive and have total

probability 1

– e.g., tests the “fairness” of a die

Used to assess two types of comparison– test of goodness of fit answers if an observed frequency distribution

differs from a theoretical one

– test of independence answers if paired observations on two variables,

expressed in a contingency table, are independent

Page 47: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 47

Steps Calculate the chi-squared test statistic, χ2, which resembles a

normalized sum of squared deviations between observed and

theoretical frequencies Determine the degrees of freedom, d, of that statistic, which is

essentially the number of frequencies reduced by the number of

parameters of the fitted distribution χ2 is then compared to the critical value in the distribution to

obtain a p-value A test that does not rely on the approximation of χ2 is the Fisher’s

exact test, which is more accurate in obtaining a significance

level, especially with few observations

Page 48: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 48

Test for fit of a distribution Suppose that there N observations divided among n cells A simple application is to test the hypothesis that, in the general

population, values would occur in each cell with equal frequency– the “theoretical frequency” for any cell (under the null hypothesis of a

discrete uniform distribution) is

– the reduction in the degrees of freedom is p=1, notionally because the

observed frequencies Oi are constrained to sum to N

– the degrees of freedom is n-1 degrees of freedom

The value of the test-statistic is , where X2 is a Pearson’s

cumulative test statistic, which asymptotically approaches

distribution

Page 49: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 49

When testing whether observations are random variables whose

distribution belongs to a given family of distributions, the

“theoretical frequencies” are calculated using a distribution from that

family– the reduction in the degrees of freedom is calculated as p=s+1, where s is

the number of co-variates used in fitting the distribution

– for instance, when checking a normal distribution (where the parameters

are mean and standard deviation), p=3

– the degrees of freedom is n-p

It should be noted that the degrees of freedom are not based on the

number of observations as with a Student’s t distribution– if testing for a fair, six-sided die, there would be five degrees of freedom

because there are six categories

– the number of times the die is rolled will have absolutely no effect on the

number of degrees of freedom

Page 50: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 50

Test of independence An “observation” consists of the values of two outcomes and the null

hypothesis is that the occurrence of these outcomes is statistically

independent Each observation is allocated to one cell of a two-dimensional array of

cells (called a table) according to the values of the two outcomes If there are r rows and c columns in the table, the value of the test-

statistic is Fitting the model of “independence” reduces the number of degrees of

freedom by p=r+c−1 The number of degrees of freedom is equal to the number of cells r×c,

minus the reduction in degrees of freedom, p, which reduces to (r − 1)(c

− 1).

Men Women Total

Dieting O1,1 O1,2 O1,1+O1,2

Non-dieting O2,1 O2,2 O2,1+O2,2

Total O1,1+O2,1 O2,1+O2,2 N

Page 51: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning and Bioinformatics 51

Summary Statistical test

– criminal trial

– philosopher’s beans

– clairvoyant card game

P-value vs. α You have to choose the

right distribution– normal distribution (z-test)

– hyper-geometric distribution

(Fisher’s exact test)

Distinguish between

distributions and tests– different tests with the same

distribution• overlap significance

• enrichment analysis

– different distributions for the

same test statistic• Student’s t-test

Chi-squared tests– goodness of fit

– test of independence

Page 52: Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

Machine Learning & Bioinformatics 52

Feature selectionTests if the selected features are significantly better

than other. Upload and test them in our

simulation system. Finally, commit your best

version and send TA Jang a report before 23:59 1/8

(Tue).