assignment 3

23
Assignment-3 IME-602 (Prof. R.N. Sengupta) Pramod Soni Roll No. : 12103173 PhD (Civil) Date : 6/9/2012 Question # 1 Using the data set, DJIA.txt, given draw the following (i) time series plot (i.e., time along X-axis and the value along Y-axis), (ii) histogram plot to show the probability of occurrences considering appropriate intervals as required (you should consider different intervals to make your analysis of the histograms more accurate). Can you comment on the distributions which you obtain. Solution # 1 : In the given problem we are given data having 26612 data points for about 93 years, values ranging from -9 to 3554.82 units. Time series plot for the given data is as shown below. From the above figure we can conclude that Values of the given 10 20 30 40 50 60 70 80 90 -500 0 500 1000 1500 2000 2500 3000 3500 4000 Time (Years) Values Figure 1: TimeSeries plot of given data 1

Upload: pramod-soni

Post on 05-Feb-2016

5 views

Category:

Documents


0 download

DESCRIPTION

IME602 course assignement, IIT Kanpur

TRANSCRIPT

Page 1: Assignment 3

Assignment-3IME-602 (Prof. R.N. Sengupta)

Pramod SoniRoll No. : 12103173PhD (Civil)Date : 6/9/2012

Question # 1

Using the data set, DJIA.txt, given draw the following (i) time series plot(i.e., time along X-axis and the value along Y-axis), (ii) histogram plot to showthe probability of occurrences considering appropriate intervals as required (youshould consider different intervals to make your analysis of the histograms moreaccurate). Can you comment on the distributions which you obtain.

Solution # 1 :

In the given problem we are given data having 26612 data points for about 93years, values ranging from -9 to 3554.82 units. Time series plot for the given datais as shown below. From the above figure we can conclude that Values of the given

10 20 30 40 50 60 70 80 90−500

0

500

1000

1500

2000

2500

3000

3500

4000

Time (Years)

Val

ues

Figure 1: TimeSeries plot of given data

1

Page 2: Assignment 3

variable does not increase for 50 years and suddenly after 50 years there is a largeincrease in values of the variable. Data has a lot of fluctuations as well.

In the second part of the problem different histograms have to be plotted withhaving different intervals.

−500 0 500 1000 1500 2000 2500 3000 3500 40000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DJIA data (Bin Size : 10 units)

Figure 2: Histogram (Bin Size : 10 units)

−500 0 500 1000 1500 2000 2500 3000 3500 40000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DJIA data (Bin Size : 30 units)

Figure 3: Histogram (Bin Size : 30 units)

Looking at different histograms we can conclude that values less than 500 havemaximum probability compared to large values. Figure 4 shows that probabilityof bin size 0 to 100 is maximum among all the bins. Large values do not have highprobability. The shape of the function looks like positive skew distribution.

2

Page 3: Assignment 3

−500 0 500 1000 1500 2000 2500 3000 3500 40000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DJIA data (Bin Size : 100 units)

Figure 4: Histogram (Bin Size : 30 units)

−250 250 750 1250 1750 2250 2750 3250 37500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DJIA data (Bin Size : 500 units)

Figure 5: Histogram (Bin Size : 30 units)

3

Page 4: Assignment 3

Question # 2

Using the data set, pollen.txt, given draw the histogram plots to show theprobability of occurrences considering appropriate intervals as required (you shouldconsider different intervals to make your analysis of the histograms more accurate)for all the variables given. Comment on the distributions which you obtain for allthe variables.

Solution # 2 :

There are total 5 types of data sets given in the above problem so we will indi-vidually plot histogram for each type of data set and analyse in different sections.

1. RIDGEFor the given variable values are between -23.28 to 21.40. Considering differ-ent bin sizes for this case different plots have been plotted. Looking at above

−25 −20 −15 −10 −5 0 5 10 15 20 250

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Values

Pro

babi

lity

(f(x

))

Histogram Plot for RIDGE data (Bin Size : 1 units)

Figure 6: Histogram for RIDGE (Bin Size : 1 units)

figures we can see that RIDGE data is normally distributed about 0. Valuesranging near 0 have highest probability whereas values away from 0 havelow probability of occurrence. Hence the shape of distribution is Gaussiandistribution.

4

Page 5: Assignment 3

−25 −20 −15 −10 −5 0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

Values

Pro

babi

lity

(f(x

))

Histogram Plot for RIDGE data (Bin Size : 3 units)

Figure 7: Histogram for RIDGE (Bin Size : 3 units)

−25 −20 −15 −10 −5 0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values

Pro

babi

lity

(f(x

))

Histogram Plot for RIDGE data (Bin Size : 5 units)

Figure 8: Histogram for RIDGE (Bin Size : 5 units)

5

Page 6: Assignment 3

2. NUBFor the given variable values are between -16.39 to 17.25. Considering dif-ferent bin sizes for this case different plots have been plotted.

−20 −15 −10 −5 0 5 10 15 200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Values

Pro

babi

lity

(f(x

))

Histogram Plot for NUB data (Bin Size : 1 units)

Figure 9: Histogram for NUB (Bin Size : 1 units)

−20 −15 −10 −5 0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Values

Pro

babi

lity

(f(x

))

Histogram Plot for NUB data (Bin Size : 2 units)

Figure 10: Histogram for NUB (Bin Size : 2 units)

Looking at different histograms we can see that NUB data is also normallydistributed around 0.

6

Page 7: Assignment 3

−20 −15 −10 −5 0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

Values

Pro

babi

lity

(f(x

))

Histogram Plot for NUB data (Bin Size : 3 units)

Figure 11: Histogram for NUB (Bin Size : 3 units)

−20 −15 −10 −5 0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values

Pro

babi

lity

(f(x

))

Histogram Plot for NUB data (Bin Size : 5 units)

Figure 12: Histogram for NUB (Bin Size : 5 units)

7

Page 8: Assignment 3

3. CRACKCRACK data varies form -31.41 to 30.31 units. Histograms with differentbin sizes have been plotted below.

−40 −30 −20 −10 0 10 20 30 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Values

Pro

babi

lity

(f(x

))

Histogram Plot for CRACK data (Bin Size : 1 units)

Figure 13: Histogram for CRACK (Bin Size : 1 units)

−40 −30 −20 −10 0 10 20 30 400

0.02

0.04

0.06

0.08

0.1

0.12

Values

Pro

babi

lity

(f(x

))

Histogram Plot for CRACK data (Bin Size : 2 units)

Figure 14: Histogram for CRACK (Bin Size : 2 units)

CRACK data also shows normal distribution. probability of negative valuesnear zero is highest compared ot values far from zero.

8

Page 9: Assignment 3

−40 −30 −20 −10 0 10 20 30 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Values

Pro

babi

lity

(f(x

))

Histogram Plot for CRACK data (Bin Size : 3 units)

Figure 15: Histogram for CRACK (Bin Size : 3 units)

−40 −30 −20 −10 0 10 20 30 400

0.05

0.1

0.15

0.2

0.25

Values

Pro

babi

lity

(f(x

))

Histogram Plot for CRACK data (Bin Size : 5 units)

Figure 16: Histogram for CRACK (Bin Size : 5 units)

9

Page 10: Assignment 3

4. WEIGHTWEIGHT data varies form -34.03 to 35.8 units. Histograms with differentbin sizes have been plotted below.

−40 −30 −20 −10 0 10 20 30 400

0.01

0.02

0.03

0.04

0.05

0.06

Values

Pro

babi

lity

(f(x

))

Histogram Plot for WEIGHT data (Bin Size : 1 units)

Figure 17: Histogram for WEIGHT (Bin Size : 1 units)

−40 −30 −20 −10 0 10 20 30 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Values

Pro

babi

lity

(f(x

))

Histogram Plot for WEIGHT data (Bin Size : 2 units)

Figure 18: Histogram for WEIGHT (Bin Size : 2 units)

WEIGHT data is also normally distributed around zero. values between -10to 10 have highest value of probability.

10

Page 11: Assignment 3

−40 −30 −20 −10 0 10 20 30 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Values

Pro

babi

lity

(f(x

))

Histogram Plot for WEIGHT data (Bin Size : 3 units)

Figure 19: Histogram for WEIGHT (Bin Size : 3 units)

−40 −30 −20 −10 0 10 20 30 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Values

Pro

babi

lity

(f(x

))

Histogram Plot for WEIGHT data (Bin Size : 5 units)

Figure 20: Histogram for WEIGHT (Bin Size : 5 units)

11

Page 12: Assignment 3

5. DENSITYDENSITY data varies form -12.03 to 10.86 units. Histograms with differentbin sizes have been plotted below.

−15 −10 −5 0 5 10 150

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DENSITY data (Bin Size : 0.5 units)

Figure 21: Histogram for DENSITY (Bin Size : 1 units)

−15 −10 −5 0 5 10 150

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DENSITY data (Bin Size : 1 units)

Figure 22: Histogram for DENSITY (Bin Size : 1 units)

DENSITY data is also normally distributed around zero. values between -3to +3 have highest value of probability approx 0.7. whereas other valueshave lower probability.

12

Page 13: Assignment 3

−13 −11 −9 −7 −5 −3 −1 1 3 5 7 9 110

0.05

0.1

0.15

0.2

0.25

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DENSITY data (Bin Size : 2 units)

Figure 23: Histogram for DENSITY (Bin Size : 2 units)

−15 −10 −5 0 5 10 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Values

Pro

babi

lity

(f(x

))

Histogram Plot for DENSITY data (Bin Size : 3 units)

Figure 24: Histogram for DENSITY (Bin Size : 3 units)

13

Page 14: Assignment 3

Question # 3

A certain mathematician carries two match boxes in his pocket. Each time hewants to use a match, he selects one if the boxes at random. Find the probabilitythat when the mathematician discovers that one box is empty, the other box con-tains r matches (r = 0, 1, 2, ., n), n being the number of matches initially containedin each box.

Solution # 3 :

Let start with the case when the empty matchbox is in the left pocket. Denotechoosing the left pocket as a ”success” and choosing the right pocket as a ”failure”.Then we want to know the probability that there were exactly R−N failures untilthe (R+ 1)st success.Let us consider the negative binomial distribution. It is close in its interpretationto the geometric distribution, we calculate the number of trials until the rth suc-cess occurs (in contrast to the 1st success in geometric distribution).Let Tr be the random variable representing this number. Let us denote the fol-lowing events.

• A = Tr = n

• B =Exactly (r − 1) successes occur in n− 1 trials.

• C = the nth trial results in a success.

We have that A = B∩C, and B and C are independent giving P (A) = P (B)P (C).Consider a particular sequence of n−1 trials with r−1 successes and n−1−(r−1) =n−r failures. The probability associated with each such sequence is pr−1(1− p)n−r

and there are

(n− 1r − 1

)such sequences. Therefore

P (B) =

(n− 1r − 1

)pr−1(1− p)n−r

Since P (C) = p we have

P (Tr) = P (A) =

(n− 1r − 1

)pr(1− p)n−r, n = r, r + 1, r + 2..

In our case we want to calculate how many matches were removed from theother pocket. We want to calculate the number of failures until the rth successoccurs. This is the modified negative binomial distribution describing the numberof failures until the rth success occurs. The probability distribution is

Pz(n) =

(n+ r − 1r − 1

)pr(1− p)n, n ≥ 0

14

Page 15: Assignment 3

(For r = 1 we obtain the modified geometric distribution.) We apply the modifiednegative binomial distribution to get the probability

Pleft =

(R−N +R

R

)(1

2

)R+1(1

2

)R−N

The symmetric event (when the matchbox in the right pocket becomes empty)is disjoint, thus the probability of finishing one matchbox when having exactlyN, 0 ≤ N ≤ R matches in the other one is

P (N) = 2Pleft =

(2R−N

R

)(1

2

)2R−N

Ans

15

Page 16: Assignment 3

Question # 4

An urn contains a white and b black balls. After a ball is drawn, it is returnedto the urn if it is white, but if it is black, it is to be replaced by a white ballfrom another urn. If µn denotes the expected number of white balls in the urn

after n operations, then show that µn = (a+ b)− b(1− 1

a+b

)n. Hence obtain the

probability of drawing a white ball after the operation has been repeated n times.Solution # 4 :

Let Ar denote the expected number of white balls, after r operations. Then A0 = a(As expected number of white balls will be a, as no ball is drawn yet)So,

A1 = a+ 1.

(1− a

a+ b

)+ 0.

(1− b

a+ b

)As if drawn ball is one, increment in number of white balls will be 1 else noincrement in white balls. The above expression can also be written as

A1 = A0 + 1.

(1− A0

a+ b

)or,

A1 = A0.

(1− 1

a+ b

)+ 1 (1)

similarly A2 will be one more than A1 if drawn ball is black else no increment. so,

A2 = A1.

(1− 1

a+ b

)+ 1 (2)

Hence we can generalize this formula for Ar+1 as

Ar+1 = Ar.

(1− 1

a+ b

)+ 1

Adding and subtracting (a+ b) to Equation (1).

A1 = a.(1− 1

a+b

)+ 1 + (a+ b)− (a+ b)

= (a+ b)− (a+ b) + 1 + a.(1− 1

a+b

)= (a+ b)− (a+ b)

(1− 1

a+b

)+ a.

(1− 1

a+b

)= (a+ b)−

(1− 1

a+b

)(a+ b− a)

A1 = (a+ b)− b(1− 1

a+b

)1Similarly adding and subtracting (a+ b) to Equation (2)

16

Page 17: Assignment 3

A2 = A1.(1− 1

a+b

)+ 1 + (a+ b)− (a+ b)

= (a+ b)− (a+ b) + 1 +A1.(1− 1

a+b

)= (a+ b)− (a+ b)

(1− 1

a+b

)+A1.

(1− 1

a+b

)= (a+ b)−

(1− 1

a+b

)(a+ b−A1)

Putting value of A1 we get,

A2 = (a+ b)− b(

1− 1

a+ b

)2

So the formula can be generalised for Ar+1

Ar = (a+ b)− b(

1− 1

a+ b

)r

or

µn = (a+ b)− b(

1− 1

a+ b

)n

HenceProved

Also probability of obtaining white ball after n operations will be given by

Pn =µna+ b

Ans

17

Page 18: Assignment 3

Question # 5

The probability mass function, pmf, is given below

f(X) = −(

qX

X.logep

)ifX = 1, 2, 3....

= 0, otherwise

Show that for this pmf the mean and variance are: (i) qp.logep

and (ii) −q(q+logep)

(p.logep)2

respectively. Also draw the pmf and F(x) very clearly on the same graph.

Solution # 5 :

For the given logarithmic probability distribution function we have to find meanand variance. For discrete distributions mean is given by the formula.

Mean = µx = E(X) =∑

x.f(x)

and variance is given by

V ariance = V (X) = E((X − µx)2) = E(X2)− (E(X))2

Now E(X) is given by

E(X) =∞∑

X=1X.f(X)

=∞∑

X=1X.(− qX

X.logep

)= − 1

logep

∞∑X=1

qX

= − 1logep

(q + q2 + q3 + ......∞)

= − 1logep

(q

1−q

)E(X) = − q

p.logep(a)

HenceProved

Now E(X2) is given by

E(X2) =∞∑

X=1X2.f(X)

18

Page 19: Assignment 3

=∞∑

X=1X2.

(− qX

X.logep

)= − 1

logep

∞∑X=1

X.qX

E(X2) = − 1logep

(q + 2q2 + 3q3 + ......∞) (b)

Also

q.E(X2) = − 1logep

(q2 + 2q3 + 3q4 + ......∞) (c)

Subtracting Equation (c) from Equation (b)

(1− q).E(X2) = − 1logep

(q + q2 + q3 + q4 + ......∞)

Hence

E(X2) = − qp2.logep

(d)

From Equations (a) and (d) we can find variance

V (X) = E(X2)− (E(X))2

= − qp2.logep

−(− q

p.logep

)2

V ariance = V (X) =−q (q + logep)

(plogep)2

HenceProved

19

Page 20: Assignment 3

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

1.2

X

prob

abili

ty f(

x)pmf and cdf for f(x)=(−(1−p).X)./(X.*log(p))

cdfpmf

Figure 25: pmf and cdf plots for given function (p=0.1)

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

1.2

X

prob

abili

ty f(

x)

pmf and cdf for f(x)=(−(1−p).X)./(X.*log(p))

cdfpmf

Figure 26: pmf and cdf plots for given function (p=0.2)

20

Page 21: Assignment 3

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

1.2

X

prob

abili

ty f(

x)pmf and cdf for f(x)=(−(1−p).X)./(X.*log(p))

cdfpmf

Figure 27: pmf and cdf plots for given function (p=0.3)

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

1.2

X

prob

abili

ty f(

x)

pmf and cdf for f(x)=(−(1−p).X)./(X.*log(p))

cdfpmf

Figure 28: pmf and cdf plots for given function (p=0.5)

21

Page 22: Assignment 3

MATLAB codes used

1. Code for Timeseries plot

1 % Function to creat Time Series plot of the data2 % Input : Y3 % Y is a matrix of dimensions nx2. where n is number of time steps.4 % 1st column contains time steps and 2nd column contains values5 % at those particular time steps.6 function TS(Y)7 plot (Y( : , 1 ) ,Y( : , 2 ) , ’ r ’ ) % Plotting time series data8 xlabel ( ’Time ( Years ) ’ )9 ylabel ( ’ Values ’ )

1011 set (gca , ’ XTick ’ , [ 100000 ,200000 ,300000 ,400000 ,500000 ,600000 ,700000 , . . .12 800000 ,900000 ] , ’ XTicklabel ’ ,{ ’ 10 ’ , ’ 20 ’ , ’ 30 ’ , ’ 40 ’ , ’ 50 ’ , ’ 60 ’ , ’ 70 ’ , . . .13 ’ 80 ’ , ’ 90 ’ })14 print −dpsc2 t i m e s e r i e s . eps15 end

2. Code for Probability histograms

1 % Function to plot probability histograms for any given data.2 % Inputs : Y,n,s13 % Y : It is vector of size n whose probability histogram4 % has to be plotted.5 % n : It is the bin size to be used for histogram. All data points6 % will be divided into different segments. Each having bin size of n.7 % s1 : It is the string which is passed as an argument for naming the8 % histogram plot according to data provided.9 function Range=p r o b a b i l i t y P l o t (Y, n , s1 )

10 Y=sort (Y) ; % Sorting the data from lower to higher values.11 % Creating a NaN matrix with number of segments based on bin size12 % provided13 Range=nan ( (max(Y)−min(Y) )/ n+2 ,3) ;14 % Finding the frequency of each bin size with for loop15 for i=f loor (min(Y)/n ) : f loor (max(Y)/n)16 Range ( i−f loor (min(Y)/n)+1 ,1)=( i )∗n ; % lower value of bin segment17 Range ( i−f loor (min(Y)/n)+1 ,2)=( i +1)∗n ;% Higher value of bin segment18 % Finding all the values lying between lower and upper value19 % of bin segment in Temp20 Temp=Y(Y>=Range ( i−f loor (min(Y)/n )+1 ,1 )&. . .21 Y<Range ( i−f loor (min(Y)/n )+1 ,2 ) ) ;

22

Page 23: Assignment 3

22 % Storing frequency of bin size in Range23 Range ( i−f loor (min(Y)/n)+1 ,3)=numel (Temp ) ;24 end25 % Plotting bar chart26 bar (mean( Range ( : , 1 : 2 ) , 2 ) , Range ( : , 3 ) /sum( Range ( : , 3 ) ) , ’ r ’ )27 xlabel ( ’ Values ’ , ’ FontSize ’ , 14)28 ylabel ( ’ P r o b a b i l i t y ( f ( x ) ) ’ , ’ FontSize ’ ,14)29 s t r i n g=s t r c a t ( s1 , ’ p rob ’ ,num2str(n ) , ’ . eps ’ ) ;30 t i t l e ( [ ’ Histogram Plot f o r ’ , s1 , ’ . . .31 data ( Bin S i z e : ’ ,num2str(n ) , ’ un i t s ) ’ ] , ’ FontSize ’ , 15)32 print ( ’−dpsc2 ’ , s t r i n g ) ;33 end

3. Code for pmf and cdf

1 % Function to plot pmf and cdf plots for the given function for2 % for different values of probability (p).3 % Input : p,n4 % p : probability5 % n : range upto which data is to be plotted6 function pmfplots (p , n)7 X=1:1:n ;8 Y=(−(1−p ) . ˆX) . / (X.∗ log (p ) ) ;9 s t r i n g=s t r c a t ( ’ l o g p l o t p r o b ’ ,num2str(p ) , ’ . eps ’ ) ;

10 bar (X,cumsum(Y) , ’ LineWidth ’ , 0 . 2 5 , ’ f a c e c o l o r ’ , ’ g ’ )11 hold on12 bar (X,Y, 0 . 4 , ’ r ’ )13 legend ( ’ cd f ’ , ’ pmf ’ , ’ Locat ion ’ , ’ NorthWest ’ )14 ylim ( [ 0 1 . 3 ] )15 t i t l e ( ’ pmf and cdf f o r f ( x)=(−(1−p ) . ˆX) . / (X.∗ l og (p ) ) ’ )16 xlabel ( ’X ’ )17 ylabel ( ’ p r o b a b i l i t y f ( x ) ’ )18 l ine ( [ 0 3 5 ] , [ 1 1 ] , ’ L ineSty l e ’ , ’ : ’ )19 print ( ’−dpsc2 ’ , s t r i n g )20 end

23