time series project on daily temperatures

8/3/2019 Time Series Project on Daily Temperatures

http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 1/12

Time Series Student Project

December 26, 2011

Introduction

This project performs a time series analysis of the daily temperature of Allentown, PA.The project will attempt to fit an ARIMA model to the data by utilizing correlogram,

regression analysis, testing of residuals and visually inspecting predicted values versusactual data.

I chose to study the temperatures of Allentown because ski season is around the corner.

Even though not an avid skier myself, my family and friends are and they always like to

“predict the weather” in the winter time. A few of their favorite ski slopes are very closein proximity to Allentown. I chose to analyze the high temperature of the day since it is

the high temperatures of the days we feel mostly. High temperatures are recorded in those

hours when we are most active in the day.

Source Data

I use the data set that NEAS provides on its website, specifically station 360106 data. Inoted that the data spans from 5/1/1948 to 12/31/2005. However, not all data points are

valid. Firstly, I removed all the invalid data points, those records with -999 degrees. Then

I noted that there is a notable “gap” in the data where a month’s worth of data is missing,

for the month of June in 1992. There are also two days missing in the data (2/9/1954 and2/10/1954). A two-day gap is very immaterial to the analysis and I used linear

interpolation to fill in temperatures of these two days. I decided to cut off the data at

5/31/1992 and use only the 6/1/1948 to 5/31/1992 data points. This selected period has 44years of data with 16071 data points which is sufficient for a robust analysis. The selected

dataset after the removal of invalid points and linear interpolation and selected cut-off at5/31/1992 can be found on the “analysis data” tab of the excel file

1.

Since 44 years are a bit too much include in one graph, I separated the data into two

graphs. Figure 1a and Figure 1b below graph all the data points for the periods 6/1/1948to 5/31/1970 and 6/1/1970 – 5/31/1992 respectively.

1 Please note that in order to save storage space and computer processing power, I copied and pasted values

down the columns of certain formulas (particularly vlookups, sumproduct, sumif, countif) in the excel file.

The first row still contains the formula for the grader to follow my calculation but the rows below contain

only paste values

NEAS Seminar - Katy Siu



Figure 1a

Figure 1b

Clearly, the graphs show seasonality as expected by our common knowledge, that

temperatures are high in the summer months and low in the winter months. We also seethat the data is quite volatile. Hence, smoothing and de-seasonsonalizing of the data are

required.

Smoothing and De-seasonalizing the Data

To smooth the data, I start with a 3 day centered moving average. The temperature foreach day is recalculated as a 44 year average of the day before and the day after. The

smoothed data can be found in column P of the “Smoothed and Adj’d Data” tab of theexcel file. Figure 2a shows the temperature smoothed by 3 day centered moving averagesfor the period 6/1/1948 to 5/31/1949.



Figure 2a

The graph still shows more volatility that I would like. I then tried to smooth the data

with a 5 day centered moving average (see column Q of “Seasonally Adj’d Data” tab)

and Figure 2b graphs the smoothed data for the same period.

Figure 2b

5 day centered moving average gives a smoother curve than 3 day centered moving

average. I next tried 7 day centered moving average (column R in the same tab in excelfile) and Figure 2c graphs the smoothed data for the same period.



Figure 2c

As shown in Figure 2c, I believe the data looks sufficiently smoothed without any

undesirable flattening that warns of over-smoothing. Intuitively, using a 7 day centered

moving average for the data set also seems sufficient. As an example, the smoothedtemperature on July 4th, 1948 is calculated as the average of the 308 data points of the

high temperatures on July 1st-3rd, 5th-7th, over the 44 years.

Next, I de-seasonalize the data in two steps. First, I calculated a seasonal index by

dividing the smoothed temperature and by the average temperature for the complete

dataset (see column S for seasonal index). I then calculate the seasonally adjustedtemperature by dividing the individual raw data points by the seasonal index (see columnT for seasonally adjusted temperatures). As an example, the seasonal index for July 4th is

83.718/61.172 = 1.369, where 83.718 is the smoothed temperature for July 4th and

61.172 is the overall average of the dataset. The seasonally adjusted temperature for July4th, 1948 is calculated as 90/1.369 = 65.736, where 90 is the raw high temperature of

July 4th, 1948 and 1.369 is the seasonal index for July 4th.

The following three graphs show the effects of the smoothing and de-seasonalizing of the

data for the year 1991.



Figure 3a

Figure 3b



Figure 3c

We see that the raw data in Figure 3a displays significant volatility. Figure 3b shows data

that is smoothed sufficiently that we can work with. Then Figure 3c shows the flatteningof the curve in Figure 3b, where temperatures are higher in the middle than in both ends

of the year. This pattern is removed by de-seasonalizing the data.

Figure 4a and Figure 4b plots all the seasonally adjusted data points respectively for the

periods 6/1/1948 to 5/31/1970 and 6/1/1970 – 5/31-1992.

Figure 4a



Figure 4b

The above two graphs show the seasonally adjusted temperature (column T in excel file).Some spikes are noted in the graphs. These can be explained by days when there was an

uncharacteristically high or low temperature compared to the smoothed average for the

day of the month. It can be noted also that the pattern looking similar to a sine wave asseen in Figure 1a and Figure 1b is now removed. We can also note that there is no

observable trend in these two graphs which suggests that the data is stationary. We will

next look at the correlogram of the data to further confirm if the data is stationary.

Sample Autocorrelation and Correlogram

Sample autocorrelations are then calculated (columns V through Y in excel file). Thesample autocorrelations for the first 200 lags are plotted in Figure 5 below:

Figure 5

We see that correlogram declines fairly rapidly but does not drop to zero immediately andthen it hovers around zero as the lag increases, which indicates that

(1) the data is stationary

(2) the data follows an autoregressive process but not a moving average process(which will have a correlogram that drops suddenly) or an ARMA process (which

will have a correlogram that declines less quickly, in a geometric fashion)



I then calculate the standard deviation of white noise error term of 1/sqrt(T). Since thereare 16071 data points, the standard deviation is 1/sqrt(16071) = 0.789%. I use a 95%

confidence interval, which implies + or – 1.96 standard deviation. I also select to use 2 *

standard deviations, which is more suitable for volatile data like daily temperatures. Thebounds are therefore 2*1.96 * 0.789% = 3.092 % (up or down). We see that the sample

autocorrelation of the first few lags are significantly greater than zero, which shows thatthe time series is not generated by a white noise process.

From the correlogram, we also see that the first three points, 0.61. 0.33, 0.22, are great in

values, which suggest that either AR(1), AR(2) or AR(3) model would fit the data well.

Model Specification

Regression Analysis

I then perform regression analysis using excel’s regression add-in. I ran three regressionfor AR(1), AR(2) and AR(3) models respectively (see columns Z to AH in the excel file

for the underlying known x’s and unknown y’s used in the regressions). The results of the

three regressions are summarized below:

AR(1)

Regression Statistics

Multiple R 0.613238661

R Square 0.376061656

Adjusted R Square 0.376022825

Standard Error 8.390556064

Observations 16070

ANOVA

df SS MS F Significance F

Regression 1 681805.7952 681805.8 9684.545 0

Residual 16068 1131210.194 70.40143

Total 16069 1813015.989

Coefficients Standard Error t Stat

Intercept 23.65620294 0.386863094 61.14877

x1 0.613240155 0.006231477 98.41008



AR(2)

SUMMARY OUTPUT



R Square 0.379089151



Observations 16069

ANOVA


Regression 2 687279.0176 343639.5 4904.445 0

Residual 16066 1125695.622 70.06695

Total 16068 1812974.64


Intercept 25.30656498 0.428518276 59.05598

x1 0.655996371 0.007870202 83.35191

x2 ‐0.0697402 0.007870495 ‐8.86097

AR(3)

SUMMARY OUTPUT



R Square 0.381864307



Observations 16068

ANOVA


Regression 3 692305.5828 230768.5 3307.94 0

Residual 16064 1120656.691 69.762

Total 16067 1812962.274


Intercept 23.62139042 0.471722223 50.07479

xi 0.660661578 0.007872278 83.92254

x2 ‐0.1133908 0.009399489 ‐12.0635

x3 0.066543024 0.007872735 8.452338



We see that AR(1) has a very high T- statistics for its x1 variable and that both AR(2) and

AR(3) have high T-statistics for x1 variable as well. We can also note that introducing x2

in AR(2) and introducing x2 and x3 in AR(3) increases R^2 but not by a significantamount. AR(1) has the highest F statistics overall. It is important to note also that AR(3)

has an F statistic that is significantly lower than the F statistics of the other two model. Itherefore conclude at this point that AR(3) is not the most appropriate model for this dataand further testing of AR(3) model is deemed no longer necessary.

Durbin-Watson Test For both the AR(1) and AR(2) model, I then proceed to test residuals for serialcorrelation by performing the Durbin-Watson test. The null hypothesis is that there is no

serial correlation in the residuals. A Durbin-Watson statistic of 2 indicates no correlation.

In other words, a Durbin-Watson statistic substantially less than 2 indicates correlation.

For AR(1), Durbin-Watson statistic calculated is 1.914 (cell G23 of “AR(1)” tab in excel

file), which is less than 2 and suggests no correlation. I also used excel’s built-incorrelation function and the correlation calculated is 0.04276 (cell C16096 in “AR(1)”

tab in excel file). Autocorrelation calculated by formula is 0.04275 (cell C16097 in

“AR(1)” tab in excel file). Neither of these correlation measures calculated are

significantly different from zero.

For AR(2), Durbin-Watson statistic calculated is 1.991(cell G23 of “AR(2)” tab in excel

file), which is less than 2 and suggests no correlation.

We can then conclude that the residuals are not serially correlated for both AR(1) andAR(2) models.

Box-Pierce Test Next, I test if the residuals resemble white noise process by perform the Box-Pierce test. I

selected to use K = 100. I noted that the study notes recommend using K of 15 to 40 but

since this data is much higher in volume than the illustrative worksheet, I decide to use K= 100.

For AR(1), critical value at 90% for degrees of freedom = 100-1=99 is 117.4 (cell L23 of

“AR(1)” tab in excel file). The Q-Statistic calculated is 0.0216 (cell K23), which is muchless than the critical value. We thus accept the null hypothesis that the residuals resemble

white noise process.

For AR(2), critical value at 90% for degrees of freedom = 100-2=98 is 116.3 (cell L23 of

“AR(2)” tab in excel file). The Q-Statistic calculated is 0.0296 (cell K23), which is much

less than the critical value. We accept the null hypothesis that the residuals resemblewhite noise process



Selection

Both Durbin-Watson and Box-Pierce tests show that AR(1) and AR(2) are acceptable

models. I choose AR(1) noting principle of parsimony.

Conclusion

The AR(1) model is selected and specifically it is:Yt= 0.613Yt-1+ 23.656

Intuitively, this makes the most sense because we know that the temperature of a day

depends on the temperature of the previous days.

Finally, I plotted a graph for the year 1991 and visually inspect actual versus predictedper the model.

Figure 6

As shown in Figure 6, the selected model predicts the temperature fairly well. The AR(1)

model appears to be simple. Yet intuitively this is the model that makes the most sense, at

least for this part of the United States. As we can see in Figure 6, the selected modelappears satisfactory.

time series project on daily temperatures

Documents