time series project on daily temperatures
TRANSCRIPT
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 1/12
Time Series Student Project
December 26, 2011
Introduction
This project performs a time series analysis of the daily temperature of Allentown, PA.The project will attempt to fit an ARIMA model to the data by utilizing correlogram,
regression analysis, testing of residuals and visually inspecting predicted values versusactual data.
I chose to study the temperatures of Allentown because ski season is around the corner.
Even though not an avid skier myself, my family and friends are and they always like to
“predict the weather” in the winter time. A few of their favorite ski slopes are very closein proximity to Allentown. I chose to analyze the high temperature of the day since it is
the high temperatures of the days we feel mostly. High temperatures are recorded in those
hours when we are most active in the day.
Source Data
I use the data set that NEAS provides on its website, specifically station 360106 data. Inoted that the data spans from 5/1/1948 to 12/31/2005. However, not all data points are
valid. Firstly, I removed all the invalid data points, those records with -999 degrees. Then
I noted that there is a notable “gap” in the data where a month’s worth of data is missing,
for the month of June in 1992. There are also two days missing in the data (2/9/1954 and2/10/1954). A two-day gap is very immaterial to the analysis and I used linear
interpolation to fill in temperatures of these two days. I decided to cut off the data at
5/31/1992 and use only the 6/1/1948 to 5/31/1992 data points. This selected period has 44years of data with 16071 data points which is sufficient for a robust analysis. The selected
dataset after the removal of invalid points and linear interpolation and selected cut-off at5/31/1992 can be found on the “analysis data” tab of the excel file
1.
Since 44 years are a bit too much include in one graph, I separated the data into two
graphs. Figure 1a and Figure 1b below graph all the data points for the periods 6/1/1948to 5/31/1970 and 6/1/1970 – 5/31/1992 respectively.
1 Please note that in order to save storage space and computer processing power, I copied and pasted values
down the columns of certain formulas (particularly vlookups, sumproduct, sumif, countif) in the excel file.
The first row still contains the formula for the grader to follow my calculation but the rows below contain
only paste values
NEAS Seminar - Katy Siu
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 2/12
Figure 1a
Figure 1b
Clearly, the graphs show seasonality as expected by our common knowledge, that
temperatures are high in the summer months and low in the winter months. We also seethat the data is quite volatile. Hence, smoothing and de-seasonsonalizing of the data are
required.
Smoothing and De-seasonalizing the Data
To smooth the data, I start with a 3 day centered moving average. The temperature foreach day is recalculated as a 44 year average of the day before and the day after. The
smoothed data can be found in column P of the “Smoothed and Adj’d Data” tab of theexcel file. Figure 2a shows the temperature smoothed by 3 day centered moving averagesfor the period 6/1/1948 to 5/31/1949.
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 3/12
Figure 2a
The graph still shows more volatility that I would like. I then tried to smooth the data
with a 5 day centered moving average (see column Q of “Seasonally Adj’d Data” tab)
and Figure 2b graphs the smoothed data for the same period.
Figure 2b
5 day centered moving average gives a smoother curve than 3 day centered moving
average. I next tried 7 day centered moving average (column R in the same tab in excelfile) and Figure 2c graphs the smoothed data for the same period.
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 4/12
Figure 2c
As shown in Figure 2c, I believe the data looks sufficiently smoothed without any
undesirable flattening that warns of over-smoothing. Intuitively, using a 7 day centered
moving average for the data set also seems sufficient. As an example, the smoothedtemperature on July 4th, 1948 is calculated as the average of the 308 data points of the
high temperatures on July 1st-3rd, 5th-7th, over the 44 years.
Next, I de-seasonalize the data in two steps. First, I calculated a seasonal index by
dividing the smoothed temperature and by the average temperature for the complete
dataset (see column S for seasonal index). I then calculate the seasonally adjustedtemperature by dividing the individual raw data points by the seasonal index (see columnT for seasonally adjusted temperatures). As an example, the seasonal index for July 4th is
83.718/61.172 = 1.369, where 83.718 is the smoothed temperature for July 4th and
61.172 is the overall average of the dataset. The seasonally adjusted temperature for July4th, 1948 is calculated as 90/1.369 = 65.736, where 90 is the raw high temperature of
July 4th, 1948 and 1.369 is the seasonal index for July 4th.
The following three graphs show the effects of the smoothing and de-seasonalizing of the
data for the year 1991.
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 5/12
Figure 3a
Figure 3b
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 6/12
Figure 3c
We see that the raw data in Figure 3a displays significant volatility. Figure 3b shows data
that is smoothed sufficiently that we can work with. Then Figure 3c shows the flatteningof the curve in Figure 3b, where temperatures are higher in the middle than in both ends
of the year. This pattern is removed by de-seasonalizing the data.
Figure 4a and Figure 4b plots all the seasonally adjusted data points respectively for the
periods 6/1/1948 to 5/31/1970 and 6/1/1970 – 5/31-1992.
Figure 4a
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 7/12
Figure 4b
The above two graphs show the seasonally adjusted temperature (column T in excel file).Some spikes are noted in the graphs. These can be explained by days when there was an
uncharacteristically high or low temperature compared to the smoothed average for the
day of the month. It can be noted also that the pattern looking similar to a sine wave asseen in Figure 1a and Figure 1b is now removed. We can also note that there is no
observable trend in these two graphs which suggests that the data is stationary. We will
next look at the correlogram of the data to further confirm if the data is stationary.
Sample Autocorrelation and Correlogram
Sample autocorrelations are then calculated (columns V through Y in excel file). Thesample autocorrelations for the first 200 lags are plotted in Figure 5 below:
Figure 5
We see that correlogram declines fairly rapidly but does not drop to zero immediately andthen it hovers around zero as the lag increases, which indicates that
(1) the data is stationary
(2) the data follows an autoregressive process but not a moving average process(which will have a correlogram that drops suddenly) or an ARMA process (which
will have a correlogram that declines less quickly, in a geometric fashion)
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 8/12
I then calculate the standard deviation of white noise error term of 1/sqrt(T). Since thereare 16071 data points, the standard deviation is 1/sqrt(16071) = 0.789%. I use a 95%
confidence interval, which implies + or – 1.96 standard deviation. I also select to use 2 *
standard deviations, which is more suitable for volatile data like daily temperatures. Thebounds are therefore 2*1.96 * 0.789% = 3.092 % (up or down). We see that the sample
autocorrelation of the first few lags are significantly greater than zero, which shows thatthe time series is not generated by a white noise process.
From the correlogram, we also see that the first three points, 0.61. 0.33, 0.22, are great in
values, which suggest that either AR(1), AR(2) or AR(3) model would fit the data well.
Model Specification
Regression Analysis
I then perform regression analysis using excel’s regression add-in. I ran three regressionfor AR(1), AR(2) and AR(3) models respectively (see columns Z to AH in the excel file
for the underlying known x’s and unknown y’s used in the regressions). The results of the
three regressions are summarized below:
AR(1)
Regression Statistics
Multiple R 0.613238661
R Square 0.376061656
Adjusted R Square 0.376022825
Standard Error 8.390556064
Observations 16070
ANOVA
df SS MS F Significance F
Regression 1 681805.7952 681805.8 9684.545 0
Residual 16068 1131210.194 70.40143
Total 16069 1813015.989
Coefficients Standard Error t Stat
Intercept 23.65620294 0.386863094 61.14877
x1 0.613240155 0.006231477 98.41008
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 9/12
AR(2)
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.615702161
R Square 0.379089151
Adjusted R Square 0.379011856
Standard Error 8.370600349
Observations 16069
ANOVA
df SS MS F Significance F
Regression 2 687279.0176 343639.5 4904.445 0
Residual 16066 1125695.622 70.06695
Total 16068 1812974.64
Coefficients Standard Error t Stat
Intercept 25.30656498 0.428518276 59.05598
x1 0.655996371 0.007870202 83.35191
x2 ‐0.0697402 0.007870495 ‐8.86097
AR(3)
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.617951703
R Square 0.381864307
Adjusted R Square 0.381748869
Standard Error 8.352364649
Observations 16068
ANOVA
df SS MS F Significance F
Regression 3 692305.5828 230768.5 3307.94 0
Residual 16064 1120656.691 69.762
Total 16067 1812962.274
Coefficients Standard Error t Stat
Intercept 23.62139042 0.471722223 50.07479
xi 0.660661578 0.007872278 83.92254
x2 ‐0.1133908 0.009399489 ‐12.0635
x3 0.066543024 0.007872735 8.452338
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 10/12
We see that AR(1) has a very high T- statistics for its x1 variable and that both AR(2) and
AR(3) have high T-statistics for x1 variable as well. We can also note that introducing x2
in AR(2) and introducing x2 and x3 in AR(3) increases R^2 but not by a significantamount. AR(1) has the highest F statistics overall. It is important to note also that AR(3)
has an F statistic that is significantly lower than the F statistics of the other two model. Itherefore conclude at this point that AR(3) is not the most appropriate model for this dataand further testing of AR(3) model is deemed no longer necessary.
Durbin-Watson Test For both the AR(1) and AR(2) model, I then proceed to test residuals for serialcorrelation by performing the Durbin-Watson test. The null hypothesis is that there is no
serial correlation in the residuals. A Durbin-Watson statistic of 2 indicates no correlation.
In other words, a Durbin-Watson statistic substantially less than 2 indicates correlation.
For AR(1), Durbin-Watson statistic calculated is 1.914 (cell G23 of “AR(1)” tab in excel
file), which is less than 2 and suggests no correlation. I also used excel’s built-incorrelation function and the correlation calculated is 0.04276 (cell C16096 in “AR(1)”
tab in excel file). Autocorrelation calculated by formula is 0.04275 (cell C16097 in
“AR(1)” tab in excel file). Neither of these correlation measures calculated are
significantly different from zero.
For AR(2), Durbin-Watson statistic calculated is 1.991(cell G23 of “AR(2)” tab in excel
file), which is less than 2 and suggests no correlation.
We can then conclude that the residuals are not serially correlated for both AR(1) andAR(2) models.
Box-Pierce Test Next, I test if the residuals resemble white noise process by perform the Box-Pierce test. I
selected to use K = 100. I noted that the study notes recommend using K of 15 to 40 but
since this data is much higher in volume than the illustrative worksheet, I decide to use K= 100.
For AR(1), critical value at 90% for degrees of freedom = 100-1=99 is 117.4 (cell L23 of
“AR(1)” tab in excel file). The Q-Statistic calculated is 0.0216 (cell K23), which is muchless than the critical value. We thus accept the null hypothesis that the residuals resemble
white noise process.
For AR(2), critical value at 90% for degrees of freedom = 100-2=98 is 116.3 (cell L23 of
“AR(2)” tab in excel file). The Q-Statistic calculated is 0.0296 (cell K23), which is much
less than the critical value. We accept the null hypothesis that the residuals resemblewhite noise process
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 11/12
Selection
Both Durbin-Watson and Box-Pierce tests show that AR(1) and AR(2) are acceptable
models. I choose AR(1) noting principle of parsimony.
Conclusion
The AR(1) model is selected and specifically it is:Yt= 0.613Yt-1+ 23.656
Intuitively, this makes the most sense because we know that the temperature of a day
depends on the temperature of the previous days.
Finally, I plotted a graph for the year 1991 and visually inspect actual versus predictedper the model.
Figure 6
As shown in Figure 6, the selected model predicts the temperature fairly well. The AR(1)
model appears to be simple. Yet intuitively this is the model that makes the most sense, at
least for this part of the United States. As we can see in Figure 6, the selected modelappears satisfactory.
8/3/2019 Time Series Project on Daily Temperatures
http://slidepdf.com/reader/full/time-series-project-on-daily-temperatures 12/12