Time Series Forecasting in Python and R

A friend recently made a prediction about the price of oil for the next three months. I thought I would perform some time series forecasting on the West Texas Intermediate prices and see if his numbers were reasonable from a dumb-numbers canned-forecasting perspective. I’m not making the claim that one can reasonably and accurately forecast oil prices with traditional time series techniques. (That’s bogus.) I’m simply doing this to learn more about forecasting.

Monthly petroleum prices can be found at the Energy Information Administration. Ever relevant, Wikipedia has a great write-up on recent trends in oil prices. Also, there is this Times article on the spike and drop in 2008 which had this apt summary,

[Oil prices are] the product of an extremely volatile mixture of speculation, oil production, weather, government policies, the global economy, the number of miles the average American is driving in any given week and so on. But the daily price is actually set — or discovered, in economic parlance — on the futures exchange.

Data

If you go to the EIA site, you can download some monthly and weekly prices in an Excel spreadsheet. The WTI and Brent oil prices are on the second sheet of the Excel file. We can use Python to extract this data from Excel and put it into a Pandas DataFrame for easier processing.

import pandas, seaborn

# create an Excel file object
excel = pandas.ExcelFile( '/Users/connorjohnson/Downloads/PET_PRI_SPT_S1_W.xls' )

# parse the first sheet
df = excel.parse( excel.sheet_names[1] )

# rename the columns
df = df.rename( columns=dict( zip( df.columns, ['Date','WTI','Brent'] ) ) )

# cut off the first 18 rows because these rows
# contain NaN values for the Brent prices
df = df[18:]

# index the data set by date
df.index = df['Date']

# remove the date column after re-indexing
df = df[['WTI','Brent']]

Visualization

We can then visualize the data with seaborn. So, the takeaway form the big picture here is that oil prices have gone totally bonkers over the last ten years.

df[['WTI','Brent']][::4].plot()
title('Crude Oil Prices')
xlabel('Year')
ylabel('Price [USD]')
savefig("wti_and_brent_all_data.png",dpi=200)

If we focus on the last three years, we get the following picture. The interesting thing about this perspective, apart from the recent marked drop, is that the WTI and Brent prices seem to start getting closer in value over time.

We can drop the last three years of weekly data, the smoothed data, and the de-trended data with the following line,

df[-36*4:].to_csv('prices_last_three_years.csv')

Seasonality

Rob Hyndman is the creator of the R forecast package, and you should really check out his blog. I found this post on R-bloggers where Hyndman proposed an interesting technique for detecting seasonality. Essentially, you fit one model with a seasonal component, and another model without a seasonal component, and then you compare their log-likelihood and their degrees of freedom using a chi-squared distribution.

For this data, I got errors attempting the decompose() and seasonal HoltWinters() functions, suggesting that there is not a strong seasonal component.

data <- read.csv('prices_last_three_years.csv')
decompose( data$WTI )
Error in decompose(data$WTI) : time series has no or less than 2 periods
HoltWinters( data$WTI )
Error in decompose(ts(x[1L:wind], start = start(x), frequency = f), seasonal) : 
  time series has no or less than 2 periods

Holt-Winters Forecasting

I used this resource for learning about Holt-Winters forecasting. We can fit a model and make predictions using the Holt-Winters method with the following lines of code:

# load Hyndman's forecast package
library(forecast)

# create a model with a trend
WTI.hwm <- HoltWinters( data$WTI, gamma=FALSE )

# predict the values for the next 4 weeks #
WTI.hwf <- forecast.HoltWinters( WTI.hwm, h=4 )

# print a summary of the forecast
summary( WTI.hwf )
Forecast method: HoltWinters

Model Information:
Holt-Winters exponential smoothing with trend and without seasonal component.

Call:
HoltWinters(x = data$WTI, gamma = FALSE)

Smoothing parameters:
 alpha: 1
 beta : 0.3259544
 gamma: FALSE

Coefficients:
       [,1]
a 76.500000
b -2.011559

Error measures:
                     ME     RMSE      MAE        MPE     MAPE     MASE     ACF1
Training set -0.1612072 2.271556 1.785524 -0.1085958 1.855251 1.068586 0.104531

Forecasts:
    Pt Forecast Lo 80 Hi 80 Lo 95 Hi 95
145       74.48 71.57 77.40 70.03 78.94
146       72.47 67.63 77.31 65.07 79.87
147       70.46 63.63 77.29 60.02 80.90
148       68.45 59.51 77.38 54.79 82.11

For reference, I’ll replace the forecasts with dates. You’ll note that today is 11/23/2014, and that the first predicted data point is for 11/21/2014, but there’s a lag between production and reporting, so the 11/21/2014 numbers won’t be published for a few days yet.

        Pt Forecast Lo 80 Hi 80 Lo 95 Hi 95
11/21/2014    74.48 71.57 77.40 70.03 78.94
11/28/2014    72.47 67.63 77.31 65.07 79.87
12/05/2014    70.46 63.63 77.29 60.02 80.90
12/12/2014    68.45 59.51 77.38 54.79 82.11

We can then plot the forecast as follows:

plot.forecast( WTI.hwf )

ARIMA Forecasting

I prefer the outlook of the Holt-Winters forecast, but I decided to do an ARIMA(1,0,0) and ARIMA(2,0,0) forecast also, for completeness. I’m still trying to understand how to interpret the ACF and PACF plots of time series. The canonical data sets make a ton of sense, but things get murkier with more typical data sets. You end up scratching your head and saying, “Hell, let’s just try an AR(1) process.” (This is where the forecast::auto.arima() function comes in handy.)

WTI.arima <- arima( data$WTI, order=c(1,0,0) )
WTI.arimaf <- forecast.Arima( WTI.arima, h=4 )
summary( WTI.arimaf )
Forecast method: ARIMA(1,0,0) with non-zero mean

Model Information:
Series: data$WTI 
ARIMA(1,0,0) with non-zero mean 

Coefficients:
         ar1  intercept
      0.9700    94.0842
s.e.  0.0218     5.1415

sigma^2 estimated as 4.727:  log likelihood=-317.58
AIC=641.16   AICc=641.34   BIC=650.07

Error measures:
                      ME     RMSE      MAE        MPE     MAPE     MASE     ACF1
Training set -0.09167768 2.174233 1.676698 -0.1544393 1.771276 0.997412 0.298963

Forecasts:
      Pt Forecast Lo 80 Hi 80 Lo 95 Hi 95
11/21       77.02 74.24 79.81 72.76 81.28
11/28       77.54 73.65 81.42 71.60 83.47
12/05       78.03 73.35 82.72 70.87 85.20
12/12       78.52 73.19 83.84 70.36 86.67

And for the AR(2) process, we have the following:

Forecast method: ARIMA(2,0,0) with non-zero mean

Model Information:
Series: data$WTI 
ARIMA(2,0,0) with non-zero mean 

Coefficients:
         ar1      ar2  intercept
      1.2725  -0.3220    94.9581
s.e.  0.0806   0.0829     3.2659

sigma^2 estimated as 4.278:  log likelihood=-310.4
AIC=628.79   AICc=629.08   BIC=640.67

Error measures:
                      ME     RMSE      MAE         MPE     MAPE      MASE
Training set -0.03991496 2.068382 1.599582 -0.09327853 1.687477 0.9515379
                    ACF1
Training set -0.01486163

Forecasts:
      Pt Forecast Lo 80 Hi 80 Lo 95 Hi 95
11/21       76.85 74.20 79.50 72.79 80.90
11/28       77.86 73.57 82.15 71.30 84.42
12/05       79.03 73.53 84.53 70.62 87.44
12/12       80.20 73.79 86.60 70.40 89.99

Conclusions

I will revisit this post in two weeks with some more details on forecasting, and an update on these forecasts. For future reference, the ets() and auto.arima() functions from the forecast package are a great starting point for forecasting, they will automatically fit a model and return model performance metrics.

One thought on “Time Series Forecasting in Python and R”

Comments are closed.