12.9 Dealing with missing values and outliers

Real data often contains missing values, outlying observations, and other messy features. Dealing with them can sometimes be troublesome.

Missing values

Missing data can arise for many reasons, and it is worth considering whether the missingness will induce bias in the forecasting model. For example, suppose we are studying sales data for a store, and missing values occur on public holidays when the store is closed. The following day may have increased sales as a result. If we fail to allow for this in our forecasting model, we will most likely under-estimate sales on the first day after the public holiday, but over-estimate sales on the days after that. One way to deal with this kind of situation is to use a dynamic regression model, with dummy variables indicating if the day is a public holiday or the day after a public holiday. No automated method can handle such effects as they depend on the specific forecasting context.

In other situations, the missingness may be essentially random. For example, someone may have forgotten to record the sales figures, or the data recording device may have malfunctioned. If the timing of the missing data is not informative for the forecasting problem, then the missing values can be handled more easily.

Some methods allow for missing values without any problem. For example, the naive forecasting method continues to work, with the most recent non-missing value providing the forecast for the future time periods. Similarly, the other benchmark methods introduced in Section 3.1 will all produce forecasts when there are missing values present in the historical data. The R functions for ARIMA models, dynamic regression models and NNAR models will also work correctly without causing errors. However, other modelling functions do not handle missing values including ets(), stlf(), and tbats().

When missing values cause errors, there are at least two ways to handle the problem. First, we could just take the section of data after the last missing value, assuming there is a long enough series of observations to produce meaningful forecasts. Alternatively, we could replace the missing values with estimates. The na.interp() function is designed for this purpose.

The gold data contains daily morning gold prices from 1 January 1985 to 31 March 1989. This series was provided to us as part of a consulting project; it contains 34 missing values as well as one apparently incorrect value. We can estimate the missing observations like this.

gold2 <- na.interp(gold)
autoplot(gold2, series="Interpolated") +
  autolayer(gold, series="Original") +
  scale_color_manual(values=c(`Interpolated`="red",`Original`="gray"))
Daily morning gold prices for 1108 consecutive trading days beginning on 1 January 1985 and ending on 31 March 1989.

Figure 12.1: Daily morning gold prices for 1108 consecutive trading days beginning on 1 January 1985 and ending on 31 March 1989.

For non-seasonal data like this, simple linear interpolation is used to fill in the missing sections. For seasonal data, an STL decomposition is used estimate the seasonally component, and the seasonally adjusted series are linear interpolated. More sophisticated missing value interpolation is provided in the imputeTS package.

Outliers

Outliers are observations that are very different from the majority of the observations in the time series. They may be errors, or they may simply be very unusual. All of the methods we have considered in this book will not work well if there are extreme outliers in the data. In this case, we may wish to replace them with missing values, or with an estimate that is more consistent with the majority of the data.

Simply replacing outliers without thinking about why they have occurred is a dangerous practice. They may provide useful information about the process that produced the data, and which should be taken into account when forecasting.

However, if we are willing to assume that the outliers are genuinely errors, or that they won’t occur in the forecasting period, then replacing them can make the forecasting task easier.

The tsoutliers() function is designed to identify outliers, and to suggest potential replacement values. In the gold data shown in Figure 12.1, there is an apparently outlier on day 770:

tsoutliers(gold)
#> $index
#> [1] 770
#> 
#> $replacements
#> [1] 495

Closer inspection reveals that the neighbouring observations are very close to $100 less than the apparent outlier.

gold[768:772]
#> [1] 495.00 502.75 593.70 487.05 487.75

Most likely, this was a transcription error, and the correct value should have been $493.70.

Another useful function is tsclean() which identifies and replaces outliers, and also replaces missing values. Obviously this should be used with some caution, but it does allow us to use forecasting models that are sensitive to outliers, or which do not handle missing values. For example, we could use the ets() function on the gold series, after applying tsclean().

gold %>% 
  tsclean() %>%
  ets() %>%
  forecast(h=50) %>%
  autoplot()

Notice that the outlier and missing values have been replaced with estimates.