Publication date: 27 maart 2020
University: Universiteit Maastricht
ISBN: 978-94-6380-745-6

High-Dimensional Time Series Analysis

Summary

We are currently in a new era of data analysis, characterized by the availability of large, unstructured datasets. These include data collected by major tech companies like Google and Facebook, as well as data gathered via local supermarket loyalty cards and debit card transactions. Since traditional statistical models often work best when accounting for the effects of only a few variables, many new statistical methods better suited for large datasets have been developed in recent years. These new methods are also called high-dimensional statistics. However, within economic and financial sectors, one primarily works with time series, such as Dutch unemployment figures or gross domestic product. Time series often exhibit unique properties, such as trend behavior where future values depend strongly on the past, which we know significantly influence the outcomes of traditional statistics. It is therefore unwise to apply high-dimensional statistics to large collections of time series without theoretical verification or practical adaptations. This topic is central to my dissertation.

In this dissertation, we focus solely on statistical methods that can be subdivided into three general categories: (1) factor models, (2) regularized regression, and (3) hybrid models. The idea behind factor models is that all observed variables are driven by a few latent (unobserved) variables. For example, we can observe unemployment within different industries or interest rates for different maturities, but all these variables may be (partially) explained by the underlying business cycle. Factor models attempt to estimate these latent variables, the factors, and thereby summarize the data with a minimum loss of information. In this way, a complex model with hundreds of observed variables does not need to be estimated. An alternative method is not to summarize the data, but to assume that many variables are simply irrelevant for explaining the dependent variable of interest. For example, it is plausible that tea prices affect coffee sales, but ketchup prices explain very little. For these types of applications, regularized regression is highly suitable. This form of regression estimates a linear model and automatically ensures that the estimated contributions of irrelevant variables are scaled down. Some forms of regularized regression, such as the Least Absolute Shrinkage and Selection Operator (LASSO), which plays an important role in this dissertation, have the desirable property that they can remove irrelevant variables from the estimated model entirely automatically. Hybrid methods are discussed as a final option, removing irrelevant variables and summarizing relevant variables through factor estimation.

In Chapter 2, we compare the forecasting performance of statistical methods subdivided using the categorization above. By conducting controlled simulations where we purposefully establish certain data properties, we find that factor models and regularized regression perform well in their intended contexts. However, we also discover that regularized regression can forecast better if factors are present in the data with "a lot of noise." in an empirical application, we find that regularized regression forecasts more accurately than factor models for some American economic indicators, despite the likely presence of factors in a macroeconomic context.

Motivated by the favorable performance of regularized regression, in Chapter 3 we develop the Single-equation Penalized Error-Correction Selector (SPECS). SPECS is a specialized method for estimating regularized linear models that account for the trend behavior of the variables under consideration. In economic applications, it frequently happens that individual variables contain a stochastic (random) trend, but this trend disappears after taking a specific linear combination. This well-known phenomenon is called cointegration and has a significant impact on statistical behavior. We derive theoretical (asymptotic) results showing that our method behaves desirably as sample sizes grow. To demonstrate the applicability of SPECS, we use our new method to forecast unemployment in the Netherlands using the popularity of 100 different Google search terms, such as "unemployment benefit" and "applying for a job." As expected, SPECS outperforms the forecasting performance of high-dimensional statistics that ignore cointegration.

In Chapter 4, we derive similar theoretical results under less restrictive assumptions. For example, we allow the number of variables in the model to increase as the sample size increases. This is important for providing clear insight into the behavior of SPECS when applied to datasets with a large number of variables.

Finally, in Chapter 5, we compare (1) statistical tests to classify the trend behavior of time series and (2) a selection of high-dimensional forecasting methods that do or do not account for cointegration. Simulations show it is extremely important to correctly classify the trend in the dependent variable, as the accuracy of the forecast heavily depends on this classification. In a macroeconomic application to an American dataset, we find that no single model is consistently most accurate, and there is no definitive answer to whether cointegration is important for making forecasts. Given there are cases where SPECS performs better than other methods in the comparison, we confirm that our method has both theoretical and applied value. However, the choice of the optimal method will always depend on the specific application.

See also these dissertations

We print for the following universities