Alberto Brini

TU Eindhoven

Share this project

Download PDF

Publication date: 20 september 2022

University: TU Eindhoven

ISBN: 978-90-386-5535-2

A few statistical contributions for the analysis of high-dimensional data

Summary

Chapter 9

9.1 A few statistical contributions for the analysis of high-dimensional data

Many scientific disciplines and industrial settings have witnessed an explosion of high-dimensional datasets, where large numbers of measurements (P) are usually performed on a comparatively small number of experimental units (N). The rapid development of high-dimensional data have raised questions on the statistical validity of the analytical technologies. With the belief that theory has its implications on using methods in practice, this thesis has interwoven statistical theory and methodology with real high-dimensional applications. Particularly, we have presented 2 challenges related to what Richard Bellman firstly defined as the "curse of dimensionality" [364], namely data sparsity and data heterogeneity, which have been defined in Chapter 1 and illustrated throughout this thesis with few practical examples. We have discussed how the various research goals (i.e., classification, outlier detection, clustering, design of experiments, missing data imputation) presented in this thesis may be affected by these statistical challenges. We have often juxtaposed a systematic review of the most popular analytical techniques with the proposal of few novel approaches to analyse high-dimensional data. In this chapter, the main research findings are summarized and a few proposals for future research are discussed. Due to the emergence of increasingly large datasets in science, we expect that the methodologies and the argumentations proposed in this thesis will be widely applicable and useful in various applied fields.

9.2 Summary of the results

Work mainly focused on the exploration of 2 very popular analytical approaches in high-dimensional data, organized in Part 1 and Part 2 of this thesis, respectively. The 2 methodologies certainly present different modelling frameworks, but they are more and more frequently sharing common areas of application and research tasks. For instance, in metabolomics and transcriptomics, the covariance-based approaches discussed in Part 1 are the most popular analytical techniques and are often considered as the golden standard, but the LMM strategies discussed in Part 2 are becoming extremely popular too, in order to evaluate environmental differences and variation in genetic data [365, 366].

In Part 1 we explored Principal Component Analysis (PCA), a traditional analytical tool in data science for feature reduction (also named as feature extraction) that decomposes the covariance matrix of the original data into eigenvalues and eigenvectors, identifying the directions of maximal variance in the data. Noticeably, this tool has served as a baseground for many research tasks, such as one-class modelling, outlier detection, clustering and many others. Similarly, the Mahalanobis distance (MHD) has been used in practice as a distance measure in multivariate data, exploiting the (inverse of the) covariance matrix, for many research purposes. Both procedures are based on the sample estimator of the covariance matrix, which may break down in high-dimensional data.

In Chapter 2, we have discussed the issues related to (SIMCA) one-class classification in presence of high-dimensional transcriptomics data. On the one hand, we have shown that the high-dimensionality of transcriptomics data generates substantial sample heterogeneity. On the other hand, transcriptomics data present several distributional challenges, i.e., zero-inflation (sparsity) and overdispersed count data distribution. In the chemometric field, SIMCA has been recently proposed for the analysis of high-dimensional transcriptomics data. The evaluation of SIMCA in these settings revealed a sub-optimal performance, with poor control on the power to identify outliers (in presence of a heterogenous set of