Share this project
Summary
Summary and future perspectives
9.1 A few statistical contributions for the analysis of high-dimensional data
Many scientific disciplines and industrial settings have witnessed an explosion of high-dimensional datasets, where large numbers of measurements (P) are usually performed on a comparatively small number of experimental units (N). The rapid development of high-dimensional data have raised questions on the statistical validity of the analytical technologies. With the belief that theory has its implications on using methods in practice, this thesis has interwaved statistical theory and methodology with real high-dimensional applications. Particularly, we have presented 2 challenges related to what Richard Bellman firstly defined as the "curse of dimensionality" [364], namely data sparsity and data heterogeneity, which have been defined in Chapter 1 and illustrated throughout this thesis with few practical examples. We have discussed how the various research goals (i.e., classification, outlier detection, clustering, design of experiments, missing data imputation) presented in this thesis may be affected by these statistical challenges. We have often juxtaposed a systematic review of the most popular analytical techniques with the proposal of few novel approaches to analyse high-dimensional data. In this chapter, the main research findings are summarized and a few proposals for future research are discussed. Due to the emergence of increasingly large datasets in science, we expect that the methodologies and the argumentations proposed in this thesis will be widely applicable and useful in various applied fields.
9.2 Summary of the results
Work mainly focused on the exploration of 2 very popular analytical approaches in high-dimensional data, organized in Part 1 and Part 2 of this thesis, respectively. The 2 methodologies certainly present different modelling frameworks, but they are more and more frequently sharing common areas of application and research tasks. For instance, in metabolomics and transcriptomics, the covariance-based approaches discussed in Part 1 are the most popular analytical techniques and are often considered as the golden standard, but the LMM strategies discussed in Part 2 are becoming extremely popular too, in order to evaluate environmental differences and variation in genetic data [365, 366].
In Part 1 we explored Principal Component Analysis (PCA), a traditional analytical tool in data science for feature reduction (also named as feature extraction) that decomposes the covariance matrix of the original data into eigenvalues and eigenvectors, identifying the directions of maximal variance in the data. Noticeably, this tool has served as a baseground for many research tasks, such as one-class modelling, outlier detection, clustering and many others. Similarly, the Mahalanobis distance (MHD) has been used in practice as a distance measure in multivariate data, exploiting the (inverse of the) covariance matrix, for many research purposes. Both procedures are based on the sample estimator of the covariance matrix, which may break down in high-dimensional data.
In Chapter 2, we have discussed the issues related to (SIMCA) one-class classification in presence of high-dimensional transcriptomics data. On the one hand, we have shown that the high-dimensionality of transcriptomics data generates substantial sample heterogeneity. On the other hand, transcriptomics data present several distributional challenges, i.e., zero-inflation (sparsity) and overdispersed count data distribution. In the chemometric field, SIMCA has been recently proposed for the analysis of high-dimensional transcriptomics data. The evaluation of SIMCA in these settings revealed a sub-optimal performance, with poor control on the power to identify outliers (in presence of a heterogenous set of reference observations). Our analyses confirmed previous findings in literature about the controversial reliability of PCA in high-dimensional settings [367, 368, 80].
In Chapter 3, we have reviewed the most popular distance measures (based on PCA or on the MHD) which served as a basis for the construction of rejection areas for outlier detection (but also for one-class classification, as in Chapter 2). Arguably, the statistical validity of these tools was proved in literature for experimental settings with a lot of observations (N >> P). Indeed, our simulations demonstrated that the type I error is reasonably controlled well under these settings, whereas we have showed that all methods have deficient performances in high-dimensional settings, especially when N is relatively small compared to P. These results were also confirmed by 3 case studies used to illustrate the methodologies in practice. Interestingly, the evaluations conducted in this chapter might partially justify the relatively poor performance encountered by the SIMCA methods in Chapter 2 and it further suggests that, next to PCA, care should be taken also in the choice of the critical limits, especially in high-dimensional data.
In Chapter 4, we have proposed a novel distance measure for one-class modelling and outlier detection based on eigenvalue-shrinkage in high-dimensional data. More specifically, our approach compared favourably to the standard SIMCA one-class classification and to the PCA-based outlier detection tools, and it resulted in a valuable tool to cope with the misspecification of the covariance matrix in high-dimensional data and the overestimation of the type I error for outlier detection tools, as described in Chapter 2 and 3.
In Part 2, we have explored the potential of statistical modelling in different types of high-dimensional experiments and research tasks. Contrary to the more flexible covariance-based approaches discussed in Part 1, the applicability of the analytical statistical tools discussed in Part 2 was strictly restricted to the related model assumptions, yet ensuring optimal performances in these settings.
In particular, in Chapter 5 we have proposed a multivariate model for high-dimensional and overdispersed count data. Our model improves upon the current literature on 2 main aspects: the parsimonious number of parameters that characterize the multivariate distribution (i.e., 2P + 1) and the (computationally) fast and direct estimation method. We have showed with a real world example from a sparse and high-dimensional RNA-Seq cross-sectional experiment that the proposed model could be used in practice also for clustering discrete variables (i.e., gene transcripts).
In Chapter 6, we have discussed an extension of the standard Linear Mixed Model (LMM) to address subject-to-subject heterogeneity (cf. tLMM), investigating the different model formulations and the model identifiability. The tLMM addresses both variability in location between subjects, but also heterogeneity in the within-subject variability. Thus, it provides a much thicker distributional tail, potentially removing outliers that would appear when only normality would be assumed for the longitudinal data. We have demonstrated the effectiveness of the tLMM (over LMM) in a sleep monitoring longitudinal trial.
Chapter 7 showed an efficient modelling of the day-to-day variability (i.e., heterogeneity) in an experimental Box Behnken Design (BBD). The BBD explored a high-dimensional design matrix with LMMs analyzing a set of features (output variables) extracted from a complex chemical reaction, i.e., the catalytic depolymerization of lignin.
Finally, in Chapter 8 we have provided an overview of the state of art for multiple imputation of missing data (MD) in high-dimensional experiments, with a comparison of the most recent covariance-based (i.e., PCA), model-based (i.e., sequential and penalized regressions) and machine learning-based (i.e., regression trees) imputation techniques. We have showed in a high-dimensional multivariate longitudinal data from health monitoring how MD affect the significance of the estimated LMM parameters. Our systematic evaluations demonstrated that the sequential regression approaches miserably failed, whereas the machine learning approaches are seemingly superior in terms of full data reconstruction, but this comes at the expenses of a much higher computational cost. The PCA-based approach using bootstrapping is computationally very fast and has a performance that is not too far from the performance of the machine learning techniques.
9.3 Discussion and future perspectives
The diverse literature reviews presented in several chapters of this thesis have highlighted that many popular statistical methodologies are challenged in finding efficient solutions in high-dimensional data. It is clear that, in high dimensions, with potentially many structurally irrelevant variables and only small sample sizes, feature selection is likewise essential for discerning the fundamental data structure, i.e., a potential underlying grouping tendency that may be ‘buried’ in a much lower dimensional subspace. Part 1 of this thesis has investigated PCA, which is a very powerful tool for dimension reduction, and it is computationally very fast. Yet, it is seemingly limited by few fundamental challenges in high-dimensional data.
First, it is well-known that we cannot estimate a full-rank covariance matrix Σ using a number of observations N smaller than or equal to the number of descriptors P. In such a case, the eigenvalue decomposition of the sample estimator S of Σ used in PCA produces N − 1 real (whose the largest ones are biased upwards) and P − (N − 1) eigenvalues equal to 0, eventually providing distortions in the explained variation of the data. In recent years, a wide variety of methods for improving (in some relevant sense) the sample covariance estimator (or the precision matrix estimator) have been proposed in the literature [80]. In this thesis, we have explored in Chapter 4 the potential of the eigenvalue-shrinkage estimators, ensuring always positive definite and more accurate estimators of the population covariance matrix in moderate high-dimensional data. It would also be useful to explore the scalability of eigenvalue-shrinkage for larger high-dimensional covariance matrices (in the order of tens to hundred thousands variables). Anyway, we believe that eigenvalue-shrinkage could be applied (as a preprocessing step) to any PCA-based type of analysis, to provide more reliable characterization of the true population (co)variance. Note that, there are other forms of conducting data regularization in high-dimensional data in literature [80, 152] that lead to stable representations of the associated high-dimensional covariance matrix, even if many of them are still computationally expensive. Noticeably, many (reference) datasets are often contaminated with little to large portion of outliers. Initial investigations (not shown in this thesis) have indicated that our eigenvalue-shrinkage class modelling approach (ES-CM) may have great potential in these settings, in light of few methodological adaptations. For instance, we are working on the robustification of our ES-CM protocol under 3 main streams: robust data autoscaling, alternatives measures to Pearson’s correlations (i.e., Kendall’s and Spearman’s correlations) and robust parametrizations of the critical limits used for outlier detection and one-class classification.
Second, traditional PCA has originally been proposed for multivariate normal data to guarantee that the PCs are uncorrelated and therefore also independent, but most importantly to ensure a proper modelling of the linear correlation structure so derived. This way, it makes perfect sense for PCA to operate on the Pearson correlation matrix. The normal distribution is also symmetric and in this case the variance can be used as good measure of "spread". These assumptions become quite unrealistic in high-dimensional data: non-linearities are quite often a significant aspect of the data; multivariate normality is often violated even after suitable data transformations (thus also Pearson’s correlations become quite artificial).
Third, data heterogeneity might result in many PCs whose associated eigenvalues show comparable sizes. As a consequence, the directions of the associated PCs axes might not be unique and PC scores calculated from non-distinct PCs may show large standard errors that could hardly be used for any practical interpretations.
Fourth, data sparsity, as a consequence of the high-dimensionality, might also undermine the interpretation of the PC linear functions, as these functions would likely result with non-zero coefficients, yet very close to 0, on all P variables. Interestingly, in PCA, there is a long-standing tradition to look for sparse representations where the variables are associated with only 1 or a few components [369]. Research efforts have focused on reformulations for PCA where the component loadings or the component weights have as many 0 elements as possible. In fact, it is common belief that the sparse structure facilitates data interpretation. In addition, sparse representations have been employed not only for interpretational issues, but also to deal with the inconsistency of the estimated component loadings/weights in high-dimensional setting [370], even if their statistical properties are controversial [94, 95]. Sparse PCA is seemingly promising for high-dimensional data, as we also see in use of a large number of software packages that are available nowadays (see Table 2.12). However, there are certain challenges that need to be resolved before they could be efficiently applied in high-dimensional settings. A crucial step for sparse PCA is the choice of the sparsity parameter(s). None of the existing procedures presents an automated calculation of the optimal amount of shrinkage for a given dataset. Particularly, they are quite computationally burdensome and the optimization of the metaparameters, i.e., the number of PCs and the shrinkage optimality, largely increase the computational complexity. Thus, they can hardly be employed for systematic comparisons in simulations for instance. In addition, initial investigations have shown that most of the available algorithms gave memory allocation and convergence issues when applied to critical high-dimensional data, like our case study in Chapter 2. Finally, in standard PCA monitoring, PCs are independent with each other which makes the use of diagnostic tools like SD, OD or the MHD for the detection of aberrant samples easier. Technically, the introduced sparsity drastically changes the distributional properties of these diagnostic statistics. It is worth revaluating in the near future the proposed critical limits for outlier detection discussed in Chapter 3 using sparse PCA. Ideally, the sparse representation would also help balancing again the critical P/N ratio in high-dimensional data (as the actual number of retained variables would be strongly reduced), which caused the inflation on the type I error for most of the outlier detection methods (as seen in Chapter 3).
Lastly, the sample size is also a fundamental problem for the stability of a PCA analysis. Johnson [371] argues that large sample sizes are necessary to adequately capture variation and to overcome difficulties arising from violations of assumptions of multivariate methods. Grossman et al. [372] demonstrated using published data that statistical differences among the eigenvalues of the PCs are reliable when the ratio N/P is 3 or greater. In addition, the literature also showed that PCA likely overfits the noise in high-dimensional data, i.e., PCA inconsistently estimates the subspace of maximal variance [79, 373, 374, 80, 42]. It is well known that when solving linear systems of equations, the number of equations must be greater than the number of unknown variables. Since PCA can be regarded as a particular regression problem, it follows that PCA may run into problems when properly separating the signal from the noise in high-dimensional data.
New feature selection methods have been steadily appearing during the past few decades [368, 375] other than the classical PCA method. However, the proliferation of feature selection algorithms, often devised for very specific applications, has not brought about a general methodology that allows for automated and intelligent selection from existing algorithms [376]. Thus, the generalizability of such methods to other applications is rather questionable. We also lacked a generic framework for the signal-driven features presented in Chapter 2. Yet, we have provided a very simple and direct approach of how great challenges in high-dimensional data like sample heterogeneity and data sparsity could be tackled (but also replicated for simulation purposes) in practice. We showed that, for our case study, the features could group together heterogeneous samples in the reference class as one homogeneous set. The selection procedure is inspired by the characteristics of the RNA-Seq data of Chapter 2. The extraction procedure is seemingly data-driven, even if the features could be extracted from any RNA-Seq (i.e., count) type of data. Such a reduced variable subspace resulted in an optimal playground for statistical modelling in a low-dimensional multivariate space for which tractable multivariate (Gaussian) distributions could be easily specified among the extracted features. Eventually, the features could also be used for one-class classification or outlier detection. Note that, for the sake of comparison, we have chosen similar classification strategies to those considered for the raw data (i.e., PCA-based classification protocols), but we acknowledge that general strategies exploring the entire multivariate space of the features as in Chapter 4 (like the MHD) could have been used as well. Similarly, in Chapter 7 we also adopted a feature selection strategy, as the 3 output variables that were considered in the experimental analysis were the result of an underlying feature selection process from high-dimensional and high-frequency spectra, in the attempt to identify the key aspects of a complex chemical reaction. Arguably, we have not modelled the correlations among the 3 output variables, but the literature offers extensions also to the chosen LMM in multivariate settings [377]. In that case however, the backward elimination strategy adopted in the paper could result in a quite challenging iteration process.
Alternatively, when more knowledge on the specific data distribution is available, statistical modelling can also indicate potential (parsimonious) feature parameter sets. For instance, the features obtained from the multivariate NB model described in Chapter 5 resulted also in a feature selection approach, generating a low-dimensional feature space that could be used for clustering or even outlier detection. In Chapter 5, a new clustering algorithm was proposed based on the proposed set of features and tested on a high-dimensional RNA-Seq data. We believe that the proposed clustering algorithm could be appealing for many practitioners. For instance, in biological data as in omics, the method could on the one hand support and enrich the current gene ontologies and, on the other hand, it could help the exploration of the large unknown regions of the human transcriptome.
All in all, we have raised in this thesis the dualism between multivariate covariance-oriented and statistical modelling techniques. Despite having outlined few limitations of the first class of methods and having shown the potential of statistical modelling in part 2, yet limited to the strict model and distributional assumptions, we believe that there might be promising routes in the upcoming future for joining forces and developing a more solid and dedicated theoretical underpinning for the analysis of high-dimensional data. Particularly, we see in the Generalized Estimating Equations (GEE) [345, 255, 256] a powerful and promising intersection. GEEs allow users to specify general association structures to describe the relationships among the variables. This can be used to accommodate the many different experimental settings explored in this thesis, ranging from the high-dimensional cross-sectional experiments of Part 1 to the longitudinal settings explored in Part 2 (or even spatial data). Clearly, GEEs would benefit a lot from the improved characterizations of the covariance matrix discussed in Chapter 4 or from other forms or (co)variance regularization. Secondly, GEEs can easily model normal and as well as non-normal responses. In fact, as a "hybrid" Generalized Linear (Mixed) Model (GLMM), GEEs can address (overdispersed) count data distributions (Poisson, Binomial, etc.), as well as binary data, requiring only the specification of the first 2 (or 4 for GEE2 [345]) distributional moments. Notably, the data distributions discussed in this thesis, i.e., comprising the zero-inflated and overdispersed count data of Chapter 2 and 5 and the heavy tailed Gaussian data of Chapter 6, would fit in this extended framework. Interestingly, GEEs would also inherit the intrinsic feature reduction approach presented in these Chapters. Finally, GEEs are also computationally simpler than (G)LMMs, making use of quasi-likelihood estimation procedures (similar to those presented in Chapter 5) and are currently gaining increasing attention in many statistical softwares, i.e., see the "geepack" R package or the GENMOD procedure in SAS. Clearly, there would be still few methodological open problems in considering this generic framework, like few of the research tasks presented in this thesis, i.e., outlier detection and clustering, but we believe that this could be a promising direction to pursue in the near future.
Bibliography
[1] V. Bell. Writing to the general, and other aesthetic strategies of critique: The art of León Ferrari as a practice of freedom. Journal of Latin American Cultural Studies, 21(2):253–285, 2012.
[2] I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 367(1906):4237–4253, 2009.
[3] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008.
[4] P. Naik, M. Wedel, L. Bacon, A. Bodapati, E. Bradlow, W. Kamakura, J. Kreulen, P. Lenk, D. M. Madigan, and A. Montgomery. Challenges and opportunities in high-dimensional choice data analyses. Marketing Letters, 19(3/4):201–213, may 2008.
[5] A. Belloni, V. Chernozhukov, and C. Hansen. On structural and treatment effects. Journal of economics perspectives, 28(2):29–50, 2014.
[6] N. Hao and H. H. Zhang. Interaction screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 109(507):1285–1301, may 2014.
[7] H. Hosseinmardi, H. Kao, K. Lerman, and E. Ferrara. Discovering hidden structure in high dimensional human behavioral data via tensor factorization. Johns Hopkins University, Dept. of Biostatistics Working Papers, pages 0–4, 2019.
[8] Z. Zipunnikov, S. Greven, B. Caffo, D. Reich, and C. Crainiceanu. Longitudinal high-dimensional data analysis. pages 1–30, 2011.
[9] Y. Luan and H. Li. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics, 19(4):474–482, mar 2003.
[10] S. Chen, E. Grant, T. Wu, and F. D. Bowman. Statistical learning methods for longitudinal high-dimensional data. Wiley interdisciplinary reviews. Computational statistics, 6(1):10–18, jan 2014.
[11] A. Crane-Droesch. Machine learning methods for crop yield prediction and climate change impact assessment in agriculture. Environmental Research Letters, 13(11):114003, 2018.
[12] S. K. K. Shinjin Kang. Outlier behavior detection for indoor environment based on t-SNE clustering. Computers, Materials & Continua, 68(3):3725–3736, 2021.
[13] A. Christmann. An approach to model complex high-dimensional insurance data. Allgemeines Statistisches Archiv, 88(4):375–396, 2004.
[14] P. Bühlmann and S. van de Geer. Statistics for high-dimensional data. Springer, Heidelberg, 2011.
[15] D. B. Allison, X. Cui, G. P. Page, and M. Sabripour. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics, 7(1):55–65, 2006.
[16] Y. Ye, Q. Wu, J. Zhexue Huang, M. K. Ng, and X. Li. Stratified sampling for feature subspace selection in random forests for high-dimensional data. Pattern Recognition, 46(3):769–787, 2013.
[17] C. Shen. High-dimensional independence testing and maximum marginal correlation. 2020.
[18] G. Mao. A new test of independence for high-dimensional data. Statistics and Probability Letters, 93:14–18, 2014.
[19] Y. Song and X. Zhao. Normality testing of high-dimensional data based on principle component and jarque-bera statistics. Stats, 4(1):216–227, 2021.
[20] H. Chen and Yi. Xia. A nonparametric normality test for high-dimensional data. pages 1–38, 2019.
[21] M. J. Wainwright. Introduction. In Martin J Wainwright, editor, High-dimensional Statistics: A non-asymptotic viewpoint, Cambridge Series in Statistical and Probabilistic Mathematics, pages 1–20. Cambridge University Press, Cambridge, 2019.
[22] P. McCullagh and N. G. Polson. Statistical sparsity. Biometrika, 105(4):797–814, dec 2018.
[23] M. Steinbach, L. Ertöz, and V. Kumar. The challenges of clustering high dimensional data. In New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition, pages 273–309. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
[24] J. Ye and J. Liu. Sparse methods for biomedical data. SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery Data Mining, 14(1):4–15, jun 2012.
[25] B. Allison, D. Guthrie, and L. Guthrie. Another Look at the Data Sparsity Problem. pages 327–334, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
[26] E. W. Steyerberg, D. Nieboer, T. P. A. Debray, and H. C. van Houwelingen. Assessment of heterogeneity in an individual participant data meta-analysis of prediction models: An overview and illustration. Statistics in Medicine, 38(22):4290–4309, sep 2019.
[27] M. Regis. Heterogeneity in subject-specific statistical models for the analysis of longitudinal data. PhD thesis, Eindhoven University of Technology, may 2019.
[28] A. Zimek, E. Schubert, and H.P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(5):363–387, 2012.
[29] R. De Maesschalck, D. Jouan-Rimbaud, and D. L. Massart. The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems, 50(1):1–18, 2000.
[30] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1):37–52, 1987.
[31] W.-Y. Loh. Classification and regression trees. WIREs Data Mining and Knowledge Discovery, 1(1):14–23, jan 2011.
[32] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[33] A. K. Jain, J. Mao, and K. M. Mohiuddin. Artificial neural networks: a tutorial. Computer, 29(3):31–44, 1996.
[34] Wold S. and M. Sjostrom. SIMCA: A Method for Analyzing Chemical Data in Terms of Similarity and Analogy. jul 1977.
[35] P. J. Rousseeuw, M. Debruyne, S. Engelen, and M. Hubert. Robustness and Outlier Detection in Chemometrics. Critical Reviews in Analytical Chemistry, 36(3-4):221–242, 2006.
[36] P. Filzmoser and V. Todorov. Review of robust multivariate statistical methods in high dimension. Analytica Chimica Acta, 705(1):2–14, 2011.
[37] L. Zhu, H. Chang, Q. Zhou, and Z. Wang. Improving the classification accuracy for near-infrared spectroscopy of chinese salvia miltiorrhiza using local variable selection. Journal of Analytical Methods in Chemistry, 2018:5237308, 2018.
[38] E. Kok, J. van Dijk, M. Voorhuijzen, M. Staats, M. Slot, A. Lommen, D. Venema, M. Pla, M. Corujo, E. Barros, R. Hutten, J. Jansen, and H. van der Voet. Omics analyses of potato plant materials using an improved one-class classification tool to identify aberrant compositional profiles in risk assessment procedures. Food Chemistry, 292(March 2018):350–358, 2019.
[39] A. L. Pomerantsev and O. Y. Rodionova. Popular decision rules in SIMCA: critical review. Journal of Chemometrics, 34(8):e3250, 2020.
[40] H. Tsugawa, Y. Tsujimoto, M. Arita, T. Bamba, and E. Fukusaki. GC/MS-based metabolomics: development of a data mining system for metabolite identification by using soft independent modeling of class analogy (SIMCA). BMC Bioinformatics, 12(1):131, 2011.
[41] S. J. Mazivila, R. N. M. J. Páscoa, R. C. Castro, D. S. M. Ribeiro, and J. L. M. Santos. Detection of melamine and sucrose as adulterants in milk powder using near-infrared spectroscopy with DD-SIMCA as one-class classifier and MCR-ALS as a mean to provide pure profiles of milk and of both adulterants with forensic evidence: A short communica. Talanta, 216:120937, 2020.
[42] A. Brini, V. Avagyan, R. C. H. de Vos, J. H. Vossen, E. R. van den Heuvel, and J. Engel. Improved one-class modeling of high-dimensional metabolomics data via eigenvalue-shrinkage, 2021.
[43] M. Corujo, M. Pla, J. van Dijk, M. Voorhuijzen, M. Staats, M Slot, A Lommen, E. Barros, A. Nadal, P. Puigdomenech, J. L. La Paz, H. van der Voet, and E. J. Kok. Use of omics analytical methods in the study of genetically modified maize varieties tested in 90 days feeding trials. Food Chemistry, 2018.
[44] EFSA. Guidance for risk assessment of food and feed from genetically modified plants. EFSA Journal, 9(5):2150, 2011.
[45] E. Barros, S. Lezar, M. J. Anttonen, J. P. van Dijk, R. M. Rohlig, E. J. Kok, and K. H. Engel. Comparison of two GM maize varieties with a near-isogenic non-GM variety using transcriptomics, proteomics and metabolomics. Plant Biotechnology Journal, 8(4):436–451, 2010.
[46] S. Z. Agapito-Tenfen, M. P. Guerra, O. G. Wikmark, and R. O. Nodari. Comparative proteomic analysis of genetically modified maize grown under different agroecosystems conditions in Brazil. Proteome Science, 11(1), 2013.
[47] J. P. van Dijk, C. S. de Mello, M.M. Voorhuijzen, R. C. B. Hutten, A. C. M. Arisi, J. Jansen, L. M. C. Buydens, H. van der Voet, and E. J. Kok. Safety assessment of plant varieties using transcriptomics profiling and a one-class classifier. Regulatory Toxicology and Pharmacology, 70(1):297–303, 2014.




