Cross-Validation for Lower Rank Matrices Containing Outliers

Arciniegas Alarcón, Sergio; Arciniegas Alarcón, Marisol; Krzanowski, Wojtek J.

Validación cruzada para matrices de rango inferior que contienen valores atípicos

Enlaces del Item

URI: http://hdl.handle.net/10818/58677
Visitar enlace: https://www.mdpi.com/2571-5577 ...

DOI: 10.3390/asi5040069

Estadísticas

Ver as estatísticas de uso

Métricas

Catalogación bibliográfica

Apresentar o registro completo

Autor

Arciniegas Alarcón, Sergio; Arciniegas Alarcón, Marisol; Krzanowski, Wojtek J.

Data

2022

Resumo

Several statistical techniques for analyzing data matrices use lower rank approximations to these matrices, for which, in general, the appropriate rank must first be estimated depending on the objective of the study. The estimation can be conducted by cross-validation (CV), but most methods are not designed to cope with the presence of outliers, a very common problem in data matrices. The literature suggests one option to circumvent the problem, namely, the elimination of the outliers, but such information removal should only be performed when it is possible to verify that an outlier effectively corresponds to a collection or typing error. This paper proposes a methodology that combines the robust singular value decomposition (rSVD) with a CV scheme, and this allows outliers to be taken into account without eliminating them. For this, three possible rSVD’s are considered and six resistant criteria are proposed for the choice of the rank, based on three classic statistics used in multivariate statistics. To test the performance of the various methods, a simulation study and an analysis of real data are described, using an exclusively numerical evaluation through Procrustes statistics and critical angles between subspaces of principal components. We conclude that, when data matrices are contaminated with outliers, the best estimation of rank is the one that uses a CV scheme over a robust lower rank approximation (RLRA) containing as many components as possible. In our experiments, the best results were obtained when this RLRA was calculated using an rSVD that minimizes the L2 norm.

Varias técnicas estadísticas para analizar matrices de datos utilizan aproximaciones de rango inferior. a estas matrices, para las cuales, en general, primero se debe estimar el rango apropiado dependiendo sobre el objetivo del estudio. La estimación se puede realizar mediante validación cruzada (CV), pero la mayoría Los métodos no están diseñados para hacer frente a la presencia de valores atípicos, un problema muy común en los datos. matrices. La literatura sugiere una opción para sortear el problema, a saber, la eliminación del valores atípicos, pero dicha eliminación de información solo debe realizarse cuando sea posible verificar que El valor atípico corresponde efectivamente a un error de recopilación o de mecanografía. Este artículo propone una metodología que combina la descomposición robusta de valores singulares (rSVD) con un esquema CV, y esto permite valores atípicos tener en cuenta sin eliminarlos. Para ello, se consideran tres posibles rSVD y Se proponen seis criterios resistentes para la elección del rango, basados en tres estadísticas clásicas utilizadas en estadística multivariada. Para probar el rendimiento de los distintos métodos, se realizó un estudio de simulación y un Se describen análisis de datos reales, utilizando una evaluación exclusivamente numérica a través de Procusto. estadísticas y ángulos críticos entre subespacios de componentes principales. Concluimos que, cuando las matrices de datos están contaminadas con valores atípicos, la mejor estimación de rango es la que utiliza un CV esquema sobre una aproximación robusta de rango inferior (RLRA) que contiene tantos componentes como sea posible. En nuestros experimentos, los mejores resultados se obtuvieron cuando este RLRA se calculó utilizando un rSVD que minimiza la norma L2.