Tratamiento y evaluación de datos incompletos mediante el clasificador Gamma Pydra y missing values accuracy (MVA).

doi:10.17562/4126

Tratamiento y evaluación de datos incompletos mediante el clasificador Gamma Pydra y missing values accuracy (MVA).

Juan Ignacio Vilchis

Abstract

Los datos incompletos son un fenómeno real, actual y en algunas situaciones impredecible que suele presentarse en diversas áreas del conocimiento humano. En los estudios enfocados al análisis de datos, la ausencia de un correcto tratamiento para valores perdidos suele presentar una serie de inconvenientes importantes en el resultado final del fenómeno estudiado. El presente artículo propone, en primer lugar, un método que posee ocho variantes para el tratamiento de valores faltantes con base en el clasificador asociativo Gamma; y, en segundo lugar, una medida de desempeño como base para la mejor comprensión de la evaluación en la clasificación con valores perdidos. Las propuestas de este estudio son comparadas contra dos métodos existentes en el estado del arte para tratar valores perdidos pertenecientes al enfoque de imputación. En la fase experimental se utilizaron diez bancos de datos sin valores perdidos. Adicionalmente se implementó un módulo para generar valores perdidos en los bancos de datos completos para poder controlar el número de valores perdidos inducidos y de esta forma poder comparar los resultados con diferentes porcentajes de pérdida. Vale la pena mencionar que las propuestas presentan desempeños competitivos frente al estado del arte con base en la medida propuesta, la cual permite analizar el comportamiento de la clasificación centrándose en los valores perdidos.

References

L. O. Silva and L. E. Zárate, “A brief review of the main approaches for treatment of missing data,” Intell. Data Anal., vol. 18, pp. 1177–1198, 2014, doi: DOI 10.3233/IDA-140690.

P. D. Allison, “Missing Data,” 2001.

T. Tressy and E. Rajabi, “A systematic review of machine learning-based missing value imputation techniques,” Data Technol. Appl., vol. 55, no. 4, pp. 558–585, 2021, doi: https://doi.org/10.1108/DTA-12-2020-0298.

F. Barzi and M. Woodward, “Imputations of Missing Values in Practice: Results from Imputations of Serum Cholesterol in 28 Cohort Studies,” Am. J. Epidemiol., vol. 160, no. 1, pp. 34–45, 2004, doi: DOI: 10.1093/aje/kwh175.

N. M. Gibson and S. Olejnik, “Treatment Of Missing Data At The Second Level Of Hierarchical Linear Models,” Educ. Psychol. Meas., vol. 63, no. 2, pp. 204–238, 2003, doi: https://doi.org/10.1177/0013164402250987.

C. Rioux and T. D. Little, “Missing data treatments in intervention studies: What was, what is, and what should be,” Int. J. Behav. Dev., vol. 45, no. 1, pp. 51–58, 2021, doi: 10.1177/0165025419880609.

L. Jin et al., “A comparative study of evaluating missing value imputation methods in label-free proteomics,” Sci. Rep., vol. 11, no. 1, 2021, doi: 10.1038/s41598-021-81279-4.

C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang, “GBKII: An imputation method for missing values,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 4426 LNAI, pp. 1080–1087, 2007, doi: 10.1007/978-3-540-71701-0_122.

K. M. Lang and T. D. Little, “Principled missing data treatments,” Prev. Sci., vol. 19, no. 3, pp. 284–294, 2018, doi: 10.1007/s11121-016-0644-5.

D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976, doi: 10.1093/biomet/63.3.581.

J. D. Kromrey and C. V. Hines, “Nonrandomly missing data in multiple regression: An empirical comparison of common missing-data treatments,” Educ. Psychol. Meas., vol. 54, no. 3, pp. 573–593, 1994, doi: 10.1177/0013164494054003001.

Joseph L Schafer and John W Graham, “Missing data: our view of the state of the art.,” Psychol. Methods, 2002.

R. Little and D. Rubin, Statistical analysis with missing data. 1987.

D. C. Howell, “The analysis of missing data,” Handb. Soc. Sci. Methodol., pp. 1–44, 2008.

L. Ben Othman and S. Ben Yahia, “Yet another approach for completing missing values,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 4923 LNAI, pp. 155–169, 2008, doi: 10.1007/978-3-540-78921-5_10.

P. Thionet, M. H. Hansen, W. N. Hurwitz, and W. G. Madow, “Sample Survey Methods and Theory,” Econometrica, vol. 23, no. 1, p. 111, 1955, doi: 10.2307/1905593.

B. WILMOTS, Y. SHEN, E. HERMANS, and D. RUAN, “Missing data treatment: Overview of possible solutions,” Steunpunt Mobil. Openb. Werken, 2011, [Online]. Available: http://www.steunpuntmowverkeersveiligheid.be/sites/default/files/RA-MOW-2011-002.pdf%5Cnhttps://uhdspace.uhasselt.be/dspace/handle/1942/16438.

M. K. Meyers and I. Garfinkel, “Social Indicators and the Study of Inequality,” SSRN Electron. J., 2011, doi: 10.2139/ssrn.1018735.

F. J. Molnar, B. Hutton, and D. Fergusson, “Does analysis using ‘last observation carried forward’ introduce bias in dementia research?,” Cmaj, vol. 179, no. 8, pp. 751–753, 2008, doi: 10.1503/cmaj.080820.

L. A. Galarza Guerrero, “Comparación mediante simulación de los métodos em e imputación múltiple para datos faltantes,” p. 83, 2013.

J. H. Friedman, R. Kohavi, and Y. Yun, “Lazy decision trees,” Proc. Natl. Conf. Artif. Intell., vol. 1, pp. 717–724, 1996.

G. Kalton, “Compensating for missing survey data,” Res. Rep. Ser. / Inst. Soc. Res., p. 157, 1983.

P. S. Kott, J. T. Lessler, and W. D. Kalsbeek, “Nonsampling Error in Surveys.,” J. Am. Stat. Assoc., vol. 88, no. 424, p. 1470, 1993, doi: 10.2307/2291300.

S. Zhang, “Nearest neighbor selection for iteratively kNN imputation,” J. Syst. Softw., vol. 85, no. 11, pp. 2541–2552, 2012, doi: 10.1016/j.jss.2012.05.073.

S. Zhang, J. Zhang, X. Zhu, Y. Qin, and C. Zhang, “Missing Value Imputation Based on Data Clustering,” Trans. Comput. Sci. I, pp. 128–138, 2008, doi: 10.1007/978-3-540-79299-4_7.

F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408, 1958, doi: 10.1037/h0042519.

J. M. C. Sousa and U. Kaymak, “Fuzzy decision making in modeling and control,” World Sci. Ser. Robot. Intell. Syst., vol. 27, no. 27, pp. xix, 335 p., 2002.

D. Li, J. Deogun, W. Spaulding, and B. Shuart, “Towards missing data imputation: A study of fuzzy K-means clustering method,” Lect. Notes Artif. Intell. (Subseries Lect. Notes Comput. Sci., vol. 3066, pp. 573–579, 2004, doi: 10.1007/978-3-540-25929-9_70.

J. I. Peláez, J. M. Doña, and J. A. Gómez-Ruiz, “Analysis of OWA operators in decision making for modelling the majority concept,” Appl. Math. Comput., vol. 186, no. 2, pp. 1263–1275, 2007, doi: 10.1016/j.amc.2006.07.161.

E. Herrera, F. Chiclana, F. Herrera, and S. Alanso, “Group decision- making model with incomplete fuzzy preference relations based on additive consistency,” IEEE Trans. Syst. Man, Cybern. B Cybern., vol. 37, no. 1, pp. 176–189, 2007.

Z. Pawlak, “Rough sets,” Int. J. Comput. Inf. Sci., vol. 11, no. 5, pp. 341–356, 1982.

L. Breiman, J. Friedman, C. . Stone, and R. . Olshen, Classification and Regression Trees. Taylor & Francis, 1984.

J. R. Quinlan, “{C4}.5 - Programs for Machine Learning,” 1993.

Q. Song and M. Shepperd, “A new imputation method for small software project data sets,” J. Syst. Softw., vol. 80, no. 1, pp. 51–62, 2007, doi: 10.1016/j.jss.2006.05.003.

J. W. Grzymala-busse, M. Hu, and N. York, “A Comparison of Several Approaches to Missing Attribute Values in Data Mining,” Rough Sets Curr. Trends Comput., pp. 378–385, 2000.

J. Larrey Ruiz, J. Morales Sánchez, J. L. Sancho Gómez, R. Verdú Monedero, and P. J. García Laencina, “Algoritmo KNN basado en información mutua para clasificación de patrones con valores perdidos,” 2008, [Online]. Available: http://repositorio.upct.es/handle/10317/1364.

J. Ruiz-Shulcloper, “Pattern recognition with mixed and incomplete data,” Pattern Recognit. Image Anal., vol. 18, no. 4, pp. 563–576, 2008, doi: 10.1134/S1054661808040044.

L. Vilahomat, “Extensión al clasificador asociativo Gamma para el manejo de datos mezclados e incompletos,” Universidad Máximo Gómez Báez, 2015.

Y. Villuendas-Rey, C. F. Rey-Benguría, Á. Ferreira-Santiago, O. Camacho-Nieto, and C. Yáñez-Márquez, “The Naïve Associative Classifier (NAC): A novel, simple, transparent, and accurate classification model evaluated on financial data,” Neurocomputing, vol. 265, pp. 105–115, 2017, doi: 10.1016/j.neucom.2017.03.085.

I. López, A. Argüelles, O. Camacho, and C. Yáñez, “Pollutants time series prediction using the Gamma classifier,” Int. J. Comput. Intell. Syst., vol. 4, pp. 680–711, 2011.

C. Yáñez-Márquez, I. López-Yáñez, M. Aldape-Pérez, O. Camacho-Nieto, A. J. Argüelles-Cruz, and Y. Villuendas-Rey, “Theoretical Foundations for the Alpha-Beta Associative Memories: 10 Years of Derived Extensions, Models, and Applications,” Neural Process. Lett., vol. 48, no. 2, pp. 811–847, 2018, doi: 10.1007/s11063-017-9768-2.

Y. Villuendas-Rey, J. A. Hernández-Castaño, O. Camacho-Nieto, C. Yáñez-Márquez, and I. López-Yañez, “NACOD: A naïve associative classifier for online data,” IEEE Access, vol. 7, pp. 117761–117767, 2019, doi: 10.1109/ACCESS.2019.2936366.

A. Rangel-Diaz De La Vega, Y. Villuendas-Rey, C. Yanez-Marquez, and O. Camacho-Nieto, “The Naïve Associative Classifier with Epsilon Disambiguation,” IEEE Access, vol. 8, pp. 51862–51870, 2020, doi: 10.1109/ACCESS.2020.2979054.

M. Aldape-Pérez, A. Alarcón-Paredes, C. Yáñez-Márquez, I. López-Yáñez, and O. Camacho-Nieto, “An associative memory approach to healthcare monitoring and decision making,” Sensors (Switzerland), vol. 18, no. 8, 2018, doi: 10.3390/s18082690.

R. Ramírez-Rubio, M. Aldape-Pérez, C. Yáñez-Márquez, I. López-Yáñez, and O. Camacho-Nieto, “Pattern classification using smallest normalized difference associative memory,” Pattern Recognit. Lett., vol. 93, pp. 104–112, 2017, doi: 10.1016/j.patrec.2017.02.013.

A. V. Uriarte-Arcia, I. López-Yáñez, C. Yáñez-Márquez, J. Gama, and O. Camacho-Nieto, “Data stream classification based on the gamma classifier,” Math. Probl. Eng., vol. 2015, 2015, doi: 10.1155/2015/939175.

A. V. Uriarte-Arcia, I. López-Yáñez, and C. Yáñez-Márquez, “One-hot vector hybrid associative classifier for medical data classification,” PLoS One, vol. 9, no. 4, 2014, doi: 10.1371/journal.pone.0095715.

C. Yáñez, “Memorias Asociativas Basadas en Relaciones de Orden y Operadores Binarios,” Instituto Politécnico Nacional, 2002.

M. E. Acevedo, “Memorias Asociativas Bidireccionales Alfa-Beta,” Instituto Politécnico Nacional, 2006.

M. E. Acevedo-Mosqueda, C. Yáñez-Márquez, and I. López-Yáñez, “Complexity of Alpha-Beta bidirectional associative memories,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 4293 LNAI, pp. 357–366, 2006, doi: 10.1007/11925231_34.

R. Flores, “Memorias asociativas Alfa-Beta basadas en el código Johnson- Möbius modificado,” Instituto Politecnico Nacional, 2006.

I. López, “Teoría y aplicaciones del Clasificador asociativo gamma,” Instituto Politecnico Nacional, 2011.

I. López, “Clasificador Automatico de alto desempeño,” Instituto Politecnico Nacional, 2007.

J. Han and M. Kamber, Data mining: concepts and techniques, vol. 49, no. 06. Morgan Kaufmann Publishers, 2000.

R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Int. Jt. Conf. Artif. Intell., vol. 2, pp. 1137–1143, 1995.

L. Victoria, F. Alberto, G. Salvador, P. Vasile, and H. Francisco, “An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics,” Inf. Sci. (Ny)., vol. 250, pp. 113–141, 2013.

V. García, J. S. Sánchez, and R. A. Mollineda, “On the effectiveness of preprocessing methods when dealing with different levels of class imbalance,” Knowledge-Based Syst., vol. 25, no. 1, pp. 13–21, 2012, doi: 10.1016/j.knosys.2011.06.013.

A. Orriols and E. Bernadó, “Evolutionary rule-based systems for imbalanced datasets,” Soft Comput., vol. 13, pp. 213–225, 2009.

A. Frank and A. Asuncion, “UCI Machine Learning Repository,” 2010. http://archive.ics.uci.edu/ml.

L. Sánchez Ramos et al., “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” J. Mult. Log. Soft Comput., 2011.

Full Text: PDF (Spanish)

Refbacks

There are currently no refbacks.

Username
Password
Remember me

POLIBITS