Clasificación temática automática exhaustiva del corpus Reuters 21578 con aprendizaje automático supervisado
Abstract
Automatic text classification has established itself as a research discipline that merges advanced natural language processing (NLP) techniques with machine learning algorithms, allowing to efficiently categorize large volumes of textual documents. An innovative approach is proposed that integrates current preprocessing techniques with classical supervised learning algorithms to improve the classification accuracy of the Reuters-21578 corpus. A literature review, the implementation of preprocessing techniques (tokenization, lemmatization, stopword elimination, lowercase conversion and special character elimination), as well as the exploration of supervised learning algorithms (Logistic Regression, Support Vector Machines, Naïve Bayes, Random Forest and k-nearest neighbors) are proposed. Experiments were conducted with various configurations, combining preprocessing techniques, feature selection methods such as TF-IDF, and the aforementioned algorithms. Thus, the findings in the experimented scenarios reveal that the integration of these techniques and algorithms significantly improves the accuracy of text classification, resulting in a configuration suitable for the Reuters-21578 corpus that presents an accuracy of up to 98.6%. A rigorous and efficient empirical methodology is proposed, which can be applicable to various document corpora in text format.
