Design of a Hybrid Ensemble Feature Selection Framework for Big Data Text Mining
Abstract
The growing volume of textual data often exceeds the capacity of available computing resources, and conventional machine learning algorithms struggle to scale up. Today, the quality of data is becoming more critical than its raw quantity: it is therefore essential to transform massive data into intelligent data through appropriate pre-processing steps. Feature selection plays a key role in this process. In this work, we propose the design of a hybrid ensemble-based feature selection framework for processing large-scale textual data. The approach is based on the MFD-AFSA algorithm combined with different feature evaluation functions, applied on multiple data subsets. To improve scalability, we also outline a distributed strategy in an Apache Spark environment, based on the Random Sample Partitioning model. Finally, we introduce an automatic approximation mechanism, which we call auto-approximation, enabling selection sets to be built dynamically via an approximation technique. This work is part of a methodological design approach; experimental validation and practical evaluations will be the subject of future work.
Downloads
Copyright (c) 2026 ITEGAM-JETIA

This work is licensed under a Creative Commons Attribution 4.0 International License.








