Design of a Hybrid Ensemble Feature Selection Framework for Big Data Text Mining

Smah Smari; Barigou Fatiha; Belalem Ghalem

doi:10.5935/jetia.v12i58.3192

Smah Smari Computer Science Department, Laboratory of Computer science of Oran (LIO), Oran 1 University, Ahmed Ben Bella, Oran, Algeria. https://orcid.org/0000-0002-5787-5380
Barigou Fatiha Computer Science Department, Laboratory of Computer science of Oran (LIO), Oran 1 University, Ahmed Ben Bella, Oran, Algeria. https://orcid.org/0000-0001-5444-4000
Belalem Ghalem Computer Science Department, Laboratory of Computer science of Oran (LIO), Oran 1 University, Ahmed Ben Bella, Oran, Algeria. https://orcid.org/0000-0002-9694-7586

DOI: https://doi.org/10.5935/jetia.v12i58.3192

Abstract

The growing volume of textual data often exceeds the capacity of available computing resources, and conventional machine learning algorithms struggle to scale up. Today, the quality of data is becoming more critical than its raw quantity: it is therefore essential to transform massive data into intelligent data through appropriate pre-processing steps. Feature selection plays a key role in this process. In this work, we propose the design of a hybrid ensemble-based feature selection framework for processing large-scale textual data. The approach is based on the MFD-AFSA algorithm combined with different feature evaluation functions, applied on multiple data subsets. To improve scalability, we also outline a distributed strategy in an Apache Spark environment, based on the Random Sample Partitioning model. Finally, we introduce an automatic approximation mechanism, which we call auto-approximation, enabling selection sets to be built dynamically via an approximation technique. This work is part of a methodological design approach; experimental validation and practical evaluations will be the subject of future work.

Downloads

Download data is not yet available.

JETIA Journal Data
Available:	2015 - 2026
Volumes:	12
Issues:	58
Articles:	1.110
Article Processing Charges (APC):	PAID