The rapid growth of Android applications has led to a significant increase in malware threats, making accurate and robust detection mechanisms essential for mobile security. However, challenges such as class imbalance and high-dimensional feature spaces limit the effectiveness of traditional machine learning approaches.
This work proposes a robust machine learning pipeline for accurate detection of Android malware by integrating generative data augmentation and deep feature extraction with classical classification models. We employ Conditional Tabular Generative Adversarial Networks (CTGAN) to synthetically balance a permission- and API-based feature dataset (TUANDROMD), developed at Tezpur University from real benign and malicious Android applications. An autoencoder is then utilized to learn compact and discriminative latent representations from the original 241 numerical features, effectively reducing dimensionality and redundancy. The extracted features are used to train multiple machine learning classifiers, including Logistic Regression, Random Forest, and XGBoost, enabling a comparative evaluation of model performance.
The models are assessed using accuracy, precision, recall, and F1-score under stratified validation and holdout testing. Four experimental configurations are investigated: (i) baseline classification using raw features, (ii) CTGAN-based data augmentation, (iii) autoencoder-based feature extraction, and (iv) CTGAN-based augmentation followed by autoencoder-driven feature extraction. Experimental results demonstrate that the combined CTGAN and autoencoder pipeline significantly improves minority-class detection while maintaining high overall accuracy. These findings highlight that integrating generative augmentation with learned feature representations is an effective strategy for handling high-dimensional, imbalanced Android malware datasets.
Key words: Android Malware Detection; CTGAN, Autoencoder; Data Augmentation; Feature Extraction; Ensemble Learning; Imbalanced Data
|