Intrusion detection in the Internet of Things (IoT) environments is essential to guarantee computer network security. Machine learning (ML) models are widely used to improve efficient detection systems. Meanwhile, with the increasing complexity and size of intrusion detection data, analyzing vast datasets using ML models is becoming more challenging and demanding in terms of computational resources. Datasets related to IoT environments usually come in very large sizes.
This study investigates the impact of dataset reduction techniques on machine learning-based Intrusion Detection Systems (IDS) performance and efficiency. We propose a two-stage framework incorporating deep autoencoder-based feature reduction with stratified sampling to reduce the dimensionality and size of six publicly available IDS datasets, including BoT-IoT, CSE-CIC-IDS2018, and others. Multiple machine learning models, such as Random Forest, XGBoost, K-Nearest Neighbors, SVM, and AdaBoost, were evaluated using default parameters. Our results show that dataset reduction can decrease training time by up to 99% with minimal loss in F1-score, typically less than 1%. It is recognized that excessive size reduction can compromise detection accuracy for minority attack classes. However, employing a stratified sampling method can effectively maintain class distributions. The study highlights significant feature redundancy, particularly high correlation among features, across multiple IoT security-related datasets, motivating the use of dimensionality reduction techniques. These findings support the feasibility of efficient, scalable IDS implementations for real-world environments, especially in resource-constrained or real-time settings.
Key words: Dimensionality reduction; Data reduction; Autoencoders; Stratified sampling; Machine learning
|