Phishing is a critical challenge in cybersecurity at present due to the high rate of technological development to conduct the act. The detection of phishing attacks is a difficult task as the methods for executing keep evolving every time, which makes it tedious. Despite several techniques deployed to fight the attacks, there is no one perfect solution. Presently, machine learning is accepted among researchers as the right antidote to fight against phishing attacks on the network. This method comprises several steps, but one crucial step is the feature selection. The quality of the features selected in building the machine learning model plays a significant role. The two general feature selection approaches were found with loopholes such as the challenge of choosing a cutoff point and high computation. To address the issue of the cutoff point, the study applied a novel ensemble feature selection strategy to identify relevant features while correlated ones were discarded. The study used a Borda count algorithm as the aggregator to improve the selection performance of the individual filter-based measures. In the first phase of the feature selection framework, three individual filter-based predictors: gain ratio, chi-square, and correlation, were applied to produce the features based on their principles. In the second stage, the innovative HDEFS was later applied to the primary information features. The innovative HDEFS produced baseline webpage features different from normal features such as IpAddress, AtSymbol, QueryLength, MissingTitle, NumQueryComponents previously used for phishing detection. From the results gathered, it was observed that the phishing detection models using the proposed HDEFS baseline features enhanced the individual filter-based identifiers. The findings showed that the prediction accuracy of the models increased using the features selected by the novel feature selection framework proposed. The bagged SVM model outperformed other ensembled and classical models achieving 0.974(97.4%), followed by bagged LR (0.94).
Key words: Keywords: Phishing Detection, Cybersecurity, Machine learning, Feature Selection, Ensemble, Malicious, email
|