Abstract
Ensuring the security of a network infrastructure necessitates the precise detection and categorization of malware. While existing methodologies have demonstrated higher accuracy, their effectiveness has predominantly been validated on a limited subset of malware families or samples. These analyses often focus on malware families with a higher number of samples, potentially leading to biased and unrepresentative classification results. To address this gap, our study aims to enhance the accuracy and robustness of malware detection and categorization systems by investigating the impact of dataset size, class balance, and data augmentation techniques on classifier performance. We demonstrate the efficacy of our approach on a comparatively larger dataset titled Blue Hexagon Open Dataset for Malware AnalysiS, comprising of 134k samples. Our analysis, exploiting 85 malware families with at least 50 samples each, results in the highest accuracy of 92.28% using Random Forest as the classifier on the original imbalanced dataset. However, by employing Generative Adversarial Networks to generate synthetic samples and achieve balanced class distributions (resulted in balanced datasets), our approach demonstrates the improvement in the classifier's accuracy to 99.35%.