A comparison of adversarial malware generators
Autoři
Louthánová, P.; Kozák, M.; Jureček, M.; Stamp, M.; Di Troia, F.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 623-639. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
Machine learning has proven to be a valuable tool for automated malware detection, but machine learning systems have also been shown to be subject to adversarial attacks. This paper summarizes and compares related work on generating adversarial malware samples, specifically malicious Windows Portable Executable files. In contrast with previous research, we not only compare generators of adversarial malware examples theoretically, but we also provide an experimental comparison and evaluation for practical usability. We use gradient-based, evolutionary-based, and reinforcement-based approaches to create adversarial samples, which we test against selected antivirus products. The results show that applying optimized modifications to previously detected malware can lead to incorrect classification of the file as benign. Moreover, generated malicious samples can be effectively employed against detection models other than those used to produce them, and combinations of methods can construct new instances that avoid detection. Based on our findings, the Gym-malware generator, which uses reinforcement learning, has the greatest practical potential. This generator has the fastest average sample production time of 5.73 s and the highest average evasion rate of 44.11%. Using the Gym-malware generator in combination with itself further improved the evasion rate to 58.35%. However, other tested methods scored significantly lower in our experiments than reported in the original publications, highlighting the importance of a standardized evaluation environment.
Classification and online clustering of zero-day malware
Autoři
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 579-592. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. The features were extracted by static analysis of portable executable files for the Windows operating system. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.
Creating valid adversarial examples of malware
Autoři
Kozák, M.; Jureček, M.; Stamp, M.; Di Troia, F.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 607-621. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
Because of its world-class results, machine learning (ML) is becoming increasingly popular as a go-to solution for many tasks. As a result, antivirus developers are incorporating ML models into their toolchains. While these models improve malware detection capabilities, they also carry the disadvantage of being susceptible to adversarial attacks. Although this vulnerability has been demonstrated for many models in white-box settings, a black-box scenario is more applicable in practice for the domain of malware detection. We present a method of creating adversarial malware examples using reinforcement learning algorithms. The reinforcement learning agents utilize a set of functionality-preserving modifications, thus creating valid adversarial examples. Using the proximal policy optimization (PPO) algorithm, we achieved an evasion rate of 53.84% against the gradient-boosted decision tree (GBDT) detector. The PPO agent previously trained against the GBDT classifier scored an evasion rate of 11.41% against the neural network-based classifier MalConv and an average evasion rate of 2.31% against top antivirus programs. Furthermore, we discovered that random application of our functionality-preserving portable executable modifications successfully evades leading antivirus engines, with an average evasion rate of 11.65%. These findings indicate that ML-based models used in malware detection systems are sensitive to adversarial attacks and that better safeguards need to be taken to protect these systems.
Reducing overdefined systems of polynomial equations derived from small scale variants of the AES via data mining methods
Autoři
Berušková, J.; Jureček, M.; Jurečková, O.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 885-900. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
This paper deals with reducing the secret key computation time of small scale variants of the AES cipher using algebraic cryptanalysis, which is accelerated by data mining methods. This work is based on the known plaintext attack and aims to speed up the calculation of the secret key by processing the polynomial equations extracted from plaintext-ciphertext pairs. Specifically, we propose to transform the overdefined system of polynomial equations over GF(2) into a new system so that the computation of the Gröbner basis using the F4 algorithm takes less time than in the case of the original system. The main idea is to group similar polynomials into clusters, and for each cluster, sum the two most similar polynomials, resulting in simpler polynomials. We compare different data mining techniques for finding similar polynomials, such as clustering or locality-sensitive hashing (LSH). Experimental results show that using the LSH technique, we get a system of equations for which we can calculate the Gröbner basis the fastest compared to the other methods that we consider in this work. Experimental results also show that the time to calculate the Gröbner basis for the transformed system of equations is significantly reduced compared to the case when the Gröbner basis was calculated from the original non-transformed system. This paper demonstrates that reducing an overdefined system of equations reduces the computation time for finding a secret key.
Social media bot detection using Dropout-GAN
Autoři
Shukla, A.; Jureček, M.; Stamp, M.
Rok
2024
Publikováno
Journal of Computer Virology and Hacking Techniques. 2024, 20(4), 669-680. ISSN 2263-8733.
Typ
Článek
Pracoviště
Anotace
Bot activity on social media platforms is a pervasive problem, undermining the credibility of online discourse and potentially leading to cybercrime. We propose an approach to bot detection using Generative Adversarial Networks (GAN). We discuss how we overcome the issue of mode collapse by utilizing multiple discriminators to train against one generator, while decoupling the discriminator to perform social media bot detection and utilizing the generator for data augmentation. In terms of classification accuracy, our approach outperforms the state-of-the-art techniques in this field. We also show how the generator in the GAN can be used to evade such a classification technique.
Combining Generators of Adversarial Malware Examples to Increase Evasion Rate
Autoři
Kozák, M.; Jureček, M.
Rok
2023
Publikováno
Proceedings of the 20th International Conference on Security and Cryptography. Porto: SciTePress - Science and Technology Publications, 2023. p. 778-786. ISSN 2184-7711. ISBN 978-989-758-666-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
Antivirus developers are increasingly embracing machine learning as a key component of malware defense. While machine learning achieves cutting-edge outcomes in many fields, it also has weaknesses that are exploited by several adversarial attack techniques. Many authors have presented both white-box and black-box generators of adversarial malware examples capable of bypassing malware detectors with varying success. We propose to combine contemporary generators in order to increase their potential. Combining different generators can create more sophisticated adversarial examples that are more likely to evade anti-malware tools. We demonstrated this technique on five well-known generators and recorded promising results. The best-performing combination of AMG-random and MAB-Malware generators achieved an average evasion rate of 15.9% against top-tier antivirus products. This represents an average improvement of more than 36% and 627% over using only the AMG-random and MAB-Malware generators, respectively. The generator that benefited the most from having another generator follow its procedure was the FGSM injection attack, which improved the evasion rate on average between 91.97% and 1,304.73%, depending on the second generator used. These results demonstrate that combining different generators can significantly improve their effectiveness against leading antivirus programs.
Generation of Adversarial Malware and Benign Examples Using Reinforcement Learning
Autoři
Kozák, M.; Jureček, M.; Lórencz, R.
Rok
2022
Publikováno
Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 3-25. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.
Typ
Kapitola v knize
Pracoviště
Anotace
Machine learning is becoming increasingly popular among antivirus
developers as a key factor in defence against malware. While machine learning is
achieving state-of-the-art results in many areas, it also has drawbacks exploited by
many with white-box attacks. Although the white-box scenario is possible in mal-
ware detection, the detailed structure of antivirus is often unknown. Consequently,
we focused on a pure black-box setup where no information apart from the predicted
label is known to the attacker, not even the feature space or predicted score.
We implemented our exploratory integrity attack using a reinforcement learning
approach on a dataset of portable executable binaries. We tested multiple agent
configurations while targeting LightGBM and MalConv classifiers. We achieved an
evasion rate of 68.64% and 13.32% against LightGBM and MalConv classifiers,
respectively. Besides traditional modelling of malware adversarial samples, we
present a setup for creating benign files that can increase the targeted classifier’s
false positive rate. This problem was considerably more challenging for our
reinforcement learning agents, with an evasion rate of 3.45% and 36.62% against
LightGBM and MalConv classifier, respectively. To understand how these attacks
transfer from classifiers based purely on machine learning to real-world anti-
malware software, we tested the same modified files against seven well-known
antiviruses. We achieved an evasion rate of up to 47.09% in malware and 14.29% in
benign adversarial attacks.
Interpretability of Machine Learning-Based Results of Malware Detection Using a Set of Rules
Autoři
Dolejš, J.; Jureček, M.
Rok
2022
Publikováno
Cybersecurity for Artificial Intelligence. Springer, Cham, 2022. p. 107-136. Advances in Information Security. vol. 54. ISSN 1568-2633. ISBN 978-3-030-97086-4.
Typ
Kapitola v knize
Pracoviště
Anotace
Machine learning plays an indispensable role in modern malware detection; it provides malware researchers with quick and reliable results. On the other
hand, the results can be hard to understand as to why a model classified a given
file as malicious or benign. This paper focuses on the interpretability of machine
learning models’ results using decision lists generated by two rule-based classifiers,
I-REP and RIPPER. We use the EMBER dataset, which contains features extracted
through static analysis from Portable Executable files, to train various machine
learning models. We extract decision lists from the machine learning models’
results using our implementation of I-REP and RIPPER. By taking into account
accuracies, true positive and false positive rates of the decision lists, we reason
whether the generated decision lists make a good representation of the results. To
comprehend the interpretability of the machine learning models, we define Human
Most Understandable Model and Interpretability Entropy. This allows us to measure
and compare the interpretability among the models. The most interpretable machine
learning model by RIPPER was Gaussian Naïve Bayes. Results show that RIPPER
is relatively successful at interpreting other machine learning models; however, it
needs some improvements to increase true positive rate.
Parallel Instance Filtering for Malware Detection
Autoři
Rok
2022
Publikováno
Proceedings of 2022 48th Euromicro Conference on Software Engineering and Advanced Applications. Los Alamitos: IEEE Computer Society, 2022. p. 13-20. ISBN 978-1-6654-6152-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation.
We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.
Yet Another Algebraic Cryptanalysis of Small Scale Variants of AES
Autoři
Rok
2022
Publikováno
Proceedings of the 19th International Conference on Security and Cryptography. Madeira: SciTePress, 2022. p. 415-427. ISSN 2184-7711. ISBN 978-989-758-590-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
This work presents new advances in algebraic cryptanalysis of small scale derivatives of AES. We model the cipher as a system of polynomial equations over GF(2), which involves only the variables of the initial key, and we subsequently attempt to solve this system using Gröbner bases. We show, for example, that one of the attacks can recover the secret key for one round of AES-128 under one minute on a contemporary CPU. This attack requires only two known plaintexts and their corresponding ciphertexts. We also compare the performance of Gröbner bases to a SAT solver, and provide an insight into the propagation of diffusion within the cipher.
Application of Distance Metric Learning to Automated Malware Detection
Autoři
Rok
2021
Publikováno
IEEE Access. 2021, 2021(9), 96151-96165. ISSN 2169-3536.
Typ
Článek
Pracoviště
Anotace
Distance metric learning aims to find the most appropriate distance metric parameters to improve similarity-based models such as k -Nearest Neighbors or k -Means. In this paper, we apply distance metric learning to the problem of malware detection. We focus on two tasks: (1) to classify malware and benign files with a minimal error rate, (2) to detect as much malware as possible while maintaining a low false positive rate. We propose a malware detection system using Particle Swarm Optimization that finds the feature weights to optimize the similarity measure. We compare the performance of the approach with three state-of-the-art distance metric learning techniques. We find that metrics trained in this way lead to significant improvements in the k -Nearest Neighbors classification. We conducted and evaluated experiments with more than 150,000 Windows-based malware and benign samples. Features consisted of metadata contained in the headers of executable files in the portable executable file format. Our experimental results show that our malware detection system based on distance metric learning achieves a 1.09 % error rate at 0.74 % false positive rate (FPR) and outperforms all machine learning algorithms considered in the experiment. Considering the second task related to keeping minimal FPR, we achieved a 1.15 % error rate at only 0.13 % FPR.
Improving Classification of Malware Families using Learning a Distance Metric
Autoři
Rok
2021
Publikováno
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 643-652. ISSN 2184-4356. ISBN 978-989-758-491-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
The objective of malware family classification is to assign a tested sample to the correct malware family. This paper concerns the application of selected state-of-the-art distance metric learning techniques to malware families classification. The goal of distance metric learning algorithms is to find the most appropriate distance metric parameters concerning some optimization criteria. The distance metric learning algorithms considered in our research learn from metadata, mostly contained in the headers of executable files in the PE file format. Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent malware families and benign files. The experimental results showed that the average precision and recall of the k-Nearest Neighbors algorithm using the distance learned on training data were improved significantly comparing when the non-learned distance was used. The k-Nearest Neighbors classifier using the Mahalanobis distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average recall, which achieved the best classification results among the state-of-the-art ML algorithms considered in our experiments.
Representation of PE Files using LSTM Networks
Autoři
Jureček, M.; Kozák, M.
Rok
2021
Publikováno
Proceedings of the 7th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2021. p. 516-525. ISSN 2184-4356. ISBN 978-989-758-491-6.
Typ
Stať ve sborníku
Anotace
An ever-growing number of malicious attacks on IT infrastructures calls for new and efficient methods of protection. In this paper, we focus on malware detection using the Long Short-Term Memory (LSTM) as a preprocessing tool to increase the classification accuracy of machine learning algorithms. To represent the malicious and benign programs, we used features extracted from files in the PE file format. We created a large dataset on which we performed common feature preparation and feature selection techniques. With the help of various LSTM and Bidirectional LSTM (BLSTM) network architectures, we further transformed the collected features and trained other supervised ML algorithms on both transformed and vanilla datasets. Transformation by deep (4 hidden layers) versions of LSTM and BLSTM networks performed well and decreased the error rate of several state-of-the-art machine learning algorithms significantly. For each machine learning algorithm considered in our experiments, the LSTM-based transformation of the feature space results in decreasing the corresponding error rate by more than 58.60 %, in comparison when the feature space was not transformed using LSTM network.
Distance Metric Learning using Particle Swarm Optimization to Improve Static Malware Detection
Autoři
Rok
2020
Publikováno
Proceedings of the 6th International Conference on Information Systems Security and Privacy. Madeira: SciTePress, 2020. p. 725-732. ISSN 2184-4356. ISBN 978-989-758-399-5.
Typ
Stať ve sborníku
Pracoviště
Anotace
Distance metric learning is concerned with finding appropriate parameters of distance function with respect to a particular task. In this work, we present a malware detection system based on static analysis. We use k-nearest neighbors (KNN) classifier with weighted heterogeneous distance function that can handle nominal and numeric features extracted from portable executable file format. Our proposed approach attempts to specify the weights of the features using particle swarm optimization algorithm. The experimental results indicate that KNN with the weighted distance function improves classification accuracy significantly.
Automatická detekcia škodlivého kódu
Autoři
Rok
2019
Publikováno
Sborník příspěvků PAD 2019 – elektronická verze. Praha: AMCA spol. s r.o., 2019. p. 35-38. ISBN 978-80-88214-20-5.
Typ
Stať ve sborníku
Pracoviště
Anotace
Problém automatickej detekcie malware predstavuje
v súčasnosti pre antivírové spoločnosti veľké výzvy. Každý deň
sa vygeneruje čoraz väčšie množstvo nových neznámych vzoriek
škodlivého kódu, ktoré je potrebné včas detekovať. Pretože
manuálna analýza vzoriek z dôvodu ohromného množstva nie je
možná, je nutné využiť automatickú detekciu malware. Algoritmy
strojového učenia sa ukazujú byť vhodným nástrojom, ktorý je do
určitej miery schopný adaptovať sa a detekovať aj nové, doteraz
neznáme vzorky.
Side-Channel Attack on the A5/1 Stream Cipher
Autoři
Rok
2019
Publikováno
Proceedings of the 22nd Euromicro Conference on Digital Systems Design. Los Alamitos, CA: IEEE Computer Soc., 2019. p. 633-638. ISBN 978-1-7281-2862-7.
Typ
Stať ve sborníku
Pracoviště
Anotace
In this paper we present cryptanalysis of the A5/1
stream cipher used in GSM mobile phones. Our attack is based
on power analysis where we assume that the power consumption
while clocking 3 LFSRs is different than when clocking 2 LFSRs.
We demonstrate a simple power analysis (SPA) attack and discuss
existing differential power analysis (DPA). We present the attack
for recovering secret key based on the information on clocking
bits of LFSRs that was deduced from power analysis.
The attack has a 100% success rate, requires minimal storage
and it does not requires any single bit of a keystream. An average
time complexity of our attack based on SPA is around 233 where
the computation unit is a resolution of system of linear equations
over the Z2. Recovering the secret key using information from
the DPA has a constant complexity.
Malware Detection Using a Heterogeneous Distance Function
Autoři
Rok
2018
Publikováno
Computing and Informatics. 2018, 37(3), 759-780. ISSN 1335-9150.
Typ
Článek
Pracoviště
Anotace
Classication of automatically generated malware is an active research
area. The amount of new malware is growing exponentially and since manual in-
vestigation is not possible, automated malware classication is necessary. In this
paper, we present a static malware detection system for the detection of unknown
malicious programs which is based on combination of the weighted k-nearest neigh-
bors classier and the statistical scoring technique from. We have extracted the
most relevant features from portable executable (PE) le format using gain ratio
and have designed a heterogeneous distance function that can handle both linear
and nominal features. Our proposed detection method was evaluated on a dataset
with tens of thousands of malicious and benign samples and the experimental re-
sults show that the accuracy of our classier is 98.80%. In addition, preliminary
results indicate that the proposed similarity metric on our feature space could be
used for clustering malware into families.