Ing. Dominik Soukup

Publikace

Analysis of Statistical Distribution Changes of Input Features in Network Traffic Classification Domain

Autoři
Jančička, L.; Koumar, J.; Soukup, D.; Čejka, T.
Rok
2024
Publikováno
NOMS 2024-2024 IEEE Network Operations and Management Symposium. Seoul: IEEE CLEO/Pacific Rim, 2024. ISSN 2374-9709. ISBN 979-8-3503-2793-9.
Typ
Stať ve sborníku
Anotace
This study investigates the evolving landscape of network traffic monitoring, which is crucial for maintaining computer network services and security. Traditional methods like Deep Packet Inspection (DPI) face challenges due to increased privacy protection through encryption, prompting a shift towards statistical-based detection using Machine Learning (ML). On the other hand, ML struggles with long-term evaluation due to various distribution changes. This study focuses on the CESNET-TLS-Year22 dataset, derived from one year of TLS network traffic on the CESNET2 backbone. Described research explores the behavior of modern protocols in real-world scenarios and their impact on dataset quality. The main result of our analysis is the identification of the Weekend phenomenon in network traffic classification that is generally overlooked during ML model training.

Analysis of Statistical Distribution Changes of Input Features in Network Traffic Classification Domain

Autoři
Jančička, L.; Koumar, J.; Soukup, D.; Čejka, T.
Rok
2024
Publikováno
Proceedings of the 12th Prague Embedded Systems Workshop. Praha: CTU. Faculty of Information Technology, 2024. ISBN 978-80-01-07303-2.
Typ
Stať ve sborníku
Anotace
This study investigates the evolving landscape of network traffic monitoring, which is crucial for maintaining computer network services and security. Traditional methods like Deep Packet Inspection (DPI) face challenges due to increased privacy protection through encryption, prompting a shift towards statistical-based detection using Machine Learning (ML). On the other hand, ML struggles with long-term evaluation due to various distribution changes. This study focuses on the CESNET-TLS-Year22 dataset, derived from one year of TLS network traffic on the CESNET2 backbone. Described research explores the behavior of modern protocols in real-world scenarios and their impact on dataset quality. The main result of our analysis is the identification of the Weekend phenomenon in network traffic classification that is generally overlooked during ML model training.

Machine Learning Metrics for Network Datasets Evaluation

Autoři
Soukup, D.; Uhříček, D.; Vašata, D.; Čejka, T.
Rok
2024
Publikováno
ICT Systems Security and Privacy Protection. Cham: Springer, 2024. p. 307-320. vol. 679. ISSN 1868-422X. ISBN 978-3-031-56326-3.
Typ
Stať ve sborníku
Anotace
High-quality datasets are an essential requirement for leveraging machine learning (ML) in data processing and recently in network security as well. However, the quality of datasets is overlooked or underestimated very often. Having reliable metrics to measure and describe the input dataset enables the feasibility assessment of a dataset. Imperfect datasets may require optimization or updating, e.g., by including more data and merging class labels. Applying ML algorithms will not bring practical value if a dataset does not contain enough information. This work addresses the neglected topics of dataset evaluation and missing metrics. We propose three novel metrics to estimate the quality of an input dataset and help with its improvement or building a new dataset. This paper describes experiments performed on public datasets to show the benefits of the proposed metrics and theoretical definitions for more straightforward interpretation. Additionally, we have implemented and published Python code so that the metrics can be adopted by the worldwide scientific community.

MFWDD: Model-based Feature Weight Drift Detection Showcased on TLS and QUIC Traffic

Autoři
Jančička, L.; Soukup, D.; Koumar, J.; Němec, F.; Čejka, T.
Rok
2024
Publikováno
2024 20th International Conference on Network and Service Management (CNSM). New York: IEEE, 2024. ISSN 2165-963X. ISBN 978-3-903176-66-9.
Typ
Stať ve sborníku
Anotace
Machine learning (ML) represents an efficient and popular approach for network traffic classification. However, network traffic inspection is a challenging domain and trained models may degrade soon after deployment. Besides biases present during data captures and model creation, data drifts contribute significantly to ML model degradation. This paper proposes a novel method called Model-based Feature Weight Drift Detection (MFWDD) for concept drift detection. It is a part of a public software framework suited for dataset drift analysis tailored to the domain of network traffic. This work addresses TLS and QUIC service classification problems, examines a variety of experiments analyzing the evolution of the respective distributions, and observes their degradation over time on different ML features. The MFWDD framework guided TLS and QUIC services classification models retraining throughout an extensive period and not only prevented model degradation but also improved its performance and consistency over time.

TCI: A system for distributed network monitoring, troubleshooting and dataset creation

Autoři
Soukup, D.; Pešek, J.; Hejcman, L.; Beneš, D.; Čejka, T.
Rok
2024
Publikováno
Proceedings of the 12th Prague Embedded Systems Workshop. Praha: CTU. Faculty of Information Technology, 2024. ISBN 978-80-01-07303-2.
Typ
Stať ve sborníku

TCI: A system for distributed network monitoring, troubleshooting and dataset creation

Autoři
Soukup, D.; Pešek, J.; Hejcman, L.; Beneš, D.; Čejka, T.
Rok
2024
Publikováno
NOMS 2024-2024 IEEE Network Operations and Management Symposium. Seoul: IEEE CLEO/Pacific Rim, 2024. ISSN 2374-9709. ISBN 979-8-3503-2793-9.
Typ
Stať ve sborníku
Anotace
Network traffic monitoring is a very complex task that requires a combination of multiple tools and teams. Very often, detected events must be validated and confirmed, or ongoing detection needs additional detailed data from full packets. All these activities must be done automatically concerning data privacy. This is why we propose a solution in the form of Traffic Capture Infrastructure (TCI), a single system for network traffic capture, investigation, and dataset creation, even in high-speed provider networks. Our system supports extensive user management features to ensure dataset privacy, system integrity, and unified control over many network probes. This paper presents the architecture, main functions, recommendations, and lessons learnt from full packet monitoring in today’s networks. Lastly, we prove the value of this system with several publications that have used our system to create their underlying dataset and network traffic investigation.

Active Learning Framework For Long-term Network Traffic Classification

Autoři
Pešek, J.; Soukup, D.; Čejka, T.
Rok
2023
Publikováno
IEEE Annual Computing and Communication Workshop and Conference (CCWC). New Jersey: IEEE, 2023. p. 893-899. ISBN 979-8-3503-3286-5.
Typ
Stať ve sborníku vyzvaná či oceněná
Anotace
Recent network traffic classification methods benefit from machine learning (ML) technology. However, there are many challenges due to the use of ML, such as lack of high-quality annotated datasets, data drifts and other effects causing aging of datasets and ML models, high volumes of network traffic, etc. This paper presents the benefits of augmenting traditional workflows of ML training&deployment and adaption of the Active Learning (AL) concept on network traffic analysis. The paper proposes a novel Active Learning Framework (ALF) to address this topic. ALF provides prepared software components that can be used to deploy an AL loop and maintain an ALF instance that continuously evolves a dataset and ML model automatically. Moreover, ALF includes monitoring, datasets quality evaluation, and optimization capabilities that enhance the current state of the art in the AL domain. The resulting solution is deployable for IP flow-based analysis of high-speed (100 Gb/s) networks, where it was evaluated for more than eight months. Additional use cases were evaluated on publicly available datasets.

Evaluation of the Limit of Detection in Network Dataset Quality Assessment with PerQoDA

Autoři
Wasielewska, K.; Soukup, D.; Čejka, T.; Camacho, J.
Rok
2023
Publikováno
ECML PKDD 2022: Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Cham: Springer, 2023. p. 170-185. ISSN 1865-0929. ISBN 978-3-031-23632-7.
Typ
Stať ve sborníku
Anotace
Machine learning is recognised as a relevant approach to detect attacks and other anomalies in network traffic. However, there are still no suitable network datasets that would enable effective detection. On the other hand, the preparation of a network dataset is not easy due to privacy reasons but also due to the lack of tools for assessing their quality. In a previous paper, we proposed a new method for data quality assessment based on permutation testing. This paper presents a parallel study on the limits of detection of such an approach. We focus on the problem of network flow classification and use well-known machine learning techniques. The experiments were performed using publicly available network datasets.

Automated Annotation of Network Traffic with Data from Web Browser

Autoři
Kala, J.; Soukup, D.
Rok
2022
Publikováno
Proceedings of the 10th Prague Embedded Systems Workshop. Praha: CTU. Faculty of Information Technology, 2022. p. 59-67. ISBN 978-80-01-07015-4.
Typ
Stať ve sborníku
Anotace
Encrypted traffic classification requires Machine Learning (ML) algorithms and a large amount of data to learn patterns and classify network communication without decrypting it. For the learning stage of ML models, we need a reliable and trusted dataset that delivers the ground truth for the whole classification. However, building a dataset is a very complicated and time-consuming task that stops ML to be used in the production environment of target networks. The aim of this work is to address this topic for network flow annotation using web traffic data. This paper introduces to problematics of network IP flow monitoring, analysis and classification. This problem is demonstrated on HTTP and HTTPS protocols. Moreover, this work describes a technique of data collection from web browsers and their pairing with traffic flows to create a reliable annotated dataset automatically

Vision of Active Learning Framework Approach to Network Traffic Analysis Research

Autoři
Pešek, J.; Soukup, D.; Čejka, T.
Rok
2022
Publikováno
Proceedings of the 10th Prague Embedded Systems Workshop. Praha: CTU. Faculty of Information Technology, 2022. p. 68-72. ISBN 978-80-01-07015-4.
Typ
Stať ve sborníku
Anotace
Current research in the network security domain intensively uses machine learning (ML) and artificial intelligence to automate processes and reveal hidden patterns in data. These technologies, however, require lots of training datasets with ideally high quality. Additionally, network infrastructures continuously evolve and thus network traffic dynamically changes in time as well. There is an urgent need to adapt machine learning models, update datasets with the latest samples of annotated network traffic and retrain the models regularly to sustain feasible performance. Active Learning Framework (ALF) directly targets these demands and aims to provide a modular platform for scientific experiments and deployment in practice as well as to support research activities regarding quality of datasets. This paper particularly describes ALF software and proposes its possible use cases in research and practice domains.

Towards Evaluating Quality of Datasets for Network Traffic Domain

Autoři
Soukup, D.; Tisovčík, P.; Hynek, K.; Čejka, T.
Rok
2021
Publikováno
Proceedings of the 2021 17th International Conference on Network and Service Management. New York: IEEE, 2021. p. 264-268. ISSN 2165-963X. ISBN 978-3-903176-36-2.
Typ
Stať ve sborníku
Anotace
This paper deals with the quality of network traffic datasets created to train and validate machine learning classification and detection methods. Naturally, there is a long epoch of research targeted at data quality; however, it is focused mainly on data consistency, validity, precision, and other metrics, which are insufficient for network traffic use-cases. The rise of Machine learning usage in network monitoring applications requires a new methodology for evaluation datasets. There is a need to evaluate and compare traffic samples captured at different conditions and decide the usability of the already captured and annotated data. This paper aims to explain a use case of dataset creation, propose definitions regarding the quality of the network traffic datasets, and finally, describe a framework for datasets analysis.

Behavior Anomaly Detection in IoT Networks

Rok
2020
Publikováno
Proceeding of the International Conference on Computer Networks, Big Data and IoT (ICCBI - 2019). Cham: Springer International Publishing, 2020. p. 465-473. Lecture Notes on Data Engineering and Communications Technologies. vol. 49. ISSN 2367-4520. ISBN 978-3-030-43192-1.
Typ
Kapitola v knize
Anotace
Data encryption makes deep packet inspection less suitable nowadays, and the need of analyzing encrypted traffic is growing. Machine learning brings new options to recognize a type of communication despite the heterogeneity of encrypted IoT traffic right at the network edge. We propose the design of scalable architecture and the method for behavior anomaly detection in IoT networks. Combination of two existing semi-supervised techniques that we used ensures higher reliability of anomaly detection and improves results achieved by a single method. We describe conducted classification and anomaly detection experiments allowed thanks to existing and our training datasets. Presented satisfying results provide a subject for further work and allow us to elaborate on this idea.

QoD: Ideas about Evaluating Quality of Datasets

Rok
2020
Publikováno
Proceedings of the 8th Prague Embedded Systems Workshop. Praha: Czech Technical University in Prague, 2020. p. 8-9. ISBN 978-80-01-06772-7.
Typ
Stať ve sborníku
Anotace
Importance of computer networks is raising every year. The reason is that we are connecting more and more devices, applications and our daily routines depends on connectivity. On the other hand, this is a great potential for attackers. They can hide their activities in complex network environment and steal valuable data. Without solid dataset, our evaluation score is misinterpreting the real score in production environment, and, therefore, proper datasets have essential role in research&development of any ML-based classifier or detector. The main motivation for this paper is to find a way how to evaluate quality of any dataset to estimate if it is good enough for ML experiments. To our best knowledge, there are only a few studies focused on quality evaluation of datasets with network traffic. For experiments, we selected datasets about DNS over HTTP (DoH) detection and URL classification problems that are already being elaborated. All metrics are calculated from dataset level. Impact of these metrics is evaluated on Random Forest (RF) model. We show results we have discovered in our datasets and ML detection modules. Eventually, we discuss possible next steps in this research.

Security Framework for IoT and Fog Computing Networks

Autoři
Soukup, D.; Hujňák, O.; Štefunko, S.; Krejčí, R.; Grešák, E.
Rok
2019
Publikováno
3rd International conference on I-SMAC. Piscataway, NJ: IEEE, 2019. p. 87-92. ISBN 978-1-7281-4365-1.
Typ
Stať ve sborníku vyzvaná či oceněná
Anotace
Our environment becomes more and more in-tercon-nected. Various devices like refrigerators, doors or light bulbs communicate over different networks and provide information for applications that are supposed to make our lives easier and more comfortable. However, such data provide sensitive information about our presence or habits and become captivating for network attackers. It is very challenging to detect incidents in heterogeneous IoT networks where different devices come in and out or change their network profiles quite frequently. We propose a security framework for IoT and fog computing networks to address these challenges. Our framework is very flexible and designed even for devices with limited computational power. All components can be deployed on one network node or distributed among many, which also allows easy scalability. Part of our solution is software IoT gateway that provides the capability to analyse traffic from non-IP IoT sensors. This project covers full-stack security solution because it contains collectors, detectors and management tools. This framework has only software components with no relation to any specific hardware device. It is developed as an open-source project and it is publicly available for the worldwide community. Currently developed detectors detect identified vulnerabilities for Z-Wave, Long Range Wide Area Network (LoRaWAN), BLE and IP based IoT protocols.