Datasets are important for training and testing machine learning and artificial intelligence algorithms for identifying security indicators and anticipating incidents.
Datasets for Security Research in the areas of Security Data Mining & Security Analytics
KDDCup-99 is a relatively old dataset that was used for The Third International Knowledge Discovery and Data Mining Tools Competition. The competition task was to build a predictive network intrusion detector model capable of distinguishing between attacks and normal network traffic. The KDDCup database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
The CIC IDS 2017 dataset was created by the Canadian Institute for Cybersecurity. The dataset contains benign traffic and the most up-to-date common attacks. According to the authors, the network traffic analysis was performed using CICFlowMeter with labeled flows based on the timestamp, source, and destination IPs, source and destination ports, protocols, and attack (CSV files). Generating realistic background traffic was prioritized. The authors used their B-Profile system to profile the abstract behavior of human interactions and generate naturalistic benign background traffic. The dataset is built upon the abstract behaviour of 25 users based on the HTTP, HTTPS, FTP, SSH, and email protocols. The CIC IDS 2017 dataset has over 2.83M examples (2.27M benign and 557,646 malicious ones) in contrast to KDDCup-99 dataset with 148,517 flows including 77,054 benign and 71,463 malicious ones.
UNSW-NB15 was generated by configuring an authentic testbed environment to simulate current representations of normal and abnormal network traffic. The generated data is a hybrid of real modern normal activities and synthetic contemporary attack behaviours. Tcpdump tool is used to collect and capture a volume of 100 GB of the raw traffic. This dataset has nine types of attacks namely Fuzzers, analysis backdoors, DoS, etc. It also contains more than 2 million observations and 49 features, more specifically, 47 input features and 2 labels. The dataset is labelled for both binary and multiclass classification.