GitHub - eeechun/CDMC2019

Task Description

*This IoT malware classification 2019 dataset is provided by Taiwan Information Security Center (TWISC).

The aim of this task is to classify IoT malware. The features provided to perform the classification are the sequence of system calls captured during the runtime of malware in an sandbox environment. The dataset contains two parts: • TRAINING: 4167 formatted sequences of system calls, labeled by the type of the malware. • TESTING: 4275 files without known class labels.

NOTE the following difference between the training and test sets. For the training set, the label of each sample (find detail information of a sample file below) is provided in the label file, whilst the TEST.label file for competition evaluation is preserved for future use.

This dataset consists of 8442 samples generated following the procedure below. First, a collection of potentially malicious Linux programs in CEF format are collected from various sources. Then, each of these programs is executed in an sandboxed environment hosted by an emulator that provides the required runtime environment for it. During the runtime, the strace command is used to monitor and record the interactions between the processes initialized by the program and the Linux kernel. This process yields a log file that contains lines of system calls. On each line, strace records the time stamp, the invoked system call, as long as parameters and results of the calls. These log files are parsed and reformatted in a simplified format as in the .seq files. The title of a .seq file indicates the sample (i.e., a malicious program) index in the dataset. There might be multiples lines in a .seq file, with each line stands for the sequence of system calls invoked by a particular process initialized by the malware. The system call in each line are presented in ascending order of the function call time. The processes are presented in ascending order of the creation time.

All the .seq files used for training a prediction model can be found in the "TRAIN" folder. All the .seq file used for evaluating a prediction model can be found in the "TEST" folder. Along with the .seq file, there is also a TRAIN.label file provided in the following format. The TRAIN.label is a comma-separated values file. The first column is the index of the sample in the training set, and the second column presents the class label of the corresponding sample. For instance, "1111,5" indicate the number 1111 sample (i.e. file 1111.seq) in the TRAIN folder belongs to class 5. We preserve the lexical meaning of the labels for fairness reason.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
CDMC2019.py		CDMC2019.py
CDMC2019Task2Train.csv		CDMC2019Task2Train.csv
CDMC2019Task2Train.label		CDMC2019Task2Train.label
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages