PyPI - cesnet-datazoo - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

cesnet-datazoo 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cesnet-datazoo
-Version: 0.1.0
+Version: 0.1.2
 Summary: A toolkit for large network traffic datasets
 Author-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
 Maintainer-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
@@ -29,10 +29,7 @@ Requires-Dist: tables>=3.8.0
 Requires-Dist: torch>=1.10
 Requires-Dist: tqdm
 Provides-Extra: dev
-Requires-Dist: black; extra == "dev"
 Requires-Dist: build; extra == "dev"
-Requires-Dist: jupyterlab; extra == "dev"
-Requires-Dist: lightgbm; extra == "dev"
 Requires-Dist: mkdocs-autorefs; extra == "dev"
 Requires-Dist: mkdocs-material-extensions; extra == "dev"
 Requires-Dist: mkdocs-material; extra == "dev"
@@ -57,7 +54,7 @@ The goal of this project is to provide tools for working with large network traf
 - Extensive configuration options for:
     - Selection of train, validation, and test periods.
     - Selection of application classes and splitting classes between *known* and *unknown*.
-    - Feature scaling.
+    - Data transformations, such as feature scaling.
 - Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
 - Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
@@ -72,7 +69,7 @@ The package is able to handle the following datasets:
 | _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |
 | _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |                                                                                                                                                                                           | ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST                                  | ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST                                                                                                       |
 | _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |
-| _Available samples_                | 141720670                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
+| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
 | _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |
 | _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    |                                                                                                                                                                                                |
 | _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    |                                                                                                                                                                                                |

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/README.md RENAMED Viewed

@@ -14,7 +14,7 @@ The goal of this project is to provide tools for working with large network traf
 - Extensive configuration options for:
     - Selection of train, validation, and test periods.
     - Selection of application classes and splitting classes between *known* and *unknown*.
-    - Feature scaling.
+    - Data transformations, such as feature scaling.
 - Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
 - Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
@@ -29,7 +29,7 @@ The package is able to handle the following datasets:
 | _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |
 | _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |                                                                                                                                                                                           | ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST                                  | ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST                                                                                                       |
 | _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |
-| _Available samples_                | 141720670                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
+| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
 | _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |
 | _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    |                                                                                                                                                                                                |
 | _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    |                                                                                                                                                                                                |

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/config.py RENAMED Viewed

@@ -113,7 +113,7 @@ class DatasetConfig():
     - Train, validation, test sets (dates, sizes, validation approach).
     - Application selection — either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).
-    - Feature scaling. See the [data features][features] page for more information. DOCS_TODO
+    - Data transformations. See the [transforms][transforms] page for more information.
     - Dataloader options like batch sizes, order of loading, or number of workers.
     When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].
@@ -133,7 +133,7 @@ class DatasetConfig():
     Attributes:
         need_train_set: Use to disable the train set. `Default: True`
-        need_val_set: Use to disable the validation set. When `need_train_set` is false, the validation set will also be disabled. `Default: True`
+        need_val_set: Use to disable the validation set. `Default: True`
         need_test_set: Use to disable the test set. `Default: True`
         train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].
         train_dates: Dates used for creating a train set.
@@ -161,7 +161,7 @@ class DatasetConfig():
         val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`
         batch_size: Number of samples per batch. `Default: 192`
         test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`
-        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: True`
+        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: False`
         preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`
         train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
         val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
@@ -176,10 +176,10 @@ class DatasetConfig():
         use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`
         use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`
         use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`
-        fit_scalers_samples: Fraction of train samples used for fitting feature scalers, if float. The absolute number of samples otherwise. `Default: 0.25` DOCS_TODO
-        ppi_transform: Transform function for PPI sequences. `Default: None` DOCS_TODO
-        flowstats_transform: Transform function for flow statistics. `Default: None`
-        flowstats_phist_transform: Transform function for packet histograms. `Default: None`
+        fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`
+        ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`
+        flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`
+        flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`
     # How to configure train, validation, and test sets
     There are three options for how to define train/validation/test dates.
@@ -238,7 +238,7 @@ class DatasetConfig():
     val_workers: int = 1
     batch_size: int = 192
     test_batch_size: int = 2048
-    preload_val: bool = True
+    preload_val: bool = False
     preload_test: bool = False
     train_size: int | Literal["all"] = "all"
     val_known_size: int | Literal["all"] = "all"
@@ -268,7 +268,6 @@ class DatasetConfig():
         self.database_path = dataset.database_path
         if not self.need_train_set:
-            self.need_val_set = False
             if self.apps_selection != AppSelection.FIXED:
                 raise ValueError("Application selection has to be fixed when need_train_set is false")
             if (len(self.train_dates) > 0 or self.train_period_name != ""):
@@ -281,7 +280,7 @@ class DatasetConfig():
                 if self.train_period_name not in dataset.time_periods:
                     raise ValueError(f"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods")
                 self.train_dates = dataset.time_periods[self.train_period_name]
-            if len(self.train_dates) == 0 and self.test_period_name == "":
+            if len(self.train_dates) == 0 and self.train_period_name == "":
                 self.train_period_name = dataset.default_train_period_name
                 self.train_dates = dataset.time_periods[dataset.default_train_period_name]
         # Configure test dates
@@ -299,17 +298,24 @@ class DatasetConfig():
                 self.test_period_name = dataset.default_test_period_name
                 self.test_dates = dataset.time_periods[dataset.default_test_period_name]
         # Configure val dates
-        if (not self.need_val_set or self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN) and (len(self.val_dates) > 0 or self.val_period_name != ""):
-            raise ValueError("val_dates and val_period_name cannot be specified when need_val_set is false or the validation approach is split-from-train")
-        if self.val_approach == ValidationApproach.VALIDATION_DATES:
-            if len(self.val_dates) > 0 and self.val_period_name == "":
-                raise ValueError("val_period_name has to be specified when val_dates are set")
-            if len(self.val_dates) == 0 and self.val_period_name != "":
-                if self.val_period_name not in dataset.time_periods:
-                    raise ValueError(f"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods")
-                self.val_dates = dataset.time_periods[self.val_period_name]
-            if len(self.val_dates) == 0 and self.val_period_name == "":
-                raise ValueError("val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates")
+        if not self.need_val_set:
+            if len(self.val_dates) > 0 or self.val_period_name != "" or self.val_approach != ValidationApproach.SPLIT_FROM_TRAIN:
+                raise ValueError("val_dates, val_period_name, and val_approach cannot be specified when need_val_set is false")
+        else:
+            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:
+                if len(self.val_dates) > 0 or self.val_period_name != "":
+                    raise ValueError("val_dates and val_period_name cannot be specified when the validation approach is split-from-train")
+                if not self.need_train_set:
+                    raise ValueError("Cannot use the split-from-train validation approach when need_train_set is false. Either use the validation-dates approach or set need_val_set to false.")
+            elif self.val_approach == ValidationApproach.VALIDATION_DATES:
+                if len(self.val_dates) > 0 and self.val_period_name == "":
+                    raise ValueError("val_period_name has to be specified when val_dates are set")
+                if len(self.val_dates) == 0 and self.val_period_name != "":
+                    if self.val_period_name not in dataset.time_periods:
+                        raise ValueError(f"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods")
+                    self.val_dates = dataset.time_periods[self.val_period_name]
+                if len(self.val_dates) == 0 and self.val_period_name == "":
+                    raise ValueError("val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates")
         # Check if train, val, and test dates are available in the dataset
         bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]
         bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]
@@ -326,12 +332,11 @@ class DatasetConfig():
         # Check time order of train, val, and test periods
         train_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.train_dates]
         test_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.test_dates]
-        if len(train_dates) > 0 and len(test_dates) > 0  and min(test_dates) <= max(train_dates):
+        if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):
             warnings.warn(f"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
         if self.val_approach == ValidationApproach.VALIDATION_DATES:
-            # Train dates are guaranteed to be set
             val_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.val_dates]
-            if min(val_dates) <= max(train_dates):
+            if len(train_dates) > 0 and min(val_dates) <= max(train_dates):
                 warnings.warn(f"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
             if len(test_dates) > 0 and min(test_dates) <= max(val_dates):
                 warnings.warn(f"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
@@ -475,7 +480,7 @@ class DatasetConfig():
     def _get_val_tables_paths(self) -> list[str]:
         if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:
-            return list(map(lambda t: f"/flows/D{t}", self.train_dates))
+            return self._get_train_tables_paths()
         return list(map(lambda t: f"/flows/D{t}", self.val_dates))
     def _get_test_tables_paths(self) -> list[str]:

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/constants.py RENAMED Viewed

@@ -21,6 +21,7 @@ SELECTED_TCP_FLAGS = ["FLAG_CWR", "FLAG_CWR_REV", "FLAG_ECE", "FLAG_ECE_REV", "F
 PHIST_BIN_COUNT = 8
 # Column names
+ID_COLUMN = "ID"
 APP_COLUMN = "APP"
 CATEGORY_COLUMN = "CATEGORY"
 PPI_COLUMN = "PPI"

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/datasets/cesnet_dataset.py RENAMED Viewed

@@ -28,7 +28,7 @@ from cesnet_datazoo.pytables_data.indices_setup import (IndicesTuple, compute_kn
                                                         date_weight_sample_train_indices,
                                                         init_or_load_test_indices,
                                                         init_or_load_train_indices,
-                                                        init_or_load_val_indices,
+                                                        init_or_load_val_indices, no_indices,
                                                         subset_and_sort_indices)
 from cesnet_datazoo.pytables_data.pytables_dataset import PyTablesDataset, worker_init_fn
 from cesnet_datazoo.utils.class_info import ClassInfo, create_class_info
@@ -537,10 +537,10 @@ class CesnetDataset():
                     raise ValueError(f"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})")
                 train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)
         elif dataset_config.apps_selection == AppSelection.FIXED:
-            known_apps = dataset_config.apps_selection_fixed_known
-            unknown_apps = dataset_config.apps_selection_fixed_unknown
-            train_indices = np.zeros((0,3), dtype=np.int64)
-            train_unknown_indices = np.zeros((0,3), dtype=np.int64)
+            known_apps = sorted(dataset_config.apps_selection_fixed_known)
+            unknown_apps = sorted(dataset_config.apps_selection_fixed_unknown)
+            train_indices = no_indices()
+            train_unknown_indices = no_indices()
         else:
             raise ValueError("Either need train set or the fixed application selection")
         # Initialize validation set
@@ -577,8 +577,8 @@ class CesnetDataset():
                                                                         test_size=dataset_config.val_known_size if dataset_config.val_known_size != "all" else None,
                                                                         stratify=train_labels, shuffle=True, random_state=train_val_rng)
         else:
-            val_known_indices = np.zeros((0,3), dtype=np.int64)
-            val_unknown_indices = np.zeros((0,3), dtype=np.int64)
+            val_known_indices = no_indices()
+            val_unknown_indices = no_indices()
             val_data_path = None
         # Initialize test set
         if dataset_config.need_test_set:
@@ -588,8 +588,8 @@ class CesnetDataset():
                                                                                                  tables_app_enum=self._tables_app_enum,
                                                                                                  disable_indices_cache=disable_indices_cache,)
         else:
-            test_known_indices = np.zeros((0,3), dtype=np.int64)
-            test_unknown_indices = np.zeros((0,3), dtype=np.int64)
+            test_known_indices = no_indices()
+            test_unknown_indices = no_indices()
             test_data_path = None
         # Fit scalers if needed
         if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or
@@ -636,7 +636,7 @@ class CesnetDataset():
             assert val_data_path is not None
             val_dataset = PyTablesDataset(
                 database_path=dataset_config.database_path,
-                tables_paths=dataset_config._get_train_tables_paths(),
+                tables_paths=dataset_config._get_val_tables_paths(),
                 indices=dataset_indices.val_known_indices,
                 tables_app_enum=self._tables_app_enum,
                 tables_cat_enum=self._tables_cat_enum,

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/datasets/metadata/metadata.csv RENAMED Viewed

@@ -1,4 +1,4 @@
 Name,Protocol,Published in,Collected in,Collection duration,Available samples,Available dataset sizes,Collection period,Missing dates in collection period,Application count,Background traffic classes,PPI features,Flowstats features,Flowstats features boolean,Packet histograms,TCP features,Other fields,Cite,Zenodo URL,Related papers
-CESNET-TLS22,TLS,2022,2021,2 weeks,141720670,"XS, S, M, L",4.10.2021 - 17.10.2021,,191,,"IPT, DIR, SIZE","BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION",,,"FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV",ID,https://doi.org/10.1016/j.comnet.2022.109467,https://zenodo.org/record/7965515,
+CESNET-TLS22,TLS,2022,2021,2 weeks,141392195,"XS, S, M, L",4.10.2021 - 17.10.2021,,191,,"IPT, DIR, SIZE","BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION",,,"FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV",ID,https://doi.org/10.1016/j.comnet.2022.109467,https://zenodo.org/record/7965515,
 CESNET-QUIC22,QUIC,2023,2022,4 weeks,153226273,"XS, S, M, L",31.10.2022 - 27.11.2022,,102,"default-background, google-background, facebook-background","IPT, DIR, SIZE","BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION","FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_OTHER","PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT",,"ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST",https://doi.org/10.1016/j.dib.2023.108888,https://zenodo.org/record/7963302,https://doi.org/10.23919/TMA58422.2023.10199052
 CESNET-TLS-Year22,TLS,2023,2022,1 year,507739073,"XS, S, M, L",1.1.2022 - 31.12.2022,"20220128, 20220129, 20220130, 20221212, 20221213, 20221229, 20221230, 20221231",180,,"IPT, DIR, SIZE, PUSH_FLAG","BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION","FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_END, FLOW_ENDREASON_OTHER","PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT","FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV","ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST",,,

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/pytables_data/data_scalers.py RENAMED Viewed

@@ -17,18 +17,6 @@ from cesnet_datazoo.utils.random import RandomizedSection, get_fresh_random_gene
 log = logging.getLogger(__name__)
-def get_scaler_attrs(scaler: StandardScaler | RobustScaler | MinMaxScaler) -> dict[str, list[float]]:
-    if isinstance(scaler, StandardScaler):
-        assert hasattr(scaler, "mean_") and scaler.mean_ is not None and hasattr(scaler, "scale_") and scaler.scale_ is not None
-        scaler_attrs = {"mean_": scaler.mean_.tolist(), "scale_": scaler.scale_.tolist()}
-    elif isinstance(scaler, RobustScaler):
-        assert hasattr(scaler, "center_") and hasattr(scaler, "scale_")
-        scaler_attrs = {"center_": scaler.center_.tolist(), "scale_": scaler.scale_.tolist()}
-    elif isinstance(scaler, MinMaxScaler):
-        assert hasattr(scaler, "min_") and hasattr(scaler, "scale_")
-        scaler_attrs = {"min_": scaler.min_.tolist(), "scale_": scaler.scale_.tolist()}
-    return scaler_attrs
 def fit_scalers(dataset_config: DatasetConfig, train_indices: np.ndarray) -> None:
     # Define indices for fitting scalers
     if isinstance(dataset_config.fit_scalers_samples, int) and dataset_config.fit_scalers_samples > len(train_indices):
@@ -48,6 +36,7 @@ def fit_scalers(dataset_config: DatasetConfig, train_indices: np.ndarray) -> Non
     clip_and_scale_ppi_transform = dataset_config.ppi_transform # TODO Fix after transforms composing is implemented
     clip_and_scale_flowstats_transform = dataset_config.flowstats_transform
+    train_data_path = dataset_config._get_train_data_path()
     # Fit the ClipAndScalePPI transform
     if clip_and_scale_ppi_transform is not None and clip_and_scale_ppi_transform.needs_fitting:
@@ -70,6 +59,7 @@ def fit_scalers(dataset_config: DatasetConfig, train_indices: np.ndarray) -> Non
             train_psizes = np.concatenate((train_psizes, [0]))
         clip_and_scale_ppi_transform.psizes_scaler.fit(train_psizes.reshape(-1, 1))
         clip_and_scale_ppi_transform.needs_fitting = False
+        json.dump(clip_and_scale_ppi_transform.to_dict(), open(os.path.join(train_data_path, "transforms", "ppi-transform.json"), "w"), indent=4)
     # Fit the ClipAndScaleFlowstats transform
     if clip_and_scale_flowstats_transform is not None and clip_and_scale_flowstats_transform.needs_fitting:
@@ -82,29 +72,5 @@ def fit_scalers(dataset_config: DatasetConfig, train_indices: np.ndarray) -> Non
         clip_and_scale_flowstats_transform.flowstats_scaler.fit(train_flowstats)
         clip_and_scale_flowstats_transform.flowstats_quantiles = flowstats_quantiles.tolist()
         clip_and_scale_flowstats_transform.needs_fitting = False
+        json.dump(clip_and_scale_flowstats_transform.to_dict(), open(os.path.join(train_data_path, "transforms", "flowstats-transform.json"), "w"), indent=4)
     log.info(f"Reading data and fitting scalers took {time.time() - start_time:.2f} seconds")
-    train_data_path = dataset_config._get_train_data_path()
-    if clip_and_scale_ppi_transform is not None:
-        ppi_transform_path = os.path.join(train_data_path, "transforms", "ppi-transform.json")
-        ppi_transform_dict = {
-            "psizes_scaler_enum": str(clip_and_scale_ppi_transform._psizes_scaler_enum),
-            "psizes_scaler_attrs": get_scaler_attrs(clip_and_scale_ppi_transform.psizes_scaler),
-            "pszies_min": clip_and_scale_ppi_transform.pszies_min,
-            "psizes_max": clip_and_scale_ppi_transform.psizes_max,
-            "ipt_scaler_enum": str(clip_and_scale_ppi_transform._ipt_scaler_enum),
-            "ipt_scaler_attrs": get_scaler_attrs(clip_and_scale_ppi_transform.ipt_scaler),
-            "ipt_min": clip_and_scale_ppi_transform.ipt_min,
-            "ipt_max": clip_and_scale_ppi_transform.ipt_max,
-        }
-        json.dump(ppi_transform_dict, open(ppi_transform_path, "w"), indent=4)
-    if clip_and_scale_flowstats_transform is not None:
-        assert clip_and_scale_flowstats_transform.flowstats_quantiles is not None
-        flowstats_transform_path = os.path.join(train_data_path, "transforms", "flowstats-transform.json")
-        flowstats_transform_dict = {
-            "flowstats_scaler_enum": str(clip_and_scale_flowstats_transform._flowstats_scaler_enum),
-            "flowstats_scaler_attrs": get_scaler_attrs(clip_and_scale_flowstats_transform.flowstats_scaler),
-            "flowstats_quantiles": clip_and_scale_flowstats_transform.flowstats_quantiles,
-            "quantile_clip": clip_and_scale_flowstats_transform.quantile_clip,
-        }
-        json.dump(flowstats_transform_dict, open(flowstats_transform_path, "w"), indent=4)

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/pytables_data/indices_setup.py RENAMED Viewed

@@ -132,7 +132,7 @@ def init_or_load_val_indices(dataset_config: DatasetConfig, known_apps: list[str
             np.save(os.path.join(val_data_path, "val_known_indices.npy"), val_known_indices)
             np.save(os.path.join(val_data_path, "val_unknown_indices.npy"), val_unknown_indices)
     else:
-        val_known_indices = np.load(os.path.join(val_data_path, "val_known_indices.npu"))
+        val_known_indices = np.load(os.path.join(val_data_path, "val_known_indices.npy"))
         val_unknown_indices = np.load(os.path.join(val_data_path, "val_unknown_indices.npy"))
     return val_known_indices, val_unknown_indices, val_data_path
@@ -162,3 +162,6 @@ def init_train_data(train_data_path: str):
 def init_test_data(test_data_path: str):
     os.makedirs(test_data_path, exist_ok=True)
     os.makedirs(os.path.join(test_data_path, "preload"), exist_ok=True)
+def no_indices() -> np.ndarray:
+    return np.zeros((0,3), dtype=np.int64)

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/pytables_data/pytables_dataset.py RENAMED Viewed

@@ -16,7 +16,8 @@ from typing_extensions import assert_never
 from cesnet_datazoo.config import (AppSelection, MinTrainSamplesCheck, TestDataParams,
                                    TrainDataParams)
-from cesnet_datazoo.constants import APP_COLUMN, INDICES_INDEX_POS, INDICES_TABLE_POS, PPI_COLUMN
+from cesnet_datazoo.constants import (APP_COLUMN, INDICES_INDEX_POS, INDICES_TABLE_POS, PPI_COLUMN,
+                                      QUIC_SNI_COLUMN, TLS_SNI_COLUMN)
 from cesnet_datazoo.pytables_data.apps_split import (is_background_app,
                                                      split_apps_topx_with_provider_groups)
@@ -66,6 +67,7 @@ class PyTablesDataset(Dataset):
         self.target_transform = target_transform
         self.return_tensors = return_tensors
         self.return_all_fields = return_all_fields
+        self.sni_column = TLS_SNI_COLUMN if TLS_SNI_COLUMN in self.other_fields else QUIC_SNI_COLUMN if QUIC_SNI_COLUMN in self.other_fields else None
         self.preload = preload
         self.preload_blob = preload_blob
@@ -179,7 +181,7 @@ def init_train_indices(train_data_params: TrainDataParams, database_path: str, t
     start_time = time.time()
     for i, table_path in enumerate(train_data_params.train_tables_paths):
         all_app_labels[i] = train_tables[i].read(field=APP_COLUMN)
-        log.info(f"Reading app column for train table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+        log.info(f"Reading app column for table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
         app_counts = app_counts.add(pd.Series(all_app_labels[i]).value_counts(), fill_value=0)
     database.close()
     # Handle disabled apps and apps with less than min_samples_per_app samples
@@ -223,13 +225,15 @@ def init_train_indices(train_data_params: TrainDataParams, database_path: str, t
     else:
         known_apps = train_data_params.apps_selection_fixed_known
         unknown_apps = train_data_params.apps_selection_fixed_unknown
+    known_apps = sorted(known_apps)
+    unknown_apps = sorted(unknown_apps)
     known_apps_ids = [inverted_tables_app_enum[app] for app in known_apps]
     unknown_apps_ids = [inverted_tables_app_enum[app] for app in unknown_apps]
     train_known_indices, train_unknown_indices = convert_dict_indices(base_indices=base_indices, base_labels=base_labels, known_apps_ids=known_apps_ids, unknown_apps_ids=unknown_apps_ids)
     rng.shuffle(train_known_indices)
     rng.shuffle(train_unknown_indices)
-    log.info(f"Processing train indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+    log.info(f"Processing indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
     return train_known_indices, train_unknown_indices, known_apps, unknown_apps
 def init_test_indices(test_data_params: TestDataParams, database_path: str, tables_app_enum: dict[int, str], rng: np.random.RandomState) -> tuple[np.ndarray, np.ndarray]:
@@ -240,7 +244,7 @@ def init_test_indices(test_data_params: TestDataParams, database_path: str, tabl
     start_time = time.time()
     for i, table_path in enumerate(test_data_params.test_tables_paths):
         base_labels[i] = test_tables[i].read(field=APP_COLUMN)
-        log.info(f"Reading app column for test table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+        log.info(f"Reading app column for table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
         base_indices[i] = np.arange(len(test_tables[i]))
     database.close()
     known_apps_ids = [inverted_tables_app_enum[app] for app in test_data_params.known_apps]
@@ -248,7 +252,7 @@ def init_test_indices(test_data_params: TestDataParams, database_path: str, tabl
     test_known_indices, test_unknown_indices = convert_dict_indices(base_indices=base_indices, base_labels=base_labels, known_apps_ids=known_apps_ids, unknown_apps_ids=unknown_apps_ids)
     rng.shuffle(test_known_indices)
     rng.shuffle(test_unknown_indices)
-    log.info(f"Processing test indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+    log.info(f"Processing indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
     return test_known_indices, test_unknown_indices
 def load_database(database_path: str, tables_paths: Optional[list[str]] = None, mode: str = "r") -> tuple[tb.File, dict[int, Any]]: # dict[int, tb.Table]

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo/utils/class_info.py RENAMED Viewed

@@ -23,8 +23,6 @@ class ClassInfo:
     categories_mapping: dict[str, Optional[str]]
 def create_class_info(servicemap: Any, encoder: LabelEncoder, known_apps: list[str], unknown_apps: list[str]) -> ClassInfo:
-    known_apps = sorted(known_apps)
-    unknown_apps = sorted(unknown_apps)
     target_names_arr = encoder.classes_
     assert known_apps == list(target_names_arr[:-1])
     group_matrix = np.array([[a == b or

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cesnet-datazoo
-Version: 0.1.0
+Version: 0.1.2
 Summary: A toolkit for large network traffic datasets
 Author-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
 Maintainer-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
@@ -29,10 +29,7 @@ Requires-Dist: tables>=3.8.0
 Requires-Dist: torch>=1.10
 Requires-Dist: tqdm
 Provides-Extra: dev
-Requires-Dist: black; extra == "dev"
 Requires-Dist: build; extra == "dev"
-Requires-Dist: jupyterlab; extra == "dev"
-Requires-Dist: lightgbm; extra == "dev"
 Requires-Dist: mkdocs-autorefs; extra == "dev"
 Requires-Dist: mkdocs-material-extensions; extra == "dev"
 Requires-Dist: mkdocs-material; extra == "dev"
@@ -57,7 +54,7 @@ The goal of this project is to provide tools for working with large network traf
 - Extensive configuration options for:
     - Selection of train, validation, and test periods.
     - Selection of application classes and splitting classes between *known* and *unknown*.
-    - Feature scaling.
+    - Data transformations, such as feature scaling.
 - Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
 - Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
@@ -72,7 +69,7 @@ The package is able to handle the following datasets:
 | _Collection duration_              | 2 weeks                                                                                                                                                                                        | 4 weeks                                                                                                                                                   | 1 year                                                                                                                                                                                         |
 | _Collection period_                | 4.10.2021 - 17.10.2021                                                                                                                                                                         | 31.10.2022 - 27.11.2022                                                                                                                                   | 1.1.2022 - 31.12.2022                                                                                                                                                                          |                                                                                                                                                                                           | ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST                                  | ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST                                                                                                       |
 | _Application count_                | 191                                                                                                                                                                                            | 102                                                                                                                                                       | 180                                                                                                                                                                                            |
-| _Available samples_                | 141720670                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
+| _Available samples_                | 141392195                                                                                                                                                                                      | 153226273                                                                                                                                                 | 507739073                                                                                                                                                                                      |
 | _Available dataset sizes_          | XS, S, M, L                                                                                                                                                                                    | XS, S, M, L                                                                                                                                               | XS, S, M, L                                                                                                                                                                                    |
 | _Cite_                             | [https://doi.org/10.1016/j.comnet.2022.109467](https://doi.org/10.1016/j.comnet.2022.109467)                                                                                                   | [https://doi.org/10.1016/j.dib.2023.108888](https://doi.org/10.1016/j.dib.2023.108888)                                                                    |                                                                                                                                                                                                |
 | _Zenodo URL_                       | [https://zenodo.org/record/7965515](https://zenodo.org/record/7965515)                                                                                                                         | [https://zenodo.org/record/7963302](https://zenodo.org/record/7963302)                                                                                    |                                                                                                                                                                                                |

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/cesnet_datazoo.egg-info/requires.txt RENAMED Viewed

@@ -12,10 +12,7 @@ torch>=1.10
 tqdm
 [dev]
-black
 build
-jupyterlab
-lightgbm
 mkdocs-autorefs
 mkdocs-material-extensions
 mkdocs-material

{cesnet-datazoo-0.1.0 → cesnet-datazoo-0.1.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "cesnet-datazoo"
-version = "0.1.0"
+version = "0.1.2"
 authors = [
   {name = "Jan Luxemburk", email = "luxemburk@cesnet.cz"},
   {name = "Karel Hynek", email = "hynekkar@cesnet.cz"},
@@ -45,10 +45,7 @@ dependencies = [
 [project.optional-dependencies]
 dev = [
-  "black",
   "build",
-  "jupyterlab",
-  "lightgbm",
   "mkdocs-autorefs",
   "mkdocs-material-extensions",
   "mkdocs-material",