PyPI - cesnet-datazoo - Versions diffs - 0.1.1__tar.gz → 0.1.3__tar.gz - Mend

cesnet-datazoo 0.1.1tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cesnet-datazoo
-Version: 0.1.1
+Version: 0.1.3
 Summary: A toolkit for large network traffic datasets
 Author-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
 Maintainer-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
@@ -43,7 +43,7 @@ Requires-Dist: twine; extra == "dev"
 </p>
 [![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)
-[![](https://img.shields.io/badge/docs-mkdocs_material-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
+[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
 [![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)
 [![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)
@@ -58,9 +58,12 @@ The goal of this project is to provide tools for working with large network traf
 - Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
 - Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
-## Datasets
+:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:
+:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:
-The package is able to handle the following datasets:
+## Datasets
+The following datasets are available in the `cesnet-datazoo` package:
 | Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |
 | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -120,6 +123,6 @@ See more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo
 Jan Luxemburk and Karel Hynek <br>
 CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023
-### Acknowledgements
+## Acknowledgments
-    This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.
+This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/README.md RENAMED Viewed

@@ -3,7 +3,7 @@
 </p>
 [![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)
-[![](https://img.shields.io/badge/docs-mkdocs_material-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
+[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
 [![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)
 [![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)
@@ -18,9 +18,12 @@ The goal of this project is to provide tools for working with large network traf
 - Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
 - Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
-## Datasets
+:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:
+:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:
-The package is able to handle the following datasets:
+## Datasets
+The following datasets are available in the `cesnet-datazoo` package:
 | Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |
 | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -80,6 +83,6 @@ See more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo
 Jan Luxemburk and Karel Hynek <br>
 CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023
-### Acknowledgements
+## Acknowledgments
-    This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.
+This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/config.py RENAMED Viewed

@@ -133,7 +133,7 @@ class DatasetConfig():
     Attributes:
         need_train_set: Use to disable the train set. `Default: True`
-        need_val_set: Use to disable the validation set. When `need_train_set` is false, the validation set will also be disabled. `Default: True`
+        need_val_set: Use to disable the validation set. `Default: True`
         need_test_set: Use to disable the test set. `Default: True`
         train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].
         train_dates: Dates used for creating a train set.
@@ -161,7 +161,7 @@ class DatasetConfig():
         val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`
         batch_size: Number of samples per batch. `Default: 192`
         test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`
-        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: True`
+        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: False`
         preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`
         train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
         val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
@@ -238,7 +238,7 @@ class DatasetConfig():
     val_workers: int = 1
     batch_size: int = 192
     test_batch_size: int = 2048
-    preload_val: bool = True
+    preload_val: bool = False
     preload_test: bool = False
     train_size: int | Literal["all"] = "all"
     val_known_size: int | Literal["all"] = "all"
@@ -268,7 +268,6 @@ class DatasetConfig():
         self.database_path = dataset.database_path
         if not self.need_train_set:
-            self.need_val_set = False
             if self.apps_selection != AppSelection.FIXED:
                 raise ValueError("Application selection has to be fixed when need_train_set is false")
             if (len(self.train_dates) > 0 or self.train_period_name != ""):
@@ -299,17 +298,24 @@ class DatasetConfig():
                 self.test_period_name = dataset.default_test_period_name
                 self.test_dates = dataset.time_periods[dataset.default_test_period_name]
         # Configure val dates
-        if (not self.need_val_set or self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN) and (len(self.val_dates) > 0 or self.val_period_name != ""):
-            raise ValueError("val_dates and val_period_name cannot be specified when need_val_set is false or the validation approach is split-from-train")
-        if self.val_approach == ValidationApproach.VALIDATION_DATES:
-            if len(self.val_dates) > 0 and self.val_period_name == "":
-                raise ValueError("val_period_name has to be specified when val_dates are set")
-            if len(self.val_dates) == 0 and self.val_period_name != "":
-                if self.val_period_name not in dataset.time_periods:
-                    raise ValueError(f"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods")
-                self.val_dates = dataset.time_periods[self.val_period_name]
-            if len(self.val_dates) == 0 and self.val_period_name == "":
-                raise ValueError("val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates")
+        if not self.need_val_set:
+            if len(self.val_dates) > 0 or self.val_period_name != "" or self.val_approach != ValidationApproach.SPLIT_FROM_TRAIN:
+                raise ValueError("val_dates, val_period_name, and val_approach cannot be specified when need_val_set is false")
+        else:
+            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:
+                if len(self.val_dates) > 0 or self.val_period_name != "":
+                    raise ValueError("val_dates and val_period_name cannot be specified when the validation approach is split-from-train")
+                if not self.need_train_set:
+                    raise ValueError("Cannot use the split-from-train validation approach when need_train_set is false. Either use the validation-dates approach or set need_val_set to false.")
+            elif self.val_approach == ValidationApproach.VALIDATION_DATES:
+                if len(self.val_dates) > 0 and self.val_period_name == "":
+                    raise ValueError("val_period_name has to be specified when val_dates are set")
+                if len(self.val_dates) == 0 and self.val_period_name != "":
+                    if self.val_period_name not in dataset.time_periods:
+                        raise ValueError(f"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods")
+                    self.val_dates = dataset.time_periods[self.val_period_name]
+                if len(self.val_dates) == 0 and self.val_period_name == "":
+                    raise ValueError("val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates")
         # Check if train, val, and test dates are available in the dataset
         bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]
         bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]
@@ -326,12 +332,11 @@ class DatasetConfig():
         # Check time order of train, val, and test periods
         train_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.train_dates]
         test_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.test_dates]
-        if len(train_dates) > 0 and len(test_dates) > 0  and min(test_dates) <= max(train_dates):
+        if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):
             warnings.warn(f"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
         if self.val_approach == ValidationApproach.VALIDATION_DATES:
-            # Train dates are guaranteed to be set
             val_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.val_dates]
-            if min(val_dates) <= max(train_dates):
+            if len(train_dates) > 0 and min(val_dates) <= max(train_dates):
                 warnings.warn(f"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
             if len(test_dates) > 0 and min(test_dates) <= max(val_dates):
                 warnings.warn(f"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
@@ -475,7 +480,7 @@ class DatasetConfig():
     def _get_val_tables_paths(self) -> list[str]:
         if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:
-            return list(map(lambda t: f"/flows/D{t}", self.train_dates))
+            return self._get_train_tables_paths()
         return list(map(lambda t: f"/flows/D{t}", self.val_dates))
     def _get_test_tables_paths(self) -> list[str]:

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/constants.py RENAMED Viewed

@@ -21,6 +21,7 @@ SELECTED_TCP_FLAGS = ["FLAG_CWR", "FLAG_CWR_REV", "FLAG_ECE", "FLAG_ECE_REV", "F
 PHIST_BIN_COUNT = 8
 # Column names
+ID_COLUMN = "ID"
 APP_COLUMN = "APP"
 CATEGORY_COLUMN = "CATEGORY"
 PPI_COLUMN = "PPI"

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/datasets/cesnet_dataset.py RENAMED Viewed

@@ -28,7 +28,7 @@ from cesnet_datazoo.pytables_data.indices_setup import (IndicesTuple, compute_kn
                                                         date_weight_sample_train_indices,
                                                         init_or_load_test_indices,
                                                         init_or_load_train_indices,
-                                                        init_or_load_val_indices,
+                                                        init_or_load_val_indices, no_indices,
                                                         subset_and_sort_indices)
 from cesnet_datazoo.pytables_data.pytables_dataset import PyTablesDataset, worker_init_fn
 from cesnet_datazoo.utils.class_info import ClassInfo, create_class_info
@@ -537,10 +537,10 @@ class CesnetDataset():
                     raise ValueError(f"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})")
                 train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)
         elif dataset_config.apps_selection == AppSelection.FIXED:
-            known_apps = dataset_config.apps_selection_fixed_known
-            unknown_apps = dataset_config.apps_selection_fixed_unknown
-            train_indices = np.zeros((0,3), dtype=np.int64)
-            train_unknown_indices = np.zeros((0,3), dtype=np.int64)
+            known_apps = sorted(dataset_config.apps_selection_fixed_known)
+            unknown_apps = sorted(dataset_config.apps_selection_fixed_unknown)
+            train_indices = no_indices()
+            train_unknown_indices = no_indices()
         else:
             raise ValueError("Either need train set or the fixed application selection")
         # Initialize validation set
@@ -577,8 +577,8 @@ class CesnetDataset():
                                                                         test_size=dataset_config.val_known_size if dataset_config.val_known_size != "all" else None,
                                                                         stratify=train_labels, shuffle=True, random_state=train_val_rng)
         else:
-            val_known_indices = np.zeros((0,3), dtype=np.int64)
-            val_unknown_indices = np.zeros((0,3), dtype=np.int64)
+            val_known_indices = no_indices()
+            val_unknown_indices = no_indices()
             val_data_path = None
         # Initialize test set
         if dataset_config.need_test_set:
@@ -588,8 +588,8 @@ class CesnetDataset():
                                                                                                  tables_app_enum=self._tables_app_enum,
                                                                                                  disable_indices_cache=disable_indices_cache,)
         else:
-            test_known_indices = np.zeros((0,3), dtype=np.int64)
-            test_unknown_indices = np.zeros((0,3), dtype=np.int64)
+            test_known_indices = no_indices()
+            test_unknown_indices = no_indices()
             test_data_path = None
         # Fit scalers if needed
         if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or
@@ -636,7 +636,7 @@ class CesnetDataset():
             assert val_data_path is not None
             val_dataset = PyTablesDataset(
                 database_path=dataset_config.database_path,
-                tables_paths=dataset_config._get_train_tables_paths(),
+                tables_paths=dataset_config._get_val_tables_paths(),
                 indices=dataset_indices.val_known_indices,
                 tables_app_enum=self._tables_app_enum,
                 tables_cat_enum=self._tables_cat_enum,

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/metrics/classification_report.py RENAMED Viewed

@@ -8,8 +8,8 @@ from cesnet_datazoo.utils.class_info import ClassInfo
 def better_classification_report(y_true: np.ndarray, y_pred: np.ndarray, cm: np.ndarray, labels: list[int], class_info: ClassInfo, digits: int = 2, zero_division: int = 0) -> tuple[str, dict[str, float]]:
     p, r, f1, s  = precision_recall_fscore_support(y_true, y_pred,
-                                                    labels=labels,
-                                                    zero_division=zero_division)
+                                                   labels=labels,
+                                                   zero_division=zero_division)
     sc_p, sc_r, sc_f1 = per_app_provider_metrics(cm, class_info=class_info)
     predicted_unknown = cm[:, -1]
     with np.errstate(divide="ignore", invalid="ignore"):

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/pytables_data/indices_setup.py RENAMED Viewed

@@ -132,7 +132,7 @@ def init_or_load_val_indices(dataset_config: DatasetConfig, known_apps: list[str
             np.save(os.path.join(val_data_path, "val_known_indices.npy"), val_known_indices)
             np.save(os.path.join(val_data_path, "val_unknown_indices.npy"), val_unknown_indices)
     else:
-        val_known_indices = np.load(os.path.join(val_data_path, "val_known_indices.npu"))
+        val_known_indices = np.load(os.path.join(val_data_path, "val_known_indices.npy"))
         val_unknown_indices = np.load(os.path.join(val_data_path, "val_unknown_indices.npy"))
     return val_known_indices, val_unknown_indices, val_data_path
@@ -162,3 +162,6 @@ def init_train_data(train_data_path: str):
 def init_test_data(test_data_path: str):
     os.makedirs(test_data_path, exist_ok=True)
     os.makedirs(os.path.join(test_data_path, "preload"), exist_ok=True)
+def no_indices() -> np.ndarray:
+    return np.zeros((0,3), dtype=np.int64)

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/pytables_data/pytables_dataset.py RENAMED Viewed

@@ -16,7 +16,8 @@ from typing_extensions import assert_never
 from cesnet_datazoo.config import (AppSelection, MinTrainSamplesCheck, TestDataParams,
                                    TrainDataParams)
-from cesnet_datazoo.constants import APP_COLUMN, INDICES_INDEX_POS, INDICES_TABLE_POS, PPI_COLUMN
+from cesnet_datazoo.constants import (APP_COLUMN, INDICES_INDEX_POS, INDICES_TABLE_POS, PPI_COLUMN,
+                                      QUIC_SNI_COLUMN, TLS_SNI_COLUMN)
 from cesnet_datazoo.pytables_data.apps_split import (is_background_app,
                                                      split_apps_topx_with_provider_groups)
@@ -66,6 +67,7 @@ class PyTablesDataset(Dataset):
         self.target_transform = target_transform
         self.return_tensors = return_tensors
         self.return_all_fields = return_all_fields
+        self.sni_column = TLS_SNI_COLUMN if TLS_SNI_COLUMN in self.other_fields else QUIC_SNI_COLUMN if QUIC_SNI_COLUMN in self.other_fields else None
         self.preload = preload
         self.preload_blob = preload_blob
@@ -179,7 +181,7 @@ def init_train_indices(train_data_params: TrainDataParams, database_path: str, t
     start_time = time.time()
     for i, table_path in enumerate(train_data_params.train_tables_paths):
         all_app_labels[i] = train_tables[i].read(field=APP_COLUMN)
-        log.info(f"Reading app column for train table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+        log.info(f"Reading app column for table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
         app_counts = app_counts.add(pd.Series(all_app_labels[i]).value_counts(), fill_value=0)
     database.close()
     # Handle disabled apps and apps with less than min_samples_per_app samples
@@ -223,13 +225,15 @@ def init_train_indices(train_data_params: TrainDataParams, database_path: str, t
     else:
         known_apps = train_data_params.apps_selection_fixed_known
         unknown_apps = train_data_params.apps_selection_fixed_unknown
+    known_apps = sorted(known_apps)
+    unknown_apps = sorted(unknown_apps)
     known_apps_ids = [inverted_tables_app_enum[app] for app in known_apps]
     unknown_apps_ids = [inverted_tables_app_enum[app] for app in unknown_apps]
     train_known_indices, train_unknown_indices = convert_dict_indices(base_indices=base_indices, base_labels=base_labels, known_apps_ids=known_apps_ids, unknown_apps_ids=unknown_apps_ids)
     rng.shuffle(train_known_indices)
     rng.shuffle(train_unknown_indices)
-    log.info(f"Processing train indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+    log.info(f"Processing indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
     return train_known_indices, train_unknown_indices, known_apps, unknown_apps
 def init_test_indices(test_data_params: TestDataParams, database_path: str, tables_app_enum: dict[int, str], rng: np.random.RandomState) -> tuple[np.ndarray, np.ndarray]:
@@ -240,7 +244,7 @@ def init_test_indices(test_data_params: TestDataParams, database_path: str, tabl
     start_time = time.time()
     for i, table_path in enumerate(test_data_params.test_tables_paths):
         base_labels[i] = test_tables[i].read(field=APP_COLUMN)
-        log.info(f"Reading app column for test table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+        log.info(f"Reading app column for table {table_path} took {time.time() - start_time:.2f} seconds"); start_time = time.time()
         base_indices[i] = np.arange(len(test_tables[i]))
     database.close()
     known_apps_ids = [inverted_tables_app_enum[app] for app in test_data_params.known_apps]
@@ -248,7 +252,7 @@ def init_test_indices(test_data_params: TestDataParams, database_path: str, tabl
     test_known_indices, test_unknown_indices = convert_dict_indices(base_indices=base_indices, base_labels=base_labels, known_apps_ids=known_apps_ids, unknown_apps_ids=unknown_apps_ids)
     rng.shuffle(test_known_indices)
     rng.shuffle(test_unknown_indices)
-    log.info(f"Processing test indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
+    log.info(f"Processing indices took {time.time() - start_time:.2f} seconds"); start_time = time.time()
     return test_known_indices, test_unknown_indices
 def load_database(database_path: str, tables_paths: Optional[list[str]] = None, mode: str = "r") -> tuple[tb.File, dict[int, Any]]: # dict[int, tb.Table]

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo/utils/class_info.py RENAMED Viewed

@@ -23,8 +23,6 @@ class ClassInfo:
     categories_mapping: dict[str, Optional[str]]
 def create_class_info(servicemap: Any, encoder: LabelEncoder, known_apps: list[str], unknown_apps: list[str]) -> ClassInfo:
-    known_apps = sorted(known_apps)
-    unknown_apps = sorted(unknown_apps)
     target_names_arr = encoder.classes_
     assert known_apps == list(target_names_arr[:-1])
     group_matrix = np.array([[a == b or

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/cesnet_datazoo.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cesnet-datazoo
-Version: 0.1.1
+Version: 0.1.3
 Summary: A toolkit for large network traffic datasets
 Author-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
 Maintainer-email: Jan Luxemburk <luxemburk@cesnet.cz>, Karel Hynek <hynekkar@cesnet.cz>
@@ -43,7 +43,7 @@ Requires-Dist: twine; extra == "dev"
 </p>
 [![](https://img.shields.io/badge/license-BSD-blue.svg)](https://github.com/CESNET/cesnet-datazoo/blob/main/LICENCE)
-[![](https://img.shields.io/badge/docs-mkdocs_material-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
+[![](https://img.shields.io/badge/docs-cesnet--datazoo-blue.svg)](https://cesnet.github.io/cesnet-datazoo/)
 [![](https://img.shields.io/badge/python->=3.10-blue.svg)](https://pypi.org/project/cesnet-datazoo/)
 [![](https://img.shields.io/pypi/v/cesnet-datazoo)](https://pypi.org/project/cesnet-datazoo/)
@@ -58,9 +58,12 @@ The goal of this project is to provide tools for working with large network traf
 - Built on suitable data structures for experiments with large datasets. There are several caching mechanisms to make repeated runs faster, for example, when searching for the best model configuration.
 - Datasets are offered in multiple sizes to give users an option to start the experiments at a smaller scale (also faster dataset download, disk space, etc.). The default is the `S` size containing 25 million samples.
-## Datasets
+:brain: :brain: See a related project [CESNET Models](https://github.com/CESNET/cesnet-models) providing pre-trained neural networks for traffic classification. :brain: :brain:
+:notebook: :notebook: Example Jupyter notebooks are included in a separate [CESNET Traffic Classification Examples](https://github.com/CESNET/cesnet-tcexamples) repo. :notebook: :notebook:
-The package is able to handle the following datasets:
+## Datasets
+The following datasets are available in the `cesnet-datazoo` package:
 | Name                               | CESNET-TLS22                                                                                                                                                                                   | CESNET-QUIC22                                                                                                                                             | CESNET-TLS-Year22                                                                                                                                                                              |
 | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -120,6 +123,6 @@ See more examples in the [documentation](https://cesnet.github.io/cesnet-datazoo
 Jan Luxemburk and Karel Hynek <br>
 CoNEXT Workshop on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking (SAFE), 2023
-### Acknowledgements
+## Acknowledgments
-    This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.
+This project was supported by the Ministry of the Interior of the Czech Republic, grant No. VJ02010024: Flow-Based Encrypted Traffic Analysis.

{cesnet-datazoo-0.1.1 → cesnet-datazoo-0.1.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "cesnet-datazoo"
-version = "0.1.1"
+version = "0.1.3"
 authors = [
   {name = "Jan Luxemburk", email = "luxemburk@cesnet.cz"},
   {name = "Karel Hynek", email = "hynekkar@cesnet.cz"},