PyPI - learning-loop-node - Versions diffs - 0.10.6__tar.gz → 0.10.7__tar.gz - Mend

learning-loop-node 0.10.6tar.gz → 0.10.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of learning-loop-node might be problematic. Click here for more details.

Files changed (88) hide show

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: learning-loop-node
-Version: 0.10.6
+Version: 0.10.7
 Summary: Python Library for Nodes which connect to the Zauberzeug Learning Loop
 Home-page: https://github.com/zauberzeug/learning_loop_node
 License: MIT
@@ -81,6 +81,8 @@ from learning_loop_node/learning_loop_node
 Detector Nodes are normally deployed on edge devices like robots or machinery but can also run in the cloud to provide backend services for an app or similar. These nodes register themself at the Learning Loop. They provide REST and Socket.io APIs to run inference on images. The processed images can automatically be used for active learning: e.g. uncertain predictions will be send to the Learning Loop.
+### Running Inference
 Images can be send to the detector node via socketio or rest.
 The later approach can be used via curl,
@@ -102,6 +104,26 @@ The detector also has a sio **upload endpoint** that can be used to upload image
 The endpoint returns None if the upload was successful and an error message otherwise.
+### Changing the outbox mode
+If the autoupload is set to `all` or `filtered` (selected) images and the corresponding detections are saved on HDD (the outbox). A background thread will upload the images and detections to the Learning Loop. The outbox is located in the `outbox` folder in the root directory of the node. The outbox can be cleared by deleting the files in the folder.
+The continuous upload can be stopped/started via a REST enpoint:
+Example Usage:
+- Enable upload: `curl -X PUT -d "continuous_upload" http://localhost/outbox_mode`
+- Disable upload: `curl -X PUT -d "stopped" http://localhost/outbox_mode`
+The current state can be queried via a GET request:
+`curl http://localhost/outbox_mode`
+### Explicit upload
+The detector has a REST endpoint to upload images (and detections) to the Learning Loop. The endpoint takes a POST request with the image and optionally the detections. The image is expected to be in jpg format. The detections are expected to be a json dictionary. Example:
+`curl -X POST -F 'files=@test.jpg' "http://localhost:/upload"`
 ## Trainer Node
 Trainers fetch the images and anntoations from the Learning Loop to train new models.

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/README.md RENAMED Viewed

@@ -41,6 +41,8 @@ from learning_loop_node/learning_loop_node
 Detector Nodes are normally deployed on edge devices like robots or machinery but can also run in the cloud to provide backend services for an app or similar. These nodes register themself at the Learning Loop. They provide REST and Socket.io APIs to run inference on images. The processed images can automatically be used for active learning: e.g. uncertain predictions will be send to the Learning Loop.
+### Running Inference
 Images can be send to the detector node via socketio or rest.
 The later approach can be used via curl,
@@ -62,6 +64,26 @@ The detector also has a sio **upload endpoint** that can be used to upload image
 The endpoint returns None if the upload was successful and an error message otherwise.
+### Changing the outbox mode
+If the autoupload is set to `all` or `filtered` (selected) images and the corresponding detections are saved on HDD (the outbox). A background thread will upload the images and detections to the Learning Loop. The outbox is located in the `outbox` folder in the root directory of the node. The outbox can be cleared by deleting the files in the folder.
+The continuous upload can be stopped/started via a REST enpoint:
+Example Usage:
+- Enable upload: `curl -X PUT -d "continuous_upload" http://localhost/outbox_mode`
+- Disable upload: `curl -X PUT -d "stopped" http://localhost/outbox_mode`
+The current state can be queried via a GET request:
+`curl http://localhost/outbox_mode`
+### Explicit upload
+The detector has a REST endpoint to upload images (and detections) to the Learning Loop. The endpoint takes a POST request with the image and optionally the detections. The image is expected to be in jpg format. The detections are expected to be a json dictionary. Example:
+`curl -X POST -F 'files=@test.jpg' "http://localhost:/upload"`
 ## Trainer Node
 Trainers fetch the images and anntoations from the Learning Loop to train new models.

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/data_exchanger.py RENAMED Viewed

@@ -14,6 +14,7 @@ import aiofiles  # type: ignore
 from .data_classes import Context
 from .helpers.misc import create_resource_paths, create_task, is_valid_image
 from .loop_communication import LoopCommunicator
+from .trainer.exceptions import CriticalError
 class DownloadError(Exception):
@@ -159,13 +160,17 @@ class DataExchanger():
         logging.info(f'Downloaded model {model_uuid}({model_format}) to {target_folder}.')
         return created_files
-    async def upload_model_get_uuid(self, context: Context, files: List[str], training_number: Optional[int], mformat: str) -> Optional[str]:
-        """Used by the trainers. Function returns the new model uuid to use for detection."""
+    async def upload_model_get_uuid(self, context: Context, files: List[str], training_number: Optional[int], mformat: str) -> str:
+        """Used by the trainers. Function returns the new model uuid to use for detection.
+        :return: The new model uuid.
+        :raise CriticalError: If the upload does not return status code 200.
+        """
         response = await self.loop_communicator.put(f'/{context.organization}/projects/{context.project}/trainings/{training_number}/models/latest/{mformat}/file', files=files)
         if response.status_code != 200:
             logging.error(f'Could not upload model for training {training_number}, format {mformat}: {response.text}')
-            response.raise_for_status()
-            return None
+            raise CriticalError(
+                f'Could not upload model for training {training_number}, format {mformat}: {response.text}')
         uploaded_model = response.json()
         logging.info(f'Uploaded model for training {training_number}, format {mformat}. Response is: {uploaded_model}')

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/detector/detector_node.py RENAMED Viewed

@@ -27,6 +27,7 @@ from .rest import about as rest_about
 from .rest import backdoor_controls
 from .rest import detect as rest_detect
 from .rest import operation_mode as rest_mode
+from .rest import outbox_mode as rest_outbox_mode
 from .rest import upload as rest_upload
 from .rest.operation_mode import OperationMode
@@ -57,6 +58,7 @@ class DetectorNode(Node):
         self.include_router(rest_upload.router, prefix="")
         self.include_router(rest_mode.router, tags=["operation_mode"])
         self.include_router(rest_about.router, tags=["about"])
+        self.include_router(rest_outbox_mode.router, tags=["outbox_mode"])
         if use_backdoor_controls:
             self.include_router(backdoor_controls.router)
@@ -89,7 +91,7 @@ class DetectorNode(Node):
     async def on_startup(self) -> None:
         try:
-            self.outbox.start_continuous_upload()
+            self.outbox.ensure_continuous_upload()
             self.detector_logic.load_model()
         except Exception:
             self.log.exception("error during 'startup'")
@@ -97,7 +99,7 @@ class DetectorNode(Node):
     async def on_shutdown(self) -> None:
         try:
-            self.outbox.stop_continuous_upload()
+            self.outbox.ensure_continuous_upload_stopped()
             for sid in self.connected_clients:
                 # pylint: disable=no-member
                 await self.sio.disconnect(sid)  # type:ignore

learning_loop_node-0.10.7/learning_loop_node/detector/outbox.py ADDED Viewed

@@ -0,0 +1,185 @@
+import json
+import logging
+import os
+import shutil
+import time
+from dataclasses import asdict
+from datetime import datetime
+from enum import Enum
+from glob import glob
+from io import BufferedReader, TextIOWrapper
+from multiprocessing import Event
+from multiprocessing.synchronize import Event as SyncEvent
+from threading import Thread
+from typing import List, Optional
+import requests
+from fastapi.encoders import jsonable_encoder
+from ..data_classes import Detections
+from ..globals import GLOBALS
+from ..helpers import environment_reader
+class OutboxMode(Enum):
+    CONTINUOUS_UPLOAD = 'continuous_upload'
+    STOPPED = 'stopped'
+class Outbox():
+    def __init__(self) -> None:
+        self.log = logging.getLogger()
+        self.path = f'{GLOBALS.data_folder}/outbox'
+        os.makedirs(self.path, exist_ok=True)
+        self.log = logging.getLogger()
+        host = environment_reader.host()
+        o = environment_reader.organization()
+        p = environment_reader.project()
+        assert o and p, 'Outbox needs an organization and a project '
+        base_url = f'http{"s" if "learning-loop.ai" in host else ""}://{host}/api'
+        base: str = base_url
+        self.target_uri = f'{base}/{o}/projects/{p}/images'
+        self.log.info('Outbox initialized with target_uri: %s', self.target_uri)
+        self.BATCH_SIZE = 20
+        self.UPLOAD_TIMEOUT_S = 30
+        self.shutdown_event: SyncEvent = Event()
+        self.upload_process: Optional[Thread] = None
+    def save(self, image: bytes, detections: Optional[Detections] = None, tags: Optional[List[str]] = None) -> None:
+        if detections is None:
+            detections = Detections()
+        if not tags:
+            tags = []
+        identifier = datetime.now().isoformat(sep='_', timespec='milliseconds')
+        tmp = f'{GLOBALS.data_folder}/tmp/{identifier}'
+        detections.tags = tags
+        detections.date = identifier
+        os.makedirs(tmp, exist_ok=True)
+        with open(tmp + '/image.json', 'w') as f:
+            json.dump(jsonable_encoder(asdict(detections)), f)
+        with open(tmp + '/image.jpg', 'wb') as f:
+            f.write(image)
+        if os.path.exists(tmp):
+            os.rename(tmp, self.path + '/' + identifier)  # NOTE rename is atomic so upload can run in parallel
+        else:
+            self.log.error('Could not rename %s to %s', tmp, self.path + '/' + identifier)
+    def get_data_files(self):
+        return glob(f'{self.path}/*')
+    def ensure_continuous_upload(self):
+        self.log.debug('start_continuous_upload')
+        if self._upload_process_alive():
+            self.log.debug('Upload thread already running')
+            return
+        self.shutdown_event.clear()
+        self.upload_process = Thread(target=self._continuous_upload, name='OutboxUpload')
+        self.upload_process.start()
+    def _continuous_upload(self):
+        self.log.info('continuous upload started')
+        assert self.shutdown_event is not None
+        while not self.shutdown_event.is_set():
+            self.upload()
+            time.sleep(5)
+        self.log.info('continuous upload ended')
+    def upload(self):
+        items = self.get_data_files()
+        if items:
+            self.log.info('Found %s images to upload', len(items))
+            for i in range(0, len(items), self.BATCH_SIZE):
+                batch_items = items[i:i+self.BATCH_SIZE]
+                if self.shutdown_event.is_set():
+                    break
+                try:
+                    self._upload_batch(batch_items)
+                except Exception:
+                    self.log.exception('Could not upload files')
+        else:
+            self.log.info('No images found to upload')
+    def _upload_batch(self, items: List[str]):
+        data: List[tuple[str, TextIOWrapper | BufferedReader]] = []
+        data = [('files', open(f'{item}/image.json', 'r')) for item in items]
+        data += [('files', open(f'{item}/image.jpg', 'rb')) for item in items]
+        response = requests.post(self.target_uri, files=data, timeout=self.UPLOAD_TIMEOUT_S)
+        if response.status_code == 200:
+            for item in items:
+                shutil.rmtree(item, ignore_errors=True)
+            self.log.info('Uploaded %s images successfully', len(items))
+        elif response.status_code == 422:
+            if len(items) == 1:
+                self.log.error('Broken content in image: %s\n Skipping.', items[0])
+                shutil.rmtree(items[0], ignore_errors=True)
+                return
+            self.log.exception('Broken content in batch. Splitting and retrying')
+            self._upload_batch(items[:len(items)//2])
+            self._upload_batch(items[len(items)//2:])
+        else:
+            self.log.error('Could not upload images: %s', response.content)
+    def ensure_continuous_upload_stopped(self) -> bool:
+        self.log.debug('Outbox: Ensuring continuous upload')
+        if not self._upload_process_alive():
+            self.log.debug('Upload thread already stopped')
+            return True
+        proc = self.upload_process
+        if not proc:
+            return True
+        try:
+            assert self.shutdown_event is not None
+            self.shutdown_event.set()
+            assert proc is not None
+            proc.join(self.UPLOAD_TIMEOUT_S + 1)
+        except Exception:
+            self.log.exception('Error while shutting down upload thread: ')
+        if proc.is_alive():
+            self.log.error('Upload thread did not terminate')
+            return False
+        self.log.info('Upload thread terminated')
+        return True
+    def _upload_process_alive(self) -> bool:
+        return bool(self.upload_process and self.upload_process.is_alive())
+    def get_mode(self) -> OutboxMode:
+        ''':return: current mode ('continuous_upload' or 'stopped')'''
+        if self.upload_process and self.upload_process.is_alive():
+            current_mode = OutboxMode.CONTINUOUS_UPLOAD
+        else:
+            current_mode = OutboxMode.STOPPED
+        self.log.debug('Outbox: Current mode is %s', current_mode)
+        return current_mode
+    def set_mode(self, mode: OutboxMode | str):
+        ''':param mode: 'continuous_upload' or 'stopped'
+        :raises ValueError: if mode is not a valid OutboxMode
+        :raises TimeoutError: if the upload thread does not terminate within 31 seconds with mode='stopped'
+        '''
+        if isinstance(mode, str):
+            mode = OutboxMode(mode)
+        if mode == OutboxMode.CONTINUOUS_UPLOAD:
+            self.ensure_continuous_upload()
+        elif mode == OutboxMode.STOPPED:
+            try:
+                self.ensure_continuous_upload_stopped()
+            except TimeoutError as e:
+                raise TimeoutError(f'Upload thread did not terminate within {self.UPLOAD_TIMEOUT_S} seconds.') from e
+        self.log.debug('set outbox mode to %s', mode)

learning_loop_node-0.10.7/learning_loop_node/detector/rest/outbox_mode.py ADDED Viewed

@@ -0,0 +1,35 @@
+from fastapi import APIRouter, HTTPException, Request
+from fastapi.responses import PlainTextResponse
+from ..outbox import Outbox
+router = APIRouter()
+@router.get("/outbox_mode")
+async def get_outbox_mode(request: Request):
+    '''
+    Example Usage
+        curl http://localhost/outbox_mode
+    '''
+    outbox: Outbox = request.app.outbox
+    return PlainTextResponse(outbox.get_mode().value)
+@router.put("/outbox_mode")
+async def put_outbox_mode(request: Request):
+    '''
+    Example Usage
+        curl -X PUT -d "continuous_upload" http://localhost/outbox_mode
+        curl -X PUT -d "stopped" http://localhost/outbox_mode
+    '''
+    outbox: Outbox = request.app.outbox
+    content = str(await request.body(), 'utf-8')
+    try:
+        outbox.set_mode(content)
+    except TimeoutError as e:
+        raise HTTPException(202, 'Setting has not completed, yet: ' + str(e)) from e
+    except ValueError as e:
+        raise HTTPException(422, 'Could not set outbox mode: ' + str(e)) from e
+    return "OK"

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/detector/rest/upload.py RENAMED Viewed

@@ -1,7 +1,10 @@
-from typing import List
+from typing import TYPE_CHECKING, List
 from fastapi import APIRouter, File, Request, UploadFile
+if TYPE_CHECKING:
+    from ..detector_node import DetectorNode
 router = APIRouter()
@@ -13,5 +16,6 @@ async def upload_image(request: Request, files: List[UploadFile] = File(...)):
         curl -X POST -F 'files=@test.jpg' "http://localhost:/upload"
     """
     raw_files = [await file.read() for file in files]
-    await request.app.upload_images(raw_files)
+    node: DetectorNode = request.app
+    await node.upload_images(raw_files)
     return 200, "OK"

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/detector/tests/test_client_communication.py RENAMED Viewed

@@ -102,3 +102,19 @@ async def test_about_endpoint(test_detector_node: DetectorNode):
     assert response_dict['state'] == 'online'
     assert response_dict['target_model'] == '1.1'
     assert any(c.name == 'purple point' for c in model_information.categories)
+async def test_rest_outbox_mode(test_detector_node: DetectorNode):
+    await asyncio.sleep(3)
+    def check_switch_to_mode(mode: str):
+        response = requests.put(f'http://localhost:{GLOBALS.detector_port}/outbox_mode',
+                                data=mode, timeout=30)
+        assert response.status_code == 200
+        response = requests.get(f'http://localhost:{GLOBALS.detector_port}/outbox_mode', timeout=30)
+        assert response.status_code == 200
+        assert response.content == mode.encode()
+    check_switch_to_mode('stopped')
+    check_switch_to_mode('continuous_upload')
+    check_switch_to_mode('stopped')

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/detector/tests/test_outbox.py RENAMED Viewed

@@ -1,5 +1,6 @@
 import os
 import shutil
+from time import sleep
 import numpy as np
 import pytest
@@ -21,6 +22,7 @@ def test_outbox():
     os.mkdir(test_outbox.path)
     yield test_outbox
+    test_outbox.set_mode('stopped')
     shutil.rmtree(test_outbox.path, ignore_errors=True)
@@ -52,11 +54,7 @@ def test_saving_opencv_image(test_outbox: Outbox):
 def test_saving_binary(test_outbox: Outbox):
     assert len(test_outbox.get_data_files()) == 0
-    img = Image.new('RGB', (60, 30), color=(73, 109, 137))
-    img.save('/tmp/image.jpg')
-    with open('/tmp/image.jpg', 'rb') as f:
-        data = f.read()
-    test_outbox.save(data)
+    save_test_image_to_outbox(test_outbox)
     assert len(test_outbox.get_data_files()) == 1
@@ -66,3 +64,23 @@ async def test_files_are_automatically_uploaded(test_detector_node: DetectorNode
     assert len(test_detector_node.outbox.get_data_files()) == 1
     assert len(test_detector_node.outbox.get_data_files()) == 1
+def test_set_outbox_mode(test_outbox: Outbox):
+    test_outbox.set_mode('stopped')
+    save_test_image_to_outbox(outbox=test_outbox)
+    sleep(6)
+    assert len(test_outbox.get_data_files()) == 1, 'File was cleared even though outbox should be stopped'
+    test_outbox.set_mode('continuous_upload')
+    sleep(6)
+    assert len(test_outbox.get_data_files()) == 0, 'File was not cleared even though outbox should be in continuous_upload'
+### Helper functions ###
+def save_test_image_to_outbox(outbox: Outbox):
+    img = Image.new('RGB', (60, 30), color=(73, 109, 137))
+    img.save('/tmp/image.jpg')
+    with open('/tmp/image.jpg', 'rb') as f:
+        data = f.read()
+    outbox.save(data)

learning_loop_node-0.10.7/learning_loop_node/trainer/exceptions.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ class CriticalError(Exception):
2	+ pass

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/trainer/tests/states/test_state_detecting.py RENAMED Viewed

@@ -20,7 +20,7 @@ async def test_successful_detecting(test_initialized_trainer: TestingTrainerLogi
                                 model_uuid_for_detecting='00000000-0000-0000-0000-000000000011')  # NOTE: this is the hard coded model uuid for zauberzeug/demo (model version 1.1)
     _ = asyncio.get_running_loop().create_task(
-        trainer._perform_state('do_detections', TrainerState.Detecting, TrainerState.Detected, trainer._do_detections))
+        trainer._perform_state('detecting', TrainerState.Detecting, TrainerState.Detected, trainer._do_detections))
     await assert_training_state(trainer.training, TrainerState.Detecting, timeout=1, interval=0.001)
     await assert_training_state(trainer.training, TrainerState.Detected, timeout=10, interval=0.001)

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/trainer/tests/states/test_state_upload_model.py RENAMED Viewed

@@ -54,7 +54,7 @@ async def test_abort_upload_model(test_initialized_trainer: TestingTrainerLogic)
 async def test_bad_server_response_content(test_initialized_trainer: TestingTrainerLogic):
     """Set the training state to confusion_matrix_synced and try to upload the model.
     This should fail because the server response is not a valid model id.
-    The training should be aborted and the training state should be set to confusion_matrix_synced."""
+    The training should be aborted and the training state should be set to ready_for_cleanup."""
     trainer = test_initialized_trainer
     create_active_training_file(trainer, training_state=TrainerState.ConfusionMatrixSynced)
@@ -64,10 +64,10 @@ async def test_bad_server_response_content(test_initialized_trainer: TestingTrai
     await assert_training_state(trainer.training, TrainerState.TrainModelUploading, timeout=1, interval=0.001)
     # TODO goes to finished because of the error
-    await assert_training_state(trainer.training, TrainerState.ConfusionMatrixSynced, timeout=2, interval=0.001)
+    await assert_training_state(trainer.training, TrainerState.ReadyForCleanup, timeout=2, interval=0.001)
     assert trainer_has_error(trainer)
-    assert trainer.training.training_state == TrainerState.ConfusionMatrixSynced
+    assert trainer.training.training_state == TrainerState.ReadyForCleanup
     assert trainer.training.model_uuid_for_detecting is None
     assert trainer.node.last_training_io.load() == trainer.training
@@ -81,8 +81,7 @@ async def test_mock_loop_response_example(mocker: MockerFixture, test_initialize
     trainer._init_from_last_training()
     # pylint: disable=protected-access
-    result = await trainer._upload_model_return_new_model_uuid(Context(organization='zauberzeug', project='demo'))
-    assert result is not None
+    await trainer._upload_model_return_new_model_uuid(Context(organization='zauberzeug', project='demo'))
 def mock_upload_model_for_training(mocker, return_value):

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/trainer/tests/testing_trainer_logic.py RENAMED Viewed

@@ -59,7 +59,7 @@ class TestingTrainerLogic(TrainerLogic):
         await super()._upload_model()
         await asyncio.sleep(0.1)  # give tests a bit time to to check for the state
-    async def _upload_model_return_new_model_uuid(self, context: Context) -> Optional[str]:
+    async def _upload_model_return_new_model_uuid(self, context: Context) -> str:
         await asyncio.sleep(0.1)  # give tests a bit time to to check for the state
         result = await super()._upload_model_return_new_model_uuid(context)
         await asyncio.sleep(0.1)  # give tests a bit time to to check for the state

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/learning_loop_node/trainer/trainer_logic_generic.py RENAMED Viewed

@@ -14,11 +14,14 @@ from ..data_classes import (Context, Errors, Hyperparameter, PretrainedModel, Tr
                             TrainingOut, TrainingStateData)
 from ..helpers.misc import create_project_folder, delete_all_training_folders, generate_training, is_valid_uuid4
 from .downloader import TrainingsDownloader
+from .exceptions import CriticalError
 from .io_helpers import ActiveTrainingIO, EnvironmentVars, LastTrainingIO
 if TYPE_CHECKING:
     from .trainer_node import TrainerNode
+logger = logging.getLogger('learning_loop_node.trainer_logic_generic')
 class TrainerLogicGeneric(ABC):
@@ -175,7 +178,7 @@ class TrainerLogicGeneric(ABC):
         """
         if not self.training_active and self.last_training_io.exists():
             self._init_from_last_training()
-            logging.info('found incomplete training, continuing now.')
+            logger.info('found incomplete training, continuing now.')
             asyncio.get_event_loop().create_task(self._run())
             return True
         return False
@@ -207,7 +210,7 @@ class TrainerLogicGeneric(ABC):
         self._active_training_io = ActiveTrainingIO(
             self._training.training_folder, self.node.loop_communicator, context)
-        logging.info(f'new training initialized: {self._training}')
+        logger.info(f'new training initialized: {self._training}')
     async def _run(self) -> None:
         """Called on `begin_training` event from the Learning Loop.
@@ -219,18 +222,21 @@ class TrainerLogicGeneric(ABC):
             await self.training_task  # NOTE: Task object is used to potentially cancel the task
         except asyncio.CancelledError:
             if not self.shutdown_event.is_set():
-                logging.info('training task was cancelled but not by shutdown event')
+                logger.info('CancelledError in _run - training task was cancelled but not by shutdown event')
                 self.training.training_state = TrainerState.ReadyForCleanup
                 self.last_training_io.save(self.training)
                 await self._clear_training()
+                self._may_restart()
+            else:
+                logger.info('CancelledError in _run - shutting down')
         except Exception as e:
-            logging.exception(f'Error in train: {e}')
+            logger.exception(f'Error in train: {e}')
     # ---------------------------------------- TRAINING STATES ----------------------------------------
     async def _training_loop(self) -> None:
         """Cycle through the training states until the training is finished or
-        an asyncio.CancelledError is raised.
+        a critical error occurs (asyncio.CancelledError or CriticalError).
         """
         assert self.training_active
@@ -252,13 +258,20 @@ class TrainerLogicGeneric(ABC):
                 await self._perform_state('detecting', TrainerState.Detecting, TrainerState.Detected, self._do_detections)
             elif tstate == TrainerState.Detected:  # -> DetectionUploading -> ReadyForCleanup
                 await self._perform_state('upload_detections', TrainerState.DetectionUploading, TrainerState.ReadyForCleanup, self.active_training_io.upload_detetions)
-            elif tstate == TrainerState.ReadyForCleanup:  # -> RESTART or TrainingFinished
+            elif tstate == TrainerState.ReadyForCleanup:  # -> Idle (RESTART or _training = None)
                 await self._clear_training()
                 self._may_restart()
     async def _perform_state(self, error_key: str, state_during: TrainerState, state_after: TrainerState, action: Callable[[], Coroutine], reset_early=False):
+        '''
+        Perform a training state and handle errors.
+        - If the loop sends a StopTraining event, this will raise a CancelledError.
+        - States can raise a CriticalError indicating that there is no point in retrying the state.
+        - If any other error occurs, the error is stored in the errors object and the state is reset to the previous state.
+        '''
         await asyncio.sleep(0.1)
-        logging.info(f'Performing state: {state_during}')
+        logger.info(f'Performing state: {state_during}')
         previous_state = self.training.training_state
         self.training.training_state = state_during
         await asyncio.sleep(0.1)
@@ -266,21 +279,30 @@ class TrainerLogicGeneric(ABC):
             self.errors.reset(error_key)
         try:
-            if await action():
-                logging.error('Something went really bad.. cleaning up')
-                state_after = TrainerState.ReadyForCleanup
+            await action()
         except asyncio.CancelledError:
-            logging.warning(f'CancelledError in {state_during}')
-            raise
+            if self.shutdown_event.is_set():
+                logger.info(f'CancelledError in {state_during} - shutdown event set')
+                raise
+            logger.info(f'CancelledError in {state_during} - cleaning up')
+            self.training.training_state = TrainerState.ReadyForCleanup
+        except CriticalError as e:
+            logger.error(f'CriticalError in {state_during} - Exception: {e}')
+            self.errors.set(error_key, str(e))
+            self.training.training_state = TrainerState.ReadyForCleanup
         except Exception as e:
             self.errors.set(error_key, str(e))
-            logging.exception(f'Error in {state_during} - Exception:')
+            logger.exception('Error in %s - Exception: %s', state_during, e)
             self.training.training_state = previous_state
+            return
         else:
+            logger.info(f'Successfully finished state: {state_during}')
             if not reset_early:
                 self.errors.reset(error_key)
             self.training.training_state = state_after
-            self.last_training_io.save(self.training)
+        self.last_training_io.save(self.training)
     async def _prepare(self) -> None:
         """Downloads images to the images_folder and saves annotations to training.data.image_data.
@@ -300,11 +322,11 @@ class TrainerLogicGeneric(ABC):
         # TODO this checks if we continue a training -> make more explicit
         if not base_model_uuid or not is_valid_uuid4(base_model_uuid):
-            logging.info(f'skipping model download. No base model provided (in form of uuid): {base_model_uuid}')
+            logger.info(f'skipping model download. No base model provided (in form of uuid): {base_model_uuid}')
             return
-        logging.info('loading model from Learning Loop')
-        logging.info(f'downloading model {base_model_uuid} as {self.model_format}')
+        logger.info('loading model from Learning Loop')
+        logger.info(f'downloading model {base_model_uuid} as {self.model_format}')
         await self.node.data_exchanger.download_model(self.training.training_folder, self.training.context, base_model_uuid, self.model_format)
         shutil.move(f'{self.training.training_folder}/model.json',
                     f'{self.training.training_folder}/base_model.json')
@@ -327,12 +349,12 @@ class TrainerLogicGeneric(ABC):
                 result = await self.node.sio_client.call('update_training', (
                     self.training.context.organization, self.training.context.project, jsonable_encoder(new_training)))
                 if isinstance(result,  dict) and result['success']:
-                    logging.info(f'successfully updated training {asdict(new_training)}')
+                    logger.info(f'successfully updated training {asdict(new_training)}')
                     self._on_metrics_published(new_best_model)
                 else:
                     raise Exception(f'Error for update_training: Response from loop was : {result}')
         except Exception as e:
-            logging.exception('Error during confusion matrix syncronization')
+            logger.exception('Error during confusion matrix syncronization')
             self.errors.set(error_key, str(e))
             raise
         self.errors.reset(error_key)
@@ -341,21 +363,22 @@ class TrainerLogicGeneric(ABC):
         """Uploads the latest model to the Learning Loop.
         """
         new_model_uuid = await self._upload_model_return_new_model_uuid(self.training.context)
-        if new_model_uuid is None:
-            self.training.training_state = TrainerState.ReadyForCleanup
-            logging.error('could not upload model - maybe training failed.. cleaning up')
-        logging.info(f'Successfully uploaded model and received new model id: {new_model_uuid}')
+        logger.info(f'Successfully uploaded model and received new model id: {new_model_uuid}')
         self.training.model_uuid_for_detecting = new_model_uuid
-    async def _upload_model_return_new_model_uuid(self, context: Context) -> Optional[str]:
+    async def _upload_model_return_new_model_uuid(self, context: Context) -> str:
         """Upload model files, usually pytorch model (.pt) hyp.yaml and the converted .wts file.
         Note that with the latest trainers the conversion to (.wts) is done by the trainer.
         The conversion from .wts to .engine is done by the detector (needs to be done on target hardware).
-        Note that trainer may train with different classes, which is why we send an initial model.json file."""
+        Note that trainer may train with different classes, which is why we send an initial model.json file.
+        :return: The new model UUID.
+        :raise CriticalError: If the latest model files cannot be obtained.
+        """
         files = await self._get_latest_model_files()
         if files is None:
-            return None
+            raise CriticalError('Could not get latest model files. Training might have failed.')
         if isinstance(files, List):
             files = {self.model_format: files}
@@ -369,8 +392,6 @@ class TrainerLogicGeneric(ABC):
             assert len([f for f in _files if 'model.json' in f]) == 1, "model.json must be included exactly once"
             model_uuid = await self.node.data_exchanger.upload_model_get_uuid(context, _files, self.training.training_number, file_format)
-            if model_uuid is None:
-                return None
             already_uploaded_formats.append(file_format)
             self.active_training_io.save_model_upload_progress(already_uploaded_formats)
@@ -411,23 +432,23 @@ class TrainerLogicGeneric(ABC):
         if not self.training_active:
             return
         if self.training_task:
-            logging.info('cancelling training task')
+            logger.info('cancelling training task')
             if self.training_task.cancel():
                 try:
                     await self.training_task
                 except asyncio.CancelledError:
                     pass
-                logging.info('cancelled training task')
+                logger.info('cancelled training task')
                 self._may_restart()
     def _may_restart(self) -> None:
         """If the environment variable RESTART_AFTER_TRAINING is set, the trainer will restart after a training.
         """
         if self._environment_vars.restart_after_training:
-            logging.info('restarting')
+            logger.info('restarting')
             sys.exit(0)
         else:
-            logging.info('not restarting')
+            logger.info('not restarting')
     # ---------------------------------------- ABSTRACT METHODS ----------------------------------------
     @abstractmethod

{learning_loop_node-0.10.6 → learning_loop_node-0.10.7}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "learning_loop_node"
-version = "v0.10.6"
+version = "v0.10.7"
 description = "Python Library for Nodes which connect to the Zauberzeug Learning Loop"
 authors = ["Zauberzeug GmbH <info@zauberzeug.com>"]
 license = "MIT"

learning_loop_node-0.10.6/learning_loop_node/detector/outbox.py DELETED Viewed

@@ -1,117 +0,0 @@
-import json
-import logging
-import os
-import shutil
-import time
-from dataclasses import asdict
-from datetime import datetime
-from glob import glob
-from multiprocessing import Event
-from multiprocessing.synchronize import Event as SyncEvent
-from threading import Thread
-from typing import List, Optional
-import requests
-from fastapi.encoders import jsonable_encoder
-from ..data_classes import Detections
-from ..globals import GLOBALS
-from ..helpers import environment_reader
-class Outbox():
-    def __init__(self) -> None:
-        self.log = logging.getLogger()
-        self.path = f'{GLOBALS.data_folder}/outbox'
-        os.makedirs(self.path, exist_ok=True)
-        host = environment_reader.host()
-        o = environment_reader.organization()
-        p = environment_reader.project()
-        assert o and p, 'Outbox needs an organization and a project '
-        base_url = f'http{"s" if "learning-loop.ai" in host else ""}://{host}/api'
-        base: str = base_url
-        self.target_uri = f'{base}/{o}/projects/{p}/images'
-        self.log.info(f'Outbox initialized with target_uri: {self.target_uri}')
-        self.shutdown_event: Optional[SyncEvent] = None
-        self.upload_process: Optional[Thread] = None
-    def save(self, image: bytes, detections: Optional[Detections] = None, tags: Optional[List[str]] = None) -> None:
-        if detections is None:
-            detections = Detections()
-        if not tags:
-            tags = []
-        identifier = datetime.now().isoformat(sep='_', timespec='milliseconds')
-        tmp = f'{GLOBALS.data_folder}/tmp/{identifier}'
-        detections.tags = tags
-        detections.date = identifier
-        os.makedirs(tmp, exist_ok=True)
-        with open(tmp + '/image.json', 'w') as f:
-            json.dump(jsonable_encoder(asdict(detections)), f)
-        with open(tmp + '/image.jpg', 'wb') as f:
-            f.write(image)
-        if os.path.exists(tmp):
-            os.rename(tmp, self.path + '/' + identifier)  # NOTE rename is atomic so upload can run in parallel
-        else:
-            self.log.error(f'Could not rename {tmp} to {self.path}/{identifier}')
-    def get_data_files(self):
-        return glob(f'{self.path}/*')
-    def start_continuous_upload(self):
-        self.shutdown_event = Event()
-        self.upload_process = Thread(target=self._continuous_upload)
-        self.upload_process.start()
-    def _continuous_upload(self):
-        self.log.info('start continuous upload')
-        assert self.shutdown_event is not None
-        while not self.shutdown_event.is_set():
-            self.upload()
-            time.sleep(1)
-        self.log.info('stop continuous upload')
-    def upload(self):
-        items = self.get_data_files()
-        if items:
-            self.log.info(f'Found {len(items)} images to upload')
-        for item in items:
-            if self.shutdown_event and self.shutdown_event.is_set():
-                break
-            try:
-                data = [('files', open(f'{item}/image.json', 'r')),
-                        ('files', open(f'{item}/image.jpg', 'rb'))]
-                response = requests.post(self.target_uri, files=data, timeout=30)
-                if response.status_code == 200:
-                    shutil.rmtree(item)
-                    self.log.info(f'uploaded {item} successfully')
-                elif response.status_code == 422:
-                    self.log.error(f'Broken content in {item}: dropping this data')
-                    shutil.rmtree(item)
-                else:
-                    self.log.error(f'Could not upload {item}: {response.status_code}')
-            except Exception:
-                self.log.exception('could not upload files')
-    def stop_continuous_upload(self, timeout=5):
-        proc = self.upload_process
-        if not proc:
-            return
-        try:
-            assert self.shutdown_event is not None
-            self.shutdown_event.set()
-            assert proc is not None
-            proc.join(timeout)
-        except Exception:
-            logging.exception('error while shutting down upload thread')
-        if proc.is_alive():
-            self.log.error('upload thread did not terminate')