PyPI - labelr - Versions diffs - 0.10.0__tar.gz → 0.11.0__tar.gz - Mend

labelr 0.10.0tar.gz → 0.11.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

labelr-0.11.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,230 @@
+Metadata-Version: 2.4
+Name: labelr
+Version: 0.11.0
+Summary: A command-line tool to manage labeling tasks with Label Studio.
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: datasets>=3.2.0
+Requires-Dist: imagehash>=4.3.1
+Requires-Dist: label-studio-sdk>=1.0.8
+Requires-Dist: more-itertools>=10.5.0
+Requires-Dist: openfoodfacts>=2.9.0
+Requires-Dist: typer>=0.15.1
+Requires-Dist: google-cloud-batch==0.18.0
+Requires-Dist: huggingface-hub
+Requires-Dist: deepdiff>=8.6.1
+Requires-Dist: rapidfuzz>=3.14.3
+Requires-Dist: aiohttp
+Requires-Dist: aiofiles
+Requires-Dist: orjson
+Requires-Dist: google-cloud-storage
+Requires-Dist: gcloud-aio-storage
+Requires-Dist: google-genai>=1.56.0
+Requires-Dist: diskcache>=5.6.3
+Provides-Extra: ultralytics
+Requires-Dist: ultralytics==8.4.8; extra == "ultralytics"
+Provides-Extra: fiftyone
+Requires-Dist: fiftyone~=1.10.0; extra == "fiftyone"
+Dynamic: license-file
+# Labelr
+Labelr a command line interface that aims at providing a set of tools to help data scientists and machine learning engineers to deal with ML data annotation, data preprocessing and format conversion.
+This project started as a way to automate some of the tasks we do at Open Food Facts to manage data at different stages of the machine learning pipeline.
+The CLI currently is integrated with Label Studio (for data annotation), Ultralytics (for object detection), Google Cloud Batch (for training) and  Hugging Face (for model and dataset storage). It only works with some specific tasks (object detection, image classification and image extraction using LVLM for now), but it's meant to be extended to other tasks in the future.
+For object detection and image classification models, it currently allows to:
+- create Label Studio projects
+- upload images to Label Studio
+- pre-annotate the tasks either with an existing object detection model, or with a zero-shot model (Yolo-World or SAM), using Ultralytics
+- perform data quality checks on Label Studio datasets
+- export the data to Hugging Face or to local disk
+- train the model on Google Batch (for object detection only)
+- visualize the model predictions and compare them with the ground truth, using [Fiftyone](https://docs.voxel51.com/user_guide/index.html).
+Labelr also support managing datasets for fine-tuning large visual language models. It currently only support a single task: structured extraction (JSON) from a single image.
+The following features are supported:
+- creating training datasets using Google Gemini Batch, from a list of images, textual instructions and a JSON schema
+- uploading the dataset to Hugging Face
+- fixing manually or automatically the model output using [Directus](https://directus.io/), a headless CMS used to manage the structured output
+- export the dataset to Hugging Face
+In addition, Labelr comes with two scripts that can be used to train ML models:
+- in `packages/train-yolo`: the `main.py` script can be used to train an object detection model using Ultralytics. The training can be fully automatized on Google Batch, and Labelr provides a CLI to launch Google Batch jobs.
+- in `packages/train-unsloth`: the `main.py` script can be used to train a visual language model using Unsloth. The training is not yet automatized on Google Batch, but the script can be used to train the model locally.
+## Installation
+Python 3.10 or higher is required to run this CLI.
+To install the CLI, simply run:
+```bash
+pip install labelr
+```
+We recommend to install the CLI in a virtual environment. You can either use pip or conda for that.
+There are two optional dependencies that you can install to use the CLI:
+- `ultralytics`: pre-annotate object detection datasets with an ultralytics model (yolo, yolo-world)
+- `fiftyone`: visualize the model predictions and compare them with the ground truth, using FiftyOne.
+To install the ultralytics optional dependency, you can run:
+```bash
+pip install labelr[ultralytics]
+```
+## Usage
+### Label Studio integration
+To create a Label Studio project, you need to have a Label Studio instance running. Launching a Label Studio instance is out of the scope of this project, but you can follow the instructions on the [Label Studio documentation](https://labelstud.io/guide/install.html).
+By default, the CLI will assume you're running Label Studio locally (url: http://127.0.0.1:8080). You can change the URL by setting the `--label-studio-url` CLI option or by updating the configuration (see the [Configuration](#configuration) section below for more information).
+For all the commands that interact with Label Studio, you need to provide an API key using the `--api-key`, or through configuration.
+#### Create a project
+Once you have a Label Studio instance running, you can create a project easily. First, you need to create a configuration file for the project. The configuration file is an XML file that defines the labeling interface and the labels to use for the project. You can find an example of a configuration file in the [Label Studio documentation](https://labelstud.io/guide/setup).
+For an object detection task, a command allows you to create the configuration file automatically:
+```bash
+labelr ls create-config-file --labels 'label1' --labels 'label2' --output-file label_config.xml
+```
+where `label1` and `label2` are the labels you want to use for the object detection task, and `label_config.xml` is the output file that will contain the configuration.
+Then, you can create a project on Label Studio with the following command:
+```bash
+labelr ls create --title my_project --api-key API_KEY --config-file label_config.xml
+```
+where `API_KEY` is the API key of the Label Studio instance (API key is available at Account page), and `label_config.xml` is the configuration file of the project.
+`ls` stands for Label Studio in the CLI.
+#### Create a dataset file
+If you have a list of images, for an object detection task, you can quickly create a dataset file with the following command:
+```bash
+labelr ls create-dataset-file --input-file image_urls.txt --output-file dataset.json
+```
+where `image_urls.txt` is a file containing the URLs of the images, one per line, and `dataset.json` is the output file.
+#### Import data
+Next, import the generated data to a project with the following command:
+```bash
+labelr ls import-data --project-id PROJECT_ID --dataset-path dataset.json
+```
+where `PROJECT_ID` is the ID of the project you created.
+#### Pre-annotate the data
+To accelerate annotation, you can pre-annotate the images with an object detection model. We support three pre-annotation backends:
+- `ultralytics`: use your own model or [Yolo-World](https://docs.ultralytics.com/models/yolo-world/), a zero-shot model that can detect any object using a text description of the object. You can specify the path or the name of the model with the `--model-name` option. If no model name is provided, the `yolov8x-worldv2.pt` model (Yolo-World) is used.
+- `ultralytics_sam3`: use [SAM3](https://docs.ultralytics.com/models/sam-3/), another zero-shot model. We advice to use this backend, as it's the most accurate. The `--model-name` option is ignored when this backend is used.
+- `robotoff`: the ML backend of Open Food Facts (specific to Open Food Facts projects).
+When using `ultralytics` or `ultralytics_sam3`, make sure you installed the labelr package with the `ultralytics` extra.
+To pre-annotate the data with Ultralytics, use the following command:
+```bash
+labelr ls add-prediction --project-id PROJECT_ID --backend ultralytics_sam3 --labels 'product' --labels 'price tag' --label-mapping '{"price tag": "price-tag"}'
+```
+The SAM3 model will be automatically downloaded from Hugging Face. [SAM3](https://huggingface.co/facebook/sam3) is a gated model, it requires a permission before getting access to the model.Make sure you were granted the access before launching the command.
+In the command above, `labels` is the list of labels to use for the object detection task (you can add as many labels as you want). You can also provide a `--label-mapping` option in case the names of the label of the model you use for pre-annotation is different from the names configured on your Label Studio project.
+#### Add `train` and `val` split
+In most machine learning projects, you need to split your data into a training and a validation set. Assigning each sample to a split is required before exporting the dataset. To do so, you can use the following command:
+```bash
+labelr ls add-split --train-split 0.8 --project-id PROJECT_ID
+```
+For each task in the dataset, it randomly assigns 80% of the samples to the `train` split and 20% to the `val` split. The split is saved in the task `data` in the `split` field.
+ You can change the train/val ratio by changing the `--train-split` option. You can also assign specific sample to a split. For example you can assign the `train` split to specific tasks by storinh the task IDs in a file `task_ids.txt` and by running the following command:
+```bash
+labelr ls add-split --split-name train --task-id-file task_ids.txt --project-id PROJECT_ID
+```
+#### Performing sanity checks on the dataset
+Labelr can detect automatically some common data quality issues:
+- broken image URLs
+- duplicate tasks (based on the image hash)
+- multiple annotations
+To perform a check, run:
+```bash
+labelr ls check-dataset --project-id PROJECT_ID
+```
+The command will report the issues found. It is non-destructive by default, but you can use the `--delete-missing-images` and `--delete-duplicate-images` options to delete the tasks with missing images or duplicates respectively.
+#### Export the data
+Once the data is annotated, you can export it to a Hugging Face dataset or to local disk (Ultralytics format). To export it to disk, use the following command:
+```bash
+labelr datasets export --project-id PROJECT_ID --from ls --to ultralytics --output-dir output --label-names 'product,price-tag'
+```
+where `output` is the directory where the data will be exported. Currently, label names must be provided, as the CLI does not support exporting label names from Label Studio yet.
+To export the data to a Hugging Face dataset, use the following command:
+```bash
+labelr datasets export --project-id PROJECT_ID --from ls --to huggingface --repo-id REPO_ID --label-names 'product,price-tag'
+```
+where `REPO_ID` is the ID of the Hugging Face repository where the dataset will be uploaded (ex: `openfoodfacts/food-detection`).
+### Lauch training jobs
+You can also launch training jobs for YOLO object detection models using datasets hosted on Hugging Face. Please refer to the [train-yolo package README](packages/train-yolo/README.md) for more details on how to use this feature.
+## Configuration
+Some Labelr settings can be configured using a configuration file or through environment variables. The configuration file is located at `~/.config/labelr/config.json`.
+By order of precedence, the configuration is loaded from:
+- CLI command option
+- environment variable
+- file configuration
+The following variables are currently supported:
+- `label_studio_url`: URL of the Label Studio server. Can also be set with the `LABELR_LABEL_STUDIO_URL` environment variable.
+- `label_studio_api_key`: API key for Label Studio. Can also be set with the `LABELR_LABEL_STUDIO_API_KEY` environment variable.
+Labelr supports configuring settings in config file through the `config` command. For example, to set the Label Studio URL, you can run:
+```bash
+labelr config label_studio_url http://127.0.0.1:8080
+```

labelr-0.11.0/README.md ADDED Viewed

@@ -0,0 +1,200 @@
+# Labelr
+Labelr a command line interface that aims at providing a set of tools to help data scientists and machine learning engineers to deal with ML data annotation, data preprocessing and format conversion.
+This project started as a way to automate some of the tasks we do at Open Food Facts to manage data at different stages of the machine learning pipeline.
+The CLI currently is integrated with Label Studio (for data annotation), Ultralytics (for object detection), Google Cloud Batch (for training) and  Hugging Face (for model and dataset storage). It only works with some specific tasks (object detection, image classification and image extraction using LVLM for now), but it's meant to be extended to other tasks in the future.
+For object detection and image classification models, it currently allows to:
+- create Label Studio projects
+- upload images to Label Studio
+- pre-annotate the tasks either with an existing object detection model, or with a zero-shot model (Yolo-World or SAM), using Ultralytics
+- perform data quality checks on Label Studio datasets
+- export the data to Hugging Face or to local disk
+- train the model on Google Batch (for object detection only)
+- visualize the model predictions and compare them with the ground truth, using [Fiftyone](https://docs.voxel51.com/user_guide/index.html).
+Labelr also support managing datasets for fine-tuning large visual language models. It currently only support a single task: structured extraction (JSON) from a single image.
+The following features are supported:
+- creating training datasets using Google Gemini Batch, from a list of images, textual instructions and a JSON schema
+- uploading the dataset to Hugging Face
+- fixing manually or automatically the model output using [Directus](https://directus.io/), a headless CMS used to manage the structured output
+- export the dataset to Hugging Face
+In addition, Labelr comes with two scripts that can be used to train ML models:
+- in `packages/train-yolo`: the `main.py` script can be used to train an object detection model using Ultralytics. The training can be fully automatized on Google Batch, and Labelr provides a CLI to launch Google Batch jobs.
+- in `packages/train-unsloth`: the `main.py` script can be used to train a visual language model using Unsloth. The training is not yet automatized on Google Batch, but the script can be used to train the model locally.
+## Installation
+Python 3.10 or higher is required to run this CLI.
+To install the CLI, simply run:
+```bash
+pip install labelr
+```
+We recommend to install the CLI in a virtual environment. You can either use pip or conda for that.
+There are two optional dependencies that you can install to use the CLI:
+- `ultralytics`: pre-annotate object detection datasets with an ultralytics model (yolo, yolo-world)
+- `fiftyone`: visualize the model predictions and compare them with the ground truth, using FiftyOne.
+To install the ultralytics optional dependency, you can run:
+```bash
+pip install labelr[ultralytics]
+```
+## Usage
+### Label Studio integration
+To create a Label Studio project, you need to have a Label Studio instance running. Launching a Label Studio instance is out of the scope of this project, but you can follow the instructions on the [Label Studio documentation](https://labelstud.io/guide/install.html).
+By default, the CLI will assume you're running Label Studio locally (url: http://127.0.0.1:8080). You can change the URL by setting the `--label-studio-url` CLI option or by updating the configuration (see the [Configuration](#configuration) section below for more information).
+For all the commands that interact with Label Studio, you need to provide an API key using the `--api-key`, or through configuration.
+#### Create a project
+Once you have a Label Studio instance running, you can create a project easily. First, you need to create a configuration file for the project. The configuration file is an XML file that defines the labeling interface and the labels to use for the project. You can find an example of a configuration file in the [Label Studio documentation](https://labelstud.io/guide/setup).
+For an object detection task, a command allows you to create the configuration file automatically:
+```bash
+labelr ls create-config-file --labels 'label1' --labels 'label2' --output-file label_config.xml
+```
+where `label1` and `label2` are the labels you want to use for the object detection task, and `label_config.xml` is the output file that will contain the configuration.
+Then, you can create a project on Label Studio with the following command:
+```bash
+labelr ls create --title my_project --api-key API_KEY --config-file label_config.xml
+```
+where `API_KEY` is the API key of the Label Studio instance (API key is available at Account page), and `label_config.xml` is the configuration file of the project.
+`ls` stands for Label Studio in the CLI.
+#### Create a dataset file
+If you have a list of images, for an object detection task, you can quickly create a dataset file with the following command:
+```bash
+labelr ls create-dataset-file --input-file image_urls.txt --output-file dataset.json
+```
+where `image_urls.txt` is a file containing the URLs of the images, one per line, and `dataset.json` is the output file.
+#### Import data
+Next, import the generated data to a project with the following command:
+```bash
+labelr ls import-data --project-id PROJECT_ID --dataset-path dataset.json
+```
+where `PROJECT_ID` is the ID of the project you created.
+#### Pre-annotate the data
+To accelerate annotation, you can pre-annotate the images with an object detection model. We support three pre-annotation backends:
+- `ultralytics`: use your own model or [Yolo-World](https://docs.ultralytics.com/models/yolo-world/), a zero-shot model that can detect any object using a text description of the object. You can specify the path or the name of the model with the `--model-name` option. If no model name is provided, the `yolov8x-worldv2.pt` model (Yolo-World) is used.
+- `ultralytics_sam3`: use [SAM3](https://docs.ultralytics.com/models/sam-3/), another zero-shot model. We advice to use this backend, as it's the most accurate. The `--model-name` option is ignored when this backend is used.
+- `robotoff`: the ML backend of Open Food Facts (specific to Open Food Facts projects).
+When using `ultralytics` or `ultralytics_sam3`, make sure you installed the labelr package with the `ultralytics` extra.
+To pre-annotate the data with Ultralytics, use the following command:
+```bash
+labelr ls add-prediction --project-id PROJECT_ID --backend ultralytics_sam3 --labels 'product' --labels 'price tag' --label-mapping '{"price tag": "price-tag"}'
+```
+The SAM3 model will be automatically downloaded from Hugging Face. [SAM3](https://huggingface.co/facebook/sam3) is a gated model, it requires a permission before getting access to the model.Make sure you were granted the access before launching the command.
+In the command above, `labels` is the list of labels to use for the object detection task (you can add as many labels as you want). You can also provide a `--label-mapping` option in case the names of the label of the model you use for pre-annotation is different from the names configured on your Label Studio project.
+#### Add `train` and `val` split
+In most machine learning projects, you need to split your data into a training and a validation set. Assigning each sample to a split is required before exporting the dataset. To do so, you can use the following command:
+```bash
+labelr ls add-split --train-split 0.8 --project-id PROJECT_ID
+```
+For each task in the dataset, it randomly assigns 80% of the samples to the `train` split and 20% to the `val` split. The split is saved in the task `data` in the `split` field.
+ You can change the train/val ratio by changing the `--train-split` option. You can also assign specific sample to a split. For example you can assign the `train` split to specific tasks by storinh the task IDs in a file `task_ids.txt` and by running the following command:
+```bash
+labelr ls add-split --split-name train --task-id-file task_ids.txt --project-id PROJECT_ID
+```
+#### Performing sanity checks on the dataset
+Labelr can detect automatically some common data quality issues:
+- broken image URLs
+- duplicate tasks (based on the image hash)
+- multiple annotations
+To perform a check, run:
+```bash
+labelr ls check-dataset --project-id PROJECT_ID
+```
+The command will report the issues found. It is non-destructive by default, but you can use the `--delete-missing-images` and `--delete-duplicate-images` options to delete the tasks with missing images or duplicates respectively.
+#### Export the data
+Once the data is annotated, you can export it to a Hugging Face dataset or to local disk (Ultralytics format). To export it to disk, use the following command:
+```bash
+labelr datasets export --project-id PROJECT_ID --from ls --to ultralytics --output-dir output --label-names 'product,price-tag'
+```
+where `output` is the directory where the data will be exported. Currently, label names must be provided, as the CLI does not support exporting label names from Label Studio yet.
+To export the data to a Hugging Face dataset, use the following command:
+```bash
+labelr datasets export --project-id PROJECT_ID --from ls --to huggingface --repo-id REPO_ID --label-names 'product,price-tag'
+```
+where `REPO_ID` is the ID of the Hugging Face repository where the dataset will be uploaded (ex: `openfoodfacts/food-detection`).
+### Lauch training jobs
+You can also launch training jobs for YOLO object detection models using datasets hosted on Hugging Face. Please refer to the [train-yolo package README](packages/train-yolo/README.md) for more details on how to use this feature.
+## Configuration
+Some Labelr settings can be configured using a configuration file or through environment variables. The configuration file is located at `~/.config/labelr/config.json`.
+By order of precedence, the configuration is loaded from:
+- CLI command option
+- environment variable
+- file configuration
+The following variables are currently supported:
+- `label_studio_url`: URL of the Label Studio server. Can also be set with the `LABELR_LABEL_STUDIO_URL` environment variable.
+- `label_studio_api_key`: API key for Label Studio. Can also be set with the `LABELR_LABEL_STUDIO_API_KEY` environment variable.
+Labelr supports configuring settings in config file through the `config` command. For example, to set the Label Studio URL, you can run:
+```bash
+labelr config label_studio_url http://127.0.0.1:8080
+```

{labelr-0.10.0 → labelr-0.11.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "labelr"
-version = "0.10.0"
+version = "0.11.0"
 description = "A command-line tool to manage labeling tasks with Label Studio."
 readme = "README.md"
 requires-python = ">=3.10"
@@ -21,6 +21,7 @@ dependencies = [
     "google-cloud-storage",
     "gcloud-aio-storage",
     "google-genai >= 1.56.0",
+    "diskcache>=5.6.3",
 ]
 [project.scripts]
@@ -28,7 +29,7 @@ labelr = "labelr.main:app"
 [project.optional-dependencies]
 ultralytics = [
-    "ultralytics==8.3.223",
+    "ultralytics==8.4.8",
 ]
 fiftyone = [
     "fiftyone~=1.10.0"

{labelr-0.10.0 → labelr-0.11.0}/src/labelr/apps/datasets.py RENAMED Viewed

@@ -18,7 +18,8 @@ from labelr.export.object_detection import (
     export_from_ls_to_ultralytics_object_detection,
 )
-from ..config import LABEL_STUDIO_DEFAULT_URL
+from . import typer_description
+from ..config import config
 from ..types import ExportDestination, ExportSource, TaskType
 app = typer.Typer()
@@ -125,7 +126,9 @@ def convert_object_detection_dataset(
 def export(
     from_: Annotated[ExportSource, typer.Option("--from", help="Input source to use")],
     to: Annotated[ExportDestination, typer.Option(help="Where to export the data")],
-    api_key: Annotated[Optional[str], typer.Option(envvar="LABEL_STUDIO_API_KEY")],
+    api_key: Annotated[
+        str | None, typer.Option(help=typer_description.LABEL_STUDIO_API_KEY)
+    ] = config.label_studio_api_key,
     task_type: Annotated[
         TaskType, typer.Option(help="Type of task to export")
     ] = TaskType.object_detection,
@@ -142,7 +145,16 @@ def export(
     project_id: Annotated[
         Optional[int], typer.Option(help="Label Studio Project ID")
     ] = None,
-    label_studio_url: Optional[str] = LABEL_STUDIO_DEFAULT_URL,
+    view_id: Annotated[
+        int | None,
+        typer.Option(
+            help="ID of the Label Studio view, if any. This option is useful "
+            "to filter the task to export."
+        ),
+    ] = None,
+    label_studio_url: Annotated[
+        str, typer.Option(help=typer_description.LABEL_STUDIO_URL)
+    ] = config.label_studio_url,
     output_dir: Annotated[
         Optional[Path],
         typer.Option(
@@ -163,11 +175,15 @@ def export(
     is_openfoodfacts_dataset: Annotated[
         bool,
         typer.Option(
-            help="Whether the Ultralytics dataset is an OpenFoodFacts dataset, only "
-            "for Ultralytics source. This is used to generate the correct image URLs "
-            "each image name."
+            help="Whether the Ultralytics dataset is an Open Food Facts dataset, only "
+            "for Ultralytics source. This is used:\n"
+            "- to generate the correct image URLs from each image name, when exporting "
+            "from Ultralytics to Hugging Face Datasets.\n"
+            "- to include additional metadata fields specific to Open Food Facts "
+            "(`barcode` and `off_image_id`) when exporting from Label Studio to "
+            "Hugging Face Datasets."
         ),
-    ] = True,
+    ] = False,
     openfoodfacts_flavor: Annotated[
         Flavor,
         typer.Option(
@@ -181,9 +197,18 @@ def export(
         float,
         typer.Option(
             help="Train ratio for splitting the dataset, if the split name is not "
-            "provided (typically, if the source is Label Studio)"
+            "provided. Only used if the source is Label Studio and the destination "
+            "is Ultralytics."
         ),
     ] = 0.8,
+    image_max_size: Annotated[
+        int | None,
+        typer.Option(
+            help="Maximum size (in pixels) for the images. If None, no resizing is performed."
+            "Otherwise, the longest side of the image will be resized to this value, "
+            "keeping the aspect ratio."
+        ),
+    ] = None,
     error_raise: Annotated[
         bool,
         typer.Option(
@@ -260,9 +285,12 @@ def export(
                 repo_id=repo_id,
                 label_names=typing.cast(list[str], label_names_list),
                 project_id=typing.cast(int, project_id),
+                is_openfoodfacts_dataset=is_openfoodfacts_dataset,
                 merge_labels=merge_labels,
                 use_aws_cache=use_aws_cache,
                 revision=revision,
+                view_id=view_id,
+                image_max_size=image_max_size,
             )
         elif to == ExportDestination.ultralytics:
             export_from_ls_to_ultralytics_object_detection(
@@ -274,6 +302,8 @@ def export(
                 error_raise=error_raise,
                 merge_labels=merge_labels,
                 use_aws_cache=use_aws_cache,
+                view_id=view_id,
+                image_max_size=image_max_size,
             )
     elif from_ == ExportSource.hf:
@@ -289,6 +319,7 @@ def export(
                 error_raise=error_raise,
                 use_aws_cache=use_aws_cache,
                 revision=revision,
+                image_max_size=image_max_size,
             )
         else:
             raise typer.BadParameter("Unsupported export format")
@@ -327,7 +358,8 @@ def export_llm_ds(
     tmp_dir: Annotated[
         Path | None,
         typer.Option(
-            help="Path to a temporary directory to use for image processing",
+            help="Path to the temporary directory used to store intermediate sample files "
+            "created when building the HF dataset.",
         ),
     ] = None,
     image_max_size: Annotated[
@@ -354,3 +386,102 @@ def export_llm_ds(
         tmp_dir=tmp_dir,
         image_max_size=image_max_size,
     )
+@app.command()
+def update_llm_ds(
+    dataset_path: Annotated[
+        Path, typer.Option(help="Path to the JSONL containing the updates.")
+    ],
+    repo_id: Annotated[
+        str, typer.Option(help="Hugging Face Datasets repository ID to update")
+    ],
+    split: Annotated[str, typer.Option(help="Dataset split to use")],
+    revision: Annotated[
+        str,
+        typer.Option(
+            help="Revision (branch, tag or commit) to use when pushing the new version "
+            "of the Hugging Face Dataset."
+        ),
+    ] = "main",
+    tmp_dir: Annotated[
+        Path | None,
+        typer.Option(
+            help="Path to a temporary directory to use for image processing",
+        ),
+    ] = None,
+    show_diff: Annotated[
+        bool,
+        typer.Option(
+            help="Show the differences between the original sample and the update. If "
+            "True, the updated dataset is not pushed to the Hub. Useful to review the "
+            "updates before applying them.",
+        ),
+    ] = False,
+):
+    """Update an existing LLM image extraction dataset, by updating the
+    `output` field of each sample in the dataset.
+    The `--dataset_path` JSONL file should contain items with two fields:
+    - `image_id`: The image ID of the sample to update in the Hugging Face
+        dataset.
+    - `output`: The new output data to set for the sample.
+    """
+    import sys
+    from difflib import Differ
+    import orjson
+    from datasets import load_dataset
+    from diskcache import Cache
+    dataset = load_dataset(repo_id, split=split)
+    # Populate cache with the updates
+    cache = Cache(directory=tmp_dir or None)
+    with dataset_path.open("r") as f:
+        for line in map(orjson.loads, f):
+            if "image_id" not in line or "output" not in line:
+                raise ValueError(
+                    "Each item in the update JSONL file must contain `image_id` and `output` fields"
+                )
+            image_id = line["image_id"]
+            output = line["output"]
+            if not isinstance(output, str):
+                output = orjson.dumps(output).decode("utf-8")
+            cache[image_id] = output
+    def apply_updates(sample):
+        image_id = sample["image_id"]
+        if image_id in cache:
+            cached_item = cache[image_id]
+            sample["output"] = cached_item
+        return sample
+    if show_diff:
+        differ = Differ()
+        for sample in dataset:
+            image_id = sample["image_id"]
+            if image_id in cache:
+                cached_item = orjson.loads(cache[image_id])
+                original_item = orjson.loads(sample["output"])
+                cached_item_str = orjson.dumps(
+                    cached_item, option=orjson.OPT_INDENT_2
+                ).decode("utf8")
+                original_item_str = orjson.dumps(
+                    original_item, option=orjson.OPT_INDENT_2
+                ).decode("utf8")
+                diff = list(
+                    differ.compare(
+                        original_item_str.splitlines(keepends=True),
+                        cached_item_str.splitlines(keepends=True),
+                    )
+                )
+                sys.stdout.writelines(diff)
+                sys.stdout.write("\n" + "-" * 30 + "\n")
+    else:
+        updated_dataset = dataset.map(apply_updates, batched=False)
+        updated_dataset.push_to_hub(repo_id, split=split, revision=revision)

labelr 0.10.0__tar.gz → 0.11.0__tar.gz

labelr 0.10.0tar.gz → 0.11.0tar.gz