caption-flow 0.2.3__py3-none-any.whl → 0.2.4__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: caption-flow
3
- Version: 0.2.3
3
+ Version: 0.2.4
4
4
  Summary: Self-contained distributed community captioning system
5
5
  Author-email: bghira <bghira@users.github.com>
6
6
  License: MIT
@@ -33,6 +33,8 @@ Requires-Dist: arrow<2.0.0,>=1.3.0
33
33
  Requires-Dist: datasets<5.0.0,>=4.0.0
34
34
  Requires-Dist: boto3<2.0.0,>=1.40.11
35
35
  Requires-Dist: torchdata<0.12.0,>=0.11.0
36
+ Requires-Dist: textual<6.0.0,>=5.3.0
37
+ Requires-Dist: urwid<4.0.0,>=3.0.2
36
38
  Provides-Extra: dev
37
39
  Requires-Dist: pytest>=7.4.0; extra == "dev"
38
40
  Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
@@ -44,12 +46,13 @@ Dynamic: license-file
44
46
 
45
47
  # CaptionFlow
46
48
 
47
- scalable, fault-tolerant **vLLM-powered image captioning**. this "first round" focuses on a fast websocket orchestrator plus lightweight gpu workers that batch requests through vLLM.
49
+ scalable, fault-tolerant **vLLM-powered image captioning**.
50
+
51
+ a fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.
48
52
 
49
53
  * **orchestrator**: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
50
54
  * **workers (vLLM)**: connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
51
55
  * **config-driven**: all components read YAML config; flags can override.
52
- * **tui monitor (optional)**: a monitor client is wired into the CLI; ship a `monitor` module to enable it.
53
56
 
54
57
  > no conda. just `venv` + `pip`.
55
58
 
@@ -103,6 +106,25 @@ caption-flow worker --config my-worker.yaml --server ws://your.hostname.address:
103
106
  caption-flow monitor --config my-monitor.yaml
104
107
  ```
105
108
 
109
+ 5. export the data
110
+
111
+ ```bash
112
+ % caption-flow export --help
113
+ Usage: caption-flow export [OPTIONS]
114
+
115
+ Export caption data to various formats.
116
+
117
+ Options:
118
+ --format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)
119
+ ```
120
+
121
+ * **jsonl**: create JSON line file in the specified `--output` path
122
+ * **csv**: exports CSV-compatible data columns to the `--output` path containing incomplete metadata
123
+ * **json**: creates a `.json` file for each sample inside the `--output` subdirectory containing **complete** metadata; useful for webdatasets
124
+ * **txt**: creates `.txt` file for each sample inside the `--output` subdirectory containing ONLY captions
125
+ * **huggingface_hub**: creates a dataset on Hugging Face Hub, possibly `--private` and `--nsfw` where necessary
126
+ * **all**: creates all export formats in a specified `--output` directory
127
+
106
128
  ---
107
129
 
108
130
  ## how it’s wired
@@ -111,20 +133,11 @@ caption-flow monitor --config my-monitor.yaml
111
133
 
112
134
  * **websocket server** (default `0.0.0.0:8765`) with three client roles: workers, data-feeders, and admin.
113
135
  * **dataset control**: the orchestrator centrally defines the dataset (`huggingface` or `local`) and version/name. it chunk-slices shards and assigns work.
136
+ * **data serving to remote workers**: local files can be captioned by remote workers that don't have access to the same files, automatically.
114
137
  * **vLLM config broadcast**: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and **inference prompts** are all pushed to workers; workers can apply many changes without a model reload.
115
138
  * **storage + checkpoints**: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
116
139
  * **auth**: token lists for `worker`, `monitor`, and `admin` roles.
117
140
 
118
- start flags you’ll likely use:
119
-
120
- ```text
121
- --config PATH # yaml config for the orchestrator
122
- --port INT, --host STR # bind controls
123
- --data-dir PATH # overrides storage.data_dir
124
- --cert PATH, --key PATH # enable TLS (or use --no-ssl for ws:// in dev)
125
- --vllm # use the vLLM-style orchestrator (webdataset/hf)
126
- ```
127
-
128
141
  ### vLLM worker
129
142
 
130
143
  * **one process per gpu**. select the device with `--gpu-id` (or `worker.gpu_id` in YAML).
@@ -132,27 +145,15 @@ start flags you’ll likely use:
132
145
  * **resilient**: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
133
146
  * **batched generate()**: images are resized down for consistent batching; each image can get multiple captions (one per prompt).
134
147
 
135
- start flags you’ll likely use:
136
-
137
- ```text
138
- --config PATH # yaml for the worker
139
- --server URL # ws(s)://host:port
140
- --token STR # must match an allowed worker token on the orchestrator
141
- --name STR # display name
142
- --batch-size INT # override vLLM batch size
143
- --vllm # use the vLLM worker implementation
144
- --gpu-id INT # which gpu to use
145
- --precision STR, --model STR # optional overrides for dtype/model
146
- --no-verify-ssl # accept self-signed certs in dev
147
- ```
148
-
149
- ### (optional) monitor
148
+ ---
150
149
 
151
- * a CLI entry exists for a TUI monitor; wire in a `monitor` module to enable it. config lives in `monitor.yaml` or inside `orchestrator.yaml` under `monitor:`.
150
+ ## dataset formats
152
151
 
153
- ---
152
+ * huggingface hub or local based URL list datasets that are compatible with the datasets library
153
+ * webdatasets shards containing full image data; also can be hosted on the hub
154
+ * local folder filled with images; orchestrator will serve the data to workers
154
155
 
155
- ## configuration
156
+ ## configuration path
156
157
 
157
158
  ### config discovery order
158
159
 
@@ -166,98 +167,6 @@ for any component, the CLI looks for config in this order (first match wins):
166
167
  6. any `$XDG_CONFIG_DIRS` entries under `caption-flow/`
167
168
  7. `./examples/<component>.yaml` (fallback)
168
169
 
169
- ### orchestrator.yaml (highlights)
170
-
171
- ```yaml
172
- orchestrator:
173
- host: 0.0.0.0
174
- port: 8765
175
- # ssl:
176
- # cert: /path/fullchain.pem
177
- # key: /path/privkey.pem
178
-
179
- dataset:
180
- type: huggingface
181
- path: <hf-dataset-or-local-path>
182
- name: <logical-name>
183
- version: "1.0"
184
-
185
- vllm:
186
- model: Qwen/Qwen2.5-VL-3B-Instruct
187
- tensor_parallel_size: 1
188
- max_model_len: 16384
189
- dtype: float16
190
- gpu_memory_utilization: 0.92
191
- enforce_eager: true
192
- disable_mm_preprocessor_cache: true
193
- limit_mm_per_prompt: { image: 1 }
194
-
195
- batch_size: 8
196
-
197
- sampling:
198
- temperature: 0.7
199
- top_p: 0.95
200
- max_tokens: 256
201
- repetition_penalty: 1.05
202
- skip_special_tokens: true
203
- stop: ["<|end|>", "<|endoftext|>", "<|im_end|>"]
204
-
205
- inference_prompts:
206
- - "describe this image in detail"
207
- - "provide a comprehensive description of the visual content"
208
- - "what are the key elements in this image?"
209
-
210
- storage:
211
- data_dir: ./caption_data
212
- checkpoint_dir: ./checkpoints
213
- caption_buffer_size: 100
214
- checkpoint_interval: 1000
215
-
216
- # chunking/queueing
217
- chunk_size: 1000
218
- chunks_per_request: 2
219
- chunk_buffer_multiplier: 3
220
- min_chunk_buffer: 10
221
-
222
- auth:
223
- worker_tokens:
224
- - { token: "example-worker-token", name: "Example Worker" }
225
- monitor_tokens:
226
- - { token: "letmein", name: "Default monitor" }
227
- admin_tokens:
228
- - { token: "admin-secret-2024", name: "Admin" }
229
- ```
230
-
231
- ### worker.yaml (highlights)
232
-
233
- ```yaml
234
- worker:
235
- server: ws://localhost:8765 # use wss:// in prod
236
- token: example-worker-token
237
- name: local-gpu
238
- gpu_id: 0
239
- vllm: true
240
-
241
- # local queues
242
- readahead_size: 256
243
- inference_queue_size: 128
244
- ```
245
-
246
- ### monitor.yaml (optional)
247
-
248
- ```yaml
249
- monitor:
250
- server: ws://localhost:8765
251
- token: letmein
252
- refresh_rate: 1.0
253
- show_contributors: true
254
- show_quality_metrics: true
255
- max_activity_items: 20
256
- show_chunk_progress: true
257
- show_worker_queues: true
258
- show_throughput_graph: true
259
- ```
260
-
261
170
  ---
262
171
 
263
172
  ## tls / certificates
@@ -300,66 +209,24 @@ PRs welcome. keep it simple and fast.
300
209
  ```
301
210
  ┌─────────────┐ WebSocket ┌─────────────┐
302
211
  │ Worker │◄──────────────────►│ │
303
- └─────────────┘ │ │ ┌──────────────┐
304
- Orchestrator│────►│Arrow/Parquet │
305
- ┌─────────────┐ │ │ Storage │
306
- Worker │◄──────────────────►│ │ └──────────────┘
307
- └─────────────┘ └─────────────┘
212
+ │ │ │ │ ┌──────────────┐
213
+ │◄───────────────────│ │────►│Arrow/Parquet │
214
+ └─────────────┘ HTTP (img data) Orchestrator│ │ Storage │
215
+ │ │ └──────────────┘
216
+ ┌─────────────┐ │ │
217
+ │ Worker │◄──────────────────►│ │
218
+ │ │ │ │
219
+ │ │◄───────────────────│ │
220
+ └─────────────┘ HTTP (img data) └─────────────┘
308
221
 
309
222
  ┌─────────────┐ │
310
223
  │ Monitor │◄──────────────────────────┘
311
224
  └─────────────┘
312
225
  ```
313
226
 
314
- ## Storage Schema
315
-
316
- ### captions.parquet
317
-
318
- - `job_id`: Unique job identifier
319
- * `dataset`: Dataset name
320
- * `shard`: Shard identifier
321
- * `item_key`: Item within shard
322
- * `caption`: Generated caption text
323
- * `contributor_id`: Worker who generated it
324
- * `timestamp`: Generation time
325
- * `quality_score`: Optional quality metric
326
-
327
- ### jobs.parquet
328
-
329
- - `job_id`: Unique identifier
330
- * `dataset`: Dataset name
331
- * `shard`: Shard identifier
332
- * `status`: pending/processing/completed/failed
333
- * `assigned_to`: Worker ID
334
- * `timestamp`: Status change time
335
-
336
- ### contributors.parquet
337
-
338
- - `contributor_id`: Unique identifier
339
- * `name`: Display name
340
- * `total_captions`: Lifetime count
341
- * `trust_level`: Quality tier (0-5)
342
-
343
- ## Development
344
-
345
- ```bash
346
- # Install with dev dependencies
347
- pip install -e ".[dev]"
348
-
349
- # Run tests
350
- pytest
351
-
352
- # Format code
353
- black src/
354
- ruff --fix src/
355
-
356
- # Type checking
357
- mypy src/
358
- ```
359
-
360
- ## Community Contribution
227
+ ## Community Clusters
361
228
 
362
- To contribute compute:
229
+ To contribute compute to a cluster:
363
230
 
364
231
  1. Install caption-flow: `pip install caption-flow`
365
232
  2. Get a worker token from the project maintainer
@@ -369,4 +236,4 @@ Your contributions will be tracked and attributed in the final dataset!
369
236
 
370
237
  ## License
371
238
 
372
- MIT
239
+ AGPLv3
@@ -1,14 +1,17 @@
1
1
  caption_flow/__init__.py,sha256=NLPJ25lRN7xHqncXweINDNwbt0q8lgjZ30G21zlPdRs,303
2
- caption_flow/cli.py,sha256=qEueeJhf3DvxSBxnOp5t32p6gAnZskvIDe6cwtPA0-Y,28892
3
- caption_flow/models.py,sha256=bpr7yMy3vPErZCQwmgOYIix489rRGbT6lVw8wxxwTkc,4931
2
+ caption_flow/cli.py,sha256=t_cYCxJE7f5UtB3br2Es51JjO5KPsWM1JTdDXAxM_Lw,41371
3
+ caption_flow/models.py,sha256=2n6iphTEL62xK2FFcJM6axMsaE8KwsUv5Ak_cCF-TdQ,5652
4
4
  caption_flow/monitor.py,sha256=bAt9EJqfPgT_KdbknGdCxwBRH002pRDgyUmYIj6Dyso,7885
5
5
  caption_flow/orchestrator.py,sha256=ciqWghxUxk-5s6u7W3JwD7_JLSFYV57NgOwiMkxME-I,36133
6
- caption_flow/storage.py,sha256=Wqgtsk6yZ9Kf-izeUKHLwSvPUH3xFqIbzox20QHbc64,43370
6
+ caption_flow/viewer.py,sha256=HxO98eHR1xtivG0dEdYC2U9T_RgeRfJqqTK-37u9bNM,20471
7
7
  caption_flow/processors/__init__.py,sha256=hvq-OuAJWQe6hFglKe7QmkS8473k20FmxZDSxfXpCrg,423
8
8
  caption_flow/processors/base.py,sha256=JlTqCHo5HRXrXMVzgle_6pNwh4HGHsF7jLF6PeSnWr0,6783
9
9
  caption_flow/processors/huggingface.py,sha256=MNz9vDMtrrTOSXe9Q_kbBrQ7XBv69X6x5xD_QP9icdg,33765
10
10
  caption_flow/processors/local_filesystem.py,sha256=EYmsImbkqsIU7UZL2FijL0hotKLtPOtkzfwernQDSxA,27860
11
11
  caption_flow/processors/webdataset.py,sha256=xsrYx7_5FCqez30dc4hSDYfyA9A0oKqHqwt7CRc1J0c,33812
12
+ caption_flow/storage/__init__.py,sha256=IVnzcSCPpPuyp-QLlgJirRZ9Sb3tR0F4sfuF5u2cNMk,36
13
+ caption_flow/storage/exporter.py,sha256=mFJqMDQ61cP-qcXe118_-oL1TUqULdQZ8LdjSTym44I,19697
14
+ caption_flow/storage/manager.py,sha256=sNslCw0uhLHihYhbo_IQg9ycDPW2IEQt9nZVMKGPVyw,46265
12
15
  caption_flow/utils/__init__.py,sha256=F1BChVoCsj9zn1GJRBOLHET1kLW6xrAmsbzcR7hHy6Y,202
13
16
  caption_flow/utils/auth.py,sha256=UrxX2n8OEEcfMD1Ey27TxGfrJFmUCpC59x-SCrQJoVE,2253
14
17
  caption_flow/utils/caption_utils.py,sha256=esUMAdcCkNjRroZ0Bhxv0_yKlLtMf0XeDCTt-5k6bik,5309
@@ -27,9 +30,9 @@ caption_flow/utils/vllm_config.py,sha256=TC7Rmjk0zRKbBXbWUXrFL4Z58hzax_-4L0pXZn0
27
30
  caption_flow/workers/base.py,sha256=2AGWERC5hbmO-0V_A1MUbgRVvRNN3blqGPyDokvvzmM,7575
28
31
  caption_flow/workers/caption.py,sha256=_uvpdoBzym1TKWKXtky7hBfj8YnG1EaJz-NRwaH2X1A,36722
29
32
  caption_flow/workers/data.py,sha256=0Tg8NE0wdONeMlivYQ4nvbcfWdLuU51O7vR8_YSnJgo,14813
30
- caption_flow-0.2.3.dist-info/licenses/LICENSE,sha256=hIahDEOTzuHCU5J2nd07LWwkLW7Hko4UFO__ffsvB-8,34523
31
- caption_flow-0.2.3.dist-info/METADATA,sha256=bk5Gk3eWuDH_UWXPEDKulksPc3hVHvnzm3sstLbuU-0,11914
32
- caption_flow-0.2.3.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
33
- caption_flow-0.2.3.dist-info/entry_points.txt,sha256=KnVlyrGKZj6p2zNyuEnCx4Y6jvJ4V-mcfN0lddPKTlQ,55
34
- caption_flow-0.2.3.dist-info/top_level.txt,sha256=_bXpKRutqded0FQ80dCChIz26ETV7tL4d4e2E_Y1FXs,13
35
- caption_flow-0.2.3.dist-info/RECORD,,
33
+ caption_flow-0.2.4.dist-info/licenses/LICENSE,sha256=hIahDEOTzuHCU5J2nd07LWwkLW7Hko4UFO__ffsvB-8,34523
34
+ caption_flow-0.2.4.dist-info/METADATA,sha256=k6N1nH8rt-dpIxso4fed-H58IZdmomPw22X22G60z9k,9670
35
+ caption_flow-0.2.4.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
36
+ caption_flow-0.2.4.dist-info/entry_points.txt,sha256=KnVlyrGKZj6p2zNyuEnCx4Y6jvJ4V-mcfN0lddPKTlQ,55
37
+ caption_flow-0.2.4.dist-info/top_level.txt,sha256=_bXpKRutqded0FQ80dCChIz26ETV7tL4d4e2E_Y1FXs,13
38
+ caption_flow-0.2.4.dist-info/RECORD,,