caption-flow 0.1.0__py3-none-any.whl → 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,427 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: caption-flow
3
- Version: 0.1.0
4
- Summary: Self-contained distributed community captioning system
5
- Author-email: bghira <bghira@users.github.com>
6
- License: MIT
7
- Keywords: captioning,distributed,vllm,dataset,community
8
- Classifier: Development Status :: 4 - Beta
9
- Classifier: Intended Audience :: Developers
10
- Classifier: License :: OSI Approved :: MIT License
11
- Classifier: Programming Language :: Python :: 3
12
- Classifier: Programming Language :: Python :: 3.10
13
- Classifier: Programming Language :: Python :: 3.11
14
- Classifier: Programming Language :: Python :: 3.12
15
- Requires-Python: <3.13,>=3.10
16
- Description-Content-Type: text/markdown
17
- License-File: LICENSE
18
- Requires-Dist: websockets>=12.0
19
- Requires-Dist: pyarrow>=14.0.0
20
- Requires-Dist: click>=8.1.0
21
- Requires-Dist: pydantic>=2.0.0
22
- Requires-Dist: aiofiles>=23.0.0
23
- Requires-Dist: rich>=13.0.0
24
- Requires-Dist: cryptography>=41.0.0
25
- Requires-Dist: pyyaml>=6.0
26
- Requires-Dist: certbot>=2.0.0
27
- Requires-Dist: numpy>=1.24.0
28
- Requires-Dist: pillow>=10.0.0
29
- Requires-Dist: vllm<0.11.0,>=0.10.0
30
- Requires-Dist: webdataset<2.0.0,>=1.0.2
31
- Requires-Dist: pandas<3.0.0,>=2.3.1
32
- Requires-Dist: arrow<2.0.0,>=1.3.0
33
- Provides-Extra: dev
34
- Requires-Dist: pytest>=7.4.0; extra == "dev"
35
- Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
36
- Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
37
- Requires-Dist: black>=23.0.0; extra == "dev"
38
- Requires-Dist: ruff>=0.1.0; extra == "dev"
39
- Requires-Dist: mypy>=1.5.0; extra == "dev"
40
- Dynamic: license-file
41
-
42
- # captionflow
43
- <img width="1024" height="768" alt="image" src="https://github.com/user-attachments/assets/67eae1b1-7545-4ade-a0b1-31484ba57af9" />
44
-
45
- ```bash
46
- $ pip install caption-flow
47
- $ caption-flow orchestrator|worker|monitor
48
- ```
49
-
50
- scalable, fault-tolerant **vllm-powered image captioning**. this "first round" focuses on a blazing fast websocket orchestrator plus lightweight gpu workers that batch requests through vllm.
51
-
52
- **performance**: consumer 4090s often outpace h100s on smaller models (3b-7b) due to higher clock speeds and lower overhead. we've seen 150+ images/sec on a single 4090 with qwen2.5-vl-3b.
53
-
54
- * **orchestrator**: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats. handles 10k+ chunks/sec on commodity hardware.
55
- * **workers (vllm)**: connect to the orchestrator, stream in image samples, batch them, and generate 1..n captions per image using prompts supplied by the orchestrator.
56
- * **dataworkers** (coming soon): separate non-gpu clients that fetch/preprocess images and feed them to the orchestrator, freeing gpu workers to focus purely on inference.
57
- * **config-driven**: all components read yaml config; flags can override.
58
- * **tui monitor (optional)**: a monitor client is wired into the cli; ship a `monitor` module to enable it.
59
-
60
- > no conda. just `venv` + `pip`.
61
-
62
- ---
63
-
64
- ## install
65
-
66
- ```bash
67
- python -m venv .venv
68
- source .venv/bin/activate # windows: .venv\Scripts\activate
69
- pip install --upgrade pip
70
- pip install -e . # installs the `caption-flow` command
71
- # or: pip install -e git+ssh://git@github.com/bghira/captionflow
72
- ```
73
-
74
- ## quickstart (single box)
75
-
76
- 1. copy + edit the sample configs
77
-
78
- ```bash
79
- cp examples/orchestrator.yaml config/orchestrator.yaml
80
- cp examples/worker.yaml config/worker.yaml
81
- cp examples/monitor.yaml config/monitor.yaml
82
- ```
83
-
84
- set a unique shared token in both `config/orchestrator.yaml` and `config/worker.yaml` (see `auth.worker_tokens` in the orchestrator config and `worker.token` in the worker config). if you use private hugging face datasets/models, export `HUGGINGFACE_HUB_TOKEN` or use `hf auth login` (old style: `huggingface-cli login`) before starting workers.
85
-
86
- 2. start the orchestrator
87
-
88
- ```bash
89
- caption-flow orchestrator
90
- ```
91
-
92
- 3. start one or more vllm workers
93
-
94
- ```bash
95
- # gpu 0 on the same host
96
- caption-flow worker --gpu-id 0
97
-
98
- # your second gpu
99
- caption-flow worker --gpu-id 1
100
- ```
101
-
102
- 4. (optional) start the monitor to check on status
103
-
104
- ```bash
105
- caption-flow monitor
106
- ```
107
-
108
- 5. (optional) scan/fix chunks on disk if you had crashes or want to ensure you're actually receiving all captions correctly
109
-
110
- ```bash
111
- caption-flow scan_chunks --data-dir ./caption_data --checkpoint-dir ./checkpoints --fix
112
- ```
113
-
114
- ---
115
-
116
- ## how it's wired
117
-
118
- ### orchestrator
119
-
120
- * **websocket server** (default `0.0.0.0:8765`) with four client roles: workers, dataworkers, monitors, and admin.
121
- * **blazing fast**: handles 10,000+ chunks/sec, 100k+ concurrent connections. the bottleneck is always gpu inference, never the orchestrator.
122
- * **dataset control**: the orchestrator centrally defines the dataset (`huggingface` or `local`) and version/name. it chunk-slices shards and assigns work.
123
- * **vllm config broadcast**: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and **inference prompts** are all pushed to workers; workers can apply many changes without a model reload.
124
- * **storage + checkpoints**: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don't double-work.
125
- * **auth**: token lists for `worker`, `dataworker`, `monitor`, and `admin` roles.
126
-
127
- start flags you'll likely use:
128
-
129
- ```text
130
- --config path # yaml config for the orchestrator
131
- --port int, --host str # bind controls
132
- --data-dir path # overrides storage.data_dir
133
- --cert path, --key path # enable tls (or use --no-ssl for ws:// in dev)
134
- --vllm # use the vllm-style orchestrator (webdataset/hf)
135
- ```
136
-
137
- ### vllm worker
138
-
139
- * **one process per gpu**. select the device with `--gpu-id` (or `worker.gpu_id` in yaml).
140
- * **gets its marching orders** from the orchestrator: dataset info, model, prompts, batch size, and sampling.
141
- * **resilient**: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
142
- * **batched generate()**: images are resized down for consistent batching; each image can get multiple captions (one per prompt).
143
- * **optimized for consumer gpus**: 4090s often beat h100s on 3b-7b models. higher boost clocks + lower kernel overhead = faster tokens/sec.
144
-
145
- start flags you'll likely use:
146
-
147
- ```text
148
- --config path # yaml for the worker
149
- --server url # ws(s)://host:port
150
- --token str # must match an allowed worker token on the orchestrator
151
- --batch-size int # override vllm batch size
152
- --vllm # use the vllm worker implementation
153
- --gpu-id int # which gpu to use
154
- --precision str, --model str # optional overrides for dtype/model
155
- --no-verify-ssl # accept self-signed certs in dev
156
- ```
157
-
158
- ### dataworker (coming soon)
159
-
160
- * **cpu-only image fetching**: separate clients that handle dataset i/o, image loading, and preprocessing
161
- * **frees gpu workers**: gpu workers receive pre-loaded images, spending 100% of time on inference
162
- * **scales horizontally**: spin up dozens of dataworkers on cpu nodes to saturate gpu throughput
163
- * **smart prefetching**: predictive loading keeps gpu workers fed with zero wait time
164
-
165
- ### (optional) monitor
166
-
167
- * a cli entry exists for a tui monitor; wire in a `monitor` module to enable it. config lives in `monitor.yaml` or inside `orchestrator.yaml` under `monitor:`.
168
-
169
- ---
170
-
171
- ## configuration
172
-
173
- ### config discovery order
174
-
175
- for any component, the cli looks for config in this order (first match wins):
176
-
177
- 1. `--config /path/to/file.yaml`
178
- 2. `./<component>.yaml` (current directory)
179
- 3. `./config/<component>.yaml` (config subdirectory)
180
- 4. `~/.caption-flow/<component>.yaml`
181
- 5. `$XDG_CONFIG_HOME/caption-flow/<component>.yaml`
182
- 6. `/etc/caption-flow/<component>.yaml`
183
- 7. any `$XDG_CONFIG_DIRS` entries under `caption-flow/`
184
- 8. `./examples/<component>.yaml` (fallback)
185
-
186
- ### orchestrator.yaml (highlights)
187
-
188
- ```yaml
189
- orchestrator:
190
- host: 0.0.0.0
191
- port: 8765
192
- # ssl:
193
- # cert: /path/fullchain.pem
194
- # key: /path/privkey.pem
195
-
196
- dataset:
197
- type: huggingface # or "local"
198
- path: <hf-dataset-or-local-path>
199
- name: <logical-name>
200
- version: "1.0"
201
-
202
- vllm:
203
- model: qwen/qwen2.5-vl-3b-instruct
204
- tensor_parallel_size: 1
205
- max_model_len: 16384
206
- dtype: float16
207
- gpu_memory_utilization: 0.92
208
- enforce_eager: true
209
- disable_mm_preprocessor_cache: true
210
- limit_mm_per_prompt: { image: 1 }
211
-
212
- batch_size: 8
213
-
214
- sampling:
215
- temperature: 0.7
216
- top_p: 0.95
217
- max_tokens: 256
218
- repetition_penalty: 1.05
219
- skip_special_tokens: true
220
- stop: ["<|end|>", "<|endoftext|>", "<|im_end|>"]
221
-
222
- inference_prompts:
223
- - "describe this image in detail"
224
- - "provide a comprehensive description of the visual content"
225
- - "what are the key elements in this image?"
226
-
227
- storage:
228
- data_dir: ./caption_data
229
- checkpoint_dir: ./checkpoints
230
- caption_buffer_size: 100
231
- checkpoint_interval: 1000
232
-
233
- # chunking/queueing
234
- chunk_size: 1000
235
- chunks_per_request: 2
236
- chunk_buffer_multiplier: 3
237
- min_chunk_buffer: 10
238
-
239
- auth:
240
- worker_tokens:
241
- - { token: "example-worker-token", name: "example worker" }
242
- dataworker_tokens:
243
- - { token: "dataworker-token", name: "data feeder 1" }
244
- monitor_tokens:
245
- - { token: "letmein", name: "default monitor" }
246
- admin_tokens:
247
- - { token: "admin-secret-2024", name: "admin" }
248
- ```
249
-
250
- ### worker.yaml (highlights)
251
-
252
- ```yaml
253
- worker:
254
- server: ws://localhost:8765 # use wss:// in prod
255
- token: example-worker-token
256
- name: local-gpu
257
- gpu_id: 0
258
- vllm: true
259
-
260
- # local queues
261
- readahead_size: 256
262
- inference_queue_size: 128
263
- ```
264
-
265
- ### monitor.yaml (optional)
266
-
267
- ```yaml
268
- monitor:
269
- server: ws://localhost:8765
270
- token: letmein
271
- refresh_rate: 1.0
272
- show_contributors: true
273
- show_quality_metrics: true
274
- max_activity_items: 20
275
- show_chunk_progress: true
276
- show_worker_queues: true
277
- show_throughput_graph: true
278
- ```
279
-
280
- ---
281
-
282
- ## performance notes
283
-
284
- **consumer gpus shine on smaller models where CPU bottlenecks arise**:
285
- - 4090 @ 3b model: 8-15 images/sec
286
- - 4090 @ 7b model: 8-12 images/sec
287
- - h100 @ 3b model: 2-10 images/sec (lower CPU clocks)
288
- - h100 @ 70b model: 2-10 images/sec (where the H100 belongs)
289
-
290
- **orchestrator throughput**:
291
- - 10,000+ chunks/sec on a typical Ryzen / Intel virtual machine
292
- - 10,000+ concurrent websocket connections
293
- - sub-millisecond chunk assignment latency
294
- - bottleneck is always gpu inference, never the orchestrator
295
-
296
- **scaling tips**:
297
- - use smaller models (3b-7b) for first-pass captioning
298
- - consumer gpus (4090/4080) offer best perf/$ on these models
299
- - add dataworkers to prefetch and saturate gpu throughput
300
- - run multiple workers per node (one per gpu)
301
- - for B200, RTX 6000 Pro, and other fast GPUs, using two worker processes per GPU (two tokens required) can provide added GPU utilisation
302
-
303
- ---
304
-
305
- ## tls / certificates
306
-
307
- use the built-in helpers during development:
308
-
309
- ```bash
310
- # self-signed certs for quick local testing
311
- caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs
312
-
313
- # inspect any certificate file
314
- caption-flow inspect_cert ./certs/fullchain.pem
315
- ```
316
-
317
- then point the orchestrator at the resulting cert/key (or run `--no-ssl` for dev-only ws://).
318
-
319
- ---
320
-
321
- ## tips & notes
322
-
323
- * **multi-gpu**: start one worker process per gpu (set `--gpu-id` or `worker.gpu_id`).
324
- * **throughput**: tune `vllm.batch_size` in the orchestrator config (or override with `--batch-size` at worker start). higher isn't always better; watch vram.
325
- * **prompts**: add more strings under `vllm.inference_prompts` to get multiple captions per image; the worker returns only non-empty generations.
326
- * **private hf**: if your dataset/model needs auth, export `HUGGINGFACE_HUB_TOKEN` before `caption-flow worker ...`.
327
- * **self-signed ssl**: pass `--no-verify-ssl` to workers/monitors in dev.
328
- * **recovery**: if you hard-crash mid-run and want to verify your database, `caption-flow scan_chunks --fix` can help but is basically never needed.
329
-
330
- ---
331
-
332
- ## architecture
333
-
334
- ```
335
- ┌──────────────┐
336
- │ │
337
- ┌─────────────┐ websocket │ │ ┌──────────────┐
338
- │ gpu worker │◄───────────────────┤ ├─────►│arrow/parquet │
339
- └─────────────┘ │ │ │ storage │
340
- │ orchestrator │ └──────────────┘
341
- ┌─────────────┐ │ │
342
- │ gpu worker │◄───────────────────┤ 10k+ │ ┌──────────────┐
343
- └─────────────┘ │ chunks/sec ├─────►│ checkpoints │
344
- │ │ └──────────────┘
345
- ┌─────────────┐ │ │
346
- │ dataworker │◄───────────────────┤ │
347
- └─────────────┘ │ │
348
- │ │
349
- ┌─────────────┐ │ │
350
- │ monitor │◄───────────────────┤ │
351
- └─────────────┘ └──────────────┘
352
- ```
353
-
354
- ## storage schema
355
-
356
- ### captions.parquet
357
- - `job_id`: unique job identifier
358
- - `dataset`: dataset name
359
- - `shard`: shard identifier
360
- - `item_key`: item within shard
361
- - `caption`: generated caption text
362
- - `contributor_id`: worker who generated it
363
- - `timestamp`: generation time
364
- - `quality_score`: optional quality metric
365
-
366
- ### jobs.parquet
367
- - `job_id`: unique identifier
368
- - `dataset`: dataset name
369
- - `shard`: shard identifier
370
- - `status`: pending/processing/completed/failed
371
- - `assigned_to`: worker id
372
- - `timestamp`: status change time
373
-
374
- ### contributors.parquet
375
- - `contributor_id`: unique identifier
376
- - `name`: display name
377
- - `total_captions`: lifetime count
378
- - `trust_level`: quality tier (0-5)
379
-
380
- ## development
381
-
382
- ```bash
383
- # install with dev dependencies
384
- pip install -e ".[dev]"
385
-
386
- # run tests
387
- pytest
388
-
389
- # format code
390
- black src/
391
- ruff --fix src/
392
-
393
- # type checking
394
- mypy src/
395
- ```
396
-
397
- ## community contribution
398
-
399
- to contribute compute:
400
-
401
- 1. install caption-flow: `pip install caption-flow`
402
- 2. get a worker token from the project maintainer
403
- 3. run: `caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN`
404
-
405
- your contributions will be tracked and attributed in the final dataset!
406
-
407
- ## roadmap
408
-
409
- in no particular order:
410
-
411
- * dataworker implementation for cpu-based data feeding
412
- * video captioning
413
- * hot config reload via the admin websocket path
414
- * richer monitor tui with real-time graphs
415
- * web interface
416
- * automatic huggingface hub dataset continuous exports
417
- * sequence-parallel inference for large vision models
418
- * discord interface
419
- * more in-depth integration for non-wds datasets
420
- * support chaining of workflows, for 2nd/3rd pass after use of initial tag model etc
421
- * distributed orchestrator clustering for planet-scale captioning
422
-
423
- prs welcome. keep it simple and fast.
424
-
425
- ## license
426
-
427
- AGPLv3
@@ -1,25 +0,0 @@
1
- caption_flow/__init__.py,sha256=tjEhGMENu9lIXtE-6PILOYSNd_cOtABnX4d121bA42Y,229
2
- caption_flow/cli.py,sha256=0Hpg-JGwqwCey6V2_va8Yu88b0qwYm2IK1VZpMp6fZs,28084
3
- caption_flow/models.py,sha256=s2ZSjfj8oTvgIxSE2mHxg7qYBvphW3fAu0dYRbFT3hs,1981
4
- caption_flow/monitor.py,sha256=W67yNhqYIEbejSBn0BWnweLV9JberfK19hGHDAlcKhY,7415
5
- caption_flow/orchestrator.py,sha256=146QFoMV23wU-LUxO9Mwx6wrgR1X-ZNb75CbYP8j_As,53762
6
- caption_flow/storage.py,sha256=R0CAFe-jZBu5j4zlrkE3fyqBZ-R3KtDAZlI7JBZIB0w,25987
7
- caption_flow/worker.py,sha256=V-vPtiDgX7it7Nchp-FbjIUXYFpj9br8hcfqD0agLZw,11183
8
- caption_flow/worker_data.py,sha256=e1JajMj0STJiPuIi4cbh8viAT4u6_hn6-VTRBuViWew,17497
9
- caption_flow/worker_vllm.py,sha256=nx6A5vowr8NUecb53JhFQ0i6aeetMNJJm_YzxI1JA_g,40023
10
- caption_flow/utils/__init__.py,sha256=i6qnzaSztaVI7Xx3y9fmol0i_cnhWZAphlC5L6CJLGM,136
11
- caption_flow/utils/auth.py,sha256=OGddz_NVKxH87IY1ny_X4PIAf1paNqQ0NG1li-RNK3s,2239
12
- caption_flow/utils/caption_utils.py,sha256=esUMAdcCkNjRroZ0Bhxv0_yKlLtMf0XeDCTt-5k6bik,5309
13
- caption_flow/utils/certificates.py,sha256=eu4blQZEkL9NRaY1ynQWg1asvDorRYhGRZea7STonJE,4635
14
- caption_flow/utils/chunk_tracker.py,sha256=wWl39O2UjiuZ8SCZkXbChEDKwTQICxBreVOG7JlJrS4,14869
15
- caption_flow/utils/dataset_loader.py,sha256=AfjYX1inR5-qH0hDaHaP7NDECQvgwFOCrH9dvtkip1A,6854
16
- caption_flow/utils/image_processor.py,sha256=ORIdATUFSDiUiSJPV--kDDlRxf9w9keScSfdFc54QO8,1366
17
- caption_flow/utils/job_queue.py,sha256=itdfXcrkvGjmXn4qtpgMF63k1ufRBaejDe4V6WcxzgU,1104
18
- caption_flow/utils/json_utils.py,sha256=IiZYn8uCM-3pYmyIbX2fmaOIyutArn67SqAyp0ggNpU,5396
19
- caption_flow/utils/vllm_config.py,sha256=TC7Rmjk0zRKbBXbWUXrFL4Z58hzax_-4L0pXZn09hdM,6019
20
- caption_flow-0.1.0.dist-info/licenses/LICENSE,sha256=hIahDEOTzuHCU5J2nd07LWwkLW7Hko4UFO__ffsvB-8,34523
21
- caption_flow-0.1.0.dist-info/METADATA,sha256=-zZDqdeec7tcyLe4iPLJd14eUyDbo_F4VJRXhqaQpXw,15304
22
- caption_flow-0.1.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
23
- caption_flow-0.1.0.dist-info/entry_points.txt,sha256=KnVlyrGKZj6p2zNyuEnCx4Y6jvJ4V-mcfN0lddPKTlQ,55
24
- caption_flow-0.1.0.dist-info/top_level.txt,sha256=_bXpKRutqded0FQ80dCChIz26ETV7tL4d4e2E_Y1FXs,13
25
- caption_flow-0.1.0.dist-info/RECORD,,