@kolbo/kolbo-code-linux-arm64-musl 2.1.5 → 2.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/bin/kolbo CHANGED
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@kolbo/kolbo-code-linux-arm64-musl",
3
- "version": "2.1.5",
3
+ "version": "2.1.7",
4
4
  "os": [
5
5
  "linux"
6
6
  ],
@@ -66,11 +66,11 @@ You have direct access to the Kolbo AI creative platform via MCP tools (auto-con
66
66
  | `get_moodboard` | Fetch a moodboard's master_prompt, style_guide, and images. |
67
67
  | `list_presets` | Browse generation presets (image/video/music templates with bundled style direction). |
68
68
 
69
- ### Chat
69
+ ### Chat & Vision
70
70
 
71
71
  | Tool | Description |
72
72
  |------|-------------|
73
- | `chat_send_message` | Send a message to Kolbo AI chat. Supports web search and deep think modes. |
73
+ | `chat_send_message` | Send a message to Kolbo AI chat. Pass `media_urls` (array of public URLs) to analyze images, videos, or audio — Smart Select auto-routes to Gemini vision when media is detected. Omit `model` for automatic routing. Supports web search and deep think modes. |
74
74
  | `chat_list_conversations` | List your SDK chat conversations. |
75
75
  | `chat_get_messages` | Fetch messages in a conversation (with media URLs). |
76
76
 
@@ -141,13 +141,15 @@ When making multiple generation calls:
141
141
 
142
142
  ## Transcription & Audio/Video Analysis
143
143
 
144
- Use `transcribe_audio` whenever the user provides an audio or video file and wants:
144
+ Use `transcribe_audio` ONLY when the user explicitly asks for:
145
145
  - A text transcript
146
146
  - Subtitles (SRT format)
147
147
  - Word-by-word timed subtitles (for karaoke, motion graphics, Remotion captions, video editing)
148
- - Content analysis or summary of spoken content
148
+ - Summary of what was **spoken/said** in the video
149
149
  - Dialogue extraction from video
150
150
 
151
+ **Do NOT use `transcribe_audio` to "analyze" a video visually.** For visual analysis (what's on screen, what's shown, what prompts appear, etc.) use `upload_media` → `chat_send_message` with `media_urls`.
152
+
151
153
  ### Workflow
152
154
  1. Call `transcribe_audio` with the `source` (URL or absolute local file path)
153
155
  2. The tool returns:
@@ -182,24 +184,34 @@ The regular `srt_url` groups words into readable subtitle lines (default 12 word
182
184
  ### Long Content
183
185
  Transcription supports files up to 30 minutes. For longer content, split the file first or provide segments.
184
186
 
185
- ### Visual Video/Audio Analysis (what's happening, not just what's said)
186
- `transcribe_audio` only extracts **speech**. If the user wants to understand **what's visually happening** in a video (scenes, actions, objects, on-screen text) or needs a multimodal AI to reason about the content, use `chat_send_message` with a video-capable model instead.
187
+ ### Visual Video/Audio/Image Analysis
188
+
189
+ **The agent has built-in vision — use the right tool for the media type:**
190
+
191
+ | Media type | How to analyze |
192
+ |------------|----------------|
193
+ | **Image** (jpg, png, webp, etc.) | Read it directly with the `Read` tool — the agent sees images natively. No upload needed. |
194
+ | **Video / Audio** | `upload_media` → `chat_send_message` with `media_urls` (Gemini handles video/audio) |
195
+ | **Transcription** | `transcribe_audio` — ONLY when user explicitly says "transcribe", "subtitles", "SRT", or "what's being said" |
187
196
 
188
- **Video-capable models**: `gemini-2.5-pro`, `gemini-2.5-flash` these can watch video and analyze visual content.
197
+ **NEVER use ffmpeg or frame extraction for analysis. NEVER ask the user just pick the right path above.**
189
198
 
190
- **Workflow for visual analysis:**
191
- 1. Upload the video with `upload_media` to get a stable CDN URL
192
- 2. Call `chat_send_message` with the video URL in the message and a video-capable model (e.g. `gemini-2.5-pro`)
193
- 3. Ask your analysis question: "Describe what happens in this video", "What products are shown?", "Summarize the key scenes"
199
+ **Video/Audio analysis workflow — Step 1 is NOT optional:**
200
+ 1. `upload_media({ source: "/absolute/local/path/to/file.mp4" })` returns `{ url, thumbnail_url, ... }`
201
+ - **Use `url`** the actual CDN URL. Ignore `thumbnail_url` (preview JPG only).
202
+ 2. `chat_send_message({ message: "<your question>", media_urls: [result.url] })`
203
+ - **`media_urls` is mandatory** — the model only sees the video if you pass the CDN URL here.
204
+ - Always an **array**: `media_urls: ["https://cdn.kolbo.ai/..."]`
205
+ - **Omit `model`** — Smart Select auto-routes to Gemini when media is detected
206
+ - **Sessions do NOT remember media between messages.** On retry: reuse the same CDN `url` (no re-upload) but always pass `media_urls` again.
207
+ - **Batch / many videos**: pass `model: "gemini-3.1-flash-lite-preview"` explicitly for cheaper bulk runs
194
208
 
195
- **When to use which:**
209
+ **❌ Never do this:**
210
+ - Pass a local file path in `media_urls` — it won't work, only CDN URLs work
211
+ - Use the `.txt` URL from a transcription result as the video URL — that's text, not video
212
+ - Skip `upload_media` and try to construct a URL yourself
196
213
 
197
- | User intent | Tool |
198
- |-------------|------|
199
- | "Transcribe this" / "What's being said?" | `transcribe_audio` |
200
- | "Generate subtitles" / "Word-by-word timing" | `transcribe_audio` |
201
- | "What's happening in this video?" / "Describe the scenes" | `chat_send_message` + Gemini |
202
- | "Analyze this video and transcribe it" | Both — `transcribe_audio` for text + `chat_send_message` for visual |
214
+ When in doubt, do visual analysis. Do not stop to ask.
203
215
 
204
216
  ---
205
217
 
@@ -429,6 +441,20 @@ When the user shares an image and asks about it:
429
441
 
430
442
  ---
431
443
 
444
+ ## Sharing HTML Artifacts
445
+
446
+ When you generate an HTML, SVG, or Mermaid artifact in the chat, a **Share** button appears in the artifact preview toolbar (next to Desktop / Mobile). Clicking it:
447
+
448
+ 1. Uploads the artifact to Kolbo's hosting platform
449
+ 2. Copies a permanent public URL to the clipboard (e.g. `https://api.kolbo.ai/api/shared-artifact-raw/<token>`)
450
+ 3. Shows a toast confirming the link was copied
451
+
452
+ Anyone with the URL can view the rendered page — no login required.
453
+
454
+ **Requirements:** You must be logged in (`kolbo auth login`). The share button returns an error toast if you are not authenticated.
455
+
456
+ ---
457
+
432
458
  ## Kolbo Code Documentation
433
459
 
434
460
  Full public documentation for Kolbo Code (the CLI you are running inside) lives at **[docs.kolbo.ai/docs/kolbo-code](https://docs.kolbo.ai/docs/kolbo-code)**. If the user asks about installation, authentication, voice input, supported languages, commands, or how to uninstall, point them to the matching page below rather than guessing:
@@ -493,6 +519,8 @@ Natural-language triggers that should prompt this skill + a tool call:
493
519
  - "Transcribe this podcast episode" → `transcribe_audio`
494
520
  - "What's being said in this video?" → `transcribe_audio` → analyze the text
495
521
  - "Generate word-by-word subtitles for this audio" → `transcribe_audio` → share `word_by_word_srt_url`
522
+ - "Analyze this video" / "What do you see?" / "What's in this?" (with video file) → `upload_media` → `chat_send_message` with `media_urls` (omit model — auto-routes to Gemini)
523
+ - "What prompts are shown in this video?" → `upload_media` → `chat_send_message` with `media_urls` (omit model — auto-routes to Gemini)
496
524
  - "Keep the same character across all these images" → `create_visual_dna` → `generate_image` with `visual_dna_ids`
497
525
  - "Upload this file to my media library" → `upload_media`
498
526
  - "What video models are available?" → `list_models` (video)
@@ -1,10 +1,10 @@
1
1
  ---
2
2
  name: photo-studio
3
3
  description: >
4
- Local AI photo generation and editing using FLUX.2 Klein 4B, Z-Image Turbo, and Ollama Gemma4.
4
+ Local AI photo generation and editing using FLUX.2 Klein 4B and Z-Image Turbo.
5
5
  Use when the user wants to generate or edit images locally (no API cost, no rate limits).
6
6
  Models at I:/AI-Models/. Script at G:/Projects/Kolbo.AI/github/training-loras/scripts/photo-studio.py
7
- Keywords: generate image, edit image, flux klein, z-image turbo, local diffusion, photo studio, gemma4
7
+ Keywords: generate image, edit image, flux klein, z-image turbo, local diffusion, photo studio
8
8
  ---
9
9
 
10
10
  # Photo Studio — Local AI Image Generation & Editing
@@ -18,11 +18,11 @@ description: >
18
18
  | FLUX.2 Klein 4B | `I:/AI-Models/flux2-klein-4b/` |
19
19
  | Z-Image Turbo | `I:/AI-Models/z-image-turbo/` |
20
20
  | Z-Image Adapter | `I:/AI-Models/z-image-turbo-adapter/zimage_turbo_training_adapter_v2.safetensors` |
21
- | LLM / Vision | Ollama Gemma4 (runs locally at `http://localhost:11434`) |
21
+ | Vision / LLM | Agent's built-in vision (`Read` tool) for images. For video analysis load the `video-production` skill. |
22
22
 
23
23
  ## How to Run
24
24
 
25
- Always use the ai-toolkit venv (has Flux2KleinPipeline + ZImagePipeline + ollama package):
25
+ Always use the ai-toolkit venv (has Flux2KleinPipeline + ZImagePipeline):
26
26
 
27
27
  ```bash
28
28
  "G:/Projects/Kolbo.AI/github/ai-toolkit/venv/Scripts/python.exe" \
@@ -43,8 +43,6 @@ Always use the ai-toolkit venv (has Flux2KleinPipeline + ZImagePipeline + ollama
43
43
  | `--steps N` | 20 | Inference steps |
44
44
  | `--cfg N` | 3.5 | Guidance scale |
45
45
  | `--seed N` | random | Deterministic seed |
46
- | `--analyze` | off | Analyze `--image` with Gemma4, use as base description |
47
- | `--enhance` | off | Enhance `--prompt` with Gemma4 before generating |
48
46
  | `--adapter` | off | Load Z-Image Turbo adapter (zimage only) |
49
47
 
50
48
  ## Common Recipes
@@ -64,18 +62,29 @@ python photo-studio.py \
64
62
  --model flux --width 1152 --height 2048
65
63
  ```
66
64
 
67
- ### Analyze image and generate variation
68
- ```bash
65
+ ### Analyze image then generate variation
66
+ ```
67
+ # Step 1: Read the image — the agent sees it natively (built-in vision)
68
+ Read("/abs/path/to/char.jpg")
69
+ → Agent describes the person: clothing, pose, features, style
70
+
71
+ # Step 2: Use the description as the prompt
69
72
  python photo-studio.py \
70
- --image char.jpg --analyze \
71
- --prompt "standing upright, full body" \
73
+ --prompt "<description from agent vision> standing upright, full body" \
72
74
  --model flux
73
75
  ```
74
76
 
75
- ### Auto-enhance a short prompt then generate
76
- ```bash
77
+ ### Enhance a short prompt then generate
78
+ ```
79
+ # Step 1: Enhance the prompt with Kolbo MCP
80
+ chat_send_message({
81
+ message: "Expand this into a detailed image generation prompt for a photorealistic portrait: 'street fashion guy'",
82
+ })
83
+ → { content: "A young man in his mid-20s wearing..." }
84
+
85
+ # Step 2: Generate with the enhanced prompt
77
86
  python photo-studio.py \
78
- --prompt "street fashion guy" --enhance \
87
+ --prompt "<enhanced prompt from Kolbo>" \
79
88
  --model zimage --width 1152 --height 2048
80
89
  ```
81
90
 
@@ -101,11 +110,10 @@ python photo-studio.py \
101
110
  - Default: `--steps 8 --cfg 0.0 --width 1152 --height 2048` (~30s per image)
102
111
  - Add `--adapter` to load the v2 training adapter
103
112
 
104
- ### Ollama Gemma4 (vision + LLM)
105
- - Used for `--analyze` (describe input image) and `--enhance` (expand prompt)
106
- - Runs locally, no API key, no rate limits
107
- - Model: `gemma4` (9.6GB, multimodal)
108
- - Ollama auto-starts on Windows boot
113
+ ### Vision & Prompt Enhancement
114
+ - For image analysis: use the agent's built-in vision — `Read` the image file directly, no MCP needed
115
+ - For prompt enhancement: `chat_send_message` asking Kolbo to expand a short prompt (text-only, no vision)
116
+ - Do NOT use `--analyze` or `--enhance` flags (those call a local model that is no longer used)
109
117
 
110
118
  ## When to use which model
111
119
 
@@ -117,6 +125,6 @@ python photo-studio.py \
117
125
  | Quick text-to-image | `flux` or `zimage` |
118
126
  | Portrait + face reference | `flux --image face.jpg` |
119
127
 
120
- ## Prompt Tips (Gemma4 is the default prompter)
128
+ ## Prompt Tips
121
129
 
122
- When the user gives a short/vague prompt, always use `--enhance` to let Gemma4 expand it. For image editing, pair `--analyze --enhance` to get the best context from the source image.
130
+ When the user gives a short/vague prompt, use `chat_send_message` to let Kolbo AI expand it before passing to the script. For image editing, first analyze the source image with the agent's built-in vision (`Read` the image), then use the description as the base prompt.
@@ -1,13 +1,16 @@
1
1
  ---
2
2
  name: video-production
3
3
  description: >
4
- Full-stack video production assistant. Analyzes talking head footage,
4
+ Full-stack video production assistant. Analyzes video content visually (Gemini),
5
5
  generates transcriptions/SRT subtitles, plans and creates motion graphics (Remotion),
6
6
  generates B-roll images/videos, produces timeline XMLs for Premiere/DaVinci.
7
- Use for: video analysis, transcription, subtitles, motion graphics, B-roll, shorts,
8
- timeline XML, clip cutting, silence removal, After Effects, Premiere Pro, DaVinci Resolve.
7
+ Downloads YouTube videos with yt-dlp.
8
+ Use for: video analysis, visual analysis, describe video, what's in this video,
9
+ transcription, subtitles, motion graphics, B-roll, shorts, timeline XML, clip cutting,
10
+ silence removal, After Effects, Premiere Pro, DaVinci Resolve, YouTube download.
9
11
  Keywords: video edit, ffmpeg, remotion, after effects, premiere, davinci, shorts, subtitles,
10
- motion graphics, clip, render, transcribe, xml, timeline, b-roll, talking head, analyze
12
+ motion graphics, clip, render, transcribe, xml, timeline, b-roll, talking head, analyze,
13
+ yt-dlp, youtube, download, gemini, vision
11
14
  allowed-tools:
12
15
  - Read
13
16
  - Write
@@ -24,29 +27,122 @@ allowed-tools:
24
27
 
25
28
  # Video Production — Strategy Map
26
29
 
30
+ ## ⚠️ DEFAULT RULE: Video Analysis = Visual Analysis (NOT Transcription)
31
+
32
+ **The agent has built-in vision for images. For videos, always use Gemini via Kolbo MCP.**
33
+
34
+ | Media type | Action |
35
+ |------------|--------|
36
+ | **Image** (jpg, png, etc.) | Agent reads it directly — no upload needed |
37
+ | **Video** — "analyze", "describe", "what's in this?", "what prompts?", file path with no instruction | `upload_media` → `chat_send_message` + Gemini |
38
+ | **Transcription** — "transcribe", "subtitles", "SRT", "what's being said", "captions" | `transcribe_audio` only |
39
+ | Both visual + transcript | Run both |
40
+
41
+ **Never use ffmpeg to extract frames for analysis. Never use local Ollama/vision models. Commit to the right action — do not ask the user. Wait for `chat_send_message` to return before proceeding — it polls until done (up to 2 min). Do NOT fall back to ffmpeg or any other approach if it takes time.**
42
+
43
+ ---
44
+
45
+ ## Kolbo MCP Tools (Active When `kolbo auth login` Is Done)
46
+
47
+ These are available as MCP tools — use them directly without any Python/API key setup:
48
+
49
+ | Tool | Use |
50
+ |------|-----|
51
+ | `upload_media` | Upload local file to Kolbo CDN → get stable public URL |
52
+ | `chat_send_message` | Send message + `media_urls` array to Gemini for visual analysis |
53
+ | `transcribe_audio` | Transcribe audio/video to text + SRT (ElevenLabs Scribe) |
54
+ | `generate_image` | Generate B-roll images |
55
+ | `generate_video` | Generate B-roll videos |
56
+ | `generate_video_from_image` | Animate a still into video |
57
+ | `generate_music` | Generate background music |
58
+ | `generate_speech` | TTS for voiceover |
59
+ | `generate_sound` | Sound effects |
60
+ | `list_models` | Browse available models by type |
61
+ | `check_credits` | Check remaining Kolbo credit balance |
62
+
63
+ ### Visual Analysis Workflow — MANDATORY for all video analysis
64
+
65
+ **Step 1 is NOT optional. You cannot skip `upload_media` or construct the URL yourself.**
66
+
67
+ ```
68
+ Step 1: upload_media({ source: "/absolute/path/to/video.mp4" })
69
+ → Returns: { url, thumbnail_url, ... }
70
+ → Save the "url" field — this is the CDN URL you will pass to Gemini
71
+ → NEVER use thumbnail_url (it's a JPG preview, not the video)
72
+
73
+ Step 2: chat_send_message({
74
+ message: "Describe this video in detail. What is shown?",
75
+ media_urls: ["<url from step 1>"] ← must be an array, must be the "url" field
76
+ })
77
+ → returns: { content: "..." }
78
+ ```
79
+
80
+ **❌ Common mistakes that break video analysis:**
81
+ - Skipping `upload_media` and passing a local file path to `chat_send_message` — local paths don't work
82
+ - Using the transcription `.txt` URL as the `media_urls` value — Gemini needs the actual video CDN URL
83
+ - Using `thumbnail_url` instead of `url` from the `upload_media` response
84
+ - Calling `transcribe_audio` first then passing its output URL as the video — transcription gives text, not video
85
+
86
+ **Omit `model`** — Smart Select detects video/audio and auto-routes to Gemini.
87
+ **Sessions do NOT remember media between messages.** On retry: reuse the same CDN `url` from step 1 (no re-upload needed) but always pass `media_urls` again.
88
+
89
+ **Batch analysis (many videos)**: Pass `model: "gemini-3.1-flash-lite-preview"` explicitly for cheaper bulk runs.
90
+
91
+ For YouTube videos — download first with yt-dlp (see below), then follow steps 1–2 above.
92
+
93
+ ---
94
+
27
95
  ## Pipeline
28
96
 
29
97
  ```
30
- Input video Transcribe Analyze Plan segments
31
- → Generate: Remotion compositions | B-roll | SRT subtitles
32
- Output: Premiere XML / DaVinci EDL / individual MP4s / SRT
98
+ Input: local video / YouTube URL / uploaded file
99
+
100
+ [DEFAULT] Visual Analysis: upload_media chat_send_message (Gemini)
101
+ → [EXPLICIT REQUEST] Transcription: transcribe_audio → SRT / text
102
+ → [EDITING] FFmpeg: cut, silence removal, 9:16 conversion
103
+ → [MOTION GRAPHICS] Remotion: compositions, captions, B-roll
104
+ → Output: Premiere XML / DaVinci EDL / MP4s / SRT
33
105
  ```
34
106
 
35
107
  ## APIs & Capabilities
36
108
 
37
109
  | Service | Use |
38
110
  |---------|-----|
39
- | ElevenLabs Scribe | Primary transcription word-level SRT, multilingual |
40
- | Claude | Content analysis, edit planning |
41
- | Google Gemini | Video understanding, visual analysis |
42
- | fal.ai (MCP) | Image & video B-roll generation |
43
- | Runway | Image-to-video, video-to-video |
44
- | FLUX / BFL | High quality still image generation |
45
- | ElevenLabs | TTS, voice cloning, SFX |
46
- | Suno | Background music generation |
111
+ | Kolbo MCP (`upload_media` + `chat_send_message`) | **Primary**visual video/image analysis via Gemini |
112
+ | Kolbo MCP (`transcribe_audio`) | **Primary** — transcription, word-level SRT, multilingual |
113
+ | yt-dlp | Download YouTube/social media videos |
114
+ | FFmpeg | Local video editing, cutting, silence removal, format conversion |
47
115
  | Remotion Lambda | Cloud render motion graphics |
116
+ | fal.ai (MCP) | Image & video B-roll generation |
117
+ | ElevenLabs | TTS, voice cloning, SFX (via Kolbo MCP `generate_speech`) |
118
+ | Suno | Background music (via Kolbo MCP `generate_music`) |
119
+
120
+ > Kolbo MCP tools need no API keys — auth is handled by `kolbo auth login`.
121
+ > FFmpeg/yt-dlp need to be installed locally on the machine.
48
122
 
49
- > Load API keys from the project's `.env` file or environment variables.
123
+ ## YouTube / Social Media Download (yt-dlp)
124
+
125
+ Download video from YouTube, TikTok, Instagram, Twitter, etc.:
126
+
127
+ ```bash
128
+ # Best quality MP4
129
+ yt-dlp -f "bestvideo[height<=1080][ext=mp4]+bestaudio/best" \
130
+ --merge-output-format mp4 \
131
+ -o "%(id)s.%(ext)s" <url>
132
+
133
+ # With subtitles
134
+ yt-dlp -f "bestvideo[height<=1080][ext=mp4]+bestaudio/best" \
135
+ --write-auto-sub --sub-lang en --convert-subs srt \
136
+ --merge-output-format mp4 \
137
+ -o "%(id)s.%(ext)s" <url>
138
+
139
+ # Audio only (for transcription)
140
+ yt-dlp -f "bestaudio" --extract-audio --audio-format mp3 -o "%(id)s.%(ext)s" <url>
141
+ ```
142
+
143
+ After download → upload to Kolbo CDN with `upload_media` → analyze visually with `chat_send_message`.
144
+
145
+ ---
50
146
 
51
147
  ## Key Rules
52
148
 
@@ -1,105 +0,0 @@
1
- ---
2
- name: ollama-vision
3
- description: >
4
- Batch image analysis using local Ollama + gemma4 (multimodal).
5
- Use when the user needs to analyze, caption, classify, or extract text from images locally —
6
- free, offline, no rate limits, no API key needed.
7
- Keywords: image analysis, batch images, captions, OCR, vision, gemma4, ollama, local AI
8
- ---
9
-
10
- # Ollama Vision — Batch Image Analysis with gemma4
11
-
12
- ## Setup (already done on this machine)
13
-
14
- - Ollama installed and running (auto-starts on Windows boot)
15
- - Model: `gemma4` (9.6 GB, multimodal)
16
- - Python package: `ollama` v0.6.1 installed (pip, Python 3.10)
17
- - REST API available at `http://localhost:11434`
18
-
19
- ## Core Pattern
20
-
21
- ```python
22
- import ollama
23
-
24
- response = ollama.chat(model='gemma4', messages=[{
25
- 'role': 'user',
26
- 'content': 'Your prompt here',
27
- 'images': ['path/to/image.jpg'] # omit for text-only
28
- }])
29
- print(response['message']['content'])
30
- ```
31
-
32
- ## Batch Image Captioning Script
33
-
34
- ```python
35
- import ollama
36
- from pathlib import Path
37
- import csv
38
-
39
- def caption_images(folder: str, prompt: str = "Write a short caption for this image.", output_csv: str = "captions.csv"):
40
- images_dir = Path(folder)
41
- extensions = {'.jpg', '.jpeg', '.png', '.webp', '.gif', '.bmp'}
42
- image_files = [f for f in images_dir.iterdir() if f.suffix.lower() in extensions]
43
-
44
- results = []
45
- for i, img_path in enumerate(image_files, 1):
46
- print(f"[{i}/{len(image_files)}] Processing {img_path.name}...")
47
- try:
48
- response = ollama.chat(model='gemma4', messages=[{
49
- 'role': 'user',
50
- 'content': prompt,
51
- 'images': [str(img_path)]
52
- }])
53
- caption = response['message']['content'].strip()
54
- results.append({'file': img_path.name, 'caption': caption})
55
- print(f" → {caption[:80]}...")
56
- except Exception as e:
57
- print(f" ERROR: {e}")
58
- results.append({'file': img_path.name, 'caption': f'ERROR: {e}'})
59
-
60
- with open(output_csv, 'w', newline='', encoding='utf-8') as f:
61
- writer = csv.DictWriter(f, fieldnames=['file', 'caption'])
62
- writer.writeheader()
63
- writer.writerows(results)
64
-
65
- print(f"\nDone! Saved {len(results)} captions to {output_csv}")
66
-
67
- # Usage
68
- caption_images("./images", prompt="Describe this image in one sentence.")
69
- ```
70
-
71
- ## Common Prompts
72
-
73
- | Task | Prompt |
74
- |------|--------|
75
- | Caption | `"Write a short, descriptive caption for this image."` |
76
- | Alt text | `"Write alt text for this image for accessibility."` |
77
- | Classification | `"What category does this image belong to? Reply with one word."` |
78
- | OCR | `"Extract all text visible in this image."` |
79
- | Product description | `"Write a product description for the item shown in this image."` |
80
- | Social media | `"Write a catchy Instagram caption for this image."` |
81
-
82
- ## REST API Alternative (no Python package needed)
83
-
84
- ```python
85
- import requests, base64
86
-
87
- def analyze_image(image_path: str, prompt: str) -> str:
88
- with open(image_path, "rb") as f:
89
- img_b64 = base64.b64encode(f.read()).decode()
90
-
91
- response = requests.post("http://localhost:11434/api/generate", json={
92
- "model": "gemma4",
93
- "prompt": prompt,
94
- "images": [img_b64],
95
- "stream": False
96
- })
97
- return response.json()["response"]
98
- ```
99
-
100
- ## Tips
101
-
102
- - gemma4 handles JPG, PNG, WEBP, GIF, BMP
103
- - For large batches, add `time.sleep(0.5)` between requests to avoid overloading
104
- - Results are best when prompts are specific ("describe the main subject" vs "describe this")
105
- - Ollama must be running — check with `ollama list` in terminal