@framers/agentos-skills-registry 0.13.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@framers/agentos-skills-registry",
3
- "version": "0.13.0",
3
+ "version": "0.15.0",
4
4
  "files": [
5
5
  "dist",
6
6
  "registry",
@@ -0,0 +1,225 @@
1
+ ---
2
+ name: audio-generation
3
+ version: '1.0.0'
4
+ description: Music and sound effects generation — 8 providers with fallback chains, user-configurable preferences, local and cloud options.
5
+ author: Wunderland
6
+ namespace: wunderland
7
+ category: media
8
+ tags: [audio, music, sound-effects, sfx, generation, suno, elevenlabs, stable-audio, musicgen]
9
+ requires_secrets: []
10
+ requires_tools: []
11
+ metadata:
12
+ agentos:
13
+ emoji: "\U0001F3B5"
14
+ ---
15
+
16
+ # Audio Generation (Music & Sound Effects)
17
+
18
+ Use this skill when the user wants to generate music compositions or sound effects from text descriptions. The system supports 8 provider backends with automatic fallback chains and user-configurable provider preferences.
19
+
20
+ This skill covers two complementary APIs:
21
+
22
+ 1. **generateMusic()** — Full-length musical compositions from text prompts
23
+ 2. **generateSFX()** — Short sound effects from text descriptions
24
+
25
+ ## Music Generation
26
+
27
+ ### Basic Usage
28
+
29
+ Generate music from a text prompt. The system auto-detects the best available provider from environment variables in priority order: `SUNO_API_KEY` (highest quality) -> `UDIO_API_KEY` -> `STABILITY_API_KEY` -> `REPLICATE_API_TOKEN` -> `FAL_API_KEY` -> local MusicGen (no key required).
30
+
31
+ ```typescript
32
+ import { generateMusic } from 'agentos';
33
+
34
+ const result = await generateMusic({
35
+ prompt: 'Upbeat lo-fi hip hop beat with vinyl crackle and mellow piano',
36
+ durationSec: 60,
37
+ });
38
+ console.log(result.audio[0].url);
39
+ ```
40
+
41
+ ### Prompt Tips for Music
42
+
43
+ - **Specify genre and mood first**: "melancholic jazz ballad", "aggressive drum and bass", "peaceful ambient soundscape"
44
+ - **Include instrumentation**: "with acoustic guitar, soft brushed drums, and upright bass"
45
+ - **Mention tempo and energy**: "slow tempo, 70 BPM", "high energy, driving rhythm"
46
+ - **Add texture and production**: "lo-fi with vinyl crackle", "clean studio recording", "reverb-heavy shoegaze"
47
+ - **Reference eras or styles**: "1970s progressive rock", "modern trap production", "classical baroque harpsichord"
48
+ - **Use negative prompts** where supported: `negativePrompt: 'vocals, singing, lyrics'`
49
+
50
+ ### Music Options
51
+
52
+ | Option | Default | Description |
53
+ |--------|---------|-------------|
54
+ | `prompt` | (required) | Text description of the desired music |
55
+ | `provider` | auto-detect | Provider ID (`"suno"`, `"udio"`, `"stable-audio"`, etc.) |
56
+ | `model` | provider default | Model identifier within the provider |
57
+ | `durationSec` | provider default | Output duration in seconds (Suno: up to ~240s, Stable Audio: ~47s) |
58
+ | `negativePrompt` | — | Musical elements to avoid (not all providers support this) |
59
+ | `outputFormat` | `"mp3"` | Output format: `"mp3"`, `"wav"`, `"flac"`, `"ogg"`, `"aac"` |
60
+ | `seed` | random | Seed for reproducible output (provider-dependent) |
61
+ | `n` | `1` | Number of clips to generate |
62
+
63
+ ## Sound Effect Generation
64
+
65
+ ### Basic Usage
66
+
67
+ Generate a sound effect from a text description. The SFX detection order is: `ELEVENLABS_API_KEY` (highest quality) -> `STABILITY_API_KEY` -> `REPLICATE_API_TOKEN` -> `FAL_API_KEY` -> local AudioGen (no key required).
68
+
69
+ ```typescript
70
+ import { generateSFX } from 'agentos';
71
+
72
+ const result = await generateSFX({
73
+ prompt: 'Thunder crack followed by heavy rain on a tin roof',
74
+ durationSec: 5,
75
+ });
76
+ console.log(result.audio[0].url);
77
+ ```
78
+
79
+ ### Prompt Tips for Sound Effects
80
+
81
+ - **Be specific about the sound**: "glass bottle shattering on concrete floor" rather than just "glass breaking"
82
+ - **Describe the environment**: "footsteps on gravel in an empty parking garage with echo"
83
+ - **Layer multiple sounds**: "busy city intersection with car horns, distant sirens, and pedestrian chatter"
84
+ - **Specify duration context**: short stingers (1-3s) vs ambient loops (10-15s)
85
+ - **Include physical properties**: "heavy wooden door creaking open slowly", "small metallic click of a light switch"
86
+
87
+ ### SFX Options
88
+
89
+ | Option | Default | Description |
90
+ |--------|---------|-------------|
91
+ | `prompt` | (required) | Text description of the desired sound effect |
92
+ | `provider` | auto-detect | Provider ID (`"elevenlabs-sfx"`, `"stable-audio"`, etc.) |
93
+ | `model` | provider default | Model identifier within the provider |
94
+ | `durationSec` | provider default | Output duration in seconds (SFX is typically 1-15s) |
95
+ | `outputFormat` | `"mp3"` | Output format: `"mp3"`, `"wav"`, `"flac"`, `"ogg"`, `"aac"` |
96
+ | `seed` | random | Seed for reproducible output (provider-dependent) |
97
+ | `n` | `1` | Number of clips to generate |
98
+
99
+ ## Provider Selection Guide
100
+
101
+ ### Music Providers
102
+
103
+ | Provider | ID | Best For | Env Var | Key Required |
104
+ |----------|-----|----------|---------|-------------|
105
+ | **Suno** | `suno` | Highest quality vocals + instrumentals, full songs | `SUNO_API_KEY` | Yes |
106
+ | **Udio** | `udio` | High quality music, alternative to Suno | `UDIO_API_KEY` | Yes |
107
+ | **Stable Audio** | `stable-audio` | Instrumentals, loops, ambient, fast generation | `STABILITY_API_KEY` | Yes |
108
+ | **Replicate** | `replicate-audio` | Open-source models (MusicGen), pay-per-use | `REPLICATE_API_TOKEN` | Yes |
109
+ | **Fal** | `fal-audio` | Fast serverless GPU, cost-effective | `FAL_API_KEY` | Yes |
110
+ | **MusicGen Local** | `musicgen-local` | Offline generation, no API key needed, privacy | — | No |
111
+
112
+ ### SFX Providers
113
+
114
+ | Provider | ID | Best For | Env Var | Key Required |
115
+ |----------|-----|----------|---------|-------------|
116
+ | **ElevenLabs** | `elevenlabs-sfx` | Highest quality SFX, fast turnaround | `ELEVENLABS_API_KEY` | Yes |
117
+ | **Stable Audio** | `stable-audio` | Good SFX + music in one provider | `STABILITY_API_KEY` | Yes |
118
+ | **Replicate** | `replicate-audio` | Open-source AudioGen model, pay-per-use | `REPLICATE_API_TOKEN` | Yes |
119
+ | **Fal** | `fal-audio` | Fast serverless GPU | `FAL_API_KEY` | Yes |
120
+ | **AudioGen Local** | `audiogen-local` | Offline SFX generation, no API key needed | — | No |
121
+
122
+ ### Forcing a Specific Provider
123
+
124
+ ```typescript
125
+ const result = await generateMusic({
126
+ prompt: 'Chill synthwave with arpeggiated synths',
127
+ provider: 'stable-audio',
128
+ apiKey: 'your-stability-key',
129
+ durationSec: 30,
130
+ });
131
+ ```
132
+
133
+ ## Provider Preferences
134
+
135
+ Use `providerPreferences` to control the fallback chain ordering and filtering without hardcoding a single provider. This is useful for load balancing, cost optimization, or respecting user preferences.
136
+
137
+ ```typescript
138
+ import { generateMusic } from 'agentos';
139
+
140
+ // Prefer Suno, fall back to Stable Audio, never use Udio
141
+ const result = await generateMusic({
142
+ prompt: 'Orchestral film score with dramatic strings',
143
+ providerPreferences: {
144
+ preferred: ['suno', 'stable-audio'],
145
+ blocked: ['udio'],
146
+ },
147
+ });
148
+ ```
149
+
150
+ ### Preference Fields
151
+
152
+ | Field | Description |
153
+ |-------|-------------|
154
+ | `preferred` | Ordered list of provider IDs to try first. Providers not in this list are excluded. |
155
+ | `blocked` | Provider IDs to unconditionally exclude from the chain. |
156
+ | `weights` | Weight map for weighted random selection (useful for A/B testing or load balancing). |
157
+
158
+ Provider preferences work identically across `generateMusic()`, `generateSFX()`, `generateImage()`, and `generateVideo()`.
159
+
160
+ ## When to Use Music vs SFX vs TTS
161
+
162
+ | Need | API | Why |
163
+ |------|-----|-----|
164
+ | Background music, songs, jingles | `generateMusic()` | Optimized for musical compositions with melody, harmony, rhythm |
165
+ | Sound effects, foley, ambient sounds | `generateSFX()` | Optimized for short, non-musical audio (impacts, nature, UI sounds) |
166
+ | Speech, narration, voice cloning | TTS (speech subsystem) | Use the speech/TTS APIs instead — audio generation is for non-speech |
167
+ | Podcast intros with music + voice | Combine both | Generate music with `generateMusic()`, speech with TTS, mix externally |
168
+
169
+ ## Combining Audio
170
+
171
+ The audio generation APIs return URLs or base64 data that can be combined in downstream workflows:
172
+
173
+ 1. **Generate background music**: `generateMusic({ prompt: 'Gentle ambient pad' })`
174
+ 2. **Generate SFX stingers**: `generateSFX({ prompt: 'Notification chime' })`
175
+ 3. **Generate speech**: Use the TTS subsystem for narration
176
+ 4. **Mix**: Use ffmpeg or a Web Audio API pipeline to layer the tracks
177
+
178
+ ```typescript
179
+ import { generateMusic, generateSFX } from 'agentos';
180
+
181
+ // Generate assets in parallel
182
+ const [music, sfx] = await Promise.all([
183
+ generateMusic({ prompt: 'Calm podcast background music', durationSec: 120 }),
184
+ generateSFX({ prompt: 'Soft transition whoosh', durationSec: 2 }),
185
+ ]);
186
+
187
+ // Use the URLs/base64 data in your mixing pipeline
188
+ console.log('Music:', music.audio[0].url);
189
+ console.log('SFX:', sfx.audio[0].url);
190
+ ```
191
+
192
+ ## Local Providers (No API Key)
193
+
194
+ Both MusicGen and AudioGen can run locally without any API keys using HuggingFace Transformers.js. The models are downloaded on first use and cached locally.
195
+
196
+ **Requirements:**
197
+ - `@huggingface/transformers` must be installed as a peer dependency
198
+ - Sufficient RAM for model inference (MusicGen Small ~1GB, AudioGen Medium ~2GB)
199
+
200
+ ```typescript
201
+ // Explicitly use local generation
202
+ const result = await generateMusic({
203
+ prompt: 'Simple piano melody',
204
+ provider: 'musicgen-local',
205
+ });
206
+ ```
207
+
208
+ Local providers are automatically used as the last fallback when no cloud API keys are configured.
209
+
210
+ ## Prerequisites
211
+
212
+ - At least one audio provider API key for cloud generation, OR `@huggingface/transformers` for local generation
213
+ - For music: `SUNO_API_KEY`, `UDIO_API_KEY`, `STABILITY_API_KEY`, `REPLICATE_API_TOKEN`, or `FAL_API_KEY`
214
+ - For SFX: `ELEVENLABS_API_KEY`, `STABILITY_API_KEY`, `REPLICATE_API_TOKEN`, or `FAL_API_KEY`
215
+
216
+ ## Examples
217
+
218
+ - "Generate a 60-second lo-fi hip hop beat for a study playlist"
219
+ - "Create a thunder and rain sound effect for my podcast intro"
220
+ - "Make upbeat electronic music for a product demo video"
221
+ - "Generate a notification chime sound effect"
222
+ - "Create ambient forest sounds with birds and a gentle stream"
223
+ - "Generate a dramatic orchestral score for a trailer"
224
+ - "Make a retro 8-bit video game soundtrack"
225
+ - "Create footstep sounds on different surfaces — wood, gravel, snow"
@@ -0,0 +1,210 @@
1
+ ---
2
+ name: video-generation
3
+ version: '1.0.0'
4
+ description: Video generation, analysis, and scene detection — text-to-video, image-to-video, structured scene descriptions with RAG indexing, and general-purpose visual change detection.
5
+ author: Wunderland
6
+ namespace: wunderland
7
+ category: media
8
+ tags: [video, generation, analysis, scene-detection, RAG, multimodal, runway, replicate, fal]
9
+ requires_secrets: []
10
+ requires_tools: []
11
+ metadata:
12
+ agentos:
13
+ emoji: "\U0001F3AC"
14
+ ---
15
+
16
+ # Video Generation, Analysis & Scene Detection
17
+
18
+ Use this skill when the user wants to create AI-generated videos, analyse existing video content for structured scene descriptions, or detect visual changes in live/recorded frame streams.
19
+
20
+ This skill covers three complementary APIs:
21
+
22
+ 1. **generateVideo()** — Text-to-video and image-to-video generation
23
+ 2. **analyzeVideo()** — Structured video analysis with scene descriptions, transcription, and optional RAG indexing
24
+ 3. **detectScenes()** — Real-time or batch scene boundary detection from frame streams
25
+
26
+ ## Video Generation
27
+
28
+ ### Text-to-Video
29
+
30
+ Generate a video from a text prompt. The system auto-detects the best available provider from environment variables in priority order: `RUNWAY_API_KEY` (highest quality), `REPLICATE_API_TOKEN` (widest model variety), `FAL_API_KEY` (fast serverless GPU).
31
+
32
+ ```typescript
33
+ import { generateVideo } from 'agentos';
34
+
35
+ const result = await generateVideo({
36
+ prompt: 'A drone flying over a misty forest at sunrise, cinematic 4K',
37
+ durationSec: 5,
38
+ aspectRatio: '16:9',
39
+ });
40
+ console.log(result.videos[0].url);
41
+ ```
42
+
43
+ ### Image-to-Video
44
+
45
+ Animate a still image by providing it as a Buffer via `opts.image`. The prompt describes the desired motion rather than the scene itself.
46
+
47
+ ```typescript
48
+ import { generateVideo } from 'agentos';
49
+ import { readFileSync } from 'fs';
50
+
51
+ const result = await generateVideo({
52
+ prompt: 'Camera slowly zooms out, gentle wind moves the leaves',
53
+ image: readFileSync('landscape.png'),
54
+ provider: 'runway',
55
+ });
56
+ ```
57
+
58
+ ### Provider Selection
59
+
60
+ | Provider | Best For | Env Var |
61
+ |----------|----------|---------|
62
+ | **Runway** | Highest quality, cinematic output, image-to-video | `RUNWAY_API_KEY` |
63
+ | **Replicate** | Widest model variety (Kling, HunyuanVideo, MiniMax), open-source models | `REPLICATE_API_TOKEN` |
64
+ | **Fal** | Fast serverless GPU, cost-effective, Kling/CogVideo | `FAL_API_KEY` |
65
+
66
+ When multiple provider API keys are set, the system wraps the primary in a `FallbackVideoProxy` so a transient failure on one provider automatically retries on the next.
67
+
68
+ To force a specific provider:
69
+
70
+ ```typescript
71
+ const result = await generateVideo({
72
+ prompt: 'A cat playing piano',
73
+ provider: 'replicate',
74
+ model: 'klingai/kling-v1',
75
+ apiKey: 'your-replicate-token',
76
+ });
77
+ ```
78
+
79
+ ### Prompt Tips for Video
80
+
81
+ - **Be specific about motion**: "camera pans left to right", "person walks toward camera", "time-lapse of clouds moving"
82
+ - **Specify style early**: "cinematic 4K", "hand-drawn animation", "vintage film grain"
83
+ - **Keep prompts concise**: Video models respond best to clear, focused descriptions (1-3 sentences)
84
+ - **Use negative prompts** to avoid unwanted artifacts: `negativePrompt: 'blurry, distorted faces, watermark'`
85
+
86
+ ### Image-to-Video Motion Strength
87
+
88
+ When doing image-to-video, the prompt controls how much the image changes:
89
+
90
+ - **Gentle motion**: "subtle camera drift", "soft wind blowing through hair" — minimal departure from source
91
+ - **Moderate motion**: "person turns head and smiles", "camera orbits subject" — clear movement while preserving subject
92
+ - **Strong motion**: "explosion of confetti", "character runs toward camera" — significant scene change
93
+
94
+ The provider's motion strength interpretation varies. Runway tends to be conservative (good for preserving the source image), while Replicate/Fal models may be more aggressive. Start with gentle prompts and increase intensity.
95
+
96
+ ## Video Analysis
97
+
98
+ ### Structured Scene Analysis
99
+
100
+ Analyse a video to extract structured scene descriptions, detected objects, on-screen text, and optional audio transcription.
101
+
102
+ ```typescript
103
+ import { analyzeVideo } from 'agentos';
104
+
105
+ const analysis = await analyzeVideo({
106
+ videoUrl: 'https://example.com/product-demo.mp4',
107
+ prompt: 'Identify all products shown and their key features',
108
+ transcribeAudio: true,
109
+ descriptionDetail: 'detailed',
110
+ });
111
+
112
+ console.log(analysis.description);
113
+ for (const scene of analysis.scenes ?? []) {
114
+ console.log(`[${scene.startSec}s - ${scene.endSec}s] ${scene.description}`);
115
+ }
116
+ ```
117
+
118
+ ### RAG Integration
119
+
120
+ Enable `indexForRAG: true` to automatically index scene descriptions and transcripts into the vector store for later retrieval. This is especially useful for building searchable video libraries.
121
+
122
+ ```typescript
123
+ const analysis = await analyzeVideo({
124
+ videoBuffer: videoData,
125
+ indexForRAG: true,
126
+ descriptionDetail: 'detailed',
127
+ transcribeAudio: true,
128
+ });
129
+
130
+ // Scene descriptions and transcripts are now searchable via RAG
131
+ console.log(`Indexed ${analysis.ragChunkIds?.length ?? 0} chunks`);
132
+ ```
133
+
134
+ Each scene description becomes a separate vector chunk with metadata including timestamps, scene index, and cut type. This enables queries like "find the part where the presenter shows the pricing slide" to return precise timestamp ranges.
135
+
136
+ ### Analysis Options
137
+
138
+ | Option | Default | Description |
139
+ |--------|---------|-------------|
140
+ | `sceneThreshold` | `0.3` | Scene change sensitivity (0-1, lower = more scenes) |
141
+ | `transcribeAudio` | `true` | Transcribe audio via configured STT provider |
142
+ | `descriptionDetail` | `'detailed'` | `'brief'`, `'detailed'`, or `'exhaustive'` |
143
+ | `maxScenes` | `100` | Cap on detected scenes (prevents runaway on long videos) |
144
+ | `indexForRAG` | `false` | Index results into RAG vector store |
145
+
146
+ ## Scene Detection
147
+
148
+ ### Live Stream / Batch Detection
149
+
150
+ Use `detectScenes()` for real-time visual change detection on frame streams. Returns an AsyncGenerator that yields `SceneBoundary` objects as visual discontinuities are detected.
151
+
152
+ ```typescript
153
+ import { detectScenes } from 'agentos';
154
+
155
+ // From a pre-recorded video (frames extracted via ffmpeg)
156
+ for await (const boundary of detectScenes({ frames: extractedFrameStream })) {
157
+ console.log(`Scene ${boundary.index} at ${boundary.startTimeSec}s`);
158
+ console.log(` Type: ${boundary.cutType}, Confidence: ${boundary.confidence}`);
159
+ }
160
+ ```
161
+
162
+ ### Use Cases
163
+
164
+ - **Webcam / security camera**: Detect motion or scene changes in real-time surveillance feeds
165
+ - **Screen recording**: Identify slide transitions in presentations, page changes in demos
166
+ - **Video editing**: Automatically segment raw footage at cut points
167
+ - **Content moderation**: Flag rapid scene changes that may indicate problematic content
168
+
169
+ ### Configuration
170
+
171
+ ```typescript
172
+ for await (const boundary of detectScenes({
173
+ frames: webcamStream,
174
+ hardCutThreshold: 0.4, // Less sensitive to hard cuts
175
+ gradualThreshold: 0.15, // Standard sensitivity for dissolves/fades
176
+ minSceneDurationSec: 2.0, // Suppress very short scenes
177
+ methods: ['histogram'], // Fast histogram-only detection
178
+ })) {
179
+ handleSceneChange(boundary);
180
+ }
181
+ ```
182
+
183
+ ### Cut Type Classification
184
+
185
+ The detector classifies each scene boundary:
186
+
187
+ | Cut Type | Description |
188
+ |----------|-------------|
189
+ | `hard-cut` | Abrupt frame-to-frame change (most common) |
190
+ | `dissolve` | Cross-dissolve / superimposition transition |
191
+ | `fade` | Fade from/to black or white |
192
+ | `gradual` | Other gradual visual change |
193
+
194
+ ## Prerequisites
195
+
196
+ - At least one video provider API key for generation (`RUNWAY_API_KEY`, `REPLICATE_API_TOKEN`, or `FAL_API_KEY`)
197
+ - **ffmpeg** on PATH for video analysis (frame extraction and audio demuxing)
198
+ - A vision-capable LLM (`OPENAI_API_KEY` or equivalent) for scene description
199
+ - An STT provider for audio transcription (when `transcribeAudio` is enabled)
200
+
201
+ Scene detection (`detectScenes()`) has zero external dependencies — it works purely on RGB pixel buffers.
202
+
203
+ ## Examples
204
+
205
+ - "Generate a 5-second cinematic video of a sunset over the ocean"
206
+ - "Turn this product photo into a video with a slow camera orbit"
207
+ - "Analyse this tutorial video and index it for search"
208
+ - "Detect scene changes in this security camera feed"
209
+ - "Extract structured scenes from this presentation recording"
210
+ - "Create a video from this image with gentle parallax motion"
@@ -1,22 +1,82 @@
1
1
  ---
2
2
  name: vision-ocr
3
- description: Extract text from images using OCR and vision AI
4
- version: 1.0.0
3
+ version: '1.1.0'
4
+ description: Extract text from images using OCR and vision AI with the performOCR() high-level API or the full VisionPipeline.
5
+ author: Wunderland
6
+ namespace: wunderland
7
+ category: vision
5
8
  tags: [vision, ocr, text-extraction, document, handwriting]
6
- tools_required: [vision-pipeline]
9
+ requires_secrets: []
10
+ requires_tools: [vision-pipeline]
7
11
  ---
8
12
 
9
13
  # Vision & OCR
10
14
 
11
- Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR) -> local vision models (TrOCR, Florence-2) -> cloud vision (GPT-4o, Claude).
15
+ Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR / Tesseract) -> local vision models (TrOCR, Florence-2) -> cloud vision LLM (GPT-4o, Claude, Gemini).
16
+
17
+ ## High-Level API: `performOCR()`
18
+
19
+ For one-shot text extraction, use the top-level `performOCR()` function. It handles input resolution, pipeline lifecycle, and cleanup automatically.
20
+
21
+ ```typescript
22
+ import { performOCR } from '@framers/agentos';
23
+
24
+ const result = await performOCR({
25
+ image: '/path/to/receipt.png', // file path, URL, base64, or Buffer
26
+ strategy: 'progressive', // 'progressive' | 'local-only' | 'cloud-only'
27
+ confidenceThreshold: 0.7, // min confidence before escalating tier
28
+ });
29
+
30
+ console.log(result.text); // extracted text
31
+ console.log(result.confidence); // 0–1 score
32
+ console.log(result.tier); // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
33
+ console.log(result.provider); // 'paddle' | 'tesseract' | 'openai' | etc.
34
+ console.log(result.regions); // bounding boxes (when available)
35
+ ```
36
+
37
+ ## When to use `performOCR()` vs `VisionPipeline`
38
+
39
+ | Use case | Recommendation |
40
+ |----------|---------------|
41
+ | One-shot text extraction from a single image | `performOCR()` — simplest API |
42
+ | Batch processing many images | `VisionPipeline` — create once, reuse, dispose when done |
43
+ | Need CLIP embeddings or document layout | `VisionPipeline` — richer result shape |
44
+ | Quick scripts and integrations | `performOCR()` — zero boilerplate |
45
+
46
+ ## Progressive Tier System
47
+
48
+ The pipeline tries the cheapest/fastest tier first and only escalates when confidence is below threshold:
49
+
50
+ 1. **Tier 1 — Local OCR** (PaddleOCR or Tesseract.js): Fast, free, offline. Handles printed text in documents, receipts, screenshots.
51
+ 2. **Tier 2 — Local Vision Models** (TrOCR / Florence-2): Still offline. Handles handwritten notes, complex document layouts with tables and figures.
52
+ 3. **Tier 3 — Cloud Vision LLM** (GPT-4o / Claude / Gemini): Best quality. Handles photographs, diagrams, mixed content, anything the local tiers can't confidently read.
53
+
54
+ ## Strategy Selection
55
+
56
+ - **`'progressive'`** (default): Start local, escalate only if needed. Best cost/quality balance for most use cases.
57
+ - **`'local-only'`**: Never call cloud APIs. Use for air-gapped environments, privacy-sensitive data (medical records, financial docs), or when no API keys are available.
58
+ - **`'cloud-only'`**: Skip local tiers entirely, send straight to a cloud vision LLM. Use when you need the highest quality output and cost is not a concern.
59
+
60
+ ## Input Formats
61
+
62
+ `performOCR()` accepts four input types:
63
+
64
+ - **File path**: `'/tmp/scan.png'` — reads from disk
65
+ - **URL**: `'https://example.com/receipt.jpg'` — fetches via HTTP
66
+ - **Base64 string**: Raw base64 or `data:image/png;base64,...` data URIs — decoded in-memory
67
+ - **Buffer**: Raw image bytes — passed directly to the pipeline
12
68
 
13
69
  ## Capabilities
14
- - **Printed text OCR**: Extract text from documents, receipts, screenshots
70
+
71
+ - **Printed text OCR**: Extract text from documents, receipts, screenshots, PDFs
15
72
  - **Handwriting recognition**: Read handwritten notes and forms via TrOCR
16
- - **Document layout**: Understand tables, figures, headings via Florence-2
17
- - **Image embeddings**: Generate CLIP vectors for semantic image search
73
+ - **Document layout understanding**: Parse tables, figures, headings via Florence-2
74
+ - **Bounding box regions**: Spatial text locations for overlay rendering
75
+ - **Image embeddings**: Generate CLIP vectors for semantic image search (via `VisionPipeline` only)
76
+
77
+ ## Examples
18
78
 
19
- ## Example
20
- "Read the text from this receipt"
21
- "What does this handwritten note say?"
22
- "Extract the table data from this PDF page"
79
+ - "Read the text from this receipt"
80
+ - "What does this handwritten note say?"
81
+ - "Extract the table data from this PDF page"
82
+ - "OCR this screenshot and return the error message"