npm - @miketromba/ploof - Versions diffs - 0.2.0 → 0.3.0 - Mend

@miketromba/ploof 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md +86 -4
package/SPEC.md +106 -1
package/dist/ploof.js +203 -202
package/package.json +2 -2
package/skills/asset-generation/SKILL.md +1 -1

package/README.md CHANGED Viewed

@@ -9,7 +9,7 @@
   <img src="https://img.shields.io/badge/node-%3E%3D18-brightgreen" alt="node version" />
 </p>
-Ploof is a CLI for generating and editing creative assets with AI providers. It supports OpenAI image generation/editing and OpenAI video generation/editing today, plus the legacy OpenAI image variations endpoint when the authenticated project has access. The provider registry is designed for audio and broader model marketplaces over time.
+Ploof is a CLI for generating and editing creative assets with AI providers. It supports OpenAI image, video, and audio generation/processing today, plus the legacy OpenAI image variations endpoint when the authenticated project has access. The provider registry is designed for broader model marketplaces over time.
 It is built for both developers and AI agents: predictable commands, parseable output, local auth profiles, YAML manifests, parallel execution, and a companion skill.
@@ -24,13 +24,15 @@ It is built for both developers and AI agents: predictable commands, parseable o
 | OpenAI video generation | Supported |
 | OpenAI video editing/extensions | Supported |
 | OpenAI video downloads/library/characters | Supported |
+| OpenAI audio generation / TTS | Supported |
+| OpenAI audio transcription | Supported |
+| OpenAI audio translation | Supported |
 | Context images and masks | Supported |
-| Video references and source videos | Supported |
+| Image, video, and audio input assets | Supported |
 | YAML/JSON batch manifests | Supported |
 | Dependency-aware parallel runs | Supported |
 | Agent instructions via `ploof learn` | Supported |
 | Additional providers | Planned |
-| Audio generation | Planned |
 ## Install
@@ -80,6 +82,16 @@ ploof video generate \
   --seconds 4 \
   --out assets/clip.mp4
+# Generate and transcribe speech
+ploof audio generate \
+  --text "Ploof can generate speech and process audio." \
+  --voice alloy \
+  --out assets/speech.mp3
+ploof audio transcribe \
+  --audio assets/speech.mp3 \
+  --out assets/transcript.json
 # Run a manifest
 ploof run assets.yaml --parallel 4
 ```
@@ -247,6 +259,55 @@ ploof video character create --name Mossy --video character.mp4 --output json
 ploof video character get char_abc123 --output json
 ```
+## Audio Generation And Processing
+OpenAI audio generation defaults to `gpt-4o-mini-tts`, `alloy`, and `mp3` when model, voice, and format are omitted.
+```bash
+ploof audio generate \
+  --provider openai \
+  --text "A concise product narration for the demo reel." \
+  --model gpt-4o-mini-tts \
+  --voice alloy \
+  --format mp3 \
+  --out assets/narration.mp3 \
+  --output json
+```
+Useful generation flags:
+| Flag | Description |
+| --- | --- |
+| `--model <model>` | TTS model, for example `gpt-4o-mini-tts` |
+| `--voice <voice>` | Built-in voice such as `alloy`, `coral`, `nova`, or `shimmer` |
+| `--voice-id <id>` | Custom voice id |
+| `--instructions <text>` | Voice/style instructions for supported models |
+| `--format <format>` | `mp3`, `opus`, `aac`, `flac`, `wav`, or `pcm` |
+| `--speed <number>` | Speech speed |
+| `--param key=value` | Provider-specific pass-through parameter |
+| `--json '{...}'` | Provider-specific JSON object |
+Transcription and translation:
+```bash
+ploof audio transcribe \
+  --audio assets/narration.mp3 \
+  --model gpt-4o-mini-transcribe \
+  --out assets/transcript.json \
+  --output json
+ploof audio translate \
+  --audio assets/spanish.mp3 \
+  --model whisper-1 \
+  --format text \
+  --out assets/translation.txt \
+  --output json
+```
+Transcription supports `--language`, `--prompt`, `--format`, `--temperature`, `--include`, `--timestamp-granularity`, `--chunking-strategy`, `--known-speaker-name`, and `--known-speaker-reference`. Translation supports `--prompt`, `--format`, and `--temperature`.
+Ploof writes complete static assets to disk. Streaming transport settings such as OpenAI `stream=true` for transcription or `stream_format=sse` for speech are rejected because they do not produce a finished asset file directly.
 ## Batch Manifests
 ```yaml
@@ -294,6 +355,27 @@ tasks:
     wait: true
     download: true
     output: assets/clip.mp4
+  - id: narration
+    kind: audio.generate
+    provider: openai
+    text: "Short narration for the generated clip."
+    params:
+      model: gpt-4o-mini-tts
+      voice: alloy
+      response_format: mp3
+    output: assets/narration.mp3
+  - id: transcript
+    kind: audio.transcribe
+    provider: openai
+    needs: [narration]
+    inputs:
+      audio:
+        task: narration
+    params:
+      model: gpt-4o-mini-transcribe
+    output: assets/transcript.json
 ```
 Run it:
@@ -372,7 +454,7 @@ bun run build
 npm pack --dry-run
 ```
-The default test suite includes mocked OpenAI end-to-end tests. Those tests run real `ploof` CLI commands against a local mock OpenAI server and verify generated files, edit uploads, video job polling/downloads, sidecar metadata, and dependency-aware manifests without spending API credits.
+The default test suite includes mocked OpenAI end-to-end tests. Those tests run real `ploof` CLI commands against a local mock OpenAI server and verify generated files, edit uploads, video job polling/downloads, audio generation/processing, sidecar metadata, and dependency-aware manifests without spending API credits.
 Live OpenAI tests are opt-in only:

package/SPEC.md CHANGED Viewed

@@ -2,7 +2,7 @@
 ## Summary
-Ploof is an npm-published CLI for generating and editing assets through AI generation providers. It starts with OpenAI image and video generation/editing, but the architecture must support multiple authenticated providers, multiple asset modalities, provider-specific settings, and parallel execution across mixed jobs.
+Ploof is an npm-published CLI for generating, editing, and processing assets through AI generation providers. It starts with OpenAI image, video, and audio generation/processing, but the architecture must support multiple authenticated providers, multiple asset modalities, provider-specific settings, and parallel execution across mixed jobs.
 The product should feel like a small, sharp developer tool: easy to run manually, predictable in scripts, and optimized for AI agents.
@@ -97,6 +97,9 @@ Initial capabilities:
 - `video.delete`
 - `video.character.create`
 - `video.character.get`
+- `audio.generate`
+- `audio.transcribe`
+- `audio.translate`
 Future providers should be added through the provider registry without changing the manifest model.
@@ -303,6 +306,80 @@ project is eligible for that workflow. Extensions accept a source video id or
 upload, plus a prompt and `--seconds`. `video remix` is supported for the SDK's
 legacy remix endpoint, but new integrations should prefer `video edit`.
+### Audio Generation And Processing
+OpenAI audio generation uses the speech API and defaults to
+`gpt-4o-mini-tts`, `alloy`, and `mp3` when model, voice, and output format are
+omitted.
+```bash
+ploof audio generate \
+  --provider openai \
+  --text "Short narration for the generated asset." \
+  --model gpt-4o-mini-tts \
+  --voice alloy \
+  --format mp3 \
+  --out assets/narration.mp3 \
+  --output json
+```
+First-class OpenAI audio generation flags:
+- `--model <model>`
+- `--voice <voice>`
+- `--voice-id <id>`
+- `--instructions <text>`
+- `--format <format>` / `--response-format <format>`
+- `--speed <number>`
+- `--param key=value`
+- `--json '{...}'`
+Audio processing supports transcription and English translation:
+```bash
+ploof audio transcribe \
+  --audio assets/narration.mp3 \
+  --model gpt-4o-mini-transcribe \
+  --out assets/transcript.json \
+  --output json
+ploof audio translate \
+  --audio assets/spanish.mp3 \
+  --model whisper-1 \
+  --format text \
+  --out assets/translation.txt \
+  --output json
+```
+Transcription first-class flags:
+- `--model <model>`
+- `--language <code>`
+- `--prompt <prompt>`
+- `--format <format>` / `--response-format <format>`
+- `--temperature <number>`
+- `--include <value>`
+- `--timestamp-granularity word|segment`
+- `--chunking-strategy auto|{...}`
+- `--known-speaker-name <name>`
+- `--known-speaker-reference <data-url>`
+- `--param key=value`
+- `--json '{...}'`
+Translation first-class flags:
+- `--model <model>`
+- `--prompt <prompt>`
+- `--format <format>` / `--response-format <format>`
+- `--temperature <number>`
+- `--param key=value`
+- `--json '{...}'`
+Ploof is a static asset generation CLI. Audio commands request complete outputs
+and write them to disk. Streaming transport settings such as OpenAI
+`stream=true` for transcription or `stream_format=sse` for speech are rejected
+because they do not directly produce finished asset files.
 ### Batch Run
 ```bash
@@ -356,6 +433,27 @@ tasks:
     wait: true
     download: true
     output: assets/clip.mp4
+  - id: narration
+    kind: audio.generate
+    provider: openai
+    text: "Short narration for the generated clip."
+    params:
+      model: gpt-4o-mini-tts
+      voice: alloy
+      response_format: mp3
+    output: assets/narration.mp3
+  - id: transcript
+    kind: audio.transcribe
+    provider: openai
+    needs: [narration]
+    inputs:
+      audio:
+        task: narration
+    params:
+      model: gpt-4o-mini-transcribe
+    output: assets/transcript.json
 ```
 ## Asset Input Model
@@ -388,6 +486,10 @@ OpenAI video generation/editing maps:
 - `role=reference` to `input_reference` for image-guided video generation.
 - `role=video` to source video uploads for eligible edit/extension workflows.
+OpenAI audio processing maps:
+- `role=audio` to the uploaded audio file for transcription or translation.
 Future providers can map roles such as `reference`, `style`, `init-image`, `audio`, or `video` differently.
 ## Provider Architecture
@@ -411,6 +513,9 @@ type Provider = {
   runVideoDelete(job, context): Promise<ProviderResult>
   runVideoCharacterCreate(job, context): Promise<ProviderResult>
   runVideoCharacterGet(job, context): Promise<ProviderResult>
+  runAudioGenerate(job, context): Promise<ProviderResult>
+  runAudioTranscribe(job, context): Promise<ProviderResult>
+  runAudioTranslate(job, context): Promise<ProviderResult>
 }
 ```