npm - oomi-ai - Versions diffs - 0.2.20 → 0.2.23 - Mend

oomi-ai 0.2.20 → 0.2.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md CHANGED Viewed

@@ -4,7 +4,7 @@ OpenClaw channel plugin and bridge tooling for Oomi managed chat and voice.
 ## Current Focus
-`0.2.19` adds the first live persona automation lane:
+`0.2.21` adds the first live persona automation lane:
 - WebSpatial-based persona scaffolding for generated Oomi apps
 - a high-level `oomi personas create-managed` command for agent-driven persona creation
 - device-authenticated persona runtime registration and job callbacks
@@ -141,8 +141,12 @@ That bridge:
 This is the part of the package most likely to matter when debugging voice turn failures.
-For managed voice replies, the extension also preserves an explicit hidden `metadata.spoken` sidecar when upstream provides one.
-If upstream does not provide one, the extension now synthesizes a bounded hidden fallback from the visible assistant text so backend TTS can speak a cleaner and more varied version without changing user-visible chat.
+For managed cloned-voice replies, the canonical contract is:
+- visible assistant `content` stays user-facing
+- hidden `metadata.spoken` carries the backend TTS payload
+- the shared helper in `lib/spokenMetadata.js` is used by both the extension and the local bridge to preserve or normalize that sidecar before it reaches the backend
+The backend cloned-voice path is intentionally strict. If `metadata.spoken` does not reach Oomi, backend TTS fails instead of speaking a flat fallback voice.
 ## Persona Scaffolding
@@ -242,7 +246,7 @@ If you are inspecting this package on npm, the main architectural points are:
   - `idempotencyKey` handling
   - bridge status that does not report `connected` before managed subscription is ready
   - runtime fault isolation so local session failures are less likely to crash the whole provider
-  - hidden managed-voice speech metadata forwarding, with a synthesized fallback when upstream does not provide `metadata.spoken`
+  - one shared hidden managed-voice speech metadata helper used by both the extension and the local bridge
 If you are developing the plugin, test the packaged surface with:

package/agent_instructions.md CHANGED Viewed

@@ -160,20 +160,20 @@ When the runtime supports it, voice turns may include a hidden speech sidecar on
 }
 ```
-Rules:
-- visible `content` remains the source of truth for Oomi chat rendering
-- for managed voice replies, include `metadata.spoken` when delivery benefits from cleaner phrasing or explicit speaking guidance
-- `metadata.spoken.text` is for backend TTS only
-- `metadata.spoken.language` should be one of the supported Qwen language values such as `English`
-- `metadata.spoken.segments` can carry bounded per-segment prosody for pace, pitch, volume, and pause timing
-- `metadata.spoken.instructions` should be natural-language guidance, not raw bracket tags
-- `metadata.spoken.style` is optional metadata for debugging/future mapping
-- if no hidden speech sidecar exists, Oomi falls back to speaking the visible assistant text
-Current plugin behavior:
-- if you provide `metadata.spoken`, the plugin preserves it unchanged
-- if you do not provide `metadata.spoken`, the plugin now synthesizes a bounded hidden fallback from visible assistant text for backend TTS
-- visible chat text is still never rewritten by the plugin
+Rules:
+- visible `content` remains the source of truth for Oomi chat rendering
+- for managed cloned-voice replies, include `metadata.spoken` whenever backend TTS should speak the turn
+- `metadata.spoken.text` is for backend TTS only
+- `metadata.spoken.language` should be one of the supported Qwen language values such as `English`
+- `metadata.spoken.segments` can carry bounded per-segment prosody for pace, pitch, volume, and pause timing
+- `metadata.spoken.instructions` should be natural-language guidance, not raw bracket tags
+- `metadata.spoken.style` is optional metadata for debugging/future mapping
+Current package behavior:
+- if you provide `metadata.spoken`, the package preserves it unchanged
+- if you omit `metadata.spoken`, the shared package helper may synthesize it as a compatibility guardrail before backend TTS
+- visible chat text is never rewritten by the package
+- backend cloned voice is strict: if `metadata.spoken` does not reach Oomi, playback fails instead of falling back to flat speech
 ## Avatar Commands

package/bin/oomi-ai.js CHANGED Viewed

@@ -1673,6 +1673,31 @@ function extractTextFromGatewayMessage(message) {
     .join(' ');
 }
+function summarizeVoiceFrameContract(frameText) {
+  const frame = parseJsonPayload(frameText);
+  if (!frame || typeof frame !== 'object') {
+    return { parseable: false };
+  }
+  const payload = frame.payload && typeof frame.payload === 'object' ? frame.payload : {};
+  const message = payload.message && typeof payload.message === 'object' ? payload.message : {};
+  const metadata = message.metadata && typeof message.metadata === 'object' ? message.metadata : {};
+  const spokenRaw = Object.prototype.hasOwnProperty.call(metadata, 'spoken') ? metadata.spoken : undefined;
+  const spokenNormalized = normalizeSpokenMetadata(spokenRaw);
+  const text = extractTextFromGatewayMessage(message);
+  return {
+    parseable: true,
+    event: typeof frame.event === 'string' ? frame.event : '',
+    state: typeof payload.state === 'string' ? payload.state : '',
+    role: typeof message.role === 'string' ? message.role : '',
+    contentLength: text.length,
+    hasMetadata: Object.keys(metadata).length > 0,
+    hasSpokenKey: Object.prototype.hasOwnProperty.call(metadata, 'spoken'),
+    spokenRawType: spokenRaw === undefined ? 'missing' : Array.isArray(spokenRaw) ? 'array' : typeof spokenRaw,
+    spokenNormalized: Boolean(spokenNormalized),
+    spokenSegmentCount: Array.isArray(spokenNormalized?.segments) ? spokenNormalized.segments.length : 0,
+  };
+}
 function ensureVoiceAssistantSpokenMetadata(frameText) {
   const frame = parseJsonPayload(frameText);
   if (!frame || typeof frame !== 'object') {
@@ -1688,7 +1713,12 @@ function ensureVoiceAssistantSpokenMetadata(frameText) {
   }
   const message = payload.message && typeof payload.message === 'object' ? payload.message : null;
-  if (!message || message.role !== 'assistant') {
+  if (!message) {
+    return { frameText, changed: false, reason: '' };
+  }
+  const messageRole = typeof message.role === 'string' ? message.role.trim() : '';
+  if (messageRole && messageRole !== 'assistant') {
     return { frameText, changed: false, reason: '' };
   }
@@ -1697,10 +1727,10 @@ function ensureVoiceAssistantSpokenMetadata(frameText) {
       ? message.metadata
       : {};
   const metadata = { ...originalMetadata };
-  const explicitSpokenPresent = Object.prototype.hasOwnProperty.call(originalMetadata, 'spoken');
+  const normalizedExplicitSpoken = normalizeSpokenMetadata(originalMetadata.spoken);
   const spoken =
-    normalizeSpokenMetadata(originalMetadata.spoken) ||
-    (!explicitSpokenPresent ? inferSpokenMetadataFromContent(extractTextFromGatewayMessage(message)) : null);
+    normalizedExplicitSpoken ||
+    inferSpokenMetadataFromContent(extractTextFromGatewayMessage(message));
   if (!spoken) {
     return { frameText, changed: false, reason: '' };
   }
@@ -1720,7 +1750,7 @@ function ensureVoiceAssistantSpokenMetadata(frameText) {
   return {
     frameText: nextFrame,
     changed: nextFrame !== frameText,
-    reason: explicitSpokenPresent ? 'normalized' : 'synthesized',
+    reason: normalizedExplicitSpoken ? 'normalized' : (messageRole ? 'synthesized' : 'synthesized_missing_role'),
   };
 }
@@ -2953,10 +2983,16 @@ async function startOpenclawBridge(flags) {
       gatewaySocket.on('message', runBridgeCallbackSafely((gatewayRaw) => {
         let frame = typeof gatewayRaw === 'string' ? gatewayRaw : gatewayRaw.toString();
         if (classifyBridgeSessionScope(sessionId) === 'voice') {
+          const beforeSummary = summarizeVoiceFrameContract(frame);
           const spokenNormalized = ensureVoiceAssistantSpokenMetadata(frame);
           if (spokenNormalized.changed) {
             frame = spokenNormalized.frameText;
-            console.log(`[bridge] voice.spoken_metadata.${spokenNormalized.reason} ${sessionId}`);
+            console.log(`[bridge] voice.spoken_metadata.${spokenNormalized.reason} ${sessionId} ${JSON.stringify({
+              before: beforeSummary,
+              after: summarizeVoiceFrameContract(frame),
+            })}`);
+          } else if (beforeSummary.event === 'chat' && beforeSummary.state === 'final') {
+            console.log(`[bridge] voice.chat.final ${sessionId} ${JSON.stringify(beforeSummary)}`);
           }
         }
         const gatewayPayload = parseJsonPayload(frame);
@@ -3313,13 +3349,17 @@ async function startOpenclawBridge(flags) {
         return;
       }
-      if (payload.type === 'client.frame') {
-        const sessionId = String(payload.sessionId || '').trim();
-        const frame = typeof payload.frame === 'string' ? payload.frame : '';
-        if (!sessionId || !frame) return;
-        console.log(`[bridge] client.frame ${sessionId}`);
-        const sessionBridge = getOrCreateGatewaySession(sessionId);
-        if (!sessionBridge) return;
+      if (payload.type === 'client.frame') {
+        const sessionId = String(payload.sessionId || '').trim();
+        const frame = typeof payload.frame === 'string' ? payload.frame : '';
+        if (!sessionId || !frame) return;
+        if (classifyBridgeSessionScope(sessionId) === 'voice') {
+          console.log(`[bridge] client.frame ${sessionId} ${JSON.stringify(summarizeVoiceFrameContract(frame))}`);
+        } else {
+          console.log(`[bridge] client.frame ${sessionId}`);
+        }
+        const sessionBridge = getOrCreateGatewaySession(sessionId);
+        if (!sessionBridge) return;
         const requestMeta = extractGatewayRequestMeta(frame);
         if (requestMeta) {
           if (!(sessionBridge.pendingRequests instanceof Map)) {

package/lib/spokenMetadata.js CHANGED Viewed

@@ -1,273 +1,273 @@
-function trimString(value, fallback = '') {
-  return typeof value === 'string' && value.trim() ? value.trim() : fallback;
-}
-function clampInteger(value, fallback, { min = 1, max = Number.MAX_SAFE_INTEGER } = {}) {
-  if (typeof value !== 'number' || !Number.isFinite(value)) return fallback;
-  const normalized = Math.floor(value);
-  if (normalized < min) return fallback;
-  if (normalized > max) return max;
-  return normalized;
-}
-const BOUNDED_LANGUAGE_TYPES = new Set([
-  'Auto',
-  'Chinese',
-  'English',
-  'German',
-  'Italian',
-  'Portuguese',
-  'Spanish',
-  'Japanese',
-  'Korean',
-  'French',
-  'Russian',
-]);
-const BOUNDED_PACE_VALUES = new Set(['very_slow', 'slow', 'medium', 'medium_fast', 'fast']);
-const BOUNDED_PITCH_VALUES = new Set(['low', 'slightly_low', 'neutral', 'slightly_high', 'high']);
-const BOUNDED_ENERGY_VALUES = new Set(['soft', 'calm', 'warm', 'bright', 'intense']);
-const BOUNDED_VOLUME_VALUES = new Set(['soft', 'normal', 'projected']);
-function inferSpokenLanguage(text) {
-  const normalized = trimString(text);
-  if (!normalized) return 'English';
-  return 'English';
-}
-function normalizeSpokenSegment(segment) {
-  if (!segment || typeof segment !== 'object' || Array.isArray(segment)) return null;
-  const text = trimString(segment.text);
-  if (!text) return null;
-  const normalized = { text };
-  const pace = trimString(segment.pace);
-  const pitch = trimString(segment.pitch);
-  const energy = trimString(segment.energy);
-  const volume = trimString(segment.volume);
-  const pauseAfterMs = clampInteger(segment.pause_after_ms, 0, { min: 0, max: 1200 });
-  if (BOUNDED_PACE_VALUES.has(pace)) normalized.pace = pace;
-  if (BOUNDED_PITCH_VALUES.has(pitch)) normalized.pitch = pitch;
-  if (BOUNDED_ENERGY_VALUES.has(energy)) normalized.energy = energy;
-  if (BOUNDED_VOLUME_VALUES.has(volume)) normalized.volume = volume;
-  normalized.pause_after_ms = pauseAfterMs;
-  return normalized;
-}
-function stripEmoji(text) {
-  return text.replace(/[\uFE0E\uFE0F]/g, '').replace(/\p{Extended_Pictographic}|\p{Emoji_Presentation}/gu, '');
-}
-function normalizeSpeechText(text) {
-  return stripEmoji(text)
-    .replace(/\*\*(.*?)\*\*/g, '$1')
-    .replace(/__(.*?)__/g, '$1')
-    .replace(/`([^`]+)`/g, '$1')
-    .replace(/[â€“â€”]/g, ', ')
-    .replace(/â€¦/g, '...')
-    .replace(/\s+/g, ' ')
-    .replace(/\s+([,.;!?])/g, '$1')
-    .replace(/([,.;!?])(?=[^\s])/g, '$1 ')
-    .replace(/,\s*,+/g, ', ')
-    .replace(/\s+/g, ' ')
-    .trim();
-}
-function splitSpeechSegments(text) {
-  const normalized = normalizeSpeechText(text);
-  if (!normalized) return [];
-  const baseSegments = normalized
-    .split(/(?<=[.!?])\s+/)
-    .map((segment) => segment.trim())
-    .filter(Boolean);
-  const segments = [];
-  for (const segment of baseSegments) {
-    if (segment.length <= 96) {
-      segments.push(segment);
-      continue;
-    }
-    const clauseParts = segment
-      .split(/,\s+/)
-      .map((part) => part.trim())
-      .filter(Boolean);
-    if (clauseParts.length > 1) {
-      for (let index = 0; index < clauseParts.length; index += 1) {
-        const part = clauseParts[index];
-        const needsComma = index < clauseParts.length - 1 && !/[.!?]$/.test(part);
-        segments.push(needsComma ? `${part},` : part);
-      }
-      continue;
-    }
-    segments.push(segment);
-  }
-  if (segments.length <= 5) return segments;
-  return [...segments.slice(0, 4), segments.slice(4).join(' ').trim()];
-}
-function inferSegmentStyle(segmentText, index, totalSegments) {
-  const normalized = segmentText.toLowerCase();
-  const exclamatory = /!/.test(segmentText) || /\b(hell yeah|awesome|amazing|stoked|love|perfect|great)\b/.test(normalized);
-  const curious = /\?/.test(segmentText);
-  const reflective =
-    /\b(i think|i'm|i am|i've|i have|lately|right now|before this|each time|understand|it feels like)\b/.test(normalized) ||
-    segmentText.length > 60;
-  if (curious) {
-    return {
-      pace: 'medium',
-      pitch: 'slightly_high',
-      energy: 'warm',
-      volume: 'normal',
-      pause_after_ms: 0,
-    };
-  }
-  if (exclamatory) {
-    return {
-      pace: 'medium_fast',
-      pitch: 'slightly_high',
-      energy: 'bright',
-      volume: 'normal',
-      pause_after_ms: index < totalSegments - 1 ? 220 : 0,
-    };
-  }
-  if (reflective) {
-    return {
-      pace: 'medium',
-      pitch: 'neutral',
-      energy: 'warm',
-      volume: 'normal',
-      pause_after_ms: index < totalSegments - 1 ? 260 : 0,
-    };
-  }
-  return {
-    pace: 'medium',
-    pitch: 'neutral',
-    energy: 'warm',
-    volume: 'normal',
-    pause_after_ms: index < totalSegments - 1 ? 180 : 0,
-  };
-}
-function synthesizeSpokenSegments(text) {
-  const language = inferSpokenLanguage(text);
-  const rawSegments = splitSpeechSegments(text);
-  if (rawSegments.length === 0) return null;
-  const segments = rawSegments.map((segmentText, index) => ({
-    text: segmentText,
-    ...inferSegmentStyle(segmentText, index, rawSegments.length),
-  }));
-  return {
-    language,
-    segments,
-  };
-}
-function normalizeSpokenMetadata(spoken) {
-  if (!spoken || typeof spoken !== 'object' || Array.isArray(spoken)) return null;
-  const text = trimString(spoken.text);
-  if (!text) return null;
-  const normalized = { text };
-  const language = trimString(spoken.language);
-  if (BOUNDED_LANGUAGE_TYPES.has(language)) {
-    normalized.language = language;
-  }
-  const explicitSegments =
-    Array.isArray(spoken.segments)
-      ? spoken.segments.map((segment) => normalizeSpokenSegment(segment)).filter(Boolean)
-      : [];
-  if (explicitSegments.length > 0) {
-    normalized.segments = explicitSegments;
-  }
-  const instructions = trimString(spoken.instructions);
-  if (instructions) normalized.instructions = instructions;
-  if (spoken.style && typeof spoken.style === 'object' && !Array.isArray(spoken.style)) {
-    normalized.style = spoken.style;
-  }
-  const fallbackSegments = synthesizeSpokenSegments(text);
-  if (!normalized.language && fallbackSegments?.language) {
-    normalized.language = fallbackSegments.language;
-  }
-  if (!normalized.segments && fallbackSegments?.segments?.length) {
-    normalized.segments = fallbackSegments.segments;
-  }
-  return normalized;
-}
-function inferSpokenMetadataFromContent(content) {
-  const text = normalizeSpeechText(trimString(content));
-  if (!text) return null;
-  const synthesized = synthesizeSpokenSegments(text);
-  const normalized = text.toLowerCase();
-  const upbeat =
-    /!/.test(text) ||
-    /\b(hell yeah|awesome|amazing|great|stoked|love|glad|perfect|nice|cool)\b/.test(normalized);
-  const gentle =
-    /\b(sorry|gentle|softly|careful|reassuring|calm|okay|it'?s okay|i know)\b/.test(normalized);
-  const curious = /\?/.test(text);
-  if (upbeat) {
-    return {
-      text,
-      language: synthesized?.language || 'English',
-      segments: synthesized?.segments,
-      instructions: 'Speak with warm, upbeat conversational energy and natural pacing.',
-      style: { emotion: 'upbeat', energy: 'medium' },
-    };
-  }
-  if (gentle) {
-    return {
-      text,
-      language: synthesized?.language || 'English',
-      segments: synthesized?.segments,
-      instructions: 'Speak gently and reassuringly, with a calm pace and soft emphasis.',
-      style: { emotion: 'gentle', energy: 'low' },
-    };
-  }
-  if (curious) {
-    return {
-      text,
-      language: synthesized?.language || 'English',
-      segments: synthesized?.segments,
-      instructions: 'Speak naturally with curious, engaged intonation and a conversational pace.',
-      style: { emotion: 'curious', energy: 'medium' },
-    };
-  }
-  return {
-    text,
-    language: synthesized?.language || 'English',
-    segments: synthesized?.segments,
-    instructions: 'Speak naturally with light warmth and conversational pacing.',
-    style: { emotion: 'neutral', energy: 'medium' },
-  };
-}
-export {
-  inferSpokenMetadataFromContent,
-  normalizeSpokenMetadata,
-  normalizeSpeechText,
-};
+function trimString(value, fallback = '') {
+  return typeof value === 'string' && value.trim() ? value.trim() : fallback;
+}
+function clampInteger(value, fallback, { min = 1, max = Number.MAX_SAFE_INTEGER } = {}) {
+  if (typeof value !== 'number' || !Number.isFinite(value)) return fallback;
+  const normalized = Math.floor(value);
+  if (normalized < min) return fallback;
+  if (normalized > max) return max;
+  return normalized;
+}
+const BOUNDED_LANGUAGE_TYPES = new Set([
+  'Auto',
+  'Chinese',
+  'English',
+  'German',
+  'Italian',
+  'Portuguese',
+  'Spanish',
+  'Japanese',
+  'Korean',
+  'French',
+  'Russian',
+]);
+const BOUNDED_PACE_VALUES = new Set(['very_slow', 'slow', 'medium', 'medium_fast', 'fast']);
+const BOUNDED_PITCH_VALUES = new Set(['low', 'slightly_low', 'neutral', 'slightly_high', 'high']);
+const BOUNDED_ENERGY_VALUES = new Set(['soft', 'calm', 'warm', 'bright', 'intense']);
+const BOUNDED_VOLUME_VALUES = new Set(['soft', 'normal', 'projected']);
+function inferSpokenLanguage(text) {
+  const normalized = trimString(text);
+  if (!normalized) return 'English';
+  return 'English';
+}
+function normalizeSpokenSegment(segment) {
+  if (!segment || typeof segment !== 'object' || Array.isArray(segment)) return null;
+  const text = trimString(segment.text);
+  if (!text) return null;
+  const normalized = { text };
+  const pace = trimString(segment.pace);
+  const pitch = trimString(segment.pitch);
+  const energy = trimString(segment.energy);
+  const volume = trimString(segment.volume);
+  const pauseAfterMs = clampInteger(segment.pause_after_ms, 0, { min: 0, max: 1200 });
+  if (BOUNDED_PACE_VALUES.has(pace)) normalized.pace = pace;
+  if (BOUNDED_PITCH_VALUES.has(pitch)) normalized.pitch = pitch;
+  if (BOUNDED_ENERGY_VALUES.has(energy)) normalized.energy = energy;
+  if (BOUNDED_VOLUME_VALUES.has(volume)) normalized.volume = volume;
+  normalized.pause_after_ms = pauseAfterMs;
+  return normalized;
+}
+function stripEmoji(text) {
+  return text.replace(/[\uFE0E\uFE0F]/g, '').replace(/\p{Extended_Pictographic}|\p{Emoji_Presentation}/gu, '');
+}
+function normalizeSpeechText(text) {
+  return stripEmoji(text)
+    .replace(/\*\*(.*?)\*\*/g, '$1')
+    .replace(/__(.*?)__/g, '$1')
+    .replace(/`([^`]+)`/g, '$1')
+    .replace(/[\u2013\u2014]/g, ', ')
+    .replace(/\u2026/g, '...')
+    .replace(/\s+/g, ' ')
+    .replace(/\s+([,.;!?])/g, '$1')
+    .replace(/([,.;!?])(?=[^\s])/g, '$1 ')
+    .replace(/,\s*,+/g, ', ')
+    .replace(/\s+/g, ' ')
+    .trim();
+}
+function splitSpeechSegments(text) {
+  const normalized = normalizeSpeechText(text);
+  if (!normalized) return [];
+  const baseSegments = normalized
+    .split(/(?<=[.!?])\s+/)
+    .map((segment) => segment.trim())
+    .filter(Boolean);
+  const segments = [];
+  for (const segment of baseSegments) {
+    if (segment.length <= 96) {
+      segments.push(segment);
+      continue;
+    }
+    const clauseParts = segment
+      .split(/,\s+/)
+      .map((part) => part.trim())
+      .filter(Boolean);
+    if (clauseParts.length > 1) {
+      for (let index = 0; index < clauseParts.length; index += 1) {
+        const part = clauseParts[index];
+        const needsComma = index < clauseParts.length - 1 && !/[.!?]$/.test(part);
+        segments.push(needsComma ? `${part},` : part);
+      }
+      continue;
+    }
+    segments.push(segment);
+  }
+  if (segments.length <= 5) return segments;
+  return [...segments.slice(0, 4), segments.slice(4).join(' ').trim()];
+}
+function inferSegmentStyle(segmentText, index, totalSegments) {
+  const normalized = segmentText.toLowerCase();
+  const exclamatory = /!/.test(segmentText) || /\b(hell yeah|awesome|amazing|stoked|love|perfect|great)\b/.test(normalized);
+  const curious = /\?/.test(segmentText);
+  const reflective =
+    /\b(i think|i'm|i am|i've|i have|lately|right now|before this|each time|understand|it feels like)\b/.test(normalized) ||
+    segmentText.length > 60;
+  if (curious) {
+    return {
+      pace: 'medium',
+      pitch: 'slightly_high',
+      energy: 'warm',
+      volume: 'normal',
+      pause_after_ms: 0,
+    };
+  }
+  if (exclamatory) {
+    return {
+      pace: 'medium_fast',
+      pitch: 'slightly_high',
+      energy: 'bright',
+      volume: 'normal',
+      pause_after_ms: index < totalSegments - 1 ? 220 : 0,
+    };
+  }
+  if (reflective) {
+    return {
+      pace: 'medium',
+      pitch: 'neutral',
+      energy: 'warm',
+      volume: 'normal',
+      pause_after_ms: index < totalSegments - 1 ? 260 : 0,
+    };
+  }
+  return {
+    pace: 'medium',
+    pitch: 'neutral',
+    energy: 'warm',
+    volume: 'normal',
+    pause_after_ms: index < totalSegments - 1 ? 180 : 0,
+  };
+}
+function synthesizeSpokenSegments(text) {
+  const language = inferSpokenLanguage(text);
+  const rawSegments = splitSpeechSegments(text);
+  if (rawSegments.length === 0) return null;
+  const segments = rawSegments.map((segmentText, index) => ({
+    text: segmentText,
+    ...inferSegmentStyle(segmentText, index, rawSegments.length),
+  }));
+  return {
+    language,
+    segments,
+  };
+}
+function normalizeSpokenMetadata(spoken) {
+  if (!spoken || typeof spoken !== 'object' || Array.isArray(spoken)) return null;
+  const text = trimString(spoken.text);
+  if (!text) return null;
+  const normalized = { text };
+  const language = trimString(spoken.language);
+  if (BOUNDED_LANGUAGE_TYPES.has(language)) {
+    normalized.language = language;
+  }
+  const explicitSegments =
+    Array.isArray(spoken.segments)
+      ? spoken.segments.map((segment) => normalizeSpokenSegment(segment)).filter(Boolean)
+      : [];
+  if (explicitSegments.length > 0) {
+    normalized.segments = explicitSegments;
+  }
+  const instructions = trimString(spoken.instructions);
+  if (instructions) normalized.instructions = instructions;
+  if (spoken.style && typeof spoken.style === 'object' && !Array.isArray(spoken.style)) {
+    normalized.style = spoken.style;
+  }
+  const fallbackSegments = synthesizeSpokenSegments(text);
+  if (!normalized.language && fallbackSegments?.language) {
+    normalized.language = fallbackSegments.language;
+  }
+  if (!normalized.segments && fallbackSegments?.segments?.length) {
+    normalized.segments = fallbackSegments.segments;
+  }
+  return normalized;
+}
+function inferSpokenMetadataFromContent(content) {
+  const text = normalizeSpeechText(trimString(content));
+  if (!text) return null;
+  const synthesized = synthesizeSpokenSegments(text);
+  const normalized = text.toLowerCase();
+  const upbeat =
+    /!/.test(text) ||
+    /\b(hell yeah|awesome|amazing|great|stoked|love|glad|perfect|nice|cool)\b/.test(normalized);
+  const gentle =
+    /\b(sorry|gentle|softly|careful|reassuring|calm|okay|it'?s okay|i know)\b/.test(normalized);
+  const curious = /\?/.test(text);
+  if (upbeat) {
+    return {
+      text,
+      language: synthesized?.language || 'English',
+      segments: synthesized?.segments,
+      instructions: 'Speak with warm, upbeat conversational energy and natural pacing.',
+      style: { emotion: 'upbeat', energy: 'medium' },
+    };
+  }
+  if (gentle) {
+    return {
+      text,
+      language: synthesized?.language || 'English',
+      segments: synthesized?.segments,
+      instructions: 'Speak gently and reassuringly, with a calm pace and soft emphasis.',
+      style: { emotion: 'gentle', energy: 'low' },
+    };
+  }
+  if (curious) {
+    return {
+      text,
+      language: synthesized?.language || 'English',
+      segments: synthesized?.segments,
+      instructions: 'Speak naturally with curious, engaged intonation and a conversational pace.',
+      style: { emotion: 'curious', energy: 'medium' },
+    };
+  }
+  return {
+    text,
+    language: synthesized?.language || 'English',
+    segments: synthesized?.segments,
+    instructions: 'Speak naturally with light warmth and conversational pacing.',
+    style: { emotion: 'neutral', energy: 'medium' },
+  };
+}
+export {
+  inferSpokenMetadataFromContent,
+  normalizeSpokenMetadata,
+  normalizeSpeechText,
+};

package/openclaw.extension.js CHANGED Viewed

@@ -1,4 +1,6 @@
-const CHANNEL_ID = 'oomi';
+import { inferSpokenMetadataFromContent, normalizeSpokenMetadata } from './lib/spokenMetadata.js';
+const CHANNEL_ID = 'oomi';
 const DEFAULT_SESSION_KEY = 'agent:main:webchat:channel:oomi';
 const DEFAULT_TIMEOUT_MS = 15000;
@@ -178,272 +180,15 @@ function extractCorrelationId(payload) {
   return '';
 }
-const BOUNDED_LANGUAGE_TYPES = new Set([
-  'Auto',
-  'Chinese',
-  'English',
-  'German',
-  'Italian',
-  'Portuguese',
-  'Spanish',
-  'Japanese',
-  'Korean',
-  'French',
-  'Russian',
-]);
-const BOUNDED_PACE_VALUES = new Set(['very_slow', 'slow', 'medium', 'medium_fast', 'fast']);
-const BOUNDED_PITCH_VALUES = new Set(['low', 'slightly_low', 'neutral', 'slightly_high', 'high']);
-const BOUNDED_ENERGY_VALUES = new Set(['soft', 'calm', 'warm', 'bright', 'intense']);
-const BOUNDED_VOLUME_VALUES = new Set(['soft', 'normal', 'projected']);
-function inferSpokenLanguage(text) {
-  const normalized = toString(text);
-  if (!normalized) return 'English';
-  return 'English';
-}
-function normalizeSpokenSegment(segment) {
-  if (!segment || typeof segment !== 'object' || Array.isArray(segment)) return null;
-  const text = toString(segment.text);
-  if (!text) return null;
-  const normalized = { text };
-  const pace = toString(segment.pace);
-  const pitch = toString(segment.pitch);
-  const energy = toString(segment.energy);
-  const volume = toString(segment.volume);
-  const pauseAfterMs = toNumber(segment.pause_after_ms, 0, { min: 0, max: 1200 });
-  if (BOUNDED_PACE_VALUES.has(pace)) normalized.pace = pace;
-  if (BOUNDED_PITCH_VALUES.has(pitch)) normalized.pitch = pitch;
-  if (BOUNDED_ENERGY_VALUES.has(energy)) normalized.energy = energy;
-  if (BOUNDED_VOLUME_VALUES.has(volume)) normalized.volume = volume;
-  normalized.pause_after_ms = pauseAfterMs;
-  return normalized;
-}
-function splitSpeechSegments(text) {
-  const normalized = normalizeSpeechText(text);
-  if (!normalized) return [];
-  const baseSegments = normalized
-    .split(/(?<=[.!?])\s+/)
-    .map((segment) => segment.trim())
-    .filter(Boolean);
-  const segments = [];
-  for (const segment of baseSegments) {
-    if (segment.length <= 96) {
-      segments.push(segment);
-      continue;
-    }
-    const clauseParts = segment
-      .split(/,\s+/)
-      .map((part) => part.trim())
-      .filter(Boolean);
-    if (clauseParts.length > 1) {
-      for (let index = 0; index < clauseParts.length; index += 1) {
-        const part = clauseParts[index];
-        const needsComma = index < clauseParts.length - 1 && !/[.!?]$/.test(part);
-        segments.push(needsComma ? `${part},` : part);
-      }
-      continue;
-    }
-    segments.push(segment);
-  }
-  if (segments.length <= 5) return segments;
-  return [...segments.slice(0, 4), segments.slice(4).join(' ').trim()];
-}
-function inferSegmentStyle(segmentText, index, totalSegments) {
-  const normalized = segmentText.toLowerCase();
-  const exclamatory = /!/.test(segmentText) || /\b(hell yeah|awesome|amazing|stoked|love|perfect|great)\b/.test(normalized);
-  const curious = /\?/.test(segmentText);
-  const reflective =
-    /\b(i think|i'm|i am|i've|i have|lately|right now|before this|each time|understand|it feels like)\b/.test(normalized) ||
-    segmentText.length > 60;
-  if (curious) {
-    return {
-      pace: 'medium',
-      pitch: 'slightly_high',
-      energy: 'warm',
-      volume: 'normal',
-      pause_after_ms: 0,
-    };
-  }
-  if (exclamatory) {
-    return {
-      pace: 'medium_fast',
-      pitch: 'slightly_high',
-      energy: 'bright',
-      volume: 'normal',
-      pause_after_ms: index < totalSegments - 1 ? 220 : 0,
-    };
-  }
-  if (reflective) {
-    return {
-      pace: 'medium',
-      pitch: 'neutral',
-      energy: 'warm',
-      volume: 'normal',
-      pause_after_ms: index < totalSegments - 1 ? 260 : 0,
-    };
-  }
-  return {
-    pace: 'medium',
-    pitch: 'neutral',
-    energy: 'warm',
-    volume: 'normal',
-    pause_after_ms: index < totalSegments - 1 ? 180 : 0,
-  };
-}
-function synthesizeSpokenSegments(text) {
-  const language = inferSpokenLanguage(text);
-  const rawSegments = splitSpeechSegments(text);
-  if (rawSegments.length === 0) return null;
-  const segments = rawSegments.map((segmentText, index) => ({
-    text: segmentText,
-    ...inferSegmentStyle(segmentText, index, rawSegments.length),
-  }));
-  return {
-    language,
-    segments,
-  };
-}
-function normalizeSpokenMetadata(spoken) {
-  if (!spoken || typeof spoken !== 'object' || Array.isArray(spoken)) return null;
-  const text = toString(spoken.text);
-  if (!text) return null;
-  const normalized = { text };
-  const language = toString(spoken.language);
-  if (BOUNDED_LANGUAGE_TYPES.has(language)) {
-    normalized.language = language;
-  }
-  const explicitSegments =
-    Array.isArray(spoken.segments)
-      ? spoken.segments.map((segment) => normalizeSpokenSegment(segment)).filter(Boolean)
-      : [];
-  if (explicitSegments.length > 0) {
-    normalized.segments = explicitSegments;
-  }
-  const instructions = toString(spoken.instructions);
-  if (instructions) normalized.instructions = instructions;
-  if (spoken.style && typeof spoken.style === 'object' && !Array.isArray(spoken.style)) {
-    normalized.style = spoken.style;
-  }
-  const fallbackSegments = synthesizeSpokenSegments(text);
-  if (!normalized.language && fallbackSegments?.language) {
-    normalized.language = fallbackSegments.language;
-  }
-  if (!normalized.segments && fallbackSegments?.segments?.length) {
-    normalized.segments = fallbackSegments.segments;
-  }
-  return normalized;
-}
-function stripEmoji(text) {
-  return text.replace(/[\uFE0E\uFE0F]/g, '').replace(/\p{Extended_Pictographic}|\p{Emoji_Presentation}/gu, '');
-}
-function normalizeSpeechText(text) {
-  return stripEmoji(text)
-    .replace(/\*\*(.*?)\*\*/g, '$1')
-    .replace(/__(.*?)__/g, '$1')
-    .replace(/`([^`]+)`/g, '$1')
-    .replace(/[–—]/g, ', ')
-    .replace(/…/g, '...')
-    .replace(/\s+/g, ' ')
-    .replace(/\s+([,.;!?])/g, '$1')
-    .replace(/([,.;!?])(?=[^\s])/g, '$1 ')
-    .replace(/,\s*,+/g, ', ')
-    .replace(/\s+/g, ' ')
-    .trim();
-}
-function inferSpokenMetadataFromContent(content) {
-  const text = normalizeSpeechText(toString(content));
-  if (!text) return null;
-  const synthesized = synthesizeSpokenSegments(text);
-  const normalized = text.toLowerCase();
-  const upbeat =
-    /!/.test(text) ||
-    /\b(hell yeah|awesome|amazing|great|stoked|love|glad|perfect|nice|cool)\b/.test(normalized);
-  const gentle =
-    /\b(sorry|gentle|softly|careful|reassuring|calm|okay|it'?s okay|i know)\b/.test(normalized);
-  const curious = /\?/.test(text);
-  if (upbeat) {
-    return {
-      text,
-      language: synthesized?.language || 'English',
-      segments: synthesized?.segments,
-      instructions: 'Speak with warm, upbeat conversational energy and natural pacing.',
-      style: { emotion: 'upbeat', energy: 'medium' },
-    };
-  }
-  if (gentle) {
-    return {
-      text,
-      language: synthesized?.language || 'English',
-      segments: synthesized?.segments,
-      instructions: 'Speak gently and reassuringly, with a calm pace and soft emphasis.',
-      style: { emotion: 'gentle', energy: 'low' },
-    };
-  }
-  if (curious) {
-    return {
-      text,
-      language: synthesized?.language || 'English',
-      segments: synthesized?.segments,
-      instructions: 'Speak naturally with curious, engaged intonation and a conversational pace.',
-      style: { emotion: 'curious', energy: 'medium' },
-    };
-  }
-  return {
-    text,
-    language: synthesized?.language || 'English',
-    segments: synthesized?.segments,
-    instructions: 'Speak naturally with light warmth and conversational pacing.',
-    style: { emotion: 'neutral', energy: 'medium' },
-  };
-}
 function normalizeOutgoingMetadata(payloadMetadata, { accountId, correlationId, content }) {
   const metadata =
     payloadMetadata && typeof payloadMetadata === 'object' && !Array.isArray(payloadMetadata)
       ? { ...payloadMetadata }
       : {};
-  const explicitSpokenPresent = Object.prototype.hasOwnProperty.call(metadata, 'spoken');
-  const spoken =
-    normalizeSpokenMetadata(metadata.spoken) ||
-    (!explicitSpokenPresent ? inferSpokenMetadataFromContent(content) : null);
+  const spoken =
+    normalizeSpokenMetadata(metadata.spoken) ||
+    inferSpokenMetadataFromContent(content);
   if (spoken) {
     metadata.spoken = spoken;
   } else {

package/openclaw.plugin.json CHANGED Viewed

@@ -2,7 +2,7 @@
   "id": "oomi-ai",
   "name": "Oomi Channel Plugin",
   "description": "Managed Oomi channel integration for OpenClaw.",
-  "version": "0.2.20",
+  "version": "0.2.23",
   "author": "Oomi",
   "license": "MIT",
   "openclawVersion": ">=0.5.0",

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "oomi-ai",
-  "version": "0.2.20",
+  "version": "0.2.23",
   "description": "Oomi OpenClaw channel plugin and bridge tooling",
   "bin": {
     "oomi": "bin/oomi-ai.js"

package/skills/oomi/SKILL.md CHANGED Viewed

@@ -168,16 +168,16 @@ Use this shape when a voice turn needs more natural delivery without changing vi
 }
 ```
-Rules:
-- keep visible assistant `content` clean and user-facing
-- do not place raw intonation tags in visible chat
-- for managed voice replies, include `metadata.spoken` when delivery benefits from cleaner phrasing or explicit speaking guidance
-- `metadata.spoken.text` is backend TTS input only
-- `metadata.spoken.language` should be one of the supported Qwen language values such as `English`
-- `metadata.spoken.segments` can carry bounded per-segment prosody for pace, pitch, volume, and pause timing
-- `metadata.spoken.instructions` should use natural-language speaking guidance
-- if the speech sidecar is absent, Oomi speaks the visible assistant text
-- if you omit `metadata.spoken`, the plugin synthesizes a bounded hidden fallback from visible assistant text
+Rules:
+- keep visible assistant `content` clean and user-facing
+- do not place raw intonation tags in visible chat
+- for managed cloned-voice replies, include `metadata.spoken` when backend TTS should speak the turn
+- `metadata.spoken.text` is backend TTS input only
+- `metadata.spoken.language` should be one of the supported Qwen language values such as `English`
+- `metadata.spoken.segments` can carry bounded per-segment prosody for pace, pitch, volume, and pause timing
+- `metadata.spoken.instructions` should use natural-language speaking guidance
+- if you omit `metadata.spoken`, the shared package helper may synthesize it as a compatibility guardrail before backend TTS
+- backend cloned voice is strict: if `metadata.spoken` does not reach Oomi, playback fails instead of falling back to flat speech
 ## Avatar Control