clipwright-wrap 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,93 @@
1
+ Metadata-Version: 2.3
2
+ Name: clipwright-wrap
3
+ Version: 0.1.0
4
+ Summary: MCP tool to format subtitle file (SRT/VTT) text at phrase boundaries with line wrapping using BudouX.
5
+ Author: satoh-y-0323
6
+ Author-email: satoh-y-0323 <shoma.papa.0323@gmail.com>
7
+ License: MIT
8
+ Requires-Dist: budoux
9
+ Requires-Dist: clipwright>=0.1.0
10
+ Requires-Dist: mcp[cli]>=1.27.2
11
+ Requires-Dist: pydantic>=2
12
+ Requires-Python: >=3.11
13
+ Description-Content-Type: text/markdown
14
+
15
+ # clipwright-wrap
16
+
17
+ MCP tool to format subtitle file text (SRT/VTT) at phrase boundaries using BudouX with line wrapping.
18
+
19
+ ## Overview
20
+
21
+ `clipwright-wrap` takes SRT/VTT subtitle files as input, splits each cue text into phrase units by BudouX, inserts line breaks to fit within specified character count and line count, and outputs the subtitle file in the same format. A pure text formatting tool with no FFmpeg / Whisper dependencies.
22
+
23
+ ## Input/Output
24
+
25
+ - **Input**: SRT file (`.srt`) or VTT file (`.vtt`)
26
+ - **Output**: Subtitle file in same format as input (with phrase boundary line breaks inserted)
27
+ - **Timecodes**: Unchanged (no retiming)
28
+
29
+ ## MCP Tool
30
+
31
+ `clipwright_wrap_captions`
32
+
33
+ ### Parameters
34
+
35
+ | Name | Type | Default | Description |
36
+ |------|------|---------|-------------|
37
+ | `input` | `string` | required | Input subtitle file path (`.srt` / `.vtt`) |
38
+ | `output` | `string` | required | Output subtitle file path (same extension as input) |
39
+ | `language` | `string` | `"ja"` | Phrase splitting language (`ja` / `zh-hans` / `zh-hant` / `th`) |
40
+ | `max_chars` | `int` | `16` | Max characters per line (full-width and half-width both count as 1 character). Positive integer. |
41
+ | `max_lines` | `int` | `2` | Max lines per cue. Over-limit cues recorded in warnings (not truncated). Positive integer. |
42
+
43
+ ### Character Count Specification
44
+
45
+ `max_chars` is **counted uniformly as 1 character each** (both full-width and half-width as one `len()` character). Full-width normalization is a future extension (requirement §8).
46
+
47
+ ## Phrase Wrapping Mechanism
48
+
49
+ 1. Each cue text (if multiple lines, remove line breaks and concatenate) is split into phrases by BudouX
50
+ 2. Phrase token sequence is greedily packed into one line within `max_chars`
51
+ 3. Formatted text (multiple lines separated by `\n`) is written back to cue
52
+
53
+ If a single phrase exceeds `max_chars` alone, place that phrase on one line (no splitting mid-phrase).
54
+
55
+ ## Supported Languages
56
+
57
+ Supports the following languages for which BudouX provides phrase splitting:
58
+
59
+ | `language` Value | Language |
60
+ |---|---|
61
+ | `ja` | Japanese |
62
+ | `zh-hans` | Chinese (Simplified) |
63
+ | `zh-hant` | Chinese (Traditional) |
64
+ | `th` | Thai |
65
+
66
+ ## Dependencies
67
+
68
+ | Package | Purpose |
69
+ |---------|---------|
70
+ | `budoux` | Phrase boundary splitting (standard dependency, lightweight model bundled) |
71
+ | `clipwright` | Shared types, envelope, errors |
72
+ | `mcp[cli]` | MCP server |
73
+ | `pydantic` | Parameter validation |
74
+
75
+ **No FFmpeg / Whisper dependencies** (pure text formatting). `budoux` is a standard dependency bundled with the package, so e2e tests can run continuously without environment variable gating.
76
+
77
+ ## Installation and Startup
78
+
79
+ ```bash
80
+ uv add clipwright-wrap
81
+ clipwright-wrap
82
+ ```
83
+
84
+ Or within a uv workspace:
85
+
86
+ ```bash
87
+ uv run --package clipwright-wrap clipwright-wrap
88
+ ```
89
+
90
+ ## Prerequisites
91
+
92
+ - Python 3.11 or later
93
+ - FFmpeg not required (text formatting only)
@@ -0,0 +1,79 @@
1
+ # clipwright-wrap
2
+
3
+ MCP tool to format subtitle file text (SRT/VTT) at phrase boundaries using BudouX with line wrapping.
4
+
5
+ ## Overview
6
+
7
+ `clipwright-wrap` takes SRT/VTT subtitle files as input, splits each cue text into phrase units by BudouX, inserts line breaks to fit within specified character count and line count, and outputs the subtitle file in the same format. A pure text formatting tool with no FFmpeg / Whisper dependencies.
8
+
9
+ ## Input/Output
10
+
11
+ - **Input**: SRT file (`.srt`) or VTT file (`.vtt`)
12
+ - **Output**: Subtitle file in same format as input (with phrase boundary line breaks inserted)
13
+ - **Timecodes**: Unchanged (no retiming)
14
+
15
+ ## MCP Tool
16
+
17
+ `clipwright_wrap_captions`
18
+
19
+ ### Parameters
20
+
21
+ | Name | Type | Default | Description |
22
+ |------|------|---------|-------------|
23
+ | `input` | `string` | required | Input subtitle file path (`.srt` / `.vtt`) |
24
+ | `output` | `string` | required | Output subtitle file path (same extension as input) |
25
+ | `language` | `string` | `"ja"` | Phrase splitting language (`ja` / `zh-hans` / `zh-hant` / `th`) |
26
+ | `max_chars` | `int` | `16` | Max characters per line (full-width and half-width both count as 1 character). Positive integer. |
27
+ | `max_lines` | `int` | `2` | Max lines per cue. Over-limit cues recorded in warnings (not truncated). Positive integer. |
28
+
29
+ ### Character Count Specification
30
+
31
+ `max_chars` is **counted uniformly as 1 character each** (both full-width and half-width as one `len()` character). Full-width normalization is a future extension (requirement §8).
32
+
33
+ ## Phrase Wrapping Mechanism
34
+
35
+ 1. Each cue text (if multiple lines, remove line breaks and concatenate) is split into phrases by BudouX
36
+ 2. Phrase token sequence is greedily packed into one line within `max_chars`
37
+ 3. Formatted text (multiple lines separated by `\n`) is written back to cue
38
+
39
+ If a single phrase exceeds `max_chars` alone, place that phrase on one line (no splitting mid-phrase).
40
+
41
+ ## Supported Languages
42
+
43
+ Supports the following languages for which BudouX provides phrase splitting:
44
+
45
+ | `language` Value | Language |
46
+ |---|---|
47
+ | `ja` | Japanese |
48
+ | `zh-hans` | Chinese (Simplified) |
49
+ | `zh-hant` | Chinese (Traditional) |
50
+ | `th` | Thai |
51
+
52
+ ## Dependencies
53
+
54
+ | Package | Purpose |
55
+ |---------|---------|
56
+ | `budoux` | Phrase boundary splitting (standard dependency, lightweight model bundled) |
57
+ | `clipwright` | Shared types, envelope, errors |
58
+ | `mcp[cli]` | MCP server |
59
+ | `pydantic` | Parameter validation |
60
+
61
+ **No FFmpeg / Whisper dependencies** (pure text formatting). `budoux` is a standard dependency bundled with the package, so e2e tests can run continuously without environment variable gating.
62
+
63
+ ## Installation and Startup
64
+
65
+ ```bash
66
+ uv add clipwright-wrap
67
+ clipwright-wrap
68
+ ```
69
+
70
+ Or within a uv workspace:
71
+
72
+ ```bash
73
+ uv run --package clipwright-wrap clipwright-wrap
74
+ ```
75
+
76
+ ## Prerequisites
77
+
78
+ - Python 3.11 or later
79
+ - FFmpeg not required (text formatting only)
@@ -0,0 +1,86 @@
1
+ [project]
2
+ name = "clipwright-wrap"
3
+ version = "0.1.0"
4
+ description = "MCP tool to format subtitle file (SRT/VTT) text at phrase boundaries with line wrapping using BudouX."
5
+ readme = "README.md"
6
+ license = { text = "MIT" }
7
+ authors = [
8
+ { name = "satoh-y-0323", email = "shoma.papa.0323@gmail.com" }
9
+ ]
10
+ requires-python = ">=3.11"
11
+ dependencies = [
12
+ "budoux",
13
+ "clipwright>=0.1.0",
14
+ "mcp[cli]>=1.27.2",
15
+ "pydantic>=2",
16
+ ]
17
+
18
+ [project.scripts]
19
+ clipwright-wrap = "clipwright_wrap.server:main"
20
+
21
+ [build-system]
22
+ requires = ["uv_build>=0.11.19,<0.12.0"]
23
+ build-backend = "uv_build"
24
+
25
+ [dependency-groups]
26
+ dev = [
27
+ "clipwright-transcribe",
28
+ "mypy>=2.1.0",
29
+ "pytest>=9.0.3",
30
+ "pytest-cov>=7.1.0",
31
+ "pytest-mock>=3.15.1",
32
+ "ruff>=0.15.16",
33
+ ]
34
+
35
+ # Resolve clipwright (core) and clipwright-transcribe within workspace by path reference
36
+ [tool.uv.sources]
37
+ clipwright = { workspace = true }
38
+ clipwright-transcribe = { workspace = true }
39
+
40
+ # --- Ruff ---
41
+ [tool.ruff]
42
+ target-version = "py311"
43
+ line-length = 88
44
+
45
+ [tool.ruff.lint]
46
+ select = ["E", "F", "W", "I", "UP", "B", "C4", "SIM"]
47
+ ignore = []
48
+
49
+ [tool.ruff.lint.per-file-ignores]
50
+ # Allow E501 for English docstrings/comments in test files
51
+ "tests/*.py" = ["E501"]
52
+
53
+ [tool.ruff.format]
54
+ # Default ruff formatter is OK
55
+
56
+ # --- mypy ---
57
+ [tool.mypy]
58
+ python_version = "3.11"
59
+ strict = true
60
+ warn_return_any = true
61
+ warn_unused_configs = true
62
+ disallow_untyped_defs = true
63
+ disallow_any_generics = true
64
+
65
+ # budoux has no stubs, ignored with mypy strict
66
+ [[tool.mypy.overrides]]
67
+ module = "budoux.*"
68
+ ignore_missing_imports = true
69
+
70
+ # --- pytest ---
71
+ [tool.pytest.ini_options]
72
+ testpaths = ["tests"]
73
+ addopts = "--strict-markers -q"
74
+ markers = [
75
+ "integration: integration test requiring actual ffmpeg/ffprobe binaries",
76
+ "slow: test with long execution time",
77
+ ]
78
+
79
+ # --- coverage ---
80
+ [tool.coverage.run]
81
+ source = ["clipwright_wrap"]
82
+ omit = ["tests/*"]
83
+
84
+ [tool.coverage.report]
85
+ show_missing = true
86
+ skip_covered = false
@@ -0,0 +1 @@
1
+ __version__ = "0.1.0"
@@ -0,0 +1,383 @@
1
+ """captions.py — clipwright-wrap pure-logic layer.
2
+
3
+ Handles SRT/VTT parsing, greedy line-filling of phrase-boundary token sequences
4
+ with max_chars, SRT/VTT re-serialisation, and overflow detection.
5
+ Pure functions with no budoux import (contract coverage target: ~100%).
6
+
7
+ Design decisions:
8
+ - Timecode strings are preserved as-is without float conversion (WR-AD-06).
9
+ - SRT/VTT byte structure conforms to the WR-AD-12 specification.
10
+ - No delimiter is inserted when joining phrase-boundary tokens (WR-AD-14).
11
+ - Overflow detection covers both line-count excess (a) and line-width excess (b)
12
+ (WR-AD-15(1)).
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import re
18
+ from dataclasses import dataclass
19
+ from typing import TypedDict
20
+
21
+ from clipwright.errors import ClipwrightError, ErrorCode
22
+
23
+ # Regex matching a VTT timeline line: "HH:MM:SS.mmm --> HH:MM:SS.mmm [settings]"
24
+ _VTT_TIMELINE_RE = re.compile(
25
+ r"^(\d{2}:\d{2}:\d{2}\.\d{3})\s+-->\s+(\d{2}:\d{2}:\d{2}\.\d{3})(.*)"
26
+ )
27
+
28
+ # Regex matching an SRT timeline line: "HH:MM:SS,mmm --> HH:MM:SS,mmm"
29
+ # Fixed-width HH:MM:SS,mmm digits; conforms to WR-AD-12
30
+ # (transcribe to_srt guarantees fixed width)
31
+ _SRT_TIMELINE_RE = re.compile(
32
+ r"^(\d{2}:\d{2}:\d{2},\d{3})\s+-->\s+(\d{2}:\d{2}:\d{2},\d{3})\s*$"
33
+ )
34
+
35
+ # Detection of VTT inline tags (<c>, <b>, <i>, <v>, <ruby>, etc.)
36
+ # The [^>]{0,200} upper bound mitigates ReDoS (CWE-1333)
37
+ _VTT_INLINE_TAG_RE = re.compile(r"<[a-zA-Z/][^>]{0,200}>")
38
+
39
+
40
+ @dataclass
41
+ class Cue:
42
+ """Normalised representation of a single subtitle cue.
43
+
44
+ index is the sequence number (1-based).
45
+ start / end are timecode strings (not converted to float; WR-AD-06).
46
+ text is the cue body text (line breaks represented as '\\n').
47
+ VTT cue settings are appended to the end field
48
+ (e.g. "00:00:01.000 line:90% position:50%").
49
+ """
50
+
51
+ index: int
52
+ start: str
53
+ end: str
54
+ text: str
55
+
56
+
57
+ class _OverflowResult(TypedDict):
58
+ """Return type of check_overflow."""
59
+
60
+ line_count_overflow: bool
61
+ line_width_overflow: bool
62
+
63
+
64
+ def _parse_srt(text: str) -> list[Cue]:
65
+ """Convert an SRT text string into a list of Cues.
66
+
67
+ Conforms to the WR-AD-12(1)(2) byte-structure specification:
68
+ - Blank-line delimited (robust to consecutive / trailing blank lines)
69
+ - Does not miss the last cue when the trailing cue has no blank line
70
+ (single newline EOF)
71
+ - 0 entries (empty string or newlines only) → []
72
+ - Multi-line text within a cue is joined without a delimiter
73
+ (no space inserted; WR-AD-14)
74
+ - Invalid timecode line → raises ValueError
75
+ (caller wrap.py converts this to INVALID_INPUT)
76
+ """
77
+ if not text.strip():
78
+ return []
79
+
80
+ # Normalise consecutive blank lines to a single blank line before splitting
81
+ normalized = re.sub(r"\n{2,}", "\n\n", text.strip())
82
+ blocks = normalized.split("\n\n")
83
+
84
+ cues: list[Cue] = []
85
+ for block in blocks:
86
+ lines = block.strip().splitlines()
87
+ if not lines: # pragma: no cover
88
+ # Unreachable: normalisation never produces an empty block (defensive guard)
89
+ continue
90
+
91
+ # Line 1: index number
92
+ try:
93
+ index = int(lines[0].strip())
94
+ except ValueError:
95
+ # Block does not start with an index line; skip (empty block, etc.)
96
+ continue
97
+
98
+ if len(lines) < 2:
99
+ continue
100
+
101
+ # Line 2: timeline line
102
+ timeline_line = lines[1].strip()
103
+ m = _SRT_TIMELINE_RE.match(timeline_line)
104
+ if m is None:
105
+ # Invalid timecode line: raise ValueError (test contract WR-AD-09)
106
+ raise ValueError(
107
+ f"Invalid SRT timecode line: {timeline_line!r}"
108
+ f" (expected format: 'HH:MM:SS,mmm --> HH:MM:SS,mmm')"
109
+ )
110
+
111
+ start = m.group(1)
112
+ end = m.group(2)
113
+
114
+ # Value range check: MM/SS must be within 0–59 (WR-AD-12 / SRT spec)
115
+ for tc in (start, end):
116
+ mm, ss = int(tc[3:5]), int(tc[6:8])
117
+ if mm > 59 or ss > 59:
118
+ raise ValueError(
119
+ f"SRT timecode value out of range: {tc!r}"
120
+ f" (minutes and seconds must be within 0–59)"
121
+ )
122
+
123
+ # Line 3 onwards: text (multiple lines joined without delimiter; no space)
124
+ text_lines = lines[2:] if len(lines) > 2 else []
125
+ joined_text = "".join(text_lines)
126
+
127
+ cues.append(Cue(index=index, start=start, end=end, text=joined_text))
128
+
129
+ return cues
130
+
131
+
132
+ def _parse_vtt(text: str) -> list[Cue]:
133
+ """Convert a VTT text string into a list of Cues.
134
+
135
+ Conforms to the WR-AD-12(1)(2)(3) byte-structure specification
136
+ and all 5 VTT edge-case behaviours:
137
+ - Skip the blank line immediately after the WEBVTT header
138
+ - 0 entries ("WEBVTT\\n" only) → []
139
+ - NOTE/STYLE blocks: preserved as-is (not treated as cues)
140
+ - cue id line: preserved; only the text lines are formatting targets
141
+ - cue settings (trailing part of the timeline line): appended to the end field
142
+ - cues containing inline tags: text preserved as-is (tags included, single line)
143
+ - Multi-line text within a cue is joined without a delimiter (WR-AD-14)
144
+ """
145
+ lines = text.splitlines()
146
+
147
+ # Verify and skip the WEBVTT header
148
+ if not lines or not lines[0].startswith("WEBVTT"):
149
+ return []
150
+
151
+ # Process lines after the header
152
+ pos = 1
153
+ total = len(lines)
154
+
155
+ # Skip blank lines immediately after the header
156
+ while pos < total and lines[pos].strip() == "":
157
+ pos += 1
158
+
159
+ cues: list[Cue] = []
160
+ cue_index = 1
161
+
162
+ while pos < total:
163
+ # Skip blank lines (cue separator)
164
+ if lines[pos].strip() == "":
165
+ pos += 1
166
+ continue
167
+
168
+ # NOTE block: skip until the next blank line or EOF
169
+ if lines[pos].startswith("NOTE"):
170
+ pos += 1
171
+ while pos < total and lines[pos].strip() != "":
172
+ pos += 1
173
+ continue
174
+
175
+ # STYLE block: skip until the next blank line or EOF
176
+ if lines[pos].startswith("STYLE"):
177
+ pos += 1
178
+ while pos < total and lines[pos].strip() != "":
179
+ pos += 1
180
+ continue
181
+
182
+ # Check for a cue id line (non-empty line that is not a timeline line)
183
+ if not _VTT_TIMELINE_RE.match(lines[pos]):
184
+ # cue id line: identifier before the timeline — skip (preserved implicitly)
185
+ pos += 1
186
+ if pos >= total:
187
+ break
188
+
189
+ # Timeline line
190
+ if pos >= total or lines[pos].strip() == "":
191
+ pos += 1
192
+ continue
193
+
194
+ m = _VTT_TIMELINE_RE.match(lines[pos])
195
+ if m is None: # pragma: no cover
196
+ # Unreachable for well-formed VTT input (fallback defensive guard)
197
+ pos += 1
198
+ continue
199
+
200
+ start = m.group(1)
201
+ # Append settings to end field for preservation (WR-AD-12(3)(d))
202
+ end_raw = m.group(2)
203
+ settings = m.group(3).strip()
204
+ end = f"{end_raw} {settings}" if settings else end_raw
205
+
206
+ pos += 1
207
+
208
+ # Collect text lines until the next blank line or EOF
209
+ text_lines: list[str] = []
210
+ while pos < total and lines[pos].strip() != "":
211
+ text_lines.append(lines[pos])
212
+ pos += 1
213
+
214
+ # Join text without a delimiter (no space inserted; WR-AD-14)
215
+ joined_text = "".join(text_lines)
216
+
217
+ cues.append(Cue(index=cue_index, start=start, end=end, text=joined_text))
218
+ cue_index += 1
219
+
220
+ return cues
221
+
222
+
223
+ def parse_captions(text: str, fmt: str) -> list[Cue]:
224
+ """Convert an SRT or VTT text string into a list of Cues.
225
+
226
+ fmt must be "srt" or "vtt".
227
+ Timecode strings are preserved as-is (WR-AD-06).
228
+ Multi-line text within a cue is joined without a delimiter (WR-AD-14).
229
+ An invalid timecode line causes _parse_srt to raise ValueError,
230
+ which wrap.py converts to ClipwrightError(INVALID_INPUT) (WR-AD-09).
231
+
232
+ Args:
233
+ text: SRT or VTT format string.
234
+ fmt: "srt" or "vtt".
235
+
236
+ Returns:
237
+ List of Cues. Returns an empty list when there are 0 entries.
238
+ """
239
+ if fmt == "srt":
240
+ return _parse_srt(text)
241
+ elif fmt == "vtt":
242
+ return _parse_vtt(text)
243
+ else:
244
+ raise ClipwrightError(
245
+ code=ErrorCode.INVALID_INPUT,
246
+ message=f"Unsupported subtitle format: {fmt!r}",
247
+ hint="Specify 'srt' or 'vtt' for fmt.",
248
+ )
249
+
250
+
251
+ def wrap_cue_lines(segments: list[str], max_chars: int) -> list[str]:
252
+ """Return lines formed by greedily packing phrase-boundary tokens up to max_chars.
253
+
254
+ Conforms to WR-AD-04/WR-AD-14:
255
+ - Segments are appended to a line; a line break is inserted just before the
256
+ limit is exceeded (greedy fill).
257
+ - If a single segment exceeds max_chars on its own, it is placed on its own
258
+ line without splitting.
259
+ - No delimiter is inserted between segments
260
+ (WR-AD-14(i); joining lines restores the original text).
261
+ - '\\n' is not included in len() of each line (WR-AD-14(ii)).
262
+ - Full-width and half-width characters are each counted as 1
263
+ (WR-AD-14(iii); uniform len() check).
264
+
265
+ Args:
266
+ segments: List of phrase-boundary tokens.
267
+ max_chars: Maximum number of characters per line (gt=0).
268
+
269
+ Returns:
270
+ List of lines (no '\\n' within any line). Returns [] for empty segments.
271
+ """
272
+ if not segments:
273
+ return []
274
+
275
+ lines: list[str] = []
276
+ current_line = ""
277
+
278
+ for seg in segments:
279
+ if not current_line:
280
+ # Start of a line: place segment even if exceeds max_chars (no splitting)
281
+ current_line = seg
282
+ elif len(current_line) + len(seg) <= max_chars:
283
+ # Adding the segment stays within max_chars → append to the same line
284
+ current_line += seg
285
+ else:
286
+ # Would exceed the limit → insert a line break
287
+ lines.append(current_line)
288
+ current_line = seg
289
+
290
+ if current_line:
291
+ lines.append(current_line)
292
+
293
+ return lines
294
+
295
+
296
+ def _serialize_srt(cues: list[Cue]) -> str:
297
+ """Convert a list of Cues into an SRT string.
298
+
299
+ Byte-structure specification (WR-AD-12(1)):
300
+ - Each block = "index\\nstart --> end\\ntext\\n"
301
+ - One blank line between cues (trailing \\n of the block + the join \\n)
302
+ - Single newline after the last cue (no trailing blank line)
303
+ - 0 entries → ""
304
+ """
305
+ if not cues:
306
+ return ""
307
+
308
+ blocks: list[str] = []
309
+ for cue in cues:
310
+ blocks.append(f"{cue.index}\n{cue.start} --> {cue.end}\n{cue.text}\n")
311
+
312
+ return "\n".join(blocks)
313
+
314
+
315
+ def _serialize_vtt(cues: list[Cue]) -> str:
316
+ """Convert a list of Cues into a VTT string.
317
+
318
+ Byte-structure specification (WR-AD-12(1)):
319
+ - "WEBVTT\\n" + "\\n" + cue1 + "\\n" + cue2 + ...
320
+ - Each cue block = "start --> end\\ntext\\n"
321
+ - One blank line between cues; single newline after the last cue
322
+ (no trailing blank line)
323
+ - 0 entries → "WEBVTT\\n"
324
+ """
325
+ if not cues:
326
+ return "WEBVTT\n"
327
+
328
+ blocks: list[str] = ["WEBVTT\n"]
329
+ for cue in cues:
330
+ blocks.append(f"{cue.start} --> {cue.end}\n{cue.text}\n")
331
+
332
+ return "\n".join(blocks)
333
+
334
+
335
+ def serialize_captions(cues: list[Cue], fmt: str) -> str:
336
+ """Convert a list of Cues into an SRT or VTT string.
337
+
338
+ Timecode strings are written back unchanged (WR-AD-06).
339
+ For 0 entries: SRT → "" / VTT → "WEBVTT\\n" (round-trip identity; WR-AD-12(2)).
340
+
341
+ Args:
342
+ cues: List of Cues.
343
+ fmt: "srt" or "vtt".
344
+
345
+ Returns:
346
+ SRT or VTT format string.
347
+ """
348
+ if fmt == "srt":
349
+ return _serialize_srt(cues)
350
+ elif fmt == "vtt":
351
+ return _serialize_vtt(cues)
352
+ else:
353
+ raise ClipwrightError(
354
+ code=ErrorCode.INVALID_INPUT,
355
+ message=f"Unsupported subtitle format: {fmt!r}",
356
+ hint="Specify 'srt' or 'vtt' for fmt.",
357
+ )
358
+
359
+
360
+ def check_overflow(lines: list[str], max_chars: int, max_lines: int) -> _OverflowResult:
361
+ """Detect overflow (line-count excess and line-width excess) in a list of lines.
362
+
363
+ Overflow detection specification (WR-AD-15(1)):
364
+ - (a) len(lines) > max_lines → line_count_overflow: True
365
+ - (b) any line's len() > max_chars → line_width_overflow: True
366
+ A single oversized segment (1 line, width excess) is also covered by (b).
367
+ lines is not modified (avoids information loss).
368
+
369
+ Args:
370
+ lines: List of lines to inspect (each line must not contain '\\n').
371
+ max_chars: Maximum number of characters per line.
372
+ max_lines: Maximum number of lines.
373
+
374
+ Returns:
375
+ Dict with line_count_overflow and line_width_overflow keys.
376
+ """
377
+ line_count_overflow = len(lines) > max_lines
378
+ line_width_overflow = any(len(line) > max_chars for line in lines)
379
+
380
+ return {
381
+ "line_count_overflow": line_count_overflow,
382
+ "line_width_overflow": line_width_overflow,
383
+ }
File without changes
@@ -0,0 +1,71 @@
1
+ """schemas.py — clipwright-wrap-specific Pydantic schemas.
2
+
3
+ Common types (MediaRef / Artifact / ToolResult, etc.) are defined
4
+ centrally in clipwright.schemas; this module does not redefine them.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ from typing import Annotated
10
+
11
+ from pydantic import BaseModel, Field
12
+
13
+
14
+ class WrapCaptionsOptions(BaseModel):
15
+ """Options for clipwright_wrap_captions (WR-AD-05).
16
+
17
+ language selects the budoux parser.
18
+ All 4 languages confirmed loadable in spike-budoux (DC-AM-005).
19
+ max_chars is the maximum number of characters per line
20
+ (each character counts as 1; len() check).
21
+ max_lines is the maximum number of lines per cue
22
+ (excess is subject to warnings; WR-AD-15(1)).
23
+ """
24
+
25
+ language: Annotated[
26
+ str,
27
+ Field(
28
+ default="ja",
29
+ max_length=7,
30
+ pattern=r"^(ja|zh-hans|zh-hant|th)$",
31
+ description=(
32
+ "Language code to select the budoux phrase-boundary parser. "
33
+ "Supported languages: ja / zh-hans / zh-hant / th. "
34
+ "All 4 languages confirmed in spike-budoux (DC-AM-005). "
35
+ "Unsupported values are rejected with INVALID_INPUT."
36
+ ),
37
+ ),
38
+ ] = "ja"
39
+
40
+ max_chars: Annotated[
41
+ int,
42
+ Field(
43
+ default=16,
44
+ gt=0,
45
+ description=(
46
+ "Maximum number of characters per line"
47
+ " (each character counts as 1; len() check). "
48
+ "Default is ~16 full-width characters, following"
49
+ " Japanese subtitle conventions (WR-AD-05). "
50
+ "A line break is inserted just before the limit is exceeded"
51
+ " (greedy fill; WR-AD-04). "
52
+ "If a single phrase segment exceeds the limit on its own,"
53
+ " it is placed on one line without splitting. "
54
+ "gt=0 constraint: 0 or below is rejected with INVALID_INPUT."
55
+ ),
56
+ ),
57
+ ] = 16
58
+
59
+ max_lines: Annotated[
60
+ int,
61
+ Field(
62
+ default=2,
63
+ gt=0,
64
+ description=(
65
+ "Maximum number of lines per cue. Excess is recorded in warnings. "
66
+ "The original text is preserved without truncation"
67
+ " (WR-AD-15(1); requirement §2). "
68
+ "gt=0 constraint: 0 or below is rejected with INVALID_INPUT."
69
+ ),
70
+ ),
71
+ ] = 2
@@ -0,0 +1,90 @@
1
+ """server.py — clipwright-wrap MCP server + CLI entry point.
2
+
3
+ A thin wrapper that delegates business logic to wrap.py.
4
+ ClipwrightError conversion and language validation are handled by wrap.py / schemas.py,
5
+ so this module does not perform double conversion (DC-GP-001).
6
+
7
+ Transport defaults to stdio (mcp.run(transport="stdio")).
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ from typing import Annotated, Any
13
+
14
+ from mcp.server.fastmcp import FastMCP
15
+ from mcp.types import ToolAnnotations
16
+ from pydantic import Field
17
+
18
+ from clipwright_wrap.schemas import WrapCaptionsOptions
19
+ from clipwright_wrap.wrap import wrap_captions
20
+
21
+ # FastMCP instance (server name)
22
+ mcp = FastMCP("clipwright-wrap")
23
+
24
+
25
+ # ===========================================================================
26
+ # clipwright_wrap_captions MCP tool
27
+ # ===========================================================================
28
+
29
+
30
+ @mcp.tool(
31
+ annotations=ToolAnnotations(
32
+ readOnlyHint=True,
33
+ destructiveHint=False,
34
+ idempotentHint=True,
35
+ openWorldHint=False,
36
+ )
37
+ )
38
+ def clipwright_wrap_captions(
39
+ input: Annotated[
40
+ str,
41
+ Field(description="Input subtitle file path (.srt or .vtt)."),
42
+ ],
43
+ output: Annotated[
44
+ str,
45
+ Field(description="Output subtitle file path (same extension as input)."),
46
+ ],
47
+ options: Annotated[
48
+ WrapCaptionsOptions | None,
49
+ Field(
50
+ description=(
51
+ "Phrase-boundary line-break options"
52
+ " (language / max_chars / max_lines). "
53
+ "When omitted, all defaults are used"
54
+ " (language='ja' / max_chars=16 / max_lines=2)."
55
+ )
56
+ ),
57
+ ] = None,
58
+ ) -> dict[str, Any]:
59
+ """MCP tool: insert phrase-boundary line breaks into a subtitle file.
60
+
61
+ The input subtitle file is never modified (non-destructive; readOnly).
62
+ The output is the path of the newly generated SRT/VTT, returned in artifacts.
63
+
64
+ Business logic is delegated to wrap.wrap_captions.
65
+ When options is None, the default WrapCaptionsOptions() is used.
66
+ """
67
+ resolved_options = options if options is not None else WrapCaptionsOptions()
68
+ return wrap_captions(
69
+ input=input,
70
+ output=output,
71
+ options=resolved_options,
72
+ )
73
+
74
+
75
+ # ===========================================================================
76
+ # Entry point (MCP stdio launch)
77
+ # ===========================================================================
78
+
79
+
80
+ def main() -> None:
81
+ """CLI entry point. Launches the MCP server over stdio.
82
+
83
+ Registered in pyproject.toml [project.scripts] as
84
+ clipwright-wrap = "clipwright_wrap.server:main".
85
+ """
86
+ mcp.run(transport="stdio")
87
+
88
+
89
+ if __name__ == "__main__": # pragma: no cover
90
+ main()
@@ -0,0 +1,317 @@
1
+ """wrap.py — clipwright-wrap orchestration layer.
2
+
3
+ Output validation → input existence check → subtitle parsing →
4
+ wrap_cli launch (phrase segmentation) →
5
+ greedy line-filling and re-serialisation via captions → output write → envelope return.
6
+
7
+ Design decisions:
8
+ - wrap_cli is launched as sys.executable -m clipwright_wrap.wrap_cli (WR-AD-01).
9
+ - wrap_cli error detection is based on the "error" key in stdout JSON (DC-AS-007).
10
+ - subprocess failure/timeout uses the sanitised message in _SUBPROCESS_SAFE_MESSAGE.
11
+ - FILE_NOT_FOUND message contains only the basename (no full path exposure; WR-AD-09).
12
+ - Overflow detection covers both line-count excess (a) and line-width excess (b)
13
+ (WR-AD-15(1)).
14
+ - Warnings use a single aggregated sentence + index arrays in data
15
+ (WR-AD-13(2); DC-AM-002).
16
+ - artifacts are dicts (Artifact model not instantiated; DC-AS-005).
17
+ - OTIO is neither generated nor used (WR-AD-06).
18
+ """
19
+
20
+ from __future__ import annotations
21
+
22
+ import json
23
+ import math
24
+ import subprocess
25
+ import sys
26
+ from pathlib import Path
27
+ from typing import Any
28
+
29
+ from clipwright.envelope import error_result, ok_result
30
+ from clipwright.errors import ClipwrightError, ErrorCode
31
+
32
+ from clipwright_wrap.captions import (
33
+ check_overflow,
34
+ parse_captions,
35
+ serialize_captions,
36
+ wrap_cue_lines,
37
+ )
38
+ from clipwright_wrap.schemas import WrapCaptionsOptions
39
+
40
+ # Sanitised message for subprocess failure/timeout (prevents stderr path leakage)
41
+ _SUBPROCESS_SAFE_MESSAGE = "internal subprocess failed"
42
+
43
+ # Timeout coefficient proportional to cue count (WR-AD-11/WR-AD-15(2))
44
+ _TIMEOUT_COEFFICIENT = 0.05
45
+ _TIMEOUT_MIN = 30
46
+
47
+
48
+ def _compute_timeout(cue_count: int) -> float:
49
+ """Calculate the cue-count-proportional timeout.
50
+
51
+ Returns max(30, ceil(cue_count * 0.05)).
52
+ """
53
+ return float(max(_TIMEOUT_MIN, math.ceil(cue_count * _TIMEOUT_COEFFICIENT)))
54
+
55
+
56
+ def wrap_captions(
57
+ input: str,
58
+ output: str,
59
+ options: WrapCaptionsOptions,
60
+ ) -> dict[str, Any]:
61
+ """Insert phrase-boundary line breaks into a subtitle file (WR-AD-04).
62
+
63
+ Non-destructive: the input subtitle file is never modified.
64
+ The output is the path of the newly generated SRT/VTT, returned in artifacts.
65
+
66
+ Args:
67
+ input: Input subtitle file path (.srt or .vtt).
68
+ output: Output subtitle file path (same extension as input).
69
+ options: WrapCaptionsOptions (language/max_chars/max_lines).
70
+
71
+ Returns:
72
+ Envelope dict as ok_result or error_result.
73
+ """
74
+ try:
75
+ return _wrap_inner(input, output, options)
76
+ except ClipwrightError as exc:
77
+ return error_result(exc.code, exc.message, exc.hint)
78
+
79
+
80
+ def _wrap_inner(
81
+ input: str,
82
+ output: str,
83
+ options: WrapCaptionsOptions,
84
+ ) -> dict[str, Any]:
85
+ """Internal implementation of wrap_captions. Raises ClipwrightError directly."""
86
+ input_path = Path(input)
87
+ output_path = Path(output)
88
+
89
+ # --- 1. Output validation (WR-AD-07/08) ---
90
+
91
+ # Verify that extensions are srt/vtt
92
+ input_ext = input_path.suffix.lower()
93
+ output_ext = output_path.suffix.lower()
94
+
95
+ if input_ext not in (".srt", ".vtt"):
96
+ raise ClipwrightError(
97
+ code=ErrorCode.INVALID_INPUT,
98
+ message=f"Unsupported subtitle format: {input_ext!r}",
99
+ hint="Set the input file extension to .srt or .vtt.",
100
+ )
101
+
102
+ if output_ext not in (".srt", ".vtt"):
103
+ raise ClipwrightError(
104
+ code=ErrorCode.INVALID_INPUT,
105
+ message=f"Unsupported output extension: {output_ext!r}",
106
+ hint="Set the output file extension to .srt or .vtt.",
107
+ )
108
+
109
+ # Verify extensions match (SRT↔VTT cross-conversion is out of scope)
110
+ if input_ext != output_ext:
111
+ raise ClipwrightError(
112
+ code=ErrorCode.INVALID_INPUT,
113
+ message=(
114
+ f"Input and output extensions do not match"
115
+ f" (input: {input_ext!r} / output: {output_ext!r})."
116
+ ),
117
+ hint="Specify an output path with the same extension as the input.",
118
+ )
119
+
120
+ # Verify that the output parent directory exists
121
+ if not output_path.parent.exists():
122
+ raise ClipwrightError(
123
+ code=ErrorCode.INVALID_INPUT,
124
+ message="Output directory does not exist.",
125
+ hint="Create the output directory first, then run again.",
126
+ )
127
+
128
+ # Prohibit output == input
129
+ try:
130
+ if output_path.resolve() == input_path.resolve():
131
+ raise ClipwrightError(
132
+ code=ErrorCode.INVALID_INPUT,
133
+ message="Output path and input path are the same.",
134
+ hint="Change the output file path to a path different from the input.",
135
+ )
136
+ except OSError: # pragma: no cover
137
+ if str(output_path) == str(input_path):
138
+ raise ClipwrightError(
139
+ code=ErrorCode.INVALID_INPUT,
140
+ message="Output path and input path are the same.",
141
+ hint="Change the output file path to a path different from the input.",
142
+ ) from None
143
+
144
+ # --- 2. Input existence check (WR-AD-09; FILE_NOT_FOUND uses basename only) ---
145
+
146
+ if not input_path.exists():
147
+ raise ClipwrightError(
148
+ code=ErrorCode.FILE_NOT_FOUND,
149
+ message=f"File not found: {input_path.name}",
150
+ hint="Check that the input file path is correct.",
151
+ )
152
+
153
+ # --- 3. Read input ---
154
+
155
+ raw_text = input_path.read_text(encoding="utf-8")
156
+ fmt = input_ext.lstrip(".") # "srt" or "vtt"
157
+
158
+ # --- 4. captions.parse_captions (invalid timecode → INVALID_INPUT + hint) ---
159
+
160
+ try:
161
+ cues = parse_captions(raw_text, fmt)
162
+ except ValueError:
163
+ # Convert ValueError to INVALID_INPUT; fixed message (not str(exc)); CWE-209
164
+ raise ClipwrightError(
165
+ code=ErrorCode.INVALID_INPUT,
166
+ message="Failed to parse subtitle file (timecode format error).",
167
+ hint=(
168
+ "Check the format of the timecode line"
169
+ " (e.g. 00:00:00,000 --> 00:00:01,000)."
170
+ ),
171
+ ) from None
172
+
173
+ # --- 5. Launch wrap_cli (WR-AD-02; DC-AS-007) ---
174
+
175
+ cue_count = len(cues)
176
+ # For 0 entries, skip wrap_cli and serialise directly
177
+ if cue_count > 0:
178
+ stdin_payload = json.dumps(
179
+ {
180
+ "language": options.language,
181
+ "texts": [cue.text for cue in cues],
182
+ },
183
+ ensure_ascii=False,
184
+ )
185
+ timeout = _compute_timeout(cue_count)
186
+
187
+ try:
188
+ proc = subprocess.run(
189
+ [sys.executable, "-m", "clipwright_wrap.wrap_cli"],
190
+ input=stdin_payload,
191
+ capture_output=True,
192
+ text=True,
193
+ encoding="utf-8",
194
+ timeout=timeout,
195
+ )
196
+ except subprocess.TimeoutExpired:
197
+ raise ClipwrightError(
198
+ code=ErrorCode.SUBPROCESS_TIMEOUT,
199
+ message=f"{_SUBPROCESS_SAFE_MESSAGE} (timeout)",
200
+ hint=(
201
+ "The subtitle file may contain too many cues. "
202
+ "Try again or reduce the number of cues."
203
+ ),
204
+ ) from None
205
+ except OSError:
206
+ raise ClipwrightError(
207
+ code=ErrorCode.SUBPROCESS_FAILED,
208
+ message=_SUBPROCESS_SAFE_MESSAGE,
209
+ hint=(
210
+ "Failed to launch wrap_cli. "
211
+ "Check that clipwright-wrap is correctly installed."
212
+ ),
213
+ ) from None
214
+
215
+ # wrap_cli returns 0; errors detected via "error" key in stdout JSON (DC-AS-007)
216
+ try:
217
+ parsed: dict[str, Any] = json.loads(proc.stdout)
218
+ except (json.JSONDecodeError, ValueError):
219
+ raise ClipwrightError(
220
+ code=ErrorCode.SUBPROCESS_FAILED,
221
+ message=_SUBPROCESS_SAFE_MESSAGE,
222
+ hint="Failed to parse wrap_cli output JSON. Please run again.",
223
+ ) from None
224
+
225
+ if "error" in parsed:
226
+ err = parsed["error"]
227
+ code_str: str = err.get("code", str(ErrorCode.INTERNAL))
228
+ msg: str = err.get("message", "An error occurred in wrap_cli")
229
+ hint: str = err.get("hint", "Please report with reproduction steps.")
230
+ # Convert to ErrorCode (DEPENDENCY_MISSING propagated as-is)
231
+ try:
232
+ code = ErrorCode(code_str)
233
+ except ValueError:
234
+ code = ErrorCode.INTERNAL
235
+ raise ClipwrightError(code=code, message=msg, hint=hint)
236
+
237
+ segments: list[list[str]] = parsed.get("segments", [])
238
+ else:
239
+ segments = []
240
+
241
+ # --- 6. Apply wrap_cue_lines to each cue → overflow detection ---
242
+
243
+ overflow_cue_indices: list[int] = []
244
+ overflow_width_cue_indices: list[int] = []
245
+ wrapped_count = 0
246
+
247
+ for i, cue in enumerate(cues):
248
+ seg = segments[i] if i < len(segments) else [cue.text]
249
+ lines = wrap_cue_lines(seg, options.max_chars)
250
+
251
+ # Increment wrapped_count when the text has changed (line break inserted)
252
+ new_text = "\n".join(lines)
253
+ if new_text != cue.text:
254
+ wrapped_count += 1
255
+
256
+ # Overflow detection (WR-AD-15(1))
257
+ overflow = check_overflow(lines, options.max_chars, options.max_lines)
258
+ if overflow["line_count_overflow"]:
259
+ overflow_cue_indices.append(i)
260
+ if overflow["line_width_overflow"]:
261
+ overflow_width_cue_indices.append(i)
262
+
263
+ # Update cue.text to the formatted text (no truncation; full text preserved)
264
+ cue.text = new_text
265
+
266
+ # --- 7. captions.serialize_captions → write output ---
267
+
268
+ serialized = serialize_captions(cues, fmt)
269
+ output_path.write_text(serialized, encoding="utf-8")
270
+
271
+ # --- 8. Build envelope (WR-AD-13) ---
272
+
273
+ warnings: list[str] = []
274
+
275
+ # Line-count overflow warnings (aggregated; omitted when 0 entries; DC-AM-002)
276
+ if overflow_cue_indices:
277
+ warnings.append(
278
+ f"{len(overflow_cue_indices)} cue(s) exceeded max_lines"
279
+ f" ({options.max_lines})"
280
+ " (see data.overflow_cue_indices for indices)."
281
+ " Output without truncation to avoid information loss."
282
+ )
283
+
284
+ # Line-width overflow warnings (single aggregated sentence; omitted when 0 entries)
285
+ if overflow_width_cue_indices:
286
+ warnings.append(
287
+ f"{len(overflow_width_cue_indices)} cue(s) exceeded max_chars"
288
+ f" ({options.max_chars})"
289
+ " (see data.overflow_width_cue_indices for indices)."
290
+ " Output without truncation to avoid information loss."
291
+ )
292
+
293
+ total_overflow = len(set(overflow_cue_indices) | set(overflow_width_cue_indices))
294
+ summary = (
295
+ f"Phrase-boundary line breaks applied to {cue_count} cue(s)"
296
+ f" ({wrapped_count} cue(s) had line breaks inserted;"
297
+ f" {total_overflow} cue(s) exceeded limits)."
298
+ f" Language: {options.language}."
299
+ f" Generated {output_path.name}."
300
+ )
301
+
302
+ artifacts = [
303
+ {"role": "captions", "path": str(output_path), "format": fmt},
304
+ ]
305
+
306
+ return ok_result(
307
+ summary,
308
+ data={
309
+ "cue_count": cue_count,
310
+ "wrapped_count": wrapped_count,
311
+ "overflow_cue_indices": overflow_cue_indices,
312
+ "overflow_width_cue_indices": overflow_width_cue_indices,
313
+ "language": options.language,
314
+ },
315
+ artifacts=artifacts,
316
+ warnings=warnings,
317
+ )
@@ -0,0 +1,180 @@
1
+ """wrap_cli.py — Small CLI for BudouX phrase-boundary segmentation (separate process).
2
+
3
+ Not imported by the MCP server process (§2.4 subprocess loose coupling).
4
+ wrap.py launches this as sys.executable -m clipwright_wrap.wrap_cli in a subprocess.
5
+
6
+ CLI contract (WR-AD-02):
7
+ - stdin: JSON {"language": "ja", "texts": ["cue1", ...]}
8
+ - stdout: JSON {"segments": [["segment1", "segment2", ...], ...]}
9
+ - On error stdout: {"error": {"code": str, "message": str, "hint": str}}
10
+ - main() catches all exceptions at the top level, always outputs JSON to stdout,
11
+ and returns 0.
12
+ - stdout contains JSON only. Logs and progress go to stderr.
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import json
18
+ import sys
19
+ import traceback
20
+ from typing import Any
21
+
22
+ from clipwright.errors import ErrorCode
23
+
24
+ # pip install hint string
25
+ _WRAP_INSTALL_HINT = "Install clipwright-wrap with `pip install clipwright-wrap`."
26
+
27
+ # Mapping of language → parser load function (DC-AS-002: target for test monkeypatching)
28
+ # budoux is imported at module top level. Because this CLI runs in a separate process,
29
+ # there is no risk of leaking into the server process, and _PARSER_LOADERS must be
30
+ # exposed as a module constant (tests reference it directly).
31
+ # If budoux is not installed, the dict stays empty (main() returns DEPENDENCY_MISSING).
32
+ try:
33
+ import budoux as _budoux
34
+
35
+ _PARSER_LOADERS: dict[str, Any] = {
36
+ "ja": _budoux.load_default_japanese_parser,
37
+ "zh-hans": _budoux.load_default_simplified_chinese_parser,
38
+ "zh-hant": _budoux.load_default_traditional_chinese_parser,
39
+ "th": _budoux.load_default_thai_parser,
40
+ }
41
+ except ImportError:
42
+ _PARSER_LOADERS = {}
43
+
44
+
45
+ def _error_output(code: str, message: str, hint: str) -> None:
46
+ """Output an error JSON to stdout.
47
+
48
+ The caller must sanitise any path information before passing it here.
49
+ """
50
+ result: dict[str, Any] = {
51
+ "error": {
52
+ "code": code,
53
+ "message": message,
54
+ "hint": hint,
55
+ }
56
+ }
57
+ print(json.dumps(result, ensure_ascii=False), file=sys.stdout)
58
+
59
+
60
+ def main(argv: list[str] | None = None) -> int: # noqa: ARG001
61
+ """Entry point for wrap_cli.
62
+
63
+ Catches all exceptions at the top level, outputs JSON to stdout,
64
+ and returns 0 (WR-AD-02).
65
+
66
+ Args:
67
+ argv: Command-line argument list (unused in the current version).
68
+
69
+ Returns:
70
+ Exit code (always 0).
71
+ """
72
+ try:
73
+ # --- Read JSON from stdin ---
74
+ try:
75
+ raw = sys.stdin.read()
76
+ payload: dict[str, Any] = json.loads(raw)
77
+ except (json.JSONDecodeError, ValueError):
78
+ _error_output(
79
+ code=str(ErrorCode.INVALID_INPUT),
80
+ message="Failed to parse JSON from stdin",
81
+ hint="Pass a valid JSON object to stdin.",
82
+ )
83
+ return 0
84
+
85
+ # --- Input validation ---
86
+ if "language" not in payload:
87
+ _error_output(
88
+ code=str(ErrorCode.INVALID_INPUT),
89
+ message="Missing 'language' key",
90
+ hint="Include a 'language' key in the stdin JSON.",
91
+ )
92
+ return 0
93
+
94
+ if "texts" not in payload:
95
+ _error_output(
96
+ code=str(ErrorCode.INVALID_INPUT),
97
+ message="Missing 'texts' key",
98
+ hint="Include a 'texts' key in the stdin JSON.",
99
+ )
100
+ return 0
101
+
102
+ language: str = payload["language"]
103
+ texts = payload["texts"]
104
+
105
+ if not isinstance(texts, list):
106
+ _error_output(
107
+ code=str(ErrorCode.INVALID_INPUT),
108
+ message="'texts' must be a list",
109
+ hint="Set 'texts' in the stdin JSON to a list of strings.",
110
+ )
111
+ return 0
112
+
113
+ if not all(isinstance(t, str) for t in texts):
114
+ _error_output(
115
+ code=str(ErrorCode.INVALID_INPUT),
116
+ message="Each element of 'texts' must be a string",
117
+ hint="Set 'texts' in the stdin JSON to a list of strings.",
118
+ )
119
+ return 0
120
+
121
+ # --- Get the parser loader (DC-AS-002: loaded once, outside the texts loop) ---
122
+ # If budoux is missing (_PARSER_LOADERS empty), return DEPENDENCY_MISSING
123
+ # CR L-2: return DEPENDENCY_MISSING + install hint instead of INVALID_INPUT
124
+ if not _PARSER_LOADERS:
125
+ _error_output(
126
+ code=str(ErrorCode.DEPENDENCY_MISSING),
127
+ message="budoux is not installed",
128
+ hint=_WRAP_INSTALL_HINT,
129
+ )
130
+ return 0
131
+
132
+ if language not in _PARSER_LOADERS:
133
+ _error_output(
134
+ code=str(ErrorCode.INVALID_INPUT),
135
+ message="Unsupported language specified",
136
+ hint=(
137
+ "Specify one of the following for language:"
138
+ " ja / zh-hans / zh-hant / th."
139
+ ),
140
+ )
141
+ return 0
142
+
143
+ # Load the parser once outside the texts loop (DC-AS-002)
144
+ # ImportError when calling the loader is returned as DEPENDENCY_MISSING
145
+ try:
146
+ parser = _PARSER_LOADERS[language]()
147
+ except ImportError:
148
+ # SR L-2: str(exc) may contain internal paths; use a fixed message instead
149
+ _error_output(
150
+ code=str(ErrorCode.DEPENDENCY_MISSING),
151
+ message="Failed to import budoux",
152
+ hint=_WRAP_INSTALL_HINT,
153
+ )
154
+ return 0
155
+
156
+ # --- Segment each cue text into phrase-boundary tokens ---
157
+ segments: list[list[str]] = []
158
+ for text in texts:
159
+ seg: list[str] = parser.parse(text)
160
+ segments.append(seg)
161
+
162
+ result: dict[str, Any] = {"segments": segments}
163
+ print(json.dumps(result, ensure_ascii=False), file=sys.stdout)
164
+ return 0
165
+
166
+ except Exception:
167
+ # Catch all unexpected exceptions and return an error JSON (WR-AD-02)
168
+ # SR NF-L-1: str(exc) may contain internal paths; use a fixed message instead.
169
+ # Debug details go to stderr only; must not leak into stdout JSON.
170
+ traceback.print_exc(file=sys.stderr)
171
+ _error_output(
172
+ code=str(ErrorCode.INTERNAL),
173
+ message="An unexpected error occurred in wrap_cli",
174
+ hint="Please report with reproduction steps.",
175
+ )
176
+ return 0
177
+
178
+
179
+ if __name__ == "__main__": # pragma: no cover
180
+ sys.exit(main())