youtube-transcript-rb 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/PLAN.md ADDED
@@ -0,0 +1,422 @@
1
+ # YouTube Transcript Ruby - Porting Plan
2
+
3
+ This document outlines the plan for porting the Python [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) library to Ruby.
4
+
5
+ ## Overview
6
+
7
+ The Python library provides functionality to:
8
+ - Fetch transcripts/subtitles from YouTube videos
9
+ - Support for auto-generated and manually created captions
10
+ - Language preference selection
11
+ - Translation support
12
+ - Multiple output formatters
13
+ - Proxy support for IP ban workarounds
14
+
15
+ ## Source Analysis
16
+
17
+ ### Python Library Structure
18
+
19
+ ```
20
+ youtube_transcript_api/
21
+ ├── __init__.py
22
+ ├── _api.py # YouTubeTranscriptApi class
23
+ ├── _transcripts.py # Transcript, TranscriptList, FetchedTranscript, TranscriptListFetcher
24
+ ├── _errors.py # Exception classes
25
+ ├── _settings.py # Constants (URLs, API settings)
26
+ ├── formatters.py # Output formatters (JSON, SRT, WebVTT, Text, PrettyPrint)
27
+ └── proxies.py # Proxy configuration classes
28
+ ```
29
+
30
+ ### Target Ruby Structure
31
+
32
+ ```
33
+ lib/youtube/transcript/rb/
34
+ ├── version.rb # ✅ Already exists
35
+ ├── errors.rb # ✅ Completed (Phase 1)
36
+ ├── settings.rb # ✅ Completed (Phase 1)
37
+ ├── transcript.rb # ✅ Completed (Phase 1)
38
+ ├── transcript_parser.rb # ✅ Completed (Phase 1)
39
+ ├── transcript_list.rb # ✅ Completed (Phase 2)
40
+ ├── transcript_list_fetcher.rb # ✅ Completed (Phase 2)
41
+ ├── api.rb # ✅ Completed (Phase 3)
42
+ ├── formatters.rb # ✅ Completed (Phase 4)
43
+ └── proxies.rb # ❌ To be created (optional, Phase 5)
44
+ ```
45
+
46
+ ---
47
+
48
+ ## Phase 1: Core Infrastructure ✅ COMPLETED
49
+
50
+ ### Task 1.1: Create Error Classes (`errors.rb`) ✅
51
+
52
+ **Status:** Completed
53
+ **Commit:** Phase 1 commit
54
+
55
+ Created comprehensive exception hierarchy (15+ classes) mirroring Python's `_errors.py`:
56
+ - `Error` - Base error class
57
+ - `CouldNotRetrieveTranscript` - Base class for transcript retrieval errors
58
+ - `VideoUnavailable`, `TranscriptsDisabled`, `NoTranscriptFound`, etc.
59
+ - Each error class includes appropriate attributes and formatted error messages
60
+
61
+ ### Task 1.2: Create Settings/Constants (`settings.rb`) ✅
62
+
63
+ **Status:** Completed
64
+
65
+ Constants defined:
66
+ - `WATCH_URL` - YouTube watch page URL template
67
+ - `INNERTUBE_API_URL` - Innertube API endpoint template
68
+ - `INNERTUBE_CONTEXT` - Android client context for API requests
69
+
70
+ ### Task 1.3: Create Transcript Data Classes (`transcript.rb`) ✅
71
+
72
+ **Status:** Completed
73
+
74
+ Classes implemented:
75
+ - `TranslationLanguage` - Language code and name pair
76
+ - `TranscriptSnippet` - Individual transcript segment (text, start, duration)
77
+ - `FetchedTranscript` - Collection of snippets with Enumerable support
78
+ - `Transcript` - Metadata with `fetch` and `translate` methods
79
+
80
+ ### Task 1.4: Create TranscriptParser (`transcript_parser.rb`) ✅
81
+
82
+ **Status:** Completed
83
+
84
+ Features:
85
+ - XML parsing with Nokogiri
86
+ - HTML entity unescaping with CGI
87
+ - `preserve_formatting` option
88
+ - Formatting tag handling (strong, em, b, i, etc.)
89
+
90
+ ### Phase 1 Test Results
91
+ - **149 examples, 0 failures**
92
+ - Test files: `errors_spec.rb`, `transcript_spec.rb`, `transcript_parser_spec.rb`, `settings_spec.rb`
93
+
94
+ ---
95
+
96
+ ## Phase 2: Transcript Fetching ✅ COMPLETED
97
+
98
+ ### Task 2.1: Create TranscriptListFetcher (`transcript_list_fetcher.rb`) ✅
99
+
100
+ **Status:** Completed
101
+ **Commit:** `ccae0eb Phase 2: Implement TranscriptList and TranscriptListFetcher`
102
+
103
+ This is the most complex component. Implemented features:
104
+ 1. Fetch video HTML page with `Accept-Language: en-US` header
105
+ 2. Extract `INNERTUBE_API_KEY` from HTML using regex
106
+ 3. Make POST request to Innertube API with Android client context
107
+ 4. Parse captions JSON from response
108
+ 5. Handle consent cookies for EU/GDPR compliance
109
+ 6. Handle various error conditions
110
+
111
+ Key methods implemented:
112
+ - `fetch(video_id)` → `TranscriptList`
113
+ - `fetch_video_html(video_id)` - Fetches and unescapes HTML
114
+ - `extract_innertube_api_key(html, video_id)` - Regex extraction
115
+ - `fetch_innertube_data(video_id, api_key)` - POST to Innertube API
116
+ - `extract_captions_json(innertube_data, video_id)` - JSON extraction
117
+ - `assert_playability(status_data, video_id)` - Playability validation
118
+
119
+ Additional modules:
120
+ - `PlayabilityStatus` - OK, ERROR, LOGIN_REQUIRED constants
121
+ - `PlayabilityFailedReason` - BOT_DETECTED, AGE_RESTRICTED, VIDEO_UNAVAILABLE
122
+
123
+ Error handling:
124
+ - HTTP 429 → `IpBlocked`
125
+ - CAPTCHA detected → `IpBlocked`
126
+ - Bot detection → `RequestBlocked`
127
+ - Age restriction → `AgeRestricted`
128
+ - Video unavailable → `VideoUnavailable` or `InvalidVideoId`
129
+ - No captions → `TranscriptsDisabled`
130
+ - Consent issues → `FailedToCreateConsentCookie`
131
+
132
+ Retry support with proxy configuration.
133
+
134
+ ### Task 2.2: Create TranscriptList (`transcript_list.rb`) ✅
135
+
136
+ **Status:** Completed
137
+
138
+ Factory and container for transcripts:
139
+ - `TranscriptList.build(http_client:, video_id:, captions_json:)` - Factory method
140
+ - Separates manually created and generated transcripts
141
+ - `find_transcript(language_codes)` - Finds by priority, prefers manual
142
+ - `find_manually_created_transcript(language_codes)`
143
+ - `find_generated_transcript(language_codes)`
144
+ - `Enumerable` included for iteration
145
+ - `to_s` for human-readable output
146
+
147
+ ### Phase 2 Test Results
148
+ - **70 new examples** (38 for TranscriptList, 32 for TranscriptListFetcher)
149
+ - **219 total examples, 0 failures**
150
+ - Test files: `transcript_list_spec.rb`, `transcript_list_fetcher_spec.rb`
151
+ - Comprehensive HTTP mocking with WebMock
152
+
153
+ ---
154
+
155
+ ## Phase 3: Main API ✅ COMPLETED
156
+
157
+ ### Task 3.1: Create YouTubeTranscriptApi (`api.rb`) ✅
158
+
159
+ **Status:** Completed
160
+ **Commit:** `290c9d2 Phase 3: Implement YouTubeTranscriptApi main entry point`
161
+
162
+ Main entry point class implemented:
163
+ - `YouTubeTranscriptApi.new(http_client:, proxy_config:)` - Constructor with optional config
164
+ - `fetch(video_id, languages:, preserve_formatting:)` - Fetch single transcript
165
+ - `list(video_id)` - List all available transcripts for a video
166
+ - `fetch_all(video_ids, ...)` - Batch fetch with error handling and yield support
167
+
168
+ Features:
169
+ - Default Faraday HTTP client with 30s timeout
170
+ - Configurable HTTP client for custom setups
171
+ - Proxy configuration support
172
+ - `continue_on_error` option for batch processing
173
+ - Block yielding for progress tracking
174
+
175
+ ### Task 3.2: Update Main Module (`rb.rb`) ✅
176
+
177
+ **Status:** Completed
178
+
179
+ Convenience methods already existed and now work:
180
+ - `Youtube::Transcript::Rb.fetch(video_id, languages:, preserve_formatting:)`
181
+ - `Youtube::Transcript::Rb.list(video_id)`
182
+
183
+ ### Phase 3 Test Results
184
+ - **33 new examples** for YouTubeTranscriptApi
185
+ - **252 total examples, 0 failures**
186
+ - Test file: `spec/api_spec.rb`
187
+
188
+ ---
189
+
190
+ ## Phase 4: Formatters ✅ COMPLETED
191
+
192
+ ### Task 4.1: Create Formatter Classes (`formatters.rb`) ✅
193
+
194
+ **Status:** Completed
195
+ **Commit:** `ec9c985 Phase 4: Implement Formatters for transcript output`
196
+
197
+ Formatter hierarchy implemented:
198
+ - `Formatter` - Abstract base class
199
+ - `JSONFormatter` - JSON output with configurable options
200
+ - `TextFormatter` - Plain text (text only, no timestamps)
201
+ - `PrettyPrintFormatter` - Ruby pretty-printed output
202
+ - `TextBasedFormatter` - Base for timestamp-based formatters
203
+ - `SRTFormatter` - SubRip format (`HH:MM:SS,mmm`)
204
+ - `WebVTTFormatter` - Web Video Text Tracks (`HH:MM:SS.mmm`)
205
+ - `FormatterLoader` - Utility to load formatters by name
206
+
207
+ Features:
208
+ - `format_transcript(transcript)` - Format single transcript
209
+ - `format_transcripts(transcripts)` - Format multiple transcripts
210
+ - Proper timestamp handling with hours/mins/secs/ms
211
+ - Overlapping timestamp correction
212
+ - SRT includes sequence numbers
213
+ - WebVTT includes WEBVTT header
214
+
215
+ ### Phase 4 Test Results
216
+ - **53 new examples** for Formatters
217
+ - **305 total examples, 0 failures**
218
+ - Test file: `spec/formatters_spec.rb`
219
+
220
+ ---
221
+
222
+ ## Phase 5: Proxy Support (Optional) ⏳ (Optional)
223
+
224
+ ### Task 5.1: Create Proxy Configuration (`proxies.rb`)
225
+
226
+ **Priority:** Low
227
+ **Estimated Effort:** 1.5 hours
228
+
229
+ ```ruby
230
+ class ProxyConfig
231
+ def to_faraday_options; raise NotImplementedError; end
232
+ end
233
+
234
+ class GenericProxyConfig < ProxyConfig
235
+ def initialize(http_url: nil, https_url: nil); end
236
+ end
237
+
238
+ class WebshareProxyConfig < GenericProxyConfig
239
+ def initialize(proxy_username:, proxy_password:, **options); end
240
+ end
241
+ ```
242
+
243
+ ---
244
+
245
+ ## Phase 6: Testing ✅ COMPLETED
246
+
247
+ ### Task 6.1: Unit Tests for Each Component ✅
248
+
249
+ **Status:** Completed (All phases)
250
+
251
+ Test files:
252
+ - `spec/errors_spec.rb` - 15+ error classes tested
253
+ - `spec/settings_spec.rb` - Constants and settings
254
+ - `spec/transcript_spec.rb` - Transcript data classes
255
+ - `spec/transcript_parser_spec.rb` - XML parsing
256
+ - `spec/transcript_list_spec.rb` - TranscriptList operations
257
+ - `spec/transcript_list_fetcher_spec.rb` - HTTP fetching with WebMock
258
+ - `spec/api_spec.rb` - YouTubeTranscriptApi main entry point
259
+ - `spec/formatters_spec.rb` - All formatter classes
260
+
261
+ ### Task 6.2: Integration Tests ✅
262
+
263
+ **Status:** Completed
264
+ **File:** `spec/integration_spec.rb`
265
+
266
+ Integration tests with real YouTube video IDs (skipped by default):
267
+ ```ruby
268
+ # Run integration tests:
269
+ INTEGRATION=1 bundle exec rspec spec/integration_spec.rb
270
+ ```
271
+
272
+ Test coverage:
273
+ - `YouTubeTranscriptApi#list` - Fetches real transcript list
274
+ - `YouTubeTranscriptApi#fetch` - Fetches real transcripts with language options
275
+ - `YouTubeTranscriptApi#fetch_all` - Batch fetching
276
+ - Convenience methods (`Youtube::Transcript::Rb.fetch`, `.list`)
277
+ - Transcript translation
278
+ - All formatters with real data (JSON, Text, SRT, WebVTT, PrettyPrint)
279
+ - Error handling (NoTranscriptFound, invalid video ID)
280
+ - FetchedTranscript interface (Enumerable, indexable, metadata)
281
+ - TranscriptList interface (Enumerable, find methods)
282
+ - Transcript object properties and methods
283
+
284
+ **Integration Test Results:** 31 examples, 0 failures, 2 pending (expected)
285
+
286
+ ---
287
+
288
+ ## Implementation Progress
289
+
290
+ ### Completed Phases
291
+
292
+ | Phase | Status | Files | Tests |
293
+ |-------|--------|-------|-------|
294
+ | Phase 1: Core Infrastructure | ✅ Completed | `errors.rb`, `settings.rb`, `transcript.rb`, `transcript_parser.rb` | 149 examples |
295
+ | Phase 2: Transcript Fetching | ✅ Completed | `transcript_list.rb`, `transcript_list_fetcher.rb` | 70 examples |
296
+ | Phase 3: Main API | ✅ Completed | `api.rb` | 33 examples |
297
+ | Phase 4: Formatters | ✅ Completed | `formatters.rb` | 53 examples |
298
+ | Phase 6: Integration Tests | ✅ Completed | `integration_spec.rb` | 31 examples |
299
+
300
+ ### Remaining Phases
301
+
302
+ | Phase | Status | Files | Estimated Effort |
303
+ |-------|--------|-------|------------------|
304
+ | Phase 5: Proxy Support | ❌ Optional | `proxies.rb` | 1.5 hours |
305
+
306
+ ### Git Commits
307
+
308
+ | Commit | Description |
309
+ |--------|-------------|
310
+ | Phase 1 | Core infrastructure - errors, settings, transcript classes, parser |
311
+ | `ccae0eb` | Phase 2 - TranscriptList and TranscriptListFetcher |
312
+ | `290c9d2` | Phase 3 - YouTubeTranscriptApi main entry point |
313
+ | `ec9c985` | Phase 4 - Formatters for transcript output |
314
+ | Phase 6 | Integration tests with real YouTube videos |
315
+
316
+ ### Current Test Summary
317
+
318
+ **Unit Tests:**
319
+ ```
320
+ 305 examples, 0 failures
321
+ ```
322
+
323
+ **Integration Tests (run with INTEGRATION=1):**
324
+ ```
325
+ 31 examples, 0 failures, 2 pending
326
+ ```
327
+
328
+ Test files:
329
+ - `spec/errors_spec.rb`
330
+ - `spec/settings_spec.rb`
331
+ - `spec/transcript_spec.rb`
332
+ - `spec/transcript_parser_spec.rb`
333
+ - `spec/transcript_list_spec.rb`
334
+ - `spec/transcript_list_fetcher_spec.rb`
335
+ - `spec/api_spec.rb`
336
+ - `spec/formatters_spec.rb`
337
+ - `spec/integration_spec.rb` (requires INTEGRATION=1)
338
+
339
+ ---
340
+
341
+ ## Technical Decisions
342
+
343
+ ### Ruby Idioms
344
+
345
+ | Python | Ruby |
346
+ |--------|------|
347
+ | `dataclass` | Plain class with `attr_reader` or `Struct` |
348
+ | `typing.Optional` | Sorbet/RBS types (optional) or YARD docs |
349
+ | `List[T]` | `Array` |
350
+ | `Dict[K, V]` | `Hash` |
351
+ | `requests.Session` | `Faraday::Connection` |
352
+ | `defusedxml.ElementTree` | `Nokogiri::XML` |
353
+ | `html.unescape` | `CGI.unescapeHTML` or Nokogiri |
354
+ | `re.compile` | `Regexp.new` |
355
+ | `json.dumps` | `JSON.generate` or `.to_json` |
356
+
357
+ ### Error Message Compatibility
358
+
359
+ Keep error messages similar to Python version for easier debugging and user familiarity.
360
+
361
+ ### HTTP Client
362
+
363
+ Use Faraday with:
364
+ - `faraday-follow_redirects` for redirect handling
365
+ - Connection pooling for multiple requests
366
+ - Configurable timeout
367
+
368
+ ---
369
+
370
+ ## Dependencies
371
+
372
+ Already in gemspec:
373
+ - `faraday` (~> 2.0) - HTTP client
374
+ - `faraday-follow_redirects` (~> 0.3) - Redirect handling
375
+ - `nokogiri` (~> 1.15) - XML/HTML parsing
376
+
377
+ Standard library (no additional gems needed):
378
+ - `json` - JSON parsing/generation
379
+ - `cgi` - HTML unescaping
380
+ - `uri` - URL handling
381
+
382
+ ---
383
+
384
+ ## API Compatibility Goals
385
+
386
+ The Ruby API should feel natural to Ruby developers while maintaining conceptual compatibility with the Python version:
387
+
388
+ ```ruby
389
+ # Python
390
+ api = YouTubeTranscriptApi()
391
+ transcript = api.fetch("video_id", languages=["en"])
392
+
393
+ # Ruby (target)
394
+ api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
395
+ transcript = api.fetch("video_id", languages: ["en"])
396
+
397
+ # Ruby (convenience)
398
+ transcript = Youtube::Transcript::Rb.fetch("video_id", languages: ["en"])
399
+ ```
400
+
401
+ ---
402
+
403
+ ## Known Limitations
404
+
405
+ 1. **PO Token Requirement** - As of January 2026, YouTube requires PO tokens for some videos. This affects all transcript libraries.
406
+
407
+ 2. **Cookie Authentication** - Currently disabled in Python version due to YouTube API changes. Will mirror this limitation.
408
+
409
+ 3. **Rate Limiting** - YouTube aggressively rate limits. Proxy support is essential for production use.
410
+
411
+ ---
412
+
413
+ ## Success Criteria
414
+
415
+ - [x] All tests pass (`bundle exec rspec`) - **305 examples, 0 failures**
416
+ - [x] Can fetch transcripts for public videos (YouTubeTranscriptApi implemented)
417
+ - [x] Language selection works correctly (TranscriptList.find_transcript)
418
+ - [x] Translation feature works (Transcript.translate)
419
+ - [x] All formatters produce correct output (JSON, Text, SRT, WebVTT, PrettyPrint)
420
+ - [x] Error handling matches expected behavior (15+ error classes)
421
+ - [x] README examples work as documented (README updated)
422
+ - [x] Integration tests pass with real YouTube videos (**31 examples, 0 failures**)