RubyGems - youtube-transcript-rb - Versions diffs - 0.1.0 - Mend

youtube-transcript-rb 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

checksums.yaml +7 -0
data/.rspec +1 -0
data/.serena/.gitignore +1 -0
data/.serena/memories/code_style_and_conventions.md +35 -0
data/.serena/memories/project_overview.md +40 -0
data/.serena/memories/suggested_commands.md +50 -0
data/.serena/memories/task_completion_checklist.md +25 -0
data/.serena/memories/tech_stack.md +20 -0
data/.serena/project.yml +84 -0
data/LICENSE +21 -0
data/PLAN.md +422 -0
data/README.md +496 -0
data/Rakefile +4 -0
data/lib/youtube/transcript/rb/api.rb +150 -0
data/lib/youtube/transcript/rb/errors.rb +217 -0
data/lib/youtube/transcript/rb/formatters.rb +269 -0
data/lib/youtube/transcript/rb/settings.rb +28 -0
data/lib/youtube/transcript/rb/transcript.rb +239 -0
data/lib/youtube/transcript/rb/transcript_list.rb +170 -0
data/lib/youtube/transcript/rb/transcript_list_fetcher.rb +225 -0
data/lib/youtube/transcript/rb/transcript_parser.rb +83 -0
data/lib/youtube/transcript/rb/version.rb +9 -0
data/lib/youtube/transcript/rb.rb +37 -0
data/sig/youtube/transcript/rb.rbs +8 -0
data/spec/api_spec.rb +397 -0
data/spec/errors_spec.rb +240 -0
data/spec/formatters_spec.rb +436 -0
data/spec/integration_spec.rb +363 -0
data/spec/settings_spec.rb +67 -0
data/spec/spec_helper.rb +109 -0
data/spec/transcript_list_fetcher_spec.rb +520 -0
data/spec/transcript_list_spec.rb +380 -0
data/spec/transcript_parser_spec.rb +355 -0
data/spec/transcript_spec.rb +435 -0
metadata +118 -0

data/README.md ADDED Viewed

@@ -0,0 +1,496 @@
+<h1 align="center">
+  ✨ YouTube Transcript API (Ruby) ✨
+</h1>
+<p align="center">
+  <a href="http://opensource.org/licenses/MIT">
+    <img src="http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat" alt="MIT license">
+  </a>
+  <a href="https://rubygems.org/gems/youtube-transcript-rb">
+    <img src="https://img.shields.io/gem/v/youtube-transcript-rb.svg" alt="Gem Version">
+  </a>
+  <a href="https://rubygems.org/gems/youtube-transcript-rb">
+    <img src="https://img.shields.io/badge/ruby-%3E%3D%203.2.0-ruby.svg" alt="Ruby Version">
+  </a>
+</p>
+<p align="center">
+  <b>This is a Ruby gem which allows you to retrieve the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do!</b>
+</p>
+<p align="center">
+  This is a Ruby port of the Python <a href="https://github.com/jdepoix/youtube-transcript-api">youtube-transcript-api</a> by jdepoix.
+</p>
+## Install
+Add this line to your application's Gemfile:
+```ruby
+gem 'youtube-transcript-rb'
+```
+And then execute:
+```
+bundle install
+```
+Or install it yourself as:
+```
+gem install youtube-transcript-rb
+```
+## API
+The easiest way to get a transcript for a given video is to execute:
+```ruby
+require 'youtube/transcript/rb'
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+api.fetch(video_id)
+```
+> **Note:** By default, this will try to access the English transcript of the video. If your video has a different
+> language, or you are interested in fetching a transcript in a different language, please read the section below.
+> **Note:** Pass in the video ID, NOT the video URL. For a video with the URL `https://www.youtube.com/watch?v=12345`
+> the ID is `12345`.
+This will return a `FetchedTranscript` object looking somewhat like this:
+```ruby
+#<Youtube::Transcript::Rb::FetchedTranscript
+  @video_id="12345",
+  @language="English",
+  @language_code="en",
+  @is_generated=false,
+  @snippets=[
+    #<Youtube::Transcript::Rb::TranscriptSnippet @text="Hey there", @start=0.0, @duration=1.54>,
+    #<Youtube::Transcript::Rb::TranscriptSnippet @text="how are you", @start=1.54, @duration=4.16>,
+    # ...
+  ]
+>
+```
+This object implements `Enumerable`, so you can iterate over it:
+```ruby
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+fetched_transcript = api.fetch(video_id)
+# is iterable
+fetched_transcript.each do |snippet|
+  puts snippet.text
+end
+# indexable
+last_snippet = fetched_transcript[-1]
+# provides a length
+snippet_count = fetched_transcript.length
+```
+If you prefer to handle the raw transcript data you can call `fetched_transcript.to_raw_data`, which will return
+an array of hashes:
+```ruby
+[
+  {
+    'text' => 'Hey there',
+    'start' => 0.0,
+    'duration' => 1.54
+  },
+  {
+    'text' => 'how are you',
+    'start' => 1.54,
+    'duration' => 4.16
+  },
+  # ...
+]
+```
+### Convenience Methods
+You can also use the convenience methods on the module directly:
+```ruby
+require 'youtube/transcript/rb'
+# Fetch a transcript
+transcript = Youtube::Transcript::Rb.fetch(video_id)
+# List available transcripts
+transcript_list = Youtube::Transcript::Rb.list(video_id)
+```
+### Retrieve different languages
+You can add the `languages` param if you want to make sure the transcripts are retrieved in your desired language
+(it defaults to english).
+```ruby
+Youtube::Transcript::Rb::YouTubeTranscriptApi.new.fetch(video_id, languages: ['de', 'en'])
+```
+It's an array of language codes in a descending priority. In this example it will first try to fetch the german
+transcript (`'de'`) and then fetch the english transcript (`'en'`) if it fails to do so. If you want to find out
+which languages are available first, [have a look at `list`](#list-available-transcripts).
+If you only want one language, you still need to format the `languages` argument as an array:
+```ruby
+Youtube::Transcript::Rb::YouTubeTranscriptApi.new.fetch(video_id, languages: ['de'])
+```
+### Preserve formatting
+You can also add `preserve_formatting: true` if you'd like to keep HTML formatting elements such as `<i>` (italics)
+and `<b>` (bold).
+```ruby
+Youtube::Transcript::Rb::YouTubeTranscriptApi.new.fetch(video_id, languages: ['de', 'en'], preserve_formatting: true)
+```
+### List available transcripts
+If you want to list all transcripts which are available for a given video you can call:
+```ruby
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+transcript_list = api.list(video_id)
+```
+This will return a `TranscriptList` object which is iterable and provides methods to filter the list of transcripts for
+specific languages and types, like:
+```ruby
+transcript = transcript_list.find_transcript(['de', 'en'])
+```
+By default this module always chooses manually created transcripts over automatically created ones, if a transcript in
+the requested language is available both manually created and generated. The `TranscriptList` allows you to bypass this
+default behaviour by searching for specific transcript types:
+```ruby
+# filter for manually created transcripts
+transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
+# or automatically generated ones
+transcript = transcript_list.find_generated_transcript(['de', 'en'])
+```
+The methods `find_generated_transcript`, `find_manually_created_transcript`, `find_transcript` return `Transcript`
+objects. They contain metadata regarding the transcript:
+```ruby
+puts transcript.video_id
+puts transcript.language
+puts transcript.language_code
+# whether it has been manually created or generated by YouTube
+puts transcript.is_generated
+# whether this transcript can be translated or not
+puts transcript.translatable?
+# a list of languages the transcript can be translated to
+puts transcript.translation_languages
+```
+and provide the method, which allows you to fetch the actual transcript data:
+```ruby
+transcript.fetch
+```
+This returns a `FetchedTranscript` object, just like `YouTubeTranscriptApi.new.fetch` does.
+### Translate transcript
+YouTube has a feature which allows you to automatically translate subtitles. This module also makes it possible to
+access this feature. To do so `Transcript` objects provide a `translate` method, which returns a new translated
+`Transcript` object:
+```ruby
+transcript = transcript_list.find_transcript(['en'])
+translated_transcript = transcript.translate('de')
+puts translated_transcript.fetch
+```
+### By example
+```ruby
+require 'youtube/transcript/rb'
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+# retrieve the available transcripts
+transcript_list = api.list('video_id')
+# iterate over all available transcripts
+transcript_list.each do |transcript|
+  # the Transcript object provides metadata properties
+  puts transcript.video_id
+  puts transcript.language
+  puts transcript.language_code
+  # whether it has been manually created or generated by YouTube
+  puts transcript.is_generated
+  # whether this transcript can be translated or not
+  puts transcript.translatable?
+  # a list of languages the transcript can be translated to
+  puts transcript.translation_languages
+  # fetch the actual transcript data
+  puts transcript.fetch
+  # translating the transcript will return another transcript object
+  puts transcript.translate('en').fetch if transcript.translatable?
+end
+# you can also directly filter for the language you are looking for, using the transcript list
+transcript = transcript_list.find_transcript(['de', 'en'])
+# or just filter for manually created transcripts
+transcript = transcript_list.find_manually_created_transcript(['de', 'en'])
+# or automatically generated ones
+transcript = transcript_list.find_generated_transcript(['de', 'en'])
+```
+### Fetch multiple videos
+You can fetch transcripts for multiple videos at once:
+```ruby
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+# Fetch multiple videos
+transcripts = api.fetch_all(['video1', 'video2', 'video3'])
+transcripts.each do |video_id, transcript|
+  puts "#{video_id}: #{transcript.length} snippets"
+end
+# With error handling - continue even if some videos fail
+api.fetch_all(['video1', 'video2'], continue_on_error: true) do |video_id, result|
+  if result.is_a?(StandardError)
+    puts "Error for #{video_id}: #{result.message}"
+  else
+    puts "Got #{result.length} snippets for #{video_id}"
+  end
+end
+```
+## Using Formatters
+Formatters are meant to be an additional layer of processing of the transcript you pass it. The goal is to convert a
+`FetchedTranscript` object into a consistent string of a given "format". Such as a basic text (`.txt`) or even formats
+that have a defined specification such as JSON (`.json`), WebVTT (`.vtt`), SRT (`.srt`), etc...
+The `Formatters` module provides a few basic formatters:
+- `JSONFormatter`
+- `PrettyPrintFormatter`
+- `TextFormatter`
+- `WebVTTFormatter`
+- `SRTFormatter`
+Here is how to import from the `Formatters` module:
+```ruby
+require 'youtube/transcript/rb'
+# Some provided formatter classes, each outputs a different string format.
+Youtube::Transcript::Rb::Formatters::JSONFormatter
+Youtube::Transcript::Rb::Formatters::TextFormatter
+Youtube::Transcript::Rb::Formatters::PrettyPrintFormatter
+Youtube::Transcript::Rb::Formatters::WebVTTFormatter
+Youtube::Transcript::Rb::Formatters::SRTFormatter
+```
+### Formatter Example
+Let's say we wanted to retrieve a transcript and store it to a JSON file. That would look something like this:
+```ruby
+require 'youtube/transcript/rb'
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+transcript = api.fetch(video_id)
+formatter = Youtube::Transcript::Rb::Formatters::JSONFormatter.new
+# .format_transcript(transcript) turns the transcript into a JSON string.
+json_formatted = formatter.format_transcript(transcript)
+# Now we can write it out to a file.
+File.write('your_filename.json', json_formatted)
+# Now should have a new JSON file that you can easily read back into Ruby.
+```
+**Passing extra keyword arguments**
+Since `JSONFormatter` leverages `JSON.generate` you can also forward keyword arguments into
+`.format_transcript(transcript)` such as making your file output prettier:
+```ruby
+json_formatted = Youtube::Transcript::Rb::Formatters::JSONFormatter.new.format_transcript(
+  transcript,
+  indent: '  ',
+  space: ' '
+)
+```
+### Using FormatterLoader
+You can also use the `FormatterLoader` to dynamically load formatters by name:
+```ruby
+require 'youtube/transcript/rb'
+loader = Youtube::Transcript::Rb::Formatters::FormatterLoader.new
+# Load by type name: "json", "pretty", "text", "webvtt", "srt"
+formatter = loader.load("json")
+output = formatter.format_transcript(transcript)
+formatter = loader.load("srt")
+File.write('transcript.srt', formatter.format_transcript(transcript))
+```
+### Custom Formatter Example
+You can implement your own formatter class. Just inherit from the `Formatter` base class and ensure you implement the
+`format_transcript` and `format_transcripts` methods which should ultimately return a string:
+```ruby
+class MyCustomFormatter < Youtube::Transcript::Rb::Formatters::Formatter
+  def format_transcript(transcript, **options)
+    # Do your custom work in here, but return a string.
+    'your processed output data as a string.'
+  end
+  def format_transcripts(transcripts, **options)
+    # Do your custom work in here to format an array of transcripts, but return a string.
+    'your processed output data as a string.'
+  end
+end
+```
+## Error Handling
+The library provides a comprehensive set of exceptions for different error scenarios:
+```ruby
+require 'youtube/transcript/rb'
+begin
+  transcript = Youtube::Transcript::Rb.fetch(video_id)
+rescue Youtube::Transcript::Rb::TranscriptsDisabled => e
+  puts "Subtitles are disabled for this video"
+rescue Youtube::Transcript::Rb::NoTranscriptFound => e
+  puts "No transcript found for the requested languages"
+  puts e.requested_language_codes
+rescue Youtube::Transcript::Rb::NoTranscriptAvailable => e
+  puts "No transcripts are available for this video"
+rescue Youtube::Transcript::Rb::VideoUnavailable => e
+  puts "The video is no longer available"
+rescue Youtube::Transcript::Rb::TooManyRequests => e
+  puts "Rate limited by YouTube"
+rescue Youtube::Transcript::Rb::RequestBlocked => e
+  puts "Request blocked by YouTube"
+rescue Youtube::Transcript::Rb::IpBlocked => e
+  puts "Your IP has been blocked by YouTube"
+rescue Youtube::Transcript::Rb::PoTokenRequired => e
+  puts "PO token required - this is a YouTube limitation"
+rescue Youtube::Transcript::Rb::CouldNotRetrieveTranscript => e
+  puts "Could not retrieve transcript: #{e.message}"
+end
+```
+### Available Exceptions
+| Exception | Description |
+|-----------|-------------|
+| `Error` | Base error class |
+| `CouldNotRetrieveTranscript` | Base class for transcript retrieval errors |
+| `YouTubeDataUnparsable` | YouTube data cannot be parsed |
+| `YouTubeRequestFailed` | HTTP request to YouTube failed |
+| `VideoUnplayable` | Video cannot be played |
+| `VideoUnavailable` | Video is no longer available |
+| `InvalidVideoId` | Invalid video ID provided |
+| `RequestBlocked` | YouTube is blocking requests |
+| `IpBlocked` | IP has been blocked by YouTube |
+| `TooManyRequests` | Rate limited (HTTP 429) |
+| `TranscriptsDisabled` | Subtitles are disabled for the video |
+| `AgeRestricted` | Video is age-restricted |
+| `NotTranslatable` | Transcript cannot be translated |
+| `TranslationLanguageNotAvailable` | Requested translation language not available |
+| `FailedToCreateConsentCookie` | Failed to create consent cookie |
+| `NoTranscriptFound` | No transcript found for requested languages |
+| `NoTranscriptAvailable` | No transcripts available for the video |
+| `PoTokenRequired` | PO token required to fetch transcript |
+## Working around IP bans (`RequestBlocked` or `IpBlocked` exception)
+Unfortunately, YouTube has started blocking most IPs that are known to belong to cloud providers (like AWS, Google Cloud
+Platform, Azure, etc.), which means you will most likely run into `RequestBlocked` or `IpBlocked` exceptions when
+deploying your code to any cloud solutions. Same can happen to the IP of your self-hosted solution, if you are doing
+too many requests. You can work around these IP bans using proxies.
+> **Note:** Proxy support is planned for a future release.
+## Overwriting request defaults
+When initializing a `YouTubeTranscriptApi` object, it will create a Faraday HTTP client which will be used for all
+HTTP(S) requests. However, you can optionally pass a custom Faraday connection into its constructor:
+```ruby
+require 'faraday'
+http_client = Faraday.new do |conn|
+  conn.options.timeout = 60
+  conn.headers['Accept-Encoding'] = 'gzip, deflate'
+  conn.ssl.verify = true
+  conn.ssl.ca_file = '/path/to/certfile'
+  conn.adapter Faraday.default_adapter
+end
+api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new(http_client: http_client)
+api.fetch(video_id)
+# Share same connection between two instances
+api_2 = Youtube::Transcript::Rb::YouTubeTranscriptApi.new(http_client: http_client)
+api_2.fetch(video_id)
+```
+## Warning
+This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no
+guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things
+working again as soon as possible if that happens. So if it stops working, let me know!
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `bundle exec rspec` to run the tests.
+You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`.
+### Running Tests
+```
+bundle exec rspec
+```
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/stadia/youtube-transcript-rb.
+## License
+This project is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
+## Credits
+This is a Ruby port of [youtube-transcript-api](https://github.com/jdepoix/youtube-transcript-api) by jdepoix.

data/Rakefile ADDED Viewed

@@ -0,0 +1,4 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+task default: %i[]

data/lib/youtube/transcript/rb/api.rb ADDED Viewed

@@ -0,0 +1,150 @@
+# frozen_string_literal: true
+require "faraday"
+require "faraday/follow_redirects"
+module Youtube
+  module Transcript
+    module Rb
+      # Main entry point for fetching YouTube transcripts.
+      # This class provides a simple API for retrieving transcripts from YouTube videos.
+      #
+      # @example Basic usage
+      #   api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+      #   transcript = api.fetch("dQw4w9WgXcQ")
+      #   transcript.each { |snippet| puts snippet.text }
+      #
+      # @example With language preference
+      #   api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+      #   transcript = api.fetch("dQw4w9WgXcQ", languages: ["es", "en"])
+      #
+      # @example Listing available transcripts
+      #   api = Youtube::Transcript::Rb::YouTubeTranscriptApi.new
+      #   transcript_list = api.list("dQw4w9WgXcQ")
+      #   transcript_list.each { |t| puts t }
+      #
+      class YouTubeTranscriptApi
+        # Default timeout for HTTP requests in seconds
+        DEFAULT_TIMEOUT = 30
+        # @param http_client [Faraday::Connection, nil] Custom HTTP client (optional)
+        # @param proxy_config [Object, nil] Proxy configuration (optional)
+        def initialize(http_client: nil, proxy_config: nil)
+          @http_client = http_client || build_default_http_client
+          @proxy_config = proxy_config
+          @fetcher = TranscriptListFetcher.new(
+            http_client: @http_client,
+            proxy_config: @proxy_config
+          )
+        end
+        # Fetch a transcript for a video.
+        # This is a convenience method that combines `list` and `find_transcript`.
+        #
+        # @param video_id [String] The YouTube video ID
+        # @param languages [Array<String>] Language codes in order of preference (default: ["en"])
+        # @param preserve_formatting [Boolean] Whether to preserve HTML formatting (default: false)
+        # @return [FetchedTranscript] The fetched transcript
+        # @raise [NoTranscriptFound] If no transcript matches the requested languages
+        # @raise [TranscriptsDisabled] If transcripts are disabled for the video
+        # @raise [VideoUnavailable] If the video is not available
+        #
+        # @example
+        #   api = YouTubeTranscriptApi.new
+        #   transcript = api.fetch("dQw4w9WgXcQ", languages: ["en", "es"])
+        #   puts transcript.first.text
+        #
+        def fetch(video_id, languages: ["en"], preserve_formatting: false)
+          list(video_id)
+            .find_transcript(languages)
+            .fetch(preserve_formatting: preserve_formatting)
+        end
+        # List all available transcripts for a video.
+        #
+        # @param video_id [String] The YouTube video ID
+        # @return [TranscriptList] A list of available transcripts
+        # @raise [TranscriptsDisabled] If transcripts are disabled for the video
+        # @raise [VideoUnavailable] If the video is not available
+        #
+        # @example
+        #   api = YouTubeTranscriptApi.new
+        #   transcript_list = api.list("dQw4w9WgXcQ")
+        #
+        #   # Find a specific transcript
+        #   transcript = transcript_list.find_transcript(["en"])
+        #
+        #   # Or iterate over all available transcripts
+        #   transcript_list.each do |transcript|
+        #     puts "#{transcript.language_code}: #{transcript.language}"
+        #   end
+        #
+        def list(video_id)
+          @fetcher.fetch(video_id)
+        end
+        # Fetch transcripts for multiple videos.
+        #
+        # @param video_ids [Array<String>] Array of YouTube video IDs
+        # @param languages [Array<String>] Language codes in order of preference (default: ["en"])
+        # @param preserve_formatting [Boolean] Whether to preserve HTML formatting (default: false)
+        # @param continue_on_error [Boolean] Whether to continue if a video fails (default: false)
+        # @yield [video_id, result] Block called for each video with either transcript or error
+        # @yieldparam video_id [String] The video ID being processed
+        # @yieldparam result [FetchedTranscript, StandardError] The transcript or error
+        # @return [Hash<String, FetchedTranscript>] Hash mapping video IDs to transcripts
+        # @raise [CouldNotRetrieveTranscript] If any video fails and continue_on_error is false
+        #
+        # @example Fetch multiple videos
+        #   api = YouTubeTranscriptApi.new
+        #   transcripts = api.fetch_all(["video1", "video2", "video3"])
+        #   transcripts.each { |id, t| puts "#{id}: #{t.length} snippets" }
+        #
+        # @example With error handling
+        #   api = YouTubeTranscriptApi.new
+        #   api.fetch_all(["video1", "video2"], continue_on_error: true) do |video_id, result|
+        #     if result.is_a?(StandardError)
+        #       puts "Error for #{video_id}: #{result.message}"
+        #     else
+        #       puts "Got #{result.length} snippets for #{video_id}"
+        #     end
+        #   end
+        #
+        def fetch_all(video_ids, languages: ["en"], preserve_formatting: false, continue_on_error: false)
+          results = {}
+          video_ids.each do |video_id|
+            begin
+              transcript = fetch(video_id, languages: languages, preserve_formatting: preserve_formatting)
+              results[video_id] = transcript
+              yield(video_id, transcript) if block_given?
+            rescue CouldNotRetrieveTranscript => e
+              if continue_on_error
+                yield(video_id, e) if block_given?
+              else
+                raise
+              end
+            end
+          end
+          results
+        end
+        private
+        # Build the default Faraday HTTP client
+        #
+        # @return [Faraday::Connection] The configured HTTP client
+        def build_default_http_client
+          Faraday.new do |conn|
+            conn.options.timeout = DEFAULT_TIMEOUT
+            conn.options.open_timeout = DEFAULT_TIMEOUT
+            conn.request :url_encoded
+            conn.response :follow_redirects
+            conn.adapter Faraday.default_adapter
+          end
+        end
+      end
+    end
+  end
+end