omniai-google 2.6.4 → 2.7.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9b4a06d7346381cd281088f6e95cb85cf0142f286ff16a26e6681447ba5a0401
4
- data.tar.gz: 5ada4e7eb1e51bbbb01e1981917ccc3331500a847e45586b4a3618b51d26992a
3
+ metadata.gz: a59cb156c5493eb277616a5d536e9fa4c71dd80d256b8b8267d5402605455770
4
+ data.tar.gz: 6d985ccfce9f354358ae4206cd840efe77d0a9e3c154404d1d63e5e0b511b0c9
5
5
  SHA512:
6
- metadata.gz: 9d929c63f5008bc8c23261d2cdde88f56aca8b17ce1524443915b4181d590fa6b6778b02719ed03f6342eec94c406e64c5b76b411900884e3dae5349a0eb1edf
7
- data.tar.gz: 1f68746ee88d7c17804a125b4d48c49b0309438408e555172889e28b7e408b9823bd63226d091d2c5159046a973a8d8e932f1ae4dcc64a3a62ebe68c979c2834
6
+ metadata.gz: f7e6377dce35f560a64a792d22f804335d7a793800d0ca915ee3937689eb3c8f9f4877a9f202706233a82cc4fee5c32ec868d155d2a50b611e6b667545975ae6
7
+ data.tar.gz: '0154897afa7d80ae011b95248a1c8a282bf402329fa6b3e7eb0d6dc1b4e7400bd9d819b56a7bfb3f2494a084db7f97ab6d9d4e141fe6ff5233760bd738a8d8a9'
data/README.md CHANGED
@@ -58,6 +58,8 @@ OmniAI::Google.configure do |config|
58
58
  end
59
59
  ```
60
60
 
61
+ **Note for Transcription**: When using transcription features, ensure your service account has the necessary permissions for Google Cloud Speech-to-Text API and Google Cloud Storage (for automatic file uploads). See the [GCS Setup](#gcs-setup-for-transcription) section below for detailed configuration.
62
+
61
63
  Credentials may be configured using:
62
64
 
63
65
  1. A `File` / `String` / `Pathname`.
@@ -143,6 +145,204 @@ end
143
145
 
144
146
  [Google API Reference `stream`](https://ai.google.dev/gemini-api/docs/api-overview#stream)
145
147
 
148
+ ### Transcribe
149
+
150
+ Audio files can be transcribed using Google's Speech-to-Text API. The implementation automatically handles both synchronous and asynchronous recognition based on file size and model type.
151
+
152
+ #### Basic Usage
153
+
154
+ ```ruby
155
+ # Transcribe a local audio file
156
+ result = client.transcribe("path/to/audio.mp3")
157
+ result.text # "Hello, this is the transcribed text..."
158
+
159
+ # Transcribe with specific model
160
+ result = client.transcribe("path/to/audio.mp3", model: "latest_long")
161
+ result.text # "Hello, this is the transcribed text..."
162
+ ```
163
+
164
+ #### Multi-Language Detection
165
+
166
+ The transcription automatically detects multiple languages when no specific language is provided:
167
+
168
+ ```ruby
169
+ # Auto-detect English and Spanish
170
+ result = client.transcribe("bilingual_audio.mp3", model: "latest_long")
171
+ result.text # "Hello, how are you? Hola, ¿cómo estás?"
172
+
173
+ # Specify expected languages explicitly
174
+ result = client.transcribe("audio.mp3", language: ["en-US", "es-US"], model: "latest_long")
175
+ ```
176
+
177
+ #### Detailed Transcription with Timestamps
178
+
179
+ Use `VERBOSE_JSON` format to get detailed timing information, confidence scores, and language detection per segment:
180
+
181
+ ```ruby
182
+ result = client.transcribe("audio.mp3",
183
+ model: "latest_long",
184
+ format: OmniAI::Transcribe::Format::VERBOSE_JSON
185
+ )
186
+
187
+ # Access the full transcript
188
+ result.text # "Complete transcribed text..."
189
+
190
+ # Access detailed segment information
191
+ result.segments.each do |segment|
192
+ puts "Segment #{segment[:segment_id]}: #{segment[:text]}"
193
+ puts "Language: #{segment[:language_code]}"
194
+ puts "Confidence: #{segment[:confidence]}"
195
+ puts "End time: #{segment[:end_time]}"
196
+
197
+ # Word-level timing (if available)
198
+ segment[:words].each do |word|
199
+ puts " #{word[:word]} (#{word[:start_time]} - #{word[:end_time]})"
200
+ end
201
+ end
202
+
203
+ # Total audio duration
204
+ puts "Total duration: #{result.total_duration}"
205
+ ```
206
+
207
+ #### Models
208
+
209
+ The transcription supports various models optimized for different use cases:
210
+
211
+ ```ruby
212
+ # For short audio (< 60 seconds)
213
+ client.transcribe("short_audio.mp3", model: OmniAI::Google::Transcribe::Model::LATEST_SHORT)
214
+
215
+ # For long-form audio (> 60 seconds) - automatically uses async processing
216
+ client.transcribe("long_audio.mp3", model: OmniAI::Google::Transcribe::Model::LATEST_LONG)
217
+
218
+ # For phone/telephony audio
219
+ client.transcribe("phone_call.mp3", model: OmniAI::Google::Transcribe::Model::TELEPHONY_LONG)
220
+
221
+ # For medical conversations
222
+ client.transcribe("medical_interview.mp3", model: OmniAI::Google::Transcribe::Model::MEDICAL_CONVERSATION)
223
+
224
+ # Other available models
225
+ client.transcribe("audio.mp3", model: OmniAI::Google::Transcribe::Model::CHIRP_2) # Enhanced model
226
+ client.transcribe("audio.mp3", model: OmniAI::Google::Transcribe::Model::CHIRP) # Universal model
227
+ ```
228
+
229
+ **Available Model Constants:**
230
+ - `OmniAI::Google::Transcribe::Model::LATEST_SHORT` - Optimized for audio < 60 seconds
231
+ - `OmniAI::Google::Transcribe::Model::LATEST_LONG` - Optimized for long-form audio
232
+ - `OmniAI::Google::Transcribe::Model::TELEPHONY_SHORT` - For short phone calls
233
+ - `OmniAI::Google::Transcribe::Model::TELEPHONY_LONG` - For long phone calls
234
+ - `OmniAI::Google::Transcribe::Model::MEDICAL_CONVERSATION` - For medical conversations
235
+ - `OmniAI::Google::Transcribe::Model::MEDICAL_DICTATION` - For medical dictation
236
+ - `OmniAI::Google::Transcribe::Model::CHIRP_2` - Enhanced universal model
237
+ - `OmniAI::Google::Transcribe::Model::CHIRP` - Universal model
238
+
239
+ #### Supported Formats
240
+
241
+ - **Input**: MP3, WAV, FLAC, and other common audio formats
242
+ - **GCS URIs**: Direct transcription from Google Cloud Storage
243
+ - **File uploads**: Automatic upload to GCS for files > 10MB or long-form models
244
+
245
+ #### Advanced Features
246
+
247
+ **Automatic Processing Selection:**
248
+ - Files < 60 seconds: Uses synchronous recognition
249
+ - Files > 60 seconds or long-form models: Uses asynchronous batch recognition
250
+ - Large files: Automatically uploaded to Google Cloud Storage
251
+
252
+ **GCS Integration:**
253
+ - Automatic file upload and cleanup
254
+ - Support for existing GCS URIs
255
+ - Configurable bucket names
256
+
257
+ **Error Handling:**
258
+ - Automatic retry logic for temporary failures
259
+ - Clear error messages for common issues
260
+ - Graceful handling of network timeouts
261
+
262
+ [Google Speech-to-Text API Reference](https://cloud.google.com/speech-to-text/docs)
263
+
264
+ #### GCS Setup for Transcription
265
+
266
+ For transcription to work properly with automatic file uploads, you need to set up Google Cloud Storage and configure the appropriate permissions.
267
+
268
+ ##### 1. Create a GCS Bucket
269
+
270
+ You must create a bucket named `{project_id}-speech-audio` manually before using transcription features:
271
+
272
+ ```bash
273
+ # Using gcloud CLI
274
+ gsutil mb gs://your-project-id-speech-audio
275
+
276
+ # Or create via Google Cloud Console
277
+ # Navigate to Cloud Storage > Browser > Create Bucket
278
+ ```
279
+
280
+ ##### 2. Service Account Permissions
281
+
282
+ Your service account needs the following IAM roles for transcription to work:
283
+
284
+ **Required Roles:**
285
+ - **Cloud Speech Editor** - Grants access to edit resources in Speech-to-Text
286
+ - **Storage Bucket Viewer** - Grants permission to view buckets and their metadata, excluding IAM policies
287
+ - **Storage Object Admin** - Grants full control over objects, including listing, creating, viewing, and deleting objects
288
+
289
+ **To assign roles via gcloud CLI:**
290
+
291
+ ```bash
292
+ # Replace YOUR_SERVICE_ACCOUNT_EMAIL and YOUR_PROJECT_ID with actual values
293
+ SERVICE_ACCOUNT="your-service-account@your-project-id.iam.gserviceaccount.com"
294
+ PROJECT_ID="your-project-id"
295
+
296
+ # Grant Speech-to-Text permissions
297
+ gcloud projects add-iam-policy-binding $PROJECT_ID \
298
+ --member="serviceAccount:$SERVICE_ACCOUNT" \
299
+ --role="roles/speech.editor"
300
+
301
+ # Grant Storage permissions
302
+ gcloud projects add-iam-policy-binding $PROJECT_ID \
303
+ --member="serviceAccount:$SERVICE_ACCOUNT" \
304
+ --role="roles/storage.objectAdmin"
305
+
306
+ gcloud projects add-iam-policy-binding $PROJECT_ID \
307
+ --member="serviceAccount:$SERVICE_ACCOUNT" \
308
+ --role="roles/storage.legacyBucketReader"
309
+ ```
310
+
311
+ **Or via Google Cloud Console:**
312
+ 1. Go to IAM & Admin > IAM
313
+ 2. Find your service account
314
+ 3. Click "Edit Principal"
315
+ 4. Add the required roles listed above
316
+
317
+ ##### 3. Enable Required APIs
318
+
319
+ Ensure the following APIs are enabled in your Google Cloud Project:
320
+
321
+ ```bash
322
+ # Enable Speech-to-Text API
323
+ gcloud services enable speech.googleapis.com
324
+
325
+ # Enable Cloud Storage API
326
+ gcloud services enable storage.googleapis.com
327
+ ```
328
+
329
+ ##### 4. Bucket Configuration (Optional)
330
+
331
+ You can customize the bucket name by configuring it in your application:
332
+
333
+ ```ruby
334
+ # Custom bucket name in your transcription calls
335
+ # The bucket must exist and your service account must have access
336
+ client.transcribe("audio.mp3", bucket_name: "my-custom-audio-bucket")
337
+ ```
338
+
339
+ **Important Notes:**
340
+ - The default bucket name follows the pattern: `{project_id}-speech-audio`
341
+ - You must create the bucket manually before using transcription features
342
+ - Choose an appropriate region for your bucket based on your location and compliance requirements
343
+ - Audio files are automatically deleted after successful transcription
344
+ - If transcription fails, temporary files may remain and should be cleaned up manually
345
+
146
346
  ### Embed
147
347
 
148
348
  Text can be converted into a vector embedding for similarity comparison usage via:
@@ -0,0 +1,115 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "google/cloud/storage"
4
+
5
+ module OmniAI
6
+ module Google
7
+ # Uploads audio files to Google Cloud Storage for transcription.
8
+ class Bucket
9
+ class UploadError < StandardError; end
10
+
11
+ # @param client [Client]
12
+ # @param io [IO, String]
13
+ # @param bucket_name [String] optional - bucket name (defaults to project_id-speech-audio)
14
+ def self.process!(client:, io:, bucket_name: nil)
15
+ new(client:, io:, bucket_name:).process!
16
+ end
17
+
18
+ # @param client [Client]
19
+ # @param io [File, String]
20
+ # @param bucket_name [String] optional - bucket name
21
+ def initialize(client:, io:, bucket_name: nil)
22
+ @client = client
23
+ @io = io
24
+ @bucket_name = bucket_name || default_bucket_name
25
+ end
26
+
27
+ # @raise [UploadError]
28
+ #
29
+ # @return [String] GCS URI (gs://bucket/object)
30
+ def process!
31
+ # Create storage client with same credentials as main client
32
+ credentials = @client.instance_variable_get(:@credentials)
33
+ storage = ::Google::Cloud::Storage.new(
34
+ project_id:,
35
+ credentials:
36
+ )
37
+
38
+ # Get bucket (don't auto-create if it doesn't exist)
39
+ bucket = storage.bucket(@bucket_name)
40
+ unless bucket
41
+ raise UploadError, "Bucket '#{@bucket_name}' not found. " \
42
+ "Please create it manually or ensure the service account has access."
43
+ end
44
+
45
+ # Generate unique filename
46
+ timestamp = Time.now.strftime("%Y%m%d_%H%M%S")
47
+ random_suffix = SecureRandom.hex(4)
48
+ filename = "audio_#{timestamp}_#{random_suffix}.#{file_extension}"
49
+
50
+ # Upload file - create StringIO for binary content
51
+ content = file_content
52
+ if content.is_a?(String) && content.include?("\0")
53
+ # Binary content - wrap in StringIO
54
+ require "stringio"
55
+ content = StringIO.new(content)
56
+ end
57
+
58
+ bucket.create_file(content, filename)
59
+
60
+ # Return GCS URI
61
+ "gs://#{@bucket_name}/#{filename}"
62
+ rescue ::Google::Cloud::Error => e
63
+ raise UploadError, "Failed to upload to GCS: #{e.message}"
64
+ end
65
+
66
+ private
67
+
68
+ # @return [String]
69
+ def project_id
70
+ @client.instance_variable_get(:@project_id) ||
71
+ raise(ArgumentError, "project_id is required for GCS upload")
72
+ end
73
+
74
+ # @return [String]
75
+ def location_id
76
+ @client.instance_variable_get(:@location_id) || "global"
77
+ end
78
+
79
+ # @return [String]
80
+ def default_bucket_name
81
+ "#{project_id}-speech-audio"
82
+ end
83
+
84
+ # @return [String]
85
+ def file_content
86
+ case @io
87
+ when String
88
+ # Check if it's a file path or binary content
89
+ if @io.include?("\0") || !File.exist?(@io)
90
+ # It's binary content, return as-is
91
+ @io
92
+ else
93
+ # It's a file path, read the file
94
+ File.read(@io)
95
+ end
96
+ when File, IO, StringIO
97
+ @io.rewind if @io.respond_to?(:rewind)
98
+ @io.read
99
+ else
100
+ raise ArgumentError, "Unsupported input type: #{@io.class}"
101
+ end
102
+ end
103
+
104
+ # @return [String]
105
+ def file_extension
106
+ case @io
107
+ when String
108
+ File.extname(@io)[1..] || "wav"
109
+ else
110
+ "wav" # Default extension
111
+ end
112
+ end
113
+ end
114
+ end
115
+ end
@@ -54,6 +54,8 @@ module OmniAI
54
54
  #
55
55
  # @param candidate [Hash]
56
56
  def process_candidate!(candidate:, index:, &block)
57
+ return unless candidate["content"]
58
+
57
59
  candidate["content"]["parts"].each do |part|
58
60
  block&.call(OmniAI::Chat::Delta.new(text: part["text"])) if part["text"]
59
61
  end
@@ -15,7 +15,7 @@ module OmniAI
15
15
  module Model
16
16
  GEMINI_1_0_PRO = "gemini-1.0-pro"
17
17
  GEMINI_1_5_PRO = "gemini-1.5-pro"
18
- GEMINI_2_5_PRO = "gemini-2.5-pro-exp-03-25"
18
+ GEMINI_2_5_PRO = "gemini-2.5-pro-preview-06-05"
19
19
  GEMINI_1_5_FLASH = "gemini-1.5-flash"
20
20
  GEMINI_2_0_FLASH = "gemini-2.0-flash"
21
21
  GEMINI_2_5_FLASH = "gemini-2.5-flash-preview-04-17"
@@ -88,6 +88,16 @@ module OmniAI
88
88
  Embed.process!(input, model:, client: self)
89
89
  end
90
90
 
91
+ # @raise [OmniAI::Error]
92
+ #
93
+ # @param input [String, File, IO] required - audio file path, file object, or GCS URI
94
+ # @param model [String] optional
95
+ # @param language [String, Array<String>] optional - language codes for transcription
96
+ # @param format [Symbol] optional - :json or :verbose_json
97
+ def transcribe(input, model: Transcribe::DEFAULT_MODEL, language: nil, format: nil)
98
+ Transcribe.process!(input, model:, language:, format:, client: self)
99
+ end
100
+
91
101
  # @return [String]
92
102
  def path
93
103
  if @project_id && @location_id
@@ -55,6 +55,14 @@ module OmniAI
55
55
  def credentials=(value)
56
56
  @credentials = Credentials.parse(value)
57
57
  end
58
+
59
+ # @return [Hash]
60
+ def transcribe_options
61
+ @transcribe_options ||= {}
62
+ end
63
+
64
+ # @param value [Hash]
65
+ attr_writer :transcribe_options
58
66
  end
59
67
  end
60
68
  end
@@ -0,0 +1,143 @@
1
+ # frozen_string_literal: true
2
+
3
+ module OmniAI
4
+ module Google
5
+ # A Google transcribe implementation.
6
+ #
7
+ # Usage:
8
+ #
9
+ # transcribe = OmniAI::Google::Transcribe.new(client: client)
10
+ # transcribe.process!(audio_file)
11
+ class Transcribe < OmniAI::Transcribe
12
+ include TranscribeHelpers
13
+ module Model
14
+ CHIRP_2 = "chirp_2"
15
+ CHIRP = "chirp"
16
+ LATEST_LONG = "latest_long"
17
+ LATEST_SHORT = "latest_short"
18
+ TELEPHONY_LONG = "telephony_long"
19
+ TELEPHONY_SHORT = "telephony_short"
20
+ MEDICAL_CONVERSATION = "medical_conversation"
21
+ MEDICAL_DICTATION = "medical_dictation"
22
+ end
23
+
24
+ DEFAULT_MODEL = Model::LATEST_SHORT
25
+ DEFAULT_RECOGNIZER = "_"
26
+
27
+ # @return [Context]
28
+ CONTEXT = Context.build do |context|
29
+ # No custom deserializers needed - let base class handle parsing
30
+ end
31
+
32
+ # @raise [HTTPError]
33
+ #
34
+ # @return [OmniAI::Transcribe::Transcription]
35
+ def process!
36
+ if needs_async_recognition?
37
+ process_async!
38
+ else
39
+ process_sync!
40
+ end
41
+ end
42
+
43
+ private
44
+
45
+ # @return [Boolean]
46
+ def needs_async_recognition?
47
+ # Use async for long-form models or when GCS is needed
48
+ needs_long_form_recognition? || needs_gcs_upload?
49
+ end
50
+
51
+ # @raise [HTTPError]
52
+ #
53
+ # @return [OmniAI::Transcribe::Transcription]
54
+ def process_sync!
55
+ response = request!
56
+ handle_sync_response_errors(response)
57
+
58
+ data = response.parse
59
+ transcript = data.dig("results", 0, "alternatives", 0, "transcript") || ""
60
+
61
+ transformed_data = build_sync_response_data(data, transcript)
62
+ Transcription.parse(model: @model, format: @format, data: transformed_data)
63
+ end
64
+
65
+ # @raise [HTTPError]
66
+ #
67
+ # @return [OmniAI::Transcribe::Transcription]
68
+ def process_async!
69
+ # Track if we uploaded the file for cleanup
70
+ uploaded_gcs_uri = nil
71
+
72
+ # Start the batch recognition job
73
+ response = request_batch!
74
+
75
+ raise HTTPError, response unless response.status.ok?
76
+
77
+ operation_data = response.parse
78
+ operation_name = operation_data["name"]
79
+
80
+ raise HTTPError, "No operation name returned from batch recognition request" unless operation_name
81
+
82
+ # Extract GCS URI for cleanup if we uploaded it
83
+ if operation_data.dig("metadata", "batchRecognizeRequest", "files")
84
+ file_uri = operation_data.dig("metadata", "batchRecognizeRequest", "files", 0, "uri")
85
+ # Only mark for cleanup if it's not a user-provided GCS URI
86
+ uploaded_gcs_uri = file_uri unless @io.is_a?(String) && @io.start_with?("gs://")
87
+ end
88
+
89
+ # Poll for completion
90
+ result = poll_operation!(operation_name)
91
+
92
+ # Extract transcript from completed operation
93
+ transcript_data = extract_batch_transcript(result)
94
+
95
+ # Clean up uploaded file if we created it
96
+ cleanup_gcs_file(uploaded_gcs_uri) if uploaded_gcs_uri
97
+
98
+ Transcription.parse(model: @model, format: @format, data: transcript_data)
99
+ end
100
+
101
+ protected
102
+
103
+ # @return [Context]
104
+ def context
105
+ CONTEXT
106
+ end
107
+
108
+ # @return [HTTP::Response]
109
+ def request!
110
+ # Speech-to-Text API uses different endpoints for regional vs global
111
+ endpoint = speech_endpoint
112
+ speech_connection = HTTP.persistent(endpoint)
113
+ .timeout(connect: @client.timeout, write: @client.timeout, read: @client.timeout)
114
+ .accept(:json)
115
+
116
+ # Add authentication if using credentials
117
+ speech_connection = speech_connection.auth("Bearer #{@client.send(:auth).split.last}") if @client.credentials?
118
+
119
+ speech_connection.post(path, params:, json: payload)
120
+ end
121
+
122
+ # @return [Hash]
123
+ def payload
124
+ config = build_config
125
+ payload_data = { config: }
126
+ add_audio_data(payload_data)
127
+ payload_data
128
+ end
129
+
130
+ # @return [String]
131
+ def path
132
+ # Always use Speech-to-Text API v2 with recognizers
133
+ recognizer_path = "projects/#{project_id}/locations/#{location_id}/recognizers/#{recognizer_name}"
134
+ "/v2/#{recognizer_path}:recognize"
135
+ end
136
+
137
+ # @return [Hash]
138
+ def params
139
+ { key: (@client.api_key unless @client.credentials?) }.compact
140
+ end
141
+ end
142
+ end
143
+ end
@@ -0,0 +1,456 @@
1
+ # frozen_string_literal: true
2
+
3
+ module OmniAI
4
+ module Google
5
+ # Helper methods for transcription functionality
6
+ module TranscribeHelpers # rubocop:disable Metrics/ModuleLength
7
+ private
8
+
9
+ # @return [String]
10
+ def project_id
11
+ @client.instance_variable_get(:@project_id) ||
12
+ raise(ArgumentError, "project_id is required for transcription")
13
+ end
14
+
15
+ # @return [String]
16
+ def location_id
17
+ @client.instance_variable_get(:@location_id) || "global"
18
+ end
19
+
20
+ # @return [String]
21
+ def speech_endpoint
22
+ location_id == "global" ? "https://speech.googleapis.com" : "https://#{location_id}-speech.googleapis.com"
23
+ end
24
+
25
+ # @return [Array<String>, nil]
26
+ def language_codes
27
+ case @language
28
+ when String
29
+ [@language] unless @language.strip.empty?
30
+ when Array
31
+ cleaned = @language.compact.reject(&:empty?)
32
+ cleaned if cleaned.any?
33
+ when nil, ""
34
+ nil # Auto-detect language when not specified
35
+ else
36
+ ["en-US"] # Default to English (multi-language only supported in global/us/eu locations)
37
+ end
38
+ end
39
+
40
+ # @param input [String, Pathname, File, IO]
41
+ # @return [String] Base64 encoded audio content
42
+ def encode_audio(input)
43
+ case input
44
+ when String
45
+ if File.exist?(input)
46
+ Base64.strict_encode64(File.read(input))
47
+ else
48
+ input # Assume it's already base64 encoded
49
+ end
50
+ when Pathname, File, IO, StringIO
51
+ Base64.strict_encode64(input.read)
52
+ else
53
+ raise ArgumentError, "Unsupported input type: #{input.class}"
54
+ end
55
+ end
56
+
57
+ # @return [Boolean]
58
+ def needs_gcs_upload?
59
+ return false if @io.is_a?(String) && @io.start_with?("gs://")
60
+
61
+ file_size = calculate_file_size
62
+ # Force GCS upload for files > 10MB or if using long models for longer audio
63
+ file_size > 10_000_000 || needs_long_form_recognition?
64
+ end
65
+
66
+ # @return [Boolean]
67
+ def needs_long_form_recognition?
68
+ # Use long-form models for potentially longer audio files
69
+ return true if @model&.include?("long")
70
+
71
+ # Chirp models process speech in larger chunks and prefer BatchRecognize
72
+ return true if @model&.include?("chirp")
73
+
74
+ # For large files, assume they might be longer than 60 seconds
75
+ # Approximate: files larger than 1MB might be longer than 60 seconds
76
+ calculate_file_size > 1_000_000
77
+ end
78
+
79
+ # @return [Integer]
80
+ def calculate_file_size
81
+ case @io
82
+ when String
83
+ File.exist?(@io) ? File.size(@io) : 0
84
+ when File, IO, StringIO
85
+ @io.respond_to?(:size) ? @io.size : 0
86
+ else
87
+ 0
88
+ end
89
+ end
90
+
91
+ # @return [Hash]
92
+ def build_config
93
+ config = {
94
+ model: @model,
95
+ autoDecodingConfig: {},
96
+ }
97
+
98
+ # Only include languageCodes if specified and non-empty (omit for auto-detection)
99
+ lang_codes = language_codes
100
+ config[:languageCodes] = if lang_codes&.any?
101
+ lang_codes
102
+ else
103
+ # Handle language detection based on model capabilities
104
+ default_language_codes
105
+ end
106
+
107
+ features = build_features
108
+ config[:features] = features unless features.empty?
109
+
110
+ if OmniAI::Google.config.respond_to?(:transcribe_options)
111
+ config.merge!(OmniAI::Google.config.transcribe_options)
112
+ end
113
+
114
+ config
115
+ end
116
+
117
+ # @return [Array<String>] Default language codes based on model
118
+ def default_language_codes
119
+ if @model&.include?("chirp")
120
+ # Chirp models use "auto" for automatic language detection
121
+ ["auto"]
122
+ else
123
+ # Other models use multiple languages for auto-detection
124
+ %w[en-US es-US]
125
+ end
126
+ end
127
+
128
+ # @return [Hash]
129
+ def build_features
130
+ case @format
131
+ when "verbose_json"
132
+ {
133
+ enableAutomaticPunctuation: true,
134
+ enableWordTimeOffsets: true,
135
+ enableWordConfidence: true,
136
+ }
137
+ when "json"
138
+ { enableAutomaticPunctuation: true }
139
+ else
140
+ {}
141
+ end
142
+ end
143
+
144
+ # @param payload_data [Hash]
145
+ def add_audio_data(payload_data)
146
+ if @io.is_a?(String) && @io.start_with?("gs://")
147
+ payload_data[:uri] = @io
148
+ elsif needs_gcs_upload?
149
+ gcs_uri = Bucket.process!(client: @client, io: @io)
150
+ payload_data[:uri] = gcs_uri
151
+ else
152
+ payload_data[:content] = encode_audio(@io)
153
+ end
154
+ end
155
+
156
+ # @return [Hash] Payload for batch recognition
157
+ def batch_payload
158
+ config = build_config
159
+
160
+ # Get audio URI for batch processing
161
+ audio_uri = if @io.is_a?(String) && @io.start_with?("gs://")
162
+ @io
163
+ else
164
+ # Force GCS upload for batch recognition
165
+ Bucket.process!(client: @client, io: @io)
166
+ end
167
+
168
+ {
169
+ config:,
170
+ files: [{ uri: audio_uri }],
171
+ recognitionOutputConfig: {
172
+ inlineResponseConfig: {},
173
+ },
174
+ }
175
+ end
176
+
177
+ # @param operation_name [String]
178
+ # @raise [HTTPError]
179
+ #
180
+ # @return [Hash]
181
+ def poll_operation!(operation_name)
182
+ endpoint = speech_endpoint
183
+ connection = HTTP.persistent(endpoint)
184
+ .timeout(connect: @client.timeout, write: @client.timeout, read: @client.timeout)
185
+ .accept(:json)
186
+
187
+ # Add authentication if using credentials
188
+ connection = connection.auth("Bearer #{@client.send(:auth).split.last}") if @client.credentials?
189
+
190
+ max_attempts = 60 # Maximum 15 minutes (15 second intervals)
191
+ attempt = 0
192
+
193
+ loop do
194
+ attempt += 1
195
+
196
+ raise HTTPError, "Operation timed out after #{max_attempts * 15} seconds" if attempt > max_attempts
197
+
198
+ operation_response = connection.get("/v2/#{operation_name}", params: operation_params)
199
+
200
+ raise HTTPError, operation_response unless operation_response.status.ok?
201
+
202
+ operation_data = operation_response.parse
203
+
204
+ # Check for errors
205
+ if operation_data["error"]
206
+ error_message = operation_data.dig("error", "message") || "Unknown error"
207
+ raise HTTPError, "Operation failed: #{error_message}"
208
+ end
209
+
210
+ # Check if done
211
+ return operation_data if operation_data["done"]
212
+
213
+ # Wait before polling again
214
+ sleep(15)
215
+ end
216
+ end
217
+
218
+ # @return [HTTP::Response]
219
+ def request_batch!
220
+ endpoint = speech_endpoint
221
+ connection = HTTP.persistent(endpoint)
222
+ .timeout(connect: @client.timeout, write: @client.timeout, read: @client.timeout)
223
+ .accept(:json)
224
+
225
+ # Add authentication if using credentials
226
+ connection = connection.auth("Bearer #{@client.send(:auth).split.last}") if @client.credentials?
227
+
228
+ connection.post(batch_path, params: operation_params, json: batch_payload)
229
+ end
230
+
231
+ # @return [String]
232
+ def batch_path
233
+ # Use batchRecognize endpoint for async recognition
234
+ recognizer_path = "projects/#{project_id}/locations/#{location_id}/recognizers/#{recognizer_name}"
235
+ "/v2/#{recognizer_path}:batchRecognize"
236
+ end
237
+
238
+ # @return [Hash]
239
+ def operation_params
240
+ { key: (@client.api_key unless @client.credentials?) }.compact
241
+ end
242
+
243
+ # @return [String]
244
+ def recognizer_name
245
+ # Always use the default recognizer - the model is specified in the config
246
+ "_"
247
+ end
248
+
249
+ # @param result [Hash] Operation result from batch recognition
250
+ # @return [Hash] Data formatted for OmniAI::Transcribe::Transcription.parse
251
+ def extract_batch_transcript(result)
252
+ batch_results = result.dig("response", "results")
253
+ return empty_transcript_data unless batch_results
254
+
255
+ file_result = batch_results.values.first
256
+ return empty_transcript_data unless file_result
257
+
258
+ transcript_segments = file_result.dig("transcript", "results")
259
+ return empty_transcript_data unless transcript_segments&.any?
260
+
261
+ build_transcript_data(transcript_segments, file_result)
262
+ end
263
+
264
+ # @return [Hash]
265
+ def empty_transcript_data
266
+ { "text" => "" }
267
+ end
268
+
269
+ # @param transcript_segments [Array]
270
+ # @param file_result [Hash]
271
+ # @return [Hash]
272
+ def build_transcript_data(transcript_segments, file_result)
273
+ transcript_text = extract_transcript_text(transcript_segments)
274
+ result_data = { "text" => transcript_text }
275
+
276
+ add_duration_if_available(result_data, file_result)
277
+ add_segments_if_verbose(result_data, transcript_segments)
278
+
279
+ result_data
280
+ end
281
+
282
+ # @param transcript_segments [Array]
283
+ # @return [String]
284
+ def extract_transcript_text(transcript_segments)
285
+ text_segments = transcript_segments.map do |segment|
286
+ segment.dig("alternatives", 0, "transcript")
287
+ end.compact
288
+
289
+ text_segments.join(" ")
290
+ end
291
+
292
+ # @param result_data [Hash]
293
+ # @param file_result [Hash]
294
+ def add_duration_if_available(result_data, file_result)
295
+ duration = file_result.dig("metadata", "totalBilledDuration")
296
+ result_data["duration"] = parse_duration(duration) if duration
297
+ end
298
+
299
+ # @param result_data [Hash]
300
+ # @param transcript_segments [Array]
301
+ def add_segments_if_verbose(result_data, transcript_segments)
302
+ result_data["segments"] = build_segments(transcript_segments) if @format == "verbose_json"
303
+ end
304
+
305
+ # @param duration_string [String] Duration in Google's format (e.g., "123.456s")
306
+ # @return [Float] Duration in seconds
307
+ def parse_duration(duration_string)
308
+ return nil unless duration_string
309
+
310
+ duration_string.to_s.sub(/s$/, "").to_f
311
+ end
312
+
313
+ # @param segments [Array] Transcript segments from Google API
314
+ # @return [Array<Hash>] Segments formatted for base class
315
+ def build_segments(segments)
316
+ segments.map.with_index do |segment, index|
317
+ alternative = segment.dig("alternatives", 0)
318
+ next unless alternative
319
+
320
+ segment_data = {
321
+ "id" => index,
322
+ "text" => alternative["transcript"],
323
+ "start" => calculate_segment_start(segments, index),
324
+ "end" => parse_duration(segment["resultEndOffset"]),
325
+ "confidence" => alternative["confidence"],
326
+ }
327
+
328
+ # Words removed - segments provide sufficient granularity for most use cases
329
+
330
+ segment_data
331
+ end.compact
332
+ end
333
+
334
+ # @param segments [Array] All segments
335
+ # @param index [Integer] Current segment index
336
+ # @return [Float] Start time estimated from previous segment end
337
+ def calculate_segment_start(segments, index)
338
+ return 0.0 if index.zero?
339
+
340
+ prev_segment = segments[index - 1]
341
+ parse_duration(prev_segment["resultEndOffset"]) || 0.0
342
+ end
343
+
344
+ # @param response [HTTP::Response]
345
+ # @raise [HTTPError]
346
+ def handle_sync_response_errors(response)
347
+ return if response.status.ok?
348
+
349
+ error_data = parse_error_data(response)
350
+ raise_timeout_error(response) if timeout_error?(error_data)
351
+ raise HTTPError, response
352
+ end
353
+
354
+ # @param response [HTTP::Response]
355
+ # @return [Hash]
356
+ def parse_error_data(response)
357
+ response.parse
358
+ rescue StandardError
359
+ {}
360
+ end
361
+
362
+ # @param error_data [Hash]
363
+ # @return [Boolean]
364
+ def timeout_error?(error_data)
365
+ error_data.dig("error", "message")&.include?("60 seconds")
366
+ end
367
+
368
+ # @param response [HTTP::Response]
369
+ # @raise [HTTPError]
370
+ def raise_timeout_error(response)
371
+ raise HTTPError, (response.tap do |r|
372
+ r.instance_variable_set(:@body, "Audio file exceeds 60-second limit for direct upload. " \
373
+ "Use a long-form model (e.g., 'latest_long') or upload to GCS first. " \
374
+ "Original error: #{response.flush}")
375
+ end)
376
+ end
377
+
378
+ # @param data [Hash]
379
+ # @param transcript [String]
380
+ # @return [Hash]
381
+ def build_sync_response_data(data, transcript)
382
+ return { "text" => transcript } unless verbose_json_format?(data)
383
+
384
+ build_verbose_sync_data(data, transcript)
385
+ end
386
+
387
+ # @param data [Hash]
388
+ # @return [Boolean]
389
+ def verbose_json_format?(data)
390
+ @format == "verbose_json" &&
391
+ data["results"]&.any? &&
392
+ data["results"][0]["alternatives"]&.any?
393
+ end
394
+
395
+ # @param data [Hash]
396
+ # @param transcript [String]
397
+ # @return [Hash]
398
+ def build_verbose_sync_data(data, transcript)
399
+ alternative = data["results"][0]["alternatives"][0]
400
+ {
401
+ "text" => transcript,
402
+ "segments" => [{
403
+ "id" => 0,
404
+ "text" => transcript,
405
+ "start" => 0.0,
406
+ "end" => nil,
407
+ "confidence" => alternative["confidence"],
408
+ }],
409
+ }
410
+ end
411
+
412
+ # @param gcs_uri [String] GCS URI to delete (e.g., "gs://bucket/file.mp3")
413
+ def cleanup_gcs_file(gcs_uri)
414
+ return unless valid_gcs_uri?(gcs_uri)
415
+
416
+ bucket_name, object_name = parse_gcs_uri(gcs_uri)
417
+ return unless bucket_name && object_name
418
+
419
+ delete_gcs_object(bucket_name, object_name, gcs_uri)
420
+ end
421
+
422
+ # @param gcs_uri [String]
423
+ # @return [Boolean]
424
+ def valid_gcs_uri?(gcs_uri)
425
+ gcs_uri&.start_with?("gs://")
426
+ end
427
+
428
+ # @param gcs_uri [String]
429
+ # @return [Array<String>] [bucket_name, object_name]
430
+ def parse_gcs_uri(gcs_uri)
431
+ uri_parts = gcs_uri.sub("gs://", "").split("/", 2)
432
+ [uri_parts[0], uri_parts[1]]
433
+ end
434
+
435
+ # @param bucket_name [String]
436
+ # @param object_name [String]
437
+ # @param gcs_uri [String]
438
+ def delete_gcs_object(bucket_name, object_name, gcs_uri)
439
+ storage = create_storage_client
440
+ bucket = storage.bucket(bucket_name)
441
+ return unless bucket
442
+
443
+ file = bucket.file(object_name)
444
+ file&.delete
445
+ rescue ::Google::Cloud::Error => e
446
+ @client.logger&.warn("Failed to cleanup GCS file #{gcs_uri}: #{e.message}")
447
+ end
448
+
449
+ # @return [Google::Cloud::Storage]
450
+ def create_storage_client
451
+ credentials = @client.instance_variable_get(:@credentials)
452
+ ::Google::Cloud::Storage.new(project_id:, credentials:)
453
+ end
454
+ end
455
+ end
456
+ end
@@ -2,6 +2,6 @@
2
2
 
3
3
  module OmniAI
4
4
  module Google
5
- VERSION = "2.6.4"
5
+ VERSION = "2.7.6"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: omniai-google
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.6.4
4
+ version: 2.7.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin Sylvestre
8
8
  bindir: exe
9
9
  cert_chain: []
10
- date: 2025-05-13 00:00:00.000000000 Z
10
+ date: 2025-06-15 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: event_stream_parser
@@ -37,6 +37,20 @@ dependencies:
37
37
  - - ">="
38
38
  - !ruby/object:Gem::Version
39
39
  version: '0'
40
+ - !ruby/object:Gem::Dependency
41
+ name: google-cloud-storage
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - ">="
45
+ - !ruby/object:Gem::Version
46
+ version: '0'
47
+ type: :runtime
48
+ prerelease: false
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - ">="
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
40
54
  - !ruby/object:Gem::Dependency
41
55
  name: omniai
42
56
  requirement: !ruby/object:Gem::Requirement
@@ -75,6 +89,7 @@ files:
75
89
  - Gemfile
76
90
  - README.md
77
91
  - lib/omniai/google.rb
92
+ - lib/omniai/google/bucket.rb
78
93
  - lib/omniai/google/chat.rb
79
94
  - lib/omniai/google/chat/choice_serializer.rb
80
95
  - lib/omniai/google/chat/content_serializer.rb
@@ -92,6 +107,8 @@ files:
92
107
  - lib/omniai/google/config.rb
93
108
  - lib/omniai/google/credentials.rb
94
109
  - lib/omniai/google/embed.rb
110
+ - lib/omniai/google/transcribe.rb
111
+ - lib/omniai/google/transcribe_helpers.rb
95
112
  - lib/omniai/google/upload.rb
96
113
  - lib/omniai/google/upload/file.rb
97
114
  - lib/omniai/google/version.rb