omniai-google 2.6.5 → 2.7.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a1f3c06628af28183c2e2224b794ed0f105b136208dc06231ad9fc43e77254e5
4
- data.tar.gz: f267290388bc2a067f202d0da53581c0142363b69d99d47c6a4e1d5ef722f720
3
+ metadata.gz: eb0ebf5227f6a88f24418703712cccfad44d0fa41251f7e84e279211e45bf8aa
4
+ data.tar.gz: 465f216908801d1c440bf6bb22f1d15df8a39558d3cb2ec91bae290ff959f48d
5
5
  SHA512:
6
- metadata.gz: 4f013c574fc90c3fe31dccdc4ac424020908866fb8efd0b027c3d92b6e901b2a7656bfba527a37af80af17be9b33e42f1b6caa3b9a784f80e678558e1bbeee53
7
- data.tar.gz: '08dea5482ac8679c3d382ecf1830f634e85e652a94f470654ebcb5468898ec30473b7be6047535a056475542d30e1ef0f29dbba851fc1f252050ececfeb3af1a'
6
+ metadata.gz: 7c2dacf677dc673f2b3c2b8ccf439f59be0cd901b23c886d40b2f8f04e11edd1574616ca2a07a59f4c4406d5bb5e06d5aca0285ba45481b22aafb552efdc5741
7
+ data.tar.gz: 704d882c87af6d51800d86eaef592b1bf84e749686ad0129d7b837a7e8b800417ec1c2757c7d3728f796d0c53b640175b2eb13e369d8a179028f6f9e48fa064b
data/README.md CHANGED
@@ -58,6 +58,8 @@ OmniAI::Google.configure do |config|
58
58
  end
59
59
  ```
60
60
 
61
+ **Note for Transcription**: When using transcription features, ensure your service account has the necessary permissions for Google Cloud Speech-to-Text API and Google Cloud Storage (for automatic file uploads). See the [GCS Setup](#gcs-setup-for-transcription) section below for detailed configuration.
62
+
61
63
  Credentials may be configured using:
62
64
 
63
65
  1. A `File` / `String` / `Pathname`.
@@ -143,6 +145,204 @@ end
143
145
 
144
146
  [Google API Reference `stream`](https://ai.google.dev/gemini-api/docs/api-overview#stream)
145
147
 
148
+ ### Transcribe
149
+
150
+ Audio files can be transcribed using Google's Speech-to-Text API. The implementation automatically handles both synchronous and asynchronous recognition based on file size and model type.
151
+
152
+ #### Basic Usage
153
+
154
+ ```ruby
155
+ # Transcribe a local audio file
156
+ result = client.transcribe("path/to/audio.mp3")
157
+ result.text # "Hello, this is the transcribed text..."
158
+
159
+ # Transcribe with specific model
160
+ result = client.transcribe("path/to/audio.mp3", model: "latest_long")
161
+ result.text # "Hello, this is the transcribed text..."
162
+ ```
163
+
164
+ #### Multi-Language Detection
165
+
166
+ The transcription automatically detects multiple languages when no specific language is provided:
167
+
168
+ ```ruby
169
+ # Auto-detect English and Spanish
170
+ result = client.transcribe("bilingual_audio.mp3", model: "latest_long")
171
+ result.text # "Hello, how are you? Hola, ¿cómo estás?"
172
+
173
+ # Specify expected languages explicitly
174
+ result = client.transcribe("audio.mp3", language: ["en-US", "es-US"], model: "latest_long")
175
+ ```
176
+
177
+ #### Detailed Transcription with Timestamps
178
+
179
+ Use `VERBOSE_JSON` format to get detailed timing information, confidence scores, and language detection per segment:
180
+
181
+ ```ruby
182
+ result = client.transcribe("audio.mp3",
183
+ model: "latest_long",
184
+ format: OmniAI::Transcribe::Format::VERBOSE_JSON
185
+ )
186
+
187
+ # Access the full transcript
188
+ result.text # "Complete transcribed text..."
189
+
190
+ # Access detailed segment information
191
+ result.segments.each do |segment|
192
+ puts "Segment #{segment[:segment_id]}: #{segment[:text]}"
193
+ puts "Language: #{segment[:language_code]}"
194
+ puts "Confidence: #{segment[:confidence]}"
195
+ puts "End time: #{segment[:end_time]}"
196
+
197
+ # Word-level timing (if available)
198
+ segment[:words].each do |word|
199
+ puts " #{word[:word]} (#{word[:start_time]} - #{word[:end_time]})"
200
+ end
201
+ end
202
+
203
+ # Total audio duration
204
+ puts "Total duration: #{result.total_duration}"
205
+ ```
206
+
207
+ #### Models
208
+
209
+ The transcription supports various models optimized for different use cases:
210
+
211
+ ```ruby
212
+ # For short audio (< 60 seconds)
213
+ client.transcribe("short_audio.mp3", model: OmniAI::Google::Transcribe::Model::LATEST_SHORT)
214
+
215
+ # For long-form audio (> 60 seconds) - automatically uses async processing
216
+ client.transcribe("long_audio.mp3", model: OmniAI::Google::Transcribe::Model::LATEST_LONG)
217
+
218
+ # For phone/telephony audio
219
+ client.transcribe("phone_call.mp3", model: OmniAI::Google::Transcribe::Model::TELEPHONY_LONG)
220
+
221
+ # For medical conversations
222
+ client.transcribe("medical_interview.mp3", model: OmniAI::Google::Transcribe::Model::MEDICAL_CONVERSATION)
223
+
224
+ # Other available models
225
+ client.transcribe("audio.mp3", model: OmniAI::Google::Transcribe::Model::CHIRP_2) # Enhanced model
226
+ client.transcribe("audio.mp3", model: OmniAI::Google::Transcribe::Model::CHIRP) # Universal model
227
+ ```
228
+
229
+ **Available Model Constants:**
230
+ - `OmniAI::Google::Transcribe::Model::LATEST_SHORT` - Optimized for audio < 60 seconds
231
+ - `OmniAI::Google::Transcribe::Model::LATEST_LONG` - Optimized for long-form audio
232
+ - `OmniAI::Google::Transcribe::Model::TELEPHONY_SHORT` - For short phone calls
233
+ - `OmniAI::Google::Transcribe::Model::TELEPHONY_LONG` - For long phone calls
234
+ - `OmniAI::Google::Transcribe::Model::MEDICAL_CONVERSATION` - For medical conversations
235
+ - `OmniAI::Google::Transcribe::Model::MEDICAL_DICTATION` - For medical dictation
236
+ - `OmniAI::Google::Transcribe::Model::CHIRP_2` - Enhanced universal model
237
+ - `OmniAI::Google::Transcribe::Model::CHIRP` - Universal model
238
+
239
+ #### Supported Formats
240
+
241
+ - **Input**: MP3, WAV, FLAC, and other common audio formats
242
+ - **GCS URIs**: Direct transcription from Google Cloud Storage
243
+ - **File uploads**: Automatic upload to GCS for files > 10MB or long-form models
244
+
245
+ #### Advanced Features
246
+
247
+ **Automatic Processing Selection:**
248
+ - Files < 60 seconds: Uses synchronous recognition
249
+ - Files > 60 seconds or long-form models: Uses asynchronous batch recognition
250
+ - Large files: Automatically uploaded to Google Cloud Storage
251
+
252
+ **GCS Integration:**
253
+ - Automatic file upload and cleanup
254
+ - Support for existing GCS URIs
255
+ - Configurable bucket names
256
+
257
+ **Error Handling:**
258
+ - Automatic retry logic for temporary failures
259
+ - Clear error messages for common issues
260
+ - Graceful handling of network timeouts
261
+
262
+ [Google Speech-to-Text API Reference](https://cloud.google.com/speech-to-text/docs)
263
+
264
+ #### GCS Setup for Transcription
265
+
266
+ For transcription to work properly with automatic file uploads, you need to set up Google Cloud Storage and configure the appropriate permissions.
267
+
268
+ ##### 1. Create a GCS Bucket
269
+
270
+ You must create a bucket named `{project_id}-speech-audio` manually before using transcription features:
271
+
272
+ ```bash
273
+ # Using gcloud CLI
274
+ gsutil mb gs://your-project-id-speech-audio
275
+
276
+ # Or create via Google Cloud Console
277
+ # Navigate to Cloud Storage > Browser > Create Bucket
278
+ ```
279
+
280
+ ##### 2. Service Account Permissions
281
+
282
+ Your service account needs the following IAM roles for transcription to work:
283
+
284
+ **Required Roles:**
285
+ - **Cloud Speech Editor** - Grants access to edit resources in Speech-to-Text
286
+ - **Storage Bucket Viewer** - Grants permission to view buckets and their metadata, excluding IAM policies
287
+ - **Storage Object Admin** - Grants full control over objects, including listing, creating, viewing, and deleting objects
288
+
289
+ **To assign roles via gcloud CLI:**
290
+
291
+ ```bash
292
+ # Replace YOUR_SERVICE_ACCOUNT_EMAIL and YOUR_PROJECT_ID with actual values
293
+ SERVICE_ACCOUNT="your-service-account@your-project-id.iam.gserviceaccount.com"
294
+ PROJECT_ID="your-project-id"
295
+
296
+ # Grant Speech-to-Text permissions
297
+ gcloud projects add-iam-policy-binding $PROJECT_ID \
298
+ --member="serviceAccount:$SERVICE_ACCOUNT" \
299
+ --role="roles/speech.editor"
300
+
301
+ # Grant Storage permissions
302
+ gcloud projects add-iam-policy-binding $PROJECT_ID \
303
+ --member="serviceAccount:$SERVICE_ACCOUNT" \
304
+ --role="roles/storage.objectAdmin"
305
+
306
+ gcloud projects add-iam-policy-binding $PROJECT_ID \
307
+ --member="serviceAccount:$SERVICE_ACCOUNT" \
308
+ --role="roles/storage.legacyBucketReader"
309
+ ```
310
+
311
+ **Or via Google Cloud Console:**
312
+ 1. Go to IAM & Admin > IAM
313
+ 2. Find your service account
314
+ 3. Click "Edit Principal"
315
+ 4. Add the required roles listed above
316
+
317
+ ##### 3. Enable Required APIs
318
+
319
+ Ensure the following APIs are enabled in your Google Cloud Project:
320
+
321
+ ```bash
322
+ # Enable Speech-to-Text API
323
+ gcloud services enable speech.googleapis.com
324
+
325
+ # Enable Cloud Storage API
326
+ gcloud services enable storage.googleapis.com
327
+ ```
328
+
329
+ ##### 4. Bucket Configuration (Optional)
330
+
331
+ You can customize the bucket name by configuring it in your application:
332
+
333
+ ```ruby
334
+ # Custom bucket name in your transcription calls
335
+ # The bucket must exist and your service account must have access
336
+ client.transcribe("audio.mp3", bucket_name: "my-custom-audio-bucket")
337
+ ```
338
+
339
+ **Important Notes:**
340
+ - The default bucket name follows the pattern: `{project_id}-speech-audio`
341
+ - You must create the bucket manually before using transcription features
342
+ - Choose an appropriate region for your bucket based on your location and compliance requirements
343
+ - Audio files are automatically deleted after successful transcription
344
+ - If transcription fails, temporary files may remain and should be cleaned up manually
345
+
146
346
  ### Embed
147
347
 
148
348
  Text can be converted into a vector embedding for similarity comparison usage via:
@@ -0,0 +1,115 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "google/cloud/storage"
4
+
5
+ module OmniAI
6
+ module Google
7
+ # Uploads audio files to Google Cloud Storage for transcription.
8
+ class Bucket
9
+ class UploadError < StandardError; end
10
+
11
+ # @param client [Client]
12
+ # @param io [IO, String]
13
+ # @param bucket_name [String] optional - bucket name (defaults to project_id-speech-audio)
14
+ def self.process!(client:, io:, bucket_name: nil)
15
+ new(client:, io:, bucket_name:).process!
16
+ end
17
+
18
+ # @param client [Client]
19
+ # @param io [File, String]
20
+ # @param bucket_name [String] optional - bucket name
21
+ def initialize(client:, io:, bucket_name: nil)
22
+ @client = client
23
+ @io = io
24
+ @bucket_name = bucket_name || default_bucket_name
25
+ end
26
+
27
+ # @raise [UploadError]
28
+ #
29
+ # @return [String] GCS URI (gs://bucket/object)
30
+ def process!
31
+ # Create storage client with same credentials as main client
32
+ credentials = @client.instance_variable_get(:@credentials)
33
+ storage = ::Google::Cloud::Storage.new(
34
+ project_id:,
35
+ credentials:
36
+ )
37
+
38
+ # Get bucket (don't auto-create if it doesn't exist)
39
+ bucket = storage.bucket(@bucket_name)
40
+ unless bucket
41
+ raise UploadError, "Bucket '#{@bucket_name}' not found. " \
42
+ "Please create it manually or ensure the service account has access."
43
+ end
44
+
45
+ # Generate unique filename
46
+ timestamp = Time.now.strftime("%Y%m%d_%H%M%S")
47
+ random_suffix = SecureRandom.hex(4)
48
+ filename = "audio_#{timestamp}_#{random_suffix}.#{file_extension}"
49
+
50
+ # Upload file - create StringIO for binary content
51
+ content = file_content
52
+ if content.is_a?(String) && content.include?("\0")
53
+ # Binary content - wrap in StringIO
54
+ require "stringio"
55
+ content = StringIO.new(content)
56
+ end
57
+
58
+ bucket.create_file(content, filename)
59
+
60
+ # Return GCS URI
61
+ "gs://#{@bucket_name}/#{filename}"
62
+ rescue ::Google::Cloud::Error => e
63
+ raise UploadError, "Failed to upload to GCS: #{e.message}"
64
+ end
65
+
66
+ private
67
+
68
+ # @return [String]
69
+ def project_id
70
+ @client.instance_variable_get(:@project_id) ||
71
+ raise(ArgumentError, "project_id is required for GCS upload")
72
+ end
73
+
74
+ # @return [String]
75
+ def location_id
76
+ @client.instance_variable_get(:@location_id) || "global"
77
+ end
78
+
79
+ # @return [String]
80
+ def default_bucket_name
81
+ "#{project_id}-speech-audio"
82
+ end
83
+
84
+ # @return [String]
85
+ def file_content
86
+ case @io
87
+ when String
88
+ # Check if it's a file path or binary content
89
+ if @io.include?("\0") || !File.exist?(@io)
90
+ # It's binary content, return as-is
91
+ @io
92
+ else
93
+ # It's a file path, read the file
94
+ File.read(@io)
95
+ end
96
+ when File, IO, StringIO
97
+ @io.rewind if @io.respond_to?(:rewind)
98
+ @io.read
99
+ else
100
+ raise ArgumentError, "Unsupported input type: #{@io.class}"
101
+ end
102
+ end
103
+
104
+ # @return [String]
105
+ def file_extension
106
+ case @io
107
+ when String
108
+ File.extname(@io)[1..] || "wav"
109
+ else
110
+ "wav" # Default extension
111
+ end
112
+ end
113
+ end
114
+ end
115
+ end
@@ -15,7 +15,7 @@ module OmniAI
15
15
  module Model
16
16
  GEMINI_1_0_PRO = "gemini-1.0-pro"
17
17
  GEMINI_1_5_PRO = "gemini-1.5-pro"
18
- GEMINI_2_5_PRO = "gemini-2.5-pro-preview-05-06"
18
+ GEMINI_2_5_PRO = "gemini-2.5-pro-preview-06-05"
19
19
  GEMINI_1_5_FLASH = "gemini-1.5-flash"
20
20
  GEMINI_2_0_FLASH = "gemini-2.0-flash"
21
21
  GEMINI_2_5_FLASH = "gemini-2.5-flash-preview-04-17"
@@ -88,6 +88,16 @@ module OmniAI
88
88
  Embed.process!(input, model:, client: self)
89
89
  end
90
90
 
91
+ # @raise [OmniAI::Error]
92
+ #
93
+ # @param input [String, File, IO] required - audio file path, file object, or GCS URI
94
+ # @param model [String] optional
95
+ # @param language [String, Array<String>] optional - language codes for transcription
96
+ # @param format [Symbol] optional - :json or :verbose_json
97
+ def transcribe(input, model: Transcribe::DEFAULT_MODEL, language: nil, format: nil)
98
+ Transcribe.process!(input, model:, language:, format:, client: self)
99
+ end
100
+
91
101
  # @return [String]
92
102
  def path
93
103
  if @project_id && @location_id
@@ -55,6 +55,14 @@ module OmniAI
55
55
  def credentials=(value)
56
56
  @credentials = Credentials.parse(value)
57
57
  end
58
+
59
+ # @return [Hash]
60
+ def transcribe_options
61
+ @transcribe_options ||= {}
62
+ end
63
+
64
+ # @param value [Hash]
65
+ attr_writer :transcribe_options
58
66
  end
59
67
  end
60
68
  end
@@ -0,0 +1,143 @@
1
+ # frozen_string_literal: true
2
+
3
+ module OmniAI
4
+ module Google
5
+ # A Google transcribe implementation.
6
+ #
7
+ # Usage:
8
+ #
9
+ # transcribe = OmniAI::Google::Transcribe.new(client: client)
10
+ # transcribe.process!(audio_file)
11
+ class Transcribe < OmniAI::Transcribe
12
+ include TranscribeHelpers
13
+ module Model
14
+ CHIRP_2 = "chirp_2"
15
+ CHIRP = "chirp"
16
+ LATEST_LONG = "latest_long"
17
+ LATEST_SHORT = "latest_short"
18
+ TELEPHONY_LONG = "telephony_long"
19
+ TELEPHONY_SHORT = "telephony_short"
20
+ MEDICAL_CONVERSATION = "medical_conversation"
21
+ MEDICAL_DICTATION = "medical_dictation"
22
+ end
23
+
24
+ DEFAULT_MODEL = Model::LATEST_SHORT
25
+ DEFAULT_RECOGNIZER = "_"
26
+
27
+ # @return [Context]
28
+ CONTEXT = Context.build do |context|
29
+ # No custom deserializers needed - let base class handle parsing
30
+ end
31
+
32
+ # @raise [HTTPError]
33
+ #
34
+ # @return [OmniAI::Transcribe::Transcription]
35
+ def process!
36
+ if needs_async_recognition?
37
+ process_async!
38
+ else
39
+ process_sync!
40
+ end
41
+ end
42
+
43
+ private
44
+
45
+ # @return [Boolean]
46
+ def needs_async_recognition?
47
+ # Use async for long-form models or when GCS is needed
48
+ needs_long_form_recognition? || needs_gcs_upload?
49
+ end
50
+
51
+ # @raise [HTTPError]
52
+ #
53
+ # @return [OmniAI::Transcribe::Transcription]
54
+ def process_sync!
55
+ response = request!
56
+ handle_sync_response_errors(response)
57
+
58
+ data = response.parse
59
+ transcript = data.dig("results", 0, "alternatives", 0, "transcript") || ""
60
+
61
+ transformed_data = build_sync_response_data(data, transcript)
62
+ Transcription.parse(model: @model, format: @format, data: transformed_data)
63
+ end
64
+
65
+ # @raise [HTTPError]
66
+ #
67
+ # @return [OmniAI::Transcribe::Transcription]
68
+ def process_async!
69
+ # Track if we uploaded the file for cleanup
70
+ uploaded_gcs_uri = nil
71
+
72
+ # Start the batch recognition job
73
+ response = request_batch!
74
+
75
+ raise HTTPError, response unless response.status.ok?
76
+
77
+ operation_data = response.parse
78
+ operation_name = operation_data["name"]
79
+
80
+ raise HTTPError, "No operation name returned from batch recognition request" unless operation_name
81
+
82
+ # Extract GCS URI for cleanup if we uploaded it
83
+ if operation_data.dig("metadata", "batchRecognizeRequest", "files")
84
+ file_uri = operation_data.dig("metadata", "batchRecognizeRequest", "files", 0, "uri")
85
+ # Only mark for cleanup if it's not a user-provided GCS URI
86
+ uploaded_gcs_uri = file_uri unless @io.is_a?(String) && @io.start_with?("gs://")
87
+ end
88
+
89
+ # Poll for completion
90
+ result = poll_operation!(operation_name)
91
+
92
+ # Extract transcript from completed operation
93
+ transcript_data = extract_batch_transcript(result)
94
+
95
+ # Clean up uploaded file if we created it
96
+ cleanup_gcs_file(uploaded_gcs_uri) if uploaded_gcs_uri
97
+
98
+ Transcription.parse(model: @model, format: @format, data: transcript_data)
99
+ end
100
+
101
+ protected
102
+
103
+ # @return [Context]
104
+ def context
105
+ CONTEXT
106
+ end
107
+
108
+ # @return [HTTP::Response]
109
+ def request!
110
+ # Speech-to-Text API uses different endpoints for regional vs global
111
+ endpoint = speech_endpoint
112
+ speech_connection = HTTP.persistent(endpoint)
113
+ .timeout(connect: @client.timeout, write: @client.timeout, read: @client.timeout)
114
+ .accept(:json)
115
+
116
+ # Add authentication if using credentials
117
+ speech_connection = speech_connection.auth("Bearer #{@client.send(:auth).split.last}") if @client.credentials?
118
+
119
+ speech_connection.post(path, params:, json: payload)
120
+ end
121
+
122
+ # @return [Hash]
123
+ def payload
124
+ config = build_config
125
+ payload_data = { config: }
126
+ add_audio_data(payload_data)
127
+ payload_data
128
+ end
129
+
130
+ # @return [String]
131
+ def path
132
+ # Always use Speech-to-Text API v2 with recognizers
133
+ recognizer_path = "projects/#{project_id}/locations/#{location_id}/recognizers/#{recognizer_name}"
134
+ "/v2/#{recognizer_path}:recognize"
135
+ end
136
+
137
+ # @return [Hash]
138
+ def params
139
+ { key: (@client.api_key unless @client.credentials?) }.compact
140
+ end
141
+ end
142
+ end
143
+ end
@@ -0,0 +1,461 @@
1
+ # frozen_string_literal: true
2
+
3
+ module OmniAI
4
+ module Google
5
+ # Helper methods for transcription functionality
6
+ module TranscribeHelpers # rubocop:disable Metrics/ModuleLength
7
+ private
8
+
9
+ # @return [String]
10
+ def project_id
11
+ @client.instance_variable_get(:@project_id) ||
12
+ raise(ArgumentError, "project_id is required for transcription")
13
+ end
14
+
15
+ # @return [String]
16
+ def location_id
17
+ case @model
18
+ when "chirp_2"
19
+ "us-central1"
20
+ else
21
+ @client.instance_variable_get(:@location_id) || "global"
22
+ end
23
+ end
24
+
25
+ # @return [String]
26
+ def speech_endpoint
27
+ location_id == "global" ? "https://speech.googleapis.com" : "https://#{location_id}-speech.googleapis.com"
28
+ end
29
+
30
+ # @return [Array<String>, nil]
31
+ def language_codes
32
+ case @language
33
+ when String
34
+ [@language] unless @language.strip.empty?
35
+ when Array
36
+ cleaned = @language.compact.reject(&:empty?)
37
+ cleaned if cleaned.any?
38
+ when nil, ""
39
+ nil # Auto-detect language when not specified
40
+ else
41
+ ["en-US"] # Default to English (multi-language only supported in global/us/eu locations)
42
+ end
43
+ end
44
+
45
+ # @param input [String, Pathname, File, IO]
46
+ # @return [String] Base64 encoded audio content
47
+ def encode_audio(input)
48
+ case input
49
+ when String
50
+ if File.exist?(input)
51
+ Base64.strict_encode64(File.read(input))
52
+ else
53
+ input # Assume it's already base64 encoded
54
+ end
55
+ when Pathname, File, IO, StringIO
56
+ Base64.strict_encode64(input.read)
57
+ else
58
+ raise ArgumentError, "Unsupported input type: #{input.class}"
59
+ end
60
+ end
61
+
62
+ # @return [Boolean]
63
+ def needs_gcs_upload?
64
+ return false if @io.is_a?(String) && @io.start_with?("gs://")
65
+
66
+ file_size = calculate_file_size
67
+ # Force GCS upload for files > 10MB or if using long models for longer audio
68
+ file_size > 10_000_000 || needs_long_form_recognition?
69
+ end
70
+
71
+ # @return [Boolean]
72
+ def needs_long_form_recognition?
73
+ # Use long-form models for potentially longer audio files
74
+ return true if @model&.include?("long")
75
+
76
+ # Chirp models process speech in larger chunks and prefer BatchRecognize
77
+ return true if @model&.include?("chirp")
78
+
79
+ # For large files, assume they might be longer than 60 seconds
80
+ # Approximate: files larger than 1MB might be longer than 60 seconds
81
+ calculate_file_size > 1_000_000
82
+ end
83
+
84
+ # @return [Integer]
85
+ def calculate_file_size
86
+ case @io
87
+ when String
88
+ File.exist?(@io) ? File.size(@io) : 0
89
+ when File, IO, StringIO
90
+ @io.respond_to?(:size) ? @io.size : 0
91
+ else
92
+ 0
93
+ end
94
+ end
95
+
96
+ # @return [Hash]
97
+ def build_config
98
+ config = {
99
+ model: @model,
100
+ autoDecodingConfig: {},
101
+ }
102
+
103
+ # Only include languageCodes if specified and non-empty (omit for auto-detection)
104
+ lang_codes = language_codes
105
+ config[:languageCodes] = if lang_codes&.any?
106
+ lang_codes
107
+ else
108
+ # Handle language detection based on model capabilities
109
+ default_language_codes
110
+ end
111
+
112
+ features = build_features
113
+ config[:features] = features unless features.empty?
114
+
115
+ if OmniAI::Google.config.respond_to?(:transcribe_options)
116
+ config.merge!(OmniAI::Google.config.transcribe_options)
117
+ end
118
+
119
+ config
120
+ end
121
+
122
+ # @return [Array<String>] Default language codes based on model
123
+ def default_language_codes
124
+ if @model&.include?("chirp")
125
+ # Chirp models use "auto" for automatic language detection
126
+ ["auto"]
127
+ else
128
+ # Other models use multiple languages for auto-detection
129
+ %w[en-US es-US]
130
+ end
131
+ end
132
+
133
+ # @return [Hash]
134
+ def build_features
135
+ case @format
136
+ when "verbose_json"
137
+ {
138
+ enableAutomaticPunctuation: true,
139
+ enableWordTimeOffsets: true,
140
+ enableWordConfidence: true,
141
+ }
142
+ when "json"
143
+ { enableAutomaticPunctuation: true }
144
+ else
145
+ {}
146
+ end
147
+ end
148
+
149
+ # @param payload_data [Hash]
150
+ def add_audio_data(payload_data)
151
+ if @io.is_a?(String) && @io.start_with?("gs://")
152
+ payload_data[:uri] = @io
153
+ elsif needs_gcs_upload?
154
+ gcs_uri = Bucket.process!(client: @client, io: @io)
155
+ payload_data[:uri] = gcs_uri
156
+ else
157
+ payload_data[:content] = encode_audio(@io)
158
+ end
159
+ end
160
+
161
+ # @return [Hash] Payload for batch recognition
162
+ def batch_payload
163
+ config = build_config
164
+
165
+ # Get audio URI for batch processing
166
+ audio_uri = if @io.is_a?(String) && @io.start_with?("gs://")
167
+ @io
168
+ else
169
+ # Force GCS upload for batch recognition
170
+ Bucket.process!(client: @client, io: @io)
171
+ end
172
+
173
+ {
174
+ config:,
175
+ files: [{ uri: audio_uri }],
176
+ recognitionOutputConfig: {
177
+ inlineResponseConfig: {},
178
+ },
179
+ }
180
+ end
181
+
182
+ # @param operation_name [String]
183
+ # @raise [HTTPError]
184
+ #
185
+ # @return [Hash]
186
+ def poll_operation!(operation_name)
187
+ endpoint = speech_endpoint
188
+ connection = HTTP.persistent(endpoint)
189
+ .timeout(connect: @client.timeout, write: @client.timeout, read: @client.timeout)
190
+ .accept(:json)
191
+
192
+ # Add authentication if using credentials
193
+ connection = connection.auth("Bearer #{@client.send(:auth).split.last}") if @client.credentials?
194
+
195
+ max_attempts = 60 # Maximum 15 minutes (15 second intervals)
196
+ attempt = 0
197
+
198
+ loop do
199
+ attempt += 1
200
+
201
+ raise HTTPError, "Operation timed out after #{max_attempts * 15} seconds" if attempt > max_attempts
202
+
203
+ operation_response = connection.get("/v2/#{operation_name}", params: operation_params)
204
+
205
+ raise HTTPError, operation_response unless operation_response.status.ok?
206
+
207
+ operation_data = operation_response.parse
208
+
209
+ # Check for errors
210
+ if operation_data["error"]
211
+ error_message = operation_data.dig("error", "message") || "Unknown error"
212
+ raise HTTPError, "Operation failed: #{error_message}"
213
+ end
214
+
215
+ # Check if done
216
+ return operation_data if operation_data["done"]
217
+
218
+ # Wait before polling again
219
+ sleep(15)
220
+ end
221
+ end
222
+
223
+ # @return [HTTP::Response]
224
+ def request_batch!
225
+ endpoint = speech_endpoint
226
+ connection = HTTP.persistent(endpoint)
227
+ .timeout(connect: @client.timeout, write: @client.timeout, read: @client.timeout)
228
+ .accept(:json)
229
+
230
+ # Add authentication if using credentials
231
+ connection = connection.auth("Bearer #{@client.send(:auth).split.last}") if @client.credentials?
232
+
233
+ connection.post(batch_path, params: operation_params, json: batch_payload)
234
+ end
235
+
236
+ # @return [String]
237
+ def batch_path
238
+ # Use batchRecognize endpoint for async recognition
239
+ recognizer_path = "projects/#{project_id}/locations/#{location_id}/recognizers/#{recognizer_name}"
240
+ "/v2/#{recognizer_path}:batchRecognize"
241
+ end
242
+
243
+ # @return [Hash]
244
+ def operation_params
245
+ { key: (@client.api_key unless @client.credentials?) }.compact
246
+ end
247
+
248
+ # @return [String]
249
+ def recognizer_name
250
+ # Always use the default recognizer - the model is specified in the config
251
+ "_"
252
+ end
253
+
254
+ # @param result [Hash] Operation result from batch recognition
255
+ # @return [Hash] Data formatted for OmniAI::Transcribe::Transcription.parse
256
+ def extract_batch_transcript(result)
257
+ batch_results = result.dig("response", "results")
258
+ return empty_transcript_data unless batch_results
259
+
260
+ file_result = batch_results.values.first
261
+ return empty_transcript_data unless file_result
262
+
263
+ transcript_segments = file_result.dig("transcript", "results")
264
+ return empty_transcript_data unless transcript_segments&.any?
265
+
266
+ build_transcript_data(transcript_segments, file_result)
267
+ end
268
+
269
+ # @return [Hash]
270
+ def empty_transcript_data
271
+ { "text" => "" }
272
+ end
273
+
274
+ # @param transcript_segments [Array]
275
+ # @param file_result [Hash]
276
+ # @return [Hash]
277
+ def build_transcript_data(transcript_segments, file_result)
278
+ transcript_text = extract_transcript_text(transcript_segments)
279
+ result_data = { "text" => transcript_text }
280
+
281
+ add_duration_if_available(result_data, file_result)
282
+ add_segments_if_verbose(result_data, transcript_segments)
283
+
284
+ result_data
285
+ end
286
+
287
+ # @param transcript_segments [Array]
288
+ # @return [String]
289
+ def extract_transcript_text(transcript_segments)
290
+ text_segments = transcript_segments.map do |segment|
291
+ segment.dig("alternatives", 0, "transcript")
292
+ end.compact
293
+
294
+ text_segments.join(" ")
295
+ end
296
+
297
+ # @param result_data [Hash]
298
+ # @param file_result [Hash]
299
+ def add_duration_if_available(result_data, file_result)
300
+ duration = file_result.dig("metadata", "totalBilledDuration")
301
+ result_data["duration"] = parse_duration(duration) if duration
302
+ end
303
+
304
+ # @param result_data [Hash]
305
+ # @param transcript_segments [Array]
306
+ def add_segments_if_verbose(result_data, transcript_segments)
307
+ result_data["segments"] = build_segments(transcript_segments) if @format == "verbose_json"
308
+ end
309
+
310
+ # @param duration_string [String] Duration in Google's format (e.g., "123.456s")
311
+ # @return [Float] Duration in seconds
312
+ def parse_duration(duration_string)
313
+ return nil unless duration_string
314
+
315
+ duration_string.to_s.sub(/s$/, "").to_f
316
+ end
317
+
318
+ # @param segments [Array] Transcript segments from Google API
319
+ # @return [Array<Hash>] Segments formatted for base class
320
+ def build_segments(segments)
321
+ segments.map.with_index do |segment, index|
322
+ alternative = segment.dig("alternatives", 0)
323
+ next unless alternative
324
+
325
+ segment_data = {
326
+ "id" => index,
327
+ "text" => alternative["transcript"],
328
+ "start" => calculate_segment_start(segments, index),
329
+ "end" => parse_duration(segment["resultEndOffset"]),
330
+ "confidence" => alternative["confidence"],
331
+ }
332
+
333
+ # Words removed - segments provide sufficient granularity for most use cases
334
+
335
+ segment_data
336
+ end.compact
337
+ end
338
+
339
+ # @param segments [Array] All segments
340
+ # @param index [Integer] Current segment index
341
+ # @return [Float] Start time estimated from previous segment end
342
+ def calculate_segment_start(segments, index)
343
+ return 0.0 if index.zero?
344
+
345
+ prev_segment = segments[index - 1]
346
+ parse_duration(prev_segment["resultEndOffset"]) || 0.0
347
+ end
348
+
349
+ # @param response [HTTP::Response]
350
+ # @raise [HTTPError]
351
+ def handle_sync_response_errors(response)
352
+ return if response.status.ok?
353
+
354
+ error_data = parse_error_data(response)
355
+ raise_timeout_error(response) if timeout_error?(error_data)
356
+ raise HTTPError, response
357
+ end
358
+
359
+ # @param response [HTTP::Response]
360
+ # @return [Hash]
361
+ def parse_error_data(response)
362
+ response.parse
363
+ rescue StandardError
364
+ {}
365
+ end
366
+
367
+ # @param error_data [Hash]
368
+ # @return [Boolean]
369
+ def timeout_error?(error_data)
370
+ error_data.dig("error", "message")&.include?("60 seconds")
371
+ end
372
+
373
+ # @param response [HTTP::Response]
374
+ # @raise [HTTPError]
375
+ def raise_timeout_error(response)
376
+ raise HTTPError, (response.tap do |r|
377
+ r.instance_variable_set(:@body, "Audio file exceeds 60-second limit for direct upload. " \
378
+ "Use a long-form model (e.g., 'latest_long') or upload to GCS first. " \
379
+ "Original error: #{response.flush}")
380
+ end)
381
+ end
382
+
383
+ # @param data [Hash]
384
+ # @param transcript [String]
385
+ # @return [Hash]
386
+ def build_sync_response_data(data, transcript)
387
+ return { "text" => transcript } unless verbose_json_format?(data)
388
+
389
+ build_verbose_sync_data(data, transcript)
390
+ end
391
+
392
+ # @param data [Hash]
393
+ # @return [Boolean]
394
+ def verbose_json_format?(data)
395
+ @format == "verbose_json" &&
396
+ data["results"]&.any? &&
397
+ data["results"][0]["alternatives"]&.any?
398
+ end
399
+
400
+ # @param data [Hash]
401
+ # @param transcript [String]
402
+ # @return [Hash]
403
+ def build_verbose_sync_data(data, transcript)
404
+ alternative = data["results"][0]["alternatives"][0]
405
+ {
406
+ "text" => transcript,
407
+ "segments" => [{
408
+ "id" => 0,
409
+ "text" => transcript,
410
+ "start" => 0.0,
411
+ "end" => nil,
412
+ "confidence" => alternative["confidence"],
413
+ }],
414
+ }
415
+ end
416
+
417
+ # @param gcs_uri [String] GCS URI to delete (e.g., "gs://bucket/file.mp3")
418
+ def cleanup_gcs_file(gcs_uri)
419
+ return unless valid_gcs_uri?(gcs_uri)
420
+
421
+ bucket_name, object_name = parse_gcs_uri(gcs_uri)
422
+ return unless bucket_name && object_name
423
+
424
+ delete_gcs_object(bucket_name, object_name, gcs_uri)
425
+ end
426
+
427
+ # @param gcs_uri [String]
428
+ # @return [Boolean]
429
+ def valid_gcs_uri?(gcs_uri)
430
+ gcs_uri&.start_with?("gs://")
431
+ end
432
+
433
+ # @param gcs_uri [String]
434
+ # @return [Array<String>] [bucket_name, object_name]
435
+ def parse_gcs_uri(gcs_uri)
436
+ uri_parts = gcs_uri.sub("gs://", "").split("/", 2)
437
+ [uri_parts[0], uri_parts[1]]
438
+ end
439
+
440
+ # @param bucket_name [String]
441
+ # @param object_name [String]
442
+ # @param gcs_uri [String]
443
+ def delete_gcs_object(bucket_name, object_name, gcs_uri)
444
+ storage = create_storage_client
445
+ bucket = storage.bucket(bucket_name)
446
+ return unless bucket
447
+
448
+ file = bucket.file(object_name)
449
+ file&.delete
450
+ rescue ::Google::Cloud::Error => e
451
+ @client.logger&.warn("Failed to cleanup GCS file #{gcs_uri}: #{e.message}")
452
+ end
453
+
454
+ # @return [Google::Cloud::Storage]
455
+ def create_storage_client
456
+ credentials = @client.instance_variable_get(:@credentials)
457
+ ::Google::Cloud::Storage.new(project_id:, credentials:)
458
+ end
459
+ end
460
+ end
461
+ end
@@ -2,6 +2,6 @@
2
2
 
3
3
  module OmniAI
4
4
  module Google
5
- VERSION = "2.6.5"
5
+ VERSION = "2.7.7"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: omniai-google
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.6.5
4
+ version: 2.7.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin Sylvestre
8
8
  bindir: exe
9
9
  cert_chain: []
10
- date: 2025-05-14 00:00:00.000000000 Z
10
+ date: 2025-06-16 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: event_stream_parser
@@ -37,6 +37,20 @@ dependencies:
37
37
  - - ">="
38
38
  - !ruby/object:Gem::Version
39
39
  version: '0'
40
+ - !ruby/object:Gem::Dependency
41
+ name: google-cloud-storage
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - ">="
45
+ - !ruby/object:Gem::Version
46
+ version: '0'
47
+ type: :runtime
48
+ prerelease: false
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - ">="
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
40
54
  - !ruby/object:Gem::Dependency
41
55
  name: omniai
42
56
  requirement: !ruby/object:Gem::Requirement
@@ -75,6 +89,7 @@ files:
75
89
  - Gemfile
76
90
  - README.md
77
91
  - lib/omniai/google.rb
92
+ - lib/omniai/google/bucket.rb
78
93
  - lib/omniai/google/chat.rb
79
94
  - lib/omniai/google/chat/choice_serializer.rb
80
95
  - lib/omniai/google/chat/content_serializer.rb
@@ -92,6 +107,8 @@ files:
92
107
  - lib/omniai/google/config.rb
93
108
  - lib/omniai/google/credentials.rb
94
109
  - lib/omniai/google/embed.rb
110
+ - lib/omniai/google/transcribe.rb
111
+ - lib/omniai/google/transcribe_helpers.rb
95
112
  - lib/omniai/google/upload.rb
96
113
  - lib/omniai/google/upload/file.rb
97
114
  - lib/omniai/google/version.rb