cartesia 1.0.6__tar.gz → 1.0.8__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: cartesia
3
- Version: 1.0.6
3
+ Version: 1.0.8
4
4
  Summary: The official Python library for the Cartesia API.
5
5
  Home-page:
6
6
  Author: Cartesia, Inc.
@@ -25,6 +25,22 @@ The official Cartesia Python library which provides convenient access to the Car
25
25
  > [!IMPORTANT]
26
26
  > The client library introduces breaking changes in v1.0.0, which was released on June 24th 2024. See the [release notes](https://github.com/cartesia-ai/cartesia-python/releases/tag/v1.0.0) and [migration guide](https://github.com/cartesia-ai/cartesia-python/discussions/44). Reach out to us on [Discord](https://discord.gg/ZVxavqHB9X) for any support requests!
27
27
 
28
+ - [Cartesia Python API Library](#cartesia-python-api-library)
29
+ - [Documentation](#documentation)
30
+ - [Installation](#installation)
31
+ - [Voices](#voices)
32
+ - [Text-to-Speech](#text-to-speech)
33
+ - [Server-Sent Events (SSE)](#server-sent-events-sse)
34
+ - [WebSocket](#websocket)
35
+ - [Conditioning speech on previous generations using WebSocket](#conditioning-speech-on-previous-generations-using-websocket)
36
+ - [Generating timestamps using WebSocket](#generating-timestamps-using-websocket)
37
+ - [Multilingual Text-to-Speech \[Alpha\]](#multilingual-text-to-speech-alpha)
38
+ - [Speed and Emotion Control \[Experimental\]](#speed-and-emotion-control-experimental)
39
+ - [Jupyter Notebook Usage](#jupyter-notebook-usage)
40
+ - [Utility methods](#utility-methods)
41
+ - [Output Formats](#output-formats)
42
+
43
+
28
44
  ## Documentation
29
45
 
30
46
  Our complete API documentation can be found [on docs.cartesia.ai](https://docs.cartesia.ai).
@@ -250,7 +266,7 @@ async def send_transcripts(ctx):
250
266
 
251
267
  # You can check out our models at https://docs.cartesia.ai/getting-started/available-models
252
268
  model_id = "sonic-english"
253
-
269
+
254
270
  # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/endpoints/stream-speech-server-sent-events
255
271
  output_format = {
256
272
  "container": "raw",
@@ -266,7 +282,7 @@ async def send_transcripts(ctx):
266
282
  "As they near Eggman's lair, our heroes charge their abilities for an epic boss battle. ",
267
283
  "Get ready to spin, jump, and sound-blast your way to victory in this high-octane crossover!"
268
284
  ]
269
-
285
+
270
286
  for transcript in transcripts:
271
287
  # Send text inputs as they become available
272
288
  await ctx.send(
@@ -278,7 +294,7 @@ async def send_transcripts(ctx):
278
294
  )
279
295
 
280
296
  # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
281
- await ctx.no_more_inputs()
297
+ await ctx.no_more_inputs()
282
298
 
283
299
  async def receive_and_play_audio(ctx):
284
300
  p = pyaudio.PyAudio()
@@ -384,7 +400,7 @@ output_stream = ctx.send(
384
400
  voice_id=voice_id,
385
401
  output_format=output_format,
386
402
  )
387
-
403
+
388
404
  for output in output_stream:
389
405
  buffer = output["audio"]
390
406
 
@@ -401,6 +417,34 @@ p.terminate()
401
417
  ws.close() # Close the websocket connection
402
418
  ```
403
419
 
420
+ ### Generating timestamps using WebSocket
421
+
422
+ The WebSocket endpoint supports timestamps, allowing you to get detailed timing information for each word in the transcript. To enable this feature, pass an `add_timestamps` boolean flag to the `send` method. The results are returned in the `word_timestamps` object, which contains three keys:
423
+ - words (list): The individual words in the transcript.
424
+ - start (list): The starting timestamp for each word (in seconds).
425
+ - end (list): The ending timestamp for each word (in seconds).
426
+
427
+ ```python
428
+ response = ws.send(
429
+ model_id=model_id,
430
+ transcript=transcript,
431
+ voice_id=voice_id,
432
+ output_format=output_format,
433
+ stream=False,
434
+ add_timestamps=True
435
+ )
436
+
437
+ # Accessing the word_timestamps object
438
+ word_timestamps = response['word_timestamps']
439
+
440
+ words = word_timestamps['words']
441
+ start_times = word_timestamps['start']
442
+ end_times = word_timestamps['end']
443
+
444
+ for word, start, end in zip(words, start_times, end_times):
445
+ print(f"Word: {word}, Start: {start}, End: {end}")
446
+ ```
447
+
404
448
  ### Multilingual Text-to-Speech [Alpha]
405
449
 
406
450
  You can use our `sonic-multilingual` model to generate audio in multiple languages. The languages supported are available at [docs.cartesia.ai](https://docs.cartesia.ai/getting-started/available-models).
@@ -454,6 +498,31 @@ stream.close()
454
498
  p.terminate()
455
499
  ```
456
500
 
501
+ ### Speed and Emotion Control [Experimental]
502
+
503
+ You can enhance the voice output by adjusting the `speed` and `emotion` parameters. To do this, pass a `_experimental_voice_controls` dictionary with the desired `speed` and `emotion` values to any `send` method.
504
+
505
+ Speed Options:
506
+ - `slowest`, `slow`, `normal`, `fast`, `fastest`
507
+
508
+ Emotion Options:
509
+ Use a list of tags in the format `emotion_name:level` where:
510
+ - Emotion Names: `anger`, `positivity`, `surprise`, `sadness`, `curiosity`
511
+ - Levels: `lowest`, `low`, (omit for medium level), `high`, `highest`
512
+ The emotion tag levels add the specified emotion to the voice at the indicated intensity, with the omission of a level tag resulting in a medium intensity.
513
+
514
+ ```python
515
+ ws.send(
516
+ model_id=model_id,
517
+ transcript=transcript,
518
+ voice_id=voice_id,
519
+ output_format=output_format,
520
+ _experimental_voice_controls={"speed": "fast", "emotion": ["positivity:high"]},
521
+ )
522
+ ```
523
+
524
+ ### Jupyter Notebook Usage
525
+
457
526
  If you are using Jupyter Notebook or JupyterLab, you can use IPython.display.Audio to play the generated audio directly in the notebook.
458
527
  Additionally, in these notebook examples we show how to use the client as a context manager (though this is not required).
459
528
 
@@ -8,6 +8,22 @@ The official Cartesia Python library which provides convenient access to the Car
8
8
  > [!IMPORTANT]
9
9
  > The client library introduces breaking changes in v1.0.0, which was released on June 24th 2024. See the [release notes](https://github.com/cartesia-ai/cartesia-python/releases/tag/v1.0.0) and [migration guide](https://github.com/cartesia-ai/cartesia-python/discussions/44). Reach out to us on [Discord](https://discord.gg/ZVxavqHB9X) for any support requests!
10
10
 
11
+ - [Cartesia Python API Library](#cartesia-python-api-library)
12
+ - [Documentation](#documentation)
13
+ - [Installation](#installation)
14
+ - [Voices](#voices)
15
+ - [Text-to-Speech](#text-to-speech)
16
+ - [Server-Sent Events (SSE)](#server-sent-events-sse)
17
+ - [WebSocket](#websocket)
18
+ - [Conditioning speech on previous generations using WebSocket](#conditioning-speech-on-previous-generations-using-websocket)
19
+ - [Generating timestamps using WebSocket](#generating-timestamps-using-websocket)
20
+ - [Multilingual Text-to-Speech \[Alpha\]](#multilingual-text-to-speech-alpha)
21
+ - [Speed and Emotion Control \[Experimental\]](#speed-and-emotion-control-experimental)
22
+ - [Jupyter Notebook Usage](#jupyter-notebook-usage)
23
+ - [Utility methods](#utility-methods)
24
+ - [Output Formats](#output-formats)
25
+
26
+
11
27
  ## Documentation
12
28
 
13
29
  Our complete API documentation can be found [on docs.cartesia.ai](https://docs.cartesia.ai).
@@ -233,7 +249,7 @@ async def send_transcripts(ctx):
233
249
 
234
250
  # You can check out our models at https://docs.cartesia.ai/getting-started/available-models
235
251
  model_id = "sonic-english"
236
-
252
+
237
253
  # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/endpoints/stream-speech-server-sent-events
238
254
  output_format = {
239
255
  "container": "raw",
@@ -249,7 +265,7 @@ async def send_transcripts(ctx):
249
265
  "As they near Eggman's lair, our heroes charge their abilities for an epic boss battle. ",
250
266
  "Get ready to spin, jump, and sound-blast your way to victory in this high-octane crossover!"
251
267
  ]
252
-
268
+
253
269
  for transcript in transcripts:
254
270
  # Send text inputs as they become available
255
271
  await ctx.send(
@@ -261,7 +277,7 @@ async def send_transcripts(ctx):
261
277
  )
262
278
 
263
279
  # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
264
- await ctx.no_more_inputs()
280
+ await ctx.no_more_inputs()
265
281
 
266
282
  async def receive_and_play_audio(ctx):
267
283
  p = pyaudio.PyAudio()
@@ -367,7 +383,7 @@ output_stream = ctx.send(
367
383
  voice_id=voice_id,
368
384
  output_format=output_format,
369
385
  )
370
-
386
+
371
387
  for output in output_stream:
372
388
  buffer = output["audio"]
373
389
 
@@ -384,6 +400,34 @@ p.terminate()
384
400
  ws.close() # Close the websocket connection
385
401
  ```
386
402
 
403
+ ### Generating timestamps using WebSocket
404
+
405
+ The WebSocket endpoint supports timestamps, allowing you to get detailed timing information for each word in the transcript. To enable this feature, pass an `add_timestamps` boolean flag to the `send` method. The results are returned in the `word_timestamps` object, which contains three keys:
406
+ - words (list): The individual words in the transcript.
407
+ - start (list): The starting timestamp for each word (in seconds).
408
+ - end (list): The ending timestamp for each word (in seconds).
409
+
410
+ ```python
411
+ response = ws.send(
412
+ model_id=model_id,
413
+ transcript=transcript,
414
+ voice_id=voice_id,
415
+ output_format=output_format,
416
+ stream=False,
417
+ add_timestamps=True
418
+ )
419
+
420
+ # Accessing the word_timestamps object
421
+ word_timestamps = response['word_timestamps']
422
+
423
+ words = word_timestamps['words']
424
+ start_times = word_timestamps['start']
425
+ end_times = word_timestamps['end']
426
+
427
+ for word, start, end in zip(words, start_times, end_times):
428
+ print(f"Word: {word}, Start: {start}, End: {end}")
429
+ ```
430
+
387
431
  ### Multilingual Text-to-Speech [Alpha]
388
432
 
389
433
  You can use our `sonic-multilingual` model to generate audio in multiple languages. The languages supported are available at [docs.cartesia.ai](https://docs.cartesia.ai/getting-started/available-models).
@@ -437,6 +481,31 @@ stream.close()
437
481
  p.terminate()
438
482
  ```
439
483
 
484
+ ### Speed and Emotion Control [Experimental]
485
+
486
+ You can enhance the voice output by adjusting the `speed` and `emotion` parameters. To do this, pass a `_experimental_voice_controls` dictionary with the desired `speed` and `emotion` values to any `send` method.
487
+
488
+ Speed Options:
489
+ - `slowest`, `slow`, `normal`, `fast`, `fastest`
490
+
491
+ Emotion Options:
492
+ Use a list of tags in the format `emotion_name:level` where:
493
+ - Emotion Names: `anger`, `positivity`, `surprise`, `sadness`, `curiosity`
494
+ - Levels: `lowest`, `low`, (omit for medium level), `high`, `highest`
495
+ The emotion tag levels add the specified emotion to the voice at the indicated intensity, with the omission of a level tag resulting in a medium intensity.
496
+
497
+ ```python
498
+ ws.send(
499
+ model_id=model_id,
500
+ transcript=transcript,
501
+ voice_id=voice_id,
502
+ output_format=output_format,
503
+ _experimental_voice_controls={"speed": "fast", "emotion": ["positivity:high"]},
504
+ )
505
+ ```
506
+
507
+ ### Jupyter Notebook Usage
508
+
440
509
  If you are using Jupyter Notebook or JupyterLab, you can use IPython.display.Audio to play the generated audio directly in the notebook.
441
510
  Additionally, in these notebook examples we show how to use the client as a context manager (though this is not required).
442
511
 
@@ -45,7 +45,7 @@ class DeprecatedOutputFormatMapping:
45
45
  "mulaw_8000": {"container": "raw", "encoding": "pcm_mulaw", "sample_rate": 8000},
46
46
  "alaw_8000": {"container": "raw", "encoding": "pcm_alaw", "sample_rate": 8000},
47
47
  }
48
-
48
+
49
49
  @classmethod
50
50
  @deprecated(
51
51
  vdeprecated="1.0.1",
@@ -74,18 +74,19 @@ class VoiceControls(TypedDict):
74
74
  """Defines different voice control parameters for voice synthesis.
75
75
 
76
76
  For a complete list of supported parameters, refer to the Cartesia API documentation.
77
- https://docs.cartesia.ai/getting-started/welcome
77
+ https://docs.cartesia.ai/api-reference
78
78
 
79
79
  Examples:
80
80
  >>> {"speed": "fastest"}
81
- >>> {"speed": "slow", "emotion": "anger:high, positivity:low"}
82
- >>> {"emotion": "surprise:high, positivity:high"}
81
+ >>> {"speed": "slow", "emotion": ["sadness:high"]}
82
+ >>> {"emotion": ["surprise:highest", "curiosity"]}
83
83
 
84
84
  Note:
85
85
  This is an experimental class and is subject to rapid change in future versions.
86
86
  """
87
+
87
88
  speed: str = ""
88
- emotion: str = ""
89
+ emotion: List[str] = []
89
90
 
90
91
 
91
92
  class OutputFormat(TypedDict):
@@ -23,7 +23,12 @@ import aiohttp
23
23
  import httpx
24
24
  import logging
25
25
  import requests
26
- from websockets.sync.client import connect
26
+ try:
27
+ from websockets.sync.client import connect
28
+ IS_WEBSOCKET_SYNC_AVAILABLE = True
29
+ except ImportError:
30
+ IS_WEBSOCKET_SYNC_AVAILABLE = False
31
+
27
32
  from iterators import TimeoutIterator
28
33
 
29
34
  from cartesia.utils.retry import retry_on_connection_error, retry_on_connection_error_async
@@ -208,37 +213,25 @@ class Voices(Resource):
208
213
  return response.json()
209
214
 
210
215
  def clone(self, filepath: Optional[str] = None, link: Optional[str] = None) -> List[float]:
211
- """Clone a voice from a clip or a URL.
216
+ """Clone a voice from a clip.
212
217
 
213
218
  Args:
214
219
  filepath: The path to the clip file.
215
- link: The URL to the clip
216
220
 
217
221
  Returns:
218
222
  The embedding of the cloned voice as a list of floats.
219
223
  """
220
224
  # TODO: Python has a bytes object, use that instead of a filepath
221
- if not filepath and not link:
222
- raise ValueError("At least one of 'filepath' or 'link' must be specified.")
223
- if filepath and link:
224
- raise ValueError("Only one of 'filepath' or 'link' should be specified.")
225
- if filepath:
226
- url = f"{self._http_url()}/voices/clone/clip"
227
- with open(filepath, "rb") as file:
228
- files = {"clip": file}
229
- headers = self.headers.copy()
230
- headers.pop("Content-Type", None)
231
- response = httpx.post(url, headers=headers, files=files, timeout=self.timeout)
232
- if not response.is_success:
233
- raise ValueError(f"Failed to clone voice from clip. Error: {response.text}")
234
- elif link:
235
- url = f"{self._http_url()}/voices/clone/url"
236
- params = {"link": link}
225
+ if not filepath:
226
+ raise ValueError("Filepath must be specified.")
227
+ url = f"{self._http_url()}/voices/clone/clip"
228
+ with open(filepath, "rb") as file:
229
+ files = {"clip": file}
237
230
  headers = self.headers.copy()
238
- headers.pop("Content-Type") # The content type header is not required for URLs
239
- response = httpx.post(url, headers=self.headers, params=params, timeout=self.timeout)
231
+ headers.pop("Content-Type", None)
232
+ response = httpx.post(url, headers=headers, files=files, timeout=self.timeout)
240
233
  if not response.is_success:
241
- raise ValueError(f"Failed to clone voice from URL. Error: {response.text}")
234
+ raise ValueError(f"Failed to clone voice from clip. Error: {response.text}")
242
235
 
243
236
  return response.json()["embedding"]
244
237
 
@@ -328,7 +321,11 @@ class _TTSContext:
328
321
 
329
322
  self._websocket.connect()
330
323
 
331
- voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding, experimental_voice_controls = _experimental_voice_controls)
324
+ voice = TTS._validate_and_construct_voice(
325
+ voice_id,
326
+ voice_embedding=voice_embedding,
327
+ experimental_voice_controls=_experimental_voice_controls,
328
+ )
332
329
 
333
330
  # Create the initial request body
334
331
  request_body = {
@@ -465,6 +462,10 @@ class _WebSocket:
465
462
  Raises:
466
463
  RuntimeError: If the connection to the WebSocket fails.
467
464
  """
465
+ if not IS_WEBSOCKET_SYNC_AVAILABLE:
466
+ raise ImportError(
467
+ "The synchronous WebSocket client is not available. Please ensure that you have 'websockets>=12.0' or compatible version installed."
468
+ )
468
469
  if self.websocket is None or self._is_websocket_closed():
469
470
  route = "tts/websocket"
470
471
  try:
@@ -493,7 +494,7 @@ class _WebSocket:
493
494
  out["audio"] = base64.b64decode(response["data"])
494
495
  elif response["type"] == EventType.TIMESTAMPS:
495
496
  out["word_timestamps"] = response["word_timestamps"]
496
-
497
+
497
498
  if include_context_id:
498
499
  out["context_id"] = response["context_id"]
499
500
 
@@ -541,7 +542,11 @@ class _WebSocket:
541
542
  if context_id is None:
542
543
  context_id = str(uuid.uuid4())
543
544
 
544
- voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding, experimental_voice_controls = _experimental_voice_controls)
545
+ voice = TTS._validate_and_construct_voice(
546
+ voice_id,
547
+ voice_embedding=voice_embedding,
548
+ experimental_voice_controls=_experimental_voice_controls,
549
+ )
545
550
 
546
551
  request_body = {
547
552
  "model_id": model_id,
@@ -681,7 +686,11 @@ class _SSE:
681
686
  Both the generator and the dictionary contain the following key(s):
682
687
  - audio: The audio as bytes.
683
688
  """
684
- voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding, experimental_voice_controls=_experimental_voice_controls)
689
+ voice = TTS._validate_and_construct_voice(
690
+ voice_id,
691
+ voice_embedding=voice_embedding,
692
+ experimental_voice_controls=_experimental_voice_controls,
693
+ )
685
694
  request_body = {
686
695
  "model_id": model_id,
687
696
  "transcript": transcript,
@@ -795,6 +804,7 @@ class TTS(Resource):
795
804
  sample_rate=output_format_obj["sample_rate"],
796
805
  )
797
806
 
807
+ @staticmethod
798
808
  def get_sample_rate(self, output_format_name: str) -> int:
799
809
  """Convenience method to get the sample rate for a given output format.
800
810
 
@@ -818,6 +828,40 @@ class TTS(Resource):
818
828
 
819
829
  return output_format_obj["sample_rate"]
820
830
 
831
+ @staticmethod
832
+ def _validate_and_construct_voice(
833
+ voice_id: Optional[str] = None,
834
+ voice_embedding: Optional[List[float]] = None,
835
+ experimental_voice_controls: Optional[VoiceControls] = None,
836
+ ) -> dict:
837
+ """Validate and construct the voice dictionary for the request.
838
+
839
+ Args:
840
+ voice_id: The ID of the voice to use for generating audio.
841
+ voice_embedding: The embedding of the voice to use for generating audio.
842
+ experimental_voice_controls: Voice controls for emotion and speed.
843
+ Note: This is an experimental feature and may rapidly change in the future.
844
+
845
+ Returns:
846
+ A dictionary representing the voice configuration.
847
+
848
+ Raises:
849
+ ValueError: If neither or both voice_id and voice_embedding are specified.
850
+ """
851
+ if voice_id is None and voice_embedding is None:
852
+ raise ValueError("Either voice_id or voice_embedding must be specified.")
853
+
854
+ if voice_id is not None and voice_embedding is not None:
855
+ raise ValueError("Only one of voice_id or voice_embedding should be specified.")
856
+
857
+ if voice_id:
858
+ voice = {"mode": "id", "id": voice_id}
859
+ else:
860
+ voice = {"mode": "embedding", "embedding": voice_embedding}
861
+ if experimental_voice_controls is not None:
862
+ voice["__experimental_controls"] = experimental_voice_controls
863
+ return voice
864
+
821
865
 
822
866
  class AsyncCartesia(Cartesia):
823
867
  """The asynchronous version of the Cartesia client."""
@@ -917,7 +961,11 @@ class _AsyncSSE(_SSE):
917
961
  stream: bool = True,
918
962
  _experimental_voice_controls: Optional[VoiceControls] = None,
919
963
  ) -> Union[bytes, AsyncGenerator[bytes, None]]:
920
- voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding,experimental_voice_controls=_experimental_voice_controls)
964
+ voice = TTS._validate_and_construct_voice(
965
+ voice_id,
966
+ voice_embedding=voice_embedding,
967
+ experimental_voice_controls=_experimental_voice_controls,
968
+ )
921
969
 
922
970
  request_body = {
923
971
  "model_id": model_id,
@@ -1042,7 +1090,9 @@ class _AsyncTTSContext:
1042
1090
 
1043
1091
  await self._websocket.connect()
1044
1092
 
1045
- voice = _validate_and_construct_voice(voice_id, voice_embedding, experimental_voice_controls=_experimental_voice_controls)
1093
+ voice = TTS._validate_and_construct_voice(
1094
+ voice_id, voice_embedding, experimental_voice_controls=_experimental_voice_controls
1095
+ )
1046
1096
 
1047
1097
  request_body = {
1048
1098
  "model_id": model_id,
@@ -1229,7 +1279,7 @@ class _AsyncWebSocket(_WebSocket):
1229
1279
  duration=duration,
1230
1280
  language=language,
1231
1281
  continue_=False,
1232
- add_timestamps = add_timestamps,
1282
+ add_timestamps=add_timestamps,
1233
1283
  _experimental_voice_controls=_experimental_voice_controls,
1234
1284
  )
1235
1285
 
@@ -1299,35 +1349,3 @@ class AsyncTTS(TTS):
1299
1349
  )
1300
1350
  await ws.connect()
1301
1351
  return ws
1302
-
1303
-
1304
- def _validate_and_construct_voice(
1305
- voice_id: Optional[str] = None, voice_embedding: Optional[List[float]] = None, experimental_voice_controls: Optional[VoiceControls] = None
1306
- ) -> dict:
1307
- """Validate and construct the voice dictionary for the request.
1308
-
1309
- Args:
1310
- voice_id: The ID of the voice to use for generating audio.
1311
- voice_embedding: The embedding of the voice to use for generating audio.
1312
- experimental_voice_controls: Voice controls for emotion and speed.
1313
- Note: This is an experimental feature and may rapidly change in the future.
1314
-
1315
- Returns:
1316
- A dictionary representing the voice configuration.
1317
-
1318
- Raises:
1319
- ValueError: If neither or both voice_id and voice_embedding are specified.
1320
- """
1321
- if voice_id is None and voice_embedding is None:
1322
- raise ValueError("Either voice_id or voice_embedding must be specified.")
1323
-
1324
- if voice_id is not None and voice_embedding is not None:
1325
- raise ValueError("Only one of voice_id or voice_embedding should be specified.")
1326
-
1327
- if voice_id:
1328
- voice = {"mode": "id", "id": voice_id}
1329
- else:
1330
- voice = {"mode": "embedding", "embedding": voice_embedding}
1331
- if experimental_voice_controls is not None:
1332
- voice["__experimental_controls"] = experimental_voice_controls
1333
- return voice
@@ -0,0 +1 @@
1
+ __version__ = "1.0.8"
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: cartesia
3
- Version: 1.0.6
3
+ Version: 1.0.8
4
4
  Summary: The official Python library for the Cartesia API.
5
5
  Home-page:
6
6
  Author: Cartesia, Inc.
@@ -25,6 +25,22 @@ The official Cartesia Python library which provides convenient access to the Car
25
25
  > [!IMPORTANT]
26
26
  > The client library introduces breaking changes in v1.0.0, which was released on June 24th 2024. See the [release notes](https://github.com/cartesia-ai/cartesia-python/releases/tag/v1.0.0) and [migration guide](https://github.com/cartesia-ai/cartesia-python/discussions/44). Reach out to us on [Discord](https://discord.gg/ZVxavqHB9X) for any support requests!
27
27
 
28
+ - [Cartesia Python API Library](#cartesia-python-api-library)
29
+ - [Documentation](#documentation)
30
+ - [Installation](#installation)
31
+ - [Voices](#voices)
32
+ - [Text-to-Speech](#text-to-speech)
33
+ - [Server-Sent Events (SSE)](#server-sent-events-sse)
34
+ - [WebSocket](#websocket)
35
+ - [Conditioning speech on previous generations using WebSocket](#conditioning-speech-on-previous-generations-using-websocket)
36
+ - [Generating timestamps using WebSocket](#generating-timestamps-using-websocket)
37
+ - [Multilingual Text-to-Speech \[Alpha\]](#multilingual-text-to-speech-alpha)
38
+ - [Speed and Emotion Control \[Experimental\]](#speed-and-emotion-control-experimental)
39
+ - [Jupyter Notebook Usage](#jupyter-notebook-usage)
40
+ - [Utility methods](#utility-methods)
41
+ - [Output Formats](#output-formats)
42
+
43
+
28
44
  ## Documentation
29
45
 
30
46
  Our complete API documentation can be found [on docs.cartesia.ai](https://docs.cartesia.ai).
@@ -250,7 +266,7 @@ async def send_transcripts(ctx):
250
266
 
251
267
  # You can check out our models at https://docs.cartesia.ai/getting-started/available-models
252
268
  model_id = "sonic-english"
253
-
269
+
254
270
  # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/endpoints/stream-speech-server-sent-events
255
271
  output_format = {
256
272
  "container": "raw",
@@ -266,7 +282,7 @@ async def send_transcripts(ctx):
266
282
  "As they near Eggman's lair, our heroes charge their abilities for an epic boss battle. ",
267
283
  "Get ready to spin, jump, and sound-blast your way to victory in this high-octane crossover!"
268
284
  ]
269
-
285
+
270
286
  for transcript in transcripts:
271
287
  # Send text inputs as they become available
272
288
  await ctx.send(
@@ -278,7 +294,7 @@ async def send_transcripts(ctx):
278
294
  )
279
295
 
280
296
  # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
281
- await ctx.no_more_inputs()
297
+ await ctx.no_more_inputs()
282
298
 
283
299
  async def receive_and_play_audio(ctx):
284
300
  p = pyaudio.PyAudio()
@@ -384,7 +400,7 @@ output_stream = ctx.send(
384
400
  voice_id=voice_id,
385
401
  output_format=output_format,
386
402
  )
387
-
403
+
388
404
  for output in output_stream:
389
405
  buffer = output["audio"]
390
406
 
@@ -401,6 +417,34 @@ p.terminate()
401
417
  ws.close() # Close the websocket connection
402
418
  ```
403
419
 
420
+ ### Generating timestamps using WebSocket
421
+
422
+ The WebSocket endpoint supports timestamps, allowing you to get detailed timing information for each word in the transcript. To enable this feature, pass an `add_timestamps` boolean flag to the `send` method. The results are returned in the `word_timestamps` object, which contains three keys:
423
+ - words (list): The individual words in the transcript.
424
+ - start (list): The starting timestamp for each word (in seconds).
425
+ - end (list): The ending timestamp for each word (in seconds).
426
+
427
+ ```python
428
+ response = ws.send(
429
+ model_id=model_id,
430
+ transcript=transcript,
431
+ voice_id=voice_id,
432
+ output_format=output_format,
433
+ stream=False,
434
+ add_timestamps=True
435
+ )
436
+
437
+ # Accessing the word_timestamps object
438
+ word_timestamps = response['word_timestamps']
439
+
440
+ words = word_timestamps['words']
441
+ start_times = word_timestamps['start']
442
+ end_times = word_timestamps['end']
443
+
444
+ for word, start, end in zip(words, start_times, end_times):
445
+ print(f"Word: {word}, Start: {start}, End: {end}")
446
+ ```
447
+
404
448
  ### Multilingual Text-to-Speech [Alpha]
405
449
 
406
450
  You can use our `sonic-multilingual` model to generate audio in multiple languages. The languages supported are available at [docs.cartesia.ai](https://docs.cartesia.ai/getting-started/available-models).
@@ -454,6 +498,31 @@ stream.close()
454
498
  p.terminate()
455
499
  ```
456
500
 
501
+ ### Speed and Emotion Control [Experimental]
502
+
503
+ You can enhance the voice output by adjusting the `speed` and `emotion` parameters. To do this, pass a `_experimental_voice_controls` dictionary with the desired `speed` and `emotion` values to any `send` method.
504
+
505
+ Speed Options:
506
+ - `slowest`, `slow`, `normal`, `fast`, `fastest`
507
+
508
+ Emotion Options:
509
+ Use a list of tags in the format `emotion_name:level` where:
510
+ - Emotion Names: `anger`, `positivity`, `surprise`, `sadness`, `curiosity`
511
+ - Levels: `lowest`, `low`, (omit for medium level), `high`, `highest`
512
+ The emotion tag levels add the specified emotion to the voice at the indicated intensity, with the omission of a level tag resulting in a medium intensity.
513
+
514
+ ```python
515
+ ws.send(
516
+ model_id=model_id,
517
+ transcript=transcript,
518
+ voice_id=voice_id,
519
+ output_format=output_format,
520
+ _experimental_voice_controls={"speed": "fast", "emotion": ["positivity:high"]},
521
+ )
522
+ ```
523
+
524
+ ### Jupyter Notebook Usage
525
+
457
526
  If you are using Jupyter Notebook or JupyterLab, you can use IPython.display.Audio to play the generated audio directly in the notebook.
458
527
  Additionally, in these notebook examples we show how to use the client as a context manager (though this is not required).
459
528
 
@@ -79,14 +79,6 @@ def test_get_voice_from_id(client: Cartesia):
79
79
  voices = client.voices.list()
80
80
  assert voice in voices
81
81
 
82
- # Does not work currently, LB issue
83
- # def test_clone_voice_with_link(client: Cartesia):
84
- # url = "https://youtu.be/g2Z7Ddd573M?si=P8BM_hBqt5P8Ft6I&t=69"
85
- # logger.info("Testing voices.clone with link")
86
- # cloned_voice_embedding = client.voices.clone(link=url)
87
- # assert isinstance(cloned_voice_embedding, list)
88
- # assert len(cloned_voice_embedding) == 192
89
-
90
82
  def test_clone_voice_with_file(client: Cartesia):
91
83
  logger.info("Testing voices.clone with file")
92
84
  output = client.voices.clone(filepath=os.path.join(RESOURCES_DIR, "sample-speech-4s.wav"))
@@ -848,4 +840,4 @@ def _validate_schema(out):
848
840
  assert word_timestamps.keys() == {"words", "start", "end"}
849
841
  assert isinstance(word_timestamps["words"], list) and all(isinstance(word, str) for word in word_timestamps["words"])
850
842
  assert isinstance(word_timestamps["start"], list) and all(isinstance(start, (int, float)) for start in word_timestamps["start"])
851
- assert isinstance(word_timestamps["end"], list) and all(isinstance(end, (int, float)) for end in word_timestamps["end"])
843
+ assert isinstance(word_timestamps["end"], list) and all(isinstance(end, (int, float)) for end in word_timestamps["end"])
@@ -1 +0,0 @@
1
- __version__ = "1.0.6"
File without changes
File without changes
File without changes
File without changes
File without changes