PyPI - cartesia - Versions diffs - 1.0.6__tar.gz → 1.0.8__tar.gz - Mend

cartesia 1.0.6tar.gz → 1.0.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

{cartesia-1.0.6 → cartesia-1.0.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cartesia
-Version: 1.0.6
+Version: 1.0.8
 Summary: The official Python library for the Cartesia API.
 Home-page:
 Author: Cartesia, Inc.
@@ -25,6 +25,22 @@ The official Cartesia Python library which provides convenient access to the Car
 > [!IMPORTANT]
 > The client library introduces breaking changes in v1.0.0, which was released on June 24th 2024. See the [release notes](https://github.com/cartesia-ai/cartesia-python/releases/tag/v1.0.0) and [migration guide](https://github.com/cartesia-ai/cartesia-python/discussions/44). Reach out to us on [Discord](https://discord.gg/ZVxavqHB9X) for any support requests!
+- [Cartesia Python API Library](#cartesia-python-api-library)
+  - [Documentation](#documentation)
+  - [Installation](#installation)
+  - [Voices](#voices)
+  - [Text-to-Speech](#text-to-speech)
+    - [Server-Sent Events (SSE)](#server-sent-events-sse)
+    - [WebSocket](#websocket)
+      - [Conditioning speech on previous generations using WebSocket](#conditioning-speech-on-previous-generations-using-websocket)
+    - [Generating timestamps using WebSocket](#generating-timestamps-using-websocket)
+    - [Multilingual Text-to-Speech \[Alpha\]](#multilingual-text-to-speech-alpha)
+    - [Speed and Emotion Control \[Experimental\]](#speed-and-emotion-control-experimental)
+    - [Jupyter Notebook Usage](#jupyter-notebook-usage)
+    - [Utility methods](#utility-methods)
+      - [Output Formats](#output-formats)
 ## Documentation
 Our complete API documentation can be found [on docs.cartesia.ai](https://docs.cartesia.ai).
@@ -250,7 +266,7 @@ async def send_transcripts(ctx):
     # You can check out our models at https://docs.cartesia.ai/getting-started/available-models
     model_id = "sonic-english"
     # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/endpoints/stream-speech-server-sent-events
     output_format = {
         "container": "raw",
@@ -266,7 +282,7 @@ async def send_transcripts(ctx):
         "As they near Eggman's lair, our heroes charge their abilities for an epic boss battle. ",
         "Get ready to spin, jump, and sound-blast your way to victory in this high-octane crossover!"
     ]
     for transcript in transcripts:
         # Send text inputs as they become available
         await ctx.send(
@@ -278,7 +294,7 @@ async def send_transcripts(ctx):
         )
     # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
-    await ctx.no_more_inputs()
+    await ctx.no_more_inputs()
 async def receive_and_play_audio(ctx):
     p = pyaudio.PyAudio()
@@ -384,7 +400,7 @@ output_stream = ctx.send(
     voice_id=voice_id,
     output_format=output_format,
 )
 for output in output_stream:
     buffer = output["audio"]
@@ -401,6 +417,34 @@ p.terminate()
 ws.close()  # Close the websocket connection
 ```
+### Generating timestamps using WebSocket
+The WebSocket endpoint supports timestamps, allowing you to get detailed timing information for each word in the transcript. To enable this feature, pass an `add_timestamps` boolean flag to the `send` method. The results are returned in the `word_timestamps` object, which contains three keys:
+- words (list): The individual words in the transcript.
+- start (list): The starting timestamp for each word (in seconds).
+- end (list): The ending timestamp for each word (in seconds).
+```python
+response = ws.send(
+    model_id=model_id,
+    transcript=transcript,
+    voice_id=voice_id,
+    output_format=output_format,
+    stream=False,
+    add_timestamps=True
+)
+# Accessing the word_timestamps object
+word_timestamps = response['word_timestamps']
+words = word_timestamps['words']
+start_times = word_timestamps['start']
+end_times = word_timestamps['end']
+for word, start, end in zip(words, start_times, end_times):
+    print(f"Word: {word}, Start: {start}, End: {end}")
+```
 ### Multilingual Text-to-Speech [Alpha]
 You can use our `sonic-multilingual` model to generate audio in multiple languages. The languages supported are available at [docs.cartesia.ai](https://docs.cartesia.ai/getting-started/available-models).
@@ -454,6 +498,31 @@ stream.close()
 p.terminate()
 ```
+### Speed and Emotion Control [Experimental]
+You can enhance the voice output by adjusting the `speed` and `emotion` parameters. To do this, pass a `_experimental_voice_controls` dictionary with the desired `speed` and `emotion` values to any `send` method.
+Speed Options:
+- `slowest`, `slow`, `normal`, `fast`, `fastest`
+Emotion Options:
+Use a list of tags in the format `emotion_name:level` where:
+- Emotion Names: `anger`, `positivity`, `surprise`, `sadness`, `curiosity`
+- Levels: `lowest`, `low`, (omit for medium level), `high`, `highest`
+The emotion tag levels add the specified emotion to the voice at the indicated intensity, with the omission of a level tag resulting in a medium intensity.
+```python
+ws.send(
+    model_id=model_id,
+    transcript=transcript,
+    voice_id=voice_id,
+    output_format=output_format,
+    _experimental_voice_controls={"speed": "fast", "emotion": ["positivity:high"]},
+)
+```
+### Jupyter Notebook Usage
 If you are using Jupyter Notebook or JupyterLab, you can use IPython.display.Audio to play the generated audio directly in the notebook.
 Additionally, in these notebook examples we show how to use the client as a context manager (though this is not required).

{cartesia-1.0.6 → cartesia-1.0.8}/README.md RENAMED Viewed

@@ -8,6 +8,22 @@ The official Cartesia Python library which provides convenient access to the Car
 > [!IMPORTANT]
 > The client library introduces breaking changes in v1.0.0, which was released on June 24th 2024. See the [release notes](https://github.com/cartesia-ai/cartesia-python/releases/tag/v1.0.0) and [migration guide](https://github.com/cartesia-ai/cartesia-python/discussions/44). Reach out to us on [Discord](https://discord.gg/ZVxavqHB9X) for any support requests!
+- [Cartesia Python API Library](#cartesia-python-api-library)
+  - [Documentation](#documentation)
+  - [Installation](#installation)
+  - [Voices](#voices)
+  - [Text-to-Speech](#text-to-speech)
+    - [Server-Sent Events (SSE)](#server-sent-events-sse)
+    - [WebSocket](#websocket)
+      - [Conditioning speech on previous generations using WebSocket](#conditioning-speech-on-previous-generations-using-websocket)
+    - [Generating timestamps using WebSocket](#generating-timestamps-using-websocket)
+    - [Multilingual Text-to-Speech \[Alpha\]](#multilingual-text-to-speech-alpha)
+    - [Speed and Emotion Control \[Experimental\]](#speed-and-emotion-control-experimental)
+    - [Jupyter Notebook Usage](#jupyter-notebook-usage)
+    - [Utility methods](#utility-methods)
+      - [Output Formats](#output-formats)
 ## Documentation
 Our complete API documentation can be found [on docs.cartesia.ai](https://docs.cartesia.ai).
@@ -233,7 +249,7 @@ async def send_transcripts(ctx):
     # You can check out our models at https://docs.cartesia.ai/getting-started/available-models
     model_id = "sonic-english"
     # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/endpoints/stream-speech-server-sent-events
     output_format = {
         "container": "raw",
@@ -249,7 +265,7 @@ async def send_transcripts(ctx):
         "As they near Eggman's lair, our heroes charge their abilities for an epic boss battle. ",
         "Get ready to spin, jump, and sound-blast your way to victory in this high-octane crossover!"
     ]
     for transcript in transcripts:
         # Send text inputs as they become available
         await ctx.send(
@@ -261,7 +277,7 @@ async def send_transcripts(ctx):
         )
     # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
-    await ctx.no_more_inputs()
+    await ctx.no_more_inputs()
 async def receive_and_play_audio(ctx):
     p = pyaudio.PyAudio()
@@ -367,7 +383,7 @@ output_stream = ctx.send(
     voice_id=voice_id,
     output_format=output_format,
 )
 for output in output_stream:
     buffer = output["audio"]
@@ -384,6 +400,34 @@ p.terminate()
 ws.close()  # Close the websocket connection
 ```
+### Generating timestamps using WebSocket
+The WebSocket endpoint supports timestamps, allowing you to get detailed timing information for each word in the transcript. To enable this feature, pass an `add_timestamps` boolean flag to the `send` method. The results are returned in the `word_timestamps` object, which contains three keys:
+- words (list): The individual words in the transcript.
+- start (list): The starting timestamp for each word (in seconds).
+- end (list): The ending timestamp for each word (in seconds).
+```python
+response = ws.send(
+    model_id=model_id,
+    transcript=transcript,
+    voice_id=voice_id,
+    output_format=output_format,
+    stream=False,
+    add_timestamps=True
+)
+# Accessing the word_timestamps object
+word_timestamps = response['word_timestamps']
+words = word_timestamps['words']
+start_times = word_timestamps['start']
+end_times = word_timestamps['end']
+for word, start, end in zip(words, start_times, end_times):
+    print(f"Word: {word}, Start: {start}, End: {end}")
+```
 ### Multilingual Text-to-Speech [Alpha]
 You can use our `sonic-multilingual` model to generate audio in multiple languages. The languages supported are available at [docs.cartesia.ai](https://docs.cartesia.ai/getting-started/available-models).
@@ -437,6 +481,31 @@ stream.close()
 p.terminate()
 ```
+### Speed and Emotion Control [Experimental]
+You can enhance the voice output by adjusting the `speed` and `emotion` parameters. To do this, pass a `_experimental_voice_controls` dictionary with the desired `speed` and `emotion` values to any `send` method.
+Speed Options:
+- `slowest`, `slow`, `normal`, `fast`, `fastest`
+Emotion Options:
+Use a list of tags in the format `emotion_name:level` where:
+- Emotion Names: `anger`, `positivity`, `surprise`, `sadness`, `curiosity`
+- Levels: `lowest`, `low`, (omit for medium level), `high`, `highest`
+The emotion tag levels add the specified emotion to the voice at the indicated intensity, with the omission of a level tag resulting in a medium intensity.
+```python
+ws.send(
+    model_id=model_id,
+    transcript=transcript,
+    voice_id=voice_id,
+    output_format=output_format,
+    _experimental_voice_controls={"speed": "fast", "emotion": ["positivity:high"]},
+)
+```
+### Jupyter Notebook Usage
 If you are using Jupyter Notebook or JupyterLab, you can use IPython.display.Audio to play the generated audio directly in the notebook.
 Additionally, in these notebook examples we show how to use the client as a context manager (though this is not required).

{cartesia-1.0.6 → cartesia-1.0.8}/cartesia/_types.py RENAMED Viewed

@@ -45,7 +45,7 @@ class DeprecatedOutputFormatMapping:
         "mulaw_8000": {"container": "raw", "encoding": "pcm_mulaw", "sample_rate": 8000},
         "alaw_8000": {"container": "raw", "encoding": "pcm_alaw", "sample_rate": 8000},
     }
     @classmethod
     @deprecated(
         vdeprecated="1.0.1",
@@ -74,18 +74,19 @@ class VoiceControls(TypedDict):
     """Defines different voice control parameters for voice synthesis.
     For a complete list of supported parameters, refer to the Cartesia API documentation.
-    https://docs.cartesia.ai/getting-started/welcome
+    https://docs.cartesia.ai/api-reference
     Examples:
         >>> {"speed": "fastest"}
-        >>> {"speed": "slow", "emotion": "anger:high, positivity:low"}
-        >>> {"emotion": "surprise:high, positivity:high"}
+        >>> {"speed": "slow", "emotion": ["sadness:high"]}
+        >>> {"emotion": ["surprise:highest", "curiosity"]}
     Note:
         This is an experimental class and is subject to rapid change in future versions.
     """
     speed: str = ""
-    emotion: str = ""
+    emotion: List[str] = []
 class OutputFormat(TypedDict):

{cartesia-1.0.6 → cartesia-1.0.8}/cartesia/client.py RENAMED Viewed

@@ -23,7 +23,12 @@ import aiohttp
 import httpx
 import logging
 import requests
-from websockets.sync.client import connect
+try:
+    from websockets.sync.client import connect
+    IS_WEBSOCKET_SYNC_AVAILABLE = True
+except ImportError:
+    IS_WEBSOCKET_SYNC_AVAILABLE = False
 from iterators import TimeoutIterator
 from cartesia.utils.retry import retry_on_connection_error, retry_on_connection_error_async
@@ -208,37 +213,25 @@ class Voices(Resource):
         return response.json()
     def clone(self, filepath: Optional[str] = None, link: Optional[str] = None) -> List[float]:
-        """Clone a voice from a clip or a URL.
+        """Clone a voice from a clip.
         Args:
             filepath: The path to the clip file.
-            link: The URL to the clip
         Returns:
             The embedding of the cloned voice as a list of floats.
         """
         # TODO: Python has a bytes object, use that instead of a filepath
-        if not filepath and not link:
-            raise ValueError("At least one of 'filepath' or 'link' must be specified.")
-        if filepath and link:
-            raise ValueError("Only one of 'filepath' or 'link' should be specified.")
-        if filepath:
-            url = f"{self._http_url()}/voices/clone/clip"
-            with open(filepath, "rb") as file:
-                files = {"clip": file}
-                headers = self.headers.copy()
-                headers.pop("Content-Type", None)
-                response = httpx.post(url, headers=headers, files=files, timeout=self.timeout)
-                if not response.is_success:
-                    raise ValueError(f"Failed to clone voice from clip. Error: {response.text}")
-        elif link:
-            url = f"{self._http_url()}/voices/clone/url"
-            params = {"link": link}
+        if not filepath:
+            raise ValueError("Filepath must be specified.")
+        url = f"{self._http_url()}/voices/clone/clip"
+        with open(filepath, "rb") as file:
+            files = {"clip": file}
             headers = self.headers.copy()
-            headers.pop("Content-Type")  # The content type header is not required for URLs
-            response = httpx.post(url, headers=self.headers, params=params, timeout=self.timeout)
+            headers.pop("Content-Type", None)
+            response = httpx.post(url, headers=headers, files=files, timeout=self.timeout)
             if not response.is_success:
-                raise ValueError(f"Failed to clone voice from URL. Error: {response.text}")
+                raise ValueError(f"Failed to clone voice from clip. Error: {response.text}")
         return response.json()["embedding"]
@@ -328,7 +321,11 @@ class _TTSContext:
         self._websocket.connect()
-        voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding, experimental_voice_controls = _experimental_voice_controls)
+        voice = TTS._validate_and_construct_voice(
+            voice_id,
+            voice_embedding=voice_embedding,
+            experimental_voice_controls=_experimental_voice_controls,
+        )
         # Create the initial request body
         request_body = {
@@ -465,6 +462,10 @@ class _WebSocket:
         Raises:
             RuntimeError: If the connection to the WebSocket fails.
         """
+        if not IS_WEBSOCKET_SYNC_AVAILABLE:
+            raise ImportError(
+                "The synchronous WebSocket client is not available. Please ensure that you have 'websockets>=12.0' or compatible version installed."
+            )
         if self.websocket is None or self._is_websocket_closed():
             route = "tts/websocket"
             try:
@@ -493,7 +494,7 @@ class _WebSocket:
             out["audio"] = base64.b64decode(response["data"])
         elif response["type"] == EventType.TIMESTAMPS:
             out["word_timestamps"] = response["word_timestamps"]
         if include_context_id:
             out["context_id"] = response["context_id"]
@@ -541,7 +542,11 @@ class _WebSocket:
         if context_id is None:
             context_id = str(uuid.uuid4())
-        voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding, experimental_voice_controls = _experimental_voice_controls)
+        voice = TTS._validate_and_construct_voice(
+            voice_id,
+            voice_embedding=voice_embedding,
+            experimental_voice_controls=_experimental_voice_controls,
+        )
         request_body = {
             "model_id": model_id,
@@ -681,7 +686,11 @@ class _SSE:
             Both the generator and the dictionary contain the following key(s):
             - audio: The audio as bytes.
         """
-        voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding, experimental_voice_controls=_experimental_voice_controls)
+        voice = TTS._validate_and_construct_voice(
+            voice_id,
+            voice_embedding=voice_embedding,
+            experimental_voice_controls=_experimental_voice_controls,
+        )
         request_body = {
             "model_id": model_id,
             "transcript": transcript,
@@ -795,6 +804,7 @@ class TTS(Resource):
             sample_rate=output_format_obj["sample_rate"],
         )
+    @staticmethod
     def get_sample_rate(self, output_format_name: str) -> int:
         """Convenience method to get the sample rate for a given output format.
@@ -818,6 +828,40 @@ class TTS(Resource):
         return output_format_obj["sample_rate"]
+    @staticmethod
+    def _validate_and_construct_voice(
+        voice_id: Optional[str] = None,
+        voice_embedding: Optional[List[float]] = None,
+        experimental_voice_controls: Optional[VoiceControls] = None,
+    ) -> dict:
+        """Validate and construct the voice dictionary for the request.
+        Args:
+            voice_id: The ID of the voice to use for generating audio.
+            voice_embedding: The embedding of the voice to use for generating audio.
+            experimental_voice_controls: Voice controls for emotion and speed.
+                Note: This is an experimental feature and may rapidly change in the future.
+        Returns:
+            A dictionary representing the voice configuration.
+        Raises:
+            ValueError: If neither or both voice_id and voice_embedding are specified.
+        """
+        if voice_id is None and voice_embedding is None:
+            raise ValueError("Either voice_id or voice_embedding must be specified.")
+        if voice_id is not None and voice_embedding is not None:
+            raise ValueError("Only one of voice_id or voice_embedding should be specified.")
+        if voice_id:
+            voice = {"mode": "id", "id": voice_id}
+        else:
+            voice = {"mode": "embedding", "embedding": voice_embedding}
+        if experimental_voice_controls is not None:
+            voice["__experimental_controls"] = experimental_voice_controls
+        return voice
 class AsyncCartesia(Cartesia):
     """The asynchronous version of the Cartesia client."""
@@ -917,7 +961,11 @@ class _AsyncSSE(_SSE):
         stream: bool = True,
         _experimental_voice_controls: Optional[VoiceControls] = None,
     ) -> Union[bytes, AsyncGenerator[bytes, None]]:
-        voice = _validate_and_construct_voice(voice_id, voice_embedding=voice_embedding,experimental_voice_controls=_experimental_voice_controls)
+        voice = TTS._validate_and_construct_voice(
+            voice_id,
+            voice_embedding=voice_embedding,
+            experimental_voice_controls=_experimental_voice_controls,
+        )
         request_body = {
             "model_id": model_id,
@@ -1042,7 +1090,9 @@ class _AsyncTTSContext:
         await self._websocket.connect()
-        voice = _validate_and_construct_voice(voice_id, voice_embedding, experimental_voice_controls=_experimental_voice_controls)
+        voice = TTS._validate_and_construct_voice(
+            voice_id, voice_embedding, experimental_voice_controls=_experimental_voice_controls
+        )
         request_body = {
             "model_id": model_id,
@@ -1229,7 +1279,7 @@ class _AsyncWebSocket(_WebSocket):
             duration=duration,
             language=language,
             continue_=False,
-            add_timestamps = add_timestamps,
+            add_timestamps=add_timestamps,
             _experimental_voice_controls=_experimental_voice_controls,
         )
@@ -1299,35 +1349,3 @@ class AsyncTTS(TTS):
         )
         await ws.connect()
         return ws
-def _validate_and_construct_voice(
-    voice_id: Optional[str] = None, voice_embedding: Optional[List[float]] = None, experimental_voice_controls: Optional[VoiceControls] = None
-) -> dict:
-    """Validate and construct the voice dictionary for the request.
-    Args:
-        voice_id: The ID of the voice to use for generating audio.
-        voice_embedding: The embedding of the voice to use for generating audio.
-        experimental_voice_controls: Voice controls for emotion and speed.
-            Note: This is an experimental feature and may rapidly change in the future.
-    Returns:
-        A dictionary representing the voice configuration.
-    Raises:
-        ValueError: If neither or both voice_id and voice_embedding are specified.
-    """
-    if voice_id is None and voice_embedding is None:
-        raise ValueError("Either voice_id or voice_embedding must be specified.")
-    if voice_id is not None and voice_embedding is not None:
-        raise ValueError("Only one of voice_id or voice_embedding should be specified.")
-    if voice_id:
-        voice = {"mode": "id", "id": voice_id}
-    else:
-        voice = {"mode": "embedding", "embedding": voice_embedding}
-    if experimental_voice_controls is not None:
-        voice["__experimental_controls"] = experimental_voice_controls
-    return voice

cartesia-1.0.8/cartesia/version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "1.0.8"

{cartesia-1.0.6 → cartesia-1.0.8}/cartesia.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: cartesia
-Version: 1.0.6
+Version: 1.0.8
 Summary: The official Python library for the Cartesia API.
 Home-page:
 Author: Cartesia, Inc.
@@ -25,6 +25,22 @@ The official Cartesia Python library which provides convenient access to the Car
 > [!IMPORTANT]
 > The client library introduces breaking changes in v1.0.0, which was released on June 24th 2024. See the [release notes](https://github.com/cartesia-ai/cartesia-python/releases/tag/v1.0.0) and [migration guide](https://github.com/cartesia-ai/cartesia-python/discussions/44). Reach out to us on [Discord](https://discord.gg/ZVxavqHB9X) for any support requests!
+- [Cartesia Python API Library](#cartesia-python-api-library)
+  - [Documentation](#documentation)
+  - [Installation](#installation)
+  - [Voices](#voices)
+  - [Text-to-Speech](#text-to-speech)
+    - [Server-Sent Events (SSE)](#server-sent-events-sse)
+    - [WebSocket](#websocket)
+      - [Conditioning speech on previous generations using WebSocket](#conditioning-speech-on-previous-generations-using-websocket)
+    - [Generating timestamps using WebSocket](#generating-timestamps-using-websocket)
+    - [Multilingual Text-to-Speech \[Alpha\]](#multilingual-text-to-speech-alpha)
+    - [Speed and Emotion Control \[Experimental\]](#speed-and-emotion-control-experimental)
+    - [Jupyter Notebook Usage](#jupyter-notebook-usage)
+    - [Utility methods](#utility-methods)
+      - [Output Formats](#output-formats)
 ## Documentation
 Our complete API documentation can be found [on docs.cartesia.ai](https://docs.cartesia.ai).
@@ -250,7 +266,7 @@ async def send_transcripts(ctx):
     # You can check out our models at https://docs.cartesia.ai/getting-started/available-models
     model_id = "sonic-english"
     # You can find the supported `output_format`s at https://docs.cartesia.ai/api-reference/endpoints/stream-speech-server-sent-events
     output_format = {
         "container": "raw",
@@ -266,7 +282,7 @@ async def send_transcripts(ctx):
         "As they near Eggman's lair, our heroes charge their abilities for an epic boss battle. ",
         "Get ready to spin, jump, and sound-blast your way to victory in this high-octane crossover!"
     ]
     for transcript in transcripts:
         # Send text inputs as they become available
         await ctx.send(
@@ -278,7 +294,7 @@ async def send_transcripts(ctx):
         )
     # Indicate that no more inputs will be sent. Otherwise, the context will close after 5 seconds of inactivity.
-    await ctx.no_more_inputs()
+    await ctx.no_more_inputs()
 async def receive_and_play_audio(ctx):
     p = pyaudio.PyAudio()
@@ -384,7 +400,7 @@ output_stream = ctx.send(
     voice_id=voice_id,
     output_format=output_format,
 )
 for output in output_stream:
     buffer = output["audio"]
@@ -401,6 +417,34 @@ p.terminate()
 ws.close()  # Close the websocket connection
 ```
+### Generating timestamps using WebSocket
+The WebSocket endpoint supports timestamps, allowing you to get detailed timing information for each word in the transcript. To enable this feature, pass an `add_timestamps` boolean flag to the `send` method. The results are returned in the `word_timestamps` object, which contains three keys:
+- words (list): The individual words in the transcript.
+- start (list): The starting timestamp for each word (in seconds).
+- end (list): The ending timestamp for each word (in seconds).
+```python
+response = ws.send(
+    model_id=model_id,
+    transcript=transcript,
+    voice_id=voice_id,
+    output_format=output_format,
+    stream=False,
+    add_timestamps=True
+)
+# Accessing the word_timestamps object
+word_timestamps = response['word_timestamps']
+words = word_timestamps['words']
+start_times = word_timestamps['start']
+end_times = word_timestamps['end']
+for word, start, end in zip(words, start_times, end_times):
+    print(f"Word: {word}, Start: {start}, End: {end}")
+```
 ### Multilingual Text-to-Speech [Alpha]
 You can use our `sonic-multilingual` model to generate audio in multiple languages. The languages supported are available at [docs.cartesia.ai](https://docs.cartesia.ai/getting-started/available-models).
@@ -454,6 +498,31 @@ stream.close()
 p.terminate()
 ```
+### Speed and Emotion Control [Experimental]
+You can enhance the voice output by adjusting the `speed` and `emotion` parameters. To do this, pass a `_experimental_voice_controls` dictionary with the desired `speed` and `emotion` values to any `send` method.
+Speed Options:
+- `slowest`, `slow`, `normal`, `fast`, `fastest`
+Emotion Options:
+Use a list of tags in the format `emotion_name:level` where:
+- Emotion Names: `anger`, `positivity`, `surprise`, `sadness`, `curiosity`
+- Levels: `lowest`, `low`, (omit for medium level), `high`, `highest`
+The emotion tag levels add the specified emotion to the voice at the indicated intensity, with the omission of a level tag resulting in a medium intensity.
+```python
+ws.send(
+    model_id=model_id,
+    transcript=transcript,
+    voice_id=voice_id,
+    output_format=output_format,
+    _experimental_voice_controls={"speed": "fast", "emotion": ["positivity:high"]},
+)
+```
+### Jupyter Notebook Usage
 If you are using Jupyter Notebook or JupyterLab, you can use IPython.display.Audio to play the generated audio directly in the notebook.
 Additionally, in these notebook examples we show how to use the client as a context manager (though this is not required).

{cartesia-1.0.6 → cartesia-1.0.8}/tests/test_tts.py RENAMED Viewed

@@ -79,14 +79,6 @@ def test_get_voice_from_id(client: Cartesia):
     voices = client.voices.list()
     assert voice in voices
-# Does not work currently, LB issue
-# def test_clone_voice_with_link(client: Cartesia):
-#     url = "https://youtu.be/g2Z7Ddd573M?si=P8BM_hBqt5P8Ft6I&t=69"
-#     logger.info("Testing voices.clone with link")
-#     cloned_voice_embedding = client.voices.clone(link=url)
-#     assert isinstance(cloned_voice_embedding, list)
-#     assert len(cloned_voice_embedding) == 192
 def test_clone_voice_with_file(client: Cartesia):
     logger.info("Testing voices.clone with file")
     output = client.voices.clone(filepath=os.path.join(RESOURCES_DIR, "sample-speech-4s.wav"))
@@ -848,4 +840,4 @@ def _validate_schema(out):
         assert word_timestamps.keys() == {"words", "start", "end"}
         assert isinstance(word_timestamps["words"], list) and all(isinstance(word, str) for word in word_timestamps["words"])
         assert isinstance(word_timestamps["start"], list) and all(isinstance(start, (int, float)) for start in word_timestamps["start"])
-        assert isinstance(word_timestamps["end"], list) and all(isinstance(end, (int, float)) for end in word_timestamps["end"])
+        assert isinstance(word_timestamps["end"], list) and all(isinstance(end, (int, float)) for end in word_timestamps["end"])