lattifai 1.2.0__py3-none-any.whl → 1.2.2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- lattifai/__init__.py +0 -24
- lattifai/alignment/__init__.py +10 -1
- lattifai/alignment/lattice1_aligner.py +66 -58
- lattifai/alignment/lattice1_worker.py +1 -6
- lattifai/alignment/punctuation.py +38 -0
- lattifai/alignment/segmenter.py +1 -1
- lattifai/alignment/sentence_splitter.py +350 -0
- lattifai/alignment/text_align.py +440 -0
- lattifai/alignment/tokenizer.py +91 -220
- lattifai/caption/__init__.py +82 -6
- lattifai/caption/caption.py +335 -1143
- lattifai/caption/formats/__init__.py +199 -0
- lattifai/caption/formats/base.py +211 -0
- lattifai/caption/formats/gemini.py +722 -0
- lattifai/caption/formats/json.py +194 -0
- lattifai/caption/formats/lrc.py +309 -0
- lattifai/caption/formats/nle/__init__.py +9 -0
- lattifai/caption/formats/nle/audition.py +561 -0
- lattifai/caption/formats/nle/avid.py +423 -0
- lattifai/caption/formats/nle/fcpxml.py +549 -0
- lattifai/caption/formats/nle/premiere.py +589 -0
- lattifai/caption/formats/pysubs2.py +642 -0
- lattifai/caption/formats/sbv.py +147 -0
- lattifai/caption/formats/tabular.py +338 -0
- lattifai/caption/formats/textgrid.py +193 -0
- lattifai/caption/formats/ttml.py +652 -0
- lattifai/caption/formats/vtt.py +469 -0
- lattifai/caption/parsers/__init__.py +9 -0
- lattifai/caption/{text_parser.py → parsers/text_parser.py} +4 -2
- lattifai/caption/standardize.py +636 -0
- lattifai/caption/utils.py +474 -0
- lattifai/cli/__init__.py +2 -1
- lattifai/cli/caption.py +108 -1
- lattifai/cli/transcribe.py +4 -9
- lattifai/cli/youtube.py +4 -1
- lattifai/client.py +48 -84
- lattifai/config/__init__.py +11 -1
- lattifai/config/alignment.py +9 -2
- lattifai/config/caption.py +267 -23
- lattifai/config/media.py +20 -0
- lattifai/diarization/__init__.py +41 -1
- lattifai/mixin.py +36 -18
- lattifai/transcription/base.py +6 -1
- lattifai/transcription/lattifai.py +19 -54
- lattifai/utils.py +81 -13
- lattifai/workflow/__init__.py +28 -4
- lattifai/workflow/file_manager.py +2 -5
- lattifai/youtube/__init__.py +43 -0
- lattifai/youtube/client.py +1170 -0
- lattifai/youtube/types.py +23 -0
- lattifai-1.2.2.dist-info/METADATA +615 -0
- lattifai-1.2.2.dist-info/RECORD +76 -0
- {lattifai-1.2.0.dist-info → lattifai-1.2.2.dist-info}/entry_points.txt +1 -2
- lattifai/caption/gemini_reader.py +0 -371
- lattifai/caption/gemini_writer.py +0 -173
- lattifai/cli/app_installer.py +0 -142
- lattifai/cli/server.py +0 -44
- lattifai/server/app.py +0 -427
- lattifai/workflow/youtube.py +0 -577
- lattifai-1.2.0.dist-info/METADATA +0 -1133
- lattifai-1.2.0.dist-info/RECORD +0 -57
- {lattifai-1.2.0.dist-info → lattifai-1.2.2.dist-info}/WHEEL +0 -0
- {lattifai-1.2.0.dist-info → lattifai-1.2.2.dist-info}/licenses/LICENSE +0 -0
- {lattifai-1.2.0.dist-info → lattifai-1.2.2.dist-info}/top_level.txt +0 -0
|
@@ -1,1133 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: lattifai
|
|
3
|
-
Version: 1.2.0
|
|
4
|
-
Summary: Lattifai Python SDK: Seamless Integration with Lattifai's Speech and Video AI Services
|
|
5
|
-
Author-email: Lattifai Technologies <tech@lattifai.com>
|
|
6
|
-
Maintainer-email: Lattice <tech@lattifai.com>
|
|
7
|
-
License: MIT License
|
|
8
|
-
|
|
9
|
-
Copyright (c) 2025 LattifAI.
|
|
10
|
-
|
|
11
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
12
|
-
of this software and associated documentation files (the "Software"), to deal
|
|
13
|
-
in the Software without restriction, including without limitation the rights
|
|
14
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
15
|
-
copies of the Software, and to permit persons to whom the Software is
|
|
16
|
-
furnished to do so, subject to the following conditions:
|
|
17
|
-
|
|
18
|
-
The above copyright notice and this permission notice shall be included in all
|
|
19
|
-
copies or substantial portions of the Software.
|
|
20
|
-
|
|
21
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
22
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
23
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
24
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
25
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
26
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
27
|
-
SOFTWARE.
|
|
28
|
-
|
|
29
|
-
Project-URL: Homepage, https://github.com/lattifai/lattifai-python
|
|
30
|
-
Project-URL: Documentation, https://github.com/lattifai/lattifai-python/blob/main/README.md
|
|
31
|
-
Project-URL: Bug Tracker, https://github.com/lattifai/lattifai-python/issues
|
|
32
|
-
Project-URL: Discussions, https://github.com/lattifai/lattifai-python/discussions
|
|
33
|
-
Project-URL: Changelog, https://github.com/lattifai/lattifai-python/blob/main/CHANGELOG.md
|
|
34
|
-
Keywords: lattifai,speech recognition,video analysis,ai,sdk,api client
|
|
35
|
-
Classifier: Development Status :: 5 - Production/Stable
|
|
36
|
-
Classifier: Intended Audience :: Developers
|
|
37
|
-
Classifier: Intended Audience :: Science/Research
|
|
38
|
-
Classifier: License :: OSI Approved :: Apache Software License
|
|
39
|
-
Classifier: Programming Language :: Python :: 3.10
|
|
40
|
-
Classifier: Programming Language :: Python :: 3.11
|
|
41
|
-
Classifier: Programming Language :: Python :: 3.12
|
|
42
|
-
Classifier: Programming Language :: Python :: 3.13
|
|
43
|
-
Classifier: Programming Language :: Python :: 3.14
|
|
44
|
-
Classifier: Operating System :: MacOS :: MacOS X
|
|
45
|
-
Classifier: Operating System :: POSIX :: Linux
|
|
46
|
-
Classifier: Operating System :: Microsoft :: Windows
|
|
47
|
-
Classifier: Topic :: Multimedia :: Sound/Audio
|
|
48
|
-
Classifier: Topic :: Multimedia :: Video
|
|
49
|
-
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
50
|
-
Requires-Python: <3.15,>=3.10
|
|
51
|
-
Description-Content-Type: text/markdown
|
|
52
|
-
License-File: LICENSE
|
|
53
|
-
Requires-Dist: k2py>=0.2.1
|
|
54
|
-
Requires-Dist: lattifai-core>=0.6.0
|
|
55
|
-
Requires-Dist: lattifai-run>=1.0.1
|
|
56
|
-
Requires-Dist: python-dotenv
|
|
57
|
-
Requires-Dist: lhotse>=1.26.0
|
|
58
|
-
Requires-Dist: colorful>=0.5.6
|
|
59
|
-
Requires-Dist: pysubs2
|
|
60
|
-
Requires-Dist: praatio
|
|
61
|
-
Requires-Dist: tgt
|
|
62
|
-
Requires-Dist: onnx>=1.16.0
|
|
63
|
-
Requires-Dist: onnxruntime
|
|
64
|
-
Requires-Dist: msgpack
|
|
65
|
-
Requires-Dist: scipy!=1.16.3
|
|
66
|
-
Requires-Dist: g2p-phonemizer>=0.4.0
|
|
67
|
-
Requires-Dist: av
|
|
68
|
-
Requires-Dist: wtpsplit>=2.1.7
|
|
69
|
-
Requires-Dist: modelscope==1.33.0
|
|
70
|
-
Requires-Dist: OmniSenseVoice>=0.4.2
|
|
71
|
-
Requires-Dist: nemo_toolkit_asr[asr]>=2.7.0rc4
|
|
72
|
-
Requires-Dist: pyannote-audio-notorchdeps>=4.0.2
|
|
73
|
-
Requires-Dist: questionary>=2.0
|
|
74
|
-
Requires-Dist: yt-dlp
|
|
75
|
-
Requires-Dist: pycryptodome
|
|
76
|
-
Requires-Dist: google-genai>=1.22.0
|
|
77
|
-
Requires-Dist: fastapi>=0.111.0
|
|
78
|
-
Requires-Dist: uvicorn>=0.30.0
|
|
79
|
-
Requires-Dist: python-multipart>=0.0.9
|
|
80
|
-
Requires-Dist: jinja2>=3.1.4
|
|
81
|
-
Provides-Extra: numpy
|
|
82
|
-
Requires-Dist: numpy; extra == "numpy"
|
|
83
|
-
Provides-Extra: diarization
|
|
84
|
-
Requires-Dist: torch-audiomentations==0.12.0; extra == "diarization"
|
|
85
|
-
Requires-Dist: pyannote.audio>=4.0.2; extra == "diarization"
|
|
86
|
-
Provides-Extra: transcription
|
|
87
|
-
Requires-Dist: OmniSenseVoice>=0.4.0; extra == "transcription"
|
|
88
|
-
Requires-Dist: nemo_toolkit_asr[asr]>=2.7.0rc3; extra == "transcription"
|
|
89
|
-
Provides-Extra: test
|
|
90
|
-
Requires-Dist: pytest; extra == "test"
|
|
91
|
-
Requires-Dist: pytest-cov; extra == "test"
|
|
92
|
-
Requires-Dist: pytest-asyncio; extra == "test"
|
|
93
|
-
Requires-Dist: numpy; extra == "test"
|
|
94
|
-
Provides-Extra: all
|
|
95
|
-
Requires-Dist: numpy; extra == "all"
|
|
96
|
-
Requires-Dist: pytest; extra == "all"
|
|
97
|
-
Requires-Dist: pytest-cov; extra == "all"
|
|
98
|
-
Requires-Dist: pytest-asyncio; extra == "all"
|
|
99
|
-
Requires-Dist: pyannote.audio>=4.0.2; extra == "all"
|
|
100
|
-
Dynamic: license-file
|
|
101
|
-
|
|
102
|
-
<div align="center">
|
|
103
|
-
<img src="https://raw.githubusercontent.com/lattifai/lattifai-python/main/assets/logo.png" width=256>
|
|
104
|
-
|
|
105
|
-
[](https://badge.fury.io/py/lattifai)
|
|
106
|
-
[](https://pypi.org/project/lattifai)
|
|
107
|
-
[](https://pepy.tech/project/lattifai)
|
|
108
|
-
</div>
|
|
109
|
-
|
|
110
|
-
<p align="center">
|
|
111
|
-
🌐 <a href="https://lattifai.com"><b>Official Website</b></a>    |    🖥️ <a href="https://github.com/lattifai/lattifai-python">GitHub</a>    |    🤗 <a href="https://huggingface.co/Lattifai/Lattice-1">Model</a>    |    📑 <a href="https://lattifai.com/blogs">Blog</a>    |    <a href="https://discord.gg/kvF4WsBRK8"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&logoColor=white" alt="Discord" style="vertical-align: middle;"></a>
|
|
112
|
-
</p>
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
# LattifAI: Precision Alignment, Infinite Possibilities
|
|
116
|
-
|
|
117
|
-
Advanced forced alignment and subtitle generation powered by [ 🤗 Lattice-1](https://huggingface.co/Lattifai/Lattice-1) model.
|
|
118
|
-
|
|
119
|
-
## Table of Contents
|
|
120
|
-
|
|
121
|
-
- [Core Capabilities](#core-capabilities)
|
|
122
|
-
- [Installation](#installation)
|
|
123
|
-
- [Quick Start](#quick-start)
|
|
124
|
-
- [Command Line Interface](#command-line-interface)
|
|
125
|
-
- [Python SDK (5 Lines of Code)](#python-sdk-5-lines-of-code)
|
|
126
|
-
- [Web Interface](#web-interface)
|
|
127
|
-
- [CLI Reference](#cli-reference)
|
|
128
|
-
- [lai alignment align](#lai-alignment-align)
|
|
129
|
-
- [lai alignment youtube](#lai-alignment-youtube)
|
|
130
|
-
- [lai transcribe run](#lai-transcribe-run)
|
|
131
|
-
- [lai caption convert](#lai-caption-convert)
|
|
132
|
-
- [lai caption shift](#lai-caption-shift)
|
|
133
|
-
- [Python SDK Reference](#python-sdk-reference)
|
|
134
|
-
- [Basic Alignment](#basic-alignment)
|
|
135
|
-
- [YouTube Processing](#youtube-processing)
|
|
136
|
-
- [Configuration Objects](#configuration-objects)
|
|
137
|
-
- [Advanced Features](#advanced-features)
|
|
138
|
-
- [Audio Preprocessing](#audio-preprocessing)
|
|
139
|
-
- [Long-Form Audio Support](#long-form-audio-support)
|
|
140
|
-
- [Word-Level Alignment](#word-level-alignment)
|
|
141
|
-
- [Smart Sentence Splitting](#smart-sentence-splitting)
|
|
142
|
-
- [Speaker Diarization](#speaker-diarization)
|
|
143
|
-
- [YAML Configuration Files](#yaml-configuration-files)
|
|
144
|
-
- [Architecture Overview](#architecture-overview)
|
|
145
|
-
- [Performance & Optimization](#performance--optimization)
|
|
146
|
-
- [Supported Formats](#supported-formats)
|
|
147
|
-
- [Supported Languages](#supported-languages)
|
|
148
|
-
- [Roadmap](#roadmap)
|
|
149
|
-
- [Development](#development)
|
|
150
|
-
|
|
151
|
-
---
|
|
152
|
-
|
|
153
|
-
## Core Capabilities
|
|
154
|
-
|
|
155
|
-
LattifAI provides comprehensive audio-text alignment powered by the Lattice-1 model:
|
|
156
|
-
|
|
157
|
-
| Feature | Description | Status |
|
|
158
|
-
|---------|-------------|--------|
|
|
159
|
-
| **Forced Alignment** | Precise word-level and segment-level synchronization with audio | ✅ Production |
|
|
160
|
-
| **Multi-Model Transcription** | Gemini (100+ languages), Parakeet (24 languages), SenseVoice (5 languages) | ✅ Production |
|
|
161
|
-
| **Speaker Diarization** | Automatic multi-speaker identification with label preservation | ✅ Production |
|
|
162
|
-
| **Audio Preprocessing** | Multi-channel selection, device optimization (CPU/CUDA/MPS) | ✅ Production |
|
|
163
|
-
| **Streaming Mode** | Process audio up to 20 hours with minimal memory footprint | ✅ Production |
|
|
164
|
-
| **Smart Text Processing** | Intelligent sentence splitting and non-speech element separation | ✅ Production |
|
|
165
|
-
| **Universal Format Support** | 30+ caption/subtitle formats with text normalization | ✅ Production |
|
|
166
|
-
| **Configuration System** | YAML-based configs for reproducible workflows | ✅ Production |
|
|
167
|
-
|
|
168
|
-
**Key Highlights:**
|
|
169
|
-
- 🎯 **Accuracy**: State-of-the-art alignment precision with Lattice-1 model
|
|
170
|
-
- 🌍 **Multilingual**: Support for 100+ languages via multiple transcription models
|
|
171
|
-
- 🚀 **Performance**: Hardware-accelerated processing with streaming support
|
|
172
|
-
- 🔧 **Flexible**: CLI, Python SDK, and Web UI interfaces
|
|
173
|
-
- 📦 **Production-Ready**: Battle-tested on diverse audio/video content
|
|
174
|
-
|
|
175
|
-
---
|
|
176
|
-
|
|
177
|
-
## Installation
|
|
178
|
-
|
|
179
|
-
### Step 1: Install SDK
|
|
180
|
-
|
|
181
|
-
**Using pip:**
|
|
182
|
-
```bash
|
|
183
|
-
|
|
184
|
-
pip install lattifai
|
|
185
|
-
```
|
|
186
|
-
|
|
187
|
-
**Using uv (Recommended - 10-100x faster):**
|
|
188
|
-
```bash
|
|
189
|
-
# Install uv if you haven't already
|
|
190
|
-
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
191
|
-
|
|
192
|
-
# Create a new project with uv
|
|
193
|
-
uv init my-project
|
|
194
|
-
cd my-project
|
|
195
|
-
source .venv/bin/activate
|
|
196
|
-
|
|
197
|
-
# Install LattifAI
|
|
198
|
-
uv pip install lattifai
|
|
199
|
-
```
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
### Step 2: Get Your API Key
|
|
204
|
-
|
|
205
|
-
**LattifAI API Key (Required)**
|
|
206
|
-
|
|
207
|
-
Get your **free API key** at [https://lattifai.com/dashboard/api-keys](https://lattifai.com/dashboard/api-keys)
|
|
208
|
-
|
|
209
|
-
**Option A: Environment variable (recommended)**
|
|
210
|
-
```bash
|
|
211
|
-
export LATTIFAI_API_KEY="lf_your_api_key_here"
|
|
212
|
-
```
|
|
213
|
-
|
|
214
|
-
**Option B: `.env` file**
|
|
215
|
-
```bash
|
|
216
|
-
# .env
|
|
217
|
-
LATTIFAI_API_KEY=lf_your_api_key_here
|
|
218
|
-
```
|
|
219
|
-
|
|
220
|
-
**Gemini API Key (Optional - for transcription)**
|
|
221
|
-
|
|
222
|
-
If you want to use Gemini models for transcription (e.g., `gemini-2.5-pro`), get your **free Gemini API key** at [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey)
|
|
223
|
-
|
|
224
|
-
```bash
|
|
225
|
-
# Add to environment variable
|
|
226
|
-
export GEMINI_API_KEY="your_gemini_api_key_here"
|
|
227
|
-
|
|
228
|
-
# Or add to .env file
|
|
229
|
-
GEMINI_API_KEY=your_gemini_api_key_here # AIzaSyxxxx
|
|
230
|
-
```
|
|
231
|
-
|
|
232
|
-
> **Note**: Gemini API key is only required if you use Gemini models for transcription. It's not needed for alignment or when using other transcription models.
|
|
233
|
-
|
|
234
|
-
---
|
|
235
|
-
|
|
236
|
-
## Quick Start
|
|
237
|
-
|
|
238
|
-
### Command Line Interface
|
|
239
|
-
|
|
240
|
-

|
|
241
|
-
|
|
242
|
-
```bash
|
|
243
|
-
# Align local audio with subtitle
|
|
244
|
-
lai alignment align audio.wav subtitle.srt output.srt
|
|
245
|
-
|
|
246
|
-
# Download and align YouTube video
|
|
247
|
-
lai alignment youtube "https://youtube.com/watch?v=VIDEO_ID"
|
|
248
|
-
```
|
|
249
|
-
|
|
250
|
-
### Python SDK (5 Lines of Code)
|
|
251
|
-
|
|
252
|
-
```python
|
|
253
|
-
from lattifai import LattifAI
|
|
254
|
-
|
|
255
|
-
client = LattifAI()
|
|
256
|
-
caption = client.alignment(
|
|
257
|
-
input_media="audio.wav",
|
|
258
|
-
input_caption="subtitle.srt",
|
|
259
|
-
output_caption_path="aligned.srt",
|
|
260
|
-
)
|
|
261
|
-
```
|
|
262
|
-
|
|
263
|
-
That's it! Your aligned subtitles are saved to `aligned.srt`.
|
|
264
|
-
|
|
265
|
-
### 🚧 Web Interface
|
|
266
|
-
|
|
267
|
-

|
|
268
|
-
|
|
269
|
-
1. **Install the web application (one-time setup):**
|
|
270
|
-
```bash
|
|
271
|
-
lai-app-install
|
|
272
|
-
```
|
|
273
|
-
|
|
274
|
-
This command will:
|
|
275
|
-
- Check if Node.js/npm is installed (and install if needed)
|
|
276
|
-
- Install frontend dependencies
|
|
277
|
-
- Build the application
|
|
278
|
-
- Setup the `lai-app` command globally
|
|
279
|
-
|
|
280
|
-
2. **Start the backend server:**
|
|
281
|
-
```bash
|
|
282
|
-
lai-server
|
|
283
|
-
|
|
284
|
-
# Custom port (default: 8001)
|
|
285
|
-
lai-server --port 9000
|
|
286
|
-
|
|
287
|
-
# Custom host
|
|
288
|
-
lai-server --host 127.0.0.1 --port 9000
|
|
289
|
-
|
|
290
|
-
# Production mode (disable auto-reload)
|
|
291
|
-
lai-server --no-reload
|
|
292
|
-
```
|
|
293
|
-
|
|
294
|
-
**Backend Server Options:**
|
|
295
|
-
- `-p, --port` - Server port (default: 8001)
|
|
296
|
-
- `--host` - Host address (default: 0.0.0.0)
|
|
297
|
-
- `--no-reload` - Disable auto-reload for production
|
|
298
|
-
- `-h, --help` - Show help message
|
|
299
|
-
|
|
300
|
-
3. **Start the frontend application:**
|
|
301
|
-
```bash
|
|
302
|
-
lai-app
|
|
303
|
-
|
|
304
|
-
# Custom port (default: 5173)
|
|
305
|
-
lai-app --port 8080
|
|
306
|
-
|
|
307
|
-
# Custom backend URL
|
|
308
|
-
lai-app --backend http://localhost:9000
|
|
309
|
-
|
|
310
|
-
# Don't auto-open browser
|
|
311
|
-
lai-app --no-open
|
|
312
|
-
```
|
|
313
|
-
|
|
314
|
-
**Frontend Application Options:**
|
|
315
|
-
- `-p, --port` - Frontend server port (default: 5173)
|
|
316
|
-
- `--backend` - Backend API URL (default: http://localhost:8001)
|
|
317
|
-
- `--no-open` - Don't automatically open browser
|
|
318
|
-
- `-h, --help` - Show help message
|
|
319
|
-
|
|
320
|
-
The web interface will automatically open in your browser at `http://localhost:5173`.
|
|
321
|
-
|
|
322
|
-
**Features:**
|
|
323
|
-
- ✅ **Drag-and-Drop Upload**: Visual file upload for audio/video and captions
|
|
324
|
-
- ✅ **Real-Time Progress**: Live alignment progress with detailed status
|
|
325
|
-
- ✅ **Multiple Transcription Models**: Gemini, Parakeet, SenseVoice selection
|
|
326
|
-
|
|
327
|
-
---
|
|
328
|
-
|
|
329
|
-
## CLI Reference
|
|
330
|
-
|
|
331
|
-
### Command Overview
|
|
332
|
-
|
|
333
|
-
| Command | Description |
|
|
334
|
-
|---------|-------------|
|
|
335
|
-
| `lai alignment align` | Align local audio/video with caption |
|
|
336
|
-
| `lai alignment youtube` | Download & align YouTube content |
|
|
337
|
-
| `lai transcribe run` | Transcribe audio/video or YouTube URL to caption |
|
|
338
|
-
| `lai transcribe align` | Transcribe audio/video and align with generated transcript |
|
|
339
|
-
| `lai caption convert` | Convert between caption formats |
|
|
340
|
-
| `lai caption normalize` | Clean and normalize caption text |
|
|
341
|
-
| `lai caption shift` | Shift caption timestamps |
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
### lai alignment align
|
|
345
|
-
|
|
346
|
-
```bash
|
|
347
|
-
# Basic usage
|
|
348
|
-
lai alignment align <audio> <caption> <output>
|
|
349
|
-
|
|
350
|
-
# Examples
|
|
351
|
-
lai alignment align audio.wav caption.srt output.srt
|
|
352
|
-
lai alignment align video.mp4 caption.vtt output.srt alignment.device=cuda
|
|
353
|
-
lai alignment align audio.wav caption.srt output.json \
|
|
354
|
-
caption.split_sentence=true \
|
|
355
|
-
caption.word_level=true
|
|
356
|
-
```
|
|
357
|
-
|
|
358
|
-
### lai alignment youtube
|
|
359
|
-
|
|
360
|
-
```bash
|
|
361
|
-
# Basic usage
|
|
362
|
-
lai alignment youtube <url>
|
|
363
|
-
|
|
364
|
-
# Examples
|
|
365
|
-
lai alignment youtube "https://youtube.com/watch?v=VIDEO_ID"
|
|
366
|
-
lai alignment youtube "https://youtube.com/watch?v=VIDEO_ID" \
|
|
367
|
-
media.output_dir=~/Downloads \
|
|
368
|
-
caption.output_path=aligned.srt \
|
|
369
|
-
caption.split_sentence=true
|
|
370
|
-
```
|
|
371
|
-
|
|
372
|
-
### lai transcribe run
|
|
373
|
-
|
|
374
|
-
Perform automatic speech recognition (ASR) on audio/video files or YouTube URLs to generate timestamped transcriptions.
|
|
375
|
-
|
|
376
|
-
```bash
|
|
377
|
-
# Basic usage - local file
|
|
378
|
-
lai transcribe run <input> <output>
|
|
379
|
-
|
|
380
|
-
# Basic usage - YouTube URL
|
|
381
|
-
lai transcribe run <url> <output_dir>
|
|
382
|
-
|
|
383
|
-
# Examples - Local files
|
|
384
|
-
lai transcribe run audio.wav output.srt
|
|
385
|
-
lai transcribe run audio.mp4 output.ass \
|
|
386
|
-
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3
|
|
387
|
-
|
|
388
|
-
# Examples - YouTube URLs
|
|
389
|
-
lai transcribe run "https://youtube.com/watch?v=VIDEO_ID" output_dir=./output
|
|
390
|
-
lai transcribe run "https://youtube.com/watch?v=VIDEO_ID" output.ass output_dir=./output \
|
|
391
|
-
transcription.model_name=gemini-2.5-pro \
|
|
392
|
-
transcription.gemini_api_key=YOUR_GEMINI_API_KEY
|
|
393
|
-
|
|
394
|
-
# Full configuration with keyword arguments
|
|
395
|
-
lai transcribe run \
|
|
396
|
-
input=audio.wav \
|
|
397
|
-
output_caption=output.srt \
|
|
398
|
-
channel_selector=average \
|
|
399
|
-
transcription.device=cuda \
|
|
400
|
-
transcription.model_name=iic/SenseVoiceSmall
|
|
401
|
-
```
|
|
402
|
-
|
|
403
|
-
**Parameters:**
|
|
404
|
-
- `input`: Path to audio/video file or YouTube URL (required)
|
|
405
|
-
- `output_caption`: Path for output caption file (for local files)
|
|
406
|
-
- `output_dir`: Directory for output files (for YouTube URLs, defaults to current directory)
|
|
407
|
-
- `media_format`: Media format for YouTube downloads (default: mp3)
|
|
408
|
-
- `channel_selector`: Audio channel selection - "average", "left", "right", or channel index (default: "average")
|
|
409
|
-
- Note: Ignored when transcribing YouTube URLs with Gemini models
|
|
410
|
-
- `transcription`: Transcription configuration (model_name, device, language, gemini_api_key)
|
|
411
|
-
|
|
412
|
-
**Supported Transcription Models (More Coming Soon):**
|
|
413
|
-
- `gemini-2.5-pro` - Google Gemini API (requires API key)
|
|
414
|
-
- Languages: 100+ languages including English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, and more
|
|
415
|
-
- `gemini-3-pro-preview` - Google Gemini API (requires API key)
|
|
416
|
-
- Languages: 100+ languages (same as gemini-2.5-pro)
|
|
417
|
-
- `nvidia/parakeet-tdt-0.6b-v3` - NVIDIA Parakeet model
|
|
418
|
-
- Languages: Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk)
|
|
419
|
-
- `iic/SenseVoiceSmall` - Alibaba SenseVoice model
|
|
420
|
-
- Languages: Chinese/Mandarin (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue)
|
|
421
|
-
- More models will be integrated in future releases
|
|
422
|
-
|
|
423
|
-
**Note:** For transcription with alignment on local files, use `lai transcribe align` instead.
|
|
424
|
-
|
|
425
|
-
### lai transcribe align
|
|
426
|
-
|
|
427
|
-
Transcribe audio/video file and automatically align the generated transcript with the audio.
|
|
428
|
-
|
|
429
|
-
This command combines transcription and alignment in a single step, producing precisely aligned captions.
|
|
430
|
-
|
|
431
|
-
```bash
|
|
432
|
-
# Basic usage
|
|
433
|
-
lai transcribe align <input_media> <output_caption>
|
|
434
|
-
|
|
435
|
-
# Examples
|
|
436
|
-
lai transcribe align audio.wav output.srt
|
|
437
|
-
lai transcribe align audio.mp4 output.ass \
|
|
438
|
-
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3 \
|
|
439
|
-
alignment.device=cuda
|
|
440
|
-
|
|
441
|
-
# Using Gemini transcription with alignment
|
|
442
|
-
lai transcribe align audio.wav output.srt \
|
|
443
|
-
transcription.model_name=gemini-2.5-pro \
|
|
444
|
-
transcription.gemini_api_key=YOUR_KEY \
|
|
445
|
-
caption.split_sentence=true
|
|
446
|
-
|
|
447
|
-
# Full configuration
|
|
448
|
-
lai transcribe align \
|
|
449
|
-
input_media=audio.wav \
|
|
450
|
-
output_caption=output.srt \
|
|
451
|
-
transcription.device=mps \
|
|
452
|
-
transcription.model_name=iic/SenseVoiceSmall \
|
|
453
|
-
alignment.device=cuda \
|
|
454
|
-
caption.word_level=true
|
|
455
|
-
```
|
|
456
|
-
|
|
457
|
-
**Parameters:**
|
|
458
|
-
- `input_media`: Path to input audio/video file (required)
|
|
459
|
-
- `output_caption`: Path for output aligned caption file (required)
|
|
460
|
-
- `transcription`: Transcription configuration (model_name, device, language, gemini_api_key)
|
|
461
|
-
- `alignment`: Alignment configuration (model_name, device)
|
|
462
|
-
- `caption`: Caption formatting options (split_sentence, word_level, etc.)
|
|
463
|
-
|
|
464
|
-
|
|
465
|
-
### lai caption convert
|
|
466
|
-
|
|
467
|
-
```bash
|
|
468
|
-
lai caption convert input.srt output.vtt
|
|
469
|
-
lai caption convert input.srt output.json
|
|
470
|
-
# Enable normalization to clean HTML entities and special characters:
|
|
471
|
-
lai caption convert input.srt output.json normalize_text=true
|
|
472
|
-
```
|
|
473
|
-
|
|
474
|
-
### lai caption shift
|
|
475
|
-
|
|
476
|
-
```bash
|
|
477
|
-
lai caption shift input.srt output.srt 2.0 # Delay by 2 seconds
|
|
478
|
-
lai caption shift input.srt output.srt -1.5 # Advance by 1.5 seconds
|
|
479
|
-
```
|
|
480
|
-
|
|
481
|
-
---
|
|
482
|
-
|
|
483
|
-
## Python SDK Reference
|
|
484
|
-
|
|
485
|
-
### Basic Alignment
|
|
486
|
-
|
|
487
|
-
```python
|
|
488
|
-
from lattifai import LattifAI
|
|
489
|
-
|
|
490
|
-
# Initialize client (uses LATTIFAI_API_KEY from environment)
|
|
491
|
-
client = LattifAI()
|
|
492
|
-
|
|
493
|
-
# Align audio/video with subtitle
|
|
494
|
-
caption = client.alignment(
|
|
495
|
-
input_media="audio.wav", # Audio or video file
|
|
496
|
-
input_caption="subtitle.srt", # Input subtitle file
|
|
497
|
-
output_caption_path="output.srt", # Output aligned subtitle
|
|
498
|
-
split_sentence=True, # Enable smart sentence splitting
|
|
499
|
-
)
|
|
500
|
-
|
|
501
|
-
# Access alignment results
|
|
502
|
-
for segment in caption.supervisions:
|
|
503
|
-
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
|
|
504
|
-
```
|
|
505
|
-
|
|
506
|
-
### YouTube Processing
|
|
507
|
-
|
|
508
|
-
```python
|
|
509
|
-
from lattifai import LattifAI
|
|
510
|
-
|
|
511
|
-
client = LattifAI()
|
|
512
|
-
|
|
513
|
-
# Download YouTube video and align with auto-downloaded subtitles
|
|
514
|
-
caption = client.youtube(
|
|
515
|
-
url="https://youtube.com/watch?v=VIDEO_ID",
|
|
516
|
-
output_dir="./downloads",
|
|
517
|
-
output_caption_path="aligned.srt",
|
|
518
|
-
split_sentence=True,
|
|
519
|
-
)
|
|
520
|
-
```
|
|
521
|
-
|
|
522
|
-
|
|
523
|
-
### Configuration Objects
|
|
524
|
-
|
|
525
|
-
LattifAI uses a config-driven architecture for fine-grained control:
|
|
526
|
-
|
|
527
|
-
#### ClientConfig - API Settings
|
|
528
|
-
|
|
529
|
-
```python
|
|
530
|
-
from lattifai import LattifAI, ClientConfig
|
|
531
|
-
|
|
532
|
-
client = LattifAI(
|
|
533
|
-
client_config=ClientConfig(
|
|
534
|
-
api_key="lf_your_api_key", # Or use LATTIFAI_API_KEY env var
|
|
535
|
-
timeout=30.0,
|
|
536
|
-
max_retries=3,
|
|
537
|
-
)
|
|
538
|
-
)
|
|
539
|
-
```
|
|
540
|
-
|
|
541
|
-
#### AlignmentConfig - Model Settings
|
|
542
|
-
|
|
543
|
-
```python
|
|
544
|
-
from lattifai import LattifAI, AlignmentConfig
|
|
545
|
-
|
|
546
|
-
client = LattifAI(
|
|
547
|
-
alignment_config=AlignmentConfig(
|
|
548
|
-
model_name="Lattifai/Lattice-1",
|
|
549
|
-
device="cuda", # "cpu", "cuda", "cuda:0", "mps"
|
|
550
|
-
)
|
|
551
|
-
)
|
|
552
|
-
```
|
|
553
|
-
|
|
554
|
-
#### CaptionConfig - Subtitle Settings
|
|
555
|
-
|
|
556
|
-
```python
|
|
557
|
-
from lattifai import LattifAI, CaptionConfig
|
|
558
|
-
|
|
559
|
-
client = LattifAI(
|
|
560
|
-
caption_config=CaptionConfig(
|
|
561
|
-
split_sentence=True, # Smart sentence splitting (default: False)
|
|
562
|
-
word_level=True, # Word-level timestamps (default: False)
|
|
563
|
-
normalize_text=True, # Clean HTML entities (default: True)
|
|
564
|
-
include_speaker_in_text=False, # Include speaker labels (default: True)
|
|
565
|
-
)
|
|
566
|
-
)
|
|
567
|
-
```
|
|
568
|
-
|
|
569
|
-
#### Complete Configuration Example
|
|
570
|
-
|
|
571
|
-
```python
|
|
572
|
-
from lattifai import (
|
|
573
|
-
LattifAI,
|
|
574
|
-
ClientConfig,
|
|
575
|
-
AlignmentConfig,
|
|
576
|
-
CaptionConfig
|
|
577
|
-
)
|
|
578
|
-
|
|
579
|
-
client = LattifAI(
|
|
580
|
-
client_config=ClientConfig(
|
|
581
|
-
api_key="lf_your_api_key",
|
|
582
|
-
timeout=60.0,
|
|
583
|
-
),
|
|
584
|
-
alignment_config=AlignmentConfig(
|
|
585
|
-
model_name="Lattifai/Lattice-1",
|
|
586
|
-
device="cuda",
|
|
587
|
-
),
|
|
588
|
-
caption_config=CaptionConfig(
|
|
589
|
-
split_sentence=True,
|
|
590
|
-
word_level=True,
|
|
591
|
-
output_format="json",
|
|
592
|
-
),
|
|
593
|
-
)
|
|
594
|
-
|
|
595
|
-
caption = client.alignment(
|
|
596
|
-
input_media="audio.wav",
|
|
597
|
-
input_caption="subtitle.srt",
|
|
598
|
-
output_caption_path="output.json",
|
|
599
|
-
)
|
|
600
|
-
```
|
|
601
|
-
|
|
602
|
-
### Available Exports
|
|
603
|
-
|
|
604
|
-
```python
|
|
605
|
-
from lattifai import (
|
|
606
|
-
# Client classes
|
|
607
|
-
LattifAI,
|
|
608
|
-
# AsyncLattifAI, # For async support
|
|
609
|
-
|
|
610
|
-
# Config classes
|
|
611
|
-
ClientConfig,
|
|
612
|
-
AlignmentConfig,
|
|
613
|
-
CaptionConfig,
|
|
614
|
-
DiarizationConfig,
|
|
615
|
-
MediaConfig,
|
|
616
|
-
|
|
617
|
-
# I/O classes
|
|
618
|
-
Caption,
|
|
619
|
-
)
|
|
620
|
-
```
|
|
621
|
-
|
|
622
|
-
---
|
|
623
|
-
|
|
624
|
-
## Advanced Features
|
|
625
|
-
|
|
626
|
-
### Audio Preprocessing
|
|
627
|
-
|
|
628
|
-
LattifAI provides powerful audio preprocessing capabilities for optimal alignment:
|
|
629
|
-
|
|
630
|
-
**Channel Selection**
|
|
631
|
-
|
|
632
|
-
Control which audio channel to process for stereo/multi-channel files:
|
|
633
|
-
|
|
634
|
-
```python
|
|
635
|
-
from lattifai import LattifAI
|
|
636
|
-
|
|
637
|
-
client = LattifAI()
|
|
638
|
-
|
|
639
|
-
# Use left channel only
|
|
640
|
-
caption = client.alignment(
|
|
641
|
-
input_media="stereo.wav",
|
|
642
|
-
input_caption="subtitle.srt",
|
|
643
|
-
channel_selector="left", # Options: "left", "right", "average", or channel index (0, 1, 2, ...)
|
|
644
|
-
)
|
|
645
|
-
|
|
646
|
-
# Average all channels (default)
|
|
647
|
-
caption = client.alignment(
|
|
648
|
-
input_media="stereo.wav",
|
|
649
|
-
input_caption="subtitle.srt",
|
|
650
|
-
channel_selector="average",
|
|
651
|
-
)
|
|
652
|
-
```
|
|
653
|
-
|
|
654
|
-
**CLI Usage:**
|
|
655
|
-
```bash
|
|
656
|
-
# Use right channel
|
|
657
|
-
lai alignment align audio.wav subtitle.srt output.srt \
|
|
658
|
-
media.channel_selector=right
|
|
659
|
-
|
|
660
|
-
# Use specific channel index
|
|
661
|
-
lai alignment align audio.wav subtitle.srt output.srt \
|
|
662
|
-
media.channel_selector=1
|
|
663
|
-
```
|
|
664
|
-
|
|
665
|
-
**Device Management**
|
|
666
|
-
|
|
667
|
-
Optimize processing for your hardware:
|
|
668
|
-
|
|
669
|
-
```python
|
|
670
|
-
from lattifai import LattifAI, AlignmentConfig
|
|
671
|
-
|
|
672
|
-
# Use CUDA GPU
|
|
673
|
-
client = LattifAI(
|
|
674
|
-
alignment_config=AlignmentConfig(device="cuda")
|
|
675
|
-
)
|
|
676
|
-
|
|
677
|
-
# Use specific GPU
|
|
678
|
-
client = LattifAI(
|
|
679
|
-
alignment_config=AlignmentConfig(device="cuda:0")
|
|
680
|
-
)
|
|
681
|
-
|
|
682
|
-
# Use Apple Silicon MPS
|
|
683
|
-
client = LattifAI(
|
|
684
|
-
alignment_config=AlignmentConfig(device="mps")
|
|
685
|
-
)
|
|
686
|
-
|
|
687
|
-
# Use CPU
|
|
688
|
-
client = LattifAI(
|
|
689
|
-
alignment_config=AlignmentConfig(device="cpu")
|
|
690
|
-
)
|
|
691
|
-
```
|
|
692
|
-
|
|
693
|
-
**Supported Formats**
|
|
694
|
-
- **Audio**: WAV, MP3, M4A, AAC, FLAC, OGG, OPUS, AIFF, and more
|
|
695
|
-
- **Video**: MP4, MKV, MOV, WEBM, AVI, and more
|
|
696
|
-
- All formats supported by FFmpeg are compatible
|
|
697
|
-
|
|
698
|
-
### Long-Form Audio Support
|
|
699
|
-
|
|
700
|
-
LattifAI now supports processing long audio files (up to 20 hours) through streaming mode. Enable streaming by setting the `streaming_chunk_secs` parameter:
|
|
701
|
-
|
|
702
|
-
**Python SDK:**
|
|
703
|
-
```python
|
|
704
|
-
from lattifai import LattifAI
|
|
705
|
-
|
|
706
|
-
client = LattifAI()
|
|
707
|
-
|
|
708
|
-
# Enable streaming for long audio files
|
|
709
|
-
caption = client.alignment(
|
|
710
|
-
input_media="long_audio.wav",
|
|
711
|
-
input_caption="subtitle.srt",
|
|
712
|
-
output_caption_path="output.srt",
|
|
713
|
-
streaming_chunk_secs=600.0, # Process in 30-second chunks
|
|
714
|
-
)
|
|
715
|
-
```
|
|
716
|
-
|
|
717
|
-
**CLI:**
|
|
718
|
-
```bash
|
|
719
|
-
# Enable streaming with chunk size
|
|
720
|
-
lai alignment align long_audio.wav subtitle.srt output.srt \
|
|
721
|
-
media.streaming_chunk_secs=300.0
|
|
722
|
-
|
|
723
|
-
# For YouTube videos
|
|
724
|
-
lai alignment youtube "https://youtube.com/watch?v=VIDEO_ID" \
|
|
725
|
-
media.streaming_chunk_secs=300.0
|
|
726
|
-
```
|
|
727
|
-
|
|
728
|
-
**MediaConfig:**
|
|
729
|
-
```python
|
|
730
|
-
from lattifai import LattifAI, MediaConfig
|
|
731
|
-
|
|
732
|
-
client = LattifAI(
|
|
733
|
-
media_config=MediaConfig(
|
|
734
|
-
streaming_chunk_secs=600.0, # Chunk duration in seconds (1-1800), default: 600 (10 minutes)
|
|
735
|
-
)
|
|
736
|
-
)
|
|
737
|
-
```
|
|
738
|
-
|
|
739
|
-
**Technical Details:**
|
|
740
|
-
|
|
741
|
-
| Parameter | Description | Recommendation |
|
|
742
|
-
|-----------|-------------|----------------|
|
|
743
|
-
| **Default Value** | 600 seconds (10 minutes) | Good for most use cases |
|
|
744
|
-
| **Memory Impact** | Lower chunks = less RAM usage | Adjust based on available RAM |
|
|
745
|
-
| **Accuracy Impact** | Virtually zero degradation | Our precise implementation preserves quality |
|
|
746
|
-
|
|
747
|
-
**Performance Characteristics:**
|
|
748
|
-
- ✅ **Near-Perfect Accuracy**: Streaming implementation maintains alignment precision
|
|
749
|
-
- 🚧 **Memory Efficient**: Process 20-hour audio with <10GB RAM (600-sec chunks)
|
|
750
|
-
|
|
751
|
-
|
|
752
|
-
### Word-Level Alignment
|
|
753
|
-
|
|
754
|
-
Enable `word_level=True` to get precise timestamps for each word:
|
|
755
|
-
|
|
756
|
-
```python
|
|
757
|
-
from lattifai import LattifAI, CaptionConfig
|
|
758
|
-
|
|
759
|
-
client = LattifAI(
|
|
760
|
-
caption_config=CaptionConfig(word_level=True)
|
|
761
|
-
)
|
|
762
|
-
|
|
763
|
-
caption = client.alignment(
|
|
764
|
-
input_media="audio.wav",
|
|
765
|
-
input_caption="subtitle.srt",
|
|
766
|
-
output_caption_path="output.json", # JSON preserves word-level data
|
|
767
|
-
)
|
|
768
|
-
|
|
769
|
-
# Access word-level alignments
|
|
770
|
-
for segment in caption.alignments:
|
|
771
|
-
if segment.alignment and "word" in segment.alignment:
|
|
772
|
-
for word_item in segment.alignment["word"]:
|
|
773
|
-
print(f"{word_item.start:.2f}s: {word_item.symbol} (confidence: {word_item.score:.2f})")
|
|
774
|
-
```
|
|
775
|
-
|
|
776
|
-
### Smart Sentence Splitting
|
|
777
|
-
|
|
778
|
-
The `split_sentence` option intelligently separates:
|
|
779
|
-
- Non-speech elements (`[APPLAUSE]`, `[MUSIC]`) from dialogue
|
|
780
|
-
- Multiple sentences within a single subtitle
|
|
781
|
-
- Speaker labels from content
|
|
782
|
-
|
|
783
|
-
```python
|
|
784
|
-
caption = client.alignment(
|
|
785
|
-
input_media="audio.wav",
|
|
786
|
-
input_caption="subtitle.srt",
|
|
787
|
-
split_sentence=True,
|
|
788
|
-
)
|
|
789
|
-
```
|
|
790
|
-
|
|
791
|
-
### Speaker Diarization
|
|
792
|
-
|
|
793
|
-
Speaker diarization automatically identifies and labels different speakers in audio using state-of-the-art models.
|
|
794
|
-
|
|
795
|
-
**Core Capabilities:**
|
|
796
|
-
- 🎤 **Multi-Speaker Detection**: Automatically detect speaker changes in audio
|
|
797
|
-
- 🏷️ **Smart Labeling**: Assign speaker labels (SPEAKER_00, SPEAKER_01, etc.)
|
|
798
|
-
- 🔄 **Label Preservation**: Maintain existing speaker names from input captions
|
|
799
|
-
- 🤖 **Gemini Integration**: Extract speaker names intelligently during transcription
|
|
800
|
-
|
|
801
|
-
**How It Works:**
|
|
802
|
-
|
|
803
|
-
1. **Without Existing Labels**: System assigns generic labels (SPEAKER_00, SPEAKER_01)
|
|
804
|
-
2. **With Existing Labels**: System preserves your speaker names during alignment
|
|
805
|
-
- Formats: `[Alice]`, `>> Bob:`, `SPEAKER_01:`, `Alice:` are all recognized
|
|
806
|
-
3. **Gemini Transcription**: When using Gemini models, speaker names are extracted from context
|
|
807
|
-
- Example: "Hi, I'm Alice" → System labels as `Alice` instead of `SPEAKER_00`
|
|
808
|
-
|
|
809
|
-
**Speaker Label Integration:**
|
|
810
|
-
|
|
811
|
-
The diarization engine intelligently matches detected speakers with existing labels:
|
|
812
|
-
- If input captions have speaker names → **Preserved during alignment**
|
|
813
|
-
- If Gemini transcription provides names → **Used for labeling**
|
|
814
|
-
- Otherwise → **Generic labels (SPEAKER_00, etc.) assigned**
|
|
815
|
-
* 🚧 **Future Enhancement:**
|
|
816
|
-
- **AI-Powered Speaker Name Inference**: Upcoming feature will use large language models combined with metadata (video title, description, context) to intelligently infer speaker names, making transcripts more human-readable and contextually accurate
|
|
817
|
-
|
|
818
|
-
**CLI:**
|
|
819
|
-
```bash
|
|
820
|
-
# Enable speaker diarization during alignment
|
|
821
|
-
lai alignment align audio.wav subtitle.srt output.srt \
|
|
822
|
-
diarization.enabled=true
|
|
823
|
-
|
|
824
|
-
# With additional diarization settings
|
|
825
|
-
lai alignment align audio.wav subtitle.srt output.srt \
|
|
826
|
-
diarization.enabled=true \
|
|
827
|
-
diarization.device=cuda \
|
|
828
|
-
diarization.min_speakers=2 \
|
|
829
|
-
diarization.max_speakers=4
|
|
830
|
-
|
|
831
|
-
# For YouTube videos with diarization
|
|
832
|
-
lai alignment youtube "https://youtube.com/watch?v=VIDEO_ID" \
|
|
833
|
-
diarization.enabled=true
|
|
834
|
-
```
|
|
835
|
-
|
|
836
|
-
**Python SDK:**
|
|
837
|
-
```python
|
|
838
|
-
from lattifai import LattifAI, DiarizationConfig
|
|
839
|
-
|
|
840
|
-
client = LattifAI(
|
|
841
|
-
diarization_config=DiarizationConfig(enabled=True)
|
|
842
|
-
)
|
|
843
|
-
|
|
844
|
-
caption = client.alignment(
|
|
845
|
-
input_media="audio.wav",
|
|
846
|
-
input_caption="subtitle.srt",
|
|
847
|
-
output_caption_path="output.srt",
|
|
848
|
-
)
|
|
849
|
-
|
|
850
|
-
# Access speaker information
|
|
851
|
-
for segment in caption.supervisions:
|
|
852
|
-
print(f"[{segment.speaker}] {segment.text}")
|
|
853
|
-
```
|
|
854
|
-
|
|
855
|
-
### YAML Configuration Files
|
|
856
|
-
|
|
857
|
-
* **under development**
|
|
858
|
-
|
|
859
|
-
Create reusable configuration files:
|
|
860
|
-
|
|
861
|
-
```yaml
|
|
862
|
-
# config/alignment.yaml
|
|
863
|
-
model_name: "Lattifai/Lattice-1"
|
|
864
|
-
device: "cuda"
|
|
865
|
-
batch_size: 1
|
|
866
|
-
```
|
|
867
|
-
|
|
868
|
-
```bash
|
|
869
|
-
lai alignment align audio.wav subtitle.srt output.srt \
|
|
870
|
-
alignment=config/alignment.yaml
|
|
871
|
-
```
|
|
872
|
-
|
|
873
|
-
---
|
|
874
|
-
|
|
875
|
-
## Architecture Overview
|
|
876
|
-
|
|
877
|
-
LattifAI uses a modular, config-driven architecture for maximum flexibility:
|
|
878
|
-
|
|
879
|
-
```
|
|
880
|
-
┌─────────────────────────────────────────────────────────────┐
|
|
881
|
-
│ LattifAI Client │
|
|
882
|
-
├─────────────────────────────────────────────────────────────┤
|
|
883
|
-
│ Configuration Layer (Config-Driven) │
|
|
884
|
-
│ ├── ClientConfig (API settings) │
|
|
885
|
-
│ ├── AlignmentConfig (Model & device) │
|
|
886
|
-
│ ├── CaptionConfig (I/O formats) │
|
|
887
|
-
│ ├── TranscriptionConfig (ASR models) │
|
|
888
|
-
│ └── DiarizationConfig (Speaker detection) │
|
|
889
|
-
├─────────────────────────────────────────────────────────────┤
|
|
890
|
-
│ Core Components │
|
|
891
|
-
│ ├── AudioLoader → Load & preprocess audio │
|
|
892
|
-
│ ├── Aligner → Lattice-1 forced alignment │
|
|
893
|
-
│ ├── Transcriber → Multi-model ASR │
|
|
894
|
-
│ ├── Diarizer → Speaker identification │
|
|
895
|
-
│ └── Tokenizer → Intelligent text segmentation │
|
|
896
|
-
├─────────────────────────────────────────────────────────────┤
|
|
897
|
-
│ Data Flow │
|
|
898
|
-
│ Input → AudioLoader → Aligner → Diarizer → Caption │
|
|
899
|
-
│ ↓ │
|
|
900
|
-
│ Transcriber (optional) │
|
|
901
|
-
└─────────────────────────────────────────────────────────────┘
|
|
902
|
-
```
|
|
903
|
-
|
|
904
|
-
**Component Responsibilities:**
|
|
905
|
-
|
|
906
|
-
| Component | Purpose | Configuration |
|
|
907
|
-
|-----------|---------|---------------|
|
|
908
|
-
| **AudioLoader** | Load audio/video, channel selection, format conversion | `MediaConfig` |
|
|
909
|
-
| **Aligner** | Forced alignment using Lattice-1 model | `AlignmentConfig` |
|
|
910
|
-
| **Transcriber** | ASR with Gemini/Parakeet/SenseVoice | `TranscriptionConfig` |
|
|
911
|
-
| **Diarizer** | Speaker diarization with pyannote.audio | `DiarizationConfig` |
|
|
912
|
-
| **Tokenizer** | Sentence splitting and text normalization | `CaptionConfig` |
|
|
913
|
-
| **Caption** | Unified data structure for alignments | `CaptionConfig` |
|
|
914
|
-
|
|
915
|
-
**Data Flow:**
|
|
916
|
-
|
|
917
|
-
1. **Audio Loading**: `AudioLoader` loads media, applies channel selection, converts to numpy array
|
|
918
|
-
2. **Transcription** (optional): `Transcriber` generates transcript if no caption provided
|
|
919
|
-
3. **Text Preprocessing**: `Tokenizer` splits sentences and normalizes text
|
|
920
|
-
4. **Alignment**: `Aligner` uses Lattice-1 to compute word-level timestamps
|
|
921
|
-
5. **Diarization** (optional): `Diarizer` identifies speakers and assigns labels
|
|
922
|
-
6. **Output**: `Caption` object contains all results, exported to desired format
|
|
923
|
-
|
|
924
|
-
**Configuration Philosophy:**
|
|
925
|
-
- ✅ **Declarative**: Describe what you want, not how to do it
|
|
926
|
-
- ✅ **Composable**: Mix and match configurations
|
|
927
|
-
- ✅ **Reproducible**: Save configs to YAML for consistent results
|
|
928
|
-
- ✅ **Flexible**: Override configs per-method or globally
|
|
929
|
-
|
|
930
|
-
---
|
|
931
|
-
|
|
932
|
-
## Performance & Optimization
|
|
933
|
-
|
|
934
|
-
### Device Selection
|
|
935
|
-
|
|
936
|
-
Choose the optimal device for your hardware:
|
|
937
|
-
|
|
938
|
-
```python
|
|
939
|
-
from lattifai import LattifAI, AlignmentConfig
|
|
940
|
-
|
|
941
|
-
# NVIDIA GPU (recommended for speed)
|
|
942
|
-
client = LattifAI(
|
|
943
|
-
alignment_config=AlignmentConfig(device="cuda")
|
|
944
|
-
)
|
|
945
|
-
|
|
946
|
-
# Apple Silicon GPU
|
|
947
|
-
client = LattifAI(
|
|
948
|
-
alignment_config=AlignmentConfig(device="mps")
|
|
949
|
-
)
|
|
950
|
-
|
|
951
|
-
# CPU (maximum compatibility)
|
|
952
|
-
client = LattifAI(
|
|
953
|
-
alignment_config=AlignmentConfig(device="cpu")
|
|
954
|
-
)
|
|
955
|
-
```
|
|
956
|
-
|
|
957
|
-
**Performance Comparison** (30-minute audio):
|
|
958
|
-
|
|
959
|
-
| Device | Time |
|
|
960
|
-
|--------|------|
|
|
961
|
-
| CUDA (RTX 4090) | ~18 sec |
|
|
962
|
-
| MPS (M4) | ~26 sec |
|
|
963
|
-
|
|
964
|
-
### Memory Management
|
|
965
|
-
|
|
966
|
-
**Streaming Mode** for long audio:
|
|
967
|
-
|
|
968
|
-
```python
|
|
969
|
-
# Process 20-hour audio with <10GB RAM
|
|
970
|
-
caption = client.alignment(
|
|
971
|
-
input_media="long_audio.wav",
|
|
972
|
-
input_caption="subtitle.srt",
|
|
973
|
-
streaming_chunk_secs=600.0, # 10-minute chunks
|
|
974
|
-
)
|
|
975
|
-
```
|
|
976
|
-
|
|
977
|
-
**Memory Usage** (approximate):
|
|
978
|
-
|
|
979
|
-
| Chunk Size | Peak RAM | Suitable For |
|
|
980
|
-
|------------|----------|-------------|
|
|
981
|
-
| 600 sec | ~5 GB | Recommended |
|
|
982
|
-
| No streaming | ~10 GB+ | Short audio only |
|
|
983
|
-
|
|
984
|
-
### Optimization Tips
|
|
985
|
-
|
|
986
|
-
1. **Use GPU when available**: 10x faster than CPU
|
|
987
|
-
2. **WIP: Enable streaming for long audio**: Process 20+ hour files without OOM
|
|
988
|
-
3. **Choose appropriate chunk size**: Balance memory vs. performance
|
|
989
|
-
4. **Batch processing**: Process multiple files in sequence (coming soon)
|
|
990
|
-
5. **Profile alignment**: Set `client.profile=True` to identify bottlenecks
|
|
991
|
-
|
|
992
|
-
---
|
|
993
|
-
|
|
994
|
-
## Supported Formats
|
|
995
|
-
|
|
996
|
-
LattifAI supports virtually all common media and subtitle formats:
|
|
997
|
-
|
|
998
|
-
| Type | Formats |
|
|
999
|
-
|------|---------|
|
|
1000
|
-
| **Audio** | WAV, MP3, M4A, AAC, FLAC, OGG, OPUS, AIFF, and more |
|
|
1001
|
-
| **Video** | MP4, MKV, MOV, WEBM, AVI, and more |
|
|
1002
|
-
| **Caption/Subtitle Input** | SRT, VTT, ASS, SSA, SUB, SBV, TXT, Gemini, and more |
|
|
1003
|
-
| **Caption/Subtitle Output** | All input formats + TextGrid (Praat) |
|
|
1004
|
-
|
|
1005
|
-
**Tabular Formats:**
|
|
1006
|
-
- **TSV**: Tab-separated values with optional speaker column
|
|
1007
|
-
- **CSV**: Comma-separated values with optional speaker column
|
|
1008
|
-
- **AUD**: Audacity labels format with `[[speaker]]` notation
|
|
1009
|
-
|
|
1010
|
-
> **Note**: If a format is not listed above but commonly used, it's likely supported. Feel free to try it or reach out if you encounter any issues.
|
|
1011
|
-
|
|
1012
|
-
---
|
|
1013
|
-
|
|
1014
|
-
## Supported Languages
|
|
1015
|
-
|
|
1016
|
-
LattifAI supports multiple transcription models with different language capabilities:
|
|
1017
|
-
|
|
1018
|
-
### Gemini Models (100+ Languages)
|
|
1019
|
-
|
|
1020
|
-
**Models**: `gemini-2.5-pro`, `gemini-3-pro-preview`, `gemini-3-flash-preview`
|
|
1021
|
-
|
|
1022
|
-
**Supported Languages**: English, Chinese (Mandarin & Cantonese), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Arabic, Russian, Hindi, Bengali, Turkish, Dutch, Polish, Swedish, Danish, Norwegian, Finnish, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Czech, Romanian, Hungarian, Swahili, Tamil, Telugu, Marathi, Gujarati, Kannada, and 70+ more languages.
|
|
1023
|
-
|
|
1024
|
-
> **Note**: Requires Gemini API key from [Google AI Studio](https://aistudio.google.com/apikey)
|
|
1025
|
-
|
|
1026
|
-
### NVIDIA Parakeet (24 European Languages)
|
|
1027
|
-
|
|
1028
|
-
**Model**: `nvidia/parakeet-tdt-0.6b-v3`
|
|
1029
|
-
|
|
1030
|
-
**Supported Languages**:
|
|
1031
|
-
- **Western Europe**: English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl)
|
|
1032
|
-
- **Nordic**: Danish (da), Swedish (sv), Norwegian (no), Finnish (fi)
|
|
1033
|
-
- **Eastern Europe**: Polish (pl), Czech (cs), Slovak (sk), Hungarian (hu), Romanian (ro), Bulgarian (bg), Ukrainian (uk), Russian (ru)
|
|
1034
|
-
- **Others**: Croatian (hr), Estonian (et), Latvian (lv), Lithuanian (lt), Slovenian (sl), Maltese (mt), Greek (el)
|
|
1035
|
-
|
|
1036
|
-
### Alibaba SenseVoice (5 Asian Languages)
|
|
1037
|
-
|
|
1038
|
-
**Model**: `iic/SenseVoiceSmall`
|
|
1039
|
-
|
|
1040
|
-
**Supported Languages**:
|
|
1041
|
-
- Chinese/Mandarin (zh)
|
|
1042
|
-
- English (en)
|
|
1043
|
-
- Japanese (ja)
|
|
1044
|
-
- Korean (ko)
|
|
1045
|
-
- Cantonese (yue)
|
|
1046
|
-
|
|
1047
|
-
### Language Selection
|
|
1048
|
-
|
|
1049
|
-
```python
|
|
1050
|
-
from lattifai import LattifAI, TranscriptionConfig
|
|
1051
|
-
|
|
1052
|
-
# Specify language for transcription
|
|
1053
|
-
client = LattifAI(
|
|
1054
|
-
transcription_config=TranscriptionConfig(
|
|
1055
|
-
model_name="nvidia/parakeet-tdt-0.6b-v3",
|
|
1056
|
-
language="de", # German
|
|
1057
|
-
)
|
|
1058
|
-
)
|
|
1059
|
-
```
|
|
1060
|
-
|
|
1061
|
-
**CLI Usage:**
|
|
1062
|
-
```bash
|
|
1063
|
-
lai transcribe run audio.wav output.srt \
|
|
1064
|
-
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3 \
|
|
1065
|
-
transcription.language=de
|
|
1066
|
-
```
|
|
1067
|
-
|
|
1068
|
-
> **Tip**: Use Gemini models for maximum language coverage, Parakeet for European languages, and SenseVoice for Asian languages.
|
|
1069
|
-
|
|
1070
|
-
---
|
|
1071
|
-
|
|
1072
|
-
## Roadmap
|
|
1073
|
-
|
|
1074
|
-
Visit our [LattifAI roadmap](https://lattifai.com/roadmap) for the latest updates.
|
|
1075
|
-
|
|
1076
|
-
| Date | Model Release | Features |
|
|
1077
|
-
|------|---------|----------|
|
|
1078
|
-
| **Oct 2025** | **Lattice-1-Alpha** | ✅ English forced alignment<br>✅ Multi-format support<br>✅ CPU/GPU optimization |
|
|
1079
|
-
| **Nov 2025** | **Lattice-1** | ✅ English + Chinese + German<br>✅ Mixed languages alignment<br>✅ Speaker Diarization<br>✅ Multi-model transcription (Gemini, Parakeet, SenseVoice)<br>✅ Web interface with React<br>🚧 Advanced segmentation strategies (entire/transcription/hybrid)<br>🚧 Audio event detection ([MUSIC], [APPLAUSE], etc.)<br> |
|
|
1080
|
-
| **Q1 2026** | **Lattice-2** | ✅ Streaming mode for long audio<br>🔮 40+ languages support<br>🔮 Real-time alignment |
|
|
1081
|
-
|
|
1082
|
-
|
|
1083
|
-
|
|
1084
|
-
**Legend**: ✅ Released | 🚧 In Development | 📋 Planned | 🔮 Future
|
|
1085
|
-
|
|
1086
|
-
---
|
|
1087
|
-
|
|
1088
|
-
## Development
|
|
1089
|
-
|
|
1090
|
-
### Setup
|
|
1091
|
-
|
|
1092
|
-
```bash
|
|
1093
|
-
git clone https://github.com/lattifai/lattifai-python.git
|
|
1094
|
-
cd lattifai-python
|
|
1095
|
-
|
|
1096
|
-
# Using uv (recommended)
|
|
1097
|
-
curl -LsSf https://astral.sh/uv/install.sh | sh
|
|
1098
|
-
uv sync
|
|
1099
|
-
source .venv/bin/activate
|
|
1100
|
-
|
|
1101
|
-
# Or using pip
|
|
1102
|
-
pip install -e ".[test]"
|
|
1103
|
-
|
|
1104
|
-
pre-commit install
|
|
1105
|
-
```
|
|
1106
|
-
|
|
1107
|
-
### Testing
|
|
1108
|
-
|
|
1109
|
-
```bash
|
|
1110
|
-
pytest # Run all tests
|
|
1111
|
-
pytest --cov=src # With coverage
|
|
1112
|
-
pytest tests/test_basic.py # Specific test
|
|
1113
|
-
```
|
|
1114
|
-
|
|
1115
|
-
---
|
|
1116
|
-
|
|
1117
|
-
## Contributing
|
|
1118
|
-
|
|
1119
|
-
1. Fork the repository
|
|
1120
|
-
2. Create a feature branch
|
|
1121
|
-
3. Make changes and add tests
|
|
1122
|
-
4. Run `pytest` and `pre-commit run`
|
|
1123
|
-
5. Submit a pull request
|
|
1124
|
-
|
|
1125
|
-
## License
|
|
1126
|
-
|
|
1127
|
-
Apache License 2.0
|
|
1128
|
-
|
|
1129
|
-
## Support
|
|
1130
|
-
|
|
1131
|
-
- **Issues**: [GitHub Issues](https://github.com/lattifai/lattifai-python/issues)
|
|
1132
|
-
- **Discussions**: [GitHub Discussions](https://github.com/lattifai/lattifai-python/discussions)
|
|
1133
|
-
- **Discord**: [Join our community](https://discord.gg/kvF4WsBRK8)
|