phrasebook-fr-to-en 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- phrasebook_fr_to_en-0.1.0/LICENSE +21 -0
- phrasebook_fr_to_en-0.1.0/PKG-INFO +244 -0
- phrasebook_fr_to_en-0.1.0/README.md +223 -0
- phrasebook_fr_to_en-0.1.0/pyproject.toml +66 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/.ruff_cache/.gitignore +2 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/.ruff_cache/0.14.7/4335670677593933877 +0 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/.ruff_cache/CACHEDIR.TAG +1 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/__init__.py +1 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/__main__.py +3 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/cli.py +578 -0
- phrasebook_fr_to_en-0.1.0/src/phrasebook_fr_to_en/py.typed +0 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Tony Aldon
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
|
13
|
+
all copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
21
|
+
THE SOFTWARE.
|
|
@@ -0,0 +1,244 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: phrasebook-fr-to-en
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Enrich French to English phrasebooks with OpenAI API.
|
|
5
|
+
Author: Tony Aldon
|
|
6
|
+
Author-email: Tony Aldon <tony@tonyaldon.com>
|
|
7
|
+
License-Expression: MIT
|
|
8
|
+
License-File: LICENSE
|
|
9
|
+
Requires-Dist: pandas>=2.3.3
|
|
10
|
+
Requires-Dist: numpy!=2.4.0
|
|
11
|
+
Requires-Dist: typer>=0.21.0
|
|
12
|
+
Requires-Dist: watchdog>=6.0.0
|
|
13
|
+
Requires-Dist: openai>=2.15.0
|
|
14
|
+
Requires-Dist: pydantic>=2.12.5
|
|
15
|
+
Requires-Dist: pytest ; extra == 'test'
|
|
16
|
+
Requires-Python: >=3.12
|
|
17
|
+
Project-URL: Homepage, https://github.com/tonyaldon/phrasebook-fr-to-en
|
|
18
|
+
Project-URL: Repository, https://github.com/tonyaldon/phrasebook-fr-to-en
|
|
19
|
+
Provides-Extra: test
|
|
20
|
+
Description-Content-Type: text/markdown
|
|
21
|
+
|
|
22
|
+
# phrasebook-fr-to-en
|
|
23
|
+
|
|
24
|
+
`phrasebook-fr-to-en` is a CLI that uses the OpenAI API to enrich
|
|
25
|
+
French to English phrasebooks with AI generated translations, audios,
|
|
26
|
+
and images.
|
|
27
|
+
|
|
28
|
+
## Installing and running
|
|
29
|
+
|
|
30
|
+
It's a Python program. You can install it as an `uv` tool like this:
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
uv tool install phrasebook-fr-to-en
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
After you generate an OpenAI API key, and with your original French to
|
|
37
|
+
English translations in the file `my-phrasebook.tsv`, you can generate
|
|
38
|
+
new ones, along with audios and images, by running this:
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
export OPENAI_API_KEY=<your-api-key>
|
|
42
|
+
phrasebook-fr-to-en my-phrasebook.tsv
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
This creates the file `enriched_phrasebook.tsv` with all translations.
|
|
46
|
+
It also saves the audios and images in the directory `media`.
|
|
47
|
+
Both `enriched_phrasebook.tsv` and `media` sit next to your phrasebook
|
|
48
|
+
file.
|
|
49
|
+
|
|
50
|
+
Your original phrasebook is left unchanged.
|
|
51
|
+
|
|
52
|
+
Run the following to list the options and their documentation:
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
phrasebook-fr-to-en --help
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## OpenAI API [IMPORTANT]
|
|
59
|
+
|
|
60
|
+
This program uses the OpenAI API with the following models:
|
|
61
|
+
|
|
62
|
+
- https://platform.openai.com/docs/models/gpt-5.2
|
|
63
|
+
- https://platform.openai.com/docs/models/gpt-4o-mini-tts
|
|
64
|
+
- https://platform.openai.com/docs/models/gpt-image-1.5
|
|
65
|
+
|
|
66
|
+
To use it, you need to register with OpenAI. You also need to be
|
|
67
|
+
verified as an organization (required for the image model). Then
|
|
68
|
+
create an API key: https://platform.openai.com.
|
|
69
|
+
|
|
70
|
+
Once you've done this, set `OPENAI_API_KEY` as an environment variable
|
|
71
|
+
before you run the program, like this:
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
export OPENAI_API_KEY=<your-api-key>
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Format of your original phrasebook
|
|
78
|
+
|
|
79
|
+
Your original phrasebook must be a TSV file (TAB separation) with the
|
|
80
|
+
columns `date`, `french`, and `english`, like in this example:
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
date french english
|
|
84
|
+
2025-11-15 Montez les escaliers. Climb the stairs.
|
|
85
|
+
2025-11-16 Il est beau. He is handsome.
|
|
86
|
+
2025-01-17 Il est moche. He is ugly.
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Full example
|
|
90
|
+
|
|
91
|
+
`phrasebook-fr-to-en` takes a TSV (TAB separation) file as input.
|
|
92
|
+
Each row is a French to English translation. It uses the following
|
|
93
|
+
columns: `date`, `french`, `english`.
|
|
94
|
+
|
|
95
|
+
For each translation (each row), two new related translations are
|
|
96
|
+
generated. The goal is to show:
|
|
97
|
+
|
|
98
|
+
- An English grammar point, or
|
|
99
|
+
- Useful nouns, verbs, or
|
|
100
|
+
- Alternative phrasing (formality, slang, etc.).
|
|
101
|
+
|
|
102
|
+
These new translations, along with the original, are saved in the file
|
|
103
|
+
`enriched_phrasebook.tsv`. It sits next to your phrasebook file.
|
|
104
|
+
Records in your original phrasebook whose english field matches a
|
|
105
|
+
record in the enriched phrasebook are skipped.
|
|
106
|
+
|
|
107
|
+
Your original phrasebook is left unchanged.
|
|
108
|
+
|
|
109
|
+
For all translations (original and AI generated), an English audio and
|
|
110
|
+
an image are generated. They are saved in a `media` directory next to
|
|
111
|
+
the original phrasebook file.
|
|
112
|
+
|
|
113
|
+
For instance, if `my-phrasebook.tsv` contains the following record
|
|
114
|
+
(columns separated by tabs)
|
|
115
|
+
|
|
116
|
+
```
|
|
117
|
+
date french english
|
|
118
|
+
2025-11-15 Montez les escaliers. Climb the stairs.
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
and you run the following commands:
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
export OPENAI_API_KEY=<your-api-key>
|
|
125
|
+
phrasebook-fr-to-en my-phrasebook.tsv
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
This produces the file `enriched_phrasebook.tsv` with AI generated
|
|
129
|
+
translations. It has the following columns: `french`, `english`,
|
|
130
|
+
`anki_audio`, `anki_img`, `generated_from`, `id`, `audio_filename`,
|
|
131
|
+
`img_filename`, `date`.
|
|
132
|
+
|
|
133
|
+
```
|
|
134
|
+
french english anki_audio anki_img generated_from id audio_filename img_filename date
|
|
135
|
+
Montez les escaliers. Climb the stairs. [sound:phrasebook-fr-to-en-1.mp3] "<img src=""phrasebook-fr-to-en-1.png"">" 1 phrasebook-fr-to-en-1.mp3 phrasebook-fr-to-en-1.png 2025-11-15
|
|
136
|
+
Montez les escaliers jusqu’au premier étage. Climb the stairs up to the first floor. [sound:phrasebook-fr-to-en-2.mp3] "<img src=""phrasebook-fr-to-en-2.png"">" 1 2 phrasebook-fr-to-en-2.mp3 phrasebook-fr-to-en-2.png 2025-11-15
|
|
137
|
+
Prenez les escaliers, c’est juste à gauche. Take the stairs; it’s just on the left. [sound:phrasebook-fr-to-en-3.mp3] "<img src=""phrasebook-fr-to-en-3.png"">" 1 3 phrasebook-fr-to-en-3.mp3 phrasebook-fr-to-en-3.png 2025-11-15
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
This also generates 3 audios and 3 images.
|
|
141
|
+
|
|
142
|
+
Your directory then looks like this:
|
|
143
|
+
|
|
144
|
+
```
|
|
145
|
+
.
|
|
146
|
+
├── enriched_phrasebook.tsv
|
|
147
|
+
├── my-phrasebook.tsv
|
|
148
|
+
└── media
|
|
149
|
+
├── phrasebook-fr-to-en-1.mp3
|
|
150
|
+
├── phrasebook-fr-to-en-1.png
|
|
151
|
+
├── phrasebook-fr-to-en-2.mp3
|
|
152
|
+
├── phrasebook-fr-to-en-2.png
|
|
153
|
+
├── phrasebook-fr-to-en-3.mp3
|
|
154
|
+
└── phrasebook-fr-to-en-3.png
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
## For Anki users
|
|
158
|
+
|
|
159
|
+
In the previous example, did you notice the columns `anki_audio` and
|
|
160
|
+
`anki_img`? They contain formatted fields for audio and image that
|
|
161
|
+
you can use directly in your [Anki](https://apps.ankiweb.net/) decks:
|
|
162
|
+
|
|
163
|
+
```
|
|
164
|
+
[sound:phrasebook-fr-to-en-1.mp3]
|
|
165
|
+
<img src="phrasebook-fr-to-en-1.png">
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
This way you can import `enhanced_phrasebook.tsv` directly into Anki.
|
|
169
|
+
No changes are needed to get audio played and images displayed.
|
|
170
|
+
|
|
171
|
+
Note that this only works:
|
|
172
|
+
|
|
173
|
+
1. If you enable "Allow HTML" option when importing the enriched file,
|
|
174
|
+
2. If you copy the audios and images from the `media` directory to
|
|
175
|
+
your Anki `collection.media` directory.
|
|
176
|
+
|
|
177
|
+
See Anki docs:
|
|
178
|
+
|
|
179
|
+
- https://docs.ankiweb.net/importing/text-files.html#importing-media
|
|
180
|
+
- https://docs.ankiweb.net/files.html
|
|
181
|
+
|
|
182
|
+
## Dev
|
|
183
|
+
|
|
184
|
+
### Installing and running from source
|
|
185
|
+
|
|
186
|
+
To run `phrasebook-fr-to-en` from the source, run this:
|
|
187
|
+
|
|
188
|
+
```
|
|
189
|
+
uv run src/phrasebook_fr_to_en/cli.py phrasebook.tsv
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
Alternatively, `phrasebook-fr-to-en` can be installed as an `uv` tool
|
|
193
|
+
from the source like this:
|
|
194
|
+
|
|
195
|
+
```
|
|
196
|
+
uv tool install .
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
Then run it like this:
|
|
200
|
+
|
|
201
|
+
```
|
|
202
|
+
phrasebook-fr-to-en my-phrasebook.tsv
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Running the tests
|
|
206
|
+
|
|
207
|
+
Run the tests like this:
|
|
208
|
+
|
|
209
|
+
```
|
|
210
|
+
uv run pytest
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
To run the tests with real calls to the OpenAI API, run this:
|
|
214
|
+
|
|
215
|
+
```
|
|
216
|
+
OPENAI_LIVE=1 uv run pytest
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
For this to work, the `OPENAI_API_KEY` environment variable must be
|
|
220
|
+
set using an OpenAI API key. This variable can also be declared in a
|
|
221
|
+
`.env` file.
|
|
222
|
+
|
|
223
|
+
### Test coverage
|
|
224
|
+
|
|
225
|
+
This software has 100% test coverage.
|
|
226
|
+
|
|
227
|
+
To check this, you can run the following commands:
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
OPENAI_LIVE=1 uv run coverage run -m pytest
|
|
231
|
+
uv run coverage report
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
As mentioned above, `OPENAI_API_KEY` must be set prior to running the
|
|
235
|
+
tests with coverage.
|
|
236
|
+
|
|
237
|
+
### Linter + formating
|
|
238
|
+
|
|
239
|
+
`ty` and `ruff` must be installed first:
|
|
240
|
+
|
|
241
|
+
```
|
|
242
|
+
ty check
|
|
243
|
+
ruff format
|
|
244
|
+
```
|
|
@@ -0,0 +1,223 @@
|
|
|
1
|
+
# phrasebook-fr-to-en
|
|
2
|
+
|
|
3
|
+
`phrasebook-fr-to-en` is a CLI that uses the OpenAI API to enrich
|
|
4
|
+
French to English phrasebooks with AI generated translations, audios,
|
|
5
|
+
and images.
|
|
6
|
+
|
|
7
|
+
## Installing and running
|
|
8
|
+
|
|
9
|
+
It's a Python program. You can install it as an `uv` tool like this:
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
uv tool install phrasebook-fr-to-en
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
After you generate an OpenAI API key, and with your original French to
|
|
16
|
+
English translations in the file `my-phrasebook.tsv`, you can generate
|
|
17
|
+
new ones, along with audios and images, by running this:
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
export OPENAI_API_KEY=<your-api-key>
|
|
21
|
+
phrasebook-fr-to-en my-phrasebook.tsv
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
This creates the file `enriched_phrasebook.tsv` with all translations.
|
|
25
|
+
It also saves the audios and images in the directory `media`.
|
|
26
|
+
Both `enriched_phrasebook.tsv` and `media` sit next to your phrasebook
|
|
27
|
+
file.
|
|
28
|
+
|
|
29
|
+
Your original phrasebook is left unchanged.
|
|
30
|
+
|
|
31
|
+
Run the following to list the options and their documentation:
|
|
32
|
+
|
|
33
|
+
```
|
|
34
|
+
phrasebook-fr-to-en --help
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## OpenAI API [IMPORTANT]
|
|
38
|
+
|
|
39
|
+
This program uses the OpenAI API with the following models:
|
|
40
|
+
|
|
41
|
+
- https://platform.openai.com/docs/models/gpt-5.2
|
|
42
|
+
- https://platform.openai.com/docs/models/gpt-4o-mini-tts
|
|
43
|
+
- https://platform.openai.com/docs/models/gpt-image-1.5
|
|
44
|
+
|
|
45
|
+
To use it, you need to register with OpenAI. You also need to be
|
|
46
|
+
verified as an organization (required for the image model). Then
|
|
47
|
+
create an API key: https://platform.openai.com.
|
|
48
|
+
|
|
49
|
+
Once you've done this, set `OPENAI_API_KEY` as an environment variable
|
|
50
|
+
before you run the program, like this:
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
export OPENAI_API_KEY=<your-api-key>
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Format of your original phrasebook
|
|
57
|
+
|
|
58
|
+
Your original phrasebook must be a TSV file (TAB separation) with the
|
|
59
|
+
columns `date`, `french`, and `english`, like in this example:
|
|
60
|
+
|
|
61
|
+
```
|
|
62
|
+
date french english
|
|
63
|
+
2025-11-15 Montez les escaliers. Climb the stairs.
|
|
64
|
+
2025-11-16 Il est beau. He is handsome.
|
|
65
|
+
2025-01-17 Il est moche. He is ugly.
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Full example
|
|
69
|
+
|
|
70
|
+
`phrasebook-fr-to-en` takes a TSV (TAB separation) file as input.
|
|
71
|
+
Each row is a French to English translation. It uses the following
|
|
72
|
+
columns: `date`, `french`, `english`.
|
|
73
|
+
|
|
74
|
+
For each translation (each row), two new related translations are
|
|
75
|
+
generated. The goal is to show:
|
|
76
|
+
|
|
77
|
+
- An English grammar point, or
|
|
78
|
+
- Useful nouns, verbs, or
|
|
79
|
+
- Alternative phrasing (formality, slang, etc.).
|
|
80
|
+
|
|
81
|
+
These new translations, along with the original, are saved in the file
|
|
82
|
+
`enriched_phrasebook.tsv`. It sits next to your phrasebook file.
|
|
83
|
+
Records in your original phrasebook whose english field matches a
|
|
84
|
+
record in the enriched phrasebook are skipped.
|
|
85
|
+
|
|
86
|
+
Your original phrasebook is left unchanged.
|
|
87
|
+
|
|
88
|
+
For all translations (original and AI generated), an English audio and
|
|
89
|
+
an image are generated. They are saved in a `media` directory next to
|
|
90
|
+
the original phrasebook file.
|
|
91
|
+
|
|
92
|
+
For instance, if `my-phrasebook.tsv` contains the following record
|
|
93
|
+
(columns separated by tabs)
|
|
94
|
+
|
|
95
|
+
```
|
|
96
|
+
date french english
|
|
97
|
+
2025-11-15 Montez les escaliers. Climb the stairs.
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
and you run the following commands:
|
|
101
|
+
|
|
102
|
+
```
|
|
103
|
+
export OPENAI_API_KEY=<your-api-key>
|
|
104
|
+
phrasebook-fr-to-en my-phrasebook.tsv
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
This produces the file `enriched_phrasebook.tsv` with AI generated
|
|
108
|
+
translations. It has the following columns: `french`, `english`,
|
|
109
|
+
`anki_audio`, `anki_img`, `generated_from`, `id`, `audio_filename`,
|
|
110
|
+
`img_filename`, `date`.
|
|
111
|
+
|
|
112
|
+
```
|
|
113
|
+
french english anki_audio anki_img generated_from id audio_filename img_filename date
|
|
114
|
+
Montez les escaliers. Climb the stairs. [sound:phrasebook-fr-to-en-1.mp3] "<img src=""phrasebook-fr-to-en-1.png"">" 1 phrasebook-fr-to-en-1.mp3 phrasebook-fr-to-en-1.png 2025-11-15
|
|
115
|
+
Montez les escaliers jusqu’au premier étage. Climb the stairs up to the first floor. [sound:phrasebook-fr-to-en-2.mp3] "<img src=""phrasebook-fr-to-en-2.png"">" 1 2 phrasebook-fr-to-en-2.mp3 phrasebook-fr-to-en-2.png 2025-11-15
|
|
116
|
+
Prenez les escaliers, c’est juste à gauche. Take the stairs; it’s just on the left. [sound:phrasebook-fr-to-en-3.mp3] "<img src=""phrasebook-fr-to-en-3.png"">" 1 3 phrasebook-fr-to-en-3.mp3 phrasebook-fr-to-en-3.png 2025-11-15
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
This also generates 3 audios and 3 images.
|
|
120
|
+
|
|
121
|
+
Your directory then looks like this:
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
.
|
|
125
|
+
├── enriched_phrasebook.tsv
|
|
126
|
+
├── my-phrasebook.tsv
|
|
127
|
+
└── media
|
|
128
|
+
├── phrasebook-fr-to-en-1.mp3
|
|
129
|
+
├── phrasebook-fr-to-en-1.png
|
|
130
|
+
├── phrasebook-fr-to-en-2.mp3
|
|
131
|
+
├── phrasebook-fr-to-en-2.png
|
|
132
|
+
├── phrasebook-fr-to-en-3.mp3
|
|
133
|
+
└── phrasebook-fr-to-en-3.png
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
## For Anki users
|
|
137
|
+
|
|
138
|
+
In the previous example, did you notice the columns `anki_audio` and
|
|
139
|
+
`anki_img`? They contain formatted fields for audio and image that
|
|
140
|
+
you can use directly in your [Anki](https://apps.ankiweb.net/) decks:
|
|
141
|
+
|
|
142
|
+
```
|
|
143
|
+
[sound:phrasebook-fr-to-en-1.mp3]
|
|
144
|
+
<img src="phrasebook-fr-to-en-1.png">
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
This way you can import `enhanced_phrasebook.tsv` directly into Anki.
|
|
148
|
+
No changes are needed to get audio played and images displayed.
|
|
149
|
+
|
|
150
|
+
Note that this only works:
|
|
151
|
+
|
|
152
|
+
1. If you enable "Allow HTML" option when importing the enriched file,
|
|
153
|
+
2. If you copy the audios and images from the `media` directory to
|
|
154
|
+
your Anki `collection.media` directory.
|
|
155
|
+
|
|
156
|
+
See Anki docs:
|
|
157
|
+
|
|
158
|
+
- https://docs.ankiweb.net/importing/text-files.html#importing-media
|
|
159
|
+
- https://docs.ankiweb.net/files.html
|
|
160
|
+
|
|
161
|
+
## Dev
|
|
162
|
+
|
|
163
|
+
### Installing and running from source
|
|
164
|
+
|
|
165
|
+
To run `phrasebook-fr-to-en` from the source, run this:
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
uv run src/phrasebook_fr_to_en/cli.py phrasebook.tsv
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
Alternatively, `phrasebook-fr-to-en` can be installed as an `uv` tool
|
|
172
|
+
from the source like this:
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
uv tool install .
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Then run it like this:
|
|
179
|
+
|
|
180
|
+
```
|
|
181
|
+
phrasebook-fr-to-en my-phrasebook.tsv
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### Running the tests
|
|
185
|
+
|
|
186
|
+
Run the tests like this:
|
|
187
|
+
|
|
188
|
+
```
|
|
189
|
+
uv run pytest
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
To run the tests with real calls to the OpenAI API, run this:
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
OPENAI_LIVE=1 uv run pytest
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
For this to work, the `OPENAI_API_KEY` environment variable must be
|
|
199
|
+
set using an OpenAI API key. This variable can also be declared in a
|
|
200
|
+
`.env` file.
|
|
201
|
+
|
|
202
|
+
### Test coverage
|
|
203
|
+
|
|
204
|
+
This software has 100% test coverage.
|
|
205
|
+
|
|
206
|
+
To check this, you can run the following commands:
|
|
207
|
+
|
|
208
|
+
```
|
|
209
|
+
OPENAI_LIVE=1 uv run coverage run -m pytest
|
|
210
|
+
uv run coverage report
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
As mentioned above, `OPENAI_API_KEY` must be set prior to running the
|
|
214
|
+
tests with coverage.
|
|
215
|
+
|
|
216
|
+
### Linter + formating
|
|
217
|
+
|
|
218
|
+
`ty` and `ruff` must be installed first:
|
|
219
|
+
|
|
220
|
+
```
|
|
221
|
+
ty check
|
|
222
|
+
ruff format
|
|
223
|
+
```
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "phrasebook-fr-to-en"
|
|
3
|
+
version = "0.1.0" # When you update here, also update __version__ in `cli.py`
|
|
4
|
+
description = "Enrich French to English phrasebooks with OpenAI API."
|
|
5
|
+
readme = "README.md"
|
|
6
|
+
authors = [
|
|
7
|
+
{ name = "Tony Aldon", email = "tony@tonyaldon.com" }
|
|
8
|
+
]
|
|
9
|
+
license = "MIT"
|
|
10
|
+
license-files = ["LICENSE"]
|
|
11
|
+
requires-python = ">=3.12"
|
|
12
|
+
dependencies = [
|
|
13
|
+
"pandas>=2.3.3",
|
|
14
|
+
"numpy!=2.4.0",
|
|
15
|
+
"typer>=0.21.0",
|
|
16
|
+
"watchdog>=6.0.0",
|
|
17
|
+
"openai>=2.15.0",
|
|
18
|
+
"pydantic>=2.12.5",
|
|
19
|
+
]
|
|
20
|
+
|
|
21
|
+
[project.urls]
|
|
22
|
+
Homepage = "https://github.com/tonyaldon/phrasebook-fr-to-en"
|
|
23
|
+
Repository = "https://github.com/tonyaldon/phrasebook-fr-to-en"
|
|
24
|
+
|
|
25
|
+
[project.scripts]
|
|
26
|
+
phrasebook-fr-to-en = "phrasebook_fr_to_en.cli:main"
|
|
27
|
+
|
|
28
|
+
[build-system]
|
|
29
|
+
requires = ["uv_build>=0.9.18,<0.10.0"]
|
|
30
|
+
build-backend = "uv_build"
|
|
31
|
+
|
|
32
|
+
[project.optional-dependencies]
|
|
33
|
+
test = [
|
|
34
|
+
"pytest",
|
|
35
|
+
]
|
|
36
|
+
|
|
37
|
+
[tool.pytest.ini_options]
|
|
38
|
+
testpaths = ["tests"]
|
|
39
|
+
pythonpath = ["src"]
|
|
40
|
+
|
|
41
|
+
[dependency-groups]
|
|
42
|
+
dev = [
|
|
43
|
+
"coverage>=7.13.1",
|
|
44
|
+
"mutagen>=1.47.0",
|
|
45
|
+
"pillow>=12.1.0",
|
|
46
|
+
"pytest>=9.0.2",
|
|
47
|
+
"python-dotenv>=1.2.1",
|
|
48
|
+
"respx>=0.22.0",
|
|
49
|
+
]
|
|
50
|
+
|
|
51
|
+
[tool.coverage.run]
|
|
52
|
+
branch = true
|
|
53
|
+
source = ["src"]
|
|
54
|
+
parallel = false
|
|
55
|
+
omit = [
|
|
56
|
+
"src/phrasebook_fr_to_en/__main__.py",
|
|
57
|
+
]
|
|
58
|
+
|
|
59
|
+
|
|
60
|
+
[tool.coverage.report]
|
|
61
|
+
show_missing = true
|
|
62
|
+
skip_covered = true
|
|
63
|
+
fail_under = 80
|
|
64
|
+
|
|
65
|
+
[tool.coverage.html]
|
|
66
|
+
directory = "htmlcov"
|
|
Binary file
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
Signature: 8a477f597d28d172789f06886806bc55
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1,578 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
import os
|
|
3
|
+
import base64
|
|
4
|
+
import logging
|
|
5
|
+
import time
|
|
6
|
+
from pathlib import Path
|
|
7
|
+
from typing import TYPE_CHECKING, Any, Annotated, Iterator
|
|
8
|
+
import contextlib
|
|
9
|
+
import typer
|
|
10
|
+
from watchdog.events import FileSystemEvent, FileSystemEventHandler
|
|
11
|
+
from watchdog.observers import Observer
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
__version__ = "0.1.0"
|
|
15
|
+
|
|
16
|
+
if TYPE_CHECKING:
|
|
17
|
+
from pydantic import BaseModel
|
|
18
|
+
import pandas as pd
|
|
19
|
+
from openai import OpenAI
|
|
20
|
+
|
|
21
|
+
logger = logging.getLogger(__name__)
|
|
22
|
+
|
|
23
|
+
MEDIA_PREFIX = "phrasebook-fr-to-en-"
|
|
24
|
+
|
|
25
|
+
# See https://docs.ankiweb.net/importing/text-files.html#importing-media
|
|
26
|
+
ANKI_AUDIO_TEMPLATE = "[sound:{}]"
|
|
27
|
+
ANKI_IMG_TEMPLATE = '<img src="{}">'
|
|
28
|
+
|
|
29
|
+
ENRICHED_COLUMNS: list[str] = [
|
|
30
|
+
"french",
|
|
31
|
+
"english",
|
|
32
|
+
"anki_audio",
|
|
33
|
+
"anki_img",
|
|
34
|
+
"generated_from",
|
|
35
|
+
"id",
|
|
36
|
+
"audio_filename",
|
|
37
|
+
"img_filename",
|
|
38
|
+
"date",
|
|
39
|
+
]
|
|
40
|
+
|
|
41
|
+
PHRASEBOOK_COLUMNS: list[str] = ["date", "french", "english"]
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
app = typer.Typer(pretty_exceptions_enable=False)
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
def enriched_path_func(phrasebook_path: Path) -> Path:
|
|
48
|
+
return (phrasebook_path.parent / "enriched_phrasebook.tsv").absolute()
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
def phrasebook_dir_func(phrasebook_path: Path) -> Path:
|
|
52
|
+
return phrasebook_path.parent.absolute()
|
|
53
|
+
|
|
54
|
+
|
|
55
|
+
def media_dir_func(phrasebook_path: Path) -> Path:
|
|
56
|
+
return phrasebook_path.parent.absolute() / "media"
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def read_phrasebook(phrasebook_path: Path) -> pd.DataFrame:
|
|
60
|
+
import pandas as pd
|
|
61
|
+
|
|
62
|
+
try:
|
|
63
|
+
df = pd.read_csv(phrasebook_path, sep="\t", dtype="string")
|
|
64
|
+
except FileNotFoundError as err:
|
|
65
|
+
raise err
|
|
66
|
+
except Exception as err:
|
|
67
|
+
raise ValueError(f"Invalid file {phrasebook_path}: {err}")
|
|
68
|
+
|
|
69
|
+
if list(df.columns) != PHRASEBOOK_COLUMNS:
|
|
70
|
+
raise ValueError(
|
|
71
|
+
f"Invalid header in {phrasebook_path}. Expected {PHRASEBOOK_COLUMNS}, got {list(df.columns)}"
|
|
72
|
+
)
|
|
73
|
+
|
|
74
|
+
return df
|
|
75
|
+
|
|
76
|
+
|
|
77
|
+
def read_enriched(enriched_path: Path) -> pd.DataFrame:
|
|
78
|
+
import pandas as pd
|
|
79
|
+
|
|
80
|
+
if not enriched_path.exists():
|
|
81
|
+
return pd.DataFrame(columns=pd.Index(ENRICHED_COLUMNS), dtype="string")
|
|
82
|
+
|
|
83
|
+
try:
|
|
84
|
+
df = pd.read_csv(enriched_path, sep="\t", dtype="string")
|
|
85
|
+
except Exception as err:
|
|
86
|
+
raise ValueError(f"Invalid file {enriched_path}: {err}")
|
|
87
|
+
|
|
88
|
+
if list(df.columns) != ENRICHED_COLUMNS:
|
|
89
|
+
raise ValueError(
|
|
90
|
+
f"Invalid header in {enriched_path}. Expected {ENRICHED_COLUMNS}, got {list(df.columns)}"
|
|
91
|
+
)
|
|
92
|
+
|
|
93
|
+
df["id"] = df["id"].astype("Int64")
|
|
94
|
+
df["generated_from"] = df["generated_from"].astype("Int64")
|
|
95
|
+
|
|
96
|
+
return df
|
|
97
|
+
|
|
98
|
+
|
|
99
|
+
@contextlib.contextmanager
|
|
100
|
+
def log_request_info_when_api_error_raised() -> Iterator[None]:
|
|
101
|
+
from openai import APIError
|
|
102
|
+
|
|
103
|
+
try:
|
|
104
|
+
yield
|
|
105
|
+
except APIError as exc:
|
|
106
|
+
logger.error(f"{exc.request!r}")
|
|
107
|
+
# It's safe to log httpx headers because its repr
|
|
108
|
+
# sets 'authorization' to '[secure]' hidding our API key
|
|
109
|
+
logger.error(f"Request headers - {exc.request.headers!r}")
|
|
110
|
+
logger.error(f"Request body - {exc.request.content.decode()}")
|
|
111
|
+
raise exc
|
|
112
|
+
|
|
113
|
+
|
|
114
|
+
def generate_translations(
|
|
115
|
+
record_original: tuple[str, str, str], client: OpenAI
|
|
116
|
+
) -> list[tuple[str, str]]:
|
|
117
|
+
from pydantic import BaseModel
|
|
118
|
+
|
|
119
|
+
class Translation(BaseModel):
|
|
120
|
+
french: str
|
|
121
|
+
english: str
|
|
122
|
+
|
|
123
|
+
class Translations(BaseModel):
|
|
124
|
+
# DON'T USE: conlist(tuple[str, str], min_length=2, max_length=2)
|
|
125
|
+
# This broke OpenAI API which generated outputs with 128,000 tokens.
|
|
126
|
+
# Mostly, whitespaces and newlines.
|
|
127
|
+
translations: list[Translation]
|
|
128
|
+
|
|
129
|
+
_, french, english = record_original
|
|
130
|
+
model = "gpt-5.2"
|
|
131
|
+
instructions = """# Role and Objective
|
|
132
|
+
You are a bilingual (French/English) teacher specializing in practical language learning. Your task is to help expand a French-to-English phrasebook by creating relevant sentence pairs and highlighting key language aspects.
|
|
133
|
+
|
|
134
|
+
# Instructions
|
|
135
|
+
- For each prompt, you will get a French sentence and its English translation.
|
|
136
|
+
- Your tasks:
|
|
137
|
+
1. Generate two related English sentences, each with its French translation.
|
|
138
|
+
2. Use these to show:
|
|
139
|
+
- An English grammar point, or
|
|
140
|
+
- Useful nouns, verbs, or
|
|
141
|
+
- Alternative phrasing (formality, slang, etc.).
|
|
142
|
+
3. Ensure all English examples are natural and suitable for daily use.
|
|
143
|
+
|
|
144
|
+
# Context
|
|
145
|
+
- The learner is a native French speaker advancing in English.
|
|
146
|
+
- The goal is to create a learner-friendly, practical phrasebook."""
|
|
147
|
+
input_msg = f"{french} -> {english}"
|
|
148
|
+
|
|
149
|
+
logger.info(f"Generating translations for record {record_original}")
|
|
150
|
+
|
|
151
|
+
with log_request_info_when_api_error_raised():
|
|
152
|
+
response = client.responses.parse(
|
|
153
|
+
model=model,
|
|
154
|
+
instructions=instructions,
|
|
155
|
+
input=input_msg,
|
|
156
|
+
text_format=Translations,
|
|
157
|
+
max_output_tokens=256,
|
|
158
|
+
)
|
|
159
|
+
|
|
160
|
+
# If we decided to use gpt-5-nano with the same low max_output_tokens,
|
|
161
|
+
# tokens would be consumed by the reasoning, we would get
|
|
162
|
+
# no text output, and this would result in output_parsed being None.
|
|
163
|
+
if not response.output_parsed:
|
|
164
|
+
raise ValueError(
|
|
165
|
+
f"No translations were returned by the model.\nResponse: {response.to_json()}"
|
|
166
|
+
)
|
|
167
|
+
|
|
168
|
+
translations = response.output_parsed.translations
|
|
169
|
+
|
|
170
|
+
if (tlen := len(translations)) != 2:
|
|
171
|
+
raise ValueError(
|
|
172
|
+
(
|
|
173
|
+
f"Wrong number of translations: {tlen}. 2 were expected.\n"
|
|
174
|
+
f"Response: {response.to_json()}"
|
|
175
|
+
)
|
|
176
|
+
)
|
|
177
|
+
|
|
178
|
+
logger.info(
|
|
179
|
+
f"Translations generated for record {record_original} using model {model} and input '{input_msg}'"
|
|
180
|
+
)
|
|
181
|
+
|
|
182
|
+
return [(t.french, t.english) for t in translations]
|
|
183
|
+
|
|
184
|
+
|
|
185
|
+
def generate_audio(record: dict[str, Any], media_dir: Path, client: OpenAI) -> None:
|
|
186
|
+
id_record = record["id"]
|
|
187
|
+
input_msg = record["english"]
|
|
188
|
+
audio_path = media_dir / record["audio_filename"]
|
|
189
|
+
|
|
190
|
+
media_dir.mkdir(parents=True, exist_ok=True)
|
|
191
|
+
|
|
192
|
+
logger.info(f"Generating audio '{input_msg}' for record {id_record}")
|
|
193
|
+
with log_request_info_when_api_error_raised():
|
|
194
|
+
with client.audio.speech.with_streaming_response.create(
|
|
195
|
+
model="gpt-4o-mini-tts",
|
|
196
|
+
voice="cedar",
|
|
197
|
+
input=input_msg,
|
|
198
|
+
instructions="Speak in a neutral General American accent at a natural conversational pace. Use clear, natural intonation with a neutral tone, and avoid emotional coloring, character impressions, and whispering.",
|
|
199
|
+
) as response:
|
|
200
|
+
response.stream_to_file(audio_path)
|
|
201
|
+
|
|
202
|
+
logger.info(f"Audio has been generated: {audio_path}.")
|
|
203
|
+
|
|
204
|
+
return None
|
|
205
|
+
|
|
206
|
+
|
|
207
|
+
def generate_img(record: dict[str, Any], media_dir: Path, client: OpenAI) -> None:
|
|
208
|
+
id_record = record["id"]
|
|
209
|
+
english = record["english"]
|
|
210
|
+
img_path = media_dir / record["img_filename"]
|
|
211
|
+
|
|
212
|
+
media_dir.mkdir(parents=True, exist_ok=True)
|
|
213
|
+
|
|
214
|
+
logger.info(f"Generating image '{english}' for record {id_record}")
|
|
215
|
+
|
|
216
|
+
prompt = (
|
|
217
|
+
f'Create a clean, minimal flat vector illustration representing: "{english}".\n'
|
|
218
|
+
"Use 2-4 simple objects/characters maximum, solid colors, white background.\n"
|
|
219
|
+
"No text, no letters, no numbers, no icons that resemble writing.\n"
|
|
220
|
+
)
|
|
221
|
+
|
|
222
|
+
with log_request_info_when_api_error_raised():
|
|
223
|
+
response = client.images.generate(
|
|
224
|
+
model="gpt-image-1.5",
|
|
225
|
+
prompt=prompt,
|
|
226
|
+
size="1024x1024",
|
|
227
|
+
quality="low",
|
|
228
|
+
output_format="png",
|
|
229
|
+
)
|
|
230
|
+
|
|
231
|
+
image_base64 = response.data[0].b64_json
|
|
232
|
+
image_bytes = base64.b64decode(image_base64)
|
|
233
|
+
with open(img_path, "wb") as f:
|
|
234
|
+
f.write(image_bytes)
|
|
235
|
+
|
|
236
|
+
logger.info(f"Image has been generated: {img_path}.")
|
|
237
|
+
|
|
238
|
+
return None
|
|
239
|
+
|
|
240
|
+
|
|
241
|
+
def next_id(enriched_df: pd.DataFrame) -> int:
|
|
242
|
+
if enriched_df.empty:
|
|
243
|
+
return 1
|
|
244
|
+
|
|
245
|
+
max_id = int(enriched_df["id"].max())
|
|
246
|
+
return max_id + 1
|
|
247
|
+
|
|
248
|
+
|
|
249
|
+
def build_record(
|
|
250
|
+
record_id: int,
|
|
251
|
+
generated_from: Any,
|
|
252
|
+
date: str,
|
|
253
|
+
french: str = "",
|
|
254
|
+
english: str = "",
|
|
255
|
+
) -> dict[str, Any]:
|
|
256
|
+
audio_filename = f"{MEDIA_PREFIX}{record_id}.mp3"
|
|
257
|
+
img_filename = f"{MEDIA_PREFIX}{record_id}.png"
|
|
258
|
+
return {
|
|
259
|
+
"id": record_id,
|
|
260
|
+
"french": french,
|
|
261
|
+
"english": english,
|
|
262
|
+
"anki_audio": ANKI_AUDIO_TEMPLATE.format(audio_filename),
|
|
263
|
+
"anki_img": ANKI_IMG_TEMPLATE.format(img_filename),
|
|
264
|
+
"generated_from": generated_from,
|
|
265
|
+
"audio_filename": audio_filename,
|
|
266
|
+
"img_filename": img_filename,
|
|
267
|
+
"date": date,
|
|
268
|
+
}
|
|
269
|
+
|
|
270
|
+
|
|
271
|
+
def enrich_record(
|
|
272
|
+
record_original: tuple[str, str, str],
|
|
273
|
+
next_id: int,
|
|
274
|
+
media_dir: Path,
|
|
275
|
+
client: OpenAI,
|
|
276
|
+
) -> list[dict[str, Any]]:
|
|
277
|
+
import pandas as pd
|
|
278
|
+
|
|
279
|
+
date, french, english = record_original
|
|
280
|
+
|
|
281
|
+
try:
|
|
282
|
+
translations = generate_translations(record_original, client)
|
|
283
|
+
except Exception:
|
|
284
|
+
logger.exception(
|
|
285
|
+
f"Failed to generate translations while processing record {record_original}"
|
|
286
|
+
)
|
|
287
|
+
return []
|
|
288
|
+
|
|
289
|
+
id_record_original = next_id
|
|
290
|
+
new_records: list[dict[str, Any]] = [
|
|
291
|
+
build_record(
|
|
292
|
+
record_id=id_record_original,
|
|
293
|
+
french=french,
|
|
294
|
+
english=english,
|
|
295
|
+
generated_from=pd.NA,
|
|
296
|
+
date=date,
|
|
297
|
+
)
|
|
298
|
+
]
|
|
299
|
+
|
|
300
|
+
for i in range(len(translations)):
|
|
301
|
+
new_records.append(
|
|
302
|
+
build_record(
|
|
303
|
+
record_id=id_record_original + i + 1,
|
|
304
|
+
french=translations[i][0],
|
|
305
|
+
english=translations[i][1],
|
|
306
|
+
generated_from=id_record_original,
|
|
307
|
+
date=date,
|
|
308
|
+
)
|
|
309
|
+
)
|
|
310
|
+
|
|
311
|
+
try:
|
|
312
|
+
for record in new_records:
|
|
313
|
+
generate_audio(record, media_dir, client)
|
|
314
|
+
except Exception:
|
|
315
|
+
logger.exception(
|
|
316
|
+
f"Failed to generate audios while processing record {record_original}"
|
|
317
|
+
)
|
|
318
|
+
return []
|
|
319
|
+
|
|
320
|
+
try:
|
|
321
|
+
for record in new_records:
|
|
322
|
+
generate_img(record, media_dir, client)
|
|
323
|
+
except Exception:
|
|
324
|
+
logger.exception(
|
|
325
|
+
f"Failed to generate images while processing record {record_original}"
|
|
326
|
+
)
|
|
327
|
+
return []
|
|
328
|
+
|
|
329
|
+
return new_records
|
|
330
|
+
|
|
331
|
+
|
|
332
|
+
def save_new_records(
|
|
333
|
+
new_records: list[dict[str, Any]], enriched_df: pd.DataFrame, enriched_path: Path
|
|
334
|
+
) -> pd.DataFrame:
|
|
335
|
+
import pandas as pd
|
|
336
|
+
|
|
337
|
+
new_df = pd.DataFrame(
|
|
338
|
+
new_records, columns=pd.Index(ENRICHED_COLUMNS), dtype="string"
|
|
339
|
+
)
|
|
340
|
+
new_df["id"] = new_df["id"].astype("Int64")
|
|
341
|
+
new_df["generated_from"] = new_df["generated_from"].astype("Int64")
|
|
342
|
+
|
|
343
|
+
updated = (
|
|
344
|
+
pd.concat([enriched_df, new_df], ignore_index=True)
|
|
345
|
+
if not enriched_df.empty
|
|
346
|
+
else new_df
|
|
347
|
+
)
|
|
348
|
+
updated.to_csv(enriched_path, sep="\t", index=False)
|
|
349
|
+
return updated
|
|
350
|
+
|
|
351
|
+
|
|
352
|
+
def enrich_phrasebook(phrasebook_path: Path, client: OpenAI) -> bool:
|
|
353
|
+
media_dir = media_dir_func(phrasebook_path)
|
|
354
|
+
enriched_path = enriched_path_func(phrasebook_path)
|
|
355
|
+
try:
|
|
356
|
+
phrasebook_df = read_phrasebook(phrasebook_path)
|
|
357
|
+
enriched_df = read_enriched(enriched_path)
|
|
358
|
+
except Exception as err:
|
|
359
|
+
logger.exception(err)
|
|
360
|
+
return False
|
|
361
|
+
|
|
362
|
+
existing_english: set[str] = (
|
|
363
|
+
set(enriched_df["english"].dropna().to_list())
|
|
364
|
+
if not enriched_df.empty
|
|
365
|
+
else set()
|
|
366
|
+
)
|
|
367
|
+
|
|
368
|
+
for record_original in phrasebook_df.itertuples(index=False, name=None):
|
|
369
|
+
_, _, english = record_original
|
|
370
|
+
|
|
371
|
+
if english in existing_english:
|
|
372
|
+
logger.info(f"Skip existing record: {record_original}")
|
|
373
|
+
continue
|
|
374
|
+
|
|
375
|
+
new_records = enrich_record(
|
|
376
|
+
record_original, next_id(enriched_df), media_dir, client
|
|
377
|
+
)
|
|
378
|
+
if not new_records:
|
|
379
|
+
return False
|
|
380
|
+
|
|
381
|
+
try:
|
|
382
|
+
enriched_df = save_new_records(new_records, enriched_df, enriched_path)
|
|
383
|
+
except Exception:
|
|
384
|
+
logger.exception(
|
|
385
|
+
f"Failed to save enriched records from record {record_original} in file {enriched_path}"
|
|
386
|
+
)
|
|
387
|
+
return False
|
|
388
|
+
|
|
389
|
+
existing_english.add(english)
|
|
390
|
+
logger.info(f"Record has been enriched: {record_original} -> {enriched_path}")
|
|
391
|
+
|
|
392
|
+
return True
|
|
393
|
+
|
|
394
|
+
|
|
395
|
+
def watch_phrasebook(phrasebook_path: Path, client: OpenAI) -> None:
|
|
396
|
+
class Handler(FileSystemEventHandler):
|
|
397
|
+
def on_modified(self, event: FileSystemEvent) -> None:
|
|
398
|
+
if event.src_path == str(phrasebook_path):
|
|
399
|
+
enrich_phrasebook(phrasebook_path, client)
|
|
400
|
+
|
|
401
|
+
observer = Observer()
|
|
402
|
+
observer.schedule(
|
|
403
|
+
Handler(), str(phrasebook_dir_func(phrasebook_path)), recursive=False
|
|
404
|
+
)
|
|
405
|
+
observer.start()
|
|
406
|
+
logger.info(f"Start watching file {phrasebook_path}")
|
|
407
|
+
|
|
408
|
+
try:
|
|
409
|
+
while True:
|
|
410
|
+
time.sleep(1)
|
|
411
|
+
finally: # pragma: no cover
|
|
412
|
+
observer.stop()
|
|
413
|
+
observer.join()
|
|
414
|
+
|
|
415
|
+
|
|
416
|
+
def version_callback(version: bool):
|
|
417
|
+
if version:
|
|
418
|
+
print(f"phrasebook-fr-to-en {__version__}")
|
|
419
|
+
raise typer.Exit()
|
|
420
|
+
|
|
421
|
+
|
|
422
|
+
def setup_logging(log_file: Path | None = None):
|
|
423
|
+
log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
|
|
424
|
+
log_datefmt = "%Y-%m-%d %H:%M:%S"
|
|
425
|
+
if log_file:
|
|
426
|
+
log_file.parent.mkdir(parents=True, exist_ok=True)
|
|
427
|
+
logging.basicConfig(
|
|
428
|
+
format=log_format, datefmt=log_datefmt, filename=log_file.absolute()
|
|
429
|
+
)
|
|
430
|
+
else:
|
|
431
|
+
logging.basicConfig(format=log_format, datefmt=log_datefmt)
|
|
432
|
+
|
|
433
|
+
logger.setLevel(logging.INFO)
|
|
434
|
+
|
|
435
|
+
|
|
436
|
+
@app.command()
|
|
437
|
+
def run(
|
|
438
|
+
file: Annotated[
|
|
439
|
+
Path,
|
|
440
|
+
typer.Argument(
|
|
441
|
+
help=(
|
|
442
|
+
"Filename of the phrasebook to be enriched. It must be a TSV format file (TAB separation) with the columns: date, french, english. For instance:\n\n\n\n"
|
|
443
|
+
"date french english\n\n"
|
|
444
|
+
"2025-12-15 J'aime l'eau. I like water.\n\n"
|
|
445
|
+
"2025-12-16 Il fait froid. It is cold.\n\n"
|
|
446
|
+
)
|
|
447
|
+
),
|
|
448
|
+
],
|
|
449
|
+
watch: Annotated[
|
|
450
|
+
bool,
|
|
451
|
+
typer.Option(
|
|
452
|
+
"--watch",
|
|
453
|
+
help="Watch the original phrasebook file for changes, and enrich any new record added to it.",
|
|
454
|
+
),
|
|
455
|
+
] = False,
|
|
456
|
+
log_file: Annotated[
|
|
457
|
+
Path | None,
|
|
458
|
+
typer.Option(
|
|
459
|
+
help="Log to this file if provided. Default is stderr.",
|
|
460
|
+
),
|
|
461
|
+
] = None,
|
|
462
|
+
version: Annotated[
|
|
463
|
+
bool,
|
|
464
|
+
typer.Option("--version", callback=version_callback, is_eager=True),
|
|
465
|
+
] = False,
|
|
466
|
+
) -> None:
|
|
467
|
+
# We escape \[sound:...] because this is a reserved syntax for Rich.
|
|
468
|
+
# If we don't, [sound:...] is removed from the help message.
|
|
469
|
+
# But when we do this we get a SyntaxWarning, so we use raw string
|
|
470
|
+
# r"""...""".
|
|
471
|
+
r"""
|
|
472
|
+
Enrich French to English phrasebooks with OpenAI API.
|
|
473
|
+
|
|
474
|
+
-----
|
|
475
|
+
|
|
476
|
+
[IMPORTANT] This program uses the OpenAI API with the following models:
|
|
477
|
+
|
|
478
|
+
- https://platform.openai.com/docs/models/gpt-5.2
|
|
479
|
+
- https://platform.openai.com/docs/models/gpt-4o-mini-tts
|
|
480
|
+
- https://platform.openai.com/docs/models/gpt-image-1.5
|
|
481
|
+
|
|
482
|
+
To use it, you need to register with OpenAI, be verified as an organization (required for the image model), and create an API key: https://platform.openai.com.
|
|
483
|
+
|
|
484
|
+
Once you've done this, set OPENAI_API_KEY as an environment variable before you run the program, like this:
|
|
485
|
+
|
|
486
|
+
$ export OPENAI_API_KEY=<your-api-key>
|
|
487
|
+
|
|
488
|
+
-----
|
|
489
|
+
|
|
490
|
+
This program takes a TSV (TAB separation) file as input. Each row is a French to English translation. It uses the following columns: date, french, english.
|
|
491
|
+
|
|
492
|
+
For each translation (each row), two new related translations are generated. The goal is to show:
|
|
493
|
+
|
|
494
|
+
- An English grammar point, or
|
|
495
|
+
- Useful nouns, verbs, or
|
|
496
|
+
- Alternative phrasing (formality, slang, etc.).
|
|
497
|
+
|
|
498
|
+
These new translations, along with the original, are saved in the file "enriched_phrasebook.tsv". It sits next to your phrasebook file. Records in your original phrasebook whose english field matches a record in the enriched phrasebook are skipped.
|
|
499
|
+
|
|
500
|
+
Your original phrasebook is left unchanged.
|
|
501
|
+
|
|
502
|
+
For all translations (original and AI generated), an English audio and an image are generated. They are saved in a "media" directory next to the original phrasebook file.
|
|
503
|
+
|
|
504
|
+
For instance, if "my-phrasebook.tsv" contains the following record
|
|
505
|
+
(columns separated by tabs)
|
|
506
|
+
|
|
507
|
+
date french english
|
|
508
|
+
2025-12-15 Montez les escaliers. Climb the stairs.
|
|
509
|
+
|
|
510
|
+
and you run the following commands:
|
|
511
|
+
|
|
512
|
+
$ export OPENAI_API_KEY=<your-api-key>
|
|
513
|
+
$ phrasebook-fr-to-en my-phrasebook.tsv
|
|
514
|
+
|
|
515
|
+
This will produce the file "enriched_phrasebook.tsv" with AI generated translations. It has the following columns: french, english, anki_audio, anki_img, generated_from, id, audio_filename, img_filename, date.
|
|
516
|
+
|
|
517
|
+
french english anki_audio anki_img generated_from id audio_filename img_filename date
|
|
518
|
+
Montez les escaliers. Climb the stairs. \[sound:phrasebook-fr-to-en-1.mp3] "<img src=""phrasebook-fr-to-en-1.png"">" 1 phrasebook-fr-to-en-1.mp3 phrasebook-fr-to-en-1.png 2025-11-15
|
|
519
|
+
Prenez les escaliers, s'il vous plaît. Please take the stairs. \[sound:phrasebook-fr-to-en-2.mp3] "<img src=""phrasebook-fr-to-en-2.png"">" 1 2 phrasebook-fr-to-en-2.mp3 phrasebook-fr-to-en-2.png 2025-11-15
|
|
520
|
+
Montez deux étages et tournez à gauche. Go up two floors and turn left. \[sound:phrasebook-fr-to-en-3.mp3] "<img src=""phrasebook-fr-to-en-3.png"">" 1 3 phrasebook-fr-to-en-3.mp3 phrasebook-fr-to-en-3.png 2025-11-15
|
|
521
|
+
|
|
522
|
+
This also generates 3 audios and 3 images.
|
|
523
|
+
|
|
524
|
+
Your directory then looks like this:
|
|
525
|
+
|
|
526
|
+
.
|
|
527
|
+
├── enriched_phrasebook.tsv
|
|
528
|
+
├── my-phrasebook.tsv
|
|
529
|
+
└── media
|
|
530
|
+
├── phrasebook-fr-to-en-1.mp3
|
|
531
|
+
├── phrasebook-fr-to-en-1.png
|
|
532
|
+
├── phrasebook-fr-to-en-2.mp3
|
|
533
|
+
├── phrasebook-fr-to-en-2.png
|
|
534
|
+
├── phrasebook-fr-to-en-3.mp3
|
|
535
|
+
└── phrasebook-fr-to-en-3.png
|
|
536
|
+
|
|
537
|
+
For Anki users. Did you notice the columns "anki_audio" and "anki_img"? They contain formatted fields for audio and image that you can use directly in your Anki decks:
|
|
538
|
+
|
|
539
|
+
\[sound:phrasebook-fr-to-en-1.mp3]
|
|
540
|
+
<img src="phrasebook-fr-to-en-1.png">
|
|
541
|
+
|
|
542
|
+
This way you can import "enhanced_phrasebook.tsv" directly into Anki. No changes are needed to get audio played and images displayed.
|
|
543
|
+
|
|
544
|
+
Note that this only works:
|
|
545
|
+
|
|
546
|
+
1) If you enable "Allow HTML" option when importing the enriched file,
|
|
547
|
+
2) If you copy the audios and images from the "media" directory to your Anki "collection.media" directory.
|
|
548
|
+
|
|
549
|
+
See Anki docs:
|
|
550
|
+
|
|
551
|
+
- https://docs.ankiweb.net/importing/text-files.html#importing-media
|
|
552
|
+
- https://docs.ankiweb.net/files.html
|
|
553
|
+
"""
|
|
554
|
+
from openai import OpenAI, APIError
|
|
555
|
+
|
|
556
|
+
setup_logging(log_file)
|
|
557
|
+
|
|
558
|
+
if not os.getenv("OPENAI_API_KEY"):
|
|
559
|
+
logger.error("Set OPENAI_API_KEY environment variable to run the app.")
|
|
560
|
+
raise typer.Exit(code=1)
|
|
561
|
+
|
|
562
|
+
client = OpenAI()
|
|
563
|
+
phrasebook_path = file.absolute()
|
|
564
|
+
|
|
565
|
+
if not enrich_phrasebook(phrasebook_path, client):
|
|
566
|
+
raise typer.Exit(code=1)
|
|
567
|
+
|
|
568
|
+
if watch: # pragma: no cover
|
|
569
|
+
watch_phrasebook(phrasebook_path, client)
|
|
570
|
+
return
|
|
571
|
+
|
|
572
|
+
|
|
573
|
+
def main() -> None: # pragma: no cover
|
|
574
|
+
app()
|
|
575
|
+
|
|
576
|
+
|
|
577
|
+
if __name__ == "__main__": # pragma: no cover
|
|
578
|
+
main()
|
|
File without changes
|