PyPI - epub-translator - Versions diffs - 0.1.6__tar.gz → 0.1.8__tar.gz - Mend

epub-translator 0.1.6tar.gz → 0.1.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (74) hide show

{epub_translator-0.1.6 → epub_translator-0.1.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: epub-translator
-Version: 0.1.6
+Version: 0.1.8
 Summary: Translate the epub book using LLM. The translated book will retain the original text and list the translated text side by side with the original text.
 License: MIT
 Keywords: epub,llm,translation,translator
@@ -24,6 +24,7 @@ Classifier: Topic :: Software Development :: Localization
 Classifier: Topic :: Text Processing :: Markup
 Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Requires-Dist: jinja2 (>=3.1.6,<4.0.0)
+Requires-Dist: mathml2latex (>=0.2.12,<0.3.0)
 Requires-Dist: openai (>=2.14.0,<3.0.0)
 Requires-Dist: resource-segmentation (>=0.0.7,<0.1.0)
 Requires-Dist: tiktoken (>=0.12.0,<1.0.0)
@@ -59,6 +60,13 @@ Translate EPUB books using Large Language Models while preserving the original t
 - **Flexible LLM Support**: Works with any OpenAI-compatible API endpoint
 - **Caching**: Built-in caching for progress recovery when translation fails
+## Use Cases
+- **Language Learning**: Read books in their original language with side-by-side translations
+- **Academic Research**: Access foreign literature with bilingual references
+- **Content Localization**: Prepare books for international audiences
+- **Cross-Cultural Reading**: Enjoy literature while understanding cultural nuances
 ## Installation
 ```bash
@@ -357,13 +365,6 @@ llm = LLM(
 )
 ```
-## Use Cases
-- **Language Learning**: Read books in their original language with side-by-side translations
-- **Academic Research**: Access foreign literature with bilingual references
-- **Content Localization**: Prepare books for international audiences
-- **Cross-Cultural Reading**: Enjoy literature while understanding cultural nuances
 ## Advanced Features
 ### Custom Translation Prompts
@@ -421,6 +422,105 @@ translate(
 When using `concurrency > 1`, ensure that any custom callback functions (`on_progress`, `on_fill_failed`) are thread-safe. Built-in callbacks are thread-safe by default.
+### Token Usage Monitoring
+Track token consumption during translation to monitor API costs and usage:
+```python
+from epub_translator import LLM, translate, language, SubmitKind
+llm = LLM(
+    key="your-api-key",
+    url="https://api.openai.com/v1",
+    model="gpt-4",
+    token_encoding="o200k_base",
+)
+translate(
+    source_path="source.epub",
+    target_path="translated.epub",
+    target_language=language.ENGLISH,
+    submit=SubmitKind.APPEND_BLOCK,
+    llm=llm,
+)
+# Access token statistics after translation
+print(f"Total tokens: {llm.total_tokens}")
+print(f"Input tokens: {llm.input_tokens}")
+print(f"Input cache tokens: {llm.input_cache_tokens}")
+print(f"Output tokens: {llm.output_tokens}")
+```
+**Available Statistics:**
+- `total_tokens` - Total number of tokens used (input + output)
+- `input_tokens` - Number of prompt/input tokens
+- `input_cache_tokens` - Number of cached input tokens (when using prompt caching)
+- `output_tokens` - Number of generated/completion tokens
+**Real-time Monitoring:**
+You can also monitor token usage in real-time during translation:
+```python
+from tqdm import tqdm
+import time
+with tqdm(total=100, desc="Translating", unit="%") as pbar:
+    last_progress = 0.0
+    start_time = time.time()
+    def on_progress(progress: float):
+        nonlocal last_progress
+        increment = (progress - last_progress) * 100
+        pbar.update(increment)
+        last_progress = progress
+        # Update token stats in progress bar
+        pbar.set_postfix({
+            'tokens': llm.total_tokens,
+            'cost_est': f'${llm.total_tokens * 0.00001:.4f}'  # Estimate based on your pricing
+        })
+    translate(
+        source_path="source.epub",
+        target_path="translated.epub",
+        target_language=language.ENGLISH,
+        submit=SubmitKind.APPEND_BLOCK,
+        llm=llm,
+        on_progress=on_progress,
+    )
+    elapsed = time.time() - start_time
+    print(f"\nTranslation completed in {elapsed:.1f}s")
+    print(f"Total tokens used: {llm.total_tokens:,}")
+    print(f"Average tokens/second: {llm.total_tokens/elapsed:.1f}")
+```
+**Dual-LLM Token Tracking:**
+When using separate LLMs for translation and filling, each LLM tracks its own statistics:
+```python
+translation_llm = LLM(key="...", url="...", model="gpt-4", token_encoding="o200k_base")
+fill_llm = LLM(key="...", url="...", model="gpt-4", token_encoding="o200k_base")
+translate(
+    source_path="source.epub",
+    target_path="translated.epub",
+    target_language=language.ENGLISH,
+    submit=SubmitKind.APPEND_BLOCK,
+    translation_llm=translation_llm,
+    fill_llm=fill_llm,
+)
+print(f"Translation tokens: {translation_llm.total_tokens}")
+print(f"Fill tokens: {fill_llm.total_tokens}")
+print(f"Combined total: {translation_llm.total_tokens + fill_llm.total_tokens}")
+```
+**Note:** Token statistics are cumulative across all API calls made by the LLM instance. The counts only increase and are thread-safe when using concurrent translation.
 ## Related Projects
 ### PDF Craft

{epub_translator-0.1.6 → epub_translator-0.1.8}/README.md RENAMED Viewed

@@ -26,6 +26,13 @@ Translate EPUB books using Large Language Models while preserving the original t
 - **Flexible LLM Support**: Works with any OpenAI-compatible API endpoint
 - **Caching**: Built-in caching for progress recovery when translation fails
+## Use Cases
+- **Language Learning**: Read books in their original language with side-by-side translations
+- **Academic Research**: Access foreign literature with bilingual references
+- **Content Localization**: Prepare books for international audiences
+- **Cross-Cultural Reading**: Enjoy literature while understanding cultural nuances
 ## Installation
 ```bash
@@ -324,13 +331,6 @@ llm = LLM(
 )
 ```
-## Use Cases
-- **Language Learning**: Read books in their original language with side-by-side translations
-- **Academic Research**: Access foreign literature with bilingual references
-- **Content Localization**: Prepare books for international audiences
-- **Cross-Cultural Reading**: Enjoy literature while understanding cultural nuances
 ## Advanced Features
 ### Custom Translation Prompts
@@ -388,6 +388,105 @@ translate(
 When using `concurrency > 1`, ensure that any custom callback functions (`on_progress`, `on_fill_failed`) are thread-safe. Built-in callbacks are thread-safe by default.
+### Token Usage Monitoring
+Track token consumption during translation to monitor API costs and usage:
+```python
+from epub_translator import LLM, translate, language, SubmitKind
+llm = LLM(
+    key="your-api-key",
+    url="https://api.openai.com/v1",
+    model="gpt-4",
+    token_encoding="o200k_base",
+)
+translate(
+    source_path="source.epub",
+    target_path="translated.epub",
+    target_language=language.ENGLISH,
+    submit=SubmitKind.APPEND_BLOCK,
+    llm=llm,
+)
+# Access token statistics after translation
+print(f"Total tokens: {llm.total_tokens}")
+print(f"Input tokens: {llm.input_tokens}")
+print(f"Input cache tokens: {llm.input_cache_tokens}")
+print(f"Output tokens: {llm.output_tokens}")
+```
+**Available Statistics:**
+- `total_tokens` - Total number of tokens used (input + output)
+- `input_tokens` - Number of prompt/input tokens
+- `input_cache_tokens` - Number of cached input tokens (when using prompt caching)
+- `output_tokens` - Number of generated/completion tokens
+**Real-time Monitoring:**
+You can also monitor token usage in real-time during translation:
+```python
+from tqdm import tqdm
+import time
+with tqdm(total=100, desc="Translating", unit="%") as pbar:
+    last_progress = 0.0
+    start_time = time.time()
+    def on_progress(progress: float):
+        nonlocal last_progress
+        increment = (progress - last_progress) * 100
+        pbar.update(increment)
+        last_progress = progress
+        # Update token stats in progress bar
+        pbar.set_postfix({
+            'tokens': llm.total_tokens,
+            'cost_est': f'${llm.total_tokens * 0.00001:.4f}'  # Estimate based on your pricing
+        })
+    translate(
+        source_path="source.epub",
+        target_path="translated.epub",
+        target_language=language.ENGLISH,
+        submit=SubmitKind.APPEND_BLOCK,
+        llm=llm,
+        on_progress=on_progress,
+    )
+    elapsed = time.time() - start_time
+    print(f"\nTranslation completed in {elapsed:.1f}s")
+    print(f"Total tokens used: {llm.total_tokens:,}")
+    print(f"Average tokens/second: {llm.total_tokens/elapsed:.1f}")
+```
+**Dual-LLM Token Tracking:**
+When using separate LLMs for translation and filling, each LLM tracks its own statistics:
+```python
+translation_llm = LLM(key="...", url="...", model="gpt-4", token_encoding="o200k_base")
+fill_llm = LLM(key="...", url="...", model="gpt-4", token_encoding="o200k_base")
+translate(
+    source_path="source.epub",
+    target_path="translated.epub",
+    target_language=language.ENGLISH,
+    submit=SubmitKind.APPEND_BLOCK,
+    translation_llm=translation_llm,
+    fill_llm=fill_llm,
+)
+print(f"Translation tokens: {translation_llm.total_tokens}")
+print(f"Fill tokens: {fill_llm.total_tokens}")
+print(f"Combined total: {translation_llm.total_tokens + fill_llm.total_tokens}")
+```
+**Note:** Token statistics are cumulative across all API calls made by the LLM instance. The counts only increase and are thread-safe when using concurrent translation.
 ## Related Projects
 ### PDF Craft

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/data/translate.jinja RENAMED Viewed

@@ -13,6 +13,9 @@ Translation rules:
 {% if user_prompt -%}
 User may provide additional requirements in <rules> tags before the source text. Follow them, but prioritize the rules above if conflicts arise.
+<rules>
+{{ user_prompt }}
+</rules>
 {% endif -%}
 Output only the translated text, nothing else.

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/llm/core.py RENAMED Viewed

@@ -13,6 +13,7 @@ from ..template import create_env
 from .context import LLMContext
 from .executor import LLMExecutor
 from .increasable import Increasable
+from .statistics import Statistics
 from .types import Message
 # Global state for logger filename generation
@@ -44,7 +45,7 @@ class LLM:
         self._temperature: Increasable = Increasable(temperature)
         self._cache_path: Path | None = self._ensure_dir_path(cache_path)
         self._logger_save_path: Path | None = self._ensure_dir_path(log_dir_path)
+        self._statistics = Statistics()
         self._executor = LLMExecutor(
             url=url,
             model=model,
@@ -53,12 +54,29 @@ class LLM:
             retry_times=retry_times,
             retry_interval_seconds=retry_interval_seconds,
             create_logger=self._create_logger,
+            statistics=self._statistics,
         )
     @property
     def encoding(self) -> Encoding:
         return self._encoding
+    @property
+    def total_tokens(self) -> int:
+        return self._statistics.total_tokens
+    @property
+    def input_tokens(self) -> int:
+        return self._statistics.input_tokens
+    @property
+    def input_cache_tokens(self) -> int:
+        return self._statistics.input_cache_tokens
+    @property
+    def output_tokens(self) -> int:
+        return self._statistics.output_tokens
     def context(self, cache_seed_content: str | None = None) -> LLMContext:
         return LLMContext(
             executor=self._executor,

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/llm/executor.py RENAMED Viewed

@@ -7,6 +7,7 @@ from openai import OpenAI
 from openai.types.chat import ChatCompletionMessageParam
 from .error import is_retry_error
+from .statistics import Statistics
 from .types import Message, MessageRole
@@ -20,12 +21,14 @@ class LLMExecutor:
         retry_times: int,
         retry_interval_seconds: float,
         create_logger: Callable[[], Logger | None],
+        statistics: Statistics,
     ) -> None:
         self._model_name: str = model
         self._timeout: float | None = timeout
         self._retry_times: int = retry_times
         self._retry_interval_seconds: float = retry_interval_seconds
         self._create_logger: Callable[[], Logger | None] = create_logger
+        self._statistics = statistics
         self._client = OpenAI(
             api_key=api_key,
             base_url=url,
@@ -156,6 +159,7 @@ class LLMExecutor:
             model=self._model_name,
             messages=messages,
             stream=True,
+            stream_options={"include_usage": True},
             top_p=top_p,
             temperature=temperature,
             max_tokens=max_tokens,
@@ -164,4 +168,5 @@ class LLMExecutor:
         for chunk in stream:
             if chunk.choices and chunk.choices[0].delta.content:
                 buffer.write(chunk.choices[0].delta.content)
+            self._statistics.submit_usage(chunk.usage)
         return buffer.getvalue()

epub_translator-0.1.8/epub_translator/llm/statistics.py ADDED Viewed

@@ -0,0 +1,25 @@
+from threading import Lock
+from openai.types import CompletionUsage
+class Statistics:
+    def __init__(self) -> None:
+        self._lock = Lock()
+        self.total_tokens = 0
+        self.input_tokens = 0
+        self.input_cache_tokens = 0
+        self.output_tokens = 0
+    def submit_usage(self, usage: CompletionUsage | None) -> None:
+        if usage is None:
+            return
+        with self._lock:
+            if usage.total_tokens:
+                self.total_tokens += usage.total_tokens
+            if usage.prompt_tokens:
+                self.input_tokens += usage.prompt_tokens
+            if usage.prompt_tokens_details and usage.prompt_tokens_details.cached_tokens:
+                self.input_cache_tokens += usage.prompt_tokens_details.cached_tokens
+            if usage.completion_tokens:
+                self.output_tokens += usage.completion_tokens

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/segment/__init__.py RENAMED Viewed

@@ -21,6 +21,7 @@ from .text_segment import (
     TextPosition,
     TextSegment,
     combine_text_segments,
+    find_block_depth,
     incision_between,
     search_text_segments,
 )

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/segment/text_segment.py RENAMED Viewed

@@ -4,7 +4,12 @@ from enum import Enum, auto
 from typing import Self
 from xml.etree.ElementTree import Element
-from ..xml import expand_left_element_texts, expand_right_element_texts, is_inline_tag, normalize_text_in_element
+from ..xml import (
+    expand_left_element_texts,
+    expand_right_element_texts,
+    is_inline_element,
+    normalize_text_in_element,
+)
 class TextPosition(Enum):
@@ -100,7 +105,7 @@ def search_text_segments(root: Element) -> Generator[TextSegment, None, None]:
 def _search_text_segments(stack: list[Element], element: Element) -> Generator[TextSegment, None, None]:
     text = normalize_text_in_element(element.text)
     next_stack = stack + [element]
-    next_block_depth = _find_block_depth(next_stack)
+    next_block_depth = find_block_depth(next_stack)
     if text is not None:
         yield TextSegment(
@@ -125,12 +130,11 @@ def _search_text_segments(stack: list[Element], element: Element) -> Generator[T
             )
-def _find_block_depth(parent_stack: list[Element]) -> int:
+def find_block_depth(parent_stack: list[Element]) -> int:
     index: int = 0
-    for i in range(len(parent_stack) - 1, -1, -1):
-        if not is_inline_tag(parent_stack[i].tag):
+    for i in range(len(parent_stack)):
+        if not is_inline_element(parent_stack[i]):
             index = i
-            break
     return index + 1  # depth is a count not index

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/segment/utils.py RENAMED Viewed

@@ -8,22 +8,6 @@ def element_fingerprint(element: Element) -> str:
     return f"<{element.tag} {' '.join(attrs)}/>"
-def unwrap_parents(element: Element) -> tuple[Element, list[Element]]:
-    parents: list[Element] = []
-    while True:
-        if len(element) != 1:
-            break
-        child = element[0]
-        if not element.text:
-            break
-        if not child.tail:
-            break
-        parents.append(element)
-        element = child
-        element.tail = None
-    return element, parents
 def id_in_element(element: Element) -> int | None:
     id_str = element.get(ID_KEY, None)
     if id_str is None:

{epub_translator-0.1.6 → epub_translator-0.1.8}/epub_translator/translation/xml_interrupter.py RENAMED Viewed

@@ -1,9 +1,13 @@
 from collections.abc import Generator, Iterable
 from typing import cast
-from xml.etree.ElementTree import Element
+from xml.etree.ElementTree import Element, tostring
-from ..segment import TextSegment
+from bs4 import BeautifulSoup
+from mathml2latex.mathml import process_mathml
+from ..segment import TextSegment, combine_text_segments, find_block_depth
 from ..utils import ensure_list, normalize_whitespace
+from ..xml import DISPLAY_ATTRIBUTE, clone_element, is_inline_element
 _ID_KEY = "__XML_INTERRUPTER_ID"
 _MATH_TAG = "math"
@@ -37,8 +41,10 @@ class XMLInterrupter:
     def interrupt_block_element(self, element: Element) -> Element:
         interrupted_element = self._placeholder2interrupted.pop(id(element), None)
         if interrupted_element is None:
+            element.attrib.pop(_ID_KEY, None)
             return element
         else:
+            interrupted_element.attrib.pop(_ID_KEY, None)
             return interrupted_element
     def _expand_source_text_segment(self, text_segment: TextSegment):
@@ -81,14 +87,18 @@ class XMLInterrupter:
                     _ID_KEY: cast(str, interrupted_element.get(_ID_KEY)),
                 },
             )
+            interrupted_display = interrupted_element.get(DISPLAY_ATTRIBUTE, None)
+            if interrupted_display is not None:
+                placeholder_element.set(DISPLAY_ATTRIBUTE, interrupted_display)
             raw_parent_stack = text_segment.parent_stack[:interrupted_index]
             parent_stack = raw_parent_stack + [placeholder_element]
             merged_text_segment = TextSegment(
-                text="".join(t.text for t in text_segments),
+                text=self._render_latex(text_segments),
                 parent_stack=parent_stack,
                 left_common_depth=text_segments[0].left_common_depth,
                 right_common_depth=text_segments[-1].right_common_depth,
-                block_depth=len(parent_stack),
+                block_depth=find_block_depth(parent_stack),
                 position=text_segments[0].position,
             )
             self._placeholder2interrupted[id(placeholder_element)] = interrupted_element
@@ -116,8 +126,8 @@ class XMLInterrupter:
                 # 原始栈退光，仅留下相对 interrupted 元素的栈，这种格式与 translated 要求一致
                 text_segment.left_common_depth = max(0, text_segment.left_common_depth - interrupted_index)
                 text_segment.right_common_depth = max(0, text_segment.right_common_depth - interrupted_index)
-                text_segment.block_depth = 1
                 text_segment.parent_stack = text_segment.parent_stack[interrupted_index:]
+                text_segment.block_depth = find_block_depth(text_segment.parent_stack)
         return merged_text_segment
@@ -129,37 +139,54 @@ class XMLInterrupter:
                 break
         return interrupted_index
+    def _render_latex(self, text_segments: list[TextSegment]) -> str:
+        math_element, _ = next(combine_text_segments(text_segments))
+        while math_element.tag != _MATH_TAG:
+            if len(math_element) == 0:
+                return ""
+            math_element = math_element[0]
+        math_element = clone_element(math_element)
+        math_element.attrib.pop(_ID_KEY, None)
+        math_element.tail = None
+        latex: str | None = None
+        try:
+            mathml_str = tostring(math_element, encoding="unicode")
+            soup = BeautifulSoup(mathml_str, "html.parser")
+            latex = process_mathml(soup)
+        except Exception:
+            pass
+        if latex is None:
+            latex = "".join(t.text for t in text_segments)
+            latex = normalize_whitespace(latex).strip()
+        else:
+            latex = normalize_whitespace(latex).strip()
+            if is_inline_element(math_element):
+                latex = f"${latex}$"
+            else:
+                latex = f"$${latex}$$"
+        return f" {latex} "
     def _expand_translated_text_segment(self, text_segment: TextSegment):
-        interrupted_id = text_segment.block_parent.attrib.pop(_ID_KEY, None)
+        parent_element = text_segment.parent_stack[-1]
+        interrupted_id = parent_element.attrib.pop(_ID_KEY, None)
         if interrupted_id is None:
             yield text_segment
             return
-        raw_text_segments = self._raw_text_segments.pop(interrupted_id, None)
-        if not raw_text_segments:
+        if parent_element is text_segment.block_parent:
+            # Block-level math， need to be hidden
             return
-        raw_block = raw_text_segments[0].parent_stack[0]
-        if not self._is_inline_math(raw_block):
+        raw_text_segments = self._raw_text_segments.pop(interrupted_id, None)
+        if not raw_text_segments:
+            yield text_segment
             return
         for raw_text_segment in raw_text_segments:
+            text_basic_parent_stack = text_segment.parent_stack[:-1]
             raw_text_segment.block_parent.attrib.pop(_ID_KEY, None)
+            raw_text_segment.parent_stack = text_basic_parent_stack + raw_text_segment.parent_stack
             yield raw_text_segment
-    def _has_no_math_texts(self, element: Element):
-        if element.tag == _MATH_TAG:
-            return True
-        if element.text and normalize_whitespace(element.text).strip():
-            return False
-        for child_element in element:
-            if not self._has_no_math_texts(child_element):
-                return False
-            if child_element.tail and normalize_whitespace(child_element.tail).strip():
-                return False
-        return True
-    def _is_inline_math(self, element: Element) -> bool:
-        if element.tag != _MATH_TAG:
-            return False
-        return element.get("display", "").lower() != "block"

epub_translator-0.1.8/epub_translator/xml/const.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ ID_KEY: str = "id"
2	+ DISPLAY_ATTRIBUTE: str = "display"

epub-translator 0.1.6__tar.gz → 0.1.8__tar.gz

epub-translator 0.1.6tar.gz → 0.1.8tar.gz