PyPI - markitai - Versions diffs - 0.3.0__py3-none-any.whl - Mend

markitai 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

markitai/__init__.py +3 -0
markitai/batch.py +1316 -0
markitai/cli.py +3979 -0
markitai/config.py +602 -0
markitai/config.schema.json +748 -0
markitai/constants.py +222 -0
markitai/converter/__init__.py +49 -0
markitai/converter/_patches.py +98 -0
markitai/converter/base.py +164 -0
markitai/converter/image.py +181 -0
markitai/converter/legacy.py +606 -0
markitai/converter/office.py +526 -0
markitai/converter/pdf.py +679 -0
markitai/converter/text.py +63 -0
markitai/fetch.py +1725 -0
markitai/image.py +1335 -0
markitai/json_order.py +550 -0
markitai/llm.py +4339 -0
markitai/ocr.py +347 -0
markitai/prompts/__init__.py +159 -0
markitai/prompts/cleaner.md +93 -0
markitai/prompts/document_enhance.md +77 -0
markitai/prompts/document_enhance_complete.md +65 -0
markitai/prompts/document_process.md +60 -0
markitai/prompts/frontmatter.md +28 -0
markitai/prompts/image_analysis.md +21 -0
markitai/prompts/image_caption.md +8 -0
markitai/prompts/image_description.md +13 -0
markitai/prompts/page_content.md +17 -0
markitai/prompts/url_enhance.md +78 -0
markitai/security.py +286 -0
markitai/types.py +30 -0
markitai/urls.py +187 -0
markitai/utils/__init__.py +33 -0
markitai/utils/executor.py +69 -0
markitai/utils/mime.py +85 -0
markitai/utils/office.py +262 -0
markitai/utils/output.py +53 -0
markitai/utils/paths.py +81 -0
markitai/utils/text.py +359 -0
markitai/workflow/__init__.py +37 -0
markitai/workflow/core.py +760 -0
markitai/workflow/helpers.py +509 -0
markitai/workflow/single.py +369 -0
markitai-0.3.0.dist-info/METADATA +159 -0
markitai-0.3.0.dist-info/RECORD +48 -0
markitai-0.3.0.dist-info/WHEEL +4 -0
markitai-0.3.0.dist-info/entry_points.txt +2 -0

markitai/prompts/document_enhance_complete.md ADDED Viewed

@@ -0,0 +1,65 @@
+你是一个文档格式清理专家。你的任务是清理提取文本中的格式问题，同时保持内容完整性。
+你会收到：
+1. **提取的文本**：程序提取的内容（文本精确，包含链接、表格、图片引用）
+2. **页面图片**：版式和结构的视觉参考
+## 核心原则
+- **禁止翻译**：原文是什么语言就保留什么语言，禁止将中文翻译成英文或反过来
+- **禁止改写**：保留原文的用词和表达方式，只做格式调整
+## 任务 1: 格式清理
+【删除残留】
+- 删除图表提取残留的孤立数字行（如单独一行的 "12", "10", "8" 等）
+- 删除 PPT 页眉页脚（重复出现的短文本 + 页码）
+- 删除无意义的重复标题
+【格式修正】
+- 参考页面图片修正标题层级（##、###等）
+- 修正列表格式（缩进、符号）
+- 修正表格结构
+- 为 `![](assets/...)` 图片添加简短 alt text
+【空行规范】
+- 标题(#)前后各保留一个空行
+- 列表块/表格前后各保留一个空行
+- 段落间保留一个空行，删除多余空行
+## 禁止事项
+- **禁止翻译任何内容** - 原文是什么语言就保留什么语言
+- **禁止删除任何段落或内容** - 只删除明显的残留/垃圾
+- **禁止移动内容位置** - 保持原有顺序
+- **禁止重写或改述内容** - 保留原文
+- **禁止添加新内容** - 只做清理
+- **禁止用代码块包裹输出** - 直接输出纯 Markdown，不要用 \`\`\`markdown 包裹
+- **必须保留所有链接** - `[文本](url)` 原样保留，URL 不得修改
+- **必须保留所有图片引用位置** - `![...](assets/...)` 位置不变，URL 不得修改
+- **禁止修改任何 URL** - 图片链接和超链接的 URL 必须与原文完全一致
+- **禁止编造 URL** - 绝对不能猜测、推断或生成原文中不存在的 URL
+- **必须保留幻灯片标记** - `<!-- Slide number: X -->` 原样保留在每个 slide 内容开头，位置不变，不要添加新的 slide 注释
+- **必须保留页码标记** - `<!-- Page number: X -->` 原样保留在每页内容开头，位置不变，不要添加新的页码注释
+- **所有 `__MARKITAI_*__` 占位符必须原样保留**（如 `__MARKITAI_IMG_0__`、`__MARKITAI_SLIDENUM_0__`、`__MARKITAI_PAGENUM_0__`、`__MARKITAI_PAGE_0__`）- 这些是系统内部标记，位置和内容都不能改变
+- **禁止输出页面/图片标记** - 不要输出 `## Page X Image:`、`__MARKITAI_PAGE_LABEL_X__`、`__MARKITAI_IMG_LABEL_X__` 等系统内部标记
+## 图片语法规范
+图片引用必须严格遵循 Markdown 语法，**不要添加多余的括号**：
+- 正确: `![alt text](assets/image.jpg)`
+- 错误: `![alt text](assets/image.jpg))` (多余的右括号)
+- 错误: `![alt text](assets/image.jpg)))` (多余的右括号)
+## 任务 2: 元数据生成
+根据文档内容生成以下元数据：
+- title: 文章标题（从内容提取，简洁准确）
+- description: 全文摘要（100字以内）
+- tags: 相关标签数组（3-5个）
+**输出语言必须与源文档保持一致**
+---
+源文件: {source}

markitai/prompts/document_process.md ADDED Viewed

@@ -0,0 +1,60 @@
+请处理以下 Markdown 文档，完成两个任务：
+## 任务 1: 格式优化
+【核心原则】
+- **禁止翻译**：原文是什么语言就保留什么语言，禁止将中文翻译成英文或反过来
+- **禁止改写**：保留原文的用词和表达方式，只做格式调整
+【清理规范】
+- 保留幻灯片标记（如 `<!-- Slide number: X -->`），不要添加新的 slide 注释，删除其他 HTML 注释
+- 删除 PPT 页眉页脚（通常是重复出现的短文本 + 页码）
+- 删除图表残留的孤立数字行
+- 删除无意义的重复标题
+【空行规范】
+- 标题(#)前后各保留一个空行
+- 代码块(```)前后各保留一个空行
+- 列表块前后各保留一个空行
+- 表格前后各保留一个空行
+- 段落间保留一个空行，删除多余空行
+【标点与强调】
+- 英文内容用英文标点，中文内容用中文标点
+- 闭合强调标记放在标点内侧
+- 合并连续强调标记
+- 粗体/斜体标记与中文之间不加空格
+【列表规范】
+- 无序列表统一使用 - 符号
+- 有序列表使用 1. 2. 3. 格式
+- 嵌套列表缩进 2 空格
+【段落规范】
+- 合并不应断行的段落（同一句话被错误换行）
+- 保留有意义的换行（如诗歌、地址、引用）
+【表格规范】
+- 若列头为空且内容语义清晰，可根据语义补充列头（使用与表格数据一致的语言）
+- 若第一列是纯数字行号且无列头，补充列头时使用与表格数据一致的语言
+【必须保留】
+- 代码块内容（原样）
+- 表格行列结构（列头补充除外）
+- 链接和图片语法
+- 原文内容（禁止翻译或改写）
+- **所有 `__MARKITAI_*__` 占位符必须原样保留**
+## 任务 2: 元数据生成
+根据文档内容生成以下元数据（使用 {language}）：
+- title: 文章标题（从内容提取，简洁准确）
+- description: 全文摘要（100字以内）
+- tags: 相关标签数组（3-5个）
+---
+源文件: {source}
+文档内容:
+{content}

markitai/prompts/frontmatter.md ADDED Viewed

@@ -0,0 +1,28 @@
+**⚠️ CRITICAL LANGUAGE RULE: Output language = {language}**
+If English → title/description/tags must be in English.
+If Chinese → title/description/tags 必须使用中文。
+---
+根据以下 Markdown 内容生成 YAML frontmatter 元数据。
+【必填字段】
+- title: 文章标题（从内容提取，简洁准确）
+- source: {source}
+- description: 全文摘要（100字以内）
+- tags: 相关标签数组（3-5个）
+【输出要求】
+- 直接输出纯 YAML，不要包裹在代码块中
+- 不要添加 ```yaml 或 ``` 标记
+- 不要添加 --- 分隔符
+- 不要添加任何解释或说明
+**⚠️ 关键：输出语言必须与源文档保持一致**
+- 如果源文档是**英文**，title/description/tags 必须用**英文**
+- 如果源文档是**中文**，title/description/tags 必须用**中文**
+- 示例：英文文档 → `title: Data Overview`, `tags: [data, excel]`
+- 示例：中文文档 → `title: 数据概览`, `tags: [数据, 表格]`
+内容：
+{content}

markitai/prompts/image_analysis.md ADDED Viewed

@@ -0,0 +1,21 @@
+请分析这张图片，生成三部分内容：
+## 1. Caption（简短描述）
+- 长度：10-30个字
+- 用作 Markdown 图片的 alt 文本
+- 简洁概括图片主要内容
+## 2. Description（详细描述）
+- 描述图片中的主要元素和场景
+- 如果是图表，请解读数据含义
+- 如果是截图，请描述界面内容
+- 使用 Markdown 格式，用 ## 和 ### 组织内容
+- **标题(#)前后各保留一个空行**（标题与上下文本之间都需要空行）
+## 3. Extracted Text（提取文字）
+- 如果图片中包含文字，请完整提取
+- **保留原图片中的文字排版布局**（换行、缩进、对齐等）
+- 如果是表格，使用 Markdown 表格格式
+- 如果图片中没有文字，输出 null
+**输出语言必须与源文档保持一致** - 英文文档用英文，中文文档用中文

markitai/prompts/image_caption.md ADDED Viewed

@@ -0,0 +1,8 @@
+请为这张图片生成一个简短的描述，用作 Markdown 图片的 alt 文本。
+要求：
+- 长度：10-30个字
+- 描述图片的主要内容
+- **输出语言必须与源文档保持一致** - 英文文档用英文，中文文档用中文
+直接输出描述文本，不要添加任何解释。

markitai/prompts/image_description.md ADDED Viewed

@@ -0,0 +1,13 @@
+请详细描述这张图片的内容。
+要求：
+1. 描述图片中的主要元素和场景
+2. 如果有文字，请识别并列出
+3. 如果是图表，请解读数据含义
+4. 如果是截图，请描述界面内容
+输出格式要求：
+- 使用 Markdown 格式
+- 使用二级标题(##)和三级标题(###)组织内容
+- 不要使用一级标题(#)
+- 直接输出内容，不要添加开头说明

markitai/prompts/page_content.md ADDED Viewed

@@ -0,0 +1,17 @@
+你是一个文档内容提取专家。你的任务是将文档页面图片转换为结构清晰的 Markdown 文本。
+要求：
+1. 提取页面图片中的所有文本内容
+2. 保持文档结构（标题、段落、列表、表格）
+3. 表格转换为 Markdown 表格格式
+4. 图表/图形用 markdown 图片语法描述：`![Chart: brief description]()`
+5. 内嵌图片用 markdown 图片语法描述：`![Image: brief description]()`
+6. 忽略页码、页眉页脚、装饰元素
+输出格式：
+- 使用正确的 Markdown 标题层级（##、###等）
+- 不要使用一级标题（#）
+- 列表使用正确格式（- 或 1. 2. 3.）
+- 表格使用 Markdown 表格语法
+- **输出语言必须与源文档保持一致** - 按原文语言提取和描述
+- 仅输出提取的内容，不要添加说明或元注释

markitai/prompts/url_enhance.md ADDED Viewed

@@ -0,0 +1,78 @@
+你是一个网页内容清理专家。你的任务是清理从网页抓取的内容，去除噪音，保留核心内容。
+你会收到：
+1. **抓取的文本**：程序从网页抓取的 Markdown 内容
+2. **页面截图**：网页的视觉参考（如果有）
+## 核心原则 - 必须严格遵守
+- **禁止翻译（CRITICAL - DO NOT TRANSLATE）**：
+  - 英文输入 → 英文输出（English in → English out）
+  - 中文输入 → 中文输出（中文输入 → 中文输出）
+  - 绝对禁止将任何语言翻译成另一种语言
+  - 违反此规则将导致输出无效
+- **禁止改写**：保留原文的用词和表达方式，只做格式调整
+## 任务 1: 内容清理
+【删除网页噪音】
+- 删除导航菜单、侧边栏内容
+- 删除页眉页脚（如版权声明、站点链接、"Powered by" 等）
+- 删除 Cookie 提示、弹窗提示文本
+- 删除广告相关内容
+- 删除社交分享按钮文本（如 "分享到 Twitter", "Like", "Share" 等）
+- 删除评论区（除非是文章核心内容）
+- 删除 "相关文章"、"推荐阅读"、"You might also enjoy" 等推荐链接
+- 删除订阅提示、Newsletter 注册、"Sign up" 提示等
+- 删除网站底部信息：版权声明、主题信息、访问统计、"TOP" 返回顶部链接
+- 删除 Terms of Service、Privacy Policy 等法律链接
+【社交媒体特殊处理】
+- Twitter/X：删除重复的推文内容（同一条推文可能被抓取多次），只保留一份完整的
+- 删除 "Don't miss what's happening"、"New to X?" 等平台提示
+- 删除互动统计文本（如 "56 replies, 28 reposts, 319 likes"）
+【格式修正】
+- 参考页面截图修正标题层级（##、###等）
+- 修正列表格式（缩进、符号）
+- 修正表格结构
+- 为 `![](...)` 图片添加简短 alt text（基于截图上下文）
+- 修复换行的链接格式：将 `[文本\n\n描述](url)` 合并为 `[文本](url)`
+【空行规范】
+- 标题(#)前后各保留一个空行
+- 列表块/表格前后各保留一个空行
+- 段落间保留一个空行，删除多余空行
+## 禁止事项
+- **禁止翻译任何内容** - 原文是什么语言就保留什么语言
+- **禁止删除文章正文内容** - 只删除明显的网页噪音
+- **禁止移动正文内容位置** - 保持原有顺序
+- **禁止重写或改述内容** - 保留原文
+- **禁止添加新内容** - 只做清理
+- **禁止用代码块包裹输出** - 直接输出纯 Markdown，不要用 \`\`\`markdown 包裹
+- **必须保留所有链接** - `[文本](url)` 原样保留，URL 不得修改
+- **必须保留所有图片引用** - `![...](...)` 位置不变，URL 不得修改
+## URL 保护 - CRITICAL
+- **禁止修改任何 URL** - 图片链接和超链接的 URL 必须与原文完全一致
+- **禁止编造 URL** - 绝对不能猜测、推断或生成原文中不存在的 URL
+- **禁止替换 URL** - 即使 URL 看起来"不正确"或"过时"，也必须保留原样
+- 示例：原文 `![](https://old-cdn.com/image.jpg)` → 输出必须是 `![](https://old-cdn.com/image.jpg)`
+- 不要根据页面上下文"推测"更合理的 URL
+- 违反此规则将导致输出无效
+## 任务 2: 元数据生成
+根据网页内容生成以下元数据：
+- title: 文章/页面标题（从内容提取，简洁准确）
+- description: 内容摘要（100字以内）
+- tags: 相关标签数组（3-5个）
+**输出语言必须与源内容保持一致**
+---
+来源 URL: {source}

markitai/security.py ADDED Viewed

@@ -0,0 +1,286 @@
+"""Security utilities for Markitai."""
+from __future__ import annotations
+import json
+import os
+import re
+import tempfile
+from collections.abc import Callable
+from pathlib import Path
+from typing import Any
+from loguru import logger
+from markitai.constants import DEFAULT_JSON_INDENT
+def atomic_write_text(
+    path: Path,
+    content: str,
+    encoding: str = "utf-8",
+) -> None:
+    """Write text to file atomically using temp file + rename.
+    This prevents partial writes and ensures file integrity even if
+    the process is interrupted during write.
+    Args:
+        path: Target file path
+        content: Text content to write
+        encoding: Text encoding (default: utf-8)
+    """
+    path = Path(path)
+    parent = path.parent
+    parent.mkdir(parents=True, exist_ok=True)
+    # Create temp file in same directory (ensures same filesystem for rename)
+    fd, tmp_path = tempfile.mkstemp(
+        suffix=".tmp",
+        prefix=f".{path.name}.",
+        dir=parent,
+    )
+    try:
+        with os.fdopen(fd, "w", encoding=encoding) as f:
+            f.write(content)
+        # Atomic rename (POSIX guarantees atomicity on same filesystem)
+        os.replace(tmp_path, path)
+    except Exception:
+        # Clean up temp file on error
+        try:
+            os.unlink(tmp_path)
+        except OSError:
+            pass
+        raise
+def atomic_write_json(
+    path: Path,
+    obj: Any,
+    indent: int = DEFAULT_JSON_INDENT,
+    ensure_ascii: bool = False,
+    order_func: Callable[[dict[str, Any]], dict[str, Any]] | None = None,
+) -> None:
+    """Write JSON to file atomically.
+    Args:
+        path: Target file path
+        obj: Object to serialize as JSON
+        indent: JSON indentation (default: 2)
+        ensure_ascii: If True, escape non-ASCII characters (default: False)
+        order_func: Optional function to order/transform dict before serialization
+    """
+    if order_func is not None and isinstance(obj, dict):
+        obj = order_func(obj)
+    content = json.dumps(obj, indent=indent, ensure_ascii=ensure_ascii)
+    atomic_write_text(path, content, encoding="utf-8")
+async def atomic_write_text_async(
+    path: Path,
+    content: str,
+    encoding: str = "utf-8",
+) -> None:
+    """Write text to file atomically using temp file + rename (async version).
+    This prevents partial writes and ensures file integrity even if
+    the process is interrupted during write.
+    Args:
+        path: Target file path
+        content: Text content to write
+        encoding: Text encoding (default: utf-8)
+    """
+    import aiofiles
+    import aiofiles.os
+    path = Path(path)
+    parent = path.parent
+    parent.mkdir(parents=True, exist_ok=True)
+    # Create temp file in same directory (ensures same filesystem for rename)
+    fd, tmp_path = tempfile.mkstemp(
+        suffix=".tmp",
+        prefix=f".{path.name}.",
+        dir=parent,
+    )
+    try:
+        # Close fd and use aiofiles for async write
+        os.close(fd)
+        async with aiofiles.open(tmp_path, "w", encoding=encoding) as f:
+            await f.write(content)
+        # Atomic rename (POSIX guarantees atomicity on same filesystem)
+        await aiofiles.os.replace(tmp_path, path)
+    except Exception:
+        # Clean up temp file on error
+        try:
+            await aiofiles.os.remove(tmp_path)
+        except OSError:
+            pass
+        raise
+async def atomic_write_json_async(
+    path: Path,
+    obj: Any,
+    indent: int = DEFAULT_JSON_INDENT,
+    ensure_ascii: bool = False,
+) -> None:
+    """Write JSON to file atomically (async version).
+    Args:
+        path: Target file path
+        obj: Object to serialize as JSON
+        indent: JSON indentation (default: 2)
+        ensure_ascii: If True, escape non-ASCII characters (default: False)
+    """
+    content = json.dumps(obj, indent=indent, ensure_ascii=ensure_ascii)
+    await atomic_write_text_async(path, content, encoding="utf-8")
+async def write_bytes_async(path: Path, data: bytes) -> None:
+    """Write bytes to file asynchronously.
+    Args:
+        path: Target file path
+        data: Bytes to write
+    """
+    import aiofiles
+    path = Path(path)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    async with aiofiles.open(path, "wb") as f:
+        await f.write(data)
+def escape_glob_pattern(s: str) -> str:
+    """Escape special glob characters in a string.
+    Args:
+        s: String that may contain glob special characters
+    Returns:
+        Escaped string safe for use in glob patterns
+    """
+    # Escape glob special characters: [ ] * ?
+    return s.translate(
+        str.maketrans(
+            {
+                "[": "[[]",
+                "]": "[]]",
+                "*": "[*]",
+                "?": "[?]",
+            }
+        )
+    )
+def validate_path_within_base(path: Path, base_dir: Path) -> Path:
+    """Validate that a path is within the base directory.
+    Args:
+        path: Path to validate
+        base_dir: Base directory that path must be within
+    Returns:
+        Resolved absolute path
+    Raises:
+        ValueError: If path is outside base directory
+    """
+    resolved = path.resolve()
+    base_resolved = base_dir.resolve()
+    try:
+        resolved.relative_to(base_resolved)
+    except ValueError:
+        raise ValueError(f"Path traversal detected: {path} is outside {base_dir}")
+    return resolved
+def check_symlink_safety(path: Path, allow_symlinks: bool = False) -> None:
+    """Check if a path involves symlinks at any level.
+    This function checks not just the final path, but all parent directories
+    to detect nested symlinks that could be used for path traversal.
+    Args:
+        path: Path to check
+        allow_symlinks: If False, raises error on symlinks
+    Raises:
+        ValueError: If symlinks are not allowed and any path component is a symlink
+    """
+    # Check the path itself
+    if path.is_symlink():
+        if not allow_symlinks:
+            target = path.readlink()
+            raise ValueError(f"Symlink not allowed: {path} -> {target}")
+        else:
+            logger.warning(f"Symlink detected: {path} -> {path.readlink()}")
+            return  # If symlinks allowed, no need to check further
+    # Check all parent directories for nested symlinks
+    if not allow_symlinks:
+        checked_parts: list[Path] = []
+        for part in path.parts:
+            checked_parts.append(
+                Path(part) if not checked_parts else checked_parts[-1] / part
+            )
+            current_path = checked_parts[-1]
+            # Only check if path exists and is absolute enough to be meaningful
+            if (
+                len(checked_parts) > 1
+                and current_path.exists()
+                and current_path.is_symlink()
+            ):
+                target = current_path.readlink()
+                raise ValueError(
+                    f"Nested symlink not allowed: {current_path} -> {target} (in path {path})"
+                )
+def sanitize_error_message(error: Exception) -> str:
+    """Sanitize error message to remove sensitive information.
+    Args:
+        error: Exception to sanitize
+    Returns:
+        Sanitized error message
+    """
+    msg = str(error)
+    # Remove absolute paths (Unix style)
+    msg = re.sub(r"/[a-zA-Z0-9_\-./]+", "[PATH]", msg)
+    # Remove absolute paths (Windows style)
+    msg = re.sub(r"[A-Za-z]:\\[a-zA-Z0-9_\-\\. ]+", "[PATH]", msg)
+    # Remove potential usernames in paths
+    msg = re.sub(r"/home/[^/\s]+/", "/home/[USER]/", msg)
+    msg = re.sub(r"C:\\Users\\[^\\]+\\", r"C:\\Users\\[USER]\\", msg)
+    return msg
+def validate_file_size(path: Path, max_size_bytes: int) -> None:
+    """Validate that a file is within size limits.
+    Args:
+        path: Path to file
+        max_size_bytes: Maximum allowed size in bytes
+    Raises:
+        ValueError: If file exceeds size limit
+    """
+    if not path.exists():
+        return
+    size = path.stat().st_size
+    if size > max_size_bytes:
+        raise ValueError(
+            f"File too large: {path.name} is {size} bytes (max: {max_size_bytes} bytes)"
+        )

markitai/types.py ADDED Viewed

@@ -0,0 +1,30 @@
+"""Common type definitions for Markitai."""
+from __future__ import annotations
+from typing import TypedDict
+class ModelUsageStats(TypedDict):
+    """Statistics for a single LLM model's usage."""
+    requests: int
+    input_tokens: int
+    output_tokens: int
+    cost_usd: float
+# Type alias for LLM usage by model
+# Format: {"model_name": {"requests": N, "input_tokens": N, "output_tokens": N, "cost_usd": F}}
+LLMUsageByModel = dict[str, ModelUsageStats]
+class AssetDescription(TypedDict, total=False):
+    """Description of an extracted asset (image)."""
+    asset: str  # Asset file path
+    alt: str  # Short alt text
+    desc: str  # Detailed description
+    text: str | None  # Extracted text (optional)
+    llm_usage: LLMUsageByModel  # LLM usage for this asset (optional)
+    created: str  # Creation timestamp (optional)