PyPI - python-hwpx - Versions diffs - 2.5__tar.gz → 2.7__tar.gz - Mend

python-hwpx 2.5tar.gz → 2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

{python_hwpx-2.5 → python_hwpx-2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: python-hwpx
-Version: 2.5
+Version: 2.7
 Summary: Hancom HWPX 패키지를 로드하고 편집하기 위한 Python 유틸리티 모음
 Author: python-hwpx Maintainers
 License: Non-Commercial License
@@ -165,7 +165,8 @@ doc.save_to_path("결과물.hwpx")
 | 🔎 **객체 검색** | 태그/속성/XPath | 특정 요소 탐색, 주석 이터레이터 |
 | 🎨 **스타일 치환** | 서식 기반 필터 | 색상/밑줄/charPrIDRef 기반 Run 검색 및 교체 |
 | 📤 **내보내기** | 텍스트/HTML/Markdown | 문서 변환 출력 |
-| ✅ **유효성 검사** | XSD 스키마 | CLI(`hwpx-validate`) 및 API |
+| ✅ **유효성 검사** | XSD + 패키지 구조 | CLI(`hwpx-validate`, `hwpx-validate-package`) 및 API |
+| 🧰 **워크플로 도구** | unpack/pack/template analyze/page guard | 템플릿 보존형 XML-first 작업 보조 |
 | 🏗️ **저수준 XML** | 데이터클래스 매핑 | OWPML 스키마 ↔ Python 객체 직접 조작 |
 | 🔄 **네임스페이스 호환** | 자동 정규화 | HWPML 2016 → 2011 자동 변환 |
@@ -262,10 +263,15 @@ python-hwpx
 │   ├── body.py          #   타입이 지정된 본문 모델
 │   └── common.py        #   범용 XML ↔ 데이터클래스
 ├── hwpx.tools
+│   ├── archive_cli      #   unpack/pack CLI 및 재패킹 메타데이터
 │   ├── text_extractor   #   텍스트 추출 파이프라인
+│   ├── text_extract_cli #   텍스트 추출 CLI
 │   ├── object_finder    #   객체 탐색 유틸리티
 │   ├── exporter         #   텍스트/HTML/Markdown 내보내기
-│   └── validator        #   스키마 유효성 검사 (hwpx-validate CLI)
+│   ├── validator        #   스키마 유효성 검사 (hwpx-validate CLI)
+│   ├── package_validator#   ZIP/OPC/HWPX 구조 검사
+│   ├── page_guard       #   layout-drift proxy
+│   └── template_analyzer#   레퍼런스 문서 분석/추출
 └── hwpx.templates       # 내장 빈 문서 템플릿
 ```
@@ -274,8 +280,26 @@ python-hwpx
 ```bash
 # HWPX 문서 스키마 유효성 검사
 hwpx-validate 문서.hwpx
+# ZIP/OPC/HWPX 패키지 구조 검사
+hwpx-validate-package 문서.hwpx
+# HWPX 풀기 / 다시 묶기
+hwpx-unpack 문서.hwpx ./unpacked
+hwpx-pack ./unpacked ./repacked.hwpx
+# 레퍼런스 템플릿 분석과 파트 추출
+hwpx-analyze-template 문서.hwpx --extract-dir ./template-parts --json
+# plain / markdown 텍스트 추출
+hwpx-text-extract 문서.hwpx --format markdown --output 문서.md
+# 레이아웃 드리프트 프록시 비교
+hwpx-page-guard --reference 원본.hwpx --output 결과.hwpx
 ```
+`hwpx-page-guard`는 렌더된 실제 쪽수를 계산하지 않습니다. 대신 단락 수, 표 수, shape/control 수, 명시적 page/column break, 텍스트 길이 통계를 비교해 레이아웃 드리프트 위험을 탐지하는 프록시 도구입니다.
 ## 문서
 | | |

{python_hwpx-2.5 → python_hwpx-2.7}/README.md RENAMED Viewed

@@ -98,7 +98,8 @@ doc.save_to_path("결과물.hwpx")
 | 🔎 **객체 검색** | 태그/속성/XPath | 특정 요소 탐색, 주석 이터레이터 |
 | 🎨 **스타일 치환** | 서식 기반 필터 | 색상/밑줄/charPrIDRef 기반 Run 검색 및 교체 |
 | 📤 **내보내기** | 텍스트/HTML/Markdown | 문서 변환 출력 |
-| ✅ **유효성 검사** | XSD 스키마 | CLI(`hwpx-validate`) 및 API |
+| ✅ **유효성 검사** | XSD + 패키지 구조 | CLI(`hwpx-validate`, `hwpx-validate-package`) 및 API |
+| 🧰 **워크플로 도구** | unpack/pack/template analyze/page guard | 템플릿 보존형 XML-first 작업 보조 |
 | 🏗️ **저수준 XML** | 데이터클래스 매핑 | OWPML 스키마 ↔ Python 객체 직접 조작 |
 | 🔄 **네임스페이스 호환** | 자동 정규화 | HWPML 2016 → 2011 자동 변환 |
@@ -195,10 +196,15 @@ python-hwpx
 │   ├── body.py          #   타입이 지정된 본문 모델
 │   └── common.py        #   범용 XML ↔ 데이터클래스
 ├── hwpx.tools
+│   ├── archive_cli      #   unpack/pack CLI 및 재패킹 메타데이터
 │   ├── text_extractor   #   텍스트 추출 파이프라인
+│   ├── text_extract_cli #   텍스트 추출 CLI
 │   ├── object_finder    #   객체 탐색 유틸리티
 │   ├── exporter         #   텍스트/HTML/Markdown 내보내기
-│   └── validator        #   스키마 유효성 검사 (hwpx-validate CLI)
+│   ├── validator        #   스키마 유효성 검사 (hwpx-validate CLI)
+│   ├── package_validator#   ZIP/OPC/HWPX 구조 검사
+│   ├── page_guard       #   layout-drift proxy
+│   └── template_analyzer#   레퍼런스 문서 분석/추출
 └── hwpx.templates       # 내장 빈 문서 템플릿
 ```
@@ -207,8 +213,26 @@ python-hwpx
 ```bash
 # HWPX 문서 스키마 유효성 검사
 hwpx-validate 문서.hwpx
+# ZIP/OPC/HWPX 패키지 구조 검사
+hwpx-validate-package 문서.hwpx
+# HWPX 풀기 / 다시 묶기
+hwpx-unpack 문서.hwpx ./unpacked
+hwpx-pack ./unpacked ./repacked.hwpx
+# 레퍼런스 템플릿 분석과 파트 추출
+hwpx-analyze-template 문서.hwpx --extract-dir ./template-parts --json
+# plain / markdown 텍스트 추출
+hwpx-text-extract 문서.hwpx --format markdown --output 문서.md
+# 레이아웃 드리프트 프록시 비교
+hwpx-page-guard --reference 원본.hwpx --output 결과.hwpx
 ```
+`hwpx-page-guard`는 렌더된 실제 쪽수를 계산하지 않습니다. 대신 단락 수, 표 수, shape/control 수, 명시적 page/column break, 텍스트 길이 통계를 비교해 레이아웃 드리프트 위험을 탐지하는 프록시 도구입니다.
 ## 문서
 | | |

{python_hwpx-2.5 → python_hwpx-2.7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "python-hwpx"
-version = "2.5"
+version = "2.7"
 description = "Hancom HWPX 패키지를 로드하고 편집하기 위한 Python 유틸리티 모음"
 readme = { file = "README.md", content-type = "text/markdown" }
 license = { file = "LICENSE" }
@@ -49,7 +49,13 @@ Documentation = "https://github.com/airmang/python-hwpx/tree/main/docs"
 Issues = "https://github.com/airmang/python-hwpx/issues"
 [project.scripts]
+hwpx-unpack = "hwpx.tools.archive_cli:unpack_main"
+hwpx-pack = "hwpx.tools.archive_cli:pack_main"
 hwpx-validate = "hwpx.tools.validator:main"
+hwpx-validate-package = "hwpx.tools.package_validator:main"
+hwpx-page-guard = "hwpx.tools.page_guard:main"
+hwpx-analyze-template = "hwpx.tools.template_analyzer:main"
+hwpx-text-extract = "hwpx.tools.text_extract_cli:main"
 [tool.setuptools]
 package-dir = { "" = "src" }

{python_hwpx-2.5 → python_hwpx-2.7}/src/hwpx/document.py RENAMED Viewed

@@ -1280,7 +1280,7 @@ class HwpxDocument:
         """
         from .tools.validator import validate_document
-        return validate_document(self._to_bytes_raw())
+        return validate_document(self._to_bytes_raw(reset_dirty=False))
     def _run_pre_save_validation(self) -> None:
         """Raise if validate_on_save is enabled and the document is invalid."""
@@ -1318,11 +1318,16 @@ class HwpxDocument:
         self._run_pre_save_validation()
         return self._to_bytes_raw()
-    def _to_bytes_raw(self) -> bytes:
-        """Serialize without validation (used by :meth:`validate`)."""
+    def _to_bytes_raw(self, *, reset_dirty: bool = True) -> bytes:
+        """Serialize without validation.
+        When ``reset_dirty`` is ``False``, the document remains marked as
+        modified after the archive snapshot is generated.
+        """
         updates = self._root.serialize()
         result = self._package.save(None, updates)
-        self._root.reset_dirty()
+        if reset_dirty:
+            self._root.reset_dirty()
         if isinstance(result, bytes):
             return result
         raise TypeError("package.save(None) must return bytes")

{python_hwpx-2.5 → python_hwpx-2.7}/src/hwpx/tools/__init__.py RENAMED Viewed

@@ -6,6 +6,16 @@ from .exporter import (
     export_text,
 )
 from .object_finder import FoundElement, ObjectFinder
+from .package_validator import (
+    PackageValidationIssue,
+    PackageValidationReport,
+    validate_package,
+)
+from .page_guard import (
+    DocumentMetrics,
+    collect_metrics,
+    compare_metrics,
+)
 from .text_extractor import (
     DEFAULT_NAMESPACES,
     ParagraphInfo,
@@ -33,6 +43,12 @@ __all__ = [
     "strip_namespace",
     "FoundElement",
     "ObjectFinder",
+    "PackageValidationIssue",
+    "PackageValidationReport",
+    "validate_package",
+    "DocumentMetrics",
+    "collect_metrics",
+    "compare_metrics",
     "DocumentSchemas",
     "ValidationIssue",
     "ValidationReport",

python_hwpx-2.7/src/hwpx/tools/archive_cli.py ADDED Viewed

@@ -0,0 +1,337 @@
+from __future__ import annotations
+import argparse
+import json
+import os
+import shutil
+import tempfile
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Sequence
+from zipfile import ZIP_DEFLATED, ZIP_STORED, ZipFile
+from lxml import etree
+from .package_validator import validate_package
+_XML_SUFFIXES = (".xml", ".hpf")
+_PACK_METADATA_NAME = ".hwpx-pack-metadata.json"
+__all__ = [
+    "ArchiveEntryInfo",
+    "UnpackResult",
+    "PackResult",
+    "pack_hwpx",
+    "unpack_hwpx",
+    "pack_main",
+    "unpack_main",
+    "main",
+]
+@dataclass(frozen=True)
+class ArchiveEntryInfo:
+    path: str
+    compress_type: int
+@dataclass(frozen=True)
+class UnpackResult:
+    output_dir: Path
+    metadata_path: Path
+    entries: tuple[ArchiveEntryInfo, ...]
+@dataclass(frozen=True)
+class PackResult:
+    output_path: Path
+    entries: tuple[str, ...]
+def _guard_destructive_target(path: Path) -> None:
+    resolved = path.resolve()
+    if resolved == Path(resolved.anchor):
+        raise ValueError(f"refusing to overwrite filesystem root: {resolved}")
+    if resolved == Path.cwd().resolve():
+        raise ValueError(f"refusing to overwrite current working directory: {resolved}")
+def _prepare_output_dir(output_dir: Path, *, overwrite: bool) -> None:
+    if output_dir.exists() and not output_dir.is_dir():
+        raise NotADirectoryError(f"output exists and is not a directory: {output_dir}")
+    if output_dir.exists():
+        if any(output_dir.iterdir()):
+            if not overwrite:
+                raise FileExistsError(f"output directory is not empty: {output_dir}")
+            _guard_destructive_target(output_dir)
+            shutil.rmtree(output_dir)
+        else:
+            output_dir.rmdir()
+    output_dir.mkdir(parents=True, exist_ok=True)
+def _prepare_output_path(output_path: Path, *, overwrite: bool) -> None:
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    if output_path.exists() and not overwrite:
+        raise FileExistsError(f"output file already exists: {output_path}")
+def _format_xml_bytes(payload: bytes) -> bytes:
+    try:
+        element = etree.fromstring(payload)
+    except etree.XMLSyntaxError:
+        return payload
+    etree.indent(element, space="  ")
+    return etree.tostring(
+        element,
+        pretty_print=True,
+        xml_declaration=True,
+        encoding="UTF-8",
+    )
+def _iter_file_entries(archive: ZipFile) -> tuple[ArchiveEntryInfo, ...]:
+    entries: list[ArchiveEntryInfo] = []
+    for info in archive.infolist():
+        if info.is_dir():
+            continue
+        entries.append(ArchiveEntryInfo(path=info.filename, compress_type=info.compress_type))
+    return tuple(entries)
+def _metadata_path(root: Path) -> Path:
+    return root / _PACK_METADATA_NAME
+def _write_pack_metadata(root: Path, entries: tuple[ArchiveEntryInfo, ...]) -> Path:
+    destination = _metadata_path(root)
+    payload = {
+        "format_version": 1,
+        "entries": [asdict(entry) for entry in entries],
+    }
+    destination.write_text(json.dumps(payload, indent=2), encoding="utf-8")
+    return destination
+def _read_pack_metadata(root: Path) -> tuple[ArchiveEntryInfo, ...]:
+    metadata_file = _metadata_path(root)
+    if not metadata_file.is_file():
+        return ()
+    data = json.loads(metadata_file.read_text(encoding="utf-8"))
+    entries: list[ArchiveEntryInfo] = []
+    for entry in data.get("entries", []):
+        path = str(entry.get("path", "")).strip()
+        if not path:
+            continue
+        entries.append(
+            ArchiveEntryInfo(
+                path=path.replace("\\", "/"),
+                compress_type=int(entry.get("compress_type", ZIP_DEFLATED)),
+            )
+        )
+    return tuple(entries)
+def _discover_files(root: Path) -> set[str]:
+    paths: set[str] = set()
+    for path in root.rglob("*"):
+        if not path.is_file():
+            continue
+        rel_path = path.relative_to(root).as_posix()
+        if rel_path == _PACK_METADATA_NAME:
+            continue
+        paths.add(rel_path)
+    return paths
+def _resolve_write_order(paths: set[str], metadata: tuple[ArchiveEntryInfo, ...]) -> tuple[str, ...]:
+    ordered: list[str] = []
+    seen: set[str] = set()
+    if "mimetype" in paths:
+        ordered.append("mimetype")
+        seen.add("mimetype")
+    for entry in metadata:
+        if entry.path in paths and entry.path not in seen:
+            ordered.append(entry.path)
+            seen.add(entry.path)
+    for path in sorted(paths):
+        if path in seen:
+            continue
+        ordered.append(path)
+        seen.add(path)
+    return tuple(ordered)
+def _summarize_pack_validation(output_path: Path) -> None:
+    report = validate_package(output_path)
+    if report.ok:
+        return
+    summary = "\n".join(f"- {issue}" for issue in report.issues[:10])
+    raise ValueError(f"packed archive failed validation:\n{summary}")
+def unpack_hwpx(
+    source: str | Path,
+    output_dir: str | Path,
+    *,
+    overwrite: bool = False,
+    pretty_xml: bool = True,
+) -> UnpackResult:
+    source_path = Path(source)
+    if not source_path.is_file():
+        raise FileNotFoundError(f"input file not found: {source_path}")
+    destination = Path(output_dir)
+    _prepare_output_dir(destination, overwrite=overwrite)
+    with ZipFile(source_path, "r") as archive:
+        entries = _iter_file_entries(archive)
+        for entry in entries:
+            data = archive.read(entry.path)
+            if pretty_xml and entry.path.endswith(_XML_SUFFIXES):
+                data = _format_xml_bytes(data)
+            target = destination / entry.path
+            target.parent.mkdir(parents=True, exist_ok=True)
+            target.write_bytes(data)
+    metadata_path = _write_pack_metadata(destination, entries)
+    return UnpackResult(output_dir=destination, metadata_path=metadata_path, entries=entries)
+def pack_hwpx(
+    input_dir: str | Path,
+    output_path: str | Path,
+    *,
+    overwrite: bool = False,
+) -> PackResult:
+    root = Path(input_dir)
+    if not root.is_dir():
+        raise FileNotFoundError(f"input directory not found: {root}")
+    destination = Path(output_path)
+    _prepare_output_path(destination, overwrite=overwrite)
+    files = _discover_files(root)
+    if "mimetype" not in files:
+        raise FileNotFoundError(f"missing required 'mimetype' file in {root}")
+    metadata = _read_pack_metadata(root)
+    compress_types = {entry.path: entry.compress_type for entry in metadata}
+    ordered_paths = _resolve_write_order(files, metadata)
+    fd, tmp_name = tempfile.mkstemp(dir=str(destination.parent), suffix=".hwpx.tmp")
+    os.close(fd)
+    tmp_path = Path(tmp_name)
+    try:
+        with ZipFile(tmp_path, "w", ZIP_DEFLATED) as archive:
+            archive.write(root / "mimetype", "mimetype", compress_type=ZIP_STORED)
+            for rel_path in ordered_paths:
+                if rel_path == "mimetype":
+                    continue
+                compress_type = compress_types.get(rel_path, ZIP_DEFLATED)
+                if compress_type != ZIP_STORED:
+                    compress_type = ZIP_DEFLATED
+                archive.write(root / rel_path, rel_path, compress_type=compress_type)
+        _summarize_pack_validation(tmp_path)
+        os.replace(tmp_path, destination)
+    except BaseException:
+        try:
+            tmp_path.unlink(missing_ok=True)
+        except OSError:
+            pass
+        raise
+    return PackResult(output_path=destination, entries=ordered_paths)
+def unpack_main(argv: Sequence[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Unpack an HWPX file into a directory")
+    parser.add_argument("input", help="Input .hwpx path")
+    parser.add_argument("output", help="Output directory")
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="Allow deleting an existing non-empty output directory",
+    )
+    parser.add_argument(
+        "--no-pretty-xml",
+        action="store_true",
+        help="Keep XML payloads in their original byte formatting",
+    )
+    args = parser.parse_args(argv)
+    try:
+        result = unpack_hwpx(
+            args.input,
+            args.output,
+            overwrite=args.force,
+            pretty_xml=not args.no_pretty_xml,
+        )
+    except Exception as exc:
+        print(f"ERROR: {exc}")
+        return 1
+    print(f"Unpacked {args.input} -> {result.output_dir}")
+    print(f"Recorded archive metadata at {result.metadata_path}")
+    return 0
+def pack_main(argv: Sequence[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Pack a directory into an HWPX archive")
+    parser.add_argument("input", help="Input directory")
+    parser.add_argument("output", help="Output .hwpx path")
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="Allow replacing an existing output file",
+    )
+    args = parser.parse_args(argv)
+    try:
+        result = pack_hwpx(args.input, args.output, overwrite=args.force)
+    except Exception as exc:
+        print(f"ERROR: {exc}")
+        return 1
+    print(f"Packed {args.input} -> {result.output_path}")
+    return 0
+def main(argv: Sequence[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="HWPX archive utility helpers")
+    subparsers = parser.add_subparsers(dest="command", required=True)
+    unpack_parser = subparsers.add_parser("unpack", help="Unpack an HWPX file")
+    unpack_parser.add_argument("input")
+    unpack_parser.add_argument("output")
+    unpack_parser.add_argument("--force", action="store_true")
+    unpack_parser.add_argument("--no-pretty-xml", action="store_true")
+    pack_parser = subparsers.add_parser("pack", help="Pack a directory into HWPX")
+    pack_parser.add_argument("input")
+    pack_parser.add_argument("output")
+    pack_parser.add_argument("--force", action="store_true")
+    args = parser.parse_args(argv)
+    if args.command == "unpack":
+        forward = [args.input, args.output]
+        if args.force:
+            forward.append("--force")
+        if args.no_pretty_xml:
+            forward.append("--no-pretty-xml")
+        return unpack_main(forward)
+    forward = [args.input, args.output]
+    if args.force:
+        forward.append("--force")
+    return pack_main(forward)
+if __name__ == "__main__":  # pragma: no cover - CLI convenience
+    raise SystemExit(main())

python-hwpx 2.5__tar.gz → 2.7__tar.gz

python-hwpx 2.5tar.gz → 2.7tar.gz