PyPI - hwp2md - Versions diffs - 1.0.0__tar.gz - Mend

hwp2md 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

hwp2md-1.0.0/.gitignore +55 -0
hwp2md-1.0.0/.python-version +1 -0
hwp2md-1.0.0/LICENSE.md +7 -0
hwp2md-1.0.0/PKG-INFO +98 -0
hwp2md-1.0.0/README.md +74 -0
hwp2md-1.0.0/pyproject.toml +38 -0
hwp2md-1.0.0/src/hwp2md/__init__.py +7 -0
hwp2md-1.0.0/src/hwp2md/cli.py +101 -0
hwp2md-1.0.0/src/hwp2md/converter.py +97 -0
hwp2md-1.0.0/src/hwp2md/hwpx_parser.py +381 -0
hwp2md-1.0.0/src/hwp2md/paragraph.py +278 -0
hwp2md-1.0.0/src/hwp2md/parser.py +191 -0
hwp2md-1.0.0/src/hwp2md/record.py +157 -0
hwp2md-1.0.0/src/hwp2md/table.py +302 -0
hwp2md-1.0.0/tests/__init__.py +0 -0
hwp2md-1.0.0/tests/conftest.py +153 -0
hwp2md-1.0.0/tests/test_hwpx_cli.py +75 -0
hwp2md-1.0.0/tests/test_hwpx_converter.py +62 -0
hwp2md-1.0.0/tests/test_hwpx_cross_format.py +68 -0
hwp2md-1.0.0/tests/test_hwpx_edge_cases.py +148 -0
hwp2md-1.0.0/tests/test_hwpx_parser.py +468 -0
hwp2md-1.0.0/uv.lock +48 -0

hwp2md-1.0.0/.gitignore ADDED Viewed

@@ -0,0 +1,55 @@
+# Dependencies
+node_modules/
+**/node_modules/
+# Build output
+dist/
+**/dist/
+# Test coverage
+coverage/
+**/coverage/
+# Logs
+*.log
+npm-debug.log*
+pnpm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+.pytest_cache/
+.mypy_cache/
+*.egg-info/
+.eggs/
+build/
+*.so
+.venv/
+.coverage
+htmlcov/
+# Output artifacts
+output.md
+# Temporary files
+*.tmp
+.cache/
+debug.log
+# Claude Code and Serena
+.claude/
+.serena/

hwp2md-1.0.0/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.14.0t

hwp2md-1.0.0/LICENSE.md ADDED Viewed

@@ -0,0 +1,7 @@
+Copyright 2025 Jaechan Kim<kjc0210@gmail.com>
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

hwp2md-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,98 @@
+Metadata-Version: 2.4
+Name: hwp2md
+Version: 1.0.0
+Summary: HWP to Markdown converter
+Project-URL: Homepage, https://github.com/jc-kim/hwp2md
+Project-URL: Repository, https://github.com/jc-kim/hwp2md
+Project-URL: Issues, https://github.com/jc-kim/hwp2md/issues
+Author-email: Kim JaeChan <kjc0210@gmail.com>
+License-Expression: MIT
+License-File: LICENSE.md
+Keywords: converter,document,hwp,korean,markdown,parser
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Requires-Python: >=3.12
+Requires-Dist: click>=8.3.0
+Requires-Dist: olefile>=0.47
+Description-Content-Type: text/markdown
+# hwp2md
+Python용 HWP to Markdown 변환기.
+한글(HWP) 5.0 파일을 LLM에 최적화된 깔끔한 Markdown으로 변환합니다.
+## 주요 기능
+- ✅ HWP 5.0 파일에서 텍스트 추출
+- ✅ 표를 Markdown 형식으로 변환
+- ✅ 문서 구조 보존
+- ✅ 셀 병합 처리 (LLM 최적화를 위한 내용 반복)
+- ✅ CLI 도구로 빠른 변환
+## 설치
+```bash
+# pip 사용
+pip install hwp2md
+# uv 사용
+uv add hwp2md
+```
+## 사용법
+### CLI
+```bash
+# Markdown으로 변환 (stdout 출력)
+hwp2md document.hwp
+# 파일로 저장
+hwp2md document.hwp -o output.md
+# uvx로 설치 없이 실행
+uvx hwp2md document.hwp
+```
+### Python API
+```python
+from hwp2md import convert
+# 파일 경로로 변환
+markdown = convert("document.hwp")
+print(markdown)
+```
+## 제한사항
+- **HWP 5.0 전용** - 이전 HWP 형식(HWP 3.0, HWP 97, HWP 2002 등)은 지원하지 않음
+  - 레거시 HWP 파일은 한컴오피스에서 HWP 5.0으로 변환 가능
+  - 레거시 형식 감지 시 오류 발생
+- **텍스트 & 표** - 현재 텍스트와 표만 추출하며, 이미지 및 복잡한 개체는 건너뜀
+- **한국어 중심** - 한국어 문서에 최적화 (UTF-16LE 인코딩)
+- **기본 서식만** - 글꼴, 색상, 고급 스타일은 보존하지 않음
+## 개발
+```bash
+# 의존성 설치
+uv sync
+# CLI 실행
+uv run hwp2md document.hwp
+# 테스트 실행
+uv run pytest
+```
+## 라이선스
+MIT

hwp2md-1.0.0/README.md ADDED Viewed

@@ -0,0 +1,74 @@
+# hwp2md
+Python용 HWP to Markdown 변환기.
+한글(HWP) 5.0 파일을 LLM에 최적화된 깔끔한 Markdown으로 변환합니다.
+## 주요 기능
+- ✅ HWP 5.0 파일에서 텍스트 추출
+- ✅ 표를 Markdown 형식으로 변환
+- ✅ 문서 구조 보존
+- ✅ 셀 병합 처리 (LLM 최적화를 위한 내용 반복)
+- ✅ CLI 도구로 빠른 변환
+## 설치
+```bash
+# pip 사용
+pip install hwp2md
+# uv 사용
+uv add hwp2md
+```
+## 사용법
+### CLI
+```bash
+# Markdown으로 변환 (stdout 출력)
+hwp2md document.hwp
+# 파일로 저장
+hwp2md document.hwp -o output.md
+# uvx로 설치 없이 실행
+uvx hwp2md document.hwp
+```
+### Python API
+```python
+from hwp2md import convert
+# 파일 경로로 변환
+markdown = convert("document.hwp")
+print(markdown)
+```
+## 제한사항
+- **HWP 5.0 전용** - 이전 HWP 형식(HWP 3.0, HWP 97, HWP 2002 등)은 지원하지 않음
+  - 레거시 HWP 파일은 한컴오피스에서 HWP 5.0으로 변환 가능
+  - 레거시 형식 감지 시 오류 발생
+- **텍스트 & 표** - 현재 텍스트와 표만 추출하며, 이미지 및 복잡한 개체는 건너뜀
+- **한국어 중심** - 한국어 문서에 최적화 (UTF-16LE 인코딩)
+- **기본 서식만** - 글꼴, 색상, 고급 스타일은 보존하지 않음
+## 개발
+```bash
+# 의존성 설치
+uv sync
+# CLI 실행
+uv run hwp2md document.hwp
+# 테스트 실행
+uv run pytest
+```
+## 라이선스
+MIT

hwp2md-1.0.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,38 @@
+[project]
+name = "hwp2md"
+version = "1.0.0"
+description = "HWP to Markdown converter"
+readme = "README.md"
+license = "MIT"
+license-files = ["LICENSE.md"]
+requires-python = ">=3.12"
+authors = [
+    { name = "Kim JaeChan", email = "kjc0210@gmail.com" },
+]
+keywords = ["hwp", "markdown", "converter", "korean", "document", "parser"]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Programming Language :: Python :: 3.14",
+    "Topic :: Text Processing :: Markup :: Markdown",
+]
+dependencies = [
+    "click>=8.3.0",
+    "olefile>=0.47",
+]
+[project.urls]
+Homepage = "https://github.com/jc-kim/hwp2md"
+Repository = "https://github.com/jc-kim/hwp2md"
+Issues = "https://github.com/jc-kim/hwp2md/issues"
+[project.scripts]
+hwp2md = "hwp2md.cli:main"
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"

hwp2md-1.0.0/src/hwp2md/__init__.py ADDED Viewed

@@ -0,0 +1,7 @@
+"""HWP/HWPX to Markdown Converter"""
+from hwp2md.parser import HWPFile
+from hwp2md.hwpx_parser import HWPXFile
+__version__ = "0.1.0"
+__all__ = ["HWPFile", "HWPXFile"]

hwp2md-1.0.0/src/hwp2md/cli.py ADDED Viewed

@@ -0,0 +1,101 @@
+"""CLI for HWP/HWPX to Markdown converter"""
+import sys
+from pathlib import Path
+import click
+from hwp2md.parser import HWPFile
+@click.group()
+@click.version_option()
+def cli():
+    """HWP/HWPX to Markdown converter"""
+    pass
+def _is_hwpx(filepath: Path) -> bool:
+    """Check if file is HWPX format based on extension"""
+    return filepath.suffix.lower() == '.hwpx'
+@cli.command()
+@click.argument('input_file', type=click.Path(exists=True, path_type=Path))
+def info(input_file: Path):
+    """Display HWP/HWPX file information"""
+    try:
+        if _is_hwpx(input_file):
+            from hwp2md.hwpx_parser import HWPXFile
+            with HWPXFile(input_file) as hwpx:
+                info = hwpx.get_file_info()
+                click.echo(f"File: {info['filepath']}")
+                click.echo(f"Format: {info['format']}")
+                click.echo(f"Version: {info['version']}")
+                click.echo(f"Sections: {info['section_count']}")
+                click.echo(f"\nContents:")
+                for entry in info['contents']:
+                    click.echo(f"  {entry}")
+        else:
+            with HWPFile(input_file) as hwp:
+                info = hwp.get_file_info()
+                click.echo(f"File: {info['filepath']}")
+                click.echo(f"Signature: {info['signature']}")
+                click.echo(f"Version: {info['version']}")
+                click.echo(f"Compressed: {info['compressed']}")
+                click.echo(f"Encrypted: {info['encrypted']}")
+                click.echo(f"\nStreams:")
+                for stream in info['streams']:
+                    stream_path = '/'.join(stream)
+                    click.echo(f"  {stream_path}")
+    except Exception as e:
+        click.echo(f"Error: {e}", err=True)
+        sys.exit(1)
+@cli.command()
+@click.argument('input_file', type=click.Path(exists=True, path_type=Path))
+@click.argument('output_file', type=click.Path(path_type=Path), required=False)
+@click.option('--table-line-breaks', type=click.Choice(['space', 'br']), default='space',
+              help='How to handle line breaks in table cells (default: space for LLM optimization)')
+def convert(input_file: Path, output_file: Path | None, table_line_breaks: str):
+    """Convert HWP/HWPX file to Markdown"""
+    if output_file is None:
+        output_file = input_file.with_suffix('.md')
+    try:
+        click.echo(f"Converting {input_file} to {output_file}...")
+        if _is_hwpx(input_file):
+            from hwp2md.converter import convert_hwpx_to_markdown
+            from hwp2md.hwpx_parser import HWPXFile
+            with HWPXFile(input_file) as hwpx:
+                markdown = convert_hwpx_to_markdown(hwpx, table_line_break_style=table_line_breaks)
+        else:
+            from hwp2md.converter import convert_hwp_to_markdown
+            with HWPFile(input_file) as hwp:
+                markdown = convert_hwp_to_markdown(hwp, table_line_break_style=table_line_breaks)
+        # Write to file
+        output_file.write_text(markdown, encoding='utf-8')
+        click.echo(f"✓ Conversion completed: {output_file}")
+    except Exception as e:
+        click.echo(f"Error: {e}", err=True)
+        sys.exit(1)
+def main():
+    """Main entry point"""
+    cli()
+if __name__ == "__main__":
+    main()

hwp2md-1.0.0/src/hwp2md/converter.py ADDED Viewed

@@ -0,0 +1,97 @@
+"""HWP/HWPX to Markdown Converter"""
+from pathlib import Path
+from hwp2md.paragraph import Paragraph, ParagraphParser
+from hwp2md.parser import HWPFile
+from hwp2md.record import RecordReader
+def paragraphs_to_markdown(paragraphs: list[Paragraph]) -> str:
+    """
+    Convert paragraphs to Markdown
+    Args:
+        paragraphs: List of paragraphs
+    Returns:
+        str: Markdown text
+    """
+    lines = []
+    for para in paragraphs:
+        text = para.text.strip()
+        if not text:
+            continue
+        # Skip paragraphs that are likely garbled control characters
+        # These appear as short mysterious character sequences before tables
+        if len(text) < 5 and not text.startswith('|') and all(ord(c) > 127 for c in text):
+            continue
+        lines.append(text)
+    return '\n\n'.join(lines)
+def convert_hwp_to_markdown(hwp: HWPFile, table_line_break_style: str = 'space') -> str:
+    """
+    Convert HWP file to Markdown
+    Args:
+        hwp: Opened HWP file
+        table_line_break_style: How to handle line breaks in table cells
+            - 'space': Join with space (LLM-optimized, default)
+            - 'br': Use <br> tags (human-readable)
+    Returns:
+        str: Markdown content
+    """
+    all_paragraphs = []
+    # Process each section
+    section_count = hwp.get_section_count()
+    for i in range(section_count):
+        # Read section data
+        section_data = hwp.read_section(i)
+        if not section_data:
+            continue
+        # Parse records
+        reader = RecordReader(section_data)
+        parser = ParagraphParser(reader, table_line_break_style)
+        # Parse paragraphs
+        paragraphs = parser.parse_all_paragraphs()
+        all_paragraphs.extend(paragraphs)
+    # Convert to Markdown
+    markdown = paragraphs_to_markdown(all_paragraphs)
+    return markdown
+def convert_hwpx_to_markdown(hwpx, table_line_break_style: str = 'space') -> str:
+    """
+    Convert HWPX file to Markdown
+    Args:
+        hwpx: Opened HWPXFile instance
+        table_line_break_style: How to handle line breaks in table cells
+            - 'space': Join with space (LLM-optimized, default)
+            - 'br': Use <br> tags (human-readable)
+    Returns:
+        str: Markdown content
+    """
+    from hwp2md.hwpx_parser import parse_hwpx_section
+    all_paragraphs = []
+    for i in range(hwpx.get_section_count()):
+        section_root = hwpx.get_section_xml(i)
+        paragraphs = parse_hwpx_section(section_root, table_line_break_style)
+        all_paragraphs.extend(paragraphs)
+    return paragraphs_to_markdown(all_paragraphs)