hwp2md 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,55 @@
1
+ # Dependencies
2
+ node_modules/
3
+ **/node_modules/
4
+
5
+ # Build output
6
+ dist/
7
+ **/dist/
8
+
9
+ # Test coverage
10
+ coverage/
11
+ **/coverage/
12
+
13
+ # Logs
14
+ *.log
15
+ npm-debug.log*
16
+ pnpm-debug.log*
17
+ yarn-debug.log*
18
+ yarn-error.log*
19
+
20
+ # IDE
21
+ .vscode/
22
+ .idea/
23
+ *.swp
24
+ *.swo
25
+ *~
26
+
27
+ # OS
28
+ .DS_Store
29
+ Thumbs.db
30
+
31
+ # Python
32
+ __pycache__/
33
+ *.py[cod]
34
+ *$py.class
35
+ .pytest_cache/
36
+ .mypy_cache/
37
+ *.egg-info/
38
+ .eggs/
39
+ build/
40
+ *.so
41
+ .venv/
42
+ .coverage
43
+ htmlcov/
44
+
45
+ # Output artifacts
46
+ output.md
47
+
48
+ # Temporary files
49
+ *.tmp
50
+ .cache/
51
+ debug.log
52
+
53
+ # Claude Code and Serena
54
+ .claude/
55
+ .serena/
@@ -0,0 +1 @@
1
+ 3.14.0t
@@ -0,0 +1,7 @@
1
+ Copyright 2025 Jaechan Kim<kjc0210@gmail.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
hwp2md-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.4
2
+ Name: hwp2md
3
+ Version: 1.0.0
4
+ Summary: HWP to Markdown converter
5
+ Project-URL: Homepage, https://github.com/jc-kim/hwp2md
6
+ Project-URL: Repository, https://github.com/jc-kim/hwp2md
7
+ Project-URL: Issues, https://github.com/jc-kim/hwp2md/issues
8
+ Author-email: Kim JaeChan <kjc0210@gmail.com>
9
+ License-Expression: MIT
10
+ License-File: LICENSE.md
11
+ Keywords: converter,document,hwp,korean,markdown,parser
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Programming Language :: Python :: 3.13
18
+ Classifier: Programming Language :: Python :: 3.14
19
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
20
+ Requires-Python: >=3.12
21
+ Requires-Dist: click>=8.3.0
22
+ Requires-Dist: olefile>=0.47
23
+ Description-Content-Type: text/markdown
24
+
25
+ # hwp2md
26
+
27
+ Python용 HWP to Markdown 변환기.
28
+
29
+ 한글(HWP) 5.0 파일을 LLM에 최적화된 깔끔한 Markdown으로 변환합니다.
30
+
31
+ ## 주요 기능
32
+
33
+ - ✅ HWP 5.0 파일에서 텍스트 추출
34
+ - ✅ 표를 Markdown 형식으로 변환
35
+ - ✅ 문서 구조 보존
36
+ - ✅ 셀 병합 처리 (LLM 최적화를 위한 내용 반복)
37
+ - ✅ CLI 도구로 빠른 변환
38
+
39
+ ## 설치
40
+
41
+ ```bash
42
+ # pip 사용
43
+ pip install hwp2md
44
+
45
+ # uv 사용
46
+ uv add hwp2md
47
+ ```
48
+
49
+ ## 사용법
50
+
51
+ ### CLI
52
+
53
+ ```bash
54
+ # Markdown으로 변환 (stdout 출력)
55
+ hwp2md document.hwp
56
+
57
+ # 파일로 저장
58
+ hwp2md document.hwp -o output.md
59
+
60
+ # uvx로 설치 없이 실행
61
+ uvx hwp2md document.hwp
62
+ ```
63
+
64
+ ### Python API
65
+
66
+ ```python
67
+ from hwp2md import convert
68
+
69
+ # 파일 경로로 변환
70
+ markdown = convert("document.hwp")
71
+ print(markdown)
72
+ ```
73
+
74
+ ## 제한사항
75
+
76
+ - **HWP 5.0 전용** - 이전 HWP 형식(HWP 3.0, HWP 97, HWP 2002 등)은 지원하지 않음
77
+ - 레거시 HWP 파일은 한컴오피스에서 HWP 5.0으로 변환 가능
78
+ - 레거시 형식 감지 시 오류 발생
79
+ - **텍스트 & 표** - 현재 텍스트와 표만 추출하며, 이미지 및 복잡한 개체는 건너뜀
80
+ - **한국어 중심** - 한국어 문서에 최적화 (UTF-16LE 인코딩)
81
+ - **기본 서식만** - 글꼴, 색상, 고급 스타일은 보존하지 않음
82
+
83
+ ## 개발
84
+
85
+ ```bash
86
+ # 의존성 설치
87
+ uv sync
88
+
89
+ # CLI 실행
90
+ uv run hwp2md document.hwp
91
+
92
+ # 테스트 실행
93
+ uv run pytest
94
+ ```
95
+
96
+ ## 라이선스
97
+
98
+ MIT
hwp2md-1.0.0/README.md ADDED
@@ -0,0 +1,74 @@
1
+ # hwp2md
2
+
3
+ Python용 HWP to Markdown 변환기.
4
+
5
+ 한글(HWP) 5.0 파일을 LLM에 최적화된 깔끔한 Markdown으로 변환합니다.
6
+
7
+ ## 주요 기능
8
+
9
+ - ✅ HWP 5.0 파일에서 텍스트 추출
10
+ - ✅ 표를 Markdown 형식으로 변환
11
+ - ✅ 문서 구조 보존
12
+ - ✅ 셀 병합 처리 (LLM 최적화를 위한 내용 반복)
13
+ - ✅ CLI 도구로 빠른 변환
14
+
15
+ ## 설치
16
+
17
+ ```bash
18
+ # pip 사용
19
+ pip install hwp2md
20
+
21
+ # uv 사용
22
+ uv add hwp2md
23
+ ```
24
+
25
+ ## 사용법
26
+
27
+ ### CLI
28
+
29
+ ```bash
30
+ # Markdown으로 변환 (stdout 출력)
31
+ hwp2md document.hwp
32
+
33
+ # 파일로 저장
34
+ hwp2md document.hwp -o output.md
35
+
36
+ # uvx로 설치 없이 실행
37
+ uvx hwp2md document.hwp
38
+ ```
39
+
40
+ ### Python API
41
+
42
+ ```python
43
+ from hwp2md import convert
44
+
45
+ # 파일 경로로 변환
46
+ markdown = convert("document.hwp")
47
+ print(markdown)
48
+ ```
49
+
50
+ ## 제한사항
51
+
52
+ - **HWP 5.0 전용** - 이전 HWP 형식(HWP 3.0, HWP 97, HWP 2002 등)은 지원하지 않음
53
+ - 레거시 HWP 파일은 한컴오피스에서 HWP 5.0으로 변환 가능
54
+ - 레거시 형식 감지 시 오류 발생
55
+ - **텍스트 & 표** - 현재 텍스트와 표만 추출하며, 이미지 및 복잡한 개체는 건너뜀
56
+ - **한국어 중심** - 한국어 문서에 최적화 (UTF-16LE 인코딩)
57
+ - **기본 서식만** - 글꼴, 색상, 고급 스타일은 보존하지 않음
58
+
59
+ ## 개발
60
+
61
+ ```bash
62
+ # 의존성 설치
63
+ uv sync
64
+
65
+ # CLI 실행
66
+ uv run hwp2md document.hwp
67
+
68
+ # 테스트 실행
69
+ uv run pytest
70
+ ```
71
+
72
+ ## 라이선스
73
+
74
+ MIT
@@ -0,0 +1,38 @@
1
+ [project]
2
+ name = "hwp2md"
3
+ version = "1.0.0"
4
+ description = "HWP to Markdown converter"
5
+ readme = "README.md"
6
+ license = "MIT"
7
+ license-files = ["LICENSE.md"]
8
+ requires-python = ">=3.12"
9
+ authors = [
10
+ { name = "Kim JaeChan", email = "kjc0210@gmail.com" },
11
+ ]
12
+ keywords = ["hwp", "markdown", "converter", "korean", "document", "parser"]
13
+ classifiers = [
14
+ "Development Status :: 3 - Alpha",
15
+ "Intended Audience :: Developers",
16
+ "License :: OSI Approved :: MIT License",
17
+ "Programming Language :: Python :: 3",
18
+ "Programming Language :: Python :: 3.12",
19
+ "Programming Language :: Python :: 3.13",
20
+ "Programming Language :: Python :: 3.14",
21
+ "Topic :: Text Processing :: Markup :: Markdown",
22
+ ]
23
+ dependencies = [
24
+ "click>=8.3.0",
25
+ "olefile>=0.47",
26
+ ]
27
+
28
+ [project.urls]
29
+ Homepage = "https://github.com/jc-kim/hwp2md"
30
+ Repository = "https://github.com/jc-kim/hwp2md"
31
+ Issues = "https://github.com/jc-kim/hwp2md/issues"
32
+
33
+ [project.scripts]
34
+ hwp2md = "hwp2md.cli:main"
35
+
36
+ [build-system]
37
+ requires = ["hatchling"]
38
+ build-backend = "hatchling.build"
@@ -0,0 +1,7 @@
1
+ """HWP/HWPX to Markdown Converter"""
2
+
3
+ from hwp2md.parser import HWPFile
4
+ from hwp2md.hwpx_parser import HWPXFile
5
+
6
+ __version__ = "0.1.0"
7
+ __all__ = ["HWPFile", "HWPXFile"]
@@ -0,0 +1,101 @@
1
+ """CLI for HWP/HWPX to Markdown converter"""
2
+
3
+ import sys
4
+ from pathlib import Path
5
+
6
+ import click
7
+
8
+ from hwp2md.parser import HWPFile
9
+
10
+
11
+ @click.group()
12
+ @click.version_option()
13
+ def cli():
14
+ """HWP/HWPX to Markdown converter"""
15
+ pass
16
+
17
+
18
+ def _is_hwpx(filepath: Path) -> bool:
19
+ """Check if file is HWPX format based on extension"""
20
+ return filepath.suffix.lower() == '.hwpx'
21
+
22
+
23
+ @cli.command()
24
+ @click.argument('input_file', type=click.Path(exists=True, path_type=Path))
25
+ def info(input_file: Path):
26
+ """Display HWP/HWPX file information"""
27
+ try:
28
+ if _is_hwpx(input_file):
29
+ from hwp2md.hwpx_parser import HWPXFile
30
+
31
+ with HWPXFile(input_file) as hwpx:
32
+ info = hwpx.get_file_info()
33
+
34
+ click.echo(f"File: {info['filepath']}")
35
+ click.echo(f"Format: {info['format']}")
36
+ click.echo(f"Version: {info['version']}")
37
+ click.echo(f"Sections: {info['section_count']}")
38
+ click.echo(f"\nContents:")
39
+ for entry in info['contents']:
40
+ click.echo(f" {entry}")
41
+ else:
42
+ with HWPFile(input_file) as hwp:
43
+ info = hwp.get_file_info()
44
+
45
+ click.echo(f"File: {info['filepath']}")
46
+ click.echo(f"Signature: {info['signature']}")
47
+ click.echo(f"Version: {info['version']}")
48
+ click.echo(f"Compressed: {info['compressed']}")
49
+ click.echo(f"Encrypted: {info['encrypted']}")
50
+ click.echo(f"\nStreams:")
51
+ for stream in info['streams']:
52
+ stream_path = '/'.join(stream)
53
+ click.echo(f" {stream_path}")
54
+
55
+ except Exception as e:
56
+ click.echo(f"Error: {e}", err=True)
57
+ sys.exit(1)
58
+
59
+
60
+ @cli.command()
61
+ @click.argument('input_file', type=click.Path(exists=True, path_type=Path))
62
+ @click.argument('output_file', type=click.Path(path_type=Path), required=False)
63
+ @click.option('--table-line-breaks', type=click.Choice(['space', 'br']), default='space',
64
+ help='How to handle line breaks in table cells (default: space for LLM optimization)')
65
+ def convert(input_file: Path, output_file: Path | None, table_line_breaks: str):
66
+ """Convert HWP/HWPX file to Markdown"""
67
+ if output_file is None:
68
+ output_file = input_file.with_suffix('.md')
69
+
70
+ try:
71
+ click.echo(f"Converting {input_file} to {output_file}...")
72
+
73
+ if _is_hwpx(input_file):
74
+ from hwp2md.converter import convert_hwpx_to_markdown
75
+ from hwp2md.hwpx_parser import HWPXFile
76
+
77
+ with HWPXFile(input_file) as hwpx:
78
+ markdown = convert_hwpx_to_markdown(hwpx, table_line_break_style=table_line_breaks)
79
+ else:
80
+ from hwp2md.converter import convert_hwp_to_markdown
81
+
82
+ with HWPFile(input_file) as hwp:
83
+ markdown = convert_hwp_to_markdown(hwp, table_line_break_style=table_line_breaks)
84
+
85
+ # Write to file
86
+ output_file.write_text(markdown, encoding='utf-8')
87
+
88
+ click.echo(f"✓ Conversion completed: {output_file}")
89
+
90
+ except Exception as e:
91
+ click.echo(f"Error: {e}", err=True)
92
+ sys.exit(1)
93
+
94
+
95
+ def main():
96
+ """Main entry point"""
97
+ cli()
98
+
99
+
100
+ if __name__ == "__main__":
101
+ main()
@@ -0,0 +1,97 @@
1
+ """HWP/HWPX to Markdown Converter"""
2
+
3
+ from pathlib import Path
4
+
5
+ from hwp2md.paragraph import Paragraph, ParagraphParser
6
+ from hwp2md.parser import HWPFile
7
+ from hwp2md.record import RecordReader
8
+
9
+
10
+ def paragraphs_to_markdown(paragraphs: list[Paragraph]) -> str:
11
+ """
12
+ Convert paragraphs to Markdown
13
+
14
+ Args:
15
+ paragraphs: List of paragraphs
16
+
17
+ Returns:
18
+ str: Markdown text
19
+ """
20
+ lines = []
21
+
22
+ for para in paragraphs:
23
+ text = para.text.strip()
24
+ if not text:
25
+ continue
26
+
27
+ # Skip paragraphs that are likely garbled control characters
28
+ # These appear as short mysterious character sequences before tables
29
+ if len(text) < 5 and not text.startswith('|') and all(ord(c) > 127 for c in text):
30
+ continue
31
+
32
+ lines.append(text)
33
+
34
+ return '\n\n'.join(lines)
35
+
36
+
37
+ def convert_hwp_to_markdown(hwp: HWPFile, table_line_break_style: str = 'space') -> str:
38
+ """
39
+ Convert HWP file to Markdown
40
+
41
+ Args:
42
+ hwp: Opened HWP file
43
+ table_line_break_style: How to handle line breaks in table cells
44
+ - 'space': Join with space (LLM-optimized, default)
45
+ - 'br': Use <br> tags (human-readable)
46
+
47
+ Returns:
48
+ str: Markdown content
49
+ """
50
+ all_paragraphs = []
51
+
52
+ # Process each section
53
+ section_count = hwp.get_section_count()
54
+
55
+ for i in range(section_count):
56
+ # Read section data
57
+ section_data = hwp.read_section(i)
58
+ if not section_data:
59
+ continue
60
+
61
+ # Parse records
62
+ reader = RecordReader(section_data)
63
+ parser = ParagraphParser(reader, table_line_break_style)
64
+
65
+ # Parse paragraphs
66
+ paragraphs = parser.parse_all_paragraphs()
67
+ all_paragraphs.extend(paragraphs)
68
+
69
+ # Convert to Markdown
70
+ markdown = paragraphs_to_markdown(all_paragraphs)
71
+
72
+ return markdown
73
+
74
+
75
+ def convert_hwpx_to_markdown(hwpx, table_line_break_style: str = 'space') -> str:
76
+ """
77
+ Convert HWPX file to Markdown
78
+
79
+ Args:
80
+ hwpx: Opened HWPXFile instance
81
+ table_line_break_style: How to handle line breaks in table cells
82
+ - 'space': Join with space (LLM-optimized, default)
83
+ - 'br': Use <br> tags (human-readable)
84
+
85
+ Returns:
86
+ str: Markdown content
87
+ """
88
+ from hwp2md.hwpx_parser import parse_hwpx_section
89
+
90
+ all_paragraphs = []
91
+
92
+ for i in range(hwpx.get_section_count()):
93
+ section_root = hwpx.get_section_xml(i)
94
+ paragraphs = parse_hwpx_section(section_root, table_line_break_style)
95
+ all_paragraphs.extend(paragraphs)
96
+
97
+ return paragraphs_to_markdown(all_paragraphs)