md-spreadsheet-parser 1.0.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- md_spreadsheet_parser/__init__.py +66 -0
- md_spreadsheet_parser/cli.py +128 -0
- md_spreadsheet_parser/converters.py +126 -0
- md_spreadsheet_parser/excel.py +183 -0
- md_spreadsheet_parser/generator.py +170 -0
- md_spreadsheet_parser/loader.py +183 -0
- md_spreadsheet_parser/models.py +491 -0
- md_spreadsheet_parser/parsing.py +590 -0
- md_spreadsheet_parser/py.typed +0 -0
- md_spreadsheet_parser/pydantic_adapter.py +130 -0
- md_spreadsheet_parser/schemas.py +108 -0
- md_spreadsheet_parser/utils.py +6 -0
- md_spreadsheet_parser/validation.py +348 -0
- md_spreadsheet_parser-1.0.1.dist-info/METADATA +922 -0
- md_spreadsheet_parser-1.0.1.dist-info/RECORD +18 -0
- md_spreadsheet_parser-1.0.1.dist-info/WHEEL +4 -0
- md_spreadsheet_parser-1.0.1.dist-info/entry_points.txt +2 -0
- md_spreadsheet_parser-1.0.1.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,922 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: md-spreadsheet-parser
|
|
3
|
+
Version: 1.0.1
|
|
4
|
+
Summary: A robust, zero-dependency Python library for parsing, validating, and manipulating Markdown tables, including conversion from Excel to Markdown.
|
|
5
|
+
Project-URL: Homepage, https://f-y.github.io/md-spreadsheet-parser/
|
|
6
|
+
Project-URL: Repository, https://github.com/f-y/md-spreadsheet-parser
|
|
7
|
+
Project-URL: Issues, https://github.com/f-y/md-spreadsheet-parser/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/f-y/md-spreadsheet-parser/releases
|
|
9
|
+
Author: f-y
|
|
10
|
+
License: MIT License
|
|
11
|
+
|
|
12
|
+
Copyright (c) 2025 f-y
|
|
13
|
+
|
|
14
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
15
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
16
|
+
in the Software without restriction, including without limitation the rights
|
|
17
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
18
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
19
|
+
furnished to do so, subject to the following conditions:
|
|
20
|
+
|
|
21
|
+
The above copyright notice and this permission notice shall be included in all
|
|
22
|
+
copies or substantial portions of the Software.
|
|
23
|
+
|
|
24
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
25
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
26
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
27
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
28
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
29
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
30
|
+
SOFTWARE.
|
|
31
|
+
License-File: LICENSE
|
|
32
|
+
Keywords: cli,conversion,csv,dataframe,excel,markdown,pandas,parser,spreadsheet,table,zero-dependency
|
|
33
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
34
|
+
Classifier: Intended Audience :: Developers
|
|
35
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
36
|
+
Classifier: Operating System :: OS Independent
|
|
37
|
+
Classifier: Programming Language :: Python :: 3
|
|
38
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
39
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
40
|
+
Classifier: Topic :: Text Processing :: Markup :: Markdown
|
|
41
|
+
Requires-Python: >=3.10
|
|
42
|
+
Description-Content-Type: text/markdown
|
|
43
|
+
|
|
44
|
+
# Markdown Spreadsheet Parser
|
|
45
|
+
|
|
46
|
+
<p align="center">
|
|
47
|
+
<a href="https://github.com/f-y/md-spreadsheet-parser/blob/main/LICENSE">
|
|
48
|
+
<img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License" />
|
|
49
|
+
</a>
|
|
50
|
+
<a href="https://pypi.org/project/md-spreadsheet-parser/">
|
|
51
|
+
<img src="https://img.shields.io/badge/pypi-v1.0.1-blue" alt="PyPI" />
|
|
52
|
+
</a>
|
|
53
|
+
<a href="https://pepy.tech/projects/md-spreadsheet-parser"><img src="https://static.pepy.tech/personalized-badge/md-spreadsheet-parser?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=BLUE&left_text=downloads" alt="PyPI Downloads"></a>
|
|
54
|
+
<a href="https://github.com/f-y/md-spreadsheet-parser">
|
|
55
|
+
<img src="https://img.shields.io/badge/repository-github-blue.svg" alt="Repository" />
|
|
56
|
+
</a>
|
|
57
|
+
<a href="https://github.com/f-y/md-spreadsheet-parser/actions?query=workflow%3ATests">
|
|
58
|
+
<img src="https://github.com/f-y/md-spreadsheet-parser/workflows/Tests/badge.svg" alt="Build Status" />
|
|
59
|
+
</a>
|
|
60
|
+
</p>
|
|
61
|
+
|
|
62
|
+
<p align="center">
|
|
63
|
+
<strong>A robust, zero-dependency Python library for converting Excel to Markdown, parsing tables, and type-safe validation.</strong>
|
|
64
|
+
</p>
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
**md-spreadsheet-parser** elevates Markdown tables from simple text to first-class data structures. It offers a precise, zero-dependency engine to parse, validate, and manipulate tables with the ease of a spreadsheet and the power of Python.
|
|
69
|
+
|
|
70
|
+
> [!IMPORTANT]
|
|
71
|
+
> **🎉 Official GUI Editor Released: [PengSheets](https://marketplace.visualstudio.com/items?itemName=f-y.peng-sheets)**
|
|
72
|
+
>
|
|
73
|
+
> We have transformed this library into an Excel-like interface for VS Code. Edit Markdown tables with sort, filter, and easy navigation directly in your editor.
|
|
74
|
+
>
|
|
75
|
+
> [](https://marketplace.visualstudio.com/items?itemName=f-y.peng-sheets)
|
|
76
|
+
|
|
77
|
+
�🚀 **Need a quick solution?** Check out the [Cookbook](https://github.com/f-y/md-spreadsheet-parser/blob/main/COOKBOOK.md) for copy-pasteable recipes (Excel conversion, Pandas integration, Markdown table manipulation, and more).
|
|
78
|
+
|
|
79
|
+
Read in Japanese: 日本語版はこちら(
|
|
80
|
+
<a href="https://github.com/f-y/md-spreadsheet-parser/blob/main/README.ja.md">README</a>, <a href="https://github.com/f-y/md-spreadsheet-parser/blob/main/COOKBOOK.ja.md">Cookbook</a>
|
|
81
|
+
)
|
|
82
|
+
|
|
83
|
+
## Table of Contents
|
|
84
|
+
|
|
85
|
+
- [Features](#features)
|
|
86
|
+
- [Installation](#installation)
|
|
87
|
+
- [Usage](#usage)
|
|
88
|
+
- [1. Basic Parsing](#1-basic-parsing)
|
|
89
|
+
- [2. Type-Safe Validation](#2-type-safe-validation-recommended)
|
|
90
|
+
- [Pydantic Integration](#pydantic-integration)
|
|
91
|
+
- [3. JSON & Dictionary Conversion](#3-json--dictionary-conversion)
|
|
92
|
+
- [4. Pandas Integration & Export](#4-pandas-integration--export)
|
|
93
|
+
- [5. Excel Import](#5-excel-import)
|
|
94
|
+
- [6. Markdown Generation](#6-markdown-generation-round-trip)
|
|
95
|
+
- [7. Advanced Features](#7-advanced-features)
|
|
96
|
+
- [8. Advanced Type Conversion](#8-advanced-type-conversion)
|
|
97
|
+
- [9. Robustness](#9-robustness-handling-malformed-tables)
|
|
98
|
+
- [10. In-Cell Line Break Support](#10-in-cell-line-break-support)
|
|
99
|
+
- [11. Performance & Scalability (Streaming API)](#11-performance--scalability-streaming-api)
|
|
100
|
+
- [12. Programmatic Manipulation](#12-programmatic-manipulation)
|
|
101
|
+
- [13. Visual Metadata Persistence](#13-visual-metadata-persistence)
|
|
102
|
+
- [Command Line Interface (CLI)](#command-line-interface-cli)
|
|
103
|
+
- [Configuration](#configuration)
|
|
104
|
+
- [Future Roadmap](#future-roadmap)
|
|
105
|
+
- [License](#license)
|
|
106
|
+
|
|
107
|
+
## Features
|
|
108
|
+
|
|
109
|
+
- **Pure Python & Zero Dependencies**: Lightweight and portable. Perfect for **AWS Lambda Layers** and constrained environments. Runs anywhere Python runs, including **WebAssembly (Pyodide)**.
|
|
110
|
+
- **Type-Safe Validation**: Convert loose Markdown tables into strongly-typed Python `dataclasses` with automatic type conversion, including customizable boolean logic (I18N) and custom type converters.
|
|
111
|
+
- **Markdown as a Database**: Treat your Markdown files as Git-managed configuration or master data. Validate schema and types automatically, preventing human error in handwritten tables.
|
|
112
|
+
- **Round-Trip Support**: Parse to objects, modify data, and generate Markdown back. Perfect for editors.
|
|
113
|
+
- **GFM Compliance**: Supports GitHub Flavored Markdown (GFM) specifications, including column alignment (`:--`, `:--:`, `--:`) and correct handling of pipes within inline code (`` `|` ``).
|
|
114
|
+
- **Robust Parsing**: Gracefully handles malformed tables (missing/extra columns) and escaped characters.
|
|
115
|
+
- **Multi-Table Workbooks**: Support for parsing multiple sheets and tables from a single file, including metadata.
|
|
116
|
+
- **JSON & Dict Support**: Column-level JSON parsing and direct conversion to `dict`/`TypedDict`.
|
|
117
|
+
- **Pandas Integration**: seamlessly create DataFrames from markdown tables.
|
|
118
|
+
- **Excel Import & Data Cleaning**: Convert Excel/TSV/CSV to Markdown with intelligent merged cell handling. Automatically flattens hierarchical headers and fills gaps, turning "dirty" spreadsheets into clean, structured data.
|
|
119
|
+
- **JSON-Friendly**: Easy export to dictionaries/JSON for integration with other tools.
|
|
120
|
+
|
|
121
|
+
## Installation
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
pip install md-spreadsheet-parser
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
## Usage
|
|
128
|
+
|
|
129
|
+
### 1. Basic Parsing
|
|
130
|
+
|
|
131
|
+
**Single Table**
|
|
132
|
+
Parse a standard Markdown table into a structured object.
|
|
133
|
+
|
|
134
|
+
```python
|
|
135
|
+
from md_spreadsheet_parser import parse_table
|
|
136
|
+
|
|
137
|
+
markdown = """
|
|
138
|
+
| Name | Age |
|
|
139
|
+
| --- | --- |
|
|
140
|
+
| Alice | 30 |
|
|
141
|
+
| Bob | 25 |
|
|
142
|
+
"""
|
|
143
|
+
|
|
144
|
+
result = parse_table(markdown)
|
|
145
|
+
|
|
146
|
+
print(result.headers)
|
|
147
|
+
# ['Name', 'Age']
|
|
148
|
+
|
|
149
|
+
print(result.rows)
|
|
150
|
+
# [['Alice', '30'], ['Bob', '25']]
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
**Multiple Tables (Workbook)**
|
|
154
|
+
Parse a file containing multiple sheets (sections). By default, it looks for `# Tables` as the root marker and `## Sheet Name` for sheets.
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
from md_spreadsheet_parser import parse_workbook, MultiTableParsingSchema
|
|
158
|
+
|
|
159
|
+
markdown = """
|
|
160
|
+
# Tables
|
|
161
|
+
|
|
162
|
+
## Users
|
|
163
|
+
| ID | Name |
|
|
164
|
+
| -- | ---- |
|
|
165
|
+
| 1 | Alice|
|
|
166
|
+
|
|
167
|
+
## Products
|
|
168
|
+
| ID | Item |
|
|
169
|
+
| -- | ---- |
|
|
170
|
+
| A | Apple|
|
|
171
|
+
"""
|
|
172
|
+
|
|
173
|
+
# Use default schema
|
|
174
|
+
schema = MultiTableParsingSchema()
|
|
175
|
+
workbook = parse_workbook(markdown, schema)
|
|
176
|
+
|
|
177
|
+
for sheet in workbook.sheets:
|
|
178
|
+
print(f"Sheet: {sheet.name}")
|
|
179
|
+
for table in sheet.tables:
|
|
180
|
+
print(table.rows)
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Lookup API & Metadata**
|
|
184
|
+
Retrieve sheets and tables directly by name, and access parsed metadata like descriptions.
|
|
185
|
+
|
|
186
|
+
```python
|
|
187
|
+
from md_spreadsheet_parser import parse_workbook
|
|
188
|
+
|
|
189
|
+
markdown = """
|
|
190
|
+
# Tables
|
|
191
|
+
|
|
192
|
+
## Sales Data
|
|
193
|
+
|
|
194
|
+
### Q1 Results
|
|
195
|
+
Financial performance for the first quarter.
|
|
196
|
+
|
|
197
|
+
| Year | Revenue |
|
|
198
|
+
| ---- | ------- |
|
|
199
|
+
| 2023 | 1000 |
|
|
200
|
+
"""
|
|
201
|
+
|
|
202
|
+
workbook = parse_workbook(markdown)
|
|
203
|
+
|
|
204
|
+
# Access by name
|
|
205
|
+
sheet = workbook.get_sheet("Sales Data")
|
|
206
|
+
if sheet:
|
|
207
|
+
# Retrieve table by name (from ### Header)
|
|
208
|
+
table = sheet.get_table("Q1 Results")
|
|
209
|
+
|
|
210
|
+
print(table.description)
|
|
211
|
+
# "Financial performance for the first quarter."
|
|
212
|
+
|
|
213
|
+
print(table.rows)
|
|
214
|
+
# [['2023', '1000']]
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
**Simple Scan Interface**
|
|
218
|
+
If you want to extract *all* tables from a document regardless of its structure (ignoring sheets and headers), use `scan_tables`.
|
|
219
|
+
|
|
220
|
+
```python
|
|
221
|
+
from md_spreadsheet_parser import scan_tables
|
|
222
|
+
|
|
223
|
+
markdown = """
|
|
224
|
+
| ID | Name |
|
|
225
|
+
| -- | ---- |
|
|
226
|
+
| 1 | Alice|
|
|
227
|
+
|
|
228
|
+
... text ...
|
|
229
|
+
|
|
230
|
+
| ID | Item |
|
|
231
|
+
| -- | ---- |
|
|
232
|
+
| A | Apple|
|
|
233
|
+
"""
|
|
234
|
+
|
|
235
|
+
# Returns a flat list of all tables found
|
|
236
|
+
tables = scan_tables(markdown)
|
|
237
|
+
print(len(tables)) # 2
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
**File Loading Helpers**
|
|
241
|
+
|
|
242
|
+
For convenience, you can parse directly from a file path (`str` or `Path`) or file-like object using the `_from_file` variants:
|
|
243
|
+
|
|
244
|
+
```python
|
|
245
|
+
from md_spreadsheet_parser import parse_workbook_from_file
|
|
246
|
+
|
|
247
|
+
# Clean and easy
|
|
248
|
+
workbook = parse_workbook_from_file("data.md")
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
Available helpers:
|
|
252
|
+
- `parse_table_from_file(path_or_file)`
|
|
253
|
+
- `parse_workbook_from_file(path_or_file)`
|
|
254
|
+
- `scan_tables_from_file(path_or_file)`
|
|
255
|
+
|
|
256
|
+
### GFM Feature Support
|
|
257
|
+
|
|
258
|
+
The parser strictly adheres to GitHub Flavored Markdown (GFM) specifications for tables.
|
|
259
|
+
|
|
260
|
+
**Column Alignment**
|
|
261
|
+
Alignment markers in the separator row are parsed and preserved.
|
|
262
|
+
|
|
263
|
+
```python
|
|
264
|
+
markdown = """
|
|
265
|
+
| Left | Center | Right |
|
|
266
|
+
| :--- | :----: | ----: |
|
|
267
|
+
| 1 | 2 | 3 |
|
|
268
|
+
"""
|
|
269
|
+
table = parse_table(markdown)
|
|
270
|
+
print(table.alignments)
|
|
271
|
+
# ["left", "center", "right"]
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
**Pipes in Code & Escaping**
|
|
275
|
+
Pipes `|` inside inline code blocks (backticks) or escaped with `\` are correctly treated as content, not column separators.
|
|
276
|
+
|
|
277
|
+
```python
|
|
278
|
+
markdown = """
|
|
279
|
+
| Code | Escaped |
|
|
280
|
+
| ----- | ------- |
|
|
281
|
+
| `a|b` | \| |
|
|
282
|
+
"""
|
|
283
|
+
table = parse_table(markdown)
|
|
284
|
+
# table.rows[0] == ["`a|b`", "|"]
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
### 2. Type-Safe Validation (Recommended)
|
|
288
|
+
|
|
289
|
+
The most powerful feature of this library is converting loose markdown tables into strongly-typed Python objects using `dataclasses`. This ensures your data is valid and easy to work with.
|
|
290
|
+
|
|
291
|
+
```python
|
|
292
|
+
from dataclasses import dataclass
|
|
293
|
+
from md_spreadsheet_parser import parse_table, TableValidationError
|
|
294
|
+
|
|
295
|
+
@dataclass
|
|
296
|
+
class User:
|
|
297
|
+
name: str
|
|
298
|
+
age: int
|
|
299
|
+
is_active: bool = True
|
|
300
|
+
|
|
301
|
+
markdown = """
|
|
302
|
+
| Name | Age | Is Active |
|
|
303
|
+
|---|---|---|
|
|
304
|
+
| Alice | 30 | yes |
|
|
305
|
+
| Bob | 25 | no |
|
|
306
|
+
"""
|
|
307
|
+
|
|
308
|
+
try:
|
|
309
|
+
# Parse and validate in one step
|
|
310
|
+
users = parse_table(markdown).to_models(User)
|
|
311
|
+
|
|
312
|
+
for user in users:
|
|
313
|
+
print(f"{user.name} is {user.age} years old.")
|
|
314
|
+
# Alice is 30 years old.
|
|
315
|
+
# Bob is 25 years old.
|
|
316
|
+
|
|
317
|
+
except TableValidationError as e:
|
|
318
|
+
print(e)
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
**Features:**
|
|
322
|
+
* **Type Conversion**: Automatically converts strings to `int`, `float`, `bool` using standard rules.
|
|
323
|
+
* **Boolean Handling (Default)**: Supports standard pairs out-of-the-box: `true/false`, `yes/no`, `on/off`, `1/0`. (See [Advanced Type Conversion](#7-advanced-type-conversion) for customization).
|
|
324
|
+
* **Optional Fields**: Handles `Optional[T]` by converting empty strings to `None`.
|
|
325
|
+
* **Validation**: Raises detailed errors if data doesn't match the schema.
|
|
326
|
+
|
|
327
|
+
### Pydantic Integration
|
|
328
|
+
|
|
329
|
+
For more advanced validation (email format, ranges, regex), you can use [Pydantic](https://docs.pydantic.dev/) models instead of dataclasses. This feature is enabled automatically if `pydantic` is installed.
|
|
330
|
+
|
|
331
|
+
```python
|
|
332
|
+
from pydantic import BaseModel, Field, EmailStr
|
|
333
|
+
|
|
334
|
+
class User(BaseModel):
|
|
335
|
+
name: str = Field(alias="User Name")
|
|
336
|
+
age: int = Field(gt=0)
|
|
337
|
+
email: EmailStr
|
|
338
|
+
|
|
339
|
+
# Automatically detects Pydantic model and uses it for validation
|
|
340
|
+
users = parse_table(markdown).to_models(User)
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
The parser respects Pydantic's `alias` and `Field` constraints.
|
|
344
|
+
|
|
345
|
+
### 3. JSON & Dictionary Conversion
|
|
346
|
+
|
|
347
|
+
Sometimes you don't want to define a full Dataclass or Pydantic model, or you have columns containing JSON strings.
|
|
348
|
+
|
|
349
|
+
**Simple Dictionary Output**
|
|
350
|
+
Convert tables directly to a list of dictionaries. Keys are derived from headers.
|
|
351
|
+
|
|
352
|
+
```python
|
|
353
|
+
# Returns list[dict[str, Any]] (Values are raw strings)
|
|
354
|
+
rows = parse_table(markdown).to_models(dict)
|
|
355
|
+
print(rows[0])
|
|
356
|
+
# {'Name': 'Alice', 'Age': '30'}
|
|
357
|
+
```
|
|
358
|
+
|
|
359
|
+
**TypedDict Support**
|
|
360
|
+
Use `TypedDict` for lightweight type safety. The parser uses the type annotations to convert values automatically.
|
|
361
|
+
|
|
362
|
+
```python
|
|
363
|
+
from typing import TypedDict
|
|
364
|
+
|
|
365
|
+
class User(TypedDict):
|
|
366
|
+
name: str
|
|
367
|
+
age: int
|
|
368
|
+
active: bool
|
|
369
|
+
|
|
370
|
+
rows = parse_table(markdown).to_models(User)
|
|
371
|
+
print(rows[0])
|
|
372
|
+
# {'name': 'Alice', 'age': 30, 'active': True}
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
**Column-Level JSON Parsing**
|
|
376
|
+
If a field is typed as `dict` or `list` (in a Dataclass or Pydantic model), the parser **automatically parses the cell value as JSON**.
|
|
377
|
+
|
|
378
|
+
```python
|
|
379
|
+
@dataclass
|
|
380
|
+
class Config:
|
|
381
|
+
id: int
|
|
382
|
+
metadata: dict # Cell: '{"debug": true}' -> Parsed to dict
|
|
383
|
+
tags: list # Cell: '["a", "b"]' -> Parsed to list
|
|
384
|
+
|
|
385
|
+
# Pydantic models also work without Json[] wrapper
|
|
386
|
+
class ConfigModel(BaseModel):
|
|
387
|
+
metadata: dict
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
**Limitations:**
|
|
391
|
+
* **JSON Syntax**: The cell content must be valid JSON (e.g. double quotes `{"a": 1}`). Malformed JSON raises a `ValueError`.
|
|
392
|
+
* **Simple Dict Parsing**: `to_models(dict)` does *not* automatically parse inner JSON strings unless you use a custom schema. It only creates a shallow dictionary of strings.
|
|
393
|
+
|
|
394
|
+
### 4. Pandas Integration & Export
|
|
395
|
+
|
|
396
|
+
This library is designed to be a bridge between Markdown and Data Science tools like **Pandas**.
|
|
397
|
+
|
|
398
|
+
**Convert to DataFrame (Easiest Way)**
|
|
399
|
+
The cleanest way to create a DataFrame is using `to_models(dict)`. This returns a list of dictionaries that Pandas can ingest directly.
|
|
400
|
+
|
|
401
|
+
```python
|
|
402
|
+
import pandas as pd
|
|
403
|
+
from md_spreadsheet_parser import parse_table
|
|
404
|
+
|
|
405
|
+
markdown = """
|
|
406
|
+
| Date | Sales | Region |
|
|
407
|
+
|------------|-------|--------|
|
|
408
|
+
| 2023-01-01 | 100 | US |
|
|
409
|
+
| 2023-01-02 | 150 | EU |
|
|
410
|
+
"""
|
|
411
|
+
|
|
412
|
+
table = parse_table(markdown)
|
|
413
|
+
|
|
414
|
+
# 1. Convert to list of dicts
|
|
415
|
+
data = table.to_models(dict)
|
|
416
|
+
|
|
417
|
+
# 2. Create DataFrame
|
|
418
|
+
df = pd.DataFrame(data)
|
|
419
|
+
|
|
420
|
+
# 3. Post-Process: Convert types (Pandas usually infers strings initially)
|
|
421
|
+
df["Sales"] = pd.to_numeric(df["Sales"])
|
|
422
|
+
df["Date"] = pd.to_datetime(df["Date"])
|
|
423
|
+
|
|
424
|
+
print(df.dtypes)
|
|
425
|
+
# Date datetime64[ns]
|
|
426
|
+
# Sales int64
|
|
427
|
+
# Region object
|
|
428
|
+
```
|
|
429
|
+
|
|
430
|
+
**Convert from Type-Safe Objects**
|
|
431
|
+
If you want to validate data **before** creating a DataFrame (e.g., ensuring "Sales" is an integer during parsing), use a `dataclass` and then convert to Pandas.
|
|
432
|
+
|
|
433
|
+
```python
|
|
434
|
+
from dataclasses import dataclass, asdict
|
|
435
|
+
|
|
436
|
+
@dataclass
|
|
437
|
+
class SalesRecord:
|
|
438
|
+
date: str
|
|
439
|
+
amount: int
|
|
440
|
+
region: str
|
|
441
|
+
|
|
442
|
+
# 1. Parse and Validate (Raises TableValidationError if invalid)
|
|
443
|
+
records = parse_table(markdown).to_models(SalesRecord)
|
|
444
|
+
|
|
445
|
+
# 2. Convert to DataFrame using asdict()
|
|
446
|
+
df = pd.DataFrame([asdict(r) for r in records])
|
|
447
|
+
|
|
448
|
+
# The 'amount' column is already int64 because validation handled conversion
|
|
449
|
+
print(df["amount"].dtype) # int64
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
**JSON Export**
|
|
453
|
+
All result objects (`Workbook`, `Sheet`, `Table`) have a `.json` property that returns a dictionary structure suitable for serialization.
|
|
454
|
+
|
|
455
|
+
```python
|
|
456
|
+
import json
|
|
457
|
+
|
|
458
|
+
# Export entire workbook structure
|
|
459
|
+
print(json.dumps(workbook.json, indent=2))
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
|
|
463
|
+
### 5. Excel Import
|
|
464
|
+
|
|
465
|
+
Import Excel data (via TSV/CSV or `openpyxl`) with intelligent handling of merged cells and hierarchical headers.
|
|
466
|
+
|
|
467
|
+
> [!NOTE]
|
|
468
|
+
> Importing from TSV/CSV text works with **zero dependencies**. Direct `.xlsx` file loading requires `openpyxl` (a user-managed optional dependency).
|
|
469
|
+
|
|
470
|
+
**Basic Usage**
|
|
471
|
+
|
|
472
|
+
🚀 **See the [Cookbook](https://github.com/f-y/md-spreadsheet-parser/blob/main/COOKBOOK.md) for more comprehensive recipes.**
|
|
473
|
+
|
|
474
|
+
```python
|
|
475
|
+
from md_spreadsheet_parser import parse_excel
|
|
476
|
+
|
|
477
|
+
# From TSV/CSV (Zero Dependency)
|
|
478
|
+
table = parse_excel("Name\tAge\nAlice\t30")
|
|
479
|
+
|
|
480
|
+
# From .xlsx (requires openpyxl)
|
|
481
|
+
import openpyxl
|
|
482
|
+
wb = openpyxl.load_workbook("data.xlsx")
|
|
483
|
+
table = parse_excel(wb.active)
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
**Merged Header Handling**
|
|
487
|
+
|
|
488
|
+
When Excel exports merged cells, they appear as empty cells. The parser automatically forward-fills these gaps:
|
|
489
|
+
|
|
490
|
+
```text
|
|
491
|
+
Excel (merged headers):
|
|
492
|
+
┌─────────────────────────────┬────────┐
|
|
493
|
+
│ Category (3 cols) │ Info │
|
|
494
|
+
├─────────┬─────────┬─────────┼────────┤
|
|
495
|
+
│ A │ B │ C │ D │
|
|
496
|
+
└─────────┴─────────┴─────────┴────────┘
|
|
497
|
+
|
|
498
|
+
↓ parse_excel()
|
|
499
|
+
|
|
500
|
+
Markdown:
|
|
501
|
+
| Category | Category | Category | Info |
|
|
502
|
+
|----------|----------|----------|------|
|
|
503
|
+
| A | B | C | D |
|
|
504
|
+
```
|
|
505
|
+
|
|
506
|
+
**2-Row Hierarchical Headers**
|
|
507
|
+
|
|
508
|
+
For complex headers with parent-child relationships, use `ExcelParsingSchema(header_rows=2)`:
|
|
509
|
+
|
|
510
|
+
```text
|
|
511
|
+
Excel (2-row header):
|
|
512
|
+
┌───────────────────┬───────────────────┐
|
|
513
|
+
│ Info │ Metrics │ ← Row 1 (Parent)
|
|
514
|
+
├─────────┬─────────┼─────────┬─────────┤
|
|
515
|
+
│ Name │ ID │ Score │ Rank │ ← Row 2 (Child)
|
|
516
|
+
├─────────┼─────────┼─────────┼─────────┤
|
|
517
|
+
│ Alice │ 001 │ 95 │ 1 │
|
|
518
|
+
└─────────┴─────────┴─────────┴─────────┘
|
|
519
|
+
|
|
520
|
+
↓ parse_excel(schema=ExcelParsingSchema(header_rows=2))
|
|
521
|
+
|
|
522
|
+
Markdown:
|
|
523
|
+
| Info - Name | Info - ID | Metrics - Score | Metrics - Rank |
|
|
524
|
+
|-------------|-----------|-----------------|----------------|
|
|
525
|
+
| Alice | 001 | 95 | 1 |
|
|
526
|
+
```
|
|
527
|
+
|
|
528
|
+
> **Note:** Currently supports up to 2 header rows. For deeper hierarchies, pre-process your data before parsing.
|
|
529
|
+
|
|
530
|
+
**Excel to Structured Objects (The "Killer" Feature)**
|
|
531
|
+
|
|
532
|
+
Don't just convert to text—convert Excel directly to valid, type-safe Python objects in one step.
|
|
533
|
+
|
|
534
|
+
```python
|
|
535
|
+
@dataclass
|
|
536
|
+
class SalesRecord:
|
|
537
|
+
category: str
|
|
538
|
+
item: str
|
|
539
|
+
amount: int # Automatic string-to-int conversion
|
|
540
|
+
|
|
541
|
+
# 1. Parse Excel (handles merged cells automatically)
|
|
542
|
+
# 2. Validate & Convert to objects
|
|
543
|
+
records = parse_excel(ws).to_models(SalesRecord)
|
|
544
|
+
|
|
545
|
+
# Now you have clean, typed data
|
|
546
|
+
assert records[0].amount == 1000
|
|
547
|
+
```
|
|
548
|
+
|
|
549
|
+
**Configuration**
|
|
550
|
+
|
|
551
|
+
Use `ExcelParsingSchema` to customize parsing behavior:
|
|
552
|
+
|
|
553
|
+
```python
|
|
554
|
+
from md_spreadsheet_parser import parse_excel, ExcelParsingSchema
|
|
555
|
+
|
|
556
|
+
schema = ExcelParsingSchema(
|
|
557
|
+
header_rows=2,
|
|
558
|
+
fill_merged_headers=True,
|
|
559
|
+
header_separator=" / "
|
|
560
|
+
)
|
|
561
|
+
|
|
562
|
+
table = parse_excel(source, schema)
|
|
563
|
+
```
|
|
564
|
+
|
|
565
|
+
| Option | Default | Description |
|
|
566
|
+
|--------|---------|-------------|
|
|
567
|
+
| `header_rows` | `1` | Number of header rows (1 or 2). |
|
|
568
|
+
| `fill_merged_headers` | `True` | Forward-fill empty header cells. |
|
|
569
|
+
| `header_separator` | `" - "` | Separator for flattened 2-row headers. |
|
|
570
|
+
| `delimiter` | `"\t"` | Column separator for TSV/CSV. |
|
|
571
|
+
|
|
572
|
+
|
|
573
|
+
### 6. Markdown Generation (Round-Trip)
|
|
574
|
+
|
|
575
|
+
You can modify parsed objects and convert them back to Markdown strings using `to_markdown()`. This enables a complete "Parse -> Modify -> Generate" workflow.
|
|
576
|
+
|
|
577
|
+
```python
|
|
578
|
+
from md_spreadsheet_parser import parse_table, ParsingSchema
|
|
579
|
+
|
|
580
|
+
markdown = "| A | B |\n|---|---| \n| 1 | 2 |"
|
|
581
|
+
table = parse_table(markdown)
|
|
582
|
+
|
|
583
|
+
# Modify data
|
|
584
|
+
table.rows.append(["3", "4"])
|
|
585
|
+
|
|
586
|
+
# Generate Markdown
|
|
587
|
+
# You can customize the output format using a schema
|
|
588
|
+
schema = ParsingSchema(require_outer_pipes=True)
|
|
589
|
+
print(table.to_markdown(schema))
|
|
590
|
+
# | A | B |
|
|
591
|
+
# | --- | --- |
|
|
592
|
+
# | 1 | 2 |
|
|
593
|
+
# | 3 | 4 |
|
|
594
|
+
```
|
|
595
|
+
|
|
596
|
+
### 6. Advanced Features
|
|
597
|
+
|
|
598
|
+
**Metadata Extraction Configuration**
|
|
599
|
+
By default, the parser captures table names (level 3 headers) and descriptions. You can customize this behavior with `MultiTableParsingSchema`.
|
|
600
|
+
|
|
601
|
+
```python
|
|
602
|
+
from md_spreadsheet_parser import MultiTableParsingSchema
|
|
603
|
+
|
|
604
|
+
schema = MultiTableParsingSchema(
|
|
605
|
+
table_header_level=3, # Treat ### Header as table name
|
|
606
|
+
capture_description=True # Capture text between header and table
|
|
607
|
+
)
|
|
608
|
+
# Pass schema to parse_workbook...
|
|
609
|
+
```
|
|
610
|
+
|
|
611
|
+
### 7. Advanced Type Conversion
|
|
612
|
+
|
|
613
|
+
You can customize how string values are converted to Python objects by passing a `ConversionSchema` to `to_models()`. This is useful for internationalization (I18N) and handling custom types.
|
|
614
|
+
|
|
615
|
+
**Internationalization (I18N): Custom Boolean Pairs**
|
|
616
|
+
|
|
617
|
+
Configure which string pairs map to `True`/`False` (case-insensitive).
|
|
618
|
+
|
|
619
|
+
```python
|
|
620
|
+
from md_spreadsheet_parser import parse_table, ConversionSchema
|
|
621
|
+
|
|
622
|
+
markdown = """
|
|
623
|
+
| User | Active? |
|
|
624
|
+
| --- | --- |
|
|
625
|
+
| Tanaka | はい |
|
|
626
|
+
| Suzuki | いいえ |
|
|
627
|
+
"""
|
|
628
|
+
|
|
629
|
+
# Configure "はい" -> True, "いいえ" -> False
|
|
630
|
+
schema = ConversionSchema(
|
|
631
|
+
boolean_pairs=(("はい", "いいえ"),)
|
|
632
|
+
)
|
|
633
|
+
|
|
634
|
+
users = parse_table(markdown).to_models(User, conversion_schema=schema)
|
|
635
|
+
# Tanaka.active is True
|
|
636
|
+
```
|
|
637
|
+
|
|
638
|
+
**Custom Type Converters**
|
|
639
|
+
|
|
640
|
+
Register custom conversion functions for specific types. You can use **ANY Python type** as a key, including:
|
|
641
|
+
|
|
642
|
+
- **Built-ins**: `int`, `float`, `bool` (to override default behavior)
|
|
643
|
+
- **Standard Library**: `Decimal`, `datetime`, `date`, `ZoneInfo`, `UUID`
|
|
644
|
+
- **Custom Classes**: Your own data classes or objects
|
|
645
|
+
|
|
646
|
+
Example using standard library types and a custom class:
|
|
647
|
+
|
|
648
|
+
```python
|
|
649
|
+
from dataclasses import dataclass
|
|
650
|
+
from uuid import UUID
|
|
651
|
+
from zoneinfo import ZoneInfo
|
|
652
|
+
from md_spreadsheet_parser import ConversionSchema, parse_table
|
|
653
|
+
|
|
654
|
+
@dataclass
|
|
655
|
+
class Color:
|
|
656
|
+
r: int
|
|
657
|
+
g: int
|
|
658
|
+
b: int
|
|
659
|
+
|
|
660
|
+
@dataclass
|
|
661
|
+
class Config:
|
|
662
|
+
timezone: ZoneInfo
|
|
663
|
+
session_id: UUID
|
|
664
|
+
theme_color: Color
|
|
665
|
+
|
|
666
|
+
markdown = """
|
|
667
|
+
| Timezone | Session ID | Theme Color |
|
|
668
|
+
| --- | --- | --- |
|
|
669
|
+
| Asia/Tokyo | 12345678-1234-5678-1234-567812345678 | 255,0,0 |
|
|
670
|
+
"""
|
|
671
|
+
|
|
672
|
+
schema = ConversionSchema(
|
|
673
|
+
custom_converters={
|
|
674
|
+
# Standard Library Types
|
|
675
|
+
ZoneInfo: lambda v: ZoneInfo(v),
|
|
676
|
+
UUID: lambda v: UUID(v),
|
|
677
|
+
# Custom Class
|
|
678
|
+
Color: lambda v: Color(*map(int, v.split(",")))
|
|
679
|
+
}
|
|
680
|
+
)
|
|
681
|
+
|
|
682
|
+
data = parse_table(markdown).to_models(Config, conversion_schema=schema)
|
|
683
|
+
# data[0].timezone is ZoneInfo("Asia/Tokyo")
|
|
684
|
+
# data[0].theme_color is Color(255, 0, 0)
|
|
685
|
+
```
|
|
686
|
+
|
|
687
|
+
**Field-Specific Converters**
|
|
688
|
+
|
|
689
|
+
For granular control, you can define converters for specific field names, which take precedence over type-based converters.
|
|
690
|
+
|
|
691
|
+
```python
|
|
692
|
+
def parse_usd(val): ...
|
|
693
|
+
def parse_jpy(val): ...
|
|
694
|
+
|
|
695
|
+
schema = ConversionSchema(
|
|
696
|
+
# Type-based defaults (Low priority)
|
|
697
|
+
custom_converters={
|
|
698
|
+
Decimal: parse_usd
|
|
699
|
+
},
|
|
700
|
+
# Field-name overrides (High priority)
|
|
701
|
+
field_converters={
|
|
702
|
+
"price_jpy": parse_jpy,
|
|
703
|
+
"created_at": lambda x: datetime.strptime(x, "%Y/%m/%d")
|
|
704
|
+
}
|
|
705
|
+
)
|
|
706
|
+
|
|
707
|
+
# price_usd (no override) -> custom_converters (parse_usd)
|
|
708
|
+
# price_jpy (override) -> field_converters (parse_jpy)
|
|
709
|
+
data = parse_table(markdown).to_models(Product, conversion_schema=schema)
|
|
710
|
+
```
|
|
711
|
+
|
|
712
|
+
**Standard Converters Library**
|
|
713
|
+
|
|
714
|
+
For common patterns (currencies, lists), you can use the built-in helper functions in `md_spreadsheet_parser.converters` instead of writing your own.
|
|
715
|
+
|
|
716
|
+
```python
|
|
717
|
+
from md_spreadsheet_parser.converters import (
|
|
718
|
+
to_decimal_clean, # Handles "$1,000", "¥500" -> Decimal
|
|
719
|
+
make_datetime_converter, # Factory for parse/TZ logic
|
|
720
|
+
make_list_converter, # "a,b,c" -> ["a", "b", "c"]
|
|
721
|
+
make_bool_converter # Custom strict boolean sets
|
|
722
|
+
)
|
|
723
|
+
|
|
724
|
+
schema = ConversionSchema(
|
|
725
|
+
custom_converters={
|
|
726
|
+
# Currency: removes $, ¥, €, £, comma, space
|
|
727
|
+
Decimal: to_decimal_clean,
|
|
728
|
+
# DateTime: ISO format default, attach Tokyo TZ if naive
|
|
729
|
+
datetime: make_datetime_converter(tz=ZoneInfo("Asia/Tokyo")),
|
|
730
|
+
# Lists: Split by comma, strip whitespace
|
|
731
|
+
list: make_list_converter(separator=",")
|
|
732
|
+
},
|
|
733
|
+
field_converters={
|
|
734
|
+
# Custom boolean for specific field
|
|
735
|
+
"is_valid": make_bool_converter(true_values=["OK"], false_values=["NG"])
|
|
736
|
+
}
|
|
737
|
+
)
|
|
738
|
+
```
|
|
739
|
+
|
|
740
|
+
### 8. Robustness (Handling Malformed Tables)
|
|
741
|
+
|
|
742
|
+
The parser is designed to handle imperfect markdown tables gracefully.
|
|
743
|
+
|
|
744
|
+
* **Missing Columns**: Rows with fewer columns than the header are automatically **padded** with empty strings.
|
|
745
|
+
* **Extra Columns**: Rows with more columns than the header are automatically **truncated**.
|
|
746
|
+
|
|
747
|
+
```python
|
|
748
|
+
from md_spreadsheet_parser import parse_table
|
|
749
|
+
|
|
750
|
+
markdown = """
|
|
751
|
+
| A | B |
|
|
752
|
+
|---|---|
|
|
753
|
+
| 1 | <-- Missing column
|
|
754
|
+
| 1 | 2 | 3 <-- Extra column
|
|
755
|
+
"""
|
|
756
|
+
|
|
757
|
+
table = parse_table(markdown)
|
|
758
|
+
|
|
759
|
+
print(table.rows)
|
|
760
|
+
# [['1', ''], ['1', '2']]
|
|
761
|
+
```
|
|
762
|
+
|
|
763
|
+
This ensures that `table.rows` always matches the structure of `table.headers`, preventing crashes during iteration or validation.
|
|
764
|
+
|
|
765
|
+
### 9. In-Cell Line Break Support
|
|
766
|
+
|
|
767
|
+
The parser automatically converts HTML line breaks to Python newlines (`\n`). This enables handling multiline cells naturally.
|
|
768
|
+
|
|
769
|
+
**Supported Tags (Case-Insensitive):**
|
|
770
|
+
- `<br>`
|
|
771
|
+
- `<br/>`
|
|
772
|
+
- `<br />`
|
|
773
|
+
|
|
774
|
+
```python
|
|
775
|
+
markdown = "| Line1<br>Line2 |"
|
|
776
|
+
table = parse_table(markdown)
|
|
777
|
+
# table.rows[0][0] == "Line1\nLine2"
|
|
778
|
+
```
|
|
779
|
+
|
|
780
|
+
**Round-Trip Support:**
|
|
781
|
+
When generating Markdown (e.g., `table.to_markdown()`), Python newlines (`\n`) are automatically converted back to `<br>` tags to preserve the table structure.
|
|
782
|
+
|
|
783
|
+
To disable this, set `convert_br_to_newline=False` in `ParsingSchema`.
|
|
784
|
+
|
|
785
|
+
### 10. Performance & Scalability (Streaming API)
|
|
786
|
+
|
|
787
|
+
**Do you really have a 10GB Markdown file?**
|
|
788
|
+
|
|
789
|
+
Probably not. We sincerely hope you don't. Markdown wasn't built for that.
|
|
790
|
+
|
|
791
|
+
But *if you do*—perhaps you're generating extensive logs or auditing standard converters—this library has your back. While Excel gives up after 1,048,576 rows, `md-spreadsheet-parser` supports streaming processing for files of **unlimited size**, keeping memory usage constant.
|
|
792
|
+
|
|
793
|
+
**scan_tables_iter**:
|
|
794
|
+
This function reads the file line-by-line and yields `Table` objects as they are found. It does **not** load the entire file into memory.
|
|
795
|
+
|
|
796
|
+
```python
|
|
797
|
+
from md_spreadsheet_parser import scan_tables_iter
|
|
798
|
+
|
|
799
|
+
# Process a massive log file (e.g., 10GB)
|
|
800
|
+
# Memory usage remains low (only the size of a single table block)
|
|
801
|
+
for table in scan_tables_iter("huge_server_log.md"):
|
|
802
|
+
print(f"Found table with {len(table.rows)} rows")
|
|
803
|
+
|
|
804
|
+
# Process rows...
|
|
805
|
+
for row in table.rows:
|
|
806
|
+
pass
|
|
807
|
+
```
|
|
808
|
+
|
|
809
|
+
This is ideal for data pipelines, log analysis, and processing exports that are too large to open in standard spreadsheet editors.
|
|
810
|
+
|
|
811
|
+
### 11. Programmatic Manipulation
|
|
812
|
+
|
|
813
|
+
The library provides immutable methods to modify the data structure. These methods return a **new instance** of the object with the changes applied, keeping the original object unchanged.
|
|
814
|
+
|
|
815
|
+
**Workbook Operations**
|
|
816
|
+
```python
|
|
817
|
+
# Add a new sheet (creates a default table with headers A, B, C)
|
|
818
|
+
new_wb = workbook.add_sheet("New Sheet")
|
|
819
|
+
|
|
820
|
+
# Rename a sheet
|
|
821
|
+
new_wb = workbook.rename_sheet(sheet_index=0, new_name("Budget 2024"))
|
|
822
|
+
|
|
823
|
+
# Delete a sheet
|
|
824
|
+
new_wb = workbook.delete_sheet(sheet_index=1)
|
|
825
|
+
```
|
|
826
|
+
|
|
827
|
+
**Sheet Operations**
|
|
828
|
+
```python
|
|
829
|
+
# Rename sheet (direct method)
|
|
830
|
+
new_sheet = sheet.rename("Q1 Data")
|
|
831
|
+
|
|
832
|
+
# Update table metadata
|
|
833
|
+
new_sheet = sheet.update_table_metadata(
|
|
834
|
+
table_index=0,
|
|
835
|
+
name="Expenses",
|
|
836
|
+
description="Monthly expense report"
|
|
837
|
+
)
|
|
838
|
+
```
|
|
839
|
+
|
|
840
|
+
**Table Operations**
|
|
841
|
+
```python
|
|
842
|
+
# Update a cell (automatically expands table if index is out of bounds)
|
|
843
|
+
new_table = table.update_cell(row_idx=5, col_idx=2, value="Updated")
|
|
844
|
+
|
|
845
|
+
# Delete a row (structural delete)
|
|
846
|
+
new_table = table.delete_row(row_idx=2)
|
|
847
|
+
|
|
848
|
+
# Clear column data (keeps headers and row structure, empties cells)
|
|
849
|
+
new_table = table.clear_column_data(col_idx=3)
|
|
850
|
+
```
|
|
851
|
+
|
|
852
|
+
### 12. Visual Metadata Persistence
|
|
853
|
+
|
|
854
|
+
The library supports persisting visual state (like column widths and filter settings) without altering the Markdown table structure itself. This is achieved via a hidden HTML comment appended after the table.
|
|
855
|
+
|
|
856
|
+
```markdown
|
|
857
|
+
| A | B |
|
|
858
|
+
|---|---|
|
|
859
|
+
| 1 | 2 |
|
|
860
|
+
|
|
861
|
+
<!-- md-spreadsheet-table-metadata: {"columnWidths": [100, 200]} -->
|
|
862
|
+
```
|
|
863
|
+
|
|
864
|
+
This ensures that:
|
|
865
|
+
1. **Clean Data**: The table remains standard Markdown, readable by any renderer.
|
|
866
|
+
2. **Rich State**: Compatible tools (like our VS Code Extension) can read the comment to restore UI state (column widths, hidden columns, etc.).
|
|
867
|
+
3. **Robustness**: The parser automatically associates this metadata with the preceding table, even if separated by blank lines.
|
|
868
|
+
|
|
869
|
+
### Command Line Interface (CLI)
|
|
870
|
+
|
|
871
|
+
You can use the `md-spreadsheet-parser` command to parse Markdown files and output JSON. This is useful for piping data to other tools.
|
|
872
|
+
|
|
873
|
+
```bash
|
|
874
|
+
# Read from file
|
|
875
|
+
md-spreadsheet-parser input.md
|
|
876
|
+
|
|
877
|
+
# Read from stdin (pipe)
|
|
878
|
+
cat input.md | md-spreadsheet-parser
|
|
879
|
+
```
|
|
880
|
+
|
|
881
|
+
**Options:**
|
|
882
|
+
- `--scan`: Scan for all tables ignoring workbook structure (returns a list of tables).
|
|
883
|
+
- `--root-marker`: Set the root marker (default: `# Tables`).
|
|
884
|
+
- `--sheet-header-level`: Set sheet header level (default: 2).
|
|
885
|
+
- `--table-header-level`: Set table header level (default: 3).
|
|
886
|
+
- `--capture-description`: Capture table descriptions (default: True).
|
|
887
|
+
- `--column-separator`: Character used to separate columns (default: `|`).
|
|
888
|
+
- `--header-separator-char`: Character used in the separator row (default: `-`).
|
|
889
|
+
- `--no-outer-pipes`: Allow tables without outer pipes (default: False).
|
|
890
|
+
- `--no-strip-whitespace`: Do not strip whitespace from cell values (default: False).
|
|
891
|
+
- `--no-br-conversion`: Disable automatic conversion of `<br>` tags to newlines (default: False).
|
|
892
|
+
|
|
893
|
+
## Configuration
|
|
894
|
+
|
|
895
|
+
Customize parsing behavior using `ParsingSchema` and `MultiTableParsingSchema`.
|
|
896
|
+
|
|
897
|
+
| Option | Default | Description |
|
|
898
|
+
| :--- | :--- | :--- |
|
|
899
|
+
| `column_separator` | `\|` | Character used to separate columns. |
|
|
900
|
+
| `header_separator_char` | `-` | Character used in the separator row. |
|
|
901
|
+
| `require_outer_pipes` | `True` | If `True`, generated markdown tables will include outer pipes. |
|
|
902
|
+
| `strip_whitespace` | `True` | If `True`, whitespace is stripped from cell values. |
|
|
903
|
+
| `convert_br_to_newline` | `True` | If `True`, `<br>` tags are converted to `\n` (and back). |
|
|
904
|
+
| `root_marker` | `# Tables` | (MultiTable) Marker indicating start of data section. |
|
|
905
|
+
| `sheet_header_level` | `2` | (MultiTable) Header level for sheets. |
|
|
906
|
+
| `table_header_level` | `3` | (MultiTable) Header level for tables. |
|
|
907
|
+
| `capture_description` | `True` | (MultiTable) Capture text between header and table. |
|
|
908
|
+
|
|
909
|
+
## Ecosystem
|
|
910
|
+
|
|
911
|
+
This parser is the core foundation of a new ecosystem: **Text-Based Spreadsheet Management**.
|
|
912
|
+
|
|
913
|
+
It powers **[PengSheets](https://marketplace.visualstudio.com/items?itemName=f-y.peng-sheets)**, a rich VS Code extension that provides a full GUI Spreadsheet Editor for Markdown files.
|
|
914
|
+
|
|
915
|
+
**The Vision: "Excel-like UX, Git-native Data"**
|
|
916
|
+
By combining a high-performance editor with this robust parser, we aim to solve the long-standing problem of managing binary spreadsheet files in software projects.
|
|
917
|
+
* **For Humans**: Edit data with a comfortable, familiar UI (cell formatting, improved navigation, visual feedback).
|
|
918
|
+
* **For Machines**: Data is saved as clean, diff-able Markdown that this library can parse, validate, and convert into Python objects instantaneously.
|
|
919
|
+
|
|
920
|
+
## License
|
|
921
|
+
|
|
922
|
+
This project is licensed under the [MIT License](https://github.com/f-y/md-spreadsheet-parser/blob/main/LICENSE).
|