xl2jsonl 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,22 @@
1
+ .venv
2
+ .pytest_cache
3
+ __pycache__
4
+ *.pyc
5
+ *.pyo
6
+ *.pyd
7
+ *.db
8
+ *.sqlite3
9
+ *.log
10
+ .DS_Store
11
+ .idea/
12
+ .vscode/
13
+ *.egg-info/
14
+ dist/
15
+ build/
16
+ *.egg
17
+ *.tar.gz
18
+ *.whl
19
+ .env
20
+ .env.*
21
+ *.env
22
+ .claude
@@ -0,0 +1,61 @@
1
+ # xl2jsonl
2
+
3
+ Convert Excel (.xlsx, .xls, .xlsb) and CSV (.csv, .tsv) files to JSONL.
4
+
5
+ ## Quick Reference
6
+
7
+ - **Language**: Python 3.10+
8
+ - **Package manager**: uv
9
+ - **Entry point**: `src/xl2jsonl/cli.py` (Click CLI), `src/xl2jsonl/__init__.py` (Python API)
10
+ - **Test runner**: `uv run python -m pytest` (pytest not directly on PATH, use `python -m`)
11
+ - **Lint**: `uv run ruff check src/ tests/`
12
+
13
+ ## Project Structure
14
+
15
+ ```
16
+ src/xl2jsonl/
17
+ __init__.py - Public API: convert(), iter_records()
18
+ cli.py - Click CLI entry point
19
+ loader.py - Multi-format loader: Excel (calamine + openpyxl), CSV/TSV (stdlib csv)
20
+ chunker.py - Header detection, row-to-dict conversion, cell normalization
21
+ writer.py - JSONL streaming output via orjson
22
+ models.py - Data classes: SheetData, RowRecord, Metadata, CellValue
23
+ exceptions.py - Xl2JsonlError, LoaderError, EmptySheetError, NoHeaderError
24
+ tests/
25
+ conftest.py - Programmatic xlsx fixture generators (openpyxl)
26
+ test_*.py - Unit and integration tests
27
+ ```
28
+
29
+ ## Architecture
30
+
31
+ 1. **Loader** (`loader.py`): Routes by file extension. Excel (.xlsx/.xls/.xlsb) read via calamine; .xlsx gets openpyxl merge detection in `auto` mode. CSV/TSV read via stdlib `csv` module with basic type inference (int, float, bool).
32
+ 2. **Chunker** (`chunker.py`): Detects table regions, auto-detects headers, converts rows to key-value dicts.
33
+ - **Table region detection**: For sheets with multiple column groups (side-by-side tables separated by empty columns), splits into independent regions. Each column group is further split into row blocks by empty row gaps. Single-column-group sheets use the whole sheet as one region (backward compatible).
34
+ - **Header detection**: First row where 50%+ non-empty cells are strings AND values are sufficiently distinct (skips merged title rows).
35
+ 3. **Writer** (`writer.py`): Streams `{content, metadata}` JSONL records via orjson.
36
+
37
+ ## Key Design Decisions
38
+
39
+ - **Multi-table sheets**: Column groups are detected using count-based occupancy (a column needs data in 10%+ of rows to count as occupied, filtering out noise from titles/notes that span across table boundaries). Groups are split by 1+ empty columns. Row blocks within each group are split by 1+ empty rows.
40
+ - **Merged title detection**: Header auto-detection skips rows where most non-empty cells share the same value (resolved merged title/description rows).
41
+ - **Multi-row headers**: When a header candidate has duplicate values (merged category row), scans subsequent rows for one with better column coverage.
42
+ - **Format support**: Merge resolution only works for .xlsx (openpyxl limitation). For .xls/.xlsb, calamine-only (no merge detection). CSV/TSV loaded via stdlib csv with type inference.
43
+ - `--header-row` (0-based) CLI flag overrides auto-detection when heuristics fail.
44
+ - Duplicate headers from merged columns get `_2`, `_3` suffixes.
45
+ - Empty leading rows are skipped during header detection.
46
+ - Regions with no detected header are silently skipped (logged at DEBUG level).
47
+
48
+ ## Running
49
+
50
+ ```bash
51
+ uv run xl2jsonl input.xlsx -o output.jsonl # CLI
52
+ uv run python -m pytest -v # Tests (all 39)
53
+ uv run python -m pytest tests/test_chunker.py # Single file
54
+ uv run ruff check src/ tests/ # Lint
55
+ ```
56
+
57
+ ## Testing Conventions
58
+
59
+ - Fixtures in `conftest.py` generate xlsx files programmatically using openpyxl
60
+ - No test data files checked in — everything is generated at test time
61
+ - Tests cover: loader engines, merge resolution, header detection, cell normalization, CLI flags, end-to-end xlsx→jsonl
xl2jsonl-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Abhijit Choudhury
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,290 @@
1
+ Metadata-Version: 2.4
2
+ Name: xl2jsonl
3
+ Version: 0.1.0
4
+ Summary: Convert Excel files to JSONL with robust merged-cell handling
5
+ Project-URL: Homepage, https://github.com/abhijit-choudhury/xl2jsonl
6
+ Project-URL: Repository, https://github.com/abhijit-choudhury/xl2jsonl
7
+ Project-URL: Issues, https://github.com/abhijit-choudhury/xl2jsonl/issues
8
+ Author: Abhijit Choudhury
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Programming Language :: Python :: 3.13
19
+ Classifier: Topic :: File Formats
20
+ Classifier: Topic :: Utilities
21
+ Requires-Python: >=3.10
22
+ Requires-Dist: click>=8.0
23
+ Requires-Dist: openpyxl>=3.1.0
24
+ Requires-Dist: orjson>=3.9
25
+ Requires-Dist: python-calamine>=0.2.0
26
+ Description-Content-Type: text/markdown
27
+
28
+ # xl2jsonl
29
+
30
+ Convert Excel (`.xlsx`, `.xls`, `.xlsb`) and CSV (`.csv`, `.tsv`) files to JSONL with robust handling of merged cells, multiple sheets, and mixed data types. Each output line is a JSON object containing row data as key-value pairs plus metadata for traceability.
31
+
32
+ ## Why xl2jsonl?
33
+
34
+ Excel spreadsheets in the wild are messy — merged headers spanning multiple columns, merged row labels, empty leading rows, inconsistent types. Most conversion tools break on these. xl2jsonl handles them correctly by using a dual-engine approach: **calamine** (Rust-backed) for speed on clean sheets, with automatic fallback to **openpyxl** when merged cells are detected.
35
+
36
+ ## Output Format
37
+
38
+ Each line in the output JSONL file is a self-contained JSON object:
39
+
40
+ ```json
41
+ {"content": {"Name": "Alice", "Age": 30, "Department": "Engineering"}, "metadata": {"filename": "staff.xlsx", "sheet_name": "Employees", "sheet_number": 1, "row_number": 2}}
42
+ {"content": {"Name": "Bob", "Age": 25, "Department": "Design"}, "metadata": {"filename": "staff.xlsx", "sheet_name": "Employees", "sheet_number": 1, "row_number": 3}}
43
+ ```
44
+
45
+ | Field | Description |
46
+ |---|---|
47
+ | `content` | Key-value pairs mapping column headers to cell values for that row |
48
+ | `metadata.filename` | Original Excel filename |
49
+ | `metadata.sheet_name` | Name of the worksheet |
50
+ | `metadata.sheet_number` | 1-based sheet index |
51
+ | `metadata.row_number` | 1-based row number in the original Excel file |
52
+
53
+ ## Features
54
+
55
+ - **Merged cell resolution** — Merged columns, rows, and blocks are filled so every cell in a merged range gets the top-left cell's value
56
+ - **Duplicate header handling** — Merged headers that produce duplicates get `_2`, `_3` suffixes (e.g., `Revenue`, `Revenue_2`, `Revenue_3`)
57
+ - **Auto header detection** — Skips empty leading rows, merged title rows, and finds the first row where the majority of non-empty cells are strings with sufficient diversity
58
+ - **Multi-table sheet support** — Automatically detects side-by-side tables (separated by empty columns) and stacked tables (separated by empty rows) within a single sheet, processing each independently
59
+ - **Multi-sheet support** — Processes all sheets by default, or filter by name/index
60
+ - **Mixed type handling** — Dates and datetimes convert to ISO 8601 strings, strings are stripped, numbers and booleans pass through, `None` stays `None`
61
+ - **Empty row skipping** — Empty rows are skipped by default (configurable)
62
+ - **Short row padding** — Rows shorter than the header are padded with `None`
63
+ - **Streaming output** — JSONL is written one record at a time via `orjson` for low memory overhead
64
+ - **Dual engine** — calamine for speed on clean sheets, openpyxl fallback for merged cells (configurable)
65
+
66
+ ## Dependencies
67
+
68
+ | Package | Purpose |
69
+ |---|---|
70
+ | [python-calamine](https://pypi.org/project/python-calamine/) | Fast Rust-backed Excel reader — supports .xlsx, .xls, .xlsb |
71
+ | [openpyxl](https://pypi.org/project/openpyxl/) | Merged cell detection and resolution (.xlsx only) |
72
+ | [click](https://pypi.org/project/click/) | CLI framework |
73
+ | [orjson](https://pypi.org/project/orjson/) | Fast JSON serialization with native date handling |
74
+
75
+ Dev dependencies: `pytest`, `pytest-cov`, `ruff`
76
+
77
+ ## Setup
78
+
79
+ Requires **Python 3.10+** and **[uv](https://docs.astral.sh/uv/)**.
80
+
81
+ ```bash
82
+ # Clone and install
83
+ git clone <repo-url>
84
+ cd excel-chunker
85
+ uv sync
86
+ ```
87
+
88
+ This creates a virtual environment in `.venv/` and installs all dependencies including dev tools.
89
+
90
+ ## Usage
91
+
92
+ ### CLI
93
+
94
+ ```bash
95
+ # Convert all sheets (output defaults to <input>.jsonl)
96
+ uv run xl2jsonl data.xlsx
97
+
98
+ # CSV and TSV files
99
+ uv run xl2jsonl report.csv -o output.jsonl
100
+ uv run xl2jsonl data.tsv -o output.jsonl
101
+
102
+ # Legacy Excel format
103
+ uv run xl2jsonl legacy.xls -o output.jsonl
104
+
105
+ # Specify output file
106
+ uv run xl2jsonl data.xlsx -o output.jsonl
107
+
108
+ # Select specific sheets (by name or 0-based index, repeatable)
109
+ uv run xl2jsonl data.xlsx -s "Revenue" -s "Costs"
110
+ uv run xl2jsonl data.xlsx -s 0 -s 2
111
+
112
+ # Force a specific engine
113
+ uv run xl2jsonl data.xlsx --engine openpyxl
114
+ uv run xl2jsonl data.xlsx --engine calamine
115
+
116
+ # Set header row explicitly (0-based index)
117
+ uv run xl2jsonl data.xlsx --header-row 2
118
+
119
+ # Keep empty rows in output
120
+ uv run xl2jsonl data.xlsx --keep-empty
121
+
122
+ # Verbose logging (-v for INFO, -vv for DEBUG)
123
+ uv run xl2jsonl data.xlsx -vv
124
+ ```
125
+
126
+ Full CLI help:
127
+
128
+ ```
129
+ Usage: xl2jsonl [OPTIONS] INPUT_FILE
130
+
131
+ Convert an Excel (.xlsx, .xls, .xlsb) or CSV (.csv, .tsv) file to JSONL.
132
+
133
+ Options:
134
+ -o, --output PATH Output JSONL file. Defaults to <input>.jsonl
135
+ -s, --sheet TEXT Sheet name or 0-based index. Repeatable.
136
+ --engine [auto|calamine|openpyxl]
137
+ Reading engine (default: auto)
138
+ --header-row INTEGER 0-based header row index. Default: auto-detect.
139
+ --skip-empty / --keep-empty Skip empty rows (default: skip)
140
+ -v, --verbose -v for INFO, -vv for DEBUG
141
+ --help Show this message and exit.
142
+ ```
143
+
144
+ ### Python API
145
+
146
+ ```python
147
+ import xl2jsonl
148
+
149
+ # Convert to JSONL file — returns row count
150
+ count = xl2jsonl.convert("data.xlsx", "output.jsonl")
151
+ print(f"Wrote {count} records")
152
+
153
+ # Convert and get results in memory — returns list of dicts
154
+ records = xl2jsonl.convert("data.xlsx")
155
+ for r in records:
156
+ print(r["content"], r["metadata"])
157
+
158
+ # Lazy iterator for large files
159
+ for record in xl2jsonl.iter_records("data.xlsx", sheets=["Sheet1"]):
160
+ print(record.data) # dict of header -> value
161
+ print(record.metadata) # Metadata(filename, sheet_name, sheet_number, row_number)
162
+
163
+ # All options
164
+ count = xl2jsonl.convert(
165
+ "data.xlsx",
166
+ "output.jsonl",
167
+ sheets=["Revenue", 2], # filter by name or index
168
+ engine="auto", # "auto" | "calamine" | "openpyxl"
169
+ header_row=None, # None for auto-detect, or 0-based int
170
+ skip_empty_rows=True, # skip rows where all cells are empty
171
+ )
172
+ ```
173
+
174
+ ## How Merged Cells Are Handled
175
+
176
+ ### Merged column headers
177
+
178
+ If a header cell spans columns B through D with value "Revenue":
179
+
180
+ | | A | B | C | D |
181
+ |---|---|---|---|---|
182
+ | **Header** | ID | Revenue (merged B:D) | | Profit |
183
+ | **Row 1** | 1 | 100 | 200 | 50 |
184
+
185
+ After resolution, the headers become `["ID", "Revenue", "Revenue_2", "Revenue_3", "Profit"]`, and each data cell maps to its respective deduplicated header.
186
+
187
+ ### Merged row labels
188
+
189
+ If a cell in column A spans rows 4-5 with value "Total":
190
+
191
+ | | A | B |
192
+ |---|---|---|
193
+ | **Row 4** | Total (merged A4:A5) | 250 |
194
+ | **Row 5** | | 300 |
195
+
196
+ After resolution, both rows get `"Total"` as their value for that column.
197
+
198
+ ### Merged blocks
199
+
200
+ A 2x3 merged region is filled entirely with the top-left cell's value.
201
+
202
+ ### Multi-table sheets
203
+
204
+ Sheets with multiple tables laid out side-by-side or stacked vertically are automatically detected and split into independent regions:
205
+
206
+ | | A | B | C | | E | F | G |
207
+ |---|---|---|---|---|---|---|---|
208
+ | **Row 1** | Account: | | ACME Corp | | *Note: ...* | | |
209
+ | | | | | | | | |
210
+ | **Row 3** | **Name** | **Score** | **Grade** | | **Region** | **Revenue** | **Growth** |
211
+ | **Row 4** | Alice | 95 | A | | EMEA | 1.2M | 5% |
212
+ | **Row 5** | Bob | 88 | B | | APAC | 800K | 12% |
213
+
214
+ This produces two separate tables: one with headers `Name, Score, Grade` and another with `Region, Revenue, Growth`. The metadata row and note are detected as separate regions and skipped (no valid header found).
215
+
216
+ **How it works:**
217
+ 1. Columns are grouped by occupancy — a column needs data in a meaningful number of rows (10%+ of the busiest column) to count, filtering out noise from titles/notes that span across table boundaries
218
+ 2. Groups are split where 1+ empty columns separate them
219
+ 3. Within each column group, row blocks are split by empty row gaps
220
+ 4. Each block gets independent header detection and processing
221
+ 5. Blocks with no detected header are silently skipped
222
+
223
+ ### Engine selection (`auto` mode)
224
+
225
+ 1. All sheets are read with calamine (fast)
226
+ 2. Each sheet is probed for merged cell ranges via openpyxl
227
+ 3. Sheets with merges are re-read through openpyxl with full merge resolution
228
+ 4. Sheets without merges use the calamine data as-is
229
+
230
+ This gives you the speed of calamine on clean sheets while correctly handling merges where they exist.
231
+
232
+ ## Project Structure
233
+
234
+ ```
235
+ excel-chunker/
236
+ ├── pyproject.toml # Package config, dependencies, CLI entry point
237
+ ├── src/xl2jsonl/
238
+ │ ├── __init__.py # Public API: convert(), iter_records()
239
+ │ ├── models.py # Data classes: SheetData, RowRecord, Metadata
240
+ │ ├── exceptions.py # Xl2JsonlError, LoaderError, EmptySheetError, NoHeaderError
241
+ │ ├── loader.py # Dual-engine Excel reader + merge resolution
242
+ │ ├── chunker.py # Header detection, row-to-dict, cell normalization
243
+ │ ├── writer.py # JSONL streaming output via orjson
244
+ │ └── cli.py # Click CLI entry point
245
+ └── tests/
246
+ ├── conftest.py # Programmatic xlsx fixture generators
247
+ ├── test_loader.py # Loader unit tests (engines, merges, sheet selection)
248
+ ├── test_chunker.py # Chunker unit tests (headers, normalization, edge cases)
249
+ ├── test_writer.py # Writer unit tests (JSONL format, streaming)
250
+ ├── test_cli.py # CLI tests via Click CliRunner
251
+ └── test_integration.py # End-to-end tests (xlsx -> jsonl -> verify)
252
+ ```
253
+
254
+ ## Running Tests
255
+
256
+ ```bash
257
+ # Run all tests
258
+ uv run pytest
259
+
260
+ # Verbose output
261
+ uv run pytest -v
262
+
263
+ # With coverage
264
+ uv run pytest --cov=xl2jsonl
265
+
266
+ # Run a specific test file
267
+ uv run pytest tests/test_loader.py
268
+ ```
269
+
270
+ ## Supported Formats
271
+
272
+ | Format | Extension | Merge Resolution | Notes |
273
+ |---|---|---|---|
274
+ | Excel (modern) | `.xlsx` | Full (openpyxl) | Default engine: calamine with openpyxl fallback for merges |
275
+ | Excel (legacy) | `.xls` | No | Read via calamine only |
276
+ | Excel Binary | `.xlsb` | No | Read via calamine only |
277
+ | CSV | `.csv` | N/A | Basic type inference (int, float, bool) |
278
+ | TSV | `.tsv` | N/A | Tab-delimited, same type inference |
279
+
280
+ ## Limitations
281
+
282
+ - Merged cell resolution is only available for `.xlsx` files (openpyxl limitation) — `.xls` and `.xlsb` are read via calamine without merge detection
283
+ - The entire sheet grid is materialized in memory during loading (Excel files are not row-streamable); the chunker and writer stages stream lazily
284
+ - Header auto-detection uses heuristics (string majority, value diversity, column coverage) — use `--header-row` to override if it guesses wrong
285
+ - Formulas are evaluated to their cached values (`data_only=True` in openpyxl) — if a file was never opened in Excel after formula changes, values may be stale
286
+ - CSV type inference is best-effort: numeric strings become int/float, "true"/"false" become booleans, everything else stays as strings
287
+
288
+ ## License
289
+
290
+ MIT
@@ -0,0 +1,263 @@
1
+ # xl2jsonl
2
+
3
+ Convert Excel (`.xlsx`, `.xls`, `.xlsb`) and CSV (`.csv`, `.tsv`) files to JSONL with robust handling of merged cells, multiple sheets, and mixed data types. Each output line is a JSON object containing row data as key-value pairs plus metadata for traceability.
4
+
5
+ ## Why xl2jsonl?
6
+
7
+ Excel spreadsheets in the wild are messy — merged headers spanning multiple columns, merged row labels, empty leading rows, inconsistent types. Most conversion tools break on these. xl2jsonl handles them correctly by using a dual-engine approach: **calamine** (Rust-backed) for speed on clean sheets, with automatic fallback to **openpyxl** when merged cells are detected.
8
+
9
+ ## Output Format
10
+
11
+ Each line in the output JSONL file is a self-contained JSON object:
12
+
13
+ ```json
14
+ {"content": {"Name": "Alice", "Age": 30, "Department": "Engineering"}, "metadata": {"filename": "staff.xlsx", "sheet_name": "Employees", "sheet_number": 1, "row_number": 2}}
15
+ {"content": {"Name": "Bob", "Age": 25, "Department": "Design"}, "metadata": {"filename": "staff.xlsx", "sheet_name": "Employees", "sheet_number": 1, "row_number": 3}}
16
+ ```
17
+
18
+ | Field | Description |
19
+ |---|---|
20
+ | `content` | Key-value pairs mapping column headers to cell values for that row |
21
+ | `metadata.filename` | Original Excel filename |
22
+ | `metadata.sheet_name` | Name of the worksheet |
23
+ | `metadata.sheet_number` | 1-based sheet index |
24
+ | `metadata.row_number` | 1-based row number in the original Excel file |
25
+
26
+ ## Features
27
+
28
+ - **Merged cell resolution** — Merged columns, rows, and blocks are filled so every cell in a merged range gets the top-left cell's value
29
+ - **Duplicate header handling** — Merged headers that produce duplicates get `_2`, `_3` suffixes (e.g., `Revenue`, `Revenue_2`, `Revenue_3`)
30
+ - **Auto header detection** — Skips empty leading rows, merged title rows, and finds the first row where the majority of non-empty cells are strings with sufficient diversity
31
+ - **Multi-table sheet support** — Automatically detects side-by-side tables (separated by empty columns) and stacked tables (separated by empty rows) within a single sheet, processing each independently
32
+ - **Multi-sheet support** — Processes all sheets by default, or filter by name/index
33
+ - **Mixed type handling** — Dates and datetimes convert to ISO 8601 strings, strings are stripped, numbers and booleans pass through, `None` stays `None`
34
+ - **Empty row skipping** — Empty rows are skipped by default (configurable)
35
+ - **Short row padding** — Rows shorter than the header are padded with `None`
36
+ - **Streaming output** — JSONL is written one record at a time via `orjson` for low memory overhead
37
+ - **Dual engine** — calamine for speed on clean sheets, openpyxl fallback for merged cells (configurable)
38
+
39
+ ## Dependencies
40
+
41
+ | Package | Purpose |
42
+ |---|---|
43
+ | [python-calamine](https://pypi.org/project/python-calamine/) | Fast Rust-backed Excel reader — supports .xlsx, .xls, .xlsb |
44
+ | [openpyxl](https://pypi.org/project/openpyxl/) | Merged cell detection and resolution (.xlsx only) |
45
+ | [click](https://pypi.org/project/click/) | CLI framework |
46
+ | [orjson](https://pypi.org/project/orjson/) | Fast JSON serialization with native date handling |
47
+
48
+ Dev dependencies: `pytest`, `pytest-cov`, `ruff`
49
+
50
+ ## Setup
51
+
52
+ Requires **Python 3.10+** and **[uv](https://docs.astral.sh/uv/)**.
53
+
54
+ ```bash
55
+ # Clone and install
56
+ git clone <repo-url>
57
+ cd excel-chunker
58
+ uv sync
59
+ ```
60
+
61
+ This creates a virtual environment in `.venv/` and installs all dependencies including dev tools.
62
+
63
+ ## Usage
64
+
65
+ ### CLI
66
+
67
+ ```bash
68
+ # Convert all sheets (output defaults to <input>.jsonl)
69
+ uv run xl2jsonl data.xlsx
70
+
71
+ # CSV and TSV files
72
+ uv run xl2jsonl report.csv -o output.jsonl
73
+ uv run xl2jsonl data.tsv -o output.jsonl
74
+
75
+ # Legacy Excel format
76
+ uv run xl2jsonl legacy.xls -o output.jsonl
77
+
78
+ # Specify output file
79
+ uv run xl2jsonl data.xlsx -o output.jsonl
80
+
81
+ # Select specific sheets (by name or 0-based index, repeatable)
82
+ uv run xl2jsonl data.xlsx -s "Revenue" -s "Costs"
83
+ uv run xl2jsonl data.xlsx -s 0 -s 2
84
+
85
+ # Force a specific engine
86
+ uv run xl2jsonl data.xlsx --engine openpyxl
87
+ uv run xl2jsonl data.xlsx --engine calamine
88
+
89
+ # Set header row explicitly (0-based index)
90
+ uv run xl2jsonl data.xlsx --header-row 2
91
+
92
+ # Keep empty rows in output
93
+ uv run xl2jsonl data.xlsx --keep-empty
94
+
95
+ # Verbose logging (-v for INFO, -vv for DEBUG)
96
+ uv run xl2jsonl data.xlsx -vv
97
+ ```
98
+
99
+ Full CLI help:
100
+
101
+ ```
102
+ Usage: xl2jsonl [OPTIONS] INPUT_FILE
103
+
104
+ Convert an Excel (.xlsx, .xls, .xlsb) or CSV (.csv, .tsv) file to JSONL.
105
+
106
+ Options:
107
+ -o, --output PATH Output JSONL file. Defaults to <input>.jsonl
108
+ -s, --sheet TEXT Sheet name or 0-based index. Repeatable.
109
+ --engine [auto|calamine|openpyxl]
110
+ Reading engine (default: auto)
111
+ --header-row INTEGER 0-based header row index. Default: auto-detect.
112
+ --skip-empty / --keep-empty Skip empty rows (default: skip)
113
+ -v, --verbose -v for INFO, -vv for DEBUG
114
+ --help Show this message and exit.
115
+ ```
116
+
117
+ ### Python API
118
+
119
+ ```python
120
+ import xl2jsonl
121
+
122
+ # Convert to JSONL file — returns row count
123
+ count = xl2jsonl.convert("data.xlsx", "output.jsonl")
124
+ print(f"Wrote {count} records")
125
+
126
+ # Convert and get results in memory — returns list of dicts
127
+ records = xl2jsonl.convert("data.xlsx")
128
+ for r in records:
129
+ print(r["content"], r["metadata"])
130
+
131
+ # Lazy iterator for large files
132
+ for record in xl2jsonl.iter_records("data.xlsx", sheets=["Sheet1"]):
133
+ print(record.data) # dict of header -> value
134
+ print(record.metadata) # Metadata(filename, sheet_name, sheet_number, row_number)
135
+
136
+ # All options
137
+ count = xl2jsonl.convert(
138
+ "data.xlsx",
139
+ "output.jsonl",
140
+ sheets=["Revenue", 2], # filter by name or index
141
+ engine="auto", # "auto" | "calamine" | "openpyxl"
142
+ header_row=None, # None for auto-detect, or 0-based int
143
+ skip_empty_rows=True, # skip rows where all cells are empty
144
+ )
145
+ ```
146
+
147
+ ## How Merged Cells Are Handled
148
+
149
+ ### Merged column headers
150
+
151
+ If a header cell spans columns B through D with value "Revenue":
152
+
153
+ | | A | B | C | D |
154
+ |---|---|---|---|---|
155
+ | **Header** | ID | Revenue (merged B:D) | | Profit |
156
+ | **Row 1** | 1 | 100 | 200 | 50 |
157
+
158
+ After resolution, the headers become `["ID", "Revenue", "Revenue_2", "Revenue_3", "Profit"]`, and each data cell maps to its respective deduplicated header.
159
+
160
+ ### Merged row labels
161
+
162
+ If a cell in column A spans rows 4-5 with value "Total":
163
+
164
+ | | A | B |
165
+ |---|---|---|
166
+ | **Row 4** | Total (merged A4:A5) | 250 |
167
+ | **Row 5** | | 300 |
168
+
169
+ After resolution, both rows get `"Total"` as their value for that column.
170
+
171
+ ### Merged blocks
172
+
173
+ A 2x3 merged region is filled entirely with the top-left cell's value.
174
+
175
+ ### Multi-table sheets
176
+
177
+ Sheets with multiple tables laid out side-by-side or stacked vertically are automatically detected and split into independent regions:
178
+
179
+ | | A | B | C | | E | F | G |
180
+ |---|---|---|---|---|---|---|---|
181
+ | **Row 1** | Account: | | ACME Corp | | *Note: ...* | | |
182
+ | | | | | | | | |
183
+ | **Row 3** | **Name** | **Score** | **Grade** | | **Region** | **Revenue** | **Growth** |
184
+ | **Row 4** | Alice | 95 | A | | EMEA | 1.2M | 5% |
185
+ | **Row 5** | Bob | 88 | B | | APAC | 800K | 12% |
186
+
187
+ This produces two separate tables: one with headers `Name, Score, Grade` and another with `Region, Revenue, Growth`. The metadata row and note are detected as separate regions and skipped (no valid header found).
188
+
189
+ **How it works:**
190
+ 1. Columns are grouped by occupancy — a column needs data in a meaningful number of rows (10%+ of the busiest column) to count, filtering out noise from titles/notes that span across table boundaries
191
+ 2. Groups are split where 1+ empty columns separate them
192
+ 3. Within each column group, row blocks are split by empty row gaps
193
+ 4. Each block gets independent header detection and processing
194
+ 5. Blocks with no detected header are silently skipped
195
+
196
+ ### Engine selection (`auto` mode)
197
+
198
+ 1. All sheets are read with calamine (fast)
199
+ 2. Each sheet is probed for merged cell ranges via openpyxl
200
+ 3. Sheets with merges are re-read through openpyxl with full merge resolution
201
+ 4. Sheets without merges use the calamine data as-is
202
+
203
+ This gives you the speed of calamine on clean sheets while correctly handling merges where they exist.
204
+
205
+ ## Project Structure
206
+
207
+ ```
208
+ excel-chunker/
209
+ ├── pyproject.toml # Package config, dependencies, CLI entry point
210
+ ├── src/xl2jsonl/
211
+ │ ├── __init__.py # Public API: convert(), iter_records()
212
+ │ ├── models.py # Data classes: SheetData, RowRecord, Metadata
213
+ │ ├── exceptions.py # Xl2JsonlError, LoaderError, EmptySheetError, NoHeaderError
214
+ │ ├── loader.py # Dual-engine Excel reader + merge resolution
215
+ │ ├── chunker.py # Header detection, row-to-dict, cell normalization
216
+ │ ├── writer.py # JSONL streaming output via orjson
217
+ │ └── cli.py # Click CLI entry point
218
+ └── tests/
219
+ ├── conftest.py # Programmatic xlsx fixture generators
220
+ ├── test_loader.py # Loader unit tests (engines, merges, sheet selection)
221
+ ├── test_chunker.py # Chunker unit tests (headers, normalization, edge cases)
222
+ ├── test_writer.py # Writer unit tests (JSONL format, streaming)
223
+ ├── test_cli.py # CLI tests via Click CliRunner
224
+ └── test_integration.py # End-to-end tests (xlsx -> jsonl -> verify)
225
+ ```
226
+
227
+ ## Running Tests
228
+
229
+ ```bash
230
+ # Run all tests
231
+ uv run pytest
232
+
233
+ # Verbose output
234
+ uv run pytest -v
235
+
236
+ # With coverage
237
+ uv run pytest --cov=xl2jsonl
238
+
239
+ # Run a specific test file
240
+ uv run pytest tests/test_loader.py
241
+ ```
242
+
243
+ ## Supported Formats
244
+
245
+ | Format | Extension | Merge Resolution | Notes |
246
+ |---|---|---|---|
247
+ | Excel (modern) | `.xlsx` | Full (openpyxl) | Default engine: calamine with openpyxl fallback for merges |
248
+ | Excel (legacy) | `.xls` | No | Read via calamine only |
249
+ | Excel Binary | `.xlsb` | No | Read via calamine only |
250
+ | CSV | `.csv` | N/A | Basic type inference (int, float, bool) |
251
+ | TSV | `.tsv` | N/A | Tab-delimited, same type inference |
252
+
253
+ ## Limitations
254
+
255
+ - Merged cell resolution is only available for `.xlsx` files (openpyxl limitation) — `.xls` and `.xlsb` are read via calamine without merge detection
256
+ - The entire sheet grid is materialized in memory during loading (Excel files are not row-streamable); the chunker and writer stages stream lazily
257
+ - Header auto-detection uses heuristics (string majority, value diversity, column coverage) — use `--header-row` to override if it guesses wrong
258
+ - Formulas are evaluated to their cached values (`data_only=True` in openpyxl) — if a file was never opened in Excel after formula changes, values may be stale
259
+ - CSV type inference is best-effort: numeric strings become int/float, "true"/"false" become booleans, everything else stays as strings
260
+
261
+ ## License
262
+
263
+ MIT
Binary file
Binary file
@@ -0,0 +1 @@
1
+ {"content":{"Version":14.0,"Risk Module":"01. Schedule & Delay","Category":"Schedule & Delay","Risk Question":"(S&D01) Bid Subject To Prior Sale?","Requirements Response":"","Requirements Excerpt":"","Requirement Response Analysis":"","Requirement Excerpt Analysis":"","Author":"","SME Correct Respones":"","SME Correct Context":"","Result":"","Failure Reason/ Comments":""},"metadata":{"filename":"data.xlsx","sheet_name":"Sample 1","sheet_number":1,"row_number":3}}