PyPI - grepxcel - Versions diffs - 0.1.0__tar.gz - Mend

grepxcel 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

grepxcel-0.1.0/LICENSE +21 -0
grepxcel-0.1.0/PKG-INFO +672 -0
grepxcel-0.1.0/README.md +625 -0
grepxcel-0.1.0/grepxcel/__init__.py +47 -0
grepxcel-0.1.0/grepxcel/__main__.py +3 -0
grepxcel-0.1.0/grepxcel/cli.py +1068 -0
grepxcel-0.1.0/grepxcel/color.py +59 -0
grepxcel-0.1.0/grepxcel/docs_generator.py +245 -0
grepxcel-0.1.0/grepxcel/doctor.py +289 -0
grepxcel-0.1.0/grepxcel/drafter.py +1368 -0
grepxcel-0.1.0/grepxcel/engine.py +1005 -0
grepxcel-0.1.0/grepxcel/examples/01_simple_invoice/data.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/01_simple_invoice/pattern.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/02_product_catalog/data.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/02_product_catalog/pattern.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/03_expense_report/data.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/03_expense_report/pattern.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/04_loan_schedule/data.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/04_loan_schedule/pattern.xlsx +0 -0
grepxcel-0.1.0/grepxcel/examples/__init__.py +0 -0
grepxcel-0.1.0/grepxcel/examples_generator.py +123 -0
grepxcel-0.1.0/grepxcel/lint.py +267 -0
grepxcel-0.1.0/grepxcel/logger.py +799 -0
grepxcel-0.1.0/grepxcel/mcp_config.py +76 -0
grepxcel-0.1.0/grepxcel/mcp_server.py +226 -0
grepxcel-0.1.0/grepxcel/model_manager.py +552 -0
grepxcel-0.1.0/grepxcel/models.py +62 -0
grepxcel-0.1.0/grepxcel/pattern_check.py +194 -0
grepxcel-0.1.0/grepxcel/pattern_colors.py +118 -0
grepxcel-0.1.0/grepxcel/pattern_parser.py +687 -0
grepxcel-0.1.0/grepxcel/proxy_support.py +195 -0
grepxcel-0.1.0/grepxcel/sbom.py +214 -0
grepxcel-0.1.0/grepxcel/schema.py +253 -0
grepxcel-0.1.0/grepxcel/security.py +316 -0
grepxcel-0.1.0/grepxcel/skill.py +126 -0
grepxcel-0.1.0/grepxcel/suggester.py +11 -0
grepxcel-0.1.0/grepxcel/utils.py +156 -0
grepxcel-0.1.0/grepxcel.egg-info/PKG-INFO +672 -0
grepxcel-0.1.0/grepxcel.egg-info/SOURCES.txt +43 -0
grepxcel-0.1.0/grepxcel.egg-info/dependency_links.txt +1 -0
grepxcel-0.1.0/grepxcel.egg-info/entry_points.txt +2 -0
grepxcel-0.1.0/grepxcel.egg-info/requires.txt +23 -0
grepxcel-0.1.0/grepxcel.egg-info/top_level.txt +1 -0
grepxcel-0.1.0/pyproject.toml +82 -0
grepxcel-0.1.0/setup.cfg +4 -0

grepxcel-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 SCPG
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

grepxcel-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,672 @@
+Metadata-Version: 2.4
+Name: grepxcel
+Version: 0.1.0
+Summary: Pattern-based data extraction for Excel
+Author: scpg
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/scpg/grepxcel
+Project-URL: Repository, https://github.com/scpg/grepxcel
+Project-URL: Issues, https://github.com/scpg/grepxcel/issues
+Project-URL: Changelog, https://github.com/scpg/grepxcel/blob/main/CHANGELOG.md
+Project-URL: Funding, https://buymeacoffee.com/scpg.dev
+Keywords: excel,xlsx,spreadsheet,data-extraction,pattern,etl,openpyxl
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: End Users/Desktop
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Topic :: Office/Business :: Financial :: Spreadsheet
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Operating System :: OS Independent
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: openpyxl>=3.1
+Requires-Dist: defusedxml>=0.7
+Requires-Dist: regex>=2024.11.6
+Requires-Dist: structlog>=24.1
+Provides-Extra: suggest
+Requires-Dist: llama-cpp-python>=0.2.90; extra == "suggest"
+Requires-Dist: huggingface_hub>=0.23; extra == "suggest"
+Requires-Dist: platformdirs>=4.0; extra == "suggest"
+Requires-Dist: truststore>=0.9; extra == "suggest"
+Provides-Extra: draft-cloud
+Requires-Dist: anthropic>=0.40; extra == "draft-cloud"
+Requires-Dist: openai>=1.40; extra == "draft-cloud"
+Requires-Dist: google-genai>=0.3; extra == "draft-cloud"
+Requires-Dist: truststore>=0.9; extra == "draft-cloud"
+Provides-Extra: mcp
+Requires-Dist: mcp<2,>=1.23; extra == "mcp"
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == "dev"
+Requires-Dist: jsonschema>=4.20; extra == "dev"
+Dynamic: license-file
+# grepxcel
+Extract structured data from Excel files — reliably, without writing custom code for every template.
+If you have ever written Python to parse an Excel file from a supplier, a client, or another department — and had it silently produce wrong output the moment they moved a column or renamed a header — grepxcel is built for exactly trying to easily get around this kind of problem.
+You describe the layout of the file once in a **pattern file** (an `.xlsx` workbook or a plain `.csv`). The engine reads any conforming data file and extracts cells and tables into clean, hierarchical JSON. When a file does not match the pattern, grepxcel fails loudly with a clear error — it is never silently wrong.
+Think of it as *[grep](https://en.wikipedia.org/wiki/Grep) for [Excel](https://en.wikipedia.org/wiki/Microsoft_Excel)*: a repeatable extraction layer you define once and run on any number of files.
+![CI](https://github.com/scpg/grepxcel/actions/workflows/ci.yml/badge.svg)
+---
+## Common use cases
+- **Supplier invoices and price lists** — each supplier sends Excel files in their own format. Define a pattern once per template, extract data from every new file automatically.
+- **Timesheets and HR reports** — team leads use slightly different layouts. grepxcel anchors on cell labels, not row numbers, so minor layout drift does not break extraction.
+- **Data pipelines ingesting Excel** — replace brittle `openpyxl` or `pandas` parsing code that breaks when a column shifts. The pattern file lives in git; extraction is deterministic and testable.
+- **Research and clinical data** — field workers submit data in Excel templates. Any file that deviates from the expected structure is rejected immediately, before it can corrupt your dataset.
+*The engineering philosophy behind grepxcel: [docs/MINDSET.md](docs/MINDSET.md).*
+---
+## What it does
+You describe the layout of your Excel sheet in a **pattern file** — either an Excel workbook (`.xlsx`) or a plain `.csv` (handy for hand-editing and git diffs). The engine reads any matching data file and extracts cells and tables into clean, hierarchical JSON — no coding required to define new patterns.
+---
+## How it works
+```
+pattern.xlsx  +  data.xlsx  →  { "po": { "number": "PO-2026" }, "line": [ {…} ] }
+```
+The pattern file has four row types:
+| Row type | Purpose |
+|---|---|
+| `config:` | Global settings: read direction, currency symbol, empty-cell aliases, case-insensitive matching (`ignore.case`) |
+| `lbl:` | Anchor label — matched for position, **never written to output JSON** |
+| `var:` | Data field — extracted and written to output JSON |
+| `doc:` | Comment / documentation row — ignored by the engine |
+Between `START:` and `END:` you list the extraction sequence:
+- `cell:next fieldName` — read the next non-empty cell into a field (alias: `cell:1`)
+- `cell:B5 fieldName` — jump directly to an absolute cell (A1-notation reference)
+- `seek:G5` — reposition the cursor to G5 **without** reading it; the next `cell:next` starts from there
+- `dir:LR` / `dir:TD` — switch the scan direction partway through (left-to-right / top-down); affects subsequent `cell:next`
+- `table:*` — match all instances of a repeating mini-table block
+Dot notation in `var:` field names creates nested output: `po.number` → `{"po": {"number": …}}`.
+### Pattern file formats
+A pattern can be authored as **`.xlsx`** or **`.csv`** — both are read into the
+same internal grid, so they behave identically. CSV is convenient for
+hand-editing and produces clean git diffs. When writing CSV:
+- One pattern row per CSV line; column A is the keyword (`config:`, `lbl:`, `var:`, `cell:…`).
+- Table-template rows start with an **empty first field** (blank column A), e.g. `,HEADER:1,col_a,col_b`.
+- **Quote any regex containing a comma**, e.g. `var,line.qty,integer,"\d{1,3}"`.
+- Plain text only — formulas (a leading `=`) are rejected, exactly as in `.xlsx`.
+---
+## Writing the match pattern (regex) — the simple version
+The last column of a `lbl:` or `var:` row is a **regex**: a little pattern that
+describes *what the cell should look like*. You don't need to know regex to start —
+here's everything most people need.
+**Two rules to remember:**
+1. **The whole cell must match.** If your pattern is `\d{4}` (four digits), the cell
+   `2026` matches but `2026-05` does **not** (the `-05` is left over). You don't add
+   `^` or `$` anchors — grepxcel does that for you.
+2. **Leave it blank to accept anything.** No pattern in the last column = "any value
+   is fine". For `date`/`datetime` fields, always leave it blank (the type is checked,
+   not the text).
+**The building blocks you'll actually use:**
+| You want to match… | Write this | Matches |
+|---|---|---|
+| Any value at all | *(leave blank)* or `.*` | anything |
+| A whole number | `\d+` | `7`, `2026`, `100` |
+| Exactly 4 digits | `\d{4}` | `2026` (not `26`) |
+| Between 1 and 3 digits | `\d{1,3}` | `5`, `42`, `999` |
+| Letters only | `[A-Za-z]+` | `Acme` |
+| A code like `PO-2026` | `PO-\d+` | `PO-1`, `PO-2026` |
+| One of a few words | `paid\|unpaid\|pending` | `paid` |
+| A price like `19.99` | `\d+\.\d{2}` | `19.99` |
+Cheat-sheet: `\d` = a digit, `[A-Z]` = one capital letter, `+` = "one or more",
+`{4}` = "exactly four", `{1,3}` = "between one and three", `.` = any character,
+`.*` = "anything", `|` = "or".
+**Tips for beginners:**
+- **Start loose, then tighten.** Begin with a blank pattern (accept anything), run
+  `extract`, see what comes out, then add a pattern only where you need to be strict.
+- **Use `[A-Z]+`, not `([A-Z])+`.** Both look similar, but the bracket form is faster
+  and the parentheses-with-a-`+` form is rejected by grepxcel as unsafe (it can make
+  matching hang). The tool will tell you if you hit this.
+- **Case doesn't matter?** Add one config row at the top of the pattern file:
+  `config: | ignore.case | yes`. Then `PAID`, `Paid`, and `paid` all match the same
+  pattern. (Default is case-sensitive.)
+- **CSV pattern files:** if your pattern contains a comma (like `\d{1,3}`), wrap it in
+  quotes — `"\d{1,3}"` — so the comma isn't read as a new column.
+> Under the hood these are standard [Python `re`](https://docs.python.org/3/library/re.html)
+> patterns, so anything from that syntax works if you already know regex.
+---
+## Quick start
+### Install
+```bash
+git clone https://github.com/scpg/grepxcel.git
+cd grepxcel
+python -m venv .venv
+.venv/bin/pip install -e .                            # core (extract + docs)
+.venv/bin/python3 scripts/install_llm_deps.py         # optional: local draft (auto-detects GPU)
+```
+**What you actually need** — grepxcel ships in layers, so you only install what you use:
+| You want to… | Install |
+|---|---|
+| Extract data / generate the docs reference | `pip install grepxcel` *(core — nothing extra)* |
+| Draft patterns with the **local** model (offline) | `pip install 'grepxcel[suggest]'` |
+| Draft patterns with a **cloud** model (Claude) | `pip install 'grepxcel[draft-cloud]'` |
+| Use as an **MCP server** for AI agents | `pip install 'grepxcel[mcp]'` |
+The `extract` and `docs` commands work with the core install alone. The `draft`
+command is optional; if its dependencies are missing, grepxcel tells you exactly
+what to install (and points you at the cloud option as an alternative).
+> **Windows (PowerShell):** use `.venv\Scripts\` instead of `.venv/bin/`, e.g.
+> `.venv\Scripts\pip install -e .` and `.venv\Scripts\grepxcel ...`.
+Requires Python **3.11+**.
+### Try it — bundled examples
+Don't have your own Excel files yet? Generate ready-to-run examples:
+```bash
+.venv/bin/grepxcel generate-examples          # creates ./grepxcel-examples/
+cd grepxcel-examples/01_simple_invoice
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx
+```
+Each example includes a pattern, a data file, and a README with commands to try.
+### CLI usage
+```bash
+# Extract data from an Excel file using a pattern
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx
+# Write output to a directory instead of stdout
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx -o output/
+# Process a specific sheet, or every sheet at once
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx --sheet 2025
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx --all-sheets
+# Verbose mode (step-by-step match log)
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx -v
+# Legacy flat output format ({"cells":{}, "tables":[]})
+.venv/bin/grepxcel extract -p pattern.xlsx data.xlsx --format legacy
+# Generate a colour-coded pattern reference file
+.venv/bin/grepxcel docs -o pattern-reference.xlsx
+# Inspect an Excel file before extraction (format, encryption, extent, merges)
+.venv/bin/grepxcel lint data.xlsx
+# Generate a JSON Schema describing the extraction output of a pattern
+.venv/bin/grepxcel schema pattern.xlsx -o schema.json
+# Use a local LLM to draft a starter pattern for an unseen Excel file
+.venv/bin/grepxcel draft data.xlsx -o draft-pattern.xlsx
+```
+### Python API
+The one-call entry point is `grepxcel.extract()`:
+```python
+import grepxcel
+result = grepxcel.extract("pattern.xlsx", "data.xlsx")
+# Fields are grouped by dot-notation prefix
+print(result["po"]["number"])        # "PO-2026"
+print(result["vendor"]["name"])      # "Acme Supplies"
+# Tables are arrays of instance objects
+for row in result["line"][0]["data"]:
+    print(row["item"], row["qty"])
+# Options: a specific sheet, every sheet, or the legacy output shape
+result  = grepxcel.extract("pattern.xlsx", "data.xlsx", sheet="Q1")
+by_sheet = grepxcel.extract("pattern.xlsx", "data.xlsx", all_sheets=True)
+```
+`extract()` is silent by default. For progress/warnings, or to reuse one engine
+across many calls, use the underlying `Engine` directly:
+```python
+from grepxcel import Engine, Logger, VerbosityLevel
+logger = Logger(level=VerbosityLevel.NORMAL)
+result = Engine().process("pattern.xlsx", "data.xlsx", logger=logger)
+```
+---
+## Output format
+Fields defined with dot notation (`po.number`, `po.date`) are grouped into nested objects.
+Tables always produce an array of instance objects, each containing `data`, and optionally
+`header` and `footer` sections.
+```json
+{
+  "po":     { "number": "PO-2026", "date": "2026-05-01" },
+  "vendor": { "name": "Acme Supplies" },
+  "line": [
+    {
+      "data": [
+        { "item": "Laptop", "qty": 2, "price": 1200.00, "total": 2400.00 },
+        { "item": "Dock",   "qty": 6, "price":   75.00, "total":  450.00 }
+      ],
+      "footer": { "label": "Grand Total", "value": 3030.00 }
+    }
+  ]
+}
+```
+Label fields (`lbl:`) are used only for positional anchoring and are never included in output.
+---
+## CLI reference
+### `grepxcel extract`
+| Flag | Default | Purpose |
+|---|---|---|
+| `-p FILE` | required | Pattern file — `.xlsx` or `.csv` |
+| `--format` | `nested` | Output format: `nested` (default) or `legacy` |
+| `-o DIR` | — | Write JSON to directory (stdout if omitted) |
+| `-l FILE` | — | Append structured log to file |
+| `--log-format` | `text` | Log format: `text` (human) or `json` (NDJSON for SIEM/cloud) |
+| `--meta` | off | Add `_meta` block to JSON output (run_id, stats, issues) |
+| `-v` / `-vv` | off | Verbosity: per-field trace (`field ← B1 = value ✓/✗`) / anchor probes |
+| `-d` / `--debug` | off | Same as `-vv` |
+| `--max-rows N` | 2048 | Max data-sheet rows (raise up to Excel's 1,048,576; untested above default) |
+| `--max-columns N` | 1024 | Max data-sheet columns (raise up to Excel's 16,384; untested above default) |
+| `--max-size MB` | 5 | Compressed file size limit |
+| `--max-uncompressed MB` | 50 | Uncompressed ZIP content limit (ZIP bomb guard) |
+| `--max-cell-len N` | 1000 | Max cell chars fed to regex (ReDoS guard) |
+| `--sheet NAME_OR_INDEX` | active | Sheet name or 0-based index to process |
+| `--all-sheets` | off | Process every sheet; output is a dict keyed by sheet name (mutually exclusive with `--sheet`) |
+### `grepxcel validate-pattern`
+Check that a pattern file (`.xlsx` or `.csv`) is valid to use — **without** running
+an extraction. It applies the same rules extraction does (structure, field types,
+multiplicities, regex safety, comments) and additionally flags an empty extraction
+sequence and references to undefined fields. **Exits non-zero** if any file is invalid.
+```bash
+grepxcel validate-pattern pattern.xlsx
+grepxcel validate-pattern pattern.csv -v       # + parsed config, fields, and steps
+grepxcel validate-pattern a.xlsx b.csv         # validate several at once
+```
+| Flag | Purpose |
+|---|---|
+| `FILE...` | One or more pattern files to validate |
+| `-v`, `--verbose` | Print the parsed config, fields, and extraction sequence |
+### `grepxcel docs`
+| Flag | Default | Purpose |
+|---|---|---|
+| `-o FILE` | `pattern-reference.xlsx` | Output path for the reference file |
+### `grepxcel generate-examples`
+Creates 4 ready-to-run example sets (pattern + data + README) in a local directory.
+| Flag | Default | Purpose |
+|---|---|---|
+| `-o DIR` | `./grepxcel-examples/` | Directory to create |
+Examples included: simple invoice (KV), product catalog (table), expense report
+(KV + table + footer), loan schedule (KV header + amortization table).
+### `grepxcel draft`
+Drafts a starter pattern file for an unseen Excel file using an LLM. The output is
+a *starting point* — review and refine the generated regexes before use.
+`grepxcel suggest` is a backward-compatible alias for this command.
+> Curious which model to use, and how good `draft` actually is? See
+> [docs/EVALUATION.md](docs/EVALUATION.md) for the methodology, model
+> comparisons (local vs cloud), and honest notes on where it succeeds and fails.
+| Flag | Default | Purpose |
+|---|---|---|
+| `-o FILE` | `draft_pattern.xlsx` | Write draft pattern to this path |
+| `-v` | off | Print the Excel analysis sent to the model + update status |
+| `--dry-run` | off | Print the analysis that would be sent to the model, then exit (no inference) |
+| `--backend local\|claude\|github\|server\|gemini` | `local` | Inference backend (see below) |
+| `--sheet NAME_OR_INDEX` | active | Sheet to analyse |
+| `--max-size MB` | 5 | Compressed file size limit |
+| `--max-uncompressed MB` | 50 | Uncompressed ZIP content limit (ZIP bomb guard) |
+#### Backends
+| Backend | Install | Notes |
+|---|---|---|
+| `local` *(default)* | `python3 scripts/install_llm_deps.py` | Runs Gemma-4-E4B (GGUF) in-process. **No data leaves your machine.** Model is pinned to a revision and downloaded once (~5 GB); set `GREPXCEL_MODEL_AUTOUPDATE=1` to track upstream, `GREPXCEL_MODEL_DIR` to relocate the cache. |
+| `github` | `pip install -e '.[draft-cloud]'` | **GitHub Models** — free with a GitHub subscription (quota-limited, no per-token charge), and the **highest-quality option** in our eval. Requires `GITHUB_TOKEN` (fine-grained, `Models: read`). Pick a model with `--github-model`, e.g. `openai/gpt-4.1`, `openai/gpt-4o`, `meta/llama-3.3-70b-instruct`. Per-draft token usage + remaining quota are printed. |
+| `claude` | `pip install -e '.[draft-cloud]'` | Anthropic API. Requires `ANTHROPIC_API_KEY`. Prints a one-line privacy notice and per-call token cost (~$0.003–0.04/draft). |
+| `server` | `pip install openai` | Any **OpenAI-compatible server** (LM Studio, Ollama, vLLM, text-generation-inference). Default URL: `http://localhost:1234/v1` (override with `--server-url` or `GREPXCEL_SERVER_URL`). Model is auto-discovered unless `--server-model` is set. Data stays local. |
+| `gemini` | — | **Planned for a future release** — not yet available. Selecting it prints a notice and exits. |
+Backends read keys from a `.env` file in the project (or any parent) directory,
+so you don't have to export them. Real environment variables take precedence.
+On draft quality (execution-based eval, all fixtures): the free **`github`**
+models (`openai/gpt-4.1`, `openai/gpt-4o`, `meta/llama-3.3-70b-instruct`) lead by
+a wide margin, ahead of the `local` model — see
+[docs/EVALUATION.md](docs/EVALUATION.md).
+> **Faster first download (`local` backend):** the model is fetched once from
+> HuggingFace. Anonymous downloads work but are rate-limited and can be slow or
+> stall on the ~5 GB file. For a faster, more reliable first run, set a free
+> [HuggingFace token](https://huggingface.co/settings/tokens) (read scope):
+> `HF_TOKEN=hf_... grepxcel draft data.xlsx`. It's optional and one-time — the
+> model is cached afterwards. If a download stalls, grepxcel prints this tip too.
+> **Privacy:** the `claude` cloud backend sends only the *structure description*
+> of your sheet (column types, sample values, labels) — never the raw file. The
+> `local` backend sends nothing over the network during inference.
+```bash
+grepxcel draft data.xlsx                       # local model (default)
+grepxcel draft data.xlsx --dry-run             # inspect the analysis, no inference
+ANTHROPIC_API_KEY=sk-... grepxcel draft data.xlsx --backend claude
+grepxcel draft data.xlsx --backend server      # LM Studio / Ollama on localhost:1234
+grepxcel draft data.xlsx --backend server --server-url http://host:8080/v1
+```
+#### Model cache & offline / alternative downloads
+The local model is stored once in the platform-appropriate per-user cache
+(resolved with [`platformdirs`](https://pypi.org/project/platformdirs/), the same
+convention pip uses):
+| OS | Default cache |
+|---|---|
+| Linux | `$XDG_CACHE_HOME/grepxcel/models/` (default `~/.cache/grepxcel/models/`) |
+| macOS | `~/Library/Caches/grepxcel/models/` |
+| Windows | `%LOCALAPPDATA%\grepxcel\Cache\models\` |
+Override the location with `GREPXCEL_MODEL_DIR` (handy for Docker volumes or a
+shared model dir) — it takes precedence over the defaults above.
+If the HuggingFace download is slow or blocked, you don't have to let grepxcel
+fetch it — **grepxcel only downloads when the file isn't already in the cache**,
+so you can supply it yourself from any source:
+```bash
+mkdir -p ~/.cache/grepxcel/models            # or your $GREPXCEL_MODEL_DIR
+# download the GGUF anywhere (HF website, a mirror, ModelScope, …), then place it
+# with this EXACT name so grepxcel finds it and skips the download:
+mv gemma-4-E4B-it-Q4_K_M.gguf  ~/.cache/grepxcel/models/
+grepxcel draft data.xlsx                     # uses the local file — no download
+```
+Other options:
+- **Mirror:** `huggingface_hub` honors `HF_ENDPOINT`, e.g.
+  `HF_ENDPOINT=https://hf-mirror.com grepxcel draft data.xlsx` (third-party mirror).
+- **Token:** `HF_TOKEN=hf_...` removes the anonymous rate limit (fastest fix).
+> The normal HuggingFace download is **hash-verified** against the pinned
+> revision. A **manually-placed file is also trusted automatically when its
+> sha256 matches the pinned build** (grepxcel ships the known-good hash, so the
+> trust is portable across machines). A file that doesn't match a known-good
+> hash and has no recorded fingerprint is used on trust with a warning — make
+> sure such a source is trustworthy, or pass `--allow-unverified-model` to
+> silence the warning.
+#### Behind a corporate proxy / TLS inspection (experimental)
+Corporate networks often route HTTPS through an inspection proxy (NetSkope,
+Zscaler, …) that re-signs traffic with a **company CA** Python doesn't trust by
+default — so the model download and the cloud backends fail with
+`CERTIFICATE_VERIFY_FAILED`. grepxcel can trust that CA (TLS verification stays
+on — it is never disabled):
+- **Best:** have IT install the corporate root CA in your OS trust store, then
+  `pip install truststore` (it ships with the `suggest` / `draft-cloud` extras).
+  grepxcel then uses the OS store automatically — no env vars needed.
+- **Or** point grepxcel at the CA `.pem` file:
+  ```bash
+  grepxcel draft data.xlsx --ca-bundle /path/to/corporate-ca.pem
+  # or: export GREPXCEL_CA_BUNDLE=/path/to/corporate-ca.pem
+  #     (REQUESTS_CA_BUNDLE / SSL_CERT_FILE are also honored)
+  ```
+- If your network requires a proxy, set `HTTPS_PROXY` / `NO_PROXY` as usual
+  (both `requests` and `httpx` read them automatically).
+- Run `grepxcel doctor` to verify the setup — it does a live TLS handshake.
+> ⚠️ This support has **not been tested against a real intercept proxy**;
+> grepxcel prints a one-time notice when proxy/CA settings are in effect.
+### `grepxcel lint`
+Inspect an Excel data file before extraction — catches issues that would cause
+extraction to fail or produce unexpected results.
+```bash
+grepxcel lint data.xlsx
+grepxcel lint jan.xlsx feb.xlsx        # lint several files
+```
+It prints a `✓/⚠/✗/ℹ` checklist covering:
+- **File format** — extension, ZIP integrity, encryption/IRM detection
+- **Microsoft Information Protection (MIP)** — detects OLE Compound Documents
+  (encrypted by sensitivity labels) and explains how to obtain an extractable copy
+- **Sheet dimensions** — declared vs real used extent (detects styling inflation)
+- **Merged cells** — listed with a note on how grepxcel handles them
+- **Formula cells** — warns about potentially stale cached values
+- **Empty sheets** — nothing to extract
+- **Advisory notes** — known limitations not yet auto-detected (password
+  protection, conditional formatting, pivot tables, VBA)
+### `grepxcel schema`
+Generate a [JSON Schema](https://json-schema.org/) (draft 2020-12) that describes
+the extraction output for a pattern. Use it to validate extracted JSON with any
+standard JSON Schema validator — handy when importing thousands of files that must
+all follow the same structure.
+```bash
+grepxcel schema pattern.xlsx               # print schema to stdout
+grepxcel schema pattern.xlsx -o schema.json
+grepxcel schema pattern.csv                # CSV patterns work too
+```
+The schema mirrors the nested output shape:
+- **Field types** map to JSON types (`currency`/`percentage` → `number`,
+  `date`/`datetime` → `string` with `format: date-time`, etc.)
+- **Dot-notation** fields (`po.number`) become nested objects
+- **Tables** become arrays of objects with `_source`/`header`/`data`/`footer`
+  sub-objects
+- All fields are **nullable** (`["string", "null"]`) since an empty cell extracts
+  as `null`
+Validate an extraction against it with any tool, e.g. Python's `jsonschema`:
+```bash
+grepxcel schema pattern.xlsx -o schema.json
+grepxcel extract -p pattern.xlsx data.xlsx -o out/
+python -m jsonschema -i out/data.json schema.json
+```
+### `grepxcel doctor`
+Preflight check that tells you exactly what's needed to run the tool.
+```bash
+grepxcel doctor            # everything: extract + draft + proxy/TLS
+grepxcel doctor extract    # only what extract needs (offline)
+grepxcel doctor draft      # only what draft needs (deps, keys, model, proxy/TLS)
+```
+It prints a `✓/⚠/✗` checklist — Python version, dependencies, API keys, the
+local-model cache + disk space, and proxy/CA config with a live handshake — and
+**exits non-zero** if the selected area has a blocking (`✗`) problem.
+| Flag | Purpose |
+|---|---|
+| `area` | `extract`, `draft`, or `all` (default `all`) |
+| `--no-probe` | Skip the live TLS handshake (offline / faster) |
+### `grepxcel mcp`
+Starts grepxcel as a [Model Context Protocol](https://modelcontextprotocol.io)
+(MCP) server so AI agents can call grepxcel tools directly.
+```bash
+pip install 'grepxcel[mcp]'       # install the MCP dependency
+grepxcel mcp-config               # print the config for your AI agent
+grepxcel mcp                      # start the server (stdio transport)
+```
+**Exposed tools:** `extract`, `validate_pattern`, `lint`, `schema`, `docs`,
+`doctor`, `generate_examples`.
+| Flag | Default | Purpose |
+|---|---|---|
+| `--target` (mcp-config) | `claude-code` | Config format: `claude-code`, `claude-desktop`, `cursor` |
+---
+## Verbosity levels
+| Level | What you see |
+|---|---|
+| `QUIET` | Nothing — records still collected in memory |
+| `NORMAL` | Summary + all validation warnings with hints *(default)* |
+| `VERBOSE` | + every cell and table match step by step |
+| `DEBUG` | + every anchor probe and rejection reason |
+---
+## Logging model & data safety
+grepxcel's logging is designed so that **no extracted Excel cell value ever reaches
+a structured log record** — by construction, not by redaction.
+### How it works
+- **Console / text logs** (`--log-format text`, default) show the full diagnostic
+  detail (including cell values) on stderr — intended for a human operator sitting
+  at the terminal.
+- **Structured logs** (`--log-format json`) write one NDJSON record per event to
+  the `--log` file. Each record is built from an **allow-list of safe keys** only:
+  `ts`, `level`, `category`, `event`, `cell`, `field`, `field_type`, `value_len`,
+  `value_sha8`, `run_id`, `source`, `schema_version`. Free-form fields (`message`,
+  `hint`, `expected`, `found`) are never serialized.
+- A **non-reversible value fingerprint** (`value_len` + `value_sha8`, a truncated
+  SHA-256) is included for diagnostics without revealing the cell content.
+- An end-of-run **summary** event carries extraction statistics — counts and field
+  names only, never values.
+- There is **no opt-out** for the allow-list model; the text log remains human-only.
+### GDPR / PII considerations
+Because structured JSON logs never contain cell values, they can be shipped to a
+SIEM or cloud log aggregator without risk of leaking personal data that may be
+present in the Excel files being processed.
+The `--meta` flag adds a `_meta` block to the extracted JSON output with run_id,
+statistics, and safe issue records — suitable for pipeline auto-verification. This
+block also follows the allow-list model and never contains cell values.
+**Operators are responsible for** controlling access to the text-mode log files and
+the extracted JSON output, which do contain the actual data.
+---
+## Project layout
+```
+grepxcel/        ← importable Python package (engine, parser, models, security, cli, drafter)
+scripts/         ← reusable utility scripts for contributors
+tests/
+  fixtures/      ← pattern + data xlsx pairs (one folder per scenario)
+  unit/          ← pytest unit tests
+  integration/   ← pytest integration tests
+docs/            ← additional documentation
+tmp.local/       ← throwaway scripts, never synced  (gitignored)
+samples.local/   ← local-only Excel files, never synced  (gitignored)
+logs/            ← log file output                 (gitignored)
+output/          ← JSON extraction results         (gitignored)
+```
+---
+## Running the tests
+```bash
+.venv/bin/pytest tests/ -q
+```
+1200+ unit and integration tests across 22 fixture scenarios — all green.
+---
+## Requirements
+- Python 3.11+
+- `openpyxl >= 3.1`
+- `defusedxml >= 0.7`
+- `llama-cpp-python >= 0.2.90` and `huggingface_hub >= 0.23` — only for `grepxcel draft`
+---
+## Support
+grepxcel is free and open source. If it saves you time and you'd like to say thanks,
+you can [buy me a coffee](https://buymeacoffee.com/scpg.dev) ☕ — entirely optional and
+always appreciated.
+<a href="https://buymeacoffee.com/scpg.dev" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" height="41" width="174"></a>
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md). All PRs target the `dev` branch.
+## Security
+See [SECURITY.md](SECURITY.md) for how to report vulnerabilities privately.
+## License
+MIT — see [LICENSE](LICENSE).