pybutt 2.0.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- old_tests/app.py +713 -0
- pybutt/__init__.py +17 -0
- pybutt/cli/__init__.py +11 -0
- pybutt/cli/app.py +94 -0
- pybutt/cli/combine_command.py +236 -0
- pybutt/cli/export_command.py +317 -0
- pybutt/cli/import_command.py +286 -0
- pybutt/cli/inspect_command.py +30 -0
- pybutt/cli/purge_command.py +235 -0
- pybutt/core/__init__.py +30 -0
- pybutt/core/base.py +124 -0
- pybutt/core/config.py +144 -0
- pybutt/core/logobs.py +445 -0
- pybutt/exceptions.py +82 -0
- pybutt/files/__init__.py +28 -0
- pybutt/files/combine.py +93 -0
- pybutt/files/inspect.py +51 -0
- pybutt/files/manifest.py +160 -0
- pybutt/io/__init__.py +6 -0
- pybutt/io/combiner.py +119 -0
- pybutt/io/exporter.py +612 -0
- pybutt/io/importer.py +928 -0
- pybutt/io/purger.py +44 -0
- pybutt-2.0.0.dist-info/METADATA +756 -0
- pybutt-2.0.0.dist-info/RECORD +39 -0
- pybutt-2.0.0.dist-info/WHEEL +5 -0
- pybutt-2.0.0.dist-info/entry_points.txt +2 -0
- pybutt-2.0.0.dist-info/licenses/LICENSE +21 -0
- pybutt-2.0.0.dist-info/top_level.txt +3 -0
- tests/conftest.py +22 -0
- tests/test_cli.py +979 -0
- tests/test_cli_help.py +130 -0
- tests/test_combiner.py +259 -0
- tests/test_core.py +1009 -0
- tests/test_exporter.py +637 -0
- tests/test_files.py +178 -0
- tests/test_import_retry_logic.py +837 -0
- tests/test_logobs.py +491 -0
- tests/test_purge.py +219 -0
|
@@ -0,0 +1,756 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pybutt
|
|
3
|
+
Version: 2.0.0
|
|
4
|
+
Requires-Python: >=3.12
|
|
5
|
+
Description-Content-Type: text/markdown
|
|
6
|
+
License-File: LICENSE
|
|
7
|
+
Requires-Dist: typer
|
|
8
|
+
Requires-Dist: pyodbc
|
|
9
|
+
Requires-Dist: pyarrow
|
|
10
|
+
Requires-Dist: duckdb
|
|
11
|
+
Requires-Dist: mssql-python
|
|
12
|
+
Requires-Dist: psutil
|
|
13
|
+
Provides-Extra: dev
|
|
14
|
+
Requires-Dist: black; extra == "dev"
|
|
15
|
+
Requires-Dist: ruff; extra == "dev"
|
|
16
|
+
Requires-Dist: isort; extra == "dev"
|
|
17
|
+
Requires-Dist: pytest; extra == "dev"
|
|
18
|
+
Requires-Dist: build; extra == "dev"
|
|
19
|
+
Dynamic: license-file
|
|
20
|
+
|
|
21
|
+
# PyButt
|
|
22
|
+
|
|
23
|
+
**Python Bulk Transfer Tool** - A tool for exporting SQL Server tables to Parquet files and importing Parquet data back into SQL Server.
|
|
24
|
+
|
|
25
|
+
## Features
|
|
26
|
+
|
|
27
|
+
- **SQL Server to Parquet Export**: Partition tables and export them as multiple Parquet files in parallel
|
|
28
|
+
- **Parquet to SQL Server Import**: Bulk import Parquet files into SQL Server with configurable batch sizing
|
|
29
|
+
- **Flexible Authentication**: Supports both SQL authentication and Windows integrated authentication
|
|
30
|
+
- **Command-Line Interface**: Full-featured CLI with Typer for easy command execution
|
|
31
|
+
- **Python API**: Use PyButt as a module in your Python projects for programmatic access
|
|
32
|
+
- **Manifest-Based Import**: Track exported files with automatic manifests
|
|
33
|
+
- **Performance Optimized**: Multi-process export and multi-threaded import for maximum throughput
|
|
34
|
+
|
|
35
|
+
## Documentation
|
|
36
|
+
|
|
37
|
+
In-depth guides on the data pipeline, memory behaviour, tuning knobs, engine
|
|
38
|
+
differences, and defaults live in [`docs/`](docs/README.md). Start with
|
|
39
|
+
[concepts](docs/concepts.md), then [tuning](docs/tuning.md),
|
|
40
|
+
[engines](docs/engines.md), and [defaults](docs/defaults.md).
|
|
41
|
+
|
|
42
|
+
## Prerequisites
|
|
43
|
+
|
|
44
|
+
Before installing PyButt, ensure your system has the required ODBC components:
|
|
45
|
+
|
|
46
|
+
### Linux
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
# Check for libodbc
|
|
50
|
+
ldconfig -p | grep libodbc
|
|
51
|
+
|
|
52
|
+
# Check for ODBC Driver 18 for SQL Server
|
|
53
|
+
odbcinst -q -d
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Required packages:**
|
|
57
|
+
- `libodbc.so.2` (usually from the `unixodbc` package)
|
|
58
|
+
- `msodbcsql` version 18
|
|
59
|
+
- `duckdb` (see https://duckdb.org/install/?platform=linux&environment=cli)
|
|
60
|
+
|
|
61
|
+
### Windows
|
|
62
|
+
|
|
63
|
+
Install these packages using winget, and set the PowerShell ExecutionPolicy so you can activate your virtual environment:
|
|
64
|
+
|
|
65
|
+
```pwsh
|
|
66
|
+
winget install -e --id Microsoft.msodbcsql.18
|
|
67
|
+
winget install -e --id DuckDB.cli
|
|
68
|
+
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
|
|
69
|
+
|
|
70
|
+
# If you haven't already got `git` or `python`
|
|
71
|
+
winget install -e --id Git.Git
|
|
72
|
+
winget install -e --id Python.Python.3.14 --location C:\Python314
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
**Required packages:**
|
|
76
|
+
- `msodbcsql` version 18
|
|
77
|
+
- `duckdb` (see https://duckdb.org/install/?platform=windows&environment=cli)
|
|
78
|
+
|
|
79
|
+
## Installation
|
|
80
|
+
|
|
81
|
+
### Quick Start
|
|
82
|
+
|
|
83
|
+
PyButt uses `pyproject.toml` as the source of truth for runtime dependencies and optional development tooling.
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
git clone https://github.com/dmonlineuk/pybutt && cd pybutt
|
|
87
|
+
python -m venv .venv
|
|
88
|
+
source .venv/bin/activate # On Windows: `.venv\Scripts\Activate.ps1`
|
|
89
|
+
python -m pip install --upgrade pip
|
|
90
|
+
pip install -e .
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
If you want the full developer environment with formatting, linting, and tests:
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
pip install -e .[dev]
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### Install as a Package
|
|
100
|
+
|
|
101
|
+
For use in Python projects and enabling CLI executable:
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
pip install -e .
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
## Usage
|
|
108
|
+
|
|
109
|
+
### Command-Line Interface
|
|
110
|
+
|
|
111
|
+
PyButt provides the following commands: `export`, `import`, `combine`, `inspect`, and `purge`.
|
|
112
|
+
|
|
113
|
+
#### Export Command
|
|
114
|
+
|
|
115
|
+
Export a SQL Server table to Parquet files:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
pybutt export \
|
|
119
|
+
--server YOUR_SERVER \
|
|
120
|
+
--database YOUR_DB \
|
|
121
|
+
--schema dbo \
|
|
122
|
+
--table YOUR_TABLE \
|
|
123
|
+
--username your_user \
|
|
124
|
+
--output-path ./output
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**Export Options:**
|
|
128
|
+
|
|
129
|
+
```
|
|
130
|
+
--server, -s SQL Server hostname or instance (required)
|
|
131
|
+
--database, -d Target database (required)
|
|
132
|
+
--schema, -S Table schema (default: dbo)
|
|
133
|
+
--table, -t Table name (required)
|
|
134
|
+
--output-path, -o Output directory for Parquet files (required)
|
|
135
|
+
--manifest-filename, -m Custom manifest filename to write (default: <schema>_<table>_manifest.json)
|
|
136
|
+
--username, -u SQL Server username
|
|
137
|
+
--password, -p SQL Server password (prompted if not provided)
|
|
138
|
+
--trusted-connection, -T Use Windows integrated authentication
|
|
139
|
+
--driver, -D ODBC driver name (default: ODBC Driver 18 for SQL Server)
|
|
140
|
+
--trust-cert, -c Trust the SQL Server TLS certificate
|
|
141
|
+
--encrypt/--no-encrypt Enable/disable encrypted transport (default: enabled)
|
|
142
|
+
--retries, -r Number of retry attempts for transient errors (default: 3)
|
|
143
|
+
--packet-size TDS packet size in bytes, 512–32767 (default: 4096)
|
|
144
|
+
--pk-column, -P Primary key column for deterministic partitioning
|
|
145
|
+
--columns, -C Comma-separated list of columns to export (all by default)
|
|
146
|
+
--parameters, -a Comma-separated list of parameter values to pass to a table-valued function (e.g. 12,'fred','1989')
|
|
147
|
+
--worker-count, -w Number of worker processes (default: 1)
|
|
148
|
+
--file-count, -f Number of output Parquet files (default: 1)
|
|
149
|
+
--rowgroup-size, -R Number of rows per rowgroup inside each Parquet file (default: 1048576)
|
|
150
|
+
--fetch-size, -F Cursor fetch size for pyodbc export (default: 1000)
|
|
151
|
+
--engine, -e Export engine to use: duckdb, pyodbc, or mssql-python (default: pyodbc)
|
|
152
|
+
--mem-heartbeat Log process memory every N seconds (default: 30.0; 0 to disable)
|
|
153
|
+
--mem-threshold System memory % at which workers are throttled (default: 85.0; 0 to disable)
|
|
154
|
+
--mem-sleep Seconds to sleep per throttle check (default: 5.0)
|
|
155
|
+
--mem-max-wait Max seconds to wait during memory throttling (default: 300.0)
|
|
156
|
+
--mem-cooldown Seconds after a throttle event before re-checking (default: 30.0)
|
|
157
|
+
--verbose, -V Show verbose logging output
|
|
158
|
+
--help, -? Show help and exit
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
**Examples:**
|
|
162
|
+
|
|
163
|
+
Export entire table with 4 parallel workers:
|
|
164
|
+
```bash
|
|
165
|
+
pybutt export \
|
|
166
|
+
--server sqlserver.example.com \
|
|
167
|
+
--database MyDatabase \
|
|
168
|
+
--table Customers \
|
|
169
|
+
--output-path ./exports/customers \
|
|
170
|
+
--username dbuser \
|
|
171
|
+
--worker-count 4 \
|
|
172
|
+
--file-count 4
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Export using the duckdb engine:
|
|
176
|
+
```bash
|
|
177
|
+
pybutt export \
|
|
178
|
+
--server sqlserver.example.com \
|
|
179
|
+
--database MyDatabase \
|
|
180
|
+
--table Customers \
|
|
181
|
+
--output-path ./exports/customers \
|
|
182
|
+
--username dbuser \
|
|
183
|
+
--engine duckdb
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
Export using the mssql-python engine:
|
|
187
|
+
```bash
|
|
188
|
+
pybutt export \
|
|
189
|
+
--server sqlserver.example.com \
|
|
190
|
+
--database MyDatabase \
|
|
191
|
+
--table Customers \
|
|
192
|
+
--output-path ./exports/customers \
|
|
193
|
+
--username dbuser \
|
|
194
|
+
--engine mssql-python
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
Export specific columns using primary key partitioning:
|
|
198
|
+
```bash
|
|
199
|
+
pybutt export \
|
|
200
|
+
--server sqlserver.example.com \
|
|
201
|
+
--database MyDatabase \
|
|
202
|
+
--table Orders \
|
|
203
|
+
--output-path ./exports/orders \
|
|
204
|
+
--username dbuser \
|
|
205
|
+
--pk-column OrderID \
|
|
206
|
+
--columns "OrderID,OrderDate,Amount" \
|
|
207
|
+
--file-count 8
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
Exporting database views is also supported. If partition statistics are unavailable for the target object, PyButt will fall back to `SELECT COUNT(*)` to determine the row count before partitioning.
|
|
211
|
+
|
|
212
|
+
Export from a TVF with parameters:
|
|
213
|
+
```bash
|
|
214
|
+
pybutt export \
|
|
215
|
+
--server sqlserver.example.com \
|
|
216
|
+
--database MyDatabase \
|
|
217
|
+
--schema export \
|
|
218
|
+
--table tvf_users \
|
|
219
|
+
--parameters "12,'fred','1989'" \
|
|
220
|
+
--output-path ./exports/tvf_users \
|
|
221
|
+
--username dbuser
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
Export using Windows authentication:
|
|
225
|
+
```bash
|
|
226
|
+
pybutt export \
|
|
227
|
+
--server SQLSERVER01\INSTANCE \
|
|
228
|
+
--database MyDatabase \
|
|
229
|
+
--table LargeTable \
|
|
230
|
+
--output-path ./exports \
|
|
231
|
+
--trusted-connection
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
#### Import Command
|
|
235
|
+
|
|
236
|
+
Import Parquet files into a SQL Server table:
|
|
237
|
+
|
|
238
|
+
```bash
|
|
239
|
+
pybutt import \
|
|
240
|
+
./exports/customers/dbo_Customers_manifest.json \
|
|
241
|
+
--server YOUR_SERVER \
|
|
242
|
+
--database YOUR_DB \
|
|
243
|
+
--schema dbo \
|
|
244
|
+
--table YOUR_TABLE \
|
|
245
|
+
--username your_user
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
**Import Options:**
|
|
249
|
+
|
|
250
|
+
```
|
|
251
|
+
manifest_path Path to the input manifest file (positional, required)
|
|
252
|
+
--server, -s SQL Server hostname or instance (required)
|
|
253
|
+
--database, -d Target database (required)
|
|
254
|
+
--schema, -S Table schema (default: dbo)
|
|
255
|
+
--table, -t Table name (required)
|
|
256
|
+
--imported-manifest-filename, -o Override the import worker manifest filename
|
|
257
|
+
--username, -u SQL Server username
|
|
258
|
+
--password, -p SQL Server password (prompted if not provided)
|
|
259
|
+
--trusted-connection, -T Use Windows integrated authentication
|
|
260
|
+
--driver, -D ODBC driver name (default: ODBC Driver 18 for SQL Server)
|
|
261
|
+
--trust-cert, -c Trust the SQL Server TLS certificate
|
|
262
|
+
--encrypt/--no-encrypt Enable/disable encrypted transport (default: enabled)
|
|
263
|
+
--retries, -r Number of retry attempts for transient errors (default: 3)
|
|
264
|
+
--packet-size TDS packet size in bytes, 512–32767 (default: 4096)
|
|
265
|
+
--worker-count, -w Number of parallel import threads (default: 1)
|
|
266
|
+
--batch-size, -b Rows per batch insert (default: 1000)
|
|
267
|
+
--engine, -e Import engine to use: duckdb, pyodbc, or mssql-python (default: mssql-python)
|
|
268
|
+
--transaction-mode, -M Transaction scope: batch, rowgroup (default), file
|
|
269
|
+
--cci/--no-cci Create a clustered columnstore index on per-worker temp tables (default: enabled)
|
|
270
|
+
--mem-heartbeat Log process memory every N seconds (default: 30.0; 0 to disable)
|
|
271
|
+
--mem-threshold System memory % at which workers are throttled (default: 85.0; 0 to disable)
|
|
272
|
+
--mem-sleep Seconds to sleep per throttle check (default: 5.0)
|
|
273
|
+
--mem-max-wait Max seconds to wait during memory throttling (default: 300.0)
|
|
274
|
+
--mem-cooldown Seconds after a throttle event before re-checking (default: 30.0)
|
|
275
|
+
--verbose, -V Show verbose logging output
|
|
276
|
+
--help, -? Show help and exit
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Columnstore on temporary tables:**
|
|
280
|
+
|
|
281
|
+
When importing with `--worker-count` of 2 or more, PyButt creates one temporary
|
|
282
|
+
table per worker (`SELECT TOP 0 * INTO ... FROM <source>`) which can then be combined
|
|
283
|
+
into the target afterwards. By default a clustered columnstore index (CCI) is now
|
|
284
|
+
created on each temporary table to reduce the storage footprint of these staging
|
|
285
|
+
tables. Pass `--no-cci` to keep the previous heap behaviour.
|
|
286
|
+
|
|
287
|
+
Notes:
|
|
288
|
+
- The CCI is only created on the multi-worker path (single-worker imports are
|
|
289
|
+
unaffected).
|
|
290
|
+
- Space savings come from columnstore compression. The SQL Server tuple mover
|
|
291
|
+
compresses row groups as data is loaded once they are large enough, so the
|
|
292
|
+
benefit applies when it matters most (large imports). Small loads may sit in
|
|
293
|
+
the uncompressed delta store until a row group fills.
|
|
294
|
+
- Clustered columnstore indexes require SQL Server 2014+ (and are available in
|
|
295
|
+
all editions from SQL Server 2016 SP1). On unsupported instances, or with
|
|
296
|
+
source columns that columnstore does not support, use `--no-cci`.
|
|
297
|
+
|
|
298
|
+
**Examples:**
|
|
299
|
+
|
|
300
|
+
Basic import (uses rowgroup transaction mode by default):
|
|
301
|
+
```bash
|
|
302
|
+
pybutt import \
|
|
303
|
+
./exports/customers/dbo_Customers_manifest.json \
|
|
304
|
+
--server sqlserver.example.com \
|
|
305
|
+
--database MyDatabase \
|
|
306
|
+
--table Customers \
|
|
307
|
+
--username dbuser
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
Import using the pyodbc engine:
|
|
311
|
+
```bash
|
|
312
|
+
pybutt import \
|
|
313
|
+
./exports/customers/dbo_Customers_manifest.json \
|
|
314
|
+
--server sqlserver.example.com \
|
|
315
|
+
--database MyDatabase \
|
|
316
|
+
--table Customers \
|
|
317
|
+
--username dbuser \
|
|
318
|
+
--engine pyodbc
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
Import using the duckdb engine:
|
|
322
|
+
```bash
|
|
323
|
+
pybutt import \
|
|
324
|
+
./exports/customers/dbo_Customers_manifest.json \
|
|
325
|
+
--server sqlserver.example.com \
|
|
326
|
+
--database MyDatabase \
|
|
327
|
+
--table Customers \
|
|
328
|
+
--username dbuser \
|
|
329
|
+
--engine duckdb
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
High-throughput import with larger batches (batch mode):
|
|
333
|
+
```bash
|
|
334
|
+
pybutt import \
|
|
335
|
+
./imports/orders/dbo_Orders_manifest.json \
|
|
336
|
+
--server sqlserver.example.com \
|
|
337
|
+
--database MyDatabase \
|
|
338
|
+
--table Orders \
|
|
339
|
+
--username dbuser \
|
|
340
|
+
--worker-count 4 \
|
|
341
|
+
--batch-size 5000 \
|
|
342
|
+
--transaction-mode batch \
|
|
343
|
+
--verbose
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
Import with batch transactions (per-batch retries):
|
|
347
|
+
```bash
|
|
348
|
+
pybutt import \
|
|
349
|
+
./imports/data/dbo_LargeTable_manifest.json \
|
|
350
|
+
--server sqlserver.example.com \
|
|
351
|
+
--database MyDatabase \
|
|
352
|
+
--table LargeTable \
|
|
353
|
+
--username dbuser \
|
|
354
|
+
--transaction-mode batch
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
Import with file-level transactions (all-or-nothing for critical data):
|
|
358
|
+
```bash
|
|
359
|
+
pybutt import \
|
|
360
|
+
./imports/financials/dbo_FinancialData_manifest.json \
|
|
361
|
+
--server sqlserver.example.com \
|
|
362
|
+
--database MyDatabase \
|
|
363
|
+
--table FinancialData \
|
|
364
|
+
--username dbuser \
|
|
365
|
+
--transaction-mode file
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
#### Combine Command
|
|
369
|
+
|
|
370
|
+
Combine objects listed in a manifest file. This command supports two types of combines depending on the manifest type:
|
|
371
|
+
- **Files manifest (`type: "files"`)**: Concatenates multiple Parquet files into a single output Parquet file.
|
|
372
|
+
- **Tables manifest (`type: "tables"`)**: Combines multiple temporary/worker SQL tables into a single target table on your SQL Server.
|
|
373
|
+
|
|
374
|
+
```bash
|
|
375
|
+
# File combine example:
|
|
376
|
+
pybutt combine \
|
|
377
|
+
./exports/customers/dbo_Customers_manifest.json \
|
|
378
|
+
--output-file ./exports/customers/combined.parquet
|
|
379
|
+
|
|
380
|
+
# Table combine example:
|
|
381
|
+
pybutt combine \
|
|
382
|
+
./exports/customers/dbo_Customers_temp_manifest.json \
|
|
383
|
+
--server YOUR_SERVER \
|
|
384
|
+
--database YOUR_DB \
|
|
385
|
+
--schema dbo \
|
|
386
|
+
--table Customers \
|
|
387
|
+
--username your_user
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
**Combine Options:**
|
|
391
|
+
|
|
392
|
+
```
|
|
393
|
+
manifest Path to manifest file (positional, required)
|
|
394
|
+
--output-file, -o Output Parquet file path (required for file combines)
|
|
395
|
+
--rowgroup-size, -R Rowgroup size for output Parquet file (default: 1048576)
|
|
396
|
+
--combined-manifest-filename, -m Override the combined manifest filename
|
|
397
|
+
--server, -s SQL Server hostname or instance (required for table combines)
|
|
398
|
+
--database, -d Target database (required for table combines)
|
|
399
|
+
--schema, -S Target schema (required for table combines)
|
|
400
|
+
--table, -t Target table name (required for table combines)
|
|
401
|
+
--username, -u SQL Server username (for table combines)
|
|
402
|
+
--password, -p SQL Server password (for table combines)
|
|
403
|
+
--trusted-connection, -T Use Windows integrated authentication
|
|
404
|
+
--driver, -D ODBC driver name (default: ODBC Driver 18 for SQL Server)
|
|
405
|
+
--trust-cert, -c Trust the SQL Server TLS certificate
|
|
406
|
+
--encrypt/--no-encrypt, -e/-n Enable/disable encrypted transport (default: enabled)
|
|
407
|
+
--retries, -r Number of retry attempts for transient SQL errors (default: 3)
|
|
408
|
+
--verbose, -V Show verbose logging output
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
#### Inspect Command
|
|
412
|
+
|
|
413
|
+
Inspect details of the Parquet files listed in a manifest (including row counts, row group counts, size, and columns):
|
|
414
|
+
|
|
415
|
+
```bash
|
|
416
|
+
pybutt inspect ./exports/customers/dbo_Customers_manifest.json
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
**Inspect Options:**
|
|
420
|
+
|
|
421
|
+
```
|
|
422
|
+
manifest Path to manifest.json file (positional, required)
|
|
423
|
+
--verbose, -v Show full column definitions, schema, and detailed metadata
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
### Password Input
|
|
427
|
+
|
|
428
|
+
When you provide a username without a password, PyButt will prompt you interactively:
|
|
429
|
+
|
|
430
|
+
```bash
|
|
431
|
+
pybutt export \
|
|
432
|
+
--server myserver \
|
|
433
|
+
--database mydb \
|
|
434
|
+
--table mytable \
|
|
435
|
+
--output-path ./output \
|
|
436
|
+
--username myuser
|
|
437
|
+
# You'll be prompted: Enter your password: [hidden input]
|
|
438
|
+
```
|
|
439
|
+
|
|
440
|
+
### Python API
|
|
441
|
+
|
|
442
|
+
Use PyButt as a module in your Python projects:
|
|
443
|
+
|
|
444
|
+
#### Configuration
|
|
445
|
+
|
|
446
|
+
First, create a `SqlConfig` object with your connection details. `SqlConfig` is
|
|
447
|
+
purely connection configuration — schema and table are passed directly to
|
|
448
|
+
`Exporter`, `Importer`, and `TableCombine`.
|
|
449
|
+
|
|
450
|
+
```python
|
|
451
|
+
from pybutt import SqlConfig, Exporter, Importer
|
|
452
|
+
from pathlib import Path
|
|
453
|
+
|
|
454
|
+
config = SqlConfig(
|
|
455
|
+
server="sqlserver.example.com",
|
|
456
|
+
database="MyDatabase",
|
|
457
|
+
username="dbuser",
|
|
458
|
+
password="dbpassword",
|
|
459
|
+
trusted_connection=False,
|
|
460
|
+
trust_cert=False,
|
|
461
|
+
encrypt=True,
|
|
462
|
+
retries=3,
|
|
463
|
+
)
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
Or with Windows authentication:
|
|
467
|
+
|
|
468
|
+
```python
|
|
469
|
+
config = SqlConfig(
|
|
470
|
+
server="SQLSERVER01\\INSTANCE",
|
|
471
|
+
database="MyDatabase",
|
|
472
|
+
trusted_connection=True,
|
|
473
|
+
)
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
#### Exporting Data
|
|
477
|
+
|
|
478
|
+
```python
|
|
479
|
+
from pathlib import Path
|
|
480
|
+
|
|
481
|
+
exporter = Exporter(
|
|
482
|
+
config=config,
|
|
483
|
+
table="Customers", # Target table name
|
|
484
|
+
output_path=Path("./exports/customers"),
|
|
485
|
+
schema="dbo", # Schema (default: dbo)
|
|
486
|
+
pk_column=None, # None for CHECKSUM partitioning
|
|
487
|
+
columns=None, # None for all columns
|
|
488
|
+
worker_count=4, # Number of parallel processes
|
|
489
|
+
file_count=4, # Number of output files
|
|
490
|
+
fetch_size=None, # Cursor fetch size for pyodbc export (None = auto)
|
|
491
|
+
)
|
|
492
|
+
|
|
493
|
+
exporter.perform_work()
|
|
494
|
+
print("Export completed successfully!")
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
With primary key partitioning:
|
|
498
|
+
|
|
499
|
+
```python
|
|
500
|
+
exporter = Exporter(
|
|
501
|
+
config=config,
|
|
502
|
+
table="Orders",
|
|
503
|
+
output_path=Path("./exports/orders"),
|
|
504
|
+
pk_column="OrderID", # Use PK for deterministic partitioning
|
|
505
|
+
columns=["OrderID", "OrderDate", "Amount"],
|
|
506
|
+
worker_count=8,
|
|
507
|
+
file_count=8,
|
|
508
|
+
fetch_size=None, # Optional: tune pyodbc fetch size for streaming
|
|
509
|
+
)
|
|
510
|
+
|
|
511
|
+
exporter.perform_work()
|
|
512
|
+
```
|
|
513
|
+
|
|
514
|
+
With multiple workers and files:
|
|
515
|
+
|
|
516
|
+
```python
|
|
517
|
+
exporter = Exporter(
|
|
518
|
+
config=config,
|
|
519
|
+
table="Orders",
|
|
520
|
+
output_path=Path("./exports/orders"),
|
|
521
|
+
worker_count=4,
|
|
522
|
+
file_count=4, # Distribute across 4 output files
|
|
523
|
+
rowgroup_size=1_048_576, # 1M rows per rowgroup
|
|
524
|
+
)
|
|
525
|
+
|
|
526
|
+
exporter.perform_work()
|
|
527
|
+
```
|
|
528
|
+
|
|
529
|
+
#### Importing Data
|
|
530
|
+
|
|
531
|
+
**Default (rowgroup-level transactions):**
|
|
532
|
+
```python
|
|
533
|
+
from pybutt import TransactionMode
|
|
534
|
+
|
|
535
|
+
importer = Importer(
|
|
536
|
+
config=config,
|
|
537
|
+
table="Customers",
|
|
538
|
+
input_path=Path("./exports/customers"),
|
|
539
|
+
manifest_filename="customers_manifest.json",
|
|
540
|
+
worker_count=4, # Number of parallel threads
|
|
541
|
+
batch_size=1000, # Rows per batch
|
|
542
|
+
transaction_mode=TransactionMode.ROWGROUP, # Each row group in its own transaction (default)
|
|
543
|
+
)
|
|
544
|
+
|
|
545
|
+
importer.perform_work()
|
|
546
|
+
print("Import completed successfully!")
|
|
547
|
+
```
|
|
548
|
+
|
|
549
|
+
**With batch-level transactions (per-batch retries):**
|
|
550
|
+
```python
|
|
551
|
+
importer = Importer(
|
|
552
|
+
config=config,
|
|
553
|
+
table="Orders",
|
|
554
|
+
input_path=Path("./exports/orders"),
|
|
555
|
+
manifest_filename="orders_manifest.json",
|
|
556
|
+
worker_count=4,
|
|
557
|
+
batch_size=5000,
|
|
558
|
+
transaction_mode=TransactionMode.BATCH, # Each batch in its own transaction
|
|
559
|
+
)
|
|
560
|
+
|
|
561
|
+
importer.perform_work()
|
|
562
|
+
```
|
|
563
|
+
|
|
564
|
+
**With file-level transactions (all-or-nothing safety):**
|
|
565
|
+
```python
|
|
566
|
+
importer = Importer(
|
|
567
|
+
config=config,
|
|
568
|
+
table="LargeTable",
|
|
569
|
+
input_path=Path("./exports/data"),
|
|
570
|
+
manifest_filename="data_manifest.json",
|
|
571
|
+
worker_count=4,
|
|
572
|
+
batch_size=1000,
|
|
573
|
+
transaction_mode=TransactionMode.FILE, # Entire file in one transaction
|
|
574
|
+
)
|
|
575
|
+
|
|
576
|
+
importer.perform_work()
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
#### Complete Example
|
|
580
|
+
|
|
581
|
+
```python
|
|
582
|
+
from pathlib import Path
|
|
583
|
+
from pybutt import SqlConfig, TransactionMode, Exporter, Importer
|
|
584
|
+
|
|
585
|
+
# Configure connection (purely connection details — no schema/table)
|
|
586
|
+
config = SqlConfig(
|
|
587
|
+
server="sqlserver.example.com",
|
|
588
|
+
database="MyDatabase",
|
|
589
|
+
username="dbuser",
|
|
590
|
+
password="dbpassword",
|
|
591
|
+
)
|
|
592
|
+
|
|
593
|
+
# Export
|
|
594
|
+
export_path = Path("./data_export")
|
|
595
|
+
exporter = Exporter(
|
|
596
|
+
config=config,
|
|
597
|
+
table="LargeTable",
|
|
598
|
+
output_path=export_path,
|
|
599
|
+
worker_count=4,
|
|
600
|
+
file_count=4,
|
|
601
|
+
)
|
|
602
|
+
exporter.perform_work()
|
|
603
|
+
print("✓ Export complete")
|
|
604
|
+
|
|
605
|
+
# Import into another table (reuse same connection config)
|
|
606
|
+
importer = Importer(
|
|
607
|
+
config=config,
|
|
608
|
+
table="LargeTableBackup",
|
|
609
|
+
input_path=export_path,
|
|
610
|
+
manifest_filename="dbo_LargeTable_manifest.json",
|
|
611
|
+
worker_count=4,
|
|
612
|
+
batch_size=5000,
|
|
613
|
+
transaction_mode=TransactionMode.ROWGROUP, # Rowgroup-level transactions (default)
|
|
614
|
+
)
|
|
615
|
+
importer.perform_work()
|
|
616
|
+
print("✓ Import complete")
|
|
617
|
+
```
|
|
618
|
+
|
|
619
|
+
## Manifest Files
|
|
620
|
+
|
|
621
|
+
When exporting, PyButt automatically creates a manifest JSON file listing all generated Parquet files. This manifest is required for importing:
|
|
622
|
+
|
|
623
|
+
As of version 2, a manifest is a JSON object with a `version`, a `type`, and an
|
|
624
|
+
`entries` list. Two manifest types are supported:
|
|
625
|
+
|
|
626
|
+
- **`files`** — `entries` are Parquet file names (written by `export` and file
|
|
627
|
+
`combine`).
|
|
628
|
+
- **`tables`** — `entries` are SQL Server table names (written during multi-worker
|
|
629
|
+
`import` and table `combine`, for consumption by the `combine` command).
|
|
630
|
+
|
|
631
|
+
**Example file manifest** (`dbo_MyTable_manifest.json`):
|
|
632
|
+
```json
|
|
633
|
+
{
|
|
634
|
+
"version": 2,
|
|
635
|
+
"type": "files",
|
|
636
|
+
"entries": [
|
|
637
|
+
"dbo_MyTable_part_00000.parquet",
|
|
638
|
+
"dbo_MyTable_part_00001.parquet",
|
|
639
|
+
"dbo_MyTable_part_00002.parquet",
|
|
640
|
+
"dbo_MyTable_part_00003.parquet"
|
|
641
|
+
]
|
|
642
|
+
}
|
|
643
|
+
```
|
|
644
|
+
|
|
645
|
+
**Example table manifest** (`dbo_MyTable_temp_manifest.json`):
|
|
646
|
+
```json
|
|
647
|
+
{
|
|
648
|
+
"version": 2,
|
|
649
|
+
"type": "tables",
|
|
650
|
+
"entries": [
|
|
651
|
+
"dbo.MyTable_01_a1b2c3d4",
|
|
652
|
+
"dbo.MyTable_02_e5f6a7b8"
|
|
653
|
+
]
|
|
654
|
+
}
|
|
655
|
+
```
|
|
656
|
+
|
|
657
|
+
For backwards compatibility, legacy version 1 manifests — a plain JSON array of
|
|
658
|
+
Parquet file names — are still accepted when reading and are treated as a `files`
|
|
659
|
+
manifest:
|
|
660
|
+
```json
|
|
661
|
+
[
|
|
662
|
+
"dbo_MyTable_part_00000.parquet",
|
|
663
|
+
"dbo_MyTable_part_00001.parquet"
|
|
664
|
+
]
|
|
665
|
+
```
|
|
666
|
+
|
|
667
|
+
## Performance Tips
|
|
668
|
+
|
|
669
|
+
- **Export**: Increase `--worker-count` and `--file-count` for large tables (use values matching your CPU core count)
|
|
670
|
+
- **Import**: Use `--worker-count` up to your CPU core count and adjust `--batch-size` (higher values = fewer database round trips)
|
|
671
|
+
- **mssql-python engine**: The default import engine (`mssql-python`) uses native bulk insert (`bulkcopy`) which is significantly faster than parameterized `INSERT` statements used by pyodbc
|
|
672
|
+
- **Primary Key Partitioning**: Use `--pk-column` for deterministic partitioning when re-importing the same data
|
|
673
|
+
- **Encryption**: Use `--no-encrypt` only in secure networks to reduce overhead
|
|
674
|
+
|
|
675
|
+
## Transaction Modes for Import
|
|
676
|
+
|
|
677
|
+
The `--transaction-mode` option controls how data is committed during import and how retries are handled. Choose based on your safety, performance, and recovery needs:
|
|
678
|
+
|
|
679
|
+
| Mode | Behavior | Retry Scope | Best For | Pros | Cons |
|
|
680
|
+
|------|----------|-------------|----------|------|------|
|
|
681
|
+
| **batch** | Each batch of `batch_size` rows commits together | Per-batch retry | High throughput with per-batch retries | Fast, limited lock duration, failed batches retry independently | Rare edge case: partial batch on non-retryable error |
|
|
682
|
+
| **rowgroup** | Each Parquet row group commits together | Per-rowgroup retry | **Default — recommended for most use cases** | Row group boundary safety, independent rowgroup retries | Longer locks than batch mode, fewer retry opportunities |
|
|
683
|
+
| **file** | Entire file in one transaction | Entire file retry | Production, critical data | All-or-nothing atomicity, complete data integrity | Can hold locks longer on large files, if failure occurs entire file retries |
|
|
684
|
+
|
|
685
|
+
**Retry Behavior:**
|
|
686
|
+
- **batch/rowgroup modes**: When a batch or rowgroup fails, only that unit is rolled back and retried (up to `--retries` times). Already-committed units remain intact.
|
|
687
|
+
- **file mode**: If any part of the file fails, the entire file operation is retried. Previously committed batches are preserved by the transaction.
|
|
688
|
+
|
|
689
|
+
**Recommended Configuration:**
|
|
690
|
+
```bash
|
|
691
|
+
pybutt import \
|
|
692
|
+
./data/dbo_YOUR_TABLE_manifest.json \
|
|
693
|
+
--server YOUR_SERVER \
|
|
694
|
+
--database YOUR_DB \
|
|
695
|
+
--table YOUR_TABLE \
|
|
696
|
+
--username your_user \
|
|
697
|
+
--batch-size 5000 \
|
|
698
|
+
--worker-count 4
|
|
699
|
+
```
|
|
700
|
+
|
|
701
|
+
**Choosing a mode:**
|
|
702
|
+
- **Default**: Use `rowgroup` (default) — balance between data safety, locking/blocking and speed
|
|
703
|
+
- **High Throughput**: Use `batch` for per-batch retries and limited lock duration
|
|
704
|
+
- **Safety-Critical (Small Files)**: Use `file` for complete all-or-nothing atomicity per file, but higher chance of locking/blocking
|
|
705
|
+
|
|
706
|
+
**Retry Configuration:**
|
|
707
|
+
Use `--retries` (default: 3) to control retry attempts. This applies at the transaction scope level:
|
|
708
|
+
```bash
|
|
709
|
+
# Retry individual batches up to 5 times before failing
|
|
710
|
+
pybutt import \
|
|
711
|
+
... \
|
|
712
|
+
--transaction-mode batch \
|
|
713
|
+
--retries 5
|
|
714
|
+
```
|
|
715
|
+
|
|
716
|
+
## Troubleshooting
|
|
717
|
+
|
|
718
|
+
**Connection Issues:**
|
|
719
|
+
- Verify SQL Server hostname and port
|
|
720
|
+
- Check ODBC driver: `odbcinst -q -d`
|
|
721
|
+
- Test ODBC connection: `isql -v your_dsn username password`
|
|
722
|
+
|
|
723
|
+
**Empty Table Errors:**
|
|
724
|
+
- Ensure the table exists and contains data
|
|
725
|
+
|
|
726
|
+
**Memory Issues:**
|
|
727
|
+
- Reduce `--worker-count` — it multiplies per-worker memory in both directions.
|
|
728
|
+
- Export: lower `--rowgroup-size`. The writer buffers a whole rowgroup in memory,
|
|
729
|
+
so this (not `--fetch-size`) drives export memory.
|
|
730
|
+
- Import: peak memory is one Parquet rowgroup (pyodbc/mssql-python) or the whole
|
|
731
|
+
file (duckdb engine) — not `--batch-size`. Re-export with a smaller
|
|
732
|
+
`--rowgroup-size`, or avoid the duckdb engine for very large files.
|
|
733
|
+
- Process smaller tables first to verify setup.
|
|
734
|
+
- Diagnose with `--mem-heartbeat <seconds>` and the `rss`/`peak` fields on each
|
|
735
|
+
log line — see [docs/logging.md](docs/logging.md#memory-observability).
|
|
736
|
+
- See [docs/concepts.md](docs/concepts.md) for the full memory model.
|
|
737
|
+
|
|
738
|
+
**Frequent Batch/Rowgroup Failures:**
|
|
739
|
+
- Increase `--retries` and `--batch-size` for more resilient imports
|
|
740
|
+
- Check SQL Server logs for transient connection issues
|
|
741
|
+
- Verify network stability if errors are intermittent
|
|
742
|
+
|
|
743
|
+
## Contributions
|
|
744
|
+
|
|
745
|
+
When coding, please consider the following:
|
|
746
|
+
|
|
747
|
+
- Use the developer environment: `pip install -e .[dev]`
|
|
748
|
+
- Write tests for your changes and features that will pass when run: `pytest`
|
|
749
|
+
- Run isort: `isort .`
|
|
750
|
+
- Run black: `black .`
|
|
751
|
+
- Run ruff: `ruff check .`
|
|
752
|
+
|
|
753
|
+
## License
|
|
754
|
+
|
|
755
|
+
See LICENSE file for details.
|
|
756
|
+
|