pybutt 2.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,756 @@
1
+ Metadata-Version: 2.4
2
+ Name: pybutt
3
+ Version: 2.0.0
4
+ Requires-Python: >=3.12
5
+ Description-Content-Type: text/markdown
6
+ License-File: LICENSE
7
+ Requires-Dist: typer
8
+ Requires-Dist: pyodbc
9
+ Requires-Dist: pyarrow
10
+ Requires-Dist: duckdb
11
+ Requires-Dist: mssql-python
12
+ Requires-Dist: psutil
13
+ Provides-Extra: dev
14
+ Requires-Dist: black; extra == "dev"
15
+ Requires-Dist: ruff; extra == "dev"
16
+ Requires-Dist: isort; extra == "dev"
17
+ Requires-Dist: pytest; extra == "dev"
18
+ Requires-Dist: build; extra == "dev"
19
+ Dynamic: license-file
20
+
21
+ # PyButt
22
+
23
+ **Python Bulk Transfer Tool** - A tool for exporting SQL Server tables to Parquet files and importing Parquet data back into SQL Server.
24
+
25
+ ## Features
26
+
27
+ - **SQL Server to Parquet Export**: Partition tables and export them as multiple Parquet files in parallel
28
+ - **Parquet to SQL Server Import**: Bulk import Parquet files into SQL Server with configurable batch sizing
29
+ - **Flexible Authentication**: Supports both SQL authentication and Windows integrated authentication
30
+ - **Command-Line Interface**: Full-featured CLI with Typer for easy command execution
31
+ - **Python API**: Use PyButt as a module in your Python projects for programmatic access
32
+ - **Manifest-Based Import**: Track exported files with automatic manifests
33
+ - **Performance Optimized**: Multi-process export and multi-threaded import for maximum throughput
34
+
35
+ ## Documentation
36
+
37
+ In-depth guides on the data pipeline, memory behaviour, tuning knobs, engine
38
+ differences, and defaults live in [`docs/`](docs/README.md). Start with
39
+ [concepts](docs/concepts.md), then [tuning](docs/tuning.md),
40
+ [engines](docs/engines.md), and [defaults](docs/defaults.md).
41
+
42
+ ## Prerequisites
43
+
44
+ Before installing PyButt, ensure your system has the required ODBC components:
45
+
46
+ ### Linux
47
+
48
+ ```bash
49
+ # Check for libodbc
50
+ ldconfig -p | grep libodbc
51
+
52
+ # Check for ODBC Driver 18 for SQL Server
53
+ odbcinst -q -d
54
+ ```
55
+
56
+ **Required packages:**
57
+ - `libodbc.so.2` (usually from the `unixodbc` package)
58
+ - `msodbcsql` version 18
59
+ - `duckdb` (see https://duckdb.org/install/?platform=linux&environment=cli)
60
+
61
+ ### Windows
62
+
63
+ Install these packages using winget, and set the PowerShell ExecutionPolicy so you can activate your virtual environment:
64
+
65
+ ```pwsh
66
+ winget install -e --id Microsoft.msodbcsql.18
67
+ winget install -e --id DuckDB.cli
68
+ Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
69
+
70
+ # If you haven't already got `git` or `python`
71
+ winget install -e --id Git.Git
72
+ winget install -e --id Python.Python.3.14 --location C:\Python314
73
+ ```
74
+
75
+ **Required packages:**
76
+ - `msodbcsql` version 18
77
+ - `duckdb` (see https://duckdb.org/install/?platform=windows&environment=cli)
78
+
79
+ ## Installation
80
+
81
+ ### Quick Start
82
+
83
+ PyButt uses `pyproject.toml` as the source of truth for runtime dependencies and optional development tooling.
84
+
85
+ ```bash
86
+ git clone https://github.com/dmonlineuk/pybutt && cd pybutt
87
+ python -m venv .venv
88
+ source .venv/bin/activate # On Windows: `.venv\Scripts\Activate.ps1`
89
+ python -m pip install --upgrade pip
90
+ pip install -e .
91
+ ```
92
+
93
+ If you want the full developer environment with formatting, linting, and tests:
94
+
95
+ ```bash
96
+ pip install -e .[dev]
97
+ ```
98
+
99
+ ### Install as a Package
100
+
101
+ For use in Python projects and enabling CLI executable:
102
+
103
+ ```bash
104
+ pip install -e .
105
+ ```
106
+
107
+ ## Usage
108
+
109
+ ### Command-Line Interface
110
+
111
+ PyButt provides the following commands: `export`, `import`, `combine`, `inspect`, and `purge`.
112
+
113
+ #### Export Command
114
+
115
+ Export a SQL Server table to Parquet files:
116
+
117
+ ```bash
118
+ pybutt export \
119
+ --server YOUR_SERVER \
120
+ --database YOUR_DB \
121
+ --schema dbo \
122
+ --table YOUR_TABLE \
123
+ --username your_user \
124
+ --output-path ./output
125
+ ```
126
+
127
+ **Export Options:**
128
+
129
+ ```
130
+ --server, -s SQL Server hostname or instance (required)
131
+ --database, -d Target database (required)
132
+ --schema, -S Table schema (default: dbo)
133
+ --table, -t Table name (required)
134
+ --output-path, -o Output directory for Parquet files (required)
135
+ --manifest-filename, -m Custom manifest filename to write (default: <schema>_<table>_manifest.json)
136
+ --username, -u SQL Server username
137
+ --password, -p SQL Server password (prompted if not provided)
138
+ --trusted-connection, -T Use Windows integrated authentication
139
+ --driver, -D ODBC driver name (default: ODBC Driver 18 for SQL Server)
140
+ --trust-cert, -c Trust the SQL Server TLS certificate
141
+ --encrypt/--no-encrypt Enable/disable encrypted transport (default: enabled)
142
+ --retries, -r Number of retry attempts for transient errors (default: 3)
143
+ --packet-size TDS packet size in bytes, 512–32767 (default: 4096)
144
+ --pk-column, -P Primary key column for deterministic partitioning
145
+ --columns, -C Comma-separated list of columns to export (all by default)
146
+ --parameters, -a Comma-separated list of parameter values to pass to a table-valued function (e.g. 12,'fred','1989')
147
+ --worker-count, -w Number of worker processes (default: 1)
148
+ --file-count, -f Number of output Parquet files (default: 1)
149
+ --rowgroup-size, -R Number of rows per rowgroup inside each Parquet file (default: 1048576)
150
+ --fetch-size, -F Cursor fetch size for pyodbc export (default: 1000)
151
+ --engine, -e Export engine to use: duckdb, pyodbc, or mssql-python (default: pyodbc)
152
+ --mem-heartbeat Log process memory every N seconds (default: 30.0; 0 to disable)
153
+ --mem-threshold System memory % at which workers are throttled (default: 85.0; 0 to disable)
154
+ --mem-sleep Seconds to sleep per throttle check (default: 5.0)
155
+ --mem-max-wait Max seconds to wait during memory throttling (default: 300.0)
156
+ --mem-cooldown Seconds after a throttle event before re-checking (default: 30.0)
157
+ --verbose, -V Show verbose logging output
158
+ --help, -? Show help and exit
159
+ ```
160
+
161
+ **Examples:**
162
+
163
+ Export entire table with 4 parallel workers:
164
+ ```bash
165
+ pybutt export \
166
+ --server sqlserver.example.com \
167
+ --database MyDatabase \
168
+ --table Customers \
169
+ --output-path ./exports/customers \
170
+ --username dbuser \
171
+ --worker-count 4 \
172
+ --file-count 4
173
+ ```
174
+
175
+ Export using the duckdb engine:
176
+ ```bash
177
+ pybutt export \
178
+ --server sqlserver.example.com \
179
+ --database MyDatabase \
180
+ --table Customers \
181
+ --output-path ./exports/customers \
182
+ --username dbuser \
183
+ --engine duckdb
184
+ ```
185
+
186
+ Export using the mssql-python engine:
187
+ ```bash
188
+ pybutt export \
189
+ --server sqlserver.example.com \
190
+ --database MyDatabase \
191
+ --table Customers \
192
+ --output-path ./exports/customers \
193
+ --username dbuser \
194
+ --engine mssql-python
195
+ ```
196
+
197
+ Export specific columns using primary key partitioning:
198
+ ```bash
199
+ pybutt export \
200
+ --server sqlserver.example.com \
201
+ --database MyDatabase \
202
+ --table Orders \
203
+ --output-path ./exports/orders \
204
+ --username dbuser \
205
+ --pk-column OrderID \
206
+ --columns "OrderID,OrderDate,Amount" \
207
+ --file-count 8
208
+ ```
209
+
210
+ Exporting database views is also supported. If partition statistics are unavailable for the target object, PyButt will fall back to `SELECT COUNT(*)` to determine the row count before partitioning.
211
+
212
+ Export from a TVF with parameters:
213
+ ```bash
214
+ pybutt export \
215
+ --server sqlserver.example.com \
216
+ --database MyDatabase \
217
+ --schema export \
218
+ --table tvf_users \
219
+ --parameters "12,'fred','1989'" \
220
+ --output-path ./exports/tvf_users \
221
+ --username dbuser
222
+ ```
223
+
224
+ Export using Windows authentication:
225
+ ```bash
226
+ pybutt export \
227
+ --server SQLSERVER01\INSTANCE \
228
+ --database MyDatabase \
229
+ --table LargeTable \
230
+ --output-path ./exports \
231
+ --trusted-connection
232
+ ```
233
+
234
+ #### Import Command
235
+
236
+ Import Parquet files into a SQL Server table:
237
+
238
+ ```bash
239
+ pybutt import \
240
+ ./exports/customers/dbo_Customers_manifest.json \
241
+ --server YOUR_SERVER \
242
+ --database YOUR_DB \
243
+ --schema dbo \
244
+ --table YOUR_TABLE \
245
+ --username your_user
246
+ ```
247
+
248
+ **Import Options:**
249
+
250
+ ```
251
+ manifest_path Path to the input manifest file (positional, required)
252
+ --server, -s SQL Server hostname or instance (required)
253
+ --database, -d Target database (required)
254
+ --schema, -S Table schema (default: dbo)
255
+ --table, -t Table name (required)
256
+ --imported-manifest-filename, -o Override the import worker manifest filename
257
+ --username, -u SQL Server username
258
+ --password, -p SQL Server password (prompted if not provided)
259
+ --trusted-connection, -T Use Windows integrated authentication
260
+ --driver, -D ODBC driver name (default: ODBC Driver 18 for SQL Server)
261
+ --trust-cert, -c Trust the SQL Server TLS certificate
262
+ --encrypt/--no-encrypt Enable/disable encrypted transport (default: enabled)
263
+ --retries, -r Number of retry attempts for transient errors (default: 3)
264
+ --packet-size TDS packet size in bytes, 512–32767 (default: 4096)
265
+ --worker-count, -w Number of parallel import threads (default: 1)
266
+ --batch-size, -b Rows per batch insert (default: 1000)
267
+ --engine, -e Import engine to use: duckdb, pyodbc, or mssql-python (default: mssql-python)
268
+ --transaction-mode, -M Transaction scope: batch, rowgroup (default), file
269
+ --cci/--no-cci Create a clustered columnstore index on per-worker temp tables (default: enabled)
270
+ --mem-heartbeat Log process memory every N seconds (default: 30.0; 0 to disable)
271
+ --mem-threshold System memory % at which workers are throttled (default: 85.0; 0 to disable)
272
+ --mem-sleep Seconds to sleep per throttle check (default: 5.0)
273
+ --mem-max-wait Max seconds to wait during memory throttling (default: 300.0)
274
+ --mem-cooldown Seconds after a throttle event before re-checking (default: 30.0)
275
+ --verbose, -V Show verbose logging output
276
+ --help, -? Show help and exit
277
+ ```
278
+
279
+ **Columnstore on temporary tables:**
280
+
281
+ When importing with `--worker-count` of 2 or more, PyButt creates one temporary
282
+ table per worker (`SELECT TOP 0 * INTO ... FROM <source>`) which can then be combined
283
+ into the target afterwards. By default a clustered columnstore index (CCI) is now
284
+ created on each temporary table to reduce the storage footprint of these staging
285
+ tables. Pass `--no-cci` to keep the previous heap behaviour.
286
+
287
+ Notes:
288
+ - The CCI is only created on the multi-worker path (single-worker imports are
289
+ unaffected).
290
+ - Space savings come from columnstore compression. The SQL Server tuple mover
291
+ compresses row groups as data is loaded once they are large enough, so the
292
+ benefit applies when it matters most (large imports). Small loads may sit in
293
+ the uncompressed delta store until a row group fills.
294
+ - Clustered columnstore indexes require SQL Server 2014+ (and are available in
295
+ all editions from SQL Server 2016 SP1). On unsupported instances, or with
296
+ source columns that columnstore does not support, use `--no-cci`.
297
+
298
+ **Examples:**
299
+
300
+ Basic import (uses rowgroup transaction mode by default):
301
+ ```bash
302
+ pybutt import \
303
+ ./exports/customers/dbo_Customers_manifest.json \
304
+ --server sqlserver.example.com \
305
+ --database MyDatabase \
306
+ --table Customers \
307
+ --username dbuser
308
+ ```
309
+
310
+ Import using the pyodbc engine:
311
+ ```bash
312
+ pybutt import \
313
+ ./exports/customers/dbo_Customers_manifest.json \
314
+ --server sqlserver.example.com \
315
+ --database MyDatabase \
316
+ --table Customers \
317
+ --username dbuser \
318
+ --engine pyodbc
319
+ ```
320
+
321
+ Import using the duckdb engine:
322
+ ```bash
323
+ pybutt import \
324
+ ./exports/customers/dbo_Customers_manifest.json \
325
+ --server sqlserver.example.com \
326
+ --database MyDatabase \
327
+ --table Customers \
328
+ --username dbuser \
329
+ --engine duckdb
330
+ ```
331
+
332
+ High-throughput import with larger batches (batch mode):
333
+ ```bash
334
+ pybutt import \
335
+ ./imports/orders/dbo_Orders_manifest.json \
336
+ --server sqlserver.example.com \
337
+ --database MyDatabase \
338
+ --table Orders \
339
+ --username dbuser \
340
+ --worker-count 4 \
341
+ --batch-size 5000 \
342
+ --transaction-mode batch \
343
+ --verbose
344
+ ```
345
+
346
+ Import with batch transactions (per-batch retries):
347
+ ```bash
348
+ pybutt import \
349
+ ./imports/data/dbo_LargeTable_manifest.json \
350
+ --server sqlserver.example.com \
351
+ --database MyDatabase \
352
+ --table LargeTable \
353
+ --username dbuser \
354
+ --transaction-mode batch
355
+ ```
356
+
357
+ Import with file-level transactions (all-or-nothing for critical data):
358
+ ```bash
359
+ pybutt import \
360
+ ./imports/financials/dbo_FinancialData_manifest.json \
361
+ --server sqlserver.example.com \
362
+ --database MyDatabase \
363
+ --table FinancialData \
364
+ --username dbuser \
365
+ --transaction-mode file
366
+ ```
367
+
368
+ #### Combine Command
369
+
370
+ Combine objects listed in a manifest file. This command supports two types of combines depending on the manifest type:
371
+ - **Files manifest (`type: "files"`)**: Concatenates multiple Parquet files into a single output Parquet file.
372
+ - **Tables manifest (`type: "tables"`)**: Combines multiple temporary/worker SQL tables into a single target table on your SQL Server.
373
+
374
+ ```bash
375
+ # File combine example:
376
+ pybutt combine \
377
+ ./exports/customers/dbo_Customers_manifest.json \
378
+ --output-file ./exports/customers/combined.parquet
379
+
380
+ # Table combine example:
381
+ pybutt combine \
382
+ ./exports/customers/dbo_Customers_temp_manifest.json \
383
+ --server YOUR_SERVER \
384
+ --database YOUR_DB \
385
+ --schema dbo \
386
+ --table Customers \
387
+ --username your_user
388
+ ```
389
+
390
+ **Combine Options:**
391
+
392
+ ```
393
+ manifest Path to manifest file (positional, required)
394
+ --output-file, -o Output Parquet file path (required for file combines)
395
+ --rowgroup-size, -R Rowgroup size for output Parquet file (default: 1048576)
396
+ --combined-manifest-filename, -m Override the combined manifest filename
397
+ --server, -s SQL Server hostname or instance (required for table combines)
398
+ --database, -d Target database (required for table combines)
399
+ --schema, -S Target schema (required for table combines)
400
+ --table, -t Target table name (required for table combines)
401
+ --username, -u SQL Server username (for table combines)
402
+ --password, -p SQL Server password (for table combines)
403
+ --trusted-connection, -T Use Windows integrated authentication
404
+ --driver, -D ODBC driver name (default: ODBC Driver 18 for SQL Server)
405
+ --trust-cert, -c Trust the SQL Server TLS certificate
406
+ --encrypt/--no-encrypt, -e/-n Enable/disable encrypted transport (default: enabled)
407
+ --retries, -r Number of retry attempts for transient SQL errors (default: 3)
408
+ --verbose, -V Show verbose logging output
409
+ ```
410
+
411
+ #### Inspect Command
412
+
413
+ Inspect details of the Parquet files listed in a manifest (including row counts, row group counts, size, and columns):
414
+
415
+ ```bash
416
+ pybutt inspect ./exports/customers/dbo_Customers_manifest.json
417
+ ```
418
+
419
+ **Inspect Options:**
420
+
421
+ ```
422
+ manifest Path to manifest.json file (positional, required)
423
+ --verbose, -v Show full column definitions, schema, and detailed metadata
424
+ ```
425
+
426
+ ### Password Input
427
+
428
+ When you provide a username without a password, PyButt will prompt you interactively:
429
+
430
+ ```bash
431
+ pybutt export \
432
+ --server myserver \
433
+ --database mydb \
434
+ --table mytable \
435
+ --output-path ./output \
436
+ --username myuser
437
+ # You'll be prompted: Enter your password: [hidden input]
438
+ ```
439
+
440
+ ### Python API
441
+
442
+ Use PyButt as a module in your Python projects:
443
+
444
+ #### Configuration
445
+
446
+ First, create a `SqlConfig` object with your connection details. `SqlConfig` is
447
+ purely connection configuration — schema and table are passed directly to
448
+ `Exporter`, `Importer`, and `TableCombine`.
449
+
450
+ ```python
451
+ from pybutt import SqlConfig, Exporter, Importer
452
+ from pathlib import Path
453
+
454
+ config = SqlConfig(
455
+ server="sqlserver.example.com",
456
+ database="MyDatabase",
457
+ username="dbuser",
458
+ password="dbpassword",
459
+ trusted_connection=False,
460
+ trust_cert=False,
461
+ encrypt=True,
462
+ retries=3,
463
+ )
464
+ ```
465
+
466
+ Or with Windows authentication:
467
+
468
+ ```python
469
+ config = SqlConfig(
470
+ server="SQLSERVER01\\INSTANCE",
471
+ database="MyDatabase",
472
+ trusted_connection=True,
473
+ )
474
+ ```
475
+
476
+ #### Exporting Data
477
+
478
+ ```python
479
+ from pathlib import Path
480
+
481
+ exporter = Exporter(
482
+ config=config,
483
+ table="Customers", # Target table name
484
+ output_path=Path("./exports/customers"),
485
+ schema="dbo", # Schema (default: dbo)
486
+ pk_column=None, # None for CHECKSUM partitioning
487
+ columns=None, # None for all columns
488
+ worker_count=4, # Number of parallel processes
489
+ file_count=4, # Number of output files
490
+ fetch_size=None, # Cursor fetch size for pyodbc export (None = auto)
491
+ )
492
+
493
+ exporter.perform_work()
494
+ print("Export completed successfully!")
495
+ ```
496
+
497
+ With primary key partitioning:
498
+
499
+ ```python
500
+ exporter = Exporter(
501
+ config=config,
502
+ table="Orders",
503
+ output_path=Path("./exports/orders"),
504
+ pk_column="OrderID", # Use PK for deterministic partitioning
505
+ columns=["OrderID", "OrderDate", "Amount"],
506
+ worker_count=8,
507
+ file_count=8,
508
+ fetch_size=None, # Optional: tune pyodbc fetch size for streaming
509
+ )
510
+
511
+ exporter.perform_work()
512
+ ```
513
+
514
+ With multiple workers and files:
515
+
516
+ ```python
517
+ exporter = Exporter(
518
+ config=config,
519
+ table="Orders",
520
+ output_path=Path("./exports/orders"),
521
+ worker_count=4,
522
+ file_count=4, # Distribute across 4 output files
523
+ rowgroup_size=1_048_576, # 1M rows per rowgroup
524
+ )
525
+
526
+ exporter.perform_work()
527
+ ```
528
+
529
+ #### Importing Data
530
+
531
+ **Default (rowgroup-level transactions):**
532
+ ```python
533
+ from pybutt import TransactionMode
534
+
535
+ importer = Importer(
536
+ config=config,
537
+ table="Customers",
538
+ input_path=Path("./exports/customers"),
539
+ manifest_filename="customers_manifest.json",
540
+ worker_count=4, # Number of parallel threads
541
+ batch_size=1000, # Rows per batch
542
+ transaction_mode=TransactionMode.ROWGROUP, # Each row group in its own transaction (default)
543
+ )
544
+
545
+ importer.perform_work()
546
+ print("Import completed successfully!")
547
+ ```
548
+
549
+ **With batch-level transactions (per-batch retries):**
550
+ ```python
551
+ importer = Importer(
552
+ config=config,
553
+ table="Orders",
554
+ input_path=Path("./exports/orders"),
555
+ manifest_filename="orders_manifest.json",
556
+ worker_count=4,
557
+ batch_size=5000,
558
+ transaction_mode=TransactionMode.BATCH, # Each batch in its own transaction
559
+ )
560
+
561
+ importer.perform_work()
562
+ ```
563
+
564
+ **With file-level transactions (all-or-nothing safety):**
565
+ ```python
566
+ importer = Importer(
567
+ config=config,
568
+ table="LargeTable",
569
+ input_path=Path("./exports/data"),
570
+ manifest_filename="data_manifest.json",
571
+ worker_count=4,
572
+ batch_size=1000,
573
+ transaction_mode=TransactionMode.FILE, # Entire file in one transaction
574
+ )
575
+
576
+ importer.perform_work()
577
+ ```
578
+
579
+ #### Complete Example
580
+
581
+ ```python
582
+ from pathlib import Path
583
+ from pybutt import SqlConfig, TransactionMode, Exporter, Importer
584
+
585
+ # Configure connection (purely connection details — no schema/table)
586
+ config = SqlConfig(
587
+ server="sqlserver.example.com",
588
+ database="MyDatabase",
589
+ username="dbuser",
590
+ password="dbpassword",
591
+ )
592
+
593
+ # Export
594
+ export_path = Path("./data_export")
595
+ exporter = Exporter(
596
+ config=config,
597
+ table="LargeTable",
598
+ output_path=export_path,
599
+ worker_count=4,
600
+ file_count=4,
601
+ )
602
+ exporter.perform_work()
603
+ print("✓ Export complete")
604
+
605
+ # Import into another table (reuse same connection config)
606
+ importer = Importer(
607
+ config=config,
608
+ table="LargeTableBackup",
609
+ input_path=export_path,
610
+ manifest_filename="dbo_LargeTable_manifest.json",
611
+ worker_count=4,
612
+ batch_size=5000,
613
+ transaction_mode=TransactionMode.ROWGROUP, # Rowgroup-level transactions (default)
614
+ )
615
+ importer.perform_work()
616
+ print("✓ Import complete")
617
+ ```
618
+
619
+ ## Manifest Files
620
+
621
+ When exporting, PyButt automatically creates a manifest JSON file listing all generated Parquet files. This manifest is required for importing:
622
+
623
+ As of version 2, a manifest is a JSON object with a `version`, a `type`, and an
624
+ `entries` list. Two manifest types are supported:
625
+
626
+ - **`files`** — `entries` are Parquet file names (written by `export` and file
627
+ `combine`).
628
+ - **`tables`** — `entries` are SQL Server table names (written during multi-worker
629
+ `import` and table `combine`, for consumption by the `combine` command).
630
+
631
+ **Example file manifest** (`dbo_MyTable_manifest.json`):
632
+ ```json
633
+ {
634
+ "version": 2,
635
+ "type": "files",
636
+ "entries": [
637
+ "dbo_MyTable_part_00000.parquet",
638
+ "dbo_MyTable_part_00001.parquet",
639
+ "dbo_MyTable_part_00002.parquet",
640
+ "dbo_MyTable_part_00003.parquet"
641
+ ]
642
+ }
643
+ ```
644
+
645
+ **Example table manifest** (`dbo_MyTable_temp_manifest.json`):
646
+ ```json
647
+ {
648
+ "version": 2,
649
+ "type": "tables",
650
+ "entries": [
651
+ "dbo.MyTable_01_a1b2c3d4",
652
+ "dbo.MyTable_02_e5f6a7b8"
653
+ ]
654
+ }
655
+ ```
656
+
657
+ For backwards compatibility, legacy version 1 manifests — a plain JSON array of
658
+ Parquet file names — are still accepted when reading and are treated as a `files`
659
+ manifest:
660
+ ```json
661
+ [
662
+ "dbo_MyTable_part_00000.parquet",
663
+ "dbo_MyTable_part_00001.parquet"
664
+ ]
665
+ ```
666
+
667
+ ## Performance Tips
668
+
669
+ - **Export**: Increase `--worker-count` and `--file-count` for large tables (use values matching your CPU core count)
670
+ - **Import**: Use `--worker-count` up to your CPU core count and adjust `--batch-size` (higher values = fewer database round trips)
671
+ - **mssql-python engine**: The default import engine (`mssql-python`) uses native bulk insert (`bulkcopy`) which is significantly faster than parameterized `INSERT` statements used by pyodbc
672
+ - **Primary Key Partitioning**: Use `--pk-column` for deterministic partitioning when re-importing the same data
673
+ - **Encryption**: Use `--no-encrypt` only in secure networks to reduce overhead
674
+
675
+ ## Transaction Modes for Import
676
+
677
+ The `--transaction-mode` option controls how data is committed during import and how retries are handled. Choose based on your safety, performance, and recovery needs:
678
+
679
+ | Mode | Behavior | Retry Scope | Best For | Pros | Cons |
680
+ |------|----------|-------------|----------|------|------|
681
+ | **batch** | Each batch of `batch_size` rows commits together | Per-batch retry | High throughput with per-batch retries | Fast, limited lock duration, failed batches retry independently | Rare edge case: partial batch on non-retryable error |
682
+ | **rowgroup** | Each Parquet row group commits together | Per-rowgroup retry | **Default — recommended for most use cases** | Row group boundary safety, independent rowgroup retries | Longer locks than batch mode, fewer retry opportunities |
683
+ | **file** | Entire file in one transaction | Entire file retry | Production, critical data | All-or-nothing atomicity, complete data integrity | Can hold locks longer on large files, if failure occurs entire file retries |
684
+
685
+ **Retry Behavior:**
686
+ - **batch/rowgroup modes**: When a batch or rowgroup fails, only that unit is rolled back and retried (up to `--retries` times). Already-committed units remain intact.
687
+ - **file mode**: If any part of the file fails, the entire file operation is retried. Previously committed batches are preserved by the transaction.
688
+
689
+ **Recommended Configuration:**
690
+ ```bash
691
+ pybutt import \
692
+ ./data/dbo_YOUR_TABLE_manifest.json \
693
+ --server YOUR_SERVER \
694
+ --database YOUR_DB \
695
+ --table YOUR_TABLE \
696
+ --username your_user \
697
+ --batch-size 5000 \
698
+ --worker-count 4
699
+ ```
700
+
701
+ **Choosing a mode:**
702
+ - **Default**: Use `rowgroup` (default) — balance between data safety, locking/blocking and speed
703
+ - **High Throughput**: Use `batch` for per-batch retries and limited lock duration
704
+ - **Safety-Critical (Small Files)**: Use `file` for complete all-or-nothing atomicity per file, but higher chance of locking/blocking
705
+
706
+ **Retry Configuration:**
707
+ Use `--retries` (default: 3) to control retry attempts. This applies at the transaction scope level:
708
+ ```bash
709
+ # Retry individual batches up to 5 times before failing
710
+ pybutt import \
711
+ ... \
712
+ --transaction-mode batch \
713
+ --retries 5
714
+ ```
715
+
716
+ ## Troubleshooting
717
+
718
+ **Connection Issues:**
719
+ - Verify SQL Server hostname and port
720
+ - Check ODBC driver: `odbcinst -q -d`
721
+ - Test ODBC connection: `isql -v your_dsn username password`
722
+
723
+ **Empty Table Errors:**
724
+ - Ensure the table exists and contains data
725
+
726
+ **Memory Issues:**
727
+ - Reduce `--worker-count` — it multiplies per-worker memory in both directions.
728
+ - Export: lower `--rowgroup-size`. The writer buffers a whole rowgroup in memory,
729
+ so this (not `--fetch-size`) drives export memory.
730
+ - Import: peak memory is one Parquet rowgroup (pyodbc/mssql-python) or the whole
731
+ file (duckdb engine) — not `--batch-size`. Re-export with a smaller
732
+ `--rowgroup-size`, or avoid the duckdb engine for very large files.
733
+ - Process smaller tables first to verify setup.
734
+ - Diagnose with `--mem-heartbeat <seconds>` and the `rss`/`peak` fields on each
735
+ log line — see [docs/logging.md](docs/logging.md#memory-observability).
736
+ - See [docs/concepts.md](docs/concepts.md) for the full memory model.
737
+
738
+ **Frequent Batch/Rowgroup Failures:**
739
+ - Increase `--retries` and `--batch-size` for more resilient imports
740
+ - Check SQL Server logs for transient connection issues
741
+ - Verify network stability if errors are intermittent
742
+
743
+ ## Contributions
744
+
745
+ When coding, please consider the following:
746
+
747
+ - Use the developer environment: `pip install -e .[dev]`
748
+ - Write tests for your changes and features that will pass when run: `pytest`
749
+ - Run isort: `isort .`
750
+ - Run black: `black .`
751
+ - Run ruff: `ruff check .`
752
+
753
+ ## License
754
+
755
+ See LICENSE file for details.
756
+