PyPI - datablade - Versions diffs - 0.0.5__tar.gz → 0.0.6__tar.gz - Mend

datablade 0.0.5tar.gz → 0.0.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

{datablade-0.0.5/src/datablade.egg-info → datablade-0.0.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: datablade
-Version: 0.0.5
+Version: 0.0.6
 Summary: datablade is a suite of functions to provide standard syntax across data engineering projects.
 Author-email: Brent Carpenetti <brentwc.git@pm.me>
 License: MIT License
@@ -63,7 +63,6 @@ Dynamic: license-file
 # datablade
-[![Tests](https://github.com/brentwc/data-prep/actions/workflows/test.yml/badge.svg)](https://github.com/brentwc/data-prep/actions/workflows/test.yml)
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -115,20 +114,23 @@ datablade is ideal for:
 ## Installation
 ```bash
-pip install git+https://github.com/brentwc/data-prep.git
+pip install datablade
 ```
 **Optional dependencies:**
 ```bash
 # For high-performance file reading with Polars
-pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[performance]
+pip install "datablade[performance]"
-# For development and testing
-pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[dev]
+# For testing
+pip install "datablade[test]"
+# For development (includes testing + lint/format tooling)
+pip install "datablade[dev]"
 # All optional dependencies
-pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[all]
+pip install "datablade[all]"
 ```
 ## Features
@@ -172,6 +174,7 @@ Multi-dialect SQL utilities:
 - Dialect-aware identifier quoting
 - CREATE TABLE generation for all dialects (from pandas DataFrames)
 - CREATE TABLE generation from Parquet schemas (schema-only, via PyArrow)
+- Optional `schema_spec` overrides for column types, nullability, and string sizing
 - Bulk loading helpers:
     - SQL Server: executes `bcp` via subprocess
     - PostgreSQL/MySQL/DuckDB: returns command strings you can run in your environment
@@ -215,10 +218,25 @@ ddl_from_parquet = generate_create_table_from_parquet(
 data = get_json('https://api.example.com/data.json')
 ```
+Most file path parameters accept `str` or `pathlib.Path`. To treat case mismatches
+as errors on case-insensitive filesystems, use `configure_paths(path_strict=True)`.
 ### Memory-Aware File Reading
+See the file format support matrix in the bundled USAGE doc:
+```bash
+python -m datablade.docs --show USAGE
+```
 ```python
-from datablade.dataframes import read_file_chunked, read_file_iter, read_file_to_parquets, stream_to_parquets
+from datablade.dataframes import (
+    excel_to_parquets,
+    read_file_chunked,
+    read_file_iter,
+    read_file_to_parquets,
+    stream_to_parquets,
+)
 # Read large files in chunks
 for chunk in read_file_chunked('huge_file.csv', memory_fraction=0.5):
@@ -232,6 +250,10 @@ for chunk in read_file_iter('huge_file.csv', memory_fraction=0.3, verbose=True):
 for chunk in read_file_iter('huge_file.parquet', memory_fraction=0.3, verbose=True):
     process(chunk)
+# Excel streaming is available with openpyxl installed (read-only mode)
+for chunk in read_file_iter('large.xlsx', chunksize=25_000, verbose=True):
+    process(chunk)
 # Partition large files to multiple Parquets
 files = read_file_to_parquets(
     'large_file.csv',
@@ -248,6 +270,15 @@ files = stream_to_parquets(
     convert_types=True,
     verbose=True,
 )
+# Excel streaming to Parquet partitions
+files = excel_to_parquets(
+    'large.xlsx',
+    output_dir='partitioned_excel/',
+    rows_per_file=200_000,
+    convert_types=True,
+    verbose=True,
+)
 ```
 ## Blade (Optional Facade)
@@ -284,10 +315,21 @@ ddl2 = blade.create_table_sql_from_parquet(
 ## Documentation
-- [Docs Home](docs/README.md) - Documentation landing page
-- [Usage Guide](docs/USAGE.md) - File reading (including streaming), SQL, IO, logging
-- [Testing Guide](docs/TESTING.md) - How to run tests locally
-- [Test Suite](tests/README.md) - Testing documentation and coverage
+Docs are bundled with the installed package:
+```bash
+python -m datablade.docs --list
+python -m datablade.docs --show USAGE
+python -m datablade.docs --write-dir .\datablade-docs
+```
+After writing docs to disk, open the markdown files locally:
+- README (docs landing page)
+- USAGE (file reading, streaming, SQL, IO, logging)
+- TESTING (how to run tests locally)
+- ARCHITECTURE (pipeline overview)
+- OBJECT_REGISTRY (registry reference)
 ## Testing
@@ -304,7 +346,11 @@ pytest
 pytest --cov=datablade --cov-report=html
 ```
-See [tests/README.md](tests/README.md) for detailed testing documentation.
+For detailed testing documentation, use the bundled TESTING doc:
+```bash
+python -m datablade.docs --show TESTING
+```
 ## Backward Compatibility
@@ -332,6 +378,12 @@ from datablade.core.json import get
 - **Streaming vs materializing**:
     - Use `read_file_iter()` to process arbitrarily large files chunk-by-chunk.
     - `read_file_smart()` returns a single DataFrame and may still be memory-intensive.
+- **Chunk concatenation**: the large-file pandas fallback in `read_file_smart()` can
+  temporarily spike memory usage during concat. Use `read_file_iter()` or
+  `return_type="iterator"` to avoid concatenation.
+- **Polars materialization**: when returning a pandas DataFrame, Polars still
+  collects into memory; use `return_type="polars"` or `"polars_lazy"` to keep
+  Polars frames.
 - **Parquet support**:
     - Streaming reads support single `.parquet` files.
     - Parquet “dataset directories” (Hive partitions / directory-of-parquets) are not a primary target API.
@@ -339,6 +391,9 @@ from datablade.core.json import get
     - Uses the Parquet schema (PyArrow) without scanning data.
     - Complex/nested columns (struct/list/map/union) are dropped and logged as warnings.
 - **DDL scope**: `CREATE TABLE` generation is column/type oriented (no indexes/constraints).
+- **SQL Server bulk load**: the SQL Server helpers use the `bcp` CLI and require it
+  to be installed and available on PATH. When using `-U`/`-P`, credentials are
+  passed via process args (logs are redacted); prefer `-T` or `-G` where possible.
 **Optional dependencies:**

{datablade-0.0.5 → datablade-0.0.6}/pyproject.toml RENAMED Viewed

@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 name = "datablade"
 dynamic = ["version"]
 description = "datablade is a suite of functions to provide standard syntax across data engineering projects."
-readme = "readme.md"
+readme = { file = "readme.md", content-type = "text/markdown" }
 requires-python = ">=3.12"
 license = { file = "LICENSE" }
 authors = [
@@ -56,6 +56,10 @@ all = [
 [tool.setuptools]
 include-package-data = true
+[tool.setuptools.package-data]
+datablade = ["docs/*.md"]
+"datablade.docs" = ["*.md"]
 [tool.setuptools.packages.find]
 where = ["src"]

{datablade-0.0.5 → datablade-0.0.6}/readme.md RENAMED Viewed

@@ -1,6 +1,5 @@
 # datablade
-[![Tests](https://github.com/brentwc/data-prep/actions/workflows/test.yml/badge.svg)](https://github.com/brentwc/data-prep/actions/workflows/test.yml)
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -52,20 +51,23 @@ datablade is ideal for:
 ## Installation
 ```bash
-pip install git+https://github.com/brentwc/data-prep.git
+pip install datablade
 ```
 **Optional dependencies:**
 ```bash
 # For high-performance file reading with Polars
-pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[performance]
+pip install "datablade[performance]"
-# For development and testing
-pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[dev]
+# For testing
+pip install "datablade[test]"
+# For development (includes testing + lint/format tooling)
+pip install "datablade[dev]"
 # All optional dependencies
-pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[all]
+pip install "datablade[all]"
 ```
 ## Features
@@ -109,6 +111,7 @@ Multi-dialect SQL utilities:
 - Dialect-aware identifier quoting
 - CREATE TABLE generation for all dialects (from pandas DataFrames)
 - CREATE TABLE generation from Parquet schemas (schema-only, via PyArrow)
+- Optional `schema_spec` overrides for column types, nullability, and string sizing
 - Bulk loading helpers:
     - SQL Server: executes `bcp` via subprocess
     - PostgreSQL/MySQL/DuckDB: returns command strings you can run in your environment
@@ -152,10 +155,25 @@ ddl_from_parquet = generate_create_table_from_parquet(
 data = get_json('https://api.example.com/data.json')
 ```
+Most file path parameters accept `str` or `pathlib.Path`. To treat case mismatches
+as errors on case-insensitive filesystems, use `configure_paths(path_strict=True)`.
 ### Memory-Aware File Reading
+See the file format support matrix in the bundled USAGE doc:
+```bash
+python -m datablade.docs --show USAGE
+```
 ```python
-from datablade.dataframes import read_file_chunked, read_file_iter, read_file_to_parquets, stream_to_parquets
+from datablade.dataframes import (
+    excel_to_parquets,
+    read_file_chunked,
+    read_file_iter,
+    read_file_to_parquets,
+    stream_to_parquets,
+)
 # Read large files in chunks
 for chunk in read_file_chunked('huge_file.csv', memory_fraction=0.5):
@@ -169,6 +187,10 @@ for chunk in read_file_iter('huge_file.csv', memory_fraction=0.3, verbose=True):
 for chunk in read_file_iter('huge_file.parquet', memory_fraction=0.3, verbose=True):
     process(chunk)
+# Excel streaming is available with openpyxl installed (read-only mode)
+for chunk in read_file_iter('large.xlsx', chunksize=25_000, verbose=True):
+    process(chunk)
 # Partition large files to multiple Parquets
 files = read_file_to_parquets(
     'large_file.csv',
@@ -185,6 +207,15 @@ files = stream_to_parquets(
     convert_types=True,
     verbose=True,
 )
+# Excel streaming to Parquet partitions
+files = excel_to_parquets(
+    'large.xlsx',
+    output_dir='partitioned_excel/',
+    rows_per_file=200_000,
+    convert_types=True,
+    verbose=True,
+)
 ```
 ## Blade (Optional Facade)
@@ -221,10 +252,21 @@ ddl2 = blade.create_table_sql_from_parquet(
 ## Documentation
-- [Docs Home](docs/README.md) - Documentation landing page
-- [Usage Guide](docs/USAGE.md) - File reading (including streaming), SQL, IO, logging
-- [Testing Guide](docs/TESTING.md) - How to run tests locally
-- [Test Suite](tests/README.md) - Testing documentation and coverage
+Docs are bundled with the installed package:
+```bash
+python -m datablade.docs --list
+python -m datablade.docs --show USAGE
+python -m datablade.docs --write-dir .\datablade-docs
+```
+After writing docs to disk, open the markdown files locally:
+- README (docs landing page)
+- USAGE (file reading, streaming, SQL, IO, logging)
+- TESTING (how to run tests locally)
+- ARCHITECTURE (pipeline overview)
+- OBJECT_REGISTRY (registry reference)
 ## Testing
@@ -241,7 +283,11 @@ pytest
 pytest --cov=datablade --cov-report=html
 ```
-See [tests/README.md](tests/README.md) for detailed testing documentation.
+For detailed testing documentation, use the bundled TESTING doc:
+```bash
+python -m datablade.docs --show TESTING
+```
 ## Backward Compatibility
@@ -269,6 +315,12 @@ from datablade.core.json import get
 - **Streaming vs materializing**:
     - Use `read_file_iter()` to process arbitrarily large files chunk-by-chunk.
     - `read_file_smart()` returns a single DataFrame and may still be memory-intensive.
+- **Chunk concatenation**: the large-file pandas fallback in `read_file_smart()` can
+  temporarily spike memory usage during concat. Use `read_file_iter()` or
+  `return_type="iterator"` to avoid concatenation.
+- **Polars materialization**: when returning a pandas DataFrame, Polars still
+  collects into memory; use `return_type="polars"` or `"polars_lazy"` to keep
+  Polars frames.
 - **Parquet support**:
     - Streaming reads support single `.parquet` files.
     - Parquet “dataset directories” (Hive partitions / directory-of-parquets) are not a primary target API.
@@ -276,6 +328,9 @@ from datablade.core.json import get
     - Uses the Parquet schema (PyArrow) without scanning data.
     - Complex/nested columns (struct/list/map/union) are dropped and logged as warnings.
 - **DDL scope**: `CREATE TABLE` generation is column/type oriented (no indexes/constraints).
+- **SQL Server bulk load**: the SQL Server helpers use the `bcp` CLI and require it
+  to be installed and available on PATH. When using `-U`/`-P`, credentials are
+  passed via process args (logs are redacted); prefer `-T` or `-G` where possible.
 **Optional dependencies:**

{datablade-0.0.5 → datablade-0.0.6}/src/datablade/__init__.py RENAMED Viewed

@@ -12,24 +12,28 @@ For backward compatibility, all functions are also available from datablade.core
 # Also maintain core for backward compatibility
 # Import from new organized structure
-from . import core, dataframes, io, sql, utils
+from . import core, dataframes, io, registry, sql, utils
 from .blade import Blade
 from .dataframes import read_file_chunked, read_file_smart, read_file_to_parquets
+from .registry import DialectSpec, ObjectNode, ObjectRef, ObjectRegistry
 from .sql import Dialect, bulk_load, generate_create_table
 # Convenience re-exports for commonly used functions
 from .utils.logging import configure_logging, get_logger
+from .utils.strings import configure_paths
-__version__ = "0.0.5"
+__version__ = "0.0.6"
 __all__ = [
     "dataframes",
     "io",
     "utils",
     "sql",
+    "registry",
     "core",  # Maintain backward compatibility
     # Convenience re-exports
     "configure_logging",
+    "configure_paths",
     "get_logger",
     "read_file_smart",
     "read_file_chunked",
@@ -38,4 +42,8 @@ __all__ = [
     "generate_create_table",
     "bulk_load",
     "Blade",
+    "DialectSpec",
+    "ObjectRef",
+    "ObjectNode",
+    "ObjectRegistry",
 ]

datablade 0.0.5__tar.gz → 0.0.6__tar.gz

datablade 0.0.5tar.gz → 0.0.6tar.gz