PyPI - tab-cli - Versions diffs - 0.1.7__tar.gz → 0.1.8__tar.gz - Mend

tab-cli 0.1.7tar.gz → 0.1.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (125) hide show

{tab_cli-0.1.7 → tab_cli-0.1.8}/.gitignore RENAMED Viewed

@@ -29,7 +29,6 @@ ENV/
 # uv
 .uv/
-uv.lock
 # PyCharm
 .idea/

{tab_cli-0.1.7 → tab_cli-0.1.8}/AGENTS.md RENAMED Viewed

@@ -43,7 +43,7 @@ Use it as the default operating guide when changing code in this repo.
 - The CLI tests rely on `typer.testing.CliRunner`.
 - Test data is stored under `tests/assets`.
 - Existing tests emphasize user-visible CLI output, not internal implementation details.
-- When adding CLI behavior, extend `tests/test_cli.py` unless the change clearly deserves a new file.
+- The CLI tests are split across focused files under `tests/`; extend the nearest existing file unless a new one is clearly warranted.
 - Assert both `exit_code` and key output fragments.
 - For stdin support, pass `"-"` as the path and provide `input=` to `runner.invoke(...)`.
@@ -89,6 +89,9 @@ Use it as the default operating guide when changing code in this repo.
 ## Types
+- NEVER implicitly cast any variable to bool with `if var:` or `if not var:` unless the variable is already a bool. Do NOT rely on truthiness for control flow:
+  for example, testing if a list is empty with `if not my_list:` is not allowed. Instead, use explicit length checks like `if len(my_list) > 0:`.
+  always write `if x is not None:` or `if x is None:` when checking for `None` values.
 - Type hints are used widely and should be preserved.
 - Prefer modern built-in generics like `list[str]` and `dict[str, Any]`.
 - Use `X | None` instead of `Optional[X]` in new code unless matching nearby style requires otherwise.
@@ -118,6 +121,7 @@ Use it as the default operating guide when changing code in this repo.
 ## Logging And Output
 - The CLI configures Loguru with `RichHandler` in the Typer callback.
+- When writing Loguru messages, use f-strings instead of Loguru brace-style formatting.
 - User-facing table and summary output is rendered with Rich.
 - Streaming command output usually writes bytes to `sys.stdout.buffer`.
 - Keep stderr/stdout behavior consistent with the existing command design.
@@ -155,4 +159,4 @@ Use it as the default operating guide when changing code in this repo.
 - Run `uv run pytest` for broader validation before finalizing cross-cutting changes.
 - Mention pre-existing lint or type-check failures separately from regressions you introduce.
 - Update CHANGELOG.md when necessary.
--
+-

tab_cli-0.1.8/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,34 @@
+- 0.1.8:
+  - Improved `tab view` performance for partitioned directories by reading only as many early partitions as needed for an unfiltered preview.
+  - Added glob-pattern support for multi-file inputs such as `s3://.../date=*/*.parquet`.
+  - Speed up Parquet row counting in summaries by reading footer metadata instead of scanning file contents.
+  - Fixed S3 Polars `storage_options` to avoid nested `client_kwargs` values that could break native reads.
+  - Added `default_num_view_rows` config so the default `tab view` preview size can be customized.
+  - Added `log_level` config so the CLI log level can default from `~/.config/tab/config.json` when `--log-level` is omitted.
+  - Added `max_cell_length` config so `tab view` can default to truncating long cell values without passing `--max-cell-len` every time.
+  - Added `num_remote_workers` config to parallelize remote per-partition summary row counting.
+  - Validated config file value types instead of silently accepting invalid JSON types.
+  - Fixed the developer `Makefile` targets to point at `src/tab_cli`.
+  - Tightened multi-file summary validation to reject inconsistent schemas, not just mismatched column counts.
+  - `tab cat` now rejects mixed input formats with a clear error instead of reusing the first reader implicitly.
+  - Cleaned up package metadata and repository hygiene issues including version drift and `uv.lock` ignore rules.
+- 0.1.7:
+  - Optional dependency groups are now named `tab-cli[s3|gs|az]`, in accordance with the protocol name.
+- 0.1.6:
+  - Fixed bug in pyarrow loading of Parquet files.
+  - Added global config file support: settings can be persisted in `~/.config/tab/config.json`. Config file values serve as defaults that CLI flags override.
+- 0.1.5:
+  - Added stdin support: use `-` as the file path to read from stdin (e.g. `cat data.csv | tab view -i csv -`). Requires `-i`/`--input-format` since format cannot be inferred. Works with `view`, `schema`, `summary`, `convert`, and `cat`.
+  - Added row-wise JMESPath queries via `--jmespath` / `--jp` on `view`, `convert`, and `cat`. Object results become columns; non-object results go into a `value` column. `--sql` and `--jp` are mutually exclusive.
+  - Implemented `--jp` with `LazyFrame.map_batches(...)` so row reshaping stays batch-oriented instead of materializing the full transformed dataset up front.
+- 0.1.4:
+  - Removed `tab sql` subcommand; SQL is now a `--sql` option on `tab view`, `tab convert`, and `tab cat`.
+  - Automatic PyArrow fallback for Parquet files that fail to read with Polars' native reader.
+- 0.1.3:
+  - Separate `tab view` from `tab cat`: `tab view` does not convert formats, `tab cat` does.
+  - Added `--max-cell-len` option to `tab view` to truncate long cell contents.
+- 0.1.2:
+  - Bugfix on reading directories.
+- 0.1.1:
+  - Better credential handling for Azure Blob Storage and Google Cloud Storage.
+- 0.1.0: Initial release

{tab_cli-0.1.7 → tab_cli-0.1.8}/Makefile RENAMED Viewed

@@ -13,13 +13,13 @@ clean:
 	find . -type d -name __pycache__ -exec rm -rf {} +
 lint:
-	uv run ruff check tab_cli/
+	uv run ruff check src/tab_cli tests
 format:
-	uv run ruff format tab_cli/
+	uv run ruff format src/tab_cli tests
 typecheck:
-	uv run ty check tab_cli/
+	uv run ty check src/tab_cli
 test:
 	uv run pytest

{tab_cli-0.1.7 → tab_cli-0.1.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tab-cli
-Version: 0.1.7
+Version: 0.1.8
 Summary: A CLI tool for tabular data
 Author-email: Tongfei Chen <tongfei@pm.me>
 License-File: LICENSE

{tab_cli-0.1.7 → tab_cli-0.1.8}/docs/cli-ref.md RENAMED Viewed

@@ -17,7 +17,7 @@ Options:
 | `--jmespath` / `--jp`   | JMESPath expression to apply to each row as JSON. Object outputs become columns; non-object outputs go to a `value` column. The result shape must stay consistent across rows. |
 | `--limit`               | Maximum number of rows to display.                                                                        |
 | `--skip`                | Number of rows to skip from the beginning.                                                                |
-| `--max-cell-len`        | Truncate cell contents longer than this.                                                                 |
+| `--max-cell-len`        | Truncate cell contents longer than this. If omitted, `max_cell_length` from config is used when set.    |
 ## `tab schema`
@@ -91,4 +91,4 @@ Options:
 | Option                  | Description                                                                                                                  |
 |-------------------------|------------------------------------------------------------------------------------------------------------------------------|
 | `--az-url-authority-is-account` | Interpret az:// URL authority as storage account name instead of container name. See [azure.md](Azure) for more information. |
-| `--log-level`               | Log level from `{DEBUG, INFO, WARNING, ERROR, CRITICAL}`.                                                                     |
+| `--log-level`               | Log level from `{DEBUG, INFO, WARNING, ERROR, CRITICAL}`. If omitted, uses `log_level` from config.                         |

{tab_cli-0.1.7 → tab_cli-0.1.8}/docs/configuration.md RENAMED Viewed

@@ -11,6 +11,10 @@ mkdir -p ~/.config/tab
 cat > ~/.config/tab/config.json << 'EOF'
 {
   "az_url_authority_is_account": false,
+  "default_num_view_rows": 20,
+  "log_level": "INFO",
+  "max_cell_length": null,
+  "num_remote_workers": 8,
   "sampling_size_for_schema_inference": 32
 }
 EOF
@@ -21,6 +25,10 @@ EOF
 | Key | Type | Default | Description |
 |-----|------|---------|-------------|
 | `az_url_authority_is_account` | `bool` | `false` | Interpret `az://` URL authority as storage account name instead of container name. |
+| `default_num_view_rows` | `int` | `20` | Default number of rows shown by `tab view` when `--limit` is omitted. |
+| `log_level` | `str` | `"INFO"` | Default CLI log level when `--log-level` is omitted. |
+| `max_cell_length` | `int \| null` | `null` | Default maximum cell length for `tab view`. The CLI `--max-cell-len` flag overrides it. |
+| `num_remote_workers` | `int` | `8` | Maximum worker threads for remote per-partition summary work such as Parquet row counts. |
 | `sampling_size_for_schema_inference` | `int` | `32` | Number of rows sampled for schema inference (e.g. when using `--jp`). |
 ## Precedence

{tab_cli-0.1.7 → tab_cli-0.1.8}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tab-cli"
-version = "0.1.7"
+version = "0.1.8"
 description = "A CLI tool for tabular data"
 authors = [{name = "Tongfei Chen", email = "tongfei@pm.me"}]
 readme = "README.md"

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """Tab CLI - A CLI tool for tabular data."""
-__version__ = "0.1.0"
+__version__ = "0.1.7"

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/cli.py RENAMED Viewed

@@ -79,6 +79,8 @@ app = typer.Typer(
     no_args_is_help=True,
 )
+DEFAULT_VIEW_TRUNCATION_PROBE_ROWS = 1
 @app.callback()
 def main_callback(
@@ -90,13 +92,18 @@ def main_callback(
         ),
     ] = False,
     log_level: Annotated[
-        str,
+        str | None,
         typer.Option(
-            "--log-level", help="Log level from {DEBUG, INFO, WARNING, ERROR, CRITICAL}"
+            "--log-level",
+            help="Log level from {DEBUG, INFO, WARNING, ERROR, CRITICAL}; defaults to config when omitted",
         ),
-    ] = "INFO",
+    ] = None,
 ) -> None:
     """Global options for tab_cli CLI."""
+    load_config_file()
+    effective_log_level = (
+        log_level.upper() if log_level is not None else config.config.log_level.upper()
+    )
     logger.remove()
     logger.add(
         RichHandler(
@@ -105,9 +112,8 @@ def main_callback(
             markup=True,
         ),
         format="{message}",
-        level=log_level.upper(),
+        level=effective_log_level,
     )
-    load_config_file()
     # CLI flags override config file values
     if az_url_authority_is_account:
         config.config.az_url_authority_is_account = az_url_authority_is_account
@@ -176,12 +182,20 @@ def _apply_jmespath(lf: pl.LazyFrame, expression: str) -> pl.LazyFrame:
     compiled = jmespath.compile(expression)
     sample_df = lf.slice(0, Config.sampling_size_for_schema_inference).collect()
+    logger.debug(
+        "Inferring JMESPath output schema from "
+        f"{Config.sampling_size_for_schema_inference} sampled row(s)"
+    )
     if sample_df.is_empty():
+        logger.debug("JMESPath schema inference sample was empty; returning empty LazyFrame")
         return pl.DataFrame().lazy()
     transformed_sample, result_mode = _transform_jmespath_batch(sample_df, compiled)
     output_schema = transformed_sample.schema
     expected_columns = tuple(transformed_sample.columns)
+    logger.debug(
+        f"Inferred JMESPath result mode '{result_mode}' with columns {expected_columns}"
+    )
     return lf.map_batches(
         lambda batch: _transform_jmespath_batch(
@@ -223,16 +237,119 @@ def _apply_limit(
     and returns whether the data was truncated.
     """
     if limit is None and default_limit is not None:
+        logger.debug(
+            f"Applying inferred default row limit {default_limit} with skip {skip}"
+        )
         lf = lf.slice(skip, length=default_limit + 1)
         df = lf.collect()
         truncated = len(df) > default_limit
         if truncated:
             df = df.head(default_limit)
+            logger.debug("Detected truncated preview after applying inferred default row limit")
         return df.lazy(), truncated
-    else:
-        if skip > 0 or limit is not None:
-            lf = lf.slice(skip, length=limit)
-        return lf, False
+    if skip > 0 or limit is not None:
+        lf = lf.slice(skip, length=limit)
+    return lf, False
+def _read_source(path: str, input_format: str | None) -> tuple[pl.LazyFrame, str | None]:
+    """Read a source path and return its LazyFrame and inferred format."""
+    if is_stdin(path):
+        logger.debug(
+            "Using stdin source with explicit format "
+            f"'{input_format.lower() if input_format is not None else None}'"
+        )
+        return (
+            read_stdin(format=input_format),
+            input_format.lower() if input_format is not None else None,
+        )
+    reader = infer_reader(path, format=input_format)
+    logger.debug(
+        f"Read source '{path}' using inferred format '{reader.format.extension()}'"
+    )
+    return reader.read(path), reader.format.extension()
+def _prepare_view_frame(
+    path: str,
+    input_format: str | None,
+    sql: str | None,
+    jmespath_expr: str | None,
+    limit: int | None,
+    skip: int,
+) -> tuple[pl.LazyFrame, bool]:
+    """Prepare the LazyFrame used by `tab view` and report truncation."""
+    default_view_rows = config.config.default_num_view_rows
+    if is_stdin(path):
+        logger.debug("Preparing view for stdin input")
+        lf = read_stdin(format=input_format)
+        lf = _apply_query(lf, sql=sql, jmespath_expr=jmespath_expr)
+        return _apply_limit(
+            lf,
+            limit=limit,
+            skip=skip,
+            default_limit=default_view_rows if limit is None else None,
+        )
+    reader = infer_reader(path, format=input_format)
+    if sql is None and jmespath_expr is None:
+        preview_limit = (
+            limit
+            if limit is not None
+            else default_view_rows + DEFAULT_VIEW_TRUNCATION_PROBE_ROWS
+        )
+        logger.debug(
+            f"Using preview read for '{path}' with inferred preview limit "
+            f"{preview_limit} and skip {skip}"
+        )
+        lf = reader.read_preview(path, limit=preview_limit, offset=skip)
+        if limit is not None:
+            return lf, False
+        df = lf.collect()
+        truncated = len(df) > default_view_rows
+        if truncated:
+            df = df.head(default_view_rows)
+        return df.lazy(), truncated
+    logger.debug(f"Using full read for '{path}' because a query transform was provided")
+    lf = reader.read(path)
+    lf = _apply_query(lf, sql=sql, jmespath_expr=jmespath_expr)
+    return _apply_limit(
+        lf,
+        limit=limit,
+        skip=skip,
+        default_limit=default_view_rows if limit is None else None,
+    )
+def _resolve_cat_output_format(
+    paths: list[str],
+    input_format: str | None,
+) -> tuple[list[pl.LazyFrame], str | None]:
+    """Read all inputs for `tab cat` and validate format consistency."""
+    files: list[pl.LazyFrame] = []
+    resolved_format = input_format.lower() if input_format is not None else None
+    if resolved_format is not None:
+        logger.debug(f"Using explicit shared input format '{resolved_format}' for `tab cat`")
+    for path in paths:
+        lf, current_format = _read_source(path, input_format)
+        if current_format is not None:
+            if resolved_format is None:
+                resolved_format = current_format
+                logger.debug(
+                    f"Inferred shared `tab cat` format '{resolved_format}' from '{path}'"
+                )
+            elif current_format != resolved_format:
+                raise ValueError(
+                    "All inputs to `tab cat` must use the same format unless -i/--input-format is provided"
+                )
+        files.append(lf)
+    return files, resolved_format
 @app.command()
@@ -247,19 +364,25 @@ def view(
     table_svg: TableSvgOpt = False,
 ) -> None:
     """View tabular data as a formatted table."""
-    if is_stdin(path):
-        lf = read_stdin(format=input)
-    else:
-        reader = infer_reader(path, format=input)
-        lf = reader.read(path)
-    lf = _apply_query(lf, sql=sql, jmespath_expr=jmespath_expr)
-    lf, truncated = _apply_limit(
-        lf, limit=limit, skip=skip, default_limit=20 if limit is None else None
+    effective_max_cell_len = (
+        max_cell_len if max_cell_len is not None else config.config.max_cell_length
+    )
+    if max_cell_len is None and effective_max_cell_len is not None:
+        logger.debug(
+            f"Inferred max_cell_len={effective_max_cell_len} for `tab view` from config"
+        )
+    lf, truncated = _prepare_view_frame(
+        path,
+        input_format=input,
+        sql=sql,
+        jmespath_expr=jmespath_expr,
+        limit=limit,
+        skip=skip,
     )
     writer = infer_writer(
         "table-svg" if table_svg else None,
         truncated=truncated,
-        max_cell_len=max_cell_len,
+        max_cell_len=effective_max_cell_len,
     )
     for chunk in writer.write(lf):
         sys.stdout.buffer.write(chunk)
@@ -278,7 +401,7 @@ def schema(
     else:
         reader = infer_reader(path, format=input)
         table_schema = reader.schema(path)
-    console = Console(force_terminal=True)
+    console = Console()
     console.print(table_schema)
@@ -299,7 +422,7 @@ def summary(
     else:
         handler = infer_reader(path, format=input)
         table_summary = handler.summary(path)
-    console = Console(force_terminal=True)
+    console = Console()
     console.print(table_summary)
@@ -320,6 +443,9 @@ def convert(
         if output is not None:
             writer = infer_writer(format=output)
         elif input is not None:
+            logger.debug(
+                f"Inferred convert output format '{input.lower()}' from stdin input format"
+            )
             writer = infer_writer(format=input)
         else:
             raise ValueError(
@@ -333,9 +459,15 @@ def convert(
         if output is not None:
             writer = infer_writer(format=output)
         elif input is not None:
+            logger.debug(
+                f"Inferred convert output format '{input.lower()}' from explicit input format override"
+            )
             writer = infer_writer(format=input)
         else:
             writer = reader
+            logger.debug(
+                f"Inferred convert output format '{reader.format.extension()}' from source '{src}'"
+            )
             assert isinstance(writer, TableWriter)
         lf = reader.read(src)
         lf = _apply_query(lf, sql=sql, jmespath_expr=jmespath_expr)
@@ -351,24 +483,14 @@ def cat(
     jmespath_expr: JmespathOpt = None,
 ) -> None:
     """Concatenate tabular data from multiple files, or just print a single file."""
-    files: list[pl.LazyFrame] = []
-    reader = None
-    for path in paths:
-        if is_stdin(path):
-            files.append(read_stdin(format=input))
-        else:
-            if reader is None:
-                reader = infer_reader(path, format=input)
-            files.append(reader.read(path))
+    files, resolved_format = _resolve_cat_output_format(paths, input)
     lf = pl.concat(files, how="vertical")
     lf = _apply_query(lf, sql=sql, jmespath_expr=jmespath_expr)
     if output is not None:
         writer = infer_writer(format=output)
-    elif reader is not None:
-        writer = infer_writer(format=reader.format.extension())
-        assert isinstance(writer, TableWriter)
-    elif input is not None:
-        writer = infer_writer(format=input)
+    elif resolved_format is not None:
+        logger.debug(f"Inferred `tab cat` output format '{resolved_format}' from input sources")
+        writer = infer_writer(format=resolved_format)
         assert isinstance(writer, TableWriter)
     else:
         raise ValueError(

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/config.py RENAMED Viewed

@@ -3,6 +3,8 @@
 import json
 from dataclasses import dataclass, fields
 from pathlib import Path
+from types import UnionType
+from typing import Any, Union, get_args, get_origin
 from loguru import logger
@@ -15,6 +17,10 @@ class Config:
     """Global configuration settings."""
     az_url_authority_is_account: bool = False
+    default_num_view_rows: int = 20
+    log_level: str = "INFO"
+    max_cell_length: int | None = None
+    num_remote_workers: int = 8
     sampling_size_for_schema_inference: int = 32
@@ -22,6 +28,15 @@ class Config:
 config: Config = Config()
+def _matches_type(value: Any, expected_type: Any) -> bool:
+    origin = get_origin(expected_type)
+    if origin in {UnionType, Union}:
+        return any(_matches_type(value, option) for option in get_args(expected_type))
+    if expected_type is type(None):
+        return value is None
+    return type(value) is expected_type
 def load_config_file(path: Path = CONFIG_FILE) -> None:
     """Load settings from a JSON config file into the global config.
@@ -29,6 +44,7 @@ def load_config_file(path: Path = CONFIG_FILE) -> None:
     If the file does not exist, this is a no-op.
     """
     if not path.is_file():
+        logger.debug(f"No config file found at {path}; using built-in defaults")
         return
     text = path.read_text(encoding="utf-8")
@@ -41,7 +57,13 @@ def load_config_file(path: Path = CONFIG_FILE) -> None:
     known = {f.name: f.type for f in fields(Config)}
     for key, value in data.items():
         if key not in known:
-            logger.warning("Unknown config key '{}' in {}", key, path)
+            logger.warning(f"Unknown config key '{key}' in {path}")
             continue
+        expected_type = known[key]
+        expected_name = getattr(expected_type, "__name__", str(expected_type))
+        if _matches_type(value, expected_type) is False:
+            raise ValueError(
+                f"Config key '{key}' must be of type {expected_name}, got {type(value).__name__}"
+            )
         setattr(config, key, value)
-    logger.debug("Loaded config from {}", path)
+    logger.debug(f"Loaded config from {path}")

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/formats/avro.py RENAMED Viewed

@@ -2,7 +2,7 @@
 from collections.abc import Iterable
 from io import BytesIO
-from typing import BinaryIO
+from typing import BinaryIO, Callable
 import polars as pl
 import polars_fastavro
@@ -32,7 +32,12 @@ class AvroFormat(FormatHandler):
         # polars_fastavro doesn't support storage_options
         return list(polars_fastavro.scan_avro(url).collect_schema().items())
-    def count_rows(self, url: str, storage_options: dict[str, str] | None = None) -> int:
+    def count_rows(
+        self,
+        url: str,
+        storage_options: dict[str, str] | None = None,
+        opener: Callable[[str], BinaryIO] | None = None,
+    ) -> int:
         # polars_fastavro doesn't support storage_options
         return polars_fastavro.scan_avro(url).select(pl.len()).collect().item()

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/formats/base.py RENAMED Viewed

@@ -2,7 +2,7 @@
 from abc import ABC, abstractmethod
 from collections.abc import Iterable
-from typing import BinaryIO
+from typing import BinaryIO, Callable
 import polars as pl
@@ -44,7 +44,12 @@ class FormatHandler(ABC):
         pass
     @abstractmethod
-    def count_rows(self, url: str, storage_options: dict[str, str] | None = None) -> int:
+    def count_rows(
+        self,
+        url: str,
+        storage_options: dict[str, str] | None = None,
+        opener: Callable[[str], BinaryIO] | None = None,
+    ) -> int:
         """Count rows in the file."""
         pass

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/formats/csv.py RENAMED Viewed

@@ -2,7 +2,7 @@
 from collections.abc import Iterable
 from io import BytesIO
-from typing import BinaryIO
+from typing import BinaryIO, Callable
 import polars as pl
@@ -30,7 +30,12 @@ class CsvFormat(FormatHandler):
     def collect_schema(self, url: str, storage_options: dict[str, str] | None = None) -> list[tuple[str, pl.DataType]]:
         return list(pl.scan_csv(url, separator=self.separator, storage_options=storage_options).collect_schema().items())
-    def count_rows(self, url: str, storage_options: dict[str, str] | None = None) -> int:
+    def count_rows(
+        self,
+        url: str,
+        storage_options: dict[str, str] | None = None,
+        opener: Callable[[str], BinaryIO] | None = None,
+    ) -> int:
         return pl.scan_csv(url, separator=self.separator, storage_options=storage_options).select(pl.len()).collect().item()
     def write(self, lf: pl.LazyFrame) -> Iterable[bytes]:

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/formats/jsonl.py RENAMED Viewed

@@ -2,7 +2,7 @@
 import json
 from collections.abc import Iterable
-from typing import BinaryIO
+from typing import BinaryIO, Callable
 import polars as pl
@@ -27,7 +27,12 @@ class JsonlFormat(FormatHandler):
     def collect_schema(self, url: str, storage_options: dict[str, str] | None = None) -> list[tuple[str, pl.DataType]]:
         return list(pl.scan_ndjson(url, storage_options=storage_options).collect_schema().items())
-    def count_rows(self, url: str, storage_options: dict[str, str] | None = None) -> int:
+    def count_rows(
+        self,
+        url: str,
+        storage_options: dict[str, str] | None = None,
+        opener: Callable[[str], BinaryIO] | None = None,
+    ) -> int:
         return pl.scan_ndjson(url, storage_options=storage_options).select(pl.len()).collect().item()
     def write(self, lf: pl.LazyFrame) -> Iterable[bytes]:

{tab_cli-0.1.7 → tab_cli-0.1.8}/src/tab_cli/formats/parquet.py RENAMED Viewed

@@ -2,10 +2,12 @@
 from collections.abc import Iterable
 from io import BytesIO
-from typing import BinaryIO
+from typing import BinaryIO, Callable
 from loguru import logger
 import polars as pl
+import pyarrow as pa
+import pyarrow.parquet as pq
 from tab_cli.formats.base import FormatHandler
@@ -27,8 +29,7 @@ def _scan_parquet_with_pyarrow_fallback(
         return lf
     except Exception as e:
         logger.warning(
-            "Polars native Parquet reader failed ({}), retrying with PyArrow backend",
-            e,
+            f"Polars native Parquet reader failed ({e}), retrying with PyArrow backend"
         )
         return pl.read_parquet(url, storage_options=storage_options, use_pyarrow=True).lazy()
@@ -51,12 +52,20 @@ class ParquetFormat(FormatHandler):
     def collect_schema(self, url: str, storage_options: dict[str, str] | None = None) -> list[tuple[str, pl.DataType]]:
         return list(_scan_parquet_with_pyarrow_fallback(url, storage_options=storage_options).collect_schema().items())
-    def count_rows(self, url: str, storage_options: dict[str, str] | None = None) -> int:
-        return _scan_parquet_with_pyarrow_fallback(url, storage_options=storage_options).select(pl.len()).collect().item()
+    def count_rows(
+        self,
+        url: str,
+        storage_options: dict[str, str] | None = None,
+        opener: Callable[[str], BinaryIO] | None = None,
+    ) -> int:
+        if opener is not None:
+            with opener(url) as stream:
+                return pq.ParquetFile(pa.PythonFile(stream, mode="r")).metadata.num_rows
+        with open(url, "rb") as stream:
+            return pq.ParquetFile(pa.PythonFile(stream, mode="r")).metadata.num_rows
     def extra_summary(self, url: str) -> dict[str, str | int | float] | None:
-        # TODO: Parquet metadata
-        pass
+        return None
     def write(self, lf: pl.LazyFrame) -> Iterable[bytes]:
         output = BytesIO()

tab-cli 0.1.7__tar.gz → 0.1.8__tar.gz

tab-cli 0.1.7tar.gz → 0.1.8tar.gz