PyPI - csvsmith - Versions diffs - 0.2.0__tar.gz → 0.2.1__tar.gz - Mend

csvsmith 0.2.0tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{csvsmith-0.2.0/src/csvsmith.egg-info → csvsmith-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: csvsmith
-Version: 0.2.0
+Version: 0.2.1
 Summary: Small CSV utilities: classification, duplicates, row digests, and CLI helpers.
 Author-email: Eiichi YAMAMOTO <info@yeiichi.com>
 License: MIT License
@@ -51,49 +51,33 @@ Dynamic: license-file
 `csvsmith` is a lightweight collection of CSV utilities designed for
 data integrity, deduplication, and organization. It provides a robust
 Python API for programmatic data cleaning and a convenient CLI for quick
-operations. Whether you need to organize thousands of files based on
-their structural signatures or pinpoint duplicate rows in a complex
-dataset, `csvsmith` ensures the process is predictable, transparent, and
-reversible.
+operations.
-## Table of Contents
+Whether you need to organize thousands of files based on their structural
+signatures or pinpoint duplicate rows in a complex dataset, `csvsmith`
+ensures the process is predictable, transparent, and reversible.
--   [Installation](#installation)
+As of recent versions, CSV classification supports:
--
+- strict vs relaxed header matching
+- exact vs subset (“contains”) matching
+- auto clustering with collision‑resistant hashes
+- dry‑run preview
+- report‑only planning mode (scan without moving)
+- full rollback via manifest
-    [Python API Usage](#python-api-usage)
-    :   -   [Count duplicate values](#count-duplicate-values)
-        -   [Find duplicate rows in a
-            DataFrame](#find-duplicate-rows-in-a-dataframe)
-        -   [Deduplicate with report](#deduplicate-with-report)
-        -   [CSV File Classification](#csv-file-classification)
--
-    [CLI Usage](#cli-usage)
-    :   -   [Show duplicate rows](#show-duplicate-rows)
-        -   [Deduplicate and generate a duplicate
-            report](#deduplicate-and-generate-a-duplicate-report)
-        -   [Classify CSVs](#classify-csvs)
--   [Philosophy](#philosophy)
--   [License](#license)
 ## Installation
 From PyPI:
-``` bash
+```bash
 pip install csvsmith
 ```
 For local development:
-``` bash
+```bash
 git clone https://github.com/yeiichi/csvsmith.git
 cd csvsmith
 python -m venv .venv
@@ -101,13 +85,12 @@ source .venv/bin/activate
 pip install -e .[dev]
 ```
 ## Python API Usage
 ### Count duplicate values
-Works on any iterable of hashable items.
-``` python
+```python
 from csvsmith import count_duplicates_sorted
 items = ["a", "b", "a", "c", "a", "b"]
@@ -115,106 +98,120 @@ print(count_duplicates_sorted(items))
 # [('a', 3), ('b', 2)]
 ```
 ### Find duplicate rows in a DataFrame
-``` python
+```python
 import pandas as pd
 from csvsmith import find_duplicate_rows
 df = pd.read_csv("input.csv")
 dup_rows = find_duplicate_rows(df)
-print(dup_rows)
 ```
 ### Deduplicate with report
-``` python
+```python
 import pandas as pd
 from csvsmith import dedupe_with_report
 df = pd.read_csv("input.csv")
-# Use all columns
 deduped, report = dedupe_with_report(df)
 deduped.to_csv("deduped.csv", index=False)
 report.to_csv("duplicate_report.csv", index=False)
-# Use all columns except an ID column
-deduped_no_id, report_no_id = dedupe_with_report(df, exclude=["id"])
+# Exclude columns (e.g. IDs or timestamps)
+deduped2, report2 = dedupe_with_report(df, exclude=["id"])
 ```
-### CSV File Classification
-Organize files into directories based on their headers.
+### CSV File Classification (Python)
-``` python
+```python
 from csvsmith.classify import CSVClassifier
 classifier = CSVClassifier(
     source_dir="./raw_data",
     dest_dir="./organized",
-    auto=True  # Automatically group files with identical headers
+    auto=True,
+    mode="relaxed",        # or "strict"
+    match="exact",        # or "contains"
 )
-# Execute the classification
 classifier.run()
-# Or rollback a previous run using its manifest
-classifier.rollback("./organized/manifest_20260121_120000.json")
+# Roll back using the generated manifest
+classifier.rollback("./organized/manifest_YYYYMMDD_HHMMSS.json")
 ```
 ## CLI Usage
-`csvsmith` includes a command-line interface for duplicate detection and
-file organization.
+csvsmith provides a CLI for duplicate detection and CSV organization.
 ### Show duplicate rows
-``` bash
+```bash
 csvsmith row-duplicates input.csv
 ```
-Save only duplicate rows to a file:
+Save duplicate rows only:
-``` bash
+```bash
 csvsmith row-duplicates input.csv -o duplicates_only.csv
 ```
-### Deduplicate and generate a duplicate report
-``` bash
+### Deduplicate and generate a report
+```bash
 csvsmith dedupe input.csv --deduped deduped.csv --report duplicate_report.csv
 ```
 ### Classify CSVs
-Organize a mess of CSV files into structured folders based on their
-column headers.
+```bash
+# Dry-run (preview only)
+csvsmith classify --src ./raw --dest ./out --auto --dry-run
+# Exact matching (default)
+csvsmith classify --src ./raw --dest ./out --config signatures.json
+# Relaxed matching (ignore column order)
+csvsmith classify --src ./raw --dest ./out --config signatures.json --mode relaxed
-``` bash
-# Preview what would happen (Dry Run)
-csvsmith classify --src ./raw_data --dest ./organized --auto --dry-run
+# Subset matching (signature columns must be present)
+csvsmith classify --src ./raw --dest ./out --config signatures.json --match contains
-# Run classification with a signature config
-csvsmith classify --src ./raw_data --dest ./organized --config signatures.json
+# Report-only (plan without moving files)
+csvsmith classify --src ./raw --dest ./out --auto --report-only
-# Undo a classification run
-csvsmith classify --rollback ./organized/manifest_20260121_120000.json
+# Roll back using manifest
+csvsmith classify --rollback ./out/manifest_YYYYMMDD_HHMMSS.json
 ```
+### Report-only mode
+`--report-only` scans all CSVs and writes a manifest describing what *would*
+happen, without touching the filesystem. This enables downstream pipelines
+to consume the classification plan for custom processing.
 ## Philosophy
-1.  CSVs deserve tools that are simple, predictable, and transparent.
-2.  A row has meaning only when its identity is stable and hashable.
-3.  Collisions are sin; determinism is virtue.
-4.  Let no delimiter sow ambiguity among fields.
-5.  **Love thy \\x1f.** The unseen separator, the quiet guardian of
-    clean hashes.
-6.  The pipeline should be silent unless something is wrong.
-7.  Your data deserves respect --- and your tools should help you give
-    it.
-For more, see `MANIFESTO.md`.
+1. CSVs deserve tools that are simple, predictable, and transparent.
+2. A row has meaning only when its identity is stable and hashable.
+3. Collisions are sin; determinism is virtue.
+4. Let no delimiter sow ambiguity among fields.
+5. Love thy \x1f — the unseen separator, guardian of clean hashes.
+6. The pipeline should be silent unless something is wrong.
+7. Your data deserves respect — and your tools should help you give it.
 ## License

csvsmith-0.2.1/README.md ADDED Viewed

@@ -0,0 +1,176 @@
+# csvsmith
+[![PyPI version](https://img.shields.io/pypi/v/csvsmith.svg)](https://pypi.org/project/csvsmith/)
+![Python versions](https://img.shields.io/pypi/pyversions/csvsmith.svg)
+[![License](https://img.shields.io/pypi/l/csvsmith.svg)](https://pypi.org/project/csvsmith/)
+## Introduction
+`csvsmith` is a lightweight collection of CSV utilities designed for
+data integrity, deduplication, and organization. It provides a robust
+Python API for programmatic data cleaning and a convenient CLI for quick
+operations.
+Whether you need to organize thousands of files based on their structural
+signatures or pinpoint duplicate rows in a complex dataset, `csvsmith`
+ensures the process is predictable, transparent, and reversible.
+As of recent versions, CSV classification supports:
+- strict vs relaxed header matching
+- exact vs subset (“contains”) matching
+- auto clustering with collision‑resistant hashes
+- dry‑run preview
+- report‑only planning mode (scan without moving)
+- full rollback via manifest
+## Installation
+From PyPI:
+```bash
+pip install csvsmith
+```
+For local development:
+```bash
+git clone https://github.com/yeiichi/csvsmith.git
+cd csvsmith
+python -m venv .venv
+source .venv/bin/activate
+pip install -e .[dev]
+```
+## Python API Usage
+### Count duplicate values
+```python
+from csvsmith import count_duplicates_sorted
+items = ["a", "b", "a", "c", "a", "b"]
+print(count_duplicates_sorted(items))
+# [('a', 3), ('b', 2)]
+```
+### Find duplicate rows in a DataFrame
+```python
+import pandas as pd
+from csvsmith import find_duplicate_rows
+df = pd.read_csv("input.csv")
+dup_rows = find_duplicate_rows(df)
+```
+### Deduplicate with report
+```python
+import pandas as pd
+from csvsmith import dedupe_with_report
+df = pd.read_csv("input.csv")
+deduped, report = dedupe_with_report(df)
+deduped.to_csv("deduped.csv", index=False)
+report.to_csv("duplicate_report.csv", index=False)
+# Exclude columns (e.g. IDs or timestamps)
+deduped2, report2 = dedupe_with_report(df, exclude=["id"])
+```
+### CSV File Classification (Python)
+```python
+from csvsmith.classify import CSVClassifier
+classifier = CSVClassifier(
+    source_dir="./raw_data",
+    dest_dir="./organized",
+    auto=True,
+    mode="relaxed",        # or "strict"
+    match="exact",        # or "contains"
+)
+classifier.run()
+# Roll back using the generated manifest
+classifier.rollback("./organized/manifest_YYYYMMDD_HHMMSS.json")
+```
+## CLI Usage
+csvsmith provides a CLI for duplicate detection and CSV organization.
+### Show duplicate rows
+```bash
+csvsmith row-duplicates input.csv
+```
+Save duplicate rows only:
+```bash
+csvsmith row-duplicates input.csv -o duplicates_only.csv
+```
+### Deduplicate and generate a report
+```bash
+csvsmith dedupe input.csv --deduped deduped.csv --report duplicate_report.csv
+```
+### Classify CSVs
+```bash
+# Dry-run (preview only)
+csvsmith classify --src ./raw --dest ./out --auto --dry-run
+# Exact matching (default)
+csvsmith classify --src ./raw --dest ./out --config signatures.json
+# Relaxed matching (ignore column order)
+csvsmith classify --src ./raw --dest ./out --config signatures.json --mode relaxed
+# Subset matching (signature columns must be present)
+csvsmith classify --src ./raw --dest ./out --config signatures.json --match contains
+# Report-only (plan without moving files)
+csvsmith classify --src ./raw --dest ./out --auto --report-only
+# Roll back using manifest
+csvsmith classify --rollback ./out/manifest_YYYYMMDD_HHMMSS.json
+```
+### Report-only mode
+`--report-only` scans all CSVs and writes a manifest describing what *would*
+happen, without touching the filesystem. This enables downstream pipelines
+to consume the classification plan for custom processing.
+## Philosophy
+1. CSVs deserve tools that are simple, predictable, and transparent.
+2. A row has meaning only when its identity is stable and hashable.
+3. Collisions are sin; determinism is virtue.
+4. Let no delimiter sow ambiguity among fields.
+5. Love thy \x1f — the unseen separator, guardian of clean hashes.
+6. The pipeline should be silent unless something is wrong.
+7. Your data deserves respect — and your tools should help you give it.
+## License
+MIT License.

{csvsmith-0.2.0 → csvsmith-0.2.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "csvsmith"
-version = "0.2.0"
+version = "0.2.1"
 description = "Small CSV utilities: classification, duplicates, row digests, and CLI helpers."
 readme = "README.md"
 requires-python = ">=3.10"

{csvsmith-0.2.0 → csvsmith-0.2.1}/src/csvsmith/__init__.py RENAMED Viewed

@@ -1,13 +1,20 @@
 """
 csvsmith: small, focused CSV utilities.
-Current submodules:
+Public API:
+- count_duplicates_sorted
+- add_row_digest
+- find_duplicate_rows
+- dedupe_with_report
+- CSVClassifier
+Submodules:
 - csvsmith.duplicates
 - csvsmith.classify
 - csvsmith.cli (CLI entrypoint)
 """
-__version__ = "0.2.0"
+__version__ = "0.2.1"
 from .duplicates import (
     count_duplicates_sorted,
@@ -23,4 +30,4 @@ __all__ = [
     "find_duplicate_rows",
     "dedupe_with_report",
     "CSVClassifier",
-]
+]

csvsmith 0.2.0__tar.gz → 0.2.1__tar.gz

csvsmith 0.2.0tar.gz → 0.2.1tar.gz