PyPI - csvsmith - Versions diffs - 0.2.3__tar.gz → 0.4.0__tar.gz - Mend

csvsmith 0.2.3tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

{csvsmith-0.2.3/src/csvsmith.egg-info → csvsmith-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: csvsmith
-Version: 0.2.3
+Version: 0.4.0
 Summary: Small CSV utilities: row deduplication, classification, row filtering, and CLI helpers.
 Author-email: Eiichi YAMAMOTO <info@yeiichi.com>
 License: MIT License
@@ -27,7 +27,7 @@ License: MIT License
 Project-URL: Homepage, https://github.com/yeiichi/csvsmith
 Project-URL: Repository, https://github.com/yeiichi/csvsmith
-Keywords: csv,pandas,deduplication,data-filtering,file-organization,filtering
+Keywords: csv,deduplication,data-filtering,file-organization,filtering
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3 :: Only
 Classifier: License :: OSI Approved :: MIT License
@@ -37,7 +37,6 @@ Classifier: Topic :: Utilities
 Requires-Python: >=3.10
 Description-Content-Type: text/x-rst
 License-File: LICENSE
-Requires-Dist: pandas>=2.0
 Requires-Dist: openpyxl>=3.1
 Dynamic: license-file
@@ -51,32 +50,35 @@ csvsmith
    :target: https://pypi.org/project/csvsmith/
 .. image:: https://img.shields.io/pypi/l/csvsmith.svg
-   :target: https://pypi.org/project/csvsmith/
+   :target: https://pypi.org/project/ccsvsmith/
 Introduction
 ------------
 csvsmith is a lightweight collection of CSV utilities designed for data
-integrity, deduplication, organization, and Excel-to-CSV conversion.
+integrity, deduplication, organization, Excel-to-CSV conversion, and
+string-similarity analysis.
 It provides a small Python API for programmatic data filtering and a single
 CLI entrypoint for quick operations.
 Whether you need to organize CSV files by header signatures, find duplicate
-rows in a dataset, convert an Excel worksheet into CSV, or drop rows by a
-substring rule, csvsmith aims to keep the process predictable and reversible.
+rows in a dataset, convert an Excel worksheet into CSV, drop rows by a
+substring rule, or compare two strings for similarity, csvsmith aims to keep
+the process predictable and reversible.
 Features
 --------
 - row duplicate counting and reporting
-- DataFrame deduplication with reports
+- CSV deduplication with reports
 - CSV classification by header signature
 - dry-run and report-only classification modes
 - rollback support via manifest
 - row filtering by substring
 - Excel worksheet to CSV conversion
 - file moving by suffix
+- string distance and similarity analysis
 - a single command-line entrypoint with subcommands
 Installation
@@ -112,34 +114,46 @@ Count duplicate values
    print(count_duplicates_sorted(items))
    # [('a', 3), ('b', 2)]
-Find duplicate rows in a DataFrame
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Find duplicate rows in a CSV
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. code-block:: python
-   import pandas as pd
-   from csvsmith import find_duplicate_rows
+   from csvsmith import find_duplicate_rows, read_csv_rows
-   df = pd.read_csv("input.csv")
-   dup_rows = find_duplicate_rows(df)
+   rows = read_csv_rows("input.csv")
+   dup_rows = find_duplicate_rows(rows)
 Deduplicate with report
 ~~~~~~~~~~~~~~~~~~~~~~~
 .. code-block:: python
-   import pandas as pd
-   from csvsmith import dedupe_with_report
+   from csvsmith import dedupe_with_report, read_csv_rows, write_csv_rows
-   df = pd.read_csv("input.csv")
+   rows = read_csv_rows("input.csv")
-   deduped, report = dedupe_with_report(df)
-   deduped.to_csv("deduped.csv", index=False)
-   report.to_csv("duplicate_report.csv", index=False)
+   deduped, report = dedupe_with_report(rows)
+   write_csv_rows("deduped.csv", deduped, fieldnames=list(rows[0].keys()))
    # Exclude columns (e.g. IDs or timestamps)
-   deduped2, report2 = dedupe_with_report(df, exclude=["id"])
+   deduped2, report2 = dedupe_with_report(rows, exclude=["id"])
+Analyze string distance
+~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: python
+   from csvsmith import analyze_pair
+   result = analyze_pair("kitten", "sitting")
+   print(result.get_relation_string())
+   print(result.damerau_levenshtein_distance)
+   print(result.jaro_winkler_score)
+   print(result.similarity_percentage)
+CLI Usage
 Drop rows in a CSV by column name
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -212,7 +226,8 @@ CLI Usage
 ---------
 csvsmith provides a single CLI entrypoint with subcommands for duplicate
-detection, CSV organization, Excel conversion, file moving, and row filtering.
+detection, CSV organization, Excel conversion, file moving, row filtering,
+and string comparison.
 Show duplicate rows
 ~~~~~~~~~~~~~~~~~~~
@@ -227,6 +242,19 @@ Save duplicate rows only:
    csvsmith row-duplicates input.csv -o duplicates_only.csv
+Analyze string distance
+~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
+   csvsmith string-distance "kitten" "sitting"
+Ignore case:
+.. code-block:: bash
+   csvsmith string-distance "Hello" "hello" --ignore-case
 Deduplicate and generate a report
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{csvsmith-0.2.3 → csvsmith-0.4.0}/README.rst RENAMED Viewed

@@ -8,32 +8,35 @@ csvsmith
    :target: https://pypi.org/project/csvsmith/
 .. image:: https://img.shields.io/pypi/l/csvsmith.svg
-   :target: https://pypi.org/project/csvsmith/
+   :target: https://pypi.org/project/ccsvsmith/
 Introduction
 ------------
 csvsmith is a lightweight collection of CSV utilities designed for data
-integrity, deduplication, organization, and Excel-to-CSV conversion.
+integrity, deduplication, organization, Excel-to-CSV conversion, and
+string-similarity analysis.
 It provides a small Python API for programmatic data filtering and a single
 CLI entrypoint for quick operations.
 Whether you need to organize CSV files by header signatures, find duplicate
-rows in a dataset, convert an Excel worksheet into CSV, or drop rows by a
-substring rule, csvsmith aims to keep the process predictable and reversible.
+rows in a dataset, convert an Excel worksheet into CSV, drop rows by a
+substring rule, or compare two strings for similarity, csvsmith aims to keep
+the process predictable and reversible.
 Features
 --------
 - row duplicate counting and reporting
-- DataFrame deduplication with reports
+- CSV deduplication with reports
 - CSV classification by header signature
 - dry-run and report-only classification modes
 - rollback support via manifest
 - row filtering by substring
 - Excel worksheet to CSV conversion
 - file moving by suffix
+- string distance and similarity analysis
 - a single command-line entrypoint with subcommands
 Installation
@@ -69,34 +72,46 @@ Count duplicate values
    print(count_duplicates_sorted(items))
    # [('a', 3), ('b', 2)]
-Find duplicate rows in a DataFrame
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Find duplicate rows in a CSV
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. code-block:: python
-   import pandas as pd
-   from csvsmith import find_duplicate_rows
+   from csvsmith import find_duplicate_rows, read_csv_rows
-   df = pd.read_csv("input.csv")
-   dup_rows = find_duplicate_rows(df)
+   rows = read_csv_rows("input.csv")
+   dup_rows = find_duplicate_rows(rows)
 Deduplicate with report
 ~~~~~~~~~~~~~~~~~~~~~~~
 .. code-block:: python
-   import pandas as pd
-   from csvsmith import dedupe_with_report
+   from csvsmith import dedupe_with_report, read_csv_rows, write_csv_rows
-   df = pd.read_csv("input.csv")
+   rows = read_csv_rows("input.csv")
-   deduped, report = dedupe_with_report(df)
-   deduped.to_csv("deduped.csv", index=False)
-   report.to_csv("duplicate_report.csv", index=False)
+   deduped, report = dedupe_with_report(rows)
+   write_csv_rows("deduped.csv", deduped, fieldnames=list(rows[0].keys()))
    # Exclude columns (e.g. IDs or timestamps)
-   deduped2, report2 = dedupe_with_report(df, exclude=["id"])
+   deduped2, report2 = dedupe_with_report(rows, exclude=["id"])
+Analyze string distance
+~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: python
+   from csvsmith import analyze_pair
+   result = analyze_pair("kitten", "sitting")
+   print(result.get_relation_string())
+   print(result.damerau_levenshtein_distance)
+   print(result.jaro_winkler_score)
+   print(result.similarity_percentage)
+CLI Usage
 Drop rows in a CSV by column name
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -169,7 +184,8 @@ CLI Usage
 ---------
 csvsmith provides a single CLI entrypoint with subcommands for duplicate
-detection, CSV organization, Excel conversion, file moving, and row filtering.
+detection, CSV organization, Excel conversion, file moving, row filtering,
+and string comparison.
 Show duplicate rows
 ~~~~~~~~~~~~~~~~~~~
@@ -184,6 +200,19 @@ Save duplicate rows only:
    csvsmith row-duplicates input.csv -o duplicates_only.csv
+Analyze string distance
+~~~~~~~~~~~~~~~~~~~~~~~
+.. code-block:: bash
+   csvsmith string-distance "kitten" "sitting"
+Ignore case:
+.. code-block:: bash
+   csvsmith string-distance "Hello" "hello" --ignore-case
 Deduplicate and generate a report
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

{csvsmith-0.2.3 → csvsmith-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "csvsmith"
-version = "0.2.3"
+version = "0.4.0"
 description = "Small CSV utilities: row deduplication, classification, row filtering, and CLI helpers."
 readme = "README.rst"
 requires-python = ">=3.10"
@@ -16,7 +16,6 @@ authors = [
 keywords = [
   "csv",
-  "pandas",
   "deduplication",
   "data-filtering",
   "file-organization",
@@ -33,7 +32,6 @@ classifiers = [
 ]
 dependencies = [
-  "pandas>=2.0",
   "openpyxl>=3.1",
 ]

{csvsmith-0.2.3 → csvsmith-0.4.0}/src/csvsmith/__init__.py RENAMED Viewed

@@ -6,14 +6,22 @@ Public API:
 - add_row_digest
 - find_duplicate_rows
 - dedupe_with_report
+- read_csv_rows
+- write_csv_rows
 - CSVClassifier
 - DropRowsBySubstring
 - excel_to_csv
+- move_by_suffix
+- StringDistance
+- Relation
+- Result
+- analyze_pair
 Compatibility aliases:
 - CSVCleaner
 Submodules:
+- csvsmith.string_distance
 - csvsmith.row_dedup
 - csvsmith.classify
 - csvsmith.filter_rows
@@ -22,26 +30,35 @@ Submodules:
 - csvsmith.cli (CLI entrypoint)
 """
-__version__ = "0.2.3"
+__version__ = "0.4.0"
 from .row_dedup import (
     count_duplicates_sorted,
     add_row_digest,
     find_duplicate_rows,
     dedupe_with_report,
+    read_csv_rows,
+    write_csv_rows,
 )
 from .classify import CSVClassifier
 from .filter_rows import DropRowsBySubstring, CSVCleaner
 from .excel2csv import excel_to_csv
 from .move_files import move_by_suffix
+from .string_distance import StringDistance, Relation, Result, analyze_pair
 __all__ = [
     "count_duplicates_sorted",
     "add_row_digest",
     "find_duplicate_rows",
     "dedupe_with_report",
+    "read_csv_rows",
+    "write_csv_rows",
     "CSVClassifier",
     "DropRowsBySubstring",
     "excel_to_csv",
     "move_by_suffix",
+    "StringDistance",
+    "Relation",
+    "Result",
+    "analyze_pair",
 ]

{csvsmith-0.2.3 → csvsmith-0.4.0}/src/csvsmith/cli.py RENAMED Viewed

@@ -1,17 +1,23 @@
 import argparse
+import csv
 import json
 import sys
 from pathlib import Path
 from typing import Optional, Sequence
-import pandas as pd
 from . import __version__
 from .classify import CSVClassifier
 from .excel2csv import excel_to_csv
 from .filter_rows import DropRowsBySubstring
 from .move_files import move_by_suffix
-from .row_dedup import dedupe_with_report, find_duplicate_rows
+from .row_dedup import (
+    dedupe_with_report,
+    find_duplicate_rows,
+    read_csv_rows,
+    write_csv_rows,
+)
+from .string_distance import analyze_pair
 def _parse_suffixes(value: str | None) -> set[str]:
@@ -30,28 +36,33 @@ def _parse_suffixes(value: str | None) -> set[str]:
 def cmd_row_duplicates(args: argparse.Namespace) -> int:
-    df = pd.read_csv(args.input)
+    rows = read_csv_rows(args.input)
     subset = args.subset.split(",") if args.subset else None
-    dupes = find_duplicate_rows(df, subset=subset)
-    if dupes.empty:
+    dupes = find_duplicate_rows(rows, subset=subset)
+    if not dupes:
         print("No duplicate rows found.")
     else:
         print(f"Found {len(dupes)} duplicate rows:")
-        print(dupes.to_csv(index=False))
+        fieldnames = list(dupes[0].keys())
+        writer = csv.DictWriter(sys.stdout, fieldnames=fieldnames)
+        writer.writeheader()
+        writer.writerows(dupes)
     return 0
 def cmd_dedupe(args: argparse.Namespace) -> int:
-    df = pd.read_csv(args.input)
+    rows = read_csv_rows(args.input)
     subset = args.subset.split(",") if args.subset else None
     exclude = args.exclude.split(",") if args.exclude else None
-    deduped_df, report = dedupe_with_report(
-        df, subset=subset, exclude=exclude, keep=args.keep
+    deduped_rows, report = dedupe_with_report(
+        rows, subset=subset, exclude=exclude, keep=args.keep
     )
     output_path = Path(args.output) if args.output else Path(args.input).with_suffix(".deduped.csv")
-    deduped_df.to_csv(output_path, index=False)
+    fieldnames = list(rows[0].keys()) if rows else []
+    write_csv_rows(output_path, deduped_rows, fieldnames=fieldnames)
     print(f"Wrote deduped CSV to: {output_path}")
     if args.report:
@@ -127,6 +138,16 @@ def cmd_drop_rows(args: argparse.Namespace) -> int:
     return 0
+def cmd_string_distance(args: argparse.Namespace) -> int:
+    res = analyze_pair(args.string_a, args.string_b, args.ignore_case)
+    print(f"{'Classification':<18}: {res.get_relation_string()}")
+    print(f"{'D-Levenshtein Dist':<18}: {res.damerau_levenshtein_distance} changes")
+    print(f"{'Jaro-Winkler':<18}: {res.jaro_winkler_score:.4f}")
+    print(f"{'Similarity':<18}: {res.similarity_percentage:.2f}%")
+    return 0
 def _add_row_duplicates_parser(subparsers) -> None:
     parser = subparsers.add_parser("row-duplicates", help="Find duplicate rows in a CSV.")
     parser.add_argument("input", help="Input CSV file.")
@@ -197,6 +218,18 @@ def _add_drop_rows_parser(subparsers) -> None:
     parser.set_defaults(func=cmd_drop_rows)
+def _add_string_distance_parser(subparsers) -> None:
+    parser = subparsers.add_parser("string-distance", help="Analyze distance between two strings.")
+    parser.add_argument("string_a", help="First string.")
+    parser.add_argument("string_b", help="Second string.")
+    parser.add_argument(
+        "--ignore-case",
+        action="store_true",
+        help="Ignore case for distance calculation.",
+    )
+    parser.set_defaults(func=cmd_string_distance)
 def build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(
         prog="csvsmith",
@@ -215,6 +248,7 @@ def build_parser() -> argparse.ArgumentParser:
     _add_move_files_parser(subparsers)
     _add_excel_to_csv_parser(subparsers)
     _add_drop_rows_parser(subparsers)
+    _add_string_distance_parser(subparsers)
     return parser

csvsmith-0.4.0/src/csvsmith/row_dedup.py ADDED Viewed

@@ -0,0 +1,192 @@
+from __future__ import annotations
+import csv
+from collections import Counter, defaultdict
+from hashlib import sha256
+from pathlib import Path
+from typing import Hashable, Iterable, Mapping, Optional, Sequence
+ROW_SEP = "\x1f"
+KEEP_OPTIONS = {"first", "last"}
+Row = dict[str, str]
+RowLike = Mapping[str, object]
+def count_duplicates_sorted(
+    items: Iterable[Hashable],
+    threshold: int = 2,
+    reverse: bool = True,
+) -> list[tuple[Hashable, int]]:
+    """Count items and return those occurring at least `threshold` times."""
+    counter = Counter(items)
+    duplicates = [(key, count) for key, count in counter.items() if count >= threshold]
+    duplicates.sort(key=lambda x: x[1], reverse=reverse)
+    return duplicates
+def read_csv_rows(csv_path: Path | str, encoding: str = "utf-8") -> list[Row]:
+    """Read a CSV file into a list of row dictionaries."""
+    path = Path(csv_path)
+    with path.open("r", encoding=encoding, newline="") as fp:
+        reader = csv.DictReader(fp)
+        return list(reader)
+def write_csv_rows(
+    csv_path: Path | str,
+    rows: Sequence[Mapping[str, object]],
+    *,
+    fieldnames: Sequence[str],
+    encoding: str = "utf-8",
+) -> None:
+    """Write row dictionaries to a CSV file."""
+    path = Path(csv_path)
+    with path.open("w", encoding=encoding, newline="") as fp:
+        writer = csv.DictWriter(fp, fieldnames=fieldnames)
+        writer.writeheader()
+        for row in rows:
+            writer.writerow(row)
+def _normalize_cell(value: object) -> str:
+    """Convert a cell value to a stable string for hashing."""
+    if value is None:
+        return ""
+    return str(value)
+def _resolve_columns(
+    rows: Sequence[RowLike],
+    *,
+    subset: Optional[Sequence[Hashable]] = None,
+    exclude: Optional[Sequence[Hashable]] = None,
+) -> list[str]:
+    """Resolve the effective column list used for comparison."""
+    if subset is None:
+        if not rows:
+            return []
+        cols = list(rows[0].keys())
+    else:
+        cols = [str(col) for col in subset]
+    if exclude:
+        exclude_set = {str(col) for col in exclude}
+        cols = [col for col in cols if col not in exclude_set]
+    return cols
+def make_row_digest(row: RowLike, *, columns: Sequence[str]) -> str:
+    """Build a SHA-256 digest for a row using selected columns."""
+    joined = ROW_SEP.join(_normalize_cell(row.get(col, "")) for col in columns)
+    return sha256(joined.encode("utf-8")).hexdigest()
+def add_row_digest(
+    rows: Sequence[RowLike],
+    *,
+    subset: Optional[Sequence[Hashable]] = None,
+    exclude: Optional[Sequence[Hashable]] = None,
+    colname: str = "row_digest",
+    inplace: bool = False,
+) -> list[dict[str, object]]:
+    """Add a row digest column and return the resulting rows."""
+    columns = _resolve_columns(rows, subset=subset, exclude=exclude)
+    out = rows if inplace else [dict(row) for row in rows]
+    for row in out:
+        row[colname] = make_row_digest(row, columns=columns)
+    return [dict(row) for row in out]
+def find_duplicate_rows(
+    rows: Sequence[RowLike],
+    *,
+    subset: Optional[Sequence[Hashable]] = None,
+) -> list[dict[str, object]]:
+    """Return only rows that participate in duplicate groups."""
+    columns = _resolve_columns(rows, subset=subset)
+    grouped: dict[str, list[int]] = defaultdict(list)
+    for idx, row in enumerate(rows):
+        digest = make_row_digest(row, columns=columns)
+        grouped[digest].append(idx)
+    dup_indices = {
+        idx
+        for indices in grouped.values()
+        if len(indices) > 1
+        for idx in indices
+    }
+    return [dict(rows[idx]) for idx in sorted(dup_indices)]
+def dedupe_with_report(
+    rows: Sequence[RowLike],
+    *,
+    subset: Optional[Sequence[Hashable]] = None,
+    exclude: Optional[Sequence[Hashable]] = None,
+    keep: str = "first",
+    digest_col: str = "row_digest",
+) -> tuple[list[dict[str, object]], list[dict[str, object]]]:
+    """Drop duplicates and return `(deduped_rows, report)`."""
+    if keep not in KEEP_OPTIONS:
+        raise ValueError(f"keep must be one of {sorted(KEEP_OPTIONS)}")
+    columns = _resolve_columns(rows, subset=subset, exclude=exclude)
+    grouped: dict[str, list[int]] = defaultdict(list)
+    for idx, row in enumerate(rows):
+        digest = make_row_digest(row, columns=columns)
+        grouped[digest].append(idx)
+    report = [
+        {
+            digest_col: digest,
+            "count": len(indices),
+            "indices": indices,
+        }
+        for digest, indices in grouped.items()
+        if len(indices) > 1
+    ]
+    report.sort(key=lambda x: x["count"], reverse=True)
+    kept_indices: set[int] = set()
+    for indices in grouped.values():
+        kept_indices.add(indices[0] if keep == "first" else indices[-1])
+    deduped_rows = [
+        dict(row)
+        for idx, row in enumerate(rows)
+        if idx in kept_indices
+    ]
+    return deduped_rows, report
+def dedupe_csv_file(
+    src: Path | str,
+    dst: Path | str,
+    *,
+    subset: Optional[Sequence[Hashable]] = None,
+    exclude: Optional[Sequence[Hashable]] = None,
+    keep: str = "first",
+    encoding: str = "utf-8",
+) -> list[dict[str, object]]:
+    """Deduplicate a CSV file, write the result, and return the report."""
+    rows = read_csv_rows(src, encoding=encoding)
+    deduped_rows, report = dedupe_with_report(
+        rows,
+        subset=subset,
+        exclude=exclude,
+        keep=keep,
+    )
+    fieldnames = list(rows[0].keys()) if rows else []
+    write_csv_rows(dst, deduped_rows, fieldnames=fieldnames, encoding=encoding)
+    return report

csvsmith 0.2.3__tar.gz → 0.4.0__tar.gz

csvsmith 0.2.3tar.gz → 0.4.0tar.gz