PyPI - csv-stream-diff - Versions diffs - 0.1.0__tar.gz - Mend

csv-stream-diff 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

csv_stream_diff-0.1.0/BLOGPOST.md +195 -0
csv_stream_diff-0.1.0/LICENSE +21 -0
csv_stream_diff-0.1.0/PKG-INFO +161 -0
csv_stream_diff-0.1.0/README.md +137 -0
csv_stream_diff-0.1.0/config.example.yaml +59 -0
csv_stream_diff-0.1.0/generator/generate_fixtures.py +212 -0
csv_stream_diff-0.1.0/pyproject.toml +43 -0
csv_stream_diff-0.1.0/src/csvstreamdiff/__init__.py +5 -0
csv_stream_diff-0.1.0/src/csvstreamdiff/cli.py +52 -0
csv_stream_diff-0.1.0/src/csvstreamdiff/comparer.py +478 -0
csv_stream_diff-0.1.0/src/csvstreamdiff/hashing.py +62 -0
csv_stream_diff-0.1.0/src/csvstreamdiff/multiprocessing.py +639 -0
csv_stream_diff-0.1.0/src/csvstreamdiff/streaming.py +114 -0
csv_stream_diff-0.1.0/tests/conftest.py +11 -0
csv_stream_diff-0.1.0/tests/features/csv_diff.feature +60 -0
csv_stream_diff-0.1.0/tests/features/environment.py +18 -0
csv_stream_diff-0.1.0/tests/features/steps/csv_diff_steps.py +116 -0
csv_stream_diff-0.1.0/tests/test_comparer.py +165 -0
csv_stream_diff-0.1.0/tests/test_hashing.py +20 -0
csv_stream_diff-0.1.0/tests/test_streaming.py +23 -0

csv_stream_diff-0.1.0/BLOGPOST.md ADDED Viewed

@@ -0,0 +1,195 @@
+# Building `csv-stream-diff`: A Fast, Streaming CSV Comparison Tool for Very Large Files
+Comparing two CSV files sounds simple until the files are no longer small.
+Once the datasets move into the millions of rows, the usual approaches start to fall apart. Loading both files into memory is expensive. Spreadsheet tools stop being useful. Even many ad hoc scripts become slow, fragile, or impossible to run reliably in production-like environments.
+That is the problem I wanted to solve with `csv-stream-diff`.
+## The Problem
+In real systems, CSV comparison is rarely just "diff these two files."
+Usually the job looks more like this:
+- The files are large enough that a full in-memory load is risky or impossible
+- The key columns on the left and right files do not use the same names
+- The comparison should only consider a selected subset of columns
+- Duplicate keys may exist and need to be reported clearly
+- Sometimes a full comparison is required, but sometimes a statistically useful sample is enough
+- The output needs to be machine-readable so it can feed downstream validation or remediation workflows
+I wanted a tool that could handle that cleanly, with minimal dependencies, and still be easy to package and run anywhere.
+## What `csv-stream-diff` Does
+`csv-stream-diff` is a Python CLI tool for comparing very large CSV files using:
+- streaming reads
+- hash-based partitioning
+- multiprocessing
+- YAML-driven configuration
+It produces structured output files for:
+- rows only on the left
+- rows only on the right
+- rows with cell-level differences
+- duplicate keys
+- summary metadata
+It is designed to be practical rather than clever.
+## The Core Design
+The main design constraint was memory.
+If a tool tries to build a single giant in-memory index for both files, it will eventually hit a limit. So instead of comparing the full files at once, `csv-stream-diff` uses a two-phase approach.
+### 1. Partition both files into hashed buckets
+Each row key is normalized and hashed into a bucket. The left and right files are streamed row by row and written into matching bucket files on disk.
+That matters because rows with the same normalized key always land in the same bucket. Once that is true, each bucket can be compared independently.
+### 2. Compare buckets in parallel
+After partitioning, the tool compares bucket pairs using multiple worker processes. Each worker only needs to index one bucket of the left file at a time, not the entire dataset.
+This keeps memory bounded while still taking advantage of all available CPU cores.
+The result is a design that scales much better for heavy loads than a naive single-process implementation.
+## Why YAML Configuration
+I did not want the CLI to become a wall of flags.
+The comparison usually needs several pieces of information:
+- the left and right file paths
+- the left and right key columns
+- the left and right comparison columns
+- CSV dialect options
+- output paths
+- sampling settings
+- performance settings
+That is much easier to manage in a YAML file than on the command line.
+The CLI still supports overrides, but the YAML file is the primary contract. That makes runs reproducible and easier to version alongside data validation jobs.
+## Exact Sampling for Large Validation Runs
+Sometimes you do not want to compare every row.
+For example, if the source files contain tens of millions of records, you may want to run a fast validation pass against an exact random sample of keys before committing to a full comparison.
+`csv-stream-diff` supports that:
+- `sampling.size: 0` means compare everything
+- `sampling.size > 0` means compare an exact random sample of left-side unique keys
+- `sampling.seed` makes the sample reproducible
+This gives you a useful middle ground between tiny spot checks and full heavy-load comparisons.
+## Handling Duplicate Keys
+Duplicate keys are one of the most annoying edge cases in file comparison work.
+If a key appears multiple times, the comparison becomes ambiguous. Instead of failing silently or hiding the problem, the tool reports duplicate keys explicitly and continues using the first occurrence for the main comparison.
+That behavior is deliberate:
+- you get a warning
+- you get a separate duplicate-key artifact
+- you still get a usable comparison result
+This makes the tool better suited for messy real-world data.
+## Keeping Dependencies Small
+I wanted the runtime dependency footprint to stay minimal.
+The tool is built mostly with the Python standard library. The only runtime dependency is `PyYAML`, which is used for configuration loading.
+That keeps installation simple and reduces operational friction when the tool needs to run in different environments.
+## Outputs That Are Actually Useful
+One important goal was to avoid producing a human-only report.
+The tool writes separate output files for each class of result, which makes it easier to automate downstream processing:
+- `only_in_left.csv`
+- `only_in_right.csv`
+- `differences.csv`
+- `duplicate_keys.csv`
+- `summary.json`
+The `differences.csv` file is especially useful because it reports cell-level differences with both the left and right column names and values.
+That means you can do more than say "this row changed." You can say exactly how it changed.
+## Testing the Tool Properly
+I also wanted the project to be easy to validate.
+So the repository includes:
+- unit tests with `pytest`
+- BDD-style acceptance tests with `behave`
+- a fixture generator that creates two baseline-identical CSV files and then introduces controlled differences
+The generator makes it easy to create realistic comparison scenarios involving:
+- changed values
+- left-only rows
+- right-only rows
+- duplicate keys
+That is useful both for development and for demonstrating the tool to others.
+## A Few Practical Lessons
+Building this reinforced a few engineering lessons:
+- For large-file tooling, streaming and partitioning beat clever in-memory shortcuts
+- Exact sampling is worth implementing properly because it gives a fast validation mode without becoming a toy feature
+- Duplicate handling should be explicit, not implicit
+- Machine-readable outputs matter as much as console output
+- Minimal dependencies make utility tools easier to adopt
+## Example Usage
+With a config file in place, the tool is intentionally simple to run:
+```bash
+csv-stream-diff --config config.yaml
+```
+You can also override selected settings from the CLI:
+```bash
+csv-stream-diff --config config.yaml --sample-size 100000 --workers 8
+```
+## Why I Built It
+This project came from a practical need: compare large CSV datasets reliably, with clear outputs, and without depending on heavy frameworks or fragile one-off scripts.
+The result is a tool that is meant to be packaged, published, and reused anywhere.
+That was the bar from the start.
+## Closing
+If you work with large exports, migration validation, reconciliation jobs, or data quality checks, CSV comparison becomes infrastructure very quickly.
+`csv-stream-diff` is my attempt to make that infrastructure solid:
+- reproducible
+- scalable
+- explicit
+- easy to automate
+If you want to explore the project, the repository includes the CLI, example configuration, test generator, and packaging setup needed to build and publish it.

csv_stream_diff-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Jordi Corbilla
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

csv_stream_diff-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,161 @@
+Metadata-Version: 2.1
+Name: csv-stream-diff
+Version: 0.1.0
+Summary: Stream and compare very large CSV files with multiprocessing.
+License: MIT
+Keywords: csv,diff,streaming,multiprocessing,comparison
+Author: Jordi
+Requires-Python: >=3.10,<4.0
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: File Formats
+Classifier: Topic :: Software Development :: Testing
+Classifier: Topic :: Utilities
+Requires-Dist: PyYAML (>=6.0)
+Requires-Dist: rich (>=13.7)
+Description-Content-Type: text/markdown
+# csv-stream-diff
+`csv-stream-diff` compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.
+## Features
+- Compare CSVs by configurable key columns, even when left and right headers differ
+- Stream files in chunks with configurable `chunk_size`
+- Partition by stable hashed key to keep worker memory bounded
+- Use all CPUs by default, or set a worker count explicitly
+- Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
+- Support exact random sampling for validation runs with `sampling.size > 0`
+- Warn on duplicate keys and continue using the first occurrence per key
+- Include a fixture generator and both `pytest` and `behave` tests
+## Installation
+```bash
+pip install csv-stream-diff
+```
+For local development:
+```bash
+poetry install
+```
+## CLI
+```bash
+csv-stream-diff --config config.yaml
+```
+Optional overrides:
+```bash
+csv-stream-diff \
+  --config config.yaml \
+  --left-file ./left.csv \
+  --right-file ./right.csv \
+  --chunk-size 100000 \
+  --sample-size 100000 \
+  --sample-seed 20260321 \
+  --workers 8 \
+  --output-dir ./output \
+  --output-prefix run_
+```
+The YAML config is the default source of truth. CLI flags override it for a single run.
+## Configuration
+See [config.example.yaml](/c:/repo/csv-stream-diff/config.example.yaml) for a full example.
+Main sections:
+- `files.left`, `files.right`: input CSV paths
+- `csv.left`, `csv.right`: dialect and encoding settings
+- `keys.left`, `keys.right`: key columns used to match rows
+- `compare.left`, `compare.right`: value columns to compare
+- `comparison`: normalization options
+- `sampling`: `size: 0` means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
+- `performance`: chunking, worker count, bucket count, temp directory, progress reporting
+- `output`: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary
+## Output Files
+The tool writes these artifacts to `output.directory`:
+- `<prefix>only_in_left.csv`
+- `<prefix>only_in_right.csv`
+- `<prefix>differences.csv`
+- `<prefix>duplicate_keys.csv`
+- `<prefix>summary.json`
+- `<prefix>summary.txt` when `output.summary_format` is `text` or `both`
+`differences.csv` contains one row per differing cell with both the left and right column names and values.
+## Sampling
+- `sampling.size: 0` runs the full comparison.
+- `sampling.size > 0` selects an exact random sample of left-side unique keys using reservoir sampling.
+- Sampling is reproducible when `sampling.seed` stays the same.
+- Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.
+## Duplicate Keys
+Duplicate keys do not stop the run. They are written to `duplicate_keys.csv`, counted in the summary, and the main comparison uses the first occurrence of each key on each side.
+## Generator
+The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:
+```bash
+python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42
+```
+Generated artifacts:
+- `left.csv`
+- `right.csv`
+- `config.generated.yaml`
+- `expected.json`
+## Tests
+Run unit tests:
+```bash
+poetry run pytest
+```
+Run BDD acceptance tests:
+```bash
+poetry run behave tests/features
+```
+Run a package build:
+```bash
+poetry build
+```
+## PyPI Packaging
+Build source and wheel distributions:
+```bash
+poetry build
+```
+Upload after verifying artifacts:
+```bash
+poetry publish
+```

csv_stream_diff-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,137 @@
+# csv-stream-diff
+`csv-stream-diff` compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.
+## Features
+- Compare CSVs by configurable key columns, even when left and right headers differ
+- Stream files in chunks with configurable `chunk_size`
+- Partition by stable hashed key to keep worker memory bounded
+- Use all CPUs by default, or set a worker count explicitly
+- Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
+- Support exact random sampling for validation runs with `sampling.size > 0`
+- Warn on duplicate keys and continue using the first occurrence per key
+- Include a fixture generator and both `pytest` and `behave` tests
+## Installation
+```bash
+pip install csv-stream-diff
+```
+For local development:
+```bash
+poetry install
+```
+## CLI
+```bash
+csv-stream-diff --config config.yaml
+```
+Optional overrides:
+```bash
+csv-stream-diff \
+  --config config.yaml \
+  --left-file ./left.csv \
+  --right-file ./right.csv \
+  --chunk-size 100000 \
+  --sample-size 100000 \
+  --sample-seed 20260321 \
+  --workers 8 \
+  --output-dir ./output \
+  --output-prefix run_
+```
+The YAML config is the default source of truth. CLI flags override it for a single run.
+## Configuration
+See [config.example.yaml](/c:/repo/csv-stream-diff/config.example.yaml) for a full example.
+Main sections:
+- `files.left`, `files.right`: input CSV paths
+- `csv.left`, `csv.right`: dialect and encoding settings
+- `keys.left`, `keys.right`: key columns used to match rows
+- `compare.left`, `compare.right`: value columns to compare
+- `comparison`: normalization options
+- `sampling`: `size: 0` means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
+- `performance`: chunking, worker count, bucket count, temp directory, progress reporting
+- `output`: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary
+## Output Files
+The tool writes these artifacts to `output.directory`:
+- `<prefix>only_in_left.csv`
+- `<prefix>only_in_right.csv`
+- `<prefix>differences.csv`
+- `<prefix>duplicate_keys.csv`
+- `<prefix>summary.json`
+- `<prefix>summary.txt` when `output.summary_format` is `text` or `both`
+`differences.csv` contains one row per differing cell with both the left and right column names and values.
+## Sampling
+- `sampling.size: 0` runs the full comparison.
+- `sampling.size > 0` selects an exact random sample of left-side unique keys using reservoir sampling.
+- Sampling is reproducible when `sampling.seed` stays the same.
+- Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.
+## Duplicate Keys
+Duplicate keys do not stop the run. They are written to `duplicate_keys.csv`, counted in the summary, and the main comparison uses the first occurrence of each key on each side.
+## Generator
+The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:
+```bash
+python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42
+```
+Generated artifacts:
+- `left.csv`
+- `right.csv`
+- `config.generated.yaml`
+- `expected.json`
+## Tests
+Run unit tests:
+```bash
+poetry run pytest
+```
+Run BDD acceptance tests:
+```bash
+poetry run behave tests/features
+```
+Run a package build:
+```bash
+poetry build
+```
+## PyPI Packaging
+Build source and wheel distributions:
+```bash
+poetry build
+```
+Upload after verifying artifacts:
+```bash
+poetry publish
+```

csv_stream_diff-0.1.0/config.example.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+files:
+  left: ./data/left.csv
+  right: ./data/right.csv
+csv:
+  left:
+    encoding: utf-8-sig
+    delimiter: ","
+    quotechar: '"'
+    escapechar:
+    newline: ""
+  right:
+    encoding: utf-8-sig
+    delimiter: ","
+    quotechar: '"'
+    escapechar:
+    newline: ""
+keys:
+  left:
+    - customer_id
+    - transaction_date
+  right:
+    - cust_id
+    - txn_dt
+compare:
+  left:
+    - amount
+    - status
+    - description
+  right:
+    - transaction_amount
+    - txn_status
+    - desc
+comparison:
+  case_insensitive: true
+  trim_whitespace: true
+  treat_null_as_equal: false
+sampling:
+  size: 0
+  seed: 12345
+performance:
+  chunk_size: 100000
+  workers:
+  bucket_count:
+  report_every_rows: 50000
+  temp_directory:
+  keep_temp_files: false
+  show_progress: true
+output:
+  directory: ./output
+  prefix: comparison_
+  include_full_rows: true
+  summary_format: both