csv-stream-diff 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,195 @@
1
+ # Building `csv-stream-diff`: A Fast, Streaming CSV Comparison Tool for Very Large Files
2
+
3
+ Comparing two CSV files sounds simple until the files are no longer small.
4
+
5
+ Once the datasets move into the millions of rows, the usual approaches start to fall apart. Loading both files into memory is expensive. Spreadsheet tools stop being useful. Even many ad hoc scripts become slow, fragile, or impossible to run reliably in production-like environments.
6
+
7
+ That is the problem I wanted to solve with `csv-stream-diff`.
8
+
9
+ ## The Problem
10
+
11
+ In real systems, CSV comparison is rarely just "diff these two files."
12
+
13
+ Usually the job looks more like this:
14
+
15
+ - The files are large enough that a full in-memory load is risky or impossible
16
+ - The key columns on the left and right files do not use the same names
17
+ - The comparison should only consider a selected subset of columns
18
+ - Duplicate keys may exist and need to be reported clearly
19
+ - Sometimes a full comparison is required, but sometimes a statistically useful sample is enough
20
+ - The output needs to be machine-readable so it can feed downstream validation or remediation workflows
21
+
22
+ I wanted a tool that could handle that cleanly, with minimal dependencies, and still be easy to package and run anywhere.
23
+
24
+ ## What `csv-stream-diff` Does
25
+
26
+ `csv-stream-diff` is a Python CLI tool for comparing very large CSV files using:
27
+
28
+ - streaming reads
29
+ - hash-based partitioning
30
+ - multiprocessing
31
+ - YAML-driven configuration
32
+
33
+ It produces structured output files for:
34
+
35
+ - rows only on the left
36
+ - rows only on the right
37
+ - rows with cell-level differences
38
+ - duplicate keys
39
+ - summary metadata
40
+
41
+ It is designed to be practical rather than clever.
42
+
43
+ ## The Core Design
44
+
45
+ The main design constraint was memory.
46
+
47
+ If a tool tries to build a single giant in-memory index for both files, it will eventually hit a limit. So instead of comparing the full files at once, `csv-stream-diff` uses a two-phase approach.
48
+
49
+ ### 1. Partition both files into hashed buckets
50
+
51
+ Each row key is normalized and hashed into a bucket. The left and right files are streamed row by row and written into matching bucket files on disk.
52
+
53
+ That matters because rows with the same normalized key always land in the same bucket. Once that is true, each bucket can be compared independently.
54
+
55
+ ### 2. Compare buckets in parallel
56
+
57
+ After partitioning, the tool compares bucket pairs using multiple worker processes. Each worker only needs to index one bucket of the left file at a time, not the entire dataset.
58
+
59
+ This keeps memory bounded while still taking advantage of all available CPU cores.
60
+
61
+ The result is a design that scales much better for heavy loads than a naive single-process implementation.
62
+
63
+ ## Why YAML Configuration
64
+
65
+ I did not want the CLI to become a wall of flags.
66
+
67
+ The comparison usually needs several pieces of information:
68
+
69
+ - the left and right file paths
70
+ - the left and right key columns
71
+ - the left and right comparison columns
72
+ - CSV dialect options
73
+ - output paths
74
+ - sampling settings
75
+ - performance settings
76
+
77
+ That is much easier to manage in a YAML file than on the command line.
78
+
79
+ The CLI still supports overrides, but the YAML file is the primary contract. That makes runs reproducible and easier to version alongside data validation jobs.
80
+
81
+ ## Exact Sampling for Large Validation Runs
82
+
83
+ Sometimes you do not want to compare every row.
84
+
85
+ For example, if the source files contain tens of millions of records, you may want to run a fast validation pass against an exact random sample of keys before committing to a full comparison.
86
+
87
+ `csv-stream-diff` supports that:
88
+
89
+ - `sampling.size: 0` means compare everything
90
+ - `sampling.size > 0` means compare an exact random sample of left-side unique keys
91
+ - `sampling.seed` makes the sample reproducible
92
+
93
+ This gives you a useful middle ground between tiny spot checks and full heavy-load comparisons.
94
+
95
+ ## Handling Duplicate Keys
96
+
97
+ Duplicate keys are one of the most annoying edge cases in file comparison work.
98
+
99
+ If a key appears multiple times, the comparison becomes ambiguous. Instead of failing silently or hiding the problem, the tool reports duplicate keys explicitly and continues using the first occurrence for the main comparison.
100
+
101
+ That behavior is deliberate:
102
+
103
+ - you get a warning
104
+ - you get a separate duplicate-key artifact
105
+ - you still get a usable comparison result
106
+
107
+ This makes the tool better suited for messy real-world data.
108
+
109
+ ## Keeping Dependencies Small
110
+
111
+ I wanted the runtime dependency footprint to stay minimal.
112
+
113
+ The tool is built mostly with the Python standard library. The only runtime dependency is `PyYAML`, which is used for configuration loading.
114
+
115
+ That keeps installation simple and reduces operational friction when the tool needs to run in different environments.
116
+
117
+ ## Outputs That Are Actually Useful
118
+
119
+ One important goal was to avoid producing a human-only report.
120
+
121
+ The tool writes separate output files for each class of result, which makes it easier to automate downstream processing:
122
+
123
+ - `only_in_left.csv`
124
+ - `only_in_right.csv`
125
+ - `differences.csv`
126
+ - `duplicate_keys.csv`
127
+ - `summary.json`
128
+
129
+ The `differences.csv` file is especially useful because it reports cell-level differences with both the left and right column names and values.
130
+
131
+ That means you can do more than say "this row changed." You can say exactly how it changed.
132
+
133
+ ## Testing the Tool Properly
134
+
135
+ I also wanted the project to be easy to validate.
136
+
137
+ So the repository includes:
138
+
139
+ - unit tests with `pytest`
140
+ - BDD-style acceptance tests with `behave`
141
+ - a fixture generator that creates two baseline-identical CSV files and then introduces controlled differences
142
+
143
+ The generator makes it easy to create realistic comparison scenarios involving:
144
+
145
+ - changed values
146
+ - left-only rows
147
+ - right-only rows
148
+ - duplicate keys
149
+
150
+ That is useful both for development and for demonstrating the tool to others.
151
+
152
+ ## A Few Practical Lessons
153
+
154
+ Building this reinforced a few engineering lessons:
155
+
156
+ - For large-file tooling, streaming and partitioning beat clever in-memory shortcuts
157
+ - Exact sampling is worth implementing properly because it gives a fast validation mode without becoming a toy feature
158
+ - Duplicate handling should be explicit, not implicit
159
+ - Machine-readable outputs matter as much as console output
160
+ - Minimal dependencies make utility tools easier to adopt
161
+
162
+ ## Example Usage
163
+
164
+ With a config file in place, the tool is intentionally simple to run:
165
+
166
+ ```bash
167
+ csv-stream-diff --config config.yaml
168
+ ```
169
+
170
+ You can also override selected settings from the CLI:
171
+
172
+ ```bash
173
+ csv-stream-diff --config config.yaml --sample-size 100000 --workers 8
174
+ ```
175
+
176
+ ## Why I Built It
177
+
178
+ This project came from a practical need: compare large CSV datasets reliably, with clear outputs, and without depending on heavy frameworks or fragile one-off scripts.
179
+
180
+ The result is a tool that is meant to be packaged, published, and reused anywhere.
181
+
182
+ That was the bar from the start.
183
+
184
+ ## Closing
185
+
186
+ If you work with large exports, migration validation, reconciliation jobs, or data quality checks, CSV comparison becomes infrastructure very quickly.
187
+
188
+ `csv-stream-diff` is my attempt to make that infrastructure solid:
189
+
190
+ - reproducible
191
+ - scalable
192
+ - explicit
193
+ - easy to automate
194
+
195
+ If you want to explore the project, the repository includes the CLI, example configuration, test generator, and packaging setup needed to build and publish it.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Jordi Corbilla
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,161 @@
1
+ Metadata-Version: 2.1
2
+ Name: csv-stream-diff
3
+ Version: 0.1.0
4
+ Summary: Stream and compare very large CSV files with multiprocessing.
5
+ License: MIT
6
+ Keywords: csv,diff,streaming,multiprocessing,comparison
7
+ Author: Jordi
8
+ Requires-Python: >=3.10,<4.0
9
+ Classifier: Development Status :: 3 - Alpha
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.10
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Programming Language :: Python :: 3.13
17
+ Classifier: Topic :: File Formats
18
+ Classifier: Topic :: Software Development :: Testing
19
+ Classifier: Topic :: Utilities
20
+ Requires-Dist: PyYAML (>=6.0)
21
+ Requires-Dist: rich (>=13.7)
22
+ Description-Content-Type: text/markdown
23
+
24
+ # csv-stream-diff
25
+
26
+ `csv-stream-diff` compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.
27
+
28
+ ## Features
29
+
30
+ - Compare CSVs by configurable key columns, even when left and right headers differ
31
+ - Stream files in chunks with configurable `chunk_size`
32
+ - Partition by stable hashed key to keep worker memory bounded
33
+ - Use all CPUs by default, or set a worker count explicitly
34
+ - Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
35
+ - Support exact random sampling for validation runs with `sampling.size > 0`
36
+ - Warn on duplicate keys and continue using the first occurrence per key
37
+ - Include a fixture generator and both `pytest` and `behave` tests
38
+
39
+ ## Installation
40
+
41
+ ```bash
42
+ pip install csv-stream-diff
43
+ ```
44
+
45
+ For local development:
46
+
47
+ ```bash
48
+ poetry install
49
+ ```
50
+
51
+ ## CLI
52
+
53
+ ```bash
54
+ csv-stream-diff --config config.yaml
55
+ ```
56
+
57
+ Optional overrides:
58
+
59
+ ```bash
60
+ csv-stream-diff \
61
+ --config config.yaml \
62
+ --left-file ./left.csv \
63
+ --right-file ./right.csv \
64
+ --chunk-size 100000 \
65
+ --sample-size 100000 \
66
+ --sample-seed 20260321 \
67
+ --workers 8 \
68
+ --output-dir ./output \
69
+ --output-prefix run_
70
+ ```
71
+
72
+ The YAML config is the default source of truth. CLI flags override it for a single run.
73
+
74
+ ## Configuration
75
+
76
+ See [config.example.yaml](/c:/repo/csv-stream-diff/config.example.yaml) for a full example.
77
+
78
+ Main sections:
79
+
80
+ - `files.left`, `files.right`: input CSV paths
81
+ - `csv.left`, `csv.right`: dialect and encoding settings
82
+ - `keys.left`, `keys.right`: key columns used to match rows
83
+ - `compare.left`, `compare.right`: value columns to compare
84
+ - `comparison`: normalization options
85
+ - `sampling`: `size: 0` means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
86
+ - `performance`: chunking, worker count, bucket count, temp directory, progress reporting
87
+ - `output`: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary
88
+
89
+ ## Output Files
90
+
91
+ The tool writes these artifacts to `output.directory`:
92
+
93
+ - `<prefix>only_in_left.csv`
94
+ - `<prefix>only_in_right.csv`
95
+ - `<prefix>differences.csv`
96
+ - `<prefix>duplicate_keys.csv`
97
+ - `<prefix>summary.json`
98
+ - `<prefix>summary.txt` when `output.summary_format` is `text` or `both`
99
+
100
+ `differences.csv` contains one row per differing cell with both the left and right column names and values.
101
+
102
+ ## Sampling
103
+
104
+ - `sampling.size: 0` runs the full comparison.
105
+ - `sampling.size > 0` selects an exact random sample of left-side unique keys using reservoir sampling.
106
+ - Sampling is reproducible when `sampling.seed` stays the same.
107
+ - Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.
108
+
109
+ ## Duplicate Keys
110
+
111
+ Duplicate keys do not stop the run. They are written to `duplicate_keys.csv`, counted in the summary, and the main comparison uses the first occurrence of each key on each side.
112
+
113
+ ## Generator
114
+
115
+ The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:
116
+
117
+ ```bash
118
+ python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42
119
+ ```
120
+
121
+ Generated artifacts:
122
+
123
+ - `left.csv`
124
+ - `right.csv`
125
+ - `config.generated.yaml`
126
+ - `expected.json`
127
+
128
+ ## Tests
129
+
130
+ Run unit tests:
131
+
132
+ ```bash
133
+ poetry run pytest
134
+ ```
135
+
136
+ Run BDD acceptance tests:
137
+
138
+ ```bash
139
+ poetry run behave tests/features
140
+ ```
141
+
142
+ Run a package build:
143
+
144
+ ```bash
145
+ poetry build
146
+ ```
147
+
148
+ ## PyPI Packaging
149
+
150
+ Build source and wheel distributions:
151
+
152
+ ```bash
153
+ poetry build
154
+ ```
155
+
156
+ Upload after verifying artifacts:
157
+
158
+ ```bash
159
+ poetry publish
160
+ ```
161
+
@@ -0,0 +1,137 @@
1
+ # csv-stream-diff
2
+
3
+ `csv-stream-diff` compares very large CSV files with streaming I/O, hashed bucket partitioning, and multiprocessing. It is designed for datasets that are too large to load fully into memory.
4
+
5
+ ## Features
6
+
7
+ - Compare CSVs by configurable key columns, even when left and right headers differ
8
+ - Stream files in chunks with configurable `chunk_size`
9
+ - Partition by stable hashed key to keep worker memory bounded
10
+ - Use all CPUs by default, or set a worker count explicitly
11
+ - Write machine-usable output artifacts for left-only, right-only, cell differences, duplicate keys, and run summary
12
+ - Support exact random sampling for validation runs with `sampling.size > 0`
13
+ - Warn on duplicate keys and continue using the first occurrence per key
14
+ - Include a fixture generator and both `pytest` and `behave` tests
15
+
16
+ ## Installation
17
+
18
+ ```bash
19
+ pip install csv-stream-diff
20
+ ```
21
+
22
+ For local development:
23
+
24
+ ```bash
25
+ poetry install
26
+ ```
27
+
28
+ ## CLI
29
+
30
+ ```bash
31
+ csv-stream-diff --config config.yaml
32
+ ```
33
+
34
+ Optional overrides:
35
+
36
+ ```bash
37
+ csv-stream-diff \
38
+ --config config.yaml \
39
+ --left-file ./left.csv \
40
+ --right-file ./right.csv \
41
+ --chunk-size 100000 \
42
+ --sample-size 100000 \
43
+ --sample-seed 20260321 \
44
+ --workers 8 \
45
+ --output-dir ./output \
46
+ --output-prefix run_
47
+ ```
48
+
49
+ The YAML config is the default source of truth. CLI flags override it for a single run.
50
+
51
+ ## Configuration
52
+
53
+ See [config.example.yaml](/c:/repo/csv-stream-diff/config.example.yaml) for a full example.
54
+
55
+ Main sections:
56
+
57
+ - `files.left`, `files.right`: input CSV paths
58
+ - `csv.left`, `csv.right`: dialect and encoding settings
59
+ - `keys.left`, `keys.right`: key columns used to match rows
60
+ - `compare.left`, `compare.right`: value columns to compare
61
+ - `comparison`: normalization options
62
+ - `sampling`: `size: 0` means full comparison; any positive value means exact random sample by left-side unique key with a fixed seed
63
+ - `performance`: chunking, worker count, bucket count, temp directory, progress reporting
64
+ - `output`: output directory, filename prefix, whether to include serialized full rows, and whether to write a text summary
65
+
66
+ ## Output Files
67
+
68
+ The tool writes these artifacts to `output.directory`:
69
+
70
+ - `<prefix>only_in_left.csv`
71
+ - `<prefix>only_in_right.csv`
72
+ - `<prefix>differences.csv`
73
+ - `<prefix>duplicate_keys.csv`
74
+ - `<prefix>summary.json`
75
+ - `<prefix>summary.txt` when `output.summary_format` is `text` or `both`
76
+
77
+ `differences.csv` contains one row per differing cell with both the left and right column names and values.
78
+
79
+ ## Sampling
80
+
81
+ - `sampling.size: 0` runs the full comparison.
82
+ - `sampling.size > 0` selects an exact random sample of left-side unique keys using reservoir sampling.
83
+ - Sampling is reproducible when `sampling.seed` stays the same.
84
+ - Duplicate keys do not expand the sampling population because only the first occurrence per key is considered.
85
+
86
+ ## Duplicate Keys
87
+
88
+ Duplicate keys do not stop the run. They are written to `duplicate_keys.csv`, counted in the summary, and the main comparison uses the first occurrence of each key on each side.
89
+
90
+ ## Generator
91
+
92
+ The generator creates two baseline-identical CSVs, applies controlled mutations, writes a matching config, and saves an expected manifest:
93
+
94
+ ```bash
95
+ python generator/generate_fixtures.py --output-dir ./generated --rows 10000 --seed 42
96
+ ```
97
+
98
+ Generated artifacts:
99
+
100
+ - `left.csv`
101
+ - `right.csv`
102
+ - `config.generated.yaml`
103
+ - `expected.json`
104
+
105
+ ## Tests
106
+
107
+ Run unit tests:
108
+
109
+ ```bash
110
+ poetry run pytest
111
+ ```
112
+
113
+ Run BDD acceptance tests:
114
+
115
+ ```bash
116
+ poetry run behave tests/features
117
+ ```
118
+
119
+ Run a package build:
120
+
121
+ ```bash
122
+ poetry build
123
+ ```
124
+
125
+ ## PyPI Packaging
126
+
127
+ Build source and wheel distributions:
128
+
129
+ ```bash
130
+ poetry build
131
+ ```
132
+
133
+ Upload after verifying artifacts:
134
+
135
+ ```bash
136
+ poetry publish
137
+ ```
@@ -0,0 +1,59 @@
1
+ files:
2
+ left: ./data/left.csv
3
+ right: ./data/right.csv
4
+
5
+ csv:
6
+ left:
7
+ encoding: utf-8-sig
8
+ delimiter: ","
9
+ quotechar: '"'
10
+ escapechar:
11
+ newline: ""
12
+ right:
13
+ encoding: utf-8-sig
14
+ delimiter: ","
15
+ quotechar: '"'
16
+ escapechar:
17
+ newline: ""
18
+
19
+ keys:
20
+ left:
21
+ - customer_id
22
+ - transaction_date
23
+ right:
24
+ - cust_id
25
+ - txn_dt
26
+
27
+ compare:
28
+ left:
29
+ - amount
30
+ - status
31
+ - description
32
+ right:
33
+ - transaction_amount
34
+ - txn_status
35
+ - desc
36
+
37
+ comparison:
38
+ case_insensitive: true
39
+ trim_whitespace: true
40
+ treat_null_as_equal: false
41
+
42
+ sampling:
43
+ size: 0
44
+ seed: 12345
45
+
46
+ performance:
47
+ chunk_size: 100000
48
+ workers:
49
+ bucket_count:
50
+ report_every_rows: 50000
51
+ temp_directory:
52
+ keep_temp_files: false
53
+ show_progress: true
54
+
55
+ output:
56
+ directory: ./output
57
+ prefix: comparison_
58
+ include_full_rows: true
59
+ summary_format: both