csvsmith 0.2.1__tar.gz → 0.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. {csvsmith-0.2.1 → csvsmith-0.2.3}/LICENSE +1 -1
  2. csvsmith-0.2.3/PKG-INFO +396 -0
  3. csvsmith-0.2.3/README.rst +353 -0
  4. {csvsmith-0.2.1 → csvsmith-0.2.3}/pyproject.toml +15 -5
  5. {csvsmith-0.2.1 → csvsmith-0.2.3}/src/csvsmith/__init__.py +18 -4
  6. {csvsmith-0.2.1 → csvsmith-0.2.3}/src/csvsmith/classify.py +3 -3
  7. csvsmith-0.2.3/src/csvsmith/cli.py +234 -0
  8. csvsmith-0.2.3/src/csvsmith/excel2csv.py +58 -0
  9. csvsmith-0.2.3/src/csvsmith/filter_rows.py +89 -0
  10. csvsmith-0.2.3/src/csvsmith/move_files.py +54 -0
  11. csvsmith-0.2.3/src/csvsmith/row_dedup.py +128 -0
  12. csvsmith-0.2.3/src/csvsmith.egg-info/PKG-INFO +396 -0
  13. {csvsmith-0.2.1 → csvsmith-0.2.3}/src/csvsmith.egg-info/SOURCES.txt +10 -3
  14. csvsmith-0.2.3/src/csvsmith.egg-info/requires.txt +2 -0
  15. {csvsmith-0.2.1 → csvsmith-0.2.3}/tests/test_classify.py +2 -2
  16. csvsmith-0.2.3/tests/test_cli.py +128 -0
  17. csvsmith-0.2.3/tests/test_excel2csv.py +81 -0
  18. csvsmith-0.2.3/tests/test_filter_rows.py +136 -0
  19. csvsmith-0.2.3/tests/test_move_files.py +77 -0
  20. csvsmith-0.2.1/tests/test_duplicates.py → csvsmith-0.2.3/tests/test_row_dedup.py +15 -81
  21. csvsmith-0.2.1/PKG-INFO +0 -218
  22. csvsmith-0.2.1/README.md +0 -176
  23. csvsmith-0.2.1/src/csvsmith/cli.py +0 -277
  24. csvsmith-0.2.1/src/csvsmith/duplicates.py +0 -221
  25. csvsmith-0.2.1/src/csvsmith.egg-info/PKG-INFO +0 -218
  26. csvsmith-0.2.1/src/csvsmith.egg-info/requires.txt +0 -1
  27. {csvsmith-0.2.1 → csvsmith-0.2.3}/setup.cfg +0 -0
  28. {csvsmith-0.2.1 → csvsmith-0.2.3}/src/csvsmith.egg-info/dependency_links.txt +0 -0
  29. {csvsmith-0.2.1 → csvsmith-0.2.3}/src/csvsmith.egg-info/entry_points.txt +0 -0
  30. {csvsmith-0.2.1 → csvsmith-0.2.3}/src/csvsmith.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  MIT License
2
2
 
3
- Copyright (c) 2025 Eiichi YAMAMOTO
3
+ Copyright (c) 2026 Eiichi YAMAMOTO
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
@@ -0,0 +1,396 @@
1
+ Metadata-Version: 2.4
2
+ Name: csvsmith
3
+ Version: 0.2.3
4
+ Summary: Small CSV utilities: row deduplication, classification, row filtering, and CLI helpers.
5
+ Author-email: Eiichi YAMAMOTO <info@yeiichi.com>
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Eiichi YAMAMOTO
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in
18
+ all copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
25
+ FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
26
+ IN THE SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/yeiichi/csvsmith
29
+ Project-URL: Repository, https://github.com/yeiichi/csvsmith
30
+ Keywords: csv,pandas,deduplication,data-filtering,file-organization,filtering
31
+ Classifier: Programming Language :: Python :: 3
32
+ Classifier: Programming Language :: Python :: 3 :: Only
33
+ Classifier: License :: OSI Approved :: MIT License
34
+ Classifier: Intended Audience :: Developers
35
+ Classifier: Topic :: Software Development :: Libraries
36
+ Classifier: Topic :: Utilities
37
+ Requires-Python: >=3.10
38
+ Description-Content-Type: text/x-rst
39
+ License-File: LICENSE
40
+ Requires-Dist: pandas>=2.0
41
+ Requires-Dist: openpyxl>=3.1
42
+ Dynamic: license-file
43
+
44
+ csvsmith
45
+ ========
46
+
47
+ .. image:: https://img.shields.io/pypi/v/csvsmith.svg
48
+ :target: https://pypi.org/project/csvsmith/
49
+
50
+ .. image:: https://img.shields.io/pypi/pyversions/csvsmith.svg
51
+ :target: https://pypi.org/project/csvsmith/
52
+
53
+ .. image:: https://img.shields.io/pypi/l/csvsmith.svg
54
+ :target: https://pypi.org/project/csvsmith/
55
+
56
+ Introduction
57
+ ------------
58
+
59
+ csvsmith is a lightweight collection of CSV utilities designed for data
60
+ integrity, deduplication, organization, and Excel-to-CSV conversion.
61
+
62
+ It provides a small Python API for programmatic data filtering and a single
63
+ CLI entrypoint for quick operations.
64
+
65
+ Whether you need to organize CSV files by header signatures, find duplicate
66
+ rows in a dataset, convert an Excel worksheet into CSV, or drop rows by a
67
+ substring rule, csvsmith aims to keep the process predictable and reversible.
68
+
69
+ Features
70
+ --------
71
+
72
+ - row duplicate counting and reporting
73
+ - DataFrame deduplication with reports
74
+ - CSV classification by header signature
75
+ - dry-run and report-only classification modes
76
+ - rollback support via manifest
77
+ - row filtering by substring
78
+ - Excel worksheet to CSV conversion
79
+ - file moving by suffix
80
+ - a single command-line entrypoint with subcommands
81
+
82
+ Installation
83
+ ------------
84
+
85
+ From PyPI:
86
+
87
+ .. code-block:: bash
88
+
89
+ pip install csvsmith
90
+
91
+ For local development:
92
+
93
+ .. code-block:: bash
94
+
95
+ git clone https://github.com/yeiichi/csvsmith.git
96
+ cd csvsmith
97
+ python -m venv .venv
98
+ source .venv/bin/activate
99
+ pip install -e .[dev]
100
+
101
+ Python API Usage
102
+ ----------------
103
+
104
+ Count duplicate values
105
+ ~~~~~~~~~~~~~~~~~~~~~~
106
+
107
+ .. code-block:: python
108
+
109
+ from csvsmith import count_duplicates_sorted
110
+
111
+ items = ["a", "b", "a", "c", "a", "b"]
112
+ print(count_duplicates_sorted(items))
113
+ # [('a', 3), ('b', 2)]
114
+
115
+ Find duplicate rows in a DataFrame
116
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
117
+
118
+ .. code-block:: python
119
+
120
+ import pandas as pd
121
+ from csvsmith import find_duplicate_rows
122
+
123
+ df = pd.read_csv("input.csv")
124
+ dup_rows = find_duplicate_rows(df)
125
+
126
+ Deduplicate with report
127
+ ~~~~~~~~~~~~~~~~~~~~~~~
128
+
129
+ .. code-block:: python
130
+
131
+ import pandas as pd
132
+ from csvsmith import dedupe_with_report
133
+
134
+ df = pd.read_csv("input.csv")
135
+
136
+ deduped, report = dedupe_with_report(df)
137
+ deduped.to_csv("deduped.csv", index=False)
138
+ report.to_csv("duplicate_report.csv", index=False)
139
+
140
+ # Exclude columns (e.g. IDs or timestamps)
141
+ deduped2, report2 = dedupe_with_report(df, exclude=["id"])
142
+
143
+ Drop rows in a CSV by column name
144
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145
+
146
+ .. code-block:: python
147
+
148
+ from csvsmith import DropRowsBySubstring
149
+
150
+ cleaner = DropRowsBySubstring(
151
+ "input.csv",
152
+ column_name="notes",
153
+ unwanted_text="spam",
154
+ case_sensitive=False,
155
+ )
156
+
157
+ cleaner.write_filtered_rows()
158
+
159
+ If you are upgrading from an older version, CSVCleaner is still available as a
160
+ compatibility alias, but DropRowsBySubstring is the preferred name.
161
+
162
+ Convert Excel to CSV
163
+ ~~~~~~~~~~~~~~~~~~~~
164
+
165
+ .. code-block:: python
166
+
167
+ from csvsmith import excel_to_csv
168
+
169
+ csv_path = excel_to_csv(
170
+ "input.xlsx",
171
+ sheet_name="Details",
172
+ )
173
+
174
+ print(csv_path)
175
+
176
+ Move files by suffix
177
+ ~~~~~~~~~~~~~~~~~~~~
178
+
179
+ .. code-block:: python
180
+
181
+ from csvsmith import move_by_suffix
182
+
183
+ moved_count = move_by_suffix(
184
+ src_dir="./raw",
185
+ dst_dir="./processed",
186
+ suffixes=[".csv", ".pdf"],
187
+ )
188
+
189
+ print(f"Moved {moved_count} files.")
190
+
191
+ CSV File Classification (Python)
192
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
193
+
194
+ .. code-block:: python
195
+
196
+ from csvsmith.classify import CSVClassifier
197
+
198
+ classifier = CSVClassifier(
199
+ source_dir="./raw_data",
200
+ dest_dir="./organized",
201
+ auto=True,
202
+ mode="relaxed", # or "strict"
203
+ match="exact", # or "contains"
204
+ )
205
+
206
+ classifier.run()
207
+
208
+ # Roll back using the generated manifest
209
+ classifier.rollback("./organized/manifest_YYYYMMDD_HHMMSS.json")
210
+
211
+ CLI Usage
212
+ ---------
213
+
214
+ csvsmith provides a single CLI entrypoint with subcommands for duplicate
215
+ detection, CSV organization, Excel conversion, file moving, and row filtering.
216
+
217
+ Show duplicate rows
218
+ ~~~~~~~~~~~~~~~~~~~
219
+
220
+ .. code-block:: bash
221
+
222
+ csvsmith row-duplicates input.csv
223
+
224
+ Save duplicate rows only:
225
+
226
+ .. code-block:: bash
227
+
228
+ csvsmith row-duplicates input.csv -o duplicates_only.csv
229
+
230
+ Deduplicate and generate a report
231
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
232
+
233
+ .. code-block:: bash
234
+
235
+ csvsmith dedupe input.csv -o deduped.csv --report duplicate_report.json
236
+
237
+ Convert Excel to CSV
238
+ ~~~~~~~~~~~~~~~~~~~~
239
+
240
+ .. code-block:: bash
241
+
242
+ csvsmith excel-to-csv input.xlsx
243
+
244
+ Select a named worksheet:
245
+
246
+ .. code-block:: bash
247
+
248
+ csvsmith excel-to-csv input.xlsx --sheet-name Details
249
+
250
+ Write to a custom output path:
251
+
252
+ .. code-block:: bash
253
+
254
+ csvsmith excel-to-csv input.xlsx -o output/result.csv
255
+
256
+ Classify CSVs
257
+ ~~~~~~~~~~~~~
258
+
259
+ .. code-block:: bash
260
+
261
+ # Dry-run (preview only)
262
+ csvsmith classify ./raw ./out --auto --dry-run
263
+
264
+ # Exact matching (default)
265
+ csvsmith classify ./raw ./out
266
+
267
+ # Relaxed matching (ignore col order)
268
+ csvsmith classify ./raw ./out --mode relaxed
269
+
270
+ # Subset matching (signature columns must be present)
271
+ csvsmith classify ./raw ./out --match contains
272
+
273
+ # Report-only (plan without moving files)
274
+ csvsmith classify ./raw ./out --auto --report-only
275
+
276
+ # Roll back using manifest
277
+ # Use the Python API for rollback:
278
+ # classifier.rollback("./out/manifest_YYYYMMDD_HHMMSS.json")
279
+
280
+ Move files by suffix
281
+ ~~~~~~~~~~~~~~~~~~~~
282
+
283
+ .. code-block:: bash
284
+
285
+ csvsmith move-files src_dir dst_dir --suffixes csv,pdf
286
+
287
+ This moves files whose suffix matches one of the given values. The suffixes can
288
+ be written with or without a leading dot, and matching is case-insensitive.
289
+
290
+ Drop CSV rows
291
+ ~~~~~~~~~~~~~
292
+
293
+ Use the ``drop-rows`` subcommand to remove rows from a CSV file when a chosen
294
+ column contains an unwanted substring.
295
+
296
+ The command expects three positional arguments:
297
+
298
+ - input: path to the source CSV file
299
+ - column_name: the header name of the column to inspect
300
+ - unwanted_text: the text that, if found in the chosen column, causes a row to be removed
301
+
302
+ It also supports two optional flags:
303
+
304
+ - --case-insensitive: match unwanted_text without regard to letter case
305
+ - --drop-header: do not copy the first row to the output file
306
+
307
+ The output is written next to the input file using the same name with
308
+ ``.filtered.csv`` appended. For example:
309
+
310
+ - orders.csv -> orders.filtered.csv
311
+
312
+ Basic usage
313
+ ^^^^^^^^^^^
314
+
315
+ .. code-block:: bash
316
+
317
+ csvsmith drop-rows input.csv notes spam
318
+
319
+ This removes every row where the notes column contains spam. The header row is
320
+ preserved by default.
321
+
322
+ Case-insensitive matching
323
+ ^^^^^^^^^^^^^^^^^^^^^^^^^
324
+
325
+ .. code-block:: bash
326
+
327
+ csvsmith drop-rows input.csv notes spam --case-insensitive
328
+
329
+ This is useful when the data may contain values such as Spam, SPAM, or sPaM.
330
+
331
+ Skip the header row
332
+ ^^^^^^^^^^^^^^^^^^^
333
+
334
+ .. code-block:: bash
335
+
336
+ csvsmith drop-rows input.csv notes spam --drop-header
337
+
338
+ Use this only if you explicitly want the output file to contain data rows only.
339
+
340
+ How to use it effectively
341
+ ^^^^^^^^^^^^^^^^^^^^^^^^^
342
+
343
+ - Make sure column_name exactly matches a header value in the CSV.
344
+ - Choose a substring that is specific enough to avoid removing unrelated rows.
345
+ - Use --case-insensitive when the source data is inconsistent in capitalization.
346
+ - Keep the header unless you are intentionally producing a headerless file.
347
+ - If the column name is missing, the command will fail with a clear error.
348
+
349
+ Example
350
+ ^^^^^^^
351
+
352
+ Suppose you have a CSV like this:
353
+
354
+ .. code-block:: csv
355
+
356
+ id,name,notes
357
+ 1,Alice,ok
358
+ 2,Bob,contains spam here
359
+ 3,Carol,ok
360
+
361
+ Running:
362
+
363
+ .. code-block:: bash
364
+
365
+ csvsmith drop-rows input.csv notes spam
366
+
367
+ produces a filtered file containing:
368
+
369
+ .. code-block:: csv
370
+
371
+ id,name,notes
372
+ 1,Alice,ok
373
+ 3,Carol,ok
374
+
375
+ Report-only mode
376
+ ~~~~~~~~~~~~~~~~
377
+
378
+ ``--report-only`` scans matching CSVs and writes a manifest describing what
379
+ would happen, without touching the filesystem. This enables downstream
380
+ pipelines to consume the classification plan for custom processing.
381
+
382
+ Philosophy
383
+ ----------
384
+
385
+ 1. CSVs deserve tools that are simple, predictable, and transparent.
386
+ 2. A row has meaning only when its identity is stable and hashable.
387
+ 3. Collisions are sin; determinism is virtue.
388
+ 4. Let no delimiter sow ambiguity among fields.
389
+ 5. Love thy \x1f — the unseen separator, guardian of clean hashes.
390
+ 6. The pipeline should be silent unless something is wrong.
391
+ 7. Your data deserves respect — and your tools should help you give it.
392
+
393
+ License
394
+ -------
395
+
396
+ MIT License.