codeclone 1.2.0__tar.gz → 1.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. codeclone-1.3.0/PKG-INFO +402 -0
  2. codeclone-1.3.0/README.md +361 -0
  3. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone/__init__.py +1 -1
  4. codeclone-1.3.0/codeclone/_cli_args.py +161 -0
  5. codeclone-1.3.0/codeclone/_cli_meta.py +43 -0
  6. codeclone-1.3.0/codeclone/_cli_paths.py +36 -0
  7. codeclone-1.3.0/codeclone/_cli_summary.py +115 -0
  8. codeclone-1.3.0/codeclone/_html_escape.py +35 -0
  9. codeclone-1.3.0/codeclone/_html_snippets.py +208 -0
  10. codeclone-1.3.0/codeclone/_report_grouping.py +64 -0
  11. codeclone-1.3.0/codeclone/_report_segments.py +247 -0
  12. codeclone-1.3.0/codeclone/_report_serialize.py +160 -0
  13. codeclone-1.3.0/codeclone/_report_types.py +14 -0
  14. codeclone-1.3.0/codeclone/baseline.py +245 -0
  15. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone/blockhash.py +1 -1
  16. codeclone-1.3.0/codeclone/blocks.py +131 -0
  17. codeclone-1.3.0/codeclone/cache.py +285 -0
  18. codeclone-1.3.0/codeclone/cfg.py +410 -0
  19. codeclone-1.3.0/codeclone/cfg_model.py +47 -0
  20. codeclone-1.3.0/codeclone/cli.py +895 -0
  21. codeclone-1.3.0/codeclone/errors.py +41 -0
  22. codeclone-1.3.0/codeclone/extractor.py +258 -0
  23. codeclone-1.3.0/codeclone/html_report.py +434 -0
  24. codeclone-1.3.0/codeclone/meta_markers.py +13 -0
  25. codeclone-1.3.0/codeclone/normalize.py +236 -0
  26. codeclone-1.3.0/codeclone/py.typed +0 -0
  27. codeclone-1.3.0/codeclone/report.py +56 -0
  28. codeclone-1.3.0/codeclone/scanner.py +111 -0
  29. codeclone-1.3.0/codeclone/templates.py +2457 -0
  30. codeclone-1.3.0/codeclone/ui_messages.py +257 -0
  31. codeclone-1.3.0/codeclone.egg-info/PKG-INFO +402 -0
  32. codeclone-1.3.0/codeclone.egg-info/SOURCES.txt +58 -0
  33. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone.egg-info/requires.txt +3 -0
  34. {codeclone-1.2.0 → codeclone-1.3.0}/pyproject.toml +37 -4
  35. codeclone-1.3.0/tests/test_baseline.py +296 -0
  36. codeclone-1.3.0/tests/test_blockhash.py +11 -0
  37. codeclone-1.3.0/tests/test_blocks.py +107 -0
  38. codeclone-1.3.0/tests/test_cache.py +437 -0
  39. codeclone-1.3.0/tests/test_cfg.py +852 -0
  40. codeclone-1.3.0/tests/test_cfg_model.py +18 -0
  41. codeclone-1.3.0/tests/test_cli_inprocess.py +2676 -0
  42. codeclone-1.3.0/tests/test_cli_main_guard.py +17 -0
  43. codeclone-1.3.0/tests/test_cli_main_guard_runpy.py +12 -0
  44. {codeclone-1.2.0 → codeclone-1.3.0}/tests/test_cli_smoke.py +28 -8
  45. codeclone-1.3.0/tests/test_cli_unit.py +202 -0
  46. codeclone-1.3.0/tests/test_extractor.py +353 -0
  47. codeclone-1.3.0/tests/test_fingerprint.py +15 -0
  48. codeclone-1.3.0/tests/test_html_report.py +595 -0
  49. codeclone-1.3.0/tests/test_init.py +26 -0
  50. codeclone-1.3.0/tests/test_normalize.py +433 -0
  51. codeclone-1.3.0/tests/test_report.py +758 -0
  52. codeclone-1.3.0/tests/test_scanner_extra.py +215 -0
  53. codeclone-1.3.0/tests/test_security.py +82 -0
  54. codeclone-1.3.0/tests/test_segments.py +105 -0
  55. codeclone-1.2.0/PKG-INFO +0 -264
  56. codeclone-1.2.0/README.md +0 -225
  57. codeclone-1.2.0/codeclone/baseline.py +0 -58
  58. codeclone-1.2.0/codeclone/blocks.py +0 -73
  59. codeclone-1.2.0/codeclone/cache.py +0 -56
  60. codeclone-1.2.0/codeclone/cfg.py +0 -338
  61. codeclone-1.2.0/codeclone/cli.py +0 -409
  62. codeclone-1.2.0/codeclone/extractor.py +0 -169
  63. codeclone-1.2.0/codeclone/html_report.py +0 -936
  64. codeclone-1.2.0/codeclone/normalize.py +0 -130
  65. codeclone-1.2.0/codeclone/report.py +0 -67
  66. codeclone-1.2.0/codeclone/scanner.py +0 -48
  67. codeclone-1.2.0/codeclone.egg-info/PKG-INFO +0 -264
  68. codeclone-1.2.0/codeclone.egg-info/SOURCES.txt +0 -30
  69. codeclone-1.2.0/tests/test_baseline.py +0 -62
  70. codeclone-1.2.0/tests/test_blocks.py +0 -32
  71. codeclone-1.2.0/tests/test_cfg.py +0 -176
  72. codeclone-1.2.0/tests/test_extractor.py +0 -49
  73. codeclone-1.2.0/tests/test_html_report.py +0 -44
  74. codeclone-1.2.0/tests/test_normalize.py +0 -22
  75. codeclone-1.2.0/tests/test_report.py +0 -24
  76. {codeclone-1.2.0 → codeclone-1.3.0}/LICENSE +0 -0
  77. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone/fingerprint.py +0 -0
  78. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone.egg-info/dependency_links.txt +0 -0
  79. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone.egg-info/entry_points.txt +0 -0
  80. {codeclone-1.2.0 → codeclone-1.3.0}/codeclone.egg-info/top_level.txt +0 -0
  81. {codeclone-1.2.0 → codeclone-1.3.0}/setup.cfg +0 -0
@@ -0,0 +1,402 @@
1
+ Metadata-Version: 2.4
2
+ Name: codeclone
3
+ Version: 1.3.0
4
+ Summary: AST and CFG-based code clone detector for Python focused on architectural duplication
5
+ Author-email: Den Rozhnovskiy <pytelemonbot@mail.ru>
6
+ Maintainer-email: Den Rozhnovskiy <pytelemonbot@mail.ru>
7
+ License: MIT
8
+ Project-URL: Homepage, https://github.com/orenlab/codeclone
9
+ Project-URL: Repository, https://github.com/orenlab/codeclone
10
+ Project-URL: Issues, https://github.com/orenlab/codeclone/issues
11
+ Project-URL: Changelog, https://github.com/orenlab/codeclone/releases
12
+ Project-URL: Documentation, https://github.com/orenlab/codeclone/tree/main/docs
13
+ Keywords: python,ast,cfg,code-clone,duplication,static-analysis,architecture,control-flow,ci
14
+ Classifier: Development Status :: 5 - Production/Stable
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Topic :: Software Development :: Quality Assurance
17
+ Classifier: Topic :: Software Development :: Testing
18
+ Classifier: Typing :: Typed
19
+ Classifier: License :: OSI Approved :: MIT License
20
+ Classifier: Programming Language :: Python :: 3
21
+ Classifier: Programming Language :: Python :: 3.10
22
+ Classifier: Programming Language :: Python :: 3.11
23
+ Classifier: Programming Language :: Python :: 3.12
24
+ Classifier: Programming Language :: Python :: 3.13
25
+ Classifier: Programming Language :: Python :: 3.14
26
+ Classifier: Operating System :: OS Independent
27
+ Requires-Python: >=3.10
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: pygments>=2.19.2
31
+ Requires-Dist: rich>=14.3.2
32
+ Provides-Extra: dev
33
+ Requires-Dist: pytest>=9.0.0; extra == "dev"
34
+ Requires-Dist: pytest-cov>=6.1.0; extra == "dev"
35
+ Requires-Dist: build>=1.2.0; extra == "dev"
36
+ Requires-Dist: twine>=5.0.0; extra == "dev"
37
+ Requires-Dist: mypy>=1.19.1; extra == "dev"
38
+ Requires-Dist: ruff>=0.15.0; extra == "dev"
39
+ Requires-Dist: pre-commit>=4.5.1; extra == "dev"
40
+ Dynamic: license-file
41
+
42
+ # CodeClone
43
+
44
+ [![PyPI](https://img.shields.io/pypi/v/codeclone.svg?style=flat-square)](https://pypi.org/project/codeclone/)
45
+ [![Downloads](https://img.shields.io/pypi/dm/codeclone.svg?style=flat-square)](https://pypi.org/project/codeclone/)
46
+ [![tests](https://github.com/orenlab/codeclone/actions/workflows/tests.yml/badge.svg?branch=main&style=flat-square)](https://github.com/orenlab/codeclone/actions/workflows/tests.yml)
47
+ [![Python](https://img.shields.io/pypi/pyversions/codeclone.svg?style=flat-square)](https://pypi.org/project/codeclone/)
48
+ ![CI First](https://img.shields.io/badge/CI-first-green?style=flat-square)
49
+ ![Baseline](https://img.shields.io/badge/baseline-versioned-green?style=flat-square)
50
+ [![License](https://img.shields.io/pypi/l/codeclone.svg?style=flat-square)](LICENSE)
51
+
52
+ **CodeClone** is a Python code clone detector based on **normalized Python AST and Control Flow Graphs (CFG)**.
53
+ It helps teams discover architectural duplication and prevent new copy-paste from entering the codebase via CI.
54
+
55
+ CodeClone is designed to help teams:
56
+
57
+ - discover **structural and control-flow duplication**,
58
+ - identify architectural hotspots,
59
+ - prevent *new* duplication via CI and pre-commit hooks.
60
+
61
+ Unlike token- or text-based tools, CodeClone operates on **normalized Python AST and CFG**, making it robust against
62
+ renaming, formatting, and minor refactoring.
63
+
64
+ ---
65
+
66
+ ## Why CodeClone?
67
+
68
+ Most existing tools detect *textual* duplication.
69
+ CodeClone detects **structural and block-level duplication**, which usually signals missing abstractions or
70
+ architectural drift.
71
+
72
+ Typical use cases:
73
+
74
+ - duplicated service or orchestration logic across layers (API ↔ application),
75
+ - repeated validation or guard blocks,
76
+ - copy-pasted request / handler flows,
77
+ - duplicated control-flow logic in routers, handlers, or services.
78
+
79
+ ---
80
+
81
+ ## Features
82
+
83
+ ### Function-level clone detection (Type-2, CFG-based)
84
+
85
+ - Detects functions and methods with identical **control-flow structure**.
86
+ - Based on **Control Flow Graph (CFG)** fingerprinting.
87
+ - Robust to:
88
+ - variable renaming,
89
+ - constant changes,
90
+ - attribute renaming,
91
+ - formatting differences,
92
+ - docstrings and type annotations.
93
+ - Ideal for spotting architectural duplication across layers.
94
+
95
+ ### Block-level clone detection (Type-3-lite)
96
+
97
+ - Detects repeated **statement blocks** inside larger functions.
98
+ - Uses sliding windows over CFG-normalized statement sequences.
99
+ - Targets:
100
+ - validation blocks,
101
+ - guard clauses,
102
+ - repeated orchestration logic.
103
+ - Carefully filtered to reduce noise:
104
+ - no overlapping windows,
105
+ - no clones inside the same function,
106
+ - no `__init__` noise,
107
+ - size and statement-count thresholds.
108
+
109
+ ### Segment-level internal clone detection
110
+
111
+ - Detects repeated **segment windows** inside the same function.
112
+ - Uses a two-step deterministic match (candidate signature → strict hash).
113
+ - Included in reports for explainability, **not** in baseline/CI failure logic.
114
+
115
+ ### Control-Flow Awareness (CFG v1)
116
+
117
+ - Each function is converted into a **Control Flow Graph**.
118
+ - CFG nodes contain normalized AST statements.
119
+ - CFG edges represent structural control flow:
120
+ - `if` / `else`
121
+ - `for` / `async for` / `while`
122
+ - `try` / `except` / `finally`
123
+ - `with` / `async with`
124
+ - `match` / `case` (Python 3.10+)
125
+ - Current CFG semantics (v1):
126
+ - `and` / `or` are modeled as short-circuit micro-CFG branches,
127
+ - `try/except` links only from statements that may raise,
128
+ - `break` / `continue` are modeled as terminating loop transitions with explicit targets,
129
+ - `for/while ... else` semantics are preserved structurally,
130
+ - `match case` and `except` handler order is preserved structurally,
131
+ - after-blocks are explicit and always present,
132
+ - focus is on **structural similarity**, not precise runtime semantics.
133
+
134
+ This design keeps clone detection **stable, deterministic, and low-noise**.
135
+
136
+ ### Low-noise by design
137
+
138
+ - AST + CFG normalization instead of token matching.
139
+ - Conservative defaults tuned for real-world Python projects.
140
+ - Explicit thresholds for size and statement count.
141
+ - No probabilistic scoring or heuristic similarity thresholds.
142
+ - Safe commutative normalization and local logical equivalences only.
143
+ - Focus on *architectural duplication*, not micro-similarities.
144
+
145
+ ### CI-friendly baseline mode
146
+
147
+ - Establish a baseline of existing clones.
148
+ - Fail CI **only when new clones are introduced**.
149
+ - Safe for legacy codebases and incremental refactoring.
150
+
151
+ ---
152
+
153
+ ## Installation
154
+
155
+ ```bash
156
+ pip install codeclone
157
+ ```
158
+
159
+ Python 3.10+ is required.
160
+
161
+ ## Quick Start
162
+
163
+ Run on a project:
164
+
165
+ ```bash
166
+ codeclone .
167
+ ```
168
+
169
+ This will:
170
+
171
+ - scan Python files,
172
+ - build CFGs for functions,
173
+ - detect function-level and block-level clones,
174
+ - print a summary to stdout.
175
+
176
+ Generate reports:
177
+
178
+ ```bash
179
+ codeclone . \
180
+ --json .cache/codeclone/report.json \
181
+ --text .cache/codeclone/report.txt
182
+ ```
183
+
184
+ Generate an HTML report:
185
+
186
+ ```bash
187
+ codeclone . --html .cache/codeclone/report.html
188
+ ```
189
+
190
+ Check version:
191
+
192
+ ```bash
193
+ codeclone --version
194
+ ```
195
+
196
+ ---
197
+
198
+ ## Reports and Metadata
199
+
200
+ All report formats include provenance metadata for auditability:
201
+
202
+ `codeclone_version`, `python_version`, `baseline_path`, `baseline_version`,
203
+ `baseline_schema_version`, `baseline_python_version`, `baseline_loaded`,
204
+ `baseline_status` (and cache metadata when available).
205
+
206
+ baseline_status values:
207
+
208
+ - `ok`
209
+ - `missing`
210
+ - `legacy`
211
+ - `invalid`
212
+ - `mismatch_version`
213
+ - `mismatch_schema`
214
+ - `mismatch_python`
215
+ - `generator_mismatch`
216
+ - `integrity_missing`
217
+ - `integrity_failed`
218
+ - `too_large`
219
+
220
+ ---
221
+
222
+ ## Baseline Workflow (Recommended)
223
+
224
+ 1. Create a baseline
225
+
226
+ Run once on your current codebase:
227
+
228
+ ```bash
229
+ codeclone . --update-baseline
230
+ ```
231
+
232
+ Commit the generated baseline file to the repository.
233
+
234
+ Baselines are versioned. If CodeClone is upgraded, regenerate the baseline to keep
235
+ CI deterministic and explainable.
236
+
237
+ Baseline format in 1.3+ is tamper-evident (generator, payload_sha256) and validated
238
+ before baseline comparison.
239
+
240
+ 2. Trusted vs untrusted baseline behavior
241
+
242
+ Baseline states considered untrusted:
243
+
244
+ - `invalid`
245
+ - `too_large`
246
+ - `generator_mismatch`
247
+ - `integrity_missing`
248
+ - `integrity_failed`
249
+
250
+ Behavior:
251
+
252
+ - in normal mode, untrusted baseline is ignored with a warning (comparison falls back to empty baseline);
253
+ - in `--fail-on-new` / `--ci`, untrusted baseline fails fast (exit code 2).
254
+
255
+ 3. Use in CI
256
+
257
+ ```bash
258
+ codeclone . --ci
259
+ ```
260
+
261
+ or:
262
+
263
+ ```bash
264
+ codeclone . --ci --html .cache/codeclone/report.html
265
+ ```
266
+
267
+ `--ci` is equivalent to `--fail-on-new --no-color --quiet`.
268
+
269
+ Behavior:
270
+
271
+ - existing clones are allowed,
272
+ - the build fails if new clones appear,
273
+ - refactoring that removes duplication is always allowed.
274
+
275
+ `--fail-on-new` / `--ci` exits with a non-zero code when new clones are detected.
276
+
277
+ ---
278
+
279
+ ### Cache
280
+
281
+ By default, CodeClone stores the cache per project at:
282
+
283
+ ```bash
284
+ <root>/.cache/codeclone/cache.json
285
+ ```
286
+
287
+ You can override this path with `--cache-path` (`--cache-dir` is a legacy alias).
288
+
289
+ If you used an older version of CodeClone, delete the legacy cache file at
290
+ `~/.cache/codeclone/cache.json` and add `.cache/` to `.gitignore`.
291
+
292
+ Cache integrity checks are strict: signature mismatch or oversized cache files are ignored
293
+ with an explicit warning, then rebuilt from source.
294
+
295
+ Cache entries are validated against expected structure/types; invalid entries are ignored
296
+ deterministically.
297
+
298
+ ---
299
+
300
+ ## Python Version Consistency for Baseline Checks
301
+
302
+ Due to inherent differences in Python’s AST between interpreter versions, baseline
303
+ generation and verification must be performed using the same Python version.
304
+
305
+ This ensures deterministic and reproducible clone detection results.
306
+
307
+ CI checks therefore pin baseline verification to a single Python version, while the
308
+ test matrix continues to validate compatibility across Python 3.10–3.14.
309
+
310
+ ---
311
+
312
+ ## Using with pre-commit
313
+
314
+ ```yaml
315
+ repos:
316
+ - repo: local
317
+ hooks:
318
+ - id: codeclone
319
+ name: CodeClone
320
+ entry: codeclone
321
+ language: system
322
+ pass_filenames: false
323
+ args: [ ".", "--ci" ]
324
+ types: [ python ]
325
+ ```
326
+
327
+ ---
328
+
329
+ ## What CodeClone Is (and Is Not)
330
+
331
+ ### CodeClone Is
332
+
333
+ - an architectural analysis tool,
334
+ - a duplication radar,
335
+ - a CI guard against copy-paste,
336
+ - a control-flow-aware clone detector.
337
+
338
+ ### CodeClone Is Not
339
+
340
+ - a linter,
341
+ - a formatter,
342
+ - a semantic equivalence prover,
343
+ - a runtime analyzer.
344
+
345
+ ## How It Works (High Level)
346
+
347
+ 1. Parse Python source into AST.
348
+ 2. Normalize AST (names, constants, attributes, annotations).
349
+ 3. Build a Control Flow Graph (CFG) per function.
350
+ 4. Compute stable CFG fingerprints.
351
+ 5. Extract segment windows for internal clone discovery.
352
+ 6. Detect function-level, block-level, and segment-level clones.
353
+ 7. Apply conservative filters to suppress noise.
354
+
355
+ See the architectural overview:
356
+
357
+ - [docs/architecture.md](docs/architecture.md)
358
+
359
+ ---
360
+
361
+ ## Control Flow Graph (CFG)
362
+
363
+ Starting from version 1.1.0, CodeClone uses a Control Flow Graph (CFG)
364
+ to improve structural clone detection robustness.
365
+
366
+ The CFG is a structural abstraction, not a runtime execution model.
367
+
368
+ See full design and semantics:
369
+
370
+ - [docs/cfg.md](docs/cfg.md)
371
+
372
+ ---
373
+
374
+ ## CLI Options
375
+
376
+ | Option | Description | Default |
377
+ |-------------------------------|----------------------------------------------------------------------|--------------------------------------|
378
+ | `root` | Project root directory to scan | `.` |
379
+ | `--version` | Print CodeClone version and exit | - |
380
+ | `--min-loc` | Minimum function LOC to analyze | `15` |
381
+ | `--min-stmt` | Minimum AST statements to analyze | `6` |
382
+ | `--processes` | Number of worker processes | `4` |
383
+ | `--cache-path FILE` | Cache file path | `<root>/.cache/codeclone/cache.json` |
384
+ | `--cache-dir FILE` | Legacy alias for `--cache-path` | - |
385
+ | `--max-cache-size-mb MB` | Max cache size before ignore + warning | `50` |
386
+ | `--baseline FILE` | Baseline file path | `codeclone.baseline.json` |
387
+ | `--max-baseline-size-mb MB` | Max baseline size; untrusted baseline fails in CI, ignored otherwise | `5` |
388
+ | `--update-baseline` | Regenerate baseline from current results | `False` |
389
+ | `--fail-on-new` | Fail if new function/block clone groups appear vs baseline | `False` |
390
+ | `--fail-threshold MAX_CLONES` | Fail if total clone groups (`function + block`) exceed threshold | `-1` (disabled) |
391
+ | `--ci` | CI preset: `--fail-on-new --no-color --quiet` | `False` |
392
+ | `--html FILE` | Write HTML report (`.html`) | - |
393
+ | `--json FILE` | Write JSON report (`.json`) | - |
394
+ | `--text FILE` | Write text report (`.txt`) | - |
395
+ | `--no-progress` | Disable progress bar output | `False` |
396
+ | `--no-color` | Disable ANSI colors | `False` |
397
+ | `--quiet` | Minimize output (warnings/errors still shown) | `False` |
398
+ | `--verbose` | Show hash details for new clone groups in fail output | `False` |
399
+
400
+ ## License
401
+
402
+ MIT License