jscpd-rs 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (96) hide show
  1. package/CHANGELOG.md +69 -0
  2. package/Cargo.lock +1323 -0
  3. package/Cargo.toml +54 -0
  4. package/LICENSE +21 -0
  5. package/README.md +372 -0
  6. package/docs/api-parity.md +49 -0
  7. package/docs/cloning-plan.md +281 -0
  8. package/docs/compat-baseline.md +535 -0
  9. package/docs/format-porting.md +86 -0
  10. package/docs/junior-task-template.md +62 -0
  11. package/docs/junior-workflow.md +87 -0
  12. package/docs/migrating-from-jscpd.md +193 -0
  13. package/docs/npm-release.md +116 -0
  14. package/docs/public-benchmark-suite.md +81 -0
  15. package/docs/release-checklist.md +200 -0
  16. package/docs/release-decisions.md +103 -0
  17. package/docs/release-readiness.md +51 -0
  18. package/docs/upstream-bugs.md +501 -0
  19. package/docs/upstream-issue-drafts.md +393 -0
  20. package/docs/user-guide.md +309 -0
  21. package/examples/dump_oxc_tokens.rs +112 -0
  22. package/examples/library_api.rs +42 -0
  23. package/npm/bin/jscpd-rs.js +6 -0
  24. package/npm/bin/jscpd-server.js +6 -0
  25. package/npm/lib/run-binary.js +68 -0
  26. package/npm/scripts/postinstall.js +50 -0
  27. package/package.json +53 -0
  28. package/skills/dry-refactoring/SKILL.md +63 -0
  29. package/skills/jscpd/SKILL.md +85 -0
  30. package/src/app.rs +512 -0
  31. package/src/bin/jscpd-server.rs +429 -0
  32. package/src/blame.rs +130 -0
  33. package/src/cli/config.rs +543 -0
  34. package/src/cli/parsing.rs +301 -0
  35. package/src/cli/tests.rs +543 -0
  36. package/src/cli.rs +671 -0
  37. package/src/detector/matching/secondary.rs +387 -0
  38. package/src/detector/matching.rs +274 -0
  39. package/src/detector/model.rs +190 -0
  40. package/src/detector/prepare.rs +71 -0
  41. package/src/detector/skip_local.rs +40 -0
  42. package/src/detector/statistics.rs +138 -0
  43. package/src/detector/store.rs +96 -0
  44. package/src/detector/tests.rs +238 -0
  45. package/src/detector.rs +265 -0
  46. package/src/files/discovery.rs +508 -0
  47. package/src/files/gitignore.rs +203 -0
  48. package/src/files/paths.rs +68 -0
  49. package/src/files/shebang.rs +106 -0
  50. package/src/files/tests.rs +523 -0
  51. package/src/files.rs +25 -0
  52. package/src/formats.rs +570 -0
  53. package/src/lib.rs +433 -0
  54. package/src/main.rs +26 -0
  55. package/src/report/ai.rs +125 -0
  56. package/src/report/badge.rs +238 -0
  57. package/src/report/console.rs +180 -0
  58. package/src/report/console_common.rs +37 -0
  59. package/src/report/console_full.rs +139 -0
  60. package/src/report/csv.rs +65 -0
  61. package/src/report/escape.rs +8 -0
  62. package/src/report/file_output.rs +28 -0
  63. package/src/report/html/assets.rs +47 -0
  64. package/src/report/html.rs +336 -0
  65. package/src/report/json.rs +119 -0
  66. package/src/report/markdown.rs +125 -0
  67. package/src/report/sarif.rs +302 -0
  68. package/src/report/silent.rs +22 -0
  69. package/src/report/source.rs +38 -0
  70. package/src/report/summary.rs +50 -0
  71. package/src/report/test_support.rs +133 -0
  72. package/src/report/threshold.rs +76 -0
  73. package/src/report/xcode.rs +90 -0
  74. package/src/report/xml.rs +119 -0
  75. package/src/report.rs +250 -0
  76. package/src/server/mcp.rs +942 -0
  77. package/src/server.rs +1081 -0
  78. package/src/tokenizer/apex.rs +97 -0
  79. package/src/tokenizer/blocks.rs +532 -0
  80. package/src/tokenizer/embedded.rs +106 -0
  81. package/src/tokenizer/generic.rs +511 -0
  82. package/src/tokenizer/hash.rs +27 -0
  83. package/src/tokenizer/ignore.rs +33 -0
  84. package/src/tokenizer/line_index.rs +33 -0
  85. package/src/tokenizer/markdown.rs +289 -0
  86. package/src/tokenizer/markup_attrs.rs +289 -0
  87. package/src/tokenizer/oxc/fallback.rs +275 -0
  88. package/src/tokenizer/oxc/jsx.rs +168 -0
  89. package/src/tokenizer/oxc/kind.rs +177 -0
  90. package/src/tokenizer/oxc/lexical.rs +67 -0
  91. package/src/tokenizer/oxc.rs +659 -0
  92. package/src/tokenizer/scan.rs +88 -0
  93. package/src/tokenizer/tap.rs +150 -0
  94. package/src/tokenizer/tests.rs +915 -0
  95. package/src/tokenizer.rs +328 -0
  96. package/src/verbose.rs +195 -0
@@ -0,0 +1,281 @@
1
+ # jscpd-rs Cloning Plan
2
+
3
+ ## Upstream Review
4
+
5
+ The reference implementation is a TypeScript monorepo:
6
+
7
+ - `apps/jscpd` owns the CLI, option/config merging, store setup, reporters, and
8
+ top-level execution.
9
+ - `packages/finder` discovers files, applies `.gitignore`/ignore/size/line
10
+ filters, coordinates detection across files, and hosts most reporters.
11
+ - `packages/core` contains the detector, in-memory store, statistics, validators,
12
+ and Rabin-Karp based clone search.
13
+ - `packages/tokenizer` maps file names/extensions to formats and tokenizes
14
+ supported languages. Current upstream supports 223 formats and special
15
+ block-aware tokenization for Vue, Svelte, Astro, and Markdown.
16
+
17
+ Upstream testing is conventional package-level Vitest behind `pnpm test`,
18
+ orchestrated by Turbo. The public CI workflow builds, lints, runs tests, and
19
+ then smoke-runs `./apps/jscpd/bin/jscpd ./fixtures`. Upstream does not keep a
20
+ separate public benchmark suite or a pinned set of large repositories; the
21
+ README only documents small fixture-based output-size timing/token examples and
22
+ a note about tokenizer speed on unspecified real projects.
23
+
24
+ The core flow is:
25
+
26
+ 1. Parse CLI/config into options.
27
+ 2. Find supported files with glob, ignore, size, line, symlink, and gitignore
28
+ filters.
29
+ 3. Tokenize each source.
30
+ 4. Convert token windows of `minTokens` into hashes.
31
+ 5. Use a per-format store to find matching windows and grow adjacent matches
32
+ into clones.
33
+ 6. Validate by `minLines` and optional validators.
34
+ 7. Emit statistics and reports.
35
+
36
+ ## MVP Scope
37
+
38
+ The first Rust MVP intentionally implements the minimum vertical slice needed to
39
+ measure whether a Rust clone has enough performance upside to continue:
40
+
41
+ - CLI with common jscpd flags.
42
+ - Partial `.jscpd.json` support.
43
+ - File discovery with the Rust `ignore` crate for `.gitignore`.
44
+ - Upstream-synchronized extension/name format registry.
45
+ - Language-agnostic non-whitespace tokenizer.
46
+ - Numeric rolling window hashing and in-memory per-format store.
47
+ - Clone growth and `minLines` validation.
48
+ - Console, consoleFull, AI, JSON, CSV, Markdown, XML PMD CPD, SARIF, badge,
49
+ HTML, Xcode, silent, and threshold reporters.
50
+ - Benchmark script against upstream on the same target path.
51
+
52
+ Known MVP gaps:
53
+
54
+ - Tokenization is not language-compatible with upstream yet.
55
+ - The upstream format registry is synchronized, but most long-tail formats still
56
+ use generic tokenization rather than Prism-compatible tokenization.
57
+ - `strict/mild/weak` are still converging overall. `strict` now preserves
58
+ whitespace tokens in the native JS/TS/Oxc path and the generic tokenizer;
59
+ `weak` strips common comment spans for generic formats.
60
+ - Terminal timing/tips/progress/verbose behavior is partially aligned with
61
+ upstream, including clone progress and detector event output.
62
+ - Blame data is populated from native `git blame -w`. Store options currently
63
+ match the local upstream missing-store fallback. Dynamic external stores are
64
+ not implemented yet.
65
+ - A native Rust API now exposes detection from configured paths and prepared
66
+ in-memory sources. Exact upstream JavaScript package API compatibility remains
67
+ follow-up work.
68
+ - A native `jscpd-server` binary exposes the first REST surface:
69
+ `/`, `/api/health`, `/api/stats`, `/api/check`, `/api/recheck`, and `/mcp`.
70
+ The MCP endpoint supports initialize/session handling, core tools, and the
71
+ statistics resource. The current snippet check reuses prepared project token
72
+ maps after `/api/recheck`; a dedicated indexed hybrid-store path can still be
73
+ added if server-scale benchmarks show this is needed.
74
+ - `cache`, config `listeners`, and `tokensToSkip` are parsed for option-surface
75
+ compatibility, but upstream currently does not consume them in runtime code.
76
+ - No full parity for non-native syntax-specific token streams yet.
77
+ - Markdown front matter and fenced code blocks are extracted into embedded
78
+ format maps, with coverage parity on the current upstream fixture.
79
+
80
+ ## Growth Plan
81
+
82
+ 1. Compatibility harness: run upstream and Rust on shared fixtures, compare clone
83
+ counts, locations, statistics, reports, and exit behavior.
84
+ 2. CLI/config parity: harden remaining flags, config merging rules, exit codes,
85
+ threshold behavior, and list output.
86
+ 3. Tokenizer backend: replace the MVP tokenizer with maintained crates and
87
+ language-aware token streams. Prefer existing parsers/tokenizers over custom
88
+ grammars.
89
+ 4. Reporters: polish remaining report details and terminal UX.
90
+ 5. Advanced surfaces: full non-native tokenizer parity, dynamic external
91
+ stores, dynamic external reporters, programmatic API/server parity, and
92
+ stricter `strict`/`mild`/`weak` parity.
93
+ 6. Performance work: parallel file reads/tokenization, compact hash storage,
94
+ faster hashers where compatible, memory profiling, and optional external
95
+ store backends.
96
+
97
+ ## Release Candidate Gate
98
+
99
+ Before publishing a release candidate, run:
100
+
101
+ ```bash
102
+ scripts/release-candidate.sh
103
+ ```
104
+
105
+ This wraps the strict local lint check, the default release gate, the full
106
+ coverage matrix, and the public benchmark/coverage suite into one reproducible
107
+ pre-publication command. The defaults can still be overridden with environment
108
+ variables such as `PUBLIC_CASES`, `PUBLIC_RUNS`, `PUBLIC_MIN_SPEEDUP`, and
109
+ `STRICT`.
110
+
111
+ ## Current Benchmark
112
+
113
+ Command:
114
+
115
+ ```bash
116
+ FORMAT=typescript RUNS=5 scripts/bench.sh jscpd/packages
117
+ ```
118
+
119
+ Result on this workspace:
120
+
121
+ - Rust MVP: `0.108s` average.
122
+ - Upstream `jscpd`: `0.818s` average.
123
+ - Same file count for this run: 297 TypeScript files.
124
+
125
+ Broader command:
126
+
127
+ ```bash
128
+ RUNS=3 scripts/bench.sh jscpd/packages
129
+ ```
130
+
131
+ Result on this workspace:
132
+
133
+ - Rust MVP: `0.130s` average.
134
+ - Upstream `jscpd`: `0.937s` average.
135
+ - This broader run is not fully apples-to-apples yet: the MVP supports fewer
136
+ formats than upstream.
137
+
138
+ Initial signal: continuing makes sense, but the next milestone must measure
139
+ speed while closing tokenization/report compatibility gaps.
140
+
141
+ ## Compatibility Gate
142
+
143
+ The project now uses a coverage-first compatibility rule for ongoing cloning
144
+ work:
145
+
146
+ - Rust must not miss duplicated lines reported by upstream `jscpd` for the same
147
+ file, format, input, and options.
148
+ - Rust may report additional duplicates while compatibility is converging.
149
+ - Missing upstream line coverage is a blocking compatibility failure.
150
+ - Extra Rust duplicates are tracked as diagnostics and fixed when they represent
151
+ likely false positives or user-visible report noise.
152
+ - Exact clone pair and fragment-boundary overlap is diagnostic only: when three
153
+ or more equivalent fragments exist, upstream and Rust may choose different
154
+ pairs or wider/split ranges while still covering the same duplicated lines.
155
+ - Exact 1:1 parity remains a useful quality metric, but it is not the default
156
+ gate for deciding whether the clone is viable.
157
+
158
+ Use the compatibility harness with `STRICT=coverage` to enforce this rule.
159
+ Use `scripts/compat-matrix.sh` for the current JS/TS-focused release matrix.
160
+
161
+ ## Format Coverage Strategy
162
+
163
+ The format registry is generated from upstream `@jscpd/tokenizer` using
164
+ `scripts/sync-formats.mjs`. This keeps extension detection and `--list` aligned
165
+ with upstream while avoiding hand-maintained mapping drift.
166
+
167
+ Tokenizer strategy remains hybrid:
168
+
169
+ - native Rust/Oxc path for hot JS/TS formats;
170
+ - native Rust block splitting for Markdown, Vue, Svelte, and Astro embedded
171
+ code/style/template regions;
172
+ - generic tokenizer for other recognized formats without parity claims;
173
+ - no embedded JavaScript runtime fallback. Formats that need real compatibility
174
+ should get native Rust tokenizers and focused compat tests.
175
+
176
+ ## Accepted Hard-Feature Decisions
177
+
178
+ The current release decisions are canonicalized in
179
+ `docs/release-decisions.md`. The summary below mirrors that file for quick
180
+ orientation.
181
+
182
+ These choices are part of the current cloning direction until a compatibility
183
+ gate proves they are insufficient:
184
+
185
+ - Dynamic npm reporters, stores, and plugins are post-MVP. The first release
186
+ should implement popular built-in reporters and stores natively instead of
187
+ embedding or casually spawning JavaScript from Rust.
188
+ - Reporter compatibility should be strict for machine-readable contracts such as
189
+ JSON, XML, SARIF, CSV, and Markdown. HTML should stay practically compatible,
190
+ but pixel-perfect parity is not a release blocker.
191
+ - Tokenizer compatibility stays hybrid. Use maintained Rust crates or Oxc for
192
+ hot formats where they materially help; keep long-tail formats on generic
193
+ tokenization until a fixture or public-repo gate shows missed upstream
194
+ coverage.
195
+ - Node/Commander quirks are mirrored only when they are user-visible and covered
196
+ by compatibility tests. The project should not import a JavaScript runtime to
197
+ reproduce every incidental JS behavior.
198
+ - Blame should stay native (`git blame`/git library based) and fail per file
199
+ where possible instead of inheriting upstream nested-repo failure modes.
200
+ - Store/cache work should stay native and demand-driven. Custom external stores
201
+ remain documented gaps unless a real large-repo benchmark proves they are
202
+ needed for release viability.
203
+
204
+ ## Larger Local Repo Benchmarks
205
+
206
+ All commands below used TypeScript-only scanning to keep the comparison focused:
207
+
208
+ ```bash
209
+ FORMAT=typescript RUNS=3 scripts/bench.sh <repo>
210
+ ```
211
+
212
+ These early exploratory timings used private local repositories. The repository
213
+ names and paths are intentionally omitted; repeatable public benchmarks are
214
+ tracked separately in `docs/public-benchmark-suite.md`.
215
+
216
+ | Repo | Rust MVP | Upstream `jscpd` | Files | Rust clones | Upstream clones |
217
+ | --- | ---: | ---: | ---: | ---: | ---: |
218
+ | Private repo A | `0.447s` | `1.930s` | 316 | 93 | 475 |
219
+ | Private repo B | `0.350s` | `1.877s` | 566 Rust / 572 upstream | 371 | 1371 |
220
+ | Private repo C | `0.010s` | `0.290s` | 28 | 12 | 41 |
221
+
222
+ Broader all-format stress on private repo B:
223
+
224
+ - Rust MVP, `RUNS=2`: `0.600s` average.
225
+ - Upstream `jscpd`: first run took `70.32s`, then the benchmark was stopped.
226
+ - This broader run is not a fair compatibility comparison yet because upstream
227
+ supports far more formats and emitted a warning for `excel-formula`.
228
+
229
+ Conclusion remains positive: the Rust path is consistently faster on larger
230
+ repos, but the next milestone must prioritize tokenizer/discovery parity before
231
+ the speedup can be treated as product-quality.
232
+
233
+ ## Acceleration Pass
234
+
235
+ The first MVP was still too conservative: it used MD5 strings for token/window
236
+ hashes and prepared files sequentially. The current hot path now uses:
237
+
238
+ - zero-copy detection tokens: no per-token `String` allocation in detection;
239
+ - `xxh3_128` token hashes;
240
+ - numeric rolling window hashes instead of MD5 over concatenated strings;
241
+ - `rustc_hash::FxHashMap` for the in-memory window store;
242
+ - parallel file reads and line counting;
243
+ - parallel per-file tokenization/window preparation;
244
+ - nanosecond-resolution benchmark timing in `scripts/bench.sh`.
245
+
246
+ Updated TypeScript-only benchmark:
247
+
248
+ | Repo | Rust MVP before | Rust MVP now | Upstream `jscpd` | Approx speedup vs upstream |
249
+ | --- | ---: | ---: | ---: | ---: |
250
+ | `jscpd/packages` | `0.108s` | `0.019s` | `0.856s` | ~45x |
251
+ | Private repo A | `0.447s` | `0.085s` | `2.028s` | ~24x |
252
+ | Private repo B | `0.350s` | `0.074s` | `1.955s` | ~26x |
253
+ | Private repo C | `0.010s` | `0.009s` | `0.305s` | ~33x |
254
+
255
+ This is the first speed signal that is strong enough to justify continuing,
256
+ provided compatibility can be raised without destroying the margin.
257
+
258
+ ## Core Stabilization Pass
259
+
260
+ The detector core was refactored again so later tokenizer/reporting work does
261
+ not have to revisit the detection hot path:
262
+
263
+ - introduced numeric `SourceId` and `FormatId` in the core;
264
+ - introduced `TokenStream` as the detector input contract;
265
+ - removed per-window `Frame` allocation;
266
+ - stores only first `Occurrence { source_id, token_start }` per window hash;
267
+ - streams rolling windows directly from token hashes;
268
+ - verifies matching windows on hash hits;
269
+ - shards detection by format and runs format shards in parallel.
270
+
271
+ Updated TypeScript-only benchmark after this pass:
272
+
273
+ | Repo | Rust after hash pass | Rust after core pass | Upstream `jscpd` | Approx speedup vs upstream |
274
+ | --- | ---: | ---: | ---: | ---: |
275
+ | `jscpd/packages` | `0.019s` | `0.011s` | `0.821s` | ~75x |
276
+ | Private repo A | `0.085s` | `0.038s` | `2.041s` | ~54x |
277
+ | Private repo B | `0.074s` | `0.034s` | `1.939s` | ~57x |
278
+
279
+ Do not treat these as fixed release gates yet. Before publication, choose a
280
+ small set of popular public repositories and use them as the repeatable
281
+ benchmark/compatibility suite.