jscpd-rs 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +69 -0
- package/Cargo.lock +1323 -0
- package/Cargo.toml +54 -0
- package/LICENSE +21 -0
- package/README.md +372 -0
- package/docs/api-parity.md +49 -0
- package/docs/cloning-plan.md +281 -0
- package/docs/compat-baseline.md +535 -0
- package/docs/format-porting.md +86 -0
- package/docs/junior-task-template.md +62 -0
- package/docs/junior-workflow.md +87 -0
- package/docs/migrating-from-jscpd.md +193 -0
- package/docs/npm-release.md +116 -0
- package/docs/public-benchmark-suite.md +81 -0
- package/docs/release-checklist.md +200 -0
- package/docs/release-decisions.md +103 -0
- package/docs/release-readiness.md +51 -0
- package/docs/upstream-bugs.md +501 -0
- package/docs/upstream-issue-drafts.md +393 -0
- package/docs/user-guide.md +309 -0
- package/examples/dump_oxc_tokens.rs +112 -0
- package/examples/library_api.rs +42 -0
- package/npm/bin/jscpd-rs.js +6 -0
- package/npm/bin/jscpd-server.js +6 -0
- package/npm/lib/run-binary.js +68 -0
- package/npm/scripts/postinstall.js +50 -0
- package/package.json +53 -0
- package/skills/dry-refactoring/SKILL.md +63 -0
- package/skills/jscpd/SKILL.md +85 -0
- package/src/app.rs +512 -0
- package/src/bin/jscpd-server.rs +429 -0
- package/src/blame.rs +130 -0
- package/src/cli/config.rs +543 -0
- package/src/cli/parsing.rs +301 -0
- package/src/cli/tests.rs +543 -0
- package/src/cli.rs +671 -0
- package/src/detector/matching/secondary.rs +387 -0
- package/src/detector/matching.rs +274 -0
- package/src/detector/model.rs +190 -0
- package/src/detector/prepare.rs +71 -0
- package/src/detector/skip_local.rs +40 -0
- package/src/detector/statistics.rs +138 -0
- package/src/detector/store.rs +96 -0
- package/src/detector/tests.rs +238 -0
- package/src/detector.rs +265 -0
- package/src/files/discovery.rs +508 -0
- package/src/files/gitignore.rs +203 -0
- package/src/files/paths.rs +68 -0
- package/src/files/shebang.rs +106 -0
- package/src/files/tests.rs +523 -0
- package/src/files.rs +25 -0
- package/src/formats.rs +570 -0
- package/src/lib.rs +433 -0
- package/src/main.rs +26 -0
- package/src/report/ai.rs +125 -0
- package/src/report/badge.rs +238 -0
- package/src/report/console.rs +180 -0
- package/src/report/console_common.rs +37 -0
- package/src/report/console_full.rs +139 -0
- package/src/report/csv.rs +65 -0
- package/src/report/escape.rs +8 -0
- package/src/report/file_output.rs +28 -0
- package/src/report/html/assets.rs +47 -0
- package/src/report/html.rs +336 -0
- package/src/report/json.rs +119 -0
- package/src/report/markdown.rs +125 -0
- package/src/report/sarif.rs +302 -0
- package/src/report/silent.rs +22 -0
- package/src/report/source.rs +38 -0
- package/src/report/summary.rs +50 -0
- package/src/report/test_support.rs +133 -0
- package/src/report/threshold.rs +76 -0
- package/src/report/xcode.rs +90 -0
- package/src/report/xml.rs +119 -0
- package/src/report.rs +250 -0
- package/src/server/mcp.rs +942 -0
- package/src/server.rs +1081 -0
- package/src/tokenizer/apex.rs +97 -0
- package/src/tokenizer/blocks.rs +532 -0
- package/src/tokenizer/embedded.rs +106 -0
- package/src/tokenizer/generic.rs +511 -0
- package/src/tokenizer/hash.rs +27 -0
- package/src/tokenizer/ignore.rs +33 -0
- package/src/tokenizer/line_index.rs +33 -0
- package/src/tokenizer/markdown.rs +289 -0
- package/src/tokenizer/markup_attrs.rs +289 -0
- package/src/tokenizer/oxc/fallback.rs +275 -0
- package/src/tokenizer/oxc/jsx.rs +168 -0
- package/src/tokenizer/oxc/kind.rs +177 -0
- package/src/tokenizer/oxc/lexical.rs +67 -0
- package/src/tokenizer/oxc.rs +659 -0
- package/src/tokenizer/scan.rs +88 -0
- package/src/tokenizer/tap.rs +150 -0
- package/src/tokenizer/tests.rs +915 -0
- package/src/tokenizer.rs +328 -0
- package/src/verbose.rs +195 -0
|
@@ -0,0 +1,281 @@
|
|
|
1
|
+
# jscpd-rs Cloning Plan
|
|
2
|
+
|
|
3
|
+
## Upstream Review
|
|
4
|
+
|
|
5
|
+
The reference implementation is a TypeScript monorepo:
|
|
6
|
+
|
|
7
|
+
- `apps/jscpd` owns the CLI, option/config merging, store setup, reporters, and
|
|
8
|
+
top-level execution.
|
|
9
|
+
- `packages/finder` discovers files, applies `.gitignore`/ignore/size/line
|
|
10
|
+
filters, coordinates detection across files, and hosts most reporters.
|
|
11
|
+
- `packages/core` contains the detector, in-memory store, statistics, validators,
|
|
12
|
+
and Rabin-Karp based clone search.
|
|
13
|
+
- `packages/tokenizer` maps file names/extensions to formats and tokenizes
|
|
14
|
+
supported languages. Current upstream supports 223 formats and special
|
|
15
|
+
block-aware tokenization for Vue, Svelte, Astro, and Markdown.
|
|
16
|
+
|
|
17
|
+
Upstream testing is conventional package-level Vitest behind `pnpm test`,
|
|
18
|
+
orchestrated by Turbo. The public CI workflow builds, lints, runs tests, and
|
|
19
|
+
then smoke-runs `./apps/jscpd/bin/jscpd ./fixtures`. Upstream does not keep a
|
|
20
|
+
separate public benchmark suite or a pinned set of large repositories; the
|
|
21
|
+
README only documents small fixture-based output-size timing/token examples and
|
|
22
|
+
a note about tokenizer speed on unspecified real projects.
|
|
23
|
+
|
|
24
|
+
The core flow is:
|
|
25
|
+
|
|
26
|
+
1. Parse CLI/config into options.
|
|
27
|
+
2. Find supported files with glob, ignore, size, line, symlink, and gitignore
|
|
28
|
+
filters.
|
|
29
|
+
3. Tokenize each source.
|
|
30
|
+
4. Convert token windows of `minTokens` into hashes.
|
|
31
|
+
5. Use a per-format store to find matching windows and grow adjacent matches
|
|
32
|
+
into clones.
|
|
33
|
+
6. Validate by `minLines` and optional validators.
|
|
34
|
+
7. Emit statistics and reports.
|
|
35
|
+
|
|
36
|
+
## MVP Scope
|
|
37
|
+
|
|
38
|
+
The first Rust MVP intentionally implements the minimum vertical slice needed to
|
|
39
|
+
measure whether a Rust clone has enough performance upside to continue:
|
|
40
|
+
|
|
41
|
+
- CLI with common jscpd flags.
|
|
42
|
+
- Partial `.jscpd.json` support.
|
|
43
|
+
- File discovery with the Rust `ignore` crate for `.gitignore`.
|
|
44
|
+
- Upstream-synchronized extension/name format registry.
|
|
45
|
+
- Language-agnostic non-whitespace tokenizer.
|
|
46
|
+
- Numeric rolling window hashing and in-memory per-format store.
|
|
47
|
+
- Clone growth and `minLines` validation.
|
|
48
|
+
- Console, consoleFull, AI, JSON, CSV, Markdown, XML PMD CPD, SARIF, badge,
|
|
49
|
+
HTML, Xcode, silent, and threshold reporters.
|
|
50
|
+
- Benchmark script against upstream on the same target path.
|
|
51
|
+
|
|
52
|
+
Known MVP gaps:
|
|
53
|
+
|
|
54
|
+
- Tokenization is not language-compatible with upstream yet.
|
|
55
|
+
- The upstream format registry is synchronized, but most long-tail formats still
|
|
56
|
+
use generic tokenization rather than Prism-compatible tokenization.
|
|
57
|
+
- `strict/mild/weak` are still converging overall. `strict` now preserves
|
|
58
|
+
whitespace tokens in the native JS/TS/Oxc path and the generic tokenizer;
|
|
59
|
+
`weak` strips common comment spans for generic formats.
|
|
60
|
+
- Terminal timing/tips/progress/verbose behavior is partially aligned with
|
|
61
|
+
upstream, including clone progress and detector event output.
|
|
62
|
+
- Blame data is populated from native `git blame -w`. Store options currently
|
|
63
|
+
match the local upstream missing-store fallback. Dynamic external stores are
|
|
64
|
+
not implemented yet.
|
|
65
|
+
- A native Rust API now exposes detection from configured paths and prepared
|
|
66
|
+
in-memory sources. Exact upstream JavaScript package API compatibility remains
|
|
67
|
+
follow-up work.
|
|
68
|
+
- A native `jscpd-server` binary exposes the first REST surface:
|
|
69
|
+
`/`, `/api/health`, `/api/stats`, `/api/check`, `/api/recheck`, and `/mcp`.
|
|
70
|
+
The MCP endpoint supports initialize/session handling, core tools, and the
|
|
71
|
+
statistics resource. The current snippet check reuses prepared project token
|
|
72
|
+
maps after `/api/recheck`; a dedicated indexed hybrid-store path can still be
|
|
73
|
+
added if server-scale benchmarks show this is needed.
|
|
74
|
+
- `cache`, config `listeners`, and `tokensToSkip` are parsed for option-surface
|
|
75
|
+
compatibility, but upstream currently does not consume them in runtime code.
|
|
76
|
+
- No full parity for non-native syntax-specific token streams yet.
|
|
77
|
+
- Markdown front matter and fenced code blocks are extracted into embedded
|
|
78
|
+
format maps, with coverage parity on the current upstream fixture.
|
|
79
|
+
|
|
80
|
+
## Growth Plan
|
|
81
|
+
|
|
82
|
+
1. Compatibility harness: run upstream and Rust on shared fixtures, compare clone
|
|
83
|
+
counts, locations, statistics, reports, and exit behavior.
|
|
84
|
+
2. CLI/config parity: harden remaining flags, config merging rules, exit codes,
|
|
85
|
+
threshold behavior, and list output.
|
|
86
|
+
3. Tokenizer backend: replace the MVP tokenizer with maintained crates and
|
|
87
|
+
language-aware token streams. Prefer existing parsers/tokenizers over custom
|
|
88
|
+
grammars.
|
|
89
|
+
4. Reporters: polish remaining report details and terminal UX.
|
|
90
|
+
5. Advanced surfaces: full non-native tokenizer parity, dynamic external
|
|
91
|
+
stores, dynamic external reporters, programmatic API/server parity, and
|
|
92
|
+
stricter `strict`/`mild`/`weak` parity.
|
|
93
|
+
6. Performance work: parallel file reads/tokenization, compact hash storage,
|
|
94
|
+
faster hashers where compatible, memory profiling, and optional external
|
|
95
|
+
store backends.
|
|
96
|
+
|
|
97
|
+
## Release Candidate Gate
|
|
98
|
+
|
|
99
|
+
Before publishing a release candidate, run:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
scripts/release-candidate.sh
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
This wraps the strict local lint check, the default release gate, the full
|
|
106
|
+
coverage matrix, and the public benchmark/coverage suite into one reproducible
|
|
107
|
+
pre-publication command. The defaults can still be overridden with environment
|
|
108
|
+
variables such as `PUBLIC_CASES`, `PUBLIC_RUNS`, `PUBLIC_MIN_SPEEDUP`, and
|
|
109
|
+
`STRICT`.
|
|
110
|
+
|
|
111
|
+
## Current Benchmark
|
|
112
|
+
|
|
113
|
+
Command:
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
FORMAT=typescript RUNS=5 scripts/bench.sh jscpd/packages
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Result on this workspace:
|
|
120
|
+
|
|
121
|
+
- Rust MVP: `0.108s` average.
|
|
122
|
+
- Upstream `jscpd`: `0.818s` average.
|
|
123
|
+
- Same file count for this run: 297 TypeScript files.
|
|
124
|
+
|
|
125
|
+
Broader command:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
RUNS=3 scripts/bench.sh jscpd/packages
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Result on this workspace:
|
|
132
|
+
|
|
133
|
+
- Rust MVP: `0.130s` average.
|
|
134
|
+
- Upstream `jscpd`: `0.937s` average.
|
|
135
|
+
- This broader run is not fully apples-to-apples yet: the MVP supports fewer
|
|
136
|
+
formats than upstream.
|
|
137
|
+
|
|
138
|
+
Initial signal: continuing makes sense, but the next milestone must measure
|
|
139
|
+
speed while closing tokenization/report compatibility gaps.
|
|
140
|
+
|
|
141
|
+
## Compatibility Gate
|
|
142
|
+
|
|
143
|
+
The project now uses a coverage-first compatibility rule for ongoing cloning
|
|
144
|
+
work:
|
|
145
|
+
|
|
146
|
+
- Rust must not miss duplicated lines reported by upstream `jscpd` for the same
|
|
147
|
+
file, format, input, and options.
|
|
148
|
+
- Rust may report additional duplicates while compatibility is converging.
|
|
149
|
+
- Missing upstream line coverage is a blocking compatibility failure.
|
|
150
|
+
- Extra Rust duplicates are tracked as diagnostics and fixed when they represent
|
|
151
|
+
likely false positives or user-visible report noise.
|
|
152
|
+
- Exact clone pair and fragment-boundary overlap is diagnostic only: when three
|
|
153
|
+
or more equivalent fragments exist, upstream and Rust may choose different
|
|
154
|
+
pairs or wider/split ranges while still covering the same duplicated lines.
|
|
155
|
+
- Exact 1:1 parity remains a useful quality metric, but it is not the default
|
|
156
|
+
gate for deciding whether the clone is viable.
|
|
157
|
+
|
|
158
|
+
Use the compatibility harness with `STRICT=coverage` to enforce this rule.
|
|
159
|
+
Use `scripts/compat-matrix.sh` for the current JS/TS-focused release matrix.
|
|
160
|
+
|
|
161
|
+
## Format Coverage Strategy
|
|
162
|
+
|
|
163
|
+
The format registry is generated from upstream `@jscpd/tokenizer` using
|
|
164
|
+
`scripts/sync-formats.mjs`. This keeps extension detection and `--list` aligned
|
|
165
|
+
with upstream while avoiding hand-maintained mapping drift.
|
|
166
|
+
|
|
167
|
+
Tokenizer strategy remains hybrid:
|
|
168
|
+
|
|
169
|
+
- native Rust/Oxc path for hot JS/TS formats;
|
|
170
|
+
- native Rust block splitting for Markdown, Vue, Svelte, and Astro embedded
|
|
171
|
+
code/style/template regions;
|
|
172
|
+
- generic tokenizer for other recognized formats without parity claims;
|
|
173
|
+
- no embedded JavaScript runtime fallback. Formats that need real compatibility
|
|
174
|
+
should get native Rust tokenizers and focused compat tests.
|
|
175
|
+
|
|
176
|
+
## Accepted Hard-Feature Decisions
|
|
177
|
+
|
|
178
|
+
The current release decisions are canonicalized in
|
|
179
|
+
`docs/release-decisions.md`. The summary below mirrors that file for quick
|
|
180
|
+
orientation.
|
|
181
|
+
|
|
182
|
+
These choices are part of the current cloning direction until a compatibility
|
|
183
|
+
gate proves they are insufficient:
|
|
184
|
+
|
|
185
|
+
- Dynamic npm reporters, stores, and plugins are post-MVP. The first release
|
|
186
|
+
should implement popular built-in reporters and stores natively instead of
|
|
187
|
+
embedding or casually spawning JavaScript from Rust.
|
|
188
|
+
- Reporter compatibility should be strict for machine-readable contracts such as
|
|
189
|
+
JSON, XML, SARIF, CSV, and Markdown. HTML should stay practically compatible,
|
|
190
|
+
but pixel-perfect parity is not a release blocker.
|
|
191
|
+
- Tokenizer compatibility stays hybrid. Use maintained Rust crates or Oxc for
|
|
192
|
+
hot formats where they materially help; keep long-tail formats on generic
|
|
193
|
+
tokenization until a fixture or public-repo gate shows missed upstream
|
|
194
|
+
coverage.
|
|
195
|
+
- Node/Commander quirks are mirrored only when they are user-visible and covered
|
|
196
|
+
by compatibility tests. The project should not import a JavaScript runtime to
|
|
197
|
+
reproduce every incidental JS behavior.
|
|
198
|
+
- Blame should stay native (`git blame`/git library based) and fail per file
|
|
199
|
+
where possible instead of inheriting upstream nested-repo failure modes.
|
|
200
|
+
- Store/cache work should stay native and demand-driven. Custom external stores
|
|
201
|
+
remain documented gaps unless a real large-repo benchmark proves they are
|
|
202
|
+
needed for release viability.
|
|
203
|
+
|
|
204
|
+
## Larger Local Repo Benchmarks
|
|
205
|
+
|
|
206
|
+
All commands below used TypeScript-only scanning to keep the comparison focused:
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
FORMAT=typescript RUNS=3 scripts/bench.sh <repo>
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
These early exploratory timings used private local repositories. The repository
|
|
213
|
+
names and paths are intentionally omitted; repeatable public benchmarks are
|
|
214
|
+
tracked separately in `docs/public-benchmark-suite.md`.
|
|
215
|
+
|
|
216
|
+
| Repo | Rust MVP | Upstream `jscpd` | Files | Rust clones | Upstream clones |
|
|
217
|
+
| --- | ---: | ---: | ---: | ---: | ---: |
|
|
218
|
+
| Private repo A | `0.447s` | `1.930s` | 316 | 93 | 475 |
|
|
219
|
+
| Private repo B | `0.350s` | `1.877s` | 566 Rust / 572 upstream | 371 | 1371 |
|
|
220
|
+
| Private repo C | `0.010s` | `0.290s` | 28 | 12 | 41 |
|
|
221
|
+
|
|
222
|
+
Broader all-format stress on private repo B:
|
|
223
|
+
|
|
224
|
+
- Rust MVP, `RUNS=2`: `0.600s` average.
|
|
225
|
+
- Upstream `jscpd`: first run took `70.32s`, then the benchmark was stopped.
|
|
226
|
+
- This broader run is not a fair compatibility comparison yet because upstream
|
|
227
|
+
supports far more formats and emitted a warning for `excel-formula`.
|
|
228
|
+
|
|
229
|
+
Conclusion remains positive: the Rust path is consistently faster on larger
|
|
230
|
+
repos, but the next milestone must prioritize tokenizer/discovery parity before
|
|
231
|
+
the speedup can be treated as product-quality.
|
|
232
|
+
|
|
233
|
+
## Acceleration Pass
|
|
234
|
+
|
|
235
|
+
The first MVP was still too conservative: it used MD5 strings for token/window
|
|
236
|
+
hashes and prepared files sequentially. The current hot path now uses:
|
|
237
|
+
|
|
238
|
+
- zero-copy detection tokens: no per-token `String` allocation in detection;
|
|
239
|
+
- `xxh3_128` token hashes;
|
|
240
|
+
- numeric rolling window hashes instead of MD5 over concatenated strings;
|
|
241
|
+
- `rustc_hash::FxHashMap` for the in-memory window store;
|
|
242
|
+
- parallel file reads and line counting;
|
|
243
|
+
- parallel per-file tokenization/window preparation;
|
|
244
|
+
- nanosecond-resolution benchmark timing in `scripts/bench.sh`.
|
|
245
|
+
|
|
246
|
+
Updated TypeScript-only benchmark:
|
|
247
|
+
|
|
248
|
+
| Repo | Rust MVP before | Rust MVP now | Upstream `jscpd` | Approx speedup vs upstream |
|
|
249
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
250
|
+
| `jscpd/packages` | `0.108s` | `0.019s` | `0.856s` | ~45x |
|
|
251
|
+
| Private repo A | `0.447s` | `0.085s` | `2.028s` | ~24x |
|
|
252
|
+
| Private repo B | `0.350s` | `0.074s` | `1.955s` | ~26x |
|
|
253
|
+
| Private repo C | `0.010s` | `0.009s` | `0.305s` | ~33x |
|
|
254
|
+
|
|
255
|
+
This is the first speed signal that is strong enough to justify continuing,
|
|
256
|
+
provided compatibility can be raised without destroying the margin.
|
|
257
|
+
|
|
258
|
+
## Core Stabilization Pass
|
|
259
|
+
|
|
260
|
+
The detector core was refactored again so later tokenizer/reporting work does
|
|
261
|
+
not have to revisit the detection hot path:
|
|
262
|
+
|
|
263
|
+
- introduced numeric `SourceId` and `FormatId` in the core;
|
|
264
|
+
- introduced `TokenStream` as the detector input contract;
|
|
265
|
+
- removed per-window `Frame` allocation;
|
|
266
|
+
- stores only first `Occurrence { source_id, token_start }` per window hash;
|
|
267
|
+
- streams rolling windows directly from token hashes;
|
|
268
|
+
- verifies matching windows on hash hits;
|
|
269
|
+
- shards detection by format and runs format shards in parallel.
|
|
270
|
+
|
|
271
|
+
Updated TypeScript-only benchmark after this pass:
|
|
272
|
+
|
|
273
|
+
| Repo | Rust after hash pass | Rust after core pass | Upstream `jscpd` | Approx speedup vs upstream |
|
|
274
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
275
|
+
| `jscpd/packages` | `0.019s` | `0.011s` | `0.821s` | ~75x |
|
|
276
|
+
| Private repo A | `0.085s` | `0.038s` | `2.041s` | ~54x |
|
|
277
|
+
| Private repo B | `0.074s` | `0.034s` | `1.939s` | ~57x |
|
|
278
|
+
|
|
279
|
+
Do not treat these as fixed release gates yet. Before publication, choose a
|
|
280
|
+
small set of popular public repositories and use them as the repeatable
|
|
281
|
+
benchmark/compatibility suite.
|