jscpd-rs 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (96) hide show
  1. package/CHANGELOG.md +69 -0
  2. package/Cargo.lock +1323 -0
  3. package/Cargo.toml +54 -0
  4. package/LICENSE +21 -0
  5. package/README.md +372 -0
  6. package/docs/api-parity.md +49 -0
  7. package/docs/cloning-plan.md +281 -0
  8. package/docs/compat-baseline.md +535 -0
  9. package/docs/format-porting.md +86 -0
  10. package/docs/junior-task-template.md +62 -0
  11. package/docs/junior-workflow.md +87 -0
  12. package/docs/migrating-from-jscpd.md +193 -0
  13. package/docs/npm-release.md +116 -0
  14. package/docs/public-benchmark-suite.md +81 -0
  15. package/docs/release-checklist.md +200 -0
  16. package/docs/release-decisions.md +103 -0
  17. package/docs/release-readiness.md +51 -0
  18. package/docs/upstream-bugs.md +501 -0
  19. package/docs/upstream-issue-drafts.md +393 -0
  20. package/docs/user-guide.md +309 -0
  21. package/examples/dump_oxc_tokens.rs +112 -0
  22. package/examples/library_api.rs +42 -0
  23. package/npm/bin/jscpd-rs.js +6 -0
  24. package/npm/bin/jscpd-server.js +6 -0
  25. package/npm/lib/run-binary.js +68 -0
  26. package/npm/scripts/postinstall.js +50 -0
  27. package/package.json +53 -0
  28. package/skills/dry-refactoring/SKILL.md +63 -0
  29. package/skills/jscpd/SKILL.md +85 -0
  30. package/src/app.rs +512 -0
  31. package/src/bin/jscpd-server.rs +429 -0
  32. package/src/blame.rs +130 -0
  33. package/src/cli/config.rs +543 -0
  34. package/src/cli/parsing.rs +301 -0
  35. package/src/cli/tests.rs +543 -0
  36. package/src/cli.rs +671 -0
  37. package/src/detector/matching/secondary.rs +387 -0
  38. package/src/detector/matching.rs +274 -0
  39. package/src/detector/model.rs +190 -0
  40. package/src/detector/prepare.rs +71 -0
  41. package/src/detector/skip_local.rs +40 -0
  42. package/src/detector/statistics.rs +138 -0
  43. package/src/detector/store.rs +96 -0
  44. package/src/detector/tests.rs +238 -0
  45. package/src/detector.rs +265 -0
  46. package/src/files/discovery.rs +508 -0
  47. package/src/files/gitignore.rs +203 -0
  48. package/src/files/paths.rs +68 -0
  49. package/src/files/shebang.rs +106 -0
  50. package/src/files/tests.rs +523 -0
  51. package/src/files.rs +25 -0
  52. package/src/formats.rs +570 -0
  53. package/src/lib.rs +433 -0
  54. package/src/main.rs +26 -0
  55. package/src/report/ai.rs +125 -0
  56. package/src/report/badge.rs +238 -0
  57. package/src/report/console.rs +180 -0
  58. package/src/report/console_common.rs +37 -0
  59. package/src/report/console_full.rs +139 -0
  60. package/src/report/csv.rs +65 -0
  61. package/src/report/escape.rs +8 -0
  62. package/src/report/file_output.rs +28 -0
  63. package/src/report/html/assets.rs +47 -0
  64. package/src/report/html.rs +336 -0
  65. package/src/report/json.rs +119 -0
  66. package/src/report/markdown.rs +125 -0
  67. package/src/report/sarif.rs +302 -0
  68. package/src/report/silent.rs +22 -0
  69. package/src/report/source.rs +38 -0
  70. package/src/report/summary.rs +50 -0
  71. package/src/report/test_support.rs +133 -0
  72. package/src/report/threshold.rs +76 -0
  73. package/src/report/xcode.rs +90 -0
  74. package/src/report/xml.rs +119 -0
  75. package/src/report.rs +250 -0
  76. package/src/server/mcp.rs +942 -0
  77. package/src/server.rs +1081 -0
  78. package/src/tokenizer/apex.rs +97 -0
  79. package/src/tokenizer/blocks.rs +532 -0
  80. package/src/tokenizer/embedded.rs +106 -0
  81. package/src/tokenizer/generic.rs +511 -0
  82. package/src/tokenizer/hash.rs +27 -0
  83. package/src/tokenizer/ignore.rs +33 -0
  84. package/src/tokenizer/line_index.rs +33 -0
  85. package/src/tokenizer/markdown.rs +289 -0
  86. package/src/tokenizer/markup_attrs.rs +289 -0
  87. package/src/tokenizer/oxc/fallback.rs +275 -0
  88. package/src/tokenizer/oxc/jsx.rs +168 -0
  89. package/src/tokenizer/oxc/kind.rs +177 -0
  90. package/src/tokenizer/oxc/lexical.rs +67 -0
  91. package/src/tokenizer/oxc.rs +659 -0
  92. package/src/tokenizer/scan.rs +88 -0
  93. package/src/tokenizer/tap.rs +150 -0
  94. package/src/tokenizer/tests.rs +915 -0
  95. package/src/tokenizer.rs +328 -0
  96. package/src/verbose.rs +195 -0
@@ -0,0 +1,535 @@
1
+ # Compatibility Baseline
2
+
3
+ Baseline date: 2026-05-31.
4
+
5
+ Latest full release gate:
6
+ `FULL=1 PUBLIC=1 scripts/release-gate.sh`
7
+ passed on 2026-05-31 at code commit `8c3da0e` as part of
8
+ `scripts/prepublish-check.sh`.
9
+
10
+ Latest public release gate:
11
+ `PUBLIC=1 PUBLIC_RUNS=3 scripts/release-gate.sh`
12
+ passed on 2026-05-31 at code commit `8c3da0e` as part of
13
+ `scripts/prepublish-check.sh`.
14
+
15
+ Default gate:
16
+
17
+ ```bash
18
+ STRICT=coverage scripts/compat-matrix.sh
19
+ ```
20
+
21
+ Coverage means every upstream duplicated line must be covered by the Rust report
22
+ for the same file, and Rust must not report fewer clones. Exact clone starts,
23
+ formats, fragment boundaries, source totals, line totals, and pair ordering are
24
+ diagnostic only because Rust may find a wider or split equivalent range while
25
+ compatibility is converging.
26
+
27
+ Reporter gate:
28
+
29
+ ```bash
30
+ scripts/compat-reporters.sh
31
+ ```
32
+
33
+ This smoke check runs Rust and upstream with
34
+ `json,csv,markdown,xml,sarif,badge,html`, verifies the expected report files,
35
+ parses JSON/SARIF payloads, checks stable artifact contracts, and compares the
36
+ root JSON report with the default coverage rule. Stable artifact checks include
37
+ CSV/Markdown line and clone summary columns, the upstream Markdown heading
38
+ prefix, exact XML output for the fixture, SARIF structure with normalized
39
+ paths, badge title/aria text, HTML report text and clone summaries, and
40
+ equality between each HTML JSON payload and its root JSON report. The aggregate
41
+ release gate also runs this reporter check against a no-duplicates JavaScript
42
+ fixture so empty JSON/CSV/Markdown/XML/SARIF/badge/HTML reports stay covered.
43
+
44
+ CLI gate:
45
+
46
+ ```bash
47
+ scripts/compat-cli.sh
48
+ ```
49
+
50
+ This smoke check compares Rust and upstream exit codes plus stable terminal
51
+ contracts for `--help`, `--version`, `--list`, `--debug`, `--exitCode`,
52
+ `--threshold`, invalid `--mode`, bare `--config`, `--store`, `--store-path`,
53
+ bare optional string flag crashes, `--formats-exts`, `--formats-names`,
54
+ malformed `--formats-exts`/`--formats-names` mappings, `--ignore-pattern`,
55
+ `--ignoreCase`, unknown reporters, explicit `time`
56
+ reporter fallback, terminal footer/tips, `xcode`, `ai`, `consoleFull`, and
57
+ `--verbose`.
58
+ The debug checks include cwd `.gitignore` expansion in the printed `ignore`
59
+ option and user-order preservation for explicit `--format` lists.
60
+
61
+ Config gate:
62
+
63
+ ```bash
64
+ scripts/compat-config.sh
65
+ ```
66
+
67
+ This smoke check runs both implementations from real `.jscpd.json` and
68
+ `package.json#jscpd` configs, including relative `path`, config `output`,
69
+ `silent`, JSON reporter setup, `exitCode`, and order-sensitive `formatsExts`
70
+ object mappings. It also verifies explicit `--config` files outside `cwd`,
71
+ `formatsNames` mappings for extensionless filenames,
72
+ `reportersOptions.badge` path/subject/status/color overrides, debug
73
+ option-surface preservation for `config`, `cache`, `listeners`, and
74
+ `tokensToSkip`, upstream-coerced string numeric config values for `minLines`,
75
+ `maxLines`, and `threshold`, and checks that
76
+ malformed `package.json` files emit a warning and do not prevent detection from
77
+ continuing. Malformed `.jscpd.json` files are checked separately: both
78
+ implementations fail before detection with an upstream-style `SyntaxError`
79
+ printed to stdout. Symlinked explicit config files are also checked so
80
+ `config`, relative `path`, and relative `ignore` resolution follow the symlink
81
+ location rather than the real target path.
82
+
83
+ Blame gate:
84
+
85
+ ```bash
86
+ scripts/compat-blame.sh
87
+ ```
88
+
89
+ This smoke check creates a temporary Git repository, commits a duplicated pair,
90
+ runs both implementations with `--blame --reporters json`, verifies that both
91
+ JSON reports include matching blame data on both duplicate fragments, and then
92
+ compares the reports with the default coverage rule.
93
+
94
+ Server gate:
95
+
96
+ ```bash
97
+ scripts/compat-server.sh
98
+ ```
99
+
100
+ This smoke check compares the native `jscpd-server` binary with upstream
101
+ `apps/jscpd-server`. It verifies exact server `--help` output, invalid or bare
102
+ `--port`, bare common optional flag error shapes, missing-store warning
103
+ fallback, bare and explicit `--host` startup output, rejects main-CLI-only
104
+ options that upstream server does not accept, config-only `workingDirectory`
105
+ semantics, starts both servers on local ports, and checks the root API info,
106
+ `/api/health`, `/api/stats`, JSON and urlencoded `/api/check`,
107
+ empty/missing/non-string field validation, large and special-character
108
+ snippets, JSON content-type headers, JSON syntax errors, upstream-style JSON
109
+ 404 responses for missing routes and wrong API methods, MCP
110
+ initialize/session handling, `tools/list`,
111
+ `resources/list`, `get_statistics`, `check_duplication` with `recheck`,
112
+ `check_current_directory`, `jscpd://statistics`, repeated snippet isolation,
113
+ and `GET /mcp` method rejection. It also checks upstream-style MCP UUID-v4
114
+ session IDs, `Content-Type` rejection,
115
+ `DELETE /mcp` and `OPTIONS /mcp` JSON 404 responses, plus JSON-RPC
116
+ single-request and multi-request batch handling. Stable MCP SDK-shaped
117
+ responses for `initialize`, `tools/list`, `resources/list`, and batch
118
+ list/resource requests are compared exactly against upstream, with only the
119
+ package version normalized.
120
+
121
+ Package/install gate:
122
+
123
+ ```bash
124
+ scripts/package-check.sh
125
+ ```
126
+
127
+ This release-surface check verifies the crate package file list, rejects
128
+ accidental publication of the upstream `jscpd/` submodule, `target/`,
129
+ `node_modules`, and internal scripts, runs `cargo package --locked`, installs
130
+ the `jscpd` and `jscpd-server` binaries into a temporary Cargo root with
131
+ `cargo install --bins`, and checks the installed binaries' versions and the CLI
132
+ binary's upstream-compatible command name.
133
+
134
+ Native API smoke tests are covered by the Rust test suite. They verify the
135
+ path-based detector API, in-memory source API, upstream singular
136
+ `detectClonesAndStatistic` spelling, default options, supported format registry,
137
+ and default/custom format lookup helpers.
138
+
139
+ Upstream CI fixture gate:
140
+
141
+ ```bash
142
+ scripts/compat-upstream-ci.sh
143
+ ```
144
+
145
+ This mirrors upstream's CI smoke command, `jscpd ./fixtures`, with the upstream
146
+ defaults that matter for detection (`minTokens=50`, `minLines=5`,
147
+ `maxSize=100kb`). It uses the coverage-first comparison, so Rust may report
148
+ additional clones but must cover every upstream duplicated line.
149
+
150
+ Aggregate gate:
151
+
152
+ ```bash
153
+ scripts/release-gate.sh
154
+ FULL=1 scripts/release-gate.sh
155
+ PUBLIC=1 scripts/release-gate.sh
156
+ ```
157
+
158
+ The default run covers formatting, unit tests, shell syntax, package/install
159
+ verification, and fast CLI/config/reporter/blame/server compatibility checks.
160
+ `FULL=1` also runs the full coverage-first compatibility matrix. `PUBLIC=1`
161
+ runs the project-owned public benchmark suite with coverage compatibility
162
+ enabled, using `PUBLIC_CASES`, `PUBLIC_RUNS`, `PUBLIC_CHECK_COMPAT`, and
163
+ `PUBLIC_MIN_SPEEDUP` to override its defaults.
164
+ `FULL=1 PUBLIC=1 scripts/release-gate.sh` is required before publication.
165
+
166
+ Release candidate gate:
167
+
168
+ ```bash
169
+ scripts/release-candidate.sh
170
+ ```
171
+
172
+ This is the pre-publication gate: it runs
173
+ `cargo clippy --all-targets -- -D warnings`, the default release gate, the full
174
+ compatibility matrix with `STRICT=coverage`, and the public benchmark/coverage
175
+ suite with three timing runs on the default public cases.
176
+ The GitHub Actions workflow exposes the same path through the
177
+ `release_candidate` manual dispatch input.
178
+
179
+ CI gate:
180
+
181
+ ```bash
182
+ .github/workflows/release-gate.yml
183
+ ```
184
+
185
+ The GitHub Actions workflow checks out the upstream submodule, installs Rust
186
+ and Node, restores Cargo/pnpm/upstream-build caches, and runs the default
187
+ release gate on pushes and pull requests. The gate prints per-step timings so
188
+ CI regressions are visible in logs. Default push/PR CI uses the already-built
189
+ Cargo target for the npm package smoke; release-candidate and prepublish gates
190
+ still run the cold npm source-build path. Manual workflow dispatch exposes
191
+ `full`, `public`, `release_candidate`, and `public_runs` inputs for the
192
+ pre-release full matrix, public benchmark, and release-candidate gates.
193
+
194
+ Latest local prepublish check: `scripts/prepublish-check.sh` passed on
195
+ 2026-05-31 at code commit `8c3da0e`, covering
196
+ `cargo clippy --all-targets -- -D warnings`, the default release gate, the full
197
+ coverage matrix, the public benchmark/coverage suite, package/install
198
+ verification, crate/tag availability checks, npm package/name/npx verification,
199
+ and `cargo publish --dry-run --locked`.
200
+
201
+ Documentation-only updates after `8c3da0e` may reuse the release-candidate
202
+ evidence if they do not change code, scripts, package metadata, or benchmark
203
+ configuration. Rerun `RUN_RELEASE_CANDIDATE=0 scripts/prepublish-check.sh`
204
+ after documentation edits so package/dry-run evidence matches the exact package
205
+ contents being tagged.
206
+
207
+ Latest GitHub Actions default release-gate check:
208
+ `push` passed on 2026-05-31 at code commit `8c3da0e`:
209
+ https://github.com/vv-bogdanov/jscpd-rs/actions/runs/26710762680
210
+
211
+ Recorded public benchmark baseline:
212
+
213
+ | Case | Commit | Format | Rust avg | Upstream avg | Speedup | Compat |
214
+ | --- | --- | --- | ---: | ---: | ---: | --- |
215
+ | `react` | `f0dfee3` | `javascript` | 0.199097s | 10.079214s | 50.62x | pass |
216
+ | `next` | `2bbb67b9` | `typescript` | 0.262433s | 14.715736s | 56.07x | pass |
217
+ | `prometheus` | `a0524ee` | `go` | 0.085239s | 4.642435s | 54.46x | pass |
218
+
219
+ ## Current Matrix
220
+
221
+ | Target | Format | Gate | Notes |
222
+ | --- | --- | --- | --- |
223
+ | `jscpd/fixtures` | `javascript` | pass | exact summary parity |
224
+ | `jscpd/fixtures` | `typescript` | pass | exact summary parity |
225
+ | `jscpd/fixtures/javascript` | `json` | pass | exact clone and line summary parity |
226
+ | `jscpd/fixtures` | auto, upstream CI defaults | pass | 422/422 upstream fragments line-covered; Rust reports a few extra generic/SFC ranges |
227
+ | `jscpd/fixtures/custom` | auto + `--formats-exts c:ccc,cc1` | pass | exact clone and line summary parity |
228
+ | `jscpd/fixtures/ignore` | auto | pass | clone-summary gate; inline `style` attributes produce upstream-compatible CSS source buckets; ignored blocks produce 0 clones |
229
+ | `jscpd/fixtures/ignore-pattern` | auto + `--ignore-pattern` | pass | exact clone and line summary parity |
230
+ | `jscpd/fixtures/ignore-case` | auto | pass | clone-summary gate; no clones without `--ignoreCase` |
231
+ | `jscpd/fixtures/ignore-case` | auto + `--ignoreCase` | pass | clone-summary gate; 1 clone with case folding |
232
+ | `jscpd/fixtures/one-file/one-file.js` | auto | pass | exact summary parity for intra-file clones |
233
+ | `jscpd/fixtures/folder1` + `jscpd/fixtures/folder2` | auto | pass | exact clone and line summary parity without `--skipLocal` |
234
+ | `jscpd/fixtures/folder1` + `jscpd/fixtures/folder2` | auto + `--skipLocal` | pass | exact clone and line summary parity with local clones skipped |
235
+ | `jscpd/fixtures/mixed-formats` | auto | pass | upstream JS-in-HTML clone line-covered; Rust reports a wider cross-file JS range |
236
+ | `jscpd/fixtures/shebang` | auto | pass | exact clone and line summary parity for extensionless bash/python shebang files |
237
+ | `jscpd/fixtures/javascript` | `javascript` / `strict` | pass | exact clone and line summary parity; token totals differ |
238
+ | `jscpd/fixtures` | `typescript` / `strict` | pass | exact clone and line summary parity; token totals differ |
239
+ | `jscpd/fixtures/javascript` | `javascript` / `weak` | pass | clone and line summary parity; token totals differ slightly |
240
+ | `jscpd/fixtures` | `jsx` | pass | exact clone and line summary parity; token totals differ slightly |
241
+ | `jscpd/fixtures` | `tsx` | pass | exact clone and line summary parity; token totals differ slightly |
242
+ | `jscpd/fixtures/markdown` | `markdown` | pass | exact clone/start and duplicated-line parity; source line and token totals differ |
243
+ | `jscpd/fixtures` | `vue` | pass | exact upstream fragment/start coverage; Rust still reports duplicate extra script/template clones |
244
+ | `jscpd/fixtures` | `svelte` | pass | 6/6 upstream fragments line-covered; exact start differs for wider css range |
245
+ | `jscpd/fixtures` | `astro` | pass | exact upstream fragment/start coverage; Rust still reports duplicate extra embedded clones |
246
+ | `jscpd/fixtures/pug` | `pug` | pass | exact clone and line summary parity; upstream overextended `style.` range is mirrored |
247
+ | `jscpd/fixtures/haml` | `haml` | pass | exact clone and line summary parity; upstream overextended silent-comment range is mirrored |
248
+ | `jscpd/fixtures/css` | `css` | pass | exact clone coverage; token totals differ |
249
+ | `jscpd/fixtures/css` | `less` | pass | exact clone and line summary parity |
250
+ | `jscpd/fixtures/css` | `scss` | pass | exact clone and line summary parity |
251
+ | `jscpd/fixtures/python` | `python` | pass | 2/2 upstream fragments line-covered |
252
+ | `jscpd/fixtures/go` | `go` | pass | 2/2 upstream fragments line-covered |
253
+ | `jscpd/fixtures/ruby` | `ruby` | pass | 2/2 upstream fragments line-covered |
254
+ | `jscpd/fixtures/php` | `php` | pass | 2/2 upstream fragments line-covered |
255
+ | `jscpd/fixtures/yaml` | `yaml` | pass | 2/2 upstream fragments line-covered |
256
+ | `jscpd/fixtures/sql` | `sql` | pass | 2/2 upstream fragments line-covered |
257
+ | `jscpd/fixtures/toml` | `toml` | pass | 2/2 upstream fragments line-covered |
258
+ | `jscpd/fixtures/shell` | `bash` | pass | 2/2 upstream fragments line-covered |
259
+ | `jscpd/fixtures/swift` | `swift` | pass | 2/2 upstream fragments line-covered |
260
+ | `jscpd/fixtures/powershell` | `powershell` | pass | 2/2 upstream fragments line-covered |
261
+ | `jscpd/fixtures/lua` | `lua` | pass | 2/2 upstream fragments line-covered |
262
+ | `jscpd/fixtures/haskell` | `haskell` | pass | 4/4 upstream fragments line-covered |
263
+ | `jscpd/fixtures/haskell-literate` | `haskell` | pass | exact clone and line summary parity |
264
+ | `jscpd/fixtures/clojure` | `clojure` | pass | 2/2 upstream fragments line-covered |
265
+ | `jscpd/fixtures/sass` | `sass` | pass | 6/6 upstream fragments line-covered |
266
+ | `jscpd/fixtures/stylus` | `stylus` | pass | 2/2 upstream fragments line-covered |
267
+ | `jscpd/fixtures/rust` | `rust` | pass | exact summary parity; 76/76 upstream fragments line-covered |
268
+ | `jscpd/fixtures/dart` | `dart` | pass | exact summary parity; 4/4 upstream fragments line-covered |
269
+ | `jscpd/fixtures/solidity` | `solidity` | pass | 4/4 upstream fragments line-covered; Rust reports one extra clone |
270
+ | `jscpd/fixtures/perl` | `perl` | pass | exact summary parity; 8/8 upstream fragments line-covered |
271
+ | `jscpd/fixtures/commonlisp` | `lisp` | pass | exact clone and line summary parity |
272
+ | `jscpd/fixtures/mllike` | `ocaml` | pass | exact clone and line summary parity |
273
+ | `jscpd/fixtures/mllike` | `fsharp` | pass | exact clone and line summary parity |
274
+ | `jscpd/fixtures/objective-c` | `objectivec` | pass | exact clone and line summary parity |
275
+ | `jscpd/fixtures/clike` | `c` | pass | 4/4 upstream fragments line-covered |
276
+ | `jscpd/fixtures/z80` | `c` | pass | exact clone and line summary parity |
277
+ | `jscpd/fixtures/clike` | `cpp` | pass | 4/4 upstream fragments line-covered |
278
+ | `jscpd/fixtures/clike` | `c-header` | pass | exact clone and line summary parity |
279
+ | `jscpd/fixtures/clike` | `cpp-header` | pass | exact clone and line summary parity |
280
+ | `jscpd/fixtures/clike` | `java` | pass | 4/4 upstream fragments line-covered |
281
+ | `jscpd/fixtures/clike` | `csharp` | pass | 4/4 upstream fragments line-covered |
282
+ | `jscpd/fixtures/clike` | `kotlin` | pass | 4/4 upstream fragments line-covered |
283
+ | `jscpd/fixtures/clike` | `scala` | pass | 2/2 upstream fragments line-covered |
284
+ | `jscpd/fixtures/groovy` | `groovy` | pass | 2/2 upstream fragments line-covered |
285
+ | `jscpd/fixtures/actionscript` | `actionscript` | pass | 2/2 upstream fragments line-covered |
286
+ | `jscpd/fixtures/awk` | `awk` | pass | 2/2 upstream fragments line-covered |
287
+ | `jscpd/fixtures/basic` | `basic` | pass | 2/2 upstream fragments line-covered |
288
+ | `jscpd/fixtures/coffeescript` | `coffeescript` | pass | 4/4 upstream fragments line-covered |
289
+ | `jscpd/fixtures/crystal` | `crystal` | pass | 2/2 upstream fragments line-covered |
290
+ | `jscpd/fixtures/d` | `d` | pass | 2/2 upstream fragments line-covered |
291
+ | `jscpd/fixtures/elm` | `elm` | pass | 4/4 upstream fragments line-covered |
292
+ | `jscpd/fixtures/erlang` | `erlang` | pass | 2/2 upstream fragments line-covered |
293
+ | `jscpd/fixtures/fortran` | `fortran` | pass | 2/2 upstream fragments line-covered |
294
+ | `jscpd/fixtures/gdscript` | `gdscript` | pass | 4/4 upstream fragments line-covered |
295
+ | `jscpd/fixtures/graphql` | `graphql` | pass | 4/4 upstream fragments line-covered |
296
+ | `jscpd/fixtures/julia` | `julia` | pass | 2/2 upstream fragments line-covered |
297
+ | `jscpd/fixtures/protobuf` | `protobuf` | pass | 2/2 upstream fragments line-covered |
298
+ | `jscpd/fixtures/ada` | `ada` | pass | exact summary parity; 6/6 upstream fragments line-covered |
299
+ | `jscpd/fixtures/apex` | `apex` | pass | exact summary parity; includes embedded SOQL as `sql` |
300
+ | `jscpd/fixtures/haxe` | `haxe` | pass | exact summary parity; 8/8 upstream fragments line-covered |
301
+ | `jscpd/fixtures/r` | `r` | pass | exact summary parity; 4/4 upstream fragments line-covered |
302
+ | `jscpd/fixtures/csv` | `csv` | pass | 2/2 upstream fragments line-covered |
303
+ | `jscpd/fixtures/diff` | `diff` | pass | 2/2 upstream fragments line-covered |
304
+ | `jscpd/fixtures/cmake` | `cmake` | pass | 2/2 upstream fragments line-covered |
305
+ | `jscpd/fixtures/hcl` | `hcl` | pass | 2/2 upstream fragments line-covered |
306
+ | `jscpd/fixtures/gitignore` | `ignore` | pass | exact clone and line summary parity |
307
+ | `jscpd/fixtures/json5` | `json5` | pass | 2/2 upstream fragments line-covered |
308
+ | `jscpd/fixtures/latex` | `latex` | pass | 2/2 upstream fragments line-covered |
309
+ | `jscpd/fixtures/puppet` | `puppet` | pass | 4/4 upstream fragments line-covered |
310
+ | `jscpd/fixtures/qsharp` | `qsharp` | pass | 2/2 upstream fragments line-covered |
311
+ | `jscpd/fixtures/racket` | `racket` | pass | 2/2 upstream fragments line-covered |
312
+ | `jscpd/fixtures/sas` | `sas` | pass | 2/2 upstream fragments line-covered |
313
+ | `jscpd/fixtures/scheme` | `scheme` | pass | 2/2 upstream fragments line-covered |
314
+ | `jscpd/fixtures/vhdl` | `vhdl` | pass | 4/4 upstream fragments line-covered |
315
+ | `jscpd/fixtures/xquery` | `xquery` | pass | 2/2 upstream fragments line-covered |
316
+ | `jscpd/fixtures/verilog` | `verilog` | pass | 4/4 upstream fragments line-covered |
317
+ | `jscpd/fixtures/wgsl` | `wgsl` | pass | 4/4 upstream fragments line-covered |
318
+ | `jscpd/fixtures/zig` | `zig` | pass | 4/4 upstream fragments line-covered |
319
+ | `jscpd/fixtures/tcl` | `tcl` | pass | 4/4 upstream fragments line-covered |
320
+ | `jscpd/fixtures/turtle` | `turtle` | pass | 4/4 upstream fragments line-covered |
321
+ | `jscpd/fixtures/twig` | `twig` | pass | exact upstream fragment/start and line summary parity; token totals differ slightly |
322
+ | `jscpd/fixtures/properties` | `properties` | pass | exact clone and line summary parity |
323
+ | `jscpd/fixtures/properties` | `ini` | pass | exact clone and line summary parity |
324
+ | `jscpd/fixtures/xml` | `markup` | pass | 6/6 upstream fragments line-covered; Rust skips empty XML/XSD inputs |
325
+ | `jscpd/fixtures/htmlmixed` | `markup` | pass | exact clone and line summary parity; upstream also reports embedded script/style sources |
326
+ | `jscpd/fixtures/htmlembedded` | `aspnet` | pass | 9/10 upstream fragments line-covered; one documented upstream range overextends through an inserted email block |
327
+ | `jscpd/fixtures/vb` | `vbnet` | pass | exact clone and line summary parity |
328
+ | `jscpd/fixtures/text` | `txt` | pass | exact clone and line summary parity |
329
+ | `jscpd/fixtures/robotframework` | `robotframework` | pass | 4/4 upstream fragments line-covered; upstream reports final newline as one-past-content |
330
+ | `jscpd/fixtures/tap` | `tap` | pass | exact clone and line summary parity for embedded YAML diagnostics |
331
+ | `jscpd/fixtures/textile` | `textile` | pass | exact clone summary parity |
332
+ | `jscpd/fixtures/antlr4` | `antlr4` | pass | 2/2 upstream fragments line-covered |
333
+ | `jscpd/fixtures/apl` | `apl` | pass | 2/2 upstream fragments line-covered |
334
+ | `jscpd/fixtures/bicep` | `bicep` | pass | 2/2 upstream fragments line-covered |
335
+ | `jscpd/fixtures/brainfuck` | `brainfuck` | pass | 8/8 upstream fragments line-covered |
336
+ | `jscpd/fixtures/cfml` | `cfml` | pass | exact clone and line summary parity |
337
+ | `jscpd/fixtures/cfscript` | `cfscript` | pass | exact clone and line summary parity |
338
+ | `jscpd/fixtures/dot` | `dot` | pass | 2/2 upstream fragments line-covered |
339
+ | `jscpd/fixtures/eiffel` | `eiffel` | pass | exact clone and line summary parity |
340
+ | `jscpd/fixtures/gettext` | `gettext` | pass | 2/2 upstream fragments line-covered; Rust reports extra covered ranges |
341
+ | `jscpd/fixtures/gherkin` | `gherkin` | pass | 2/2 upstream fragments line-covered |
342
+ | `jscpd/fixtures/handlebars` | `handlebars` | pass | 2/2 upstream fragments line-covered |
343
+ | `jscpd/fixtures/idris` | `idris` | pass | 4/4 upstream fragments line-covered |
344
+ | `jscpd/fixtures/lilypond` | `lilypond` | pass | 6/6 upstream fragments line-covered |
345
+ | `jscpd/fixtures/livescript` | `livescript` | pass | 2/2 upstream fragments line-covered |
346
+ | `jscpd/fixtures/linker-script` | `linker-script` | pass | exact clone and line summary parity |
347
+ | `jscpd/fixtures/llvm` | `llvm` | pass | 2/2 upstream fragments line-covered |
348
+ | `jscpd/fixtures/log` | `log` | pass | 2/2 upstream fragments line-covered |
349
+ | `jscpd/fixtures/nsis` | `nsis` | pass | 2/2 upstream fragments line-covered |
350
+ | `jscpd/fixtures/openqasm` | `openqasm` | pass | 2/2 upstream fragments line-covered |
351
+ | `jscpd/fixtures/oz` | `oz` | pass | 2/2 upstream fragments line-covered |
352
+ | `jscpd/fixtures/pascal` | `pascal` | pass | 2/2 upstream fragments line-covered |
353
+ | `jscpd/fixtures/idl` | `prolog` | pass | exact clone and line summary parity |
354
+ | `jscpd/fixtures/plsql` | `plsql` | pass | exact clone and line summary parity |
355
+ | `jscpd/fixtures/plant-uml` | `plant-uml` | pass | 2/2 upstream fragments line-covered |
356
+ | `jscpd/fixtures/powerquery` | `powerquery` | pass | 2/2 upstream fragments line-covered |
357
+ | `jscpd/fixtures/purescript` | `purescript` | pass | exact clone and line summary parity |
358
+ | `jscpd/fixtures/q` | `q` | pass | 2/2 upstream fragments line-covered |
359
+ | `jscpd/fixtures/rescript` | `rescript` | pass | exact clone and line summary parity |
360
+ | `jscpd/fixtures/smalltalk` | `smalltalk` | pass | 2/2 upstream fragments line-covered |
361
+ | `jscpd/fixtures/smarty` | `smarty` | pass | 2/2 upstream fragments line-covered |
362
+ | `jscpd/fixtures/soy` | `soy` | pass | 2/2 upstream fragments line-covered |
363
+ | `jscpd/fixtures/sparql` | `sparql` | pass | 2/2 upstream fragments line-covered |
364
+ | `jscpd/fixtures/tt2` | `tt2` | pass | exact clone and line summary parity |
365
+ | `jscpd/fixtures/unrealscript` | `unrealscript` | pass | 2/2 upstream fragments line-covered |
366
+ | `jscpd/fixtures/velocity` | `velocity` | pass | 2/2 upstream fragments line-covered |
367
+ | `jscpd/fixtures/mathematica` | `wolfram` | pass | exact clone and line summary parity |
368
+ | `jscpd/packages` | `javascript` | pass | no clones in either implementation |
369
+ | `jscpd/packages` | `typescript` | pass | 66/66 upstream fragments line-covered |
370
+ | Private app fixture | `javascript` | pass | 154/154 upstream fragments line-covered; one exact pair differs in generated `.next` chunks |
371
+ | Private app fixture | `typescript` | pass | 408/408 upstream fragments line-covered |
372
+ | Private app fixture | `tsx` | pass | 14/14 upstream fragments line-covered; Rust currently reports extra findings |
373
+
374
+ ## Known Deltas
375
+
376
+ - JS/TS/JSX/TSX use native Rust/Oxc tokenization, so token totals can differ
377
+ from Prism while fragment coverage remains green.
378
+ - Long-tail formats are now discoverable through the upstream-synchronized
379
+ registry, but most use generic tokenization and do not carry parity claims.
380
+ - Markdown extracts YAML front matter and fenced code blocks into embedded
381
+ format maps. YAML quoted scalars are kept whole and fenced gap whitespace is
382
+ preserved enough for exact upstream Markdown clone/start and duplicated-line
383
+ parity, while source line and token totals still differ.
384
+ - Vue, Svelte, and Astro now split embedded template/script/style/frontmatter
385
+ regions into format maps. CSS-like style blocks skip internal whitespace
386
+ tokens so Vue SCSS starts align with upstream, while other embedded generic
387
+ block maps still preserve internal whitespace where it is needed for
388
+ coverage. Their fixtures are line-covered, with remaining wider ranges from
389
+ generic markup/style tokenization.
390
+ - Plain `markup` now extracts top-level `<script>` and `<style>` blocks into
391
+ embedded JavaScript/TypeScript/CSS-like maps. This covers upstream mixed HTML
392
+ fixture clones, though Rust may report a wider equivalent embedded range.
393
+ - Pug and HAML mirror Prism's multiline block behavior for fixture parity:
394
+ `pug` keeps non-`script` dot blocks as one token, and `haml` keeps silent
395
+ comment blocks as one token. The overextended upstream report ranges remain
396
+ listed in `docs/upstream-bugs.md`.
397
+ - Non-native generic formats use coarse whitespace tokenization; weak mode
398
+ strips best-effort common comment spans, including `#`, `//`, `/* */`,
399
+ `<!-- -->`, SQL-style `--`, and Lisp/INI-style `;` comments where those
400
+ prefixes are comments in the upstream Prism grammar.
401
+ - CSS-like generic formats split common punctuation so practical stylesheet
402
+ clones meet upstream token thresholds without carrying a full Prism port.
403
+ - Code-like and Prism-like generic formats split common punctuation and
404
+ operator runs so practical language fixtures meet upstream default token
405
+ thresholds without carrying a full Prism port. This includes long-tail
406
+ fixture formats such as YAML, INI, markup, HAML, DOT, CSV, CMake, Clojure,
407
+ CoffeeScript, Q#, SPARQL, and Robot Framework.
408
+ - Properties uses the same generic punctuation/operator split so dotted keys
409
+ and assignments reach upstream clone thresholds without a dedicated lexer.
410
+ - Several upstream fixture directories are gated through upstream aliases:
411
+ `gitignore` as `ignore`, `mathematica` as `wolfram`, `idl` as `prolog`, and
412
+ `z80` as `c`.
413
+ - ASP.NET uses the code-like generic splitter and is gated with a narrow
414
+ documented upstream range exception for `file2.aspx:18-43`, where upstream
415
+ reports through an inserted email field block that is not present in the
416
+ paired source.
417
+ - Apex extracts bracketed SOQL regions into an embedded `sql` map to match
418
+ upstream's multi-format Apex reports.
419
+ - `--mode strict` now preserves Prism-style `empty` and `new_line` whitespace
420
+ tokens in the native JS/TS/Oxc path and the generic tokenizer. The
421
+ JavaScript fixture has exact strict-mode summary parity.
422
+ - Extensionless names such as `Makefile` and `Dockerfile` require
423
+ `--formats-names`, matching upstream behavior.
424
+ - Custom extension and filename mappings are supported through
425
+ `--formats-exts`/`formatsExts` and `--formats-names`/`formatsNames`.
426
+ - Relative `ignore`/`--ignore` patterns are normalized against each configured
427
+ scan root and the current working directory, matching upstream behavior for
428
+ absolute scan paths outside `cwd`.
429
+ - `--noSymlinks` skips symlink scan roots as well as symlinks found during tree
430
+ walking, matching upstream's pre-glob path filtering.
431
+ - File discovery respects the current working directory `.gitignore`, scan-root
432
+ `.gitignore` files, `.git/info/exclude`, and the global Git excludes file
433
+ from `git config --global core.excludesFile`.
434
+ - `--max-size`/`maxSize` follows upstream `bytes.parse` semantics, including
435
+ decimal `kb` through `pb` values, `parseInt` fallback for non-matching
436
+ suffixes such as `1k`, and zero-file behavior for invalid limits.
437
+ - CLI `--min-lines`, `--min-tokens`, and `--max-lines` accept upstream-style
438
+ `parseInt` numeric prefixes, so values such as `20.9` are treated as `20`;
439
+ missing optional values are accepted like Commander `[number]` options.
440
+ - Bare optional values for `--threshold`, `--exitCode`, `--max-size`,
441
+ `--pattern`, `--store`, and `--store-path` follow the local upstream runtime
442
+ behavior where upstream continues instead of failing during CLI parsing.
443
+ - Bare optional values for `--ignore`, `--ignore-pattern`, `--reporters`,
444
+ `--mode`, `--format`, `--formats-exts`, `--formats-names`, and file-writing
445
+ `--output` paths now mirror upstream's Commander runtime TypeError shape
446
+ instead of failing during CLI parsing, including the different
447
+ `fs.mkdirSync` and `path.join` error strings used by different file
448
+ reporters.
449
+ - Malformed CLI `--formats-exts`/`--formats-names` entries without `:` now
450
+ preserve upstream's visible `Cannot read properties of undefined` TypeError
451
+ instead of silently ignoring the entry.
452
+ - CLI `--threshold` follows JavaScript `Number(...)` parsing for values such as
453
+ `0x10` and `nope`, matching upstream threshold reporter behavior.
454
+ - CLI/config `exitCode` keeps the raw Node-like value until clones are found.
455
+ Integer strings such as `0x10` exit with the matching code, while invalid,
456
+ fractional, or bare boolean values emit the same Node-style error after
457
+ reports are written.
458
+ - Config `minLines`, `maxLines`, and `threshold` accept string numeric values
459
+ that upstream coerces at runtime, including JavaScript-style threshold strings
460
+ such as `0x10`. Config `minTokens` remains intentionally strict because
461
+ upstream's string value path can corrupt token-window indexing and crash in
462
+ detection.
463
+ - Invalid `--mode` values fail after CLI parsing with the upstream-style
464
+ `Error: Mode ... does not supported yet.` message printed to stdout.
465
+ - If discovery, size, or line filters leave no files to detect, reporters are
466
+ not run, matching upstream's `InFilesDetector` early return. Silent mode
467
+ stays quiet; non-silent mode only prints the terminal footer.
468
+ - `skipLocal` follows the upstream configured-root validator: clones are skipped
469
+ only when both fragments are inside the same input path.
470
+ - The upstream workflow option surface for `blame`, `store`, `storePath`,
471
+ `cache`, `executionId`, `noTips`, `listeners`, and `tokensToSkip` is parsed
472
+ from CLI/config where applicable. The default `executionId` is generated as a
473
+ UTC RFC3339 timestamp, matching the upstream workflow shape. `--blame`
474
+ populates clone fragment blame data from native `git blame -w` output when
475
+ available.
476
+ - `cache`, config `listeners`, and `tokensToSkip` are intentionally treated as
477
+ option-surface compatibility only for now: the upstream CLI/reference code
478
+ defines or merges these fields, but does not consume them in the detection,
479
+ tokenizer, reporter, or store runtime.
480
+ - `--store <name>` currently follows the upstream missing-store fallback shape
481
+ in both CLI and server entrypoints: it warns that the store package is not
482
+ installed and continues with in-memory detection. Dynamic loading of external
483
+ store packages remains an implementation gap.
484
+ - `--debug` is a dry run like upstream: it prints JS-style option fields and
485
+ discovered files, then exits before clone detection and reporter execution.
486
+ - Explicit `--config` paths are resolved lexically like Node `path.resolve()`,
487
+ without canonicalizing symlinks, so config-relative options use the visible
488
+ config path's directory.
489
+ - `--list` follows the upstream output shape: a `Supported formats:` header
490
+ followed by comma-separated formats.
491
+ - Non-silent runs print clone progress for non-`ai` reporters, then reporter
492
+ output, then a `time:` footer. Tips are printed by default and suppressed by
493
+ `--noTips`; the Rust footer keeps only the AI refactoring tip and omits the
494
+ upstream promotional/support lines.
495
+ - Reporter normalization mirrors upstream append behavior: explicit `silent`
496
+ or `threshold` reporters are not deduplicated when `--silent` or
497
+ `--threshold` appends the same reporter.
498
+ - `--verbose` prints upstream-style format-filter skip messages and detector
499
+ events for `START_DETECTION`, `CLONE_FOUND`, and `CLONE_SKIPPED`.
500
+ - Unknown reporter names emit the upstream-style install warning. Dynamic
501
+ loading of external reporter packages is not implemented yet.
502
+ - `reportersOptions.badge` supports the upstream-style `subject`, `status`,
503
+ `color`, and `path` overrides for the built-in badge reporter.
504
+ - Known upstream bug candidates are tracked in `docs/upstream-bugs.md`.
505
+
506
+ ## Benchmark Sanity
507
+
508
+ Recent local sanity checks:
509
+
510
+ | Target | Format | Rust avg | Upstream avg | Approx speedup |
511
+ | --- | --- | ---: | ---: | ---: |
512
+ | Private app fixture | `tsx` | `0.0358s` | `0.568s` | `16x` |
513
+ | `jscpd/packages` | `typescript` | `0.0143s` | `0.831s` | `58x` |
514
+
515
+ Latest public benchmark suite checks, using repositories cloned outside the
516
+ project tree:
517
+
518
+ | Target | Commit | Format | Rust avg | Upstream avg | Approx speedup |
519
+ | --- | --- | --- | ---: | ---: | ---: |
520
+ | `facebook/react` | `f0dfee3` | `javascript` | `0.199097s` | `10.079214s` | `50.62x` |
521
+ | `vercel/next.js` | `2bbb67b9` | `typescript` | `0.262433s` | `14.715736s` | `56.07x` |
522
+ | `prometheus/prometheus` | `a0524ee` | `go` | `0.085239s` | `4.642435s` | `54.46x` |
523
+
524
+ ## Additional Mode Checks
525
+
526
+ ```bash
527
+ DETECTION_MODE=strict FORMAT=javascript MIN_TOKENS=20 MIN_LINES=3 MAX_SIZE=1mb \
528
+ STRICT=coverage scripts/compat.sh jscpd/fixtures/javascript
529
+ ```
530
+
531
+ The default matrix also includes strict JavaScript/TypeScript and weak
532
+ JavaScript mode checks so mode regressions are gated directly. Strict mode uses
533
+ the same coverage-first release rule; token totals remain diagnostic because
534
+ the native token stream may split whitespace differently from Prism while still
535
+ covering every upstream duplicated line.
@@ -0,0 +1,86 @@
1
+ # Format Porting Guide
2
+
3
+ The first-release policy is coverage-first for hot JS/TS formats and smoke-only
4
+ for long-tail generic formats. A format can find more than upstream while
5
+ compatibility converges, but release-compatible formats must not miss upstream
6
+ duplicate fragments on their fixtures.
7
+
8
+ ## Status Levels
9
+
10
+ - `generic`: format is recognized through the upstream-synchronized registry and
11
+ uses coarse whitespace tokenization.
12
+ - `native-smoke`: Rust has format-specific logic and local smoke tests, but no
13
+ upstream coverage claim.
14
+ - `coverage`: `MODE=compat scripts/check-format.sh <format> <target>` passes
15
+ with `STRICT=coverage`.
16
+ - `release`: docs and tests make the support level explicit, and the format is
17
+ included in the release matrix.
18
+
19
+ ## Files To Know
20
+
21
+ - `src/formats.rs`: generated format and extension registry. Do not edit by
22
+ hand; run `node scripts/sync-formats.mjs` after upstream tokenizer changes.
23
+ - `src/tokenizer.rs`: tokenizer entrypoint and format dispatch. Native, generic,
24
+ embedded-block, hashing, ignore, and position helpers live under
25
+ `src/tokenizer/`.
26
+ - `src/files.rs`: discovery entrypoint and format filtering. Gitignore, shebang,
27
+ and path-order helpers live under `src/files/`.
28
+ - `src/detector.rs`: clone detection; do not change for ordinary format work.
29
+ - `scripts/check-format.sh`: one-format smoke/compat checks.
30
+ - `scripts/compat.sh`: Rust vs upstream report comparison.
31
+ - `docs/compat-baseline.md`: current compatibility claims and known deltas.
32
+
33
+ ## Minimal Format Task
34
+
35
+ 1. Confirm the format is present:
36
+
37
+ ```bash
38
+ cargo run --quiet -- --list | rg '(^|, )<format>(,|$)'
39
+ ```
40
+
41
+ 2. Add or reuse a tiny target directory for the format.
42
+
43
+ 3. Run smoke mode:
44
+
45
+ ```bash
46
+ scripts/check-format.sh <format> <target>
47
+ ```
48
+
49
+ 4. If the task claims upstream coverage, run compat mode:
50
+
51
+ ```bash
52
+ MODE=compat scripts/check-format.sh <format> <target>
53
+ ```
54
+
55
+ 5. Add focused tests near the code being changed.
56
+
57
+ 6. Update docs only when the support level changes.
58
+
59
+ ## Native Tokenizer Task
60
+
61
+ Native tokenizers should be added only when generic tokenization is too noisy or
62
+ misses practical clones. Prefer maintained Rust crates where available. If a
63
+ custom scanner is needed, keep it small and format-specific.
64
+
65
+ Expected shape:
66
+
67
+ - One small scanner/helper for the format.
68
+ - Unit tests for token slices, comments, weak mode, and at least one duplicate
69
+ detection path.
70
+ - No detector changes unless there is a proven cross-format contract issue.
71
+ - No JavaScript runtime fallback.
72
+
73
+ ## Junior-Safe Format Tasks
74
+
75
+ - Add a smoke test for a format already handled by generic tokenization.
76
+ - Add one comment-style test and no production code.
77
+ - Add one small production helper by copying an existing tokenizer pattern.
78
+ - Run `scripts/check-format.sh` and report exact `sources`/`clones` output.
79
+
80
+ ## Main-Agent-Only Decisions
81
+
82
+ - Promoting a format to `coverage` or `release`.
83
+ - Adding dependencies.
84
+ - Changing detector contracts.
85
+ - Changing compatibility gate semantics.
86
+ - Editing generated registry logic.