diffinite 0.9.6__tar.gz → 0.11.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. {diffinite-0.9.6/src/diffinite.egg-info → diffinite-0.11.1}/PKG-INFO +100 -54
  2. {diffinite-0.9.6 → diffinite-0.11.1}/README.md +98 -52
  3. {diffinite-0.9.6 → diffinite-0.11.1}/pyproject.toml +3 -2
  4. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/cli.py +141 -40
  5. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/collector.py +35 -11
  6. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/deep_compare.py +7 -1
  7. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/differ.py +3 -0
  8. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/models.py +6 -0
  9. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/pdf_gen.py +74 -32
  10. diffinite-0.11.1/src/diffinite/pipeline.py +1183 -0
  11. {diffinite-0.9.6 → diffinite-0.11.1/src/diffinite.egg-info}/PKG-INFO +100 -54
  12. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/SOURCES.txt +1 -0
  13. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_cli.py +57 -2
  14. diffinite-0.11.1/tests/test_json_report_integration.py +46 -0
  15. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_pdf_gen.py +104 -11
  16. diffinite-0.11.1/tests/test_pipeline.py +343 -0
  17. diffinite-0.9.6/src/diffinite/pipeline.py +0 -777
  18. diffinite-0.9.6/tests/test_pipeline.py +0 -126
  19. {diffinite-0.9.6 → diffinite-0.11.1}/LICENSE +0 -0
  20. {diffinite-0.9.6 → diffinite-0.11.1}/NOTICE +0 -0
  21. {diffinite-0.9.6 → diffinite-0.11.1}/setup.cfg +0 -0
  22. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/__init__.py +0 -0
  23. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/__main__.py +0 -0
  24. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/evidence.py +0 -0
  25. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/fingerprint.py +0 -0
  26. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/__init__.py +0 -0
  27. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/_registry.py +0 -0
  28. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/_spec.py +0 -0
  29. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/c_family.py +0 -0
  30. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/csharp.py +0 -0
  31. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/data.py +0 -0
  32. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/go_rust_swift.py +0 -0
  33. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/java.py +0 -0
  34. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/javascript.py +0 -0
  35. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/markup.py +0 -0
  36. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/python.py +0 -0
  37. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/scripting.py +0 -0
  38. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/parser.py +0 -0
  39. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/dependency_links.txt +0 -0
  40. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/entry_points.txt +0 -0
  41. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/requires.txt +0 -0
  42. {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/top_level.txt +0 -0
  43. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_collector.py +0 -0
  44. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_deep_compare.py +0 -0
  45. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_differ.py +0 -0
  46. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_differ_extended.py +0 -0
  47. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_evidence.py +0 -0
  48. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_evidence_hash.py +0 -0
  49. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_fingerprint.py +0 -0
  50. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_languages.py +0 -0
  51. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_normalize.py +0 -0
  52. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_parser.py +0 -0
  53. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_plagiarism_dataset.py +0 -0
  54. {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_sqlite_integration.py +0 -0
@@ -1,13 +1,13 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: diffinite
3
- Version: 0.9.6
3
+ Version: 0.11.1
4
4
  Summary: Forensic source-code comparison tool — Winnowing fingerprints and professional PDF reports for IP litigation & code audit
5
5
  Author: nash-dir
6
6
  License: Apache-2.0
7
7
  Project-URL: Homepage, https://github.com/nash-dir/diffinite
8
8
  Project-URL: Repository, https://github.com/nash-dir/diffinite
9
9
  Project-URL: Issues, https://github.com/nash-dir/diffinite/issues
10
- Keywords: diff,pdf,source-code,comparison,bates-number,forensics,code-audit,plagiarism-detection,winnowing,clone-detection
10
+ Keywords: diff,pdf,source-code,comparison,bates-number,forensics,code-audit,plagiarism-detection,winnowing,clone-detection,side-by-side,similarity,jaccard,copyright,litigation,vscode
11
11
  Classifier: Development Status :: 4 - Beta
12
12
  Classifier: Intended Audience :: Legal Industry
13
13
  Classifier: Intended Audience :: Developers
@@ -37,15 +37,39 @@ Dynamic: license-file
37
37
 
38
38
  # Diffinite
39
39
 
40
- **Source-code comparison tool for code audit and similarity analysis.**
40
+ **Forensic source-code comparison tool for IP litigation and code audit.**
41
41
 
42
- Diffinite compares two directories of source code and produces professional PDF/HTML reports with syntax-highlighted side-by-side diffs. It uses [Winnowing fingerprints](https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) (Schleimer et al., 2003 — the algorithm that also forms the basis of [Stanford MOSS](https://theory.stanford.edu/~aiken/moss/)) for N:M cross-matching to detect code reuse even across renamed, split, or merged files.
42
+ Diffinite compares two directories of source code and produces professional PDF/HTML reports with syntax-highlighted side-by-side diffs. It uses [Winnowing fingerprints](https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) (Schleimer et al., 2003 — the algorithm behind [Stanford MOSS](https://theory.stanford.edu/~aiken/moss/)) for N:M cross-matching to detect code reuse even across renamed, split, or merged files.
43
43
 
44
44
  > **Design Principle**: Diffinite reports **how similar** and **where similar**. It does not classify the type of copying — that is the expert witness's job.
45
45
 
46
46
  ---
47
47
 
48
- ## Installation
48
+ ## VS Code Extension
49
+
50
+ The recommended way to use Diffinite is through the **VS Code extension**, which bundles an embedded Python runtime — no separate Python installation required.
51
+
52
+ ### Features
53
+ - **Visual directory picker** — Select two directories and configure options via a GUI panel
54
+ - **Real-time progress bar** — Live percentage tracking during analysis
55
+ - **Pre-analysis time estimation** — Scans file sizes upfront and estimates Simple/Deep mode duration
56
+ - **Dynamic CPU calibration** — Benchmarks Phase 1 performance to refine Phase 2 time predictions
57
+ - **OOM defense** — Warns before analyzing file pairs exceeding 5MB
58
+ - **Interactive tree viewer** — Review matched pairs and selectively export
59
+ - **One-click PDF/HTML export** — With Bates numbering, page numbers, and filename annotations
60
+
61
+ ### Install from Source
62
+
63
+ ```bash
64
+ cd vscode-extension
65
+ npm install
66
+ npm run compile
67
+ # Press F5 in VS Code to launch Extension Development Host
68
+ ```
69
+
70
+ ---
71
+
72
+ ## CLI Installation
49
73
 
50
74
  ```bash
51
75
  pip install diffinite
@@ -73,7 +97,7 @@ diffinite original/ suspect/ -o report.pdf
73
97
 
74
98
  # With comment stripping and Bates numbering (forensic use)
75
99
  diffinite original/ suspect/ -o report.pdf \
76
- --no-comments --bates-number --page-number --show-filename
100
+ --strip-comments --bates-number --page-number --filename
77
101
 
78
102
  # HTML report (single self-contained file, opens in browser)
79
103
  diffinite original/ suspect/ --report-html report.html
@@ -89,7 +113,7 @@ Diffinite runs a two-stage pipeline:
89
113
 
90
114
  1. **Fuzzy name matching** — Pairs files across `dir_a` and `dir_b` using [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) string similarity (configurable threshold).
91
115
  2. **Comment stripping** — Optionally removes comments using a 5-state finite state machine parser supporting 30+ file extensions.
92
- 3. **Side-by-side diff** — Computes line-by-line (or word-by-word) diffs using Python's `difflib.SequenceMatcher`.
116
+ 3. **Side-by-side diff** — Computes line-by-line (or word-by-word) diffs using Python's `difflib.SequenceMatcher` with `autojunk=True` for O(n) performance on large files.
93
117
  4. **Report generation** — Renders syntax-highlighted HTML diffs via Pygments, then converts to PDF with xhtml2pdf.
94
118
 
95
119
  ### Stage 2: N:M Cross-Matching (`deep` mode, default)
@@ -110,8 +134,9 @@ The cover page contains a summary table for each matched file pair:
110
134
  | Column | Description |
111
135
  |--------|-------------|
112
136
  | **File A / File B** | Matched file paths |
113
- | **Match** | `difflib.SequenceMatcher.ratio()` the proportion of matching characters between the two files. `1.0` = identical, `0.0` = completely different. |
114
- | **Added / Deleted** | Number of lines added to or deleted from File A to produce File B. |
137
+ | **Name Sim.** | Fuzzy filename similarity score (0–100) |
138
+ | **Content Match** | `difflib.SequenceMatcher.ratio()` proportion of matching content. `1.0` = identical. |
139
+ | **Added / Deleted** | Number of lines (or words) added to or deleted from File A to produce File B. |
115
140
 
116
141
  ### Diff Pages
117
142
 
@@ -119,6 +144,7 @@ Each matched pair gets a side-by-side diff page with:
119
144
 
120
145
  - **Green highlight** — Lines present only in File B (additions)
121
146
  - **Red highlight** — Lines present only in File A (deletions)
147
+ - **Yellow highlight** — Lines changed between A and B (word-level diff in `--by-word` mode)
122
148
  - **Purple highlight** — Lines moved from this position (`--detect-moved`)
123
149
  - **Blue highlight** — Lines moved to this position (`--detect-moved`)
124
150
  - **No highlight** — Identical lines (with configurable context folding)
@@ -131,9 +157,9 @@ When running in `deep` mode (default), the report includes an N:M cross-matching
131
157
  |--------|-------------|
132
158
  | **File A** | Source file from directory A |
133
159
  | **Matched Files (B)** | All files from directory B that share fingerprints above the Jaccard threshold |
134
- | **Jaccard** | `|A∩B| / |A∪B|` — the fraction of shared Winnowing fingerprints. A Jaccard of `0.73` means 73% of the code fingerprints are shared between the two files. |
160
+ | **Jaccard** | `|A∩B| / |A∪B|` — the fraction of shared Winnowing fingerprints. |
135
161
 
136
- Jaccard similarity is a well-defined set metric: `|A∩B| / |A∪B|`. Its interpretation depends on the domain, code size, and language. Diffinite reports the raw value without attaching qualitative labels.
162
+ Jaccard similarity is a well-defined set metric. Its interpretation depends on the domain, code size, and language. Diffinite reports the raw value without attaching qualitative labels.
137
163
 
138
164
  ### Page Annotations
139
165
 
@@ -142,7 +168,7 @@ Jaccard similarity is a well-defined set metric: `|A∩B| / |A∪B|`. Its interp
142
168
  | `--page-number` | `Page 3 / 47` | Bottom-right |
143
169
  | `--file-number` | `File 2 / 12` | Bottom-left |
144
170
  | `--bates-number` | `TEST-000003-CONF` | Bottom-center |
145
- | `--show-filename` | `com/example/Foo.java` | Top-right |
171
+ | `--filename` | `com/example/Foo.java` | Top-right |
146
172
 
147
173
  ---
148
174
 
@@ -169,18 +195,21 @@ dir_b Path to the comparison source directory (B)
169
195
  | `--report-pdf PATH` | Generate merged PDF report |
170
196
  | `--report-html PATH` | Generate standalone HTML report (single file, no external deps) |
171
197
  | `--report-md PATH` | Generate Markdown summary report |
198
+ | `--report-json PATH` | Generate machine-readable JSON report (used by VS Code extension) |
172
199
  | `--no-merge` | Generate individual PDFs per file instead of one merged PDF |
200
+ | `--preserve-tree` / `--no-preserve-tree` | Preserve directory tree structure in individual output (default: on) |
173
201
 
174
202
  ### Diff Options
175
203
 
176
204
  | Option | Default | Description |
177
205
  |--------|:-------:|-------------|
178
- | `--no-comments` | off | Strip comments before comparison (5-state FSM parser, 30+ extensions) |
206
+ | `--strip-comments` | off | Strip comments before comparison (5-state FSM parser, 30+ extensions) |
179
207
  | `--by-word` | off | Compare by word instead of by line |
180
208
  | `--squash-blanks` | off | Collapse runs of 3+ blank lines. ⚠️ Changes line numbers — not recommended for forensic line-tracing. |
181
209
  | `--threshold N` | `60` | Fuzzy file-name matching threshold (0–100). Lower = more aggressive matching. |
182
210
  | `--collapse-identical` | off | Fold unchanged code blocks (3 context lines around each change) |
183
- | `--detect-moved` | off | Detect moved code blocks and highlight with distinct colors (purple=original, blue=destination) |
211
+ | `--detect-moved` | off | Detect moved code blocks and highlight with distinct colors |
212
+ | `--encoding ENC` | `auto` | Force file encoding (e.g. `euc-kr`, `utf-8`). Default: auto-detect via charset-normalizer. |
184
213
 
185
214
  ### Deep Compare Options
186
215
 
@@ -190,7 +219,7 @@ dir_b Path to the comparison source directory (B)
190
219
  | `--window N` | `4` | Winnowing window size. Guarantees detection of any shared sequence ≥ `K+W−1` = 8 tokens. |
191
220
  | `--threshold-deep F` | `0.05` | Minimum Jaccard similarity to include in results. Below 5% is considered noise. |
192
221
  | `--normalize` | off | Normalize identifiers → `ID`, literals → `LIT` before fingerprinting. Improves Type-2 clone detection (renamed variables). |
193
- | `--workers N` | `4` | Number of parallel worker processes for fingerprint extraction. |
222
+ | `--workers N` | `4` | Number of parallel worker processes for diff rendering and fingerprint extraction. |
194
223
 
195
224
  ### Forensic Options
196
225
 
@@ -198,6 +227,9 @@ dir_b Path to the comparison source directory (B)
198
227
  |--------|:-------:|-------------|
199
228
  | `--no-autojunk` | off | Disable `SequenceMatcher`'s autojunk heuristic. Treats all tokens equally — slower but more precise for forensic analysis. |
200
229
  | `--max-index-entries N` | `10,000,000` | Memory cap for inverted index. Prevents OOM on large corpora. ~800MB at 10M entries. |
230
+ | `--max-file-size-mb N` | `10.0` | Skip files exceeding this size (MB). Prevents OOM on large binary/generated files. |
231
+ | `--hash` | off | Embed SHA-256 evidence integrity hashes for all analyzed files in the report. |
232
+ | `--uncompared-mode {inline,separate,none}` | `inline` | Control how unmatched files are displayed: inline in main report, as separate appendix, or omitted. |
201
233
 
202
234
  ### Page Annotation Options
203
235
 
@@ -209,7 +241,7 @@ dir_b Path to the comparison source directory (B)
209
241
  | `--bates-prefix TEXT` | Bates number prefix (e.g. `PLAINTIFF-`). Combined as: `{prefix}{number}{suffix}` |
210
242
  | `--bates-suffix TEXT` | Bates number suffix (e.g. `-CONFIDENTIAL`) |
211
243
  | `--bates-start N` | Starting Bates number (default: `1`). Useful for continuing numbering across reports. |
212
- | `--show-filename` | Show filename at the top-right |
244
+ | `--filename` | Show filename at the top-right |
213
245
 
214
246
  ---
215
247
 
@@ -220,17 +252,17 @@ dir_b Path to the comparison source directory (B)
220
252
  ```bash
221
253
  # Full forensic report with all annotations
222
254
  diffinite plaintiff_code/ defendant_code/ -o exhibit_A.pdf \
223
- --no-comments \
255
+ --strip-comments \
224
256
  --bates-number --bates-prefix "CASE2026-" --bates-suffix "-CONFIDENTIAL" \
225
- --bates-start 1 --page-number --file-number --show-filename \
226
- --collapse-identical --detect-moved
257
+ --bates-start 1 --page-number --file-number --filename \
258
+ --collapse-identical --detect-moved --hash
227
259
  ```
228
260
 
229
261
  ### Code Audit (Quick HTML)
230
262
 
231
263
  ```bash
232
264
  # HTML report for browser viewing (no PDF dependency issues)
233
- diffinite vendor_v1/ vendor_v2/ --report-html audit.html --no-comments
265
+ diffinite vendor_v1/ vendor_v2/ --report-html audit.html --strip-comments
234
266
  ```
235
267
 
236
268
  ### Maximum Sensitivity (Type-2 Clones)
@@ -238,7 +270,7 @@ diffinite vendor_v1/ vendor_v2/ --report-html audit.html --no-comments
238
270
  ```bash
239
271
  # Detect renamed-variable copies
240
272
  diffinite original/ suspect/ -o report.pdf \
241
- --normalize --no-autojunk --no-comments
273
+ --normalize --no-autojunk --strip-comments
242
274
  ```
243
275
 
244
276
  ### Simple Mode (Fast, No Cross-Matching)
@@ -251,11 +283,12 @@ diffinite dir_a/ dir_b/ --mode simple -o quick_report.pdf
251
283
  ### Multiple Output Formats
252
284
 
253
285
  ```bash
254
- # Generate all three formats at once
286
+ # Generate all formats at once
255
287
  diffinite dir_a/ dir_b/ \
256
288
  --report-pdf report.pdf \
257
289
  --report-html report.html \
258
- --report-md report.md
290
+ --report-md report.md \
291
+ --report-json report.json
259
292
  ```
260
293
 
261
294
  ### Tuning Sensitivity
@@ -275,7 +308,7 @@ diffinite dir_a/ dir_b/ --threshold 80
275
308
 
276
309
  ## Comment Stripping Support
277
310
 
278
- The `--no-comments` flag removes comments using a 5-state finite state machine parser:
311
+ The `--strip-comments` flag removes comments using a 5-state finite state machine parser:
279
312
 
280
313
  | Extensions | Comment Styles |
281
314
  |------------|---------------|
@@ -298,19 +331,28 @@ The `--no-comments` flag removes comments using a 5-state finite state machine p
298
331
  diffinite/
299
332
  ├── src/diffinite/
300
333
  │ ├── cli.py # CLI entry point & argument parsing
301
- │ ├── pipeline.py # Orchestration (simple/deep modes)
334
+ │ ├── pipeline.py # Orchestration (simple/deep modes, parallel rendering)
302
335
  │ ├── collector.py # File collection & fuzzy name matching
303
336
  │ ├── parser.py # 5-state comment stripping FSM
304
- │ ├── differ.py # Diff computation & HTML rendering
337
+ │ ├── differ.py # Diff computation, moved-block detection & HTML rendering
305
338
  │ ├── fingerprint.py # Winnowing fingerprint extraction
306
- │ ├── deep_compare.py # N:M cross-matching (inverted index)
307
- │ ├── evidence.py # Jaccard similarity metric
308
- │ ├── models.py # Data classes
309
- │ ├── pdf_gen.py # PDF/HTML report generation
310
- │ └── languages/ # Per-language specs (30+ extensions)
311
- ├── tests/
339
+ │ ├── deep_compare.py # N:M cross-matching (inverted index + Jaccard)
340
+ │ ├── evidence.py # SHA-256 integrity hashing & manifest generation
341
+ │ ├── models.py # Data classes (DiffResult, DeepMatchResult, etc.)
342
+ │ ├── pdf_gen.py # PDF/HTML report generation (xhtml2pdf)
343
+ │ └── languages/ # Per-language comment specs (30+ extensions)
344
+ ├── vscode-extension/
345
+ │ ├── src/ # TypeScript extension source
346
+ │ │ ├── extension.ts # Extension activation & command registration
347
+ │ │ ├── compareCommand.ts # Directory selection, time estimation, pipeline orchestration
348
+ │ │ ├── dirScanner.ts # Pre-analysis file scanning & OOM heuristic
349
+ │ │ ├── runner.ts # Python backend spawner with progress bar integration
350
+ │ │ ├── optionsPanel.ts # GUI options webview (mode, comments, Bates, etc.)
351
+ │ │ ├── treeViewer.ts # Interactive matched-pair tree for selective export
352
+ │ │ └── resultViewer.ts # HTML report preview inside VS Code
353
+ │ ├── bin/python/ # Embedded Python 3.12 runtime (gitignored)
354
+ │ └── package.json
312
355
  ├── example/ # Benchmark datasets (see below)
313
- ├── AGENTS.md # AI agent development guidelines
314
356
  ├── pyproject.toml
315
357
  ├── LICENSE # Apache 2.0
316
358
  └── NOTICE
@@ -335,16 +377,18 @@ Pre-generated benchmark reports (Markdown) are in `example/benchmark/`.
335
377
 
336
378
  ```bash
337
379
  diffinite example/Case-Oracle/AOSP_Google example/Case-Oracle/OpenJDK_Oracle \
338
- --no-comments --report-md example/benchmark/case_oracle.md
380
+ --strip-comments --report-md example/benchmark/case_oracle.md
339
381
  ```
340
382
 
341
- | File | Match (difflib) | Jaccard (Winnowing) |
383
+ | File | Match (difflib) | Deep Cross-Match |
342
384
  |------|:-:|:-:|
343
- | `ArrayList.java` | 9.0% | 7.3% |
385
+ | `ArrayList.java` | 9.0% | |
344
386
  | `Collections.java` | 4.5% | — |
345
- | `String.java` | 3.3% | 7.3% |
387
+ | `List.java` | 6.3% | |
388
+ | `Math.java` | 5.2% | — |
389
+ | `String.java` | 3.3% | — |
346
390
 
347
- **Observation**: Low Match and Jaccard scores confirm these are **independent implementations** of the same API specification. The shared fingerprints come from identical method signatures, not copied logic.
391
+ **Observation**: Low Match scores and no Jaccard cross-matches above 5% confirm these are **independent implementations** of the same API specification. The structural similarity comes from identical method signatures, not copied logic.
348
392
 
349
393
  ### 2. Eclipse Collections v. OpenJDK — Negative Control
350
394
 
@@ -352,10 +396,10 @@ diffinite example/Case-Oracle/AOSP_Google example/Case-Oracle/OpenJDK_Oracle \
352
396
 
353
397
  ```bash
354
398
  diffinite example/Case-NegativeControl/Eclipse_Collections example/Case-NegativeControl/OpenJDK \
355
- --no-comments --report-md example/benchmark/case_negative.md
399
+ --strip-comments --report-md example/benchmark/case_negative.md
356
400
  ```
357
401
 
358
- | File A | File B | Match | Jaccard |
402
+ | File A | File B | Match | Deep Cross-Match |
359
403
  |--------|--------|:-:|:-:|
360
404
  | `StringIterate.java` | `String.java` | 2.4% | — |
361
405
  | `FastList.java` | `ArrayList.java` | 1.5% | — |
@@ -368,19 +412,21 @@ diffinite example/Case-NegativeControl/Eclipse_Collections example/Case-Negative
368
412
 
369
413
  ```bash
370
414
  diffinite example/plagiarism/case-01/original example/plagiarism/case-01/plagiarized \
371
- --normalize --no-comments --report-md example/benchmark/plagiarism_case01.md
415
+ --normalize --strip-comments --report-md example/benchmark/plagiarism_case01.md
372
416
  ```
373
417
 
374
418
  | Original | Plagiarized | Jaccard |
375
419
  |----------|-------------|:-:|
420
+ | `T1.java` | `L2/04/hellow.java` | 100.0% |
376
421
  | `T1.java` | `L1/04/T1.java` | 100.0% |
377
- | `T1.java` | `L1/06/HelloWorld.java` | 100.0% |
378
- | `T1.java` | `L1/05/HelloWorld.java` | 92.3% |
379
- | `T1.java` | `L4/01/L4.java` | 57.9% |
380
- | `T1.java` | `L5/03/WelcomeToJava.java` | 39.1% |
381
- | `T1.java` | `L6/02/Main.java` | 31.0% |
422
+ | `T1.java` | `L1/05/HelloWorld.java` | 90.0% |
423
+ | `T1.java` | `L4/05/hellow.java` | 56.2% |
424
+ | `T1.java` | `L5/02/Main.java` | 38.1% |
425
+ | `T1.java` | `L6/07/PrintJava.java` | 34.8% |
426
+ | `T1.java` | `L6/01/L6.java` | 26.1% |
427
+ | `T1.java` | `L6/05/HelloWorld.java` | 17.9% |
382
428
 
383
- **Observation**: Jaccard decreases monotonically as the plagiarism level increases (L1→L6). Verbatim copies score 100%. Heavily restructured copies (L5, L6) still show 3040% shared fingerprints.
429
+ **Observation**: Jaccard decreases monotonically as the plagiarism level increases (L1→L6). Verbatim copies score 100%. Heavily restructured copies (L5, L6) still show 1838% shared fingerprints — well above the negative control baseline.
384
430
 
385
431
  ### 4. AOSP Framework — Same Codebase, Minor Edits
386
432
 
@@ -388,16 +434,16 @@ diffinite example/plagiarism/case-01/original example/plagiarism/case-01/plagiar
388
434
 
389
435
  ```bash
390
436
  diffinite example/aosp/left example/aosp/right \
391
- --no-comments --report-md example/benchmark/aosp.md
437
+ --strip-comments --report-md example/benchmark/aosp.md
392
438
  ```
393
439
 
394
- | File | Match (difflib) | Jaccard |
395
- |------|:-:|:-:|
396
- | `Handler.java` | 88.6% | — |
397
- | `Looper.java` | 90.0% | 77.1% |
398
- | `Message.java` | 96.3% | — |
440
+ | File | Match (difflib) |
441
+ |------|:-:|
442
+ | `Handler.java` | 88.6% |
443
+ | `Looper.java` | 90.0% |
444
+ | `Message.java` | 96.3% |
399
445
 
400
- **Observation**: High Match and Jaccard scores correctly reflect that these are minor revisions of the same codebase.
446
+ **Observation**: High Match scores correctly reflect that these are minor revisions of the same codebase.
401
447
 
402
448
  ---
403
449