diffinite 0.9.6__tar.gz → 0.11.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {diffinite-0.9.6/src/diffinite.egg-info → diffinite-0.11.1}/PKG-INFO +100 -54
- {diffinite-0.9.6 → diffinite-0.11.1}/README.md +98 -52
- {diffinite-0.9.6 → diffinite-0.11.1}/pyproject.toml +3 -2
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/cli.py +141 -40
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/collector.py +35 -11
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/deep_compare.py +7 -1
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/differ.py +3 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/models.py +6 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/pdf_gen.py +74 -32
- diffinite-0.11.1/src/diffinite/pipeline.py +1183 -0
- {diffinite-0.9.6 → diffinite-0.11.1/src/diffinite.egg-info}/PKG-INFO +100 -54
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/SOURCES.txt +1 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_cli.py +57 -2
- diffinite-0.11.1/tests/test_json_report_integration.py +46 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_pdf_gen.py +104 -11
- diffinite-0.11.1/tests/test_pipeline.py +343 -0
- diffinite-0.9.6/src/diffinite/pipeline.py +0 -777
- diffinite-0.9.6/tests/test_pipeline.py +0 -126
- {diffinite-0.9.6 → diffinite-0.11.1}/LICENSE +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/NOTICE +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/setup.cfg +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/__init__.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/__main__.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/evidence.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/fingerprint.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/__init__.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/_registry.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/_spec.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/c_family.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/csharp.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/data.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/go_rust_swift.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/java.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/javascript.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/markup.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/python.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/languages/scripting.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite/parser.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/dependency_links.txt +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/entry_points.txt +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/requires.txt +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/src/diffinite.egg-info/top_level.txt +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_collector.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_deep_compare.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_differ.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_differ_extended.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_evidence.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_evidence_hash.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_fingerprint.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_languages.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_normalize.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_parser.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_plagiarism_dataset.py +0 -0
- {diffinite-0.9.6 → diffinite-0.11.1}/tests/test_sqlite_integration.py +0 -0
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: diffinite
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.11.1
|
|
4
4
|
Summary: Forensic source-code comparison tool — Winnowing fingerprints and professional PDF reports for IP litigation & code audit
|
|
5
5
|
Author: nash-dir
|
|
6
6
|
License: Apache-2.0
|
|
7
7
|
Project-URL: Homepage, https://github.com/nash-dir/diffinite
|
|
8
8
|
Project-URL: Repository, https://github.com/nash-dir/diffinite
|
|
9
9
|
Project-URL: Issues, https://github.com/nash-dir/diffinite/issues
|
|
10
|
-
Keywords: diff,pdf,source-code,comparison,bates-number,forensics,code-audit,plagiarism-detection,winnowing,clone-detection
|
|
10
|
+
Keywords: diff,pdf,source-code,comparison,bates-number,forensics,code-audit,plagiarism-detection,winnowing,clone-detection,side-by-side,similarity,jaccard,copyright,litigation,vscode
|
|
11
11
|
Classifier: Development Status :: 4 - Beta
|
|
12
12
|
Classifier: Intended Audience :: Legal Industry
|
|
13
13
|
Classifier: Intended Audience :: Developers
|
|
@@ -37,15 +37,39 @@ Dynamic: license-file
|
|
|
37
37
|
|
|
38
38
|
# Diffinite
|
|
39
39
|
|
|
40
|
-
**
|
|
40
|
+
**Forensic source-code comparison tool for IP litigation and code audit.**
|
|
41
41
|
|
|
42
|
-
Diffinite compares two directories of source code and produces professional PDF/HTML reports with syntax-highlighted side-by-side diffs. It uses [Winnowing fingerprints](https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) (Schleimer et al., 2003 — the algorithm
|
|
42
|
+
Diffinite compares two directories of source code and produces professional PDF/HTML reports with syntax-highlighted side-by-side diffs. It uses [Winnowing fingerprints](https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) (Schleimer et al., 2003 — the algorithm behind [Stanford MOSS](https://theory.stanford.edu/~aiken/moss/)) for N:M cross-matching to detect code reuse even across renamed, split, or merged files.
|
|
43
43
|
|
|
44
44
|
> **Design Principle**: Diffinite reports **how similar** and **where similar**. It does not classify the type of copying — that is the expert witness's job.
|
|
45
45
|
|
|
46
46
|
---
|
|
47
47
|
|
|
48
|
-
##
|
|
48
|
+
## VS Code Extension
|
|
49
|
+
|
|
50
|
+
The recommended way to use Diffinite is through the **VS Code extension**, which bundles an embedded Python runtime — no separate Python installation required.
|
|
51
|
+
|
|
52
|
+
### Features
|
|
53
|
+
- **Visual directory picker** — Select two directories and configure options via a GUI panel
|
|
54
|
+
- **Real-time progress bar** — Live percentage tracking during analysis
|
|
55
|
+
- **Pre-analysis time estimation** — Scans file sizes upfront and estimates Simple/Deep mode duration
|
|
56
|
+
- **Dynamic CPU calibration** — Benchmarks Phase 1 performance to refine Phase 2 time predictions
|
|
57
|
+
- **OOM defense** — Warns before analyzing file pairs exceeding 5MB
|
|
58
|
+
- **Interactive tree viewer** — Review matched pairs and selectively export
|
|
59
|
+
- **One-click PDF/HTML export** — With Bates numbering, page numbers, and filename annotations
|
|
60
|
+
|
|
61
|
+
### Install from Source
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
cd vscode-extension
|
|
65
|
+
npm install
|
|
66
|
+
npm run compile
|
|
67
|
+
# Press F5 in VS Code to launch Extension Development Host
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## CLI Installation
|
|
49
73
|
|
|
50
74
|
```bash
|
|
51
75
|
pip install diffinite
|
|
@@ -73,7 +97,7 @@ diffinite original/ suspect/ -o report.pdf
|
|
|
73
97
|
|
|
74
98
|
# With comment stripping and Bates numbering (forensic use)
|
|
75
99
|
diffinite original/ suspect/ -o report.pdf \
|
|
76
|
-
--
|
|
100
|
+
--strip-comments --bates-number --page-number --filename
|
|
77
101
|
|
|
78
102
|
# HTML report (single self-contained file, opens in browser)
|
|
79
103
|
diffinite original/ suspect/ --report-html report.html
|
|
@@ -89,7 +113,7 @@ Diffinite runs a two-stage pipeline:
|
|
|
89
113
|
|
|
90
114
|
1. **Fuzzy name matching** — Pairs files across `dir_a` and `dir_b` using [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) string similarity (configurable threshold).
|
|
91
115
|
2. **Comment stripping** — Optionally removes comments using a 5-state finite state machine parser supporting 30+ file extensions.
|
|
92
|
-
3. **Side-by-side diff** — Computes line-by-line (or word-by-word) diffs using Python's `difflib.SequenceMatcher
|
|
116
|
+
3. **Side-by-side diff** — Computes line-by-line (or word-by-word) diffs using Python's `difflib.SequenceMatcher` with `autojunk=True` for O(n) performance on large files.
|
|
93
117
|
4. **Report generation** — Renders syntax-highlighted HTML diffs via Pygments, then converts to PDF with xhtml2pdf.
|
|
94
118
|
|
|
95
119
|
### Stage 2: N:M Cross-Matching (`deep` mode, default)
|
|
@@ -110,8 +134,9 @@ The cover page contains a summary table for each matched file pair:
|
|
|
110
134
|
| Column | Description |
|
|
111
135
|
|--------|-------------|
|
|
112
136
|
| **File A / File B** | Matched file paths |
|
|
113
|
-
| **
|
|
114
|
-
| **
|
|
137
|
+
| **Name Sim.** | Fuzzy filename similarity score (0–100) |
|
|
138
|
+
| **Content Match** | `difflib.SequenceMatcher.ratio()` — proportion of matching content. `1.0` = identical. |
|
|
139
|
+
| **Added / Deleted** | Number of lines (or words) added to or deleted from File A to produce File B. |
|
|
115
140
|
|
|
116
141
|
### Diff Pages
|
|
117
142
|
|
|
@@ -119,6 +144,7 @@ Each matched pair gets a side-by-side diff page with:
|
|
|
119
144
|
|
|
120
145
|
- **Green highlight** — Lines present only in File B (additions)
|
|
121
146
|
- **Red highlight** — Lines present only in File A (deletions)
|
|
147
|
+
- **Yellow highlight** — Lines changed between A and B (word-level diff in `--by-word` mode)
|
|
122
148
|
- **Purple highlight** — Lines moved from this position (`--detect-moved`)
|
|
123
149
|
- **Blue highlight** — Lines moved to this position (`--detect-moved`)
|
|
124
150
|
- **No highlight** — Identical lines (with configurable context folding)
|
|
@@ -131,9 +157,9 @@ When running in `deep` mode (default), the report includes an N:M cross-matching
|
|
|
131
157
|
|--------|-------------|
|
|
132
158
|
| **File A** | Source file from directory A |
|
|
133
159
|
| **Matched Files (B)** | All files from directory B that share fingerprints above the Jaccard threshold |
|
|
134
|
-
| **Jaccard** | `|A∩B| / |A∪B|` — the fraction of shared Winnowing fingerprints.
|
|
160
|
+
| **Jaccard** | `|A∩B| / |A∪B|` — the fraction of shared Winnowing fingerprints. |
|
|
135
161
|
|
|
136
|
-
Jaccard similarity is a well-defined set metric
|
|
162
|
+
Jaccard similarity is a well-defined set metric. Its interpretation depends on the domain, code size, and language. Diffinite reports the raw value without attaching qualitative labels.
|
|
137
163
|
|
|
138
164
|
### Page Annotations
|
|
139
165
|
|
|
@@ -142,7 +168,7 @@ Jaccard similarity is a well-defined set metric: `|A∩B| / |A∪B|`. Its interp
|
|
|
142
168
|
| `--page-number` | `Page 3 / 47` | Bottom-right |
|
|
143
169
|
| `--file-number` | `File 2 / 12` | Bottom-left |
|
|
144
170
|
| `--bates-number` | `TEST-000003-CONF` | Bottom-center |
|
|
145
|
-
| `--
|
|
171
|
+
| `--filename` | `com/example/Foo.java` | Top-right |
|
|
146
172
|
|
|
147
173
|
---
|
|
148
174
|
|
|
@@ -169,18 +195,21 @@ dir_b Path to the comparison source directory (B)
|
|
|
169
195
|
| `--report-pdf PATH` | Generate merged PDF report |
|
|
170
196
|
| `--report-html PATH` | Generate standalone HTML report (single file, no external deps) |
|
|
171
197
|
| `--report-md PATH` | Generate Markdown summary report |
|
|
198
|
+
| `--report-json PATH` | Generate machine-readable JSON report (used by VS Code extension) |
|
|
172
199
|
| `--no-merge` | Generate individual PDFs per file instead of one merged PDF |
|
|
200
|
+
| `--preserve-tree` / `--no-preserve-tree` | Preserve directory tree structure in individual output (default: on) |
|
|
173
201
|
|
|
174
202
|
### Diff Options
|
|
175
203
|
|
|
176
204
|
| Option | Default | Description |
|
|
177
205
|
|--------|:-------:|-------------|
|
|
178
|
-
| `--
|
|
206
|
+
| `--strip-comments` | off | Strip comments before comparison (5-state FSM parser, 30+ extensions) |
|
|
179
207
|
| `--by-word` | off | Compare by word instead of by line |
|
|
180
208
|
| `--squash-blanks` | off | Collapse runs of 3+ blank lines. ⚠️ Changes line numbers — not recommended for forensic line-tracing. |
|
|
181
209
|
| `--threshold N` | `60` | Fuzzy file-name matching threshold (0–100). Lower = more aggressive matching. |
|
|
182
210
|
| `--collapse-identical` | off | Fold unchanged code blocks (3 context lines around each change) |
|
|
183
|
-
| `--detect-moved` | off | Detect moved code blocks and highlight with distinct colors
|
|
211
|
+
| `--detect-moved` | off | Detect moved code blocks and highlight with distinct colors |
|
|
212
|
+
| `--encoding ENC` | `auto` | Force file encoding (e.g. `euc-kr`, `utf-8`). Default: auto-detect via charset-normalizer. |
|
|
184
213
|
|
|
185
214
|
### Deep Compare Options
|
|
186
215
|
|
|
@@ -190,7 +219,7 @@ dir_b Path to the comparison source directory (B)
|
|
|
190
219
|
| `--window N` | `4` | Winnowing window size. Guarantees detection of any shared sequence ≥ `K+W−1` = 8 tokens. |
|
|
191
220
|
| `--threshold-deep F` | `0.05` | Minimum Jaccard similarity to include in results. Below 5% is considered noise. |
|
|
192
221
|
| `--normalize` | off | Normalize identifiers → `ID`, literals → `LIT` before fingerprinting. Improves Type-2 clone detection (renamed variables). |
|
|
193
|
-
| `--workers N` | `4` | Number of parallel worker processes for fingerprint extraction. |
|
|
222
|
+
| `--workers N` | `4` | Number of parallel worker processes for diff rendering and fingerprint extraction. |
|
|
194
223
|
|
|
195
224
|
### Forensic Options
|
|
196
225
|
|
|
@@ -198,6 +227,9 @@ dir_b Path to the comparison source directory (B)
|
|
|
198
227
|
|--------|:-------:|-------------|
|
|
199
228
|
| `--no-autojunk` | off | Disable `SequenceMatcher`'s autojunk heuristic. Treats all tokens equally — slower but more precise for forensic analysis. |
|
|
200
229
|
| `--max-index-entries N` | `10,000,000` | Memory cap for inverted index. Prevents OOM on large corpora. ~800MB at 10M entries. |
|
|
230
|
+
| `--max-file-size-mb N` | `10.0` | Skip files exceeding this size (MB). Prevents OOM on large binary/generated files. |
|
|
231
|
+
| `--hash` | off | Embed SHA-256 evidence integrity hashes for all analyzed files in the report. |
|
|
232
|
+
| `--uncompared-mode {inline,separate,none}` | `inline` | Control how unmatched files are displayed: inline in main report, as separate appendix, or omitted. |
|
|
201
233
|
|
|
202
234
|
### Page Annotation Options
|
|
203
235
|
|
|
@@ -209,7 +241,7 @@ dir_b Path to the comparison source directory (B)
|
|
|
209
241
|
| `--bates-prefix TEXT` | Bates number prefix (e.g. `PLAINTIFF-`). Combined as: `{prefix}{number}{suffix}` |
|
|
210
242
|
| `--bates-suffix TEXT` | Bates number suffix (e.g. `-CONFIDENTIAL`) |
|
|
211
243
|
| `--bates-start N` | Starting Bates number (default: `1`). Useful for continuing numbering across reports. |
|
|
212
|
-
| `--
|
|
244
|
+
| `--filename` | Show filename at the top-right |
|
|
213
245
|
|
|
214
246
|
---
|
|
215
247
|
|
|
@@ -220,17 +252,17 @@ dir_b Path to the comparison source directory (B)
|
|
|
220
252
|
```bash
|
|
221
253
|
# Full forensic report with all annotations
|
|
222
254
|
diffinite plaintiff_code/ defendant_code/ -o exhibit_A.pdf \
|
|
223
|
-
--
|
|
255
|
+
--strip-comments \
|
|
224
256
|
--bates-number --bates-prefix "CASE2026-" --bates-suffix "-CONFIDENTIAL" \
|
|
225
|
-
--bates-start 1 --page-number --file-number --
|
|
226
|
-
--collapse-identical --detect-moved
|
|
257
|
+
--bates-start 1 --page-number --file-number --filename \
|
|
258
|
+
--collapse-identical --detect-moved --hash
|
|
227
259
|
```
|
|
228
260
|
|
|
229
261
|
### Code Audit (Quick HTML)
|
|
230
262
|
|
|
231
263
|
```bash
|
|
232
264
|
# HTML report for browser viewing (no PDF dependency issues)
|
|
233
|
-
diffinite vendor_v1/ vendor_v2/ --report-html audit.html --
|
|
265
|
+
diffinite vendor_v1/ vendor_v2/ --report-html audit.html --strip-comments
|
|
234
266
|
```
|
|
235
267
|
|
|
236
268
|
### Maximum Sensitivity (Type-2 Clones)
|
|
@@ -238,7 +270,7 @@ diffinite vendor_v1/ vendor_v2/ --report-html audit.html --no-comments
|
|
|
238
270
|
```bash
|
|
239
271
|
# Detect renamed-variable copies
|
|
240
272
|
diffinite original/ suspect/ -o report.pdf \
|
|
241
|
-
--normalize --no-autojunk --
|
|
273
|
+
--normalize --no-autojunk --strip-comments
|
|
242
274
|
```
|
|
243
275
|
|
|
244
276
|
### Simple Mode (Fast, No Cross-Matching)
|
|
@@ -251,11 +283,12 @@ diffinite dir_a/ dir_b/ --mode simple -o quick_report.pdf
|
|
|
251
283
|
### Multiple Output Formats
|
|
252
284
|
|
|
253
285
|
```bash
|
|
254
|
-
# Generate all
|
|
286
|
+
# Generate all formats at once
|
|
255
287
|
diffinite dir_a/ dir_b/ \
|
|
256
288
|
--report-pdf report.pdf \
|
|
257
289
|
--report-html report.html \
|
|
258
|
-
--report-md report.md
|
|
290
|
+
--report-md report.md \
|
|
291
|
+
--report-json report.json
|
|
259
292
|
```
|
|
260
293
|
|
|
261
294
|
### Tuning Sensitivity
|
|
@@ -275,7 +308,7 @@ diffinite dir_a/ dir_b/ --threshold 80
|
|
|
275
308
|
|
|
276
309
|
## Comment Stripping Support
|
|
277
310
|
|
|
278
|
-
The `--
|
|
311
|
+
The `--strip-comments` flag removes comments using a 5-state finite state machine parser:
|
|
279
312
|
|
|
280
313
|
| Extensions | Comment Styles |
|
|
281
314
|
|------------|---------------|
|
|
@@ -298,19 +331,28 @@ The `--no-comments` flag removes comments using a 5-state finite state machine p
|
|
|
298
331
|
diffinite/
|
|
299
332
|
├── src/diffinite/
|
|
300
333
|
│ ├── cli.py # CLI entry point & argument parsing
|
|
301
|
-
│ ├── pipeline.py # Orchestration (simple/deep modes)
|
|
334
|
+
│ ├── pipeline.py # Orchestration (simple/deep modes, parallel rendering)
|
|
302
335
|
│ ├── collector.py # File collection & fuzzy name matching
|
|
303
336
|
│ ├── parser.py # 5-state comment stripping FSM
|
|
304
|
-
│ ├── differ.py # Diff computation & HTML rendering
|
|
337
|
+
│ ├── differ.py # Diff computation, moved-block detection & HTML rendering
|
|
305
338
|
│ ├── fingerprint.py # Winnowing fingerprint extraction
|
|
306
|
-
│ ├── deep_compare.py # N:M cross-matching (inverted index)
|
|
307
|
-
│ ├── evidence.py #
|
|
308
|
-
│ ├── models.py # Data classes
|
|
309
|
-
│ ├── pdf_gen.py # PDF/HTML report generation
|
|
310
|
-
│ └── languages/ # Per-language specs (30+ extensions)
|
|
311
|
-
├──
|
|
339
|
+
│ ├── deep_compare.py # N:M cross-matching (inverted index + Jaccard)
|
|
340
|
+
│ ├── evidence.py # SHA-256 integrity hashing & manifest generation
|
|
341
|
+
│ ├── models.py # Data classes (DiffResult, DeepMatchResult, etc.)
|
|
342
|
+
│ ├── pdf_gen.py # PDF/HTML report generation (xhtml2pdf)
|
|
343
|
+
│ └── languages/ # Per-language comment specs (30+ extensions)
|
|
344
|
+
├── vscode-extension/
|
|
345
|
+
│ ├── src/ # TypeScript extension source
|
|
346
|
+
│ │ ├── extension.ts # Extension activation & command registration
|
|
347
|
+
│ │ ├── compareCommand.ts # Directory selection, time estimation, pipeline orchestration
|
|
348
|
+
│ │ ├── dirScanner.ts # Pre-analysis file scanning & OOM heuristic
|
|
349
|
+
│ │ ├── runner.ts # Python backend spawner with progress bar integration
|
|
350
|
+
│ │ ├── optionsPanel.ts # GUI options webview (mode, comments, Bates, etc.)
|
|
351
|
+
│ │ ├── treeViewer.ts # Interactive matched-pair tree for selective export
|
|
352
|
+
│ │ └── resultViewer.ts # HTML report preview inside VS Code
|
|
353
|
+
│ ├── bin/python/ # Embedded Python 3.12 runtime (gitignored)
|
|
354
|
+
│ └── package.json
|
|
312
355
|
├── example/ # Benchmark datasets (see below)
|
|
313
|
-
├── AGENTS.md # AI agent development guidelines
|
|
314
356
|
├── pyproject.toml
|
|
315
357
|
├── LICENSE # Apache 2.0
|
|
316
358
|
└── NOTICE
|
|
@@ -335,16 +377,18 @@ Pre-generated benchmark reports (Markdown) are in `example/benchmark/`.
|
|
|
335
377
|
|
|
336
378
|
```bash
|
|
337
379
|
diffinite example/Case-Oracle/AOSP_Google example/Case-Oracle/OpenJDK_Oracle \
|
|
338
|
-
--
|
|
380
|
+
--strip-comments --report-md example/benchmark/case_oracle.md
|
|
339
381
|
```
|
|
340
382
|
|
|
341
|
-
| File | Match (difflib) |
|
|
383
|
+
| File | Match (difflib) | Deep Cross-Match |
|
|
342
384
|
|------|:-:|:-:|
|
|
343
|
-
| `ArrayList.java` | 9.0% |
|
|
385
|
+
| `ArrayList.java` | 9.0% | — |
|
|
344
386
|
| `Collections.java` | 4.5% | — |
|
|
345
|
-
| `
|
|
387
|
+
| `List.java` | 6.3% | — |
|
|
388
|
+
| `Math.java` | 5.2% | — |
|
|
389
|
+
| `String.java` | 3.3% | — |
|
|
346
390
|
|
|
347
|
-
**Observation**: Low Match and Jaccard
|
|
391
|
+
**Observation**: Low Match scores and no Jaccard cross-matches above 5% confirm these are **independent implementations** of the same API specification. The structural similarity comes from identical method signatures, not copied logic.
|
|
348
392
|
|
|
349
393
|
### 2. Eclipse Collections v. OpenJDK — Negative Control
|
|
350
394
|
|
|
@@ -352,10 +396,10 @@ diffinite example/Case-Oracle/AOSP_Google example/Case-Oracle/OpenJDK_Oracle \
|
|
|
352
396
|
|
|
353
397
|
```bash
|
|
354
398
|
diffinite example/Case-NegativeControl/Eclipse_Collections example/Case-NegativeControl/OpenJDK \
|
|
355
|
-
--
|
|
399
|
+
--strip-comments --report-md example/benchmark/case_negative.md
|
|
356
400
|
```
|
|
357
401
|
|
|
358
|
-
| File A | File B | Match |
|
|
402
|
+
| File A | File B | Match | Deep Cross-Match |
|
|
359
403
|
|--------|--------|:-:|:-:|
|
|
360
404
|
| `StringIterate.java` | `String.java` | 2.4% | — |
|
|
361
405
|
| `FastList.java` | `ArrayList.java` | 1.5% | — |
|
|
@@ -368,19 +412,21 @@ diffinite example/Case-NegativeControl/Eclipse_Collections example/Case-Negative
|
|
|
368
412
|
|
|
369
413
|
```bash
|
|
370
414
|
diffinite example/plagiarism/case-01/original example/plagiarism/case-01/plagiarized \
|
|
371
|
-
--normalize --
|
|
415
|
+
--normalize --strip-comments --report-md example/benchmark/plagiarism_case01.md
|
|
372
416
|
```
|
|
373
417
|
|
|
374
418
|
| Original | Plagiarized | Jaccard |
|
|
375
419
|
|----------|-------------|:-:|
|
|
420
|
+
| `T1.java` | `L2/04/hellow.java` | 100.0% |
|
|
376
421
|
| `T1.java` | `L1/04/T1.java` | 100.0% |
|
|
377
|
-
| `T1.java` | `L1/
|
|
378
|
-
| `T1.java` | `
|
|
379
|
-
| `T1.java` | `
|
|
380
|
-
| `T1.java` | `
|
|
381
|
-
| `T1.java` | `L6/
|
|
422
|
+
| `T1.java` | `L1/05/HelloWorld.java` | 90.0% |
|
|
423
|
+
| `T1.java` | `L4/05/hellow.java` | 56.2% |
|
|
424
|
+
| `T1.java` | `L5/02/Main.java` | 38.1% |
|
|
425
|
+
| `T1.java` | `L6/07/PrintJava.java` | 34.8% |
|
|
426
|
+
| `T1.java` | `L6/01/L6.java` | 26.1% |
|
|
427
|
+
| `T1.java` | `L6/05/HelloWorld.java` | 17.9% |
|
|
382
428
|
|
|
383
|
-
**Observation**: Jaccard decreases monotonically as the plagiarism level increases (L1→L6). Verbatim copies score 100%. Heavily restructured copies (L5, L6) still show
|
|
429
|
+
**Observation**: Jaccard decreases monotonically as the plagiarism level increases (L1→L6). Verbatim copies score 100%. Heavily restructured copies (L5, L6) still show 18–38% shared fingerprints — well above the negative control baseline.
|
|
384
430
|
|
|
385
431
|
### 4. AOSP Framework — Same Codebase, Minor Edits
|
|
386
432
|
|
|
@@ -388,16 +434,16 @@ diffinite example/plagiarism/case-01/original example/plagiarism/case-01/plagiar
|
|
|
388
434
|
|
|
389
435
|
```bash
|
|
390
436
|
diffinite example/aosp/left example/aosp/right \
|
|
391
|
-
--
|
|
437
|
+
--strip-comments --report-md example/benchmark/aosp.md
|
|
392
438
|
```
|
|
393
439
|
|
|
394
|
-
| File | Match (difflib) |
|
|
395
|
-
|
|
396
|
-
| `Handler.java` | 88.6% |
|
|
397
|
-
| `Looper.java` | 90.0% |
|
|
398
|
-
| `Message.java` | 96.3% |
|
|
440
|
+
| File | Match (difflib) |
|
|
441
|
+
|------|:-:|
|
|
442
|
+
| `Handler.java` | 88.6% |
|
|
443
|
+
| `Looper.java` | 90.0% |
|
|
444
|
+
| `Message.java` | 96.3% |
|
|
399
445
|
|
|
400
|
-
**Observation**: High Match
|
|
446
|
+
**Observation**: High Match scores correctly reflect that these are minor revisions of the same codebase.
|
|
401
447
|
|
|
402
448
|
---
|
|
403
449
|
|