diffinite 0.1.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (58) hide show
  1. diffinite-0.4.0/NOTICE +107 -0
  2. diffinite-0.4.0/PKG-INFO +430 -0
  3. diffinite-0.4.0/README.md +393 -0
  4. {diffinite-0.1.0 → diffinite-0.4.0}/pyproject.toml +22 -8
  5. diffinite-0.4.0/src/diffinite/__init__.py +7 -0
  6. diffinite-0.4.0/src/diffinite/__main__.py +5 -0
  7. diffinite-0.4.0/src/diffinite/cli.py +320 -0
  8. diffinite-0.4.0/src/diffinite/collector.py +137 -0
  9. diffinite-0.4.0/src/diffinite/deep_compare.py +226 -0
  10. diffinite-0.4.0/src/diffinite/differ.py +293 -0
  11. diffinite-0.4.0/src/diffinite/evidence.py +268 -0
  12. diffinite-0.4.0/src/diffinite/fingerprint.py +239 -0
  13. diffinite-0.4.0/src/diffinite/languages/__init__.py +36 -0
  14. diffinite-0.4.0/src/diffinite/languages/_registry.py +51 -0
  15. diffinite-0.4.0/src/diffinite/languages/_spec.py +29 -0
  16. diffinite-0.4.0/src/diffinite/languages/c_family.py +49 -0
  17. diffinite-0.4.0/src/diffinite/languages/csharp.py +37 -0
  18. diffinite-0.4.0/src/diffinite/languages/data.py +36 -0
  19. diffinite-0.4.0/src/diffinite/languages/go_rust_swift.py +63 -0
  20. diffinite-0.4.0/src/diffinite/languages/java.py +55 -0
  21. diffinite-0.4.0/src/diffinite/languages/javascript.py +57 -0
  22. diffinite-0.4.0/src/diffinite/languages/markup.py +46 -0
  23. diffinite-0.4.0/src/diffinite/languages/python.py +35 -0
  24. diffinite-0.4.0/src/diffinite/languages/scripting.py +76 -0
  25. diffinite-0.4.0/src/diffinite/models.py +200 -0
  26. diffinite-0.4.0/src/diffinite/parser.py +400 -0
  27. diffinite-0.4.0/src/diffinite/pdf_gen.py +670 -0
  28. diffinite-0.4.0/src/diffinite/pipeline.py +728 -0
  29. diffinite-0.4.0/src/diffinite.egg-info/PKG-INFO +430 -0
  30. diffinite-0.4.0/src/diffinite.egg-info/SOURCES.txt +49 -0
  31. diffinite-0.4.0/src/diffinite.egg-info/entry_points.txt +2 -0
  32. {diffinite-0.1.0 → diffinite-0.4.0/src}/diffinite.egg-info/requires.txt +6 -0
  33. diffinite-0.4.0/tests/test_cli.py +109 -0
  34. diffinite-0.4.0/tests/test_collector.py +82 -0
  35. diffinite-0.4.0/tests/test_deep_compare.py +49 -0
  36. diffinite-0.4.0/tests/test_differ.py +56 -0
  37. diffinite-0.4.0/tests/test_differ_extended.py +101 -0
  38. diffinite-0.4.0/tests/test_evidence.py +26 -0
  39. diffinite-0.4.0/tests/test_evidence_hash.py +206 -0
  40. diffinite-0.4.0/tests/test_fingerprint.py +106 -0
  41. diffinite-0.4.0/tests/test_languages.py +112 -0
  42. diffinite-0.4.0/tests/test_normalize.py +198 -0
  43. diffinite-0.4.0/tests/test_parser.py +135 -0
  44. diffinite-0.4.0/tests/test_pdf_gen.py +157 -0
  45. diffinite-0.4.0/tests/test_pipeline.py +126 -0
  46. diffinite-0.4.0/tests/test_plagiarism_dataset.py +201 -0
  47. diffinite-0.4.0/tests/test_sqlite_integration.py +479 -0
  48. diffinite-0.1.0/NOTICE +0 -6
  49. diffinite-0.1.0/PKG-INFO +0 -143
  50. diffinite-0.1.0/README.md +0 -111
  51. diffinite-0.1.0/diffinite.egg-info/PKG-INFO +0 -143
  52. diffinite-0.1.0/diffinite.egg-info/SOURCES.txt +0 -11
  53. diffinite-0.1.0/diffinite.egg-info/entry_points.txt +0 -2
  54. diffinite-0.1.0/diffinite.py +0 -1162
  55. {diffinite-0.1.0 → diffinite-0.4.0}/LICENSE +0 -0
  56. {diffinite-0.1.0 → diffinite-0.4.0}/setup.cfg +0 -0
  57. {diffinite-0.1.0 → diffinite-0.4.0/src}/diffinite.egg-info/dependency_links.txt +0 -0
  58. {diffinite-0.1.0 → diffinite-0.4.0/src}/diffinite.egg-info/top_level.txt +0 -0
diffinite-0.4.0/NOTICE ADDED
@@ -0,0 +1,107 @@
1
+ Diffinite
2
+ Copyright 2026 nash-dir
3
+
4
+ This product includes code that was partially generated with the assistance
5
+ of LLM-based AI tools (Anthropic Claude).
6
+ The final implementation was reviewed, tested, and approved by the author.
7
+
8
+ =========================================================================
9
+ Third-Party Reference Code and Datasets
10
+ =========================================================================
11
+
12
+ The following third-party source code files and datasets are included
13
+ in this repository solely as REFERENCE DATA for algorithm validation
14
+ and forensic analysis benchmarking. They are NOT part of the Diffinite
15
+ software itself and retain their original licenses as noted below.
16
+
17
+ These files are located under example/ and TDD/corpus/ directories,
18
+ both of which are excluded from distribution via .gitignore.
19
+
20
+ -------------------------------------------------------------------------
21
+ 1. OpenJDK (Oracle)
22
+ -------------------------------------------------------------------------
23
+ Path: example/Case-Oracle/OpenJDK_Oracle/
24
+ example/Case-NegativeControl/OpenJDK/
25
+ TDD/corpus/openjdk_extra/
26
+ Source: https://github.com/openjdk/jdk (tag: jdk7-b147)
27
+ License: GNU General Public License, version 2,
28
+ with the Classpath Exception
29
+ Copyright (c) 1994, 2011, Oracle and/or its affiliates.
30
+ Files: ArrayList.java, Collections.java, String.java, List.java,
31
+ Math.java, HashMap.java, HashSet.java, Arrays.java
32
+
33
+ -------------------------------------------------------------------------
34
+ 2. Android Open Source Project (AOSP / Google)
35
+ -------------------------------------------------------------------------
36
+ Path: example/Case-Oracle/AOSP_Google/
37
+ Source: https://android.googlesource.com/platform/libcore/
38
+ (Froyo release)
39
+ License: Apache License, Version 2.0
40
+ Copyright (c) 2006, 2010, The Android Open Source Project.
41
+ Files: ArrayList.java, Collections.java, String.java, List.java,
42
+ Math.java
43
+
44
+ -------------------------------------------------------------------------
45
+ 3. Eclipse Collections
46
+ -------------------------------------------------------------------------
47
+ Path: example/Case-NegativeControl/Eclipse_Collections/
48
+ Source: https://github.com/eclipse/eclipse-collections
49
+ License: Eclipse Public License - v 1.0
50
+ Eclipse Distribution License - v 1.0
51
+ Copyright (c) 2004, 2024, Goldman Sachs, Eclipse Foundation,
52
+ and/or their affiliates.
53
+ Files: FastList.java, UnifiedSet.java, UnifiedMap.java,
54
+ Iterate.java, StringIterate.java
55
+
56
+ -------------------------------------------------------------------------
57
+ 4. Apache Commons Lang
58
+ -------------------------------------------------------------------------
59
+ Path: TDD/corpus/apache_commons_lang/
60
+ Source: https://github.com/apache/commons-lang
61
+ License: Apache License, Version 2.0
62
+ Copyright (c) 2001, 2024, The Apache Software Foundation.
63
+ Files: StringUtils.java, ArrayUtils.java, NumberUtils.java
64
+
65
+ -------------------------------------------------------------------------
66
+ 5. Apache Commons Collections
67
+ -------------------------------------------------------------------------
68
+ Path: TDD/corpus/apache_commons_collections/
69
+ Source: https://github.com/apache/commons-collections
70
+ License: Apache License, Version 2.0
71
+ Copyright (c) 2001, 2024, The Apache Software Foundation.
72
+ Files: CollectionUtils.java, ListUtils.java
73
+
74
+ -------------------------------------------------------------------------
75
+ 6. Google Guava
76
+ -------------------------------------------------------------------------
77
+ Path: TDD/corpus/guava/
78
+ Source: https://github.com/google/guava
79
+ License: Apache License, Version 2.0
80
+ Copyright (c) 2007, 2024, Google LLC.
81
+ Files: Strings.java, Lists.java, Maps.java
82
+
83
+ -------------------------------------------------------------------------
84
+ 7. IR-Plag-Dataset (Source Code Plagiarism Dataset)
85
+ -------------------------------------------------------------------------
86
+ Path: example/plagiarism/
87
+ Source: https://github.com/oscarkarnalim/sourcecodeplagiarismdataset
88
+ License: Apache License, Version 2.0
89
+ Citation: Karnalim, O. (2017). "Source Code Plagiarism Detection
90
+ in Academia with Information Retrieval: Dataset and the
91
+ Observation." Informatics in Education, 16(1), 83-102.
92
+ Files: 467 Java source code files across 7 programming tasks
93
+ (case-01 through case-07), each with original,
94
+ non-plagiarized, and plagiarized (L1-L6) submissions.
95
+
96
+ -------------------------------------------------------------------------
97
+ 8. SOCO 2014 (PAN@FIRE Source Code Re-use Detection)
98
+ -------------------------------------------------------------------------
99
+ Path: TDD/corpus/soco14/
100
+ Source: https://zenodo.org/records/7433031
101
+ License: Open Access (Creative Commons Attribution 4.0 International)
102
+ Citation: Flores, E., Barrón-Cedeño, A., Rosso, P., Moreno, L.
103
+ (2014). "DeSoCoRe: Detecting Source Code Re-Use across
104
+ Programming Languages." Proceedings of the 6th Forum for
105
+ Information Retrieval Evaluation (FIRE 2014).
106
+ Files: Training: 259 Java + 79 C files with expert annotations
107
+ Test: ~30,000 Java + C/C++ files with relevance judgements
@@ -0,0 +1,430 @@
1
+ Metadata-Version: 2.4
2
+ Name: diffinite
3
+ Version: 0.4.0
4
+ Summary: Forensic source-code comparison tool — Winnowing fingerprints and professional PDF reports for IP litigation & code audit
5
+ Author: nash-dir
6
+ License: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/nash-dir/diffinite
8
+ Project-URL: Repository, https://github.com/nash-dir/diffinite
9
+ Project-URL: Issues, https://github.com/nash-dir/diffinite/issues
10
+ Keywords: diff,pdf,source-code,comparison,bates-number,forensics,code-audit,plagiarism-detection,winnowing,clone-detection
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Legal Industry
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: Apache Software License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Software Development :: Quality Assurance
21
+ Classifier: Topic :: Text Processing :: General
22
+ Classifier: Operating System :: OS Independent
23
+ Requires-Python: >=3.10
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ License-File: NOTICE
27
+ Requires-Dist: rapidfuzz>=3.0.0
28
+ Requires-Dist: charset-normalizer>=3.0.0
29
+ Requires-Dist: xhtml2pdf>=0.2.11
30
+ Requires-Dist: pypdf>=4.0.0
31
+ Requires-Dist: pygments>=2.16.0
32
+ Requires-Dist: reportlab>=4.0.0
33
+ Provides-Extra: dev
34
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
35
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
36
+ Dynamic: license-file
37
+
38
+ # Diffinite
39
+
40
+ **Source-code comparison tool for code audit and similarity analysis.**
41
+
42
+ Diffinite compares two directories of source code and produces professional PDF/HTML reports with syntax-highlighted side-by-side diffs. It uses [Winnowing fingerprints](https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf) (Schleimer et al., 2003 — the algorithm that also forms the basis of [Stanford MOSS](https://theory.stanford.edu/~aiken/moss/)) for N:M cross-matching to detect code reuse even across renamed, split, or merged files.
43
+
44
+ > **Design Principle**: Diffinite reports **how similar** and **where similar**. It does not classify the type of copying — that is the expert witness's job.
45
+
46
+ ---
47
+
48
+ ## Installation
49
+
50
+ ```bash
51
+ pip install diffinite
52
+ ```
53
+
54
+ Or from source:
55
+
56
+ ```bash
57
+ git clone https://github.com/nash-dir/diffinite.git
58
+ cd diffinite
59
+ pip install -e ".[dev]"
60
+ ```
61
+
62
+ **Requirements**: Python ≥ 3.10
63
+
64
+ **Dependencies**: [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz), [Pygments](https://pygments.org/), [xhtml2pdf](https://github.com/xhtml2pdf/xhtml2pdf), [pypdf](https://github.com/py-pdf/pypdf), [reportlab](https://docs.reportlab.com/), [charset-normalizer](https://github.com/Ousret/charset_normalizer)
65
+
66
+ ---
67
+
68
+ ## Quick Start
69
+
70
+ ```bash
71
+ # Compare two directories → PDF report
72
+ diffinite original/ suspect/ -o report.pdf
73
+
74
+ # With comment stripping and Bates numbering (forensic use)
75
+ diffinite original/ suspect/ -o report.pdf \
76
+ --no-comments --bates-number --page-number --show-filename
77
+
78
+ # HTML report (single self-contained file, opens in browser)
79
+ diffinite original/ suspect/ --report-html report.html
80
+ ```
81
+
82
+ ---
83
+
84
+ ## How It Works
85
+
86
+ Diffinite runs a two-stage pipeline:
87
+
88
+ ### Stage 1: 1:1 File Matching (`simple` mode)
89
+
90
+ 1. **Fuzzy name matching** — Pairs files across `dir_a` and `dir_b` using [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) string similarity (configurable threshold).
91
+ 2. **Comment stripping** — Optionally removes comments using a 5-state finite state machine parser supporting 30+ file extensions.
92
+ 3. **Side-by-side diff** — Computes line-by-line (or word-by-word) diffs using Python's `difflib.SequenceMatcher`.
93
+ 4. **Report generation** — Renders syntax-highlighted HTML diffs via Pygments, then converts to PDF with xhtml2pdf.
94
+
95
+ ### Stage 2: N:M Cross-Matching (`deep` mode, default)
96
+
97
+ 5. **Winnowing fingerprint extraction** — Extracts position-independent code fingerprints using the Winnowing algorithm (K-gram → rolling hash → window selection).
98
+ 6. **Inverted index construction** — Builds a hash-to-file mapping for all B-directory fingerprints.
99
+ 7. **Jaccard similarity computation** — For each A-file, queries the index to find all B-files sharing fingerprints, then computes Jaccard similarity `|A∩B| / |A∪B|`.
100
+ 8. **Cross-match reporting** — Appends an N:M similarity matrix to the report, showing which files from A are similar to which files in B.
101
+
102
+ ---
103
+
104
+ ## Output Report
105
+
106
+ ### Cover Page
107
+
108
+ The cover page contains a summary table for each matched file pair:
109
+
110
+ | Column | Description |
111
+ |--------|-------------|
112
+ | **File A / File B** | Matched file paths |
113
+ | **Match** | `difflib.SequenceMatcher.ratio()` — the proportion of matching characters between the two files. `1.0` = identical, `0.0` = completely different. |
114
+ | **Added / Deleted** | Number of lines added to or deleted from File A to produce File B. |
115
+
116
+ ### Diff Pages
117
+
118
+ Each matched pair gets a side-by-side diff page with:
119
+
120
+ - **Green highlight** — Lines present only in File B (additions)
121
+ - **Red highlight** — Lines present only in File A (deletions)
122
+ - **No highlight** — Identical lines (with configurable context folding)
123
+
124
+ ### Deep Compare Section
125
+
126
+ When running in `deep` mode (default), the report includes an N:M cross-matching table:
127
+
128
+ | Column | Description |
129
+ |--------|-------------|
130
+ | **File A** | Source file from directory A |
131
+ | **Matched Files (B)** | All files from directory B that share fingerprints above the Jaccard threshold |
132
+ | **Jaccard** | `|A∩B| / |A∪B|` — the fraction of shared Winnowing fingerprints. A Jaccard of `0.73` means 73% of the code fingerprints are shared between the two files. |
133
+
134
+ Jaccard similarity is a well-defined set metric: `|A∩B| / |A∪B|`. Its interpretation depends on the domain, code size, and language. Diffinite reports the raw value without attaching qualitative labels.
135
+
136
+ ### Page Annotations
137
+
138
+ | Option | Annotation | Position |
139
+ |--------|-----------|----------|
140
+ | `--page-number` | `Page 3 / 47` | Bottom-right |
141
+ | `--file-number` | `File 2 / 12` | Bottom-left |
142
+ | `--bates-number` | `DIFF-000003` | Bottom-center |
143
+ | `--show-filename` | `com/example/Foo.java` | Top-right |
144
+
145
+ ---
146
+
147
+ ## CLI Reference
148
+
149
+ ### Positional Arguments
150
+
151
+ ```
152
+ dir_a Path to the original source directory (A)
153
+ dir_b Path to the comparison source directory (B)
154
+ ```
155
+
156
+ ### Execution Mode
157
+
158
+ | Option | Default | Description |
159
+ |--------|:-------:|-------------|
160
+ | `--mode {simple,deep}` | `deep` | `simple` = 1:1 file matching only. `deep` = 1:1 + N:M Winnowing cross-matching. |
161
+
162
+ ### Output Options
163
+
164
+ | Option | Description |
165
+ |--------|-------------|
166
+ | `-o`, `--output-pdf PATH` | Output PDF path (default: `report.pdf`). Ignored when `--report-*` is specified. |
167
+ | `--report-pdf PATH` | Generate merged PDF report |
168
+ | `--report-html PATH` | Generate standalone HTML report (single file, no external deps) |
169
+ | `--report-md PATH` | Generate Markdown summary report |
170
+ | `--no-merge` | Generate individual PDFs per file instead of one merged PDF |
171
+
172
+ ### Diff Options
173
+
174
+ | Option | Default | Description |
175
+ |--------|:-------:|-------------|
176
+ | `--no-comments` | off | Strip comments before comparison (5-state FSM parser, 30+ extensions) |
177
+ | `--by-word` | off | Compare by word instead of by line |
178
+ | `--squash-blanks` | off | Collapse runs of 3+ blank lines. ⚠️ Changes line numbers — not recommended for forensic line-tracing. |
179
+ | `--threshold N` | `60` | Fuzzy file-name matching threshold (0–100). Lower = more aggressive matching. |
180
+ | `--collapse-identical` | off | Fold unchanged code blocks (3 context lines around each change) |
181
+
182
+ ### Deep Compare Options
183
+
184
+ | Option | Default | Description |
185
+ |--------|:-------:|-------------|
186
+ | `--k-gram N` | `5` | K-gram size for Winnowing. Larger K = fewer but more specific fingerprints. (Schleimer 2003, §4.2) |
187
+ | `--window N` | `4` | Winnowing window size. Guarantees detection of any shared sequence ≥ `K+W−1` = 8 tokens. |
188
+ | `--threshold-deep F` | `0.05` | Minimum Jaccard similarity to include in results. Below 5% is considered noise. |
189
+ | `--normalize` | off | Normalize identifiers → `ID`, literals → `LIT` before fingerprinting. Improves Type-2 clone detection (renamed variables). |
190
+ | `--workers N` | `4` | Number of parallel worker processes for fingerprint extraction. |
191
+
192
+ ### Forensic Options
193
+
194
+ | Option | Default | Description |
195
+ |--------|:-------:|-------------|
196
+ | `--no-autojunk` | off | Disable `SequenceMatcher`'s autojunk heuristic. Treats all tokens equally — slower but more precise for forensic analysis. |
197
+ | `--max-index-entries N` | `10,000,000` | Memory cap for inverted index. Prevents OOM on large corpora. ~800MB at 10M entries. |
198
+
199
+ ### Page Annotation Options
200
+
201
+ | Option | Description |
202
+ |--------|-------------|
203
+ | `--page-number` | Show `Page n / N` at the bottom-right |
204
+ | `--file-number` | Show `File n / N` at the bottom-left |
205
+ | `--bates-number` | Stamp sequential Bates numbers at the bottom-center |
206
+ | `--show-filename` | Show filename at the top-right |
207
+
208
+ ---
209
+
210
+ ## Usage Examples
211
+
212
+ ### Basic IP Litigation Report
213
+
214
+ ```bash
215
+ # Full forensic report with all annotations
216
+ diffinite plaintiff_code/ defendant_code/ -o exhibit_A.pdf \
217
+ --no-comments \
218
+ --bates-number --page-number --file-number --show-filename \
219
+ --collapse-identical
220
+ ```
221
+
222
+ ### Code Audit (Quick HTML)
223
+
224
+ ```bash
225
+ # HTML report for browser viewing (no PDF dependency issues)
226
+ diffinite vendor_v1/ vendor_v2/ --report-html audit.html --no-comments
227
+ ```
228
+
229
+ ### Maximum Sensitivity (Type-2 Clones)
230
+
231
+ ```bash
232
+ # Detect renamed-variable copies
233
+ diffinite original/ suspect/ -o report.pdf \
234
+ --normalize --no-autojunk --no-comments
235
+ ```
236
+
237
+ ### Simple Mode (Fast, No Cross-Matching)
238
+
239
+ ```bash
240
+ # 1:1 matching only — faster for quick comparisons
241
+ diffinite dir_a/ dir_b/ --mode simple -o quick_report.pdf
242
+ ```
243
+
244
+ ### Multiple Output Formats
245
+
246
+ ```bash
247
+ # Generate all three formats at once
248
+ diffinite dir_a/ dir_b/ \
249
+ --report-pdf report.pdf \
250
+ --report-html report.html \
251
+ --report-md report.md
252
+ ```
253
+
254
+ ### Tuning Sensitivity
255
+
256
+ ```bash
257
+ # Larger K-gram = fewer false positives, may miss short matches
258
+ diffinite dir_a/ dir_b/ --k-gram 7 --window 5
259
+
260
+ # Lower Jaccard threshold = show weaker matches
261
+ diffinite dir_a/ dir_b/ --threshold-deep 0.02
262
+
263
+ # Stricter file name matching
264
+ diffinite dir_a/ dir_b/ --threshold 80
265
+ ```
266
+
267
+ ---
268
+
269
+ ## Comment Stripping Support
270
+
271
+ The `--no-comments` flag removes comments using a 5-state finite state machine parser:
272
+
273
+ | Extensions | Comment Styles |
274
+ |------------|---------------|
275
+ | `.py` | `# line comments`, `"""docstrings"""` |
276
+ | `.js`, `.ts`, `.jsx`, `.tsx` | `// line`, `/* block */`, `` `template literals` `` |
277
+ | `.java`, `.c`, `.cpp`, `.h`, `.cs`, `.go`, `.rs`, `.kt`, `.scala` | `// line`, `/* block */` |
278
+ | `.html`, `.xml`, `.svg`, `.htm` | `<!-- block -->` |
279
+ | `.css`, `.scss`, `.less` | `/* block */` |
280
+ | `.sql` | `-- line`, `/* block */` |
281
+ | `.rb` | `# line` |
282
+ | `.sh`, `.bash`, `.zsh` | `# line` |
283
+ | `.lua` | `-- line`, `--[[ block ]]` |
284
+ | `.r` | `# line` |
285
+
286
+ ---
287
+
288
+ ## Project Structure
289
+
290
+ ```
291
+ diffinite/
292
+ ├── src/diffinite/
293
+ │ ├── cli.py # CLI entry point & argument parsing
294
+ │ ├── pipeline.py # Orchestration (simple/deep modes)
295
+ │ ├── collector.py # File collection & fuzzy name matching
296
+ │ ├── parser.py # 5-state comment stripping FSM
297
+ │ ├── differ.py # Diff computation & HTML rendering
298
+ │ ├── fingerprint.py # Winnowing fingerprint extraction
299
+ │ ├── deep_compare.py # N:M cross-matching (inverted index)
300
+ │ ├── evidence.py # Jaccard similarity metric
301
+ │ ├── models.py # Data classes
302
+ │ ├── pdf_gen.py # PDF/HTML report generation
303
+ │ └── languages/ # Per-language specs (30+ extensions)
304
+ ├── tests/
305
+ ├── example/ # Benchmark datasets (see below)
306
+ ├── AGENTS.md # AI agent development guidelines
307
+ ├── pyproject.toml
308
+ ├── LICENSE # Apache 2.0
309
+ └── NOTICE
310
+ ```
311
+
312
+ ---
313
+
314
+ ## Benchmarks
315
+
316
+ Download the example datasets first, then run the benchmarks yourself:
317
+
318
+ ```bash
319
+ python example/download_examples.py # download all datasets
320
+ python example/download_examples.py --dataset aosp # or download one
321
+ ```
322
+
323
+ Pre-generated benchmark reports (Markdown) are in `example/benchmark/`.
324
+
325
+ ### 1. Google v. Oracle — API Header Similarity
326
+
327
+ **Why this dataset**: The [Oracle v. Google](https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.) case is the landmark SSO (Structure, Sequence, Organization) copyright dispute. Google's Android reimplemented Java API declarations. The code *bodies* are independently written, but the API *signatures* are necessarily similar.
328
+
329
+ ```bash
330
+ diffinite example/Case-Oracle/AOSP_Google example/Case-Oracle/OpenJDK_Oracle \
331
+ --no-comments --report-md example/benchmark/case_oracle.md
332
+ ```
333
+
334
+ | File | Match (difflib) | Jaccard (Winnowing) |
335
+ |------|:-:|:-:|
336
+ | `ArrayList.java` | 9.0% | 7.3% |
337
+ | `Collections.java` | 4.5% | — |
338
+ | `String.java` | 3.3% | 7.3% |
339
+
340
+ **Observation**: Low Match and Jaccard scores confirm these are **independent implementations** of the same API specification. The shared fingerprints come from identical method signatures, not copied logic.
341
+
342
+ ### 2. Eclipse Collections v. OpenJDK — Negative Control
343
+
344
+ **Why this dataset**: Eclipse Collections and OpenJDK solve similar problems (collection frameworks) but are developed by different teams with no code sharing. This is the **expected baseline for independent work** in the same domain.
345
+
346
+ ```bash
347
+ diffinite example/Case-NegativeControl/Eclipse_Collections example/Case-NegativeControl/OpenJDK \
348
+ --no-comments --report-md example/benchmark/case_negative.md
349
+ ```
350
+
351
+ | File A | File B | Match | Jaccard |
352
+ |--------|--------|:-:|:-:|
353
+ | `StringIterate.java` | `String.java` | 2.4% | — |
354
+ | `FastList.java` | `ArrayList.java` | 1.5% | — |
355
+
356
+ **Observation**: No cross-matches above the 5% Jaccard threshold. This is the correct result — independent projects should show near-zero similarity.
357
+
358
+ ### 3. IR-Plag Case 01 — Known Plagiarism
359
+
360
+ **Why this dataset**: [IR-Plag](https://github.com/oscarkarnalim/sourcecodeplagiarismdataset) is a publicly available plagiarism corpus with labeled modification levels (L1=verbatim copy through L6=heavy restructuring).
361
+
362
+ ```bash
363
+ diffinite example/plagiarism/case-01/original example/plagiarism/case-01/plagiarized \
364
+ --normalize --no-comments --report-md example/benchmark/plagiarism_case01.md
365
+ ```
366
+
367
+ | Original | Plagiarized | Jaccard |
368
+ |----------|-------------|:-:|
369
+ | `T1.java` | `L1/04/T1.java` | 100.0% |
370
+ | `T1.java` | `L1/06/HelloWorld.java` | 100.0% |
371
+ | `T1.java` | `L1/05/HelloWorld.java` | 92.3% |
372
+ | `T1.java` | `L4/01/L4.java` | 57.9% |
373
+ | `T1.java` | `L5/03/WelcomeToJava.java` | 39.1% |
374
+ | `T1.java` | `L6/02/Main.java` | 31.0% |
375
+
376
+ **Observation**: Jaccard decreases monotonically as the plagiarism level increases (L1→L6). Verbatim copies score 100%. Heavily restructured copies (L5, L6) still show 30–40% shared fingerprints.
377
+
378
+ ### 4. AOSP Framework — Same Codebase, Minor Edits
379
+
380
+ **Why this dataset**: Two versions of Android's `Handler`/`Looper`/`Message` framework. Small evolutionary changes between versions.
381
+
382
+ ```bash
383
+ diffinite example/aosp/left example/aosp/right \
384
+ --no-comments --report-md example/benchmark/aosp.md
385
+ ```
386
+
387
+ | File | Match (difflib) | Jaccard |
388
+ |------|:-:|:-:|
389
+ | `Handler.java` | 88.6% | — |
390
+ | `Looper.java` | 90.0% | 77.1% |
391
+ | `Message.java` | 96.3% | — |
392
+
393
+ **Observation**: High Match and Jaccard scores correctly reflect that these are minor revisions of the same codebase.
394
+
395
+ ---
396
+
397
+ ## Winnowing Algorithm
398
+
399
+ Diffinite uses the **Winnowing** algorithm (Schleimer, Wilkerson, Aiken. *"Winnowing: Local Algorithms for Document Fingerprinting."* SIGMOD 2003), which also forms the basis of [Stanford MOSS](https://theory.stanford.edu/~aiken/moss/).
400
+
401
+ **Pipeline**: `source → tokenize → K-gram → rolling hash → winnow → fingerprint set`
402
+
403
+ The algorithm provides a **density guarantee**: any shared token sequence of length ≥ `K + W − 1` (default: 8) will always be detected, regardless of its position in the file.
404
+
405
+ **Parameters**:
406
+
407
+ | Parameter | Default | Rationale |
408
+ |-----------|:-------:|-----------|
409
+ | `K` (k-gram) | `5` | Schleimer 2003 §4.2 recommended range. 5 consecutive tokens per fingerprint unit. |
410
+ | `W` (window) | `4` | Window of 4 fingerprints → minimum detectable sequence = 8 tokens. |
411
+ | `HASH_BASE` | `257` | Standard Rabin hash base (prime). |
412
+ | `HASH_MOD` | `2⁶¹ − 1` | Mersenne prime — efficient modular arithmetic, minimal collision probability. |
413
+
414
+ ---
415
+
416
+ ## Limitations
417
+
418
+ - **General-purpose tokenizer**: Uses a single regex tokenizer for all languages, not language-specific parsers. Accuracy may vary across languages.
419
+ - **Position-independent**: Winnowing fingerprints are order-independent within a window. Code with reordered functions may produce higher similarity than expected.
420
+ - **No corpus-based analysis**: Each comparison is pairwise. There is no built-in corpus-wide frequency weighting (e.g., TF-IDF) to down-weight common idioms.
421
+ - **Binary and obfuscated code**: Not supported. Diffinite operates on source code text only.
422
+ - **Not a legal opinion**: Similarity scores are mathematical measurements, not legal conclusions. Professional review is required before use in any legal proceeding.
423
+
424
+ ---
425
+
426
+ ## License
427
+
428
+ [Apache License 2.0](LICENSE)
429
+
430
+ See [NOTICE](NOTICE) for attribution.