docrev 0.9.6 → 0.9.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (80) hide show
  1. package/CHANGELOG.md +41 -0
  2. package/dev_notes/bug_repro_comment_parser.md +71 -0
  3. package/dev_notes/stress2/adversarial.docx +0 -0
  4. package/dev_notes/stress2/build_adversarial.ts +186 -0
  5. package/dev_notes/stress2/drift_matcher.ts +62 -0
  6. package/dev_notes/stress2/probe_anchors.ts +35 -0
  7. package/dev_notes/stress2/project/adversarial.docx +0 -0
  8. package/dev_notes/stress2/project/discussion.before.md +3 -0
  9. package/dev_notes/stress2/project/discussion.md +3 -0
  10. package/dev_notes/stress2/project/methods.before.md +20 -0
  11. package/dev_notes/stress2/project/methods.md +20 -0
  12. package/dev_notes/stress2/project/rev.yaml +5 -0
  13. package/dev_notes/stress2/project/sections.yaml +4 -0
  14. package/dev_notes/stress2/sections.yaml +5 -0
  15. package/dev_notes/stress2/trace_placement.ts +50 -0
  16. package/dev_notes/stresstest_boundaries.ts +27 -0
  17. package/dev_notes/stresstest_drift_apply.ts +43 -0
  18. package/dev_notes/stresstest_drift_compare.ts +43 -0
  19. package/dev_notes/stresstest_drift_v2.ts +54 -0
  20. package/dev_notes/stresstest_inspect.ts +54 -0
  21. package/dev_notes/stresstest_pstyle.ts +55 -0
  22. package/dev_notes/stresstest_section_debug.ts +23 -0
  23. package/dev_notes/stresstest_split.ts +70 -0
  24. package/dev_notes/stresstest_trace.ts +19 -0
  25. package/dev_notes/stresstest_verify_no_overwrite.ts +40 -0
  26. package/dist/lib/anchor-match.d.ts +51 -0
  27. package/dist/lib/anchor-match.d.ts.map +1 -0
  28. package/dist/lib/anchor-match.js +227 -0
  29. package/dist/lib/anchor-match.js.map +1 -0
  30. package/dist/lib/annotations.d.ts.map +1 -1
  31. package/dist/lib/annotations.js +24 -11
  32. package/dist/lib/annotations.js.map +1 -1
  33. package/dist/lib/commands/index.d.ts +2 -1
  34. package/dist/lib/commands/index.d.ts.map +1 -1
  35. package/dist/lib/commands/index.js +3 -1
  36. package/dist/lib/commands/index.js.map +1 -1
  37. package/dist/lib/commands/quality.js +1 -1
  38. package/dist/lib/commands/quality.js.map +1 -1
  39. package/dist/lib/commands/section-boundaries.d.ts +22 -0
  40. package/dist/lib/commands/section-boundaries.d.ts.map +1 -0
  41. package/dist/lib/commands/section-boundaries.js +63 -0
  42. package/dist/lib/commands/section-boundaries.js.map +1 -0
  43. package/dist/lib/commands/sync.d.ts.map +1 -1
  44. package/dist/lib/commands/sync.js +141 -0
  45. package/dist/lib/commands/sync.js.map +1 -1
  46. package/dist/lib/commands/verify-anchors.d.ts +17 -0
  47. package/dist/lib/commands/verify-anchors.d.ts.map +1 -0
  48. package/dist/lib/commands/verify-anchors.js +226 -0
  49. package/dist/lib/commands/verify-anchors.js.map +1 -0
  50. package/dist/lib/comment-realign.js +2 -2
  51. package/dist/lib/comment-realign.js.map +1 -1
  52. package/dist/lib/import.d.ts +26 -8
  53. package/dist/lib/import.d.ts.map +1 -1
  54. package/dist/lib/import.js +166 -187
  55. package/dist/lib/import.js.map +1 -1
  56. package/dist/lib/response.js +1 -1
  57. package/dist/lib/response.js.map +1 -1
  58. package/dist/lib/word-extraction.d.ts +23 -0
  59. package/dist/lib/word-extraction.d.ts.map +1 -1
  60. package/dist/lib/word-extraction.js +79 -0
  61. package/dist/lib/word-extraction.js.map +1 -1
  62. package/dist/lib/wordcomments.d.ts.map +1 -1
  63. package/dist/lib/wordcomments.js +165 -73
  64. package/dist/lib/wordcomments.js.map +1 -1
  65. package/lib/anchor-match.ts +276 -0
  66. package/lib/annotations.ts +25 -11
  67. package/lib/commands/index.ts +3 -0
  68. package/lib/commands/quality.ts +1 -1
  69. package/lib/commands/section-boundaries.ts +82 -0
  70. package/lib/commands/sync.ts +170 -0
  71. package/lib/commands/verify-anchors.ts +272 -0
  72. package/lib/comment-realign.ts +2 -2
  73. package/lib/import.ts +197 -209
  74. package/lib/response.ts +1 -1
  75. package/lib/word-extraction.ts +93 -0
  76. package/lib/wordcomments.ts +180 -82
  77. package/package.json +1 -1
  78. package/skill/REFERENCE.md +29 -2
  79. package/skill/SKILL.md +12 -2
  80. package/dist/package.json +0 -137
package/CHANGELOG.md CHANGED
@@ -5,6 +5,47 @@ All notable changes to docrev will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.9.11] - 2026-04-30
9
+
10
+ ### Fixed
11
+ - **Single-section comment placement.** `computeSectionBoundaries` left the last section's `end` at `Number.MAX_SAFE_INTEGER`, which collapsed the proportional-position math in `insertCommentsIntoMarkdown` to ~0. Every comment whose anchor wasn't in the first 200 chars of the markdown stacked at position 0. Now caps the last boundary's `end` at `fullDocText.length`, passed in from sync and verify-anchors.
12
+ - **Re-sync duplicated comments.** `sync --comments-only` re-inserted every comment on each invocation, producing `{>>R1<<}{>>R1<<}{>>R1<<}…` over time. `insertCommentsIntoMarkdown` now scans ±200 chars around the target for an identical `{>>author: text<<}` block and skips insertion when found.
13
+ - **Threading content destruction.** `prepareMarkdownWithMarkers`'s whitespace-consumption loops captured `charBefore` once outside the loop, so a single leading space caused `removeStart` to walk to position 0 and `slice()` to delete every preceding paragraph. Replaced with a one-char check.
14
+ - **Multi-run anchor injection.** Pandoc splits a single anchor across multiple `<w:r>` blocks whenever it applies styling mid-anchor — smart-quote substitution, `*italic*`, `` `code` ``, `**bold**` all trigger this. The single-run scan in `injectCommentsAtMarkers` grabbed the start marker's `<w:t>`, looked for the end marker inside it, found nothing, and silently skipped the comment. New multi-run path splits the start run at the start marker, keeps middle runs verbatim, splits the end run at the end marker, and rebuilds with `commentRangeStart`/`commentRangeEnd` around the styled anchor portions.
15
+ - **Nested-bracket anchors.** `prepareMarkdownWithMarkers` used `\[([^\]]+)\]\{\.mark\}` for the trailing anchor group, so any inner `]` (e.g. `[[0..9]]{.mark}`, `[*italic*]{.mark}`) terminated the match prematurely. Replaced with a manual balanced-bracket walker that handles arbitrary nesting depth and verifies a `{.mark}` suffix.
16
+ - **Orphan-`[` over-stripping.** `stripAnnotations`'s orphan cleanup used `\[(?![^\[\]]*\])`, treating any inner `[` as a barrier and stripping the outer `[` of nested forms. Loosened to `\[(?![^\]\n]*\])`: an `[` is orphan only when no `]` follows before end of line.
17
+
18
+ ### Changed
19
+ - `sync --comments-only` summary distinguishes `placed` / `already present` / `unmatched` instead of subtracting before/after counts. Re-syncs now report "6 already present (skipped to avoid duplication)" instead of misreporting them as fully placed or fully unmatched. New `outStats` channel from `insertCommentsIntoMarkdown`.
20
+
21
+ ## [0.9.10] - 2026-04-30
22
+
23
+ ### Fixed
24
+ - `stripAnnotations` stripped `[anchor]{.mark}` spans even when `keepComments=true`, leaving the dual-build marker generator with no anchor text and collapsing every multi-word anchor to a single fallback word in the rebuilt docx. Now preserves anchor spans that belong to retained `{>>...<<}` comments.
25
+ - Comments authored at the very start of a Word section landed before the markdown file's `# Heading` line and rendered in the previous section. Added `pushPastSectionHeading` so position-0 comments advance to the first body paragraph of the section they were authored in.
26
+ - Empty-anchor comments fell through to proportional placement even when before/after context uniquely identified the position, landing mid-word or splitting unrelated phrases. Context match now runs first; proportional placement is the fallback.
27
+ - When an anchor appeared multiple times in the search window (repeated phrasing, formulaic prose), `insertCommentsIntoMarkdown` always picked the first occurrence. Now picks the occurrence closest to the docx-derived insert position.
28
+
29
+ ## [0.9.7] - 2026-04-29
30
+
31
+ ### Added
32
+ - `rev sync --comments-only` — import only Word comments at fuzzy-matched anchors, leaving prose byte-identical. Use when the markdown was revised between sending the docx out for review and receiving it back; applying track changes from a stale draft would clobber newer edits.
33
+ - `rev verify-anchors <docx>` — drift report classifying every comment as `clean` / `drift` / `context-only` / `ambiguous` / `unmatched` against the current section markdown. Pair with `--comments-only` to plan placement before sync. Supports `--json` for scripting.
34
+ - `extractHeadings()` in `word-extraction.ts` — read heading paragraphs directly from `<w:pStyle>` styles, returning text + level + position in the same coordinate system as comment anchors.
35
+ - Shared `lib/commands/section-boundaries.ts` — single source of truth that maps `sections.yaml` to docx text positions, used by both sync and verify-anchors.
36
+ - Shared `lib/anchor-match.ts` — pure anchor-matching primitives (`findAnchorInText`, `stripCriticMarkup`, `classifyStrategy`) so sync (insertion) and verify-anchors (drift reporting) use the same fallback strategies.
37
+ - New tests: `test/anchor-match.test.js` (11 cases covering each fallback strategy and the quality classifier).
38
+
39
+ ### Fixed
40
+ - **Section detection mistook prose for headings.** The old keyword finder scanned the concatenated body text and would match "results across countries" as the Results heading or skip the real Methods heading because the structured-abstract label `Methods:` lost its colon during text-run concatenation. Replaced with paragraph-style-based heading extraction, so boundaries now reflect actual heading paragraphs. Affects the new commands; the existing sync flow already used pandoc-derived headings and was unaffected.
41
+ - `stripCriticMarkup` regex used `[^<]*` and silently failed on comments containing `<` characters (e.g. `pre-industrial trade (<1825)`). Switched to non-greedy `[\s\S]*?`.
42
+ - `insertCommentsIntoMarkdown` always prepended a leading space when there was no anchor, accumulating multiple spaces when several comments shared a position 0 anchor. Removed the heuristic; comments insert at exact position so prose stays byte-identical except for the inserted blocks.
43
+ - `verify-anchors` crashed with a stack trace when given a non-docx file (e.g. an `.md` path). Now reports a friendly error.
44
+
45
+ ### Changed
46
+ - New flag is `--comments-only` (positive form). The originally proposed `--no-overwrite` was dropped because Commander assigns `--no-X` to `options.x === false` rather than `options.noX === true`, which made the flag silently ignored.
47
+ - `insertCommentsIntoMarkdown` now accepts `wrapAnchor?: boolean` (default `true`). When `false`, comment blocks are inserted next to the anchor without `[anchor]{.mark}` wrapping. `--comments-only` uses this so multiple comments sharing an anchor (e.g. 6 reviewers commenting on the same word) no longer produce nested broken CriticMarkup.
48
+
8
49
  ## [0.7.1] - 2025-01-02
9
50
 
10
51
  ### Added
@@ -0,0 +1,71 @@
1
+ # Bug repro — comment parser drops valid comments
2
+
3
+ ## Symptom
4
+
5
+ A markdown file with 24 valid `{>>Author: text<<}` comments is shown by `rev comments file.md` as containing only 21. Three comments are silently dropped.
6
+
7
+ ## Reproducer
8
+
9
+ `test.md`:
10
+ ```markdown
11
+ # Test
12
+
13
+ **Para 1.** Sentence one{>>Jonathan Lenoir: short comment<<}.
14
+
15
+ **Para 2.** Sentence two{>>Jens-Christian Svenning: I'm quite skeptical about these listings of all the bad things alien species do, with any kind of proper benchmarking - you could arguably make similar lists of all the annoying or dangerous things native species do<<}.
16
+
17
+ **Para 3.** Sentence three{>>Jens-Christian Svenning: see https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2745.14307<<}.
18
+ ```
19
+
20
+ ```bash
21
+ rev comments test.md
22
+ ```
23
+
24
+ **Expected:** 3 comments listed.
25
+ **Actual:** 1 comment listed (Lenoir's short comment). Both Svenning comments are dropped.
26
+
27
+ ## Root cause — `lib/annotations.ts`
28
+
29
+ ### Bug A: hyphenated reviewer names (line 102)
30
+
31
+ ```ts
32
+ const hasAuthorPrefix = /^[A-Za-z][A-Za-z\s]{0,20}:\s/.test(commentContent.trim());
33
+ ```
34
+
35
+ The character class `[A-Za-z\s]` forbids hyphens, apostrophes, and accented characters. So:
36
+
37
+ - `Jens-Christian Svenning:` — fails (hyphen at position 4)
38
+ - `Camilla T Colding-Jørgensen:` — fails (hyphen + diacritic)
39
+ - `Renata Ćušterevska:` — fails (diacritic at position 7)
40
+ - Any name with `-`, `'`, or non-ASCII letters fails
41
+
42
+ When `hasAuthorPrefix === false` AND content > `MAX_COMMENT_CONTENT_LENGTH = 200`, line 104 discards the comment. Any long comment from a hyphenated reviewer is silently dropped.
43
+
44
+ ### Bug B: any comment containing a URL <150 chars (line 95)
45
+
46
+ ```ts
47
+ if (/https?:\/\/|www\./i.test(commentContent) && commentContent.length < 150) return true;
48
+ ```
49
+
50
+ This filter assumes URLs in CriticMarkup blocks are markdown links wrapped in CriticMarkup syntax (a real false positive). But reviewers legitimately cite URLs/DOIs inside comments — that gets dropped too.
51
+
52
+ `{>>Lenoir: see https://www.pnas.org/doi/10.1073/pnas.1608980113<<}` — 60 chars, contains URL → dropped, even though `Lenoir:` is a valid author prefix.
53
+
54
+ ## Suggested fix
55
+
56
+ 1. **Line 102 — broaden author-prefix regex:**
57
+ ```ts
58
+ const hasAuthorPrefix = /^[\p{L}][\p{L}\s\-'.]{0,30}:\s/u.test(commentContent.trim());
59
+ ```
60
+ Allows hyphens, apostrophes, periods, and Unicode letters. Length bumped to 30 (Camilla T Colding-Jørgensen is 27 chars).
61
+
62
+ 2. **Line 95 — only filter URLs when there's no author prefix:**
63
+ ```ts
64
+ const looksLikeAuthor = /^[\p{L}][\p{L}\s\-'.]{0,30}:\s/u.test(commentContent.trim());
65
+ if (!looksLikeAuthor && /https?:\/\/|www\./i.test(commentContent) && commentContent.length < 150) return true;
66
+ ```
67
+ A comment with a real `Author:` prefix should not be dropped just because it cites a URL.
68
+
69
+ ## Impact
70
+
71
+ In `paper_hexgrids_2026/abstract.md`, three reviewer comments from Jens-Christian Svenning (one tone-related, two with cited URLs/DOIs) were silently dropped from `rev comments`, `rev next`, and presumably `rev build docx --dual` output. Other affected reviewers in this manuscript: Camilla T Colding-Jørgensen (long comments would drop), Renata Ćušterevska, Jean-Christian Svenning across all sections.
@@ -0,0 +1,186 @@
1
+ /**
2
+ * Build an adversarial DOCX with hand-crafted commentRangeStart/End
3
+ * markers that probe whether docrev locates comments by *real*
4
+ * anchor (the highlighted run text, taken from the docx XML) or by
5
+ * guesswork. Each comment is designed to expose a specific failure
6
+ * mode if the matcher were not anchor-aware.
7
+ */
8
+ import AdmZip from 'adm-zip';
9
+ import path from 'path';
10
+
11
+ const out = path.resolve('dev_notes/stress2/adversarial.docx');
12
+
13
+ // Sentence with: a triplicated word, unicode look-alikes, mid-word
14
+ // hyphenation, and embedded XML-special chars in the *prose* (escaped).
15
+ const paragraphs: { text: string; comments?: { id: string; on: string }[] }[] = [
16
+ // heading
17
+ { text: 'Methods', comments: [] },
18
+
19
+ // p0: single highlight, easy
20
+ { text: 'The methods section describes our approach in detail.',
21
+ comments: [{ id: '0', on: 'methods section' }] },
22
+
23
+ // p1: word "model" appears 3x — anchor must pin the SECOND occurrence
24
+ { text: 'The model is fitted, the model is checked, the model is reported.',
25
+ comments: [{ id: '1', on: 'model' /* second occurrence */ }] },
26
+
27
+ // p2: empty anchor (zero-width comment, classic Word "insert here")
28
+ { text: 'Reviewers often place a marker between two sentences. Like this. Then continue.',
29
+ comments: [{ id: '2', on: '' }] },
30
+
31
+ // p3: anchor spanning across multiple runs (formatting break)
32
+ { text: 'Our results show that p < 0.001 across all conditions.',
33
+ comments: [{ id: '3', on: 'p < 0.001' }] },
34
+
35
+ // p4: text containing literal angle brackets that would break a naive
36
+ // [^<]* regex — anchor includes "(<1825)"
37
+ { text: 'Trade volumes pre-industrial (<1825) were modest.',
38
+ comments: [{ id: '4', on: '(<1825)' }] },
39
+
40
+ // p5: very long anchor (full sentence) — matcher must still place it
41
+ { text: 'We used a hierarchical Bayesian model with weakly informative priors and Hamiltonian Monte Carlo sampling implemented in Stan.',
42
+ comments: [{ id: '5', on: 'We used a hierarchical Bayesian model with weakly informative priors and Hamiltonian Monte Carlo sampling implemented in Stan.' }] },
43
+
44
+ // p6: anchor on punctuation only — should fall back to context
45
+ { text: 'See Table 1 for details; numbers are rounded.',
46
+ comments: [{ id: '6', on: ';' }] },
47
+
48
+ // p7: two overlapping comment ranges share a word
49
+ { text: 'The overrepresented species are listed in Appendix A.',
50
+ comments: [
51
+ { id: '7', on: 'overrepresented species' },
52
+ { id: '8', on: 'overrepresented' },
53
+ ] },
54
+
55
+ // p8: anchor that exists VERBATIM elsewhere in the doc — context disambiguates
56
+ { text: 'The cohort was small. Limitations are discussed in Section 5.',
57
+ comments: [{ id: '9', on: 'small' }] },
58
+ { text: 'The effect was small but significant.',
59
+ comments: [{ id: '10', on: 'small' }] },
60
+
61
+ // p9: heading paragraph (<w:pStyle w:val="Heading1"/>)
62
+ { text: 'Discussion', comments: [], /* heading marker handled below */ },
63
+ { text: 'In this section we situate the findings in prior literature.',
64
+ comments: [{ id: '11', on: 'situate the findings' }] },
65
+ ];
66
+
67
+ function escapeXml(s: string): string {
68
+ return s.replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;')
69
+ .replace(/"/g, '&quot;').replace(/'/g, '&apos;');
70
+ }
71
+
72
+ function buildParagraphXml(p: typeof paragraphs[0], isHeading: boolean): string {
73
+ const text = p.text;
74
+ const comments = p.comments || [];
75
+
76
+ if (comments.length === 0) {
77
+ return `<w:p>${isHeading ? '<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>' : ''}<w:r><w:t xml:space="preserve">${escapeXml(text)}</w:t></w:r></w:p>`;
78
+ }
79
+
80
+ // Plan ranges: each comment marks [start..end] character offsets into `text`.
81
+ // Empty anchors place start=end at first sentence boundary.
82
+ type Range = { id: string; start: number; end: number };
83
+ const ranges: Range[] = [];
84
+ for (const c of comments) {
85
+ if (c.on === '') {
86
+ const dotIdx = text.indexOf('.');
87
+ const pos = dotIdx >= 0 ? dotIdx + 1 : 0;
88
+ ranges.push({ id: c.id, start: pos, end: pos });
89
+ continue;
90
+ }
91
+ if (c.id === '1') {
92
+ // "model" — pick the SECOND occurrence
93
+ const first = text.indexOf('model');
94
+ const second = text.indexOf('model', first + 1);
95
+ ranges.push({ id: c.id, start: second, end: second + 'model'.length });
96
+ continue;
97
+ }
98
+ const start = text.indexOf(c.on);
99
+ if (start < 0) throw new Error(`anchor not found: ${c.on}`);
100
+ ranges.push({ id: c.id, start, end: start + c.on.length });
101
+ }
102
+
103
+ // Build event list: at each char boundary we may need to emit
104
+ // <w:commentRangeStart/> or <w:commentRangeEnd/>.
105
+ type Event = { pos: number; kind: 'start' | 'end'; id: string };
106
+ const events: Event[] = [];
107
+ for (const r of ranges) {
108
+ events.push({ pos: r.start, kind: 'start', id: r.id });
109
+ events.push({ pos: r.end, kind: 'end', id: r.id });
110
+ }
111
+ // Emit ends before starts at same position, so an empty anchor's start=end
112
+ // ordering keeps a zero-width range
113
+ events.sort((a, b) => a.pos - b.pos || (a.kind === 'end' ? -1 : 1));
114
+
115
+ let xml = `<w:p>${isHeading ? '<w:pPr><w:pStyle w:val="Heading1"/></w:pPr>' : ''}`;
116
+ let cursor = 0;
117
+ for (const ev of events) {
118
+ if (ev.pos > cursor) {
119
+ xml += `<w:r><w:t xml:space="preserve">${escapeXml(text.slice(cursor, ev.pos))}</w:t></w:r>`;
120
+ cursor = ev.pos;
121
+ }
122
+ xml += ev.kind === 'start'
123
+ ? `<w:commentRangeStart w:id="${ev.id}"/>`
124
+ : `<w:commentRangeEnd w:id="${ev.id}"/><w:r><w:commentReference w:id="${ev.id}"/></w:r>`;
125
+ }
126
+ if (cursor < text.length) {
127
+ xml += `<w:r><w:t xml:space="preserve">${escapeXml(text.slice(cursor))}</w:t></w:r>`;
128
+ }
129
+ xml += `</w:p>`;
130
+ return xml;
131
+ }
132
+
133
+ const headingTexts = new Set(['Methods', 'Discussion']);
134
+ const documentBody = paragraphs.map(p => buildParagraphXml(p, headingTexts.has(p.text))).join('');
135
+
136
+ const documentXml = `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
137
+ <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
138
+ <w:body>
139
+ ${documentBody}
140
+ <w:sectPr/>
141
+ </w:body>
142
+ </w:document>`;
143
+
144
+ const allComments = paragraphs.flatMap(p => p.comments || []);
145
+
146
+ const commentsXml = `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
147
+ <w:comments xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
148
+ ${allComments.map(c => ` <w:comment w:id="${c.id}" w:author="Reviewer" w:initials="R" w:date="2026-01-01T00:00:00Z"><w:p><w:r><w:t>Comment ${c.id} on &quot;${escapeXml(c.on)}&quot; — text contains <embedded brackets and &amp;.</w:t></w:r></w:p></w:comment>`.replace('<embedded', '&lt;embedded')).join('\n')}
149
+ </w:comments>`;
150
+
151
+ const contentTypesXml = `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
152
+ <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
153
+ <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
154
+ <Default Extension="xml" ContentType="application/xml"/>
155
+ <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
156
+ <Override PartName="/word/comments.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
157
+ <Override PartName="/word/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
158
+ </Types>`;
159
+
160
+ const rootRelsXml = `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
161
+ <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
162
+ <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
163
+ </Relationships>`;
164
+
165
+ const docRelsXml = `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
166
+ <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
167
+ <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments" Target="comments.xml"/>
168
+ <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
169
+ </Relationships>`;
170
+
171
+ const stylesXml = `<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
172
+ <w:styles xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
173
+ <w:style w:type="paragraph" w:styleId="Heading1"><w:name w:val="heading 1"/></w:style>
174
+ </w:styles>`;
175
+
176
+ const zip = new AdmZip();
177
+ zip.addFile('[Content_Types].xml', Buffer.from(contentTypesXml, 'utf8'));
178
+ zip.addFile('_rels/.rels', Buffer.from(rootRelsXml, 'utf8'));
179
+ zip.addFile('word/document.xml', Buffer.from(documentXml, 'utf8'));
180
+ zip.addFile('word/comments.xml', Buffer.from(commentsXml, 'utf8'));
181
+ zip.addFile('word/styles.xml', Buffer.from(stylesXml, 'utf8'));
182
+ zip.addFile('word/_rels/document.xml.rels', Buffer.from(docRelsXml, 'utf8'));
183
+ zip.writeZip(out);
184
+
185
+ console.log(`wrote ${out}`);
186
+ console.log(`comments: ${allComments.length}`);
@@ -0,0 +1,62 @@
1
+ /**
2
+ * Run anchors from the adversarial docx against a HEAVILY DRIFTED
3
+ * markdown and report match quality per comment.
4
+ */
5
+ import { extractCommentAnchors } from '../../lib/word-extraction.js';
6
+ import { findAnchorInText, classifyStrategy } from '../../lib/anchor-match.js';
7
+
8
+ const docx = 'dev_notes/stress2/adversarial.docx';
9
+
10
+ // Drifted markdown — same comments need to land on this.
11
+ // - p0 paraphrased
12
+ // - p1 reordered, only TWO occurrences of "model"
13
+ // - p2 prose lightly edited, sentence boundary preserved
14
+ // - p3 prose changed slightly: "p < 0.001 (n=412)"
15
+ // - p4 prose moved (<1825)
16
+ // - p5 mostly intact
17
+ // - p6 entire sentence rewritten ("Round numbers are reported in Table 1.")
18
+ // - p7 unchanged
19
+ // - p8/p9 unchanged
20
+ // - p10/p11 unchanged
21
+ const drifted = [
22
+ '# Methods', '',
23
+ 'Below we describe our methodology in detail.', '',
24
+ 'The model is fitted; later, the model is reported.', '',
25
+ 'Reviewers often place a marker between two sentences. Like this. Then continue.', '',
26
+ 'Our results show that p < 0.001 (n=412) across all conditions.', '',
27
+ 'Trade volumes were modest pre-industrial (<1825).', '',
28
+ 'We used a hierarchical Bayesian model with weakly informative priors and Hamiltonian Monte Carlo sampling implemented in Stan.', '',
29
+ 'Round numbers are reported in Table 1.', '',
30
+ 'The overrepresented species are listed in Appendix A.', '',
31
+ 'The cohort was small. Limitations are discussed in Section 5.',
32
+ 'The effect was small but significant.', '',
33
+ '# Discussion', '',
34
+ 'In this section we situate the findings in prior literature.',
35
+ ].join('\n');
36
+
37
+ const { anchors } = await extractCommentAnchors(docx);
38
+
39
+ const sorted = [...anchors.entries()].sort((a, b) => Number(a[0]) - Number(b[0]));
40
+
41
+ const tally = { clean: 0, drift: 0, 'context-only': 0, ambiguous: 0, unmatched: 0 };
42
+
43
+ console.log('id │ quality │ strategy │ #hits │ anchor');
44
+ console.log('───┼───────────────┼─────────────────────┼───────┼──────────────────────');
45
+ for (const [id, a] of sorted) {
46
+ const r = findAnchorInText(a.anchor, drifted, a.before, a.after);
47
+ let q: keyof typeof tally = classifyStrategy(r.strategy, r.occurrences.length) as any;
48
+ if (r.occurrences.length > 1) q = 'ambiguous';
49
+ tally[q]++;
50
+ const tag = a.isEmpty ? '∅' : a.anchor.length > 30 ? a.anchor.slice(0, 27) + '...' : a.anchor;
51
+ console.log(`${id.padStart(2)} │ ${q.padEnd(13)} │ ${r.strategy.padEnd(19)} │ ${String(r.occurrences.length).padStart(5)} │ ${JSON.stringify(tag)}`);
52
+ }
53
+
54
+ console.log('\nTally:', tally);
55
+
56
+ // Also confirm: comment #6 (";") will fail anchor-direct match and force
57
+ // context fallback (because the original sentence was rewritten).
58
+ console.log('\nComment #6 (\";\") — sentence rewritten in drifted md.');
59
+ const c6 = anchors.get('6')!;
60
+ const r6 = findAnchorInText(c6.anchor, drifted, c6.before, c6.after);
61
+ console.log(` strategy=${r6.strategy}, hits=${r6.occurrences.length}`);
62
+ console.log(` ${r6.occurrences.length > 0 ? 'placed via fallback' : 'left for manual placement'}`);
@@ -0,0 +1,35 @@
1
+ /**
2
+ * Probe: do extracted anchors come from `<w:commentRangeStart/End>`
3
+ * markers (i.e. real anchors), and does each comment carry a unique
4
+ * docPosition that disambiguates duplicate prose?
5
+ */
6
+ import { extractCommentAnchors } from '../../lib/word-extraction.js';
7
+
8
+ const docx = 'dev_notes/stress2/adversarial.docx';
9
+ const { anchors, fullDocText } = await extractCommentAnchors(docx);
10
+
11
+ console.log(`Doc text length: ${fullDocText.length}`);
12
+ console.log(`Anchors extracted: ${anchors.size}\n`);
13
+
14
+ const sorted = [...anchors.entries()].sort((a, b) => Number(a[0]) - Number(b[0]));
15
+
16
+ for (const [id, a] of sorted) {
17
+ const tag = a.isEmpty ? '∅(empty)' : JSON.stringify(a.anchor);
18
+ console.log(`#${id} pos=${String(a.docPosition).padStart(4)} anchor=${tag}`);
19
+ if (a.before) console.log(` before=${JSON.stringify(a.before.slice(-40))}`);
20
+ if (a.after) console.log(` after =${JSON.stringify(a.after.slice(0, 40))}`);
21
+ }
22
+
23
+ // Sanity: comment 1 should pin the SECOND occurrence of "model"
24
+ const c1 = anchors.get('1')!;
25
+ const firstModel = fullDocText.indexOf('model');
26
+ const secondModel = fullDocText.indexOf('model', firstModel + 1);
27
+ console.log(`\nComment #1 expectation: pin SECOND "model" at pos ${secondModel}`);
28
+ console.log(`Comment #1 actual docPosition=${c1.docPosition}`);
29
+ console.log(`Match: ${c1.docPosition === secondModel ? 'YES (real-anchor)' : 'NO'}`);
30
+
31
+ // Comment 7 vs 8: overlapping ranges, both anchored on "overrepresented..."
32
+ const c7 = anchors.get('7')!;
33
+ const c8 = anchors.get('8')!;
34
+ console.log(`\nComment #7 anchor: ${JSON.stringify(c7.anchor)} (expected "overrepresented species")`);
35
+ console.log(`Comment #8 anchor: ${JSON.stringify(c8.anchor)} (expected "overrepresented")`);
@@ -0,0 +1,3 @@
1
+ # Discussion
2
+
3
+ In this section we situate the findings in prior literature.
@@ -0,0 +1,3 @@
1
+ # Discussion
2
+
3
+ In this section we {>>Reviewer: Comment 11 on "situate the findings" — text contains <embedded brackets and &.<<}situate the findings in prior literature.
@@ -0,0 +1,20 @@
1
+ # Methods
2
+
3
+ Below we describe our methodology in detail.
4
+
5
+ The model is fitted; later, the model is reported.
6
+
7
+ Reviewers often place a marker between two sentences. Like this. Then continue.
8
+
9
+ Our results show that p < 0.001 (n=412) across all conditions.
10
+
11
+ Trade volumes were modest pre-industrial (<1825).
12
+
13
+ We used a hierarchical Bayesian model with weakly informative priors and Hamiltonian Monte Carlo sampling implemented in Stan.
14
+
15
+ Round numbers are reported in Table 1.
16
+
17
+ The overrepresented species are listed in Appendix A.
18
+
19
+ The cohort was small. Limitations are discussed in Section 5.
20
+ The effect was small but significant.
@@ -0,0 +1,20 @@
1
+ # Methods
2
+
3
+ Below we describe{>>Reviewer: Comment 0 on "methods section" — text contains <embedded brackets and &.<<} our methodology in detail.
4
+
5
+ The {>>Reviewer: Comment 1 on "model" — text contains <embedded brackets and &.<<}model is fitted; later, the model is reported.
6
+
7
+ Reviewers often place a marker between two sentences. Like this. Then{>>Reviewer: Comment 2 on "" — text contains <embedded brackets and &.<<} continue.
8
+
9
+ Our results show that {>>Reviewer: Comment 3 on "p < 0.001" — text contains <embedded brackets and &.<<}p < 0.001 (n=412) across all conditions.
10
+
11
+ Trade volumes were modest pre-industrial {>>Reviewer: Comment 4 on "(<1825)" — text contains <embedded brackets and &.<<}(<1825).
12
+
13
+ {>>Reviewer: Comment 5 on "We used a hierarchical Bayesian model with weakly informative priors and Hamiltonian Monte Carlo sampling implemented in Stan." — text contains <embedded brackets and &.<<}We used a hierarchical Bayesian model with weakly informative priors and Hamiltonian Monte Carlo sampling implemented in Stan.
14
+
15
+ Round numbers are reported{>>Reviewer: Comment 6 on ";" — text contains <embedded brackets and &.<<} in Table 1.
16
+
17
+ The {>>Reviewer: Comment 7 on "overrepresented species" — text contains <embedded brackets and &.<<}overrepresented{>>Reviewer: Comment 8 on "overrepresented" — text contains <embedded brackets and &.<<} species are listed in Appendix A.
18
+
19
+ The cohort was {>>Reviewer: Comment 9 on "small" — text contains <embedded brackets and &.<<}small. Limitations are discussed in Section 5.
20
+ The effect was {>>Reviewer: Comment 10 on "small" — text contains <embedded brackets and &.<<}small but significant.
@@ -0,0 +1,5 @@
1
+ title: stress2
2
+ authors: []
3
+ sections:
4
+ - methods
5
+ - discussion
@@ -0,0 +1,4 @@
1
+ version: 1
2
+ sections:
3
+ methods.md: Methods
4
+ discussion.md: Discussion
@@ -0,0 +1,5 @@
1
+ sections:
2
+ - file: methods.md
3
+ header: Methods
4
+ - file: discussion.md
5
+ header: Discussion
@@ -0,0 +1,50 @@
1
+ /**
2
+ * Trace exactly where insertCommentsIntoMarkdown places each comment.
3
+ * Reproduces the methods.md sync to see if disambiguation works.
4
+ */
5
+ import { extractCommentAnchors, extractWordComments } from '../../lib/word-extraction.js';
6
+ import { insertCommentsIntoMarkdown } from '../../lib/import.js';
7
+ import * as fs from 'fs';
8
+
9
+ const docx = 'dev_notes/stress2/adversarial.docx';
10
+ const md = fs.readFileSync('dev_notes/stress2/project/methods.before.md', 'utf-8');
11
+
12
+ const { anchors } = await extractCommentAnchors(docx);
13
+ const comments = await extractWordComments(docx);
14
+
15
+ // Filter to methods (everything except #11)
16
+ const methodsComments = comments.filter((c: any) => c.id !== '11');
17
+
18
+ console.log('Markdown length:', md.length);
19
+ console.log('Comments to place:', methodsComments.length);
20
+
21
+ const smallPositions: number[] = [];
22
+ let i = 0;
23
+ while ((i = md.indexOf('small', i)) !== -1) {
24
+ smallPositions.push(i);
25
+ i += 5;
26
+ }
27
+ console.log('"small" occurrences in md at:', smallPositions);
28
+ for (const p of smallPositions) {
29
+ console.log(` pos ${p}: ...${JSON.stringify(md.slice(Math.max(0, p - 30), p + 30))}...`);
30
+ }
31
+
32
+ console.log('\nDocx anchor data for #9 and #10:');
33
+ for (const id of ['9', '10']) {
34
+ const a = anchors.get(id)!;
35
+ console.log(` #${id}: docPos=${a.docPosition}, before=${JSON.stringify(a.before.slice(-30))}, after=${JSON.stringify(a.after.slice(0, 30))}`);
36
+ }
37
+
38
+ // Run with quiet:false to see warnings
39
+ const out = insertCommentsIntoMarkdown(md, methodsComments, anchors, {
40
+ quiet: false,
41
+ wrapAnchor: false,
42
+ });
43
+
44
+ // Locate both inserted comment blocks
45
+ const block9 = out.indexOf('Comment 9 on');
46
+ const block10 = out.indexOf('Comment 10 on');
47
+ console.log(`\n#9 inserted at md offset: ${block9}`);
48
+ console.log(`#10 inserted at md offset: ${block10}`);
49
+ console.log(`Surrounding #9: ${JSON.stringify(out.slice(Math.max(0, block9 - 30), block9 + 50))}`);
50
+ console.log(`Surrounding #10: ${JSON.stringify(out.slice(Math.max(0, block10 - 30), block10 + 50))}`);
@@ -0,0 +1,27 @@
1
+ import { extractCommentAnchors } from '../lib/import.js';
2
+
3
+ const docx = 'C:/GillesC/tmp/docrev-stress/reviewed.docx';
4
+ const { fullDocText } = await extractCommentAnchors(docx);
5
+
6
+ function findSectionHeader(text: string, header: string): number {
7
+ const needle = header.toLowerCase().trim();
8
+ const lower = text.toLowerCase();
9
+ let idx = 0;
10
+ while ((idx = lower.indexOf(needle, idx)) !== -1) {
11
+ const after = text.slice(idx + needle.length, idx + needle.length + 5);
12
+ if (!after.startsWith(':') && !after.startsWith(' :')) return idx;
13
+ idx++;
14
+ }
15
+ return -1;
16
+ }
17
+
18
+ const headers = ['Abstract', 'Introduction', 'Methods', 'Results', 'Discussion', 'Conclusion', 'References', 'Supplementary Materials'];
19
+ const found = headers.map(h => ({ h, idx: findSectionHeader(fullDocText, h) }))
20
+ .filter(x => x.idx >= 0)
21
+ .sort((a, b) => a.idx - b.idx);
22
+
23
+ console.log('Sorted boundary positions found:');
24
+ for (const f of found) {
25
+ const ctx = fullDocText.slice(Math.max(0, f.idx - 20), f.idx + 30).replace(/\s+/g, ' ');
26
+ console.log(` ${f.h.padEnd(25)} @ ${f.idx} ctx: ...${ctx}...`);
27
+ }
@@ -0,0 +1,43 @@
1
+ // Apply targeted drift edits to project-drifted/abstract.md and methods.md
2
+ import * as fs from 'fs';
3
+
4
+ const dir = 'C:/GillesC/tmp/docrev-stress/project-drifted';
5
+
6
+ function edit(file: string, edits: Array<[string, string]>) {
7
+ const p = `${dir}/${file}`;
8
+ let t = fs.readFileSync(p, 'utf-8');
9
+ for (const [from, to] of edits) {
10
+ if (!t.includes(from)) {
11
+ console.error(`!! ${file}: pattern not found: "${from.slice(0, 60)}"`);
12
+ continue;
13
+ }
14
+ t = t.replace(from, to);
15
+ console.log(`OK ${file}: replaced "${from.slice(0, 60)}..."`);
16
+ }
17
+ fs.writeFileSync(p, t);
18
+ }
19
+
20
+ // Word swap: 'accelerating' should still be findable via stripped/partial fallback (anchor was a single word, replacing changes the prose)
21
+ // Actually: replacing "accelerating" with "rapid" in the abstract removes the anchor entirely.
22
+ // Some comments have anchor "accelerating" — those should go drift -> unmatched.
23
+
24
+ edit('abstract.md', [
25
+ ['accelerating rates', 'rapid rates'],
26
+ // Numerical drift: this anchor is "0–20 years" / "57%" — both should fail to direct match
27
+ ['from 17% (0--20 years) to 57%', 'from 15% (0–25 years) to 60%'],
28
+ // Anchor-spanning rewrite: replace 'Man-made and ruderal habitats functioned as gateways' with 'Anthropogenic habitats acted as entry points'
29
+ ['Man-made and ruderal habitats functioned as gateways', 'Anthropogenic habitats acted as entry points'],
30
+ // Number change in big number
31
+ ['835,891 vegetation plots', '1,200,000 vegetation plots'],
32
+ ]);
33
+
34
+ // Delete an entire paragraph from discussion (so anchors there go unmatched)
35
+ const disc = fs.readFileSync(`${dir}/discussion.md`, 'utf-8');
36
+ // Just append a marker to track that we did NOT delete; instead, we'll insert a new prose block to test that comments still land correctly relative to it.
37
+ // For deletion test, find a known phrase and remove its sentence
38
+ edit('discussion.md', [
39
+ // Insert a new paragraph at top to force position drift in proportion-based placement
40
+ ['# Discussion\n', '# Discussion\n\n_NOTE: this paragraph was inserted after review. Just an extra block of prose to push everything downward in the markdown so that proportion-based anchor placement gets stress-tested. We pad with several sentences to ensure the offset is meaningful and that fuzzy matchers still find the right targets despite this drift._\n\n'],
41
+ ]);
42
+
43
+ console.log('drift applied');