structurecc 2.0.3 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -168,13 +168,65 @@ Return ONLY this JSON structure:
|
|
|
168
168
|
|
|
169
169
|
## Chart Type Specifications
|
|
170
170
|
|
|
171
|
-
### Kaplan-Meier / Survival Curves
|
|
171
|
+
### Kaplan-Meier / Survival / Cumulative Incidence Curves
|
|
172
|
+
|
|
173
|
+
**CRITICAL: These are STEP FUNCTIONS, not smooth lines.**
|
|
174
|
+
|
|
172
175
|
Required fields:
|
|
173
|
-
-
|
|
174
|
-
-
|
|
175
|
-
-
|
|
176
|
-
-
|
|
177
|
-
-
|
|
176
|
+
- **Curve style**: Must note `"curve_style": "step_function"`
|
|
177
|
+
- **Step function data points** - Capture visible steps/jumps
|
|
178
|
+
- **Risk table** (if present below chart)
|
|
179
|
+
- **Censoring marks** (vertical ticks if visible)
|
|
180
|
+
- **Confidence interval bands** - Note BOTH colors (often different per group)
|
|
181
|
+
- **Log-rank p-value** or other statistical annotation
|
|
182
|
+
- **Endpoint values** - Where each curve ends (y-value at max x)
|
|
183
|
+
|
|
184
|
+
**LEGEND EXTRACTION - VERBATIM:**
|
|
185
|
+
Read the legend box word-for-word. Common patterns:
|
|
186
|
+
- "HSV: Dementia Risk" (not "HSV group")
|
|
187
|
+
- "Control: Dementia Risk" (not "Control group")
|
|
188
|
+
- "HSV: Dementia Risk 95% CI" (for confidence bands)
|
|
189
|
+
- "Control: Dementia Risk 95% CI"
|
|
190
|
+
|
|
191
|
+
**COLOR PRECISION:**
|
|
192
|
+
- Main lines: Often purple vs dark blue
|
|
193
|
+
- CI bands: Often light purple vs yellow/orange (DIFFERENT colors!)
|
|
194
|
+
- Be specific: "light purple shaded band", "yellow/orange shaded band"
|
|
195
|
+
|
|
196
|
+
**Example for cumulative incidence:**
|
|
197
|
+
```json
|
|
198
|
+
{
|
|
199
|
+
"chart_type": "kaplan_meier",
|
|
200
|
+
"curve_style": "step_function",
|
|
201
|
+
"chart_metadata": {
|
|
202
|
+
"title": "Cumulative Incidence of Dementia",
|
|
203
|
+
"source_page": 7
|
|
204
|
+
},
|
|
205
|
+
"axes": {
|
|
206
|
+
"x": {"label": "Time (Days) Since HSV Diagnosis", "min": 0, "max": 7000},
|
|
207
|
+
"y": {"label": "Cumulative Risk of Dementia", "min": 0.0, "max": 0.6}
|
|
208
|
+
},
|
|
209
|
+
"legend": {
|
|
210
|
+
"position": "right",
|
|
211
|
+
"title": "Legend",
|
|
212
|
+
"entries": [
|
|
213
|
+
{"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid step"},
|
|
214
|
+
{"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid step"},
|
|
215
|
+
{"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
|
|
216
|
+
{"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
|
|
217
|
+
]
|
|
218
|
+
},
|
|
219
|
+
"curve_endpoints": [
|
|
220
|
+
{"series": "HSV: Dementia Risk", "final_x": 7000, "final_y": 0.32},
|
|
221
|
+
{"series": "Control: Dementia Risk", "final_x": 7000, "final_y": 0.05}
|
|
222
|
+
],
|
|
223
|
+
"key_observations": [
|
|
224
|
+
"Step-function curves with visible jumps at event times",
|
|
225
|
+
"Curves diverge early and separation increases",
|
|
226
|
+
"CI bands widen substantially after day 5000"
|
|
227
|
+
]
|
|
228
|
+
}
|
|
229
|
+
```
|
|
178
230
|
|
|
179
231
|
### Bar Charts
|
|
180
232
|
```json
|
|
@@ -5,7 +5,7 @@ description: Phase 2 - Verbatim multi-panel figure extraction (A, B, C, D panels
|
|
|
5
5
|
|
|
6
6
|
# Multi-Panel Figure Extractor
|
|
7
7
|
|
|
8
|
-
You extract multi-panel figures by processing EACH PANEL SEPARATELY with
|
|
8
|
+
You extract multi-panel figures by processing EACH PANEL SEPARATELY with ABSOLUTE verbatim accuracy.
|
|
9
9
|
|
|
10
10
|
## VERBATIM EXTRACTION RULES
|
|
11
11
|
|
|
@@ -16,6 +16,82 @@ You extract multi-panel figures by processing EACH PANEL SEPARATELY with full ve
|
|
|
16
16
|
3. **Classify each panel** - Each panel may be a different type (chart, table, heatmap, etc.)
|
|
17
17
|
4. **Preserve panel relationships** - Note when panels share legends, axes, or data
|
|
18
18
|
|
|
19
|
+
## LEGEND EXTRACTION - VERBATIM REQUIRED
|
|
20
|
+
|
|
21
|
+
**CRITICAL FOR KAPLAN-MEIER/SURVIVAL CURVES:**
|
|
22
|
+
|
|
23
|
+
Read the legend box carefully and extract EVERY entry EXACTLY as written:
|
|
24
|
+
|
|
25
|
+
Example - If you see:
|
|
26
|
+
```
|
|
27
|
+
Legend
|
|
28
|
+
— HSV: Dementia Risk
|
|
29
|
+
— Control: Dementia Risk
|
|
30
|
+
▒ HSV: Dementia Risk 95% CI
|
|
31
|
+
▒ Control: Dementia Risk 95% CI
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
You MUST output:
|
|
35
|
+
```json
|
|
36
|
+
"legend": {
|
|
37
|
+
"entries": [
|
|
38
|
+
{"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid"},
|
|
39
|
+
{"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid"},
|
|
40
|
+
{"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
|
|
41
|
+
{"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
|
|
42
|
+
]
|
|
43
|
+
}
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Do NOT summarize as "HSV group" or "Control group" - use the EXACT text from the legend.
|
|
47
|
+
|
|
48
|
+
## KAPLAN-MEIER / CUMULATIVE INCIDENCE CURVES
|
|
49
|
+
|
|
50
|
+
For survival/incidence curves, capture:
|
|
51
|
+
|
|
52
|
+
1. **Curve Shape**: Note that these are STEP FUNCTIONS, not smooth lines
|
|
53
|
+
2. **Key Inflection Points**: Where curves separate, where steps occur
|
|
54
|
+
3. **Endpoint Values**: The y-value where each curve ends (be precise)
|
|
55
|
+
4. **Confidence Intervals**: Shaded bands - note BOTH colors (often different for each group)
|
|
56
|
+
5. **Widening CI**: Note if confidence intervals widen over time
|
|
57
|
+
|
|
58
|
+
Example output for a Kaplan-Meier panel:
|
|
59
|
+
```json
|
|
60
|
+
{
|
|
61
|
+
"panel_id": "A",
|
|
62
|
+
"panel_type": "chart_kaplan_meier",
|
|
63
|
+
"panel_title": "Overall HSV vs Control",
|
|
64
|
+
"extraction": {
|
|
65
|
+
"chart_type": "kaplan_meier",
|
|
66
|
+
"curve_style": "step_function",
|
|
67
|
+
"axes": {
|
|
68
|
+
"x": {"label": "Time (Days) Since HSV Diagnosis", "min": 0, "max": 7000, "ticks": [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000]},
|
|
69
|
+
"y": {"label": "Cumulative Risk of Dementia", "min": 0.0, "max": 0.6, "ticks": [0, 0.2, 0.4, 0.6]}
|
|
70
|
+
},
|
|
71
|
+
"legend": {
|
|
72
|
+
"position": "right",
|
|
73
|
+
"title": "Legend",
|
|
74
|
+
"entries": [
|
|
75
|
+
{"label": "HSV: Dementia Risk", "color": "purple", "line_style": "solid step"},
|
|
76
|
+
{"label": "Control: Dementia Risk", "color": "dark blue", "line_style": "solid step"},
|
|
77
|
+
{"label": "HSV: Dementia Risk 95% CI", "color": "light purple", "style": "shaded band"},
|
|
78
|
+
{"label": "Control: Dementia Risk 95% CI", "color": "yellow/orange", "style": "shaded band"}
|
|
79
|
+
]
|
|
80
|
+
},
|
|
81
|
+
"curve_endpoints": [
|
|
82
|
+
{"series": "HSV", "final_x": 7000, "final_y": 0.32, "note": "steep step around day 6500"},
|
|
83
|
+
{"series": "Control", "final_x": 7000, "final_y": 0.05, "note": "relatively flat"}
|
|
84
|
+
],
|
|
85
|
+
"key_observations": [
|
|
86
|
+
"Curves diverge early (~day 500) and separation increases over time",
|
|
87
|
+
"HSV group shows multiple step increases, particularly steep after day 5000",
|
|
88
|
+
"Control group remains relatively flat throughout",
|
|
89
|
+
"95% CI bands widen substantially after day 5000, especially for HSV group"
|
|
90
|
+
]
|
|
91
|
+
}
|
|
92
|
+
}
|
|
93
|
+
```
|
|
94
|
+
|
|
19
95
|
## Output Schema
|
|
20
96
|
|
|
21
97
|
Return ONLY this JSON structure:
|
package/bin/install.js
CHANGED
|
@@ -4,7 +4,7 @@ const fs = require('fs');
|
|
|
4
4
|
const path = require('path');
|
|
5
5
|
const os = require('os');
|
|
6
6
|
|
|
7
|
-
const VERSION = '2.
|
|
7
|
+
const VERSION = '2.1.0';
|
|
8
8
|
const PACKAGE_NAME = 'structurecc';
|
|
9
9
|
|
|
10
10
|
// Agent files in v2.0
|
|
@@ -39,23 +39,18 @@ function log(msg, color = '') {
|
|
|
39
39
|
function banner() {
|
|
40
40
|
const c = colors;
|
|
41
41
|
console.log('');
|
|
42
|
-
console.log(c.cyan + '
|
|
43
|
-
console.log(c.cyan + '
|
|
44
|
-
console.log(c.cyan + '
|
|
45
|
-
console.log(c.cyan + '
|
|
46
|
-
console.log(c.cyan + '
|
|
47
|
-
console.log(c.cyan + '
|
|
48
|
-
console.log(
|
|
49
|
-
console.log(c.
|
|
50
|
-
console.log(c.
|
|
51
|
-
console.log(
|
|
52
|
-
console.log(c.
|
|
53
|
-
console.log(c.cyan + ' │ │' + c.reset);
|
|
54
|
-
console.log(c.cyan + ' │ ' + c.white + '3-phase pipeline with quality scoring' + c.reset + c.cyan + ' │' + c.reset);
|
|
55
|
-
console.log(c.cyan + ' │ │' + c.reset);
|
|
56
|
-
console.log(c.cyan + ' └─────────────────────────────────────────────────────┘' + c.reset);
|
|
42
|
+
console.log(c.cyan + ' ███████╗████████╗██████╗ ██╗ ██╗ ██████╗████████╗██╗ ██╗██████╗ ███████╗' + c.reset);
|
|
43
|
+
console.log(c.cyan + ' ██╔════╝╚══██╔══╝██╔══██╗██║ ██║██╔════╝╚══██╔══╝██║ ██║██╔══██╗██╔════╝' + c.reset);
|
|
44
|
+
console.log(c.cyan + ' ███████╗ ██║ ██████╔╝██║ ██║██║ ██║ ██║ ██║██████╔╝█████╗ ' + c.reset);
|
|
45
|
+
console.log(c.cyan + ' ╚════██║ ██║ ██╔══██╗██║ ██║██║ ██║ ██║ ██║██╔══██╗██╔══╝ ' + c.reset);
|
|
46
|
+
console.log(c.cyan + ' ███████║ ██║ ██║ ██║╚██████╔╝╚██████╗ ██║ ╚██████╔╝██║ ██║███████╗' + c.reset);
|
|
47
|
+
console.log(c.cyan + ' ╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝' + c.reset);
|
|
48
|
+
console.log('');
|
|
49
|
+
console.log(c.dim + ' Agentic Document Structuring' + c.reset + ' ' + c.yellow + 'v' + VERSION + c.reset);
|
|
50
|
+
console.log(c.dim + ' Verbatim extraction. Quality verified.' + c.reset);
|
|
51
|
+
console.log('');
|
|
52
|
+
console.log(c.dim + ' ─────────────────────────────────────────────────────────────────────────────' + c.reset);
|
|
57
53
|
console.log('');
|
|
58
|
-
console.log(c.bright + 'structurecc' + c.reset + ' v' + VERSION);
|
|
59
54
|
}
|
|
60
55
|
|
|
61
56
|
function getClaudeDir() {
|
|
@@ -93,13 +88,13 @@ function install() {
|
|
|
93
88
|
const srcCommandsDir = path.join(packageDir, 'commands', 'structure');
|
|
94
89
|
const srcAgentsDir = path.join(packageDir, 'agents');
|
|
95
90
|
|
|
96
|
-
log('Installing
|
|
91
|
+
log('Installing...', colors.dim);
|
|
97
92
|
log('');
|
|
98
93
|
|
|
99
94
|
// Install command
|
|
100
95
|
if (fs.existsSync(srcCommandsDir)) {
|
|
101
96
|
copyDir(srcCommandsDir, commandsDir);
|
|
102
|
-
log(' ✓
|
|
97
|
+
log(' ✓ /structure command', colors.green);
|
|
103
98
|
}
|
|
104
99
|
|
|
105
100
|
// Install agents
|
|
@@ -115,11 +110,8 @@ function install() {
|
|
|
115
110
|
if (fs.existsSync(srcPath)) {
|
|
116
111
|
fs.copyFileSync(srcPath, destPath);
|
|
117
112
|
const agentName = file.replace('.md', '');
|
|
118
|
-
log(` ✓
|
|
113
|
+
log(` ✓ ${agentName}`, colors.green);
|
|
119
114
|
installed++;
|
|
120
|
-
} else {
|
|
121
|
-
log(` ⚠ Missing ${file}`, colors.yellow);
|
|
122
|
-
skipped++;
|
|
123
115
|
}
|
|
124
116
|
}
|
|
125
117
|
|
|
@@ -127,29 +119,11 @@ function install() {
|
|
|
127
119
|
const oldExtractor = path.join(agentsDir, 'structurecc-extractor.md');
|
|
128
120
|
if (fs.existsSync(oldExtractor)) {
|
|
129
121
|
fs.unlinkSync(oldExtractor);
|
|
130
|
-
log(' ✓ Removed legacy structurecc-extractor', colors.dim);
|
|
131
|
-
}
|
|
132
|
-
|
|
133
|
-
log('');
|
|
134
|
-
log(` Agents installed: ${installed}`, colors.dim);
|
|
135
|
-
if (skipped > 0) {
|
|
136
|
-
log(` Agents skipped: ${skipped}`, colors.yellow);
|
|
137
122
|
}
|
|
138
123
|
}
|
|
139
124
|
|
|
140
125
|
log('');
|
|
141
|
-
log(`${colors.green}Done!${colors.reset}
|
|
142
|
-
log('');
|
|
143
|
-
log(`${colors.bright}What's new in v2.0:${colors.reset}`);
|
|
144
|
-
log(` • 3-phase pipeline: Classify → Extract → Verify`, colors.dim);
|
|
145
|
-
log(` • 7 specialized extractors (tables, charts, heatmaps, etc.)`, colors.dim);
|
|
146
|
-
log(` • Verbatim extraction with quality scoring`, colors.dim);
|
|
147
|
-
log(` • Auto-revision loop for failed extractions`, colors.dim);
|
|
148
|
-
log('');
|
|
149
|
-
log(`Run in Claude Code:`, colors.bright);
|
|
150
|
-
log(` /structure path/to/document.pdf`, colors.cyan);
|
|
151
|
-
log('');
|
|
152
|
-
log(`${colors.dim}Supports: PDF, DOCX, PNG, JPG, TIFF${colors.reset}`);
|
|
126
|
+
log(`${colors.green}Done!${colors.reset} Run ${colors.cyan}/structure <path>${colors.reset} to extract.`);
|
|
153
127
|
log('');
|
|
154
128
|
}
|
|
155
129
|
|
|
@@ -187,9 +161,6 @@ function uninstall() {
|
|
|
187
161
|
function showVersion() {
|
|
188
162
|
log(`structurecc v${VERSION}`, colors.bright);
|
|
189
163
|
log('');
|
|
190
|
-
log('Pipeline: 3-phase with verification', colors.dim);
|
|
191
|
-
log('Agents: 8 (classifier + 6 extractors + verifier)', colors.dim);
|
|
192
|
-
log('');
|
|
193
164
|
}
|
|
194
165
|
|
|
195
166
|
// Main
|
|
@@ -204,27 +175,11 @@ if (args.includes('--uninstall') || args.includes('-u')) {
|
|
|
204
175
|
} else if (args.includes('--help') || args.includes('-h')) {
|
|
205
176
|
log('Usage: npx structurecc [options]', colors.bright);
|
|
206
177
|
log('');
|
|
207
|
-
log('Options:', colors.
|
|
208
|
-
log(' --
|
|
209
|
-
log(' --version, -v Show version info', colors.dim);
|
|
210
|
-
log(' --uninstall, -u Remove from Claude Code', colors.dim);
|
|
178
|
+
log('Options:', colors.dim);
|
|
179
|
+
log(' --uninstall, -u Remove from Claude Code');
|
|
211
180
|
log('');
|
|
212
|
-
log('After install,
|
|
181
|
+
log('After install, run in Claude Code:', colors.bright);
|
|
213
182
|
log(' /structure path/to/document.pdf', colors.cyan);
|
|
214
|
-
log(' /structure path/to/document.docx', colors.cyan);
|
|
215
|
-
log('');
|
|
216
|
-
log('Pipeline:', colors.bright);
|
|
217
|
-
log(' Phase 1: Classification (haiku - fast triage)', colors.dim);
|
|
218
|
-
log(' Phase 2: Specialized extraction (opus - quality)', colors.dim);
|
|
219
|
-
log(' Phase 3: Verification (sonnet - balance)', colors.dim);
|
|
220
|
-
log('');
|
|
221
|
-
log('Extractors:', colors.bright);
|
|
222
|
-
log(' • structurecc-extract-table - Tables with cell-by-cell accuracy', colors.dim);
|
|
223
|
-
log(' • structurecc-extract-chart - Charts with axes, legends, data', colors.dim);
|
|
224
|
-
log(' • structurecc-extract-heatmap - Heatmaps with color scales', colors.dim);
|
|
225
|
-
log(' • structurecc-extract-diagram - Flowcharts, timelines, networks', colors.dim);
|
|
226
|
-
log(' • structurecc-extract-multipanel - Multi-panel figures (A,B,C,D)', colors.dim);
|
|
227
|
-
log(' • structurecc-extract-generic - Fallback for other visuals', colors.dim);
|
|
228
183
|
log('');
|
|
229
184
|
} else {
|
|
230
185
|
install();
|
|
@@ -52,7 +52,132 @@ for subdir in ["images", "classifications", "extractions", "verifications", "ele
|
|
|
52
52
|
print(f"Output directory: {output_dir}")
|
|
53
53
|
```
|
|
54
54
|
|
|
55
|
-
## Step 2: Extract Images
|
|
55
|
+
## Step 2: Extract Text and Images
|
|
56
|
+
|
|
57
|
+
**CRITICAL:** Extract BOTH the manuscript text AND images. Figures need context from the paper.
|
|
58
|
+
|
|
59
|
+
### 2A: Extract Manuscript Text (PDF)
|
|
60
|
+
|
|
61
|
+
```python
|
|
62
|
+
import fitz
|
|
63
|
+
import re
|
|
64
|
+
import json
|
|
65
|
+
from pathlib import Path
|
|
66
|
+
|
|
67
|
+
pdf_path = "<document_path>"
|
|
68
|
+
output_dir = Path("<output_dir>")
|
|
69
|
+
|
|
70
|
+
doc = fitz.open(pdf_path)
|
|
71
|
+
full_text = []
|
|
72
|
+
page_texts = {}
|
|
73
|
+
|
|
74
|
+
for page_num in range(len(doc)):
|
|
75
|
+
page = doc[page_num]
|
|
76
|
+
text = page.get_text("text")
|
|
77
|
+
full_text.append(text)
|
|
78
|
+
page_texts[page_num + 1] = text
|
|
79
|
+
|
|
80
|
+
# Save full manuscript text
|
|
81
|
+
with open(output_dir / "manuscript_text.txt", "w") as f:
|
|
82
|
+
f.write("\n\n---PAGE BREAK---\n\n".join(full_text))
|
|
83
|
+
|
|
84
|
+
# Parse figure and table captions
|
|
85
|
+
def extract_captions(text):
|
|
86
|
+
"""Extract Figure and Table captions from manuscript text."""
|
|
87
|
+
captions = {"figures": {}, "tables": {}}
|
|
88
|
+
|
|
89
|
+
# Figure patterns: "Figure 1.", "Figure 1:", "Fig. 1.", "FIGURE 1"
|
|
90
|
+
fig_pattern = r'(?:Figure|Fig\.?|FIGURE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
|
|
91
|
+
for match in re.finditer(fig_pattern, text, re.IGNORECASE):
|
|
92
|
+
fig_num = match.group(1)
|
|
93
|
+
caption = match.group(2).strip()
|
|
94
|
+
# Clean up caption (remove extra whitespace)
|
|
95
|
+
caption = ' '.join(caption.split())
|
|
96
|
+
captions["figures"][fig_num] = caption
|
|
97
|
+
|
|
98
|
+
# Table patterns: "Table 1.", "Table 1:", "TABLE 1"
|
|
99
|
+
table_pattern = r'(?:Table|TABLE)\s*(\d+[A-Za-z]?)[\.:]\s*([^\n]+(?:\n(?![A-Z][a-z])[^\n]+)*)'
|
|
100
|
+
for match in re.finditer(table_pattern, text, re.IGNORECASE):
|
|
101
|
+
table_num = match.group(1)
|
|
102
|
+
caption = match.group(2).strip()
|
|
103
|
+
caption = ' '.join(caption.split())
|
|
104
|
+
captions["tables"][table_num] = caption
|
|
105
|
+
|
|
106
|
+
return captions
|
|
107
|
+
|
|
108
|
+
all_text = "\n".join(full_text)
|
|
109
|
+
captions = extract_captions(all_text)
|
|
110
|
+
|
|
111
|
+
# Save captions
|
|
112
|
+
with open(output_dir / "captions.json", "w") as f:
|
|
113
|
+
json.dump(captions, f, indent=2)
|
|
114
|
+
|
|
115
|
+
print(f"Extracted text from {len(doc)} pages")
|
|
116
|
+
print(f"Found {len(captions['figures'])} figure captions")
|
|
117
|
+
print(f"Found {len(captions['tables'])} table captions")
|
|
118
|
+
|
|
119
|
+
doc.close()
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### 2B: Extract Context Snippets
|
|
123
|
+
|
|
124
|
+
For each figure/table, extract surrounding manuscript context:
|
|
125
|
+
|
|
126
|
+
```python
|
|
127
|
+
def extract_context_for_element(text, element_type, element_num, window=500):
|
|
128
|
+
"""Extract text context surrounding references to a figure/table."""
|
|
129
|
+
contexts = []
|
|
130
|
+
|
|
131
|
+
if element_type == "figure":
|
|
132
|
+
patterns = [
|
|
133
|
+
rf'(?:Figure|Fig\.?)\s*{element_num}\b',
|
|
134
|
+
rf'(?:figure|fig\.?)\s*{element_num}\b'
|
|
135
|
+
]
|
|
136
|
+
else:
|
|
137
|
+
patterns = [
|
|
138
|
+
rf'(?:Table)\s*{element_num}\b',
|
|
139
|
+
rf'(?:table)\s*{element_num}\b'
|
|
140
|
+
]
|
|
141
|
+
|
|
142
|
+
for pattern in patterns:
|
|
143
|
+
for match in re.finditer(pattern, text):
|
|
144
|
+
start = max(0, match.start() - window)
|
|
145
|
+
end = min(len(text), match.end() + window)
|
|
146
|
+
context = text[start:end].strip()
|
|
147
|
+
# Clean up
|
|
148
|
+
context = ' '.join(context.split())
|
|
149
|
+
if context not in contexts:
|
|
150
|
+
contexts.append(context)
|
|
151
|
+
|
|
152
|
+
return contexts
|
|
153
|
+
|
|
154
|
+
# Extract contexts for each figure
|
|
155
|
+
figure_contexts = {}
|
|
156
|
+
for fig_num in captions["figures"]:
|
|
157
|
+
contexts = extract_context_for_element(all_text, "figure", fig_num)
|
|
158
|
+
figure_contexts[fig_num] = {
|
|
159
|
+
"caption": captions["figures"][fig_num],
|
|
160
|
+
"contexts": contexts
|
|
161
|
+
}
|
|
162
|
+
|
|
163
|
+
# Extract contexts for each table
|
|
164
|
+
table_contexts = {}
|
|
165
|
+
for table_num in captions["tables"]:
|
|
166
|
+
contexts = extract_context_for_element(all_text, "table", table_num)
|
|
167
|
+
table_contexts[table_num] = {
|
|
168
|
+
"caption": captions["tables"][table_num],
|
|
169
|
+
"contexts": contexts
|
|
170
|
+
}
|
|
171
|
+
|
|
172
|
+
# Save context data
|
|
173
|
+
with open(output_dir / "manuscript_context.json", "w") as f:
|
|
174
|
+
json.dump({
|
|
175
|
+
"figures": figure_contexts,
|
|
176
|
+
"tables": table_contexts
|
|
177
|
+
}, f, indent=2)
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
### 2C: Extract Images
|
|
56
181
|
|
|
57
182
|
**For PDF files** - Use PyMuPDF:
|
|
58
183
|
|
|
@@ -166,7 +291,7 @@ After classifications complete, read each classification file and dispatch to th
|
|
|
166
291
|
| `multi_panel` | `structurecc-extract-multipanel.md` |
|
|
167
292
|
| Everything else | `structurecc-extract-generic.md` |
|
|
168
293
|
|
|
169
|
-
For EACH element, spawn the appropriate extractor:
|
|
294
|
+
For EACH element, spawn the appropriate extractor WITH manuscript context:
|
|
170
295
|
|
|
171
296
|
```
|
|
172
297
|
Task(
|
|
@@ -185,12 +310,31 @@ Read the agent instructions from:
|
|
|
185
310
|
**Source:** Page <N> of <document_name>
|
|
186
311
|
**Output:** Write JSON to <output_dir>/extractions/<element_id>.json
|
|
187
312
|
|
|
313
|
+
## MANUSCRIPT CONTEXT (Use this to understand the figure)
|
|
314
|
+
|
|
315
|
+
**Figure Caption:**
|
|
316
|
+
<caption_from_captions.json>
|
|
317
|
+
|
|
318
|
+
**Relevant Manuscript Text:**
|
|
319
|
+
<context_snippets_from_manuscript_context.json>
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
188
323
|
Follow the extractor instructions EXACTLY. Output ONLY valid JSON.
|
|
189
|
-
|
|
324
|
+
|
|
325
|
+
CRITICAL REQUIREMENTS:
|
|
326
|
+
1. VERBATIM extraction - Copy ALL text exactly as shown in the image
|
|
327
|
+
2. Use manuscript context to understand what the figure shows
|
|
328
|
+
3. Include the figure caption in your extraction
|
|
329
|
+
4. For charts: capture EXACT legend text, axis labels, tick values
|
|
330
|
+
5. For Kaplan-Meier/survival curves: note step-function nature, describe curve progression
|
|
331
|
+
6. Describe colors precisely (e.g., "purple line", "light purple shaded 95% CI", "yellow/orange shaded band")
|
|
190
332
|
"""
|
|
191
333
|
)
|
|
192
334
|
```
|
|
193
335
|
|
|
336
|
+
**IMPORTANT:** Read `manuscript_context.json` to get the caption and context for each element.
|
|
337
|
+
|
|
194
338
|
Launch ALL extractions in ONE message for parallel processing.
|
|
195
339
|
|
|
196
340
|
## Step 5: Phase 3 - Verification (Parallel)
|
|
@@ -317,48 +461,73 @@ for extract_file in extractions_dir.glob("*.json"):
|
|
|
317
461
|
**Markdown conversion function:**
|
|
318
462
|
|
|
319
463
|
```python
|
|
320
|
-
|
|
321
|
-
|
|
464
|
+
import json
|
|
465
|
+
from pathlib import Path
|
|
466
|
+
|
|
467
|
+
def json_to_markdown(extraction: dict, context: dict = None) -> str:
|
|
468
|
+
"""Convert JSON extraction to clean markdown with manuscript context."""
|
|
322
469
|
|
|
323
470
|
ext_type = extraction.get("extraction_type")
|
|
324
471
|
|
|
325
472
|
if ext_type == "table":
|
|
326
|
-
return table_to_markdown(extraction)
|
|
473
|
+
return table_to_markdown(extraction, context)
|
|
327
474
|
elif ext_type == "chart":
|
|
328
|
-
return chart_to_markdown(extraction)
|
|
475
|
+
return chart_to_markdown(extraction, context)
|
|
329
476
|
elif ext_type == "heatmap":
|
|
330
|
-
return heatmap_to_markdown(extraction)
|
|
477
|
+
return heatmap_to_markdown(extraction, context)
|
|
331
478
|
elif ext_type == "diagram":
|
|
332
|
-
return diagram_to_markdown(extraction)
|
|
479
|
+
return diagram_to_markdown(extraction, context)
|
|
333
480
|
elif ext_type == "multi_panel":
|
|
334
|
-
return multipanel_to_markdown(extraction)
|
|
481
|
+
return multipanel_to_markdown(extraction, context)
|
|
335
482
|
else:
|
|
336
|
-
return generic_to_markdown(extraction)
|
|
483
|
+
return generic_to_markdown(extraction, context)
|
|
484
|
+
|
|
485
|
+
# Load manuscript context for element processing
|
|
486
|
+
def get_context_for_element(output_dir: Path, element_num: int, element_type: str = "figure"):
|
|
487
|
+
"""Get manuscript context for a specific element."""
|
|
488
|
+
context_file = output_dir / "manuscript_context.json"
|
|
489
|
+
if not context_file.exists():
|
|
490
|
+
return None
|
|
337
491
|
|
|
492
|
+
with open(context_file) as f:
|
|
493
|
+
manuscript_context = json.load(f)
|
|
338
494
|
|
|
339
|
-
|
|
495
|
+
key = "figures" if element_type == "figure" else "tables"
|
|
496
|
+
return manuscript_context.get(key, {}).get(str(element_num))
|
|
497
|
+
|
|
498
|
+
|
|
499
|
+
def table_to_markdown(ext: dict, context: dict = None) -> str:
|
|
340
500
|
md = []
|
|
341
501
|
meta = ext.get("table_metadata", {})
|
|
342
502
|
|
|
343
|
-
md.append(f"
|
|
344
|
-
md.append(f"\n**Type:**
|
|
503
|
+
md.append(f"## {meta.get('title', 'Table')}")
|
|
504
|
+
md.append(f"\n**Type:** table ({meta.get('table_type', 'standard')})")
|
|
345
505
|
md.append(f"**Source:** Page {meta.get('source_page', '?')}")
|
|
346
506
|
|
|
507
|
+
# Add manuscript context if available
|
|
508
|
+
if context:
|
|
509
|
+
if context.get("caption"):
|
|
510
|
+
md.append(f"\n> **Table Caption (from manuscript):** {context['caption']}")
|
|
511
|
+
if context.get("contexts"):
|
|
512
|
+
md.append("\n### Manuscript Context\n")
|
|
513
|
+
for ctx in context["contexts"][:2]:
|
|
514
|
+
md.append(f"> {ctx[:400]}...")
|
|
515
|
+
|
|
347
516
|
if meta.get("caption"):
|
|
348
|
-
md.append(f"\n> {meta['caption']}")
|
|
517
|
+
md.append(f"\n> **Caption (from image):** {meta['caption']}")
|
|
349
518
|
|
|
350
|
-
md.append("\n
|
|
519
|
+
md.append("\n### Data\n")
|
|
351
520
|
md.append(ext.get("markdown_table", ""))
|
|
352
521
|
|
|
353
522
|
if meta.get("footnotes"):
|
|
354
|
-
md.append("\n
|
|
523
|
+
md.append("\n### Footnotes\n")
|
|
355
524
|
for fn in meta["footnotes"]:
|
|
356
525
|
md.append(f"- {fn}")
|
|
357
526
|
|
|
358
527
|
return "\n".join(md)
|
|
359
528
|
|
|
360
529
|
|
|
361
|
-
def chart_to_markdown(ext: dict) -> str:
|
|
530
|
+
def chart_to_markdown(ext: dict, context: dict = None) -> str:
|
|
362
531
|
md = []
|
|
363
532
|
meta = ext.get("chart_metadata", {})
|
|
364
533
|
|
|
@@ -366,31 +535,61 @@ def chart_to_markdown(ext: dict) -> str:
|
|
|
366
535
|
md.append(f"\n**Type:** {ext.get('chart_type', 'Chart')}")
|
|
367
536
|
md.append(f"**Source:** Page {meta.get('source_page', '?')}")
|
|
368
537
|
|
|
538
|
+
# Add manuscript context if available
|
|
539
|
+
if context:
|
|
540
|
+
if context.get("caption"):
|
|
541
|
+
md.append(f"\n> **Caption:** {context['caption']}")
|
|
542
|
+
if context.get("contexts"):
|
|
543
|
+
md.append("\n### Manuscript Context\n")
|
|
544
|
+
for ctx in context["contexts"][:2]: # Limit to 2 most relevant
|
|
545
|
+
md.append(f"> {ctx[:500]}...") # Truncate long contexts
|
|
546
|
+
|
|
369
547
|
axes = ext.get("axes", {})
|
|
370
548
|
md.append("\n## Axes\n")
|
|
371
549
|
if axes.get("x"):
|
|
372
550
|
md.append(f"- **X-axis:** {axes['x'].get('label', 'unlabeled')}")
|
|
373
551
|
md.append(f" - Range: {axes['x'].get('min')} to {axes['x'].get('max')}")
|
|
552
|
+
if axes['x'].get('ticks'):
|
|
553
|
+
md.append(f" - Ticks: {axes['x']['ticks']}")
|
|
374
554
|
if axes.get("y"):
|
|
375
555
|
md.append(f"- **Y-axis:** {axes['y'].get('label', 'unlabeled')}")
|
|
376
556
|
md.append(f" - Range: {axes['y'].get('min')} to {axes['y'].get('max')}")
|
|
557
|
+
if axes['y'].get('ticks'):
|
|
558
|
+
md.append(f" - Ticks: {axes['y']['ticks']}")
|
|
377
559
|
|
|
378
560
|
legend = ext.get("legend", {})
|
|
379
561
|
if legend.get("entries"):
|
|
380
|
-
md.append("\n## Legend\n")
|
|
562
|
+
md.append("\n## Legend (Verbatim)\n")
|
|
381
563
|
for entry in legend["entries"]:
|
|
382
564
|
style = entry.get("line_style") or entry.get("style", "")
|
|
383
|
-
|
|
565
|
+
color = entry.get("color", "")
|
|
566
|
+
md.append(f"- **\"{entry['label']}\"**: {color} {style}")
|
|
567
|
+
|
|
568
|
+
# Data series details (for Kaplan-Meier etc.)
|
|
569
|
+
series = ext.get("data_series", [])
|
|
570
|
+
if series:
|
|
571
|
+
md.append("\n## Data Series\n")
|
|
572
|
+
for s in series:
|
|
573
|
+
md.append(f"### {s.get('name', 'Series')}")
|
|
574
|
+
if s.get("data_points"):
|
|
575
|
+
md.append("Key data points:")
|
|
576
|
+
for pt in s["data_points"][:10]: # First 10 points
|
|
577
|
+
md.append(f" - x={pt.get('x')}, y={pt.get('y')}")
|
|
384
578
|
|
|
385
579
|
stats = ext.get("statistical_annotations", [])
|
|
386
580
|
if stats:
|
|
387
581
|
md.append("\n## Statistical Annotations\n")
|
|
388
582
|
for stat in stats:
|
|
389
|
-
|
|
583
|
+
if stat.get("type") == "hazard_ratio":
|
|
584
|
+
md.append(f"- Hazard Ratio: {stat.get('value')} (95% CI: {stat.get('ci_lower')}-{stat.get('ci_upper')})")
|
|
585
|
+
elif stat.get("type") == "p_value":
|
|
586
|
+
md.append(f"- {stat.get('test', 'P-value')}: {stat.get('value')}")
|
|
587
|
+
else:
|
|
588
|
+
md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
|
|
390
589
|
|
|
391
590
|
risk = ext.get("risk_table", {})
|
|
392
591
|
if risk.get("present"):
|
|
393
|
-
md.append("\n## Risk Table\n")
|
|
592
|
+
md.append("\n## Risk Table (Number at Risk)\n")
|
|
394
593
|
headers = risk.get("headers", [])
|
|
395
594
|
md.append("| " + " | ".join(headers) + " |")
|
|
396
595
|
md.append("| " + " | ".join(["---"] * len(headers)) + " |")
|
|
@@ -399,18 +598,137 @@ def chart_to_markdown(ext: dict) -> str:
|
|
|
399
598
|
md.append("| " + " | ".join(values) + " |")
|
|
400
599
|
|
|
401
600
|
return "\n".join(md)
|
|
601
|
+
|
|
602
|
+
|
|
603
|
+
def multipanel_to_markdown(ext: dict, context: dict = None) -> str:
|
|
604
|
+
"""Convert multi-panel figure extraction to detailed markdown."""
|
|
605
|
+
md = []
|
|
606
|
+
meta = ext.get("figure_metadata", {})
|
|
607
|
+
|
|
608
|
+
md.append(f"## {meta.get('title', 'Multi-Panel Figure')}")
|
|
609
|
+
md.append(f"\n**Type:** multi_panel ({meta.get('total_panels', '?')} panels)")
|
|
610
|
+
md.append(f"**Source:** Page {meta.get('source_page', '?')}")
|
|
611
|
+
md.append(f"**Layout:** {meta.get('layout', 'unknown')}")
|
|
612
|
+
|
|
613
|
+
# Add manuscript context if available
|
|
614
|
+
if context:
|
|
615
|
+
if context.get("caption"):
|
|
616
|
+
md.append(f"\n> **Figure Caption (from manuscript):** {context['caption']}")
|
|
617
|
+
if context.get("contexts"):
|
|
618
|
+
md.append("\n### Manuscript Context\n")
|
|
619
|
+
md.append("*How this figure is described in the paper:*\n")
|
|
620
|
+
for ctx in context["contexts"][:2]:
|
|
621
|
+
md.append(f"> ...{ctx[:500]}...\n")
|
|
622
|
+
|
|
623
|
+
# Process each panel in detail
|
|
624
|
+
panels = ext.get("panels", [])
|
|
625
|
+
for panel in panels:
|
|
626
|
+
panel_id = panel.get("panel_id", "?")
|
|
627
|
+
panel_type = panel.get("panel_type", "unknown")
|
|
628
|
+
panel_title = panel.get("panel_title", f"Panel {panel_id}")
|
|
629
|
+
|
|
630
|
+
md.append(f"\n### Panel {panel_id}: {panel_title}")
|
|
631
|
+
md.append(f"**Type:** {panel_type}")
|
|
632
|
+
|
|
633
|
+
extraction = panel.get("extraction", {})
|
|
634
|
+
|
|
635
|
+
# Axes
|
|
636
|
+
axes = extraction.get("axes", {})
|
|
637
|
+
if axes:
|
|
638
|
+
md.append("\n**Axes:**")
|
|
639
|
+
if axes.get("x"):
|
|
640
|
+
x = axes["x"]
|
|
641
|
+
md.append(f"- X-axis: \"{x.get('label', 'unlabeled')}\"")
|
|
642
|
+
md.append(f" - Range: {x.get('min')} to {x.get('max')}")
|
|
643
|
+
if x.get("ticks"):
|
|
644
|
+
md.append(f" - Tick values: {x['ticks']}")
|
|
645
|
+
if axes.get("y"):
|
|
646
|
+
y = axes["y"]
|
|
647
|
+
md.append(f"- Y-axis: \"{y.get('label', 'unlabeled')}\"")
|
|
648
|
+
md.append(f" - Range: {y.get('min')} to {y.get('max')}")
|
|
649
|
+
if y.get("ticks"):
|
|
650
|
+
md.append(f" - Tick values: {y['ticks']}")
|
|
651
|
+
|
|
652
|
+
# Legend (VERBATIM)
|
|
653
|
+
legend = extraction.get("legend", {})
|
|
654
|
+
if legend.get("entries"):
|
|
655
|
+
md.append("\n**Legend (Verbatim from image):**")
|
|
656
|
+
if legend.get("title"):
|
|
657
|
+
md.append(f"*{legend['title']}*")
|
|
658
|
+
for entry in legend["entries"]:
|
|
659
|
+
label = entry.get("label", "")
|
|
660
|
+
color = entry.get("color", "")
|
|
661
|
+
style = entry.get("line_style") or entry.get("style", "")
|
|
662
|
+
md.append(f"- \"{label}\" — {color} {style}")
|
|
663
|
+
|
|
664
|
+
# Curve endpoints (for Kaplan-Meier)
|
|
665
|
+
endpoints = extraction.get("curve_endpoints", [])
|
|
666
|
+
if endpoints:
|
|
667
|
+
md.append("\n**Curve Endpoints:**")
|
|
668
|
+
for ep in endpoints:
|
|
669
|
+
md.append(f"- {ep.get('series', 'Series')}: y={ep.get('final_y')} at x={ep.get('final_x')}")
|
|
670
|
+
if ep.get("note"):
|
|
671
|
+
md.append(f" - Note: {ep['note']}")
|
|
672
|
+
|
|
673
|
+
# Key observations
|
|
674
|
+
observations = extraction.get("key_observations", [])
|
|
675
|
+
if observations:
|
|
676
|
+
md.append("\n**Key Observations:**")
|
|
677
|
+
for obs in observations:
|
|
678
|
+
md.append(f"- {obs}")
|
|
679
|
+
|
|
680
|
+
# Statistical annotations
|
|
681
|
+
stats = extraction.get("statistical_annotations", [])
|
|
682
|
+
if stats:
|
|
683
|
+
md.append("\n**Statistical Annotations:**")
|
|
684
|
+
for stat in stats:
|
|
685
|
+
md.append(f"- {stat.get('type', 'stat')}: {stat.get('value', '')}")
|
|
686
|
+
|
|
687
|
+
# All visible text
|
|
688
|
+
all_text = extraction.get("all_visible_text", [])
|
|
689
|
+
if all_text:
|
|
690
|
+
md.append("\n**All Visible Text:**")
|
|
691
|
+
md.append(f"```\n{', '.join(all_text)}\n```")
|
|
692
|
+
|
|
693
|
+
# Shared elements
|
|
694
|
+
shared = ext.get("shared_elements", {})
|
|
695
|
+
if shared.get("shared_legend") or shared.get("cross_references"):
|
|
696
|
+
md.append("\n### Shared Elements")
|
|
697
|
+
if shared.get("shared_legend"):
|
|
698
|
+
md.append(f"- Shared legend applies to panels: {shared['shared_legend'].get('applies_to', [])}")
|
|
699
|
+
if shared.get("cross_references"):
|
|
700
|
+
for ref in shared["cross_references"]:
|
|
701
|
+
md.append(f"- {ref}")
|
|
702
|
+
|
|
703
|
+
return "\n".join(md)
|
|
402
704
|
```
|
|
403
705
|
|
|
404
|
-
## Step 8: Generate Combined STRUCTURED.md
|
|
706
|
+
## Step 8: Generate Combined STRUCTURED.md with Manuscript Context
|
|
405
707
|
|
|
406
708
|
```python
|
|
709
|
+
import json
|
|
407
710
|
from pathlib import Path
|
|
408
711
|
from datetime import datetime
|
|
409
712
|
|
|
410
713
|
output_dir = Path("<output_dir>")
|
|
411
714
|
elements_dir = output_dir / "elements"
|
|
715
|
+
extractions_dir = output_dir / "extractions"
|
|
412
716
|
doc_name = "<document_name>"
|
|
413
717
|
|
|
718
|
+
# Load manuscript context
|
|
719
|
+
context_file = output_dir / "manuscript_context.json"
|
|
720
|
+
manuscript_context = {}
|
|
721
|
+
if context_file.exists():
|
|
722
|
+
with open(context_file) as f:
|
|
723
|
+
manuscript_context = json.load(f)
|
|
724
|
+
|
|
725
|
+
# Load captions
|
|
726
|
+
captions_file = output_dir / "captions.json"
|
|
727
|
+
captions = {"figures": {}, "tables": {}}
|
|
728
|
+
if captions_file.exists():
|
|
729
|
+
with open(captions_file) as f:
|
|
730
|
+
captions = json.load(f)
|
|
731
|
+
|
|
414
732
|
# Read all element files in order
|
|
415
733
|
element_files = sorted(elements_dir.glob("element_*.md"))
|
|
416
734
|
|
|
@@ -419,21 +737,70 @@ sections.append(f"# {doc_name} - Structured Extraction")
|
|
|
419
737
|
sections.append(f"\n**Original:** {doc_name}")
|
|
420
738
|
sections.append(f"**Extracted:** {datetime.now().isoformat()}")
|
|
421
739
|
sections.append(f"**Elements:** {len(element_files)} visual elements processed")
|
|
422
|
-
sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification)")
|
|
740
|
+
sections.append(f"**Pipeline:** structurecc v2.0 (3-phase with verification + manuscript context)")
|
|
423
741
|
sections.append("\n---\n")
|
|
424
742
|
|
|
425
|
-
#
|
|
743
|
+
# Table of contents
|
|
744
|
+
sections.append("## Table of Contents\n")
|
|
426
745
|
for i, elem_file in enumerate(element_files, 1):
|
|
746
|
+
elem_id = elem_file.stem
|
|
747
|
+
# Try to get title from extraction
|
|
748
|
+
extract_file = extractions_dir / f"{elem_id}.json"
|
|
749
|
+
title = f"Element {i}"
|
|
750
|
+
if extract_file.exists():
|
|
751
|
+
with open(extract_file) as f:
|
|
752
|
+
ext = json.load(f)
|
|
753
|
+
title = ext.get("chart_metadata", {}).get("title") or \
|
|
754
|
+
ext.get("table_metadata", {}).get("title") or \
|
|
755
|
+
ext.get("figure_metadata", {}).get("title") or \
|
|
756
|
+
f"Element {i}"
|
|
757
|
+
sections.append(f"{i}. [{title}](#{elem_id})")
|
|
758
|
+
sections.append("\n---\n")
|
|
759
|
+
|
|
760
|
+
# Add each element with context
|
|
761
|
+
for i, elem_file in enumerate(element_files, 1):
|
|
762
|
+
elem_id = elem_file.stem
|
|
763
|
+
|
|
427
764
|
with open(elem_file) as f:
|
|
428
765
|
content = f.read()
|
|
429
766
|
|
|
430
|
-
sections.append(f"
|
|
767
|
+
sections.append(f'<a id="{elem_id}"></a>\n')
|
|
768
|
+
sections.append(f"## Element {i}: {elem_id}\n")
|
|
769
|
+
|
|
770
|
+
# Try to match with manuscript context
|
|
771
|
+
# Heuristic: Figure 1 = first figure element, etc.
|
|
772
|
+
fig_num = str(i)
|
|
773
|
+
if fig_num in manuscript_context.get("figures", {}):
|
|
774
|
+
ctx = manuscript_context["figures"][fig_num]
|
|
775
|
+
if ctx.get("caption"):
|
|
776
|
+
sections.append(f"\n> **Figure Caption:** {ctx['caption']}\n")
|
|
777
|
+
if ctx.get("contexts"):
|
|
778
|
+
sections.append("\n### Manuscript Context\n")
|
|
779
|
+
sections.append("*Relevant text from the manuscript:*\n")
|
|
780
|
+
for c in ctx["contexts"][:2]:
|
|
781
|
+
# Truncate and clean
|
|
782
|
+
clean_ctx = ' '.join(c.split())[:600]
|
|
783
|
+
sections.append(f"> ...{clean_ctx}...\n")
|
|
784
|
+
|
|
431
785
|
sections.append(content)
|
|
432
786
|
sections.append("\n---\n")
|
|
433
787
|
|
|
788
|
+
# Add manuscript summary section
|
|
789
|
+
if manuscript_context.get("figures") or manuscript_context.get("tables"):
|
|
790
|
+
sections.append("## Manuscript References Summary\n")
|
|
791
|
+
sections.append("### Figure Captions from Manuscript\n")
|
|
792
|
+
for fig_num, ctx in manuscript_context.get("figures", {}).items():
|
|
793
|
+
sections.append(f"- **Figure {fig_num}:** {ctx.get('caption', 'No caption found')}")
|
|
794
|
+
sections.append("\n### Table Captions from Manuscript\n")
|
|
795
|
+
for table_num, ctx in manuscript_context.get("tables", {}).items():
|
|
796
|
+
sections.append(f"- **Table {table_num}:** {ctx.get('caption', 'No caption found')}")
|
|
797
|
+
sections.append("\n---\n")
|
|
798
|
+
|
|
434
799
|
# Write combined file
|
|
435
800
|
with open(output_dir / "STRUCTURED.md", "w") as f:
|
|
436
801
|
f.write("\n".join(sections))
|
|
802
|
+
|
|
803
|
+
print(f"Generated STRUCTURED.md with {len(element_files)} elements and manuscript context")
|
|
437
804
|
```
|
|
438
805
|
|
|
439
806
|
## Step 9: Generate Quality Report
|