structurecc 2.1.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,331 @@
1
+ ---
2
+ name: structure
3
+ description: Extract structured data from documents using Claude vision and parallel chunk agents
4
+ argument-hint: <path> [--output dir]
5
+ allowed-tools: Read, Write, Task, Glob, Bash
6
+ model: opus
7
+ ---
8
+
9
+ <command-name>structure</command-name>
10
+
11
+ # Document Structure Extraction
12
+
13
+ You are extracting structured data from a document using Claude's native vision capabilities and parallel Task agents.
14
+
15
+ ## Input
16
+
17
+ **Document path:** $ARGUMENTS
18
+
19
+ ## Workflow
20
+
21
+ ### Step 1: Validate Input
22
+
23
+ 1. Check if the file exists at the provided path
24
+ 2. Determine the file type (PDF, DOCX, PNG, JPG, TIFF, etc.)
25
+ 3. If the path is invalid, inform the user and stop
26
+
27
+ ### Step 2: Determine Processing Strategy
28
+
29
+ Based on document type:
30
+
31
+ **For PDFs:**
32
+ - Use the Read tool to read the PDF (Claude can read PDFs natively)
33
+ - Determine total page count from the read
34
+ - If ≤10 pages: Process as single chunk
35
+ - If >10 pages: Split into chunks of 5 pages each
36
+
37
+ **For Images (PNG, JPG, TIFF, BMP):**
38
+ - Process as single chunk (one image = one chunk)
39
+
40
+ **For Word Documents (.docx):**
41
+ - Use Bash to extract text: `textutil -convert txt "<path>" -stdout`
42
+ - Use Read tool to read the .docx file directly (Claude can read these)
43
+ - Process as single chunk unless very large
44
+
45
+ ### Step 3: Create Output Directory
46
+
47
+ Create output directory structure:
48
+ ```
49
+ <source_dir>/<filename>_extracted/
50
+ ├── chunks/ # Individual chunk JSON files
51
+ ├── structure.json # Final merged JSON
52
+ └── STRUCTURE.md # Human-readable markdown
53
+ ```
54
+
55
+ ### Step 4: Launch Chunk Agents (Parallel)
56
+
57
+ For each chunk, launch a Task agent with subagent_type="general-purpose":
58
+
59
+ **CRITICAL: Launch ALL chunk agents in a SINGLE message with multiple Task tool calls for true parallelism.**
60
+
61
+ Each agent receives:
62
+ 1. The document path
63
+ 2. Their assigned page range (e.g., pages 1-5)
64
+ 3. The chunk extractor prompt (embedded below)
65
+ 4. Output path for their chunk JSON
66
+
67
+ **Chunk Extractor Prompt for Agents:**
68
+
69
+ ```
70
+ You are extracting structured data from a document chunk.
71
+
72
+ ## Your Assignment
73
+ - Document: {document_path}
74
+ - Pages: {start_page} to {end_page} (or "all" for images/small docs)
75
+ - Output: Write JSON to {output_path}
76
+
77
+ ## Protocol
78
+
79
+ ### Step 1: SCAN
80
+ Read the document using the Read tool. Look at ALL pages in your assigned range.
81
+ List every element you see:
82
+ - Tables (data tables, comparison tables, any tabular data)
83
+ - Figures (charts, graphs, images, diagrams, gels, blots, plots)
84
+ - Text blocks (paragraphs, headers, captions)
85
+ - Equations (mathematical formulas)
86
+ - Lists (bulleted, numbered)
87
+
88
+ ### Step 2: TRANSCRIBE
89
+ For EACH element, transcribe EXACTLY what you see:
90
+ - Every label, number, unit, symbol
91
+ - Every axis label, legend entry, annotation
92
+ - Every cell value, header, footnote
93
+ - Every variable, subscript, superscript
94
+
95
+ Rules:
96
+ - Use [unclear] for illegible content
97
+ - Use [ambiguous: A|B] for uncertain readings (e.g., "0" vs "O")
98
+ - Preserve EXACT formatting (1,234.56 not 1234.56)
99
+ - Include ALL flags, markers, asterisks, annotations
100
+ - Transcribe units exactly as shown (mg/dL, not mg per dL)
101
+
102
+ ### Step 3: STRUCTURE
103
+ Output as JSON with this schema:
104
+
105
+ {
106
+ "chunk": {
107
+ "pages": [start_page, end_page],
108
+ "document": "filename"
109
+ },
110
+ "elements": [
111
+ {
112
+ "id": "element_1",
113
+ "page": N,
114
+ "type": "table|figure|text|equation|list",
115
+ "title": "Title if visible, null otherwise",
116
+ "caption": "Caption if visible, null otherwise",
117
+ "data": {
118
+ // Type-specific structured data - see below
119
+ },
120
+ "raw_transcription": "Exact text as seen in document",
121
+ "position": "top|middle|bottom of page",
122
+ "confidence": 0.0-1.0,
123
+ "notes": "Any observations about quality, legibility, etc."
124
+ }
125
+ ]
126
+ }
127
+
128
+ ### Type-Specific Data Structures
129
+
130
+ **For Tables:**
131
+ {
132
+ "headers": ["Col1", "Col2", "Col3"],
133
+ "rows": [
134
+ ["value1", "value2", "value3"],
135
+ ["value4", "value5", {"value": "value6", "flag": "H"}]
136
+ ],
137
+ "footnotes": ["* p < 0.05", "** p < 0.01"]
138
+ }
139
+
140
+ **For Figures (charts/graphs):**
141
+ {
142
+ "figure_type": "bar_chart|line_graph|scatter_plot|pie_chart|other",
143
+ "description": "Brief description of what the figure shows",
144
+ "axes": {
145
+ "x": {"label": "Time (hours)", "range": [0, 24]},
146
+ "y": {"label": "Concentration (ng/mL)", "range": [0, 100]}
147
+ },
148
+ "data_series": [
149
+ {"name": "Control", "color": "blue", "values": [[0, 10], [6, 45], [12, 80]]},
150
+ {"name": "Treatment", "color": "red", "values": [[0, 12], [6, 60], [12, 95]]}
151
+ ],
152
+ "legend": ["Control", "Treatment"],
153
+ "annotations": ["Arrow pointing to peak at t=12h"]
154
+ }
155
+
156
+ **For Figures (gels/blots):**
157
+ {
158
+ "figure_type": "western_blot|gel_electrophoresis|other",
159
+ "description": "Western blot of protein X expression",
160
+ "lanes": [
161
+ {"position": 1, "label": "Ladder", "bands": [{"position": "250 kDa"}, {"position": "150 kDa"}]},
162
+ {"position": 2, "label": "Control", "bands": [{"position": "~50 kDa", "intensity": "strong"}]},
163
+ {"position": 3, "label": "Sample 1", "bands": [{"position": "~50 kDa", "intensity": "weak"}]}
164
+ ],
165
+ "loading_control": "Beta-actin at ~42 kDa",
166
+ "annotations": ["Arrow indicates target protein"]
167
+ }
168
+
169
+ **For Figures (generic images/diagrams):**
170
+ {
171
+ "figure_type": "diagram|photograph|illustration|other",
172
+ "description": "Detailed description of what the image shows",
173
+ "visible_labels": ["Label 1", "Label 2"],
174
+ "annotations": ["Arrow pointing to structure X"],
175
+ "colors_used": ["Red indicates inflammation", "Blue shows normal tissue"]
176
+ }
177
+
178
+ **For Equations:**
179
+ {
180
+ "latex": "E = mc^2",
181
+ "variables": {
182
+ "E": "energy",
183
+ "m": "mass",
184
+ "c": "speed of light"
185
+ }
186
+ }
187
+
188
+ **For Text Blocks:**
189
+ {
190
+ "content": "The full text content",
191
+ "type": "paragraph|header|caption|footnote"
192
+ }
193
+
194
+ **For Lists:**
195
+ {
196
+ "list_type": "bulleted|numbered",
197
+ "items": ["Item 1", "Item 2", "Item 3"]
198
+ }
199
+
200
+ ## Rules
201
+ 1. Extract EXACTLY what's visible - never fabricate data
202
+ 2. If a figure has a caption nearby, include it
203
+ 3. Link figures to their captions within your chunk
204
+ 4. Low confidence (<0.8) = flag for review in notes
205
+ 5. For tables with merged cells, represent structure accurately
206
+ 6. Include ALL visible data points, not just a sample
207
+
208
+ ## Output
209
+ Write your JSON to: {output_path}
210
+ ```
211
+
212
+ ### Step 5: Wait for Chunk Agents
213
+
214
+ After launching all chunk agents, wait for them to complete.
215
+ Each agent will write their chunk JSON to the chunks/ directory.
216
+
217
+ ### Step 6: Merge Chunks
218
+
219
+ Once all chunks are complete:
220
+
221
+ 1. Read all chunk JSON files from chunks/ directory
222
+ 2. Merge into single structure with page offset correction:
223
+
224
+ ```python
225
+ def merge_chunks(chunk_files):
226
+ result = {
227
+ "source": document_path,
228
+ "extracted": timestamp,
229
+ "pages": []
230
+ }
231
+
232
+ all_elements = []
233
+ element_counter = 1
234
+
235
+ for chunk_file in sorted(chunk_files):
236
+ chunk_data = json.load(chunk_file)
237
+ for element in chunk_data["elements"]:
238
+ # Renumber elements globally
239
+ element["id"] = f"element_{element_counter}"
240
+ element_counter += 1
241
+ all_elements.append(element)
242
+
243
+ # Group by page
244
+ pages = {}
245
+ for elem in all_elements:
246
+ page_num = elem["page"]
247
+ if page_num not in pages:
248
+ pages[page_num] = {"page": page_num, "elements": []}
249
+ pages[page_num]["elements"].append(elem)
250
+
251
+ result["pages"] = [pages[p] for p in sorted(pages.keys())]
252
+
253
+ # Calculate summary
254
+ result["summary"] = {
255
+ "total_pages": len(result["pages"]),
256
+ "tables": sum(1 for p in result["pages"] for e in p["elements"] if e["type"] == "table"),
257
+ "figures": sum(1 for p in result["pages"] for e in p["elements"] if e["type"] == "figure"),
258
+ "equations": sum(1 for p in result["pages"] for e in p["elements"] if e["type"] == "equation"),
259
+ "average_confidence": calculate_average_confidence(all_elements)
260
+ }
261
+
262
+ return result
263
+ ```
264
+
265
+ 3. Write merged result to structure.json
266
+
267
+ ### Step 7: Generate Markdown Summary
268
+
269
+ Create STRUCTURE.md with human-readable format:
270
+
271
+ ```markdown
272
+ # Document Extraction
273
+
274
+ **Source:** {filename}
275
+ **Extracted:** {timestamp}
276
+ **Pages:** {total} | **Tables:** {count} | **Figures:** {count} | **Equations:** {count}
277
+ **Average Confidence:** {score}
278
+
279
+ ---
280
+
281
+ ## Page {N}
282
+
283
+ ### {Element Title or "Element {id}"}
284
+
285
+ {Markdown representation of the element}
286
+
287
+ - For tables: Render as markdown table
288
+ - For figures: Description + key data points
289
+ - For equations: LaTeX in $$ blocks
290
+ - For text: The content
291
+
292
+ **Confidence:** {score}
293
+
294
+ ---
295
+ ```
296
+
297
+ ### Step 8: Display Completion
298
+
299
+ ```
300
+ ┌──────────────────────────────────────────────────────────────────────┐
301
+ │ EXTRACTION COMPLETE │
302
+ ├──────────────────────────────────────────────────────────────────────┤
303
+ │ │
304
+ │ Source: {filename} │
305
+ │ Pages: {total} | Tables: {count} | Figures: {count} │
306
+ │ Average Confidence: {score} │
307
+ │ │
308
+ │ Output: │
309
+ │ {output_dir}/structure.json │
310
+ │ {output_dir}/STRUCTURE.md │
311
+ │ │
312
+ │ Low Confidence Items: {count} │
313
+ │ {List any elements with confidence < 0.8} │
314
+ │ │
315
+ └──────────────────────────────────────────────────────────────────────┘
316
+ ```
317
+
318
+ ## Error Handling
319
+
320
+ - If document cannot be read: Report error, suggest checking file format
321
+ - If chunk agent fails: Report which chunk failed, continue with others
322
+ - If merge fails: Save individual chunk files, report merge error
323
+ - If output directory cannot be created: Use same directory as source
324
+
325
+ ## Notes
326
+
327
+ - This command uses Claude's native vision - no external APIs needed
328
+ - Parallel chunk processing maximizes throughput
329
+ - Each chunk agent has 200K context - plenty for 5 pages
330
+ - Chunks preserve figure-caption relationships (usually within same chunk)
331
+ - Edge cases (figure on page 5, caption on page 6) are rare but detectable
package/install.js ADDED
@@ -0,0 +1,93 @@
1
+ #!/usr/bin/env node
2
+
3
+ const fs = require('fs');
4
+ const path = require('path');
5
+ const os = require('os');
6
+
7
+ const PLUGIN_NAME = 'structurecc';
8
+ const DEST_DIR = path.join(os.homedir(), '.claude', 'plugins', PLUGIN_NAME);
9
+
10
+ // Files to copy
11
+ const FILES = [
12
+ '.claude-plugin/plugin.json',
13
+ 'commands/structure.md',
14
+ 'commands/structure-batch.md',
15
+ 'prompts/chunk-extractor.md',
16
+ 'README.md'
17
+ ];
18
+
19
+ function copyFile(src, dest) {
20
+ const destDir = path.dirname(dest);
21
+ if (!fs.existsSync(destDir)) {
22
+ fs.mkdirSync(destDir, { recursive: true });
23
+ }
24
+ fs.copyFileSync(src, dest);
25
+ }
26
+
27
+ function install() {
28
+ console.log(`
29
+ ┌──────────────────────────────────────────────────────────────────────┐
30
+ │ │
31
+ │ ███████╗████████╗██████╗ ██╗ ██╗ ██████╗████████╗██╗ ██╗██████╗│
32
+ │ ██╔════╝╚══██╔══╝██╔══██╗██║ ██║██╔════╝╚══██╔══╝██║ ██║██╔══██╗
33
+ │ ███████╗ ██║ ██████╔╝██║ ██║██║ ██║ ██║ ██║██████╔╝
34
+ │ ╚════██║ ██║ ██╔══██╗██║ ██║██║ ██║ ██║ ██║██╔══██╗
35
+ │ ███████║ ██║ ██║ ██║╚██████╔╝╚██████╗ ██║ ╚██████╔╝██║ ██║
36
+ │ ╚══════╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝
37
+ │ │
38
+ │ Document Structure Extraction | Claude Code Plugin | v3.0 │
39
+ │ │
40
+ └──────────────────────────────────────────────────────────────────────┘
41
+ `);
42
+
43
+ const sourceDir = __dirname;
44
+
45
+ console.log(`Installing to: ${DEST_DIR}\n`);
46
+
47
+ // Create destination directory
48
+ if (!fs.existsSync(DEST_DIR)) {
49
+ fs.mkdirSync(DEST_DIR, { recursive: true });
50
+ }
51
+
52
+ // Copy each file
53
+ let copied = 0;
54
+ for (const file of FILES) {
55
+ const src = path.join(sourceDir, file);
56
+ const dest = path.join(DEST_DIR, file);
57
+
58
+ if (fs.existsSync(src)) {
59
+ copyFile(src, dest);
60
+ console.log(` ✓ ${file}`);
61
+ copied++;
62
+ } else {
63
+ console.log(` ✗ ${file} (not found)`);
64
+ }
65
+ }
66
+
67
+ console.log(`
68
+ ────────────────────────────────────────────────────────────────────────
69
+
70
+ Installation complete! ${copied}/${FILES.length} files installed.
71
+
72
+ USAGE:
73
+ /structure <path> Extract structured data from a document
74
+ /structure:batch <dir> Process multiple documents
75
+
76
+ SUPPORTED FORMATS:
77
+ - PDF documents
78
+ - Word documents (.docx)
79
+ - Images (PNG, JPG, TIFF)
80
+
81
+ EXAMPLE:
82
+ /structure ~/Documents/lab_report.pdf
83
+ /structure ~/Pictures/gel_image.png
84
+ /structure:batch ~/Documents/patient_files/
85
+
86
+ Output: JSON + Markdown in same directory as source document
87
+
88
+ ────────────────────────────────────────────────────────────────────────
89
+ `);
90
+ }
91
+
92
+ // Run installation
93
+ install();
package/package.json CHANGED
@@ -1,33 +1,34 @@
1
1
  {
2
2
  "name": "structurecc",
3
- "version": "2.1.0",
4
- "description": "Extract structured data from PDFs, Word docs, and images using Claude Code.",
3
+ "version": "3.0.0",
4
+ "description": "Claude Code plugin for extracting structured data from documents using native vision and parallel Task agents",
5
+ "author": "UTMB Diagnostic Center",
6
+ "license": "MIT",
7
+ "bin": {
8
+ "structurecc": "./install.js"
9
+ },
10
+ "files": [
11
+ "install.js",
12
+ ".claude-plugin/",
13
+ "commands/",
14
+ "prompts/",
15
+ "README.md"
16
+ ],
5
17
  "keywords": [
18
+ "claude-code",
19
+ "plugin",
6
20
  "document-extraction",
7
21
  "pdf",
8
- "structure",
9
- "claude-code",
10
- "llm",
11
- "multimodal",
12
- "tables",
13
- "figures",
14
- "charts",
15
- "markdown",
16
- "ai-agents"
22
+ "vision",
23
+ "structured-data",
24
+ "pharmacogenomics",
25
+ "medical-documents"
17
26
  ],
18
- "author": "James Weatherhead",
19
- "license": "MIT",
20
27
  "repository": {
21
28
  "type": "git",
22
29
  "url": "https://github.com/JamesWeatherhead/structure"
23
30
  },
24
- "homepage": "https://github.com/JamesWeatherhead/structure#readme",
25
- "bin": {
26
- "structurecc": "./bin/install.js"
27
- },
28
- "files": [
29
- "bin/",
30
- "commands/",
31
- "agents/"
32
- ]
31
+ "engines": {
32
+ "node": ">=16.0.0"
33
+ }
33
34
  }