@biaoo/tiangong-wiki 0.3.10 → 0.3.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -117,7 +117,7 @@ tiangong-wiki dashboard # open dashboard in browse
117
117
 
118
118
  > Environment variables are managed via `.wiki.env` (created by `tiangong-wiki setup`). The CLI prefers the nearest local `.wiki.env`, then falls back to the global default workspace config. See [references/troubleshooting.md](./references/troubleshooting.md) for the full reference. For a centralized Linux + `systemd` + Nginx deployment, see [references/centralized-service-deployment.md](./references/centralized-service-deployment.md). That deployment guide now also includes Git repository / GitHub remote setup for daemon-side commit and optional auto-push.
119
119
 
120
- For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel, Markdown, and common image formats; the skill's `SKILL.md` remains the source of truth for the exact extension allowlist. The wiki workflow uses `return_txt=true` and consumes the plain `txt` text as the agent input; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
120
+ For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel, Markdown, and common image formats; the skill's `SKILL.md` remains the source of truth for the exact extension allowlist. The wiki workflow uses `return_txt=true`, consumes the plain `txt` text as the agent input, and stores the extracted plain-text snapshot under `.queue-artifacts/<file-artifact>/extracted-fulltext.txt` with metadata in the queue result; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
121
121
 
122
122
  ## MCP Server
123
123
 
package/README.zh-CN.md CHANGED
@@ -117,7 +117,7 @@ tiangong-wiki dashboard # 在浏览器中打开仪
117
117
 
118
118
  > 环境变量通过 `.wiki.env` 管理(由 `tiangong-wiki setup` 创建)。CLI 会优先使用最近的本地 `.wiki.env`,找不到时再 fallback 到全局默认工作区配置。完整参考见 [references/troubleshooting.md](./references/troubleshooting.md)。如需部署中心化服务(Linux + `systemd` + Nginx),见 [references/centralized-service-deployment.md](./references/centralized-service-deployment.md)。该部署文档现在也包含了 Git 仓库初始化、GitHub remote 配置和 daemon 自动 push 的 Git 配置说明。
119
119
 
120
- 如果 vault 里以文档解析为主,可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`,并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`,让 workflow 优先使用 TianGong Unstructure parser。`document-granular-decompose` 是覆盖范围更宽的文档/图片解析器,适用于 PDF、Word、PowerPoint、Excel、Markdown 和常见图片格式;精确扩展名 allowlist 以该 skill 自身的 `SKILL.md` 为准。wiki workflow 会使用 `return_txt=true`,并把返回的纯 `txt` 文本作为 agent 主输入;`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
120
+ 如果 vault 里以文档解析为主,可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`,并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`,让 workflow 优先使用 TianGong Unstructure parser。`document-granular-decompose` 是覆盖范围更宽的文档/图片解析器,适用于 PDF、Word、PowerPoint、Excel、Markdown 和常见图片格式;精确扩展名 allowlist 以该 skill 自身的 `SKILL.md` 为准。wiki workflow 会使用 `return_txt=true`,把返回的纯 `txt` 文本作为 agent 主输入,并把解析纯文本快照保存到 `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`,同时在 queue result 里保留元数据;`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
121
121
 
122
122
  ## MCP Server
123
123
 
package/dist/core/db.js CHANGED
@@ -90,6 +90,20 @@ function ensureBaseTables(db, embeddingDimensions) {
90
90
  CREATE INDEX IF NOT EXISTS idx_vchangelog_sync ON vault_changelog(sync_id);
91
91
  CREATE INDEX IF NOT EXISTS idx_vchangelog_time ON vault_changelog(detected_at);
92
92
 
93
+ CREATE TABLE IF NOT EXISTS vault_extractions (
94
+ file_id TEXT NOT NULL,
95
+ content_hash TEXT NOT NULL,
96
+ artifact_path TEXT NOT NULL,
97
+ artifact_sha256 TEXT NOT NULL,
98
+ parser_skill TEXT,
99
+ char_count INTEGER NOT NULL,
100
+ created_at TEXT NOT NULL,
101
+ updated_at TEXT NOT NULL,
102
+ PRIMARY KEY(file_id, content_hash)
103
+ );
104
+
105
+ CREATE INDEX IF NOT EXISTS idx_vex_file ON vault_extractions(file_id);
106
+
93
107
  CREATE TABLE IF NOT EXISTS vault_processing_queue (
94
108
  file_id TEXT PRIMARY KEY,
95
109
  status TEXT DEFAULT 'pending',
@@ -11,7 +11,7 @@ import { ensureLocalVaultFile } from "./vault.js";
11
11
  import { buildVaultWorkflowPrompt, ensureWorkflowArtifactSet, getWorkflowArtifactSet, } from "./workflow-context.js";
12
12
  import { readWorkflowResult } from "./workflow-result.js";
13
13
  import { AppError } from "../utils/errors.js";
14
- import { readTextFileSync } from "../utils/fs.js";
14
+ import { pathExistsSync, readTextFileSync, sha256Text } from "../utils/fs.js";
15
15
  import { addSeconds, toOffsetIso } from "../utils/time.js";
16
16
  const INLINE_WORKFLOW_ATTEMPTS = 2;
17
17
  const MAX_QUEUE_ERROR_RETRIES = 3;
@@ -90,6 +90,10 @@ function mapQueueRow(row) {
90
90
  appliedTypeNames: parseOptionalStringArray(row.appliedTypeNames),
91
91
  proposedTypeNames: parseOptionalStringArray(row.proposedTypeNames),
92
92
  skillsUsed: parseOptionalStringArray(row.skillsUsed),
93
+ extractedTextPath: typeof row.extractedTextPath === "string" ? row.extractedTextPath : null,
94
+ extractedTextSha256: typeof row.extractedTextSha256 === "string" ? row.extractedTextSha256 : null,
95
+ extractedTextParserSkill: typeof row.extractedTextParserSkill === "string" ? row.extractedTextParserSkill : null,
96
+ extractedTextCharCount: typeof row.extractedTextCharCount === "number" ? row.extractedTextCharCount : null,
93
97
  fileName: typeof row.fileName === "string" ? row.fileName : undefined,
94
98
  fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
95
99
  sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
@@ -113,8 +117,8 @@ function claimQueueItems(db, limit, options) {
113
117
  ].join("\n AND ");
114
118
  const select = db.prepare(`
115
119
  SELECT
116
- file_id AS fileId,
117
- status,
120
+ vault_processing_queue.file_id AS fileId,
121
+ vault_processing_queue.status,
118
122
  priority,
119
123
  queued_at AS queuedAt,
120
124
  claimed_at AS claimedAt,
@@ -141,16 +145,23 @@ function claimQueueItems(db, limit, options) {
141
145
  vault_files.file_ext AS fileExt,
142
146
  vault_files.source_type AS sourceType,
143
147
  vault_files.file_size AS fileSize,
144
- vault_files.file_path AS filePath
148
+ vault_files.file_path AS filePath,
149
+ vault_extractions.artifact_path AS extractedTextPath,
150
+ vault_extractions.artifact_sha256 AS extractedTextSha256,
151
+ vault_extractions.parser_skill AS extractedTextParserSkill,
152
+ vault_extractions.char_count AS extractedTextCharCount
145
153
  FROM vault_processing_queue
146
154
  LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
155
+ LEFT JOIN vault_extractions
156
+ ON vault_extractions.file_id = vault_processing_queue.file_id
157
+ AND vault_extractions.content_hash = vault_files.content_hash
147
158
  WHERE (
148
159
  vault_processing_queue.status = 'pending'
149
160
  OR (
150
161
  ${errorEligibility}
151
162
  )
152
163
  )${filter.clause}${exclude.clause}
153
- ORDER BY priority DESC, queued_at ASC
164
+ ORDER BY vault_processing_queue.priority DESC, vault_processing_queue.queued_at ASC
154
165
  LIMIT ?
155
166
  `);
156
167
  const markProcessing = db.prepare(`
@@ -202,8 +213,8 @@ function claimQueueItems(db, limit, options) {
202
213
  function fetchQueueItemsByStatus(db, status) {
203
214
  const rows = db.prepare(`
204
215
  SELECT
205
- file_id AS fileId,
206
- status,
216
+ vault_processing_queue.file_id AS fileId,
217
+ vault_processing_queue.status,
207
218
  priority,
208
219
  queued_at AS queuedAt,
209
220
  claimed_at AS claimedAt,
@@ -230,19 +241,26 @@ function fetchQueueItemsByStatus(db, status) {
230
241
  vault_files.file_ext AS fileExt,
231
242
  vault_files.source_type AS sourceType,
232
243
  vault_files.file_size AS fileSize,
233
- vault_files.file_path AS filePath
244
+ vault_files.file_path AS filePath,
245
+ vault_extractions.artifact_path AS extractedTextPath,
246
+ vault_extractions.artifact_sha256 AS extractedTextSha256,
247
+ vault_extractions.parser_skill AS extractedTextParserSkill,
248
+ vault_extractions.char_count AS extractedTextCharCount
234
249
  FROM vault_processing_queue
235
250
  LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
236
- ${status ? "WHERE status = ?" : ""}
237
- ORDER BY priority DESC, queued_at ASC
251
+ LEFT JOIN vault_extractions
252
+ ON vault_extractions.file_id = vault_processing_queue.file_id
253
+ AND vault_extractions.content_hash = vault_files.content_hash
254
+ ${status ? "WHERE vault_processing_queue.status = ?" : ""}
255
+ ORDER BY vault_processing_queue.priority DESC, vault_processing_queue.queued_at ASC
238
256
  `).all(...(status ? [status] : []));
239
257
  return rows.map(mapQueueRow);
240
258
  }
241
259
  function fetchQueueItemByFileId(db, fileId) {
242
260
  const row = db.prepare(`
243
261
  SELECT
244
- file_id AS fileId,
245
- status,
262
+ vault_processing_queue.file_id AS fileId,
263
+ vault_processing_queue.status,
246
264
  priority,
247
265
  queued_at AS queuedAt,
248
266
  claimed_at AS claimedAt,
@@ -269,9 +287,16 @@ function fetchQueueItemByFileId(db, fileId) {
269
287
  vault_files.file_ext AS fileExt,
270
288
  vault_files.source_type AS sourceType,
271
289
  vault_files.file_size AS fileSize,
272
- vault_files.file_path AS filePath
290
+ vault_files.file_path AS filePath,
291
+ vault_extractions.artifact_path AS extractedTextPath,
292
+ vault_extractions.artifact_sha256 AS extractedTextSha256,
293
+ vault_extractions.parser_skill AS extractedTextParserSkill,
294
+ vault_extractions.char_count AS extractedTextCharCount
273
295
  FROM vault_processing_queue
274
296
  LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
297
+ LEFT JOIN vault_extractions
298
+ ON vault_extractions.file_id = vault_processing_queue.file_id
299
+ AND vault_extractions.content_hash = vault_files.content_hash
275
300
  WHERE vault_processing_queue.file_id = ?
276
301
  `).get(fileId);
277
302
  return row ? mapQueueRow(row) : null;
@@ -332,8 +357,8 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
332
357
  const cutoff = toOffsetIso(addSeconds(new Date(), -PROCESSING_STALE_THRESHOLD_SECONDS));
333
358
  const rows = db.prepare(`
334
359
  SELECT
335
- file_id AS fileId,
336
- status,
360
+ vault_processing_queue.file_id AS fileId,
361
+ vault_processing_queue.status,
337
362
  priority,
338
363
  queued_at AS queuedAt,
339
364
  claimed_at AS claimedAt,
@@ -360,10 +385,17 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
360
385
  vault_files.file_ext AS fileExt,
361
386
  vault_files.source_type AS sourceType,
362
387
  vault_files.file_size AS fileSize,
363
- vault_files.file_path AS filePath
388
+ vault_files.file_path AS filePath,
389
+ vault_extractions.artifact_path AS extractedTextPath,
390
+ vault_extractions.artifact_sha256 AS extractedTextSha256,
391
+ vault_extractions.parser_skill AS extractedTextParserSkill,
392
+ vault_extractions.char_count AS extractedTextCharCount
364
393
  FROM vault_processing_queue
365
394
  LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
366
- WHERE status = 'processing'
395
+ LEFT JOIN vault_extractions
396
+ ON vault_extractions.file_id = vault_processing_queue.file_id
397
+ AND vault_extractions.content_hash = vault_files.content_hash
398
+ WHERE vault_processing_queue.status = 'processing'
367
399
  AND COALESCE(processing_owner_id, '') != ?
368
400
  AND (
369
401
  (heartbeat_at IS NOT NULL AND julianday(heartbeat_at) <= julianday(?))
@@ -513,10 +545,104 @@ function formatQueueErrorMessage(message, autoRetryExhausted) {
513
545
  : "";
514
546
  return `${message}${autoRetrySuffix}`.slice(0, 1_000);
515
547
  }
516
- function applyWorkflowManifest(db, fileId, manifest, resultManifestPath, currentAttempts) {
548
+ function resolveCurrentVaultContentHash(db, fileId) {
549
+ const row = db
550
+ .prepare("SELECT content_hash AS contentHash FROM vault_files WHERE id = ?")
551
+ .get(fileId);
552
+ return typeof row?.contentHash === "string" && row.contentHash.length > 0 ? row.contentHash : null;
553
+ }
554
+ function inferExtractionParserSkill(manifest) {
555
+ const explicit = manifest.extractedText?.parserSkill?.trim();
556
+ if (explicit) {
557
+ return explicit;
558
+ }
559
+ return manifest.skillsUsed.find((skill) => skill !== "tiangong-wiki-skill") ?? null;
560
+ }
561
+ function persistWorkflowExtraction(db, fileId, manifest, extractedTextPath, processedAt) {
562
+ const contentHash = resolveCurrentVaultContentHash(db, fileId);
563
+ if (!contentHash) {
564
+ return;
565
+ }
566
+ const expectedPath = path.resolve(extractedTextPath);
567
+ if (manifest.extractedText?.path && path.resolve(manifest.extractedText.path) !== expectedPath) {
568
+ throw new AppError("result.extractedText.path must match EXTRACTED_TEXT_PATH", "runtime", {
569
+ expectedPath,
570
+ actualPath: manifest.extractedText.path,
571
+ });
572
+ }
573
+ const extractedText = pathExistsSync(expectedPath) ? readTextFileSync(expectedPath) : "";
574
+ if (extractedText.length === 0) {
575
+ if (manifest.extractedText) {
576
+ throw new AppError("result.extractedText was declared but EXTRACTED_TEXT_PATH is empty", "runtime", {
577
+ expectedPath,
578
+ });
579
+ }
580
+ db.prepare("DELETE FROM vault_extractions WHERE file_id = ? AND content_hash = ?").run(fileId, contentHash);
581
+ return;
582
+ }
583
+ const artifactSha256 = sha256Text(extractedText);
584
+ if (manifest.extractedText?.sha256 && manifest.extractedText.sha256 !== artifactSha256) {
585
+ throw new AppError("result.extractedText.sha256 does not match EXTRACTED_TEXT_PATH content", "runtime", {
586
+ expectedSha256: artifactSha256,
587
+ actualSha256: manifest.extractedText.sha256,
588
+ });
589
+ }
590
+ db.prepare(`
591
+ INSERT INTO vault_extractions(
592
+ file_id,
593
+ content_hash,
594
+ artifact_path,
595
+ artifact_sha256,
596
+ parser_skill,
597
+ char_count,
598
+ created_at,
599
+ updated_at
600
+ )
601
+ VALUES (
602
+ @file_id,
603
+ @content_hash,
604
+ @artifact_path,
605
+ @artifact_sha256,
606
+ @parser_skill,
607
+ @char_count,
608
+ @created_at,
609
+ @updated_at
610
+ )
611
+ ON CONFLICT(file_id, content_hash) DO UPDATE SET
612
+ artifact_path = excluded.artifact_path,
613
+ artifact_sha256 = excluded.artifact_sha256,
614
+ parser_skill = excluded.parser_skill,
615
+ char_count = excluded.char_count,
616
+ updated_at = excluded.updated_at
617
+ `).run({
618
+ file_id: fileId,
619
+ content_hash: contentHash,
620
+ artifact_path: expectedPath,
621
+ artifact_sha256: artifactSha256,
622
+ parser_skill: inferExtractionParserSkill(manifest),
623
+ char_count: extractedText.length,
624
+ created_at: processedAt,
625
+ updated_at: processedAt,
626
+ });
627
+ }
628
+ function buildExtractionResultFields(manifest, extractedTextPath) {
629
+ const expectedPath = path.resolve(extractedTextPath);
630
+ const extractedText = pathExistsSync(expectedPath) ? readTextFileSync(expectedPath) : "";
631
+ if (extractedText.length === 0) {
632
+ return {};
633
+ }
634
+ return {
635
+ extractedTextPath: expectedPath,
636
+ extractedTextSha256: sha256Text(extractedText),
637
+ extractedTextParserSkill: inferExtractionParserSkill(manifest),
638
+ extractedTextCharCount: extractedText.length,
639
+ };
640
+ }
641
+ function applyWorkflowManifest(db, fileId, manifest, resultManifestPath, extractedTextPath, currentAttempts) {
517
642
  const resultPageId = manifest.createdPageIds[0] ?? manifest.updatedPageIds[0] ?? null;
518
643
  const status = manifest.status;
519
644
  const processedAt = toOffsetIso();
645
+ persistWorkflowExtraction(db, fileId, manifest, extractedTextPath, processedAt);
520
646
  if (status === "error") {
521
647
  const failureState = buildQueueFailureState(manifest.reason);
522
648
  const nextAttempts = currentAttempts + 1;
@@ -694,7 +820,8 @@ function recoverStaleProcessingQueueItems(input) {
694
820
  });
695
821
  if (recoveredManifest && item.resultManifestPath) {
696
822
  assertTemplateEvolutionAllowed(recoveredManifest, input.templateEvolution);
697
- const outcome = applyWorkflowManifest(input.db, item.fileId, recoveredManifest, item.resultManifestPath, item.attempts);
823
+ const extractedTextPath = getWorkflowArtifactSet(input.paths, item.fileId).extractedTextPath;
824
+ const outcome = applyWorkflowManifest(input.db, item.fileId, recoveredManifest, item.resultManifestPath, extractedTextPath, item.attempts);
698
825
  input.log?.(`${item.fileId}: recovered stale processing with persisted result status=${outcome.status} thread=${recoveredManifest.threadId} ${formatManifestLogFields(recoveredManifest)} result=${item.resultManifestPath}`);
699
826
  recovered.push({
700
827
  fileId: item.fileId,
@@ -708,6 +835,7 @@ function recoverStaleProcessingQueueItems(input) {
708
835
  updatedPageIds: recoveredManifest.updatedPageIds,
709
836
  proposedTypeNames: recoveredManifest.proposedTypes.map((entry) => entry.name),
710
837
  resultManifestPath: item.resultManifestPath,
838
+ ...buildExtractionResultFields(recoveredManifest, extractedTextPath),
711
839
  });
712
840
  continue;
713
841
  }
@@ -783,6 +911,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
783
911
  workspaceRoot,
784
912
  vaultFilePath: localFilePath,
785
913
  resultJsonPath: artifacts.resultPath,
914
+ extractedTextPath: artifacts.extractedTextPath,
786
915
  allowTemplateEvolution,
787
916
  });
788
917
  ensureWorkflowArtifactSet(paths, {
@@ -796,6 +925,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
796
925
  vaultPath: paths.vaultPath,
797
926
  localFilePath,
798
927
  resultJsonPath: artifacts.resultPath,
928
+ extractedTextPath: artifacts.extractedTextPath,
799
929
  skillArtifactsPath: artifacts.skillArtifactsPath,
800
930
  file,
801
931
  queue: {
@@ -817,6 +947,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
817
947
  promptText,
818
948
  queueItemPath: artifacts.queueItemPath,
819
949
  resultPath: artifacts.resultPath,
950
+ extractedTextPath: artifacts.extractedTextPath,
820
951
  skillArtifactsPath: artifacts.skillArtifactsPath,
821
952
  model: resolveAgentSettings(env).model,
822
953
  env,
@@ -903,7 +1034,7 @@ async function processClaimedQueueItem(input) {
903
1034
  }));
904
1035
  assertTemplateEvolutionAllowed(manifest, templateEvolution);
905
1036
  finalOutcome = {
906
- outcome: applyWorkflowManifest(db, item.fileId, manifest, artifacts.resultPath, item.attempts),
1037
+ outcome: applyWorkflowManifest(db, item.fileId, manifest, artifacts.resultPath, artifacts.extractedTextPath, item.attempts),
907
1038
  manifest,
908
1039
  handleThreadId: handle.threadId,
909
1040
  };
@@ -925,7 +1056,7 @@ async function processClaimedQueueItem(input) {
925
1056
  if (recoveredManifest) {
926
1057
  assertTemplateEvolutionAllowed(recoveredManifest, templateEvolution);
927
1058
  finalOutcome = {
928
- outcome: applyWorkflowManifest(db, item.fileId, recoveredManifest, artifacts.resultPath, item.attempts),
1059
+ outcome: applyWorkflowManifest(db, item.fileId, recoveredManifest, artifacts.resultPath, artifacts.extractedTextPath, item.attempts),
929
1060
  manifest: recoveredManifest,
930
1061
  handleThreadId: recoveredManifest.threadId,
931
1062
  };
@@ -956,6 +1087,7 @@ async function processClaimedQueueItem(input) {
956
1087
  updatedPageIds: finalOutcome.manifest.updatedPageIds,
957
1088
  proposedTypeNames: finalOutcome.manifest.proposedTypes.map((entry) => entry.name),
958
1089
  resultManifestPath: artifacts.resultPath,
1090
+ ...buildExtractionResultFields(finalOutcome.manifest, artifacts.extractedTextPath),
959
1091
  },
960
1092
  };
961
1093
  }
@@ -965,7 +1097,8 @@ async function processClaimedQueueItem(input) {
965
1097
  : null;
966
1098
  if (recoveredManifest && resultManifestPath) {
967
1099
  assertTemplateEvolutionAllowed(recoveredManifest, templateEvolution);
968
- const recoveredOutcome = applyWorkflowManifest(db, item.fileId, recoveredManifest, resultManifestPath, item.attempts);
1100
+ const extractedTextPath = getWorkflowArtifactSet(paths, item.fileId).extractedTextPath;
1101
+ const recoveredOutcome = applyWorkflowManifest(db, item.fileId, recoveredManifest, resultManifestPath, extractedTextPath, item.attempts);
969
1102
  input.log?.(`${item.fileId}: recovered persisted workflow result after terminal failure status=${recoveredOutcome.status} thread=${recoveredManifest.threadId} ${formatManifestLogFields(recoveredManifest)} result=${resultManifestPath} message=${formatWorkflowError(error)}`);
970
1103
  return {
971
1104
  status: recoveredOutcome.status,
@@ -981,6 +1114,7 @@ async function processClaimedQueueItem(input) {
981
1114
  updatedPageIds: recoveredManifest.updatedPageIds,
982
1115
  proposedTypeNames: recoveredManifest.proposedTypes.map((entry) => entry.name),
983
1116
  resultManifestPath,
1117
+ ...buildExtractionResultFields(recoveredManifest, extractedTextPath),
984
1118
  },
985
1119
  };
986
1120
  }
@@ -1101,6 +1235,7 @@ export async function processVaultQueueBatch(env = process.env, options = {}) {
1101
1235
  };
1102
1236
  for (const recoveredItem of recoverStaleProcessingQueueItems({
1103
1237
  db,
1238
+ paths,
1104
1239
  processingOwnerId,
1105
1240
  log: options.log,
1106
1241
  templateEvolution,
@@ -42,6 +42,7 @@ export function getWorkflowArtifactSet(paths, queueItemId) {
42
42
  queueItemPath: path.join(rootDir, "queue-item.json"),
43
43
  promptPath: path.join(rootDir, "prompt.md"),
44
44
  resultPath: path.join(rootDir, "result.json"),
45
+ extractedTextPath: path.join(rootDir, "extracted-fulltext.txt"),
45
46
  skillArtifactsPath: path.join(rootDir, "skill-artifacts"),
46
47
  };
47
48
  }
@@ -52,6 +53,7 @@ export function buildVaultWorkflowPrompt(input) {
52
53
  `WORKSPACE_ROOT=${input.workspaceRoot}`,
53
54
  `VAULT_FILE_PATH=${input.vaultFilePath}`,
54
55
  `RESULT_JSON_PATH=${input.resultJsonPath}`,
56
+ `EXTRACTED_TEXT_PATH=${input.extractedTextPath}`,
55
57
  `ALLOW_TEMPLATE_EVOLUTION=${input.allowTemplateEvolution ? "true" : "false"}`,
56
58
  "",
57
59
  "## Goal",
@@ -85,6 +87,7 @@ export function buildVaultWorkflowPrompt(input) {
85
87
  "2. Read the target vault file at VAULT_FILE_PATH. Refer to `references/vault-to-wiki-instruction.md` (Phase 1) in the wiki package for file-type-specific reading strategies, parser skill discovery, image handling, and metadata utilization.",
86
88
  " - If `WIKI_PARSER_SKILLS` includes `document-granular-decompose` and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer that skill for supported document/image files before the legacy type-specific parser skills. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; use the skill's `SKILL.md` for the exact extension allowlist.",
87
89
  " - When using `document-granular-decompose`, request `return_txt=true`, treat the pure text extracted from `response.txt`/`txt` as the main input, and keep raw JSON only for debugging or page-number evidence.",
90
+ " - If you extract plain text through any parser skill, write that canonical plain text snapshot to EXTRACTED_TEXT_PATH. For `document-granular-decompose`, write the same pure text from `response.txt`/`txt`. Leave EXTRACTED_TEXT_PATH empty only when no extractable text exists.",
88
91
  "3. Discover the current page type ontology via `tiangong-wiki type list` and `tiangong-wiki type show <type>`. Do not assume any type, template, or default target type.",
89
92
  "4. Search the existing wiki for overlapping or related content:",
90
93
  " - Use `tiangong-wiki fts` and `tiangong-wiki search` with key terms from the source.",
@@ -173,7 +176,7 @@ export function buildVaultWorkflowPrompt(input) {
173
176
  "",
174
177
  "The authoritative threadId is queue-item.json.threadId. Read it from there and copy it unchanged into result.json.threadId. If it is empty on first read, read queue-item.json again immediately before writing the manifest.",
175
178
  "",
176
- "Write RESULT_JSON_PATH as one JSON object with: status, decision, reason, threadId, skillsUsed, createdPageIds, updatedPageIds, appliedTypeNames, proposedTypes, actions, lint.",
179
+ "Write RESULT_JSON_PATH as one JSON object with: status, decision, reason, threadId, skillsUsed, createdPageIds, updatedPageIds, appliedTypeNames, proposedTypes, actions, lint, and optional extractedText.",
177
180
  "",
178
181
  "### Allowed Values",
179
182
  "",
@@ -182,10 +185,11 @@ export function buildVaultWorkflowPrompt(input) {
182
185
  "- **actions**: Array of objects, never strings. Allowed action kinds: create_page, update_page, create_template. Every action object must include kind and summary. create_page requires pageType and title. update_page requires pageId. create_template requires pageType and title.",
183
186
  "- **proposedTypes**: Objects with name, reason, suggestedTemplateSections.",
184
187
  "- **lint**: Objects with pageId, errors, warnings.",
188
+ "- **extractedText**: Optional object when EXTRACTED_TEXT_PATH contains extracted plain text. Include path=EXTRACTED_TEXT_PATH, parserSkill, sha256 when practical, and charCount. Do not put the full text itself in result.json.",
185
189
  "",
186
190
  "### Example",
187
191
  "",
188
- '{"status":"done","decision":"apply","reason":"Updated the existing method.","threadId":"<copy queue-item.json.threadId>","skillsUsed":["tiangong-wiki-skill"],"createdPageIds":[],"updatedPageIds":["methods/example.md"],"appliedTypeNames":["method"],"proposedTypes":[],"actions":[{"kind":"update_page","pageId":"methods/example.md","pageType":"method","summary":"Updated the page with durable knowledge."}],"lint":[{"pageId":"methods/example.md","errors":0,"warnings":0}]}',
192
+ '{"status":"done","decision":"apply","reason":"Updated the existing method.","threadId":"<copy queue-item.json.threadId>","skillsUsed":["tiangong-wiki-skill"],"createdPageIds":[],"updatedPageIds":["methods/example.md"],"appliedTypeNames":["method"],"proposedTypes":[],"actions":[{"kind":"update_page","pageId":"methods/example.md","pageType":"method","summary":"Updated the page with durable knowledge."}],"lint":[{"pageId":"methods/example.md","errors":0,"warnings":0}],"extractedText":{"path":"<copy EXTRACTED_TEXT_PATH>","parserSkill":"document-granular-decompose","sha256":"<sha256 of extracted-fulltext.txt>","charCount":1234}}',
189
193
  "",
190
194
  "If no page change is justified, still write RESULT_JSON_PATH with decision=skip or decision=propose_only and then stop.",
191
195
  "Use RESULT_JSON_PATH only for the final structured manifest. Write raw JSON only, with no Markdown fences and no prose before or after the JSON object.",
@@ -244,5 +248,6 @@ export function ensureWorkflowArtifactSet(paths, input) {
244
248
  "This prompt is intentionally minimal and will be populated by the workflow runner.",
245
249
  ].join("\n"));
246
250
  writeTextFileSync(artifacts.resultPath, "");
251
+ writeTextFileSync(artifacts.extractedTextPath, "");
247
252
  return artifacts;
248
253
  }
@@ -27,6 +27,13 @@ function ensureNumber(value, label) {
27
27
  }
28
28
  return value;
29
29
  }
30
+ function ensureNonNegativeInteger(value, label) {
31
+ const parsed = ensureNumber(value, label);
32
+ if (!Number.isInteger(parsed) || parsed < 0) {
33
+ fail(`${label} must be a non-negative integer`);
34
+ }
35
+ return parsed;
36
+ }
30
37
  function ensureStatus(value) {
31
38
  const status = ensureString(value, "result.status");
32
39
  if (status === "done" || status === "skipped" || status === "error") {
@@ -50,6 +57,28 @@ function parseSourceFile(value) {
50
57
  const sha256 = sourceFile.sha256 === undefined ? undefined : ensureString(sourceFile.sha256, "result.sourceFile.sha256");
51
58
  return { path, ...(sha256 ? { sha256 } : {}) };
52
59
  }
60
+ function parseExtractedText(value) {
61
+ if (value === undefined) {
62
+ return undefined;
63
+ }
64
+ const extractedText = ensureRecord(value, "result.extractedText");
65
+ const path = ensureString(extractedText.path, "result.extractedText.path");
66
+ const sha256 = extractedText.sha256 === undefined
67
+ ? undefined
68
+ : ensureString(extractedText.sha256, "result.extractedText.sha256");
69
+ const parserSkill = extractedText.parserSkill === undefined
70
+ ? undefined
71
+ : ensureString(extractedText.parserSkill, "result.extractedText.parserSkill");
72
+ const charCount = extractedText.charCount === undefined
73
+ ? undefined
74
+ : ensureNonNegativeInteger(extractedText.charCount, "result.extractedText.charCount");
75
+ return {
76
+ path,
77
+ ...(sha256 ? { sha256 } : {}),
78
+ ...(parserSkill ? { parserSkill } : {}),
79
+ ...(charCount !== undefined ? { charCount } : {}),
80
+ };
81
+ }
53
82
  function parseProposedTypes(value) {
54
83
  if (!Array.isArray(value)) {
55
84
  fail("result.proposedTypes must be an array");
@@ -129,6 +158,7 @@ export function parseWorkflowResult(raw) {
129
158
  proposedTypes: parseProposedTypes(result.proposedTypes),
130
159
  actions: parseActions(result.actions),
131
160
  lint: parseLint(result.lint),
161
+ extractedText: parseExtractedText(result.extractedText),
132
162
  sourceFile: parseSourceFile(result.sourceFile),
133
163
  };
134
164
  if (manifest.decision === "apply" && manifest.actions.length === 0) {
@@ -292,6 +292,10 @@ function buildQueueListItem(item) {
292
292
  appliedTypeNames: item.appliedTypeNames ?? [],
293
293
  proposedTypeNames: item.proposedTypeNames ?? [],
294
294
  skillsUsed: item.skillsUsed ?? [],
295
+ extractedTextPath: item.extractedTextPath ?? null,
296
+ extractedTextSha256: item.extractedTextSha256 ?? null,
297
+ extractedTextParserSkill: item.extractedTextParserSkill ?? null,
298
+ extractedTextCharCount: item.extractedTextCharCount ?? null,
295
299
  timing: buildQueueTiming(item),
296
300
  };
297
301
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@biaoo/tiangong-wiki",
3
- "version": "0.3.10",
3
+ "version": "0.3.11",
4
4
  "description": "Local-first wiki index and query engine for Markdown knowledge pages (Tiangong Wiki).",
5
5
  "type": "module",
6
6
  "publishConfig": {
@@ -295,7 +295,7 @@ tiangong-wiki vault queue [--status pending|processing|done|skipped|error]
295
295
 
296
296
  - `list` — List indexed vault files; `--path` does prefix matching on relative paths
297
297
  - `diff` — Show changes since the last sync (or since a given date with `--since`)
298
- - `queue` — Show processing queue status and item details
298
+ - `queue` — Show processing queue status and item details, including extracted plain-text artifact metadata when a parser snapshot exists
299
299
 
300
300
  ### lint
301
301
 
@@ -99,6 +99,8 @@ The agent uses [Codex SDK](https://www.npmjs.com/package/@openai/codex-sdk) to p
99
99
 
100
100
  When `document-granular-decompose` is configured with `UNSTRUCTURED_API_BASE_URL` and `UNSTRUCTURED_AUTH_TOKEN`, the wiki agent prefers it before the type-specific `pdf`, `docx`, `pptx`, and `xlsx` skills for supported document/image files. Keep the type-specific skills configured only when you want them available as fallback tools.
101
101
 
102
+ For successful parser runs, the workflow keeps the exact plain-text extraction used by the agent at `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`. `tiangong-wiki vault queue` exposes `extractedTextPath`, `extractedTextSha256`, `extractedTextParserSkill`, and `extractedTextCharCount` when a snapshot exists.
103
+
102
104
  `tiangong-wiki setup` now prompts for `WIKI_AGENT_SANDBOX_MODE` when automatic vault processing is enabled. The default is `danger-full-access`, and the setup wizard highlights that this mode grants full runtime access.
103
105
 
104
106
  When `WIKI_AGENT_ENABLED=true` and `WIKI_AGENT_AUTH_MODE=codex-login`, `tiangong-wiki doctor` and `tiangong-wiki check-config` verify that `WIKI_AGENT_CODEX_HOME` exists and contains `auth.json`. They report the path and remediation command, but never print token or auth file contents.
@@ -30,7 +30,7 @@ Parser skills are installed under `<workspace-root>/.agents/skills/`. Do not ass
30
30
  | `xlsx` | Extract tables and data from XLSX/CSV files |
31
31
  | `document-granular-decompose` | Extract fulltext from PDF, Office documents, Markdown, and common image formats through TianGong Unstructure |
32
32
 
33
- When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; read the skill's own `SKILL.md` for the exact extension allowlist. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Keep JSON chunks and page numbers only for debugging or provenance evidence.
33
+ When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; read the skill's own `SKILL.md` for the exact extension allowlist. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Write that same plain text to `EXTRACTED_TEXT_PATH` so the queue artifact retains the exact text snapshot used for analysis. Keep JSON chunks and page numbers only for debugging or provenance evidence.
34
34
 
35
35
  When any other parser skill is available and the vault file matches its type, use the skill. Read the skill's SKILL.md for interface details before invoking.
36
36
 
@@ -167,5 +167,8 @@ The workflow must write a valid `result.json` manifest with these fields:
167
167
  - `proposedTypes`
168
168
  - `actions`
169
169
  - `lint`
170
+ - `extractedText` when `EXTRACTED_TEXT_PATH` contains a plain-text extraction
170
171
 
171
172
  The service layer trusts this manifest, not free-form prose.
173
+
174
+ `extractedText` must be metadata only, not the full text body. Use `{ "path": EXTRACTED_TEXT_PATH, "parserSkill": "<skill-name>", "sha256": "<sha256>", "charCount": <characters> }`.