npm - @biaoo/tiangong-wiki - Versions diffs - 0.3.9 → 0.3.11 - Mend

@biaoo/tiangong-wiki 0.3.9 → 0.3.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +1 -1
package/README.zh-CN.md +1 -1
package/dist/core/db.js +14 -0
package/dist/core/vault-processing.js +157 -22
package/dist/core/workflow-context.js +8 -3
package/dist/core/workflow-result.js +30 -0
package/dist/operations/dashboard.js +4 -0
package/package.json +4 -4
package/references/cli-interface.md +1 -1
package/references/troubleshooting.md +5 -1
package/references/vault-to-wiki-instruction.md +5 -2

package/README.md CHANGED Viewed

@@ -117,7 +117,7 @@ tiangong-wiki dashboard                               # open dashboard in browse
 > Environment variables are managed via `.wiki.env` (created by `tiangong-wiki setup`). The CLI prefers the nearest local `.wiki.env`, then falls back to the global default workspace config. See [references/troubleshooting.md](./references/troubleshooting.md) for the full reference. For a centralized Linux + `systemd` + Nginx deployment, see [references/centralized-service-deployment.md](./references/centralized-service-deployment.md). That deployment guide now also includes Git repository / GitHub remote setup for daemon-side commit and optional auto-push.
-For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser. The wiki workflow uses `return_txt=true` and consumes the plain `txt` text as the agent input; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
+For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel, Markdown, and common image formats; the skill's `SKILL.md` remains the source of truth for the exact extension allowlist. The wiki workflow uses `return_txt=true`, consumes the plain `txt` text as the agent input, and stores the extracted plain-text snapshot under `.queue-artifacts/<file-artifact>/extracted-fulltext.txt` with metadata in the queue result; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
 ## MCP Server

package/README.zh-CN.md CHANGED Viewed

@@ -117,7 +117,7 @@ tiangong-wiki dashboard                               # 在浏览器中打开仪
 > 环境变量通过 `.wiki.env` 管理（由 `tiangong-wiki setup` 创建）。CLI 会优先使用最近的本地 `.wiki.env`，找不到时再 fallback 到全局默认工作区配置。完整参考见 [references/troubleshooting.md](./references/troubleshooting.md)。如需部署中心化服务（Linux + `systemd` + Nginx），见 [references/centralized-service-deployment.md](./references/centralized-service-deployment.md)。该部署文档现在也包含了 Git 仓库初始化、GitHub remote 配置和 daemon 自动 push 的 Git 配置说明。
-如果 vault 里以文档解析为主，可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`，并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`，让 workflow 优先使用 TianGong Unstructure parser。wiki workflow 会使用 `return_txt=true`，并把返回的纯 `txt` 文本作为 agent 主输入；`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
+如果 vault 里以文档解析为主，可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`，并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`，让 workflow 优先使用 TianGong Unstructure parser。`document-granular-decompose` 是覆盖范围更宽的文档/图片解析器，适用于 PDF、Word、PowerPoint、Excel、Markdown 和常见图片格式；精确扩展名 allowlist 以该 skill 自身的 `SKILL.md` 为准。wiki workflow 会使用 `return_txt=true`，把返回的纯 `txt` 文本作为 agent 主输入，并把解析纯文本快照保存到 `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`，同时在 queue result 里保留元数据；`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
 ## MCP Server

package/dist/core/db.js CHANGED Viewed

@@ -90,6 +90,20 @@ function ensureBaseTables(db, embeddingDimensions) {
     CREATE INDEX IF NOT EXISTS idx_vchangelog_sync ON vault_changelog(sync_id);
     CREATE INDEX IF NOT EXISTS idx_vchangelog_time ON vault_changelog(detected_at);
+    CREATE TABLE IF NOT EXISTS vault_extractions (
+      file_id TEXT NOT NULL,
+      content_hash TEXT NOT NULL,
+      artifact_path TEXT NOT NULL,
+      artifact_sha256 TEXT NOT NULL,
+      parser_skill TEXT,
+      char_count INTEGER NOT NULL,
+      created_at TEXT NOT NULL,
+      updated_at TEXT NOT NULL,
+      PRIMARY KEY(file_id, content_hash)
+    );
+    CREATE INDEX IF NOT EXISTS idx_vex_file ON vault_extractions(file_id);
     CREATE TABLE IF NOT EXISTS vault_processing_queue (
       file_id TEXT PRIMARY KEY,
       status TEXT DEFAULT 'pending',

package/dist/core/vault-processing.js CHANGED Viewed

@@ -11,7 +11,7 @@ import { ensureLocalVaultFile } from "./vault.js";
 import { buildVaultWorkflowPrompt, ensureWorkflowArtifactSet, getWorkflowArtifactSet, } from "./workflow-context.js";
 import { readWorkflowResult } from "./workflow-result.js";
 import { AppError } from "../utils/errors.js";
-import { readTextFileSync } from "../utils/fs.js";
+import { pathExistsSync, readTextFileSync, sha256Text } from "../utils/fs.js";
 import { addSeconds, toOffsetIso } from "../utils/time.js";
 const INLINE_WORKFLOW_ATTEMPTS = 2;
 const MAX_QUEUE_ERROR_RETRIES = 3;
@@ -90,6 +90,10 @@ function mapQueueRow(row) {
         appliedTypeNames: parseOptionalStringArray(row.appliedTypeNames),
         proposedTypeNames: parseOptionalStringArray(row.proposedTypeNames),
         skillsUsed: parseOptionalStringArray(row.skillsUsed),
+        extractedTextPath: typeof row.extractedTextPath === "string" ? row.extractedTextPath : null,
+        extractedTextSha256: typeof row.extractedTextSha256 === "string" ? row.extractedTextSha256 : null,
+        extractedTextParserSkill: typeof row.extractedTextParserSkill === "string" ? row.extractedTextParserSkill : null,
+        extractedTextCharCount: typeof row.extractedTextCharCount === "number" ? row.extractedTextCharCount : null,
         fileName: typeof row.fileName === "string" ? row.fileName : undefined,
         fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
         sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
@@ -113,8 +117,8 @@ function claimQueueItems(db, limit, options) {
         ].join("\n          AND ");
     const select = db.prepare(`
       SELECT
-        file_id AS fileId,
-        status,
+        vault_processing_queue.file_id AS fileId,
+        vault_processing_queue.status,
         priority,
         queued_at AS queuedAt,
         claimed_at AS claimedAt,
@@ -141,16 +145,23 @@ function claimQueueItems(db, limit, options) {
         vault_files.file_ext AS fileExt,
         vault_files.source_type AS sourceType,
         vault_files.file_size AS fileSize,
-        vault_files.file_path AS filePath
+        vault_files.file_path AS filePath,
+        vault_extractions.artifact_path AS extractedTextPath,
+        vault_extractions.artifact_sha256 AS extractedTextSha256,
+        vault_extractions.parser_skill AS extractedTextParserSkill,
+        vault_extractions.char_count AS extractedTextCharCount
       FROM vault_processing_queue
       LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
+      LEFT JOIN vault_extractions
+        ON vault_extractions.file_id = vault_processing_queue.file_id
+        AND vault_extractions.content_hash = vault_files.content_hash
       WHERE (
         vault_processing_queue.status = 'pending'
         OR (
           ${errorEligibility}
         )
       )${filter.clause}${exclude.clause}
-      ORDER BY priority DESC, queued_at ASC
+      ORDER BY vault_processing_queue.priority DESC, vault_processing_queue.queued_at ASC
       LIMIT ?
     `);
     const markProcessing = db.prepare(`
@@ -202,8 +213,8 @@ function claimQueueItems(db, limit, options) {
 function fetchQueueItemsByStatus(db, status) {
     const rows = db.prepare(`
       SELECT
-        file_id AS fileId,
-        status,
+        vault_processing_queue.file_id AS fileId,
+        vault_processing_queue.status,
         priority,
         queued_at AS queuedAt,
         claimed_at AS claimedAt,
@@ -230,19 +241,26 @@ function fetchQueueItemsByStatus(db, status) {
         vault_files.file_ext AS fileExt,
         vault_files.source_type AS sourceType,
         vault_files.file_size AS fileSize,
-        vault_files.file_path AS filePath
+        vault_files.file_path AS filePath,
+        vault_extractions.artifact_path AS extractedTextPath,
+        vault_extractions.artifact_sha256 AS extractedTextSha256,
+        vault_extractions.parser_skill AS extractedTextParserSkill,
+        vault_extractions.char_count AS extractedTextCharCount
       FROM vault_processing_queue
       LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
-      ${status ? "WHERE status = ?" : ""}
-      ORDER BY priority DESC, queued_at ASC
+      LEFT JOIN vault_extractions
+        ON vault_extractions.file_id = vault_processing_queue.file_id
+        AND vault_extractions.content_hash = vault_files.content_hash
+      ${status ? "WHERE vault_processing_queue.status = ?" : ""}
+      ORDER BY vault_processing_queue.priority DESC, vault_processing_queue.queued_at ASC
     `).all(...(status ? [status] : []));
     return rows.map(mapQueueRow);
 }
 function fetchQueueItemByFileId(db, fileId) {
     const row = db.prepare(`
       SELECT
-        file_id AS fileId,
-        status,
+        vault_processing_queue.file_id AS fileId,
+        vault_processing_queue.status,
         priority,
         queued_at AS queuedAt,
         claimed_at AS claimedAt,
@@ -269,9 +287,16 @@ function fetchQueueItemByFileId(db, fileId) {
         vault_files.file_ext AS fileExt,
         vault_files.source_type AS sourceType,
         vault_files.file_size AS fileSize,
-        vault_files.file_path AS filePath
+        vault_files.file_path AS filePath,
+        vault_extractions.artifact_path AS extractedTextPath,
+        vault_extractions.artifact_sha256 AS extractedTextSha256,
+        vault_extractions.parser_skill AS extractedTextParserSkill,
+        vault_extractions.char_count AS extractedTextCharCount
       FROM vault_processing_queue
       LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
+      LEFT JOIN vault_extractions
+        ON vault_extractions.file_id = vault_processing_queue.file_id
+        AND vault_extractions.content_hash = vault_files.content_hash
       WHERE vault_processing_queue.file_id = ?
     `).get(fileId);
     return row ? mapQueueRow(row) : null;
@@ -332,8 +357,8 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
     const cutoff = toOffsetIso(addSeconds(new Date(), -PROCESSING_STALE_THRESHOLD_SECONDS));
     const rows = db.prepare(`
       SELECT
-        file_id AS fileId,
-        status,
+        vault_processing_queue.file_id AS fileId,
+        vault_processing_queue.status,
         priority,
         queued_at AS queuedAt,
         claimed_at AS claimedAt,
@@ -360,10 +385,17 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
         vault_files.file_ext AS fileExt,
         vault_files.source_type AS sourceType,
         vault_files.file_size AS fileSize,
-        vault_files.file_path AS filePath
+        vault_files.file_path AS filePath,
+        vault_extractions.artifact_path AS extractedTextPath,
+        vault_extractions.artifact_sha256 AS extractedTextSha256,
+        vault_extractions.parser_skill AS extractedTextParserSkill,
+        vault_extractions.char_count AS extractedTextCharCount
       FROM vault_processing_queue
       LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
-      WHERE status = 'processing'
+      LEFT JOIN vault_extractions
+        ON vault_extractions.file_id = vault_processing_queue.file_id
+        AND vault_extractions.content_hash = vault_files.content_hash
+      WHERE vault_processing_queue.status = 'processing'
         AND COALESCE(processing_owner_id, '') != ?
         AND (
           (heartbeat_at IS NOT NULL AND julianday(heartbeat_at) <= julianday(?))
@@ -513,10 +545,104 @@ function formatQueueErrorMessage(message, autoRetryExhausted) {
         : "";
     return `${message}${autoRetrySuffix}`.slice(0, 1_000);
 }
-function applyWorkflowManifest(db, fileId, manifest, resultManifestPath, currentAttempts) {
+function resolveCurrentVaultContentHash(db, fileId) {
+    const row = db
+        .prepare("SELECT content_hash AS contentHash FROM vault_files WHERE id = ?")
+        .get(fileId);
+    return typeof row?.contentHash === "string" && row.contentHash.length > 0 ? row.contentHash : null;
+}
+function inferExtractionParserSkill(manifest) {
+    const explicit = manifest.extractedText?.parserSkill?.trim();
+    if (explicit) {
+        return explicit;
+    }
+    return manifest.skillsUsed.find((skill) => skill !== "tiangong-wiki-skill") ?? null;
+}
+function persistWorkflowExtraction(db, fileId, manifest, extractedTextPath, processedAt) {
+    const contentHash = resolveCurrentVaultContentHash(db, fileId);
+    if (!contentHash) {
+        return;
+    }
+    const expectedPath = path.resolve(extractedTextPath);
+    if (manifest.extractedText?.path && path.resolve(manifest.extractedText.path) !== expectedPath) {
+        throw new AppError("result.extractedText.path must match EXTRACTED_TEXT_PATH", "runtime", {
+            expectedPath,
+            actualPath: manifest.extractedText.path,
+        });
+    }
+    const extractedText = pathExistsSync(expectedPath) ? readTextFileSync(expectedPath) : "";
+    if (extractedText.length === 0) {
+        if (manifest.extractedText) {
+            throw new AppError("result.extractedText was declared but EXTRACTED_TEXT_PATH is empty", "runtime", {
+                expectedPath,
+            });
+        }
+        db.prepare("DELETE FROM vault_extractions WHERE file_id = ? AND content_hash = ?").run(fileId, contentHash);
+        return;
+    }
+    const artifactSha256 = sha256Text(extractedText);
+    if (manifest.extractedText?.sha256 && manifest.extractedText.sha256 !== artifactSha256) {
+        throw new AppError("result.extractedText.sha256 does not match EXTRACTED_TEXT_PATH content", "runtime", {
+            expectedSha256: artifactSha256,
+            actualSha256: manifest.extractedText.sha256,
+        });
+    }
+    db.prepare(`
+      INSERT INTO vault_extractions(
+        file_id,
+        content_hash,
+        artifact_path,
+        artifact_sha256,
+        parser_skill,
+        char_count,
+        created_at,
+        updated_at
+      )
+      VALUES (
+        @file_id,
+        @content_hash,
+        @artifact_path,
+        @artifact_sha256,
+        @parser_skill,
+        @char_count,
+        @created_at,
+        @updated_at
+      )
+      ON CONFLICT(file_id, content_hash) DO UPDATE SET
+        artifact_path = excluded.artifact_path,
+        artifact_sha256 = excluded.artifact_sha256,
+        parser_skill = excluded.parser_skill,
+        char_count = excluded.char_count,
+        updated_at = excluded.updated_at
+    `).run({
+        file_id: fileId,
+        content_hash: contentHash,
+        artifact_path: expectedPath,
+        artifact_sha256: artifactSha256,
+        parser_skill: inferExtractionParserSkill(manifest),
+        char_count: extractedText.length,
+        created_at: processedAt,
+        updated_at: processedAt,
+    });
+}
+function buildExtractionResultFields(manifest, extractedTextPath) {
+    const expectedPath = path.resolve(extractedTextPath);
+    const extractedText = pathExistsSync(expectedPath) ? readTextFileSync(expectedPath) : "";
+    if (extractedText.length === 0) {
+        return {};
+    }
+    return {
+        extractedTextPath: expectedPath,
+        extractedTextSha256: sha256Text(extractedText),
+        extractedTextParserSkill: inferExtractionParserSkill(manifest),
+        extractedTextCharCount: extractedText.length,
+    };
+}
+function applyWorkflowManifest(db, fileId, manifest, resultManifestPath, extractedTextPath, currentAttempts) {
     const resultPageId = manifest.createdPageIds[0] ?? manifest.updatedPageIds[0] ?? null;
     const status = manifest.status;
     const processedAt = toOffsetIso();
+    persistWorkflowExtraction(db, fileId, manifest, extractedTextPath, processedAt);
     if (status === "error") {
         const failureState = buildQueueFailureState(manifest.reason);
         const nextAttempts = currentAttempts + 1;
@@ -694,7 +820,8 @@ function recoverStaleProcessingQueueItems(input) {
         });
         if (recoveredManifest && item.resultManifestPath) {
             assertTemplateEvolutionAllowed(recoveredManifest, input.templateEvolution);
-            const outcome = applyWorkflowManifest(input.db, item.fileId, recoveredManifest, item.resultManifestPath, item.attempts);
+            const extractedTextPath = getWorkflowArtifactSet(input.paths, item.fileId).extractedTextPath;
+            const outcome = applyWorkflowManifest(input.db, item.fileId, recoveredManifest, item.resultManifestPath, extractedTextPath, item.attempts);
             input.log?.(`${item.fileId}: recovered stale processing with persisted result status=${outcome.status} thread=${recoveredManifest.threadId} ${formatManifestLogFields(recoveredManifest)} result=${item.resultManifestPath}`);
             recovered.push({
                 fileId: item.fileId,
@@ -708,6 +835,7 @@ function recoverStaleProcessingQueueItems(input) {
                 updatedPageIds: recoveredManifest.updatedPageIds,
                 proposedTypeNames: recoveredManifest.proposedTypes.map((entry) => entry.name),
                 resultManifestPath: item.resultManifestPath,
+                ...buildExtractionResultFields(recoveredManifest, extractedTextPath),
             });
             continue;
         }
@@ -783,6 +911,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
         workspaceRoot,
         vaultFilePath: localFilePath,
         resultJsonPath: artifacts.resultPath,
+        extractedTextPath: artifacts.extractedTextPath,
         allowTemplateEvolution,
     });
     ensureWorkflowArtifactSet(paths, {
@@ -796,6 +925,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
             vaultPath: paths.vaultPath,
             localFilePath,
             resultJsonPath: artifacts.resultPath,
+            extractedTextPath: artifacts.extractedTextPath,
             skillArtifactsPath: artifacts.skillArtifactsPath,
             file,
             queue: {
@@ -817,6 +947,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
             promptText,
             queueItemPath: artifacts.queueItemPath,
             resultPath: artifacts.resultPath,
+            extractedTextPath: artifacts.extractedTextPath,
             skillArtifactsPath: artifacts.skillArtifactsPath,
             model: resolveAgentSettings(env).model,
             env,
@@ -903,7 +1034,7 @@ async function processClaimedQueueItem(input) {
                 }));
                 assertTemplateEvolutionAllowed(manifest, templateEvolution);
                 finalOutcome = {
-                    outcome: applyWorkflowManifest(db, item.fileId, manifest, artifacts.resultPath, item.attempts),
+                    outcome: applyWorkflowManifest(db, item.fileId, manifest, artifacts.resultPath, artifacts.extractedTextPath, item.attempts),
                     manifest,
                     handleThreadId: handle.threadId,
                 };
@@ -925,7 +1056,7 @@ async function processClaimedQueueItem(input) {
                 if (recoveredManifest) {
                     assertTemplateEvolutionAllowed(recoveredManifest, templateEvolution);
                     finalOutcome = {
-                        outcome: applyWorkflowManifest(db, item.fileId, recoveredManifest, artifacts.resultPath, item.attempts),
+                        outcome: applyWorkflowManifest(db, item.fileId, recoveredManifest, artifacts.resultPath, artifacts.extractedTextPath, item.attempts),
                         manifest: recoveredManifest,
                         handleThreadId: recoveredManifest.threadId,
                     };
@@ -956,6 +1087,7 @@ async function processClaimedQueueItem(input) {
                 updatedPageIds: finalOutcome.manifest.updatedPageIds,
                 proposedTypeNames: finalOutcome.manifest.proposedTypes.map((entry) => entry.name),
                 resultManifestPath: artifacts.resultPath,
+                ...buildExtractionResultFields(finalOutcome.manifest, artifacts.extractedTextPath),
             },
         };
     }
@@ -965,7 +1097,8 @@ async function processClaimedQueueItem(input) {
             : null;
         if (recoveredManifest && resultManifestPath) {
             assertTemplateEvolutionAllowed(recoveredManifest, templateEvolution);
-            const recoveredOutcome = applyWorkflowManifest(db, item.fileId, recoveredManifest, resultManifestPath, item.attempts);
+            const extractedTextPath = getWorkflowArtifactSet(paths, item.fileId).extractedTextPath;
+            const recoveredOutcome = applyWorkflowManifest(db, item.fileId, recoveredManifest, resultManifestPath, extractedTextPath, item.attempts);
             input.log?.(`${item.fileId}: recovered persisted workflow result after terminal failure status=${recoveredOutcome.status} thread=${recoveredManifest.threadId} ${formatManifestLogFields(recoveredManifest)} result=${resultManifestPath} message=${formatWorkflowError(error)}`);
             return {
                 status: recoveredOutcome.status,
@@ -981,6 +1114,7 @@ async function processClaimedQueueItem(input) {
                     updatedPageIds: recoveredManifest.updatedPageIds,
                     proposedTypeNames: recoveredManifest.proposedTypes.map((entry) => entry.name),
                     resultManifestPath,
+                    ...buildExtractionResultFields(recoveredManifest, extractedTextPath),
                 },
             };
         }
@@ -1101,6 +1235,7 @@ export async function processVaultQueueBatch(env = process.env, options = {}) {
         };
         for (const recoveredItem of recoverStaleProcessingQueueItems({
             db,
+            paths,
             processingOwnerId,
             log: options.log,
             templateEvolution,

package/dist/core/workflow-context.js CHANGED Viewed

@@ -42,6 +42,7 @@ export function getWorkflowArtifactSet(paths, queueItemId) {
         queueItemPath: path.join(rootDir, "queue-item.json"),
         promptPath: path.join(rootDir, "prompt.md"),
         resultPath: path.join(rootDir, "result.json"),
+        extractedTextPath: path.join(rootDir, "extracted-fulltext.txt"),
         skillArtifactsPath: path.join(rootDir, "skill-artifacts"),
     };
 }
@@ -52,6 +53,7 @@ export function buildVaultWorkflowPrompt(input) {
         `WORKSPACE_ROOT=${input.workspaceRoot}`,
         `VAULT_FILE_PATH=${input.vaultFilePath}`,
         `RESULT_JSON_PATH=${input.resultJsonPath}`,
+        `EXTRACTED_TEXT_PATH=${input.extractedTextPath}`,
         `ALLOW_TEMPLATE_EVOLUTION=${input.allowTemplateEvolution ? "true" : "false"}`,
         "",
         "## Goal",
@@ -83,8 +85,9 @@ export function buildVaultWorkflowPrompt(input) {
         "",
         "1. Read queue-item.json next to RESULT_JSON_PATH.",
         "2. Read the target vault file at VAULT_FILE_PATH. Refer to `references/vault-to-wiki-instruction.md` (Phase 1) in the wiki package for file-type-specific reading strategies, parser skill discovery, image handling, and metadata utilization.",
-        "   - If `WIKI_PARSER_SKILLS` includes `document-granular-decompose` and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer that skill for supported document/image formats before the legacy type-specific parser skills.",
+        "   - If `WIKI_PARSER_SKILLS` includes `document-granular-decompose` and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer that skill for supported document/image files before the legacy type-specific parser skills. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; use the skill's `SKILL.md` for the exact extension allowlist.",
         "   - When using `document-granular-decompose`, request `return_txt=true`, treat the pure text extracted from `response.txt`/`txt` as the main input, and keep raw JSON only for debugging or page-number evidence.",
+        "   - If you extract plain text through any parser skill, write that canonical plain text snapshot to EXTRACTED_TEXT_PATH. For `document-granular-decompose`, write the same pure text from `response.txt`/`txt`. Leave EXTRACTED_TEXT_PATH empty only when no extractable text exists.",
         "3. Discover the current page type ontology via `tiangong-wiki type list` and `tiangong-wiki type show <type>`. Do not assume any type, template, or default target type.",
         "4. Search the existing wiki for overlapping or related content:",
         "   - Use `tiangong-wiki fts` and `tiangong-wiki search` with key terms from the source.",
@@ -173,7 +176,7 @@ export function buildVaultWorkflowPrompt(input) {
         "",
         "The authoritative threadId is queue-item.json.threadId. Read it from there and copy it unchanged into result.json.threadId. If it is empty on first read, read queue-item.json again immediately before writing the manifest.",
         "",
-        "Write RESULT_JSON_PATH as one JSON object with: status, decision, reason, threadId, skillsUsed, createdPageIds, updatedPageIds, appliedTypeNames, proposedTypes, actions, lint.",
+        "Write RESULT_JSON_PATH as one JSON object with: status, decision, reason, threadId, skillsUsed, createdPageIds, updatedPageIds, appliedTypeNames, proposedTypes, actions, lint, and optional extractedText.",
         "",
         "### Allowed Values",
         "",
@@ -182,10 +185,11 @@ export function buildVaultWorkflowPrompt(input) {
         "- **actions**: Array of objects, never strings. Allowed action kinds: create_page, update_page, create_template. Every action object must include kind and summary. create_page requires pageType and title. update_page requires pageId. create_template requires pageType and title.",
         "- **proposedTypes**: Objects with name, reason, suggestedTemplateSections.",
         "- **lint**: Objects with pageId, errors, warnings.",
+        "- **extractedText**: Optional object when EXTRACTED_TEXT_PATH contains extracted plain text. Include path=EXTRACTED_TEXT_PATH, parserSkill, sha256 when practical, and charCount. Do not put the full text itself in result.json.",
         "",
         "### Example",
         "",
-        '{"status":"done","decision":"apply","reason":"Updated the existing method.","threadId":"<copy queue-item.json.threadId>","skillsUsed":["tiangong-wiki-skill"],"createdPageIds":[],"updatedPageIds":["methods/example.md"],"appliedTypeNames":["method"],"proposedTypes":[],"actions":[{"kind":"update_page","pageId":"methods/example.md","pageType":"method","summary":"Updated the page with durable knowledge."}],"lint":[{"pageId":"methods/example.md","errors":0,"warnings":0}]}',
+        '{"status":"done","decision":"apply","reason":"Updated the existing method.","threadId":"<copy queue-item.json.threadId>","skillsUsed":["tiangong-wiki-skill"],"createdPageIds":[],"updatedPageIds":["methods/example.md"],"appliedTypeNames":["method"],"proposedTypes":[],"actions":[{"kind":"update_page","pageId":"methods/example.md","pageType":"method","summary":"Updated the page with durable knowledge."}],"lint":[{"pageId":"methods/example.md","errors":0,"warnings":0}],"extractedText":{"path":"<copy EXTRACTED_TEXT_PATH>","parserSkill":"document-granular-decompose","sha256":"<sha256 of extracted-fulltext.txt>","charCount":1234}}',
         "",
         "If no page change is justified, still write RESULT_JSON_PATH with decision=skip or decision=propose_only and then stop.",
         "Use RESULT_JSON_PATH only for the final structured manifest. Write raw JSON only, with no Markdown fences and no prose before or after the JSON object.",
@@ -244,5 +248,6 @@ export function ensureWorkflowArtifactSet(paths, input) {
             "This prompt is intentionally minimal and will be populated by the workflow runner.",
         ].join("\n"));
     writeTextFileSync(artifacts.resultPath, "");
+    writeTextFileSync(artifacts.extractedTextPath, "");
     return artifacts;
 }

package/dist/core/workflow-result.js CHANGED Viewed

@@ -27,6 +27,13 @@ function ensureNumber(value, label) {
     }
     return value;
 }
+function ensureNonNegativeInteger(value, label) {
+    const parsed = ensureNumber(value, label);
+    if (!Number.isInteger(parsed) || parsed < 0) {
+        fail(`${label} must be a non-negative integer`);
+    }
+    return parsed;
+}
 function ensureStatus(value) {
     const status = ensureString(value, "result.status");
     if (status === "done" || status === "skipped" || status === "error") {
@@ -50,6 +57,28 @@ function parseSourceFile(value) {
     const sha256 = sourceFile.sha256 === undefined ? undefined : ensureString(sourceFile.sha256, "result.sourceFile.sha256");
     return { path, ...(sha256 ? { sha256 } : {}) };
 }
+function parseExtractedText(value) {
+    if (value === undefined) {
+        return undefined;
+    }
+    const extractedText = ensureRecord(value, "result.extractedText");
+    const path = ensureString(extractedText.path, "result.extractedText.path");
+    const sha256 = extractedText.sha256 === undefined
+        ? undefined
+        : ensureString(extractedText.sha256, "result.extractedText.sha256");
+    const parserSkill = extractedText.parserSkill === undefined
+        ? undefined
+        : ensureString(extractedText.parserSkill, "result.extractedText.parserSkill");
+    const charCount = extractedText.charCount === undefined
+        ? undefined
+        : ensureNonNegativeInteger(extractedText.charCount, "result.extractedText.charCount");
+    return {
+        path,
+        ...(sha256 ? { sha256 } : {}),
+        ...(parserSkill ? { parserSkill } : {}),
+        ...(charCount !== undefined ? { charCount } : {}),
+    };
+}
 function parseProposedTypes(value) {
     if (!Array.isArray(value)) {
         fail("result.proposedTypes must be an array");
@@ -129,6 +158,7 @@ export function parseWorkflowResult(raw) {
             proposedTypes: parseProposedTypes(result.proposedTypes),
             actions: parseActions(result.actions),
             lint: parseLint(result.lint),
+            extractedText: parseExtractedText(result.extractedText),
             sourceFile: parseSourceFile(result.sourceFile),
         };
         if (manifest.decision === "apply" && manifest.actions.length === 0) {

package/dist/operations/dashboard.js CHANGED Viewed

@@ -292,6 +292,10 @@ function buildQueueListItem(item) {
         appliedTypeNames: item.appliedTypeNames ?? [],
         proposedTypeNames: item.proposedTypeNames ?? [],
         skillsUsed: item.skillsUsed ?? [],
+        extractedTextPath: item.extractedTextPath ?? null,
+        extractedTextSha256: item.extractedTextSha256 ?? null,
+        extractedTextParserSkill: item.extractedTextParserSkill ?? null,
+        extractedTextCharCount: item.extractedTextCharCount ?? null,
         timing: buildQueueTiming(item),
     };
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@biaoo/tiangong-wiki",
-  "version": "0.3.9",
+  "version": "0.3.11",
   "description": "Local-first wiki index and query engine for Markdown knowledge pages (Tiangong Wiki).",
   "type": "module",
   "publishConfig": {
@@ -8,7 +8,7 @@
   },
   "repository": {
     "type": "git",
-    "url": "https://github.com/Biaoo/tiangong-wiki.git"
+    "url": "git+https://github.com/Biaoo/tiangong-wiki.git"
   },
   "homepage": "https://github.com/Biaoo/tiangong-wiki#readme",
   "bugs": {
@@ -16,8 +16,8 @@
   },
   "license": "MIT",
   "bin": {
-    "tiangong-wiki": "./dist/index.js",
-    "tiangong-wiki-mcp-server": "./mcp-server/dist/index.js"
+    "tiangong-wiki": "dist/index.js",
+    "tiangong-wiki-mcp-server": "mcp-server/dist/index.js"
   },
   "main": "./dist/index.js",
   "files": [

package/references/cli-interface.md CHANGED Viewed

@@ -295,7 +295,7 @@ tiangong-wiki vault queue [--status pending|processing|done|skipped|error]
 - `list` — List indexed vault files; `--path` does prefix matching on relative paths
 - `diff` — Show changes since the last sync (or since a given date with `--since`)
-- `queue` — Show processing queue status and item details
+- `queue` — Show processing queue status and item details, including extracted plain-text artifact metadata when a parser snapshot exists
 ### lint

package/references/troubleshooting.md CHANGED Viewed

@@ -91,12 +91,16 @@ The agent uses [Codex SDK](https://www.npmjs.com/package/@openai/codex-sdk) to p
 | `WIKI_AGENT_MODEL` | No | Model name (default: `gpt-5.5`; e.g. `Qwen/Qwen3.5-397B-A17B-GPTQ-Int4`) |
 | `WIKI_AGENT_BATCH_SIZE` | No | Max concurrent vault queue workers per cycle (default: `5`) |
 | `WIKI_AGENT_SANDBOX_MODE` | No | Codex sandbox mode: `danger-full-access` (default) or `workspace-write` |
-| `WIKI_PARSER_SKILLS` | No | Comma-separated parser skill list (e.g. `pdf,docx,pptx,xlsx,document-granular-decompose`) |
+| `WIKI_PARSER_SKILLS` | No | Comma-separated parser skill list (e.g. `pdf,docx,pptx,xlsx,document-granular-decompose`). `document-granular-decompose` covers PDF, Office, Markdown, and common image formats; use its own `SKILL.md` for the exact extension allowlist |
 | `UNSTRUCTURED_API_BASE_URL` | For `document-granular-decompose` | TianGong Unstructure API base URL |
 | `UNSTRUCTURED_AUTH_TOKEN` | For `document-granular-decompose` | Bearer token for TianGong Unstructure |
 | `UNSTRUCTURED_PROVIDER` | No | Optional provider override passed to TianGong Unstructure |
 | `UNSTRUCTURED_MODEL` | No | Optional model override passed to TianGong Unstructure |
+When `document-granular-decompose` is configured with `UNSTRUCTURED_API_BASE_URL` and `UNSTRUCTURED_AUTH_TOKEN`, the wiki agent prefers it before the type-specific `pdf`, `docx`, `pptx`, and `xlsx` skills for supported document/image files. Keep the type-specific skills configured only when you want them available as fallback tools.
+For successful parser runs, the workflow keeps the exact plain-text extraction used by the agent at `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`. `tiangong-wiki vault queue` exposes `extractedTextPath`, `extractedTextSha256`, `extractedTextParserSkill`, and `extractedTextCharCount` when a snapshot exists.
 `tiangong-wiki setup` now prompts for `WIKI_AGENT_SANDBOX_MODE` when automatic vault processing is enabled. The default is `danger-full-access`, and the setup wizard highlights that this mode grants full runtime access.
 When `WIKI_AGENT_ENABLED=true` and `WIKI_AGENT_AUTH_MODE=codex-login`, `tiangong-wiki doctor` and `tiangong-wiki check-config` verify that `WIKI_AGENT_CODEX_HOME` exists and contains `auth.json`. They report the path and remediation command, but never print token or auth file contents.

package/references/vault-to-wiki-instruction.md CHANGED Viewed

@@ -28,9 +28,9 @@ Parser skills are installed under `<workspace-root>/.agents/skills/`. Do not ass
 | `docx` | Extract text and structure from DOCX files |
 | `pptx` | Extract text, slide structure, and speaker notes from PPTX files |
 | `xlsx` | Extract tables and data from XLSX/CSV files |
-| `document-granular-decompose` | Extract document/image fulltext through TianGong Unstructure |
+| `document-granular-decompose` | Extract fulltext from PDF, Office documents, Markdown, and common image formats through TianGong Unstructure |
-When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported document/image formats before the type-specific parser skills below. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Keep JSON chunks and page numbers only for debugging or provenance evidence.
+When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; read the skill's own `SKILL.md` for the exact extension allowlist. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Write that same plain text to `EXTRACTED_TEXT_PATH` so the queue artifact retains the exact text snapshot used for analysis. Keep JSON chunks and page numbers only for debugging or provenance evidence.
 When any other parser skill is available and the vault file matches its type, use the skill. Read the skill's SKILL.md for interface details before invoking.
@@ -167,5 +167,8 @@ The workflow must write a valid `result.json` manifest with these fields:
 - `proposedTypes`
 - `actions`
 - `lint`
+- `extractedText` when `EXTRACTED_TEXT_PATH` contains a plain-text extraction
 The service layer trusts this manifest, not free-form prose.
+`extractedText` must be metadata only, not the full text body. Use `{ "path": EXTRACTED_TEXT_PATH, "parserSkill": "<skill-name>", "sha256": "<sha256>", "charCount": <characters> }`.