@biaoo/tiangong-wiki 0.3.10 → 0.3.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +1 -1
- package/README.zh-CN.md +1 -1
- package/dist/core/db.js +27 -0
- package/dist/core/vault-processing.js +209 -23
- package/dist/core/vault.js +190 -3
- package/dist/core/workflow-context.js +10 -3
- package/dist/core/workflow-result.js +30 -0
- package/dist/operations/dashboard.js +37 -0
- package/dist/operations/query.js +24 -1
- package/package.json +1 -1
- package/references/cli-interface.md +2 -2
- package/references/troubleshooting.md +4 -2
- package/references/vault-to-wiki-instruction.md +7 -3
package/README.md
CHANGED
|
@@ -117,7 +117,7 @@ tiangong-wiki dashboard # open dashboard in browse
|
|
|
117
117
|
|
|
118
118
|
> Environment variables are managed via `.wiki.env` (created by `tiangong-wiki setup`). The CLI prefers the nearest local `.wiki.env`, then falls back to the global default workspace config. See [references/troubleshooting.md](./references/troubleshooting.md) for the full reference. For a centralized Linux + `systemd` + Nginx deployment, see [references/centralized-service-deployment.md](./references/centralized-service-deployment.md). That deployment guide now also includes Git repository / GitHub remote setup for daemon-side commit and optional auto-push.
|
|
119
119
|
|
|
120
|
-
For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel,
|
|
120
|
+
For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser for supported non-text documents and images. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel, and common image formats; Markdown and other plain text-like files are read locally by the wiki workflow. The skill's `SKILL.md` remains the source of truth for the exact extension allowlist. The wiki workflow uses `return_txt=true`, consumes the plain `txt` text as the agent input, and stores the extracted plain-text snapshot under `.queue-artifacts/<file-artifact>/extracted-fulltext.txt` with metadata in the queue result; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
|
|
121
121
|
|
|
122
122
|
## MCP Server
|
|
123
123
|
|
package/README.zh-CN.md
CHANGED
|
@@ -117,7 +117,7 @@ tiangong-wiki dashboard # 在浏览器中打开仪
|
|
|
117
117
|
|
|
118
118
|
> 环境变量通过 `.wiki.env` 管理(由 `tiangong-wiki setup` 创建)。CLI 会优先使用最近的本地 `.wiki.env`,找不到时再 fallback 到全局默认工作区配置。完整参考见 [references/troubleshooting.md](./references/troubleshooting.md)。如需部署中心化服务(Linux + `systemd` + Nginx),见 [references/centralized-service-deployment.md](./references/centralized-service-deployment.md)。该部署文档现在也包含了 Git 仓库初始化、GitHub remote 配置和 daemon 自动 push 的 Git 配置说明。
|
|
119
119
|
|
|
120
|
-
如果 vault 里以文档解析为主,可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`,并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`,让 workflow
|
|
120
|
+
如果 vault 里以文档解析为主,可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`,并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`,让 workflow 对支持的非纯文本文档和图片优先使用 TianGong Unstructure parser。`document-granular-decompose` 是覆盖范围更宽的文档/图片解析器,适用于 PDF、Word、PowerPoint、Excel 和常见图片格式;Markdown 与其他纯文本类文件由 wiki workflow 本地直接读取。精确扩展名 allowlist 以该 skill 自身的 `SKILL.md` 为准。wiki workflow 会使用 `return_txt=true`,把返回的纯 `txt` 文本作为 agent 主输入,并把解析纯文本快照保存到 `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`,同时在 queue result 里保留元数据;`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
|
|
121
121
|
|
|
122
122
|
## MCP Server
|
|
123
123
|
|
package/dist/core/db.js
CHANGED
|
@@ -76,9 +76,15 @@ function ensureBaseTables(db, embeddingDimensions) {
|
|
|
76
76
|
file_path TEXT NOT NULL,
|
|
77
77
|
content_hash TEXT,
|
|
78
78
|
file_mtime REAL,
|
|
79
|
+
source_timestamp TEXT,
|
|
80
|
+
source_timestamp_source TEXT,
|
|
81
|
+
source_timestamp_confidence REAL,
|
|
82
|
+
source_timestamp_candidates TEXT,
|
|
79
83
|
indexed_at TEXT
|
|
80
84
|
);
|
|
81
85
|
|
|
86
|
+
CREATE INDEX IF NOT EXISTS idx_vfiles_source_timestamp ON vault_files(source_timestamp);
|
|
87
|
+
|
|
82
88
|
CREATE TABLE IF NOT EXISTS vault_changelog (
|
|
83
89
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
84
90
|
file_id TEXT NOT NULL,
|
|
@@ -90,6 +96,20 @@ function ensureBaseTables(db, embeddingDimensions) {
|
|
|
90
96
|
CREATE INDEX IF NOT EXISTS idx_vchangelog_sync ON vault_changelog(sync_id);
|
|
91
97
|
CREATE INDEX IF NOT EXISTS idx_vchangelog_time ON vault_changelog(detected_at);
|
|
92
98
|
|
|
99
|
+
CREATE TABLE IF NOT EXISTS vault_extractions (
|
|
100
|
+
file_id TEXT NOT NULL,
|
|
101
|
+
content_hash TEXT NOT NULL,
|
|
102
|
+
artifact_path TEXT NOT NULL,
|
|
103
|
+
artifact_sha256 TEXT NOT NULL,
|
|
104
|
+
parser_skill TEXT,
|
|
105
|
+
char_count INTEGER NOT NULL,
|
|
106
|
+
created_at TEXT NOT NULL,
|
|
107
|
+
updated_at TEXT NOT NULL,
|
|
108
|
+
PRIMARY KEY(file_id, content_hash)
|
|
109
|
+
);
|
|
110
|
+
|
|
111
|
+
CREATE INDEX IF NOT EXISTS idx_vex_file ON vault_extractions(file_id);
|
|
112
|
+
|
|
93
113
|
CREATE TABLE IF NOT EXISTS vault_processing_queue (
|
|
94
114
|
file_id TEXT PRIMARY KEY,
|
|
95
115
|
status TEXT DEFAULT 'pending',
|
|
@@ -150,6 +170,13 @@ function ensureBaseTables(db, embeddingDimensions) {
|
|
|
150
170
|
proposed_type_names: "TEXT",
|
|
151
171
|
skills_used: "TEXT",
|
|
152
172
|
});
|
|
173
|
+
ensureTableColumns(db, "vault_files", {
|
|
174
|
+
source_timestamp: "TEXT",
|
|
175
|
+
source_timestamp_source: "TEXT",
|
|
176
|
+
source_timestamp_confidence: "REAL",
|
|
177
|
+
source_timestamp_candidates: "TEXT",
|
|
178
|
+
});
|
|
179
|
+
db.exec("CREATE INDEX IF NOT EXISTS idx_vfiles_source_timestamp ON vault_files(source_timestamp)");
|
|
153
180
|
if (!tableExists(db, "vec_pages")) {
|
|
154
181
|
db.exec(`
|
|
155
182
|
CREATE VIRTUAL TABLE vec_pages USING vec0(
|
|
@@ -11,7 +11,7 @@ import { ensureLocalVaultFile } from "./vault.js";
|
|
|
11
11
|
import { buildVaultWorkflowPrompt, ensureWorkflowArtifactSet, getWorkflowArtifactSet, } from "./workflow-context.js";
|
|
12
12
|
import { readWorkflowResult } from "./workflow-result.js";
|
|
13
13
|
import { AppError } from "../utils/errors.js";
|
|
14
|
-
import { readTextFileSync } from "../utils/fs.js";
|
|
14
|
+
import { pathExistsSync, readTextFileSync, sha256Text } from "../utils/fs.js";
|
|
15
15
|
import { addSeconds, toOffsetIso } from "../utils/time.js";
|
|
16
16
|
const INLINE_WORKFLOW_ATTEMPTS = 2;
|
|
17
17
|
const MAX_QUEUE_ERROR_RETRIES = 3;
|
|
@@ -61,6 +61,38 @@ function parseOptionalStringArray(value) {
|
|
|
61
61
|
return [];
|
|
62
62
|
}
|
|
63
63
|
}
|
|
64
|
+
function parseSourceTimestampCandidates(value) {
|
|
65
|
+
if (Array.isArray(value)) {
|
|
66
|
+
return value;
|
|
67
|
+
}
|
|
68
|
+
if (typeof value !== "string" || !value.trim()) {
|
|
69
|
+
return [];
|
|
70
|
+
}
|
|
71
|
+
try {
|
|
72
|
+
const parsed = JSON.parse(value);
|
|
73
|
+
return Array.isArray(parsed) ? parsed : [];
|
|
74
|
+
}
|
|
75
|
+
catch {
|
|
76
|
+
return [];
|
|
77
|
+
}
|
|
78
|
+
}
|
|
79
|
+
function mapVaultFileRow(row) {
|
|
80
|
+
return {
|
|
81
|
+
id: String(row.id),
|
|
82
|
+
fileName: String(row.fileName),
|
|
83
|
+
fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
|
|
84
|
+
sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
|
|
85
|
+
fileSize: Number(row.fileSize ?? 0),
|
|
86
|
+
filePath: String(row.filePath),
|
|
87
|
+
contentHash: typeof row.contentHash === "string" ? row.contentHash : null,
|
|
88
|
+
fileMtime: typeof row.fileMtime === "number" ? row.fileMtime : null,
|
|
89
|
+
sourceTimestamp: typeof row.sourceTimestamp === "string" ? row.sourceTimestamp : null,
|
|
90
|
+
sourceTimestampSource: typeof row.sourceTimestampSource === "string" ? row.sourceTimestampSource : null,
|
|
91
|
+
sourceTimestampConfidence: typeof row.sourceTimestampConfidence === "number" ? row.sourceTimestampConfidence : null,
|
|
92
|
+
sourceTimestampCandidates: parseSourceTimestampCandidates(row.sourceTimestampCandidates),
|
|
93
|
+
indexedAt: String(row.indexedAt),
|
|
94
|
+
};
|
|
95
|
+
}
|
|
64
96
|
function mapQueueRow(row) {
|
|
65
97
|
const attempts = Number(row.attempts ?? 0);
|
|
66
98
|
const status = row.status;
|
|
@@ -90,11 +122,18 @@ function mapQueueRow(row) {
|
|
|
90
122
|
appliedTypeNames: parseOptionalStringArray(row.appliedTypeNames),
|
|
91
123
|
proposedTypeNames: parseOptionalStringArray(row.proposedTypeNames),
|
|
92
124
|
skillsUsed: parseOptionalStringArray(row.skillsUsed),
|
|
125
|
+
extractedTextPath: typeof row.extractedTextPath === "string" ? row.extractedTextPath : null,
|
|
126
|
+
extractedTextSha256: typeof row.extractedTextSha256 === "string" ? row.extractedTextSha256 : null,
|
|
127
|
+
extractedTextParserSkill: typeof row.extractedTextParserSkill === "string" ? row.extractedTextParserSkill : null,
|
|
128
|
+
extractedTextCharCount: typeof row.extractedTextCharCount === "number" ? row.extractedTextCharCount : null,
|
|
93
129
|
fileName: typeof row.fileName === "string" ? row.fileName : undefined,
|
|
94
130
|
fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
|
|
95
131
|
sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
|
|
96
132
|
fileSize: typeof row.fileSize === "number" ? row.fileSize : undefined,
|
|
97
133
|
filePath: typeof row.filePath === "string" ? row.filePath : undefined,
|
|
134
|
+
sourceTimestamp: typeof row.sourceTimestamp === "string" ? row.sourceTimestamp : null,
|
|
135
|
+
sourceTimestampSource: typeof row.sourceTimestampSource === "string" ? row.sourceTimestampSource : null,
|
|
136
|
+
sourceTimestampConfidence: typeof row.sourceTimestampConfidence === "number" ? row.sourceTimestampConfidence : null,
|
|
98
137
|
};
|
|
99
138
|
}
|
|
100
139
|
function claimQueueItems(db, limit, options) {
|
|
@@ -113,8 +152,8 @@ function claimQueueItems(db, limit, options) {
|
|
|
113
152
|
].join("\n AND ");
|
|
114
153
|
const select = db.prepare(`
|
|
115
154
|
SELECT
|
|
116
|
-
file_id AS fileId,
|
|
117
|
-
status,
|
|
155
|
+
vault_processing_queue.file_id AS fileId,
|
|
156
|
+
vault_processing_queue.status,
|
|
118
157
|
priority,
|
|
119
158
|
queued_at AS queuedAt,
|
|
120
159
|
claimed_at AS claimedAt,
|
|
@@ -141,16 +180,26 @@ function claimQueueItems(db, limit, options) {
|
|
|
141
180
|
vault_files.file_ext AS fileExt,
|
|
142
181
|
vault_files.source_type AS sourceType,
|
|
143
182
|
vault_files.file_size AS fileSize,
|
|
144
|
-
vault_files.file_path AS filePath
|
|
183
|
+
vault_files.file_path AS filePath,
|
|
184
|
+
vault_files.source_timestamp AS sourceTimestamp,
|
|
185
|
+
vault_files.source_timestamp_source AS sourceTimestampSource,
|
|
186
|
+
vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
|
|
187
|
+
vault_extractions.artifact_path AS extractedTextPath,
|
|
188
|
+
vault_extractions.artifact_sha256 AS extractedTextSha256,
|
|
189
|
+
vault_extractions.parser_skill AS extractedTextParserSkill,
|
|
190
|
+
vault_extractions.char_count AS extractedTextCharCount
|
|
145
191
|
FROM vault_processing_queue
|
|
146
192
|
LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
|
|
193
|
+
LEFT JOIN vault_extractions
|
|
194
|
+
ON vault_extractions.file_id = vault_processing_queue.file_id
|
|
195
|
+
AND vault_extractions.content_hash = vault_files.content_hash
|
|
147
196
|
WHERE (
|
|
148
197
|
vault_processing_queue.status = 'pending'
|
|
149
198
|
OR (
|
|
150
199
|
${errorEligibility}
|
|
151
200
|
)
|
|
152
201
|
)${filter.clause}${exclude.clause}
|
|
153
|
-
ORDER BY priority DESC, queued_at ASC
|
|
202
|
+
ORDER BY vault_processing_queue.priority DESC, vault_processing_queue.queued_at ASC
|
|
154
203
|
LIMIT ?
|
|
155
204
|
`);
|
|
156
205
|
const markProcessing = db.prepare(`
|
|
@@ -202,8 +251,8 @@ function claimQueueItems(db, limit, options) {
|
|
|
202
251
|
function fetchQueueItemsByStatus(db, status) {
|
|
203
252
|
const rows = db.prepare(`
|
|
204
253
|
SELECT
|
|
205
|
-
file_id AS fileId,
|
|
206
|
-
status,
|
|
254
|
+
vault_processing_queue.file_id AS fileId,
|
|
255
|
+
vault_processing_queue.status,
|
|
207
256
|
priority,
|
|
208
257
|
queued_at AS queuedAt,
|
|
209
258
|
claimed_at AS claimedAt,
|
|
@@ -230,19 +279,29 @@ function fetchQueueItemsByStatus(db, status) {
|
|
|
230
279
|
vault_files.file_ext AS fileExt,
|
|
231
280
|
vault_files.source_type AS sourceType,
|
|
232
281
|
vault_files.file_size AS fileSize,
|
|
233
|
-
vault_files.file_path AS filePath
|
|
282
|
+
vault_files.file_path AS filePath,
|
|
283
|
+
vault_files.source_timestamp AS sourceTimestamp,
|
|
284
|
+
vault_files.source_timestamp_source AS sourceTimestampSource,
|
|
285
|
+
vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
|
|
286
|
+
vault_extractions.artifact_path AS extractedTextPath,
|
|
287
|
+
vault_extractions.artifact_sha256 AS extractedTextSha256,
|
|
288
|
+
vault_extractions.parser_skill AS extractedTextParserSkill,
|
|
289
|
+
vault_extractions.char_count AS extractedTextCharCount
|
|
234
290
|
FROM vault_processing_queue
|
|
235
291
|
LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
|
|
236
|
-
|
|
237
|
-
|
|
292
|
+
LEFT JOIN vault_extractions
|
|
293
|
+
ON vault_extractions.file_id = vault_processing_queue.file_id
|
|
294
|
+
AND vault_extractions.content_hash = vault_files.content_hash
|
|
295
|
+
${status ? "WHERE vault_processing_queue.status = ?" : ""}
|
|
296
|
+
ORDER BY vault_processing_queue.priority DESC, vault_processing_queue.queued_at ASC
|
|
238
297
|
`).all(...(status ? [status] : []));
|
|
239
298
|
return rows.map(mapQueueRow);
|
|
240
299
|
}
|
|
241
300
|
function fetchQueueItemByFileId(db, fileId) {
|
|
242
301
|
const row = db.prepare(`
|
|
243
302
|
SELECT
|
|
244
|
-
file_id AS fileId,
|
|
245
|
-
status,
|
|
303
|
+
vault_processing_queue.file_id AS fileId,
|
|
304
|
+
vault_processing_queue.status,
|
|
246
305
|
priority,
|
|
247
306
|
queued_at AS queuedAt,
|
|
248
307
|
claimed_at AS claimedAt,
|
|
@@ -269,9 +328,19 @@ function fetchQueueItemByFileId(db, fileId) {
|
|
|
269
328
|
vault_files.file_ext AS fileExt,
|
|
270
329
|
vault_files.source_type AS sourceType,
|
|
271
330
|
vault_files.file_size AS fileSize,
|
|
272
|
-
vault_files.file_path AS filePath
|
|
331
|
+
vault_files.file_path AS filePath,
|
|
332
|
+
vault_files.source_timestamp AS sourceTimestamp,
|
|
333
|
+
vault_files.source_timestamp_source AS sourceTimestampSource,
|
|
334
|
+
vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
|
|
335
|
+
vault_extractions.artifact_path AS extractedTextPath,
|
|
336
|
+
vault_extractions.artifact_sha256 AS extractedTextSha256,
|
|
337
|
+
vault_extractions.parser_skill AS extractedTextParserSkill,
|
|
338
|
+
vault_extractions.char_count AS extractedTextCharCount
|
|
273
339
|
FROM vault_processing_queue
|
|
274
340
|
LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
|
|
341
|
+
LEFT JOIN vault_extractions
|
|
342
|
+
ON vault_extractions.file_id = vault_processing_queue.file_id
|
|
343
|
+
AND vault_extractions.content_hash = vault_files.content_hash
|
|
275
344
|
WHERE vault_processing_queue.file_id = ?
|
|
276
345
|
`).get(fileId);
|
|
277
346
|
return row ? mapQueueRow(row) : null;
|
|
@@ -287,11 +356,15 @@ function fetchVaultFile(db, fileId) {
|
|
|
287
356
|
file_path AS filePath,
|
|
288
357
|
content_hash AS contentHash,
|
|
289
358
|
file_mtime AS fileMtime,
|
|
359
|
+
source_timestamp AS sourceTimestamp,
|
|
360
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
361
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
362
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
290
363
|
indexed_at AS indexedAt
|
|
291
364
|
FROM vault_files
|
|
292
365
|
WHERE id = ?
|
|
293
366
|
`).get(fileId);
|
|
294
|
-
return row
|
|
367
|
+
return row ? mapVaultFileRow(row) : null;
|
|
295
368
|
}
|
|
296
369
|
function buildProcessingOwnerId() {
|
|
297
370
|
return `${os.hostname()}:${process.pid}:${Date.now()}:${randomUUID().slice(0, 8)}`;
|
|
@@ -332,8 +405,8 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
|
|
|
332
405
|
const cutoff = toOffsetIso(addSeconds(new Date(), -PROCESSING_STALE_THRESHOLD_SECONDS));
|
|
333
406
|
const rows = db.prepare(`
|
|
334
407
|
SELECT
|
|
335
|
-
file_id AS fileId,
|
|
336
|
-
status,
|
|
408
|
+
vault_processing_queue.file_id AS fileId,
|
|
409
|
+
vault_processing_queue.status,
|
|
337
410
|
priority,
|
|
338
411
|
queued_at AS queuedAt,
|
|
339
412
|
claimed_at AS claimedAt,
|
|
@@ -360,10 +433,20 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
|
|
|
360
433
|
vault_files.file_ext AS fileExt,
|
|
361
434
|
vault_files.source_type AS sourceType,
|
|
362
435
|
vault_files.file_size AS fileSize,
|
|
363
|
-
vault_files.file_path AS filePath
|
|
436
|
+
vault_files.file_path AS filePath,
|
|
437
|
+
vault_files.source_timestamp AS sourceTimestamp,
|
|
438
|
+
vault_files.source_timestamp_source AS sourceTimestampSource,
|
|
439
|
+
vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
|
|
440
|
+
vault_extractions.artifact_path AS extractedTextPath,
|
|
441
|
+
vault_extractions.artifact_sha256 AS extractedTextSha256,
|
|
442
|
+
vault_extractions.parser_skill AS extractedTextParserSkill,
|
|
443
|
+
vault_extractions.char_count AS extractedTextCharCount
|
|
364
444
|
FROM vault_processing_queue
|
|
365
445
|
LEFT JOIN vault_files ON vault_files.id = vault_processing_queue.file_id
|
|
366
|
-
|
|
446
|
+
LEFT JOIN vault_extractions
|
|
447
|
+
ON vault_extractions.file_id = vault_processing_queue.file_id
|
|
448
|
+
AND vault_extractions.content_hash = vault_files.content_hash
|
|
449
|
+
WHERE vault_processing_queue.status = 'processing'
|
|
367
450
|
AND COALESCE(processing_owner_id, '') != ?
|
|
368
451
|
AND (
|
|
369
452
|
(heartbeat_at IS NOT NULL AND julianday(heartbeat_at) <= julianday(?))
|
|
@@ -513,10 +596,104 @@ function formatQueueErrorMessage(message, autoRetryExhausted) {
|
|
|
513
596
|
: "";
|
|
514
597
|
return `${message}${autoRetrySuffix}`.slice(0, 1_000);
|
|
515
598
|
}
|
|
516
|
-
function
|
|
599
|
+
function resolveCurrentVaultContentHash(db, fileId) {
|
|
600
|
+
const row = db
|
|
601
|
+
.prepare("SELECT content_hash AS contentHash FROM vault_files WHERE id = ?")
|
|
602
|
+
.get(fileId);
|
|
603
|
+
return typeof row?.contentHash === "string" && row.contentHash.length > 0 ? row.contentHash : null;
|
|
604
|
+
}
|
|
605
|
+
function inferExtractionParserSkill(manifest) {
|
|
606
|
+
const explicit = manifest.extractedText?.parserSkill?.trim();
|
|
607
|
+
if (explicit) {
|
|
608
|
+
return explicit;
|
|
609
|
+
}
|
|
610
|
+
return manifest.skillsUsed.find((skill) => skill !== "tiangong-wiki-skill") ?? null;
|
|
611
|
+
}
|
|
612
|
+
function persistWorkflowExtraction(db, fileId, manifest, extractedTextPath, processedAt) {
|
|
613
|
+
const contentHash = resolveCurrentVaultContentHash(db, fileId);
|
|
614
|
+
if (!contentHash) {
|
|
615
|
+
return;
|
|
616
|
+
}
|
|
617
|
+
const expectedPath = path.resolve(extractedTextPath);
|
|
618
|
+
if (manifest.extractedText?.path && path.resolve(manifest.extractedText.path) !== expectedPath) {
|
|
619
|
+
throw new AppError("result.extractedText.path must match EXTRACTED_TEXT_PATH", "runtime", {
|
|
620
|
+
expectedPath,
|
|
621
|
+
actualPath: manifest.extractedText.path,
|
|
622
|
+
});
|
|
623
|
+
}
|
|
624
|
+
const extractedText = pathExistsSync(expectedPath) ? readTextFileSync(expectedPath) : "";
|
|
625
|
+
if (extractedText.length === 0) {
|
|
626
|
+
if (manifest.extractedText) {
|
|
627
|
+
throw new AppError("result.extractedText was declared but EXTRACTED_TEXT_PATH is empty", "runtime", {
|
|
628
|
+
expectedPath,
|
|
629
|
+
});
|
|
630
|
+
}
|
|
631
|
+
db.prepare("DELETE FROM vault_extractions WHERE file_id = ? AND content_hash = ?").run(fileId, contentHash);
|
|
632
|
+
return;
|
|
633
|
+
}
|
|
634
|
+
const artifactSha256 = sha256Text(extractedText);
|
|
635
|
+
if (manifest.extractedText?.sha256 && manifest.extractedText.sha256 !== artifactSha256) {
|
|
636
|
+
throw new AppError("result.extractedText.sha256 does not match EXTRACTED_TEXT_PATH content", "runtime", {
|
|
637
|
+
expectedSha256: artifactSha256,
|
|
638
|
+
actualSha256: manifest.extractedText.sha256,
|
|
639
|
+
});
|
|
640
|
+
}
|
|
641
|
+
db.prepare(`
|
|
642
|
+
INSERT INTO vault_extractions(
|
|
643
|
+
file_id,
|
|
644
|
+
content_hash,
|
|
645
|
+
artifact_path,
|
|
646
|
+
artifact_sha256,
|
|
647
|
+
parser_skill,
|
|
648
|
+
char_count,
|
|
649
|
+
created_at,
|
|
650
|
+
updated_at
|
|
651
|
+
)
|
|
652
|
+
VALUES (
|
|
653
|
+
@file_id,
|
|
654
|
+
@content_hash,
|
|
655
|
+
@artifact_path,
|
|
656
|
+
@artifact_sha256,
|
|
657
|
+
@parser_skill,
|
|
658
|
+
@char_count,
|
|
659
|
+
@created_at,
|
|
660
|
+
@updated_at
|
|
661
|
+
)
|
|
662
|
+
ON CONFLICT(file_id, content_hash) DO UPDATE SET
|
|
663
|
+
artifact_path = excluded.artifact_path,
|
|
664
|
+
artifact_sha256 = excluded.artifact_sha256,
|
|
665
|
+
parser_skill = excluded.parser_skill,
|
|
666
|
+
char_count = excluded.char_count,
|
|
667
|
+
updated_at = excluded.updated_at
|
|
668
|
+
`).run({
|
|
669
|
+
file_id: fileId,
|
|
670
|
+
content_hash: contentHash,
|
|
671
|
+
artifact_path: expectedPath,
|
|
672
|
+
artifact_sha256: artifactSha256,
|
|
673
|
+
parser_skill: inferExtractionParserSkill(manifest),
|
|
674
|
+
char_count: extractedText.length,
|
|
675
|
+
created_at: processedAt,
|
|
676
|
+
updated_at: processedAt,
|
|
677
|
+
});
|
|
678
|
+
}
|
|
679
|
+
function buildExtractionResultFields(manifest, extractedTextPath) {
|
|
680
|
+
const expectedPath = path.resolve(extractedTextPath);
|
|
681
|
+
const extractedText = pathExistsSync(expectedPath) ? readTextFileSync(expectedPath) : "";
|
|
682
|
+
if (extractedText.length === 0) {
|
|
683
|
+
return {};
|
|
684
|
+
}
|
|
685
|
+
return {
|
|
686
|
+
extractedTextPath: expectedPath,
|
|
687
|
+
extractedTextSha256: sha256Text(extractedText),
|
|
688
|
+
extractedTextParserSkill: inferExtractionParserSkill(manifest),
|
|
689
|
+
extractedTextCharCount: extractedText.length,
|
|
690
|
+
};
|
|
691
|
+
}
|
|
692
|
+
function applyWorkflowManifest(db, fileId, manifest, resultManifestPath, extractedTextPath, currentAttempts) {
|
|
517
693
|
const resultPageId = manifest.createdPageIds[0] ?? manifest.updatedPageIds[0] ?? null;
|
|
518
694
|
const status = manifest.status;
|
|
519
695
|
const processedAt = toOffsetIso();
|
|
696
|
+
persistWorkflowExtraction(db, fileId, manifest, extractedTextPath, processedAt);
|
|
520
697
|
if (status === "error") {
|
|
521
698
|
const failureState = buildQueueFailureState(manifest.reason);
|
|
522
699
|
const nextAttempts = currentAttempts + 1;
|
|
@@ -694,7 +871,8 @@ function recoverStaleProcessingQueueItems(input) {
|
|
|
694
871
|
});
|
|
695
872
|
if (recoveredManifest && item.resultManifestPath) {
|
|
696
873
|
assertTemplateEvolutionAllowed(recoveredManifest, input.templateEvolution);
|
|
697
|
-
const
|
|
874
|
+
const extractedTextPath = getWorkflowArtifactSet(input.paths, item.fileId).extractedTextPath;
|
|
875
|
+
const outcome = applyWorkflowManifest(input.db, item.fileId, recoveredManifest, item.resultManifestPath, extractedTextPath, item.attempts);
|
|
698
876
|
input.log?.(`${item.fileId}: recovered stale processing with persisted result status=${outcome.status} thread=${recoveredManifest.threadId} ${formatManifestLogFields(recoveredManifest)} result=${item.resultManifestPath}`);
|
|
699
877
|
recovered.push({
|
|
700
878
|
fileId: item.fileId,
|
|
@@ -708,6 +886,7 @@ function recoverStaleProcessingQueueItems(input) {
|
|
|
708
886
|
updatedPageIds: recoveredManifest.updatedPageIds,
|
|
709
887
|
proposedTypeNames: recoveredManifest.proposedTypes.map((entry) => entry.name),
|
|
710
888
|
resultManifestPath: item.resultManifestPath,
|
|
889
|
+
...buildExtractionResultFields(recoveredManifest, extractedTextPath),
|
|
711
890
|
});
|
|
712
891
|
continue;
|
|
713
892
|
}
|
|
@@ -783,6 +962,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
|
|
|
783
962
|
workspaceRoot,
|
|
784
963
|
vaultFilePath: localFilePath,
|
|
785
964
|
resultJsonPath: artifacts.resultPath,
|
|
965
|
+
extractedTextPath: artifacts.extractedTextPath,
|
|
786
966
|
allowTemplateEvolution,
|
|
787
967
|
});
|
|
788
968
|
ensureWorkflowArtifactSet(paths, {
|
|
@@ -796,6 +976,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
|
|
|
796
976
|
vaultPath: paths.vaultPath,
|
|
797
977
|
localFilePath,
|
|
798
978
|
resultJsonPath: artifacts.resultPath,
|
|
979
|
+
extractedTextPath: artifacts.extractedTextPath,
|
|
799
980
|
skillArtifactsPath: artifacts.skillArtifactsPath,
|
|
800
981
|
file,
|
|
801
982
|
queue: {
|
|
@@ -817,6 +998,7 @@ function prepareCodexWorkflowInput(paths, item, file, localFilePath, env, allowT
|
|
|
817
998
|
promptText,
|
|
818
999
|
queueItemPath: artifacts.queueItemPath,
|
|
819
1000
|
resultPath: artifacts.resultPath,
|
|
1001
|
+
extractedTextPath: artifacts.extractedTextPath,
|
|
820
1002
|
skillArtifactsPath: artifacts.skillArtifactsPath,
|
|
821
1003
|
model: resolveAgentSettings(env).model,
|
|
822
1004
|
env,
|
|
@@ -903,7 +1085,7 @@ async function processClaimedQueueItem(input) {
|
|
|
903
1085
|
}));
|
|
904
1086
|
assertTemplateEvolutionAllowed(manifest, templateEvolution);
|
|
905
1087
|
finalOutcome = {
|
|
906
|
-
outcome: applyWorkflowManifest(db, item.fileId, manifest, artifacts.resultPath, item.attempts),
|
|
1088
|
+
outcome: applyWorkflowManifest(db, item.fileId, manifest, artifacts.resultPath, artifacts.extractedTextPath, item.attempts),
|
|
907
1089
|
manifest,
|
|
908
1090
|
handleThreadId: handle.threadId,
|
|
909
1091
|
};
|
|
@@ -925,7 +1107,7 @@ async function processClaimedQueueItem(input) {
|
|
|
925
1107
|
if (recoveredManifest) {
|
|
926
1108
|
assertTemplateEvolutionAllowed(recoveredManifest, templateEvolution);
|
|
927
1109
|
finalOutcome = {
|
|
928
|
-
outcome: applyWorkflowManifest(db, item.fileId, recoveredManifest, artifacts.resultPath, item.attempts),
|
|
1110
|
+
outcome: applyWorkflowManifest(db, item.fileId, recoveredManifest, artifacts.resultPath, artifacts.extractedTextPath, item.attempts),
|
|
929
1111
|
manifest: recoveredManifest,
|
|
930
1112
|
handleThreadId: recoveredManifest.threadId,
|
|
931
1113
|
};
|
|
@@ -956,6 +1138,7 @@ async function processClaimedQueueItem(input) {
|
|
|
956
1138
|
updatedPageIds: finalOutcome.manifest.updatedPageIds,
|
|
957
1139
|
proposedTypeNames: finalOutcome.manifest.proposedTypes.map((entry) => entry.name),
|
|
958
1140
|
resultManifestPath: artifacts.resultPath,
|
|
1141
|
+
...buildExtractionResultFields(finalOutcome.manifest, artifacts.extractedTextPath),
|
|
959
1142
|
},
|
|
960
1143
|
};
|
|
961
1144
|
}
|
|
@@ -965,7 +1148,8 @@ async function processClaimedQueueItem(input) {
|
|
|
965
1148
|
: null;
|
|
966
1149
|
if (recoveredManifest && resultManifestPath) {
|
|
967
1150
|
assertTemplateEvolutionAllowed(recoveredManifest, templateEvolution);
|
|
968
|
-
const
|
|
1151
|
+
const extractedTextPath = getWorkflowArtifactSet(paths, item.fileId).extractedTextPath;
|
|
1152
|
+
const recoveredOutcome = applyWorkflowManifest(db, item.fileId, recoveredManifest, resultManifestPath, extractedTextPath, item.attempts);
|
|
969
1153
|
input.log?.(`${item.fileId}: recovered persisted workflow result after terminal failure status=${recoveredOutcome.status} thread=${recoveredManifest.threadId} ${formatManifestLogFields(recoveredManifest)} result=${resultManifestPath} message=${formatWorkflowError(error)}`);
|
|
970
1154
|
return {
|
|
971
1155
|
status: recoveredOutcome.status,
|
|
@@ -981,6 +1165,7 @@ async function processClaimedQueueItem(input) {
|
|
|
981
1165
|
updatedPageIds: recoveredManifest.updatedPageIds,
|
|
982
1166
|
proposedTypeNames: recoveredManifest.proposedTypes.map((entry) => entry.name),
|
|
983
1167
|
resultManifestPath,
|
|
1168
|
+
...buildExtractionResultFields(recoveredManifest, extractedTextPath),
|
|
984
1169
|
},
|
|
985
1170
|
};
|
|
986
1171
|
}
|
|
@@ -1101,6 +1286,7 @@ export async function processVaultQueueBatch(env = process.env, options = {}) {
|
|
|
1101
1286
|
};
|
|
1102
1287
|
for (const recoveredItem of recoverStaleProcessingQueueItems({
|
|
1103
1288
|
db,
|
|
1289
|
+
paths,
|
|
1104
1290
|
processingOwnerId,
|
|
1105
1291
|
log: options.log,
|
|
1106
1292
|
templateEvolution,
|
package/dist/core/vault.js
CHANGED
|
@@ -19,6 +19,164 @@ function normalizeVaultFileExtension(filePath) {
|
|
|
19
19
|
const fileExt = path.extname(filePath).replace(/^\./, "").toLowerCase();
|
|
20
20
|
return fileExt || null;
|
|
21
21
|
}
|
|
22
|
+
function normalizeFileMtimeMs(fileMtime) {
|
|
23
|
+
if (typeof fileMtime !== "number" || !Number.isFinite(fileMtime) || fileMtime <= 0) {
|
|
24
|
+
return null;
|
|
25
|
+
}
|
|
26
|
+
return fileMtime < 1_000_000_000_000 ? fileMtime * 1000 : fileMtime;
|
|
27
|
+
}
|
|
28
|
+
function isValidDateParts(year, month, day, hour, minute, second) {
|
|
29
|
+
if (year < 1900 || year > 2099 || month < 1 || month > 12 || day < 1 || day > 31) {
|
|
30
|
+
return false;
|
|
31
|
+
}
|
|
32
|
+
if (hour < 0 || hour > 23 || minute < 0 || minute > 59 || second < 0 || second > 59) {
|
|
33
|
+
return false;
|
|
34
|
+
}
|
|
35
|
+
const date = new Date(year, month - 1, day, hour, minute, second);
|
|
36
|
+
return (date.getFullYear() === year &&
|
|
37
|
+
date.getMonth() === month - 1 &&
|
|
38
|
+
date.getDate() === day &&
|
|
39
|
+
date.getHours() === hour &&
|
|
40
|
+
date.getMinutes() === minute &&
|
|
41
|
+
date.getSeconds() === second);
|
|
42
|
+
}
|
|
43
|
+
function buildTimestampCandidate(input) {
|
|
44
|
+
const year = Number.parseInt(input.year, 10);
|
|
45
|
+
const month = Number.parseInt(input.month, 10);
|
|
46
|
+
const day = Number.parseInt(input.day, 10);
|
|
47
|
+
const hour = input.hour ? Number.parseInt(input.hour, 10) : 0;
|
|
48
|
+
const minute = input.minute ? Number.parseInt(input.minute, 10) : 0;
|
|
49
|
+
const second = input.second ? Number.parseInt(input.second, 10) : 0;
|
|
50
|
+
if (!isValidDateParts(year, month, day, hour, minute, second)) {
|
|
51
|
+
return null;
|
|
52
|
+
}
|
|
53
|
+
const precision = input.hour && input.minute ? "datetime" : "date";
|
|
54
|
+
const baseConfidence = input.source === "file_name" ? 0.9 : 0.8;
|
|
55
|
+
return {
|
|
56
|
+
timestamp: toOffsetIso(new Date(year, month - 1, day, hour, minute, second)),
|
|
57
|
+
source: input.source,
|
|
58
|
+
confidence: precision === "datetime" ? baseConfidence + 0.05 : baseConfidence,
|
|
59
|
+
precision,
|
|
60
|
+
raw: input.raw,
|
|
61
|
+
};
|
|
62
|
+
}
|
|
63
|
+
function collectDateCandidatesFromText(text, source) {
|
|
64
|
+
const candidates = [];
|
|
65
|
+
const separated = /(^|[^0-9])((?:19|20)\d{2})[-_.年/]([01]?\d)[-_.月/]([0-3]?\d)(?:[日\sT_@-]+([0-2]?\d)[::]?([0-5]\d)(?:[::]?([0-5]\d))?)?(?=$|[^0-9])/g;
|
|
66
|
+
const compact = /(^|[^0-9])((?:19|20)\d{2})([01]\d)([0-3]\d)(?:[T_\s@-]?([0-2]\d)([0-5]\d)([0-5]\d)?)?(?=$|[^0-9])/g;
|
|
67
|
+
for (const match of text.matchAll(separated)) {
|
|
68
|
+
const raw = match[0].slice(match[1].length);
|
|
69
|
+
const candidate = buildTimestampCandidate({
|
|
70
|
+
year: match[2],
|
|
71
|
+
month: match[3],
|
|
72
|
+
day: match[4],
|
|
73
|
+
hour: match[5],
|
|
74
|
+
minute: match[6],
|
|
75
|
+
second: match[7],
|
|
76
|
+
raw,
|
|
77
|
+
source,
|
|
78
|
+
});
|
|
79
|
+
if (candidate) {
|
|
80
|
+
candidates.push(candidate);
|
|
81
|
+
}
|
|
82
|
+
}
|
|
83
|
+
for (const match of text.matchAll(compact)) {
|
|
84
|
+
const raw = match[0].slice(match[1].length);
|
|
85
|
+
const candidate = buildTimestampCandidate({
|
|
86
|
+
year: match[2],
|
|
87
|
+
month: match[3],
|
|
88
|
+
day: match[4],
|
|
89
|
+
hour: match[5],
|
|
90
|
+
minute: match[6],
|
|
91
|
+
second: match[7],
|
|
92
|
+
raw,
|
|
93
|
+
source,
|
|
94
|
+
});
|
|
95
|
+
if (candidate) {
|
|
96
|
+
candidates.push(candidate);
|
|
97
|
+
}
|
|
98
|
+
}
|
|
99
|
+
return candidates;
|
|
100
|
+
}
|
|
101
|
+
function dedupeTimestampCandidates(candidates) {
|
|
102
|
+
const seen = new Set();
|
|
103
|
+
const result = [];
|
|
104
|
+
for (const candidate of candidates) {
|
|
105
|
+
const key = `${candidate.timestamp}:${candidate.source}:${candidate.raw}`;
|
|
106
|
+
if (seen.has(key)) {
|
|
107
|
+
continue;
|
|
108
|
+
}
|
|
109
|
+
seen.add(key);
|
|
110
|
+
result.push(candidate);
|
|
111
|
+
}
|
|
112
|
+
return result;
|
|
113
|
+
}
|
|
114
|
+
function inferVaultSourceTimestamp(input) {
|
|
115
|
+
const fileNameWithoutExt = input.fileName.replace(/\.[^.]+$/, "");
|
|
116
|
+
const directory = path.posix.dirname(input.id);
|
|
117
|
+
const pathText = directory === "." ? "" : directory;
|
|
118
|
+
const candidates = dedupeTimestampCandidates([
|
|
119
|
+
...collectDateCandidatesFromText(fileNameWithoutExt, "file_name"),
|
|
120
|
+
...collectDateCandidatesFromText(pathText, "path"),
|
|
121
|
+
]);
|
|
122
|
+
const mtimeMs = normalizeFileMtimeMs(input.fileMtime);
|
|
123
|
+
if (mtimeMs !== null) {
|
|
124
|
+
candidates.push({
|
|
125
|
+
timestamp: toOffsetIso(new Date(mtimeMs)),
|
|
126
|
+
source: "file_mtime",
|
|
127
|
+
confidence: 0.5,
|
|
128
|
+
precision: "datetime",
|
|
129
|
+
raw: String(input.fileMtime),
|
|
130
|
+
});
|
|
131
|
+
}
|
|
132
|
+
const preferred = candidates
|
|
133
|
+
.slice()
|
|
134
|
+
.sort((left, right) => right.confidence - left.confidence || left.timestamp.localeCompare(right.timestamp))[0];
|
|
135
|
+
return {
|
|
136
|
+
sourceTimestamp: preferred?.timestamp ?? null,
|
|
137
|
+
sourceTimestampSource: preferred?.source ?? null,
|
|
138
|
+
sourceTimestampConfidence: preferred?.confidence ?? null,
|
|
139
|
+
sourceTimestampCandidates: candidates,
|
|
140
|
+
};
|
|
141
|
+
}
|
|
142
|
+
function serializeSourceTimestampCandidates(candidates) {
|
|
143
|
+
if (!candidates || candidates.length === 0) {
|
|
144
|
+
return null;
|
|
145
|
+
}
|
|
146
|
+
return JSON.stringify(candidates);
|
|
147
|
+
}
|
|
148
|
+
function parseSourceTimestampCandidates(value) {
|
|
149
|
+
if (Array.isArray(value)) {
|
|
150
|
+
return value;
|
|
151
|
+
}
|
|
152
|
+
if (typeof value !== "string" || value.trim().length === 0) {
|
|
153
|
+
return [];
|
|
154
|
+
}
|
|
155
|
+
try {
|
|
156
|
+
const parsed = JSON.parse(value);
|
|
157
|
+
return Array.isArray(parsed) ? parsed : [];
|
|
158
|
+
}
|
|
159
|
+
catch {
|
|
160
|
+
return [];
|
|
161
|
+
}
|
|
162
|
+
}
|
|
163
|
+
function mapVaultFileRow(row) {
|
|
164
|
+
return {
|
|
165
|
+
id: String(row.id),
|
|
166
|
+
fileName: String(row.fileName),
|
|
167
|
+
fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
|
|
168
|
+
sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
|
|
169
|
+
fileSize: Number(row.fileSize ?? 0),
|
|
170
|
+
filePath: String(row.filePath),
|
|
171
|
+
contentHash: typeof row.contentHash === "string" ? row.contentHash : null,
|
|
172
|
+
fileMtime: typeof row.fileMtime === "number" ? row.fileMtime : null,
|
|
173
|
+
sourceTimestamp: typeof row.sourceTimestamp === "string" ? row.sourceTimestamp : null,
|
|
174
|
+
sourceTimestampSource: typeof row.sourceTimestampSource === "string" ? row.sourceTimestampSource : null,
|
|
175
|
+
sourceTimestampConfidence: typeof row.sourceTimestampConfidence === "number" ? row.sourceTimestampConfidence : null,
|
|
176
|
+
sourceTimestampCandidates: parseSourceTimestampCandidates(row.sourceTimestampCandidates),
|
|
177
|
+
indexedAt: String(row.indexedAt),
|
|
178
|
+
};
|
|
179
|
+
}
|
|
22
180
|
function createAllowedVaultFileTypeSet(vaultFileTypes) {
|
|
23
181
|
return new Set(vaultFileTypes.map((item) => item.trim().replace(/^\./, "").toLowerCase()).filter(Boolean));
|
|
24
182
|
}
|
|
@@ -42,6 +200,11 @@ function localVaultFiles(vaultPath, hashMode, vaultFileTypes) {
|
|
|
42
200
|
filePath,
|
|
43
201
|
contentHash: computeVaultHash(hashMode, id, filePath, stats.size, stats.mtimeMs),
|
|
44
202
|
fileMtime: stats.mtimeMs,
|
|
203
|
+
...inferVaultSourceTimestamp({
|
|
204
|
+
id,
|
|
205
|
+
fileName: path.basename(filePath),
|
|
206
|
+
fileMtime: stats.mtimeMs,
|
|
207
|
+
}),
|
|
45
208
|
indexedAt,
|
|
46
209
|
};
|
|
47
210
|
});
|
|
@@ -162,6 +325,11 @@ async function scanSynologyFolder(client, remoteRoot, currentFolder, results, al
|
|
|
162
325
|
filePath,
|
|
163
326
|
contentHash: sha256Text(`${relativeId}:${filePath}:${fileSize}:${fileMtime}`),
|
|
164
327
|
fileMtime,
|
|
328
|
+
...inferVaultSourceTimestamp({
|
|
329
|
+
id: relativeId,
|
|
330
|
+
fileName: item.name ?? path.basename(filePath),
|
|
331
|
+
fileMtime,
|
|
332
|
+
}),
|
|
165
333
|
indexedAt,
|
|
166
334
|
});
|
|
167
335
|
}
|
|
@@ -198,10 +366,17 @@ function getExistingVaultFiles(db) {
|
|
|
198
366
|
file_path AS filePath,
|
|
199
367
|
content_hash AS contentHash,
|
|
200
368
|
file_mtime AS fileMtime,
|
|
369
|
+
source_timestamp AS sourceTimestamp,
|
|
370
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
371
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
372
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
201
373
|
indexed_at AS indexedAt
|
|
202
374
|
FROM vault_files
|
|
203
375
|
`).all();
|
|
204
|
-
return new Map(rows.map((row) =>
|
|
376
|
+
return new Map(rows.map((row) => {
|
|
377
|
+
const file = mapVaultFileRow(row);
|
|
378
|
+
return [file.id, file];
|
|
379
|
+
}));
|
|
205
380
|
}
|
|
206
381
|
export function getVaultQueuePriority(fileExt) {
|
|
207
382
|
const normalized = (fileExt ?? "").toLowerCase();
|
|
@@ -406,9 +581,13 @@ export function syncVaultIndex(db, currentFiles, syncId) {
|
|
|
406
581
|
}
|
|
407
582
|
const upsertStatement = db.prepare(`
|
|
408
583
|
INSERT INTO vault_files(
|
|
409
|
-
id, file_name, file_ext, source_type, file_size, file_path, content_hash, file_mtime,
|
|
584
|
+
id, file_name, file_ext, source_type, file_size, file_path, content_hash, file_mtime,
|
|
585
|
+
source_timestamp, source_timestamp_source, source_timestamp_confidence, source_timestamp_candidates,
|
|
586
|
+
indexed_at
|
|
410
587
|
) VALUES (
|
|
411
|
-
@id, @file_name, @file_ext, @source_type, @file_size, @file_path, @content_hash, @file_mtime,
|
|
588
|
+
@id, @file_name, @file_ext, @source_type, @file_size, @file_path, @content_hash, @file_mtime,
|
|
589
|
+
@source_timestamp, @source_timestamp_source, @source_timestamp_confidence, @source_timestamp_candidates,
|
|
590
|
+
@indexed_at
|
|
412
591
|
)
|
|
413
592
|
ON CONFLICT(id) DO UPDATE SET
|
|
414
593
|
file_name = excluded.file_name,
|
|
@@ -418,6 +597,10 @@ export function syncVaultIndex(db, currentFiles, syncId) {
|
|
|
418
597
|
file_path = excluded.file_path,
|
|
419
598
|
content_hash = excluded.content_hash,
|
|
420
599
|
file_mtime = excluded.file_mtime,
|
|
600
|
+
source_timestamp = excluded.source_timestamp,
|
|
601
|
+
source_timestamp_source = excluded.source_timestamp_source,
|
|
602
|
+
source_timestamp_confidence = excluded.source_timestamp_confidence,
|
|
603
|
+
source_timestamp_candidates = excluded.source_timestamp_candidates,
|
|
421
604
|
indexed_at = excluded.indexed_at
|
|
422
605
|
`);
|
|
423
606
|
const insertChange = db.prepare(`
|
|
@@ -573,6 +756,10 @@ export function syncVaultIndex(db, currentFiles, syncId) {
|
|
|
573
756
|
file_path: file.filePath,
|
|
574
757
|
content_hash: file.contentHash,
|
|
575
758
|
file_mtime: file.fileMtime,
|
|
759
|
+
source_timestamp: file.sourceTimestamp ?? null,
|
|
760
|
+
source_timestamp_source: file.sourceTimestampSource ?? null,
|
|
761
|
+
source_timestamp_confidence: file.sourceTimestampConfidence ?? null,
|
|
762
|
+
source_timestamp_candidates: serializeSourceTimestampCandidates(file.sourceTimestampCandidates),
|
|
576
763
|
indexed_at: file.indexedAt,
|
|
577
764
|
});
|
|
578
765
|
}
|
|
@@ -42,6 +42,7 @@ export function getWorkflowArtifactSet(paths, queueItemId) {
|
|
|
42
42
|
queueItemPath: path.join(rootDir, "queue-item.json"),
|
|
43
43
|
promptPath: path.join(rootDir, "prompt.md"),
|
|
44
44
|
resultPath: path.join(rootDir, "result.json"),
|
|
45
|
+
extractedTextPath: path.join(rootDir, "extracted-fulltext.txt"),
|
|
45
46
|
skillArtifactsPath: path.join(rootDir, "skill-artifacts"),
|
|
46
47
|
};
|
|
47
48
|
}
|
|
@@ -52,6 +53,7 @@ export function buildVaultWorkflowPrompt(input) {
|
|
|
52
53
|
`WORKSPACE_ROOT=${input.workspaceRoot}`,
|
|
53
54
|
`VAULT_FILE_PATH=${input.vaultFilePath}`,
|
|
54
55
|
`RESULT_JSON_PATH=${input.resultJsonPath}`,
|
|
56
|
+
`EXTRACTED_TEXT_PATH=${input.extractedTextPath}`,
|
|
55
57
|
`ALLOW_TEMPLATE_EVOLUTION=${input.allowTemplateEvolution ? "true" : "false"}`,
|
|
56
58
|
"",
|
|
57
59
|
"## Goal",
|
|
@@ -83,8 +85,11 @@ export function buildVaultWorkflowPrompt(input) {
|
|
|
83
85
|
"",
|
|
84
86
|
"1. Read queue-item.json next to RESULT_JSON_PATH.",
|
|
85
87
|
"2. Read the target vault file at VAULT_FILE_PATH. Refer to `references/vault-to-wiki-instruction.md` (Phase 1) in the wiki package for file-type-specific reading strategies, parser skill discovery, image handling, and metadata utilization.",
|
|
86
|
-
" -
|
|
88
|
+
" - Plain text-like files (`.txt`, `.md`, `.markdown`, `.json`, `.csv`, `.tsv`, `.yaml`, `.yml`) must be read directly from VAULT_FILE_PATH. Do not send them to parser skills or remote unstructure APIs; write the direct plain-text snapshot to EXTRACTED_TEXT_PATH when it becomes the analysis input.",
|
|
89
|
+
" - If `WIKI_PARSER_SKILLS` includes `document-granular-decompose` and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer that skill for supported non-text document/image files before the legacy type-specific parser skills. This includes PDF, Word, PowerPoint, Excel, and common image formats; use the skill's `SKILL.md` for the exact extension allowlist.",
|
|
87
90
|
" - When using `document-granular-decompose`, request `return_txt=true`, treat the pure text extracted from `response.txt`/`txt` as the main input, and keep raw JSON only for debugging or page-number evidence.",
|
|
91
|
+
" - If you extract plain text through any parser skill, write that canonical plain text snapshot to EXTRACTED_TEXT_PATH. For `document-granular-decompose`, write the same pure text from `response.txt`/`txt`. Leave EXTRACTED_TEXT_PATH empty only when no extractable text exists.",
|
|
92
|
+
" - queue-item.json may include `file.sourceTimestamp`, `file.sourceTimestampSource`, `file.sourceTimestampConfidence`, and timestamp candidates inferred from the source file name, path, or mtime. Use this as evidence about source recency, but do not blindly copy it into page `createdAt` or `updatedAt`; those fields are normalized by the wiki system.",
|
|
88
93
|
"3. Discover the current page type ontology via `tiangong-wiki type list` and `tiangong-wiki type show <type>`. Do not assume any type, template, or default target type.",
|
|
89
94
|
"4. Search the existing wiki for overlapping or related content:",
|
|
90
95
|
" - Use `tiangong-wiki fts` and `tiangong-wiki search` with key terms from the source.",
|
|
@@ -173,7 +178,7 @@ export function buildVaultWorkflowPrompt(input) {
|
|
|
173
178
|
"",
|
|
174
179
|
"The authoritative threadId is queue-item.json.threadId. Read it from there and copy it unchanged into result.json.threadId. If it is empty on first read, read queue-item.json again immediately before writing the manifest.",
|
|
175
180
|
"",
|
|
176
|
-
"Write RESULT_JSON_PATH as one JSON object with: status, decision, reason, threadId, skillsUsed, createdPageIds, updatedPageIds, appliedTypeNames, proposedTypes, actions, lint.",
|
|
181
|
+
"Write RESULT_JSON_PATH as one JSON object with: status, decision, reason, threadId, skillsUsed, createdPageIds, updatedPageIds, appliedTypeNames, proposedTypes, actions, lint, and optional extractedText.",
|
|
177
182
|
"",
|
|
178
183
|
"### Allowed Values",
|
|
179
184
|
"",
|
|
@@ -182,10 +187,11 @@ export function buildVaultWorkflowPrompt(input) {
|
|
|
182
187
|
"- **actions**: Array of objects, never strings. Allowed action kinds: create_page, update_page, create_template. Every action object must include kind and summary. create_page requires pageType and title. update_page requires pageId. create_template requires pageType and title.",
|
|
183
188
|
"- **proposedTypes**: Objects with name, reason, suggestedTemplateSections.",
|
|
184
189
|
"- **lint**: Objects with pageId, errors, warnings.",
|
|
190
|
+
"- **extractedText**: Optional object when EXTRACTED_TEXT_PATH contains extracted plain text. Include path=EXTRACTED_TEXT_PATH, parserSkill, sha256 when practical, and charCount. Do not put the full text itself in result.json.",
|
|
185
191
|
"",
|
|
186
192
|
"### Example",
|
|
187
193
|
"",
|
|
188
|
-
'{"status":"done","decision":"apply","reason":"Updated the existing method.","threadId":"<copy queue-item.json.threadId>","skillsUsed":["tiangong-wiki-skill"],"createdPageIds":[],"updatedPageIds":["methods/example.md"],"appliedTypeNames":["method"],"proposedTypes":[],"actions":[{"kind":"update_page","pageId":"methods/example.md","pageType":"method","summary":"Updated the page with durable knowledge."}],"lint":[{"pageId":"methods/example.md","errors":0,"warnings":0}]}',
|
|
194
|
+
'{"status":"done","decision":"apply","reason":"Updated the existing method.","threadId":"<copy queue-item.json.threadId>","skillsUsed":["tiangong-wiki-skill"],"createdPageIds":[],"updatedPageIds":["methods/example.md"],"appliedTypeNames":["method"],"proposedTypes":[],"actions":[{"kind":"update_page","pageId":"methods/example.md","pageType":"method","summary":"Updated the page with durable knowledge."}],"lint":[{"pageId":"methods/example.md","errors":0,"warnings":0}],"extractedText":{"path":"<copy EXTRACTED_TEXT_PATH>","parserSkill":"document-granular-decompose","sha256":"<sha256 of extracted-fulltext.txt>","charCount":1234}}',
|
|
189
195
|
"",
|
|
190
196
|
"If no page change is justified, still write RESULT_JSON_PATH with decision=skip or decision=propose_only and then stop.",
|
|
191
197
|
"Use RESULT_JSON_PATH only for the final structured manifest. Write raw JSON only, with no Markdown fences and no prose before or after the JSON object.",
|
|
@@ -244,5 +250,6 @@ export function ensureWorkflowArtifactSet(paths, input) {
|
|
|
244
250
|
"This prompt is intentionally minimal and will be populated by the workflow runner.",
|
|
245
251
|
].join("\n"));
|
|
246
252
|
writeTextFileSync(artifacts.resultPath, "");
|
|
253
|
+
writeTextFileSync(artifacts.extractedTextPath, "");
|
|
247
254
|
return artifacts;
|
|
248
255
|
}
|
|
@@ -27,6 +27,13 @@ function ensureNumber(value, label) {
|
|
|
27
27
|
}
|
|
28
28
|
return value;
|
|
29
29
|
}
|
|
30
|
+
function ensureNonNegativeInteger(value, label) {
|
|
31
|
+
const parsed = ensureNumber(value, label);
|
|
32
|
+
if (!Number.isInteger(parsed) || parsed < 0) {
|
|
33
|
+
fail(`${label} must be a non-negative integer`);
|
|
34
|
+
}
|
|
35
|
+
return parsed;
|
|
36
|
+
}
|
|
30
37
|
function ensureStatus(value) {
|
|
31
38
|
const status = ensureString(value, "result.status");
|
|
32
39
|
if (status === "done" || status === "skipped" || status === "error") {
|
|
@@ -50,6 +57,28 @@ function parseSourceFile(value) {
|
|
|
50
57
|
const sha256 = sourceFile.sha256 === undefined ? undefined : ensureString(sourceFile.sha256, "result.sourceFile.sha256");
|
|
51
58
|
return { path, ...(sha256 ? { sha256 } : {}) };
|
|
52
59
|
}
|
|
60
|
+
function parseExtractedText(value) {
|
|
61
|
+
if (value === undefined) {
|
|
62
|
+
return undefined;
|
|
63
|
+
}
|
|
64
|
+
const extractedText = ensureRecord(value, "result.extractedText");
|
|
65
|
+
const path = ensureString(extractedText.path, "result.extractedText.path");
|
|
66
|
+
const sha256 = extractedText.sha256 === undefined
|
|
67
|
+
? undefined
|
|
68
|
+
: ensureString(extractedText.sha256, "result.extractedText.sha256");
|
|
69
|
+
const parserSkill = extractedText.parserSkill === undefined
|
|
70
|
+
? undefined
|
|
71
|
+
: ensureString(extractedText.parserSkill, "result.extractedText.parserSkill");
|
|
72
|
+
const charCount = extractedText.charCount === undefined
|
|
73
|
+
? undefined
|
|
74
|
+
: ensureNonNegativeInteger(extractedText.charCount, "result.extractedText.charCount");
|
|
75
|
+
return {
|
|
76
|
+
path,
|
|
77
|
+
...(sha256 ? { sha256 } : {}),
|
|
78
|
+
...(parserSkill ? { parserSkill } : {}),
|
|
79
|
+
...(charCount !== undefined ? { charCount } : {}),
|
|
80
|
+
};
|
|
81
|
+
}
|
|
53
82
|
function parseProposedTypes(value) {
|
|
54
83
|
if (!Array.isArray(value)) {
|
|
55
84
|
fail("result.proposedTypes must be an array");
|
|
@@ -129,6 +158,7 @@ export function parseWorkflowResult(raw) {
|
|
|
129
158
|
proposedTypes: parseProposedTypes(result.proposedTypes),
|
|
130
159
|
actions: parseActions(result.actions),
|
|
131
160
|
lint: parseLint(result.lint),
|
|
161
|
+
extractedText: parseExtractedText(result.extractedText),
|
|
132
162
|
sourceFile: parseSourceFile(result.sourceFile),
|
|
133
163
|
};
|
|
134
164
|
if (manifest.decision === "apply" && manifest.actions.length === 0) {
|
|
@@ -287,11 +287,18 @@ function buildQueueListItem(item) {
|
|
|
287
287
|
sourceType: item.sourceType ?? null,
|
|
288
288
|
fileSize: item.fileSize ?? null,
|
|
289
289
|
filePath: item.filePath ?? null,
|
|
290
|
+
sourceTimestamp: item.sourceTimestamp ?? null,
|
|
291
|
+
sourceTimestampSource: item.sourceTimestampSource ?? null,
|
|
292
|
+
sourceTimestampConfidence: item.sourceTimestampConfidence ?? null,
|
|
290
293
|
createdPageIds: item.createdPageIds ?? [],
|
|
291
294
|
updatedPageIds: item.updatedPageIds ?? [],
|
|
292
295
|
appliedTypeNames: item.appliedTypeNames ?? [],
|
|
293
296
|
proposedTypeNames: item.proposedTypeNames ?? [],
|
|
294
297
|
skillsUsed: item.skillsUsed ?? [],
|
|
298
|
+
extractedTextPath: item.extractedTextPath ?? null,
|
|
299
|
+
extractedTextSha256: item.extractedTextSha256 ?? null,
|
|
300
|
+
extractedTextParserSkill: item.extractedTextParserSkill ?? null,
|
|
301
|
+
extractedTextCharCount: item.extractedTextCharCount ?? null,
|
|
295
302
|
timing: buildQueueTiming(item),
|
|
296
303
|
};
|
|
297
304
|
}
|
|
@@ -426,6 +433,10 @@ async function resolvePageVaultSource(db, config, env, page, rawData) {
|
|
|
426
433
|
file_path AS filePath,
|
|
427
434
|
content_hash AS contentHash,
|
|
428
435
|
file_mtime AS fileMtime,
|
|
436
|
+
source_timestamp AS sourceTimestamp,
|
|
437
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
438
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
439
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
429
440
|
indexed_at AS indexedAt
|
|
430
441
|
FROM vault_files
|
|
431
442
|
WHERE id = ?
|
|
@@ -447,6 +458,9 @@ async function resolvePageVaultSource(db, config, env, page, rawData) {
|
|
|
447
458
|
sourceType: row.sourceType,
|
|
448
459
|
fileSize: row.fileSize,
|
|
449
460
|
remotePath: row.filePath,
|
|
461
|
+
sourceTimestamp: row.sourceTimestamp ?? null,
|
|
462
|
+
sourceTimestampSource: row.sourceTimestampSource ?? null,
|
|
463
|
+
sourceTimestampConfidence: row.sourceTimestampConfidence ?? null,
|
|
450
464
|
indexedAt: row.indexedAt,
|
|
451
465
|
...preview,
|
|
452
466
|
};
|
|
@@ -918,6 +932,10 @@ export async function openDashboardPageSource(env = process.env, inputPageId, ta
|
|
|
918
932
|
file_path AS filePath,
|
|
919
933
|
content_hash AS contentHash,
|
|
920
934
|
file_mtime AS fileMtime,
|
|
935
|
+
source_timestamp AS sourceTimestamp,
|
|
936
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
937
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
938
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
921
939
|
indexed_at AS indexedAt
|
|
922
940
|
FROM vault_files
|
|
923
941
|
WHERE id = ?
|
|
@@ -951,6 +969,10 @@ export function getDashboardVaultSummary(env = process.env) {
|
|
|
951
969
|
file_path AS filePath,
|
|
952
970
|
content_hash AS contentHash,
|
|
953
971
|
file_mtime AS fileMtime,
|
|
972
|
+
source_timestamp AS sourceTimestamp,
|
|
973
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
974
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
975
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
954
976
|
indexed_at AS indexedAt
|
|
955
977
|
FROM vault_files
|
|
956
978
|
ORDER BY id
|
|
@@ -1023,6 +1045,10 @@ export function listDashboardVaultFiles(env = process.env, options = {}) {
|
|
|
1023
1045
|
file_path AS filePath,
|
|
1024
1046
|
content_hash AS contentHash,
|
|
1025
1047
|
file_mtime AS fileMtime,
|
|
1048
|
+
source_timestamp AS sourceTimestamp,
|
|
1049
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
1050
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
1051
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
1026
1052
|
indexed_at AS indexedAt
|
|
1027
1053
|
FROM vault_files
|
|
1028
1054
|
ORDER BY id
|
|
@@ -1049,6 +1075,9 @@ export function listDashboardVaultFiles(env = process.env, options = {}) {
|
|
|
1049
1075
|
sourceType: file.sourceType,
|
|
1050
1076
|
fileSize: file.fileSize,
|
|
1051
1077
|
filePath: file.filePath,
|
|
1078
|
+
sourceTimestamp: file.sourceTimestamp ?? null,
|
|
1079
|
+
sourceTimestampSource: file.sourceTimestampSource ?? null,
|
|
1080
|
+
sourceTimestampConfidence: file.sourceTimestampConfidence ?? null,
|
|
1052
1081
|
indexedAt: file.indexedAt,
|
|
1053
1082
|
queueStatus: queueItem?.status ?? "not-queued",
|
|
1054
1083
|
queueItem: queueItem ? buildQueueListItem(queueItem) : null,
|
|
@@ -1093,6 +1122,10 @@ export async function getDashboardVaultFileDetail(env = process.env, fileId) {
|
|
|
1093
1122
|
file_path AS filePath,
|
|
1094
1123
|
content_hash AS contentHash,
|
|
1095
1124
|
file_mtime AS fileMtime,
|
|
1125
|
+
source_timestamp AS sourceTimestamp,
|
|
1126
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
1127
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
1128
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
1096
1129
|
indexed_at AS indexedAt
|
|
1097
1130
|
FROM vault_files
|
|
1098
1131
|
WHERE id = ?
|
|
@@ -1130,6 +1163,10 @@ export async function openDashboardVaultFile(env = process.env, fileId) {
|
|
|
1130
1163
|
file_path AS filePath,
|
|
1131
1164
|
content_hash AS contentHash,
|
|
1132
1165
|
file_mtime AS fileMtime,
|
|
1166
|
+
source_timestamp AS sourceTimestamp,
|
|
1167
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
1168
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
1169
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
1133
1170
|
indexed_at AS indexedAt
|
|
1134
1171
|
FROM vault_files
|
|
1135
1172
|
WHERE id = ?
|
package/dist/operations/query.js
CHANGED
|
@@ -67,6 +67,21 @@ function normalizeOptionalString(value) {
|
|
|
67
67
|
const normalized = value.trim();
|
|
68
68
|
return normalized ? normalized : null;
|
|
69
69
|
}
|
|
70
|
+
function parseOptionalJsonArray(value) {
|
|
71
|
+
if (Array.isArray(value)) {
|
|
72
|
+
return value;
|
|
73
|
+
}
|
|
74
|
+
if (typeof value !== "string" || !value.trim()) {
|
|
75
|
+
return [];
|
|
76
|
+
}
|
|
77
|
+
try {
|
|
78
|
+
const parsed = JSON.parse(value);
|
|
79
|
+
return Array.isArray(parsed) ? parsed : [];
|
|
80
|
+
}
|
|
81
|
+
catch {
|
|
82
|
+
return [];
|
|
83
|
+
}
|
|
84
|
+
}
|
|
70
85
|
function isAbsoluteLikePath(value) {
|
|
71
86
|
return path.isAbsolute(value) || /^[A-Za-z]:[\\/]/.test(value);
|
|
72
87
|
}
|
|
@@ -469,7 +484,7 @@ export function listVaultFiles(env = process.env, options = {}) {
|
|
|
469
484
|
clauses.push("file_ext = ?");
|
|
470
485
|
params.push(String(options.ext).replace(/^\./, ""));
|
|
471
486
|
}
|
|
472
|
-
|
|
487
|
+
const rows = db
|
|
473
488
|
.prepare(`
|
|
474
489
|
SELECT
|
|
475
490
|
id,
|
|
@@ -480,12 +495,20 @@ export function listVaultFiles(env = process.env, options = {}) {
|
|
|
480
495
|
file_path AS filePath,
|
|
481
496
|
content_hash AS contentHash,
|
|
482
497
|
file_mtime AS fileMtime,
|
|
498
|
+
source_timestamp AS sourceTimestamp,
|
|
499
|
+
source_timestamp_source AS sourceTimestampSource,
|
|
500
|
+
source_timestamp_confidence AS sourceTimestampConfidence,
|
|
501
|
+
source_timestamp_candidates AS sourceTimestampCandidates,
|
|
483
502
|
indexed_at AS indexedAt
|
|
484
503
|
FROM vault_files
|
|
485
504
|
${clauses.length > 0 ? `WHERE ${clauses.join(" AND ")}` : ""}
|
|
486
505
|
ORDER BY id
|
|
487
506
|
`)
|
|
488
507
|
.all(...params);
|
|
508
|
+
return rows.map((row) => ({
|
|
509
|
+
...row,
|
|
510
|
+
sourceTimestampCandidates: parseOptionalJsonArray(row.sourceTimestampCandidates),
|
|
511
|
+
}));
|
|
489
512
|
}
|
|
490
513
|
finally {
|
|
491
514
|
db.close();
|
package/package.json
CHANGED
|
@@ -293,9 +293,9 @@ tiangong-wiki vault diff [--since <date>] [--path <prefix>]
|
|
|
293
293
|
tiangong-wiki vault queue [--status pending|processing|done|skipped|error]
|
|
294
294
|
```
|
|
295
295
|
|
|
296
|
-
- `list` — List indexed vault files; `--path` does prefix matching on relative paths
|
|
296
|
+
- `list` — List indexed vault files; `--path` does prefix matching on relative paths. JSON output includes source timestamp inference fields when available (`sourceTimestamp`, `sourceTimestampSource`, `sourceTimestampConfidence`, `sourceTimestampCandidates`).
|
|
297
297
|
- `diff` — Show changes since the last sync (or since a given date with `--since`)
|
|
298
|
-
- `queue` — Show processing queue status and item details
|
|
298
|
+
- `queue` — Show processing queue status and item details, including extracted plain-text artifact metadata when a parser snapshot exists
|
|
299
299
|
|
|
300
300
|
### lint
|
|
301
301
|
|
|
@@ -91,13 +91,15 @@ The agent uses [Codex SDK](https://www.npmjs.com/package/@openai/codex-sdk) to p
|
|
|
91
91
|
| `WIKI_AGENT_MODEL` | No | Model name (default: `gpt-5.5`; e.g. `Qwen/Qwen3.5-397B-A17B-GPTQ-Int4`) |
|
|
92
92
|
| `WIKI_AGENT_BATCH_SIZE` | No | Max concurrent vault queue workers per cycle (default: `5`) |
|
|
93
93
|
| `WIKI_AGENT_SANDBOX_MODE` | No | Codex sandbox mode: `danger-full-access` (default) or `workspace-write` |
|
|
94
|
-
| `WIKI_PARSER_SKILLS` | No | Comma-separated parser skill list (e.g. `pdf,docx,pptx,xlsx,document-granular-decompose`). `document-granular-decompose` covers PDF, Office,
|
|
94
|
+
| `WIKI_PARSER_SKILLS` | No | Comma-separated parser skill list (e.g. `pdf,docx,pptx,xlsx,document-granular-decompose`). `document-granular-decompose` covers PDF, Office, and common image formats; Markdown and other plain text-like files are read locally by the wiki workflow. Use the skill's own `SKILL.md` for the exact extension allowlist |
|
|
95
95
|
| `UNSTRUCTURED_API_BASE_URL` | For `document-granular-decompose` | TianGong Unstructure API base URL |
|
|
96
96
|
| `UNSTRUCTURED_AUTH_TOKEN` | For `document-granular-decompose` | Bearer token for TianGong Unstructure |
|
|
97
97
|
| `UNSTRUCTURED_PROVIDER` | No | Optional provider override passed to TianGong Unstructure |
|
|
98
98
|
| `UNSTRUCTURED_MODEL` | No | Optional model override passed to TianGong Unstructure |
|
|
99
99
|
|
|
100
|
-
When `document-granular-decompose` is configured with `UNSTRUCTURED_API_BASE_URL` and `UNSTRUCTURED_AUTH_TOKEN`, the wiki agent prefers it before the type-specific `pdf`, `docx`, `pptx`, and `xlsx` skills for supported document/image files. Keep the type-specific skills configured only when you want them available as fallback tools.
|
|
100
|
+
When `document-granular-decompose` is configured with `UNSTRUCTURED_API_BASE_URL` and `UNSTRUCTURED_AUTH_TOKEN`, the wiki agent prefers it before the type-specific `pdf`, `docx`, `pptx`, and `xlsx` skills for supported non-text document/image files. Keep the type-specific skills configured only when you want them available as fallback tools.
|
|
101
|
+
|
|
102
|
+
For successful parser runs, the workflow keeps the exact plain-text extraction used by the agent at `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`. `tiangong-wiki vault queue` exposes `extractedTextPath`, `extractedTextSha256`, `extractedTextParserSkill`, and `extractedTextCharCount` when a snapshot exists.
|
|
101
103
|
|
|
102
104
|
`tiangong-wiki setup` now prompts for `WIKI_AGENT_SANDBOX_MODE` when automatic vault processing is enabled. The default is `danger-full-access`, and the setup wizard highlights that this mode grants full runtime access.
|
|
103
105
|
|
|
@@ -28,9 +28,9 @@ Parser skills are installed under `<workspace-root>/.agents/skills/`. Do not ass
|
|
|
28
28
|
| `docx` | Extract text and structure from DOCX files |
|
|
29
29
|
| `pptx` | Extract text, slide structure, and speaker notes from PPTX files |
|
|
30
30
|
| `xlsx` | Extract tables and data from XLSX/CSV files |
|
|
31
|
-
| `document-granular-decompose` | Extract fulltext from PDF, Office documents,
|
|
31
|
+
| `document-granular-decompose` | Extract fulltext from PDF, Office documents, and common image formats through TianGong Unstructure |
|
|
32
32
|
|
|
33
|
-
When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel,
|
|
33
|
+
When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported non-text document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel, and common image formats; read the skill's own `SKILL.md` for the exact extension allowlist. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Write that same plain text to `EXTRACTED_TEXT_PATH` so the queue artifact retains the exact text snapshot used for analysis. Keep JSON chunks and page numbers only for debugging or provenance evidence.
|
|
34
34
|
|
|
35
35
|
When any other parser skill is available and the vault file matches its type, use the skill. Read the skill's SKILL.md for interface details before invoking.
|
|
36
36
|
|
|
@@ -39,7 +39,7 @@ If a parser skill fails due to missing runtime dependencies, attempt to install
|
|
|
39
39
|
### File Type Strategies
|
|
40
40
|
|
|
41
41
|
**Markdown / Plain Text (md, txt)**
|
|
42
|
-
Read directly. For large files (>5000 lines), read in sections. Parse YAML frontmatter separately if present.
|
|
42
|
+
Read directly. Do not send Markdown or other plain text-like files to parser skills or remote unstructure APIs. For large files (>5000 lines), read in sections. Parse YAML frontmatter separately if present.
|
|
43
43
|
|
|
44
44
|
**PDF**
|
|
45
45
|
Prefer the `pdf` parser skill. Without it: attempt direct read; if unreadable, skip. Use PDF metadata (title, author, date, subject) to inform decisions.
|
|
@@ -88,6 +88,7 @@ Use vision to understand each image in context. Extract only high-value images v
|
|
|
88
88
|
6. `sourceRefs` may only contain existing wiki page ids. Raw file provenance belongs in the page body or a field like `vaultPath`.
|
|
89
89
|
7. Only write frontmatter fields declared by the chosen type (`tiangong-wiki type show <type>`). Do not invent ad-hoc fields.
|
|
90
90
|
8. If the type system cannot represent the knowledge cleanly, prefer `propose_only` unless template evolution is explicitly allowed.
|
|
91
|
+
9. queue-item.json may include a source timestamp inferred from file name, path, or mtime. Use it only as source-date evidence; do not copy it blindly into page `createdAt` / `updatedAt`, which are system-normalized.
|
|
91
92
|
|
|
92
93
|
### Runtime Discovery
|
|
93
94
|
|
|
@@ -167,5 +168,8 @@ The workflow must write a valid `result.json` manifest with these fields:
|
|
|
167
168
|
- `proposedTypes`
|
|
168
169
|
- `actions`
|
|
169
170
|
- `lint`
|
|
171
|
+
- `extractedText` when `EXTRACTED_TEXT_PATH` contains a plain-text extraction
|
|
170
172
|
|
|
171
173
|
The service layer trusts this manifest, not free-form prose.
|
|
174
|
+
|
|
175
|
+
`extractedText` must be metadata only, not the full text body. Use `{ "path": EXTRACTED_TEXT_PATH, "parserSkill": "<skill-name>", "sha256": "<sha256>", "charCount": <characters> }`.
|