@biaoo/tiangong-wiki 0.3.11 → 0.3.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -117,7 +117,7 @@ tiangong-wiki dashboard # open dashboard in browse
117
117
 
118
118
  > Environment variables are managed via `.wiki.env` (created by `tiangong-wiki setup`). The CLI prefers the nearest local `.wiki.env`, then falls back to the global default workspace config. See [references/troubleshooting.md](./references/troubleshooting.md) for the full reference. For a centralized Linux + `systemd` + Nginx deployment, see [references/centralized-service-deployment.md](./references/centralized-service-deployment.md). That deployment guide now also includes Git repository / GitHub remote setup for daemon-side commit and optional auto-push.
119
119
 
120
- For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel, Markdown, and common image formats; the skill's `SKILL.md` remains the source of truth for the exact extension allowlist. The wiki workflow uses `return_txt=true`, consumes the plain `txt` text as the agent input, and stores the extracted plain-text snapshot under `.queue-artifacts/<file-artifact>/extracted-fulltext.txt` with metadata in the queue result; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
120
+ For document-heavy vaults, set `WIKI_PARSER_SKILLS=document-granular-decompose` and configure `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` to prefer the TianGong Unstructure parser for supported non-text documents and images. `document-granular-decompose` is a broad document/image parser for PDF, Word, PowerPoint, Excel, and common image formats; Markdown and other plain text-like files are read locally by the wiki workflow. The skill's `SKILL.md` remains the source of truth for the exact extension allowlist. The wiki workflow uses `return_txt=true`, consumes the plain `txt` text as the agent input, and stores the extracted plain-text snapshot under `.queue-artifacts/<file-artifact>/extracted-fulltext.txt` with metadata in the queue result; `UNSTRUCTURED_PROVIDER` and `UNSTRUCTURED_MODEL` are optional overrides.
121
121
 
122
122
  ## MCP Server
123
123
 
package/README.zh-CN.md CHANGED
@@ -117,7 +117,7 @@ tiangong-wiki dashboard # 在浏览器中打开仪
117
117
 
118
118
  > 环境变量通过 `.wiki.env` 管理(由 `tiangong-wiki setup` 创建)。CLI 会优先使用最近的本地 `.wiki.env`,找不到时再 fallback 到全局默认工作区配置。完整参考见 [references/troubleshooting.md](./references/troubleshooting.md)。如需部署中心化服务(Linux + `systemd` + Nginx),见 [references/centralized-service-deployment.md](./references/centralized-service-deployment.md)。该部署文档现在也包含了 Git 仓库初始化、GitHub remote 配置和 daemon 自动 push 的 Git 配置说明。
119
119
 
120
- 如果 vault 里以文档解析为主,可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`,并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`,让 workflow 优先使用 TianGong Unstructure parser。`document-granular-decompose` 是覆盖范围更宽的文档/图片解析器,适用于 PDF、Word、PowerPoint、ExcelMarkdown 和常见图片格式;精确扩展名 allowlist 以该 skill 自身的 `SKILL.md` 为准。wiki workflow 会使用 `return_txt=true`,把返回的纯 `txt` 文本作为 agent 主输入,并把解析纯文本快照保存到 `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`,同时在 queue result 里保留元数据;`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
120
+ 如果 vault 里以文档解析为主,可设置 `WIKI_PARSER_SKILLS=document-granular-decompose`,并配置 `UNSTRUCTURED_API_BASE_URL` 与 `UNSTRUCTURED_AUTH_TOKEN`,让 workflow 对支持的非纯文本文档和图片优先使用 TianGong Unstructure parser。`document-granular-decompose` 是覆盖范围更宽的文档/图片解析器,适用于 PDF、Word、PowerPoint、Excel 和常见图片格式;Markdown 与其他纯文本类文件由 wiki workflow 本地直接读取。精确扩展名 allowlist 以该 skill 自身的 `SKILL.md` 为准。wiki workflow 会使用 `return_txt=true`,把返回的纯 `txt` 文本作为 agent 主输入,并把解析纯文本快照保存到 `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`,同时在 queue result 里保留元数据;`UNSTRUCTURED_PROVIDER` 和 `UNSTRUCTURED_MODEL` 只是可选覆盖项。
121
121
 
122
122
  ## MCP Server
123
123
 
package/dist/core/db.js CHANGED
@@ -76,9 +76,15 @@ function ensureBaseTables(db, embeddingDimensions) {
76
76
  file_path TEXT NOT NULL,
77
77
  content_hash TEXT,
78
78
  file_mtime REAL,
79
+ source_timestamp TEXT,
80
+ source_timestamp_source TEXT,
81
+ source_timestamp_confidence REAL,
82
+ source_timestamp_candidates TEXT,
79
83
  indexed_at TEXT
80
84
  );
81
85
 
86
+ CREATE INDEX IF NOT EXISTS idx_vfiles_source_timestamp ON vault_files(source_timestamp);
87
+
82
88
  CREATE TABLE IF NOT EXISTS vault_changelog (
83
89
  id INTEGER PRIMARY KEY AUTOINCREMENT,
84
90
  file_id TEXT NOT NULL,
@@ -164,6 +170,13 @@ function ensureBaseTables(db, embeddingDimensions) {
164
170
  proposed_type_names: "TEXT",
165
171
  skills_used: "TEXT",
166
172
  });
173
+ ensureTableColumns(db, "vault_files", {
174
+ source_timestamp: "TEXT",
175
+ source_timestamp_source: "TEXT",
176
+ source_timestamp_confidence: "REAL",
177
+ source_timestamp_candidates: "TEXT",
178
+ });
179
+ db.exec("CREATE INDEX IF NOT EXISTS idx_vfiles_source_timestamp ON vault_files(source_timestamp)");
167
180
  if (!tableExists(db, "vec_pages")) {
168
181
  db.exec(`
169
182
  CREATE VIRTUAL TABLE vec_pages USING vec0(
@@ -61,6 +61,38 @@ function parseOptionalStringArray(value) {
61
61
  return [];
62
62
  }
63
63
  }
64
+ function parseSourceTimestampCandidates(value) {
65
+ if (Array.isArray(value)) {
66
+ return value;
67
+ }
68
+ if (typeof value !== "string" || !value.trim()) {
69
+ return [];
70
+ }
71
+ try {
72
+ const parsed = JSON.parse(value);
73
+ return Array.isArray(parsed) ? parsed : [];
74
+ }
75
+ catch {
76
+ return [];
77
+ }
78
+ }
79
+ function mapVaultFileRow(row) {
80
+ return {
81
+ id: String(row.id),
82
+ fileName: String(row.fileName),
83
+ fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
84
+ sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
85
+ fileSize: Number(row.fileSize ?? 0),
86
+ filePath: String(row.filePath),
87
+ contentHash: typeof row.contentHash === "string" ? row.contentHash : null,
88
+ fileMtime: typeof row.fileMtime === "number" ? row.fileMtime : null,
89
+ sourceTimestamp: typeof row.sourceTimestamp === "string" ? row.sourceTimestamp : null,
90
+ sourceTimestampSource: typeof row.sourceTimestampSource === "string" ? row.sourceTimestampSource : null,
91
+ sourceTimestampConfidence: typeof row.sourceTimestampConfidence === "number" ? row.sourceTimestampConfidence : null,
92
+ sourceTimestampCandidates: parseSourceTimestampCandidates(row.sourceTimestampCandidates),
93
+ indexedAt: String(row.indexedAt),
94
+ };
95
+ }
64
96
  function mapQueueRow(row) {
65
97
  const attempts = Number(row.attempts ?? 0);
66
98
  const status = row.status;
@@ -99,6 +131,9 @@ function mapQueueRow(row) {
99
131
  sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
100
132
  fileSize: typeof row.fileSize === "number" ? row.fileSize : undefined,
101
133
  filePath: typeof row.filePath === "string" ? row.filePath : undefined,
134
+ sourceTimestamp: typeof row.sourceTimestamp === "string" ? row.sourceTimestamp : null,
135
+ sourceTimestampSource: typeof row.sourceTimestampSource === "string" ? row.sourceTimestampSource : null,
136
+ sourceTimestampConfidence: typeof row.sourceTimestampConfidence === "number" ? row.sourceTimestampConfidence : null,
102
137
  };
103
138
  }
104
139
  function claimQueueItems(db, limit, options) {
@@ -146,6 +181,9 @@ function claimQueueItems(db, limit, options) {
146
181
  vault_files.source_type AS sourceType,
147
182
  vault_files.file_size AS fileSize,
148
183
  vault_files.file_path AS filePath,
184
+ vault_files.source_timestamp AS sourceTimestamp,
185
+ vault_files.source_timestamp_source AS sourceTimestampSource,
186
+ vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
149
187
  vault_extractions.artifact_path AS extractedTextPath,
150
188
  vault_extractions.artifact_sha256 AS extractedTextSha256,
151
189
  vault_extractions.parser_skill AS extractedTextParserSkill,
@@ -242,6 +280,9 @@ function fetchQueueItemsByStatus(db, status) {
242
280
  vault_files.source_type AS sourceType,
243
281
  vault_files.file_size AS fileSize,
244
282
  vault_files.file_path AS filePath,
283
+ vault_files.source_timestamp AS sourceTimestamp,
284
+ vault_files.source_timestamp_source AS sourceTimestampSource,
285
+ vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
245
286
  vault_extractions.artifact_path AS extractedTextPath,
246
287
  vault_extractions.artifact_sha256 AS extractedTextSha256,
247
288
  vault_extractions.parser_skill AS extractedTextParserSkill,
@@ -288,6 +329,9 @@ function fetchQueueItemByFileId(db, fileId) {
288
329
  vault_files.source_type AS sourceType,
289
330
  vault_files.file_size AS fileSize,
290
331
  vault_files.file_path AS filePath,
332
+ vault_files.source_timestamp AS sourceTimestamp,
333
+ vault_files.source_timestamp_source AS sourceTimestampSource,
334
+ vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
291
335
  vault_extractions.artifact_path AS extractedTextPath,
292
336
  vault_extractions.artifact_sha256 AS extractedTextSha256,
293
337
  vault_extractions.parser_skill AS extractedTextParserSkill,
@@ -312,11 +356,15 @@ function fetchVaultFile(db, fileId) {
312
356
  file_path AS filePath,
313
357
  content_hash AS contentHash,
314
358
  file_mtime AS fileMtime,
359
+ source_timestamp AS sourceTimestamp,
360
+ source_timestamp_source AS sourceTimestampSource,
361
+ source_timestamp_confidence AS sourceTimestampConfidence,
362
+ source_timestamp_candidates AS sourceTimestampCandidates,
315
363
  indexed_at AS indexedAt
316
364
  FROM vault_files
317
365
  WHERE id = ?
318
366
  `).get(fileId);
319
- return row ?? null;
367
+ return row ? mapVaultFileRow(row) : null;
320
368
  }
321
369
  function buildProcessingOwnerId() {
322
370
  return `${os.hostname()}:${process.pid}:${Date.now()}:${randomUUID().slice(0, 8)}`;
@@ -386,6 +434,9 @@ function fetchStaleProcessingQueueItems(db, processingOwnerId) {
386
434
  vault_files.source_type AS sourceType,
387
435
  vault_files.file_size AS fileSize,
388
436
  vault_files.file_path AS filePath,
437
+ vault_files.source_timestamp AS sourceTimestamp,
438
+ vault_files.source_timestamp_source AS sourceTimestampSource,
439
+ vault_files.source_timestamp_confidence AS sourceTimestampConfidence,
389
440
  vault_extractions.artifact_path AS extractedTextPath,
390
441
  vault_extractions.artifact_sha256 AS extractedTextSha256,
391
442
  vault_extractions.parser_skill AS extractedTextParserSkill,
@@ -19,6 +19,164 @@ function normalizeVaultFileExtension(filePath) {
19
19
  const fileExt = path.extname(filePath).replace(/^\./, "").toLowerCase();
20
20
  return fileExt || null;
21
21
  }
22
+ function normalizeFileMtimeMs(fileMtime) {
23
+ if (typeof fileMtime !== "number" || !Number.isFinite(fileMtime) || fileMtime <= 0) {
24
+ return null;
25
+ }
26
+ return fileMtime < 1_000_000_000_000 ? fileMtime * 1000 : fileMtime;
27
+ }
28
+ function isValidDateParts(year, month, day, hour, minute, second) {
29
+ if (year < 1900 || year > 2099 || month < 1 || month > 12 || day < 1 || day > 31) {
30
+ return false;
31
+ }
32
+ if (hour < 0 || hour > 23 || minute < 0 || minute > 59 || second < 0 || second > 59) {
33
+ return false;
34
+ }
35
+ const date = new Date(year, month - 1, day, hour, minute, second);
36
+ return (date.getFullYear() === year &&
37
+ date.getMonth() === month - 1 &&
38
+ date.getDate() === day &&
39
+ date.getHours() === hour &&
40
+ date.getMinutes() === minute &&
41
+ date.getSeconds() === second);
42
+ }
43
+ function buildTimestampCandidate(input) {
44
+ const year = Number.parseInt(input.year, 10);
45
+ const month = Number.parseInt(input.month, 10);
46
+ const day = Number.parseInt(input.day, 10);
47
+ const hour = input.hour ? Number.parseInt(input.hour, 10) : 0;
48
+ const minute = input.minute ? Number.parseInt(input.minute, 10) : 0;
49
+ const second = input.second ? Number.parseInt(input.second, 10) : 0;
50
+ if (!isValidDateParts(year, month, day, hour, minute, second)) {
51
+ return null;
52
+ }
53
+ const precision = input.hour && input.minute ? "datetime" : "date";
54
+ const baseConfidence = input.source === "file_name" ? 0.9 : 0.8;
55
+ return {
56
+ timestamp: toOffsetIso(new Date(year, month - 1, day, hour, minute, second)),
57
+ source: input.source,
58
+ confidence: precision === "datetime" ? baseConfidence + 0.05 : baseConfidence,
59
+ precision,
60
+ raw: input.raw,
61
+ };
62
+ }
63
+ function collectDateCandidatesFromText(text, source) {
64
+ const candidates = [];
65
+ const separated = /(^|[^0-9])((?:19|20)\d{2})[-_.年/]([01]?\d)[-_.月/]([0-3]?\d)(?:[日\sT_@-]+([0-2]?\d)[::]?([0-5]\d)(?:[::]?([0-5]\d))?)?(?=$|[^0-9])/g;
66
+ const compact = /(^|[^0-9])((?:19|20)\d{2})([01]\d)([0-3]\d)(?:[T_\s@-]?([0-2]\d)([0-5]\d)([0-5]\d)?)?(?=$|[^0-9])/g;
67
+ for (const match of text.matchAll(separated)) {
68
+ const raw = match[0].slice(match[1].length);
69
+ const candidate = buildTimestampCandidate({
70
+ year: match[2],
71
+ month: match[3],
72
+ day: match[4],
73
+ hour: match[5],
74
+ minute: match[6],
75
+ second: match[7],
76
+ raw,
77
+ source,
78
+ });
79
+ if (candidate) {
80
+ candidates.push(candidate);
81
+ }
82
+ }
83
+ for (const match of text.matchAll(compact)) {
84
+ const raw = match[0].slice(match[1].length);
85
+ const candidate = buildTimestampCandidate({
86
+ year: match[2],
87
+ month: match[3],
88
+ day: match[4],
89
+ hour: match[5],
90
+ minute: match[6],
91
+ second: match[7],
92
+ raw,
93
+ source,
94
+ });
95
+ if (candidate) {
96
+ candidates.push(candidate);
97
+ }
98
+ }
99
+ return candidates;
100
+ }
101
+ function dedupeTimestampCandidates(candidates) {
102
+ const seen = new Set();
103
+ const result = [];
104
+ for (const candidate of candidates) {
105
+ const key = `${candidate.timestamp}:${candidate.source}:${candidate.raw}`;
106
+ if (seen.has(key)) {
107
+ continue;
108
+ }
109
+ seen.add(key);
110
+ result.push(candidate);
111
+ }
112
+ return result;
113
+ }
114
+ function inferVaultSourceTimestamp(input) {
115
+ const fileNameWithoutExt = input.fileName.replace(/\.[^.]+$/, "");
116
+ const directory = path.posix.dirname(input.id);
117
+ const pathText = directory === "." ? "" : directory;
118
+ const candidates = dedupeTimestampCandidates([
119
+ ...collectDateCandidatesFromText(fileNameWithoutExt, "file_name"),
120
+ ...collectDateCandidatesFromText(pathText, "path"),
121
+ ]);
122
+ const mtimeMs = normalizeFileMtimeMs(input.fileMtime);
123
+ if (mtimeMs !== null) {
124
+ candidates.push({
125
+ timestamp: toOffsetIso(new Date(mtimeMs)),
126
+ source: "file_mtime",
127
+ confidence: 0.5,
128
+ precision: "datetime",
129
+ raw: String(input.fileMtime),
130
+ });
131
+ }
132
+ const preferred = candidates
133
+ .slice()
134
+ .sort((left, right) => right.confidence - left.confidence || left.timestamp.localeCompare(right.timestamp))[0];
135
+ return {
136
+ sourceTimestamp: preferred?.timestamp ?? null,
137
+ sourceTimestampSource: preferred?.source ?? null,
138
+ sourceTimestampConfidence: preferred?.confidence ?? null,
139
+ sourceTimestampCandidates: candidates,
140
+ };
141
+ }
142
+ function serializeSourceTimestampCandidates(candidates) {
143
+ if (!candidates || candidates.length === 0) {
144
+ return null;
145
+ }
146
+ return JSON.stringify(candidates);
147
+ }
148
+ function parseSourceTimestampCandidates(value) {
149
+ if (Array.isArray(value)) {
150
+ return value;
151
+ }
152
+ if (typeof value !== "string" || value.trim().length === 0) {
153
+ return [];
154
+ }
155
+ try {
156
+ const parsed = JSON.parse(value);
157
+ return Array.isArray(parsed) ? parsed : [];
158
+ }
159
+ catch {
160
+ return [];
161
+ }
162
+ }
163
+ function mapVaultFileRow(row) {
164
+ return {
165
+ id: String(row.id),
166
+ fileName: String(row.fileName),
167
+ fileExt: typeof row.fileExt === "string" ? row.fileExt : null,
168
+ sourceType: typeof row.sourceType === "string" ? row.sourceType : null,
169
+ fileSize: Number(row.fileSize ?? 0),
170
+ filePath: String(row.filePath),
171
+ contentHash: typeof row.contentHash === "string" ? row.contentHash : null,
172
+ fileMtime: typeof row.fileMtime === "number" ? row.fileMtime : null,
173
+ sourceTimestamp: typeof row.sourceTimestamp === "string" ? row.sourceTimestamp : null,
174
+ sourceTimestampSource: typeof row.sourceTimestampSource === "string" ? row.sourceTimestampSource : null,
175
+ sourceTimestampConfidence: typeof row.sourceTimestampConfidence === "number" ? row.sourceTimestampConfidence : null,
176
+ sourceTimestampCandidates: parseSourceTimestampCandidates(row.sourceTimestampCandidates),
177
+ indexedAt: String(row.indexedAt),
178
+ };
179
+ }
22
180
  function createAllowedVaultFileTypeSet(vaultFileTypes) {
23
181
  return new Set(vaultFileTypes.map((item) => item.trim().replace(/^\./, "").toLowerCase()).filter(Boolean));
24
182
  }
@@ -42,6 +200,11 @@ function localVaultFiles(vaultPath, hashMode, vaultFileTypes) {
42
200
  filePath,
43
201
  contentHash: computeVaultHash(hashMode, id, filePath, stats.size, stats.mtimeMs),
44
202
  fileMtime: stats.mtimeMs,
203
+ ...inferVaultSourceTimestamp({
204
+ id,
205
+ fileName: path.basename(filePath),
206
+ fileMtime: stats.mtimeMs,
207
+ }),
45
208
  indexedAt,
46
209
  };
47
210
  });
@@ -162,6 +325,11 @@ async function scanSynologyFolder(client, remoteRoot, currentFolder, results, al
162
325
  filePath,
163
326
  contentHash: sha256Text(`${relativeId}:${filePath}:${fileSize}:${fileMtime}`),
164
327
  fileMtime,
328
+ ...inferVaultSourceTimestamp({
329
+ id: relativeId,
330
+ fileName: item.name ?? path.basename(filePath),
331
+ fileMtime,
332
+ }),
165
333
  indexedAt,
166
334
  });
167
335
  }
@@ -198,10 +366,17 @@ function getExistingVaultFiles(db) {
198
366
  file_path AS filePath,
199
367
  content_hash AS contentHash,
200
368
  file_mtime AS fileMtime,
369
+ source_timestamp AS sourceTimestamp,
370
+ source_timestamp_source AS sourceTimestampSource,
371
+ source_timestamp_confidence AS sourceTimestampConfidence,
372
+ source_timestamp_candidates AS sourceTimestampCandidates,
201
373
  indexed_at AS indexedAt
202
374
  FROM vault_files
203
375
  `).all();
204
- return new Map(rows.map((row) => [row.id, row]));
376
+ return new Map(rows.map((row) => {
377
+ const file = mapVaultFileRow(row);
378
+ return [file.id, file];
379
+ }));
205
380
  }
206
381
  export function getVaultQueuePriority(fileExt) {
207
382
  const normalized = (fileExt ?? "").toLowerCase();
@@ -406,9 +581,13 @@ export function syncVaultIndex(db, currentFiles, syncId) {
406
581
  }
407
582
  const upsertStatement = db.prepare(`
408
583
  INSERT INTO vault_files(
409
- id, file_name, file_ext, source_type, file_size, file_path, content_hash, file_mtime, indexed_at
584
+ id, file_name, file_ext, source_type, file_size, file_path, content_hash, file_mtime,
585
+ source_timestamp, source_timestamp_source, source_timestamp_confidence, source_timestamp_candidates,
586
+ indexed_at
410
587
  ) VALUES (
411
- @id, @file_name, @file_ext, @source_type, @file_size, @file_path, @content_hash, @file_mtime, @indexed_at
588
+ @id, @file_name, @file_ext, @source_type, @file_size, @file_path, @content_hash, @file_mtime,
589
+ @source_timestamp, @source_timestamp_source, @source_timestamp_confidence, @source_timestamp_candidates,
590
+ @indexed_at
412
591
  )
413
592
  ON CONFLICT(id) DO UPDATE SET
414
593
  file_name = excluded.file_name,
@@ -418,6 +597,10 @@ export function syncVaultIndex(db, currentFiles, syncId) {
418
597
  file_path = excluded.file_path,
419
598
  content_hash = excluded.content_hash,
420
599
  file_mtime = excluded.file_mtime,
600
+ source_timestamp = excluded.source_timestamp,
601
+ source_timestamp_source = excluded.source_timestamp_source,
602
+ source_timestamp_confidence = excluded.source_timestamp_confidence,
603
+ source_timestamp_candidates = excluded.source_timestamp_candidates,
421
604
  indexed_at = excluded.indexed_at
422
605
  `);
423
606
  const insertChange = db.prepare(`
@@ -573,6 +756,10 @@ export function syncVaultIndex(db, currentFiles, syncId) {
573
756
  file_path: file.filePath,
574
757
  content_hash: file.contentHash,
575
758
  file_mtime: file.fileMtime,
759
+ source_timestamp: file.sourceTimestamp ?? null,
760
+ source_timestamp_source: file.sourceTimestampSource ?? null,
761
+ source_timestamp_confidence: file.sourceTimestampConfidence ?? null,
762
+ source_timestamp_candidates: serializeSourceTimestampCandidates(file.sourceTimestampCandidates),
576
763
  indexed_at: file.indexedAt,
577
764
  });
578
765
  }
@@ -85,9 +85,11 @@ export function buildVaultWorkflowPrompt(input) {
85
85
  "",
86
86
  "1. Read queue-item.json next to RESULT_JSON_PATH.",
87
87
  "2. Read the target vault file at VAULT_FILE_PATH. Refer to `references/vault-to-wiki-instruction.md` (Phase 1) in the wiki package for file-type-specific reading strategies, parser skill discovery, image handling, and metadata utilization.",
88
- " - If `WIKI_PARSER_SKILLS` includes `document-granular-decompose` and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer that skill for supported document/image files before the legacy type-specific parser skills. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; use the skill's `SKILL.md` for the exact extension allowlist.",
88
+ " - Plain text-like files (`.txt`, `.md`, `.markdown`, `.json`, `.csv`, `.tsv`, `.yaml`, `.yml`) must be read directly from VAULT_FILE_PATH. Do not send them to parser skills or remote unstructure APIs; write the direct plain-text snapshot to EXTRACTED_TEXT_PATH when it becomes the analysis input.",
89
+ " - If `WIKI_PARSER_SKILLS` includes `document-granular-decompose` and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer that skill for supported non-text document/image files before the legacy type-specific parser skills. This includes PDF, Word, PowerPoint, Excel, and common image formats; use the skill's `SKILL.md` for the exact extension allowlist.",
89
90
  " - When using `document-granular-decompose`, request `return_txt=true`, treat the pure text extracted from `response.txt`/`txt` as the main input, and keep raw JSON only for debugging or page-number evidence.",
90
91
  " - If you extract plain text through any parser skill, write that canonical plain text snapshot to EXTRACTED_TEXT_PATH. For `document-granular-decompose`, write the same pure text from `response.txt`/`txt`. Leave EXTRACTED_TEXT_PATH empty only when no extractable text exists.",
92
+ " - queue-item.json may include `file.sourceTimestamp`, `file.sourceTimestampSource`, `file.sourceTimestampConfidence`, and timestamp candidates inferred from the source file name, path, or mtime. Use this as evidence about source recency, but do not blindly copy it into page `createdAt` or `updatedAt`; those fields are normalized by the wiki system.",
91
93
  "3. Discover the current page type ontology via `tiangong-wiki type list` and `tiangong-wiki type show <type>`. Do not assume any type, template, or default target type.",
92
94
  "4. Search the existing wiki for overlapping or related content:",
93
95
  " - Use `tiangong-wiki fts` and `tiangong-wiki search` with key terms from the source.",
@@ -287,6 +287,9 @@ function buildQueueListItem(item) {
287
287
  sourceType: item.sourceType ?? null,
288
288
  fileSize: item.fileSize ?? null,
289
289
  filePath: item.filePath ?? null,
290
+ sourceTimestamp: item.sourceTimestamp ?? null,
291
+ sourceTimestampSource: item.sourceTimestampSource ?? null,
292
+ sourceTimestampConfidence: item.sourceTimestampConfidence ?? null,
290
293
  createdPageIds: item.createdPageIds ?? [],
291
294
  updatedPageIds: item.updatedPageIds ?? [],
292
295
  appliedTypeNames: item.appliedTypeNames ?? [],
@@ -430,6 +433,10 @@ async function resolvePageVaultSource(db, config, env, page, rawData) {
430
433
  file_path AS filePath,
431
434
  content_hash AS contentHash,
432
435
  file_mtime AS fileMtime,
436
+ source_timestamp AS sourceTimestamp,
437
+ source_timestamp_source AS sourceTimestampSource,
438
+ source_timestamp_confidence AS sourceTimestampConfidence,
439
+ source_timestamp_candidates AS sourceTimestampCandidates,
433
440
  indexed_at AS indexedAt
434
441
  FROM vault_files
435
442
  WHERE id = ?
@@ -451,6 +458,9 @@ async function resolvePageVaultSource(db, config, env, page, rawData) {
451
458
  sourceType: row.sourceType,
452
459
  fileSize: row.fileSize,
453
460
  remotePath: row.filePath,
461
+ sourceTimestamp: row.sourceTimestamp ?? null,
462
+ sourceTimestampSource: row.sourceTimestampSource ?? null,
463
+ sourceTimestampConfidence: row.sourceTimestampConfidence ?? null,
454
464
  indexedAt: row.indexedAt,
455
465
  ...preview,
456
466
  };
@@ -922,6 +932,10 @@ export async function openDashboardPageSource(env = process.env, inputPageId, ta
922
932
  file_path AS filePath,
923
933
  content_hash AS contentHash,
924
934
  file_mtime AS fileMtime,
935
+ source_timestamp AS sourceTimestamp,
936
+ source_timestamp_source AS sourceTimestampSource,
937
+ source_timestamp_confidence AS sourceTimestampConfidence,
938
+ source_timestamp_candidates AS sourceTimestampCandidates,
925
939
  indexed_at AS indexedAt
926
940
  FROM vault_files
927
941
  WHERE id = ?
@@ -955,6 +969,10 @@ export function getDashboardVaultSummary(env = process.env) {
955
969
  file_path AS filePath,
956
970
  content_hash AS contentHash,
957
971
  file_mtime AS fileMtime,
972
+ source_timestamp AS sourceTimestamp,
973
+ source_timestamp_source AS sourceTimestampSource,
974
+ source_timestamp_confidence AS sourceTimestampConfidence,
975
+ source_timestamp_candidates AS sourceTimestampCandidates,
958
976
  indexed_at AS indexedAt
959
977
  FROM vault_files
960
978
  ORDER BY id
@@ -1027,6 +1045,10 @@ export function listDashboardVaultFiles(env = process.env, options = {}) {
1027
1045
  file_path AS filePath,
1028
1046
  content_hash AS contentHash,
1029
1047
  file_mtime AS fileMtime,
1048
+ source_timestamp AS sourceTimestamp,
1049
+ source_timestamp_source AS sourceTimestampSource,
1050
+ source_timestamp_confidence AS sourceTimestampConfidence,
1051
+ source_timestamp_candidates AS sourceTimestampCandidates,
1030
1052
  indexed_at AS indexedAt
1031
1053
  FROM vault_files
1032
1054
  ORDER BY id
@@ -1053,6 +1075,9 @@ export function listDashboardVaultFiles(env = process.env, options = {}) {
1053
1075
  sourceType: file.sourceType,
1054
1076
  fileSize: file.fileSize,
1055
1077
  filePath: file.filePath,
1078
+ sourceTimestamp: file.sourceTimestamp ?? null,
1079
+ sourceTimestampSource: file.sourceTimestampSource ?? null,
1080
+ sourceTimestampConfidence: file.sourceTimestampConfidence ?? null,
1056
1081
  indexedAt: file.indexedAt,
1057
1082
  queueStatus: queueItem?.status ?? "not-queued",
1058
1083
  queueItem: queueItem ? buildQueueListItem(queueItem) : null,
@@ -1097,6 +1122,10 @@ export async function getDashboardVaultFileDetail(env = process.env, fileId) {
1097
1122
  file_path AS filePath,
1098
1123
  content_hash AS contentHash,
1099
1124
  file_mtime AS fileMtime,
1125
+ source_timestamp AS sourceTimestamp,
1126
+ source_timestamp_source AS sourceTimestampSource,
1127
+ source_timestamp_confidence AS sourceTimestampConfidence,
1128
+ source_timestamp_candidates AS sourceTimestampCandidates,
1100
1129
  indexed_at AS indexedAt
1101
1130
  FROM vault_files
1102
1131
  WHERE id = ?
@@ -1134,6 +1163,10 @@ export async function openDashboardVaultFile(env = process.env, fileId) {
1134
1163
  file_path AS filePath,
1135
1164
  content_hash AS contentHash,
1136
1165
  file_mtime AS fileMtime,
1166
+ source_timestamp AS sourceTimestamp,
1167
+ source_timestamp_source AS sourceTimestampSource,
1168
+ source_timestamp_confidence AS sourceTimestampConfidence,
1169
+ source_timestamp_candidates AS sourceTimestampCandidates,
1137
1170
  indexed_at AS indexedAt
1138
1171
  FROM vault_files
1139
1172
  WHERE id = ?
@@ -67,6 +67,21 @@ function normalizeOptionalString(value) {
67
67
  const normalized = value.trim();
68
68
  return normalized ? normalized : null;
69
69
  }
70
+ function parseOptionalJsonArray(value) {
71
+ if (Array.isArray(value)) {
72
+ return value;
73
+ }
74
+ if (typeof value !== "string" || !value.trim()) {
75
+ return [];
76
+ }
77
+ try {
78
+ const parsed = JSON.parse(value);
79
+ return Array.isArray(parsed) ? parsed : [];
80
+ }
81
+ catch {
82
+ return [];
83
+ }
84
+ }
70
85
  function isAbsoluteLikePath(value) {
71
86
  return path.isAbsolute(value) || /^[A-Za-z]:[\\/]/.test(value);
72
87
  }
@@ -469,7 +484,7 @@ export function listVaultFiles(env = process.env, options = {}) {
469
484
  clauses.push("file_ext = ?");
470
485
  params.push(String(options.ext).replace(/^\./, ""));
471
486
  }
472
- return db
487
+ const rows = db
473
488
  .prepare(`
474
489
  SELECT
475
490
  id,
@@ -480,12 +495,20 @@ export function listVaultFiles(env = process.env, options = {}) {
480
495
  file_path AS filePath,
481
496
  content_hash AS contentHash,
482
497
  file_mtime AS fileMtime,
498
+ source_timestamp AS sourceTimestamp,
499
+ source_timestamp_source AS sourceTimestampSource,
500
+ source_timestamp_confidence AS sourceTimestampConfidence,
501
+ source_timestamp_candidates AS sourceTimestampCandidates,
483
502
  indexed_at AS indexedAt
484
503
  FROM vault_files
485
504
  ${clauses.length > 0 ? `WHERE ${clauses.join(" AND ")}` : ""}
486
505
  ORDER BY id
487
506
  `)
488
507
  .all(...params);
508
+ return rows.map((row) => ({
509
+ ...row,
510
+ sourceTimestampCandidates: parseOptionalJsonArray(row.sourceTimestampCandidates),
511
+ }));
489
512
  }
490
513
  finally {
491
514
  db.close();
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@biaoo/tiangong-wiki",
3
- "version": "0.3.11",
3
+ "version": "0.3.12",
4
4
  "description": "Local-first wiki index and query engine for Markdown knowledge pages (Tiangong Wiki).",
5
5
  "type": "module",
6
6
  "publishConfig": {
@@ -293,7 +293,7 @@ tiangong-wiki vault diff [--since <date>] [--path <prefix>]
293
293
  tiangong-wiki vault queue [--status pending|processing|done|skipped|error]
294
294
  ```
295
295
 
296
- - `list` — List indexed vault files; `--path` does prefix matching on relative paths
296
+ - `list` — List indexed vault files; `--path` does prefix matching on relative paths. JSON output includes source timestamp inference fields when available (`sourceTimestamp`, `sourceTimestampSource`, `sourceTimestampConfidence`, `sourceTimestampCandidates`).
297
297
  - `diff` — Show changes since the last sync (or since a given date with `--since`)
298
298
  - `queue` — Show processing queue status and item details, including extracted plain-text artifact metadata when a parser snapshot exists
299
299
 
@@ -91,13 +91,13 @@ The agent uses [Codex SDK](https://www.npmjs.com/package/@openai/codex-sdk) to p
91
91
  | `WIKI_AGENT_MODEL` | No | Model name (default: `gpt-5.5`; e.g. `Qwen/Qwen3.5-397B-A17B-GPTQ-Int4`) |
92
92
  | `WIKI_AGENT_BATCH_SIZE` | No | Max concurrent vault queue workers per cycle (default: `5`) |
93
93
  | `WIKI_AGENT_SANDBOX_MODE` | No | Codex sandbox mode: `danger-full-access` (default) or `workspace-write` |
94
- | `WIKI_PARSER_SKILLS` | No | Comma-separated parser skill list (e.g. `pdf,docx,pptx,xlsx,document-granular-decompose`). `document-granular-decompose` covers PDF, Office, Markdown, and common image formats; use its own `SKILL.md` for the exact extension allowlist |
94
+ | `WIKI_PARSER_SKILLS` | No | Comma-separated parser skill list (e.g. `pdf,docx,pptx,xlsx,document-granular-decompose`). `document-granular-decompose` covers PDF, Office, and common image formats; Markdown and other plain text-like files are read locally by the wiki workflow. Use the skill's own `SKILL.md` for the exact extension allowlist |
95
95
  | `UNSTRUCTURED_API_BASE_URL` | For `document-granular-decompose` | TianGong Unstructure API base URL |
96
96
  | `UNSTRUCTURED_AUTH_TOKEN` | For `document-granular-decompose` | Bearer token for TianGong Unstructure |
97
97
  | `UNSTRUCTURED_PROVIDER` | No | Optional provider override passed to TianGong Unstructure |
98
98
  | `UNSTRUCTURED_MODEL` | No | Optional model override passed to TianGong Unstructure |
99
99
 
100
- When `document-granular-decompose` is configured with `UNSTRUCTURED_API_BASE_URL` and `UNSTRUCTURED_AUTH_TOKEN`, the wiki agent prefers it before the type-specific `pdf`, `docx`, `pptx`, and `xlsx` skills for supported document/image files. Keep the type-specific skills configured only when you want them available as fallback tools.
100
+ When `document-granular-decompose` is configured with `UNSTRUCTURED_API_BASE_URL` and `UNSTRUCTURED_AUTH_TOKEN`, the wiki agent prefers it before the type-specific `pdf`, `docx`, `pptx`, and `xlsx` skills for supported non-text document/image files. Keep the type-specific skills configured only when you want them available as fallback tools.
101
101
 
102
102
  For successful parser runs, the workflow keeps the exact plain-text extraction used by the agent at `.queue-artifacts/<file-artifact>/extracted-fulltext.txt`. `tiangong-wiki vault queue` exposes `extractedTextPath`, `extractedTextSha256`, `extractedTextParserSkill`, and `extractedTextCharCount` when a snapshot exists.
103
103
 
@@ -28,9 +28,9 @@ Parser skills are installed under `<workspace-root>/.agents/skills/`. Do not ass
28
28
  | `docx` | Extract text and structure from DOCX files |
29
29
  | `pptx` | Extract text, slide structure, and speaker notes from PPTX files |
30
30
  | `xlsx` | Extract tables and data from XLSX/CSV files |
31
- | `document-granular-decompose` | Extract fulltext from PDF, Office documents, Markdown, and common image formats through TianGong Unstructure |
31
+ | `document-granular-decompose` | Extract fulltext from PDF, Office documents, and common image formats through TianGong Unstructure |
32
32
 
33
- When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel, Markdown, and common image formats; read the skill's own `SKILL.md` for the exact extension allowlist. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Write that same plain text to `EXTRACTED_TEXT_PATH` so the queue artifact retains the exact text snapshot used for analysis. Keep JSON chunks and page numbers only for debugging or provenance evidence.
33
+ When `document-granular-decompose` is available, `WIKI_PARSER_SKILLS` includes it, and `UNSTRUCTURED_API_BASE_URL` plus `UNSTRUCTURED_AUTH_TOKEN` are set, prefer it for supported non-text document/image formats before the type-specific parser skills below. This includes PDF, Word, PowerPoint, Excel, and common image formats; read the skill's own `SKILL.md` for the exact extension allowlist. The client should request JSON with `return_txt=true`, then use the plain text from `response.txt` / `txt` as the wiki agent's primary input. Write that same plain text to `EXTRACTED_TEXT_PATH` so the queue artifact retains the exact text snapshot used for analysis. Keep JSON chunks and page numbers only for debugging or provenance evidence.
34
34
 
35
35
  When any other parser skill is available and the vault file matches its type, use the skill. Read the skill's SKILL.md for interface details before invoking.
36
36
 
@@ -39,7 +39,7 @@ If a parser skill fails due to missing runtime dependencies, attempt to install
39
39
  ### File Type Strategies
40
40
 
41
41
  **Markdown / Plain Text (md, txt)**
42
- Read directly. For large files (>5000 lines), read in sections. Parse YAML frontmatter separately if present.
42
+ Read directly. Do not send Markdown or other plain text-like files to parser skills or remote unstructure APIs. For large files (>5000 lines), read in sections. Parse YAML frontmatter separately if present.
43
43
 
44
44
  **PDF**
45
45
  Prefer the `pdf` parser skill. Without it: attempt direct read; if unreadable, skip. Use PDF metadata (title, author, date, subject) to inform decisions.
@@ -88,6 +88,7 @@ Use vision to understand each image in context. Extract only high-value images v
88
88
  6. `sourceRefs` may only contain existing wiki page ids. Raw file provenance belongs in the page body or a field like `vaultPath`.
89
89
  7. Only write frontmatter fields declared by the chosen type (`tiangong-wiki type show <type>`). Do not invent ad-hoc fields.
90
90
  8. If the type system cannot represent the knowledge cleanly, prefer `propose_only` unless template evolution is explicitly allowed.
91
+ 9. queue-item.json may include a source timestamp inferred from file name, path, or mtime. Use it only as source-date evidence; do not copy it blindly into page `createdAt` / `updatedAt`, which are system-normalized.
91
92
 
92
93
  ### Runtime Discovery
93
94