ethan-agent-skills 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -25,8 +25,8 @@ npx -y ethan-agent-skills@latest update --target /tmp/test-skills
25
25
  npx -y ethan-agent-skills@latest list
26
26
  ```
27
27
 
28
- Current common bundled skills include `pdf-extract`, `skill-evolution`, and
29
- `fix-my-life`, plus the client-specific OpenSpec OPSX skills.
28
+ Current common bundled skills include `pdf-extract`, `skill-evolution`,
29
+ `fix-my-life`, and `smzdm-picks`, plus the client-specific OpenSpec OPSX skills.
30
30
 
31
31
  The updater writes `.skills-lock.json` in each target skill root and only
32
32
  rewrites skills managed by this package. If a destination skill directory exists
@@ -169,6 +169,98 @@ The workflow can also be started manually from the GitHub Actions tab, but tag
169
169
  pushes are the preferred release path because they tie npm versions to Git
170
170
  history.
171
171
 
172
+ ### Importing Existing Local Skills
173
+
174
+ Use this flow when a useful skill already exists under a local agent directory
175
+ such as `~/.claude/skills/<skill-dir>` and should become part of this package.
176
+
177
+ 1. Choose the package destination:
178
+ - Put cross-client skills in `skills/<skill-dir>`.
179
+ - Put Claude-only skills in `claude/skills/<skill-dir>`.
180
+ - Put Codex-only skills in `codex/skills/<skill-dir>`.
181
+ - Put source-command or agent-only skills in `agents/skills/<skill-dir>`.
182
+ 2. Inspect the source before copying:
183
+
184
+ ```bash
185
+ SOURCE="$HOME/.claude/skills/<skill-dir>"
186
+ rg --files -uu "$SOURCE"
187
+ rg -n "(secret|password|token|api[_-]?key|PRIVATE|sk-[A-Za-z0-9])" "$SOURCE"
188
+ ```
189
+
190
+ Do not copy real `.env` files, private keys, generated caches, or unrelated
191
+ local state. Placeholder files such as `.env.example` are fine.
192
+
193
+ 3. Copy the skill into the package:
194
+
195
+ ```bash
196
+ DEST="skills/<skill-dir>"
197
+ rm -rf "$DEST"
198
+ cp -R "$SOURCE" "$DEST"
199
+ find "$DEST" -type d -name "__pycache__" -prune -exec rm -rf {} +
200
+ find "$DEST" -name "*.pyc" -delete
201
+ ```
202
+
203
+ 4. Add or update `skill.json` beside `SKILL.md`:
204
+
205
+ ```json
206
+ {
207
+ "name": "<trigger-or-display-name>",
208
+ "version": "0.1.0",
209
+ "description": "Short description used by the package list command."
210
+ }
211
+ ```
212
+
213
+ Keep the `SKILL.md` frontmatter name unchanged when the existing trigger name
214
+ should remain stable. For example, a directory can be `skill-evolution` while
215
+ the skill frontmatter name remains `skill-dev`.
216
+
217
+ 5. Verify local discovery and install behavior:
218
+
219
+ ```bash
220
+ node bin/skills.mjs list
221
+ node bin/skills.mjs update --dry-run --target /tmp/test-skills --client claude
222
+ npm run test:local
223
+ npm run pack:check
224
+ ```
225
+
226
+ `npm run pack:check` should show the new `SKILL.md`, `skill.json`, references,
227
+ and scripts in the tarball contents.
228
+
229
+ 6. Commit and push the imported skill:
230
+
231
+ ```bash
232
+ git status --short
233
+ git add README.md skills/<skill-dir>
234
+ git commit -m "Add <skill-dir> skill"
235
+ ```
236
+
237
+ Adjust the `git add` path if the skill was copied into `claude/`, `codex/`, or
238
+ `agents/` instead of `skills/`.
239
+
240
+ 7. Publish a new npm version:
241
+
242
+ ```bash
243
+ npm version minor
244
+ git push --follow-tags
245
+ ```
246
+
247
+ Use `minor` for adding a new skill. Use `patch` if the skill was already
248
+ published and only its content changed.
249
+
250
+ 8. Confirm the automated release:
251
+
252
+ ```bash
253
+ npm view ethan-agent-skills version --json
254
+ npx -y ethan-agent-skills@latest list
255
+ npx -y ethan-agent-skills@latest update --dry-run --target /tmp/test-skills --client claude
256
+ ```
257
+
258
+ After the new version appears on npm, users can refresh with:
259
+
260
+ ```bash
261
+ npx -y ethan-agent-skills@latest update
262
+ ```
263
+
172
264
  ## OpenSpec OPSX Usage
173
265
 
174
266
  Codex App can invoke the global custom prompts directly:
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "ethan-agent-skills",
3
- "version": "0.1.1",
3
+ "version": "0.2.0",
4
4
  "description": "Agent skills published from my_agent_skill",
5
5
  "type": "module",
6
6
  "bin": {
@@ -5,156 +5,55 @@ description: Extract text from PDF files and save as clean markdown documents. U
5
5
 
6
6
  # PDF Extract Skill
7
7
 
8
- Extract complete text from PDF files and save as clean markdown. Handles text-based PDFs, scanned/image pages, and mixed documents — all using macOS native frameworks (no external dependencies beyond pyobjc, which comes with the system Python on macOS).
8
+ Extract complete text from PDF files and save as clean markdown. Handles text-based PDFs, scanned/image pages, and mixed documents using macOS-native frameworks (pyobjc + Quartz/PDFKit).
9
9
 
10
10
  ## When to Use
11
11
 
12
12
  - A user uploads a PDF and wants its content extracted to text or markdown
13
- - The PDF contains a mix of text and scanned pages (e.g., insurance policy booklets)
14
- - Scanned pages are embedded within otherwise text-based PDFs
15
- - Large PDFs where the extracted text exceeds normal read limits
13
+ - The PDF mixes text and scanned pages (e.g., insurance policy booklets)
14
+ - Large PDFs whose extracted text exceeds normal read limits
16
15
  - Financial documents, insurance policies, contracts, or reports that need structured extraction
17
16
 
18
- ## Workflow Overview
17
+ ## Recommended Workflow
19
18
 
20
- 1. **Try text extraction first** use PDFKit to get native text
21
- 2. **Detect scanned pages** — pages with little or no extracted text are likely scans
22
- 3. **Render scanned pages as images** — convert them to PNGs at high resolution
23
- 4. **Extract text from images** — use multimodal vision to OCR the rendered pages
24
- 5. **Crop if needed** — isolate specific regions (tables, signatures) from page images
25
- 6. **Assemble and save** — combine all extracted text into a clean markdown document
19
+ Run the bundled script. It handles text extraction + section keyword scan + optional rendering in one call:
26
20
 
27
- ## Step-by-Step Instructions
28
-
29
- ### Step 1: Inspect the PDF
30
-
31
- First, check the PDF file path and try basic text extraction on the first page to gauge quality:
32
-
33
- ```python
34
- import sys, Quartz, os
35
- sys.path.insert(0, '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC')
36
- import objc
37
-
38
- pdf_path = "/path/to/file.pdf"
39
- url = Quartz.NSURL.fileURLWithPath_(pdf_path)
40
- doc = Quartz.PDFDocument.alloc().initWithURL_(doc)
41
-
42
- if doc is None:
43
- # Cannot read PDF
44
- exit()
45
-
46
- page_count = doc.pageCount()
47
- print(f"Pages: {page_count}")
48
-
49
- # Quick quality check on first page
50
- p1 = doc.pageAtIndex_(0)
51
- text = p1.string() if p1 else ""
52
- print(f"First page text length: {len(text)}")
53
- if len(text) < 50:
54
- print("WARNING: Low text extraction — likely a scanned PDF")
21
+ ```bash
22
+ python3 ~/.claude/skills/pdf-extract/scripts/pdf_extract.py <pdf_path> \
23
+ > /tmp/extracted.txt 2> /tmp/summary.txt
55
24
  ```
56
25
 
57
- ### Step 2: Extract All Text-Based Pages
26
+ Then:
58
27
 
59
- Attempt to extract text from every page. Pages with meaningful text (>50 chars) are text-based; pages with very little text (<10 chars, often just whitespace or decorative elements) are scanned images.
28
+ 1. **Read `/tmp/summary.txt` first.** It contains the `section_index` (page numbers of key financial sections like cash value table, benefit illustration, surrender value, non-guaranteed projections) and any `warnings`. **Inspect every section_index entry before declaring extraction complete** do not stop at the first value table found.
29
+ 2. **Heed all warnings.** They flag missing documents or coverage gaps (e.g., "participating policy without illustration").
30
+ 3. **Deep-read the pages** the section_index points to. For long documents, use offset/limit on `/tmp/extracted.txt`.
31
+ 4. **For scanned pages**, rerun with `--render-scanned` to write PNGs to `/tmp/pdf_rendered/`, then use the Read tool on the PNGs for vision-based OCR.
32
+ 5. **Assemble** the extracted content into structured markdown.
60
33
 
61
- ```python
62
- import sys, Quartz
34
+ For the underlying pyobjc/Quartz Python code (when modifying the script or running inline), see `references/manual-implementation.md`.
63
35
 
64
- pdf_path = "/path/to/file.pdf"
65
- url = Quartz.NSURL.fileURLWithPath_(pdf_path)
66
- doc = Quartz.PDFDocument.alloc().initWithURL_(url)
36
+ ## Policy / Financial PDF Workflow
67
37
 
68
- pages_text = []
69
- scanned_pages = []
38
+ Insurance policies (especially participating/with-profits plans), annuity contracts, and savings plans contain **multiple value tables** that look similar but represent very different things. Missing one leads to wildly wrong conclusions.
70
39
 
71
- for i in range(doc.pageCount()):
72
- page = doc.pageAtIndex_(i)
73
- text = page.string() if page else ""
74
- if len(text.strip()) < 10:
75
- scanned_pages.append(i)
76
- else:
77
- pages_text.append({"page": i + 1, "text": text})
40
+ **Key distinction for participating policies**:
41
+ - **Guaranteed Cash Value Table** (保證現金價值表) — contractually guaranteed, usually printed in the policy contract
42
+ - **Benefit Illustration / 建議書摘要** — projected values including **non-guaranteed dividends** (特別紅利, reversionary/terminal bonus). For a分紅 / participating policy, **the illustration is the central financial document, not the guaranteed table.**
78
43
 
79
- print(f"Text pages: {len(pages_text)}, Scanned pages: {scanned_pages}")
80
- ```
81
-
82
- ### Step 3: Render Scanned Pages as PNG Images
83
-
84
- For each scanned page, render it to a PNG image at 6x scale (300+ DPI equivalent) for clarity. Lower scales (3x-4x) may not be readable for dense financial tables.
85
-
86
- ```python
87
- import sys, Quartz
88
-
89
- pdf_path = "/path/to/file.pdf"
90
- url = Quartz.NSURL.fileURLWithPath_(pdf_path)
91
- doc = Quartz.PDFDocument.alloc().initWithURL_(url)
92
- output_dir = "/tmp/pdf_rendered/"
93
- os.makedirs(output_dir, exist_ok=True)
94
-
95
- scale = 6.0 # 6x scale for dense tables
96
-
97
- for page_idx in scanned_pages:
98
- page = doc.pageAtIndex_(page_idx)
99
- media_box = page.boundsForBox_(Quartz.kCGPDFMediaBox)
100
-
101
- pw = int(media_box.size.width * scale)
102
- ph = int(media_box.size.height * scale)
103
-
104
- cs = Quartz.CGColorSpaceCreateDeviceRGB()
105
- ctx = Quartz.CGBitmapContextCreate(
106
- None, pw, ph, 8, pw * 4, cs,
107
- Quartz.kCGImageAlphaPremultipliedLast
108
- )
109
-
110
- # White background
111
- Quartz.CGContextSetRGBFillColor(ctx, 1.0, 1.0, 1.0, 1.0)
112
- Quartz.CGContextFillRect(ctx, Quartz.CGRectMake(0, 0, pw, ph))
113
- Quartz.CGContextScaleCTM(ctx, scale, scale)
114
- page.drawWithBox_toContext_(Quartz.kCGPDFMediaBox, ctx)
115
-
116
- cg_img = Quartz.CGBitmapContextCreateImage(ctx)
117
- Quartz.CGImageDestinationAddImage(
118
- Quartz.CGImageDestinationCreateWithURL(
119
- Quartz.NSURL.fileURLWithPath_(f"{output_dir}page_{page_idx+1}.png"),
120
- Quartz.kUTTypePNG, 1, None
121
- ),
122
- cg_img, None
123
- )
124
- # Finalize the destination
125
- dest = Quartz.CGImageDestinationCreateWithURL(
126
- Quartz.NSURL.fileURLWithPath_(f"{output_dir}page_{page_idx+1}.png"),
127
- Quartz.kUTTypePNG, 1, None
128
- )
129
- Quartz.CGImageDestinationAddImage(dest, cg_img, None)
130
- Quartz.CGImageDestinationFinalize(dest)
131
-
132
- print(f"Rendered page {page_idx+1} to {output_dir}page_{page_idx+1}.png ({pw}x{ph})")
133
- ```
44
+ For a typical participating policy, expected surrender value ≈ guaranteed cash value + projected non-guaranteed bonus. Using only the guaranteed table can understate expected returns by 2-5x.
134
45
 
135
- > **Scale reference:** 3x works for simple text, 4x for most documents, 5x for dense forms, 6x for financial tables with small fonts. Start at 4x and go up if text isn't legible.
46
+ **Rules**:
47
+ 1. If `section_index["participating_plan"]` is non-empty, **you must locate and extract** `benefit_illustration`. Don't summarize without it.
48
+ 2. If `section_index["benefit_illustration"]` is empty for a participating policy, **tell the user the illustration is missing** and recommend requesting it from the agent — do not compute IRR or surrender returns from the guaranteed table alone.
49
+ 3. When building the summary markdown, present **guaranteed (A) + non-guaranteed (B) + total (A+B)** as separate columns. Note that non-guaranteed values can be downgraded at the insurer's discretion.
50
+ 4. Look for pessimistic/optimistic scenario tables (悲觀情景 / 樂觀情景) — they bracket the range of plausible returns; include them when present.
136
51
 
137
- ### Step 4: Crop Specific Regions (Optional)
52
+ **Example**: If the script reports `participating_plan: [55, 68]` and `benefit_illustration: [51, 52, 53, 54, 55]`, deep-read pages 51-55 — that's where the projection tables live. A cash value table elsewhere (e.g., page 48) is necessary but not sufficient.
138
53
 
139
- When only part of a page is relevant (e.g., a specific table), crop the rendered image to that region. Coordinates use the Quartz coordinate system where the origin (0,0) is at the **top-left** of the page.
140
-
141
- ```python
142
- import sys, Quartz
143
-
144
- # After rendering to cg_img (before saving), crop a region:
145
- # crop_rect = Quartz.CGRectMake(x * scale, y * scale, width * scale, height * scale)
146
- # cg_img = Quartz.CGImageCreateWithImageInRect(cg_img, crop_rect)
147
- ```
54
+ ## Assembled Markdown Template
148
55
 
149
- > **Finding coordinates:** Render the full page first, view the image, then estimate the crop rectangle. The page bounds are in PDF points (1 point = 1/72 inch). A standard A4 page is roughly 595x842 points. Multiply by `scale` for the pixel-space crop rectangle.
150
-
151
- ### Step 5: Extract Text from Rendered Images
152
-
153
- Use multimodal vision to extract text from the rendered PNG images. This works best with the Read tool on the image files.
154
-
155
- ### Step 6: Assemble and Save as Markdown
156
-
157
- Combine all extracted content into a structured markdown document. Follow the Obsidian wiki schema if the target vault uses it:
56
+ When saving the result to an Obsidian wiki, use this frontmatter pattern:
158
57
 
159
58
  ```markdown
160
59
  ---
@@ -168,47 +67,30 @@ status: active
168
67
 
169
68
  # Document Title
170
69
 
171
- > Brief description of the document
70
+ > Brief description
172
71
 
173
72
  ## Key Data
174
-
175
73
  [Extracted tables as markdown tables]
176
74
 
177
75
  ## Main Content
178
-
179
- [Extracted body text, organized by sections]
76
+ [Body text, organized by sections]
180
77
 
181
78
  ## Important Notes
182
-
183
- [Any specific details, warnings, or risks mentioned]
184
- ```
185
-
186
- ## Handling Large PDFs
187
-
188
- When the extracted text is very long and would exceed output limits:
189
-
190
- - Process pages in batches of 5-10
191
- - Save intermediate results to a temporary file
192
- - For text-based extraction, use chunked reading with offset/limit if reading from a file:
193
-
194
- ```python
195
- # Write extracted text to a file first, then read in chunks
196
- with open("/tmp/extracted.txt", "w") as f:
197
- for page_data in pages_text:
198
- f.write(f"\n--- Page {page_data['page']} ---\n")
199
- f.write(page_data["text"])
79
+ [Specific details, warnings, risks]
200
80
  ```
201
81
 
202
82
  ## Common Pitfalls
203
83
 
204
- - **Low resolution renders:** If OCR quality is poor, increase the scale from 4x to 6x. Dense tables with small fonts almost always need 6x.
205
- - **Page orientation:** Some PDFs have rotated pages. Check `media_box` dimensions to detect landscape pages.
206
- - **Watermarks/overlays:** Background watermarks can interfere with OCR. If pages have heavy watermarks, try cropping to the content region.
207
- - **Mixed content pages:** A page might have both text and scanned elements. The `< 10 chars` threshold detects pure scans, but pages with partial text need manual review.
208
- - **pyobjc availability:** On macOS, pyobjc is pre-installed with the system Python. Use `python3` from the system, not a Homebrew Python that may lack the Quartz bindings.
84
+ - **Stopping at the first value table.** Especially in policy PDFs, finding a "Cash Value Table" does NOT mean you're done. Always check the `section_index` for other value-related sections (benefit illustration, surrender value, non-guaranteed) before concluding.
85
+ - **Garbled CJK text from font subsetting.** Some PDFs (esp. Traditional Chinese insurance policies) embed subset fonts with custom glyph encodings — `page.string()` returns broken codepoints. Render those pages as PNG at 6x scale and OCR via vision. Often only the boilerplate provisions pages are affected; the data tables remain readable.
86
+ - **Low-resolution renders.** If OCR quality is poor, raise `--scale` from 4 to 6. Dense tables with small fonts need 6x.
87
+ - **Page orientation.** Some PDFs have rotated pages. Check `media_box` dimensions to detect landscape.
88
+ - **Watermarks/overlays.** Heavy background watermarks interfere with OCR crop to the content region.
89
+ - **Mixed content pages.** A page may have both text and scanned elements. The `< 10 chars` threshold detects pure scans only; partial-text pages need manual review.
90
+ - **pyobjc availability.** On macOS, pyobjc is pre-installed with system Python. Use `python3` from the system, not a Homebrew Python that may lack Quartz bindings.
209
91
 
210
92
  ## Dependencies
211
93
 
212
- - macOS (required — uses Quartz/PDFKit frameworks)
213
- - pyobjc (pre-installed on macOS with system Python)
214
- - No additional packages needed (no poppler, no tesseract, no PIL)
94
+ - macOS (uses Quartz / PDFKit)
95
+ - pyobjc (pre-installed with system Python)
96
+ - No additional packages (no poppler, no tesseract, no PIL)
@@ -0,0 +1,141 @@
1
+ # Manual Implementation — pyobjc / Quartz PDF Extraction
2
+
3
+ The bundled `scripts/pdf_extract.py` handles all of this end-to-end. Read this reference only when:
4
+ - Modifying the script itself
5
+ - The script is unavailable and you need to inline the logic
6
+ - Doing a custom one-off (e.g., a different page-level filter)
7
+
8
+ The script source at `scripts/pdf_extract.py` is the authoritative implementation.
9
+
10
+ ## Step 1: Inspect the PDF
11
+
12
+ Check the PDF file path and try basic text extraction on the first page to gauge quality.
13
+
14
+ ```python
15
+ import Quartz
16
+
17
+ pdf_path = "/path/to/file.pdf"
18
+ url = Quartz.NSURL.fileURLWithPath_(pdf_path)
19
+ doc = Quartz.PDFDocument.alloc().initWithURL_(url)
20
+
21
+ if doc is None:
22
+ raise SystemExit("Cannot open PDF")
23
+
24
+ page_count = doc.pageCount()
25
+ print(f"Pages: {page_count}")
26
+
27
+ p1 = doc.pageAtIndex_(0)
28
+ text = p1.string() if p1 else ""
29
+ print(f"First page text length: {len(text)}")
30
+ if len(text) < 50:
31
+ print("WARNING: Low text extraction — likely a scanned PDF")
32
+ ```
33
+
34
+ ## Step 2: Extract All Text-Based Pages
35
+
36
+ Pages with `len(text.strip()) < 10` are treated as pure scans. Pages with partial text need manual review.
37
+
38
+ ```python
39
+ import Quartz
40
+
41
+ pdf_path = "/path/to/file.pdf"
42
+ url = Quartz.NSURL.fileURLWithPath_(pdf_path)
43
+ doc = Quartz.PDFDocument.alloc().initWithURL_(url)
44
+
45
+ pages_text = []
46
+ scanned_pages = []
47
+
48
+ for i in range(doc.pageCount()):
49
+ page = doc.pageAtIndex_(i)
50
+ text = page.string() if page else ""
51
+ if len(text.strip()) < 10:
52
+ scanned_pages.append(i)
53
+ else:
54
+ pages_text.append({"page": i + 1, "text": text})
55
+
56
+ print(f"Text pages: {len(pages_text)}, Scanned pages: {scanned_pages}")
57
+ ```
58
+
59
+ ## Step 3: Render Scanned Pages as PNG
60
+
61
+ For dense financial tables, use **6x scale** (≈300 DPI equivalent). Lower scales:
62
+ - 3x — simple text
63
+ - 4x — most documents
64
+ - 5x — dense forms
65
+ - 6x — financial tables with small fonts
66
+
67
+ ```python
68
+ import os
69
+ import Quartz
70
+
71
+ output_dir = "/tmp/pdf_rendered/"
72
+ os.makedirs(output_dir, exist_ok=True)
73
+ scale = 6.0
74
+
75
+ for page_idx in scanned_pages:
76
+ page = doc.pageAtIndex_(page_idx)
77
+ media_box = page.boundsForBox_(Quartz.kCGPDFMediaBox)
78
+ pw = int(media_box.size.width * scale)
79
+ ph = int(media_box.size.height * scale)
80
+
81
+ cs = Quartz.CGColorSpaceCreateDeviceRGB()
82
+ ctx = Quartz.CGBitmapContextCreate(
83
+ None, pw, ph, 8, pw * 4, cs,
84
+ Quartz.kCGImageAlphaPremultipliedLast
85
+ )
86
+
87
+ Quartz.CGContextSetRGBFillColor(ctx, 1.0, 1.0, 1.0, 1.0)
88
+ Quartz.CGContextFillRect(ctx, Quartz.CGRectMake(0, 0, pw, ph))
89
+ Quartz.CGContextScaleCTM(ctx, scale, scale)
90
+ page.drawWithBox_toContext_(Quartz.kCGPDFMediaBox, ctx)
91
+
92
+ cg_img = Quartz.CGBitmapContextCreateImage(ctx)
93
+ out_path = os.path.join(output_dir, f"page_{page_idx+1}.png")
94
+ dest = Quartz.CGImageDestinationCreateWithURL(
95
+ Quartz.NSURL.fileURLWithPath_(out_path),
96
+ Quartz.kUTTypePNG, 1, None
97
+ )
98
+ Quartz.CGImageDestinationAddImage(dest, cg_img, None)
99
+ Quartz.CGImageDestinationFinalize(dest)
100
+ print(f"Rendered page {page_idx+1} -> {out_path} ({pw}x{ph})")
101
+ ```
102
+
103
+ ## Step 4: Crop a Region (Optional)
104
+
105
+ When only part of a page is relevant (e.g., one table), crop the rendered image. Quartz's coordinate origin (0,0) is at the **top-left** of the page.
106
+
107
+ ```python
108
+ # After rendering to cg_img (before saving):
109
+ # crop_rect uses pixel-space coordinates (PDF points × scale)
110
+ crop_rect = Quartz.CGRectMake(x * scale, y * scale, width * scale, height * scale)
111
+ cg_img = Quartz.CGImageCreateWithImageInRect(cg_img, crop_rect)
112
+ ```
113
+
114
+ **Finding coordinates**: render the full page first, view the image, estimate the crop rectangle in PDF points (1 point = 1/72 inch; A4 ≈ 595×842 points), then multiply by `scale`.
115
+
116
+ ## Section Keyword Patterns
117
+
118
+ The script's auto section scan uses these regex patterns. Extend them when encountering new document genres (e.g., loan agreements, trust deeds):
119
+
120
+ | section_key | Pattern (case-insensitive) |
121
+ |-------------------------|-------------------------------------------------------------------------------------------------|
122
+ | cash_value_table | `保[證证]現?金價值表` / `Guaranteed Cash Value` / `Cash Value Table` |
123
+ | benefit_illustration | `建[議议]書摘要` / `Benefit Illustration` / `Illustration Summary` / `利益[說说]明` |
124
+ | surrender_value | `退保價值` / `退保价值` / `Surrender Value` |
125
+ | non_guaranteed | `非保[證证]` / `Non-Guaranteed` |
126
+ | projected_values | `預計` / `预计` / `Projected` / `Estimated` |
127
+ | special_bonus | `特別紅利` / `特别红利` / `Special Bonus` / `Reversionary Bonus` / `Terminal Bonus` |
128
+ | participating_plan | `分紅(計劃|保單)?` / `分红(计划|保单)?` / `Participating` / `With-Profits` |
129
+ | death_benefit | `身故保[障賠]` / `Death Benefit` |
130
+ | critical_illness | `嚴重疾病` / `严重疾病` / `危疾` / `Critical Illness` |
131
+ | premium_payment | `繳[費付]` / `缴[费付]` / `Premium Payment` / `Payment Schedule` |
132
+ | policy_terms | `保單條款` / `保单条款` / `Policy Provisions` / `Terms and Conditions` |
133
+ | endorsement | `批[註注]` / `Endorsement` / `Rider` |
134
+
135
+ ## Warning Triggers
136
+
137
+ The script emits warnings when the detected section pattern suggests missing documentation:
138
+
139
+ 1. **`participating_plan` present + `benefit_illustration` absent** → the illustration is often a separate document. Ask the user / agent to provide it.
140
+ 2. **`cash_value_table` present + `participating_plan` present + `surrender_value` absent** → guaranteed cash value alone understates expected return for a participating policy.
141
+ 3. **`non_guaranteed` referenced + `benefit_illustration` absent** → projection data likely in a separate proposal/illustration document.
@@ -9,22 +9,44 @@ Options:
9
9
  --render-scanned Render scanned/blank pages as PNG images for OCR
10
10
  --output-dir DIR Directory for rendered page images (default: /tmp/pdf_rendered/)
11
11
  --scale N Render scale multiplier (default: 6.0)
12
+ --no-section-scan Disable section keyword scan (on by default)
12
13
 
13
14
  Output:
14
15
  - Prints extracted text to stdout, page by page
15
16
  - If --render-scanned is set, saves scanned pages as PNGs to output_dir
16
- - Prints a JSON summary to stderr with page counts and scanned page indices
17
+ - Prints a JSON summary to stderr with page counts, scanned pages,
18
+ section_index (page locations of key financial sections), and warnings.
17
19
  """
18
20
 
19
21
  import sys
20
22
  import os
23
+ import re
21
24
  import json
22
25
  import argparse
23
26
 
24
27
  import Quartz
25
28
 
26
29
 
27
- def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0):
30
+ # Section keyword patterns for policy / financial document detection.
31
+ # Each entry: (section_key, regex_pattern). Patterns are case-insensitive
32
+ # and cover both Traditional / Simplified Chinese and English variants.
33
+ SECTION_PATTERNS = [
34
+ ("cash_value_table", r"保[證证]現?金價值表|保证现金价值表|Guaranteed\s+Cash\s+Value|Cash\s+Value\s+Table"),
35
+ ("benefit_illustration", r"建[議议]書摘要|建议书摘要|建[議议]書|建议书|Benefit\s+Illustration|Illustration\s+Summary|Sales\s+Illustration|利益[說说]明"),
36
+ ("surrender_value", r"退保價值|退保价值|Surrender\s+Value"),
37
+ ("non_guaranteed", r"非保[證证]|Non[-\s]?Guaranteed"),
38
+ ("projected_values", r"預計|预计|Projected|Estimated"),
39
+ ("special_bonus", r"特別紅利|特别红利|Special\s+Bonus|Reversionary\s+Bonus|Terminal\s+Bonus"),
40
+ ("participating_plan", r"分紅(?:計劃|保[單单])?|分红(?:计划|保[單单])?|Participating|With[-\s]?Profits"),
41
+ ("death_benefit", r"身故保[障賠]|Death\s+Benefit"),
42
+ ("critical_illness", r"嚴重疾病|严重疾病|危疾|Critical\s+Illness"),
43
+ ("premium_payment", r"繳[費付]|缴[费付]|Premium\s+Payment|Payment\s+Schedule"),
44
+ ("policy_terms", r"保[單单]條款|保[單单]条款|Policy\s+Provisions|Terms\s+and\s+Conditions"),
45
+ ("endorsement", r"批[註注]|Endorsement|Rider"),
46
+ ]
47
+
48
+
49
+ def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0, section_scan=True):
28
50
  url = Quartz.NSURL.fileURLWithPath_(pdf_path)
29
51
  doc = Quartz.PDFDocument.alloc().initWithURL_(url)
30
52
 
@@ -35,10 +57,12 @@ def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0):
35
57
  page_count = doc.pageCount()
36
58
  pages_text = []
37
59
  scanned_pages = []
60
+ all_page_texts = {} # page_idx -> text, for section scan
38
61
 
39
62
  for i in range(page_count):
40
63
  page = doc.pageAtIndex_(i)
41
64
  text = page.string() if page else ""
65
+ all_page_texts[i] = text
42
66
 
43
67
  if len(text.strip()) < 10:
44
68
  scanned_pages.append(i)
@@ -62,13 +86,78 @@ def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0):
62
86
  "scanned_pages": scanned_pages,
63
87
  "scanned_count": len(scanned_pages),
64
88
  }
65
- print(f"\n[Summary] {summary['text_pages']} text pages, {summary['scanned_count']} scanned pages: {scanned_pages}", file=sys.stderr)
66
- json.dump(summary, sys.stderr)
89
+
90
+ if section_scan:
91
+ section_index, warnings = _scan_sections(all_page_texts)
92
+ summary["section_index"] = section_index
93
+ summary["warnings"] = warnings
94
+
95
+ print(f"\n[Summary] {summary['text_pages']} text pages, {summary['scanned_count']} scanned pages: {scanned_pages}", file=sys.stderr)
96
+ print("[Section Index] (1-indexed page numbers)", file=sys.stderr)
97
+ for key, pages in section_index.items():
98
+ if pages:
99
+ print(f" {key}: {pages}", file=sys.stderr)
100
+ if warnings:
101
+ print("[WARNINGS]", file=sys.stderr)
102
+ for w in warnings:
103
+ print(f" ! {w}", file=sys.stderr)
104
+ else:
105
+ print(f"\n[Summary] {summary['text_pages']} text pages, {summary['scanned_count']} scanned pages: {scanned_pages}", file=sys.stderr)
106
+
107
+ json.dump(summary, sys.stderr, ensure_ascii=False)
67
108
  print("", file=sys.stderr)
68
109
 
69
110
  return summary
70
111
 
71
112
 
113
+ def _scan_sections(all_page_texts):
114
+ """Scan all pages for known financial document section keywords.
115
+
116
+ Returns (section_index, warnings):
117
+ section_index: {section_key: [1-indexed page numbers where matched]}
118
+ warnings: list of human-readable warnings about coverage gaps
119
+ """
120
+ section_index = {key: [] for key, _ in SECTION_PATTERNS}
121
+
122
+ for page_idx, text in all_page_texts.items():
123
+ if not text:
124
+ continue
125
+ for key, pattern in SECTION_PATTERNS:
126
+ if re.search(pattern, text, re.IGNORECASE):
127
+ section_index[key].append(page_idx + 1)
128
+
129
+ warnings = []
130
+
131
+ # Participating policy without Benefit Illustration → likely missing the key document
132
+ if section_index["participating_plan"] and not section_index["benefit_illustration"]:
133
+ warnings.append(
134
+ "Participating/with-profits policy detected, but no Benefit Illustration "
135
+ "section found. The illustration (with non-guaranteed projections) is often "
136
+ "a separate document — ask the user / agent to provide it before drawing "
137
+ "conclusions about surrender/maturity values."
138
+ )
139
+
140
+ # Cash value table present but no surrender value section → for participating plans,
141
+ # guaranteed cash value alone understates the true expected return.
142
+ if (section_index["cash_value_table"]
143
+ and section_index["participating_plan"]
144
+ and not section_index["surrender_value"]):
145
+ warnings.append(
146
+ "Cash Value Table found but no 'Surrender Value' / 退保價值 section. "
147
+ "For a participating policy, 'Guaranteed Cash Value' ≠ 'Expected Surrender Value' "
148
+ "(the latter includes non-guaranteed dividends). Verify all value tables were located."
149
+ )
150
+
151
+ # Non-guaranteed mentioned but no Benefit Illustration → projection data probably elsewhere
152
+ if section_index["non_guaranteed"] and not section_index["benefit_illustration"]:
153
+ warnings.append(
154
+ "Non-guaranteed amounts referenced but no Benefit Illustration found — "
155
+ "projected values may be in a separate proposal/illustration document."
156
+ )
157
+
158
+ return section_index, warnings
159
+
160
+
72
161
  def _render_page(page, page_idx, output_dir, scale):
73
162
  media_box = page.boundsForBox_(Quartz.kCGPDFMediaBox)
74
163
  pw = int(media_box.size.width * scale)
@@ -107,6 +196,9 @@ if __name__ == "__main__":
107
196
  help="Output directory for rendered images")
108
197
  parser.add_argument("--scale", type=float, default=6.0,
109
198
  help="Scale multiplier for rendering (default: 6.0)")
199
+ parser.add_argument("--no-section-scan", action="store_true",
200
+ help="Disable section keyword scan (on by default)")
110
201
  args = parser.parse_args()
111
202
 
112
- extract_text(args.pdf_path, args.render_scanned, args.output_dir, args.scale)
203
+ extract_text(args.pdf_path, args.render_scanned, args.output_dir, args.scale,
204
+ section_scan=not args.no_section_scan)
@@ -1,5 +1,5 @@
1
1
  {
2
2
  "name": "pdf-extract",
3
- "version": "0.1.0",
4
- "description": ""
3
+ "version": "0.2.0",
4
+ "description": "Extract text from PDFs and save as clean markdown on macOS via PDFKit + Quartz. Auto-detects scanned pages, renders them for vision OCR, and scans for financial-document sections (cash value, benefit illustration, surrender value, non-guaranteed) so participating-policy data isn't missed."
5
5
  }
@@ -0,0 +1,108 @@
1
+ ---
2
+ name: smzdm-picks
3
+ description: Fetch personalized 什么值得买 (smzdm.com) deal picks from the user's already-logged-in Chrome session. Trigger when the user says "smzdm", "什么值得买", "今日好价", "smzdm 精选", "查一下值得买", "好价推荐", "smzdm picks", "zdm", "今天有什么好价", "值得买推荐", "smzdm 推荐", or asks to see today's curated deals / discounts / 优惠 / 特价 on smzdm. Drives Chrome via AppleScript because smzdm has anti-bot protection; uses the user's existing login for personalized content.
4
+ ---
5
+
6
+ # smzdm-picks
7
+
8
+ Pull today's personalized curated deals (个性化好价) from 什么值得买 by driving the user's already-logged-in Chrome via AppleScript + JS injection.
9
+
10
+ ## Why AppleScript
11
+
12
+ smzdm.com is protected by an Akamai-style JS challenge: plain `curl` returns HTTP 202 with a `probe.js` and no content. The only way to get the rendered, personalized feed is through a real browser that's already authenticated. The user has Chrome logged in, so this skill steers Chrome itself.
13
+
14
+ ## One-time setup (per Mac)
15
+
16
+ **This skill is per-machine** — each Mac needs its own setup because Chrome's cookie store and AppleScript permission are local. When the user mentions running this on a new Mac, or this skill triggers on a machine for the first time, walk them through this checklist.
17
+
18
+ Use the built-in self-check command to verify each step:
19
+ ```bash
20
+ bash ~/.claude/skills/smzdm-picks/scripts/fetch.sh --check
21
+ ```
22
+
23
+ It prints ✓/✗ for each requirement and pinpoints what's missing. Manual checklist if you prefer step-by-step:
24
+
25
+ 1. **Open Chrome** and confirm it stays open in the background.
26
+ 2. **Enable AppleScript JS API**: Chrome menu → View → Developer → **Allow JavaScript from Apple Events** ✓
27
+ - If the "Developer" submenu is hidden: Chrome → Settings → search for "developer menu" → enable.
28
+ - In Chinese Chrome: 查看 → 开发者 → 允许 Apple 事件中的 JavaScript.
29
+ 3. **Grant Automation permission (first run only)**: when the script runs for the first time on a new Mac, macOS pops up *"Terminal (or iTerm/Claude) wants to control 'Google Chrome'"* — click **OK**. To verify or fix later: System Settings → Privacy & Security → Automation → expand your terminal app → ensure **Google Chrome ✓** is checked.
30
+ 4. **Log into smzdm.com in this Chrome**: open https://www.smzdm.com/ and sign in. Cookies are per-profile and don't sync from other machines.
31
+ 5. **Chrome variant**: if using Chrome Canary or Chrome Beta, set `CHROME_APP="Google Chrome Canary"` (or similar) in the environment — the default is `"Google Chrome"`.
32
+
33
+ If `fetch.sh` returns exit 4 (JS API rejected) → step 2 is the issue. Exit 6 (--check failed) → check the ✗ line. Exit 2 (Chrome not running) → step 1.
34
+
35
+ ## Workflow
36
+
37
+ 1. **New Mac?** If this is the first time on this machine (no successful prior run in this session, or the user mentions "另一台电脑"/"new computer"/"on this Mac"), run `bash ~/.claude/skills/smzdm-picks/scripts/fetch.sh --check` first and walk through any ✗ items per the Setup section above. Don't proceed to scrape until --check passes.
38
+ 2. Run: `bash ~/.claude/skills/smzdm-picks/scripts/fetch.sh [target]`
39
+ - No arg → homepage (`/`) — the personalized recommendation feed
40
+ - `jingxuan` → `/jingxuan/` — curated picks
41
+ - `faxian` → `faxian.smzdm.com` — pure deals stream
42
+ 3. The script briefly opens a tab in the user's front Chrome window, waits 6s for JS, extracts items via DOM selectors, closes the tab, restores focus.
43
+ 4. Parse the JSON on stdout. Read stderr diagnostics for any warnings (logged_in flag, item count).
44
+ 5. Curate and render to the user (see Curation Rules + Output Format below).
45
+
46
+ ## Curation Rules
47
+
48
+ From the extracted ≤30 raw items, pick the **10 best** for display:
49
+
50
+ - **Must have** title + link. Items missing either are noise — skip.
51
+ - **Prefer items with a concrete price** (¥XXX) over vague ones ("低至"/"满减"). Sort price-ed items first.
52
+ - **Skip soft ads**: title containing 赞助 / 广告 / 软广 / 测评推荐.
53
+ - **Diversify sources**: avoid >3 items from the same e-commerce platform (京东/天猫/etc.) in the top 10.
54
+ - **Mark the standout**: prefix the top 1 with 🔥 if `value` field (smzdm 值得买/不值得指数) suggests strong consensus, otherwise no prefix.
55
+
56
+ ## Output format
57
+
58
+ Mobile-friendly markdown list, no tables. Match `daily-hunt` / `pulse` style:
59
+
60
+ ```markdown
61
+ # 🛒 smzdm 今日精选好价 · {today}
62
+
63
+ > 已为你筛选 {N} 条来自{个性化推荐|精选|发现}流的好价
64
+
65
+ 1. 🔥 **{标题}** · {商城} · {价格}
66
+ {短链}
67
+
68
+ 2. **{标题}** · {商城} · {价格}
69
+ {短链}
70
+
71
+ ...
72
+
73
+ — 数据来自 smzdm.com({personalized|curated}),共抓取 {total} 条,展示前 {N}
74
+ ```
75
+
76
+ Notes for rendering:
77
+ - Use full-width Chinese punctuation (,。:!) inside Chinese text.
78
+ - Keep each item to ≤ 2 lines; long titles wrap naturally.
79
+ - Link goes on its own line for tappability.
80
+ - If a tag is interesting (like 双 11 / PLUS 会员价 / 限时), append it inline: `· 京东 · ¥499 · 双11 价`.
81
+
82
+ ## Error handling
83
+
84
+ | Symptom | Likely cause | Tell the user |
85
+ |---|---|---|
86
+ | Exit 2 — "Chrome not running" | Chrome process not found | "请先打开 Chrome 并确认 smzdm.com 已登录"(提示:Chrome Canary/Beta 需设置 `CHROME_APP` 环境变量) |
87
+ | Exit 4 — JS API rejected | Allow JS from Apple Events disabled | Walk through the one-time Setup above. Also check System Settings → Privacy & Security → Automation. |
88
+ | Exit 5 — empty result | Page load slow / DOM changed | "smzdm 页面没在 6 秒内加载完,请稍后重试" |
89
+ | Exit 6 — --check failed | New machine setup incomplete | Use the ✓/✗ output from `--check` to pinpoint which step is missing |
90
+ | `logged_in: false` in JSON | Chrome session expired on this Mac | "Chrome 里 smzdm 登录已过期 / 这台电脑还没登录过,请到浏览器登录后再试" |
91
+ | 0 items extracted | DOM selectors out of date | "smzdm 页面结构可能变了,需要更新 fetch.sh 里的 selector" |
92
+
93
+ Always show the **raw stderr** from the script when reporting an error — it has the exit code and hint.
94
+
95
+ ## Limitations
96
+
97
+ - **Per-machine setup**: this skill won't auto-work on a new Mac. Each Mac needs Chrome's JS-from-AppleEvents toggle, macOS Automation permission, and a fresh smzdm.com login. Run `--check` on new machines.
98
+ - **macOS-only**: relies on AppleScript + Chrome AppleScript bindings. Won't work on Linux/Windows without a port.
99
+ - **Briefly steals Chrome focus** during the 6s scrape (opens then closes a tab in the front window).
100
+ - **Cookie freshness**: smzdm cookies last 30+ days but logout/clear-cookies/different-profile breaks it.
101
+ - **DOM drift**: selectors will need updating as smzdm redesigns. If extraction returns 0 items repeatedly, update the `candidates` list in `scripts/fetch.sh`.
102
+ - **Read-only digest**: no price history, no alerts, no auto-purchasing.
103
+
104
+ ## Out of scope
105
+
106
+ - Scheduled push (cron/notifications) — manual trigger only.
107
+ - Multi-day trend tracking — stateless per invocation.
108
+ - 非个性化好价(公开 RSS)fallback — if login breaks, fix the login, don't degrade silently.
@@ -0,0 +1,296 @@
1
+ #!/usr/bin/env bash
2
+ # fetch.sh — Drive the user's already-logged-in Chrome via AppleScript to
3
+ # scrape smzdm.com personalized feed and output items as JSON on stdout.
4
+ #
5
+ # Usage:
6
+ # bash fetch.sh # scrape homepage (/)
7
+ # bash fetch.sh jingxuan # scrape 精选 page (/jingxuan/)
8
+ # bash fetch.sh faxian # scrape 好价 page (/faxian/)
9
+ # bash fetch.sh --check # setup self-check (no scrape, just verify env)
10
+ #
11
+ # Environment:
12
+ # CHROME_APP Chrome app name (default: "Google Chrome"). Set to
13
+ # "Google Chrome Canary" / "Google Chrome Beta" if needed.
14
+ #
15
+ # Exit codes:
16
+ # 0 success — JSON written to stdout (or all checks passed in --check mode)
17
+ # 2 Chrome not running
18
+ # 3 osascript invocation failed
19
+ # 4 Chrome JS API rejected (Allow JS from Apple Events likely disabled)
20
+ # 5 empty / unparseable result
21
+ # 6 --check found a problem (details on stderr)
22
+ #
23
+ # Diagnostics go to stderr. JSON goes to stdout.
24
+
25
+ set -uo pipefail
26
+
27
+ CHROME_APP="${CHROME_APP:-Google Chrome}"
28
+ TARGET="${1:-home}"
29
+ CHECK_MODE=0
30
+
31
+ case "$TARGET" in
32
+ home) URL="https://www.smzdm.com/" ;;
33
+ jingxuan) URL="https://www.smzdm.com/jingxuan/" ;;
34
+ faxian) URL="https://faxian.smzdm.com/" ;;
35
+ --check) CHECK_MODE=1; URL="https://www.smzdm.com/" ;;
36
+ *) URL="$TARGET" ;; # allow passing full URL
37
+ esac
38
+
39
+ # 1. Chrome must be running
40
+ if ! pgrep -x "$CHROME_APP" > /dev/null; then
41
+ echo "ERROR: $CHROME_APP is not running. Open it and make sure smzdm.com is logged in, then retry." >&2
42
+ echo "(If you use Chrome Canary/Beta, set: CHROME_APP=\"Google Chrome Canary\" bash fetch.sh)" >&2
43
+ exit 2
44
+ fi
45
+
46
+ # --check mode: run a trivial JS probe instead of full scrape, report each setup step
47
+ if [ "$CHECK_MODE" = "1" ]; then
48
+ echo "[smzdm-picks --check] running setup verification..." >&2
49
+ echo " ✓ $CHROME_APP is running" >&2
50
+
51
+ PROBE=$(osascript <<OSAEOF 2>&1
52
+ tell application "$CHROME_APP"
53
+ if (count of windows) is 0 then return "__OSAERROR__:0:no windows"
54
+ try
55
+ set probe to execute (active tab of front window) javascript "1+1"
56
+ return "OK:" & probe
57
+ on error errMsg number errNum
58
+ return "__OSAERROR__:" & errNum & ":" & errMsg
59
+ end try
60
+ end tell
61
+ OSAEOF
62
+ )
63
+ case "$PROBE" in
64
+ OK:2)
65
+ echo " ✓ AppleScript→Chrome JS injection works (View → Developer → Allow JavaScript from Apple Events is enabled)" >&2
66
+ ;;
67
+ __OSAERROR__:*)
68
+ echo " ✗ AppleScript→Chrome JS injection rejected: $PROBE" >&2
69
+ echo " Fix: $CHROME_APP menu → View → Developer → Allow JavaScript from Apple Events ✓" >&2
70
+ echo " (If 'Developer' is hidden: Chrome → Settings → search 'developer menu' → enable)" >&2
71
+ echo " Also check System Settings → Privacy & Security → Automation: allow your terminal to control $CHROME_APP" >&2
72
+ exit 6
73
+ ;;
74
+ *)
75
+ echo " ✗ Unexpected probe response: $PROBE" >&2
76
+ exit 6
77
+ ;;
78
+ esac
79
+
80
+ # Login check: visit smzdm and look for login markers
81
+ LOGIN_PROBE=$(osascript <<OSAEOF 2>&1
82
+ tell application "$CHROME_APP"
83
+ set originalIndex to active tab index of front window
84
+ set t to make new tab at end of tabs of front window with properties {URL:"https://www.smzdm.com/"}
85
+ delay 5
86
+ set logged to "false"
87
+ try
88
+ set logged to execute t javascript "(document.cookie.indexOf('sess=')>=0||document.cookie.indexOf('user=')>=0||!!document.querySelector('[class*=\"user-info\"],[class*=\"nickname\"],[class*=\"avatar\"]'))+''"
89
+ end try
90
+ try
91
+ close t
92
+ end try
93
+ try
94
+ set active tab index of front window to originalIndex
95
+ end try
96
+ return logged
97
+ end tell
98
+ OSAEOF
99
+ )
100
+ if [ "$LOGIN_PROBE" = "true" ]; then
101
+ echo " ✓ smzdm.com login state detected in Chrome" >&2
102
+ echo "" >&2
103
+ echo "All checks passed. Run 'bash fetch.sh' to scrape." >&2
104
+ exit 0
105
+ else
106
+ echo " ✗ smzdm.com not logged in (probe returned: $LOGIN_PROBE)" >&2
107
+ echo " Fix: open https://www.smzdm.com/ in $CHROME_APP and log in, then retry." >&2
108
+ exit 6
109
+ fi
110
+ fi
111
+
112
+ # 2. Write JS extractor to a tmpfile (avoids AppleScript string escaping hell)
113
+ JS_TMP=$(mktemp -t smzdm_extract.XXXXXX)
114
+ mv "$JS_TMP" "${JS_TMP}.js"
115
+ JS_TMP="${JS_TMP}.js"
116
+ trap 'rm -f "$JS_TMP"' EXIT
117
+
118
+ cat > "$JS_TMP" <<'JSEOF'
119
+ (function () {
120
+ function txt(el) {
121
+ return el ? (el.innerText || el.textContent || '').trim() : '';
122
+ }
123
+
124
+ function findPrice(root) {
125
+ // Look for leaf elements containing ¥/元/价 — usually the price chip
126
+ const all = root.querySelectorAll('*');
127
+ for (const n of all) {
128
+ if (n.children.length > 0) continue;
129
+ const t = txt(n);
130
+ if (t.length > 0 && t.length < 40 && /(¥|元|售价|价格|低至)/.test(t)) {
131
+ return t;
132
+ }
133
+ }
134
+ return '';
135
+ }
136
+
137
+ function findSource(root) {
138
+ const re = /^(京东|天猫|淘宝|拼多多|官网|苏宁|亚马逊|网易严选|得物|抖音|京东国际|京东自营|天猫超市|山姆|Costco|考拉|唯品会)$/;
139
+ const all = root.querySelectorAll('*');
140
+ for (const n of all) {
141
+ const t = txt(n);
142
+ if (t.length > 0 && t.length < 20 && re.test(t)) return t;
143
+ }
144
+ return '';
145
+ }
146
+
147
+ function findTags(root) {
148
+ return [...root.querySelectorAll('[class*="tag"], [class*="Tag"], [class*="label"]')]
149
+ .map(n => txt(n))
150
+ .filter(t => t && t.length < 20 && t.length > 0)
151
+ .slice(0, 5);
152
+ }
153
+
154
+ // Try multiple selector strategies — smzdm changes class names periodically
155
+ const candidates = document.querySelectorAll(
156
+ '[class*="feed-row"], [class*="feed-block"], [data-feed-id], article, [class*="z-feed"], [class*="feed-item"]'
157
+ );
158
+
159
+ const items = [];
160
+ const seen = new Set();
161
+
162
+ for (const el of candidates) {
163
+ // Title
164
+ const titleEl = el.querySelector('[class*="title"], h5, h3, h4, h2, a[title]');
165
+ let title = txt(titleEl);
166
+ if (!title && titleEl) title = titleEl.getAttribute('title') || '';
167
+ if (!title || title.length < 4) continue;
168
+
169
+ // Link — prefer post/faxian links, fall back to any href
170
+ const linkEl = el.querySelector(
171
+ 'a[href*="//post.smzdm.com/"], a[href*="//faxian.smzdm.com/"], a[href*="//www.smzdm.com/p/"], a[href]'
172
+ );
173
+ let link = linkEl ? linkEl.href : '';
174
+ if (!link || link === window.location.href) continue;
175
+ if (seen.has(link)) continue;
176
+ seen.add(link);
177
+
178
+ const price = findPrice(el);
179
+ const source = findSource(el);
180
+ const tags = findTags(el);
181
+
182
+ // Hot/value score if present
183
+ const valueEl = el.querySelector('[class*="value"], [class*="zhi"], [class*="hot"]');
184
+ const value = txt(valueEl);
185
+
186
+ items.push({ title, price, link, source, tags, value });
187
+ }
188
+
189
+ // Login detection — multiple heuristics
190
+ const loggedIn = !!(
191
+ document.querySelector('[class*="user-info"]') ||
192
+ document.querySelector('[class*="nickname"]') ||
193
+ document.querySelector('[class*="userpic"]') ||
194
+ document.querySelector('[class*="avatar"]') ||
195
+ document.cookie.includes('user=')
196
+ );
197
+
198
+ return JSON.stringify({
199
+ items: items.slice(0, 30),
200
+ total_candidates: candidates.length,
201
+ logged_in: loggedIn,
202
+ url: location.href,
203
+ title: document.title,
204
+ ts: new Date().toISOString()
205
+ });
206
+ })();
207
+ JSEOF
208
+
209
+ # 3. Drive Chrome via AppleScript
210
+ RESULT=$(osascript <<OSAEOF 2>&1
211
+ set jsFile to POSIX file "$JS_TMP"
212
+ set jsCode to read jsFile as «class utf8»
213
+
214
+ tell application "$CHROME_APP"
215
+ if (count of windows) is 0 then
216
+ make new window
217
+ end if
218
+
219
+ -- Remember which tab was active so we can restore focus
220
+ set originalIndex to active tab index of front window
221
+
222
+ -- Open target in a new tab
223
+ set newTab to make new tab at end of tabs of front window with properties {URL:"$URL"}
224
+
225
+ -- Wait for JS-rendered content
226
+ delay 6
227
+
228
+ set extractedJSON to ""
229
+ try
230
+ set extractedJSON to execute newTab javascript jsCode
231
+ on error errMsg number errNum
232
+ set extractedJSON to "__OSAERROR__:" & errNum & ":" & errMsg
233
+ end try
234
+
235
+ -- Close the scrape tab
236
+ try
237
+ close newTab
238
+ end try
239
+
240
+ -- Restore focus
241
+ try
242
+ set active tab index of front window to originalIndex
243
+ end try
244
+
245
+ return extractedJSON
246
+ end tell
247
+ OSAEOF
248
+ )
249
+
250
+ OSA_EXIT=$?
251
+
252
+ # 4. Handle osascript failures
253
+ if [ $OSA_EXIT -ne 0 ]; then
254
+ echo "ERROR: osascript invocation failed (exit $OSA_EXIT)." >&2
255
+ echo "Raw output: $RESULT" >&2
256
+ exit 3
257
+ fi
258
+
259
+ # 5. Detect JS API rejection
260
+ if [[ "$RESULT" == __OSAERROR__:* ]]; then
261
+ echo "ERROR: Chrome rejected the JS injection: $RESULT" >&2
262
+ echo "" >&2
263
+ echo "HINT (most common cause): enable the JS-from-AppleEvents toggle in Chrome:" >&2
264
+ echo " Chrome menu → View / 查看 → Developer / 开发者 → Allow JavaScript from Apple Events / 允许 Apple 事件中的 JavaScript ✓" >&2
265
+ echo " (If 'Developer' submenu is hidden: Chrome → Settings → Advanced → enable 'Show Develop menu')" >&2
266
+ exit 4
267
+ fi
268
+
269
+ # 6. Empty result?
270
+ if [[ -z "$RESULT" || "$RESULT" == "missing value" ]]; then
271
+ echo "ERROR: Empty result from Chrome JS. Possible causes:" >&2
272
+ echo " - 'Allow JavaScript from Apple Events' is not enabled (see Chrome → View → Developer)" >&2
273
+ echo " - Page didn't finish loading in 6s (try again)" >&2
274
+ echo " - smzdm DOM structure changed" >&2
275
+ exit 5
276
+ fi
277
+
278
+ # 7. Output JSON to stdout
279
+ echo "$RESULT"
280
+
281
+ # 8. Diagnostics to stderr
282
+ python3 - <<PY 2>/dev/null
283
+ import sys, json
284
+ try:
285
+ d = json.loads('''$RESULT''')
286
+ items = d.get('items', [])
287
+ logged = d.get('logged_in', '?')
288
+ cand = d.get('total_candidates', '?')
289
+ sys.stderr.write(f"[smzdm-picks] target={'$URL'} items={len(items)} candidates={cand} logged_in={logged}\n")
290
+ if not logged:
291
+ sys.stderr.write("WARNING: Not logged in — feed may be generic, not personalized. Re-login in Chrome.\n")
292
+ if len(items) == 0:
293
+ sys.stderr.write("WARNING: Zero items extracted. Page DOM likely changed; check the extractor selectors.\n")
294
+ except Exception as e:
295
+ sys.stderr.write(f"[smzdm-picks] (could not parse result for diagnostics: {e})\n")
296
+ PY
@@ -0,0 +1,5 @@
1
+ {
2
+ "name": "smzdm-picks",
3
+ "version": "0.1.0",
4
+ "description": "Fetch personalized 什么值得买 (smzdm.com) deal picks from the user's already-logged-in Chrome on macOS. Drives Chrome via AppleScript + JS injection to bypass anti-bot. Per-machine setup; includes --check mode for new-Mac self-verification."
5
+ }