ethan-agent-skills 0.1.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +94 -2
- package/package.json +1 -1
- package/skills/pdf-extract/SKILL.md +42 -160
- package/skills/pdf-extract/references/manual-implementation.md +141 -0
- package/skills/pdf-extract/scripts/pdf_extract.py +97 -5
- package/skills/pdf-extract/skill.json +2 -2
- package/skills/smzdm-picks/SKILL.md +108 -0
- package/skills/smzdm-picks/scripts/fetch.sh +296 -0
- package/skills/smzdm-picks/skill.json +5 -0
package/README.md
CHANGED
|
@@ -25,8 +25,8 @@ npx -y ethan-agent-skills@latest update --target /tmp/test-skills
|
|
|
25
25
|
npx -y ethan-agent-skills@latest list
|
|
26
26
|
```
|
|
27
27
|
|
|
28
|
-
Current common bundled skills include `pdf-extract`, `skill-evolution`,
|
|
29
|
-
`fix-my-life`, plus the client-specific OpenSpec OPSX skills.
|
|
28
|
+
Current common bundled skills include `pdf-extract`, `skill-evolution`,
|
|
29
|
+
`fix-my-life`, and `smzdm-picks`, plus the client-specific OpenSpec OPSX skills.
|
|
30
30
|
|
|
31
31
|
The updater writes `.skills-lock.json` in each target skill root and only
|
|
32
32
|
rewrites skills managed by this package. If a destination skill directory exists
|
|
@@ -169,6 +169,98 @@ The workflow can also be started manually from the GitHub Actions tab, but tag
|
|
|
169
169
|
pushes are the preferred release path because they tie npm versions to Git
|
|
170
170
|
history.
|
|
171
171
|
|
|
172
|
+
### Importing Existing Local Skills
|
|
173
|
+
|
|
174
|
+
Use this flow when a useful skill already exists under a local agent directory
|
|
175
|
+
such as `~/.claude/skills/<skill-dir>` and should become part of this package.
|
|
176
|
+
|
|
177
|
+
1. Choose the package destination:
|
|
178
|
+
- Put cross-client skills in `skills/<skill-dir>`.
|
|
179
|
+
- Put Claude-only skills in `claude/skills/<skill-dir>`.
|
|
180
|
+
- Put Codex-only skills in `codex/skills/<skill-dir>`.
|
|
181
|
+
- Put source-command or agent-only skills in `agents/skills/<skill-dir>`.
|
|
182
|
+
2. Inspect the source before copying:
|
|
183
|
+
|
|
184
|
+
```bash
|
|
185
|
+
SOURCE="$HOME/.claude/skills/<skill-dir>"
|
|
186
|
+
rg --files -uu "$SOURCE"
|
|
187
|
+
rg -n "(secret|password|token|api[_-]?key|PRIVATE|sk-[A-Za-z0-9])" "$SOURCE"
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
Do not copy real `.env` files, private keys, generated caches, or unrelated
|
|
191
|
+
local state. Placeholder files such as `.env.example` are fine.
|
|
192
|
+
|
|
193
|
+
3. Copy the skill into the package:
|
|
194
|
+
|
|
195
|
+
```bash
|
|
196
|
+
DEST="skills/<skill-dir>"
|
|
197
|
+
rm -rf "$DEST"
|
|
198
|
+
cp -R "$SOURCE" "$DEST"
|
|
199
|
+
find "$DEST" -type d -name "__pycache__" -prune -exec rm -rf {} +
|
|
200
|
+
find "$DEST" -name "*.pyc" -delete
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
4. Add or update `skill.json` beside `SKILL.md`:
|
|
204
|
+
|
|
205
|
+
```json
|
|
206
|
+
{
|
|
207
|
+
"name": "<trigger-or-display-name>",
|
|
208
|
+
"version": "0.1.0",
|
|
209
|
+
"description": "Short description used by the package list command."
|
|
210
|
+
}
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
Keep the `SKILL.md` frontmatter name unchanged when the existing trigger name
|
|
214
|
+
should remain stable. For example, a directory can be `skill-evolution` while
|
|
215
|
+
the skill frontmatter name remains `skill-dev`.
|
|
216
|
+
|
|
217
|
+
5. Verify local discovery and install behavior:
|
|
218
|
+
|
|
219
|
+
```bash
|
|
220
|
+
node bin/skills.mjs list
|
|
221
|
+
node bin/skills.mjs update --dry-run --target /tmp/test-skills --client claude
|
|
222
|
+
npm run test:local
|
|
223
|
+
npm run pack:check
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
`npm run pack:check` should show the new `SKILL.md`, `skill.json`, references,
|
|
227
|
+
and scripts in the tarball contents.
|
|
228
|
+
|
|
229
|
+
6. Commit and push the imported skill:
|
|
230
|
+
|
|
231
|
+
```bash
|
|
232
|
+
git status --short
|
|
233
|
+
git add README.md skills/<skill-dir>
|
|
234
|
+
git commit -m "Add <skill-dir> skill"
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
Adjust the `git add` path if the skill was copied into `claude/`, `codex/`, or
|
|
238
|
+
`agents/` instead of `skills/`.
|
|
239
|
+
|
|
240
|
+
7. Publish a new npm version:
|
|
241
|
+
|
|
242
|
+
```bash
|
|
243
|
+
npm version minor
|
|
244
|
+
git push --follow-tags
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
Use `minor` for adding a new skill. Use `patch` if the skill was already
|
|
248
|
+
published and only its content changed.
|
|
249
|
+
|
|
250
|
+
8. Confirm the automated release:
|
|
251
|
+
|
|
252
|
+
```bash
|
|
253
|
+
npm view ethan-agent-skills version --json
|
|
254
|
+
npx -y ethan-agent-skills@latest list
|
|
255
|
+
npx -y ethan-agent-skills@latest update --dry-run --target /tmp/test-skills --client claude
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
After the new version appears on npm, users can refresh with:
|
|
259
|
+
|
|
260
|
+
```bash
|
|
261
|
+
npx -y ethan-agent-skills@latest update
|
|
262
|
+
```
|
|
263
|
+
|
|
172
264
|
## OpenSpec OPSX Usage
|
|
173
265
|
|
|
174
266
|
Codex App can invoke the global custom prompts directly:
|
package/package.json
CHANGED
|
@@ -5,156 +5,55 @@ description: Extract text from PDF files and save as clean markdown documents. U
|
|
|
5
5
|
|
|
6
6
|
# PDF Extract Skill
|
|
7
7
|
|
|
8
|
-
Extract complete text from PDF files and save as clean markdown. Handles text-based PDFs, scanned/image pages, and mixed documents
|
|
8
|
+
Extract complete text from PDF files and save as clean markdown. Handles text-based PDFs, scanned/image pages, and mixed documents using macOS-native frameworks (pyobjc + Quartz/PDFKit).
|
|
9
9
|
|
|
10
10
|
## When to Use
|
|
11
11
|
|
|
12
12
|
- A user uploads a PDF and wants its content extracted to text or markdown
|
|
13
|
-
- The PDF
|
|
14
|
-
-
|
|
15
|
-
- Large PDFs where the extracted text exceeds normal read limits
|
|
13
|
+
- The PDF mixes text and scanned pages (e.g., insurance policy booklets)
|
|
14
|
+
- Large PDFs whose extracted text exceeds normal read limits
|
|
16
15
|
- Financial documents, insurance policies, contracts, or reports that need structured extraction
|
|
17
16
|
|
|
18
|
-
## Workflow
|
|
17
|
+
## Recommended Workflow
|
|
19
18
|
|
|
20
|
-
|
|
21
|
-
2. **Detect scanned pages** — pages with little or no extracted text are likely scans
|
|
22
|
-
3. **Render scanned pages as images** — convert them to PNGs at high resolution
|
|
23
|
-
4. **Extract text from images** — use multimodal vision to OCR the rendered pages
|
|
24
|
-
5. **Crop if needed** — isolate specific regions (tables, signatures) from page images
|
|
25
|
-
6. **Assemble and save** — combine all extracted text into a clean markdown document
|
|
19
|
+
Run the bundled script. It handles text extraction + section keyword scan + optional rendering in one call:
|
|
26
20
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
First, check the PDF file path and try basic text extraction on the first page to gauge quality:
|
|
32
|
-
|
|
33
|
-
```python
|
|
34
|
-
import sys, Quartz, os
|
|
35
|
-
sys.path.insert(0, '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC')
|
|
36
|
-
import objc
|
|
37
|
-
|
|
38
|
-
pdf_path = "/path/to/file.pdf"
|
|
39
|
-
url = Quartz.NSURL.fileURLWithPath_(pdf_path)
|
|
40
|
-
doc = Quartz.PDFDocument.alloc().initWithURL_(doc)
|
|
41
|
-
|
|
42
|
-
if doc is None:
|
|
43
|
-
# Cannot read PDF
|
|
44
|
-
exit()
|
|
45
|
-
|
|
46
|
-
page_count = doc.pageCount()
|
|
47
|
-
print(f"Pages: {page_count}")
|
|
48
|
-
|
|
49
|
-
# Quick quality check on first page
|
|
50
|
-
p1 = doc.pageAtIndex_(0)
|
|
51
|
-
text = p1.string() if p1 else ""
|
|
52
|
-
print(f"First page text length: {len(text)}")
|
|
53
|
-
if len(text) < 50:
|
|
54
|
-
print("WARNING: Low text extraction — likely a scanned PDF")
|
|
21
|
+
```bash
|
|
22
|
+
python3 ~/.claude/skills/pdf-extract/scripts/pdf_extract.py <pdf_path> \
|
|
23
|
+
> /tmp/extracted.txt 2> /tmp/summary.txt
|
|
55
24
|
```
|
|
56
25
|
|
|
57
|
-
|
|
26
|
+
Then:
|
|
58
27
|
|
|
59
|
-
|
|
28
|
+
1. **Read `/tmp/summary.txt` first.** It contains the `section_index` (page numbers of key financial sections like cash value table, benefit illustration, surrender value, non-guaranteed projections) and any `warnings`. **Inspect every section_index entry before declaring extraction complete** — do not stop at the first value table found.
|
|
29
|
+
2. **Heed all warnings.** They flag missing documents or coverage gaps (e.g., "participating policy without illustration").
|
|
30
|
+
3. **Deep-read the pages** the section_index points to. For long documents, use offset/limit on `/tmp/extracted.txt`.
|
|
31
|
+
4. **For scanned pages**, rerun with `--render-scanned` to write PNGs to `/tmp/pdf_rendered/`, then use the Read tool on the PNGs for vision-based OCR.
|
|
32
|
+
5. **Assemble** the extracted content into structured markdown.
|
|
60
33
|
|
|
61
|
-
|
|
62
|
-
import sys, Quartz
|
|
34
|
+
For the underlying pyobjc/Quartz Python code (when modifying the script or running inline), see `references/manual-implementation.md`.
|
|
63
35
|
|
|
64
|
-
|
|
65
|
-
url = Quartz.NSURL.fileURLWithPath_(pdf_path)
|
|
66
|
-
doc = Quartz.PDFDocument.alloc().initWithURL_(url)
|
|
36
|
+
## Policy / Financial PDF Workflow
|
|
67
37
|
|
|
68
|
-
|
|
69
|
-
scanned_pages = []
|
|
38
|
+
Insurance policies (especially participating/with-profits plans), annuity contracts, and savings plans contain **multiple value tables** that look similar but represent very different things. Missing one leads to wildly wrong conclusions.
|
|
70
39
|
|
|
71
|
-
for
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
if len(text.strip()) < 10:
|
|
75
|
-
scanned_pages.append(i)
|
|
76
|
-
else:
|
|
77
|
-
pages_text.append({"page": i + 1, "text": text})
|
|
40
|
+
**Key distinction for participating policies**:
|
|
41
|
+
- **Guaranteed Cash Value Table** (保證現金價值表) — contractually guaranteed, usually printed in the policy contract
|
|
42
|
+
- **Benefit Illustration / 建議書摘要** — projected values including **non-guaranteed dividends** (特別紅利, reversionary/terminal bonus). For a分紅 / participating policy, **the illustration is the central financial document, not the guaranteed table.**
|
|
78
43
|
|
|
79
|
-
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
### Step 3: Render Scanned Pages as PNG Images
|
|
83
|
-
|
|
84
|
-
For each scanned page, render it to a PNG image at 6x scale (300+ DPI equivalent) for clarity. Lower scales (3x-4x) may not be readable for dense financial tables.
|
|
85
|
-
|
|
86
|
-
```python
|
|
87
|
-
import sys, Quartz
|
|
88
|
-
|
|
89
|
-
pdf_path = "/path/to/file.pdf"
|
|
90
|
-
url = Quartz.NSURL.fileURLWithPath_(pdf_path)
|
|
91
|
-
doc = Quartz.PDFDocument.alloc().initWithURL_(url)
|
|
92
|
-
output_dir = "/tmp/pdf_rendered/"
|
|
93
|
-
os.makedirs(output_dir, exist_ok=True)
|
|
94
|
-
|
|
95
|
-
scale = 6.0 # 6x scale for dense tables
|
|
96
|
-
|
|
97
|
-
for page_idx in scanned_pages:
|
|
98
|
-
page = doc.pageAtIndex_(page_idx)
|
|
99
|
-
media_box = page.boundsForBox_(Quartz.kCGPDFMediaBox)
|
|
100
|
-
|
|
101
|
-
pw = int(media_box.size.width * scale)
|
|
102
|
-
ph = int(media_box.size.height * scale)
|
|
103
|
-
|
|
104
|
-
cs = Quartz.CGColorSpaceCreateDeviceRGB()
|
|
105
|
-
ctx = Quartz.CGBitmapContextCreate(
|
|
106
|
-
None, pw, ph, 8, pw * 4, cs,
|
|
107
|
-
Quartz.kCGImageAlphaPremultipliedLast
|
|
108
|
-
)
|
|
109
|
-
|
|
110
|
-
# White background
|
|
111
|
-
Quartz.CGContextSetRGBFillColor(ctx, 1.0, 1.0, 1.0, 1.0)
|
|
112
|
-
Quartz.CGContextFillRect(ctx, Quartz.CGRectMake(0, 0, pw, ph))
|
|
113
|
-
Quartz.CGContextScaleCTM(ctx, scale, scale)
|
|
114
|
-
page.drawWithBox_toContext_(Quartz.kCGPDFMediaBox, ctx)
|
|
115
|
-
|
|
116
|
-
cg_img = Quartz.CGBitmapContextCreateImage(ctx)
|
|
117
|
-
Quartz.CGImageDestinationAddImage(
|
|
118
|
-
Quartz.CGImageDestinationCreateWithURL(
|
|
119
|
-
Quartz.NSURL.fileURLWithPath_(f"{output_dir}page_{page_idx+1}.png"),
|
|
120
|
-
Quartz.kUTTypePNG, 1, None
|
|
121
|
-
),
|
|
122
|
-
cg_img, None
|
|
123
|
-
)
|
|
124
|
-
# Finalize the destination
|
|
125
|
-
dest = Quartz.CGImageDestinationCreateWithURL(
|
|
126
|
-
Quartz.NSURL.fileURLWithPath_(f"{output_dir}page_{page_idx+1}.png"),
|
|
127
|
-
Quartz.kUTTypePNG, 1, None
|
|
128
|
-
)
|
|
129
|
-
Quartz.CGImageDestinationAddImage(dest, cg_img, None)
|
|
130
|
-
Quartz.CGImageDestinationFinalize(dest)
|
|
131
|
-
|
|
132
|
-
print(f"Rendered page {page_idx+1} to {output_dir}page_{page_idx+1}.png ({pw}x{ph})")
|
|
133
|
-
```
|
|
44
|
+
For a typical participating policy, expected surrender value ≈ guaranteed cash value + projected non-guaranteed bonus. Using only the guaranteed table can understate expected returns by 2-5x.
|
|
134
45
|
|
|
135
|
-
|
|
46
|
+
**Rules**:
|
|
47
|
+
1. If `section_index["participating_plan"]` is non-empty, **you must locate and extract** `benefit_illustration`. Don't summarize without it.
|
|
48
|
+
2. If `section_index["benefit_illustration"]` is empty for a participating policy, **tell the user the illustration is missing** and recommend requesting it from the agent — do not compute IRR or surrender returns from the guaranteed table alone.
|
|
49
|
+
3. When building the summary markdown, present **guaranteed (A) + non-guaranteed (B) + total (A+B)** as separate columns. Note that non-guaranteed values can be downgraded at the insurer's discretion.
|
|
50
|
+
4. Look for pessimistic/optimistic scenario tables (悲觀情景 / 樂觀情景) — they bracket the range of plausible returns; include them when present.
|
|
136
51
|
|
|
137
|
-
|
|
52
|
+
**Example**: If the script reports `participating_plan: [55, 68]` and `benefit_illustration: [51, 52, 53, 54, 55]`, deep-read pages 51-55 — that's where the projection tables live. A cash value table elsewhere (e.g., page 48) is necessary but not sufficient.
|
|
138
53
|
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
```python
|
|
142
|
-
import sys, Quartz
|
|
143
|
-
|
|
144
|
-
# After rendering to cg_img (before saving), crop a region:
|
|
145
|
-
# crop_rect = Quartz.CGRectMake(x * scale, y * scale, width * scale, height * scale)
|
|
146
|
-
# cg_img = Quartz.CGImageCreateWithImageInRect(cg_img, crop_rect)
|
|
147
|
-
```
|
|
54
|
+
## Assembled Markdown Template
|
|
148
55
|
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
### Step 5: Extract Text from Rendered Images
|
|
152
|
-
|
|
153
|
-
Use multimodal vision to extract text from the rendered PNG images. This works best with the Read tool on the image files.
|
|
154
|
-
|
|
155
|
-
### Step 6: Assemble and Save as Markdown
|
|
156
|
-
|
|
157
|
-
Combine all extracted content into a structured markdown document. Follow the Obsidian wiki schema if the target vault uses it:
|
|
56
|
+
When saving the result to an Obsidian wiki, use this frontmatter pattern:
|
|
158
57
|
|
|
159
58
|
```markdown
|
|
160
59
|
---
|
|
@@ -168,47 +67,30 @@ status: active
|
|
|
168
67
|
|
|
169
68
|
# Document Title
|
|
170
69
|
|
|
171
|
-
> Brief description
|
|
70
|
+
> Brief description
|
|
172
71
|
|
|
173
72
|
## Key Data
|
|
174
|
-
|
|
175
73
|
[Extracted tables as markdown tables]
|
|
176
74
|
|
|
177
75
|
## Main Content
|
|
178
|
-
|
|
179
|
-
[Extracted body text, organized by sections]
|
|
76
|
+
[Body text, organized by sections]
|
|
180
77
|
|
|
181
78
|
## Important Notes
|
|
182
|
-
|
|
183
|
-
[Any specific details, warnings, or risks mentioned]
|
|
184
|
-
```
|
|
185
|
-
|
|
186
|
-
## Handling Large PDFs
|
|
187
|
-
|
|
188
|
-
When the extracted text is very long and would exceed output limits:
|
|
189
|
-
|
|
190
|
-
- Process pages in batches of 5-10
|
|
191
|
-
- Save intermediate results to a temporary file
|
|
192
|
-
- For text-based extraction, use chunked reading with offset/limit if reading from a file:
|
|
193
|
-
|
|
194
|
-
```python
|
|
195
|
-
# Write extracted text to a file first, then read in chunks
|
|
196
|
-
with open("/tmp/extracted.txt", "w") as f:
|
|
197
|
-
for page_data in pages_text:
|
|
198
|
-
f.write(f"\n--- Page {page_data['page']} ---\n")
|
|
199
|
-
f.write(page_data["text"])
|
|
79
|
+
[Specific details, warnings, risks]
|
|
200
80
|
```
|
|
201
81
|
|
|
202
82
|
## Common Pitfalls
|
|
203
83
|
|
|
204
|
-
- **
|
|
205
|
-
- **
|
|
206
|
-
- **
|
|
207
|
-
- **
|
|
208
|
-
- **
|
|
84
|
+
- **Stopping at the first value table.** Especially in policy PDFs, finding a "Cash Value Table" does NOT mean you're done. Always check the `section_index` for other value-related sections (benefit illustration, surrender value, non-guaranteed) before concluding.
|
|
85
|
+
- **Garbled CJK text from font subsetting.** Some PDFs (esp. Traditional Chinese insurance policies) embed subset fonts with custom glyph encodings — `page.string()` returns broken codepoints. Render those pages as PNG at 6x scale and OCR via vision. Often only the boilerplate provisions pages are affected; the data tables remain readable.
|
|
86
|
+
- **Low-resolution renders.** If OCR quality is poor, raise `--scale` from 4 to 6. Dense tables with small fonts need 6x.
|
|
87
|
+
- **Page orientation.** Some PDFs have rotated pages. Check `media_box` dimensions to detect landscape.
|
|
88
|
+
- **Watermarks/overlays.** Heavy background watermarks interfere with OCR — crop to the content region.
|
|
89
|
+
- **Mixed content pages.** A page may have both text and scanned elements. The `< 10 chars` threshold detects pure scans only; partial-text pages need manual review.
|
|
90
|
+
- **pyobjc availability.** On macOS, pyobjc is pre-installed with system Python. Use `python3` from the system, not a Homebrew Python that may lack Quartz bindings.
|
|
209
91
|
|
|
210
92
|
## Dependencies
|
|
211
93
|
|
|
212
|
-
- macOS (
|
|
213
|
-
- pyobjc (pre-installed
|
|
214
|
-
- No additional packages
|
|
94
|
+
- macOS (uses Quartz / PDFKit)
|
|
95
|
+
- pyobjc (pre-installed with system Python)
|
|
96
|
+
- No additional packages (no poppler, no tesseract, no PIL)
|
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
# Manual Implementation — pyobjc / Quartz PDF Extraction
|
|
2
|
+
|
|
3
|
+
The bundled `scripts/pdf_extract.py` handles all of this end-to-end. Read this reference only when:
|
|
4
|
+
- Modifying the script itself
|
|
5
|
+
- The script is unavailable and you need to inline the logic
|
|
6
|
+
- Doing a custom one-off (e.g., a different page-level filter)
|
|
7
|
+
|
|
8
|
+
The script source at `scripts/pdf_extract.py` is the authoritative implementation.
|
|
9
|
+
|
|
10
|
+
## Step 1: Inspect the PDF
|
|
11
|
+
|
|
12
|
+
Check the PDF file path and try basic text extraction on the first page to gauge quality.
|
|
13
|
+
|
|
14
|
+
```python
|
|
15
|
+
import Quartz
|
|
16
|
+
|
|
17
|
+
pdf_path = "/path/to/file.pdf"
|
|
18
|
+
url = Quartz.NSURL.fileURLWithPath_(pdf_path)
|
|
19
|
+
doc = Quartz.PDFDocument.alloc().initWithURL_(url)
|
|
20
|
+
|
|
21
|
+
if doc is None:
|
|
22
|
+
raise SystemExit("Cannot open PDF")
|
|
23
|
+
|
|
24
|
+
page_count = doc.pageCount()
|
|
25
|
+
print(f"Pages: {page_count}")
|
|
26
|
+
|
|
27
|
+
p1 = doc.pageAtIndex_(0)
|
|
28
|
+
text = p1.string() if p1 else ""
|
|
29
|
+
print(f"First page text length: {len(text)}")
|
|
30
|
+
if len(text) < 50:
|
|
31
|
+
print("WARNING: Low text extraction — likely a scanned PDF")
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
## Step 2: Extract All Text-Based Pages
|
|
35
|
+
|
|
36
|
+
Pages with `len(text.strip()) < 10` are treated as pure scans. Pages with partial text need manual review.
|
|
37
|
+
|
|
38
|
+
```python
|
|
39
|
+
import Quartz
|
|
40
|
+
|
|
41
|
+
pdf_path = "/path/to/file.pdf"
|
|
42
|
+
url = Quartz.NSURL.fileURLWithPath_(pdf_path)
|
|
43
|
+
doc = Quartz.PDFDocument.alloc().initWithURL_(url)
|
|
44
|
+
|
|
45
|
+
pages_text = []
|
|
46
|
+
scanned_pages = []
|
|
47
|
+
|
|
48
|
+
for i in range(doc.pageCount()):
|
|
49
|
+
page = doc.pageAtIndex_(i)
|
|
50
|
+
text = page.string() if page else ""
|
|
51
|
+
if len(text.strip()) < 10:
|
|
52
|
+
scanned_pages.append(i)
|
|
53
|
+
else:
|
|
54
|
+
pages_text.append({"page": i + 1, "text": text})
|
|
55
|
+
|
|
56
|
+
print(f"Text pages: {len(pages_text)}, Scanned pages: {scanned_pages}")
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Step 3: Render Scanned Pages as PNG
|
|
60
|
+
|
|
61
|
+
For dense financial tables, use **6x scale** (≈300 DPI equivalent). Lower scales:
|
|
62
|
+
- 3x — simple text
|
|
63
|
+
- 4x — most documents
|
|
64
|
+
- 5x — dense forms
|
|
65
|
+
- 6x — financial tables with small fonts
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
import os
|
|
69
|
+
import Quartz
|
|
70
|
+
|
|
71
|
+
output_dir = "/tmp/pdf_rendered/"
|
|
72
|
+
os.makedirs(output_dir, exist_ok=True)
|
|
73
|
+
scale = 6.0
|
|
74
|
+
|
|
75
|
+
for page_idx in scanned_pages:
|
|
76
|
+
page = doc.pageAtIndex_(page_idx)
|
|
77
|
+
media_box = page.boundsForBox_(Quartz.kCGPDFMediaBox)
|
|
78
|
+
pw = int(media_box.size.width * scale)
|
|
79
|
+
ph = int(media_box.size.height * scale)
|
|
80
|
+
|
|
81
|
+
cs = Quartz.CGColorSpaceCreateDeviceRGB()
|
|
82
|
+
ctx = Quartz.CGBitmapContextCreate(
|
|
83
|
+
None, pw, ph, 8, pw * 4, cs,
|
|
84
|
+
Quartz.kCGImageAlphaPremultipliedLast
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
Quartz.CGContextSetRGBFillColor(ctx, 1.0, 1.0, 1.0, 1.0)
|
|
88
|
+
Quartz.CGContextFillRect(ctx, Quartz.CGRectMake(0, 0, pw, ph))
|
|
89
|
+
Quartz.CGContextScaleCTM(ctx, scale, scale)
|
|
90
|
+
page.drawWithBox_toContext_(Quartz.kCGPDFMediaBox, ctx)
|
|
91
|
+
|
|
92
|
+
cg_img = Quartz.CGBitmapContextCreateImage(ctx)
|
|
93
|
+
out_path = os.path.join(output_dir, f"page_{page_idx+1}.png")
|
|
94
|
+
dest = Quartz.CGImageDestinationCreateWithURL(
|
|
95
|
+
Quartz.NSURL.fileURLWithPath_(out_path),
|
|
96
|
+
Quartz.kUTTypePNG, 1, None
|
|
97
|
+
)
|
|
98
|
+
Quartz.CGImageDestinationAddImage(dest, cg_img, None)
|
|
99
|
+
Quartz.CGImageDestinationFinalize(dest)
|
|
100
|
+
print(f"Rendered page {page_idx+1} -> {out_path} ({pw}x{ph})")
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
## Step 4: Crop a Region (Optional)
|
|
104
|
+
|
|
105
|
+
When only part of a page is relevant (e.g., one table), crop the rendered image. Quartz's coordinate origin (0,0) is at the **top-left** of the page.
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
# After rendering to cg_img (before saving):
|
|
109
|
+
# crop_rect uses pixel-space coordinates (PDF points × scale)
|
|
110
|
+
crop_rect = Quartz.CGRectMake(x * scale, y * scale, width * scale, height * scale)
|
|
111
|
+
cg_img = Quartz.CGImageCreateWithImageInRect(cg_img, crop_rect)
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**Finding coordinates**: render the full page first, view the image, estimate the crop rectangle in PDF points (1 point = 1/72 inch; A4 ≈ 595×842 points), then multiply by `scale`.
|
|
115
|
+
|
|
116
|
+
## Section Keyword Patterns
|
|
117
|
+
|
|
118
|
+
The script's auto section scan uses these regex patterns. Extend them when encountering new document genres (e.g., loan agreements, trust deeds):
|
|
119
|
+
|
|
120
|
+
| section_key | Pattern (case-insensitive) |
|
|
121
|
+
|-------------------------|-------------------------------------------------------------------------------------------------|
|
|
122
|
+
| cash_value_table | `保[證证]現?金價值表` / `Guaranteed Cash Value` / `Cash Value Table` |
|
|
123
|
+
| benefit_illustration | `建[議议]書摘要` / `Benefit Illustration` / `Illustration Summary` / `利益[說说]明` |
|
|
124
|
+
| surrender_value | `退保價值` / `退保价值` / `Surrender Value` |
|
|
125
|
+
| non_guaranteed | `非保[證证]` / `Non-Guaranteed` |
|
|
126
|
+
| projected_values | `預計` / `预计` / `Projected` / `Estimated` |
|
|
127
|
+
| special_bonus | `特別紅利` / `特别红利` / `Special Bonus` / `Reversionary Bonus` / `Terminal Bonus` |
|
|
128
|
+
| participating_plan | `分紅(計劃|保單)?` / `分红(计划|保单)?` / `Participating` / `With-Profits` |
|
|
129
|
+
| death_benefit | `身故保[障賠]` / `Death Benefit` |
|
|
130
|
+
| critical_illness | `嚴重疾病` / `严重疾病` / `危疾` / `Critical Illness` |
|
|
131
|
+
| premium_payment | `繳[費付]` / `缴[费付]` / `Premium Payment` / `Payment Schedule` |
|
|
132
|
+
| policy_terms | `保單條款` / `保单条款` / `Policy Provisions` / `Terms and Conditions` |
|
|
133
|
+
| endorsement | `批[註注]` / `Endorsement` / `Rider` |
|
|
134
|
+
|
|
135
|
+
## Warning Triggers
|
|
136
|
+
|
|
137
|
+
The script emits warnings when the detected section pattern suggests missing documentation:
|
|
138
|
+
|
|
139
|
+
1. **`participating_plan` present + `benefit_illustration` absent** → the illustration is often a separate document. Ask the user / agent to provide it.
|
|
140
|
+
2. **`cash_value_table` present + `participating_plan` present + `surrender_value` absent** → guaranteed cash value alone understates expected return for a participating policy.
|
|
141
|
+
3. **`non_guaranteed` referenced + `benefit_illustration` absent** → projection data likely in a separate proposal/illustration document.
|
|
@@ -9,22 +9,44 @@ Options:
|
|
|
9
9
|
--render-scanned Render scanned/blank pages as PNG images for OCR
|
|
10
10
|
--output-dir DIR Directory for rendered page images (default: /tmp/pdf_rendered/)
|
|
11
11
|
--scale N Render scale multiplier (default: 6.0)
|
|
12
|
+
--no-section-scan Disable section keyword scan (on by default)
|
|
12
13
|
|
|
13
14
|
Output:
|
|
14
15
|
- Prints extracted text to stdout, page by page
|
|
15
16
|
- If --render-scanned is set, saves scanned pages as PNGs to output_dir
|
|
16
|
-
- Prints a JSON summary to stderr with page counts
|
|
17
|
+
- Prints a JSON summary to stderr with page counts, scanned pages,
|
|
18
|
+
section_index (page locations of key financial sections), and warnings.
|
|
17
19
|
"""
|
|
18
20
|
|
|
19
21
|
import sys
|
|
20
22
|
import os
|
|
23
|
+
import re
|
|
21
24
|
import json
|
|
22
25
|
import argparse
|
|
23
26
|
|
|
24
27
|
import Quartz
|
|
25
28
|
|
|
26
29
|
|
|
27
|
-
|
|
30
|
+
# Section keyword patterns for policy / financial document detection.
|
|
31
|
+
# Each entry: (section_key, regex_pattern). Patterns are case-insensitive
|
|
32
|
+
# and cover both Traditional / Simplified Chinese and English variants.
|
|
33
|
+
SECTION_PATTERNS = [
|
|
34
|
+
("cash_value_table", r"保[證证]現?金價值表|保证现金价值表|Guaranteed\s+Cash\s+Value|Cash\s+Value\s+Table"),
|
|
35
|
+
("benefit_illustration", r"建[議议]書摘要|建议书摘要|建[議议]書|建议书|Benefit\s+Illustration|Illustration\s+Summary|Sales\s+Illustration|利益[說说]明"),
|
|
36
|
+
("surrender_value", r"退保價值|退保价值|Surrender\s+Value"),
|
|
37
|
+
("non_guaranteed", r"非保[證证]|Non[-\s]?Guaranteed"),
|
|
38
|
+
("projected_values", r"預計|预计|Projected|Estimated"),
|
|
39
|
+
("special_bonus", r"特別紅利|特别红利|Special\s+Bonus|Reversionary\s+Bonus|Terminal\s+Bonus"),
|
|
40
|
+
("participating_plan", r"分紅(?:計劃|保[單单])?|分红(?:计划|保[單单])?|Participating|With[-\s]?Profits"),
|
|
41
|
+
("death_benefit", r"身故保[障賠]|Death\s+Benefit"),
|
|
42
|
+
("critical_illness", r"嚴重疾病|严重疾病|危疾|Critical\s+Illness"),
|
|
43
|
+
("premium_payment", r"繳[費付]|缴[费付]|Premium\s+Payment|Payment\s+Schedule"),
|
|
44
|
+
("policy_terms", r"保[單单]條款|保[單单]条款|Policy\s+Provisions|Terms\s+and\s+Conditions"),
|
|
45
|
+
("endorsement", r"批[註注]|Endorsement|Rider"),
|
|
46
|
+
]
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0, section_scan=True):
|
|
28
50
|
url = Quartz.NSURL.fileURLWithPath_(pdf_path)
|
|
29
51
|
doc = Quartz.PDFDocument.alloc().initWithURL_(url)
|
|
30
52
|
|
|
@@ -35,10 +57,12 @@ def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0):
|
|
|
35
57
|
page_count = doc.pageCount()
|
|
36
58
|
pages_text = []
|
|
37
59
|
scanned_pages = []
|
|
60
|
+
all_page_texts = {} # page_idx -> text, for section scan
|
|
38
61
|
|
|
39
62
|
for i in range(page_count):
|
|
40
63
|
page = doc.pageAtIndex_(i)
|
|
41
64
|
text = page.string() if page else ""
|
|
65
|
+
all_page_texts[i] = text
|
|
42
66
|
|
|
43
67
|
if len(text.strip()) < 10:
|
|
44
68
|
scanned_pages.append(i)
|
|
@@ -62,13 +86,78 @@ def extract_text(pdf_path, render_scanned=False, output_dir=None, scale=6.0):
|
|
|
62
86
|
"scanned_pages": scanned_pages,
|
|
63
87
|
"scanned_count": len(scanned_pages),
|
|
64
88
|
}
|
|
65
|
-
|
|
66
|
-
|
|
89
|
+
|
|
90
|
+
if section_scan:
|
|
91
|
+
section_index, warnings = _scan_sections(all_page_texts)
|
|
92
|
+
summary["section_index"] = section_index
|
|
93
|
+
summary["warnings"] = warnings
|
|
94
|
+
|
|
95
|
+
print(f"\n[Summary] {summary['text_pages']} text pages, {summary['scanned_count']} scanned pages: {scanned_pages}", file=sys.stderr)
|
|
96
|
+
print("[Section Index] (1-indexed page numbers)", file=sys.stderr)
|
|
97
|
+
for key, pages in section_index.items():
|
|
98
|
+
if pages:
|
|
99
|
+
print(f" {key}: {pages}", file=sys.stderr)
|
|
100
|
+
if warnings:
|
|
101
|
+
print("[WARNINGS]", file=sys.stderr)
|
|
102
|
+
for w in warnings:
|
|
103
|
+
print(f" ! {w}", file=sys.stderr)
|
|
104
|
+
else:
|
|
105
|
+
print(f"\n[Summary] {summary['text_pages']} text pages, {summary['scanned_count']} scanned pages: {scanned_pages}", file=sys.stderr)
|
|
106
|
+
|
|
107
|
+
json.dump(summary, sys.stderr, ensure_ascii=False)
|
|
67
108
|
print("", file=sys.stderr)
|
|
68
109
|
|
|
69
110
|
return summary
|
|
70
111
|
|
|
71
112
|
|
|
113
|
+
def _scan_sections(all_page_texts):
|
|
114
|
+
"""Scan all pages for known financial document section keywords.
|
|
115
|
+
|
|
116
|
+
Returns (section_index, warnings):
|
|
117
|
+
section_index: {section_key: [1-indexed page numbers where matched]}
|
|
118
|
+
warnings: list of human-readable warnings about coverage gaps
|
|
119
|
+
"""
|
|
120
|
+
section_index = {key: [] for key, _ in SECTION_PATTERNS}
|
|
121
|
+
|
|
122
|
+
for page_idx, text in all_page_texts.items():
|
|
123
|
+
if not text:
|
|
124
|
+
continue
|
|
125
|
+
for key, pattern in SECTION_PATTERNS:
|
|
126
|
+
if re.search(pattern, text, re.IGNORECASE):
|
|
127
|
+
section_index[key].append(page_idx + 1)
|
|
128
|
+
|
|
129
|
+
warnings = []
|
|
130
|
+
|
|
131
|
+
# Participating policy without Benefit Illustration → likely missing the key document
|
|
132
|
+
if section_index["participating_plan"] and not section_index["benefit_illustration"]:
|
|
133
|
+
warnings.append(
|
|
134
|
+
"Participating/with-profits policy detected, but no Benefit Illustration "
|
|
135
|
+
"section found. The illustration (with non-guaranteed projections) is often "
|
|
136
|
+
"a separate document — ask the user / agent to provide it before drawing "
|
|
137
|
+
"conclusions about surrender/maturity values."
|
|
138
|
+
)
|
|
139
|
+
|
|
140
|
+
# Cash value table present but no surrender value section → for participating plans,
|
|
141
|
+
# guaranteed cash value alone understates the true expected return.
|
|
142
|
+
if (section_index["cash_value_table"]
|
|
143
|
+
and section_index["participating_plan"]
|
|
144
|
+
and not section_index["surrender_value"]):
|
|
145
|
+
warnings.append(
|
|
146
|
+
"Cash Value Table found but no 'Surrender Value' / 退保價值 section. "
|
|
147
|
+
"For a participating policy, 'Guaranteed Cash Value' ≠ 'Expected Surrender Value' "
|
|
148
|
+
"(the latter includes non-guaranteed dividends). Verify all value tables were located."
|
|
149
|
+
)
|
|
150
|
+
|
|
151
|
+
# Non-guaranteed mentioned but no Benefit Illustration → projection data probably elsewhere
|
|
152
|
+
if section_index["non_guaranteed"] and not section_index["benefit_illustration"]:
|
|
153
|
+
warnings.append(
|
|
154
|
+
"Non-guaranteed amounts referenced but no Benefit Illustration found — "
|
|
155
|
+
"projected values may be in a separate proposal/illustration document."
|
|
156
|
+
)
|
|
157
|
+
|
|
158
|
+
return section_index, warnings
|
|
159
|
+
|
|
160
|
+
|
|
72
161
|
def _render_page(page, page_idx, output_dir, scale):
|
|
73
162
|
media_box = page.boundsForBox_(Quartz.kCGPDFMediaBox)
|
|
74
163
|
pw = int(media_box.size.width * scale)
|
|
@@ -107,6 +196,9 @@ if __name__ == "__main__":
|
|
|
107
196
|
help="Output directory for rendered images")
|
|
108
197
|
parser.add_argument("--scale", type=float, default=6.0,
|
|
109
198
|
help="Scale multiplier for rendering (default: 6.0)")
|
|
199
|
+
parser.add_argument("--no-section-scan", action="store_true",
|
|
200
|
+
help="Disable section keyword scan (on by default)")
|
|
110
201
|
args = parser.parse_args()
|
|
111
202
|
|
|
112
|
-
extract_text(args.pdf_path, args.render_scanned, args.output_dir, args.scale
|
|
203
|
+
extract_text(args.pdf_path, args.render_scanned, args.output_dir, args.scale,
|
|
204
|
+
section_scan=not args.no_section_scan)
|
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "pdf-extract",
|
|
3
|
-
"version": "0.
|
|
4
|
-
"description": ""
|
|
3
|
+
"version": "0.2.0",
|
|
4
|
+
"description": "Extract text from PDFs and save as clean markdown on macOS via PDFKit + Quartz. Auto-detects scanned pages, renders them for vision OCR, and scans for financial-document sections (cash value, benefit illustration, surrender value, non-guaranteed) so participating-policy data isn't missed."
|
|
5
5
|
}
|
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: smzdm-picks
|
|
3
|
+
description: Fetch personalized 什么值得买 (smzdm.com) deal picks from the user's already-logged-in Chrome session. Trigger when the user says "smzdm", "什么值得买", "今日好价", "smzdm 精选", "查一下值得买", "好价推荐", "smzdm picks", "zdm", "今天有什么好价", "值得买推荐", "smzdm 推荐", or asks to see today's curated deals / discounts / 优惠 / 特价 on smzdm. Drives Chrome via AppleScript because smzdm has anti-bot protection; uses the user's existing login for personalized content.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# smzdm-picks
|
|
7
|
+
|
|
8
|
+
Pull today's personalized curated deals (个性化好价) from 什么值得买 by driving the user's already-logged-in Chrome via AppleScript + JS injection.
|
|
9
|
+
|
|
10
|
+
## Why AppleScript
|
|
11
|
+
|
|
12
|
+
smzdm.com is protected by an Akamai-style JS challenge: plain `curl` returns HTTP 202 with a `probe.js` and no content. The only way to get the rendered, personalized feed is through a real browser that's already authenticated. The user has Chrome logged in, so this skill steers Chrome itself.
|
|
13
|
+
|
|
14
|
+
## One-time setup (per Mac)
|
|
15
|
+
|
|
16
|
+
**This skill is per-machine** — each Mac needs its own setup because Chrome's cookie store and AppleScript permission are local. When the user mentions running this on a new Mac, or this skill triggers on a machine for the first time, walk them through this checklist.
|
|
17
|
+
|
|
18
|
+
Use the built-in self-check command to verify each step:
|
|
19
|
+
```bash
|
|
20
|
+
bash ~/.claude/skills/smzdm-picks/scripts/fetch.sh --check
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
It prints ✓/✗ for each requirement and pinpoints what's missing. Manual checklist if you prefer step-by-step:
|
|
24
|
+
|
|
25
|
+
1. **Open Chrome** and confirm it stays open in the background.
|
|
26
|
+
2. **Enable AppleScript JS API**: Chrome menu → View → Developer → **Allow JavaScript from Apple Events** ✓
|
|
27
|
+
- If the "Developer" submenu is hidden: Chrome → Settings → search for "developer menu" → enable.
|
|
28
|
+
- In Chinese Chrome: 查看 → 开发者 → 允许 Apple 事件中的 JavaScript.
|
|
29
|
+
3. **Grant Automation permission (first run only)**: when the script runs for the first time on a new Mac, macOS pops up *"Terminal (or iTerm/Claude) wants to control 'Google Chrome'"* — click **OK**. To verify or fix later: System Settings → Privacy & Security → Automation → expand your terminal app → ensure **Google Chrome ✓** is checked.
|
|
30
|
+
4. **Log into smzdm.com in this Chrome**: open https://www.smzdm.com/ and sign in. Cookies are per-profile and don't sync from other machines.
|
|
31
|
+
5. **Chrome variant**: if using Chrome Canary or Chrome Beta, set `CHROME_APP="Google Chrome Canary"` (or similar) in the environment — the default is `"Google Chrome"`.
|
|
32
|
+
|
|
33
|
+
If `fetch.sh` returns exit 4 (JS API rejected) → step 2 is the issue. Exit 6 (--check failed) → check the ✗ line. Exit 2 (Chrome not running) → step 1.
|
|
34
|
+
|
|
35
|
+
## Workflow
|
|
36
|
+
|
|
37
|
+
1. **New Mac?** If this is the first time on this machine (no successful prior run in this session, or the user mentions "另一台电脑"/"new computer"/"on this Mac"), run `bash ~/.claude/skills/smzdm-picks/scripts/fetch.sh --check` first and walk through any ✗ items per the Setup section above. Don't proceed to scrape until --check passes.
|
|
38
|
+
2. Run: `bash ~/.claude/skills/smzdm-picks/scripts/fetch.sh [target]`
|
|
39
|
+
- No arg → homepage (`/`) — the personalized recommendation feed
|
|
40
|
+
- `jingxuan` → `/jingxuan/` — curated picks
|
|
41
|
+
- `faxian` → `faxian.smzdm.com` — pure deals stream
|
|
42
|
+
3. The script briefly opens a tab in the user's front Chrome window, waits 6s for JS, extracts items via DOM selectors, closes the tab, restores focus.
|
|
43
|
+
4. Parse the JSON on stdout. Read stderr diagnostics for any warnings (logged_in flag, item count).
|
|
44
|
+
5. Curate and render to the user (see Curation Rules + Output Format below).
|
|
45
|
+
|
|
46
|
+
## Curation Rules
|
|
47
|
+
|
|
48
|
+
From the extracted ≤30 raw items, pick the **10 best** for display:
|
|
49
|
+
|
|
50
|
+
- **Must have** title + link. Items missing either are noise — skip.
|
|
51
|
+
- **Prefer items with a concrete price** (¥XXX) over vague ones ("低至"/"满减"). Sort price-ed items first.
|
|
52
|
+
- **Skip soft ads**: title containing 赞助 / 广告 / 软广 / 测评推荐.
|
|
53
|
+
- **Diversify sources**: avoid >3 items from the same e-commerce platform (京东/天猫/etc.) in the top 10.
|
|
54
|
+
- **Mark the standout**: prefix the top 1 with 🔥 if `value` field (smzdm 值得买/不值得指数) suggests strong consensus, otherwise no prefix.
|
|
55
|
+
|
|
56
|
+
## Output format
|
|
57
|
+
|
|
58
|
+
Mobile-friendly markdown list, no tables. Match `daily-hunt` / `pulse` style:
|
|
59
|
+
|
|
60
|
+
```markdown
|
|
61
|
+
# 🛒 smzdm 今日精选好价 · {today}
|
|
62
|
+
|
|
63
|
+
> 已为你筛选 {N} 条来自{个性化推荐|精选|发现}流的好价
|
|
64
|
+
|
|
65
|
+
1. 🔥 **{标题}** · {商城} · {价格}
|
|
66
|
+
{短链}
|
|
67
|
+
|
|
68
|
+
2. **{标题}** · {商城} · {价格}
|
|
69
|
+
{短链}
|
|
70
|
+
|
|
71
|
+
...
|
|
72
|
+
|
|
73
|
+
— 数据来自 smzdm.com({personalized|curated}),共抓取 {total} 条,展示前 {N}
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
Notes for rendering:
|
|
77
|
+
- Use full-width Chinese punctuation (,。:!) inside Chinese text.
|
|
78
|
+
- Keep each item to ≤ 2 lines; long titles wrap naturally.
|
|
79
|
+
- Link goes on its own line for tappability.
|
|
80
|
+
- If a tag is interesting (like 双 11 / PLUS 会员价 / 限时), append it inline: `· 京东 · ¥499 · 双11 价`.
|
|
81
|
+
|
|
82
|
+
## Error handling
|
|
83
|
+
|
|
84
|
+
| Symptom | Likely cause | Tell the user |
|
|
85
|
+
|---|---|---|
|
|
86
|
+
| Exit 2 — "Chrome not running" | Chrome process not found | "请先打开 Chrome 并确认 smzdm.com 已登录"(提示:Chrome Canary/Beta 需设置 `CHROME_APP` 环境变量) |
|
|
87
|
+
| Exit 4 — JS API rejected | Allow JS from Apple Events disabled | Walk through the one-time Setup above. Also check System Settings → Privacy & Security → Automation. |
|
|
88
|
+
| Exit 5 — empty result | Page load slow / DOM changed | "smzdm 页面没在 6 秒内加载完,请稍后重试" |
|
|
89
|
+
| Exit 6 — --check failed | New machine setup incomplete | Use the ✓/✗ output from `--check` to pinpoint which step is missing |
|
|
90
|
+
| `logged_in: false` in JSON | Chrome session expired on this Mac | "Chrome 里 smzdm 登录已过期 / 这台电脑还没登录过,请到浏览器登录后再试" |
|
|
91
|
+
| 0 items extracted | DOM selectors out of date | "smzdm 页面结构可能变了,需要更新 fetch.sh 里的 selector" |
|
|
92
|
+
|
|
93
|
+
Always show the **raw stderr** from the script when reporting an error — it has the exit code and hint.
|
|
94
|
+
|
|
95
|
+
## Limitations
|
|
96
|
+
|
|
97
|
+
- **Per-machine setup**: this skill won't auto-work on a new Mac. Each Mac needs Chrome's JS-from-AppleEvents toggle, macOS Automation permission, and a fresh smzdm.com login. Run `--check` on new machines.
|
|
98
|
+
- **macOS-only**: relies on AppleScript + Chrome AppleScript bindings. Won't work on Linux/Windows without a port.
|
|
99
|
+
- **Briefly steals Chrome focus** during the 6s scrape (opens then closes a tab in the front window).
|
|
100
|
+
- **Cookie freshness**: smzdm cookies last 30+ days but logout/clear-cookies/different-profile breaks it.
|
|
101
|
+
- **DOM drift**: selectors will need updating as smzdm redesigns. If extraction returns 0 items repeatedly, update the `candidates` list in `scripts/fetch.sh`.
|
|
102
|
+
- **Read-only digest**: no price history, no alerts, no auto-purchasing.
|
|
103
|
+
|
|
104
|
+
## Out of scope
|
|
105
|
+
|
|
106
|
+
- Scheduled push (cron/notifications) — manual trigger only.
|
|
107
|
+
- Multi-day trend tracking — stateless per invocation.
|
|
108
|
+
- 非个性化好价(公开 RSS)fallback — if login breaks, fix the login, don't degrade silently.
|
|
@@ -0,0 +1,296 @@
|
|
|
1
|
+
#!/usr/bin/env bash
|
|
2
|
+
# fetch.sh — Drive the user's already-logged-in Chrome via AppleScript to
|
|
3
|
+
# scrape smzdm.com personalized feed and output items as JSON on stdout.
|
|
4
|
+
#
|
|
5
|
+
# Usage:
|
|
6
|
+
# bash fetch.sh # scrape homepage (/)
|
|
7
|
+
# bash fetch.sh jingxuan # scrape 精选 page (/jingxuan/)
|
|
8
|
+
# bash fetch.sh faxian # scrape 好价 page (/faxian/)
|
|
9
|
+
# bash fetch.sh --check # setup self-check (no scrape, just verify env)
|
|
10
|
+
#
|
|
11
|
+
# Environment:
|
|
12
|
+
# CHROME_APP Chrome app name (default: "Google Chrome"). Set to
|
|
13
|
+
# "Google Chrome Canary" / "Google Chrome Beta" if needed.
|
|
14
|
+
#
|
|
15
|
+
# Exit codes:
|
|
16
|
+
# 0 success — JSON written to stdout (or all checks passed in --check mode)
|
|
17
|
+
# 2 Chrome not running
|
|
18
|
+
# 3 osascript invocation failed
|
|
19
|
+
# 4 Chrome JS API rejected (Allow JS from Apple Events likely disabled)
|
|
20
|
+
# 5 empty / unparseable result
|
|
21
|
+
# 6 --check found a problem (details on stderr)
|
|
22
|
+
#
|
|
23
|
+
# Diagnostics go to stderr. JSON goes to stdout.
|
|
24
|
+
|
|
25
|
+
set -uo pipefail
|
|
26
|
+
|
|
27
|
+
CHROME_APP="${CHROME_APP:-Google Chrome}"
|
|
28
|
+
TARGET="${1:-home}"
|
|
29
|
+
CHECK_MODE=0
|
|
30
|
+
|
|
31
|
+
case "$TARGET" in
|
|
32
|
+
home) URL="https://www.smzdm.com/" ;;
|
|
33
|
+
jingxuan) URL="https://www.smzdm.com/jingxuan/" ;;
|
|
34
|
+
faxian) URL="https://faxian.smzdm.com/" ;;
|
|
35
|
+
--check) CHECK_MODE=1; URL="https://www.smzdm.com/" ;;
|
|
36
|
+
*) URL="$TARGET" ;; # allow passing full URL
|
|
37
|
+
esac
|
|
38
|
+
|
|
39
|
+
# 1. Chrome must be running
|
|
40
|
+
if ! pgrep -x "$CHROME_APP" > /dev/null; then
|
|
41
|
+
echo "ERROR: $CHROME_APP is not running. Open it and make sure smzdm.com is logged in, then retry." >&2
|
|
42
|
+
echo "(If you use Chrome Canary/Beta, set: CHROME_APP=\"Google Chrome Canary\" bash fetch.sh)" >&2
|
|
43
|
+
exit 2
|
|
44
|
+
fi
|
|
45
|
+
|
|
46
|
+
# --check mode: run a trivial JS probe instead of full scrape, report each setup step
|
|
47
|
+
if [ "$CHECK_MODE" = "1" ]; then
|
|
48
|
+
echo "[smzdm-picks --check] running setup verification..." >&2
|
|
49
|
+
echo " ✓ $CHROME_APP is running" >&2
|
|
50
|
+
|
|
51
|
+
PROBE=$(osascript <<OSAEOF 2>&1
|
|
52
|
+
tell application "$CHROME_APP"
|
|
53
|
+
if (count of windows) is 0 then return "__OSAERROR__:0:no windows"
|
|
54
|
+
try
|
|
55
|
+
set probe to execute (active tab of front window) javascript "1+1"
|
|
56
|
+
return "OK:" & probe
|
|
57
|
+
on error errMsg number errNum
|
|
58
|
+
return "__OSAERROR__:" & errNum & ":" & errMsg
|
|
59
|
+
end try
|
|
60
|
+
end tell
|
|
61
|
+
OSAEOF
|
|
62
|
+
)
|
|
63
|
+
case "$PROBE" in
|
|
64
|
+
OK:2)
|
|
65
|
+
echo " ✓ AppleScript→Chrome JS injection works (View → Developer → Allow JavaScript from Apple Events is enabled)" >&2
|
|
66
|
+
;;
|
|
67
|
+
__OSAERROR__:*)
|
|
68
|
+
echo " ✗ AppleScript→Chrome JS injection rejected: $PROBE" >&2
|
|
69
|
+
echo " Fix: $CHROME_APP menu → View → Developer → Allow JavaScript from Apple Events ✓" >&2
|
|
70
|
+
echo " (If 'Developer' is hidden: Chrome → Settings → search 'developer menu' → enable)" >&2
|
|
71
|
+
echo " Also check System Settings → Privacy & Security → Automation: allow your terminal to control $CHROME_APP" >&2
|
|
72
|
+
exit 6
|
|
73
|
+
;;
|
|
74
|
+
*)
|
|
75
|
+
echo " ✗ Unexpected probe response: $PROBE" >&2
|
|
76
|
+
exit 6
|
|
77
|
+
;;
|
|
78
|
+
esac
|
|
79
|
+
|
|
80
|
+
# Login check: visit smzdm and look for login markers
|
|
81
|
+
LOGIN_PROBE=$(osascript <<OSAEOF 2>&1
|
|
82
|
+
tell application "$CHROME_APP"
|
|
83
|
+
set originalIndex to active tab index of front window
|
|
84
|
+
set t to make new tab at end of tabs of front window with properties {URL:"https://www.smzdm.com/"}
|
|
85
|
+
delay 5
|
|
86
|
+
set logged to "false"
|
|
87
|
+
try
|
|
88
|
+
set logged to execute t javascript "(document.cookie.indexOf('sess=')>=0||document.cookie.indexOf('user=')>=0||!!document.querySelector('[class*=\"user-info\"],[class*=\"nickname\"],[class*=\"avatar\"]'))+''"
|
|
89
|
+
end try
|
|
90
|
+
try
|
|
91
|
+
close t
|
|
92
|
+
end try
|
|
93
|
+
try
|
|
94
|
+
set active tab index of front window to originalIndex
|
|
95
|
+
end try
|
|
96
|
+
return logged
|
|
97
|
+
end tell
|
|
98
|
+
OSAEOF
|
|
99
|
+
)
|
|
100
|
+
if [ "$LOGIN_PROBE" = "true" ]; then
|
|
101
|
+
echo " ✓ smzdm.com login state detected in Chrome" >&2
|
|
102
|
+
echo "" >&2
|
|
103
|
+
echo "All checks passed. Run 'bash fetch.sh' to scrape." >&2
|
|
104
|
+
exit 0
|
|
105
|
+
else
|
|
106
|
+
echo " ✗ smzdm.com not logged in (probe returned: $LOGIN_PROBE)" >&2
|
|
107
|
+
echo " Fix: open https://www.smzdm.com/ in $CHROME_APP and log in, then retry." >&2
|
|
108
|
+
exit 6
|
|
109
|
+
fi
|
|
110
|
+
fi
|
|
111
|
+
|
|
112
|
+
# 2. Write JS extractor to a tmpfile (avoids AppleScript string escaping hell)
|
|
113
|
+
JS_TMP=$(mktemp -t smzdm_extract.XXXXXX)
|
|
114
|
+
mv "$JS_TMP" "${JS_TMP}.js"
|
|
115
|
+
JS_TMP="${JS_TMP}.js"
|
|
116
|
+
trap 'rm -f "$JS_TMP"' EXIT
|
|
117
|
+
|
|
118
|
+
cat > "$JS_TMP" <<'JSEOF'
|
|
119
|
+
(function () {
|
|
120
|
+
function txt(el) {
|
|
121
|
+
return el ? (el.innerText || el.textContent || '').trim() : '';
|
|
122
|
+
}
|
|
123
|
+
|
|
124
|
+
function findPrice(root) {
|
|
125
|
+
// Look for leaf elements containing ¥/元/价 — usually the price chip
|
|
126
|
+
const all = root.querySelectorAll('*');
|
|
127
|
+
for (const n of all) {
|
|
128
|
+
if (n.children.length > 0) continue;
|
|
129
|
+
const t = txt(n);
|
|
130
|
+
if (t.length > 0 && t.length < 40 && /(¥|元|售价|价格|低至)/.test(t)) {
|
|
131
|
+
return t;
|
|
132
|
+
}
|
|
133
|
+
}
|
|
134
|
+
return '';
|
|
135
|
+
}
|
|
136
|
+
|
|
137
|
+
function findSource(root) {
|
|
138
|
+
const re = /^(京东|天猫|淘宝|拼多多|官网|苏宁|亚马逊|网易严选|得物|抖音|京东国际|京东自营|天猫超市|山姆|Costco|考拉|唯品会)$/;
|
|
139
|
+
const all = root.querySelectorAll('*');
|
|
140
|
+
for (const n of all) {
|
|
141
|
+
const t = txt(n);
|
|
142
|
+
if (t.length > 0 && t.length < 20 && re.test(t)) return t;
|
|
143
|
+
}
|
|
144
|
+
return '';
|
|
145
|
+
}
|
|
146
|
+
|
|
147
|
+
function findTags(root) {
|
|
148
|
+
return [...root.querySelectorAll('[class*="tag"], [class*="Tag"], [class*="label"]')]
|
|
149
|
+
.map(n => txt(n))
|
|
150
|
+
.filter(t => t && t.length < 20 && t.length > 0)
|
|
151
|
+
.slice(0, 5);
|
|
152
|
+
}
|
|
153
|
+
|
|
154
|
+
// Try multiple selector strategies — smzdm changes class names periodically
|
|
155
|
+
const candidates = document.querySelectorAll(
|
|
156
|
+
'[class*="feed-row"], [class*="feed-block"], [data-feed-id], article, [class*="z-feed"], [class*="feed-item"]'
|
|
157
|
+
);
|
|
158
|
+
|
|
159
|
+
const items = [];
|
|
160
|
+
const seen = new Set();
|
|
161
|
+
|
|
162
|
+
for (const el of candidates) {
|
|
163
|
+
// Title
|
|
164
|
+
const titleEl = el.querySelector('[class*="title"], h5, h3, h4, h2, a[title]');
|
|
165
|
+
let title = txt(titleEl);
|
|
166
|
+
if (!title && titleEl) title = titleEl.getAttribute('title') || '';
|
|
167
|
+
if (!title || title.length < 4) continue;
|
|
168
|
+
|
|
169
|
+
// Link — prefer post/faxian links, fall back to any href
|
|
170
|
+
const linkEl = el.querySelector(
|
|
171
|
+
'a[href*="//post.smzdm.com/"], a[href*="//faxian.smzdm.com/"], a[href*="//www.smzdm.com/p/"], a[href]'
|
|
172
|
+
);
|
|
173
|
+
let link = linkEl ? linkEl.href : '';
|
|
174
|
+
if (!link || link === window.location.href) continue;
|
|
175
|
+
if (seen.has(link)) continue;
|
|
176
|
+
seen.add(link);
|
|
177
|
+
|
|
178
|
+
const price = findPrice(el);
|
|
179
|
+
const source = findSource(el);
|
|
180
|
+
const tags = findTags(el);
|
|
181
|
+
|
|
182
|
+
// Hot/value score if present
|
|
183
|
+
const valueEl = el.querySelector('[class*="value"], [class*="zhi"], [class*="hot"]');
|
|
184
|
+
const value = txt(valueEl);
|
|
185
|
+
|
|
186
|
+
items.push({ title, price, link, source, tags, value });
|
|
187
|
+
}
|
|
188
|
+
|
|
189
|
+
// Login detection — multiple heuristics
|
|
190
|
+
const loggedIn = !!(
|
|
191
|
+
document.querySelector('[class*="user-info"]') ||
|
|
192
|
+
document.querySelector('[class*="nickname"]') ||
|
|
193
|
+
document.querySelector('[class*="userpic"]') ||
|
|
194
|
+
document.querySelector('[class*="avatar"]') ||
|
|
195
|
+
document.cookie.includes('user=')
|
|
196
|
+
);
|
|
197
|
+
|
|
198
|
+
return JSON.stringify({
|
|
199
|
+
items: items.slice(0, 30),
|
|
200
|
+
total_candidates: candidates.length,
|
|
201
|
+
logged_in: loggedIn,
|
|
202
|
+
url: location.href,
|
|
203
|
+
title: document.title,
|
|
204
|
+
ts: new Date().toISOString()
|
|
205
|
+
});
|
|
206
|
+
})();
|
|
207
|
+
JSEOF
|
|
208
|
+
|
|
209
|
+
# 3. Drive Chrome via AppleScript
|
|
210
|
+
RESULT=$(osascript <<OSAEOF 2>&1
|
|
211
|
+
set jsFile to POSIX file "$JS_TMP"
|
|
212
|
+
set jsCode to read jsFile as «class utf8»
|
|
213
|
+
|
|
214
|
+
tell application "$CHROME_APP"
|
|
215
|
+
if (count of windows) is 0 then
|
|
216
|
+
make new window
|
|
217
|
+
end if
|
|
218
|
+
|
|
219
|
+
-- Remember which tab was active so we can restore focus
|
|
220
|
+
set originalIndex to active tab index of front window
|
|
221
|
+
|
|
222
|
+
-- Open target in a new tab
|
|
223
|
+
set newTab to make new tab at end of tabs of front window with properties {URL:"$URL"}
|
|
224
|
+
|
|
225
|
+
-- Wait for JS-rendered content
|
|
226
|
+
delay 6
|
|
227
|
+
|
|
228
|
+
set extractedJSON to ""
|
|
229
|
+
try
|
|
230
|
+
set extractedJSON to execute newTab javascript jsCode
|
|
231
|
+
on error errMsg number errNum
|
|
232
|
+
set extractedJSON to "__OSAERROR__:" & errNum & ":" & errMsg
|
|
233
|
+
end try
|
|
234
|
+
|
|
235
|
+
-- Close the scrape tab
|
|
236
|
+
try
|
|
237
|
+
close newTab
|
|
238
|
+
end try
|
|
239
|
+
|
|
240
|
+
-- Restore focus
|
|
241
|
+
try
|
|
242
|
+
set active tab index of front window to originalIndex
|
|
243
|
+
end try
|
|
244
|
+
|
|
245
|
+
return extractedJSON
|
|
246
|
+
end tell
|
|
247
|
+
OSAEOF
|
|
248
|
+
)
|
|
249
|
+
|
|
250
|
+
OSA_EXIT=$?
|
|
251
|
+
|
|
252
|
+
# 4. Handle osascript failures
|
|
253
|
+
if [ $OSA_EXIT -ne 0 ]; then
|
|
254
|
+
echo "ERROR: osascript invocation failed (exit $OSA_EXIT)." >&2
|
|
255
|
+
echo "Raw output: $RESULT" >&2
|
|
256
|
+
exit 3
|
|
257
|
+
fi
|
|
258
|
+
|
|
259
|
+
# 5. Detect JS API rejection
|
|
260
|
+
if [[ "$RESULT" == __OSAERROR__:* ]]; then
|
|
261
|
+
echo "ERROR: Chrome rejected the JS injection: $RESULT" >&2
|
|
262
|
+
echo "" >&2
|
|
263
|
+
echo "HINT (most common cause): enable the JS-from-AppleEvents toggle in Chrome:" >&2
|
|
264
|
+
echo " Chrome menu → View / 查看 → Developer / 开发者 → Allow JavaScript from Apple Events / 允许 Apple 事件中的 JavaScript ✓" >&2
|
|
265
|
+
echo " (If 'Developer' submenu is hidden: Chrome → Settings → Advanced → enable 'Show Develop menu')" >&2
|
|
266
|
+
exit 4
|
|
267
|
+
fi
|
|
268
|
+
|
|
269
|
+
# 6. Empty result?
|
|
270
|
+
if [[ -z "$RESULT" || "$RESULT" == "missing value" ]]; then
|
|
271
|
+
echo "ERROR: Empty result from Chrome JS. Possible causes:" >&2
|
|
272
|
+
echo " - 'Allow JavaScript from Apple Events' is not enabled (see Chrome → View → Developer)" >&2
|
|
273
|
+
echo " - Page didn't finish loading in 6s (try again)" >&2
|
|
274
|
+
echo " - smzdm DOM structure changed" >&2
|
|
275
|
+
exit 5
|
|
276
|
+
fi
|
|
277
|
+
|
|
278
|
+
# 7. Output JSON to stdout
|
|
279
|
+
echo "$RESULT"
|
|
280
|
+
|
|
281
|
+
# 8. Diagnostics to stderr
|
|
282
|
+
python3 - <<PY 2>/dev/null
|
|
283
|
+
import sys, json
|
|
284
|
+
try:
|
|
285
|
+
d = json.loads('''$RESULT''')
|
|
286
|
+
items = d.get('items', [])
|
|
287
|
+
logged = d.get('logged_in', '?')
|
|
288
|
+
cand = d.get('total_candidates', '?')
|
|
289
|
+
sys.stderr.write(f"[smzdm-picks] target={'$URL'} items={len(items)} candidates={cand} logged_in={logged}\n")
|
|
290
|
+
if not logged:
|
|
291
|
+
sys.stderr.write("WARNING: Not logged in — feed may be generic, not personalized. Re-login in Chrome.\n")
|
|
292
|
+
if len(items) == 0:
|
|
293
|
+
sys.stderr.write("WARNING: Zero items extracted. Page DOM likely changed; check the extractor selectors.\n")
|
|
294
|
+
except Exception as e:
|
|
295
|
+
sys.stderr.write(f"[smzdm-picks] (could not parse result for diagnostics: {e})\n")
|
|
296
|
+
PY
|
|
@@ -0,0 +1,5 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "smzdm-picks",
|
|
3
|
+
"version": "0.1.0",
|
|
4
|
+
"description": "Fetch personalized 什么值得买 (smzdm.com) deal picks from the user's already-logged-in Chrome on macOS. Drives Chrome via AppleScript + JS injection to bypass anti-bot. Per-machine setup; includes --check mode for new-Mac self-verification."
|
|
5
|
+
}
|