@saulwade/swl-ses 1.6.3 → 1.6.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +3 -3
- package/README.md +2 -2
- package/agentes/gh-fix-ci-swl.md +275 -0
- package/agentes/nemesis-auditor-swl.md +90 -1
- package/comandos/swl/exportar-vault.md +106 -14
- package/comandos/swl/nemesis.md +70 -3
- package/comandos/swl/release.md +62 -2
- package/comandos/swl/salud.md +32 -0
- package/comandos/swl/verificar.md +116 -2
- package/habilidades/agent-browser/SKILL.md +111 -4
- package/habilidades/agent-deep-links/SKILL.md +148 -0
- package/habilidades/backend-async-postgres-testing/SKILL.md +215 -0
- package/habilidades/backend-error-design/SKILL.md +221 -0
- package/habilidades/browser-interaction-patterns/SKILL.md +514 -0
- package/habilidades/browser-research-domains/SKILL.md +635 -0
- package/habilidades/changelog-generator/SKILL.md +172 -0
- package/habilidades/changelog-generator/scripts/parse-commits.js +354 -0
- package/habilidades/devsecops-pipeline-security/SKILL.md +3 -0
- package/habilidades/fastapi-experto/SKILL.md +49 -4
- package/habilidades/harness-claude-code/SKILL.md +4 -1
- package/habilidades/postgresql-experto/SKILL.md +80 -4
- package/habilidades/proceso-discovery-machote/SKILL.md +157 -0
- package/habilidades/proceso-modular-split/SKILL.md +256 -0
- package/habilidades/tdd-workflow/SKILL.md +12 -5
- package/hooks/extraccion-aprendizajes.js +8 -0
- package/hooks/lib/deep-links.js +185 -0
- package/hooks/lib/evolution-tracker.js +115 -18
- package/hooks/lib/gateway-notify.js +70 -7
- package/manifiestos/modulos.json +13 -3
- package/manifiestos/skills-lock.json +1247 -1191
- package/package.json +3 -3
- package/plugin.json +11 -2
- package/reglas/arquitectura.md +38 -0
- package/reglas/arreglar-al-detectar.md +93 -0
- package/reglas/auditorias-documentales-estructurales.md +38 -0
- package/reglas/registro-componentes-nuevos.md +14 -0
- package/reglas/tests-cleanup.md +220 -0
- package/scripts/lib/mcp_config.py +29 -14
- package/scripts/mcp-orchestrator.py +153 -131
- package/scripts/mcp-pool-manager.py +132 -107
- package/scripts/mcp-telemetry.py +139 -120
- package/scripts/verificar-release.js +199 -1
|
@@ -0,0 +1,635 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: browser-research-domains
|
|
3
|
+
description: >
|
|
4
|
+
Atajos por dominio para research técnico que evitan abrir un browser cuando
|
|
5
|
+
hay API documentada disponible. Cubre GitHub, ArXiv, Hacker News, Stack
|
|
6
|
+
Overflow, PubMed, OpenAlex y SEC EDGAR con endpoint correcto, batch fetch,
|
|
7
|
+
rate limits y gotchas. Cargar cuando investigador-swl, agent-browser o un
|
|
8
|
+
flujo de research consume múltiples páginas de cualquiera de estos dominios
|
|
9
|
+
y se quiere reducir tokens y latencia.
|
|
10
|
+
version: "1.0.0"
|
|
11
|
+
herramientasPermitidas: [Read, Bash, WebFetch]
|
|
12
|
+
evolved: false
|
|
13
|
+
fuente: "browser-use/browser-harness — domain-skills (MIT License, 2026)"
|
|
14
|
+
evolvable: true
|
|
15
|
+
exclusiones:
|
|
16
|
+
- "No cargar si el research toca un dominio no listado aquí — usar agent-browser directo."
|
|
17
|
+
- "No cargar para automatización con login/interacción — esto es read-only research vía APIs públicas."
|
|
18
|
+
- "No cargar para tareas que requieren contenido renderizado por JS — usar agent-browser o browser-interaction-patterns."
|
|
19
|
+
- "No cargar para descarga de PDFs académicos — usar swl-markitdown sobre la URL del PDF directo."
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
# Skill: browser-research-domains
|
|
23
|
+
|
|
24
|
+
Para los 7 dominios cubiertos aquí, **NUNCA abrir un browser**. Todo el dato
|
|
25
|
+
está accesible vía `http_get` + API REST/Atom/XML, sin auth o con auth opcional.
|
|
26
|
+
Reduce tokens 20-50× y latencia 5-20×.
|
|
27
|
+
|
|
28
|
+
Adaptado de `browser-use/browser-harness/agent-workspace/domain-skills/`
|
|
29
|
+
(MIT License). Los snippets son agnósticos al runtime — funcionan en cualquier
|
|
30
|
+
contexto Python que tenga `urllib.request` o equivalente. En SWL, llamar via
|
|
31
|
+
`agent-browser` CLI cuando sea necesario, o directamente con `Bash + curl/python`
|
|
32
|
+
si la situación lo permite.
|
|
33
|
+
|
|
34
|
+
## Cuándo cargar
|
|
35
|
+
|
|
36
|
+
- Research técnico que tocará github, arxiv, hackernews, stackoverflow, pubmed,
|
|
37
|
+
openalex o sec edgar.
|
|
38
|
+
- Necesidad de batch-fetch (10+ items) de cualquiera de estos dominios.
|
|
39
|
+
- `agent-browser` está siendo invocado para algo que tiene API documentada.
|
|
40
|
+
|
|
41
|
+
## Cuándo NO cargar
|
|
42
|
+
|
|
43
|
+
- Research que cruza dominios NO listados aquí.
|
|
44
|
+
- Tareas de automatización con interacción (login, formularios).
|
|
45
|
+
- Descarga de PDFs académicos — usar `swl-markitdown` sobre la URL del PDF.
|
|
46
|
+
|
|
47
|
+
## Regla universal — API antes que browser
|
|
48
|
+
|
|
49
|
+
Para los 7 dominios cubiertos, el browser solo es necesario para:
|
|
50
|
+
- GitHub trending page (server-side rendered, sin API equivalente).
|
|
51
|
+
- Render de MathJax en Stack Overflow (raro).
|
|
52
|
+
- Cualquier contenido fuera de API pública.
|
|
53
|
+
|
|
54
|
+
**Todo lo demás se hace con `http_get` + JSON/XML.** Ganancia típica:
|
|
55
|
+
- `arxiv` browser: 5-8s por paper / API batch: 1.9s por 10 papers (~25× más rápido).
|
|
56
|
+
- `hackernews` browser: 3-8s por página / `http_get` regex: 170ms.
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## 1. GitHub
|
|
61
|
+
|
|
62
|
+
Mezcla REST API + browser. El browser solo para `/trending` (server-side
|
|
63
|
+
rendered, sin equivalente API). Todo lo demás vía REST.
|
|
64
|
+
|
|
65
|
+
### Metadata de repo
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
import json
|
|
69
|
+
data = json.loads(http_get("https://api.github.com/repos/{owner}/{repo}"))
|
|
70
|
+
# Campos: stargazers_count, forks_count, description, language, topics,
|
|
71
|
+
# open_issues_count, created_at, updated_at, pushed_at, default_branch,
|
|
72
|
+
# license, homepage, visibility
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Contenido de archivos — `raw.githubusercontent.com`
|
|
76
|
+
|
|
77
|
+
Sin rate limit, sin auth, sin base64.
|
|
78
|
+
|
|
79
|
+
```python
|
|
80
|
+
readme = http_get("https://raw.githubusercontent.com/owner/repo/main/README.md")
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Búsqueda de repos
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
results = json.loads(http_get(
|
|
87
|
+
"https://api.github.com/search/repositories"
|
|
88
|
+
"?q=browser+automation+language:python&sort=stars&per_page=10"
|
|
89
|
+
))
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Rate limit search: **10 req/min unauthenticated** (separado del core 60/hora).
|
|
93
|
+
|
|
94
|
+
### Trending page — único caso con browser
|
|
95
|
+
|
|
96
|
+
```python
|
|
97
|
+
import json
|
|
98
|
+
goto_url("https://github.com/trending") # o /trending/python?since=weekly
|
|
99
|
+
wait_for_load()
|
|
100
|
+
wait(2) # hidratación React tras readyState=complete
|
|
101
|
+
|
|
102
|
+
result = js("""
|
|
103
|
+
(function(){
|
|
104
|
+
var rows = Array.from(document.querySelectorAll('article.Box-row'));
|
|
105
|
+
return JSON.stringify(rows.map(function(el){
|
|
106
|
+
var h2link = el.querySelector('h2 a');
|
|
107
|
+
var starLink = el.querySelector('a[href*="/stargazers"]');
|
|
108
|
+
return {
|
|
109
|
+
name: h2link?.innerText.trim().replace(/\\s+/g,' '),
|
|
110
|
+
url: h2link ? 'https://github.com' + h2link.getAttribute('href') : null,
|
|
111
|
+
stars_total: starLink?.innerText.trim()
|
|
112
|
+
};
|
|
113
|
+
}));
|
|
114
|
+
})()
|
|
115
|
+
""")
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### Auth
|
|
119
|
+
|
|
120
|
+
Sin token: 60 req/hora core, 10 req/min search.
|
|
121
|
+
Con `GITHUB_TOKEN`: 5,000/hora ambos.
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
headers = {"Authorization": f"Bearer {os.environ['GITHUB_TOKEN']}",
|
|
125
|
+
"X-GitHub-Api-Version": "2022-11-28"}
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### Gotchas
|
|
129
|
+
|
|
130
|
+
- 404 raises `urllib.error.HTTPError`, no JSON error.
|
|
131
|
+
- Code search (`/search/code`) requiere auth, devuelve 401 sin token.
|
|
132
|
+
- Trending stars vienen como string `"4,548"` — `int(s.replace(',', ''))`.
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## 2. ArXiv
|
|
137
|
+
|
|
138
|
+
**NUNCA usar browser para ArXiv.** Todo accesible via Atom API. `id_list`
|
|
139
|
+
soporta batch hasta 200 IDs por call.
|
|
140
|
+
|
|
141
|
+
### Batch fetch (10× más rápido que paralelo)
|
|
142
|
+
|
|
143
|
+
```python
|
|
144
|
+
import xml.etree.ElementTree as ET
|
|
145
|
+
NS = {'atom': 'http://www.w3.org/2005/Atom',
|
|
146
|
+
'arxiv': 'http://arxiv.org/schemas/atom'}
|
|
147
|
+
|
|
148
|
+
ids = ['1706.03762', '1810.04805', '2005.14165']
|
|
149
|
+
xml = http_get(
|
|
150
|
+
f"http://export.arxiv.org/api/query?id_list={','.join(ids)}"
|
|
151
|
+
f"&max_results={len(ids)}"
|
|
152
|
+
)
|
|
153
|
+
root = ET.fromstring(xml)
|
|
154
|
+
for e in root.findall('atom:entry', NS):
|
|
155
|
+
arxiv_id = e.find('atom:id', NS).text.split('/')[-1]
|
|
156
|
+
title = e.find('atom:title', NS).text.strip()
|
|
157
|
+
abstract = e.find('atom:summary', NS).text.strip()
|
|
158
|
+
pdf_link = next((l.get('href') for l in e.findall('atom:link', NS)
|
|
159
|
+
if l.get('title') == 'pdf'), None)
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
Confirmed: batch 10 IDs → 1.91s. Paralelo 10 single calls → 6.34s.
|
|
163
|
+
|
|
164
|
+
### Search por categoría
|
|
165
|
+
|
|
166
|
+
```python
|
|
167
|
+
xml = http_get(
|
|
168
|
+
"http://export.arxiv.org/api/query"
|
|
169
|
+
"?search_query=ti:transformer+AND+cat:cs.LG"
|
|
170
|
+
"&max_results=10&sortBy=submittedDate&sortOrder=descending"
|
|
171
|
+
)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
Field prefixes: `ti:` title, `au:` author, `abs:` abstract, `cat:` category,
|
|
175
|
+
`all:` all fields. Boolean: `AND`/`OR`/`ANDNOT`.
|
|
176
|
+
|
|
177
|
+
### URL construction
|
|
178
|
+
|
|
179
|
+
```python
|
|
180
|
+
arxiv_id = "1706.03762v7"
|
|
181
|
+
pdf_versioned = f"https://arxiv.org/pdf/{arxiv_id}"
|
|
182
|
+
pdf_latest = f"https://arxiv.org/pdf/{re.sub(r'v\\d+$', '', arxiv_id)}"
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Gotchas
|
|
186
|
+
|
|
187
|
+
- Rate limit: 3s entre requests para bulk crawling. Bursts rápidos de ~10
|
|
188
|
+
funcionan sin block.
|
|
189
|
+
- `atom:id` es URL completa `http://arxiv.org/abs/1706.03762v7` — siempre
|
|
190
|
+
split `/[-1]`.
|
|
191
|
+
- Batch `id_list` retorna entries en orden impredecible — indexar por ID, no
|
|
192
|
+
por posición.
|
|
193
|
+
- ~5% de papers no tienen `atom:summary` — guard con `if el is not None`.
|
|
194
|
+
- `max_results` cap es 2000 por call. Pagination con `start` offset + 3s sleep.
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
198
|
+
## 3. Hacker News
|
|
199
|
+
|
|
200
|
+
Tres paths, todos sin browser:
|
|
201
|
+
|
|
202
|
+
| Goal | Approach | Latency |
|
|
203
|
+
|------|----------|---------|
|
|
204
|
+
| Front page (30 stories) | `http_get` + regex | ~170ms |
|
|
205
|
+
| Historical / keyword search | Algolia API | ~400ms |
|
|
206
|
+
| Comment tree completo | Algolia items API | ~300ms |
|
|
207
|
+
| Item específico | Firebase API | ~200ms |
|
|
208
|
+
| 500 ranked IDs | Firebase topstories | ~200ms |
|
|
209
|
+
|
|
210
|
+
### Front page scrape (más rápido para real-time)
|
|
211
|
+
|
|
212
|
+
```python
|
|
213
|
+
import re, html as htmllib
|
|
214
|
+
|
|
215
|
+
page = http_get("https://news.ycombinator.com")
|
|
216
|
+
story_ids = re.findall(r'<tr class="athing submission" id="(\\d+)">', page)
|
|
217
|
+
titles_urls = re.findall(
|
|
218
|
+
r'class="titleline"[^>]*><a href="([^"]*)"[^>]*>(.*?)</a>', page
|
|
219
|
+
)
|
|
220
|
+
# titles DEBEN pasar por html.unescape() — contienen ' & etc.
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### Algolia (search + nested comments)
|
|
224
|
+
|
|
225
|
+
```python
|
|
226
|
+
import json
|
|
227
|
+
|
|
228
|
+
# Búsqueda por keyword
|
|
229
|
+
data = json.loads(http_get(
|
|
230
|
+
"https://hn.algolia.com/api/v1/search"
|
|
231
|
+
"?query=llm&tags=story&hitsPerPage=20"
|
|
232
|
+
))
|
|
233
|
+
|
|
234
|
+
# Más recientes
|
|
235
|
+
data = json.loads(http_get(
|
|
236
|
+
"https://hn.algolia.com/api/v1/search_by_date"
|
|
237
|
+
"?tags=story&hitsPerPage=20"
|
|
238
|
+
))
|
|
239
|
+
|
|
240
|
+
# Thread completo con árbol anidado
|
|
241
|
+
thread = json.loads(http_get(
|
|
242
|
+
"https://hn.algolia.com/api/v1/items/47806725"
|
|
243
|
+
))
|
|
244
|
+
# thread['children'] = top-level comments con .children anidados
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
### Firebase oficial
|
|
248
|
+
|
|
249
|
+
```python
|
|
250
|
+
top = json.loads(http_get("https://hacker-news.firebaseio.com/v0/topstories.json"))
|
|
251
|
+
item = json.loads(http_get(f"https://hacker-news.firebaseio.com/v0/item/{id}.json"))
|
|
252
|
+
user = json.loads(http_get(f"https://hacker-news.firebaseio.com/v0/user/{u}.json"))
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
### Gotchas
|
|
256
|
+
|
|
257
|
+
- Titles tienen entities HTML (`'`, `&`) — siempre `html.unescape()`.
|
|
258
|
+
- Job posts no tienen score ni author — `scores_by_id.get(sid)` retorna `None`.
|
|
259
|
+
- Algolia comment fields usan `comment_text`, NO `text`.
|
|
260
|
+
- Firebase 500 items secuenciales = ~100s; Algolia con `tags=front_page` es
|
|
261
|
+
mucho más rápido para bulk.
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## 4. Stack Overflow
|
|
266
|
+
|
|
267
|
+
Stack Exchange API v2.3, JSON, sin browser. Rate limit duro:
|
|
268
|
+
- **300 req/día por IP** sin key.
|
|
269
|
+
- **10,000 req/día** con key.
|
|
270
|
+
|
|
271
|
+
### Top questions por tag
|
|
272
|
+
|
|
273
|
+
```python
|
|
274
|
+
data = json.loads(http_get(
|
|
275
|
+
"https://api.stackexchange.com/2.3/questions"
|
|
276
|
+
"?order=desc&sort=votes&tagged=python&site=stackoverflow"
|
|
277
|
+
"&pagesize=5&filter=withbody"
|
|
278
|
+
))
|
|
279
|
+
print("Quota:", data['quota_remaining'])
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
`filter=withbody` es **obligatorio** para incluir body — sin él, el campo
|
|
283
|
+
simplemente no existe (sin error, sin warning).
|
|
284
|
+
|
|
285
|
+
### Batch IDs (semicolon-delimited, hasta 100)
|
|
286
|
+
|
|
287
|
+
```python
|
|
288
|
+
data = json.loads(http_get(
|
|
289
|
+
"https://api.stackexchange.com/2.3/questions/231767;419163;394809"
|
|
290
|
+
"?site=stackoverflow&filter=withbody"
|
|
291
|
+
))
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
### Decoding
|
|
295
|
+
|
|
296
|
+
- `title`: tiene entities (`"`, `'`) → `html.unescape()`.
|
|
297
|
+
- `body`: HTML completo → `HTMLParser` para texto plano.
|
|
298
|
+
- `display_name`, `tags`: plain text, no decode.
|
|
299
|
+
|
|
300
|
+
### Multi-site
|
|
301
|
+
|
|
302
|
+
`site=stackoverflow` se puede sustituir por: `superuser`, `serverfault`,
|
|
303
|
+
`askubuntu`, `unix`, `datascience`, `math`. Mismo API, mismo quota pool.
|
|
304
|
+
|
|
305
|
+
### Gotchas
|
|
306
|
+
|
|
307
|
+
- Quota es por IP, resets midnight UTC. 6 tests consumen ~27 quota.
|
|
308
|
+
- Verificar `data.get('backoff')` y dormir si retorna int.
|
|
309
|
+
- Pagesize max 100. `page=` 1-indexed.
|
|
310
|
+
- Errors raise `HTTPError` exception, no JSON body accesible.
|
|
311
|
+
|
|
312
|
+
---
|
|
313
|
+
|
|
314
|
+
## 5. PubMed
|
|
315
|
+
|
|
316
|
+
NCBI E-utilities REST. **NUNCA browser.** Sin API key: 3 req/s. Con free key:
|
|
317
|
+
10 req/s.
|
|
318
|
+
|
|
319
|
+
### Pipeline ESearch → ESummary (más común)
|
|
320
|
+
|
|
321
|
+
```python
|
|
322
|
+
import json
|
|
323
|
+
|
|
324
|
+
# Step 1: search → PMIDs
|
|
325
|
+
search = json.loads(http_get(
|
|
326
|
+
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
|
|
327
|
+
"?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
|
|
328
|
+
))
|
|
329
|
+
pmids = search['esearchresult']['idlist']
|
|
330
|
+
|
|
331
|
+
# Step 2: metadata batch
|
|
332
|
+
summary = json.loads(http_get(
|
|
333
|
+
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
|
|
334
|
+
f"?db=pubmed&id={','.join(pmids)}&retmode=json"
|
|
335
|
+
))
|
|
336
|
+
result = summary['result']
|
|
337
|
+
for uid in result['uids']:
|
|
338
|
+
art = result[uid]
|
|
339
|
+
title = art['title']
|
|
340
|
+
pubdate = art['pubdate'] # '2026 Apr 18'
|
|
341
|
+
authors = [a['name'] for a in art['authors']] # abreviados 'Last I'
|
|
342
|
+
doi = {x['idtype']: x['value'] for x in art['articleids']}.get('doi')
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
### EFetch XML (full abstract + MeSH + author full names)
|
|
346
|
+
|
|
347
|
+
```python
|
|
348
|
+
import xml.etree.ElementTree as ET
|
|
349
|
+
|
|
350
|
+
raw = http_get(
|
|
351
|
+
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
|
|
352
|
+
"?db=pubmed&id=41999029&retmode=xml&rettype=abstract"
|
|
353
|
+
)
|
|
354
|
+
root = ET.fromstring(raw)
|
|
355
|
+
for art in root.findall('.//PubmedArticle'):
|
|
356
|
+
mc = art.find('MedlineCitation')
|
|
357
|
+
pmid = mc.find('PMID').text
|
|
358
|
+
article = mc.find('Article')
|
|
359
|
+
title = ''.join(article.find('ArticleTitle').itertext()).strip()
|
|
360
|
+
# Abstract puede ser estructurado (BACKGROUND/METHODS/RESULTS/CONCLUSION)
|
|
361
|
+
abstract_el = article.find('Abstract')
|
|
362
|
+
if abstract_el is not None:
|
|
363
|
+
sections = []
|
|
364
|
+
for t in abstract_el.findall('AbstractText'):
|
|
365
|
+
label = t.get('Label', '')
|
|
366
|
+
text = ''.join(t.itertext()).strip()
|
|
367
|
+
sections.append(f"[{label}] {text}" if label else text)
|
|
368
|
+
abstract = ' '.join(sections)
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
### Bulk con WebEnv (>10k results)
|
|
372
|
+
|
|
373
|
+
```python
|
|
374
|
+
search = json.loads(http_get(
|
|
375
|
+
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
|
|
376
|
+
"?db=pubmed&term=CRISPR&retmax=0&retmode=json&usehistory=y"
|
|
377
|
+
))
|
|
378
|
+
webenv = search['esearchresult']['webenv']
|
|
379
|
+
query_key = search['esearchresult']['querykey']
|
|
380
|
+
|
|
381
|
+
for start in range(0, 1000, 200):
|
|
382
|
+
raw = http_get(
|
|
383
|
+
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
|
|
384
|
+
f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
|
|
385
|
+
f"&retstart={start}&retmax=200&retmode=xml&rettype=abstract"
|
|
386
|
+
)
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
### Gotchas
|
|
390
|
+
|
|
391
|
+
- `count` es string, no int. `int(search['esearchresult']['count'])`.
|
|
392
|
+
- EFetch retmode debe ser `xml`, NO `json` (json devuelve texto MEDLINE plano).
|
|
393
|
+
- `ArticleTitle` puede tener tags embebidos (`<i>`, `<sub>`) — usar `itertext()`.
|
|
394
|
+
- ~15% de articles sin abstract — guard `if abstract_el is not None`.
|
|
395
|
+
- Author puede ser `CollectiveName` (consortium) en lugar de `LastName/ForeName` —
|
|
396
|
+
check `CollectiveName` primero.
|
|
397
|
+
- ELink `pubmed_pubmed` (related) está roto persistentemente — usar DOI como
|
|
398
|
+
fallback.
|
|
399
|
+
|
|
400
|
+
---
|
|
401
|
+
|
|
402
|
+
## 6. OpenAlex
|
|
403
|
+
|
|
404
|
+
260M+ works, 90M+ authors, JSON API, sin auth.
|
|
405
|
+
|
|
406
|
+
**Siempre incluir `mailto=` para usar polite pool** (10 req/s, más confiable).
|
|
407
|
+
|
|
408
|
+
### Search papers
|
|
409
|
+
|
|
410
|
+
```python
|
|
411
|
+
import json
|
|
412
|
+
|
|
413
|
+
data = json.loads(http_get(
|
|
414
|
+
"https://api.openalex.org/works"
|
|
415
|
+
"?search=transformer+attention"
|
|
416
|
+
"&per-page=5&sort=cited_by_count:desc"
|
|
417
|
+
"&select=id,doi,display_name,publication_year,cited_by_count,open_access"
|
|
418
|
+
"&mailto=you@example.com"
|
|
419
|
+
))
|
|
420
|
+
for w in data["results"]:
|
|
421
|
+
bare_id = w["id"].split("/")[-1] # W2626778328
|
|
422
|
+
print(bare_id, w["publication_year"], w["cited_by_count"])
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
### Por DOI
|
|
426
|
+
|
|
427
|
+
```python
|
|
428
|
+
w = json.loads(http_get(
|
|
429
|
+
"https://api.openalex.org/works/https://doi.org/10.1038/nature14539"
|
|
430
|
+
"?mailto=you@example.com"
|
|
431
|
+
))
|
|
432
|
+
```
|
|
433
|
+
|
|
434
|
+
### Reconstruir abstract (inverted index)
|
|
435
|
+
|
|
436
|
+
```python
|
|
437
|
+
w = json.loads(http_get(
|
|
438
|
+
"https://api.openalex.org/works/W2626778328"
|
|
439
|
+
"?select=id,abstract_inverted_index&mailto=you@example.com"
|
|
440
|
+
))
|
|
441
|
+
aii = w.get("abstract_inverted_index") or {}
|
|
442
|
+
words_pos = [(pos, word) for word, positions in aii.items() for pos in positions]
|
|
443
|
+
abstract = " ".join(word for _, word in sorted(words_pos))
|
|
444
|
+
```
|
|
445
|
+
|
|
446
|
+
### Citation traversal
|
|
447
|
+
|
|
448
|
+
```python
|
|
449
|
+
# Forward citations
|
|
450
|
+
citing = json.loads(http_get(
|
|
451
|
+
f"https://api.openalex.org/works?filter=cites:{paper_id}"
|
|
452
|
+
"&per-page=5&sort=cited_by_count:desc&mailto=you@example.com"
|
|
453
|
+
))
|
|
454
|
+
|
|
455
|
+
# Backward references
|
|
456
|
+
paper = json.loads(http_get(
|
|
457
|
+
f"https://api.openalex.org/works/{paper_id}"
|
|
458
|
+
"?select=referenced_works&mailto=you@example.com"
|
|
459
|
+
))
|
|
460
|
+
refs = paper.get("referenced_works", [])
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
### Cursor pagination (bulk >10k)
|
|
464
|
+
|
|
465
|
+
```python
|
|
466
|
+
import urllib.parse
|
|
467
|
+
cursor = "*"
|
|
468
|
+
while True:
|
|
469
|
+
encoded = urllib.parse.quote(cursor, safe="")
|
|
470
|
+
data = json.loads(http_get(
|
|
471
|
+
f"https://api.openalex.org/works?filter={flt}"
|
|
472
|
+
f"&per-page=200&cursor={encoded}&mailto=you@example.com"
|
|
473
|
+
))
|
|
474
|
+
if not data.get("results"): break
|
|
475
|
+
# process
|
|
476
|
+
cursor = data["meta"].get("next_cursor")
|
|
477
|
+
if not cursor: break
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
### Filter syntax
|
|
481
|
+
|
|
482
|
+
`filter=author.id:A5108093963,publication_year:>2020,open_access.is_oa:true`
|
|
483
|
+
- AND: comma
|
|
484
|
+
- OR: pipe `2022|2023`
|
|
485
|
+
- Negation: `!2020`
|
|
486
|
+
- Range: `>1000`, `<2010`, `100-500`
|
|
487
|
+
|
|
488
|
+
### Entity ID prefixes
|
|
489
|
+
|
|
490
|
+
`W` Work, `A` Author, `I` Institution, `S` Source, `C` Concept, `T` Topic,
|
|
491
|
+
`F` Funder, `P` Publisher.
|
|
492
|
+
|
|
493
|
+
### Gotchas
|
|
494
|
+
|
|
495
|
+
- `id` field es URL completa `https://openalex.org/W2626778328` — siempre
|
|
496
|
+
`.split("/")[-1]` para bare ID.
|
|
497
|
+
- DOI lookup usa URL completa: `/works/https://doi.org/...`, NO `/works/10.1038/...`.
|
|
498
|
+
- Page-based pagination hard stops at 10,000 results. Usar `cursor=*` para más.
|
|
499
|
+
- `cursor=*` debe URL-encodearse: `urllib.parse.quote(cursor, safe="")`.
|
|
500
|
+
- `group_by` y `page` incompatibles.
|
|
501
|
+
- `abstract_inverted_index` puede ser `null` para closed-access papers.
|
|
502
|
+
- `select=` reduce payload ~90% — usar en bulk harvests.
|
|
503
|
+
|
|
504
|
+
---
|
|
505
|
+
|
|
506
|
+
## 7. SEC EDGAR
|
|
507
|
+
|
|
508
|
+
Datos públicos sin auth. **`www.sec.gov` requiere User-Agent custom** o devuelve 403.
|
|
509
|
+
|
|
510
|
+
```python
|
|
511
|
+
UA = {"User-Agent": "swl-ses research@example.com"}
|
|
512
|
+
# Formato requerido: "CompanyName contact@email.com"
|
|
513
|
+
```
|
|
514
|
+
|
|
515
|
+
### Ticker → CIK
|
|
516
|
+
|
|
517
|
+
```python
|
|
518
|
+
import json
|
|
519
|
+
tickers = json.loads(http_get(
|
|
520
|
+
"https://www.sec.gov/files/company_tickers.json", headers=UA
|
|
521
|
+
))
|
|
522
|
+
aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
|
|
523
|
+
cik = str(aapl['cik_str']).zfill(10) # "0000320193"
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
### Submissions (~1000 filings recientes)
|
|
527
|
+
|
|
528
|
+
```python
|
|
529
|
+
data = json.loads(http_get(
|
|
530
|
+
f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA
|
|
531
|
+
))
|
|
532
|
+
recent = data['filings']['recent']
|
|
533
|
+
filings_10k = [
|
|
534
|
+
(f, d, a, doc)
|
|
535
|
+
for f, d, a, doc in zip(
|
|
536
|
+
recent['form'], recent['filingDate'],
|
|
537
|
+
recent['accessionNumber'], recent['primaryDocument']
|
|
538
|
+
)
|
|
539
|
+
if f in ('10-K', '10-Q')
|
|
540
|
+
]
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
### XBRL — un concepto sobre tiempo
|
|
544
|
+
|
|
545
|
+
```python
|
|
546
|
+
data = json.loads(http_get(
|
|
547
|
+
f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik}"
|
|
548
|
+
"/us-gaap/Assets.json", headers=UA
|
|
549
|
+
))
|
|
550
|
+
entries = data['units']['USD']
|
|
551
|
+
# Deduplicar — restatements multiplican entries por periodo
|
|
552
|
+
seen = {}
|
|
553
|
+
for e in entries:
|
|
554
|
+
if e.get('form') == '10-K' and e.get('fp') == 'FY':
|
|
555
|
+
end = e['end']
|
|
556
|
+
if end not in seen or e['filed'] > seen[end]['filed']:
|
|
557
|
+
seen[end] = e
|
|
558
|
+
```
|
|
559
|
+
|
|
560
|
+
### Cross-company (XBRL frames)
|
|
561
|
+
|
|
562
|
+
```python
|
|
563
|
+
data = json.loads(http_get(
|
|
564
|
+
"https://data.sec.gov/api/xbrl/frames/us-gaap"
|
|
565
|
+
"/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
|
|
566
|
+
headers=UA
|
|
567
|
+
))
|
|
568
|
+
# data['data'] = lista de todas las companies para ese concepto/periodo
|
|
569
|
+
```
|
|
570
|
+
|
|
571
|
+
### Full-text search
|
|
572
|
+
|
|
573
|
+
```python
|
|
574
|
+
data = json.loads(http_get(
|
|
575
|
+
"https://efts.sec.gov/LATEST/search-index"
|
|
576
|
+
"?q=%22climate+risk%22&forms=10-K&dateRange=custom&startdt=2024-01-01",
|
|
577
|
+
headers=UA
|
|
578
|
+
))
|
|
579
|
+
# efts.sec.gov acepta Mozilla/5.0 default
|
|
580
|
+
```
|
|
581
|
+
|
|
582
|
+
### Rate limit
|
|
583
|
+
|
|
584
|
+
**10 req/s** documentado. `max_workers ≤ 8` para `ThreadPoolExecutor`.
|
|
585
|
+
|
|
586
|
+
### Gotchas
|
|
587
|
+
|
|
588
|
+
- `www.sec.gov` con `Mozilla/5.0` default devuelve 403. SIEMPRE `headers=UA`.
|
|
589
|
+
- `data.sec.gov` y `efts.sec.gov` son más permisivos pero usar UA igual por
|
|
590
|
+
política.
|
|
591
|
+
- XBRL contiene duplicates per period — dedup por `end` con latest `filed`.
|
|
592
|
+
- Revenue concept varía: `RevenueFromContractWithCustomerExcludingAssessedTax`
|
|
593
|
+
(post-2018) vs `SalesRevenueNet` (older).
|
|
594
|
+
- `fp` para anuales es `'FY'`; quarterly también aparecen en 10-K, filtrar
|
|
595
|
+
ambos `form == '10-K'` AND `fp == 'FY'`.
|
|
596
|
+
- `companyfacts.json` es ~5MB — para single metric usar `companyconcept`.
|
|
597
|
+
- CIK format: `cik_str` int en company_tickers; APIs requieren `str(cik).zfill(10)`.
|
|
598
|
+
|
|
599
|
+
---
|
|
600
|
+
|
|
601
|
+
## Tabla de decisión rápida
|
|
602
|
+
|
|
603
|
+
| Dominio | Browser necesario | Endpoint preferido |
|
|
604
|
+
|---------|-------------------|---------------------|
|
|
605
|
+
| GitHub | Solo para /trending | api.github.com (60/h sin token) |
|
|
606
|
+
| ArXiv | NUNCA | export.arxiv.org/api/query (Atom) |
|
|
607
|
+
| Hacker News | NUNCA | hn.algolia.com / hacker-news.firebaseio.com |
|
|
608
|
+
| Stack Overflow | NUNCA | api.stackexchange.com (300/día sin key) |
|
|
609
|
+
| PubMed | NUNCA | eutils.ncbi.nlm.nih.gov (3/s sin key) |
|
|
610
|
+
| OpenAlex | NUNCA | api.openalex.org (con mailto) |
|
|
611
|
+
| SEC EDGAR | NUNCA | data.sec.gov + UA obligatoria |
|
|
612
|
+
|
|
613
|
+
## Patrón canónico
|
|
614
|
+
|
|
615
|
+
Para los 7 dominios, el patrón es:
|
|
616
|
+
|
|
617
|
+
1. Skip `goto_url`, `wait_for_load`, `capture_screenshot`.
|
|
618
|
+
2. Construir URL del API endpoint con query params.
|
|
619
|
+
3. `http_get` (o `WebFetch` en SWL si el agente no tiene CLI Python).
|
|
620
|
+
4. Parsear JSON / XML según corresponda.
|
|
621
|
+
5. Verificar quota / rate limit response field si la API lo provee.
|
|
622
|
+
|
|
623
|
+
Si el dominio NO está en esta lista, default a `agent-browser` + `browser-interaction-patterns`.
|
|
624
|
+
|
|
625
|
+
---
|
|
626
|
+
|
|
627
|
+
## Relación con otras skills
|
|
628
|
+
|
|
629
|
+
- **`agent-browser`**: cuando el dominio NO está aquí o requiere JS/login.
|
|
630
|
+
- **`browser-interaction-patterns`**: patrones de bajo nivel CDP cuando hay
|
|
631
|
+
que automatizar UI.
|
|
632
|
+
- **`web-fetcher-routing`**: orquesta WebFetch vs agent-browser por tipo de URL.
|
|
633
|
+
- **`swl-markitdown`**: para PDFs académicos linked desde papers (arxiv, pubmed).
|
|
634
|
+
|
|
635
|
+
<!-- Adaptado de browser-use/browser-harness bajo MIT License (browser-use, 2026). -->
|