elementus-ai 1.0.2 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +47 -17
- package/elementus.js +1174 -193
- package/package.json +16 -1
- package/wdio.d.ts +1 -3
package/README.md
CHANGED
|
@@ -27,9 +27,9 @@ I just installed the npm package "elementus-ai" — a self-healing element resol
|
|
|
27
27
|
- If none found, tell me you can't detect a supported framework and stop
|
|
28
28
|
|
|
29
29
|
2. CHOOSE THE LLM PROVIDER
|
|
30
|
-
- Ask me: "Do you want to use a local LLM (LM Studio, free, private) or Google Gemini (cloud, fast, ~$0.
|
|
30
|
+
- Ask me: "Do you want to use a local LLM (LM Studio, free, private) or Google Gemini (cloud, fast, ~$0.001 per AI-healed selector on gemini-3.5-flash; selectors that still work cost nothing)?"
|
|
31
31
|
- If Gemini: ask for API key or check for GEMINI_API_KEY env var
|
|
32
|
-
- If LM Studio: use defaults (localhost:1234
|
|
32
|
+
- If LM Studio: use defaults (localhost:1234) with a vision/grounding model loaded (recommended: holo-3.1-9b)
|
|
33
33
|
|
|
34
34
|
3. INTEGRATE BASED ON MY FRAMEWORK
|
|
35
35
|
|
|
@@ -95,17 +95,21 @@ await p.locator('#stable-element').click()
|
|
|
95
95
|
### Option A: Local LLM via LM Studio (free, private)
|
|
96
96
|
|
|
97
97
|
1. Download [LM Studio](https://lmstudio.ai)
|
|
98
|
-
2. Load a vision-capable model
|
|
98
|
+
2. Load a vision-capable model. Recommended: **`holo-3.1-9b`** — a GUI-grounding model that locates on-screen elements far better than general chat VLMs, and it's small (9B). Any vision model works, but grounding models earn their keep on the vision-fallback path.
|
|
99
99
|
3. Start the local server (default: `http://localhost:1234`)
|
|
100
100
|
|
|
101
101
|
```javascript
|
|
102
102
|
const el = createElementus({
|
|
103
103
|
provider: 'lmstudio',
|
|
104
104
|
lmStudioUrl: 'http://localhost:1234/v1/chat/completions',
|
|
105
|
-
model: '
|
|
105
|
+
model: 'holo-3.1-9b',
|
|
106
106
|
})
|
|
107
107
|
```
|
|
108
108
|
|
|
109
|
+
Tips for the local setup:
|
|
110
|
+
- **Context length:** set it to 16k+ in LM Studio — the ARIA-snapshot grounding step can send large prompts, and the default 4k will silently truncate.
|
|
111
|
+
- **Semantic matching:** load an embedding model (e.g. `text-embedding-nomic-embed-text-v1.5`) and set `embeddingModel` to let paraphrased descriptions ("sign in" vs "log in") resolve without vision.
|
|
112
|
+
|
|
109
113
|
### Option B: Google Gemini API (cloud, fast, better vision)
|
|
110
114
|
|
|
111
115
|
1. Get an API key from [Google AI Studio](https://aistudio.google.com/apikey)
|
|
@@ -114,10 +118,14 @@ const el = createElementus({
|
|
|
114
118
|
const el = createElementus({
|
|
115
119
|
provider: 'gemini',
|
|
116
120
|
geminiApiKey: 'AIza...', // or set GEMINI_API_KEY env var
|
|
117
|
-
geminiModel: 'gemini-
|
|
121
|
+
geminiModel: 'gemini-3.5-flash', // or 'gemini-3.1-flash-lite' (cheaper/faster)
|
|
118
122
|
})
|
|
119
123
|
```
|
|
120
124
|
|
|
125
|
+
Older `gemini-2.5-flash` / `gemini-2.5-flash-lite` still work (Google retires them 2026-10-16) — Elementus picks the right thinking config per model family automatically.
|
|
126
|
+
|
|
127
|
+
Note: Google's *computer-use* capability requires dedicated models (`gemini-2.5-computer-use-preview-*`, `gemini-3-flash-preview`) and is **not** available on `gemini-3.5-flash`. Elementus's own vision pipeline does not need it — this only matters if you point `geminiModel` at a computer-use model expecting agentic behavior.
|
|
128
|
+
|
|
121
129
|
## Framework Setup
|
|
122
130
|
|
|
123
131
|
### Playwright
|
|
@@ -235,25 +243,39 @@ createElementus({
|
|
|
235
243
|
|
|
236
244
|
// LM Studio
|
|
237
245
|
lmStudioUrl: 'http://localhost:1234/v1/chat/completions',
|
|
238
|
-
model: '
|
|
246
|
+
model: 'holo-3.1-9b',
|
|
239
247
|
|
|
240
248
|
// Gemini
|
|
241
249
|
geminiApiKey: null, // or GEMINI_API_KEY env var
|
|
242
|
-
geminiModel: 'gemini-
|
|
250
|
+
geminiModel: 'gemini-3.5-flash',
|
|
243
251
|
|
|
244
252
|
// Behavior
|
|
245
253
|
maxCandidates: 20, // max elements sent to LLM for disambiguation
|
|
246
254
|
visionMaxWidth: 1280, // max screenshot width (px) sent to vision LLM
|
|
247
255
|
|
|
256
|
+
// Fingerprint cache (opt-in) — remember healed elements across runs and
|
|
257
|
+
// re-match them algorithmically (zero LLM cost, ~ms) before any AI call
|
|
258
|
+
cacheFile: null, // e.g. './elementus-cache.json'
|
|
259
|
+
|
|
260
|
+
// Semantic matching (opt-in) — embedding model for paraphrase matching
|
|
261
|
+
// when keyword scoring finds nothing
|
|
262
|
+
embeddingModel: null, // e.g. 'text-embedding-nomic-embed-text-v1.5'
|
|
263
|
+
|
|
248
264
|
// Debugging
|
|
249
265
|
debug: false, // save screenshots to debugDir
|
|
250
|
-
debugDir:
|
|
266
|
+
debugDir: null, // required when debug: true, e.g. './debug'
|
|
251
267
|
|
|
252
268
|
// Custom stop words
|
|
253
269
|
stopWords: null, // Set of words to ignore in descriptions
|
|
254
270
|
})
|
|
255
271
|
```
|
|
256
272
|
|
|
273
|
+
## Security Notes
|
|
274
|
+
|
|
275
|
+
- **Debug screenshots** capture the full page — including any sensitive data visible on it. Keep `debugDir` out of version control.
|
|
276
|
+
- **Page content reaches the LLM.** Element texts and screenshots from the page under test are sent to your configured LLM provider as part of the resolution prompts. Run Elementus against pages you trust, and prefer a local LLM (LM Studio) for sensitive applications.
|
|
277
|
+
- The Gemini API key is sent via the `x-goog-api-key` header (never in the URL) and can be supplied via the `GEMINI_API_KEY` env var instead of code.
|
|
278
|
+
|
|
257
279
|
## Timeouts
|
|
258
280
|
|
|
259
281
|
Elementus respects your framework's configured timeouts. It does **not** override or race against them. Set appropriate action timeouts in your framework config:
|
|
@@ -270,13 +292,17 @@ If a selector works, it returns immediately (zero overhead). If it fails after y
|
|
|
270
292
|
|
|
271
293
|
## How It Works
|
|
272
294
|
|
|
273
|
-
When a selector fails, Elementus runs a
|
|
295
|
+
When a selector fails, Elementus runs a 5-step pipeline — free and deterministic steps first, LLM steps later, vision last:
|
|
274
296
|
|
|
275
297
|
**Step 1: Locator** — Try the original selector. If it works, done (zero overhead).
|
|
276
298
|
|
|
277
|
-
**Step 2:
|
|
299
|
+
**Step 2: Fingerprint cache** (opt-in via `cacheFile`) — If this selector+description healed before on this page, re-match the stored multi-attribute fingerprint (tag, id, text, neighbor text, position, size, …) against the live DOM with weighted similarity. Milliseconds, zero LLM cost. Accepted only with a confidence threshold *and* a margin over the runner-up.
|
|
300
|
+
|
|
301
|
+
**Step 3: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send the ranked top-N to the LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates. With `embeddingModel` set, zero keyword matches fall back to semantic (embedding) ranking.
|
|
302
|
+
|
|
303
|
+
**Step 4: Snapshot grounding** — Playwright: capture an ARIA accessibility snapshot with element refs and ask the text LLM to pick the matching ref. WDIO/native: synthesize an indexed role/name list from the element scan and do the same. No vision model needed.
|
|
278
304
|
|
|
279
|
-
**Step
|
|
305
|
+
**Step 5: Vision** — First Set-of-Marks: numbered badges drawn on the candidate elements, one vision call returns a badge number. If that fails: screenshot with a labeled 3x3 grid, region re-scan, then a precise-coordinate fail-safe for elements a DOM scan can never see (e.g. canvas-rendered icons). The fail-safe narrows the search band to roughly one viewport before asking the model for pixels — so the target is never too small to locate — then verifies the result on a zoomed crop and fails loudly rather than ever clicking the wrong place.
|
|
280
306
|
|
|
281
307
|
## Tips for Writing Descriptions
|
|
282
308
|
|
|
@@ -297,12 +323,16 @@ When a selector fails, Elementus runs a 3-step pipeline:
|
|
|
297
323
|
|
|
298
324
|
| Platform | Element scan | Vision | Status |
|
|
299
325
|
|----------|-------------|--------|--------|
|
|
300
|
-
| Playwright (web) | DOM |
|
|
301
|
-
| WDIO (web) | DOM |
|
|
302
|
-
| Appium (mobile web) | DOM |
|
|
303
|
-
| Appium (native Android/iOS) | XML source |
|
|
304
|
-
| Appium (Flutter) | XML source |
|
|
305
|
-
| Appium (React Native) | XML source |
|
|
326
|
+
| Playwright (web) | DOM | Full-page screenshot + LLM | Full support |
|
|
327
|
+
| WDIO (web) | DOM | Viewport screenshot + LLM | Full support (WDIO v9+ recommended) |
|
|
328
|
+
| Appium (mobile web) | DOM | Viewport screenshot + LLM | Full support |
|
|
329
|
+
| Appium (native Android/iOS) | XML source | — | Element-tree resolution (no vision) |
|
|
330
|
+
| Appium (Flutter) | XML source | — | Element-tree resolution (no vision) |
|
|
331
|
+
| Appium (React Native) | XML source | — | Element-tree resolution (no vision) |
|
|
332
|
+
|
|
333
|
+
Notes:
|
|
334
|
+
- WDIO screenshots are viewport-only, so the vision fallback sees the current viewport rather than the full page. DOM scoring always covers the whole document on every platform.
|
|
335
|
+
- Native apps have no DOM to overlay the vision grid on — resolution uses the native element tree (accessibility-id, resource-id, text) and stops there with an actionable error if unresolved.
|
|
306
336
|
|
|
307
337
|
## License
|
|
308
338
|
|