elementus-ai 1.0.1 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +63 -16
- package/elementus.js +1174 -188
- package/package.json +17 -1
- package/wdio.d.ts +5 -0
package/README.md
CHANGED
|
@@ -39,7 +39,9 @@ I just installed the npm package "elementus-ai" — a self-healing element resol
|
|
|
39
39
|
- Set actionTimeout: 10000 in playwright config (Elementus respects framework timeouts)
|
|
40
40
|
|
|
41
41
|
For WebDriverIO:
|
|
42
|
-
-
|
|
42
|
+
- In wdio.conf.js before hook, wrap browser and override global $:
|
|
43
|
+
const wrapped = el.wrapBrowser(browser); globalThis.$ = wrapped.$.bind(wrapped)
|
|
44
|
+
- This way all page objects use plain $() with optional { ai } — zero changes needed
|
|
43
45
|
|
|
44
46
|
For Appium:
|
|
45
47
|
- Add el.wrapBrowser(driver) in the before hook
|
|
@@ -104,6 +106,10 @@ const el = createElementus({
|
|
|
104
106
|
})
|
|
105
107
|
```
|
|
106
108
|
|
|
109
|
+
Tips for the local setup:
|
|
110
|
+
- **Vision accuracy:** a dedicated GUI-grounding model (e.g. `Holo2-8B`, Apache-2.0 GGUF on Hugging Face) typically grounds screen coordinates better than general chat VLMs — benchmark numbers are vendor-reported (Nov 2025), verify it loads in your LM Studio version before switching.
|
|
111
|
+
- **Semantic matching:** load an embedding model (e.g. `text-embedding-nomic-embed-text-v1.5`) and set `embeddingModel` to let paraphrased descriptions ("sign in" vs "log in") resolve without vision.
|
|
112
|
+
|
|
107
113
|
### Option B: Google Gemini API (cloud, fast, better vision)
|
|
108
114
|
|
|
109
115
|
1. Get an API key from [Google AI Studio](https://aistudio.google.com/apikey)
|
|
@@ -112,10 +118,14 @@ const el = createElementus({
|
|
|
112
118
|
const el = createElementus({
|
|
113
119
|
provider: 'gemini',
|
|
114
120
|
geminiApiKey: 'AIza...', // or set GEMINI_API_KEY env var
|
|
115
|
-
geminiModel: 'gemini-
|
|
121
|
+
geminiModel: 'gemini-3.5-flash', // or 'gemini-3.1-flash-lite' (cheaper/faster)
|
|
116
122
|
})
|
|
117
123
|
```
|
|
118
124
|
|
|
125
|
+
Older `gemini-2.5-flash` / `gemini-2.5-flash-lite` still work (Google retires them 2026-10-16) — Elementus picks the right thinking config per model family automatically.
|
|
126
|
+
|
|
127
|
+
Note: Google's *computer-use* capability requires dedicated models (`gemini-2.5-computer-use-preview-*`, `gemini-3-flash-preview`) and is **not** available on `gemini-3.5-flash`. Elementus's own vision pipeline does not need it — this only matters if you point `geminiModel` at a computer-use model expecting agentic behavior.
|
|
128
|
+
|
|
119
129
|
## Framework Setup
|
|
120
130
|
|
|
121
131
|
### Playwright
|
|
@@ -150,10 +160,25 @@ test('example', async ({ page }) => {
|
|
|
150
160
|
|
|
151
161
|
### WebDriverIO
|
|
152
162
|
|
|
163
|
+
Override the global `$` in your `wdio.conf.js` so all page objects work transparently:
|
|
164
|
+
|
|
153
165
|
```javascript
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
166
|
+
// wdio.conf.js
|
|
167
|
+
const { createElementus } = require('elementus-ai')
|
|
168
|
+
const el = createElementus({ provider: 'gemini', geminiApiKey: '...' })
|
|
169
|
+
|
|
170
|
+
exports.config = {
|
|
171
|
+
// ... other config
|
|
172
|
+
async before() {
|
|
173
|
+
const wrapped = el.wrapBrowser(browser)
|
|
174
|
+
globalThis.$ = wrapped.$.bind(wrapped)
|
|
175
|
+
}
|
|
176
|
+
}
|
|
177
|
+
|
|
178
|
+
// In tests / page objects — plain $() with optional { ai }:
|
|
179
|
+
await $('[data-testid="btn-send"]') // unchanged, zero overhead
|
|
180
|
+
await $('#btn', { ai: 'Submit order button' }).click() // self-healing
|
|
181
|
+
await $('#email', { ai: 'Email input field' }).setValue('test@test.com')
|
|
157
182
|
```
|
|
158
183
|
|
|
159
184
|
### Appium (Native Android / iOS / Flutter)
|
|
@@ -222,21 +247,35 @@ createElementus({
|
|
|
222
247
|
|
|
223
248
|
// Gemini
|
|
224
249
|
geminiApiKey: null, // or GEMINI_API_KEY env var
|
|
225
|
-
geminiModel: 'gemini-
|
|
250
|
+
geminiModel: 'gemini-3.5-flash',
|
|
226
251
|
|
|
227
252
|
// Behavior
|
|
228
253
|
maxCandidates: 20, // max elements sent to LLM for disambiguation
|
|
229
254
|
visionMaxWidth: 1280, // max screenshot width (px) sent to vision LLM
|
|
230
255
|
|
|
256
|
+
// Fingerprint cache (opt-in) — remember healed elements across runs and
|
|
257
|
+
// re-match them algorithmically (zero LLM cost, ~ms) before any AI call
|
|
258
|
+
cacheFile: null, // e.g. './elementus-cache.json'
|
|
259
|
+
|
|
260
|
+
// Semantic matching (opt-in) — embedding model for paraphrase matching
|
|
261
|
+
// when keyword scoring finds nothing
|
|
262
|
+
embeddingModel: null, // e.g. 'text-embedding-nomic-embed-text-v1.5'
|
|
263
|
+
|
|
231
264
|
// Debugging
|
|
232
265
|
debug: false, // save screenshots to debugDir
|
|
233
|
-
debugDir:
|
|
266
|
+
debugDir: null, // required when debug: true, e.g. './debug'
|
|
234
267
|
|
|
235
268
|
// Custom stop words
|
|
236
269
|
stopWords: null, // Set of words to ignore in descriptions
|
|
237
270
|
})
|
|
238
271
|
```
|
|
239
272
|
|
|
273
|
+
## Security Notes
|
|
274
|
+
|
|
275
|
+
- **Debug screenshots** capture the full page — including any sensitive data visible on it. Keep `debugDir` out of version control.
|
|
276
|
+
- **Page content reaches the LLM.** Element texts and screenshots from the page under test are sent to your configured LLM provider as part of the resolution prompts. Run Elementus against pages you trust, and prefer a local LLM (LM Studio) for sensitive applications.
|
|
277
|
+
- The Gemini API key is sent via the `x-goog-api-key` header (never in the URL) and can be supplied via the `GEMINI_API_KEY` env var instead of code.
|
|
278
|
+
|
|
240
279
|
## Timeouts
|
|
241
280
|
|
|
242
281
|
Elementus respects your framework's configured timeouts. It does **not** override or race against them. Set appropriate action timeouts in your framework config:
|
|
@@ -253,13 +292,17 @@ If a selector works, it returns immediately (zero overhead). If it fails after y
|
|
|
253
292
|
|
|
254
293
|
## How It Works
|
|
255
294
|
|
|
256
|
-
When a selector fails, Elementus runs a
|
|
295
|
+
When a selector fails, Elementus runs a 5-step pipeline — free and deterministic steps first, LLM steps later, vision last:
|
|
257
296
|
|
|
258
297
|
**Step 1: Locator** — Try the original selector. If it works, done (zero overhead).
|
|
259
298
|
|
|
260
|
-
**Step 2:
|
|
299
|
+
**Step 2: Fingerprint cache** (opt-in via `cacheFile`) — If this selector+description healed before on this page, re-match the stored multi-attribute fingerprint (tag, id, text, neighbor text, position, size, …) against the live DOM with weighted similarity. Milliseconds, zero LLM cost. Accepted only with a confidence threshold *and* a margin over the runner-up.
|
|
300
|
+
|
|
301
|
+
**Step 3: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send the ranked top-N to the LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates. With `embeddingModel` set, zero keyword matches fall back to semantic (embedding) ranking.
|
|
302
|
+
|
|
303
|
+
**Step 4: Snapshot grounding** — Playwright: capture an ARIA accessibility snapshot with element refs and ask the text LLM to pick the matching ref. WDIO/native: synthesize an indexed role/name list from the element scan and do the same. No vision model needed.
|
|
261
304
|
|
|
262
|
-
**Step
|
|
305
|
+
**Step 5: Vision** — First Set-of-Marks: numbered badges drawn on the candidate elements, one vision call returns a badge number. If that fails: screenshot with a labeled 3x3 grid, region re-scan, then a precise-coordinate fail-safe for elements a DOM scan can never see (e.g. canvas-rendered icons). The fail-safe narrows the search band to roughly one viewport before asking the model for pixels — so the target is never too small to locate — then verifies the result on a zoomed crop and fails loudly rather than ever clicking the wrong place.
|
|
263
306
|
|
|
264
307
|
## Tips for Writing Descriptions
|
|
265
308
|
|
|
@@ -280,12 +323,16 @@ When a selector fails, Elementus runs a 3-step pipeline:
|
|
|
280
323
|
|
|
281
324
|
| Platform | Element scan | Vision | Status |
|
|
282
325
|
|----------|-------------|--------|--------|
|
|
283
|
-
| Playwright (web) | DOM |
|
|
284
|
-
| WDIO (web) | DOM |
|
|
285
|
-
| Appium (mobile web) | DOM |
|
|
286
|
-
| Appium (native Android/iOS) | XML source |
|
|
287
|
-
| Appium (Flutter) | XML source |
|
|
288
|
-
| Appium (React Native) | XML source |
|
|
326
|
+
| Playwright (web) | DOM | Full-page screenshot + LLM | Full support |
|
|
327
|
+
| WDIO (web) | DOM | Viewport screenshot + LLM | Full support (WDIO v9+ recommended) |
|
|
328
|
+
| Appium (mobile web) | DOM | Viewport screenshot + LLM | Full support |
|
|
329
|
+
| Appium (native Android/iOS) | XML source | — | Element-tree resolution (no vision) |
|
|
330
|
+
| Appium (Flutter) | XML source | — | Element-tree resolution (no vision) |
|
|
331
|
+
| Appium (React Native) | XML source | — | Element-tree resolution (no vision) |
|
|
332
|
+
|
|
333
|
+
Notes:
|
|
334
|
+
- WDIO screenshots are viewport-only, so the vision fallback sees the current viewport rather than the full page. DOM scoring always covers the whole document on every platform.
|
|
335
|
+
- Native apps have no DOM to overlay the vision grid on — resolution uses the native element tree (accessibility-id, resource-id, text) and stops there with an actionable error if unresolved.
|
|
289
336
|
|
|
290
337
|
## License
|
|
291
338
|
|