elementus-ai 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/README.md +63 -16
  2. package/elementus.js +1174 -188
  3. package/package.json +17 -1
  4. package/wdio.d.ts +5 -0
package/README.md CHANGED
@@ -39,7 +39,9 @@ I just installed the npm package "elementus-ai" — a self-healing element resol
39
39
  - Set actionTimeout: 10000 in playwright config (Elementus respects framework timeouts)
40
40
 
41
41
  For WebDriverIO:
42
- - Add el.wrapBrowser(browser) in the before hook in wdio.conf.js
42
+ - In wdio.conf.js before hook, wrap browser and override global $:
43
+ const wrapped = el.wrapBrowser(browser); globalThis.$ = wrapped.$.bind(wrapped)
44
+ - This way all page objects use plain $() with optional { ai } — zero changes needed
43
45
 
44
46
  For Appium:
45
47
  - Add el.wrapBrowser(driver) in the before hook
@@ -104,6 +106,10 @@ const el = createElementus({
104
106
  })
105
107
  ```
106
108
 
109
+ Tips for the local setup:
110
+ - **Vision accuracy:** a dedicated GUI-grounding model (e.g. `Holo2-8B`, Apache-2.0 GGUF on Hugging Face) typically grounds screen coordinates better than general chat VLMs — benchmark numbers are vendor-reported (Nov 2025), verify it loads in your LM Studio version before switching.
111
+ - **Semantic matching:** load an embedding model (e.g. `text-embedding-nomic-embed-text-v1.5`) and set `embeddingModel` to let paraphrased descriptions ("sign in" vs "log in") resolve without vision.
112
+
107
113
  ### Option B: Google Gemini API (cloud, fast, better vision)
108
114
 
109
115
  1. Get an API key from [Google AI Studio](https://aistudio.google.com/apikey)
@@ -112,10 +118,14 @@ const el = createElementus({
112
118
  const el = createElementus({
113
119
  provider: 'gemini',
114
120
  geminiApiKey: 'AIza...', // or set GEMINI_API_KEY env var
115
- geminiModel: 'gemini-2.5-flash',
121
+ geminiModel: 'gemini-3.5-flash', // or 'gemini-3.1-flash-lite' (cheaper/faster)
116
122
  })
117
123
  ```
118
124
 
125
+ Older `gemini-2.5-flash` / `gemini-2.5-flash-lite` still work (Google retires them 2026-10-16) — Elementus picks the right thinking config per model family automatically.
126
+
127
+ Note: Google's *computer-use* capability requires dedicated models (`gemini-2.5-computer-use-preview-*`, `gemini-3-flash-preview`) and is **not** available on `gemini-3.5-flash`. Elementus's own vision pipeline does not need it — this only matters if you point `geminiModel` at a computer-use model expecting agentic behavior.
128
+
119
129
  ## Framework Setup
120
130
 
121
131
  ### Playwright
@@ -150,10 +160,25 @@ test('example', async ({ page }) => {
150
160
 
151
161
  ### WebDriverIO
152
162
 
163
+ Override the global `$` in your `wdio.conf.js` so all page objects work transparently:
164
+
153
165
  ```javascript
154
- const b = el.wrapBrowser(browser)
155
- await b.$('#btn', { ai: 'Submit order button' }).click()
156
- await b.$('#email', { ai: 'Email input field' }).setValue('test@test.com')
166
+ // wdio.conf.js
167
+ const { createElementus } = require('elementus-ai')
168
+ const el = createElementus({ provider: 'gemini', geminiApiKey: '...' })
169
+
170
+ exports.config = {
171
+ // ... other config
172
+ async before() {
173
+ const wrapped = el.wrapBrowser(browser)
174
+ globalThis.$ = wrapped.$.bind(wrapped)
175
+ }
176
+ }
177
+
178
+ // In tests / page objects — plain $() with optional { ai }:
179
+ await $('[data-testid="btn-send"]') // unchanged, zero overhead
180
+ await $('#btn', { ai: 'Submit order button' }).click() // self-healing
181
+ await $('#email', { ai: 'Email input field' }).setValue('test@test.com')
157
182
  ```
158
183
 
159
184
  ### Appium (Native Android / iOS / Flutter)
@@ -222,21 +247,35 @@ createElementus({
222
247
 
223
248
  // Gemini
224
249
  geminiApiKey: null, // or GEMINI_API_KEY env var
225
- geminiModel: 'gemini-2.5-flash',
250
+ geminiModel: 'gemini-3.5-flash',
226
251
 
227
252
  // Behavior
228
253
  maxCandidates: 20, // max elements sent to LLM for disambiguation
229
254
  visionMaxWidth: 1280, // max screenshot width (px) sent to vision LLM
230
255
 
256
+ // Fingerprint cache (opt-in) — remember healed elements across runs and
257
+ // re-match them algorithmically (zero LLM cost, ~ms) before any AI call
258
+ cacheFile: null, // e.g. './elementus-cache.json'
259
+
260
+ // Semantic matching (opt-in) — embedding model for paraphrase matching
261
+ // when keyword scoring finds nothing
262
+ embeddingModel: null, // e.g. 'text-embedding-nomic-embed-text-v1.5'
263
+
231
264
  // Debugging
232
265
  debug: false, // save screenshots to debugDir
233
- debugDir: './debug', // directory for debug screenshots
266
+ debugDir: null, // required when debug: true, e.g. './debug'
234
267
 
235
268
  // Custom stop words
236
269
  stopWords: null, // Set of words to ignore in descriptions
237
270
  })
238
271
  ```
239
272
 
273
+ ## Security Notes
274
+
275
+ - **Debug screenshots** capture the full page — including any sensitive data visible on it. Keep `debugDir` out of version control.
276
+ - **Page content reaches the LLM.** Element texts and screenshots from the page under test are sent to your configured LLM provider as part of the resolution prompts. Run Elementus against pages you trust, and prefer a local LLM (LM Studio) for sensitive applications.
277
+ - The Gemini API key is sent via the `x-goog-api-key` header (never in the URL) and can be supplied via the `GEMINI_API_KEY` env var instead of code.
278
+
240
279
  ## Timeouts
241
280
 
242
281
  Elementus respects your framework's configured timeouts. It does **not** override or race against them. Set appropriate action timeouts in your framework config:
@@ -253,13 +292,17 @@ If a selector works, it returns immediately (zero overhead). If it fails after y
253
292
 
254
293
  ## How It Works
255
294
 
256
- When a selector fails, Elementus runs a 3-step pipeline:
295
+ When a selector fails, Elementus runs a 5-step pipeline — free and deterministic steps first, LLM steps later, vision last:
257
296
 
258
297
  **Step 1: Locator** — Try the original selector. If it works, done (zero overhead).
259
298
 
260
- **Step 2: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send candidates to LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates.
299
+ **Step 2: Fingerprint cache** (opt-in via `cacheFile`) If this selector+description healed before on this page, re-match the stored multi-attribute fingerprint (tag, id, text, neighbor text, position, size, …) against the live DOM with weighted similarity. Milliseconds, zero LLM cost. Accepted only with a confidence threshold *and* a margin over the runner-up.
300
+
301
+ **Step 3: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send the ranked top-N to the LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates. With `embeddingModel` set, zero keyword matches fall back to semantic (embedding) ranking.
302
+
303
+ **Step 4: Snapshot grounding** — Playwright: capture an ARIA accessibility snapshot with element refs and ask the text LLM to pick the matching ref. WDIO/native: synthesize an indexed role/name list from the element scan and do the same. No vision model needed.
261
304
 
262
- **Step 3: Vision** — Take a screenshot with a labeled grid overlay. Ask the vision LLM which region contains the target. Scroll there, re-scan. If still unresolved, ask for precise pixel coordinates.
305
+ **Step 5: Vision** — First Set-of-Marks: numbered badges drawn on the candidate elements, one vision call returns a badge number. If that fails: screenshot with a labeled 3x3 grid, region re-scan, then a precise-coordinate fail-safe for elements a DOM scan can never see (e.g. canvas-rendered icons). The fail-safe narrows the search band to roughly one viewport before asking the model for pixels — so the target is never too small to locate then verifies the result on a zoomed crop and fails loudly rather than ever clicking the wrong place.
263
306
 
264
307
  ## Tips for Writing Descriptions
265
308
 
@@ -280,12 +323,16 @@ When a selector fails, Elementus runs a 3-step pipeline:
280
323
 
281
324
  | Platform | Element scan | Vision | Status |
282
325
  |----------|-------------|--------|--------|
283
- | Playwright (web) | DOM | Screenshot + LLM | Full support |
284
- | WDIO (web) | DOM | Screenshot + LLM | Full support |
285
- | Appium (mobile web) | DOM | Screenshot + LLM | Full support |
286
- | Appium (native Android/iOS) | XML source | Screenshot + LLM | Full support |
287
- | Appium (Flutter) | XML source | Screenshot + LLM | Full support |
288
- | Appium (React Native) | XML source | Screenshot + LLM | Full support |
326
+ | Playwright (web) | DOM | Full-page screenshot + LLM | Full support |
327
+ | WDIO (web) | DOM | Viewport screenshot + LLM | Full support (WDIO v9+ recommended) |
328
+ | Appium (mobile web) | DOM | Viewport screenshot + LLM | Full support |
329
+ | Appium (native Android/iOS) | XML source | | Element-tree resolution (no vision) |
330
+ | Appium (Flutter) | XML source | | Element-tree resolution (no vision) |
331
+ | Appium (React Native) | XML source | | Element-tree resolution (no vision) |
332
+
333
+ Notes:
334
+ - WDIO screenshots are viewport-only, so the vision fallback sees the current viewport rather than the full page. DOM scoring always covers the whole document on every platform.
335
+ - Native apps have no DOM to overlay the vision grid on — resolution uses the native element tree (accessibility-id, resource-id, text) and stops there with an actionable error if unresolved.
289
336
 
290
337
  ## License
291
338