elementus-ai 1.0.2 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/README.md +42 -12
  2. package/elementus.js +1169 -188
  3. package/package.json +16 -1
  4. package/wdio.d.ts +1 -3
package/README.md CHANGED
@@ -106,6 +106,10 @@ const el = createElementus({
106
106
  })
107
107
  ```
108
108
 
109
+ Tips for the local setup:
110
+ - **Vision accuracy:** a dedicated GUI-grounding model (e.g. `Holo2-8B`, Apache-2.0 GGUF on Hugging Face) typically grounds screen coordinates better than general chat VLMs — benchmark numbers are vendor-reported (Nov 2025), verify it loads in your LM Studio version before switching.
111
+ - **Semantic matching:** load an embedding model (e.g. `text-embedding-nomic-embed-text-v1.5`) and set `embeddingModel` to let paraphrased descriptions ("sign in" vs "log in") resolve without vision.
112
+
109
113
  ### Option B: Google Gemini API (cloud, fast, better vision)
110
114
 
111
115
  1. Get an API key from [Google AI Studio](https://aistudio.google.com/apikey)
@@ -114,10 +118,14 @@ const el = createElementus({
114
118
  const el = createElementus({
115
119
  provider: 'gemini',
116
120
  geminiApiKey: 'AIza...', // or set GEMINI_API_KEY env var
117
- geminiModel: 'gemini-2.5-flash',
121
+ geminiModel: 'gemini-3.5-flash', // or 'gemini-3.1-flash-lite' (cheaper/faster)
118
122
  })
119
123
  ```
120
124
 
125
+ Older `gemini-2.5-flash` / `gemini-2.5-flash-lite` still work (Google retires them 2026-10-16) — Elementus picks the right thinking config per model family automatically.
126
+
127
+ Note: Google's *computer-use* capability requires dedicated models (`gemini-2.5-computer-use-preview-*`, `gemini-3-flash-preview`) and is **not** available on `gemini-3.5-flash`. Elementus's own vision pipeline does not need it — this only matters if you point `geminiModel` at a computer-use model expecting agentic behavior.
128
+
121
129
  ## Framework Setup
122
130
 
123
131
  ### Playwright
@@ -239,21 +247,35 @@ createElementus({
239
247
 
240
248
  // Gemini
241
249
  geminiApiKey: null, // or GEMINI_API_KEY env var
242
- geminiModel: 'gemini-2.5-flash',
250
+ geminiModel: 'gemini-3.5-flash',
243
251
 
244
252
  // Behavior
245
253
  maxCandidates: 20, // max elements sent to LLM for disambiguation
246
254
  visionMaxWidth: 1280, // max screenshot width (px) sent to vision LLM
247
255
 
256
+ // Fingerprint cache (opt-in) — remember healed elements across runs and
257
+ // re-match them algorithmically (zero LLM cost, ~ms) before any AI call
258
+ cacheFile: null, // e.g. './elementus-cache.json'
259
+
260
+ // Semantic matching (opt-in) — embedding model for paraphrase matching
261
+ // when keyword scoring finds nothing
262
+ embeddingModel: null, // e.g. 'text-embedding-nomic-embed-text-v1.5'
263
+
248
264
  // Debugging
249
265
  debug: false, // save screenshots to debugDir
250
- debugDir: './debug', // directory for debug screenshots
266
+ debugDir: null, // required when debug: true, e.g. './debug'
251
267
 
252
268
  // Custom stop words
253
269
  stopWords: null, // Set of words to ignore in descriptions
254
270
  })
255
271
  ```
256
272
 
273
+ ## Security Notes
274
+
275
+ - **Debug screenshots** capture the full page — including any sensitive data visible on it. Keep `debugDir` out of version control.
276
+ - **Page content reaches the LLM.** Element texts and screenshots from the page under test are sent to your configured LLM provider as part of the resolution prompts. Run Elementus against pages you trust, and prefer a local LLM (LM Studio) for sensitive applications.
277
+ - The Gemini API key is sent via the `x-goog-api-key` header (never in the URL) and can be supplied via the `GEMINI_API_KEY` env var instead of code.
278
+
257
279
  ## Timeouts
258
280
 
259
281
  Elementus respects your framework's configured timeouts. It does **not** override or race against them. Set appropriate action timeouts in your framework config:
@@ -270,13 +292,17 @@ If a selector works, it returns immediately (zero overhead). If it fails after y
270
292
 
271
293
  ## How It Works
272
294
 
273
- When a selector fails, Elementus runs a 3-step pipeline:
295
+ When a selector fails, Elementus runs a 5-step pipeline — free and deterministic steps first, LLM steps later, vision last:
274
296
 
275
297
  **Step 1: Locator** — Try the original selector. If it works, done (zero overhead).
276
298
 
277
- **Step 2: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send candidates to LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates.
299
+ **Step 2: Fingerprint cache** (opt-in via `cacheFile`) If this selector+description healed before on this page, re-match the stored multi-attribute fingerprint (tag, id, text, neighbor text, position, size, …) against the live DOM with weighted similarity. Milliseconds, zero LLM cost. Accepted only with a confidence threshold *and* a margin over the runner-up.
300
+
301
+ **Step 3: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send the ranked top-N to the LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates. With `embeddingModel` set, zero keyword matches fall back to semantic (embedding) ranking.
302
+
303
+ **Step 4: Snapshot grounding** — Playwright: capture an ARIA accessibility snapshot with element refs and ask the text LLM to pick the matching ref. WDIO/native: synthesize an indexed role/name list from the element scan and do the same. No vision model needed.
278
304
 
279
- **Step 3: Vision** — Take a screenshot with a labeled grid overlay. Ask the vision LLM which region contains the target. Scroll there, re-scan. If still unresolved, ask for precise pixel coordinates.
305
+ **Step 5: Vision** — First Set-of-Marks: numbered badges drawn on the candidate elements, one vision call returns a badge number. If that fails: screenshot with a labeled 3x3 grid, region re-scan, then a precise-coordinate fail-safe for elements a DOM scan can never see (e.g. canvas-rendered icons). The fail-safe narrows the search band to roughly one viewport before asking the model for pixels — so the target is never too small to locate then verifies the result on a zoomed crop and fails loudly rather than ever clicking the wrong place.
280
306
 
281
307
  ## Tips for Writing Descriptions
282
308
 
@@ -297,12 +323,16 @@ When a selector fails, Elementus runs a 3-step pipeline:
297
323
 
298
324
  | Platform | Element scan | Vision | Status |
299
325
  |----------|-------------|--------|--------|
300
- | Playwright (web) | DOM | Screenshot + LLM | Full support |
301
- | WDIO (web) | DOM | Screenshot + LLM | Full support |
302
- | Appium (mobile web) | DOM | Screenshot + LLM | Full support |
303
- | Appium (native Android/iOS) | XML source | Screenshot + LLM | Full support |
304
- | Appium (Flutter) | XML source | Screenshot + LLM | Full support |
305
- | Appium (React Native) | XML source | Screenshot + LLM | Full support |
326
+ | Playwright (web) | DOM | Full-page screenshot + LLM | Full support |
327
+ | WDIO (web) | DOM | Viewport screenshot + LLM | Full support (WDIO v9+ recommended) |
328
+ | Appium (mobile web) | DOM | Viewport screenshot + LLM | Full support |
329
+ | Appium (native Android/iOS) | XML source | | Element-tree resolution (no vision) |
330
+ | Appium (Flutter) | XML source | | Element-tree resolution (no vision) |
331
+ | Appium (React Native) | XML source | | Element-tree resolution (no vision) |
332
+
333
+ Notes:
334
+ - WDIO screenshots are viewport-only, so the vision fallback sees the current viewport rather than the full page. DOM scoring always covers the whole document on every platform.
335
+ - Native apps have no DOM to overlay the vision grid on — resolution uses the native element tree (accessibility-id, resource-id, text) and stops there with an actionable error if unresolved.
306
336
 
307
337
  ## License
308
338