npm - elementus-ai - Versions diffs - 1.0.2 → 1.1.0 - Mend

elementus-ai 1.0.2 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -106,6 +106,10 @@ const el = createElementus({
 })
 ```
+Tips for the local setup:
+- **Vision accuracy:** a dedicated GUI-grounding model (e.g. `Holo2-8B`, Apache-2.0 GGUF on Hugging Face) typically grounds screen coordinates better than general chat VLMs — benchmark numbers are vendor-reported (Nov 2025), verify it loads in your LM Studio version before switching.
+- **Semantic matching:** load an embedding model (e.g. `text-embedding-nomic-embed-text-v1.5`) and set `embeddingModel` to let paraphrased descriptions ("sign in" vs "log in") resolve without vision.
 ### Option B: Google Gemini API (cloud, fast, better vision)
 1. Get an API key from [Google AI Studio](https://aistudio.google.com/apikey)
@@ -114,10 +118,14 @@ const el = createElementus({
 const el = createElementus({
   provider: 'gemini',
   geminiApiKey: 'AIza...',  // or set GEMINI_API_KEY env var
-  geminiModel: 'gemini-2.5-flash',
+  geminiModel: 'gemini-3.5-flash',  // or 'gemini-3.1-flash-lite' (cheaper/faster)
 })
 ```
+Older `gemini-2.5-flash` / `gemini-2.5-flash-lite` still work (Google retires them 2026-10-16) — Elementus picks the right thinking config per model family automatically.
+Note: Google's *computer-use* capability requires dedicated models (`gemini-2.5-computer-use-preview-*`, `gemini-3-flash-preview`) and is **not** available on `gemini-3.5-flash`. Elementus's own vision pipeline does not need it — this only matters if you point `geminiModel` at a computer-use model expecting agentic behavior.
 ## Framework Setup
 ### Playwright
@@ -239,21 +247,35 @@ createElementus({
   // Gemini
   geminiApiKey: null,       // or GEMINI_API_KEY env var
-  geminiModel: 'gemini-2.5-flash',
+  geminiModel: 'gemini-3.5-flash',
   // Behavior
   maxCandidates: 20,        // max elements sent to LLM for disambiguation
   visionMaxWidth: 1280,     // max screenshot width (px) sent to vision LLM
+  // Fingerprint cache (opt-in) — remember healed elements across runs and
+  // re-match them algorithmically (zero LLM cost, ~ms) before any AI call
+  cacheFile: null,          // e.g. './elementus-cache.json'
+  // Semantic matching (opt-in) — embedding model for paraphrase matching
+  // when keyword scoring finds nothing
+  embeddingModel: null,     // e.g. 'text-embedding-nomic-embed-text-v1.5'
   // Debugging
   debug: false,             // save screenshots to debugDir
-  debugDir: './debug',      // directory for debug screenshots
+  debugDir: null,           // required when debug: true, e.g. './debug'
   // Custom stop words
   stopWords: null,          // Set of words to ignore in descriptions
 })
 ```
+## Security Notes
+- **Debug screenshots** capture the full page — including any sensitive data visible on it. Keep `debugDir` out of version control.
+- **Page content reaches the LLM.** Element texts and screenshots from the page under test are sent to your configured LLM provider as part of the resolution prompts. Run Elementus against pages you trust, and prefer a local LLM (LM Studio) for sensitive applications.
+- The Gemini API key is sent via the `x-goog-api-key` header (never in the URL) and can be supplied via the `GEMINI_API_KEY` env var instead of code.
 ## Timeouts
 Elementus respects your framework's configured timeouts. It does **not** override or race against them. Set appropriate action timeouts in your framework config:
@@ -270,13 +292,17 @@ If a selector works, it returns immediately (zero overhead). If it fails after y
 ## How It Works
-When a selector fails, Elementus runs a 3-step pipeline:
+When a selector fails, Elementus runs a 5-step pipeline — free and deterministic steps first, LLM steps later, vision last:
 **Step 1: Locator** — Try the original selector. If it works, done (zero overhead).
-**Step 2: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send candidates to LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates.
+**Step 2: Fingerprint cache** (opt-in via `cacheFile`) — If this selector+description healed before on this page, re-match the stored multi-attribute fingerprint (tag, id, text, neighbor text, position, size, …) against the live DOM with weighted similarity. Milliseconds, zero LLM cost. Accepted only with a confidence threshold *and* a margin over the runner-up.
+**Step 3: DOM/Element Tree Scoring** — Scan all interactive elements on the page (DOM for web, XML source for native apps). Score each by keyword and phrase relevance to the description. If one clear winner, use it. If multiple tied, send the ranked top-N to the LLM. If all identical (e.g., 10x "Edit" buttons), use positional LLM with coordinates. With `embeddingModel` set, zero keyword matches fall back to semantic (embedding) ranking.
+**Step 4: Snapshot grounding** — Playwright: capture an ARIA accessibility snapshot with element refs and ask the text LLM to pick the matching ref. WDIO/native: synthesize an indexed role/name list from the element scan and do the same. No vision model needed.
-**Step 3: Vision** — Take a screenshot with a labeled grid overlay. Ask the vision LLM which region contains the target. Scroll there, re-scan. If still unresolved, ask for precise pixel coordinates.
+**Step 5: Vision** — First Set-of-Marks: numbered badges drawn on the candidate elements, one vision call returns a badge number. If that fails: screenshot with a labeled 3x3 grid, region re-scan, then a precise-coordinate fail-safe for elements a DOM scan can never see (e.g. canvas-rendered icons). The fail-safe narrows the search band to roughly one viewport before asking the model for pixels — so the target is never too small to locate — then verifies the result on a zoomed crop and fails loudly rather than ever clicking the wrong place.
 ## Tips for Writing Descriptions
@@ -297,12 +323,16 @@ When a selector fails, Elementus runs a 3-step pipeline:
 | Platform | Element scan | Vision | Status |
 |----------|-------------|--------|--------|
-| Playwright (web) | DOM | Screenshot + LLM | Full support |
-| WDIO (web) | DOM | Screenshot + LLM | Full support |
-| Appium (mobile web) | DOM | Screenshot + LLM | Full support |
-| Appium (native Android/iOS) | XML source | Screenshot + LLM | Full support |
-| Appium (Flutter) | XML source | Screenshot + LLM | Full support |
-| Appium (React Native) | XML source | Screenshot + LLM | Full support |
+| Playwright (web) | DOM | Full-page screenshot + LLM | Full support |
+| WDIO (web) | DOM | Viewport screenshot + LLM | Full support (WDIO v9+ recommended) |
+| Appium (mobile web) | DOM | Viewport screenshot + LLM | Full support |
+| Appium (native Android/iOS) | XML source | — | Element-tree resolution (no vision) |
+| Appium (Flutter) | XML source | — | Element-tree resolution (no vision) |
+| Appium (React Native) | XML source | — | Element-tree resolution (no vision) |
+Notes:
+- WDIO screenshots are viewport-only, so the vision fallback sees the current viewport rather than the full page. DOM scoring always covers the whole document on every platform.
+- Native apps have no DOM to overlay the vision grid on — resolution uses the native element tree (accessibility-id, resource-id, text) and stops there with an actionable error if unresolved.
 ## License