@jrpool/kilotest 31.2.4 → 33.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AI-TOOL.md CHANGED
@@ -1,27 +1,40 @@
1
- # Kilotest as an AI Tool
1
+ # Kilotest as connector to and creator of AI Tools
2
2
 
3
3
  ## Introduction
4
4
 
5
- Until 2026 Kilotest was intended, and implemented, as a web application for human users.
5
+ Until 2026 Kilotest was intended, and implemented, as a web application performing a service for human users.
6
6
 
7
- Beginning in May 2026, it [became evident](https://github.com/jrpool/kilotest/issues/2) that Kilotest could also act as a tool for use by language models. Language models are asked for help in all domains, including the domain of software quality. When asked about the front-end quality (accessibility, usability, and standard conformity) of specific web pages, models gave answers without the use of tools, and the answers were inferior: simplistic, speculative, and fabricated. If a model were to use Kilotest as a tool, the model could give comprehensive, authoritative, grounded, and factual answers. Every reported defect could be documented and ascribed to one or more specific tools in the Kilotest ensemble.
7
+ Beginning in May 2026, it [became evident](https://github.com/jrpool/kilotest/issues/2) that Kilotest could also act as a connector to tools for use by language models. Language models are asked for help in all domains, including the domain of software quality. When asked about the front-end quality (accessibility, usability, and standard conformity) of specific web pages, models gave answers without the use of tools. The answers were at best fragmentary but often speculative and fabricated. If Kilotest offered to connect language models to its functionalities as tools, models could give more inexpensive, comprehensive, factual, authoritative, and grounded answers. Every reported defect could be documented and ascribed to one or more specific rule engines in the Kilotest ensemble.
8
8
 
9
- Given the potential of Kilotest as an AI tool and the expected continued growth in the share of questions that are directed to language models, a decision was made to make Kilotest discoverable and usable as a tool for language models.
9
+ Given the potential of Kilotest to serve language models and the expected continued growth in the share of questions that are directed to AI platforms that use language models, a decision was made to **make Kilotest discoverable, usable, and, where appropriate, used as a connector to, and creator of, tools for language models**.
10
+
11
+ ## Terms
12
+
13
+ - **Language model**: A model (e.g., Claude Haiku 4.5, Kimi K2.6, GPT-5.4, Gemini 3.1 Pro) that can consume and generate text, images, or other content.
14
+ - **AI platform**: A platform (e.g., Claude Desktop, Perplexity, ChatGPT, Gemini 3.5 Flash) that gives human users access to the services of language models and connects language models to productivity resources.
15
+ - **tool**: A specialized productivity resource providing a capability that the language model itself needs but does not have. In the OpenAPI specification, the term _operation_ is used to mean tool.
16
+ - **connector**: A service that allows AI platforms to enable their language models to discover, evaluate, and use tools.
17
+
18
+ Tools can be independent of connectors that connect them to language models, but Kilotest can be both. It already provides capabilities to human users; it can re-use the same functions that underlie those capabilities and reconfigure those functions as tools for language models. Kilotest can also provide connectors to those tools for language models and make those connectors compatible with AI platforms.
10
19
 
11
20
  ## Internal additions
12
21
 
13
- The internal changes that have been made in the Kilotest codebase to support the use of Kilotest as an AI tool are:
22
+ The internal changes that have been made in the Kilotest codebase to support the use of Kilotest as a set of tools and connectors to them are:
14
23
 
15
24
  - An API, consisting of:
16
25
  - Additions to `index.js`.
17
- - Addition to `util.js`.
26
+ - Additions to `util.js`.
18
27
  - Within directories providing API functionality:
19
28
  - `api.js` modules.
20
29
  - `util.js` modules, if needed, containing resources shared by the `index.js` and `api.js` modules in those directories.
30
+ - A [`JSON-LD`](https://json-ld.org/) script in the `index.html` file, providing structured data about the Kilotest API.
21
31
  - Additions to the `env.example` file.
22
32
  - An `llms.txt` file and an `llms-full.txt` file, documenting the use of Kilotest by language models, conforming to the [llms-txt](https://llmstxt.org/) specification.
23
33
  - An `openapi.yaml` file, documenting the Kilotest API, conforming to the [OpenAPI specification](https://spec.openapis.org/oas/v3.1.0).
34
+ - An `mcp.js` file, providing an MCP server for Kilotest.
24
35
  - A `sitemap.xml` file.
36
+ - Additions to the `README.md` file.
37
+ - This `AI-TOOL.md` file.
25
38
  - A `researchAgent.js` file, testing the API.
26
39
 
27
40
  ## External actions
@@ -32,35 +45,84 @@ The external actions that have been taken to support the use of Kilotest as an A
32
45
  - A [request](https://rapidapi.com/studio/api_91f2ce07-2572-48bd-a34d-ff01ed6cd039/publish/general) to add Kilotest to the Rapid API Hub.
33
46
  - An [issue](https://github.com/APIs-guru/openapi-directory/issues/2677) to add Kilotest to `openapi-directory`.
34
47
  - A [pull request](https://github.com/w3c/wai-evaluation-tools-list/pull/1153) to add Kilotest to the WAI evaluation tools list.
35
- - [Configuration of Claude Desktop](https://github.com/ivo-toby/mcp-openapi-server#option-1-using-with-claude-desktop-stdio-transport) on the local development host to let Claude models find Kilotest.
48
+ - [Configuration of Claude Desktop](https://github.com/ivo-toby/mcp-openapi-server#option-1-using-with-claude-desktop-stdio-transport) on the local development host to connect Claude Desktop models to Kilotest tools.
49
+ - Deployment of an MCP server in HTTP mode on the Kilotest service host.
36
50
 
37
51
  ## Use cases
38
52
 
39
- The rationale for Kilotest as a tool for language models is set forth in the `llms-full.txt` file and is not repeated here.
53
+ The rationale for Kilotest as a connector to tools for language models is set forth in the `llms-full.txt` file and is not repeated here.
40
54
 
41
55
  Some common anticipated use cases for this role are:
42
56
 
43
- - A user of a web-builder platform with responsibility for a website asks an AI platform for help in creating or improving the website.
44
- - A prospective customer of a web development service asks an AI platform to evaluate the quality of websites in the portfolios of candidate vendors.
45
- - A professional web developer within an organization asks an AI platform for a code review.
46
- - A risk-management professional within an organization asks an AI platform to report on any web accessibility defects that could expose the organization to prosecution or civil litigation.
47
- - An person who depends on web accessibility because of disabilities asks an AI platform to provide documentary support for a complaint to the owner of a website about accessibility defects.
48
- - A disability-rights advocate or attorney concerned with inaccessibility in a particular industry asks an AI platform to perform a front-end-quality comparison of some websites in that industry.
57
+ 1. A user of a web-builder platform with responsibility for a website asks an AI platform for help in creating or improving the website.
58
+ 1. A prospective customer of a web development service asks an AI platform to evaluate the quality of websites in the portfolios of candidate vendors.
59
+ 1. A professional web developer within an organization asks an AI platform for a code review.
60
+ 1. A risk-management professional within an organization asks an AI platform to report on any web accessibility defects that could expose the organization to prosecution or civil litigation for disability discrimination.
61
+ 1. A person who depends on web accessibility because of disabilities asks an AI platform to provide documentary support for a complaint to the owner of a website about accessibility defects.
62
+ 1. A disability-rights advocate or attorney concerned with inaccessibility in a particular industry asks an AI platform to perform a front-end-quality comparison of some websites in that industry.
49
63
 
50
- ## Strategy
64
+ Among these use cases, case 3 would make it feasible for the user to tell an AI platform explicitly and formally that Kilotest is an available connector to tools that are relevant to the task. All 5 of the other use cases would not make that feasible. In those 5 use cases, the user has a question but relies on the AI platform to know or discover which relevant connectors exist, to select appropriate connectors, and to provide language models that can use those connectors and tools that they connect to.
65
+
66
+ Use cases 1, 2, 4, 5, and 6 exemplify a widespread expectation and demand for AI platform capability. The commonality is: “I have a question; answer it.” If Kilotest can be employed as an expert for AI platforms in relevant cases, platforms will be more successful in satisfying that demand. At present this is a difficult problem because of platform limitations and a lack of standardization.
51
67
 
52
- ### Ease of use
68
+ ## Strategy
53
69
 
54
- Any use of Kilotest as a tool of language models will fail for some of the anticipated users if they are obligated to be aware of Kilotest and to translate that awareness into instructions or documentation. Therefore, for success in all the use cases, users should be able to tell models what is wanted and rely on models to figure out whether they need tools and, if so, which ones and how to use them.
70
+ The Kilotest project is attempting to solve the just-described problem. That attempt follows the following strategy.
55
71
 
56
72
  ### Increments
57
73
 
58
- The standardization of tool discovery and utilization by and for language models and AI platforms is partial. Major differences in protocols exist among model and platform families. Therefore, small testable increments of improvement in the tool functionality of Kilotest can best be defined by model and platform family. For example, working on discoverability first and then on usability would not be effective, because both are complex and testability would be postponed until both were complete. Instead, a coherent model and platform family should be identified and any and all improvements to make use cases successful within that family should be made and tested, before development proceeds to the next family.
74
+ Connector and tool discovery and utilization have been only partly standardized. Major differences in protocols exist among model and platform families. Therefore, small testable increments of improvement in the connector and tool functionality of Kilotest can best be defined by model and platform family and by use-case characteristics.
59
75
 
60
- One benefit of this type of incrementalism is that, after the first increment succeeds, it becomes possible to make and test external changes publicizing the fact that a particular platform-model combination delivers unprecedentedly comprehensive, truthful, and low-cost answers to questions about front-end web quality.
76
+ One benefit of such incrementalism is that, after the first increment succeeds, it becomes possible to make and test external changes publicizing the fact that a particular platform-model combination delivers unprecedentedly inexpensive, comprehensive, and truthful answers in a particular class of use cases to questions about front-end web quality.
61
77
 
62
78
  Another benefit is that subsequent increments can be defined incrementally rather than in advance. Lessons learned from the work on each increment can inform the choice of what to work on next.
63
79
 
64
80
  #### Increment 1
65
81
 
66
- In the first increment, the objective is to make Kilotest automatically discovered and used by Anthropic Claude models when used within the Claude Desktop application. That investigation is under way. Results will be summarized here.
82
+ In the first increment, the objective was to make Anthropic Claude models use Kilotest to help them answer questions from developers using the Claude Desktop application for use case 3, where the code in question is already deployed as a public web page.
83
+
84
+ As mentioned above, Claude Desktop was installed on the local development host and then configured for Kilotest. That configuration consisted of the addition of a property to the Claude Desktop configuration file, located at `~/Library/Application Support/Claude/claude_desktop_config.json`. The added property is:
85
+
86
+ ```json
87
+ "mcpServers": {
88
+ "kilotest": {
89
+ "command": "npx",
90
+ "args": [
91
+ "-y",
92
+ "@ivotoby/openapi-mcp-server",
93
+ "--disable-abbreviation",
94
+ "true"
95
+ ],
96
+ "env": {
97
+ "API_BASE_URL": "https://kilotest.com",
98
+ "OPENAPI_SPEC_PATH": "https://kilotest.com/openapi.yaml"
99
+ }
100
+ }
101
+ },
102
+ ```
103
+
104
+ That configuration notifies models that Kilotest is available as an MCP server, but does not explain what tools it provides, what they are useful for, or how to use them.
105
+
106
+ Experimentation revealed that Claude Sonnet 4.6 with Medium effort chose to use Kilotest when answering questions about the accessibility and usability of particular public web pages. Low effort did not result in the use of Kilotest. The less capable Claude Haiku model did not use Kilotest.
107
+
108
+ ### Increment 2
109
+
110
+ Increment 2 somewhat lightens the burden on the user by shifting the user environment from an installed application to a web browser, and by providing an easier interface for users to tell the platform about Kilotest. The platform is the `claude.ai` web application instead of Claude Desktop. Instead of editing a JSON file, the user tells the platform about Kilotest with a setting.
111
+
112
+ The investigation began with the addition of Kilotest as a _connector_ to the `claude.ai` web application. The user used the `Customize/Connectors/Add connector/Add custom connector` interface, providing these data before activating the `Add` button:
113
+
114
+ - Name: Kilotest
115
+ - Remote MCP server URL: `https://kilotest.com/mcp`
116
+
117
+ When the connector configuration with 4 endpoints appeared, the user changed the tool permissions for all 4 endpoints from `Needs approval` to `Always allow`.
118
+
119
+ Then the user asked the platform, “Summarize the accessibility and usability of the home page of the website of Salesforce.”
120
+
121
+ The results with the least powerful model were:
122
+
123
+ - Claude Haiku 4.5 not extended: Summarized accessibility programs of Salesforce, instead of saying anything about its home page.
124
+ - Claude Haiku 4.5 extended: Announced this thought process: “Deliberated between direct inspection and specialized accessibility testing tools”; then announced that it was using Kilotest, specifically the `summarize-accessibility-of-all-tested-web-pages` path; then stated: “Great! I've loaded the Kilotest tools. Now let me first check if Salesforce has already been tested by calling the summarize function. If not, I'll submit it for testing.”
125
+
126
+ These results show that on the `claude.ai` platform, when configured with Kilotest as a connector, even Claude Haiku, in its Extended mode, can decide to use Kilotest when appropriate and can check for existing data before deciding to recommend that Kilotest test a page. This correct sequencing may be a result of the revised path descriptions in the `openapi.yaml` file recommending that sequencing.
127
+
128
+ In the first run, Claude Haiku paused for about 5 minutes and then reported: “The Kilotest service seems to have timed out. Let me try a different approach - I'll fetch the Salesforce home page directly and evaluate it myself for accessibility and usability issues.”
package/IDEAS.md CHANGED
@@ -1,4 +1,4 @@
1
- # development
1
+ # ideas
2
2
 
3
3
  Notes about the further development of this project.
4
4
 
package/README.md CHANGED
@@ -4,7 +4,7 @@ An ensemble testing service with a focus on accessibility
4
4
 
5
5
  ## Features
6
6
 
7
- This application uses an ensemble of 10 tools to test public web pages for standards conformity, usability, and accessibility.
7
+ This application uses an ensemble of 10 rule engines to test public web pages for front-end quality (i.e. accessibility, usability, and conformity to standards).
8
8
 
9
9
  The testing paradigm employed by Kilotest is discussed in these papers:
10
10
 
package/SERVICE.md CHANGED
@@ -239,7 +239,7 @@ In case Testaro again becomes a dependency of Kilotest, the notes below on secur
239
239
 
240
240
  ### Browser privileges
241
241
 
242
- Testaro uses Playwright to launch and control headless browsers, often `chromium`. Those browsers navigate to web pages that are tested by the tools that Testaro integrates. The `qualWeb` tool launches its own browser via Puppeteer as a dependency to perform its tests.
242
+ Testaro uses Playwright to launch and control headless browsers, often `chromium`. Those browsers navigate to web pages that are tested by the rule engines that Testaro integrates. The `qualWeb` tool launches its own browser via Puppeteer as a dependency to perform its tests.
243
243
 
244
244
  When either Playwright or Puppeteer launches a `chromium` browser, in most environments it is [sandboxed](https://www.geeksforgeeks.org/ethical-hacking/what-is-browser-sandboxing/). Sandboxing is a security feature that prevents the browser from accessing potentially unsafe system resources. But in the Ubuntu Linux operating system that was installed on the Vultr Cloud Compute host a sandboxed browser requires an [unprivileged user namespace](https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged-user-namespaces), and when Ubuntu was installed its configuration disallowed such namespaces. The file `/etc/sysctl.d/99-kilotest-userns.conf` with the content `kernel.apparmor_restrict_unprivileged_userns = 1` prohibited unprivileged user namespaces and thereby made sandboxed browsers unlaunchable.
245
245
 
@@ -37,7 +37,7 @@
37
37
  <li>Related WCAG standard: __wcag__</li>
38
38
  </ul>
39
39
  <h2>Diagnoses</h2>
40
- <p>Here is how tools diagnose the <q>__issue__</q> issue for HTML element __catalogIndex__ of the __target__ page.</p>
40
+ <p>Here is how rule engines diagnose the <q>__issue__</q> issue for HTML element __catalogIndex__ of the __target__ page.</p>
41
41
  <ol>
42
42
  __diagnoses__
43
43
  </ol>
package/index.html CHANGED
@@ -15,22 +15,22 @@
15
15
  "name": "Kilotest",
16
16
  "sameAs": "https://github.com/jrpool/kilotest",
17
17
  "datePublished": "2025-10-14",
18
- "applicationCategory": "Web accessibility, usability, and standard conformity testing and reporting service",
18
+ "applicationCategory": "Front-end quality (i.e. web accessibility, usability, and standard conformity) testing and reporting service",
19
19
  "isAccessibleForFree": true,
20
- "description": "Testing of web pages for accessibility, usability, and standard conformity with an ensemble of 10 tools and provision of results at selectable levels of detail.",
20
+ "description": "Testing of web pages for front-end quality (i.e. web accessibility, usability, and standard conformity) with an ensemble of 10 rule engines and provision of results at selectable levels of detail.",
21
21
  "softwareHelp": {
22
22
  "@type": "CreativeWork",
23
23
  "url": "https://kilotest.com/tutorial.html"
24
24
  },
25
25
  "featureList": [
26
- "Automated testing of web accessibility, usability, and standard conformity",
26
+ "Automated testing of front-end quality (i.e. web accessibility, usability, and standard conformity)",
27
27
  "WCAG 2.2 conformance evaluation",
28
28
  "HTML5 conformance evaluation",
29
29
  "CSS3 conformance evaluation",
30
- "Ensemble of 10 testing tools performing more than a thousand tests",
30
+ "Ensemble of 10 rule engines performing more than a thousand tests",
31
31
  "Multi-page, single-page, single-issue, and single-element reporting",
32
32
  "Priority-ranked issue lists",
33
- "Public API for agent integration"
33
+ "Public API, connector, and tools for language-model integration"
34
34
  ],
35
35
  "url": "https://kilotest.com/",
36
36
  "maintainer": {
@@ -63,17 +63,17 @@
63
63
  <summary>More than a thousand? Really?</summary>
64
64
  <h2>Ensemble testing</h2>
65
65
  <p>Kilotest does <strong>ensemble testing</strong> of web pages for <a href="https://www.w3.org/mission/accessibility/">accessibility</a>, usability, and conformity to HTML standards.</p>
66
- <p>Efficient testing <strong>first</strong> leverages <strong>automated tools</strong> and corrects the defects that they discover, <strong>then</strong> employs human testers.</p>
67
- <p>Choosing an automated tool is hard, because they <a href="https://www.w3.org/WAI/test-evaluate/tools/list/">are numerous</a> and <a href="https://arxiv.org/pdf/2304.07591">vary in what and how they test</a>.</p>
68
- <p><strong>Ensemble testing</strong>, instead of choosing a tool, combines multiple tools. It can discover more issues than single-tool testing, and the tools can check one another&rsquo;s quality.</p>
69
- <h2>The tools</h2>
70
- <p>The Kilotest ensemble is <a href="https://github.com/jrpool/testaro?tab=readme-ov-file#dependencies">10 tools</a> developed by teams in the USA, Canada, Denmark, Australia, and Portugal.</p>
71
- <p>The tools perform about 1,300 tests in total. <a href="https://medium.com/cvs-health-tech-blog/how-to-run-a-thousand-accessibility-tests-63692ad120c3">Open-source software</a> developed initially at CVS Health drives the tools, tries to weed out invalid results, and distills the findings into about 350 <q>issues</q>.</p>
66
+ <p>Efficient testing <strong>first</strong> leverages <strong>automated rule engines</strong> and corrects the defects that they discover, <strong>then</strong> employs human testers.</p>
67
+ <p>Choosing an automated rule engine is hard, because they <a href="https://www.w3.org/WAI/test-evaluate/tools/list/">are numerous</a> and <a href="https://arxiv.org/pdf/2304.07591">vary in what and how they test</a>.</p>
68
+ <p><strong>Ensemble testing</strong>, instead of choosing a rule engine, combines multiple rule engines. It can discover more issues than single-rule-engine testing, and the rule engines can check one another&rsquo;s quality.</p>
69
+ <h2>The rule engines</h2>
70
+ <p>The Kilotest ensemble is <a href="https://github.com/jrpool/testaro?tab=readme-ov-file#dependencies">10 rule engines</a> developed by teams in the USA, Canada, Denmark, Australia, and Portugal.</p>
71
+ <p>The rule engines perform about 1,300 tests in total. <a href="https://medium.com/cvs-health-tech-blog/how-to-run-a-thousand-accessibility-tests-63692ad120c3">Open-source software</a> developed initially at CVS Health drives the rule engines, tries to weed out invalid results, and distills the findings into about 350 <q>issues</q>.</p>
72
72
  <h2>Status</h2>
73
73
  <p>Kilotest is a proof of concept, launched in October 2025 and being actively developed. Your suggestions, bug reports, and collaboration proposals are welcome! Post them on <a href="https://github.com/jrpool/kilotest/issues">GitHub</a> or email them to <a href="mailto:info@kilotest.com">info@kilotest.com</a>.</p>
74
74
  <p>In the current version, users can get results of already performed tests and can recommend new tests. Kilotest managers can approve recommendations.</p>
75
75
  <h2 id="a11y">Accessibility</h2>
76
- <p>Kilotest intends to be accessible. At a minimum, Kilotest intends to conform to the rules of the tools in the Kilotest ensemble. Those rules enforce various standards and best practices, including many success criteria of conformance levels AA and AAA of the <a href="https://www.w3.org/WAI/standards-guidelines/wcag/">Web Content Accessibility Guidelines (WCAG)</a>.</p>
76
+ <p>Kilotest intends to be accessible. At a minimum, Kilotest intends to conform to the rules of the rule engines in the Kilotest ensemble. Those rules enforce various standards and best practices, including many success criteria of conformance levels AA and AAA of the <a href="https://www.w3.org/WAI/standards-guidelines/wcag/">Web Content Accessibility Guidelines (WCAG)</a>.</p>
77
77
  <p>Humans and automated agents encountering any accessibility barriers during the use of Kilotest are encouraged to report them on <a href="https://github.com/jrpool/kilotest/issues">GitHub</a> or by email to <a href="mailto:info@kilotest.com">info@kilotest.com</a>.</p>
78
78
  </details>
79
79
  <h2>What do you want to know?</h2>