browsergym-workarena 0.3.0__py3-none-any.whl → 0.3.2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.3
2
2
  Name: browsergym-workarena
3
- Version: 0.3.0
3
+ Version: 0.3.2
4
4
  Summary: WorkArena benchmark for BrowserGym
5
5
  Project-URL: homepage, https://github.com/ServiceNow/WorkArena
6
6
  Author: Léo Boisvert, Alex Drouin, Maxime Gasse, Alex Lacoste, Manuel Del Verme, Megh Thakkar
@@ -22,9 +22,14 @@ Requires-Dist: tenacity>=8.2.3
22
22
  Requires-Dist: tqdm>=4.66.2
23
23
  Description-Content-Type: text/markdown
24
24
 
25
- # WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?
25
+ # WorkArena: A Benchmark for Evaluating Agents on Knowledge Work Tasks
26
+ [[Benchmark Contents]](#benchmark-contents) ♦ [[Getting Started]](#getting-started) ♦ [[Live Demo]](#live-demo) ♦ [[BrowserGym]](https://github.com/ServiceNow/BrowserGym) ♦ [[Citing This Work]](#citing-this-work)
26
27
 
27
- [[Paper]](https://arxiv.org/abs/2403.07718) ♦ [[Benchmark Contents]](#benchmark-contents) ♦ [[Getting Started]](#getting-started) ♦ [[Live Demo]](#live-demo) ♦ [[BrowserGym]](https://github.com/ServiceNow/BrowserGym) ♦ [[Citing This Work]](#citing-this-work)
28
+ ### Papers
29
+ * [ICML 2024] WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? [[Paper]](https://arxiv.org/abs/2403.07718)
30
+
31
+ * WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [[Paper]](https://arxiv.org/abs/2407.05291)
32
+
28
33
 
29
34
  `WorkArena` is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers.
30
35
  By harnessing the ubiquitous [ServiceNow](https://www.servicenow.com/what-is-servicenow.html) platform, this benchmark will be instrumental in assessing the widespread state of such automations in modern knowledge work environments.
@@ -34,9 +39,47 @@ WorkArena is included in [BrowserGym](https://github.com/ServiceNow/BrowserGym),
34
39
 
35
40
  https://github.com/ServiceNow/WorkArena/assets/2374980/68640f09-7d6f-4eb1-b556-c294a6afef70
36
41
 
42
+ ## Getting Started
43
+
44
+ To setup WorkArena, you will need to get your own ServiceNow instance, install our Python package, and upload some data to your instance. Follow the steps below to achieve this.
45
+
46
+ ### a) Create a ServiceNow Developer Instance
47
+
48
+ 1. Go to https://developer.servicenow.com/ and create an account.
49
+ 2. Click on `Request an instance` and select the `Washington` release (initializing the instance will take a few minutes)
50
+ 3. Once the instance is ready, you should see your instance URL and credentials. If not, click _Return to the Developer Portal_, then navigate to _Manage instance password_ and click _Reset instance password_.
51
+ 4. You should now see your URL and credentials. Based on this information, set the following environment variables:
52
+ * `SNOW_INSTANCE_URL`: The URL of your ServiceNow developer instance
53
+ * `SNOW_INSTANCE_UNAME`: The username, should be "admin"
54
+ * `SNOW_INSTANCE_PWD`: The password, make sure you place the value in quotes "" and be mindful of [escaping special shell characters](https://onlinelinuxtools.com/escape-shell-characters). Running `echo $SNOW_INSTANCE_PWD` should print the correct password.
55
+ 6. Log into your instance via a browser using the admin credentials. Close any popup that appears on the main screen (e.g., agreeing to analytics).
56
+
57
+ **Warning:** Feel free to look around the platform, but please make sure you revert any changes (e.g., changes to list views, pinning some menus, etc.) as these changes will be persistent and affect the benchmarking process.
58
+
59
+ ### b) Install WorkArena and Initialize your Instance
60
+
61
+ Run the following command to install WorkArena in the [BrowswerGym](https://github.com/servicenow/browsergym) environment:
62
+ ```
63
+ pip install browsergym
64
+ ```
65
+
66
+ Then, install [Playwright](https://github.com/microsoft/playwright):
67
+ ```
68
+ playwright install
69
+ ```
70
+
71
+ Finally, run this command in a terminal to upload the benchmark data to your ServiceNow instance:
72
+ ```
73
+ workarena-install
74
+ ```
75
+ Your installation is now complete! 🎉
76
+
77
+
37
78
  ## Benchmark Contents
38
79
 
39
- At the moment, WorkArena includes `19,912` unique instances drawn from `33` tasks that cover the main components of the ServiceNow user interface. The following videos show an agent built on `GPT-4-vision` interacting with every such component. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.
80
+ At the moment, WorkArena-L1 includes `19,912` unique instances drawn from `33` tasks that cover the main components of the ServiceNow user interface, otherwise referred to as "atomic" tasks. WorkArena++ contains 682 tasks, each one sampling among thousands of potential configurations. WorkArena++ uses the atomic components presented in WorkArena, and composes them into real-world use cases evaluating planning, reasoning, and memorizing abilities of agents.
81
+
82
+ The following videos show an agent built on `GPT-4-vision` interacting with every atomic component of the benchmark. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.
40
83
 
41
84
  ### Knowledge Bases
42
85
 
@@ -80,7 +123,6 @@ https://github.com/ServiceNow/WorkArena/assets/1726818/ca26dfaf-2358-4418-855f-8
80
123
 
81
124
  https://github.com/ServiceNow/WorkArena/assets/1726818/0023232c-081f-4be4-99bd-f60c766e6c3f
82
125
 
83
-
84
126
  ## Getting Started
85
127
 
86
128
  To setup WorkArena, you will need to get your own ServiceNow instance, install our Python package, and upload some data to your instance. Follow the steps below to achieve this.
@@ -93,7 +135,7 @@ To setup WorkArena, you will need to get your own ServiceNow instance, install o
93
135
  4. You should now see your URL and credentials. Based on this information, set the following environment variables:
94
136
  * `SNOW_INSTANCE_URL`: The URL of your ServiceNow developer instance
95
137
  * `SNOW_INSTANCE_UNAME`: The username, should be "admin"
96
- * `SNOW_INSTANCE_PWD`: The password, make sure you place the value in quotes "" and be mindful of [escaping special shell characters](https://onlinelinuxtools.com/escape-shell-characters). Running `echo $SNOW_INSTANCE_PWD` should print the correct password.
138
+ * `SNOW_INSTANCE_PWD`: The password, make sure you place the value in single quotes '' and be mindful of [escaping special shell characters](https://onlinelinuxtools.com/escape-shell-characters). Running `echo $SNOW_INSTANCE_PWD` should print the correct password.
97
139
  6. Log into your instance via a browser using the admin credentials. Close any popup that appears on the main screen (e.g., agreeing to analytics).
98
140
 
99
141
  **Warning:** Feel free to look around the platform, but please make sure you revert any changes (e.g., changes to list views, pinning some menus, etc.) as these changes will be persistent and affect the benchmarking process.
@@ -105,25 +147,24 @@ Run the following command to install WorkArena in the [BrowswerGym](https://gith
105
147
  pip install browsergym-workarena
106
148
  ```
107
149
 
108
- Then, run this command in a terminal to upload the benchmark data to your ServiceNow instance:
150
+ Then, install [Playwright](https://github.com/microsoft/playwright):
109
151
  ```
110
- workarena-install
152
+ playwright install
111
153
  ```
112
154
 
113
- Finally, install [Playwright](https://github.com/microsoft/playwright):
155
+ Finally, run this command in a terminal to upload the benchmark data to your ServiceNow instance:
114
156
  ```
115
- playwright install
157
+ workarena-install
116
158
  ```
117
-
118
159
  Your installation is now complete! 🎉
119
160
 
120
-
121
161
  ## Live Demo
122
162
 
123
163
  Run this code to see WorkArena in action.
124
164
 
125
165
  Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to `env.step()` must be used instead.
126
166
 
167
+ - To run a demo of WorkArena-L1 (ICML 2024) tasks using BrowserGym, use the following script:
127
168
  ```python
128
169
  import random
129
170
 
@@ -165,9 +206,55 @@ for task in ALL_WORKARENA_TASKS:
165
206
  ```
166
207
 
167
208
 
209
+
210
+ - To run a demo of WorkArena-L2 (WorkArena++) tasks using BrowserGym, use the following script. Change the filter on line 6 to `l3` to sample L3 tasks.
211
+
212
+ ```python
213
+ import random
214
+
215
+ from browsergym.core.env import BrowserEnv
216
+ from browsergym.workarena import get_all_tasks_agents
217
+
218
+ AGENT_L2_SAMPLED_SET = get_all_tasks_agents(filter="l2")
219
+
220
+ AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS = [sampled_set[0] for sampled_set in AGENT_L2_SAMPLED_SET], [
221
+ sampled_set[1] for sampled_set in AGENT_L2_SAMPLED_SET
222
+ ]
223
+ from time import sleep
224
+
225
+ for (task, seed) in zip(AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS):
226
+ print("Task:", task)
227
+
228
+ # Instantiate a new environment
229
+ env = BrowserEnv(task_entrypoint=task,
230
+ headless=False)
231
+ env.reset()
232
+
233
+ # Cheat functions use Playwright to automatically solve the task
234
+ env.chat.add_message(role="assistant", msg="On it. Please wait...")
235
+
236
+ for i in range(len(env.task)):
237
+ sleep(1)
238
+ env.task.cheat(page=env.page, chat_messages=env.chat.messages, subtask_idx=i)
239
+ sleep(1)
240
+ reward, done, message, info = env.task.validate(page=env.page, chat_messages=env.chat.messages)
241
+
242
+ if reward == 1:
243
+ env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
244
+ else:
245
+ env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")
246
+
247
+ sleep(3)
248
+ env.close()
249
+ ```
250
+
251
+ Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to `env.step()` must be used instead.
252
+
168
253
  ## Citing This Work
169
254
 
170
255
  Please use the following BibTeX to cite our work:
256
+
257
+ ### WorkArena
171
258
  ```
172
259
  @misc{workarena2024,
173
260
  title={WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?},
@@ -178,3 +265,15 @@ Please use the following BibTeX to cite our work:
178
265
  primaryClass={cs.LG}
179
266
  }
180
267
  ```
268
+ ### WorkArena++
269
+ ```
270
+ @misc{boisvert2024workarenacompositionalplanningreasoningbased,
271
+ title={WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks},
272
+ author={Léo Boisvert and Megh Thakkar and Maxime Gasse and Massimo Caccia and Thibault Le Sellier De Chezelles and Quentin Cappart and Nicolas Chapados and Alexandre Lacoste and Alexandre Drouin},
273
+ year={2024},
274
+ eprint={2407.05291},
275
+ archivePrefix={arXiv},
276
+ primaryClass={cs.AI},
277
+ url={https://arxiv.org/abs/2407.05291},
278
+ }
279
+ ```
@@ -1,4 +1,4 @@
1
- browsergym/workarena/__init__.py,sha256=G39gKQHio1b2YMfdVZ_E-kdJDIUimnoBcAlFyMPn9wU,1062
1
+ browsergym/workarena/__init__.py,sha256=hJlDYA-SqyBTfaQQma5vvJyKlJJbc6FTGdw2GyUd6bs,6289
2
2
  browsergym/workarena/config.py,sha256=tblmOUpqSoL3qlQHK_TFEDSFbC3o2kuRP_GFpoTNsX4,8522
3
3
  browsergym/workarena/install.py,sha256=UaPE1K70xJB-2Gr1P5rJbcolkwMeWyRt04F7_5gpR4E,39341
4
4
  browsergym/workarena/instance.py,sha256=Qw4lzHhgnl8IuiWOelsmzCJce3jXYivYYwtfTPt2H-s,4314
@@ -76,7 +76,7 @@ browsergym/workarena/human_eval/tool.py,sha256=SwPqArNnvEeOPLRgem6kwl8ho345o-1f3
76
76
  browsergym/workarena/tasks/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
77
77
  browsergym/workarena/tasks/base.py,sha256=Ikh_A5I9_9acHFQCcnVMEnlBg3u3QHQD2I_NbGvD6SE,6411
78
78
  browsergym/workarena/tasks/comp_building_block.py,sha256=Lg3KbAWrxzAHe5XbPN6L8bvdu7mfJpmBvI7jXeSDwKE,194
79
- browsergym/workarena/tasks/dashboard.py,sha256=xubemHTVZotc-TynTTUC4qeH6bK4FPVImmS34NEzCTA,34014
79
+ browsergym/workarena/tasks/dashboard.py,sha256=HDGygBVtUM88lWKkUjyd43JvqmGUOPjmGfmRPkTJruE,34199
80
80
  browsergym/workarena/tasks/form.py,sha256=_s07yZ-zcZbi5v6VK6km1BPzUfIFfMEVWFm56QhoznM,64141
81
81
  browsergym/workarena/tasks/knowledge.py,sha256=kANjlC7DpptMbRlUlZGdDjqZeWIwwyJzozV58qEA6KU,13751
82
82
  browsergym/workarena/tasks/list.py,sha256=4Ov7fHD4smr_L_EB9og7j7pWTQ2zKAI8LWRrr-7ryiA,53389
@@ -87,15 +87,15 @@ browsergym/workarena/tasks/service_catalog.py,sha256=y-MxuJ-L3uJDB1RJz4cUpkoCN1F
87
87
  browsergym/workarena/tasks/compositional/__init__.py,sha256=zgbl23owwUZSnFD84rh-QJitaAsNCH0PNSct_H_NrM4,2341
88
88
  browsergym/workarena/tasks/compositional/base.py,sha256=eIZhfpBOvZvrlC2X7PSbY_7JrILuezYe-NRzDTECHik,14578
89
89
  browsergym/workarena/tasks/compositional/dash_do_base.py,sha256=ihxgwVxUfxBJXt49KzOSEH1i_8uymm1oMLGPrsD4zfI,58252
90
- browsergym/workarena/tasks/compositional/dash_do_catalog.py,sha256=P0ZKPv_LibTIXXaCL4oX6l39mWe4Ry7VUkqwHU5T8lw,35114
91
- browsergym/workarena/tasks/compositional/dash_do_catalog_infeasible.py,sha256=3k5xXiSqsJv0aC-oMlgPSJIkL_XC_X1Hb1xpfDq0kRI,63706
92
- browsergym/workarena/tasks/compositional/dash_do_create_incident.py,sha256=XmhES9vu9IL6977AuLJ1p9EuI1ogIJm5HNJ26m_gNbs,17683
93
- browsergym/workarena/tasks/compositional/dash_do_create_incident_infeasible.py,sha256=88eCqZInWm6M72zZuxC6ROJ7OojmGpphnqDKAlRAEX8,11098
94
- browsergym/workarena/tasks/compositional/dash_do_create_problem.py,sha256=yjhz6axOMP1qTYWK22t6SDGGf30rAHfM-mDNVtaZOSQ,14207
95
- browsergym/workarena/tasks/compositional/dash_do_create_problem_infeasible.py,sha256=YMT0Q-WAg_Lqui_3NIXkiXo3OmqilW33hkazojjCTw4,9187
96
- browsergym/workarena/tasks/compositional/dash_do_filter.py,sha256=sGmVWlifTJw7InBVgXbRDzggFJtPfeQJpKqLHzqShwM,62097
97
- browsergym/workarena/tasks/compositional/dash_do_request_item.py,sha256=CUGwjBebGHU6Ky_7-Xrb35U5aSv4_c6KXLNQNzZ8V8w,45321
98
- browsergym/workarena/tasks/compositional/dash_do_request_item_infeasible.py,sha256=3SV4qpntBbHIXZUQAIEeELs9hy2oedYYM1sD0cLw9OY,23299
90
+ browsergym/workarena/tasks/compositional/dash_do_catalog.py,sha256=cB3LdFnb-msU-i-L4Cx05nILoUUGExBbsribxMQWhBI,35349
91
+ browsergym/workarena/tasks/compositional/dash_do_catalog_infeasible.py,sha256=qHXsdtfsI3rCayX4jgaC-jZyWtFgKktX-nEQ-nvyFZ0,64121
92
+ browsergym/workarena/tasks/compositional/dash_do_create_incident.py,sha256=sqek6DspIe-cy0zlW7mE4hyKhY3_3AW3lJh44iQvw7E,17713
93
+ browsergym/workarena/tasks/compositional/dash_do_create_incident_infeasible.py,sha256=ccCcAMMlFvHqHvOd6ENNPnptDyK2itnmMdtb4oB9Y2U,11113
94
+ browsergym/workarena/tasks/compositional/dash_do_create_problem.py,sha256=cLWuwxBFXMOxKkmqQ8qp9PBQLfVWydqd1usg1-0wjMs,14237
95
+ browsergym/workarena/tasks/compositional/dash_do_create_problem_infeasible.py,sha256=XUmVt6f8zcJmS6NZ_D3ALfKlf021stJ6tja0LGkRYQY,9202
96
+ browsergym/workarena/tasks/compositional/dash_do_filter.py,sha256=SEUNDUBPVQo2ruki6x2ADhi3zUJoE7zXyxhzErkIdWI,62267
97
+ browsergym/workarena/tasks/compositional/dash_do_request_item.py,sha256=h3eH_T_m-lxl0f7stK1LPjeMlqUGeVrYb1yBqBuCh-U,45531
98
+ browsergym/workarena/tasks/compositional/dash_do_request_item_infeasible.py,sha256=rzTHohm9x1OG8uC4A1wZc98sAvAd9vt7yu18BbIfLTI,23404
99
99
  browsergym/workarena/tasks/compositional/delete_record.py,sha256=fXs_pdppZ60LA5ck6LgN3fJ3w9Z_MkmZMI6-AEI73Og,13164
100
100
  browsergym/workarena/tasks/compositional/edit_knowledge_base.py,sha256=bKlcpfBNCsRyt-ktBQcvjn8Qpcx9Y8kaSPf1FeqxsPE,17262
101
101
  browsergym/workarena/tasks/compositional/expense_management.py,sha256=99bpD0sXXDPGiN3EmNXUdijPjs-zcX-dHrwv5jCGSwA,21235
@@ -131,8 +131,8 @@ browsergym/workarena/tasks/utils/js_utils.js,sha256=n97fmY2Jkr59rEcQSuSbCnn1L2ZN
131
131
  browsergym/workarena/tasks/utils/private_tasks.py,sha256=r7Z9SnBMuZdZ2i-tK6eULj0q8hclANXFSzdLl49KYHI,2128
132
132
  browsergym/workarena/tasks/utils/string.py,sha256=ir5_ASD9QSFMZ9kuHo2snSXRuSfv_wROH6nxBLOTP4I,330
133
133
  browsergym/workarena/tasks/utils/utils.py,sha256=xQD-njEwgN7qxfn1dLBN8MYfd3kl3TuVfpmI1yxML9k,955
134
- browsergym_workarena-0.3.0.dist-info/METADATA,sha256=nRKiUhU9MfRGozpEPZSCB-dyWFosRkLOLTpclCCG0ws,7707
135
- browsergym_workarena-0.3.0.dist-info/WHEEL,sha256=zEMcRr9Kr03x1ozGwg5v9NQBKn3kndp6LSoSlVg-jhU,87
136
- browsergym_workarena-0.3.0.dist-info/entry_points.txt,sha256=1lCeAbQFCcU6UTFwS5QIA3TKhT2P9ZabaZKT7sIShKc,137
137
- browsergym_workarena-0.3.0.dist-info/licenses/LICENSE,sha256=sZLFiZHo_1hcxXRhXUDnQYVATUuWwRCdQjBxqxNnNEs,579
138
- browsergym_workarena-0.3.0.dist-info/RECORD,,
134
+ browsergym_workarena-0.3.2.dist-info/METADATA,sha256=EMvS0RoUmBnanW481LS7Ssdk2AvxNJPdsSvONSDcAoo,12498
135
+ browsergym_workarena-0.3.2.dist-info/WHEEL,sha256=1yFddiXMmvYK7QYTqtRNtX66WJ0Mz8PYEiEUoOUUxRY,87
136
+ browsergym_workarena-0.3.2.dist-info/entry_points.txt,sha256=1lCeAbQFCcU6UTFwS5QIA3TKhT2P9ZabaZKT7sIShKc,137
137
+ browsergym_workarena-0.3.2.dist-info/licenses/LICENSE,sha256=sZLFiZHo_1hcxXRhXUDnQYVATUuWwRCdQjBxqxNnNEs,579
138
+ browsergym_workarena-0.3.2.dist-info/RECORD,,
@@ -1,4 +1,4 @@
1
1
  Wheel-Version: 1.0
2
- Generator: hatchling 1.24.2
2
+ Generator: hatchling 1.25.0
3
3
  Root-Is-Purelib: true
4
4
  Tag: py3-none-any