@opentermsarchive/engine 0.17.0 → 0.17.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,245 +1,289 @@
1
- # Open Terms Archive
2
-
3
- **Services** have **terms** that can change over time. _Open Terms Archive_ enables users rights advocates, regulatory bodies and any interested citizen to follow the **changes** to these **terms** by being **notified** whenever a new **version** is published, and exploring their entire **history**.
4
-
5
- > Les services ont des conditions générales qui évoluent dans le temps. _Open Terms Archive_ permet aux défenseurs des droits des utilisateurs, aux régulateurs et à toute personne intéressée de suivre les évolutions de ces conditions générales en étant notifiée à chaque publication d'une nouvelle version, et en explorant leur historique.
6
-
7
- [🇫🇷 Manuel en français](README.fr.md).
8
-
9
- ## Table of Contents
10
-
11
- - [How it works](#how-it-works)
12
- - [Exploring the versions history](#exploring-the-versions-history)
13
- - [Be notified](#be-notified)
14
- - [By email](#by-email)
15
- - [By RSS](#by-rss)
16
- - [Importing as a module](#importing-as-a-module)
17
- - [CLI](#cli)
18
- - [Features exposed](#features-exposed)
19
- - [fetch](#fetch)
20
- - [filter](#filter)
21
- - [Using locally](#using-locally)
22
- - [Installing](#installing)
23
- - [Declarations repository](#declarations-repository)
24
- - [Core](#core)
25
- - [Configuring](#configuring)
26
- - [Configuration file](#configuration-file)
27
- - [Storage repositories](#storage-repositories)
28
- - [Environment variables](#environment-variables)
29
- - [Running](#running)
1
+ _The document you are reading now is targeted at developers wanting to use or contribute to the engine of [Open Terms Archive](https://opentermsarchive.org). For a high-level overview of Open Terms Archive’s wider goals and processes, please read its [public homepage](https://opentermsarchive.org)._
2
+
3
+ # Open Terms Archive Engine
4
+
5
+ This codebase is a Node.js module enabling downloading, archiving and publishing versions of documents obtained online. It can be used independently from the Open Terms Archive ecosystem.
6
+
7
+ ## Table of contents
8
+
9
+ - [Motivation](#motivation)
10
+ - [Main concepts](#main-concepts)
11
+ - [How to add documents to a collection](#how-to-add-documents-to-a-collection)
12
+ - [How to use the engine](#how-to-use-the-engine)
13
+ - [Configuring](#configuring)
30
14
  - [Deploying](#deploying)
31
- - [Publishing](#publishing)
32
15
  - [Contributing](#contributing)
33
- - [Adding or updating a service](#adding-a-new-service-or-updating-an-existing-service)
34
- - [Core engine](#core-engine)
35
- - [Funding and partnerships](#funding-and-partnerships)
36
16
  - [License](#license)
37
17
 
38
- ## How it works
18
+ ## Motivation
19
+
20
+ _Words in bold are [business domain names](https://en.wikipedia.org/wiki/Domain-driven_design)._
21
+
22
+ **Services** have **terms** written in **documents**, contractual (Terms of Services, Privacy Policy…) or not (Community Guidelines, Deceased User Policy…), that can change over time. Open Terms Archive enables users rights advocates, regulatory bodies and interested citizens to follow the **changes** to these **terms**, to be notified whenever a new **version** is published, to explore their entire **history** and to collaborate in analysing them. This free and open-source engine is developed to support these goals.
23
+
24
+ ## Main concepts
39
25
 
40
- _Note: Words in bold are [business domain names](https://en.wikipedia.org/wiki/Domain-driven_design)._
26
+ ### Instances
41
27
 
42
- **Services** are **declared** within _Open Terms Archive_ with a **declaration file** listing all the **documents** that, together, constitute the **terms** under which this **service** can be used. These **documents** all have a **type**, such as “terms and conditions”, “privacy policy”, “developer agreement”…
28
+ Open Terms Archive is a decentralised system.
43
29
 
44
- In order to **track** their **changes**, **documents** are periodically obtained by **fetching** a web **location** and **selecting content** within the **web page** to remove the **noise** (ads, navigation menu, login fields…). Beyond selecting a subset of a page, some **documents** have additional **noise** (hashes in links, CSRF tokens…) that would be false positives for **changes**. _Open Terms Archive_ thus supports specific **filters** for each **document**.
30
+ It aims at enabling any entity to **track** **terms** on its own and at federating a number of public **instances** in a single ecosystem to maximise discoverability, collaboration and political power. To that end, the Open Terms Archive **engine** can be run on any server, thus making it a dedicated **instance**.
45
31
 
46
- However, the shape of that **noise** can change over time. In order to recover in case of information loss during the **noise filtering** step, a **snapshot** is **recorded** every time there is a **change**. After the **noise** is **filtered out** from the **snapshot**, if there are **changes** in the resulting **document**, a new **version** of the **document** is **recorded**.
32
+ > Federated public instances can be [found on GitHub](
33
+ https://github.com/OpenTermsArchive?q=declarations).
47
34
 
48
- Anyone can run their own **private** instance and track changes on their own. However, we also **publish** each **version** on a [**public** instance](https://github.com/OpenTermsArchive/contrib-versions) that makes it easy to explore the entire **history** and enables **notifying** over email whenever a new **version** is **recorded**.
49
- Users can [**subscribe** to **notifications**](#be-notified).
35
+ ### Collections
50
36
 
51
- _Note: For now, when multiple versions coexist, **terms** are only **tracked** in their English version and for the European jurisdiction._
37
+ An **instance** **tracks** **documents** of a single **collection**.
52
38
 
53
- ## Exploring the versions history
39
+ A **collection** is characterised by a **scope** across **dimensions** that describe the **terms** it **tracks**, such as **language**, **jurisdiction** and **industry**.
54
40
 
55
- We offer a public database of versions recorded each time there is a change in the terms of service and other contractual documents of tracked services: [contrib-versions](https://github.com/OpenTermsArchive/contrib-versions).
41
+ > Federated public collections can be [found on GitHub](https://github.com/OpenTermsArchive?q=versions).
56
42
 
57
- From the **repository homepage** [contrib-versions](https://github.com/OpenTermsArchive/contrib-versions), open the folder of the **service of your choice** (e.g. [WhatsApp](https://github.com/OpenTermsArchive/contrib-versions/tree/main/WhatsApp)).
43
+ #### Example scope
58
44
 
59
- You will see the **set of documents tracked** for that service, now click **on the document of your choice** (e.g. [WhatsApp's Privacy Policy](https://github.com/OpenTermsArchive/contrib-versions/blob/main/WhatsApp/Privacy%20Policy.md)). The **latest version** (updated hourly) will be displayed.
45
+ > The documents declared in this collection are:
46
+ > - Related to dating services used in Europe.
47
+ > - In the European Union and Switzerland jurisdictions.
48
+ > - In English, unless no English version exists, in which case the primary official language of the jurisdiction of incorporation of the service operator will be used.
60
49
 
61
- To view the **history of changes** made to this document, click on **History** at the top right of the document (for our previous [example](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)). The **changes** are ordered **by date**, with the latest first.
50
+ ### Terms types
62
51
 
63
- Click on a change to see what it consists of (for example [this one](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd)). There are **two types of display** you can choose from the icons in the gray bar above the document.
52
+ To distinguish between the different **terms** of a **service**, each has a **type**, such as “Terms of Service”, “Privacy Policy”, “Developer Agreement”…
64
53
 
65
- - The first one, named _source diff_ (button with chevrons) allows you to **display the old version and the new one side by side** (for our [example](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd#diff-e8bdae8692561f60aeac9d27a55e84fc)). This display has the merit of **explicitly showing** all additions and deletions.
66
- - The second one, named _rich diff_ (button with a document icon) allows you to **unify all the changes in a single document** (for our [example](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd?short_path=e8bdae8#diff-e8bdae8692561f60aeac9d27a55e84fc)). The **red** color shows **deleted** elements, the **yellow** color shows **modified** paragraphs, and the **green** color shows **added** elements. Be careful, this display **does not show some changes** such as hyperlinks and text style's changes.
54
+ This **type** matches the topic, but not necessarily the title the **service** gives to it. Unifying the **types** enables comparing **terms** across **services**.
67
55
 
68
- ### Notes
56
+ > More information on terms types can be found in the [dedicated repository](https://github.com/OpenTermsArchive/terms-types). They are published on NPM under [`@opentermsarchive/terms-types`](https://www.npmjs.com/package/@opentermsarchive/terms-types), enabling standardisation and interoperability beyond the Open Terms Archive engine.
69
57
 
70
- - For long documents, unchanged **paragraphs will not be displayed by default**. You can manually make them appear by clicking on the small arrows just above or just below the displayed paragraphs.
71
- - You can use the **History button anywhere** in the repository contrib-versions, which will then display the **history of changes made to all documents in the folder** where you are (including sub-folders).
58
+ ### Declarations
72
59
 
73
- ## Be notified
60
+ The **documents** that constitute a **collection** are defined in simple JSON files called **declarations**.
74
61
 
75
- ### By email
62
+ A **declaration** also contains some metadata on the **service** the **documents** relate to.
76
63
 
77
- #### Document per document
64
+ > Here is an example declaration tracking the Privacy Policy of Open Terms Archive:
65
+ >
66
+ > ```json
67
+ > {
68
+ > "name": "Open Terms Archive",
69
+ > "documents": {
70
+ > "Privacy Policy": {
71
+ > "fetch": "https://opentermsarchive.org/en/privacy-policy",
72
+ > "select": ".TextContent_textContent__ToW2S"
73
+ > }
74
+ > }
75
+ > }
76
+ > ```
78
77
 
79
- You can go on the official front website [opentermsarchive.org](https://opentermsarchive.org). From there, you can select a service and then the corresponding document type.
80
- After you enter your email and click on subscribe, we will add your email to the correspondning mailing list in [SendInBlue](https://www.sendinblue.com/) and will not store your email anywhere else.
81
- Then, everytime a modification is found on the correspondning document, we will send you an email.
78
+ ## How to add documents to a collection
82
79
 
83
- You can unsubscribe at any moment by clicking on the `unsubscribe` link at the bottom of the received email.
80
+ Open Terms Archive **acquires** **documents** to deliver an explorable **history** of **changes**. This can be done in two ways:
84
81
 
85
- #### For all documents at once
82
+ 1. For the present and future, by **tracking** **documents**.
83
+ 2. For the past, by **importing** from an existing **fonds** such as [ToSBack](https://tosback.org), the [Internet Archive](https://archive.org/web/), [Common Crawl](https://commoncrawl.org) or any other in-house format.
86
84
 
87
- You can [subscribe](https://59692a77.sibforms.com/serve/MUIEAKuTv3y67e27PkjAiw7UkHCn0qVrcD188cQb-ofHVBGpvdUWQ6EraZ5AIb6vJqz3L8LDvYhEzPb2SE6eGWP35zXrpwEFVJCpGuER9DKPBUrifKScpF_ENMqwE_OiOZ3FdCV2ra-TXQNxB2sTEL13Zj8HU7U0vbbeF7TnbFiW8gGbcOa5liqmMvw_rghnEB2htMQRCk6A3eyj) to receive an email whenever a document is updated in the database.
85
+ ### Tracking documents
88
86
 
89
- **Beware, you are likely to receive a large amount of notifications!** You can unsubscribe by replying to any email you will receive.
87
+ The **engine** **reads** **declarations** to **record** a **snapshot** by **fetching** the declared web **location** periodically. The **engine** then **extracts** a **version** from this **snapshot** by:
90
88
 
91
- ### By RSS
89
+ 1. **Selecting** the subset of the **snapshot** that contains the **terms** (instead of navigation menus, footers, cookies banners…).
90
+ 2. **Removing** residual content in this subset that is not part of the **terms** (ads, illustrative pictures, internal navigation links…).
91
+ 3. **Filtering noise** by preventing parts that change frequently from triggering false positives for **changes** (tracker identifiers in links, relative dates…). The **engine** can execute custom **filters** written in JavaScript to that end.
92
92
 
93
- You can receive notification for a specific service or document by subscribing to RSS feeds.
93
+ After these steps, if **changes** are spotted in the resulting **document**, a new **version** is **recorded**.
94
94
 
95
- > An RSS feed is a type of web page that contains information about the latest content published by a website, such as the date of publication and the address where you can view it. When this resource is updated, a feed reader app automatically notifies you and you can see the update.
95
+ Preserving **snapshots** enables recovering after the fact information potentially lost in the **extraction** step: if **declarations** were wrong, they can be **maintained** and corrected **versions** can be **extracted** from the original **snapshots**.
96
96
 
97
- To find out the address of the RSS feed you want to subscribe to:
97
+ ### Importing documents
98
98
 
99
- 1. [Navigate](#exploring-the-versions-history) to the page with the history of changes you are interested in. _In the WhatsApp example above, this would be [this page](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)._
100
- 2. Copy the address of that page from your browser’s address bar. _In the WhatsApp example, this would be `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md`._
101
- 3. Append `.atom` at the end of this address. _In the WhatsApp example, this would become `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md.atom`._
102
- 4. Subscribe your RSS feed reader to the resulting address.
99
+ Existing **fonds** can be prepared for easier analysis by unifying their format to the **Open Terms Archive dataset format**. This unique format enables building interoperable tools, fostering collaboration across reusers.
100
+ Such a dataset can be generated from **versions** alone. If **snapshots** and **declarations** can be retrieved from the **fonds** too, then a full-fledged **collection** can be created.
103
101
 
104
- #### Recap of available RSS feeds
102
+ ## How to use the engine
105
103
 
106
- | Updated for | URL |
107
- | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
108
- | all services and documents | `https://github.com/OpenTermsArchive/contrib-versions/commits.atom` |
109
- | all the documents of a service | Replace `$serviceId` with the service ID:<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId.atom.` |
110
- | a specific document of a service | Replace `$serviceId` with the service ID and `$documentType` with the document type:<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId/$documentType.md.atom` |
104
+ This documentation describes how to execute the **engine** independently from any specific **instance**. For other use cases, other parts of the documentation could be more relevant:
111
105
 
112
- For example:
106
+ - to contribute **declarations** to an existing **instance**, see [how to contribute documents](./docs/doc-contributing-documents.md);
107
+ - to create a new **collection**, see the [collection bootstrap](https://github.com/OpenTermsArchive/template-declarations) script;
108
+ - to create a new public **instance**, see the [governance](./docs/doc-governance.md) documentation.
113
109
 
114
- - To receive all updates of `Facebook` documents, the URL is `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Facebook.atom`.
115
- - To receive all updates of the `Privacy Policy` from `Google`, the URL is `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Google/Privacy%20Policy.md.atom`.
110
+ ### Requirements
116
111
 
117
- ## Importing as a module
112
+ This module is tested to work across operating systems (continuous testing on UNIX, macOS and Windows).
113
+
114
+ A [Node.js](https://nodejs.org/en/download/) runtime is required to execute this engine.
115
+
116
+ ![Supported Node.js version can be found in the package.json file](https://img.shields.io/node/v/@opentermsarchive/engine?color=informational&label=Supported%20Node.js%20version)
117
+
118
+ ### Getting started
119
+
120
+ This engine is published as a [module on NPM](https://npmjs.com/package/@opentermsarchive/engine). The recommended install is as a dependency in a `package.json` file, next to a folder containing [declaration files](#declarations).
121
+
122
+ ```sh
123
+ npm install --save @opentermsarchive/engine
124
+ mkdir declarations
125
+ ```
118
126
 
119
- Open Terms Archive exposes a JavaScript API to make some of its capabilities available in NodeJS. You can install it as an NPM module:
127
+ In an editor, create the following declaration file in `declarations/Open Terms Archive.json` to track the terms of the Open Terms Archive website:
120
128
 
129
+ ```json
130
+ {
131
+ "name": "Open Terms Archive",
132
+ "documents": {
133
+ "Privacy Policy": {
134
+ "fetch": "https://opentermsarchive.org/en/privacy-policy",
135
+ "select": ".TextContent_textContent__ToW2S"
136
+ }
137
+ }
138
+ }
121
139
  ```
122
- npm install "ambanum/OpenTermsArchive#main"
140
+
141
+ In the terminal:
142
+
143
+ ```sh
144
+ npx ota-track
123
145
  ```
124
146
 
147
+ The tracked documents can be found in the `data` folder.
148
+
149
+ This quick example aimed at letting you try the engine quickly. Most likely, you will simply `npm install` from an existing collection, or create a new collection from the [collection template](https://github.com/OpenTermsArchive/template-declarations).
150
+
125
151
  ### CLI
126
152
 
127
- The following commands are available where the package is installed:
153
+ Once the engine module is installed as a dependency within another module, the following commands are available.
128
154
 
129
- - `./node_modules/.bin/ota-lint-declarations`: check and normalise the format of declarations.
130
- - `./node_modules/.bin/ota-validate-declarations`: validate declarations.
131
- - `./node_modules/.bin/ota-track`: track services. Recorded snapshots and versions will be stored in the `data` folder at the root of the module where the package is installed.
155
+ In these commands:
132
156
 
133
- In order to have them available globally in your command line, install it with the `--global` option.
157
+ - **`<service_id>`** is the case sensitive name of the service declaration file without the extension. For example, for `Twitter.json`, the service ID is `Twitter`.
158
+ - **`<terms_type>`** is the property name used under the `documents` property in the declaration to declare a terms. For example, in the getting started declaration, the terms type declared is `Privacy Policy`.
134
159
 
135
- ### Features exposed
160
+ #### `ota-track`
136
161
 
137
- #### fetch
162
+ ```sh
163
+ npx ota-track
164
+ ```
138
165
 
139
- The `fetch` module gets the MIME type and content of a document from its URL.
166
+ [Track](#tracking-documents) the current terms of services according to provided declarations.
140
167
 
141
- You can use it in your code by using `import fetch from 'open-terms-archive/fetch';`.
168
+ The declarations, snapshots and versions paths are defined in the [configuration](#configuring).
142
169
 
143
- Documentation on how to use `fetch` is provided as JSDoc within [./src/archivist/fetcher/index.js](./src/archivist/fetcher/index.js).
170
+ > Note that the snapshots and versions will be recorded at the moment the command is executed, on top of the existing local history. If a shared history already exists and the goal is to add on top of it, that history has to be downloaded before executing that command.
144
171
 
145
- If you plan to use `executeClientScripts` as a parameter of `fetch`, the fetching will be done using a headless browser.
146
- In order to not instantiate this browser at each fetch, the starting and stopping of the browser is your responsibility.
172
+ ##### Recap of available options
147
173
 
148
- Here is an example on how to use it:
174
+ ```sh
175
+ npx ota-track --help
176
+ ```
149
177
 
150
- ```js
151
- import fetch, { launchHeadlessBrowser, stopHeadlessBrowser } from 'open-terms-archive/fetch';
178
+ ##### Track terms of specific services
152
179
 
153
- await launchHeadlessBrowser();
154
- await fetch({ executeClientScripts: true, ... });
155
- await fetch({ executeClientScripts: true, ... });
156
- await fetch({ executeClientScripts: true, ... });
157
- await stopHeadlessBrowser();
180
+ ```sh
181
+ npx ota-track --services "<service_id>" ["<service_id>"...]
158
182
  ```
159
183
 
160
- The `fetch` module can also be configured as a [`node-config` submodule](https://github.com/node-config/node-config/wiki/Sub-Module-Configuration).
161
- If [`node-config`](https://github.com/node-config/node-config) is used in the project, the default `fetcher` configuration can be overridden by adding a `fetcher` object to the local config. See [Configuration file](#configuration-file) for full reference.
184
+ ##### Track specific terms of specific services
162
185
 
163
- #### filter
186
+ ```sh
187
+ npx ota-track --services "<service_id>" ["<service_id>"...] --documentTypes "<terms_type>" ["<terms_type>"...]
188
+ ```
164
189
 
165
- The `filter` module transforms HTML or PDF content into a Markdown string.
166
- It will filter content based on the [document declaration](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md#declaring-a-new-service).
190
+ ##### Track documents four times a day
167
191
 
168
- You can use the filter in your code by using `import filter from 'open-terms-archive/filter';`.
192
+ ```sh
193
+ npx ota-track --schedule
194
+ ```
169
195
 
170
- The `filter` function documentation is available as JSDoc within [./src/archivist/filter/index.js](./src/archivist/filter/index.js).
196
+ #### `ota-validate-declarations`
171
197
 
172
- #### page-declaration
198
+ ```sh
199
+ npx ota-validate-declarations [--services <service_id>...]
200
+ ```
173
201
 
174
- PageDeclaration object is used to describe a page to be tracked by Open Terms Archive.
202
+ Check that all declarations allow recording a snapshot and a version properly.
175
203
 
176
- You can use the page-declaration in your code by using `import pageDeclaration from 'open-terms-archive/page-declaration';`.
204
+ If one or several `<service_id>` are provided, check only those services.
177
205
 
178
- ## Using locally
206
+ ##### Validate schema only
179
207
 
180
- ### Installing
208
+ ```sh
209
+ npx ota-validate-declarations --schema-only [--services <service_id>...]
210
+ ```
181
211
 
182
- This module is built with [Node](https://nodejs.org/en/) and is tested on macOS, UNIX and Windows. You will need to [install Node >= v16.x](https://nodejs.org/en/download/) to run it.
212
+ Check that all declarations are readable by the engine.
183
213
 
184
- #### Declarations repository
214
+ Allows for a much faster check of declarations, but does not check that the documents are actually accessible.
185
215
 
186
- 1. Locally clone your declarations repository, e.g., `git@github.com:OpenTermsArchive/contrib-declarations.git`.
187
- 2. Go into your folder and initialize it, e.g., `cd contrib-declarations; npm install`.
188
- 3. You can now modify your declarations in the `./declarations/` folder, following [these instructions](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md).
189
- 4. When you want to test:
190
- - If you want to test every declaration, run `npm test`.
191
- - If you want to test a specific declaration, run `npm test $serviceId`, e.g., `npm test HER`.
192
- - If you want to have faster feedback on the structure of a specific declaration, run `npm run test:schema $serviceId`, e.g., `npm run test:schema HER`.
193
- 5. Once you have done that, if you have any error, it will be prompted and detailed at the end of the test.
194
- - E.g., `InaccessibleContentError`: Your selector is wrong and should be fixed.
195
- - E.g., `TypeError`: The file declaration is invalid.
196
- - E.g., if you have a weird error, you may want to contact OTA, if may be a bug.
216
+ If one or several `<service_id>` are provided, check only those services.
197
217
 
198
- ##### Note: Testing
218
+ #### `ota-lint-declarations`
199
219
 
200
- Testing works with multiple tests (e.g., checking the validity of the file, that the URL is correct and reachable, that the content is correctly gathered, etc.); as it may take a bit of time, that's why you may want to use `npm run test:schema`.
220
+ ```sh
221
+ npx ota-lint-declarations [--services <service_id>...]
222
+ ```
201
223
 
202
- #### Core
224
+ Normalise the format of declarations.
203
225
 
204
- When refering to the base folder, it means the folder where you will be `git pull`ing everything.
226
+ Automatically correct formatting mistakes and ensure that all declarations are standardised.
205
227
 
206
- 1. If not done already, follow the previous part with the repo of your choice.
207
- 2. In the base folder of the previous step (i.e., not _in_ the previous folder, but _where the previous folder is_), clone the core engine: `git clone git@github.com:ambanum/OpenTermsArchive.git`.
208
- 3. Go into the cloned folder and install dependencies: `cd contrib-declarations; npm install`.
209
- 4. If you are using the main repo, you are done, go to step 6.
210
- 5. If you are using a special repo instance (e.g., `dating-declarations`), create a new [config file](#configuring), `config/development.json`, and add:
211
- ```json
212
- {
228
+ If one or several `<service_id>` are provided, check only those services.
213
229
 
214
- "services": {
215
- "declarationsPath": "../<name of the repo>/declarations"
216
- }
217
- }
218
- ```
219
- e.g.,
220
- ```json
221
- {
222
- "services": {
223
- "declarationsPath": "../dating-declarations/declarations"
224
- }
225
- }
226
- ```
227
- 6. In the folder of the repo (i.e., `OpenTermsArchive`), use `npm start`.
228
- - It will first do a refiltering to check whenever everything works properly.
229
- - You will then start to see everything being downloaded under `data/`.
230
- - More details in [Running](#running).
230
+ ### API
231
231
 
232
- ##### Notes: Tips
232
+ Once added as a dependency, the engine exposes a JavaScript API that can be called in your own code. The following modules are available.
233
233
 
234
- - You may want to regularly `git pull` to have the latest updates, both in the core engine and in the declarations repos.
235
- - You have to `npm install` in the declarations repo at least once, and a least once each time `package.json` changes.
236
- - Be careful, it doesn't download the history! If you want that, you need to git clone `snapshots` and `versions` in `data/`.
234
+ #### `fetch`
237
235
 
238
- You can clone as many declarations repositories as you want. The one that will be loaded at execution will be defined through configuration.
236
+ The `fetch` module gets the MIME type and content of a document from its URL
239
237
 
240
- ### Configuring
238
+ ```js
239
+ import fetch from '@opentermsarchive/engine/fetch';
240
+ ```
241
+
242
+ Documentation on how to use `fetch` is provided [as JSDoc](./src/archivist/fetcher/index.js).
243
+
244
+ ##### Headless browser management
245
+
246
+ If you pass the `executeClientScripts` option to `fetch`, a headless browser will be used to download and execute the page before serialising its DOM. For performance reasons, the starting and stopping of the browser is your responsibility to avoid instantiating a browser on each fetch. Here is an example on how to use this feature:
247
+
248
+ ```js
249
+ import fetch, { launchHeadlessBrowser, stopHeadlessBrowser } from '@opentermsarchive/engine/fetch';
241
250
 
242
- #### Configuration file
251
+ await launchHeadlessBrowser();
252
+ await fetch({ executeClientScripts: true, ... });
253
+ await fetch({ executeClientScripts: true, ... });
254
+ await fetch({ executeClientScripts: true, ... });
255
+ await stopHeadlessBrowser();
256
+ ```
257
+
258
+ The `fetch` module options are defined as a [`node-config` submodule](https://github.com/node-config/node-config/wiki/Sub-Module-Configuration). The default `fetcher` configuration can be overridden by adding a `fetcher` object to the [local configuration file](#configuration-file).
259
+
260
+ #### `filter`
261
+
262
+ The `filter` module transforms HTML or PDF content into a Markdown string according to a [declaration](#declarations).
263
+
264
+ ```js
265
+ import filter from '@opentermsarchive/engine/filter';
266
+ ```
267
+
268
+ The `filter` function documentation is available [as JSDoc](./src/archivist/filter/index.js).
269
+
270
+ #### `PageDeclaration`
271
+
272
+ The `PageDeclaration` class encapsulates information about a page tracked by Open Terms Archive.
273
+
274
+ ```js
275
+ import pageDeclaration from '@opentermsarchive/engine/page-declaration';
276
+ ```
277
+
278
+ The `PageDeclaration` format is defined [in source code](./src/archivist/services/pageDeclaration.js).
279
+
280
+ ### Dataset generation
281
+
282
+ See the [`dataset` script documentation](./scripts/dataset/README.md).
283
+
284
+ ## Configuring
285
+
286
+ ### Configuration file
243
287
 
244
288
  The default configuration can be found in `config/default.json`. The full reference is given below. You are unlikely to want to edit all of these elements.
245
289
 
@@ -276,7 +320,7 @@ The default configuration can be found in `config/default.json`. The full refere
276
320
  "host": "SMTP server hostname",
277
321
  "username": "User for server authentication" // Password for server authentication is defined in environment variables, see the “Environment variables” section below
278
322
  },
279
- "sendMailOnError": { // Can be set to `false` if you do not want to send email on error
323
+ "sendMailOnError": { // Can be set to `false` if sending email on error is not needed
280
324
  "to": "The address to send the email to in case of an error",
281
325
  "from": "The address from which to send the email",
282
326
  "sendWarnings": "Boolean. Set to true to also send email in case of warning",
@@ -299,15 +343,15 @@ The default configuration can be found in `config/default.json`. The full refere
299
343
  }
300
344
  ```
301
345
 
302
- The default configuration is merged with (and overridden by) environment-specific configuration that can be specified at startup with the `NODE_ENV` environment variable. For example, you would run `NODE_ENV=development npm start` to load the `development.json` configuration file.
346
+ The default configuration is merged with (and overridden by) environment-specific configuration that can be specified at startup with the `NODE_ENV` environment variable. For example, running `NODE_ENV=vagrant npm start` will load the `vagrant.json` configuration file. See [node-config](https://github.com/node-config/node-config) for more information about configuration files.
303
347
 
304
- If you want to change your local configuration, we suggest you create a `config/development.json` file with overridden values. Example production configuration files can be found in the `config` folder.
348
+ In order to have a local configuration that override all exisiting config, it is recommended to create a `config/development.json` file with overridden values.
305
349
 
306
- ##### Storage repositories
350
+ #### Storage repositories
307
351
 
308
352
  Two storage repositories are currently supported: Git and MongoDB. Each one can be used independently for versions and snapshots.
309
353
 
310
- ###### Git
354
+ ##### Git
311
355
 
312
356
  ```json
313
357
  {
@@ -326,11 +370,11 @@ Two storage repositories are currently supported: Git and MongoDB. Each one can
326
370
 
327
371
  }
328
372
  ```
329
- ###### MongoDB
373
+ ##### MongoDB
330
374
 
331
375
  ```json
332
376
  {
333
-
377
+
334
378
  "storage": {
335
379
  "mongo": {
336
380
  "connectionURI": "URI for defining connection to the MongoDB instance. See https://docs.mongodb.com/manual/reference/connection-string/",
@@ -342,7 +386,7 @@ Two storage repositories are currently supported: Git and MongoDB. Each one can
342
386
  }
343
387
  ```
344
388
 
345
- #### Environment variables
389
+ ### Environment variables
346
390
 
347
391
  Environment variables can be passed in the command-line or provided in a `.env` file at the root of the repository. See `.env.example` for an example of such a file.
348
392
 
@@ -350,89 +394,53 @@ Environment variables can be passed in the command-line or provided in a `.env`
350
394
  - `SENDINBLUE_API_KEY`: a SendInBlue API key, in order to send email notifications with that service.
351
395
  - `GITHUB_TOKEN`: a token with repository privileges to access the [GitHub API](https://github.com/settings/tokens).
352
396
 
353
- If your infrastructure requires using an outgoing HTTP/HTTPS proxy to access the Internet, you can provide it through the `HTTP_PROXY` and `HTTPS_PROXY` environment variable.
354
-
355
- ### Running
356
-
357
- To get the latest versions of all documents:
358
-
359
- ```
360
- npm start
361
- ```
362
-
363
- The latest version of a document will be available in the versions path defined in your configuration, under `$versions_folder/$service_provider_name/$document_type.md`.
364
-
365
- To update documents automatically:
366
-
367
- ```
368
- npm run start:scheduler
369
- ```
370
-
371
- To get the latest version of a specific service's terms:
372
-
373
- ```
374
- npm start -- --services <service_id>
375
- ```
376
-
377
- > The service ID is the case sensitive name of the service declaration file without the extension. For example, for `Twitter.json`, the service ID is `Twitter`.
378
-
379
-
380
- To get the latest version of a specific service's terms and document type:
381
-
382
- ```
383
- npm start -- --services <service_id> --documentTypes <document_type>
384
- ```
385
-
386
- To display help:
387
-
388
- ```
389
- npm start -- --help
390
- ```
397
+ If an outgoing HTTP/HTTPS proxy to access the Internet is required, it is possible to provide it through the `HTTP_PROXY` and `HTTPS_PROXY` environment variable.
391
398
 
392
399
  ## Deploying
393
400
 
394
- See [Ops Readme](ops/README.md).
401
+ Deployment is managed with [Ansible](https://www.ansible.com). See the [Open Terms Archive deployment Ansible collection](https://github.com/OpenTermsArchive/ota.deployment-ansible-collection).
395
402
 
396
- ## Publishing
403
+ ## Contributing
397
404
 
398
- To generate a dataset:
405
+ ### Getting a copy
399
406
 
400
- ```
401
- npm run dataset:generate
402
- ```
407
+ In order to edit the code of the engine itself, an editable and executable copy is necessary.
403
408
 
404
- To release a dataset:
409
+ First of all, follow the [requirements](#requirements) above. Then, clone the repository:
405
410
 
406
- ```
407
- npm run dataset:release
411
+ ```sh
412
+ git clone https://github.com/ambanum/OpenTermsArchive.git
413
+ cd OpenTermsArchive
408
414
  ```
409
415
 
410
- To weekly release a dataset:
416
+ Install dependencies:
411
417
 
418
+ ```sh
419
+ npm install
412
420
  ```
413
- npm run dataset:scheduler
414
- ```
415
-
416
- ## Contributing
417
421
 
418
- Thanks for wanting to contribute! There are different ways to contribute to Open Terms Archive. We describe the most common below. If you want to explore other venues for contributing, please contact us over email (contact@[our domain name]) or [Twitter](https://twitter.com/OpenTerms).
422
+ ### Testing
419
423
 
420
- ### Adding a new service or updating an existing service
424
+ If changes are made to the engine, check that all parts covered by tests still work properly:
421
425
 
422
- See the [CONTRIBUTING](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md) of repository [`OpenTermsArchive/contrib-declarations`](https://github.com/OpenTermsArchive/contrib-declarations). You will need knowledge of JSON and web DOM.
426
+ ```sh
427
+ npm test
428
+ ```
423
429
 
424
- ### Core engine
430
+ If existing features are changed or new ones are added, relevant tests must be added too.
425
431
 
426
- To contribute to the core engine of Open Terms Archive, see the [CONTRIBUTING](CONTRIBUTING.md) file of this repository. You will need knowledge of JavaScript and NodeJS.
432
+ ### Suggesting changes
427
433
 
428
- ### Funding and partnerships
434
+ To contribute to the core engine of Open Terms Archive, see the [CONTRIBUTING](CONTRIBUTING.md) file of this repository. You will need knowledge of JavaScript and Node.js.
429
435
 
430
- Beyond individual contributions, we need funds and committed partners to pay for a core team to maintain and grow Open Terms Archive. If you know of opportunities, please let us know! You can find [on our website](https://opentermsarchive.org/en/about) an up-to-date list of the partners and funders that make Open Terms Archive possible.
436
+ ### Sponsorship and partnerships
431
437
 
438
+ Beyond individual contributions, we need funds and committed partners to pay for a core team to maintain and grow Open Terms Archive. If you know of opportunities, please let us know over email at `contact@[project name without spaces].org`!
432
439
 
433
- ---
440
+ - - -
434
441
 
435
442
  ## License
436
443
 
437
- The code for this software is distributed under the European Union Public Licence (EUPL) v1.2.
438
- Contact the author if you have any specific need or question regarding licensing.
444
+ The code for this software is distributed under the [European Union Public Licence (EUPL) v1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12). In short, this [means](https://choosealicense.com/licenses/eupl-1.2/) you are allowed to read, use, modify and redistribute this source code, as long as you as you credit “Open Terms Archive Contributors” and make available any change you make to it under similar conditions.
445
+
446
+ Contact the core team over email at `contact@[project name without spaces].org` if you have any specific need or question regarding licensing.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@opentermsarchive/engine",
3
- "version": "0.17.0",
3
+ "version": "0.17.1",
4
4
  "description": "Tracks and makes visible changes to the terms of online services",
5
5
  "homepage": "https://github.com/ambanum/OpenTermsArchive#readme",
6
6
  "bugs": {
@@ -4,7 +4,7 @@ Export the versions dataset into a ZIP file and publish it to GitHub releases.
4
4
 
5
5
  ## Configuring
6
6
 
7
- You can change the configuration in the appropriate config file in the `config` folder. See the [main README](https://github.com/ambanum/OpenTermsArchive#configuring) for documentation on using the configuration file.
7
+ You can change the configuration in the appropriate config file in the `config` folder. See the [main README](../../README.md#configuring) for documentation on using the configuration file.
8
8
 
9
9
  ## Running
10
10
 
@@ -34,4 +34,4 @@ node scripts/dataset/main.js --schedule --publish --remove-local-copy
34
34
 
35
35
  ## Adding renaming rules
36
36
 
37
- See the [renamer module documentation](../renamer/README.md).
37
+ See the [renamer module documentation](../utils/renamer/README.md).
@@ -31,27 +31,27 @@ It has been generated with [Open Terms Archive](https://opentermsarchive.org).
31
31
 
32
32
  ### Dataset format
33
33
 
34
- This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the document type. The filesystem layout will look like below.
34
+ This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the terms type. The filesystem layout will look like below.
35
35
 
36
36
  \`\`\`
37
37
  ├ README.md
38
38
  ├┬ Service provider 1 (e.g. Facebook)
39
- │├┬ Document type 1 (e.g. Terms of Service)
39
+ │├┬ Terms type 1 (e.g. Terms of Service)
40
40
  ││├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-08-01T01-03-12Z.md)
41
41
  ┆┆┆
42
42
  ││└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-10-03T08-12-25Z.md)
43
43
  ┆┆
44
- │└┬ Document type X (e.g. Privacy Policy)
44
+ │└┬ Terms type X (e.g. Privacy Policy)
45
45
  │ ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
46
46
  ┆ ┆
47
47
  │ └ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-11-14T12-36-45Z.md)
48
48
 
49
49
  └┬ Service provider Y (e.g. Google)
50
- ├┬ Document type 1 (e.g. Developer Terms)
50
+ ├┬ Terms type 1 (e.g. Developer Terms)
51
51
  │├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2019-03-12T04-18-22Z.md)
52
52
  ┆┆
53
53
  │└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-12-04T22-47-05Z.md)
54
- └┬ Document type Z (e.g. Privacy Policy)
54
+ └┬ Terms type Z (e.g. Privacy Policy)
55
55
 
56
56
  ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
57
57
 
@@ -8,27 +8,27 @@ It has been generated with [Open Terms Archive](https://opentermsarchive.org).
8
8
 
9
9
  ### Dataset format
10
10
 
11
- This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the document type. The filesystem layout will look like below.
11
+ This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the terms type. The filesystem layout will look like below.
12
12
 
13
13
  ```
14
14
  ├ README.md
15
15
  ├┬ Service provider 1 (e.g. Facebook)
16
- │├┬ Document type 1 (e.g. Terms of Service)
16
+ │├┬ Terms type 1 (e.g. Terms of Service)
17
17
  ││├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-08-01T01-03-12Z.md)
18
18
  ┆┆┆
19
19
  ││└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-10-03T08-12-25Z.md)
20
20
  ┆┆
21
- │└┬ Document type X (e.g. Privacy Policy)
21
+ │└┬ Terms type X (e.g. Privacy Policy)
22
22
  │ ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
23
23
  ┆ ┆
24
24
  │ └ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-11-14T12-36-45Z.md)
25
25
 
26
26
  └┬ Service provider Y (e.g. Google)
27
- ├┬ Document type 1 (e.g. Developer Terms)
27
+ ├┬ Terms type 1 (e.g. Developer Terms)
28
28
  │├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2019-03-12T04-18-22Z.md)
29
29
  ┆┆
30
30
  │└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-12-04T22-47-05Z.md)
31
- └┬ Document type Z (e.g. Privacy Policy)
31
+ └┬ Terms type Z (e.g. Privacy Policy)
32
32
 
33
33
  ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
34
34
 
@@ -54,6 +54,6 @@ NODE_ENV=import node scripts/import/index.js
54
54
  The script will:
55
55
 
56
56
  - Ignore commits which are not a document snapshot (like renaming or documentation commits).
57
- - Rename document types according to declared rules. See the [renamer module documentation](../renamer/README.md).
57
+ - Rename terms types according to declared rules. See the [renamer module documentation](../renamer/README.md).
58
58
  - Rename services according to declared rules. See the [renamer module documentation](../renamer/README.md).
59
59
  - Handle duplicates, so you can run it twice without worrying about duplicate entries in the database.
@@ -2,7 +2,7 @@ __:warning: These scripts are no longer up-to-date with the codebase and are not
2
2
 
3
3
  # Rewrite history
4
4
 
5
- As some document types or service names can change over time or as we need to import history from other tools, provided they have an history with the same structure as Open Terms Archive, we need a way to rewrite, reorder and apply changes to the snapshots or versions history.
5
+ As some terms types or service names can change over time or as we need to import history from other tools, provided they have an history with the same structure as Open Terms Archive, we need a way to rewrite, reorder and apply changes to the snapshots or versions history.
6
6
 
7
7
  The script works by reading commits from a **source** repository, applying changes and then committing the result in another, empty or not, **target** repository. So a source repository with commits is required.
8
8
 
@@ -125,7 +125,7 @@ Currently, the script will:
125
125
 
126
126
  - Ignore commits which are not a document snapshot (like renaming or documentation commits)
127
127
  - Reorder commits according to their author date
128
- - Rename document types according to declared rules
128
+ - Rename terms types according to declared rules
129
129
  - Rename services according to declared rules
130
130
  - Skip commits with empty content
131
131
  - Skip commits which do not change the document
@@ -101,7 +101,7 @@ let recorder;
101
101
  );
102
102
 
103
103
  if (!documentDeclaration) {
104
- console.log(`⌙ Skip unknown document type "${documentType}" for service "${serviceId}"`);
104
+ console.log(`⌙ Skip unknown terms type "${documentType}" for service "${serviceId}"`);
105
105
  continue;
106
106
  }
107
107
 
@@ -1,6 +1,6 @@
1
1
  # Renamer
2
2
 
3
- This module is used to apply renaming rules to service IDs and document types.
3
+ This module is used to apply renaming rules to service IDs and terms types.
4
4
 
5
5
  ## Usage
6
6
 
@@ -24,9 +24,9 @@ To rename a service, add a rule in `./rules/services.json`, for example, to rena
24
24
  }
25
25
  ```
26
26
 
27
- ### Document type
27
+ ### Terms type
28
28
 
29
- To rename a document type, add a rule in `./rules/documentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy", add the following line in the file:
29
+ To rename a terms type, add a rule in `./rules/documentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy", add the following line in the file:
30
30
 
31
31
  ```json
32
32
  {
@@ -35,9 +35,9 @@ To rename a document type, add a rule in `./rules/documentTypes.json`, for examp
35
35
  }
36
36
  ```
37
37
 
38
- ### Document type for a specific service
38
+ ### Terms type for a specific service
39
39
 
40
- To rename a document type only for a specific service, add a rule in `./rules/servicesDocumentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy" only for Skype, add the following line in the file:
40
+ To rename a terms type only for a specific service, add a rule in `./rules/servicesDocumentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy" only for Skype, add the following line in the file:
41
41
 
42
42
  ```json
43
43
  {
@@ -26,7 +26,7 @@ export function applyRules(serviceId, documentType) {
26
26
  const renamedDocumentType = renamingRules.documentTypes[documentType];
27
27
 
28
28
  if (renamedDocumentType) {
29
- console.log(`⌙ Rename document type "${documentType}" to "${renamedDocumentType}" of "${serviceId}" service`);
29
+ console.log(`⌙ Rename terms type "${documentType}" to "${renamedDocumentType}" of "${serviceId}" service`);
30
30
  documentType = renamedDocumentType;
31
31
  }
32
32
 
@@ -34,7 +34,7 @@ export function applyRules(serviceId, documentType) {
34
34
  && renamingRules.documentTypesByService[serviceId][documentType];
35
35
 
36
36
  if (renamedServiceDocumentType) {
37
- console.log(`⌙ Specific rename document type "${documentType}" to "${renamedServiceDocumentType}" of "${serviceId}" service`);
37
+ console.log(`⌙ Specific rename terms type "${documentType}" to "${renamedServiceDocumentType}" of "${serviceId}" service`);
38
38
  documentType = renamedServiceDocumentType;
39
39
  }
40
40
 
@@ -27,7 +27,7 @@ export default class Recorder {
27
27
  }
28
28
 
29
29
  if (!documentType) {
30
- throw new Error('A document type is required');
30
+ throw new Error('A terms type is required');
31
31
  }
32
32
 
33
33
  if (!fetchDate) {
@@ -51,7 +51,7 @@ export default class Recorder {
51
51
  }
52
52
 
53
53
  if (!documentType) {
54
- throw new Error('A document type is required');
54
+ throw new Error('A terms type is required');
55
55
  }
56
56
 
57
57
  if (!snapshotIds?.length) {
@@ -49,7 +49,7 @@ describe('Recorder', () => {
49
49
 
50
50
  const paramsNameToExpectedTextInError = {
51
51
  serviceId: 'service ID',
52
- documentType: 'document type',
52
+ documentType: 'terms type',
53
53
  fetchDate: 'fetch date',
54
54
  content: 'content',
55
55
  mimeType: 'mime type',
@@ -190,7 +190,7 @@ describe('Recorder', () => {
190
190
 
191
191
  const paramsNameToExpectedTextInError = {
192
192
  serviceId: 'service ID',
193
- documentType: 'document type',
193
+ documentType: 'terms type',
194
194
  snapshotIds: 'snapshot ID',
195
195
  fetchDate: 'fetch date',
196
196
  content: 'content',
@@ -335,7 +335,7 @@ describe('Recorder', () => {
335
335
 
336
336
  const paramsNameToExpectedTextInError = {
337
337
  serviceId: 'service ID',
338
- documentType: 'document type',
338
+ documentType: 'terms type',
339
339
  snapshotIds: 'snapshot ID',
340
340
  fetchDate: 'fetch date',
341
341
  content: 'content',
@@ -77,7 +77,7 @@ function generateFileName(documentType, pageId, extension) {
77
77
  }
78
78
 
79
79
  export function generateFilePath(serviceId, documentType, pageId, mimeType) {
80
- const extension = mime.getExtension(mimeType) || '*'; // If mime type is undefined, an asterisk is set as an extension. Used to match all files for the given service ID, document type and page ID when mime type is unknown.
80
+ const extension = mime.getExtension(mimeType) || '*'; // If mime type is undefined, an asterisk is set as an extension. Used to match all files for the given service ID, terms type and page ID when mime type is unknown.
81
81
 
82
82
  return `${serviceId}/${generateFileName(documentType, pageId, extension)}`; // Do not use `path.join` as even for Windows, the path should be with `/` and not `\`. See https://github.com/ambanum/OpenTermsArchive/runs/8110230474?check_suite_focus=true#step:7:125
83
83
  }
@@ -101,7 +101,7 @@ describe('GitRepository', () => {
101
101
  expect(commit.message).to.include(SERVICE_PROVIDER_ID);
102
102
  });
103
103
 
104
- it('stores the document type', () => {
104
+ it('stores the terms type', () => {
105
105
  expect(commit.message).to.include(DOCUMENT_TYPE);
106
106
  });
107
107
 
@@ -314,7 +314,7 @@ describe('GitRepository', () => {
314
314
  expect(commit.message).to.include(SERVICE_PROVIDER_ID);
315
315
  });
316
316
 
317
- it('stores the document type', () => {
317
+ it('stores the terms type', () => {
318
318
  expect(commit.message).to.include(DOCUMENT_TYPE);
319
319
  });
320
320
 
@@ -351,7 +351,7 @@ describe('GitRepository', () => {
351
351
  expect(commit.message).to.include(SERVICE_PROVIDER_ID);
352
352
  });
353
353
 
354
- it('stores the document type', () => {
354
+ it('stores the terms type', () => {
355
355
  expect(commit.message).to.include(DOCUMENT_TYPE);
356
356
  });
357
357
 
@@ -394,7 +394,7 @@ describe('GitRepository', () => {
394
394
  expect(commit.message).to.include(SERVICE_PROVIDER_ID);
395
395
  });
396
396
 
397
- it('stores the document type', () => {
397
+ it('stores the terms type', () => {
398
398
  expect(commit.message).to.include(DOCUMENT_TYPE);
399
399
  });
400
400
 
@@ -436,7 +436,7 @@ describe('GitRepository', () => {
436
436
  expect(record.serviceId).to.equal(SERVICE_PROVIDER_ID);
437
437
  });
438
438
 
439
- it('returns the document type', () => {
439
+ it('returns the terms type', () => {
440
440
  expect(record.documentType).to.equal(DOCUMENT_TYPE);
441
441
  });
442
442
 
@@ -35,11 +35,11 @@ export default class RepositoryInterface {
35
35
  }
36
36
 
37
37
  /**
38
- * Find the most recent record that matches the given service ID and document type and optionally the page ID
38
+ * Find the most recent record that matches the given service ID and terms type and optionally the page ID
39
39
  * In case of snapshots, if the record is related to a multipage document, the page ID is required to find the corresponding snapshot
40
40
  *
41
41
  * @param {string} serviceId - Service ID of record to find
42
- * @param {string} documentType - Document type of record to find
42
+ * @param {string} documentType - Terms type of record to find
43
43
  * @param {string} [pageId] - Page ID of record to find. Used to differentiate pages of multipage document. Not necessary for single page document
44
44
  * @returns {Promise<Record>} Promise that will be resolved with the found record or an empty object if none match the given criteria
45
45
  */
@@ -95,7 +95,7 @@ describe('MongoRepository', () => {
95
95
  expect(mongoDocument.serviceId).to.include(SERVICE_PROVIDER_ID);
96
96
  });
97
97
 
98
- it('stores the document type', () => {
98
+ it('stores the terms type', () => {
99
99
  expect(mongoDocument.documentType).to.include(DOCUMENT_TYPE);
100
100
  });
101
101
 
@@ -349,7 +349,7 @@ describe('MongoRepository', () => {
349
349
  expect(mongoDocument.serviceId).to.include(SERVICE_PROVIDER_ID);
350
350
  });
351
351
 
352
- it('stores the document type', () => {
352
+ it('stores the terms type', () => {
353
353
  expect(mongoDocument.documentType).to.include(DOCUMENT_TYPE);
354
354
  });
355
355
 
@@ -392,7 +392,7 @@ describe('MongoRepository', () => {
392
392
  expect(mongoDocument.serviceId).to.include(SERVICE_PROVIDER_ID);
393
393
  });
394
394
 
395
- it('stores the document type', () => {
395
+ it('stores the terms type', () => {
396
396
  expect(mongoDocument.documentType).to.include(DOCUMENT_TYPE);
397
397
  });
398
398
 
@@ -434,7 +434,7 @@ describe('MongoRepository', () => {
434
434
  expect(record.serviceId).to.equal(SERVICE_PROVIDER_ID);
435
435
  });
436
436
 
437
- it('returns the document type', () => {
437
+ it('returns the terms type', () => {
438
438
  expect(record.documentType).to.equal(DOCUMENT_TYPE);
439
439
  });
440
440
 
@@ -51,7 +51,7 @@ describe('Services', () => {
51
51
  expect(actualDocumentDeclaration.service.name).to.eql(expectedDocumentDeclaration.service.name);
52
52
  });
53
53
 
54
- it('has the proper document type', () => {
54
+ it('has the proper terms type', () => {
55
55
  expect(actualDocumentDeclaration.type).to.eql(expectedDocumentDeclaration.type);
56
56
  });
57
57
 
@@ -170,7 +170,7 @@ describe('Services', () => {
170
170
  expect(actualDocumentDeclaration.service.name).to.eql(expectedDocumentDeclaration.service.name);
171
171
  });
172
172
 
173
- it('has the proper document type', () => {
173
+ it('has the proper terms type', () => {
174
174
  expect(actualDocumentDeclaration.type).to.eql(expectedDocumentDeclaration.type);
175
175
  });
176
176
 
@@ -154,7 +154,7 @@ describe('Service', () => {
154
154
  subject.addDocumentDeclaration(privacyPolicyDeclaration);
155
155
  });
156
156
 
157
- it('returns the service document types', async () => {
157
+ it('returns the service terms types', async () => {
158
158
  expect(subject.getDocumentTypes()).to.have.members([
159
159
  termsOfServiceDeclaration.type,
160
160
  privacyPolicyDeclaration.type,
package/src/main.js CHANGED
@@ -11,7 +11,7 @@ program
11
11
  .description(description)
12
12
  .version(version)
13
13
  .option('-s, --services [serviceId...]', 'service IDs of services to handle')
14
- .option('-d, --documentTypes [documentType...]', 'document types to handle')
14
+ .option('-d, --documentTypes [documentType...]', 'terms types to handle')
15
15
  .option('-r, --refilter-only', 'only refilter exisiting snapshots with last declarations and engine\'s updates')
16
16
  .option('--schedule', 'schedule automatic document tracking');
17
17
 
package/README.fr.md DELETED
@@ -1,110 +0,0 @@
1
- <img src="https://disinfo.quaidorsay.fr/assets/img/logo.png" width="140">
2
-
3
- # Open Terms Archive
4
-
5
- Les services en ligne ont des conditions générales qui évoluent dans le temps. _Open Terms Archive_ permet aux défenseurs des droits des utilisateurs, aux régulateurs et à toute personne intéressée de suivre les évolutions de ces conditions générales en étant notifiée à chaque publication d'une nouvelle version, et en explorant leur historique.
6
-
7
- ## Table des matières
8
-
9
- - [Fonctionnement](#fonctionnement)
10
- - [Naviguer dans l'historique des versions](#naviguer-dans-lhistorique-des-versions)
11
- - [Remarques](#remarques)
12
- - [Recevoir des notifications](#recevoir-des-notifications)
13
- - [Par courriel](#par-courriel)
14
- - [Recevoir les mises à jour de services ou documents spécifiques](#recevoir-les-mises-%C3%A0-jour-de-services-ou-documents-sp%C3%A9cifiques)
15
- - [Par RSS](#par-rss)
16
- - [Récapitulatif des flux RSS disponibles](#r%C3%A9capitulatif-des-flux-rss-disponibles)
17
- - [Désabonnement](#désabonnement)
18
- - [Contribuer](#contribuer)
19
- - [Ajouter un nouveau service](#ajouter-un-nouveau-service)
20
-
21
- ## Fonctionnement
22
-
23
- _Note: Les mots en gras sont les [termes du domaine](https://fr.wikipedia.org/wiki/Conception_pilot%C3%A9e_par_le_domaine)._
24
-
25
- Les **services** sont **déclarés** dans l'outil _Open Terms Archive_ grâce à un **fichier de déclaration** listant les **documents** qui forment l'ensemble des **conditions** régissant l'usage du **service**. Ces **documents** peuvent être de plusieurs **types** : « conditions d'utilisation », « politique de confidentialité », « contrat de développeur »…
26
-
27
- Afin de **suivre** leurs **évolutions**, les **documents** sont régulièrement mis à jour, en les **téléchargeant** depuis une **adresse** web et en **sélectionnant leur contenu** dans la **page web** pour supprimer le **bruit** (publicités, menus de navigation, champs de connexion…). En plus de simplement sélectionner une zone de la page, certains documents possèdent du **bruit** supplémentaire (hashs dans des liens, jetons CSRF...) créant de faux positifs en terme d'**évolutions**. En conséquence, _Open Terms Archive_ supporte des **filtres** spécifiques pour chaque **document**.
28
-
29
- Néanmoins, le **bruit** peut changer de forme avec le temps. Afin d'éviter des pertes d'information irrécupérables pendant l'étape de **filtrage du bruit**, un **instantané** de la page Web est **enregistré** à chaque **évolution**. Après avoir **filtré l'instantané** de son **bruit**, si le **document** résultant a changé par rapport à sa **version** précédente, une nouvelle **version** est **enregistrée**.
30
-
31
- Vous pouvez disposer de votre propre instance **privée** de l'outil _Open Terms Archive_ et suivre vous-même les **évolutions**. Néanmoins, nous **publions** chaque **version** sur une [instance **publique**](https://github.com/OpenTermsArchive/contrib-versions) facilitant l'exploration de l'**historique** et **notifiant** par courriels l'**enregistrement** de nouvelles **versions**. Les **utilisateurs** peuvent [**s'abonner** aux **notifications**](#recevoir-des-notifications).
32
-
33
- _Note: Actuellement, nous ne suivons que les **conditions** rédigées en anglais et concernant la juridiction européenne._
34
-
35
- ## Naviguer dans l'historique des versions
36
-
37
- À partir de la **page d'accueil du dépôt** [contrib-versions](https://github.com/OpenTermsArchive/contrib-versions), ouvrez le dossier du **service de votre choix** (prenons par exemple [WhatsApp](https://github.com/OpenTermsArchive/contrib-versions/tree/main/WhatsApp)).
38
-
39
- L'**ensemble des documents suivis** pour ce service s'affichent, cliquez ensuite sur **celui dont vous souhaitez suivre l'historique** (par exemple la [politique d'utilisation des données de WhatsApp](https://github.com/OpenTermsArchive/contrib-versions/blob/main/WhatsApp/Privacy%20Policy.md)). Le document s'affiche alors dans sa **dernière version** (il est actualisé toutes les heures).
40
-
41
- Pour afficher l'**historique des modifications** subies par ce document, cliquez sur **History** en haut à droite du document (pour l'exemple précédent nous arrivons [ici](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)). Les **modifications** sont affichées **par dates**, de la plus récente à la plus ancienne.
42
-
43
- Cliquez sur une modification pour voir en quoi elle consiste (par exemple [celle-ci](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd)). Vous disposez de **deux types d'affichage**, sélectionnables à partir des icônes dans la barre grisée qui chapeaute le document.
44
-
45
- - Le premier, appelé _source diff_ (bouton avec des chevrons) permet d'**afficher côte-à-côte l'ancienne version et la nouvelle** (pour notre [exemple](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd#diff-e8bdae8692561f60aeac9d27a55e84fc)). Cet affichage a le mérite de **montrer explicitement** l'ensemble des ajouts/suppressions.
46
- - Le second, appelé _rich diff_ (bouton avec l'icône document) permet d'**unifier l'ensemble des modifications sur un seul document** (pour notre [exemple](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd?short_path=e8bdae8#diff-e8bdae8692561f60aeac9d27a55e84fc)). La couleur **rouge** montre les éléments **supprimés**, la couleur **jaune** montre les paragraphes **modifiés**, et la couleur **verte** montrent les éléments **ajoutés**. Attention, cet affichage **ne montre pas certaines modifications** comme le changement des hyperliens et le style du texte.
47
-
48
- ### Remarques
49
-
50
- - Pour les longs documents, les **paragraphes inchangés ne seront pas affichés par défaut**. Vous pouvez manuellement les faire apparaître en cliquant sur les petites flèches juste au-dessus ou juste en-dessous des paragraphes affichés.
51
- - Vous pouvez utiliser le bouton **History n'importe où** dans le dépôt contrib-versions, qui affichera alors l'**historique des modifications subies par tous les documents se trouvant dans le dossier** où vous vous trouvez (y compris dans les sous-dossiers).
52
-
53
- ## Recevoir des notifications
54
-
55
- ### Par courriel
56
-
57
- #### Pour tous les documents d'un coup
58
-
59
- Vous pouvez [vous abonner](https://59692a77.sibforms.com/serve/MUIEAKuTv3y67e27PkjAiw7UkHCn0qVrcD188cQb-ofHVBGpvdUWQ6EraZ5AIb6vJqz3L8LDvYhEzPb2SE6eGWP35zXrpwEFVJCpGuER9DKPBUrifKScpF_ENMqwE_OiOZ3FdCV2ra-TXQNxB2sTEL13Zj8HU7U0vbbeF7TnbFiW8gGbcOa5liqmMvw_rghnEB2htMQRCk6A3eyj) pour recevoir un courriel à chaque modification d'un document dans l'ensemble de la base.
60
-
61
- **Attention, vous risquez de recevoir de nombreuses notifications !** Vous pourrez vous désabonner en répondant à n'importe quel courriel reçu.
62
-
63
- #### Recevoir les mises à jour de services ou documents spécifiques
64
-
65
- Vous pouvez vous rendre sur le site officiel [opentermsarchive.org] (https://opentermsarchive.org). De là, vous pouvez sélectionner un service, puis le type de document correspondant.
66
- Après avoir entré votre adresse électronique et cliqué sur "S'inscrire", nous ajouterons votre adresse à la liste de diffusion correspondante dans [SendInBlue](https://www.sendinblue.com/) et nous ne la conserverons nulle part ailleurs.
67
- Ensuite, chaque fois qu'une modification sera trouvée sur le document correspondant, nous vous enverrons un e-mail.
68
-
69
- Vous pouvez vous désinscrire à tout moment en cliquant sur le lien "désinscription" en bas de l'email reçu.
70
-
71
- ### Par RSS
72
-
73
- Vous pouvez recevoir une notification pour un service ou un document spécifique en vous abonnant à des flux RSS.
74
-
75
- > Un flux RSS est un type de page accessible en ligne qui contient des informations sur les derniers contenus publiés par un site web comme leur date de publication et l'adresse pour les consulter. Lorsque cette ressource est mise à jour, une application de type lecteur de flux vous notifie automatiquement et vous pouvez ainsi consulter la mise à jour.
76
-
77
- Pour obtenir l'adresse du flux RSS auquel vous abonner :
78
-
79
- 1. [Naviguez](#naviguer-dans-lhistorique-des-versions) jusqu’à la page qui présente l’historique des modifications qui vous intéressent. _Dans l'exemple de WhatsApp donné plus haut, il s’agit de [cette page](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)._
80
- 2. Copiez l’adresse de cette page depuis la barre d’adresse de votre navigateur. _Dans l’exemple de WhatsApp, il s’agit de `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md`._
81
- 3. Ajoutez `.atom` à la fin de cette adresse. _Dans l’exemple de WhatsApp, cela donnerait `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md.atom`._
82
- 4. Abonnez votre lecteur de flux RSS à l’adresse résultante.
83
-
84
- #### Récapitulatif des flux RSS disponibles
85
-
86
- | Mis à jour pour | URL |
87
- | ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
88
- | l'ensemble des services et documents | `https://github.com/OpenTermsArchive/contrib-versions/commits.atom` |
89
- | l'ensemble des documents d'un service | Remplacer `$serviceId` par l'identifiant du service :<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId.atom` |
90
- | un document spécifique d'un service | Remplacer `$serviceId` par l'identifiant du service et `$documentType` par le type du document :<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId/$documentType.md.atom` |
91
-
92
- Par exemple :
93
-
94
- - Pour recevoir toutes les mises à jour des documents de `Facebook`, abonnez-vous à `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Facebook.atom`.
95
- - Pour recevoir toutes les mises à jour des `Privacy Policy` de `Google`, abonnez-vous à `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Google/Privacy%20Policy.md.atom`.
96
-
97
- ### Désabonnement
98
-
99
- Afin de ne plus recevoir d'e-mails de mise à jour des services, deux liens sont inclus dans chaque e-mail reçu :
100
-
101
- - un pour ne plus recevoir tous les e-mails de bot@opentermsarchive.org
102
- - un pour ne plus recevoir les e-mails d'un document particulier
103
-
104
- Ce dernier lien consiste à envoyer un courriel à contact@opentermsarchive.org pour être retiré manuellement de la liste correspondante.
105
-
106
- ## Contribuer
107
-
108
- ### Ajouter un nouveau service
109
-
110
- Voir le fichier [CONTRIBUTING](CONTRIBUTING.md) (en anglais).