npm - @opentermsarchive/engine - Versions diffs - 0.17.0 → 0.17.1 - Mend

@opentermsarchive/engine 0.17.0 → 0.17.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/README.md +240 -232
package/package.json +1 -1
package/scripts/dataset/README.md +2 -2
package/scripts/dataset/assets/README.template.js +5 -5
package/scripts/dataset/export/test/fixtures/dataset/README.md +5 -5
package/scripts/import/README.md +1 -1
package/scripts/rewrite/README.md +2 -2
package/scripts/rewrite/rewrite-versions.js +1 -1
package/scripts/utils/renamer/README.md +5 -5
package/scripts/utils/renamer/index.js +2 -2
package/src/archivist/recorder/index.js +2 -2
package/src/archivist/recorder/index.test.js +3 -3
package/src/archivist/recorder/repositories/git/dataMapper.js +1 -1
package/src/archivist/recorder/repositories/git/index.test.js +5 -5
package/src/archivist/recorder/repositories/interface.js +2 -2
package/src/archivist/recorder/repositories/mongo/index.test.js +4 -4
package/src/archivist/services/index.test.js +2 -2
package/src/archivist/services/service.test.js +1 -1
package/src/main.js +1 -1
package/README.fr.md +0 -110

package/README.md CHANGED Viewed

@@ -1,245 +1,289 @@
-# Open Terms Archive
-**Services** have **terms** that can change over time. _Open Terms Archive_ enables users rights advocates, regulatory bodies and any interested citizen to follow the **changes** to these **terms** by being **notified** whenever a new **version** is published, and exploring their entire **history**.
-> Les services ont des conditions générales qui évoluent dans le temps. _Open Terms Archive_ permet aux défenseurs des droits des utilisateurs, aux régulateurs et à toute personne intéressée de suivre les évolutions de ces conditions générales en étant notifiée à chaque publication d'une nouvelle version, et en explorant leur historique.
-[🇫🇷 Manuel en français](README.fr.md).
-## Table of Contents
-- [How it works](#how-it-works)
-- [Exploring the versions history](#exploring-the-versions-history)
-- [Be notified](#be-notified)
-  - [By email](#by-email)
-  - [By RSS](#by-rss)
-- [Importing as a module](#importing-as-a-module)
-  - [CLI](#cli)
-  - [Features exposed](#features-exposed)
-    - [fetch](#fetch)
-    - [filter](#filter)
-- [Using locally](#using-locally)
-  - [Installing](#installing)
-    - [Declarations repository](#declarations-repository)
-    - [Core](#core)
-  - [Configuring](#configuring)
-    - [Configuration file](#configuration-file)
-      - [Storage repositories](#storage-repositories)
-    - [Environment variables](#environment-variables)
-  - [Running](#running)
+_The document you are reading now is targeted at developers wanting to use or contribute to the engine of [Open Terms Archive](https://opentermsarchive.org). For a high-level overview of Open Terms Archive’s wider goals and processes, please read its [public homepage](https://opentermsarchive.org)._
+# Open Terms Archive Engine
+This codebase is a Node.js module enabling downloading, archiving and publishing versions of documents obtained online. It can be used independently from the Open Terms Archive ecosystem.
+## Table of contents
+- [Motivation](#motivation)
+- [Main concepts](#main-concepts)
+- [How to add documents to a collection](#how-to-add-documents-to-a-collection)
+- [How to use the engine](#how-to-use-the-engine)
+- [Configuring](#configuring)
 - [Deploying](#deploying)
-- [Publishing](#publishing)
 - [Contributing](#contributing)
-  - [Adding or updating a service](#adding-a-new-service-or-updating-an-existing-service)
-  - [Core engine](#core-engine)
-  - [Funding and partnerships](#funding-and-partnerships)
 - [License](#license)
-## How it works
+## Motivation
+_Words in bold are [business domain names](https://en.wikipedia.org/wiki/Domain-driven_design)._
+**Services** have **terms** written in **documents**, contractual (Terms of Services, Privacy Policy…) or not (Community Guidelines, Deceased User Policy…), that can change over time. Open Terms Archive enables users rights advocates, regulatory bodies and interested citizens to follow the **changes** to these **terms**, to be notified whenever a new **version** is published, to explore their entire **history** and to collaborate in analysing them. This free and open-source engine is developed to support these goals.
+## Main concepts
-_Note: Words in bold are [business domain names](https://en.wikipedia.org/wiki/Domain-driven_design)._
+### Instances
-**Services** are **declared** within _Open Terms Archive_ with a **declaration file** listing all the **documents** that, together, constitute the **terms** under which this **service** can be used. These **documents** all have a **type**, such as “terms and conditions”, “privacy policy”, “developer agreement”…
+Open Terms Archive is a decentralised system.
-In order to **track** their **changes**, **documents** are periodically obtained by **fetching** a web **location** and **selecting content** within the **web page** to remove the **noise** (ads, navigation menu, login fields…). Beyond selecting a subset of a page, some **documents** have additional **noise** (hashes in links, CSRF tokens…) that would be false positives for **changes**. _Open Terms Archive_ thus supports specific **filters** for each **document**.
+It aims at enabling any entity to **track** **terms** on its own and at federating a number of public **instances** in a single ecosystem to maximise discoverability, collaboration and political power. To that end, the Open Terms Archive **engine** can be run on any server, thus making it a dedicated **instance**.
-However, the shape of that **noise** can change over time. In order to recover in case of information loss during the **noise filtering** step, a **snapshot** is **recorded** every time there is a **change**. After the **noise** is **filtered out** from the **snapshot**, if there are **changes** in the resulting **document**, a new **version** of the **document** is **recorded**.
+> Federated public instances can be [found on GitHub](
+https://github.com/OpenTermsArchive?q=declarations).
-Anyone can run their own **private** instance and track changes on their own. However, we also **publish** each **version** on a [**public** instance](https://github.com/OpenTermsArchive/contrib-versions) that makes it easy to explore the entire **history** and enables **notifying** over email whenever a new **version** is **recorded**.
-Users can [**subscribe** to **notifications**](#be-notified).
+### Collections
-_Note: For now, when multiple versions coexist, **terms** are only **tracked** in their English version and for the European jurisdiction._
+An **instance** **tracks** **documents** of a single **collection**.
-## Exploring the versions history
+A **collection** is characterised by a **scope** across **dimensions** that describe the **terms** it **tracks**, such as **language**, **jurisdiction** and **industry**.
-We offer a public database of versions recorded each time there is a change in the terms of service and other contractual documents of tracked services: [contrib-versions](https://github.com/OpenTermsArchive/contrib-versions).
+> Federated public collections can be [found on GitHub](https://github.com/OpenTermsArchive?q=versions).
-From the **repository homepage** [contrib-versions](https://github.com/OpenTermsArchive/contrib-versions), open the folder of the **service of your choice** (e.g. [WhatsApp](https://github.com/OpenTermsArchive/contrib-versions/tree/main/WhatsApp)).
+#### Example scope
-You will see the **set of documents tracked** for that service, now click **on the document of your choice** (e.g. [WhatsApp's Privacy Policy](https://github.com/OpenTermsArchive/contrib-versions/blob/main/WhatsApp/Privacy%20Policy.md)). The **latest version** (updated hourly) will be displayed.
+> The documents declared in this collection are:
+> - Related to dating services used in Europe.
+> - In the European Union and Switzerland jurisdictions.
+> - In English, unless no English version exists, in which case the primary official language of the jurisdiction of incorporation of the service operator will be used.
-To view the **history of changes** made to this document, click on **History** at the top right of the document (for our previous [example](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)). The **changes** are ordered **by date**, with the latest first.
+### Terms types
-Click on a change to see what it consists of (for example [this one](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd)). There are **two types of display** you can choose from the icons in the gray bar above the document.
+To distinguish between the different **terms** of a **service**, each has a **type**, such as “Terms of Service”, “Privacy Policy”, “Developer Agreement”…
-- The first one, named _source diff_ (button with chevrons) allows you to **display the old version and the new one side by side** (for our [example](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd#diff-e8bdae8692561f60aeac9d27a55e84fc)). This display has the merit of **explicitly showing** all additions and deletions.
-- The second one, named _rich diff_ (button with a document icon) allows you to **unify all the changes in a single document** (for our [example](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd?short_path=e8bdae8#diff-e8bdae8692561f60aeac9d27a55e84fc)). The **red** color shows **deleted** elements, the **yellow** color shows **modified** paragraphs, and the **green** color shows **added** elements. Be careful, this display **does not show some changes** such as hyperlinks and text style's changes.
+This **type** matches the topic, but not necessarily the title the **service** gives to it. Unifying the **types** enables comparing **terms** across **services**.
-### Notes
+> More information on terms types can be found in the [dedicated repository](https://github.com/OpenTermsArchive/terms-types). They are published on NPM under [`@opentermsarchive/terms-types`](https://www.npmjs.com/package/@opentermsarchive/terms-types), enabling standardisation and interoperability beyond the Open Terms Archive engine.
-- For long documents, unchanged **paragraphs will not be displayed by default**. You can manually make them appear by clicking on the small arrows just above or just below the displayed paragraphs.
-- You can use the **History button anywhere** in the repository contrib-versions, which will then display the **history of changes made to all documents in the folder** where you are (including sub-folders).
+### Declarations
-## Be notified
+The **documents** that constitute a **collection** are defined in simple JSON files called **declarations**.
-### By email
+A **declaration** also contains some metadata on the **service** the **documents** relate to.
-#### Document per document
+> Here is an example declaration tracking the Privacy Policy of Open Terms Archive:
+>
+> ```json
+> {
+>   "name": "Open Terms Archive",
+>   "documents": {
+>     "Privacy Policy": {
+>       "fetch": "https://opentermsarchive.org/en/privacy-policy",
+>       "select": ".TextContent_textContent__ToW2S"
+>     }
+>   }
+> }
+> ```
-You can go on the official front website [opentermsarchive.org](https://opentermsarchive.org). From there, you can select a service and then the corresponding document type.
-After you enter your email and click on subscribe, we will add your email to the correspondning mailing list in [SendInBlue](https://www.sendinblue.com/) and will not store your email anywhere else.
-Then, everytime a modification is found on the correspondning document, we will send you an email.
+## How to add documents to a collection
-You can unsubscribe at any moment by clicking on the `unsubscribe` link at the bottom of the received email.
+Open Terms Archive **acquires** **documents** to deliver an explorable **history** of **changes**. This can be done in two ways:
-#### For all documents at once
+1. For the present and future, by **tracking** **documents**.
+2. For the past, by **importing** from an existing **fonds** such as [ToSBack](https://tosback.org), the [Internet Archive](https://archive.org/web/), [Common Crawl](https://commoncrawl.org) or any other in-house format.
-You can [subscribe](https://59692a77.sibforms.com/serve/MUIEAKuTv3y67e27PkjAiw7UkHCn0qVrcD188cQb-ofHVBGpvdUWQ6EraZ5AIb6vJqz3L8LDvYhEzPb2SE6eGWP35zXrpwEFVJCpGuER9DKPBUrifKScpF_ENMqwE_OiOZ3FdCV2ra-TXQNxB2sTEL13Zj8HU7U0vbbeF7TnbFiW8gGbcOa5liqmMvw_rghnEB2htMQRCk6A3eyj) to receive an email whenever a document is updated in the database.
+### Tracking documents
-**Beware, you are likely to receive a large amount of notifications!** You can unsubscribe by replying to any email you will receive.
+The **engine** **reads** **declarations** to **record** a **snapshot** by **fetching** the declared web **location** periodically. The **engine** then **extracts** a **version** from this **snapshot** by:
-### By RSS
+1. **Selecting** the subset of the **snapshot** that contains the **terms** (instead of navigation menus, footers, cookies banners…).
+2. **Removing** residual content in this subset that is not part of the **terms** (ads, illustrative pictures, internal navigation links…).
+3. **Filtering noise** by preventing parts that change frequently from triggering false positives for **changes** (tracker identifiers in links, relative dates…). The **engine** can execute custom **filters** written in JavaScript to that end.
-You can receive notification for a specific service or document by subscribing to RSS feeds.
+After these steps, if **changes** are spotted in the resulting **document**, a new **version** is **recorded**.
-> An RSS feed is a type of web page that contains information about the latest content published by a website, such as the date of publication and the address where you can view it. When this resource is updated, a feed reader app automatically notifies you and you can see the update.
+Preserving **snapshots** enables recovering after the fact information potentially lost in the **extraction** step: if **declarations** were wrong, they can be **maintained** and corrected **versions** can be **extracted** from the original **snapshots**.
-To find out the address of the RSS feed you want to subscribe to:
+### Importing documents
-1. [Navigate](#exploring-the-versions-history) to the page with the history of changes you are interested in. _In the WhatsApp example above, this would be [this page](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)._
-2. Copy the address of that page from your browser’s address bar. _In the WhatsApp example, this would be `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md`._
-3. Append `.atom` at the end of this address. _In the WhatsApp example, this would become `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md.atom`._
-4. Subscribe your RSS feed reader to the resulting address.
+Existing **fonds** can be prepared for easier analysis by unifying their format to the **Open Terms Archive dataset format**. This unique format enables building interoperable tools, fostering collaboration across reusers.
+Such a dataset can be generated from **versions** alone. If **snapshots** and **declarations** can be retrieved from the **fonds** too, then a full-fledged **collection** can be created.
-#### Recap of available RSS feeds
+## How to use the engine
-| Updated for                         | URL                                                                                                                                                                                            |
-| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| all services and documents          | `https://github.com/OpenTermsArchive/contrib-versions/commits.atom`                                                                                                                            |
-| all the documents of a service      | Replace `$serviceId` with the service ID:<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId.atom.`                                                            |
-| a specific document of a service | Replace `$serviceId` with the service ID and `$documentType` with the document type:<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId/$documentType.md.atom` |
+This documentation describes how to execute the **engine** independently from any specific **instance**. For other use cases, other parts of the documentation could be more relevant:
-For example:
+- to contribute **declarations** to an existing **instance**, see [how to contribute documents](./docs/doc-contributing-documents.md);
+- to create a new **collection**, see the [collection bootstrap](https://github.com/OpenTermsArchive/template-declarations) script;
+- to create a new public **instance**, see the [governance](./docs/doc-governance.md) documentation.
-- To receive all updates of `Facebook` documents, the URL is `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Facebook.atom`.
-- To receive all updates of the `Privacy Policy` from `Google`, the URL is `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Google/Privacy%20Policy.md.atom`.
+### Requirements
-## Importing as a module
+This module is tested to work across operating systems (continuous testing on UNIX, macOS and Windows).
+A [Node.js](https://nodejs.org/en/download/) runtime is required to execute this engine.
+![Supported Node.js version can be found in the package.json file](https://img.shields.io/node/v/@opentermsarchive/engine?color=informational&label=Supported%20Node.js%20version)
+### Getting started
+This engine is published as a [module on NPM](https://npmjs.com/package/@opentermsarchive/engine). The recommended install is as a dependency in a `package.json` file, next to a folder containing [declaration files](#declarations).
+```sh
+npm install --save @opentermsarchive/engine
+mkdir declarations
+```
-Open Terms Archive exposes a JavaScript API to make some of its capabilities available in NodeJS. You can install it as an NPM module:
+In an editor, create the following declaration file in `declarations/Open Terms Archive.json` to track the terms of the Open Terms Archive website:
+```json
+{
+  "name": "Open Terms Archive",
+  "documents": {
+    "Privacy Policy": {
+      "fetch": "https://opentermsarchive.org/en/privacy-policy",
+      "select": ".TextContent_textContent__ToW2S"
+    }
+  }
+}
 ```
-npm install "ambanum/OpenTermsArchive#main"
+In the terminal:
+```sh
+npx ota-track
 ```
+The tracked documents can be found in the `data` folder.
+This quick example aimed at letting you try the engine quickly. Most likely, you will simply `npm install` from an existing collection, or create a new collection from the [collection template](https://github.com/OpenTermsArchive/template-declarations).
 ### CLI
-The following commands are available where the package is installed:
+Once the engine module is installed as a dependency within another module, the following commands are available.
-- `./node_modules/.bin/ota-lint-declarations`: check and normalise the format of declarations.
-- `./node_modules/.bin/ota-validate-declarations`: validate declarations.
-- `./node_modules/.bin/ota-track`: track services. Recorded snapshots and versions will be stored in the `data` folder at the root of the module where the package is installed.
+In these commands:
-In order to have them available globally in your command line, install it with the `--global` option.
+- **`<service_id>`** is the case sensitive name of the service declaration file without the extension. For example, for `Twitter.json`, the service ID is `Twitter`.
+- **`<terms_type>`** is the property name used under the `documents` property in the declaration to declare a terms. For example, in the getting started declaration, the terms type declared is `Privacy Policy`.
-### Features exposed
+#### `ota-track`
-#### fetch
+```sh
+npx ota-track
+```
-The `fetch` module gets the MIME type and content of a document from its URL.
+[Track](#tracking-documents) the current terms of services according to provided declarations.
-You can use it in your code by using `import fetch from 'open-terms-archive/fetch';`.
+The declarations, snapshots and versions paths are defined in the [configuration](#configuring).
-Documentation on how to use `fetch` is provided as JSDoc within [./src/archivist/fetcher/index.js](./src/archivist/fetcher/index.js).
+> Note that the snapshots and versions will be recorded at the moment the command is executed, on top of the existing local history. If a shared history already exists and the goal is to add on top of it, that history has to be downloaded before executing that command.
-If you plan to use `executeClientScripts` as a parameter of `fetch`, the fetching will be done using a headless browser.
-In order to not instantiate this browser at each fetch, the starting and stopping of the browser is your responsibility.
+##### Recap of available options
-Here is an example on how to use it:
+```sh
+npx ota-track --help
+```
-```js
-import fetch, { launchHeadlessBrowser, stopHeadlessBrowser } from 'open-terms-archive/fetch';
+##### Track terms of specific services
-await launchHeadlessBrowser();
-await fetch({ executeClientScripts: true, ... });
-await fetch({ executeClientScripts: true, ... });
-await fetch({ executeClientScripts: true, ... });
-await stopHeadlessBrowser();
+```sh
+npx ota-track --services "<service_id>" ["<service_id>"...]
 ```
-The `fetch` module can also be configured as a [`node-config` submodule](https://github.com/node-config/node-config/wiki/Sub-Module-Configuration).
-If [`node-config`](https://github.com/node-config/node-config) is used in the project, the default `fetcher` configuration can be overridden by adding a `fetcher` object to the local config. See [Configuration file](#configuration-file) for full reference.
+##### Track specific terms of specific services
-#### filter
+```sh
+npx ota-track --services "<service_id>" ["<service_id>"...] --documentTypes "<terms_type>" ["<terms_type>"...]
+```
-The `filter` module transforms HTML or PDF content into a Markdown string.
-It will filter content based on the [document declaration](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md#declaring-a-new-service).
+##### Track documents four times a day
-You can use the filter in your code by using `import filter from 'open-terms-archive/filter';`.
+```sh
+npx ota-track --schedule
+```
-The `filter` function documentation is available as JSDoc within [./src/archivist/filter/index.js](./src/archivist/filter/index.js).
+#### `ota-validate-declarations`
-#### page-declaration
+```sh
+npx ota-validate-declarations [--services <service_id>...]
+```
-PageDeclaration object is used to describe a page to be tracked by Open Terms Archive.
+Check that all declarations allow recording a snapshot and a version properly.
-You can use the page-declaration in your code by using `import pageDeclaration from 'open-terms-archive/page-declaration';`.
+If one or several `<service_id>` are provided, check only those services.
-## Using locally
+##### Validate schema only
-### Installing
+```sh
+npx ota-validate-declarations --schema-only [--services <service_id>...]
+```
-This module is built with [Node](https://nodejs.org/en/) and is tested on macOS, UNIX and Windows. You will need to [install Node >= v16.x](https://nodejs.org/en/download/) to run it.
+Check that all declarations are readable by the engine.
-#### Declarations repository
+Allows for a much faster check of declarations, but does not check that the documents are actually accessible.
-1. Locally clone your declarations repository, e.g., `git@github.com:OpenTermsArchive/contrib-declarations.git`.
-2. Go into your folder and initialize it, e.g., `cd contrib-declarations; npm install`.
-3. You can now modify your declarations in the `./declarations/` folder, following [these instructions](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md).
-4. When you want to test:
-    - If you want to test every declaration, run `npm test`.
-    - If you want to test a specific declaration, run `npm test $serviceId`, e.g., `npm test HER`.
-    - If you want to have faster feedback on the structure of a specific declaration, run `npm run test:schema $serviceId`, e.g., `npm run test:schema HER`.
-5. Once you have done that, if you have any error, it will be prompted and detailed at the end of the test.
-    - E.g., `InaccessibleContentError`: Your selector is wrong and should be fixed.
-    - E.g., `TypeError`: The file declaration is invalid.
-    - E.g., if you have a weird error, you may want to contact OTA, if may be a bug.
+If one or several `<service_id>` are provided, check only those services.
-##### Note: Testing
+#### `ota-lint-declarations`
-Testing works with multiple tests (e.g., checking the validity of the file, that the URL is correct and reachable, that the content is correctly gathered, etc.); as it may take a bit of time, that's why you may want to use `npm run test:schema`.
+```sh
+npx ota-lint-declarations [--services <service_id>...]
+```
-#### Core
+Normalise the format of declarations.
-When refering to the base folder, it means the folder where you will be `git pull`ing everything.
+Automatically correct formatting mistakes and ensure that all declarations are standardised.
-1. If not done already, follow the previous part with the repo of your choice.
-2. In the base folder of the previous step (i.e., not _in_ the previous folder, but _where the previous folder is_), clone the core engine: `git clone git@github.com:ambanum/OpenTermsArchive.git`.
-3. Go into the cloned folder and install dependencies: `cd contrib-declarations; npm install`.
-4. If you are using the main repo, you are done, go to step 6.
-5. If you are using a special repo instance (e.g., `dating-declarations`), create a new [config file](#configuring), `config/development.json`, and add:
-    ```json
-    {
+If one or several `<service_id>` are provided, check only those services.
-      "services": {
-        "declarationsPath": "../<name of the repo>/declarations"
-      }
-    }
-    ```
-    e.g.,
-    ```json
-    {
-      "services": {
-        "declarationsPath": "../dating-declarations/declarations"
-      }
-    }
-    ```
-6. In the folder of the repo (i.e., `OpenTermsArchive`), use `npm start`.
-    - It will first do a refiltering to check whenever everything works properly.
-    - You will then start to see everything being downloaded under `data/`.
-    - More details in [Running](#running).
+### API
-##### Notes: Tips
+Once added as a dependency, the engine exposes a JavaScript API that can be called in your own code. The following modules are available.
-- You may want to regularly `git pull` to have the latest updates, both in the core engine and in the declarations repos.
-- You have to `npm install` in the declarations repo at least once, and a least once each time `package.json` changes.
-- Be careful, it doesn't download the history! If you want that, you need to git clone `snapshots` and `versions` in `data/`.
+#### `fetch`
-You can clone as many declarations repositories as you want. The one that will be loaded at execution will be defined through configuration.
+The `fetch` module gets the MIME type and content of a document from its URL
-### Configuring
+```js
+import fetch from '@opentermsarchive/engine/fetch';
+```
+Documentation on how to use `fetch` is provided [as JSDoc](./src/archivist/fetcher/index.js).
+##### Headless browser management
+If you pass the `executeClientScripts` option to `fetch`, a headless browser will be used to download and execute the page before serialising its DOM. For performance reasons, the starting and stopping of the browser is your responsibility to avoid instantiating a browser on each fetch. Here is an example on how to use this feature:
+```js
+import fetch, { launchHeadlessBrowser, stopHeadlessBrowser } from '@opentermsarchive/engine/fetch';
-#### Configuration file
+await launchHeadlessBrowser();
+await fetch({ executeClientScripts: true, ... });
+await fetch({ executeClientScripts: true, ... });
+await fetch({ executeClientScripts: true, ... });
+await stopHeadlessBrowser();
+```
+The `fetch` module options are defined as a [`node-config` submodule](https://github.com/node-config/node-config/wiki/Sub-Module-Configuration). The default `fetcher` configuration can be overridden by adding a `fetcher` object to the [local configuration file](#configuration-file).
+#### `filter`
+The `filter` module transforms HTML or PDF content into a Markdown string according to a [declaration](#declarations).
+```js
+import filter from '@opentermsarchive/engine/filter';
+```
+The `filter` function documentation is available [as JSDoc](./src/archivist/filter/index.js).
+#### `PageDeclaration`
+The `PageDeclaration` class encapsulates information about a page tracked by Open Terms Archive.
+```js
+import pageDeclaration from '@opentermsarchive/engine/page-declaration';
+```
+The `PageDeclaration` format is defined [in source code](./src/archivist/services/pageDeclaration.js).
+### Dataset generation
+See the [`dataset` script documentation](./scripts/dataset/README.md).
+## Configuring
+### Configuration file
 The default configuration can be found in `config/default.json`. The full reference is given below. You are unlikely to want to edit all of these elements.
@@ -276,7 +320,7 @@ The default configuration can be found in `config/default.json`. The full refere
       "host": "SMTP server hostname",
       "username": "User for server authentication" // Password for server authentication is defined in environment variables, see the “Environment variables” section below
     },
-    "sendMailOnError": { // Can be set to `false` if you do not want to send email on error
+    "sendMailOnError": { // Can be set to `false` if sending email on error is not needed
       "to": "The address to send the email to in case of an error",
       "from": "The address from which to send the email",
       "sendWarnings": "Boolean. Set to true to also send email in case of warning",
@@ -299,15 +343,15 @@ The default configuration can be found in `config/default.json`. The full refere
 }
 ```
-The default configuration is merged with (and overridden by) environment-specific configuration that can be specified at startup with the `NODE_ENV` environment variable. For example, you would run `NODE_ENV=development npm start` to load the `development.json` configuration file.
+The default configuration is merged with (and overridden by) environment-specific configuration that can be specified at startup with the `NODE_ENV` environment variable. For example, running `NODE_ENV=vagrant npm start` will load the `vagrant.json` configuration file. See [node-config](https://github.com/node-config/node-config) for more information about configuration files.
-If you want to change your local configuration, we suggest you create a `config/development.json` file with overridden values. Example production configuration files can be found in the `config` folder.
+In order to have a local configuration that override all exisiting config, it is recommended to create a `config/development.json` file with overridden values.
-##### Storage repositories
+#### Storage repositories
 Two storage repositories are currently supported: Git and MongoDB. Each one can be used independently for versions and snapshots.
-###### Git
+##### Git
 ```json
 {
@@ -326,11 +370,11 @@ Two storage repositories are currently supported: Git and MongoDB. Each one can
   …
 }
 ```
-###### MongoDB
+##### MongoDB
 ```json
 {
-    …
+  …
   "storage": {
     "mongo": {
       "connectionURI": "URI for defining connection to the MongoDB instance. See https://docs.mongodb.com/manual/reference/connection-string/",
@@ -342,7 +386,7 @@ Two storage repositories are currently supported: Git and MongoDB. Each one can
 }
 ```
-#### Environment variables
+### Environment variables
 Environment variables can be passed in the command-line or provided in a `.env` file at the root of the repository. See `.env.example` for an example of such a file.
@@ -350,89 +394,53 @@ Environment variables can be passed in the command-line or provided in a `.env`
 - `SENDINBLUE_API_KEY`: a SendInBlue API key, in order to send email notifications with that service.
 - `GITHUB_TOKEN`: a token with repository privileges to access the [GitHub API](https://github.com/settings/tokens).
-If your infrastructure requires using an outgoing HTTP/HTTPS proxy to access the Internet, you can provide it through the `HTTP_PROXY` and `HTTPS_PROXY` environment variable.
-### Running
-To get the latest versions of all documents:
-```
-npm start
-```
-The latest version of a document will be available in the versions path defined in your configuration, under `$versions_folder/$service_provider_name/$document_type.md`.
-To update documents automatically:
-```
-npm run start:scheduler
-```
-To get the latest version of a specific service's terms:
-```
-npm start -- --services <service_id>
-```
-> The service ID is the case sensitive name of the service declaration file without the extension. For example, for `Twitter.json`, the service ID is `Twitter`.
-To get the latest version of a specific service's terms and document type:
-```
-npm start -- --services <service_id> --documentTypes <document_type>
-```
-To display help:
-```
-npm start -- --help
-```
+If an outgoing HTTP/HTTPS proxy to access the Internet is required, it is possible to provide it through the `HTTP_PROXY` and `HTTPS_PROXY` environment variable.
 ## Deploying
-See [Ops Readme](ops/README.md).
+Deployment is managed with [Ansible](https://www.ansible.com). See the [Open Terms Archive deployment Ansible collection](https://github.com/OpenTermsArchive/ota.deployment-ansible-collection).
-## Publishing
+## Contributing
-To generate a dataset:
+### Getting a copy
-```
-npm run dataset:generate
-```
+In order to edit the code of the engine itself, an editable and executable copy is necessary.
-To release a dataset:
+First of all, follow the [requirements](#requirements) above. Then, clone the repository:
-```
-npm run dataset:release
+```sh
+git clone https://github.com/ambanum/OpenTermsArchive.git
+cd OpenTermsArchive
 ```
-To weekly release a dataset:
+Install dependencies:
+```sh
+npm install
 ```
-npm run dataset:scheduler
-```
-## Contributing
-Thanks for wanting to contribute! There are different ways to contribute to Open Terms Archive. We describe the most common below. If you want to explore other venues for contributing, please contact us over email (contact@[our domain name]) or [Twitter](https://twitter.com/OpenTerms).
+### Testing
-### Adding a new service or updating an existing service
+If changes are made to the engine, check that all parts covered by tests still work properly:
-See the [CONTRIBUTING](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md) of repository [`OpenTermsArchive/contrib-declarations`](https://github.com/OpenTermsArchive/contrib-declarations). You will need knowledge of JSON and web DOM.
+```sh
+npm test
+```
-### Core engine
+If existing features are changed or new ones are added, relevant tests must be added too.
-To contribute to the core engine of Open Terms Archive, see the [CONTRIBUTING](CONTRIBUTING.md) file of this repository. You will need knowledge of JavaScript and NodeJS.
+### Suggesting changes
-### Funding and partnerships
+To contribute to the core engine of Open Terms Archive, see the [CONTRIBUTING](CONTRIBUTING.md) file of this repository. You will need knowledge of JavaScript and Node.js.
-Beyond individual contributions, we need funds and committed partners to pay for a core team to maintain and grow Open Terms Archive. If you know of opportunities, please let us know! You can find [on our website](https://opentermsarchive.org/en/about) an up-to-date list of the partners and funders that make Open Terms Archive possible.
+### Sponsorship and partnerships
+Beyond individual contributions, we need funds and committed partners to pay for a core team to maintain and grow Open Terms Archive. If you know of opportunities, please let us know over email at `contact@[project name without spaces].org`!
----
+- - -
 ## License
-The code for this software is distributed under the European Union Public Licence (EUPL) v1.2.
-Contact the author if you have any specific need or question regarding licensing.
+The code for this software is distributed under the [European Union Public Licence (EUPL) v1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12). In short, this [means](https://choosealicense.com/licenses/eupl-1.2/) you are allowed to read, use, modify and redistribute this source code, as long as you as you credit “Open Terms Archive Contributors” and make available any change you make to it under similar conditions.
+Contact the core team over email at `contact@[project name without spaces].org` if you have any specific need or question regarding licensing.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@opentermsarchive/engine",
-  "version": "0.17.0",
+  "version": "0.17.1",
   "description": "Tracks and makes visible changes to the terms of online services",
   "homepage": "https://github.com/ambanum/OpenTermsArchive#readme",
   "bugs": {

package/scripts/dataset/README.md CHANGED Viewed

@@ -4,7 +4,7 @@ Export the versions dataset into a ZIP file and publish it to GitHub releases.
 ## Configuring
-You can change the configuration in the appropriate config file in the `config` folder. See the [main README](https://github.com/ambanum/OpenTermsArchive#configuring) for documentation on using the configuration file.
+You can change the configuration in the appropriate config file in the `config` folder. See the [main README](../../README.md#configuring) for documentation on using the configuration file.
 ## Running
@@ -34,4 +34,4 @@ node scripts/dataset/main.js --schedule --publish --remove-local-copy
 ## Adding renaming rules
-See the [renamer module documentation](../renamer/README.md).
+See the [renamer module documentation](../utils/renamer/README.md).

package/scripts/dataset/assets/README.template.js CHANGED Viewed

@@ -31,27 +31,27 @@ It has been generated with [Open Terms Archive](https://opentermsarchive.org).
 ### Dataset format
-This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the document type. The filesystem layout will look like below.
+This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the terms type. The filesystem layout will look like below.
 \`\`\`
 ├ README.md
 ├┬ Service provider 1 (e.g. Facebook)
-│├┬ Document type 1 (e.g. Terms of Service)
+│├┬ Terms type 1 (e.g. Terms of Service)
 ││├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-08-01T01-03-12Z.md)
 ┆┆┆
 ││└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-10-03T08-12-25Z.md)
 ┆┆
-│└┬ Document type X (e.g. Privacy Policy)
+│└┬ Terms type X (e.g. Privacy Policy)
 │ ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
 ┆ ┆
 │ └ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-11-14T12-36-45Z.md)
 ┆
 └┬ Service provider Y (e.g. Google)
- ├┬ Document type 1 (e.g. Developer Terms)
+ ├┬ Terms type 1 (e.g. Developer Terms)
  │├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2019-03-12T04-18-22Z.md)
  ┆┆
  │└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-12-04T22-47-05Z.md)
- └┬ Document type Z (e.g. Privacy Policy)
+ └┬ Terms type Z (e.g. Privacy Policy)
   ┆
   ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
   ┆

package/scripts/dataset/export/test/fixtures/dataset/README.md CHANGED Viewed

@@ -8,27 +8,27 @@ It has been generated with [Open Terms Archive](https://opentermsarchive.org).
 ### Dataset format
-This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the document type. The filesystem layout will look like below.
+This dataset represents each version of a document as a separate [Markdown](https://spec.commonmark.org/0.30/) file, nested in a directory with the name of the service provider and in a directory with the name of the terms type. The filesystem layout will look like below.
 ```
 ├ README.md
 ├┬ Service provider 1 (e.g. Facebook)
-│├┬ Document type 1 (e.g. Terms of Service)
+│├┬ Terms type 1 (e.g. Terms of Service)
 ││├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-08-01T01-03-12Z.md)
 ┆┆┆
 ││└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-10-03T08-12-25Z.md)
 ┆┆
-│└┬ Document type X (e.g. Privacy Policy)
+│└┬ Terms type X (e.g. Privacy Policy)
 │ ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
 ┆ ┆
 │ └ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-11-14T12-36-45Z.md)
 ┆
 └┬ Service provider Y (e.g. Google)
- ├┬ Document type 1 (e.g. Developer Terms)
+ ├┬ Terms type 1 (e.g. Developer Terms)
  │├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2019-03-12T04-18-22Z.md)
  ┆┆
  │└ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-12-04T22-47-05Z.md)
- └┬ Document type Z (e.g. Privacy Policy)
+ └┬ Terms type Z (e.g. Privacy Policy)
   ┆
   ├ YYYY-DD-MMTHH-MM-SSZ.md (e.g. 2021-05-02T03-02-15Z.md)
   ┆

package/scripts/import/README.md CHANGED Viewed

@@ -54,6 +54,6 @@ NODE_ENV=import node scripts/import/index.js
 The script will:
 - Ignore commits which are not a document snapshot (like renaming or documentation commits).
-- Rename document types according to declared rules. See the [renamer module documentation](../renamer/README.md).
+- Rename terms types according to declared rules. See the [renamer module documentation](../renamer/README.md).
 - Rename services according to declared rules. See the [renamer module documentation](../renamer/README.md).
 - Handle duplicates, so you can run it twice without worrying about duplicate entries in the database.

package/scripts/rewrite/README.md CHANGED Viewed

@@ -2,7 +2,7 @@ __:warning: These scripts are no longer up-to-date with the codebase and are not
 # Rewrite history
-As some document types or service names can change over time or as we need to import history from other tools, provided they have an history with the same structure as Open Terms Archive, we need a way to rewrite, reorder and apply changes to the snapshots or versions history.
+As some terms types or service names can change over time or as we need to import history from other tools, provided they have an history with the same structure as Open Terms Archive, we need a way to rewrite, reorder and apply changes to the snapshots or versions history.
 The script works by reading commits from a **source** repository, applying changes and then committing the result in another, empty or not, **target** repository. So a source repository with commits is required.
@@ -125,7 +125,7 @@ Currently, the script will:
 - Ignore commits which are not a document snapshot (like renaming or documentation commits)
 - Reorder commits according to their author date
-- Rename document types according to declared rules
+- Rename terms types according to declared rules
 - Rename services according to declared rules
 - Skip commits with empty content
 - Skip commits which do not change the document

package/scripts/rewrite/rewrite-versions.js CHANGED Viewed

@@ -101,7 +101,7 @@ let recorder;
     );
     if (!documentDeclaration) {
-      console.log(`⌙ Skip unknown document type "${documentType}" for service "${serviceId}"`);
+      console.log(`⌙ Skip unknown terms type "${documentType}" for service "${serviceId}"`);
       continue;
     }

package/scripts/utils/renamer/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Renamer
-This module is used to apply renaming rules to service IDs and document types.
+This module is used to apply renaming rules to service IDs and terms types.
 ## Usage
@@ -24,9 +24,9 @@ To rename a service, add a rule in `./rules/services.json`, for example, to rena
 }
 ```
-### Document type
+### Terms type
-To rename a document type, add a rule in `./rules/documentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy", add the following line in the file:
+To rename a terms type, add a rule in `./rules/documentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy", add the following line in the file:
 ```json
 {
@@ -35,9 +35,9 @@ To rename a document type, add a rule in `./rules/documentTypes.json`, for examp
 }
 ```
-### Document type for a specific service
+### Terms type for a specific service
-To rename a document type only for a specific service, add a rule in `./rules/servicesDocumentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy" only for Skype, add the following line in the file:
+To rename a terms type only for a specific service, add a rule in `./rules/servicesDocumentTypes.json`, for example, to rename "Program Policies" to "Acceptable Use Policy" only for Skype, add the following line in the file:
 ```json
 {

package/scripts/utils/renamer/index.js CHANGED Viewed

@@ -26,7 +26,7 @@ export function applyRules(serviceId, documentType) {
   const renamedDocumentType = renamingRules.documentTypes[documentType];
   if (renamedDocumentType) {
-    console.log(`⌙ Rename document type "${documentType}" to "${renamedDocumentType}" of "${serviceId}" service`);
+    console.log(`⌙ Rename terms type "${documentType}" to "${renamedDocumentType}" of "${serviceId}" service`);
     documentType = renamedDocumentType;
   }
@@ -34,7 +34,7 @@ export function applyRules(serviceId, documentType) {
     && renamingRules.documentTypesByService[serviceId][documentType];
   if (renamedServiceDocumentType) {
-    console.log(`⌙ Specific rename document type "${documentType}" to "${renamedServiceDocumentType}" of "${serviceId}" service`);
+    console.log(`⌙ Specific rename terms type "${documentType}" to "${renamedServiceDocumentType}" of "${serviceId}" service`);
     documentType = renamedServiceDocumentType;
   }

package/src/archivist/recorder/index.js CHANGED Viewed

@@ -27,7 +27,7 @@ export default class Recorder {
     }
     if (!documentType) {
-      throw new Error('A document type is required');
+      throw new Error('A terms type is required');
     }
     if (!fetchDate) {
@@ -51,7 +51,7 @@ export default class Recorder {
     }
     if (!documentType) {
-      throw new Error('A document type is required');
+      throw new Error('A terms type is required');
     }
     if (!snapshotIds?.length) {

package/src/archivist/recorder/index.test.js CHANGED Viewed

@@ -49,7 +49,7 @@ describe('Recorder', () => {
           const paramsNameToExpectedTextInError = {
             serviceId: 'service ID',
-            documentType: 'document type',
+            documentType: 'terms type',
             fetchDate: 'fetch date',
             content: 'content',
             mimeType: 'mime type',
@@ -190,7 +190,7 @@ describe('Recorder', () => {
           const paramsNameToExpectedTextInError = {
             serviceId: 'service ID',
-            documentType: 'document type',
+            documentType: 'terms type',
             snapshotIds: 'snapshot ID',
             fetchDate: 'fetch date',
             content: 'content',
@@ -335,7 +335,7 @@ describe('Recorder', () => {
           const paramsNameToExpectedTextInError = {
             serviceId: 'service ID',
-            documentType: 'document type',
+            documentType: 'terms type',
             snapshotIds: 'snapshot ID',
             fetchDate: 'fetch date',
             content: 'content',

package/src/archivist/recorder/repositories/git/dataMapper.js CHANGED Viewed

@@ -77,7 +77,7 @@ function generateFileName(documentType, pageId, extension) {
 }
 export function generateFilePath(serviceId, documentType, pageId, mimeType) {
-  const extension = mime.getExtension(mimeType) || '*'; // If mime type is undefined, an asterisk is set as an extension. Used to match all files for the given service ID, document type and page ID when mime type is unknown.
+  const extension = mime.getExtension(mimeType) || '*'; // If mime type is undefined, an asterisk is set as an extension. Used to match all files for the given service ID, terms type and page ID when mime type is unknown.
   return `${serviceId}/${generateFileName(documentType, pageId, extension)}`; // Do not use `path.join` as even for Windows, the path should be with `/` and not `\`. See https://github.com/ambanum/OpenTermsArchive/runs/8110230474?check_suite_focus=true#step:7:125
 }

package/src/archivist/recorder/repositories/git/index.test.js CHANGED Viewed

@@ -101,7 +101,7 @@ describe('GitRepository', () => {
         expect(commit.message).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(commit.message).to.include(DOCUMENT_TYPE);
       });
@@ -314,7 +314,7 @@ describe('GitRepository', () => {
         expect(commit.message).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(commit.message).to.include(DOCUMENT_TYPE);
       });
@@ -351,7 +351,7 @@ describe('GitRepository', () => {
         expect(commit.message).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(commit.message).to.include(DOCUMENT_TYPE);
       });
@@ -394,7 +394,7 @@ describe('GitRepository', () => {
         expect(commit.message).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(commit.message).to.include(DOCUMENT_TYPE);
       });
@@ -436,7 +436,7 @@ describe('GitRepository', () => {
       expect(record.serviceId).to.equal(SERVICE_PROVIDER_ID);
     });
-    it('returns the document type', () => {
+    it('returns the terms type', () => {
       expect(record.documentType).to.equal(DOCUMENT_TYPE);
     });

package/src/archivist/recorder/repositories/interface.js CHANGED Viewed

@@ -35,11 +35,11 @@ export default class RepositoryInterface {
   }
   /**
-  * Find the most recent record that matches the given service ID and document type and optionally the page ID
+  * Find the most recent record that matches the given service ID and terms type and optionally the page ID
   * In case of snapshots, if the record is related to a multipage document, the page ID is required to find the corresponding snapshot
   *
   * @param {string} serviceId - Service ID of record to find
-  * @param {string} documentType - Document type of record to find
+  * @param {string} documentType - Terms type of record to find
   * @param {string} [pageId] - Page ID of record to find. Used to differentiate pages of multipage document. Not necessary for single page document
   * @returns {Promise<Record>} Promise that will be resolved with the found record or an empty object if none match the given criteria
   */

package/src/archivist/recorder/repositories/mongo/index.test.js CHANGED Viewed

@@ -95,7 +95,7 @@ describe('MongoRepository', () => {
         expect(mongoDocument.serviceId).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(mongoDocument.documentType).to.include(DOCUMENT_TYPE);
       });
@@ -349,7 +349,7 @@ describe('MongoRepository', () => {
         expect(mongoDocument.serviceId).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(mongoDocument.documentType).to.include(DOCUMENT_TYPE);
       });
@@ -392,7 +392,7 @@ describe('MongoRepository', () => {
         expect(mongoDocument.serviceId).to.include(SERVICE_PROVIDER_ID);
       });
-      it('stores the document type', () => {
+      it('stores the terms type', () => {
         expect(mongoDocument.documentType).to.include(DOCUMENT_TYPE);
       });
@@ -434,7 +434,7 @@ describe('MongoRepository', () => {
       expect(record.serviceId).to.equal(SERVICE_PROVIDER_ID);
     });
-    it('returns the document type', () => {
+    it('returns the terms type', () => {
       expect(record.documentType).to.equal(DOCUMENT_TYPE);
     });

package/src/archivist/services/index.test.js CHANGED Viewed

@@ -51,7 +51,7 @@ describe('Services', () => {
                 expect(actualDocumentDeclaration.service.name).to.eql(expectedDocumentDeclaration.service.name);
               });
-              it('has the proper document type', () => {
+              it('has the proper terms type', () => {
                 expect(actualDocumentDeclaration.type).to.eql(expectedDocumentDeclaration.type);
               });
@@ -170,7 +170,7 @@ describe('Services', () => {
             expect(actualDocumentDeclaration.service.name).to.eql(expectedDocumentDeclaration.service.name);
           });
-          it('has the proper document type', () => {
+          it('has the proper terms type', () => {
             expect(actualDocumentDeclaration.type).to.eql(expectedDocumentDeclaration.type);
           });

package/src/archivist/services/service.test.js CHANGED Viewed

@@ -154,7 +154,7 @@ describe('Service', () => {
       subject.addDocumentDeclaration(privacyPolicyDeclaration);
     });
-    it('returns the service document types', async () => {
+    it('returns the service terms types', async () => {
       expect(subject.getDocumentTypes()).to.have.members([
         termsOfServiceDeclaration.type,
         privacyPolicyDeclaration.type,

package/src/main.js CHANGED Viewed

@@ -11,7 +11,7 @@ program
   .description(description)
   .version(version)
   .option('-s, --services [serviceId...]', 'service IDs of services to handle')
-  .option('-d, --documentTypes [documentType...]', 'document types to handle')
+  .option('-d, --documentTypes [documentType...]', 'terms types to handle')
   .option('-r, --refilter-only', 'only refilter exisiting snapshots with last declarations and engine\'s updates')
   .option('--schedule', 'schedule automatic document tracking');

package/README.fr.md DELETED Viewed

@@ -1,110 +0,0 @@
-<img src="https://disinfo.quaidorsay.fr/assets/img/logo.png" width="140">
-# Open Terms Archive
-Les services en ligne ont des conditions générales qui évoluent dans le temps. _Open Terms Archive_ permet aux défenseurs des droits des utilisateurs, aux régulateurs et à toute personne intéressée de suivre les évolutions de ces conditions générales en étant notifiée à chaque publication d'une nouvelle version, et en explorant leur historique.
-## Table des matières
-- [Fonctionnement](#fonctionnement)
-- [Naviguer dans l'historique des versions](#naviguer-dans-lhistorique-des-versions)
-  - [Remarques](#remarques)
-- [Recevoir des notifications](#recevoir-des-notifications)
-  - [Par courriel](#par-courriel)
-    - [Recevoir les mises à jour de services ou documents spécifiques](#recevoir-les-mises-%C3%A0-jour-de-services-ou-documents-sp%C3%A9cifiques)
-  - [Par RSS](#par-rss)
-    - [Récapitulatif des flux RSS disponibles](#r%C3%A9capitulatif-des-flux-rss-disponibles)
-  - [Désabonnement](#désabonnement)
-- [Contribuer](#contribuer)
-  - [Ajouter un nouveau service](#ajouter-un-nouveau-service)
-## Fonctionnement
-_Note: Les mots en gras sont les [termes du domaine](https://fr.wikipedia.org/wiki/Conception_pilot%C3%A9e_par_le_domaine)._
-Les **services** sont **déclarés** dans l'outil _Open Terms Archive_ grâce à un **fichier de déclaration** listant les **documents** qui forment l'ensemble des **conditions** régissant l'usage du **service**. Ces **documents** peuvent être de plusieurs **types** : « conditions d'utilisation », « politique de confidentialité », « contrat de développeur »…
-Afin de **suivre** leurs **évolutions**, les **documents** sont régulièrement mis à jour, en les **téléchargeant** depuis une **adresse** web et en **sélectionnant leur contenu** dans la **page web** pour supprimer le **bruit** (publicités, menus de navigation, champs de connexion…). En plus de simplement sélectionner une zone de la page, certains documents possèdent du **bruit** supplémentaire (hashs dans des liens, jetons CSRF...) créant de faux positifs en terme d'**évolutions**. En conséquence, _Open Terms Archive_ supporte des **filtres** spécifiques pour chaque **document**.
-Néanmoins, le **bruit** peut changer de forme avec le temps. Afin d'éviter des pertes d'information irrécupérables pendant l'étape de **filtrage du bruit**, un **instantané** de la page Web est **enregistré** à chaque **évolution**. Après avoir **filtré l'instantané** de son **bruit**, si le **document** résultant a changé par rapport à sa **version** précédente, une nouvelle **version** est **enregistrée**.
-Vous pouvez disposer de votre propre instance **privée** de l'outil _Open Terms Archive_ et suivre vous-même les **évolutions**. Néanmoins, nous **publions** chaque **version** sur une [instance **publique**](https://github.com/OpenTermsArchive/contrib-versions) facilitant l'exploration de l'**historique** et **notifiant** par courriels l'**enregistrement** de nouvelles **versions**. Les **utilisateurs** peuvent [**s'abonner** aux **notifications**](#recevoir-des-notifications).
-_Note: Actuellement, nous ne suivons que les **conditions** rédigées en anglais et concernant la juridiction européenne._
-## Naviguer dans l'historique des versions
-À partir de la **page d'accueil du dépôt** [contrib-versions](https://github.com/OpenTermsArchive/contrib-versions), ouvrez le dossier du **service de votre choix** (prenons par exemple [WhatsApp](https://github.com/OpenTermsArchive/contrib-versions/tree/main/WhatsApp)).
-L'**ensemble des documents suivis** pour ce service s'affichent, cliquez ensuite sur **celui dont vous souhaitez suivre l'historique** (par exemple la [politique d'utilisation des données de WhatsApp](https://github.com/OpenTermsArchive/contrib-versions/blob/main/WhatsApp/Privacy%20Policy.md)). Le document s'affiche alors dans sa **dernière version** (il est actualisé toutes les heures).
-Pour afficher l'**historique des modifications** subies par ce document, cliquez sur **History** en haut à droite du document (pour l'exemple précédent nous arrivons [ici](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)). Les **modifications** sont affichées **par dates**, de la plus récente à la plus ancienne.
-Cliquez sur une modification pour voir en quoi elle consiste (par exemple [celle-ci](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd)). Vous disposez de **deux types d'affichage**, sélectionnables à partir des icônes dans la barre grisée qui chapeaute le document.
-- Le premier, appelé _source diff_ (bouton avec des chevrons) permet d'**afficher côte-à-côte l'ancienne version et la nouvelle** (pour notre [exemple](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd#diff-e8bdae8692561f60aeac9d27a55e84fc)). Cet affichage a le mérite de **montrer explicitement** l'ensemble des ajouts/suppressions.
-- Le second, appelé _rich diff_ (bouton avec l'icône document) permet d'**unifier l'ensemble des modifications sur un seul document** (pour notre [exemple](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd?short_path=e8bdae8#diff-e8bdae8692561f60aeac9d27a55e84fc)). La couleur **rouge** montre les éléments **supprimés**, la couleur **jaune** montre les paragraphes **modifiés**, et la couleur **verte** montrent les éléments **ajoutés**. Attention, cet affichage **ne montre pas certaines modifications** comme le changement des hyperliens et le style du texte.
-### Remarques
-- Pour les longs documents, les **paragraphes inchangés ne seront pas affichés par défaut**. Vous pouvez manuellement les faire apparaître en cliquant sur les petites flèches juste au-dessus ou juste en-dessous des paragraphes affichés.
-- Vous pouvez utiliser le bouton **History n'importe où** dans le dépôt contrib-versions, qui affichera alors l'**historique des modifications subies par tous les documents se trouvant dans le dossier** où vous vous trouvez (y compris dans les sous-dossiers).
-## Recevoir des notifications
-### Par courriel
-#### Pour tous les documents d'un coup
-Vous pouvez [vous abonner](https://59692a77.sibforms.com/serve/MUIEAKuTv3y67e27PkjAiw7UkHCn0qVrcD188cQb-ofHVBGpvdUWQ6EraZ5AIb6vJqz3L8LDvYhEzPb2SE6eGWP35zXrpwEFVJCpGuER9DKPBUrifKScpF_ENMqwE_OiOZ3FdCV2ra-TXQNxB2sTEL13Zj8HU7U0vbbeF7TnbFiW8gGbcOa5liqmMvw_rghnEB2htMQRCk6A3eyj) pour recevoir un courriel à chaque modification d'un document dans l'ensemble de la base.
-**Attention, vous risquez de recevoir de nombreuses notifications !** Vous pourrez vous désabonner en répondant à n'importe quel courriel reçu.
-#### Recevoir les mises à jour de services ou documents spécifiques
-Vous pouvez vous rendre sur le site officiel [opentermsarchive.org] (https://opentermsarchive.org). De là, vous pouvez sélectionner un service, puis le type de document correspondant.
-Après avoir entré votre adresse électronique et cliqué sur "S'inscrire", nous ajouterons votre adresse à la liste de diffusion correspondante dans [SendInBlue](https://www.sendinblue.com/) et nous ne la conserverons nulle part ailleurs.
-Ensuite, chaque fois qu'une modification sera trouvée sur le document correspondant, nous vous enverrons un e-mail.
-Vous pouvez vous désinscrire à tout moment en cliquant sur le lien "désinscription" en bas de l'email reçu.
-### Par RSS
-Vous pouvez recevoir une notification pour un service ou un document spécifique en vous abonnant à des flux RSS.
-> Un flux RSS est un type de page accessible en ligne qui contient des informations sur les derniers contenus publiés par un site web comme leur date de publication et l'adresse pour les consulter. Lorsque cette ressource est mise à jour, une application de type lecteur de flux vous notifie automatiquement et vous pouvez ainsi consulter la mise à jour.
-Pour obtenir l'adresse du flux RSS auquel vous abonner :
-1. [Naviguez](#naviguer-dans-lhistorique-des-versions) jusqu’à la page qui présente l’historique des modifications qui vous intéressent. _Dans l'exemple de WhatsApp donné plus haut, il s’agit de [cette page](https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md)._
-2. Copiez l’adresse de cette page depuis la barre d’adresse de votre navigateur. _Dans l’exemple de WhatsApp, il s’agit de `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md`._
-3. Ajoutez `.atom` à la fin de cette adresse. _Dans l’exemple de WhatsApp, cela donnerait `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md.atom`._
-4. Abonnez votre lecteur de flux RSS à l’adresse résultante.
-#### Récapitulatif des flux RSS disponibles
-| Mis à jour pour                       | URL                                                                                                                                                                                                        |
-| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| l'ensemble des services et documents  | `https://github.com/OpenTermsArchive/contrib-versions/commits.atom`                                                                                                                                        |
-| l'ensemble des documents d'un service | Remplacer `$serviceId` par l'identifiant du service :<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId.atom`                                                             |
-| un document spécifique d'un service   | Remplacer `$serviceId` par l'identifiant du service et `$documentType` par le type du document :<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId/$documentType.md.atom` |
-Par exemple :
-- Pour recevoir toutes les mises à jour des documents de `Facebook`, abonnez-vous à `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Facebook.atom`.
-- Pour recevoir toutes les mises à jour des `Privacy Policy` de `Google`, abonnez-vous à `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Google/Privacy%20Policy.md.atom`.
-### Désabonnement
-Afin de ne plus recevoir d'e-mails de mise à jour des services, deux liens sont inclus dans chaque e-mail reçu :
-- un pour ne plus recevoir tous les e-mails de bot@opentermsarchive.org
-- un pour ne plus recevoir les e-mails d'un document particulier
-Ce dernier lien consiste à envoyer un courriel à contact@opentermsarchive.org pour être retiré manuellement de la liste correspondante.
-## Contribuer
-### Ajouter un nouveau service
-Voir le fichier [CONTRIBUTING](CONTRIBUTING.md) (en anglais).