@opentermsarchive/engine 0.17.0 → 0.17.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.eslintrc.yaml +116 -0
- package/README.md +240 -232
- package/package.json +8 -7
- package/scripts/dataset/README.md +2 -2
- package/scripts/dataset/assets/README.template.js +5 -5
- package/scripts/dataset/export/test/fixtures/dataset/README.md +5 -5
- package/scripts/import/README.md +1 -1
- package/scripts/rewrite/README.md +2 -2
- package/scripts/rewrite/rewrite-versions.js +1 -1
- package/scripts/utils/renamer/README.md +5 -5
- package/scripts/utils/renamer/index.js +2 -2
- package/src/archivist/recorder/index.js +2 -2
- package/src/archivist/recorder/index.test.js +3 -3
- package/src/archivist/recorder/repositories/git/dataMapper.js +1 -1
- package/src/archivist/recorder/repositories/git/index.test.js +5 -5
- package/src/archivist/recorder/repositories/interface.js +2 -2
- package/src/archivist/recorder/repositories/mongo/index.test.js +4 -4
- package/src/archivist/services/index.test.js +2 -2
- package/src/archivist/services/service.test.js +1 -1
- package/src/main.js +1 -1
- package/README.fr.md +0 -110
package/README.md
CHANGED
|
@@ -1,245 +1,289 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
- [How
|
|
12
|
-
- [
|
|
13
|
-
- [
|
|
14
|
-
- [By email](#by-email)
|
|
15
|
-
- [By RSS](#by-rss)
|
|
16
|
-
- [Importing as a module](#importing-as-a-module)
|
|
17
|
-
- [CLI](#cli)
|
|
18
|
-
- [Features exposed](#features-exposed)
|
|
19
|
-
- [fetch](#fetch)
|
|
20
|
-
- [filter](#filter)
|
|
21
|
-
- [Using locally](#using-locally)
|
|
22
|
-
- [Installing](#installing)
|
|
23
|
-
- [Declarations repository](#declarations-repository)
|
|
24
|
-
- [Core](#core)
|
|
25
|
-
- [Configuring](#configuring)
|
|
26
|
-
- [Configuration file](#configuration-file)
|
|
27
|
-
- [Storage repositories](#storage-repositories)
|
|
28
|
-
- [Environment variables](#environment-variables)
|
|
29
|
-
- [Running](#running)
|
|
1
|
+
_The document you are reading now is targeted at developers wanting to use or contribute to the engine of [Open Terms Archive](https://opentermsarchive.org). For a high-level overview of Open Terms Archive’s wider goals and processes, please read its [public homepage](https://opentermsarchive.org)._
|
|
2
|
+
|
|
3
|
+
# Open Terms Archive Engine
|
|
4
|
+
|
|
5
|
+
This codebase is a Node.js module enabling downloading, archiving and publishing versions of documents obtained online. It can be used independently from the Open Terms Archive ecosystem.
|
|
6
|
+
|
|
7
|
+
## Table of contents
|
|
8
|
+
|
|
9
|
+
- [Motivation](#motivation)
|
|
10
|
+
- [Main concepts](#main-concepts)
|
|
11
|
+
- [How to add documents to a collection](#how-to-add-documents-to-a-collection)
|
|
12
|
+
- [How to use the engine](#how-to-use-the-engine)
|
|
13
|
+
- [Configuring](#configuring)
|
|
30
14
|
- [Deploying](#deploying)
|
|
31
|
-
- [Publishing](#publishing)
|
|
32
15
|
- [Contributing](#contributing)
|
|
33
|
-
- [Adding or updating a service](#adding-a-new-service-or-updating-an-existing-service)
|
|
34
|
-
- [Core engine](#core-engine)
|
|
35
|
-
- [Funding and partnerships](#funding-and-partnerships)
|
|
36
16
|
- [License](#license)
|
|
37
17
|
|
|
38
|
-
##
|
|
18
|
+
## Motivation
|
|
19
|
+
|
|
20
|
+
_Words in bold are [business domain names](https://en.wikipedia.org/wiki/Domain-driven_design)._
|
|
21
|
+
|
|
22
|
+
**Services** have **terms** written in **documents**, contractual (Terms of Services, Privacy Policy…) or not (Community Guidelines, Deceased User Policy…), that can change over time. Open Terms Archive enables users rights advocates, regulatory bodies and interested citizens to follow the **changes** to these **terms**, to be notified whenever a new **version** is published, to explore their entire **history** and to collaborate in analysing them. This free and open-source engine is developed to support these goals.
|
|
23
|
+
|
|
24
|
+
## Main concepts
|
|
39
25
|
|
|
40
|
-
|
|
26
|
+
### Instances
|
|
41
27
|
|
|
42
|
-
|
|
28
|
+
Open Terms Archive is a decentralised system.
|
|
43
29
|
|
|
44
|
-
|
|
30
|
+
It aims at enabling any entity to **track** **terms** on its own and at federating a number of public **instances** in a single ecosystem to maximise discoverability, collaboration and political power. To that end, the Open Terms Archive **engine** can be run on any server, thus making it a dedicated **instance**.
|
|
45
31
|
|
|
46
|
-
|
|
32
|
+
> Federated public instances can be [found on GitHub](
|
|
33
|
+
https://github.com/OpenTermsArchive?q=declarations).
|
|
47
34
|
|
|
48
|
-
|
|
49
|
-
Users can [**subscribe** to **notifications**](#be-notified).
|
|
35
|
+
### Collections
|
|
50
36
|
|
|
51
|
-
|
|
37
|
+
An **instance** **tracks** **documents** of a single **collection**.
|
|
52
38
|
|
|
53
|
-
|
|
39
|
+
A **collection** is characterised by a **scope** across **dimensions** that describe the **terms** it **tracks**, such as **language**, **jurisdiction** and **industry**.
|
|
54
40
|
|
|
55
|
-
|
|
41
|
+
> Federated public collections can be [found on GitHub](https://github.com/OpenTermsArchive?q=versions).
|
|
56
42
|
|
|
57
|
-
|
|
43
|
+
#### Example scope
|
|
58
44
|
|
|
59
|
-
|
|
45
|
+
> The documents declared in this collection are:
|
|
46
|
+
> - Related to dating services used in Europe.
|
|
47
|
+
> - In the European Union and Switzerland jurisdictions.
|
|
48
|
+
> - In English, unless no English version exists, in which case the primary official language of the jurisdiction of incorporation of the service operator will be used.
|
|
60
49
|
|
|
61
|
-
|
|
50
|
+
### Terms types
|
|
62
51
|
|
|
63
|
-
|
|
52
|
+
To distinguish between the different **terms** of a **service**, each has a **type**, such as “Terms of Service”, “Privacy Policy”, “Developer Agreement”…
|
|
64
53
|
|
|
65
|
-
|
|
66
|
-
- The second one, named _rich diff_ (button with a document icon) allows you to **unify all the changes in a single document** (for our [example](https://github.com/OpenTermsArchive/contrib-versions/commit/58a1d2ae4187a3260ac58f3f3c7dcd3aeacaebcd?short_path=e8bdae8#diff-e8bdae8692561f60aeac9d27a55e84fc)). The **red** color shows **deleted** elements, the **yellow** color shows **modified** paragraphs, and the **green** color shows **added** elements. Be careful, this display **does not show some changes** such as hyperlinks and text style's changes.
|
|
54
|
+
This **type** matches the topic, but not necessarily the title the **service** gives to it. Unifying the **types** enables comparing **terms** across **services**.
|
|
67
55
|
|
|
68
|
-
|
|
56
|
+
> More information on terms types can be found in the [dedicated repository](https://github.com/OpenTermsArchive/terms-types). They are published on NPM under [`@opentermsarchive/terms-types`](https://www.npmjs.com/package/@opentermsarchive/terms-types), enabling standardisation and interoperability beyond the Open Terms Archive engine.
|
|
69
57
|
|
|
70
|
-
|
|
71
|
-
- You can use the **History button anywhere** in the repository contrib-versions, which will then display the **history of changes made to all documents in the folder** where you are (including sub-folders).
|
|
58
|
+
### Declarations
|
|
72
59
|
|
|
73
|
-
|
|
60
|
+
The **documents** that constitute a **collection** are defined in simple JSON files called **declarations**.
|
|
74
61
|
|
|
75
|
-
|
|
62
|
+
A **declaration** also contains some metadata on the **service** the **documents** relate to.
|
|
76
63
|
|
|
77
|
-
|
|
64
|
+
> Here is an example declaration tracking the Privacy Policy of Open Terms Archive:
|
|
65
|
+
>
|
|
66
|
+
> ```json
|
|
67
|
+
> {
|
|
68
|
+
> "name": "Open Terms Archive",
|
|
69
|
+
> "documents": {
|
|
70
|
+
> "Privacy Policy": {
|
|
71
|
+
> "fetch": "https://opentermsarchive.org/en/privacy-policy",
|
|
72
|
+
> "select": ".TextContent_textContent__ToW2S"
|
|
73
|
+
> }
|
|
74
|
+
> }
|
|
75
|
+
> }
|
|
76
|
+
> ```
|
|
78
77
|
|
|
79
|
-
|
|
80
|
-
After you enter your email and click on subscribe, we will add your email to the correspondning mailing list in [SendInBlue](https://www.sendinblue.com/) and will not store your email anywhere else.
|
|
81
|
-
Then, everytime a modification is found on the correspondning document, we will send you an email.
|
|
78
|
+
## How to add documents to a collection
|
|
82
79
|
|
|
83
|
-
|
|
80
|
+
Open Terms Archive **acquires** **documents** to deliver an explorable **history** of **changes**. This can be done in two ways:
|
|
84
81
|
|
|
85
|
-
|
|
82
|
+
1. For the present and future, by **tracking** **documents**.
|
|
83
|
+
2. For the past, by **importing** from an existing **fonds** such as [ToSBack](https://tosback.org), the [Internet Archive](https://archive.org/web/), [Common Crawl](https://commoncrawl.org) or any other in-house format.
|
|
86
84
|
|
|
87
|
-
|
|
85
|
+
### Tracking documents
|
|
88
86
|
|
|
89
|
-
**
|
|
87
|
+
The **engine** **reads** **declarations** to **record** a **snapshot** by **fetching** the declared web **location** periodically. The **engine** then **extracts** a **version** from this **snapshot** by:
|
|
90
88
|
|
|
91
|
-
|
|
89
|
+
1. **Selecting** the subset of the **snapshot** that contains the **terms** (instead of navigation menus, footers, cookies banners…).
|
|
90
|
+
2. **Removing** residual content in this subset that is not part of the **terms** (ads, illustrative pictures, internal navigation links…).
|
|
91
|
+
3. **Filtering noise** by preventing parts that change frequently from triggering false positives for **changes** (tracker identifiers in links, relative dates…). The **engine** can execute custom **filters** written in JavaScript to that end.
|
|
92
92
|
|
|
93
|
-
|
|
93
|
+
After these steps, if **changes** are spotted in the resulting **document**, a new **version** is **recorded**.
|
|
94
94
|
|
|
95
|
-
|
|
95
|
+
Preserving **snapshots** enables recovering after the fact information potentially lost in the **extraction** step: if **declarations** were wrong, they can be **maintained** and corrected **versions** can be **extracted** from the original **snapshots**.
|
|
96
96
|
|
|
97
|
-
|
|
97
|
+
### Importing documents
|
|
98
98
|
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
3. Append `.atom` at the end of this address. _In the WhatsApp example, this would become `https://github.com/OpenTermsArchive/contrib-versions/commits/main/WhatsApp/Privacy%20Policy.md.atom`._
|
|
102
|
-
4. Subscribe your RSS feed reader to the resulting address.
|
|
99
|
+
Existing **fonds** can be prepared for easier analysis by unifying their format to the **Open Terms Archive dataset format**. This unique format enables building interoperable tools, fostering collaboration across reusers.
|
|
100
|
+
Such a dataset can be generated from **versions** alone. If **snapshots** and **declarations** can be retrieved from the **fonds** too, then a full-fledged **collection** can be created.
|
|
103
101
|
|
|
104
|
-
|
|
102
|
+
## How to use the engine
|
|
105
103
|
|
|
106
|
-
|
|
107
|
-
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
108
|
-
| all services and documents | `https://github.com/OpenTermsArchive/contrib-versions/commits.atom` |
|
|
109
|
-
| all the documents of a service | Replace `$serviceId` with the service ID:<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId.atom.` |
|
|
110
|
-
| a specific document of a service | Replace `$serviceId` with the service ID and `$documentType` with the document type:<br>`https://github.com/OpenTermsArchive/contrib-versions/commits/main/$serviceId/$documentType.md.atom` |
|
|
104
|
+
This documentation describes how to execute the **engine** independently from any specific **instance**. For other use cases, other parts of the documentation could be more relevant:
|
|
111
105
|
|
|
112
|
-
|
|
106
|
+
- to contribute **declarations** to an existing **instance**, see [how to contribute documents](./docs/doc-contributing-documents.md);
|
|
107
|
+
- to create a new **collection**, see the [collection bootstrap](https://github.com/OpenTermsArchive/template-declarations) script;
|
|
108
|
+
- to create a new public **instance**, see the [governance](./docs/doc-governance.md) documentation.
|
|
113
109
|
|
|
114
|
-
|
|
115
|
-
- To receive all updates of the `Privacy Policy` from `Google`, the URL is `https://github.com/OpenTermsArchive/contrib-versions/commits/main/Google/Privacy%20Policy.md.atom`.
|
|
110
|
+
### Requirements
|
|
116
111
|
|
|
117
|
-
|
|
112
|
+
This module is tested to work across operating systems (continuous testing on UNIX, macOS and Windows).
|
|
113
|
+
|
|
114
|
+
A [Node.js](https://nodejs.org/en/download/) runtime is required to execute this engine.
|
|
115
|
+
|
|
116
|
+

|
|
117
|
+
|
|
118
|
+
### Getting started
|
|
119
|
+
|
|
120
|
+
This engine is published as a [module on NPM](https://npmjs.com/package/@opentermsarchive/engine). The recommended install is as a dependency in a `package.json` file, next to a folder containing [declaration files](#declarations).
|
|
121
|
+
|
|
122
|
+
```sh
|
|
123
|
+
npm install --save @opentermsarchive/engine
|
|
124
|
+
mkdir declarations
|
|
125
|
+
```
|
|
118
126
|
|
|
119
|
-
|
|
127
|
+
In an editor, create the following declaration file in `declarations/Open Terms Archive.json` to track the terms of the Open Terms Archive website:
|
|
120
128
|
|
|
129
|
+
```json
|
|
130
|
+
{
|
|
131
|
+
"name": "Open Terms Archive",
|
|
132
|
+
"documents": {
|
|
133
|
+
"Privacy Policy": {
|
|
134
|
+
"fetch": "https://opentermsarchive.org/en/privacy-policy",
|
|
135
|
+
"select": ".TextContent_textContent__ToW2S"
|
|
136
|
+
}
|
|
137
|
+
}
|
|
138
|
+
}
|
|
121
139
|
```
|
|
122
|
-
|
|
140
|
+
|
|
141
|
+
In the terminal:
|
|
142
|
+
|
|
143
|
+
```sh
|
|
144
|
+
npx ota-track
|
|
123
145
|
```
|
|
124
146
|
|
|
147
|
+
The tracked documents can be found in the `data` folder.
|
|
148
|
+
|
|
149
|
+
This quick example aimed at letting you try the engine quickly. Most likely, you will simply `npm install` from an existing collection, or create a new collection from the [collection template](https://github.com/OpenTermsArchive/template-declarations).
|
|
150
|
+
|
|
125
151
|
### CLI
|
|
126
152
|
|
|
127
|
-
|
|
153
|
+
Once the engine module is installed as a dependency within another module, the following commands are available.
|
|
128
154
|
|
|
129
|
-
|
|
130
|
-
- `./node_modules/.bin/ota-validate-declarations`: validate declarations.
|
|
131
|
-
- `./node_modules/.bin/ota-track`: track services. Recorded snapshots and versions will be stored in the `data` folder at the root of the module where the package is installed.
|
|
155
|
+
In these commands:
|
|
132
156
|
|
|
133
|
-
|
|
157
|
+
- **`<service_id>`** is the case sensitive name of the service declaration file without the extension. For example, for `Twitter.json`, the service ID is `Twitter`.
|
|
158
|
+
- **`<terms_type>`** is the property name used under the `documents` property in the declaration to declare a terms. For example, in the getting started declaration, the terms type declared is `Privacy Policy`.
|
|
134
159
|
|
|
135
|
-
|
|
160
|
+
#### `ota-track`
|
|
136
161
|
|
|
137
|
-
|
|
162
|
+
```sh
|
|
163
|
+
npx ota-track
|
|
164
|
+
```
|
|
138
165
|
|
|
139
|
-
|
|
166
|
+
[Track](#tracking-documents) the current terms of services according to provided declarations.
|
|
140
167
|
|
|
141
|
-
|
|
168
|
+
The declarations, snapshots and versions paths are defined in the [configuration](#configuring).
|
|
142
169
|
|
|
143
|
-
|
|
170
|
+
> Note that the snapshots and versions will be recorded at the moment the command is executed, on top of the existing local history. If a shared history already exists and the goal is to add on top of it, that history has to be downloaded before executing that command.
|
|
144
171
|
|
|
145
|
-
|
|
146
|
-
In order to not instantiate this browser at each fetch, the starting and stopping of the browser is your responsibility.
|
|
172
|
+
##### Recap of available options
|
|
147
173
|
|
|
148
|
-
|
|
174
|
+
```sh
|
|
175
|
+
npx ota-track --help
|
|
176
|
+
```
|
|
149
177
|
|
|
150
|
-
|
|
151
|
-
import fetch, { launchHeadlessBrowser, stopHeadlessBrowser } from 'open-terms-archive/fetch';
|
|
178
|
+
##### Track terms of specific services
|
|
152
179
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
await fetch({ executeClientScripts: true, ... });
|
|
156
|
-
await fetch({ executeClientScripts: true, ... });
|
|
157
|
-
await stopHeadlessBrowser();
|
|
180
|
+
```sh
|
|
181
|
+
npx ota-track --services "<service_id>" ["<service_id>"...]
|
|
158
182
|
```
|
|
159
183
|
|
|
160
|
-
|
|
161
|
-
If [`node-config`](https://github.com/node-config/node-config) is used in the project, the default `fetcher` configuration can be overridden by adding a `fetcher` object to the local config. See [Configuration file](#configuration-file) for full reference.
|
|
184
|
+
##### Track specific terms of specific services
|
|
162
185
|
|
|
163
|
-
|
|
186
|
+
```sh
|
|
187
|
+
npx ota-track --services "<service_id>" ["<service_id>"...] --documentTypes "<terms_type>" ["<terms_type>"...]
|
|
188
|
+
```
|
|
164
189
|
|
|
165
|
-
|
|
166
|
-
It will filter content based on the [document declaration](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md#declaring-a-new-service).
|
|
190
|
+
##### Track documents four times a day
|
|
167
191
|
|
|
168
|
-
|
|
192
|
+
```sh
|
|
193
|
+
npx ota-track --schedule
|
|
194
|
+
```
|
|
169
195
|
|
|
170
|
-
|
|
196
|
+
#### `ota-validate-declarations`
|
|
171
197
|
|
|
172
|
-
|
|
198
|
+
```sh
|
|
199
|
+
npx ota-validate-declarations [--services <service_id>...]
|
|
200
|
+
```
|
|
173
201
|
|
|
174
|
-
|
|
202
|
+
Check that all declarations allow recording a snapshot and a version properly.
|
|
175
203
|
|
|
176
|
-
|
|
204
|
+
If one or several `<service_id>` are provided, check only those services.
|
|
177
205
|
|
|
178
|
-
|
|
206
|
+
##### Validate schema only
|
|
179
207
|
|
|
180
|
-
|
|
208
|
+
```sh
|
|
209
|
+
npx ota-validate-declarations --schema-only [--services <service_id>...]
|
|
210
|
+
```
|
|
181
211
|
|
|
182
|
-
|
|
212
|
+
Check that all declarations are readable by the engine.
|
|
183
213
|
|
|
184
|
-
|
|
214
|
+
Allows for a much faster check of declarations, but does not check that the documents are actually accessible.
|
|
185
215
|
|
|
186
|
-
|
|
187
|
-
2. Go into your folder and initialize it, e.g., `cd contrib-declarations; npm install`.
|
|
188
|
-
3. You can now modify your declarations in the `./declarations/` folder, following [these instructions](https://github.com/OpenTermsArchive/contrib-declarations/blob/main/CONTRIBUTING.md).
|
|
189
|
-
4. When you want to test:
|
|
190
|
-
- If you want to test every declaration, run `npm test`.
|
|
191
|
-
- If you want to test a specific declaration, run `npm test $serviceId`, e.g., `npm test HER`.
|
|
192
|
-
- If you want to have faster feedback on the structure of a specific declaration, run `npm run test:schema $serviceId`, e.g., `npm run test:schema HER`.
|
|
193
|
-
5. Once you have done that, if you have any error, it will be prompted and detailed at the end of the test.
|
|
194
|
-
- E.g., `InaccessibleContentError`: Your selector is wrong and should be fixed.
|
|
195
|
-
- E.g., `TypeError`: The file declaration is invalid.
|
|
196
|
-
- E.g., if you have a weird error, you may want to contact OTA, if may be a bug.
|
|
216
|
+
If one or several `<service_id>` are provided, check only those services.
|
|
197
217
|
|
|
198
|
-
|
|
218
|
+
#### `ota-lint-declarations`
|
|
199
219
|
|
|
200
|
-
|
|
220
|
+
```sh
|
|
221
|
+
npx ota-lint-declarations [--services <service_id>...]
|
|
222
|
+
```
|
|
201
223
|
|
|
202
|
-
|
|
224
|
+
Normalise the format of declarations.
|
|
203
225
|
|
|
204
|
-
|
|
226
|
+
Automatically correct formatting mistakes and ensure that all declarations are standardised.
|
|
205
227
|
|
|
206
|
-
|
|
207
|
-
2. In the base folder of the previous step (i.e., not _in_ the previous folder, but _where the previous folder is_), clone the core engine: `git clone git@github.com:ambanum/OpenTermsArchive.git`.
|
|
208
|
-
3. Go into the cloned folder and install dependencies: `cd contrib-declarations; npm install`.
|
|
209
|
-
4. If you are using the main repo, you are done, go to step 6.
|
|
210
|
-
5. If you are using a special repo instance (e.g., `dating-declarations`), create a new [config file](#configuring), `config/development.json`, and add:
|
|
211
|
-
```json
|
|
212
|
-
{
|
|
228
|
+
If one or several `<service_id>` are provided, check only those services.
|
|
213
229
|
|
|
214
|
-
|
|
215
|
-
"declarationsPath": "../<name of the repo>/declarations"
|
|
216
|
-
}
|
|
217
|
-
}
|
|
218
|
-
```
|
|
219
|
-
e.g.,
|
|
220
|
-
```json
|
|
221
|
-
{
|
|
222
|
-
"services": {
|
|
223
|
-
"declarationsPath": "../dating-declarations/declarations"
|
|
224
|
-
}
|
|
225
|
-
}
|
|
226
|
-
```
|
|
227
|
-
6. In the folder of the repo (i.e., `OpenTermsArchive`), use `npm start`.
|
|
228
|
-
- It will first do a refiltering to check whenever everything works properly.
|
|
229
|
-
- You will then start to see everything being downloaded under `data/`.
|
|
230
|
-
- More details in [Running](#running).
|
|
230
|
+
### API
|
|
231
231
|
|
|
232
|
-
|
|
232
|
+
Once added as a dependency, the engine exposes a JavaScript API that can be called in your own code. The following modules are available.
|
|
233
233
|
|
|
234
|
-
|
|
235
|
-
- You have to `npm install` in the declarations repo at least once, and a least once each time `package.json` changes.
|
|
236
|
-
- Be careful, it doesn't download the history! If you want that, you need to git clone `snapshots` and `versions` in `data/`.
|
|
234
|
+
#### `fetch`
|
|
237
235
|
|
|
238
|
-
|
|
236
|
+
The `fetch` module gets the MIME type and content of a document from its URL
|
|
239
237
|
|
|
240
|
-
|
|
238
|
+
```js
|
|
239
|
+
import fetch from '@opentermsarchive/engine/fetch';
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
Documentation on how to use `fetch` is provided [as JSDoc](./src/archivist/fetcher/index.js).
|
|
243
|
+
|
|
244
|
+
##### Headless browser management
|
|
245
|
+
|
|
246
|
+
If you pass the `executeClientScripts` option to `fetch`, a headless browser will be used to download and execute the page before serialising its DOM. For performance reasons, the starting and stopping of the browser is your responsibility to avoid instantiating a browser on each fetch. Here is an example on how to use this feature:
|
|
247
|
+
|
|
248
|
+
```js
|
|
249
|
+
import fetch, { launchHeadlessBrowser, stopHeadlessBrowser } from '@opentermsarchive/engine/fetch';
|
|
241
250
|
|
|
242
|
-
|
|
251
|
+
await launchHeadlessBrowser();
|
|
252
|
+
await fetch({ executeClientScripts: true, ... });
|
|
253
|
+
await fetch({ executeClientScripts: true, ... });
|
|
254
|
+
await fetch({ executeClientScripts: true, ... });
|
|
255
|
+
await stopHeadlessBrowser();
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
The `fetch` module options are defined as a [`node-config` submodule](https://github.com/node-config/node-config/wiki/Sub-Module-Configuration). The default `fetcher` configuration can be overridden by adding a `fetcher` object to the [local configuration file](#configuration-file).
|
|
259
|
+
|
|
260
|
+
#### `filter`
|
|
261
|
+
|
|
262
|
+
The `filter` module transforms HTML or PDF content into a Markdown string according to a [declaration](#declarations).
|
|
263
|
+
|
|
264
|
+
```js
|
|
265
|
+
import filter from '@opentermsarchive/engine/filter';
|
|
266
|
+
```
|
|
267
|
+
|
|
268
|
+
The `filter` function documentation is available [as JSDoc](./src/archivist/filter/index.js).
|
|
269
|
+
|
|
270
|
+
#### `PageDeclaration`
|
|
271
|
+
|
|
272
|
+
The `PageDeclaration` class encapsulates information about a page tracked by Open Terms Archive.
|
|
273
|
+
|
|
274
|
+
```js
|
|
275
|
+
import pageDeclaration from '@opentermsarchive/engine/page-declaration';
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
The `PageDeclaration` format is defined [in source code](./src/archivist/services/pageDeclaration.js).
|
|
279
|
+
|
|
280
|
+
### Dataset generation
|
|
281
|
+
|
|
282
|
+
See the [`dataset` script documentation](./scripts/dataset/README.md).
|
|
283
|
+
|
|
284
|
+
## Configuring
|
|
285
|
+
|
|
286
|
+
### Configuration file
|
|
243
287
|
|
|
244
288
|
The default configuration can be found in `config/default.json`. The full reference is given below. You are unlikely to want to edit all of these elements.
|
|
245
289
|
|
|
@@ -276,7 +320,7 @@ The default configuration can be found in `config/default.json`. The full refere
|
|
|
276
320
|
"host": "SMTP server hostname",
|
|
277
321
|
"username": "User for server authentication" // Password for server authentication is defined in environment variables, see the “Environment variables” section below
|
|
278
322
|
},
|
|
279
|
-
"sendMailOnError": { // Can be set to `false` if
|
|
323
|
+
"sendMailOnError": { // Can be set to `false` if sending email on error is not needed
|
|
280
324
|
"to": "The address to send the email to in case of an error",
|
|
281
325
|
"from": "The address from which to send the email",
|
|
282
326
|
"sendWarnings": "Boolean. Set to true to also send email in case of warning",
|
|
@@ -299,15 +343,15 @@ The default configuration can be found in `config/default.json`. The full refere
|
|
|
299
343
|
}
|
|
300
344
|
```
|
|
301
345
|
|
|
302
|
-
The default configuration is merged with (and overridden by) environment-specific configuration that can be specified at startup with the `NODE_ENV` environment variable. For example,
|
|
346
|
+
The default configuration is merged with (and overridden by) environment-specific configuration that can be specified at startup with the `NODE_ENV` environment variable. For example, running `NODE_ENV=vagrant npm start` will load the `vagrant.json` configuration file. See [node-config](https://github.com/node-config/node-config) for more information about configuration files.
|
|
303
347
|
|
|
304
|
-
|
|
348
|
+
In order to have a local configuration that override all exisiting config, it is recommended to create a `config/development.json` file with overridden values.
|
|
305
349
|
|
|
306
|
-
|
|
350
|
+
#### Storage repositories
|
|
307
351
|
|
|
308
352
|
Two storage repositories are currently supported: Git and MongoDB. Each one can be used independently for versions and snapshots.
|
|
309
353
|
|
|
310
|
-
|
|
354
|
+
##### Git
|
|
311
355
|
|
|
312
356
|
```json
|
|
313
357
|
{
|
|
@@ -326,11 +370,11 @@ Two storage repositories are currently supported: Git and MongoDB. Each one can
|
|
|
326
370
|
…
|
|
327
371
|
}
|
|
328
372
|
```
|
|
329
|
-
|
|
373
|
+
##### MongoDB
|
|
330
374
|
|
|
331
375
|
```json
|
|
332
376
|
{
|
|
333
|
-
|
|
377
|
+
…
|
|
334
378
|
"storage": {
|
|
335
379
|
"mongo": {
|
|
336
380
|
"connectionURI": "URI for defining connection to the MongoDB instance. See https://docs.mongodb.com/manual/reference/connection-string/",
|
|
@@ -342,7 +386,7 @@ Two storage repositories are currently supported: Git and MongoDB. Each one can
|
|
|
342
386
|
}
|
|
343
387
|
```
|
|
344
388
|
|
|
345
|
-
|
|
389
|
+
### Environment variables
|
|
346
390
|
|
|
347
391
|
Environment variables can be passed in the command-line or provided in a `.env` file at the root of the repository. See `.env.example` for an example of such a file.
|
|
348
392
|
|
|
@@ -350,89 +394,53 @@ Environment variables can be passed in the command-line or provided in a `.env`
|
|
|
350
394
|
- `SENDINBLUE_API_KEY`: a SendInBlue API key, in order to send email notifications with that service.
|
|
351
395
|
- `GITHUB_TOKEN`: a token with repository privileges to access the [GitHub API](https://github.com/settings/tokens).
|
|
352
396
|
|
|
353
|
-
If
|
|
354
|
-
|
|
355
|
-
### Running
|
|
356
|
-
|
|
357
|
-
To get the latest versions of all documents:
|
|
358
|
-
|
|
359
|
-
```
|
|
360
|
-
npm start
|
|
361
|
-
```
|
|
362
|
-
|
|
363
|
-
The latest version of a document will be available in the versions path defined in your configuration, under `$versions_folder/$service_provider_name/$document_type.md`.
|
|
364
|
-
|
|
365
|
-
To update documents automatically:
|
|
366
|
-
|
|
367
|
-
```
|
|
368
|
-
npm run start:scheduler
|
|
369
|
-
```
|
|
370
|
-
|
|
371
|
-
To get the latest version of a specific service's terms:
|
|
372
|
-
|
|
373
|
-
```
|
|
374
|
-
npm start -- --services <service_id>
|
|
375
|
-
```
|
|
376
|
-
|
|
377
|
-
> The service ID is the case sensitive name of the service declaration file without the extension. For example, for `Twitter.json`, the service ID is `Twitter`.
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
To get the latest version of a specific service's terms and document type:
|
|
381
|
-
|
|
382
|
-
```
|
|
383
|
-
npm start -- --services <service_id> --documentTypes <document_type>
|
|
384
|
-
```
|
|
385
|
-
|
|
386
|
-
To display help:
|
|
387
|
-
|
|
388
|
-
```
|
|
389
|
-
npm start -- --help
|
|
390
|
-
```
|
|
397
|
+
If an outgoing HTTP/HTTPS proxy to access the Internet is required, it is possible to provide it through the `HTTP_PROXY` and `HTTPS_PROXY` environment variable.
|
|
391
398
|
|
|
392
399
|
## Deploying
|
|
393
400
|
|
|
394
|
-
See [
|
|
401
|
+
Deployment is managed with [Ansible](https://www.ansible.com). See the [Open Terms Archive deployment Ansible collection](https://github.com/OpenTermsArchive/ota.deployment-ansible-collection).
|
|
395
402
|
|
|
396
|
-
##
|
|
403
|
+
## Contributing
|
|
397
404
|
|
|
398
|
-
|
|
405
|
+
### Getting a copy
|
|
399
406
|
|
|
400
|
-
|
|
401
|
-
npm run dataset:generate
|
|
402
|
-
```
|
|
407
|
+
In order to edit the code of the engine itself, an editable and executable copy is necessary.
|
|
403
408
|
|
|
404
|
-
|
|
409
|
+
First of all, follow the [requirements](#requirements) above. Then, clone the repository:
|
|
405
410
|
|
|
406
|
-
```
|
|
407
|
-
|
|
411
|
+
```sh
|
|
412
|
+
git clone https://github.com/ambanum/OpenTermsArchive.git
|
|
413
|
+
cd OpenTermsArchive
|
|
408
414
|
```
|
|
409
415
|
|
|
410
|
-
|
|
416
|
+
Install dependencies:
|
|
411
417
|
|
|
418
|
+
```sh
|
|
419
|
+
npm install
|
|
412
420
|
```
|
|
413
|
-
npm run dataset:scheduler
|
|
414
|
-
```
|
|
415
|
-
|
|
416
|
-
## Contributing
|
|
417
421
|
|
|
418
|
-
|
|
422
|
+
### Testing
|
|
419
423
|
|
|
420
|
-
|
|
424
|
+
If changes are made to the engine, check that all parts covered by tests still work properly:
|
|
421
425
|
|
|
422
|
-
|
|
426
|
+
```sh
|
|
427
|
+
npm test
|
|
428
|
+
```
|
|
423
429
|
|
|
424
|
-
|
|
430
|
+
If existing features are changed or new ones are added, relevant tests must be added too.
|
|
425
431
|
|
|
426
|
-
|
|
432
|
+
### Suggesting changes
|
|
427
433
|
|
|
428
|
-
|
|
434
|
+
To contribute to the core engine of Open Terms Archive, see the [CONTRIBUTING](CONTRIBUTING.md) file of this repository. You will need knowledge of JavaScript and Node.js.
|
|
429
435
|
|
|
430
|
-
|
|
436
|
+
### Sponsorship and partnerships
|
|
431
437
|
|
|
438
|
+
Beyond individual contributions, we need funds and committed partners to pay for a core team to maintain and grow Open Terms Archive. If you know of opportunities, please let us know over email at `contact@[project name without spaces].org`!
|
|
432
439
|
|
|
433
|
-
|
|
440
|
+
- - -
|
|
434
441
|
|
|
435
442
|
## License
|
|
436
443
|
|
|
437
|
-
The code for this software is distributed under the European Union Public Licence (EUPL) v1.2.
|
|
438
|
-
|
|
444
|
+
The code for this software is distributed under the [European Union Public Licence (EUPL) v1.2](https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12). In short, this [means](https://choosealicense.com/licenses/eupl-1.2/) you are allowed to read, use, modify and redistribute this source code, as long as you as you credit “Open Terms Archive Contributors” and make available any change you make to it under similar conditions.
|
|
445
|
+
|
|
446
|
+
Contact the core team over email at `contact@[project name without spaces].org` if you have any specific need or question regarding licensing.
|