npm - @opentermsarchive/engine - Versions diffs - 0.15.0 → 0.17.0 - Mend

@opentermsarchive/engine 0.15.0 → 0.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (64) hide show

package/package.json +7 -1
package/src/tracker/index.js +1 -1
package/.env.example +0 -3
package/.eslintrc.yaml +0 -116
package/.github/workflows/deploy.yml +0 -50
package/.github/workflows/release.yml +0 -64
package/.github/workflows/test.yml +0 -77
package/CHANGELOG.md +0 -14
package/CODE_OF_CONDUCT.md +0 -128
package/CONTRIBUTING.md +0 -143
package/MIGRATING.md +0 -42
package/Vagrantfile +0 -38
package/ansible.cfg +0 -13
package/decision-records/0001-service-name-and-id.md +0 -73
package/decision-records/0002-service-history.md +0 -212
package/decision-records/0003-snapshots-database.md +0 -123
package/ops/README.md +0 -280
package/ops/app.yml +0 -5
package/ops/infra.yml +0 -6
package/ops/inventories/dev.yml +0 -7
package/ops/inventories/production.yml +0 -27
package/ops/roles/infra/defaults/main.yml +0 -2
package/ops/roles/infra/files/.gitconfig +0 -3
package/ops/roles/infra/files/mongod.conf +0 -18
package/ops/roles/infra/files/ota-bot-key.private_key +0 -26
package/ops/roles/infra/tasks/main.yml +0 -78
package/ops/roles/infra/tasks/mongo.yml +0 -40
package/ops/roles/infra/templates/ssh_config.j2 +0 -5
package/ops/roles/ota/defaults/main.yml +0 -14
package/ops/roles/ota/files/.env +0 -21
package/ops/roles/ota/tasks/database.yml +0 -65
package/ops/roles/ota/tasks/main.yml +0 -110
package/ops/site.yml +0 -6
package/pm2.config.cjs +0 -20
package/test/fixtures/service_A.js +0 -22
package/test/fixtures/service_A_terms.md +0 -10
package/test/fixtures/service_A_terms_snapshot.html +0 -14
package/test/fixtures/service_B.js +0 -22
package/test/fixtures/service_with_declaration_history.js +0 -65
package/test/fixtures/service_with_filters_history.js +0 -155
package/test/fixtures/service_with_history.js +0 -188
package/test/fixtures/service_with_multipage_document.js +0 -100
package/test/fixtures/service_without_history.js +0 -31
package/test/fixtures/services.js +0 -19
package/test/fixtures/terms.pdf +0 -0
package/test/fixtures/termsFromPDF.md +0 -25
package/test/fixtures/termsModified.pdf +0 -0
package/test/services/service_A.json +0 -9
package/test/services/service_B.json +0 -9
package/test/services/service_with_declaration_history.filters.js +0 -7
package/test/services/service_with_declaration_history.history.json +0 -17
package/test/services/service_with_declaration_history.json +0 -13
package/test/services/service_with_filters_history.filters.history.js +0 -29
package/test/services/service_with_filters_history.filters.js +0 -7
package/test/services/service_with_filters_history.json +0 -13
package/test/services/service_with_history.filters.history.js +0 -29
package/test/services/service_with_history.filters.js +0 -7
package/test/services/service_with_history.history.json +0 -26
package/test/services/service_with_history.json +0 -17
package/test/services/service_with_multipage_document.filters.js +0 -7
package/test/services/service_with_multipage_document.history.json +0 -37
package/test/services/service_with_multipage_document.json +0 -28
package/test/services/service_without_history.filters.js +0 -7
package/test/services/service_without_history.json +0 -13

package/Vagrantfile DELETED Viewed

@@ -1,38 +0,0 @@
-# -*- mode: ruby -*-
-# vi: set ft=ruby :
-Vagrant.configure("2") do |config|
-  config.vm.hostname = "vagrant"
-  config.vm.box = "debian/bullseye64" # Unable to locate package mongodb-org
-  # in order to have the same config for both Docker and VirtualBox providers, we load the key manually
-  # if necessary, create the key with `ssh-keygen -f ~/.ssh/ota-vagrant -q -N ""`
-  # CAUTION: use of `~` in path causes problems with ssh
-  config.vm.provision "file", source: File.join(ENV['HOME'], ".ssh", "ota-vagrant.pub"), destination: "/home/vagrant/.ssh/authorized_keys"
-  # based on https://github.com/rofrano/vagrant-docker-provider#example-vagrantfile
-  config.vm.provider :docker do |docker, override|
-    override.vm.box = nil
-    docker.image = "rofrano/vagrant-provider:debian"
-    docker.remains_running = true
-    docker.has_ssh = true
-    docker.privileged = true
-    docker.volumes = ["/sys/fs/cgroup:/sys/fs/cgroup:rw"]
-    docker.create_args = ["--cgroupns=host"]
-    # python is not installed by default in the vagrant-provider image
-    # and deploying results in  /bin/sh: 1: /usr/bin/python: not found
-    # use a provision to fix that
-    # only with debian, no need with ubuntu
-    # Also need to name the provisioner, so that it runs only once https://github.com/hashicorp/vagrant/issues/7685#issuecomment-308281283
-    config.vm.provision "install_python3", type: "shell", inline: $installPython3
-  end
-end
-$installPython3 = <<-SCRIPT
-echo Updating apt...
-sudo apt-get update --fix-missing # Needed to fix "No package matching 'chromium' is available"
-echo Installing python...
-sudo apt-get --assume-yes install python3 python3-pip
-SCRIPT

package/ansible.cfg DELETED Viewed

@@ -1,13 +0,0 @@
-[defaults]
-inventory = ops/inventories/dev.yml
-roles_path = ops/roles
-# The two following lines allow to have human readable output
-# Use the YAML callback plugin.
-stdout_callback = yaml
-# Use the stdout_callback when running ad-hoc commands.
-bin_ansible_callbacks = true
-vault_password_file = vault.key

package/decision-records/0001-service-name-and-id.md DELETED Viewed

@@ -1,73 +0,0 @@
-# Choosing service name and service ID
-- Date: 2020-10-14
-## Context and Problem Statement
-To scale up from 50 to 5,000 services, we need a clear way for choosing the service name and the service ID.
-### We need
-A name that reflects the common name used by the provider itself, to be exposed in a GUI. This name is currently exposed as the name property in the service declaration.
-An ID of sorts that can be represented in the filesystem. This ID is currently exposed as the filename of the service declaration, without the .json extension.
-### Use cases
-The service name is presented to end users. It should reflect as closely as possible the official service name, so that it can be identified easily.
-The ID is used internally and exposed for analysis. It should be easy to handle with scripts and other tools.
-### Constraints for the ID
-As long as this ID is stored in the filesystem:
-- No `/` for UNIX.
-- No `\` for Windows.
-- No `:` for APFS and HFS.
-- No case-sensitive duplicates to support case-insensitive filesystems.
-- No more than 255 characters to support transfer over [FAT32](https://en.wikipedia.org/wiki/File_Allocation_Table#FAT32).
-UTF, spaces and capitals are all supported, even on case-insensitive filesystems.
-### However
-- UTF in filenames can be [a (fixable) problem with Git and HFS+](https://stackoverflow.com/questions/5581857/git-and-the-umlaut-problem-on-mac-os-x).
-- UTF in filenames is by default quoted in Git, leading for example `été.txt` to be displayed as `"\303\251t\303\251.txt"`.
-- Most online services align their brand name with their domain name. Even though UTF is now officially supported in domain names, support is limited and most services, even non-Western, have an official ASCII transliteration used at least in their domain name (e.g. “qq” by Tencent, “rzd.ru” for “РЖД”, “yahoo” for “Yahoo!”).
-- We currently use GitHub as a GUI, so the service ID is presented to the user instead of the service name. The name is used in email notifications.
-## Decision Outcome
-1. The service name should be the one used by the service itself, no matter the alphabet.
-- _Example: `туту.ру`_.
-2. We don't support non-ASCII characters in service IDs, at least as long as the database is Git and the filesystem, in order to minimise risk. Service IDs are derived from the service name through normalization into ASCII.
-- _Example: `туту.ру` → `tutu.ru`_.
-- _Example: `historielærer.dk` → `historielaerer.dk`_.
-- _Example: `RTÉ` → `RTE`_.
-3. We support punctuation, except characters that have meaning at filesystem level (`:`, `/`, `\`). These are replaced with a dash (`-`).
-- _Example: `Yahoo!` → `Yahoo!`_.
-- _Example: `Last.fm` → `Last.fm`_.
-- _Example: `re:start` → `re-start`_.
-- _Example: `we://` → `we---`_.
-4. We support capitals. Casing is expected to reflect the official service name casing.
-- _Example: `hi5` → `hi5`_.
-- _Example: `DeviantArt` → `DeviantArt`_.
-- _Example: `LINE` → `LINE`_.
-5. We support spaces. Spaces are expected to reflect the official service name spacing.
-- _Example: `App Store` → `App Store`_.
-- _Example: `DeviantArt` → `DeviantArt`_.
-6. We prefix the service name by the provider name when self-references are ambiguous, separated by a space. For example, Facebook refers to their Self-serve Ads service simply as “Ads”, which we cannot use in a shared database. We thus call the service “Facebook Ads”.
-- _Example: `Ads` (by Facebook) → `Facebook Ads`_.
-- _Example: `Analytics` (by Google) → `Google Analytics`_.
-- _Example: `Firebase` (by Google) → `Firebase`_.
-- _Example: `App Store` (by Apple) → `App Store`_.

package/decision-records/0002-service-history.md DELETED Viewed

@@ -1,212 +0,0 @@
-# Defining a service history system
-- Date: 2020-11-23
-## Context and Problem Statement
-We need to be able to regenerate versions from snapshots. As documents is aim to change over time (location or filters) we can't rely on the last version of the declaration to regenerate the version from an old snapshot. So we need a system to keep track of declaration changes, that's what we called declarations and filters versioning.
-## Solutions considered
-At this time, we see three solutions which have in common the following rules:
-- `history` is optional
-- the current valid declaration has no date and should be clearly identifiable
-- the `valid_until` date is an inclusive expiration date. It should be the exact authored date of the last snapshot commit for which the declaration is still valid.
-## Option 1: Add an `history` field in service declaration
-In `services/ASKfm.json`:
-```
-{
-  "name": "ASKfm",
-  "documents": {
-    "Terms of Service": {
-      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-      "select": ".selection",
-      "filter": [ "add" ]
-      "history": [
-        {
-          "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-          "select": "body",
-          "filter": [ "add" ]
-          "valid_until": "2020-08-24T14:02:39Z"
-        },
-        {
-          "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-          "select": "body",
-          "valid_until": "2020-08-23T14:02:39Z"
-        }
-      ]
-    }
-  }
-}
-```
-Note: When no historisation is needed the file may have no mention of history.
-**Pros:**
-- Everything is in the same file:
-  - might prevent to forget to update existing history
-  - might help user to know that history is a thing and encourage them to learn about it if they feel the need
-  - no (pseudo-)hidden knowledge about history
-**Cons:**
-- Apparent complexity can discourage new contributors
-- With time, the file can become huge
-## Option 2: Add an `serviceId.history.json` file
-In `services/ASKfm.json`:
-```
-{
-  "name": "ASKfm",
-  "documents": {
-    "Terms of Service": {
-      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-      "select": ".selection",
-      "filter": [ "add" ]
-    }
-  }
-}
-```
-In `services/ASKfm.history.json`:
-```
-{
-  "name": "ASKfm",
-  "documents": {
-    "Terms of Service": [
-      {
-        "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-        "select": "body",
-        "filter": [ "add" ]
-        "valid_until": "2020-08-24T14:02:39Z"
-      },
-      {
-        "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-        "select": "body",
-        "valid_until": "2020-08-23T14:02:39Z"
-      }
-    ]
-  }
-}
-```
-**Pros:**
-- Service declaration stay small and simple
-- History file is kept close to the service declaration so users might see them
-**Cons:**
-- Make the discovery of history capacities less easy
-- Increase the probability of forgetting to update history file when making a change in the service discovery
-## Option 2A
-Same as option 2, but the history file should only contain the document declarations to avoid divergence on service properties with the one in the original file.
-In `services/ASKfm.json`, **called the “service declaration”**:
-```
-{
-  "name": "ASKfm",
-  "documents": {
-    "Terms of Service": {
-      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-      "select": ".selection",
-      "filter": [ "add" ]
-    }
-  }
-}
-```
-In `services/ASKfm.history.json`, **called the “service history”**:
-```
-{
-  "Terms of Service": [
-    {
-      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-      "select": "body",
-      "filter": [ "add" ]
-      "valid_until": "2020-08-24T14:02:39Z"
-    },
-    {
-      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-      "select": "body",
-      "valid_until": "2020-08-23T14:02:39Z"
-    }
-  ]
-}
-```
-## Option 3: Add an history service declaration file in `services/history` folder
-In `services/ASKfm.json`:
-```
-{
-  "name": "ASKfm",
-  "documents": {
-    "Terms of Service": {
-      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-      "select": ".selection",
-      "filter": [ "add" ]
-    }
-  }
-}
-```
-In `services/history/ASKfm.json`:
-```
-{
-  "name": "ASKfm",
-  "documents": {
-    "Terms of Service": [
-      {
-        "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-        "select": "body",
-        "filter": [ "add" ]
-        "valid_until": "2020-08-24T14:02:39Z"
-      },
-      {
-        "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
-        "select": "body",
-        "valid_until": "2020-08-23T14:02:39Z"
-      }
-    ]
-  }
-}
-```
-**Pros:**
-- Service declaration stay small and simple
-- All history updates are reserved to users with the knowledge that might work as gatekeepers
-**Cons:**
-- All history updates are reserved to users with the knowledge that might work as gatekeepers :)
-- Need to rely on people with knowledge to keep the history
-## Some thoughts
-### Community
-The choice might have implication on the community that will grow around the project.
-_Option 1_ shows everything to everyone, it might frightened some contributors with some apparent complexity (once there are history in the declaration file), but it might also encourage them to learn about it if they want or feel the need to. All contributors will share the same view and knowledge about the system. This might encourage collaboration between them to learn and improve together.
-_Option 2_ and _Option 3_ hide the complexity of history management in separate files and only most adventurous contributors will find them by themselves. Contribution to those files will probably be done by specific contributors that will be taught to manage those file. Thus creating two different kind of contributors: those who will stay with the basic service declaration, not knowing that more complex options exist, and those who will have the knowledge of history management whose work might stay in the shadow or work as gatekeeper.
-## Decision Outcome
-[After consulting the community](https://github.com/ambanum/OpenTermsArchive/issues/156), the options 2A is retained as it hide complexity (compared to Option 1) of the history while increasing its discoverability (compared to Option 3) for contributors who might become more “adventurous”.

package/decision-records/0003-snapshots-database.md DELETED Viewed

@@ -1,123 +0,0 @@
-# Determining an appropriate database system to store snapshots
-- Date: 2021-10-20
-## Context and Problem Statement
-### Context
-The Versions repository has several purposes:
-- Display differences between two versions, in particular when users receive a notification of change, so that they can simply see the changes.
-- Explore significant changes in tracked documents.
-- Offer a corpus of the latest versions of all the documents of the monitored services.
-- Serve as a dataset for research.
-It is therefore important that repository constitutes a quality dataset, to provide relevant information to users.
-For this purpose, the following constraints are considered necessary:
-- Versions must be ordered chronologically, so that navigation through the history of a document is intuitive.
-- Versions should not contain noise, only significant changes.
-- Each version must contain a link to the snapshot that was used to generate it.
-Currently, the following problems with the repository of Versions are identified:
-- Noise in the versions: URL or structure changes in the tracked documents.
-- Presence of refilter commits: related to URL and selector updates in service declarations or to Open Terms Archive code evolution.
-- Presence of commits due to code changes: type renaming, service renaming, documentation changes in the repository.
-- Presence of unordered commits: consequence of the import of the ToSBack history in snapshots or to the import of snapshots corresponding to archived documents provided by the services themselves.
-The solution considered in order to provide a quality dataset therefore consists of being able to regenerate the `versions` from the `snapshots`, that's what we call rewriting history.
-#### Rewriting history
-To rewrite history, we go through the snapshot commits one by one after reordering them (in memory) and we create a version commit each time, avoiding commits corresponding to noise and performing any renaming.
-This implies being able to version the service filters (used to generate the version from the snapshot).
-See https://github.com/ambanum/OpenTermsArchive/issues/156.
-### Problem
-Currently, `git` is used as database for storing snapshots and versions.
-One year ago, the process to rewrite history was estimated to take about 16 hours for 100,000 commits. It has also been noted that the evolution of the time is not linear, the more commits there are in `snapshots` the more the average time per commit increases.
-It appears that the most costly operation is accessing the contents of a commit (checkout).
-It also appears that the older the commit is in the git history, the longer this operation takes.
-> For example, on a history containing about 100,000 commits, accessing the contents of the oldest commit takes about 1,000 ms while accessing the most recent commit takes only 100 ms.
-At the date of this document, the number of commits entries approaches the million and to iterate over these snapshots, to rewrite versions history, it currently takes more or less 3 months.
-Also, `git` implies to store data in a hash tree in the form of chronologically ordered commits. So to insert snapshots in the history, it implies to rewrite the whole snapshots history which also takes the same time as reading them.
-As described previously, we need to be able to regenerate versions from snapshots (for example to [rename services](https://github.com/ambanum/OpenTermsArchive/issues/314)) and to be able to insert snapshots in the history (for example to [import databases](https://github.com/ambanum/OpenTermsArchive/pull/214)).
-**This cannot take 6 months.**
-Moreover, as the number of snapshots will keep on growing, we need a system which allows scaling, potentially across multiple servers.
-Thus, we need a database management system meeting the following requirements:
-- Access time to a snapshot should be constant and independent from its authoring date.
-- Inserting time of a snapshot should be constant and independent from its authoring date.
-- Support concurrent access.
-- Scale horizontally.
-### Solutions considered
-#### 1. Keep the system under git
-##### Splitting into sub-repos
-Since accessing the contents of a commit takes longer the older it is in the history considered, the idea would be to work successively on ordered subsets of this history.
-This means truncating the history, browsing the remaining commits and regenerating the corresponding versions. Then creating another subset of the history which contains an arbitrary number of commits following the commits already browsed and perform the processing.
-To create a history subset with git :
-- Create a clone of a subset of N commits from the local snapshot: `git clone --depth <N> "file://local/path/snapshots" snapshots-tmp` with `N` corresponding to the position of the first commit you want in the block relative to the last commit in the history
-- Remove all commits older than the last commit you want to keep in the block: `git reset --hard <sha>` with `sha` corresponding to the id of the last commit you want to have in the block.
-- Clean up git to ensure that history navigation is efficient: `git gc`.
-So we need to split the history into chronologically ordered blocks, which leads us to the next problem.
-##### Splitting and reordering blocks of snapshots
-Because snapshot commits are unordered, we can't simply create blocks of a fixed size from the git history (otherwise we'd process commits out of order).
-It is necessary to create blocks whose commits are ordered within the block but also in relation to the other blocks: for example, all the commits of the first block processed must be older than the commits of all the other blocks.
-The solution would be to create blocks in order: from the git history, we look for commits that are not in their place (whose date is earlier than that of its predecessor).
-Each of these commits represents the first commit of a block. This block extends to the previous one, the starting point of the next block.
-We thus obtain blocks whose commits are ordered.
-We still have to order the blocks between them (note, it is possible to have to cut a block to be able to place another).
-These chronologically ordered commit blocks, without overlap, can then be used with the previous approach (it may be necessary to re-split these blocks so that they have a reasonable size).
-#### 2. Move snapshots to a document-oriented database
-The idea of this solution is to keep the `versions` under git in order to continue to enjoy the benefits that GitHub provides in terms of browsing and viewing diffs, but to save the snapshots in a database, since we don't really need to browse the snapshots via a graphical interface nor to see the diff between two snapshots, which would allow us to be able to access the content more efficiently.
-MongoDB seems to meet the constraints:
-- It natively allows horizontal scaling with [replica sets](https://docs.mongodb.com/manual/replication/) and [sharding](https://docs.mongodb.com/manual/sharding/).
-- It supports concurrent access.
-- It has [In-Memory storage engine](https://docs.mongodb.com/manual/core/inmemory/) as an option for performance.
-We also did a simple test to ensure that access time and insert time also meets the requirements. We populated a database with one million entries and tried accessing snapshots with random dates and we found that access times remained stable. In our test on 1000 sequential access to random snapshot, the average access time was ~3.5ms with a maximum of ~50ms.
-Moreover, MongoDB has the following benefits:
-- Easy to use: offers a simple query syntax SQL and has a quick learning curve, especially for JavaScript developers.
-- Flexible and evolutive: allows to manage data of any structure, not just tabular structures defined in advance.
-- Widely used in the JavaScript ecosystem.
-As downside, joining documents in MongoDB is no easy task and pulling data from several collections requires a number of queries, which will lead to long turn-around times. This is not a problem in our case as we do not currently envision a need for such complex queries.
-## Decision Outcome
-As MongoDB meets the requirements it is retained as a solution.
-### Benchmark
-With MongoDB implementation, refilter takes around ~3m where it took around ~1h20 with the Git version.