@tricoteuses/senat 2.18.11 → 2.18.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE.md CHANGED
@@ -1,22 +1,22 @@
1
- # Tricoteuses-Senat
2
-
3
- ## _Handle French Sénat's open data_
4
-
5
- By: Emmanuel Raviart <mailto:emmanuel@raviart.com>
6
-
7
- Copyright (C) 2019, 2020, 2021 Emmanuel Raviart
8
-
9
- https://git.tricoteuses.fr/logiciels/tricoteuses-senat
10
-
11
- > Tricoteuses-Senat is free software; you can redistribute it and/or modify
12
- > it under the terms of the GNU Affero General Public License as
13
- > published by the Free Software Foundation, either version 3 of the
14
- > License, or (at your option) any later version.
15
- >
16
- > Tricoteuses-Senat is distributed in the hope that it will be useful,
17
- > but WITHOUT ANY WARRANTY; without even the implied warranty of
18
- > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19
- > GNU Affero General Public License for more details.
20
- >
21
- > You should have received a copy of the GNU Affero General Public License
22
- > along with this program. If not, see <http://www.gnu.org/licenses/>.
1
+ # Tricoteuses-Senat
2
+
3
+ ## _Handle French Sénat's open data_
4
+
5
+ By: Emmanuel Raviart <mailto:emmanuel@raviart.com>
6
+
7
+ Copyright (C) 2019, 2020, 2021 Emmanuel Raviart
8
+
9
+ https://git.tricoteuses.fr/logiciels/tricoteuses-senat
10
+
11
+ > Tricoteuses-Senat is free software; you can redistribute it and/or modify
12
+ > it under the terms of the GNU Affero General Public License as
13
+ > published by the Free Software Foundation, either version 3 of the
14
+ > License, or (at your option) any later version.
15
+ >
16
+ > Tricoteuses-Senat is distributed in the hope that it will be useful,
17
+ > but WITHOUT ANY WARRANTY; without even the implied warranty of
18
+ > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19
+ > GNU Affero General Public License for more details.
20
+ >
21
+ > You should have received a copy of the GNU Affero General Public License
22
+ > along with this program. If not, see <http://www.gnu.org/licenses/>.
package/README.md CHANGED
@@ -1,120 +1,123 @@
1
- # Tricoteuses-Senat
2
-
3
- ## _Retrieve, clean up & handle French Sénat's open data_
4
-
5
- ## Requirements
6
-
7
- - Node >= 22
8
-
9
- ## Installation
10
-
11
- ```bash
12
- git clone https://git.tricoteuses.fr/logiciels/tricoteuses-senat
13
- cd tricoteuses-senat/
14
- ```
15
-
16
- Create a `.env` file to set PostgreSQL database informations and other configuration variables (you can use `example.env` as a template). Then
17
-
18
- ```bash
19
- npm install
20
- ```
21
-
22
- ### Database creation (not needed if downloading with Docker image)
23
-
24
- #### Using Docker
25
-
26
- ```bash
27
- docker run --name local-postgres -d -p 5432:5432 -e POSTGRES_PASSWORD=$YOUR_CUSTOM_DB_PASSWORD postgres
28
- # Default Postgres user is postgres
29
- # But scripts require an "opendata" role
30
- docker exec -it local-postgres psql -U postgres -c "CREATE ROLE opendata;"
31
- ```
32
-
33
- ## Download data
34
-
35
- Create a folder where the data will be downloaded and run the following command to download the data and convert it into JSON files.
36
-
37
- ```bash
38
- mkdir ../senat-data/
39
-
40
- # Available options for optional `categories` parameter : All, Ameli, Debats, DosLeg, Questions, Sens
41
- npm run data:download ../senat-data -- [--categories All]
42
- ```
43
-
44
- Data from other sources is also available :
45
-
46
- ```bash
47
- # Retrieval of textes and rapports from Sénat's website
48
- # Available options for optional `formats` parameter : xml, html, pdf
49
- # Available options for optional `types` parameter : textes, rapports
50
- npm run data:retrieve_documents ../senat-data -- --fromSession 2022 [--formats xml pdf] [--types textes]
51
-
52
- # Retrieval & parsing (textes in xml format only for now)
53
- npm run data:retrieve_documents ../senat-data -- --fromSession 2022 --parseDocuments
54
-
55
- # Parsing only
56
- npm run data:parse_textes_lois ../senat-data
57
-
58
- # Retrieval (& parsing) of agenda from Sénat's website
59
- npm run data:retrieve_agenda ../senat-data -- --fromSession 2022 [--parseAgenda]
60
-
61
- # Retrieval (& parsing) of comptes-rendus de séance from Sénat's data
62
- npm run data:retrieve_cr_seance ../senat-data -- [--parseDebats]
63
-
64
- # Retrieval (& parsing) of comptes-rendus de commissions from Sénat's website
65
- npm run data:retrieve_cr_commission ../senat-data -- [--parseDebats]
66
-
67
- # Retrieval of sénateurs' pictures from Sénat's website
68
- npm run data:retrieve_senateurs_photos ../senat-data
69
- ```
70
-
71
- ## Data download using Docker
72
-
73
- A Docker image that downloads and converts the data all at once is available. Build it locally or run it from the container registry.
74
- Use the environment variables `FROM_SESSION` and `CATEGORIES` if needed.
75
-
76
- ```bash
77
- docker run --pull always --name tricoteuses-senat -v ../senat-data:/app/senat-data -d git.tricoteuses.fr/logiciels/tricoteuses-senat:latest
78
- ```
79
-
80
- Use the environment variable `CATEGORIES` and `FROM_SESSION` if needed.
81
-
82
- ## Using the data
83
-
84
- Once the data is downloaded, you can use loaders to retrieve it.
85
- To use loaders in your project, you can install the _@tricoteuses/senat_ package, and import the iterator functions that you need.
86
-
87
- ```bash
88
- npm install @tricoteuses/senat
89
- ```
90
-
91
- ```js
92
- import { iterLoadSenatQuestions } from "@tricoteuses/senat/loaders"
93
-
94
- // Pass data directory and legislature as arguments
95
- for (const { item: question } of iterLoadSenatQuestions("../senat-data", 17)) {
96
- console.log(question.id)
97
- }
98
- ```
99
-
100
- ## Generation of raw types from SQL schema (for contributors only)
101
-
102
- ```bash
103
- npm run data:generate_schemas ../senat-data
104
- ```
105
-
106
- ## Publishing
107
-
108
- To publish a new version of this package onto npm, bump the package version and publish.
109
-
110
- ```bash
111
- npm version x.y.z # Bumps version in package.json and creates a new tag x.y.z
112
- npx tsc
113
- npm publish
114
- ```
115
-
116
- The Docker image will be automatically built during a CI Workflow if you push the tag to the remote repository.
117
-
118
- ```bash
119
- git push --tags
120
- ```
1
+ # Tricoteuses-Senat
2
+
3
+ ## _Retrieve, clean up & handle French Sénat's open data_
4
+
5
+ ## Requirements
6
+
7
+ - Node >= 22
8
+
9
+ ## Installation
10
+
11
+ ```bash
12
+ git clone https://git.tricoteuses.fr/logiciels/tricoteuses-senat
13
+ cd tricoteuses-senat/
14
+ ```
15
+
16
+ Create a `.env` file to set PostgreSQL database informations and other configuration variables (you can use `example.env` as a template). Then
17
+
18
+ ```bash
19
+ npm install
20
+ ```
21
+
22
+ ### Database creation (not needed if downloading with Docker image)
23
+
24
+ #### Using Docker
25
+
26
+ ```bash
27
+ docker run --name local-postgres -d -p 5432:5432 -e POSTGRES_PASSWORD=$YOUR_CUSTOM_DB_PASSWORD postgres
28
+ # Default Postgres user is postgres
29
+ # But scripts require an "opendata" role
30
+ docker exec -it local-postgres psql -U postgres -c "CREATE ROLE opendata;"
31
+ ```
32
+
33
+ ## Download data
34
+
35
+ Create a folder where the data will be downloaded and run the following command to download the data and convert it into JSON files.
36
+
37
+ ```bash
38
+ mkdir ../senat-data/
39
+
40
+ # Available options for optional `categories` parameter : All, Ameli, Debats, DosLeg, Questions, Sens
41
+ npm run data:download ../senat-data -- [--categories All]
42
+ ```
43
+
44
+ Data from other sources is also available :
45
+
46
+ ```bash
47
+ # Retrieval of textes and rapports from Sénat's website
48
+ # Available options for optional `formats` parameter : xml, html, pdf
49
+ # Available options for optional `types` parameter : textes, rapports
50
+ npm run data:retrieve_documents ../senat-data -- --fromSession 2022 [--formats xml pdf] [--types textes]
51
+
52
+ # Retrieval & parsing (textes in xml format only for now)
53
+ npm run data:retrieve_documents ../senat-data -- --fromSession 2022 --parseDocuments
54
+
55
+ # Parsing only
56
+ npm run data:parse_textes_lois ../senat-data
57
+
58
+ # Retrieval (& parsing) of agenda from Sénat's website
59
+ npm run data:retrieve_agenda ../senat-data -- --fromSession 2022 [--parseAgenda]
60
+
61
+ # Retrieval (& parsing) of comptes-rendus de séance from Sénat's data
62
+ npm run data:retrieve_cr_seance ../senat-data -- [--parseDebats]
63
+
64
+ # Retrieval (& parsing) of comptes-rendus de commissions from Sénat's website
65
+ npm run data:retrieve_cr_commission ../senat-data -- [--parseDebats]
66
+
67
+ # Retrieval of sénateurs' pictures from Sénat's website
68
+ npm run data:retrieve_senateurs_photos ../senat-data
69
+ ```
70
+
71
+ ## Data download using Docker
72
+
73
+ A Docker image that downloads and converts the data all at once is available. Build it locally or run it from the container registry.
74
+ Use the environment variables `FROM_SESSION` and `CATEGORIES` if needed.
75
+
76
+ ```bash
77
+ docker run --pull always --name tricoteuses-senat -v ../senat-data:/app/senat-data -d git.tricoteuses.fr/logiciels/tricoteuses-senat:latest
78
+ ```
79
+
80
+ Use the environment variable `CATEGORIES` and `FROM_SESSION` if needed.
81
+
82
+ ## Using the data
83
+
84
+ Once the data is downloaded, you can use loaders to retrieve it.
85
+ To use loaders in your project, you can install the _@tricoteuses/senat_ package, and import the iterator functions that you need.
86
+
87
+ ```bash
88
+ npm install @tricoteuses/senat
89
+ ```
90
+
91
+ ```js
92
+ import { iterLoadSenatQuestions } from "@tricoteuses/senat/loaders"
93
+
94
+ // Pass data directory and legislature as arguments
95
+ for (const { item: question } of iterLoadSenatQuestions("../senat-data", 17)) {
96
+ console.log(question.id)
97
+ }
98
+ ```
99
+
100
+ ## Generation of raw types from SQL schema (for contributors only)
101
+
102
+ ```bash
103
+ npm run data:generate_schemas ../senat-data
104
+ ```
105
+
106
+ ## Publishing
107
+
108
+ To publish a new version of this package onto npm, bump the package version and publish.
109
+
110
+ ```bash
111
+ # Increment version and create a new Git tag automatically
112
+ npm version patch # +0.0.1 → small fixes
113
+ npm version minor # +0.1.0 → new features
114
+ npm version major # +1.0.0 → breaking changes
115
+ npx tsc
116
+ npm publish
117
+ ```
118
+
119
+ The Docker image will be automatically built during a CI Workflow if you push the tag to the remote repository.
120
+
121
+ ```bash
122
+ git push --tags
123
+ ```
@@ -116,8 +116,7 @@ const findAllAmendementsQuery = dbSenat
116
116
  "ameli.com_ameli.lil as au_nom_de_commission",
117
117
  eb.case().when("ameli.cab.entid", "is not", null).then(true).else(false).end().as("auteur_est_gouvernement"),
118
118
  auteurs(ref("ameli.amd.id")).as("auteurs"),
119
- ])
120
- .distinctOn("ameli.amd.id");
119
+ ]);
121
120
  export function findAllAmendements(fromSession) {
122
121
  if (fromSession !== undefined) {
123
122
  return findAllAmendementsQuery.where("ameli.ses.ann", ">=", fromSession).stream();
@@ -1,11 +1,13 @@
1
1
  import assert from "assert";
2
- import { execSync, spawn } from "child_process";
2
+ import { execSync } from "child_process";
3
3
  import commandLineArgs from "command-line-args";
4
4
  import fs from "fs-extra";
5
+ // import fetch from "node-fetch"
5
6
  import path from "path";
6
7
  import StreamZip from "node-stream-zip";
8
+ import readline from "readline";
7
9
  import windows1252 from "windows-1252";
8
- import { pipeline, Transform } from "stream";
10
+ import { pipeline } from "stream";
9
11
  import { promisify } from "util";
10
12
  import config from "../config";
11
13
  import { getChosenDatasets, getEnabledDatasets } from "../datasets";
@@ -67,88 +69,60 @@ async function downloadFile(url, dest) {
67
69
  }
68
70
  /**
69
71
  * Copy a dataset database to the main Senat database (overwriting its contents).
70
- * Optimized to combine encoding repair and schema transformation in a single pass.
71
72
  */
72
73
  async function copyToSenat(dataset, dataDir, options) {
73
74
  if (!options["silent"]) {
74
75
  console.log(`Copying ${dataset.database} to Senat database...`);
75
76
  }
76
77
  const sqlFilePath = path.join(dataDir, `${dataset.database}.sql`);
77
- // Helper function to replace 'public' schema outside single-quoted strings
78
- function replacePublicOutsideStrings(line, schema) {
79
- const parts = line.split(/(')/);
80
- let inString = false;
81
- for (let i = 0; i < parts.length; i++) {
82
- if (parts[i] === "'") {
83
- inString = !inString;
84
- }
85
- else if (!inString) {
86
- parts[i] = parts[i].replace(/\bpublic\b(?=(\s*\.|\s*[,;]|\s|$))/g, schema);
87
- }
88
- }
89
- return parts.join('');
90
- }
91
- // Spawn psql process
92
- const psqlArgs = options["sudo"]
93
- ? ["-u", options["sudo"], "psql", "--quiet", "-d", "senat"]
94
- : ["--quiet", "-d", "senat"];
95
- const psql = spawn(options["sudo"] ? "sudo" : "psql", psqlArgs, {
96
- stdio: ["pipe", "ignore", "pipe"],
97
- env: process.env,
78
+ const schemaDumpFile = path.join(dataDir, `${dataset.database}_schema_dump.sql`);
79
+ // Write the header and then stream the rest of the SQL file
80
+ const schemaSqlWriter = fs.createWriteStream(schemaDumpFile, { encoding: "utf8" });
81
+ // Add CREATE SCHEMA statement at the top
82
+ schemaSqlWriter.write(`CREATE SCHEMA IF NOT EXISTS ${dataset.database};\n`);
83
+ const lineReader = readline.createInterface({
84
+ input: fs.createReadStream(sqlFilePath, { encoding: "utf8" }),
85
+ crlfDelay: Infinity,
98
86
  });
99
- psql.stdin.write(`DROP SCHEMA IF EXISTS ${dataset.database} CASCADE;\n`);
100
- psql.stdin.write(`CREATE SCHEMA IF NOT EXISTS ${dataset.database};\n`);
101
- let buffer = '';
102
- const combinedTransform = new Transform({
103
- transform(chunk, encoding, callback) {
104
- // Encoding repair if needed (decode from latin1 and fix Windows-1252 characters)
105
- let data = dataset.repairEncoding
106
- ? chunk.toString('latin1').replace(badWindows1252CharacterRegex, (match) => windows1252.decode(match, { mode: "fatal" }))
107
- : chunk.toString();
108
- buffer += data;
109
- const lines = buffer.split('\n');
110
- buffer = lines.pop() || '';
111
- let processedData = '';
112
- for (const line of lines) {
113
- let newLine = replacePublicOutsideStrings(line, dataset.database);
114
- newLine = newLine.replace(/SET client_encoding = 'LATIN1';/i, "SET client_encoding = 'UTF8';");
115
- processedData += newLine + '\n';
116
- }
117
- callback(null, processedData);
118
- },
119
- flush(callback) {
120
- // Process any remaining data in buffer
121
- if (buffer) {
122
- let newLine = replacePublicOutsideStrings(buffer, dataset.database);
123
- newLine = newLine.replace(/SET client_encoding = 'LATIN1';/i, "SET client_encoding = 'UTF8';");
124
- callback(null, newLine);
125
- }
126
- else {
127
- callback();
87
+ for await (const line of lineReader) {
88
+ let newLine = line;
89
+ // Replace 'public' schema outside single-quoted strings
90
+ function replacePublicOutsideStrings(line, schema) {
91
+ const parts = line.split(/(')/);
92
+ let inString = false;
93
+ for (let i = 0; i < parts.length; i++) {
94
+ if (parts[i] === "'") {
95
+ inString = !inString;
96
+ }
97
+ else if (!inString) {
98
+ // Only replace outside of strings, including before comma
99
+ parts[i] = parts[i].replace(/\bpublic\b(?=(\s*\.|\s*[,;]|\s|$))/g, schema);
100
+ }
128
101
  }
102
+ return parts.join('');
129
103
  }
130
- });
131
- let stderrData = '';
132
- psql.stderr.on('data', (data) => {
133
- stderrData += data.toString();
134
- });
135
- const pipelinePromise = streamPipeline(fs.createReadStream(sqlFilePath, {
136
- encoding: dataset.repairEncoding ? undefined : "utf8",
137
- highWaterMark: 4 * 1024 * 1024
138
- }), combinedTransform, psql.stdin);
139
- await pipelinePromise;
140
- return new Promise((resolve, reject) => {
141
- psql.on("close", (code) => {
142
- if (code === 0) {
143
- resolve();
104
+ newLine = replacePublicOutsideStrings(line, dataset.database);
105
+ // Replace SET client_encoding to UTF8
106
+ newLine = newLine.replace(/SET client_encoding = 'LATIN1';/i, "SET client_encoding = 'UTF8';");
107
+ schemaSqlWriter.write(newLine + "\n");
108
+ }
109
+ schemaSqlWriter.end();
110
+ await new Promise((resolve, reject) => {
111
+ schemaSqlWriter.on("finish", () => {
112
+ try {
113
+ execSync(`${options["sudo"] ? `sudo -u ${options["sudo"]} ` : ""}psql --quiet -d senat -f ${schemaDumpFile}`, {
114
+ env: process.env,
115
+ encoding: "utf-8",
116
+ stdio: ["ignore", "ignore", "pipe"],
117
+ });
144
118
  }
145
- else {
146
- if (!options["silent"] && stderrData) {
147
- console.error(`psql stderr: ${stderrData}`);
148
- }
149
- reject(new Error(`psql exited with code ${code}`));
119
+ finally {
120
+ try { }
121
+ catch { }
150
122
  }
123
+ resolve();
151
124
  });
125
+ schemaSqlWriter.on("error", reject);
152
126
  });
153
127
  }
154
128
  async function retrieveDataset(dataDir, dataset) {
@@ -202,9 +176,31 @@ async function retrieveDataset(dataDir, dataset) {
202
176
  dataset.repairZip(dataset, dataDir);
203
177
  }
204
178
  }
205
- // Encoding repair is now handled in copyToSenat for better performance (single-pass processing)
206
- // The separate encoding repair step has been removed
179
+ if ((options["all"] || options["repairEncoding"]) && dataset.repairEncoding) {
180
+ if (!options["silent"]) {
181
+ console.log(`Repairing Windows CP1252 encoding in ${dataset.title}: ${sqlFilename}…`);
182
+ }
183
+ const repairedSqlFilePath = sqlFilePath + ".repaired";
184
+ const repairedSqlWriter = fs.createWriteStream(repairedSqlFilePath, {
185
+ encoding: "utf8",
186
+ });
187
+ // Read the file as latin1 (ISO-8859-1/CP1252) and write as UTF-8
188
+ const lineReader = readline.createInterface({
189
+ input: fs.createReadStream(sqlFilePath, { encoding: "latin1" }),
190
+ crlfDelay: Infinity,
191
+ });
192
+ for await (const line of lineReader) {
193
+ // Optionally repair Windows-1252 control characters
194
+ let repairedLine = line.replace(badWindows1252CharacterRegex, (match) => windows1252.decode(match, { mode: "fatal" }));
195
+ repairedSqlWriter.write(repairedLine + "\n");
196
+ }
197
+ repairedSqlWriter.end();
198
+ await fs.move(repairedSqlFilePath, sqlFilePath, { overwrite: true });
199
+ }
207
200
  if (options["all"] || options["import"] || options["schema"]) {
201
+ if (!options["silent"]) {
202
+ console.log(`Importing ${dataset.title}: ${sqlFilename}…`);
203
+ }
208
204
  await copyToSenat(dataset, dataDir, options);
209
205
  // Create indexes programmatically after import
210
206
  if (dataset.indexes) {
@@ -274,7 +270,12 @@ async function retrieveOpenData() {
274
270
  process.env["PGUSER"] &&
275
271
  process.env["PGPASSWORD"], "Missing database configuration: environment variables PGHOST, PGPORT, PGUSER and PGPASSWORD or TRICOTEUSES_SENAT_DB_* in .env file");
276
272
  console.time("data extraction time");
277
- execSync(`${options["sudo"] ? `sudo -u ${options["sudo"]} ` : ""}psql --quiet -c "CREATE DATABASE senat WITH OWNER opendata" || true`, {
273
+ execSync(`${options["sudo"] ? `sudo -u ${options["sudo"]} ` : ""}psql --quiet -c "DROP DATABASE IF EXISTS senat"`, {
274
+ cwd: dataDir,
275
+ env: process.env,
276
+ encoding: "utf-8",
277
+ });
278
+ execSync(`${options["sudo"] ? `sudo -u ${options["sudo"]} ` : ""}psql --quiet -c "CREATE DATABASE senat WITH OWNER opendata"`, {
278
279
  cwd: dataDir,
279
280
  env: process.env,
280
281
  encoding: "utf-8",
@@ -135,15 +135,6 @@ function extractCandidatesFromSearchHtml(html) {
135
135
  return true;
136
136
  });
137
137
  }
138
- function parseFinalNvs(nvs) {
139
- const playerTag = nvs.match(/<player\b[^>]*>/i)?.[0];
140
- if (!playerTag)
141
- return {};
142
- const sessionStartStr = playerTag.match(/\bsessionstart="(\d+)"/i)?.[1];
143
- return {
144
- sessionStart: sessionStartStr ? Number(sessionStartStr) : undefined,
145
- };
146
- }
147
138
  function parseDataNvs(nvs) {
148
139
  const epochStr = nvs.match(/<metadata\s+name="date"\s+value="(\d+)"/i)?.[1];
149
140
  const epoch = epochStr ? Number(epochStr) : undefined;
@@ -207,8 +198,6 @@ function score(agenda, agendaTs, sameOrg, videoTitle, videoEpoch, videoOrganes)
207
198
  const titleScore = Math.max(objetS, titleS);
208
199
  let timeScore = 0;
209
200
  if (agendaTs && videoEpoch) {
210
- console.log("agendaTs", agendaTs);
211
- console.log("videoEpoch", videoEpoch);
212
201
  const deltaMin = Math.abs(videoEpoch - agendaTs) / 60;
213
202
  timeScore = Math.exp(-deltaMin / 60);
214
203
  }
@@ -292,142 +281,168 @@ async function processGroupedReunion(agenda, session, dataDir) {
292
281
  if (agendaTs && agendaTs * 1000 > now) {
293
282
  return;
294
283
  }
295
- STATS.total++;
296
284
  const reunionUid = agenda.uid;
297
285
  const baseDir = path.join(dataDir, VIDEOS_ROOT_FOLDER, String(session), reunionUid);
298
286
  await fs.ensureDir(baseDir);
299
- const searchParams = {
300
- search: "true",
301
- videotype: getAgendaType(agenda),
302
- };
303
- if (agenda.date) {
304
- const fr = toFRDate(agenda.date);
305
- searchParams.period = "custom";
306
- searchParams.begin = fr;
307
- searchParams.end = fr;
308
- }
309
- if (agenda.organe) {
310
- searchParams.organe = agenda.organe;
311
- }
312
- const pages = await fetchAllSearchPages(searchParams);
313
- if (!pages.length) {
314
- if (!options["silent"]) {
315
- console.log(`[miss] ${agenda.uid} no candidates (videotype=${searchParams.videotype}, organe=${searchParams.organe || "-"}, date=${searchParams.begin || "-"})`);
287
+ let skipDownload = false;
288
+ if (options["only-recent"]) {
289
+ const now = Date.now();
290
+ const cutoff = now - options["only-recent"] * 24 * 3600 * 1000;
291
+ const reunionTs = Date.parse(agenda.date);
292
+ if (reunionTs < cutoff) {
293
+ // Check if files already exist
294
+ const dataNvsPath = path.join(baseDir, "data.nvs");
295
+ const finalplayerNvsPath = path.join(baseDir, "finalplayer.nvs");
296
+ if (fs.existsSync(dataNvsPath) && fs.existsSync(finalplayerNvsPath)) {
297
+ skipDownload = true;
298
+ }
316
299
  }
317
- return;
318
300
  }
319
- const combinedHtml = pages.join("\n<!-- PAGE SPLIT -->\n");
320
- const candidates = extractCandidatesFromSearchHtml(combinedHtml).slice(0, MAX_CANDIDATES);
321
- if (!candidates.length) {
322
- if (!options["silent"]) {
323
- console.log(`[miss] ${agenda.uid} no candidates after parse (videotype=${searchParams.videotype}, organe=${searchParams.organe || "-"}, date=${searchParams.begin || "-"})`);
301
+ let master = null;
302
+ let accepted = false;
303
+ if (!skipDownload) {
304
+ STATS.total++;
305
+ const searchParams = {
306
+ search: "true",
307
+ videotype: getAgendaType(agenda),
308
+ };
309
+ if (agenda.date) {
310
+ const fr = toFRDate(agenda.date);
311
+ searchParams.period = "custom";
312
+ searchParams.begin = fr;
313
+ searchParams.end = fr;
324
314
  }
325
- return;
326
- }
327
- // ==== 2) Enrich via data.nvs + scoring; pick best ====
328
- let best = null;
329
- for (const c of candidates) {
330
- const dataUrl = `${SENAT_DATAS_ROOT}/${c.id}_${c.hash}/content/data.nvs`;
331
- const finalUrl = `${SENAT_DATAS_ROOT}/${c.id}_${c.hash}/content/finalplayer.nvs`;
332
- const dataBuf = await fetchBuffer(dataUrl);
333
- if (!dataBuf)
334
- continue;
335
- const meta = parseDataNvs(dataBuf.toString("utf-8"));
336
- let sameOrg = false;
337
- // If organes are too different, go to next candidates
338
- if (agenda.organe && meta.organes?.length) {
339
- const agendaOrgNorm = normalize(agenda.organe);
340
- const agendaKey = getOrgKey(agendaOrgNorm);
341
- let bestDice = 0;
342
- let hasSameKey = false;
343
- for (const vo of meta.organes) {
344
- const videoOrgNorm = normalize(vo);
345
- const videoKey = getOrgKey(videoOrgNorm);
346
- const d = dice(agendaOrgNorm, videoOrgNorm);
347
- if (videoKey === agendaKey && videoKey !== "autre") {
348
- hasSameKey = true;
349
- }
350
- if (d > bestDice)
351
- bestDice = d;
315
+ if (agenda.organe) {
316
+ searchParams.organe = agenda.organe;
317
+ }
318
+ const pages = await fetchAllSearchPages(searchParams);
319
+ if (!pages.length) {
320
+ if (!options["silent"]) {
321
+ console.log(`[miss] ${agenda.uid} no candidates (videotype=${searchParams.videotype}, organe=${searchParams.organe || "-"}, date=${searchParams.begin || "-"})`);
352
322
  }
353
- if (hasSameKey) {
354
- sameOrg = true; // we are sure this is the same org
323
+ return;
324
+ }
325
+ const combinedHtml = pages.join("\n<!-- PAGE SPLIT -->\n");
326
+ const candidates = extractCandidatesFromSearchHtml(combinedHtml).slice(0, MAX_CANDIDATES);
327
+ if (!candidates.length) {
328
+ if (!options["silent"]) {
329
+ console.log(`[miss] ${agenda.uid} no candidates after parse (videotype=${searchParams.videotype}, organe=${searchParams.organe || "-"}, date=${searchParams.begin || "-"})`);
355
330
  }
356
- else if (bestDice < 0.8) {
357
- // if diff org and dice too low we skip
331
+ return;
332
+ }
333
+ // ==== 2) Enrich via data.nvs + scoring; pick best ====
334
+ let best = null;
335
+ for (const c of candidates) {
336
+ const dataUrl = `${SENAT_DATAS_ROOT}/${c.id}_${c.hash}/content/data.nvs`;
337
+ const finalUrl = `${SENAT_DATAS_ROOT}/${c.id}_${c.hash}/content/finalplayer.nvs`;
338
+ const dataBuf = await fetchBuffer(dataUrl);
339
+ if (!dataBuf)
358
340
  continue;
341
+ const meta = parseDataNvs(dataBuf.toString("utf-8"));
342
+ let sameOrg = false;
343
+ // If organes are too different, go to next candidates
344
+ if (agenda.organe && meta.organes?.length) {
345
+ const agendaOrgNorm = normalize(agenda.organe);
346
+ const agendaKey = getOrgKey(agendaOrgNorm);
347
+ let bestDice = 0;
348
+ let hasSameKey = false;
349
+ for (const vo of meta.organes) {
350
+ const videoOrgNorm = normalize(vo);
351
+ const videoKey = getOrgKey(videoOrgNorm);
352
+ const d = dice(agendaOrgNorm, videoOrgNorm);
353
+ if (videoKey === agendaKey && videoKey !== "autre") {
354
+ hasSameKey = true;
355
+ }
356
+ if (d > bestDice)
357
+ bestDice = d;
358
+ }
359
+ if (hasSameKey) {
360
+ sameOrg = true; // we are sure this is the same org
361
+ }
362
+ else if (bestDice < 0.8) {
363
+ // if diff org and dice too low we skip
364
+ continue;
365
+ }
366
+ }
367
+ let videoTitle = c.title;
368
+ if (c.isSeancePublique && meta.firstChapterLabel) {
369
+ videoTitle = meta.firstChapterLabel;
370
+ }
371
+ const s = score(agenda, agendaTs, sameOrg, videoTitle, meta.epoch, meta.organes);
372
+ if (!best || s > best.score) {
373
+ best = {
374
+ id: c.id,
375
+ hash: c.hash,
376
+ pageUrl: c.pageUrl,
377
+ epoch: meta.epoch,
378
+ vtitle: videoTitle,
379
+ score: s,
380
+ vorgane: meta.organes[0],
381
+ };
359
382
  }
360
383
  }
361
- let videoTitle = c.title;
362
- if (c.isSeancePublique && meta.firstChapterLabel) {
363
- videoTitle = meta.firstChapterLabel;
384
+ if (!best) {
385
+ if (!options["silent"])
386
+ console.log(`[miss] ${agenda.uid} No candidate found for this reunion`);
387
+ return;
364
388
  }
365
- const s = score(agenda, agendaTs, sameOrg, videoTitle, meta.epoch, meta.organes);
366
- if (!best || s > best.score) {
367
- best = {
368
- id: c.id,
369
- hash: c.hash,
370
- pageUrl: c.pageUrl,
371
- epoch: meta.epoch,
372
- vtitle: videoTitle,
373
- score: s,
374
- vorgane: meta.organes[0],
375
- };
389
+ accepted = best.score >= MATCH_THRESHOLD;
390
+ if (accepted)
391
+ STATS.accepted++;
392
+ if (!options["silent"]) {
393
+ console.log(`[pick] ${agenda.uid} score=${best.score.toFixed(2)}
394
+ agenda title="${agenda.titre ?? ""}" agenda organe="${agenda.organe ?? ""}" agenda heure=${agenda.startTime}
395
+ best title="${best.vtitle ?? ""}" best organe="${best.vorgane ?? ""}"
396
+ accepted=${accepted}`);
376
397
  }
398
+ // ==== 3) Write metadata + NVS of the best candidate (always) ====
399
+ const bestDt = best?.epoch ? epochToParisDateTime(best.epoch) : null;
400
+ const metadata = {
401
+ reunionUid,
402
+ session,
403
+ accepted,
404
+ threshold: MATCH_THRESHOLD,
405
+ agenda: {
406
+ date: agenda.date,
407
+ startTime: agenda.startTime,
408
+ titre: agenda.titre,
409
+ organe: agenda.organe ?? undefined,
410
+ uid: agenda.uid,
411
+ },
412
+ best: {
413
+ id: best.id,
414
+ hash: best.hash,
415
+ pageUrl: best.pageUrl,
416
+ epoch: best.epoch ?? null,
417
+ date: bestDt?.date ?? null,
418
+ startTime: bestDt?.startTime ?? null,
419
+ title: best.vtitle ?? null,
420
+ score: best.score,
421
+ },
422
+ };
423
+ await writeIfChanged(path.join(baseDir, "metadata.json"), JSON.stringify(metadata, null, 2));
424
+ const dataUrl = `${SENAT_DATAS_ROOT}/${best.id}_${best.hash}/content/data.nvs`;
425
+ const finalUrl = `${SENAT_DATAS_ROOT}/${best.id}_${best.hash}/content/finalplayer.nvs`;
426
+ const dataTxt = await fetchText(dataUrl);
427
+ const finalTxt = await fetchText(finalUrl);
428
+ if (dataTxt)
429
+ await fsp.writeFile(path.join(baseDir, "data.nvs"), dataTxt, "utf-8");
430
+ if (finalTxt)
431
+ await fsp.writeFile(path.join(baseDir, "finalplayer.nvs"), finalTxt, "utf-8");
432
+ if (dataTxt && finalTxt)
433
+ master = buildSenatVodMasterM3u8FromNvs(dataTxt, finalTxt);
377
434
  }
378
- if (!best) {
379
- if (!options["silent"])
380
- console.log(`[miss] ${agenda.uid} No candidate found for this reunion`);
381
- return;
382
- }
383
- const accepted = best.score >= MATCH_THRESHOLD;
384
- if (accepted)
385
- STATS.accepted++;
386
- if (!options["silent"]) {
387
- console.log(`[pick] ${agenda.uid} score=${best.score.toFixed(2)}
388
- agenda title="${agenda.titre ?? ""}" agenda organe="${agenda.organe ?? ""}" agenda heure=${agenda.startTime}
389
- best title="${best.vtitle ?? ""}" best organe="${best.vorgane ?? ""}"
390
- accepted=${accepted}`);
435
+ else {
436
+ // Skipped download, but need to read data.nvs for urlVideo
437
+ try {
438
+ const dataTxt = await fsp.readFile(path.join(baseDir, "data.nvs"), "utf-8");
439
+ const finalTxt = await fsp.readFile(path.join(baseDir, "finalplayer.nvs"), "utf-8");
440
+ master = buildSenatVodMasterM3u8FromNvs(dataTxt, finalTxt);
441
+ }
442
+ catch { }
391
443
  }
392
- // ==== 3) Write metadata + NVS of the best candidate (always) ====
393
- const bestDt = best?.epoch ? epochToParisDateTime(best.epoch) : null;
394
- const metadata = {
395
- reunionUid,
396
- session,
397
- accepted,
398
- threshold: MATCH_THRESHOLD,
399
- agenda: {
400
- date: agenda.date,
401
- startTime: agenda.startTime,
402
- titre: agenda.titre,
403
- organe: agenda.organe ?? undefined,
404
- uid: agenda.uid,
405
- },
406
- best: {
407
- id: best.id,
408
- hash: best.hash,
409
- pageUrl: best.pageUrl,
410
- epoch: best.epoch ?? null,
411
- date: bestDt?.date ?? null,
412
- startTime: bestDt?.startTime ?? null,
413
- title: best.vtitle ?? null,
414
- score: best.score,
415
- },
416
- };
417
- await writeIfChanged(path.join(baseDir, "metadata.json"), JSON.stringify(metadata, null, 2));
418
- const dataUrl = `${SENAT_DATAS_ROOT}/${best.id}_${best.hash}/content/data.nvs`;
419
- const finalUrl = `${SENAT_DATAS_ROOT}/${best.id}_${best.hash}/content/finalplayer.nvs`;
420
- const dataTxt = await fetchText(dataUrl);
421
- const finalTxt = await fetchText(finalUrl);
422
- if (dataTxt)
423
- await fsp.writeFile(path.join(baseDir, "data.nvs"), dataTxt, "utf-8");
424
- if (finalTxt)
425
- await fsp.writeFile(path.join(baseDir, "finalplayer.nvs"), finalTxt, "utf-8");
426
- let master = null;
427
- if (dataTxt && finalTxt)
428
- master = buildSenatVodMasterM3u8FromNvs(dataTxt, finalTxt);
429
444
  // ==== 4) Update agenda file (only if accepted + m3u8) ====
430
- if (accepted && master) {
445
+ if ((accepted || skipDownload) && master) {
431
446
  const agendaJsonPath = path.join(dataDir, AGENDA_FOLDER, DATA_TRANSFORMED_FOLDER, String(session), `${agenda.uid}.json`);
432
447
  if (await fs.pathExists(agendaJsonPath)) {
433
448
  const raw = await fsp.readFile(agendaJsonPath, "utf-8");
@@ -48,13 +48,12 @@ export declare const commonOptions: ({
48
48
  name: string;
49
49
  type: StringConstructor;
50
50
  } | {
51
- defaultValue: number;
51
+ alias: string;
52
52
  help: string;
53
53
  name: string;
54
- type: NumberConstructor;
54
+ type: BooleanConstructor;
55
55
  } | {
56
- alias: string;
57
56
  help: string;
58
57
  name: string;
59
- type: BooleanConstructor;
58
+ type: NumberConstructor;
60
59
  })[];
@@ -35,4 +35,11 @@ export const onlyRecentOption = {
35
35
  name: "only-recent",
36
36
  type: Number,
37
37
  };
38
- export const commonOptions = [categoriesOption, dataDirDefaultOption, fromSessionOption, silentOption, verboseOption];
38
+ export const commonOptions = [
39
+ categoriesOption,
40
+ dataDirDefaultOption,
41
+ fromSessionOption,
42
+ silentOption,
43
+ verboseOption,
44
+ onlyRecentOption,
45
+ ];
package/package.json CHANGED
@@ -1,101 +1,101 @@
1
- {
2
- "name": "@tricoteuses/senat",
3
- "version": "2.18.11",
4
- "description": "Handle French Sénat's open data",
5
- "keywords": [
6
- "France",
7
- "open data",
8
- "Parliament",
9
- "Sénat"
10
- ],
11
- "author": "Emmanuel Raviart <emmanuel@raviart.com>",
12
- "bugs": {
13
- "url": "https://git.tricoteuses.fr/logiciels/tricoteuses-senat/issues"
14
- },
15
- "homepage": "https://tricoteuses.fr/",
16
- "license": "AGPL-3.0-or-later",
17
- "repository": {
18
- "type": "git",
19
- "url": "https://git.tricoteuses.fr/logiciels/tricoteuses-senat.git"
20
- },
21
- "type": "module",
22
- "engines": {
23
- "node": ">=22"
24
- },
25
- "files": [
26
- "lib"
27
- ],
28
- "exports": {
29
- ".": {
30
- "import": "./lib/index.js",
31
- "types": "./lib/index.d.ts"
32
- },
33
- "./loaders": {
34
- "import": "./lib/loaders.js",
35
- "types": "./lib/loaders.d.ts"
36
- },
37
- "./package.json": "./package.json"
38
- },
39
- "publishConfig": {
40
- "access": "public"
41
- },
42
- "scripts": {
43
- "build": "tsc",
44
- "build:types": "tsc --emitDeclarationOnly",
45
- "data:convert_data": "tsx src/scripts/convert_data.ts",
46
- "data:download": "tsx src/scripts/data-download.ts",
47
- "data:generate_schemas": "tsx src/scripts/retrieve_open_data.ts --schema",
48
- "data:retrieve_agenda": "cross-env TZ='Etc/UTC' tsx src/scripts/retrieve_agenda.ts",
49
- "data:retrieve_cr_seance": "tsx src/scripts/retrieve_cr_seance.ts",
50
- "data:retrieve_cr_commission": "tsx src/scripts/retrieve_cr_commission.ts",
51
- "data:retrieve_documents": "tsx src/scripts/retrieve_documents.ts",
52
- "data:retrieve_open_data": "tsx src/scripts/retrieve_open_data.ts --all",
53
- "data:retrieve_senateurs_photos": "tsx src/scripts/retrieve_senateurs_photos.ts --fetch",
54
- "data:retrieve_videos": "tsx src/scripts/retrieve_videos.ts",
55
- "data:parse_textes_lois": "tsx src/scripts/parse_textes.ts",
56
- "prepare": "npm run build",
57
- "prepublishOnly": "npm run build",
58
- "prettier": "prettier --write 'src/**/*.ts' 'tests/**/*.test.ts'",
59
- "test:iter_load": "tsx src/scripts/test_iter_load.ts",
60
- "type-check": "tsc --noEmit",
61
- "type-check:watch": "npm run type-check -- --watch"
62
- },
63
- "dependencies": {
64
- "@biryani/core": "^0.2.1",
65
- "cheerio": "^1.1.2",
66
- "command-line-args": "^5.1.1",
67
- "dotenv": "^8.2.0",
68
- "fs-extra": "^9.1.0",
69
- "jsdom": "^26.0.0",
70
- "kysely": "^0.27.4",
71
- "luxon": "^3.7.2",
72
- "node-stream-zip": "^1.8.2",
73
- "pg": "^8.13.1",
74
- "pg-cursor": "^2.12.1",
75
- "p-limit": "^7.2.0",
76
- "slug": "^11.0.0",
77
- "tsx": "^4.20.6",
78
- "windows-1252": "^1.0.0"
79
- },
80
- "devDependencies": {
81
- "@typed-code/schemats": "^5.0.1",
82
- "@types/cheerio": "^1.0.0",
83
- "@types/command-line-args": "^5.0.0",
84
- "@types/fs-extra": "^9.0.7",
85
- "@types/jsdom": "^21.1.7",
86
- "@types/luxon": "^3.7.1",
87
- "@types/node": "^20.17.6",
88
- "@types/pg": "^8.15.5",
89
- "@types/pg-cursor": "^2.7.2",
90
- "@types/slug": "^5.0.9",
91
- "@typescript-eslint/eslint-plugin": "^8.46.0",
92
- "@typescript-eslint/parser": "^8.46.0",
93
- "cross-env": "^10.1.0",
94
- "eslint": "^8.57.1",
95
- "iconv-lite": "^0.7.0",
96
- "kysely-codegen": "^0.19.0",
97
- "prettier": "^3.5.3",
98
- "tslib": "^2.1.0",
99
- "typescript": "^5.9.3"
100
- }
101
- }
1
+ {
2
+ "name": "@tricoteuses/senat",
3
+ "version": "2.18.13",
4
+ "description": "Handle French Sénat's open data",
5
+ "keywords": [
6
+ "France",
7
+ "open data",
8
+ "Parliament",
9
+ "Sénat"
10
+ ],
11
+ "author": "Emmanuel Raviart <emmanuel@raviart.com>",
12
+ "bugs": {
13
+ "url": "https://git.tricoteuses.fr/logiciels/tricoteuses-senat/issues"
14
+ },
15
+ "homepage": "https://tricoteuses.fr/",
16
+ "license": "AGPL-3.0-or-later",
17
+ "repository": {
18
+ "type": "git",
19
+ "url": "https://git.tricoteuses.fr/logiciels/tricoteuses-senat.git"
20
+ },
21
+ "type": "module",
22
+ "engines": {
23
+ "node": ">=22"
24
+ },
25
+ "files": [
26
+ "lib"
27
+ ],
28
+ "exports": {
29
+ ".": {
30
+ "import": "./lib/index.js",
31
+ "types": "./lib/index.d.ts"
32
+ },
33
+ "./loaders": {
34
+ "import": "./lib/loaders.js",
35
+ "types": "./lib/loaders.d.ts"
36
+ },
37
+ "./package.json": "./package.json"
38
+ },
39
+ "publishConfig": {
40
+ "access": "public"
41
+ },
42
+ "scripts": {
43
+ "build": "tsc",
44
+ "build:types": "tsc --emitDeclarationOnly",
45
+ "data:convert_data": "tsx src/scripts/convert_data.ts",
46
+ "data:download": "tsx src/scripts/data-download.ts",
47
+ "data:generate_schemas": "tsx src/scripts/retrieve_open_data.ts --schema",
48
+ "data:retrieve_agenda": "cross-env TZ='Etc/UTC' tsx src/scripts/retrieve_agenda.ts",
49
+ "data:retrieve_cr_seance": "tsx src/scripts/retrieve_cr_seance.ts",
50
+ "data:retrieve_cr_commission": "tsx src/scripts/retrieve_cr_commission.ts",
51
+ "data:retrieve_documents": "tsx src/scripts/retrieve_documents.ts",
52
+ "data:retrieve_open_data": "tsx src/scripts/retrieve_open_data.ts --all",
53
+ "data:retrieve_senateurs_photos": "tsx src/scripts/retrieve_senateurs_photos.ts --fetch",
54
+ "data:retrieve_videos": "tsx src/scripts/retrieve_videos.ts",
55
+ "data:parse_textes_lois": "tsx src/scripts/parse_textes.ts",
56
+ "prepare": "npm run build",
57
+ "prepublishOnly": "npm run build",
58
+ "prettier": "prettier --write 'src/**/*.ts' 'tests/**/*.test.ts'",
59
+ "test:iter_load": "tsx src/scripts/test_iter_load.ts",
60
+ "type-check": "tsc --noEmit",
61
+ "type-check:watch": "npm run type-check -- --watch"
62
+ },
63
+ "dependencies": {
64
+ "@biryani/core": "^0.2.1",
65
+ "cheerio": "^1.1.2",
66
+ "command-line-args": "^5.1.1",
67
+ "dotenv": "^8.2.0",
68
+ "fs-extra": "^9.1.0",
69
+ "jsdom": "^26.0.0",
70
+ "kysely": "^0.27.4",
71
+ "luxon": "^3.7.2",
72
+ "node-stream-zip": "^1.8.2",
73
+ "pg": "^8.13.1",
74
+ "pg-cursor": "^2.12.1",
75
+ "p-limit": "^7.2.0",
76
+ "slug": "^11.0.0",
77
+ "tsx": "^4.20.6",
78
+ "windows-1252": "^1.0.0"
79
+ },
80
+ "devDependencies": {
81
+ "@typed-code/schemats": "^5.0.1",
82
+ "@types/cheerio": "^1.0.0",
83
+ "@types/command-line-args": "^5.0.0",
84
+ "@types/fs-extra": "^9.0.7",
85
+ "@types/jsdom": "^21.1.7",
86
+ "@types/luxon": "^3.7.1",
87
+ "@types/node": "^20.17.6",
88
+ "@types/pg": "^8.15.5",
89
+ "@types/pg-cursor": "^2.7.2",
90
+ "@types/slug": "^5.0.9",
91
+ "@typescript-eslint/eslint-plugin": "^8.46.0",
92
+ "@typescript-eslint/parser": "^8.46.0",
93
+ "cross-env": "^10.1.0",
94
+ "eslint": "^8.57.1",
95
+ "iconv-lite": "^0.7.0",
96
+ "kysely-codegen": "^0.19.0",
97
+ "prettier": "^3.5.3",
98
+ "tslib": "^2.1.0",
99
+ "typescript": "^5.9.3"
100
+ }
101
+ }