xml-stream-editor 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,6 +1,18 @@
1
1
  CHANGELOG
2
2
  ===
3
3
 
4
+ 0.2.0
5
+ ---
6
+
7
+ Significantly improve performance of selector matching.
8
+
9
+ Add validation checks for created or modified XML element attribute names.
10
+
11
+ Add config option, currently just with 1. the ability to disable validation
12
+ of outgoing XML, and 2. configuring the "saxes" parser.
13
+
14
+ Hopefully more helpful, additional text in README.md.
15
+
4
16
  0.1.1
5
17
  ---
6
18
 
package/README.md CHANGED
@@ -1,5 +1,4 @@
1
- xml-stream-editor
2
- ===
1
+ # xml-stream-editor
3
2
 
4
3
  Library to edit xml files in a streaming manner. Inspired by
5
4
  [xml-stream](https://www.npmjs.com/package/xml-stream), but 1. allows using
@@ -11,23 +10,86 @@ allows you to modify XML without needing to buffer the XML files in memory.
11
10
  For small to mid-sized XML files buffering is fine. But when editing very large
12
11
  files (e.g., multi-Gb files) buffering can be a problem or an absolute blocker.
13
12
 
14
- Usage
15
- ---
13
+ ## Usage
16
14
 
17
15
  `xml-stream-editor` is designed to be used with node's stream systems
18
16
  by subclassing [`stream.Transform`](https://nodejs.org/api/stream.html#class-streamtransform),
19
17
  so it can be used with the [streams promises API](https://nodejs.org/api/stream.html#streams-promises-api)
20
18
  and stdlib interfaces like [`stream.pipeline`](https://nodejs.org/api/stream.html#streampipelinestreams-options).
21
19
 
22
- The main way to use `xml-stream-editor` is to 1. select which XML elements
23
- you want to edit using simple declarative selectors (like _very_ simple XPath
24
- rules or CSS selectors), and 2. write functions to be called with each
25
- matching XML element in the document. Those functions then either edit and
26
- return the provided element, or remove the element from the document
27
- by returning nothing.
20
+ The main way to use `xml-stream-editor` is to:
28
21
 
29
- Example
30
- ---
22
+ 1. select which XML elements you want to edit using simple declarative selectors
23
+ (like _very_ simple XPath rules or CSS selectors), and
24
+ 2. write functions to be called with each matching XML element in the document.
25
+ Those functions then either edit and return the provided element, or remove
26
+ the element from the document by returning nothing.
27
+
28
+ ### Calling xml-stream-editor
29
+
30
+ The main way to call `xml-stream-editor` is by importing `createXMLEditor`,
31
+ passing that function an object, with keys as `selectors` (strings that describe
32
+ which elements to edit) as keys, and values being functions that get passed
33
+ matching elements (to edit to delete those elements).
34
+
35
+ ### Elements Selectors
36
+
37
+ You choose which XML elements to edit by writing (simple, limited) CSS-selector
38
+ like statements. For example, the selector `parent child` will match
39
+ all `<child>` elements that are _immediate_ children of `<parent>` nodes.
40
+ **Note**, this is a little different than CSS selectors, where the selector
41
+ `div a` would match `<a>` elements that were were contained in `<div>` elements,
42
+ regardless of whether the `<a>` was an immediate child or more deeply nested.
43
+
44
+ ### Editing Elements
45
+
46
+ Each element that matches a given selector is passed to the matching
47
+ function, with the signature `(elm: Element) => Element | undefined`,
48
+ and elements are structured as follows (as typescript):
49
+
50
+ ```typescript
51
+ interface Element {
52
+ name: string
53
+ text?: string
54
+ attributes: Record<string, string>
55
+ children: Element[]
56
+ }
57
+ ```
58
+
59
+ ### Options / Configuration
60
+
61
+ In addition to a `rules` argument, `createReadStream` can also take
62
+ a second `Options` argument. This object has the follow parameters.
63
+
64
+ ```typescript
65
+ interface Options {
66
+ // Whether to check and enforce the validity of created and modified
67
+ // XML element names and attributes. If true, will throw an error
68
+ // if you create an XML element with a disallowed name (e.g.,
69
+ // <no spaces allowed>) or with an invalid attribute name
70
+ // (<my-elm a:b:c="too many namespaces" d@y="no @ in attr names">)
71
+ //
72
+ // This only checks the syntax of the XML element names and attributes.
73
+ // It does not perform any further validation, like if used namespaces
74
+ // are valid.
75
+ //
76
+ // default: `true`
77
+ validate: boolean // true
78
+
79
+ // Options defined by the "saxes" library, and passed to the "saxes" parser
80
+ //
81
+ // eslint-disable-next-line max-len
82
+ // https://github.com/lddubeau/saxes/blob/4968bd09b5fd0270a989c69913614b0e640dae1b/src/saxes.ts#L557
83
+ // https://www.npmjs.com/package/saxes
84
+ saxes?: SaxesOptions
85
+ }
86
+
87
+ // The createXMLEditor function takes the options object as an optional
88
+ // second argument.
89
+ const transformer = createXMLEditor(rules, options)
90
+ ```
91
+
92
+ ## Examples
31
93
 
32
94
  Start with this input as `simpsons.xml`:
33
95
 
@@ -55,8 +117,9 @@ import { pipeline } from 'node:stream/promises'
55
117
  import { createXMLEditor, newElement } from 'xml-stream-editor'
56
118
 
57
119
  (async () => {
58
- const config = {
59
- // Map element selectors to editing functions
120
+ // The keys of this object are selector strings, and the
121
+ // values are functions that get called with matching elements.
122
+ const rules = {
60
123
  "main character": (elm) => {
61
124
  switch (elm.text) {
62
125
  case "Marge Simpson":
@@ -68,10 +131,14 @@ import { createXMLEditor, newElement } from 'xml-stream-editor'
68
131
  case "Lisa Simpson":
69
132
  elm.text = ""
70
133
 
134
+ // Create an <instrument> element and make it
135
+ // a child element.
71
136
  const instrumentElm = newElement("instrument")
72
137
  instrumentElm.text = "saxophone"
73
138
  elm.children.push(instrumentElm)
74
139
 
140
+ // Also create a new <name> element, and also make it
141
+ // a child element.
75
142
  const nameElm = newElement("name")
76
143
  nameElm.text = "Lisa Simpson"
77
144
  elm.children.push(nameElm)
@@ -85,18 +152,20 @@ import { createXMLEditor, newElement } from 'xml-stream-editor'
85
152
  }
86
153
  await pipeline(
87
154
  createReadStream("simpsons.xml"), // above example
88
- createXMLEditor(config),
155
+ createXMLEditor(rules),
89
156
  process.stdout
90
157
  )
91
158
  })()
92
159
  ```
93
160
 
94
- And you'll find this printed to `STDOUT` (reformatted):
161
+ And you'll find this printed to `STDOUT` (reformatted and annotated):
95
162
 
96
163
  ```xml
97
164
  <?xml version="1.0" encoding="UTF-8"?>
98
165
  <simpsons decade="90s" locale="US">
99
166
  <main>
167
+ <!-- These character elements were edited because they're
168
+ children of the main element (i.e., "main character"). -->
100
169
  <character sex="female" hair="blue">Marge Simpson</character>
101
170
  <character sex="male">Homer Simpson (Sr.)</character>
102
171
  <character sex="female">
@@ -104,16 +173,22 @@ And you'll find this printed to `STDOUT` (reformatted):
104
173
  <name>Lisa Simpson</name>
105
174
  </character>
106
175
  <character sex="female">Maggie Simpson</character>
176
+ <!-- There is no <character>Bart Simpson</character>
177
+ element anymore because the `case "Bart Simpson":`
178
+ case didn't return an element from the function. -->
107
179
  </main>
108
180
  <side>
181
+ <!-- These side character elements were not edited of affected
182
+ at all because they didn't match the given selector
183
+ (i.e., they are not "character" elements that are direct
184
+ children of "side" elements). -->
109
185
  <character sex="male">Disco Stu</character>
110
186
  <character sex="male" title="Dr.">Julius Hibbert</character>
111
187
  </side>
112
188
  </simpsons>
113
189
  ```
114
190
 
115
- Notes
116
- ---
191
+ ## Notes
117
192
 
118
193
  Nested editing functions are not supported. You can define as many editing
119
194
  rules as you'd like, but only one rule can be matching the xml document
@@ -129,7 +204,7 @@ import { pipeline } from 'node:stream/promises'
129
204
  import { createXMLEditor, newElement } from 'xml-stream-editor'
130
205
 
131
206
  (async () => {
132
- const config = {
207
+ const rules = {
133
208
  // This rule will match first, since the "main" element will be
134
209
  // identified first during parsing.
135
210
  "main character": (elm) => {
@@ -147,14 +222,13 @@ import { createXMLEditor, newElement } from 'xml-stream-editor'
147
222
  }
148
223
  await pipeline(
149
224
  createReadStream("simpsons.xml"), // above example
150
- createXMLEditor(config),
225
+ createXMLEditor(rules),
151
226
  process.stdout
152
227
  )
153
228
  })()
154
229
  ```
155
230
 
156
- Motivation
157
- ---
231
+ ## Motivation
158
232
 
159
233
  `xml-stream-editor` was built to handle the extremely large XML files
160
234
  generated by [Brave Software's PageGraph system](https://github.com/brave/brave-browser/wiki/PageGraph),
package/dist/markup.js CHANGED
@@ -1,6 +1,6 @@
1
1
  import xmlescape from 'xml-escape';
2
2
  import xnv from 'xml-name-validator';
3
- export const isValidName = xnv.name;
3
+ export const isValidName = xnv.qname;
4
4
  export const toAttrValue = (value) => {
5
5
  return xmlescape(value);
6
6
  };
@@ -3,16 +3,30 @@ import { Transform } from 'node:stream';
3
3
  import { SaxesParser } from 'saxes';
4
4
  import { isValidName, toAttrValue, toBodyText, toCloseTag, toOpenTag } from './markup.js';
5
5
  export const newElement = (name) => {
6
- if (isValidName(name) === false) {
7
- throw new Error(`"${name}" is not a valid XML element name`);
8
- }
9
6
  return {
10
7
  name: name,
11
8
  text: undefined,
12
- attributes: {},
9
+ attributes: Object.create(null),
13
10
  children: [],
14
11
  };
15
12
  };
13
+ const throwOnInvalidElement = (elm) => {
14
+ if (typeof elm.name !== 'string') {
15
+ throw new Error('No name provided for element');
16
+ }
17
+ if (!isValidName(elm.name)) {
18
+ throw new Error(`"${elm.name}" is not a valid XML element name`);
19
+ }
20
+ if (typeof elm.attributes !== 'object' || elm.attributes === null) {
21
+ throw new Error('"attributes" property on element is not an object');
22
+ }
23
+ for (const attrName of Object.keys(elm.attributes)) {
24
+ if (!isValidName(attrName)) {
25
+ throw new Error(`"${attrName}" is not a valid XML attribute name`);
26
+ }
27
+ }
28
+ elm.children.forEach(throwOnInvalidElement);
29
+ };
16
30
  const cloneElement = (elm) => {
17
31
  const newElm = newElement(elm.name);
18
32
  newElm.text = elm.text;
@@ -25,7 +39,7 @@ const elementForNode = (node) => {
25
39
  // string), or in the namespace representation the "saxes" library
26
40
  // uses (in which case attrValue will be a SaxesAttributeNS
27
41
  // object, that we have to unpack a bit)
28
- const attributes = {};
42
+ const attributes = Object.create(null);
29
43
  if (node.attributes) {
30
44
  for (const [attrName, attrValue] of Object.entries(node.attributes)) {
31
45
  if (typeof attrValue === 'string') {
@@ -45,12 +59,21 @@ const elementForNameAndAttrs = (name, attrs) => {
45
59
  return newElm;
46
60
  };
47
61
  class XMLStreamEditorTransformer extends Transform {
62
+ // Default options, used if the caller doesn't provide any options (or
63
+ // merged into the provided options if the user only sets some options).
64
+ static defaultOptions = {
65
+ validate: true,
66
+ saxes: undefined,
67
+ };
68
+ // The configuration options, including possible options to pass to
69
+ // the (above) saxes parser at instantiation.
70
+ #options;
48
71
  // Used to track how deep in the XML tree the parser is, so that we can
49
72
  // check newly parsed elements against the passed editor rules.
50
- #elmStack = [];
73
+ #parseStack = [];
51
74
  // This is a map of (VERY) simple xpaths (i.e., only XML element names;
52
75
  // no attributes, no name spaces, etc).
53
- #config;
76
+ #editingRules;
54
77
  // Handle to the 'saxes' xml parser object.
55
78
  #xmlParser;
56
79
  // If set, tracks the current element in the parser stack that matches
@@ -60,17 +83,31 @@ class XMLStreamEditorTransformer extends Transform {
60
83
  // Store any errors we've been passed by the saxes parser so that we
61
84
  // can pass it along in the transformer callback next time we get data.
62
85
  #error;
86
+ #pushParsedElementToStack(element) {
87
+ const topOfStackElm = this.#parseStack.at(-1);
88
+ const pathToElement = topOfStackElm
89
+ ? topOfStackElm.path + ' ' + element.name
90
+ : element.name;
91
+ this.#parseStack.push({
92
+ element: element,
93
+ path: pathToElement,
94
+ });
95
+ }
63
96
  // Checks to see if the current editor stack (which tracks the current
64
97
  // element being parsed in the input XML stream, along with its parent
65
98
  // elements) matches any of the passed editor rules.
66
- #doesStackMatchEditorRule() {
67
- const currentElementPath = this.#elmStack.map(x => x.name).join(' ');
68
- for (const [selector, editorFunc] of Object.entries(this.#config)) {
69
- if (currentElementPath.endsWith(selector)) {
99
+ #doesStackMatchEditingRule() {
100
+ const topOfStack = this.#parseStack.at(-1);
101
+ // This method is only called after pushing an element to the stack,
102
+ // so this is guaranteed to be true
103
+ assert(topOfStack);
104
+ const topOfStackPath = topOfStack.path;
105
+ for (const [selector, editorFunc] of Object.entries(this.#editingRules)) {
106
+ if (topOfStackPath.endsWith(selector)) {
70
107
  // The depth of the root of this subtree in the stack
71
- const depth = this.#elmStack.length - 1;
108
+ const depth = this.#parseStack.length - 1;
72
109
  assert(depth >= 0);
73
- const elmToEdit = this.#elmStack[depth];
110
+ const elmToEdit = this.#parseStack[depth].element;
74
111
  return { selector: selector, func: editorFunc, element: elmToEdit };
75
112
  }
76
113
  }
@@ -89,12 +126,15 @@ class XMLStreamEditorTransformer extends Transform {
89
126
  }
90
127
  this.push(toCloseTag(element.name));
91
128
  }
92
- #writeSubtreeToStream() {
129
+ #callUserFuncOnCompletedElementAndWriteToStream() {
93
130
  assert(this.#elmToEditInfo);
94
131
  const clonedElm = cloneElement(this.#elmToEditInfo.element);
95
132
  try {
96
133
  const editedElm = this.#elmToEditInfo.func(clonedElm);
97
134
  if (editedElm) {
135
+ if (this.#options.validate === true) {
136
+ throwOnInvalidElement(editedElm);
137
+ }
98
138
  this.#writeElementToStream(editedElm);
99
139
  }
100
140
  this.#elmToEditInfo = undefined;
@@ -116,13 +156,13 @@ class XMLStreamEditorTransformer extends Transform {
116
156
  // 3. We are NOT the root of a subtree to be edited, in which case
117
157
  // we just add ourselves to the stack.
118
158
  const newElement = elementForNode(node);
119
- this.#elmStack.push(newElement);
159
+ this.#pushParsedElementToStack(newElement);
120
160
  // Check for case one
121
161
  if (this.#isInSubtreeToBeEdited()) {
122
162
  return;
123
163
  }
124
164
  // Check for case two, if we're at the root of a subtree to edit.
125
- const matchingElementInfo = this.#doesStackMatchEditorRule();
165
+ const matchingElementInfo = this.#doesStackMatchEditingRule();
126
166
  if (matchingElementInfo !== null) {
127
167
  this.#elmToEditInfo = matchingElementInfo;
128
168
  return;
@@ -140,9 +180,9 @@ class XMLStreamEditorTransformer extends Transform {
140
180
  // print the text out immediately.
141
181
  // Check for case one
142
182
  if (this.#isInSubtreeToBeEdited()) {
143
- const topOfStack = this.#elmStack.at(-1);
183
+ const topOfStack = this.#parseStack.at(-1);
144
184
  assert(topOfStack);
145
- topOfStack.text = text;
185
+ topOfStack.element.text = text;
146
186
  return;
147
187
  }
148
188
  // Otherwise we're in case two, and can print the text out immediately.
@@ -176,36 +216,52 @@ class XMLStreamEditorTransformer extends Transform {
176
216
  // 3. We've completed a CHILD NODE in a subtree being edited,
177
217
  // in which case we append this node to our buffered subtree
178
218
  // and pop it off the stack.
179
- const completedElm = this.#elmStack.pop();
219
+ const completedStackElement = this.#parseStack.pop();
220
+ const completedElm = completedStackElement?.element;
180
221
  assert(completedElm);
181
222
  // Check for case one
182
223
  if (this.#isInSubtreeToBeEdited() === false) {
224
+ // Write the closing tag of the just-completed element
225
+ // to the write stream.
183
226
  this.push(toCloseTag(node.name));
184
227
  return;
185
228
  }
186
229
  // Check for case two
187
230
  assert(this.#elmToEditInfo);
188
231
  if (completedElm === this.#elmToEditInfo.element) {
189
- this.#writeSubtreeToStream();
232
+ this.#callUserFuncOnCompletedElementAndWriteToStream();
190
233
  return;
191
234
  }
192
235
  // Otherwise, we must be in case three
193
- assert(this.#elmStack.length > 0);
194
- this.#elmStack.at(-1)?.children.push(completedElm);
236
+ const topOfStack = this.#parseStack.at(-1);
237
+ assert(topOfStack);
238
+ topOfStack.element.children.push(completedElm);
195
239
  });
196
240
  }
197
- constructor(config, saxesOptions) {
241
+ constructor(editingRules, options) {
198
242
  super();
199
- this.#config = config;
200
- this.#xmlParser = new SaxesParser(saxesOptions);
243
+ const defaultOptions = XMLStreamEditorTransformer.defaultOptions;
244
+ const mergedOptions = {
245
+ validate: options?.validate ?? defaultOptions.validate,
246
+ saxes: options?.saxes ?? defaultOptions.saxes,
247
+ };
248
+ this.#options = mergedOptions;
249
+ this.#editingRules = editingRules;
250
+ this.#xmlParser = new SaxesParser(this.#options.saxes);
201
251
  this.#configureParserCallbacks();
202
252
  }
203
253
  _transform(chunk, encoding, callback) {
254
+ // Don't do any parsing if something threw an error parsing the previous
255
+ // chunk.
204
256
  if (this.#error) {
205
257
  callback(this.#error);
206
258
  return;
207
259
  }
208
260
  this.#xmlParser.write(chunk);
261
+ // And, similarly, don't continuing parsing if we've caught any errors
262
+ // parsing the current chunk. This looks a little redundant, but because
263
+ // the XML from the input stream is parsed asynchronously, this is
264
+ // just an attempt to catch and handle an error as quickly as possible.
209
265
  if (this.#error) {
210
266
  callback(this.#error);
211
267
  return;
@@ -213,6 +269,10 @@ class XMLStreamEditorTransformer extends Transform {
213
269
  callback();
214
270
  }
215
271
  }
216
- export const createXMLEditor = (config, saxesOptions) => {
217
- return new XMLStreamEditorTransformer(config, saxesOptions);
272
+ // This is the entry point to the library, and is designed / named
273
+ // to mirror the naming of transformers in the standard lib
274
+ // (e.g., createGzip , createDeflate, etc in the stdlib zlib module,
275
+ // or createHmac, createECDH, etc in the stdlib crypto module).
276
+ export const createXMLEditor = (rules, options) => {
277
+ return new XMLStreamEditorTransformer(rules, options);
218
278
  };
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "xml-stream-editor",
3
- "version": "0.1.1",
3
+ "version": "0.2.0",
4
4
  "description": "A streaming xml editor.",
5
5
  "main": "dist/index.js",
6
6
  "files": [
@@ -18,7 +18,7 @@
18
18
  "type": "module",
19
19
  "types": "src/types.d.ts",
20
20
  "repository": {
21
- "url": "https://github.com/pes10k/xml-stream-editor.git"
21
+ "url": "git+https://github.com/pes10k/xml-stream-editor.git"
22
22
  },
23
23
  "keywords": [
24
24
  "xml",
package/src/types.d.ts CHANGED
@@ -9,9 +9,31 @@ export declare interface Element {
9
9
  children: Element[]
10
10
  }
11
11
 
12
+ export declare interface Options {
13
+ // Whether to check and enforce the validity of created and modified
14
+ // XML element names and attributes. If true, will throw an error
15
+ // if you create an XML element with a disallowed name (e.g.,
16
+ // <no spaces allowed>) or with an invalid attribute name
17
+ // (<my-elm a:b:c="too many namespaces" d@y="no @ in attr names">)
18
+ //
19
+ // This only checks the syntax of the XML element names and attributes.
20
+ // It does not perform any further validation, like if used namespaces
21
+ // are valid.
22
+ //
23
+ // default: `true`
24
+ validate: boolean // true
25
+
26
+ // Options defined by the "saxes" library, and passed to the "saxes" parser
27
+ //
28
+ // eslint-disable-next-line max-len
29
+ // https://github.com/lddubeau/saxes/blob/4968bd09b5fd0270a989c69913614b0e640dae1b/src/saxes.ts#L557
30
+ // https://www.npmjs.com/package/saxes
31
+ saxes?: SaxesOptions
32
+ }
33
+
12
34
  export type Selector = string
13
35
  export type EditorFunc = (elm: Element) => Element | undefined
14
- export type Config = Record<Selector, EditorFunc>
36
+ export type EditingRules = Record<Selector, EditorFunc>
15
37
  export declare const newElement: (name: string) => Element
16
38
  export declare const createXMLEditor: (
17
- config: Config, saxesOptions?: SaxesOptions) => Transform
39
+ editingRules: EditingRules, options?: Options) => Transform