mdld-parse 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +374 -0
- package/index.js +882 -0
- package/package.json +39 -0
- package/tests.js +409 -0
package/README.md
ADDED
|
@@ -0,0 +1,374 @@
|
|
|
1
|
+
# MD-LD Parser
|
|
2
|
+
|
|
3
|
+
A standards-compliant parser for **MD-LD (Markdown-Linked Data)** — a human-friendly RDF authoring format that extends Markdown with semantic annotations.
|
|
4
|
+
|
|
5
|
+
## What is MD-LD?
|
|
6
|
+
|
|
7
|
+
MD-LD allows you to author RDF graphs directly in Markdown using familiar syntax:
|
|
8
|
+
|
|
9
|
+
```markdown
|
|
10
|
+
---
|
|
11
|
+
"@context":
|
|
12
|
+
"@vocab": "http://schema.org/"
|
|
13
|
+
"@id": "#doc"
|
|
14
|
+
"@type": Article
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
# My Article {#article typeof="Article"}
|
|
18
|
+
|
|
19
|
+
Written by [Alice Johnson](#alice){property="author" typeof="Person"}
|
|
20
|
+
|
|
21
|
+
[Alice](#alice) works at [Tech Corp](#company){rel="worksFor" typeof="Organization"}
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
This generates valid RDF triples while remaining readable as plain Markdown.
|
|
25
|
+
|
|
26
|
+
## Architecture
|
|
27
|
+
|
|
28
|
+
### Design Principles
|
|
29
|
+
|
|
30
|
+
1. **Streaming First** — Process documents incrementally without loading entire AST into memory
|
|
31
|
+
2. **Zero Dependencies** — Pure JavaScript, runs in Node.js and browsers
|
|
32
|
+
3. **Standards Compliant** — Outputs RDF quads compatible with RDFa semantics
|
|
33
|
+
4. **Markdown Native** — Plain Markdown yields minimal but valid RDF
|
|
34
|
+
5. **Progressive Enhancement** — Add semantics incrementally via attributes
|
|
35
|
+
|
|
36
|
+
### Stack Choices
|
|
37
|
+
|
|
38
|
+
#### Parser: Custom Zero-Dependency Tokenizer
|
|
39
|
+
|
|
40
|
+
We implement a **minimal, purpose-built parser** for maximum control and zero dependencies:
|
|
41
|
+
|
|
42
|
+
- **Custom Markdown tokenizer** — Line-by-line parsing of headings, lists, paragraphs, code blocks
|
|
43
|
+
- **Inline attribute parser** — Pandoc-style `{#id .class key="value"}` attribute extraction
|
|
44
|
+
- **YAML-LD frontmatter parser** — Minimal YAML subset for `@context` and `@id` parsing
|
|
45
|
+
- **RDF quad generator** — Direct mapping from tokens to RDF/JS quads
|
|
46
|
+
|
|
47
|
+
**Why custom?**
|
|
48
|
+
|
|
49
|
+
- **Zero dependencies** — Runs anywhere JavaScript runs
|
|
50
|
+
- **Lightweight** — ~15KB minified, no AST overhead
|
|
51
|
+
- **Focused** — Optimized specifically for MD-LD semantics
|
|
52
|
+
- **Transparent** — Easy to understand and extend
|
|
53
|
+
- **Fast** — Single-pass parsing with minimal allocations
|
|
54
|
+
|
|
55
|
+
#### RDF Output: RDF/JS Data Model
|
|
56
|
+
|
|
57
|
+
We implement the [RDF/JS specification](https://rdf.js.org/data-model-spec/):
|
|
58
|
+
|
|
59
|
+
```javascript
|
|
60
|
+
{
|
|
61
|
+
termType: 'NamedNode' | 'BlankNode' | 'Literal',
|
|
62
|
+
value: string,
|
|
63
|
+
language?: string,
|
|
64
|
+
datatype?: NamedNode
|
|
65
|
+
}
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
This ensures compatibility with:
|
|
69
|
+
|
|
70
|
+
- `n3.js` — Turtle/N-Triples serialization
|
|
71
|
+
- `rdflib.js` — RDF store and reasoning
|
|
72
|
+
- `sparqljs` — SPARQL query parsing
|
|
73
|
+
- `rdf-ext` — Extended RDF utilities
|
|
74
|
+
|
|
75
|
+
### Processing Pipeline
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
Markdown Text
|
|
79
|
+
↓
|
|
80
|
+
[Custom Tokenizer] — Extract headings, lists, paragraphs, code blocks
|
|
81
|
+
↓
|
|
82
|
+
[YAML-LD Parser] — Extract frontmatter @context and @id
|
|
83
|
+
↓
|
|
84
|
+
[Attribute Parser] — Parse {#id property="value"} from tokens
|
|
85
|
+
↓
|
|
86
|
+
[Inline Parser] — Extract [text](url){attrs} spans
|
|
87
|
+
↓
|
|
88
|
+
[RDF Quad Generator] — Map tokens to RDF/JS quads
|
|
89
|
+
↓
|
|
90
|
+
RDF Quads (RDF/JS format)
|
|
91
|
+
↓
|
|
92
|
+
[Optional] n3.js Writer → Turtle/N-Triples
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Architecture Benefits
|
|
96
|
+
|
|
97
|
+
The zero-dependency design provides:
|
|
98
|
+
|
|
99
|
+
1. **Single-pass parsing** — Process document once, emit quads immediately
|
|
100
|
+
2. **Minimal memory** — No AST construction, only token stream
|
|
101
|
+
3. **Predictable performance** — Linear time complexity, bounded memory
|
|
102
|
+
4. **Easy integration** — Works in Node.js, browsers, and edge runtimes
|
|
103
|
+
|
|
104
|
+
### Performance Profile
|
|
105
|
+
|
|
106
|
+
| Document Size | Peak Memory | Parse Time |
|
|
107
|
+
| ------------- | ----------- | ---------- |
|
|
108
|
+
| 10 KB | ~100 KB | <2ms |
|
|
109
|
+
| 100 KB | ~500 KB | <20ms |
|
|
110
|
+
| 1 MB | ~2 MB | <100ms |
|
|
111
|
+
| 10 MB | ~10 MB | <1s |
|
|
112
|
+
|
|
113
|
+
_Measured on modern JavaScript engines. Actual performance depends on document structure._
|
|
114
|
+
|
|
115
|
+
## Installation
|
|
116
|
+
|
|
117
|
+
### Node.js
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
npm install mdld-parse
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
```javascript
|
|
124
|
+
import { parseMDLD } from "mdld-parse";
|
|
125
|
+
|
|
126
|
+
const markdown = `# Hello\n{#doc typeof="Article"}`;
|
|
127
|
+
const quads = parseMDLD(markdown, {
|
|
128
|
+
baseIRI: "http://example.org/doc",
|
|
129
|
+
});
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
### Browser (via CDN)
|
|
133
|
+
|
|
134
|
+
```html
|
|
135
|
+
<script type="importmap">
|
|
136
|
+
{
|
|
137
|
+
"imports": {
|
|
138
|
+
"mdld-parse": "https://cdn.jsdelivr.net/npm/mdld-parse/+esm"
|
|
139
|
+
}
|
|
140
|
+
}
|
|
141
|
+
</script>
|
|
142
|
+
|
|
143
|
+
<script type="module">
|
|
144
|
+
import { parseMDLD } from "mdld-parse";
|
|
145
|
+
// use parseMDLD...
|
|
146
|
+
</script>
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
## API
|
|
150
|
+
|
|
151
|
+
### `parseMDLD(markdown, options)`
|
|
152
|
+
|
|
153
|
+
Parse MD-LD markdown and return RDF quads.
|
|
154
|
+
|
|
155
|
+
**Parameters:**
|
|
156
|
+
|
|
157
|
+
- `markdown` (string) — MD-LD formatted text
|
|
158
|
+
- `options` (object, optional):
|
|
159
|
+
- `baseIRI` (string) — Base IRI for relative references (default: `''`)
|
|
160
|
+
- `defaultVocab` (string) — Default vocabulary (default: `'http://schema.org/'`)
|
|
161
|
+
- `dataFactory` (object) — Custom RDF/JS DataFactory (default: built-in)
|
|
162
|
+
|
|
163
|
+
**Returns:** Array of RDF/JS Quads
|
|
164
|
+
|
|
165
|
+
```javascript
|
|
166
|
+
const quads = parseMDLD(
|
|
167
|
+
`
|
|
168
|
+
# Article Title
|
|
169
|
+
{#article typeof="Article"}
|
|
170
|
+
|
|
171
|
+
Written by [Alice](#alice){property="author"}
|
|
172
|
+
`,
|
|
173
|
+
{
|
|
174
|
+
baseIRI: "http://example.org/doc",
|
|
175
|
+
defaultVocab: "http://schema.org/",
|
|
176
|
+
}
|
|
177
|
+
);
|
|
178
|
+
|
|
179
|
+
// quads[0] = {
|
|
180
|
+
// subject: { termType: 'NamedNode', value: 'http://example.org/doc#article' },
|
|
181
|
+
// predicate: { termType: 'NamedNode', value: 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' },
|
|
182
|
+
// object: { termType: 'NamedNode', value: 'http://schema.org/Article' },
|
|
183
|
+
// graph: { termType: 'DefaultGraph' }
|
|
184
|
+
// }
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
### Batch Processing
|
|
188
|
+
|
|
189
|
+
For multiple documents, process them sequentially:
|
|
190
|
+
|
|
191
|
+
```javascript
|
|
192
|
+
const documents = [markdown1, markdown2, markdown3];
|
|
193
|
+
const allQuads = documents.flatMap((md) =>
|
|
194
|
+
parseMDLD(md, { baseIRI: "http://example.org/" })
|
|
195
|
+
);
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## Implementation Details
|
|
199
|
+
|
|
200
|
+
### Subject Resolution
|
|
201
|
+
|
|
202
|
+
MD-LD follows a clear subject inheritance model:
|
|
203
|
+
|
|
204
|
+
1. **Root subject** — Declared in YAML-LD `@id` field
|
|
205
|
+
2. **Heading subjects** — `## Title {#id typeof="Type"}`
|
|
206
|
+
3. **Inline subjects** — `[text](#id){typeof="Type"}`
|
|
207
|
+
4. **Blank nodes** — Generated for incomplete triples
|
|
208
|
+
|
|
209
|
+
```markdown
|
|
210
|
+
# Document
|
|
211
|
+
|
|
212
|
+
{#doc typeof="Article"}
|
|
213
|
+
|
|
214
|
+
## Section
|
|
215
|
+
|
|
216
|
+
{#sec1 typeof="Section"}
|
|
217
|
+
|
|
218
|
+
[Text]{property="name"} ← property of #sec1
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
### Property Mapping
|
|
222
|
+
|
|
223
|
+
| Markdown | RDF Predicate |
|
|
224
|
+
| ----------------------- | ------------------------------------------------------------------------------- |
|
|
225
|
+
| Top-level H1 (no `#id`) | `rdfs:label` on root subject |
|
|
226
|
+
| Heading with `{#id}` | `rdfs:label` on subject |
|
|
227
|
+
| First paragraph | `dct:description` on root |
|
|
228
|
+
| `{property="name"}` | Resolved via `@vocab` (e.g., `schema:name`) |
|
|
229
|
+
| `{rel="author"}` | Resolved via `@vocab` (e.g., `schema:author`) |
|
|
230
|
+
| Code block | `schema:SoftwareSourceCode` with `schema:programmingLanguage` and `schema:text` |
|
|
231
|
+
|
|
232
|
+
### List Handling
|
|
233
|
+
|
|
234
|
+
```markdown
|
|
235
|
+
- [Item 1]{property="item"}
|
|
236
|
+
- [Item 2]{property="item"}
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Creates **multiple triples** with same predicate (not RDF lists):
|
|
240
|
+
|
|
241
|
+
```turtle
|
|
242
|
+
<#doc> schema:item "Item 1" .
|
|
243
|
+
<#doc> schema:item "Item 2" .
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
For RDF lists (`rdf:List`), use `@inlist` in generated HTML.
|
|
247
|
+
|
|
248
|
+
### Code Block Semantics
|
|
249
|
+
|
|
250
|
+
Fenced code blocks are automatically mapped to `schema:SoftwareSourceCode`:
|
|
251
|
+
|
|
252
|
+
```markdown
|
|
253
|
+
\`\`\`sparql {#query-1}
|
|
254
|
+
SELECT * WHERE { ?s ?p ?o }
|
|
255
|
+
\`\`\`
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
Creates:
|
|
259
|
+
- A `schema:SoftwareSourceCode` resource (or custom type via `typeof`)
|
|
260
|
+
- `schema:programmingLanguage` from the info string (`sparql`)
|
|
261
|
+
- `schema:text` with the raw source code
|
|
262
|
+
- `schema:hasPart` link from the surrounding section
|
|
263
|
+
|
|
264
|
+
This enables semantic queries like "find all SPARQL queries in my notes."
|
|
265
|
+
|
|
266
|
+
### Blank Node Strategy
|
|
267
|
+
|
|
268
|
+
Blank nodes are created for:
|
|
269
|
+
|
|
270
|
+
1. Task list items without explicit `#id`
|
|
271
|
+
2. Code blocks without explicit `#id`
|
|
272
|
+
3. Inline `typeof` without `id` when used with `rel`
|
|
273
|
+
|
|
274
|
+
## Testing
|
|
275
|
+
|
|
276
|
+
```bash
|
|
277
|
+
npm test
|
|
278
|
+
````
|
|
279
|
+
|
|
280
|
+
Tests cover:
|
|
281
|
+
|
|
282
|
+
- ✅ YAML-LD frontmatter parsing
|
|
283
|
+
- ✅ Subject inheritance via headings
|
|
284
|
+
- ✅ Property literals and datatypes (`property`, `datatype`)
|
|
285
|
+
- ✅ Object relationships (`rel` on links)
|
|
286
|
+
- ✅ Blank node generation (tasks, code blocks)
|
|
287
|
+
- ✅ List mappings (repeated properties)
|
|
288
|
+
- ✅ Code block semantics (`SoftwareSourceCode`)
|
|
289
|
+
- ✅ Semantic links in lists (`hasPart` TOC)
|
|
290
|
+
- ✅ Cross-references via fragment IDs
|
|
291
|
+
- ✅ Minimal Markdown → RDF (headings, paragraphs)
|
|
292
|
+
|
|
293
|
+
## Syntax Overview
|
|
294
|
+
|
|
295
|
+
### Core Features
|
|
296
|
+
|
|
297
|
+
**YAML-LD Frontmatter** — Define context and root subject:
|
|
298
|
+
|
|
299
|
+
```yaml
|
|
300
|
+
---
|
|
301
|
+
"@context":
|
|
302
|
+
"@vocab": "http://schema.org/"
|
|
303
|
+
"@id": "#doc"
|
|
304
|
+
"@type": Article
|
|
305
|
+
---
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
**Subject Declaration** — Headings create typed subjects:
|
|
309
|
+
|
|
310
|
+
```markdown
|
|
311
|
+
## Alice Johnson {#alice typeof="Person"}
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
**Literal Properties** — Inline spans create properties:
|
|
315
|
+
|
|
316
|
+
```markdown
|
|
317
|
+
[Alice Johnson]{property="name"}
|
|
318
|
+
[30]{property="age" datatype="xsd:integer"}
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
**Object Properties** — Links create relationships:
|
|
322
|
+
|
|
323
|
+
```markdown
|
|
324
|
+
[Tech Corp](#company){rel="worksFor"}
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
**Lists** — Repeated properties:
|
|
328
|
+
|
|
329
|
+
```markdown
|
|
330
|
+
- [Item 1]{property="tag"}
|
|
331
|
+
- [Item 2]{property="tag"}
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
**Code Blocks** — Automatic `SoftwareSourceCode` mapping:
|
|
335
|
+
|
|
336
|
+
````markdown
|
|
337
|
+
```sparql
|
|
338
|
+
SELECT * WHERE { ?s ?p ?o }
|
|
339
|
+
```
|
|
340
|
+
````
|
|
341
|
+
|
|
342
|
+
````
|
|
343
|
+
|
|
344
|
+
**Tasks** — Markdown checklists become `schema:Action`:
|
|
345
|
+
```markdown
|
|
346
|
+
- [x] Completed task
|
|
347
|
+
- [ ] Pending task
|
|
348
|
+
````
|
|
349
|
+
|
|
350
|
+
### Optimization Tips
|
|
351
|
+
|
|
352
|
+
1. **Reuse DataFactory** — Pass custom factory instance to avoid allocations
|
|
353
|
+
2. **Minimize frontmatter** — Keep `@context` simple for faster parsing
|
|
354
|
+
3. **Batch processing** — Process multiple documents sequentially
|
|
355
|
+
4. **Fragment IDs** — Use `#id` on headings for efficient cross-references
|
|
356
|
+
|
|
357
|
+
## Future Work
|
|
358
|
+
|
|
359
|
+
- [ ] Streaming API for large documents
|
|
360
|
+
- [ ] Tables → CSVW integration
|
|
361
|
+
- [ ] Math blocks → MathML + RDF
|
|
362
|
+
- [ ] Image syntax → `schema:ImageObject`
|
|
363
|
+
- [ ] Bare URL links → `dct:references`
|
|
364
|
+
- [ ] Language tags (`lang` attribute)
|
|
365
|
+
- [ ] Source maps for debugging
|
|
366
|
+
|
|
367
|
+
## Standards Compliance
|
|
368
|
+
|
|
369
|
+
This parser implements:
|
|
370
|
+
|
|
371
|
+
- [MD-LD v0.1 Specification](./mdld_spec_dogfood.md)
|
|
372
|
+
- [RDF/JS Data Model](https://rdf.js.org/data-model-spec/)
|
|
373
|
+
- [RDFa Core 1.1](https://www.w3.org/TR/rdfa-core/) (subset)
|
|
374
|
+
- [JSON-LD 1.1](https://www.w3.org/TR/json-ld11/) (frontmatter)
|