@udx/mq 0.1.1 → 1.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,242 @@
1
+ # @udx/mq - Markdown Query
2
+
3
+ A powerful tool for querying and transforming markdown documents, designed as a companion to @udx/mcurl. Think of it as "jq for markdown" - a tool that lets you treat markdown as structured data.
4
+
5
+ ## Key Capabilities
6
+
7
+ - **Clean Content Extraction**: Pull narrative content without code blocks for cleaner analysis
8
+ - **Structured Querying**: Filter and transform markdown content like jq does for JSON
9
+ - **Document Analysis**: Generate actionable insights and understand document structure
10
+ - **Format Conversion**: Transform between JSON, markdown, and other formats
11
+ - **Composability**: Combine with other tools in Unix-style pipelines
12
+
13
+ ## Why Clean Content Extraction Matters
14
+
15
+ Code blocks in technical documents serve a crucial purpose for developers but act as "noise" when analyzing the narrative flow. By separating content from code, mq helps:
16
+
17
+ - Improve focus on conceptual information
18
+ - Extract cleaner summaries without code snippets
19
+ - Better identify key points and arguments
20
+ - Create more approachable versions of technical content
21
+
22
+ ## Installation
23
+
24
+ ```bash
25
+ npm install -g @udx/mq
26
+ ```
27
+
28
+ ## Usage Examples
29
+
30
+ ### Extract Clean Content (No Code Blocks)
31
+
32
+ ```bash
33
+ # Extract clean content without code blocks
34
+ mq --clean-content --input test/fixtures/test-code-blocks.md
35
+
36
+ # Filter content to include only h1 and h2 headings and their content
37
+ mq --clean-content=2 --input test/fixtures/complex-test.md
38
+
39
+ # Get clean content in JSON format
40
+ mq --clean-content --format json --input test/fixtures/test-code-blocks.md
41
+ ```
42
+
43
+ ### Basic Query Operations
44
+
45
+ ```bash
46
+ # Extract headings from a document (returns JSON structure by default)
47
+ mq --input test/fixtures/basic-test.md '.headings[]'
48
+
49
+ # Analyze document structure (returns formatted Markdown report)
50
+ mq --analyze --input test/fixtures/complex-test.md
51
+
52
+ # Generate a table of contents (returns Markdown TOC)
53
+ mq --input test/fixtures/test-document.md '.toc'
54
+
55
+ # Extract code blocks by language (returns JSON structure)
56
+ mq --language javascript --input test/fixtures/test-code-blocks.md
57
+
58
+ # Extract code content only in raw format
59
+ mq --language javascript --input test/fixtures/test-code-blocks.md | jq -r '.[0].content'
60
+
61
+ # Extract all images (returns JSON structure)
62
+ mq --input test/fixtures/test-images.md '.images[]'
63
+
64
+ # Extract first sentences from sections (returns text content)
65
+ mq --first-sentences 2 --input test/fixtures/test-sentences.md
66
+ ```
67
+
68
+ ### Pipe with mcurl
69
+
70
+ ```bash
71
+ # Fetch web content and analyze it
72
+ mcurl https://udx.io | mq --analyze
73
+
74
+ # Fetch web content and extract key information
75
+ mcurl https://udx.io/work | mq --clean-content
76
+
77
+ # First analyze the overall structure of web content
78
+ mcurl https://udx.io/about | mq --analyze
79
+ ```
80
+
81
+ ### Complex Queries
82
+
83
+ ```bash
84
+ # Extract level 2 headings
85
+ mq --input test/fixtures/complex-test.md '.headings[] | select(.level == 2)'
86
+
87
+ # Extract links to specific domain
88
+ mq --input test/fixtures/test-document.md '.links[] | select(.href | contains("example"))'
89
+
90
+ # Extract code blocks and make them collapsible
91
+ mq --input test/fixtures/test-code-blocks.md --transform-code-blocks
92
+ ```
93
+
94
+ ### Integration with curl and jq
95
+
96
+ One of the most powerful aspects of mq is its ability to integrate with curl, mcurl, and jq in Unix-style pipelines:
97
+
98
+ ```bash
99
+ # Fetch a GitHub markdown file and extract headings
100
+ curl -s https://raw.githubusercontent.com/WordPress/wordpress-develop/HEAD/README.md | mq '.headings[]'
101
+
102
+ # Get content from a website and extract clean narrative content
103
+ mcurl https://udx.io/about | mq --clean-content
104
+
105
+ # Process markdown content and pipe to jq for further filtering
106
+ curl -s https://raw.githubusercontent.com/WordPress/wordpress-develop/HEAD/README.md | mq --clean-content --format json | \
107
+ jq '[.[] | select(.type=="heading" and .level == 1)]'
108
+
109
+ # Extract expertise data from UDX API using proper jq patterns
110
+ curl -s 'https://udx.io/wp-json/udx/v2/works/search?query=&page=1' | \
111
+ jq '.facets.expertise[] | select(.count > 10) | {name: .name, count: .count}'
112
+ ```
113
+
114
+ ## Advanced Features
115
+
116
+ ### Clean Content Extraction
117
+
118
+ The clean content extractor is one of mq's most powerful features for document analysis. It removes code blocks while preserving the document's narrative structure:
119
+
120
+ ```bash
121
+ # Extract clean content without code blocks
122
+ mq --clean-content --input test/fixtures/test-code-blocks.md
123
+
124
+ # Limit extraction to specific heading levels (h1 and h2 only)
125
+ mq --clean-content=2 --input test/fixtures/complex-test.md
126
+
127
+ # Get JSON output for programmatic processing
128
+ mq --clean-content --format json --input test/fixtures/test-code-blocks.md | jq length
129
+ ```
130
+
131
+ #### Benefits of Clean Content Extraction
132
+
133
+ - **Improved Analysis**: Focus on the narrative without code noise
134
+ - **Better Summarization**: Generate more coherent summaries from technical content
135
+ - **Hierarchical Understanding**: Preserve document structure while filtering code
136
+ - **Content Repurposing**: Transform code-heavy tutorials into conceptual guides
137
+ - **Incremental Content Processing**: Extract varying amounts of content for different purposes
138
+
139
+ ### Advanced UDX API Examples
140
+
141
+ ```bash
142
+ # Extract links from HTML content using mq
143
+ mcurl https://udx.io/about | mq '.links[0:5]'
144
+
145
+ # Extract clean content from a WordPress page for easier reading
146
+ mcurl https://udx.io/guidance | mq --clean-content
147
+
148
+ # First analyze page structure, then extract specific elements
149
+ mcurl https://udx.io/work | mq --analyze
150
+ ```
151
+
152
+ ## Approach
153
+
154
+ ### Best Practices for Working with Markdown and APIs
155
+
156
+ - **Native Node.js Functions**: Prefer using native Node.js functions for fetching API data rather than dedicated modules. For example:
157
+
158
+ ```javascript
159
+ // Using native Node.js rather than dedicated modules
160
+ const https = require('https');
161
+
162
+ function fetchContent(url) {
163
+ // Function fetches content from URL using native Node.js modules
164
+ // Input: url - String URL to fetch
165
+ // Output: Promise that resolves to response body
166
+ return new Promise((resolve, reject) => {
167
+ https.get(url, (res) => {
168
+ let data = '';
169
+ res.on('data', (chunk) => { data += chunk; });
170
+ res.on('end', () => { resolve(data); });
171
+ }).on('error', reject);
172
+ });
173
+ }
174
+ ```
175
+
176
+ - **Logging and Debugging**: Always log API request metadata and response data for troubleshooting:
177
+
178
+ ```javascript
179
+ // Proper logging for API requests
180
+ function logApiRequest(url, options, response) {
181
+ // Log API request details when verbose mode is enabled
182
+ // Input: url - request URL, options - request options, response - API response
183
+ // Output: None, logs to console
184
+ if (process.env.DEBUG || process.env.VERBOSE) {
185
+ console.log(`[API Request] ${options.method || 'GET'} ${url}`);
186
+ console.log(`[API Response] Status: ${response.statusCode}`);
187
+ if (process.env.VERBOSE) {
188
+ console.log(`[API Response Body] ${JSON.stringify(response.body).substring(0, 200)}...`);
189
+ }
190
+ }
191
+ }
192
+ ```
193
+
194
+ - **Use Lodash for Complex Operations**: Leverage Lodash for data transformations to improve readability and fault tolerance in your pipeline.
195
+
196
+ - **Progressive Enhancement Workflow**:
197
+ 1. Start by analyzing content structure with `mq --analyze`
198
+ 2. Extract relevant sections with targeted selectors
199
+ 3. Process and transform with clean content extraction
200
+ 4. Format output appropriately for your use case
201
+
202
+ - **Testing Strategy**: Test your pipelines using REST API tools, Mocha for unit tests, or simple curl commands for verification.
203
+
204
+ - **Documentation**: Add comprehensive function headers that explain purpose, inputs, and outputs for all custom operations.
205
+
206
+ ### Common Pipelines
207
+
208
+ ```bash
209
+ # Extract content → Clean → Filter → Format as JSON
210
+ mcurl https://udx.io/about | mq --clean-content | mq --format json | jq 'length'
211
+
212
+ # Analyze content structure then target specific elements
213
+ mcurl https://udx.io/work | mq --analyze && mcurl https://udx.io/work | mq '.headings[0:5]'
214
+
215
+ # Process multiple sources with consistent transformations
216
+ for url in "udx.io/about" "udx.io/work" "udx.io/guidance"; do
217
+ echo "Processing $url"
218
+ mcurl https://$url | mq --clean-content=2 | wc -l
219
+ done
220
+ ```
221
+
222
+ ### UDX API Integration Patterns
223
+
224
+ Mq can be used as part of a larger data processing pipeline, working alongside other tools like curl and jq:
225
+
226
+ ```bash
227
+ # Use mq for HTML content processing
228
+ mcurl https://udx.io/work | mq --clean-content | grep "Cloud"
229
+
230
+ # Use curl+jq for JSON API processing (not mcurl!)
231
+ curl -s 'https://udx.io/wp-json/udx/v2/works/search?query=&page=1' | \
232
+ jq '.facets.expertise[] | select(.count > 10) | {name: .name, count: .count}'
233
+
234
+ # Get industry distribution with better formatting
235
+ curl -s 'https://udx.io/wp-json/udx/v2/works/search?query=&page=1' | \
236
+ jq '.facets.industries[] | select(.count > 5) | {name: .name, count: .count}'
237
+
238
+ # Pipeline: Extract content from UDX pages, clean it, then analyze structure
239
+ for page in "about" "work" "guidance"; do
240
+ mcurl "https://udx.io/$page" | mq --clean-content | mq --analyze | grep -i "headings"
241
+ done
242
+ ```
@@ -0,0 +1,195 @@
1
+ /**
2
+ * Enhanced Content Extractor for Markdown Query
3
+ *
4
+ * Provides flexible extraction methods for markdown content with customizable filters
5
+ * Designed specifically for extracting meaningful content while omitting code blocks
6
+ */
7
+
8
+ import { toString } from 'mdast-util-to-string';
9
+ import { visit } from 'unist-util-visit';
10
+ import _ from 'lodash';
11
+
12
+ /**
13
+ * Extract clean content without code blocks
14
+ * Extracts readable content from markdown while filtering code blocks
15
+ *
16
+ * Input: Markdown AST document
17
+ * Output: Array of content nodes with metadata
18
+ *
19
+ * @param {Object} ast - Markdown AST
20
+ * @param {string} selector - Selector for filtering (e.g., 'level=2' for headings up to level 2)
21
+ * @param {Object} options - Additional options for extraction
22
+ * @returns {Array} Array of content objects
23
+ */
24
+ function extractCleanContent(ast, selector, options = {}) {
25
+ // Default options
26
+ const config = _.defaults(options, {
27
+ includeHeadings: true,
28
+ includeParagraphs: true,
29
+ includeImages: false,
30
+ includeLists: true,
31
+ maxHeadingLevel: 6,
32
+ preserveHierarchy: true,
33
+ debug: false
34
+ });
35
+
36
+ // Track which nodes to exclude (code blocks and their containers)
37
+ const excludeNodes = new Set();
38
+ const contentNodes = [];
39
+
40
+ // First pass: identify all code blocks to exclude
41
+ visit(ast, 'code', (node, index, parent) => {
42
+ excludeNodes.add(node);
43
+ if (config.debug) {
44
+ console.log(`Found code block at index ${index}, marking for exclusion`);
45
+ }
46
+ });
47
+
48
+ // Parse options from selector
49
+ if (selector && selector.includes('level=')) {
50
+ const levelMatch = selector.match(/level=(\d+)/);
51
+ if (levelMatch) {
52
+ config.maxHeadingLevel = parseInt(levelMatch[1], 10);
53
+ }
54
+ }
55
+
56
+ // Second pass: extract all desired content
57
+ visit(ast, (node) => {
58
+ // Skip excluded nodes
59
+ if (excludeNodes.has(node)) {
60
+ return;
61
+ }
62
+
63
+ let content;
64
+
65
+ // Process based on node type
66
+ switch (node.type) {
67
+ case 'heading':
68
+ if (config.includeHeadings && node.depth <= config.maxHeadingLevel) {
69
+ content = {
70
+ type: 'heading',
71
+ level: node.depth,
72
+ text: toString(node),
73
+ position: node.position
74
+ };
75
+ }
76
+ break;
77
+
78
+ case 'paragraph':
79
+ if (config.includeParagraphs) {
80
+ content = {
81
+ type: 'paragraph',
82
+ text: toString(node),
83
+ position: node.position
84
+ };
85
+ }
86
+ break;
87
+
88
+ case 'image':
89
+ if (config.includeImages) {
90
+ content = {
91
+ type: 'image',
92
+ alt: node.alt || '',
93
+ src: node.url,
94
+ title: node.title || '',
95
+ position: node.position
96
+ };
97
+ }
98
+ break;
99
+
100
+ case 'list':
101
+ if (config.includeLists) {
102
+ const items = [];
103
+ visit(node, 'listItem', (item) => {
104
+ items.push(toString(item));
105
+ });
106
+
107
+ content = {
108
+ type: 'list',
109
+ items,
110
+ ordered: node.ordered,
111
+ position: node.position
112
+ };
113
+ }
114
+ break;
115
+ }
116
+
117
+ if (content) {
118
+ contentNodes.push(content);
119
+ }
120
+ });
121
+
122
+ // Sort content by original position in document
123
+ const sortedContent = _.sortBy(contentNodes, (node) => {
124
+ return node.position ? node.position.start.line : 0;
125
+ });
126
+
127
+ // Log operation info when in verbose debug mode
128
+ if (config.debug) {
129
+ console.log(`Extracted ${sortedContent.length} content nodes, filtering out code blocks`);
130
+ }
131
+
132
+ return sortedContent;
133
+ }
134
+
135
+ /**
136
+ * Convert extracted content to markdown
137
+ * Transforms the output from extractCleanContent back to markdown
138
+ *
139
+ * Input: Array of content nodes
140
+ * Output: Markdown string
141
+ *
142
+ * @param {Array} contentNodes - Array of content objects
143
+ * @param {Object} options - Formatting options
144
+ * @returns {string} Formatted markdown content
145
+ */
146
+ function contentToMarkdown(contentNodes, options = {}) {
147
+ // Default options
148
+ const config = _.defaults(options, {
149
+ addSeparators: false,
150
+ headingStyle: '#', // or 'setext'
151
+ addLineBreaks: true
152
+ });
153
+
154
+ let markdown = '';
155
+
156
+ contentNodes.forEach((node) => {
157
+ switch (node.type) {
158
+ case 'heading':
159
+ // Add proper number of # for heading level
160
+ markdown += '#'.repeat(node.level) + ' ' + node.text;
161
+ markdown += config.addLineBreaks ? '\n\n' : '\n';
162
+ break;
163
+
164
+ case 'paragraph':
165
+ markdown += node.text;
166
+ markdown += config.addLineBreaks ? '\n\n' : '\n';
167
+ break;
168
+
169
+ case 'image':
170
+ markdown += `![${node.alt}](${node.src}${node.title ? ` "${node.title}"` : ''})`;
171
+ markdown += config.addLineBreaks ? '\n\n' : '\n';
172
+ break;
173
+
174
+ case 'list':
175
+ node.items.forEach((item, index) => {
176
+ const prefix = node.ordered ? `${index + 1}. ` : '- ';
177
+ markdown += prefix + item;
178
+ markdown += config.addLineBreaks ? '\n' : '';
179
+ });
180
+ markdown += config.addLineBreaks ? '\n' : '';
181
+ break;
182
+ }
183
+
184
+ if (config.addSeparators) {
185
+ markdown += '---\n\n';
186
+ }
187
+ });
188
+
189
+ return markdown;
190
+ }
191
+
192
+ export {
193
+ extractCleanContent,
194
+ contentToMarkdown
195
+ };
@@ -14,7 +14,9 @@ export {
14
14
  extractLinks,
15
15
  generateToc,
16
16
  extractSections,
17
- filterHeadingsByLevel
17
+ filterHeadingsByLevel,
18
+ extractImages,
19
+ extractFirstSentences
18
20
  };
19
21
 
20
22
  /**
@@ -57,11 +59,12 @@ function extractHeadings(ast, selector) {
57
59
  /**
58
60
  * Extract code blocks from AST
59
61
  *
60
- * Extracts all code blocks from a markdown document with their language and content
62
+ * Extracts all code blocks from a markdown document with their language and content.
63
+ * Supports filtering by language using the lang= selector.
61
64
  *
62
65
  * @param {Object} ast - Markdown AST
63
- * @param {string} selector - Query selector
64
- * @returns {Array|Object} Array of code blocks or single code block if index specified
66
+ * @param {string} selector - Query selector (e.g., '[0]' for first block, 'lang=php' for PHP blocks)
67
+ * @returns {Array|Object} Array of code blocks or filtered blocks based on selector
65
68
  */
66
69
  function extractCodeBlocks(ast, selector) {
67
70
  const codeBlocks = [];
@@ -73,6 +76,15 @@ function extractCodeBlocks(ast, selector) {
73
76
  });
74
77
  });
75
78
 
79
+ // Handle language selector if present
80
+ if (selector && selector.includes('lang=')) {
81
+ const langMatch = selector.match(/lang=["']?([a-zA-Z0-9]+)["']?/);
82
+ if (langMatch) {
83
+ const language = langMatch[1];
84
+ return codeBlocks.filter(block => block.language === language);
85
+ }
86
+ }
87
+
76
88
  // Handle array index selector if present
77
89
  if (selector && selector.includes('[')) {
78
90
  const indexMatch = selector.match(/\[(\d+)\]/);
@@ -245,3 +257,92 @@ function filterHeadingsByLevel(ast, level) {
245
257
 
246
258
  return output;
247
259
  }
260
+
261
+ /**
262
+ * Extract images from AST
263
+ *
264
+ * Extracts all images from a markdown document with their alt text and source URL
265
+ *
266
+ * @param {Object} ast - Markdown AST
267
+ * @param {string} selector - Query selector
268
+ * @returns {Array|Object} Array of images or single image if index specified
269
+ */
270
+ function extractImages(ast, selector) {
271
+ const images = [];
272
+
273
+ visit(ast, 'image', (node) => {
274
+ images.push({
275
+ alt: node.alt || '',
276
+ src: node.url,
277
+ title: node.title || ''
278
+ });
279
+ });
280
+
281
+ // Handle array index selector if present
282
+ if (selector && selector.includes('[')) {
283
+ const indexMatch = selector.match(/\[(\d+)\]/);
284
+ if (indexMatch) {
285
+ const index = parseInt(indexMatch[1], 10);
286
+ return images[index];
287
+ }
288
+ return images;
289
+ }
290
+
291
+ return images;
292
+ }
293
+
294
+ /**
295
+ * Extract first sentences from sections
296
+ *
297
+ * Extracts the first sentence from each section down to a specified heading level
298
+ *
299
+ * @param {Object} ast - Markdown AST
300
+ * @param {string} selector - Query selector (e.g., 'level=3' for headings up to level 3)
301
+ * @returns {Array} Array of sections with their first sentences
302
+ */
303
+ function extractFirstSentences(ast, selector) {
304
+ // Parse selector for max level if present
305
+ let maxLevel = 6;
306
+ if (selector && selector.includes('level=')) {
307
+ const levelMatch = selector.match(/level=(\d+)/);
308
+ if (levelMatch) {
309
+ maxLevel = parseInt(levelMatch[1], 10);
310
+ }
311
+ }
312
+
313
+ // Extract sections
314
+ const sections = extractSections(ast);
315
+ const result = [];
316
+
317
+ // Process each section to extract first sentence
318
+ sections.forEach(section => {
319
+ // Skip sections with heading level greater than maxLevel
320
+ if (section.depth > maxLevel) return;
321
+
322
+ // Find the first paragraph after the heading
323
+ let firstSentence = '';
324
+ for (const node of section.content) {
325
+ if (node.type === 'paragraph') {
326
+ const text = toString(node);
327
+ // Extract first sentence (ending with period, question mark, or exclamation mark)
328
+ const sentenceMatch = text.match(/^[^.!?]*[.!?]/);
329
+ if (sentenceMatch) {
330
+ firstSentence = sentenceMatch[0];
331
+ break;
332
+ } else {
333
+ // If no sentence ending found, use the full paragraph
334
+ firstSentence = text;
335
+ break;
336
+ }
337
+ }
338
+ }
339
+
340
+ result.push({
341
+ title: section.title,
342
+ level: section.depth,
343
+ firstSentence
344
+ });
345
+ });
346
+
347
+ return result;
348
+ }
package/mq.js CHANGED
@@ -45,7 +45,8 @@ const packageJson = JSON.parse(
45
45
  );
46
46
 
47
47
  // Import extract operations
48
- import { extractHeadings, extractCodeBlocks, extractLinks, generateToc, extractSections, filterHeadingsByLevel } from './lib/operations/extractors.js';
48
+ import { extractHeadings, extractCodeBlocks, extractLinks, generateToc, extractSections, filterHeadingsByLevel, extractImages, extractFirstSentences } from './lib/operations/extractors.js';
49
+ import { extractCleanContent, contentToMarkdown } from './lib/operations/content-extractor.js';
49
50
 
50
51
  // Import analysis operations
51
52
  import { showDocumentStructure, countDocumentElements, analyzeDocument } from './lib/operations/analysis.js';
@@ -60,6 +61,9 @@ const queryOperations = {
60
61
  'count': countDocumentElements,
61
62
  'sections': extractSections,
62
63
  'level': filterHeadingsByLevel,
64
+ 'images': extractImages,
65
+ 'firstSentences': extractFirstSentences,
66
+ 'cleanContent': extractCleanContent,
63
67
  'default': (ast, query) => ast
64
68
  };
65
69
 
@@ -98,11 +102,15 @@ program
98
102
  .option('-a, --analyze', 'Analyze document structure')
99
103
  .option('-s, --structure', 'Show document structure (headings hierarchy)')
100
104
  .option('-c, --count', 'Count document elements')
105
+ .option('-l, --language <lang>', 'Filter code blocks by language')
101
106
  .option('-i, --input <file>', 'Input file (defaults to stdin)')
102
107
  .option('-o, --output <file>', 'Output file (defaults to stdout)')
103
108
  .option('-f, --format <format>', 'Output format (json, yaml, markdown)', 'markdown')
104
109
  .option('-v, --verbose', 'Verbose output with operation details')
105
110
  .option('-d, --debug', 'Debug mode with detailed logs for troubleshooting')
111
+ .option('--images', 'Extract all images from markdown')
112
+ .option('--first-sentences [level]', 'Extract first sentences of sections, optionally specify max heading level')
113
+ .option('--clean-content [level]', 'Extract clean content without code blocks, optionally specify max heading level')
106
114
  .parse(process.argv);
107
115
 
108
116
  const options = program.opts();
@@ -338,6 +346,39 @@ async function main() {
338
346
  log(`Filtering headings by level: ${options.level}`, 'verbose');
339
347
  result = filterHeadingsByLevel(ast, parseInt(options.level, 10));
340
348
  log('Heading filtering completed', 'debug');
349
+ } else if (options.language) {
350
+ log(`Filtering code blocks by language: ${options.language}`, 'verbose');
351
+ result = extractCodeBlocks(ast, `lang=${options.language}`);
352
+ log('Code block filtering completed', 'debug');
353
+ } else if (options.images) {
354
+ log('Extracting images from markdown', 'verbose');
355
+ result = extractImages(ast);
356
+ log('Image extraction completed', 'debug');
357
+ } else if (options.firstSentences) {
358
+ const level = typeof options.firstSentences === 'string' ?
359
+ options.firstSentences : '6';
360
+ log(`Extracting first sentences with max level ${level}`, 'verbose');
361
+ result = extractFirstSentences(ast, `level=${level}`);
362
+ log('First sentence extraction completed', 'debug');
363
+ } else if (options.cleanContent) {
364
+ const level = typeof options.cleanContent === 'string' ?
365
+ options.cleanContent : '6';
366
+ log(`Extracting clean content without code blocks, max level ${level}`, 'verbose');
367
+
368
+ const cleanContent = extractCleanContent(ast, `level=${level}`, {
369
+ debug: options.debug || options.verbose,
370
+ maxHeadingLevel: parseInt(level, 10)
371
+ });
372
+
373
+ if (options.format === 'json' || options.format === 'yaml') {
374
+ result = cleanContent;
375
+ } else {
376
+ // For markdown output, convert the structured content back to markdown
377
+ result = contentToMarkdown(cleanContent, {
378
+ addLineBreaks: true
379
+ });
380
+ }
381
+ log('Clean content extraction completed', 'debug');
341
382
  } else if (options.transform) {
342
383
  log(`Applying transform operation: ${options.transform}`, 'verbose');
343
384
  result = transformMarkdown(ast, options.transform);
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@udx/mq",
3
- "version": "0.1.1",
3
+ "version": "1.1.3",
4
4
  "description": "Markdown Query - jq for Markdown documents",
5
5
  "main": "mq.js",
6
6
  "type": "module",
@@ -10,7 +10,7 @@
10
10
  "scripts": {
11
11
  "test": "NODE_OPTIONS=--experimental-vm-modules npx mocha",
12
12
  "test:gist": "NODE_OPTIONS=--experimental-vm-modules npx mocha test/gist-integration.test.mjs",
13
- "test:integration": "NODE_OPTIONS=--experimental-vm-modules npx mocha test/integration.test.mjs",
13
+ "test:integration": "NODE_OPTIONS=--experimental-vm-modules npx mocha test/integration.test.mjs",
14
14
  "test:all": "npm test",
15
15
  "start": "node mq.js",
16
16
  "prepublishOnly": "npm test"
@@ -46,11 +46,12 @@
46
46
  "examples"
47
47
  ],
48
48
  "engines": {
49
- "node": ">=14.16"
49
+ "node": ">=18.17"
50
50
  },
51
51
  "dependencies": {
52
52
  "commander": "^11.1.0",
53
53
  "js-yaml": "^4.1.0",
54
+ "lodash": "^4.17.21",
54
55
  "mdast-util-from-markdown": "^1.3.0",
55
56
  "mdast-util-gfm": "^3.1.0",
56
57
  "mdast-util-to-markdown": "^1.5.0",
@@ -62,6 +63,7 @@
62
63
  "@udx/mcurl": "^0.3.0"
63
64
  },
64
65
  "devDependencies": {
66
+ "express": "^4.21.2",
65
67
  "mocha": "^11.1.0"
66
68
  }
67
69
  }
package/readme.md DELETED
@@ -1,242 +0,0 @@
1
- # mq - Markdown Query
2
-
3
- A powerful tool for querying, transforming, and analyzing markdown documents, designed as an extension for [@udx/mcurl](../mcurl).
4
-
5
- ## Quick Start
6
-
7
- Installing mq
8
-
9
- ```bash
10
- npm install -g @udx/mq
11
- ```
12
-
13
- Test installation:
14
-
15
- ```bash
16
- mq --version
17
- ```
18
-
19
- ### Basic Usage
20
-
21
- ```bash
22
- # Extract all headings
23
- mq '.headings[]' -i ../../README.md
24
-
25
- # Get the table of contents
26
- mq '.toc' -i ../../README.md
27
-
28
- # Extract all code blocks with their language
29
- mq '.codeBlocks[] | {language, content}' -i ../../README.md
30
-
31
- # Make all code blocks collapsible
32
- mq -t '.codeBlocks[] |= makeCollapsible' -i ../../README.md
33
-
34
- # Analyze document structure
35
- mq --analyze -i ../../README.md
36
-
37
- # Show document structure (headings hierarchy)
38
- mq --structure -i ../../README.md
39
-
40
- # Count document elements
41
- mq --count -i ../../README.md
42
- ```
43
-
44
- ### Using Piped Input
45
-
46
- You can also use mq with piped input from the tools/mq directory:
47
-
48
- ```bash
49
- cat ../../README.md | mq '.headings[]'
50
- ```
51
-
52
- ## Concept
53
-
54
- Just as `jq` allows you to query and transform JSON data, `mq` provides a similar experience for markdown documents. It parses markdown into a structured object that you can query, transform, and analyze using a familiar syntax.
55
-
56
- ## Features
57
-
58
- - Query markdown documents with a jq-like syntax
59
- - Transform markdown content with various operations
60
- - Extract specific elements like headings, code blocks, links
61
- - Generate table of contents
62
- - Analyze document structure and content
63
- - Count document elements
64
- - Integration with mcurl
65
-
66
- ## Query Syntax
67
-
68
- mq uses a simplified jq-like syntax for querying markdown elements:
69
-
70
- ```bash
71
- # Basic element selection
72
- .headings[] # All headings
73
- .headings[0] # First heading
74
- .headings[] | select(.level == 2) # All h2 headings
75
- .links[] # All links
76
- .codeBlocks[] # All code blocks
77
- .toc # Table of contents
78
-
79
- # Filtering and transformations
80
- .headings[] | select(.text | contains("Introduction")) # Headings containing "Introduction"
81
- .codeBlocks[] | select(.language == "javascript") # JavaScript code blocks
82
- .links[] | select(.href | startswith("http")) # External links
83
- ```
84
-
85
- ## Transform Operations
86
-
87
- Transform operations modify the markdown structure:
88
-
89
- ```bash
90
- # Make code blocks collapsible
91
- mq -t '.codeBlocks[] |= makeCollapsible' -i test/fixtures/content-website.md
92
-
93
- # Add a descriptive TOC
94
- mq -t '.toc |= makeDescriptive' -i test/fixtures/content-website.md
95
-
96
- # Add cross-links section after TOC
97
- mq -t '.toc |= addCrossLinks(["worker.md", "platform.md"])' -i test/fixtures/content-website.md
98
-
99
- # Fix heading hierarchy
100
- mq -t '.headings |= fixHierarchy' -i test/fixtures/content-website.md
101
- ```
102
-
103
- ## Analysis Features
104
-
105
- mq provides powerful analysis capabilities:
106
-
107
- ```bash
108
- # Analyze document structure and content
109
- mq --analyze -i test/fixtures/content-website.md
110
-
111
- # Show document structure (headings hierarchy)
112
- mq --structure -i test/fixtures/content-website.md
113
-
114
- # Count document elements
115
- mq --count -i test/fixtures/content-website.md
116
- ```
117
-
118
- ## Piping with Other Tools
119
-
120
- mq works seamlessly with Unix pipes and other tools:
121
-
122
- ```bash
123
- # Find markdown files with inconsistent heading hierarchy
124
- find ../../content/ -name "*.md" | xargs -I{} sh -c 'cat {} | mq ".headings | validateHierarchy" || echo "Issue in {}"'
125
-
126
- # Extract all external links from documentation
127
- find ../../content/ -name "*.md" | xargs cat | mq '.links[] | select(.href | startswith("http")) | .href' | sort | uniq
128
-
129
- # Generate a combined TOC from multiple files
130
- find ../../content/architecture/ -name "*.md" | xargs cat | mq '.headings[] | select(.level <= 2) | {file: input_filename, heading: .text}'
131
- ```
132
-
133
- ## Integration with mCurl
134
-
135
- Use mq as a content handler extension for mCurl:
136
-
137
- ```javascript
138
- import { registerContentHandler } from '@udx/mcurl';
139
- import { markdownHandler } from '@udx/mq';
140
-
141
- // Register the markdown handler with mcurl
142
- registerContentHandler('text/markdown', markdownHandler);
143
-
144
- // Fetch and transform markdown content
145
- const result = await mcurl('https://udx.io', {
146
- mqQuery: '.headings[] | select(.level == 2) | .text'
147
- });
148
- ```
149
-
150
- ## Common Use Cases
151
-
152
- ### 1. Documentation Analysis
153
-
154
- ```bash
155
- # Find documents missing a proper TOC
156
- find test/fixtures/ -name "*.md" | xargs -I{} sh -c 'cat {} | mq ".toc | length" | grep -q "^0$" && echo "{} missing TOC"'
157
-
158
- # List all unique tags used in documentation
159
- find test/fixtures/ -name "*.md" | xargs cat | mq '.tags[]?' | sort | uniq -c | sort -nr
160
-
161
- # Analyze content distribution
162
- find test/fixtures/ -name "*.md" | xargs -I{} sh -c 'cat {} | mq --analyze > {}.analysis.md'
163
- ```
164
-
165
- ### 2. Batch Transformations
166
-
167
- ```bash
168
- # Add cross-links between related documents
169
- for file in test/fixtures/*.md; do
170
- cat $file | mq -t '.toc |= addCrossLinks(["worker.md", "platform.md"])' > $file.new
171
- mv $file.new $file
172
- done
173
-
174
- # Make all code blocks collapsible
175
- find test/fixtures/ -name "*.md" | xargs -I{} sh -c 'cat {} | mq -t ".codeBlocks[] |= makeCollapsible" > {}.new && mv {}.new {}'
176
- ```
177
-
178
- ### 3. Documentation Validation
179
-
180
- ```bash
181
- # Validate all links are working
182
- find test/fixtures/ -name "*.md" | xargs cat | mq '.links[] | .href' | xargs -I{} curl -s -o /dev/null -w "%{http_code} {}\n" {} | grep -v "^200"
183
-
184
- # Check for consistent heading structure
185
- find test/fixtures/ -name "*.md" | xargs -I{} sh -c 'cat {} | mq ".headings | validateHierarchy" || echo "Issue in {}"'
186
- ```
187
-
188
- ## Examples
189
-
190
- See the [examples](./examples) directory for more usage examples.
191
-
192
- ## Document Comparison Examples
193
-
194
- Use `mq` to compare and analyze differences between architecture documents. Run these commands from the tools/mq directory:
195
-
196
- ```bash
197
-
198
- # Compare document statistics between files
199
- mq --count -i test/fixtures/stateless.md | grep 'Headings\|Paragraphs\|Links\|CodeBlocks'
200
-
201
- # Find unique sections in a specific document (e.g., mobile credential-related sections)
202
- mq '.headings[] | select(.text | contains("Mobile Credential"))' -i test/fixtures/stateless.md
203
-
204
- # Compare content distribution to identify most detailed sections
205
- mq --analyze -i test/fixtures/stateless.md | grep -A20 'Content Distribution'
206
-
207
- # Extract links to see different external references
208
- mq '.links[]' -i test/fixtures/stateless.md
209
- ```
210
-
211
- For inline comparison without creating temporary files, use command substitution:
212
-
213
- ```bash
214
- # Make sure to run these from the tools/mq directory
215
-
216
- # Directly compare heading counts
217
- echo "Stateless: $(mq --count -i test/fixtures/stateless.md | grep 'Headings' | awk '{print $2}')"
218
-
219
- # Find terms that appear in one document but not the other
220
- comm -23 <(mq '.headings[].text' -i test/fixtures/stateless.md | sort) \
221
- <(mq '.headings[].text' -i test/fixtures/cloud-automation-group.md | sort)
222
- ```
223
-
224
- ### Document Comparison Script
225
-
226
- Create a simple shell script to help compare architecture documents:
227
-
228
- ```bash
229
- #!/bin/bash
230
- # Save as tools/mq/compare-architectures.sh and make executable with chmod +x
231
-
232
- echo "=== Comparing Document Statistics ==="
233
- echo "Stateless:"
234
- mq --count -i test/fixtures/stateless.md | grep 'Headings\|Paragraphs\|Links\|CodeBlocks'
235
-
236
- echo "\n=== Finding Mobile Credential Sections ==="
237
- mq '.headings[] | select(.text | contains("Mobile") or .text | contains("Credential"))' -i test/fixtures/stateless.md | grep "text"
238
- ```
239
-
240
- ## License
241
-
242
- MIT