tree-sitter-markdown-text 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. tree_sitter_markdown_text-0.2.0/LICENSE +21 -0
  2. tree_sitter_markdown_text-0.2.0/PKG-INFO +181 -0
  3. tree_sitter_markdown_text-0.2.0/README.md +162 -0
  4. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text/__init__.py +34 -0
  5. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text/__init__.pyi +17 -0
  6. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text/binding.c +35 -0
  7. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text/py.typed +0 -0
  8. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text.egg-info/PKG-INFO +181 -0
  9. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text.egg-info/SOURCES.txt +15 -0
  10. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text.egg-info/dependency_links.txt +1 -0
  11. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text.egg-info/not-zip-safe +1 -0
  12. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text.egg-info/requires.txt +3 -0
  13. tree_sitter_markdown_text-0.2.0/bindings/python/tree_sitter_markdown_text.egg-info/top_level.txt +2 -0
  14. tree_sitter_markdown_text-0.2.0/pyproject.toml +29 -0
  15. tree_sitter_markdown_text-0.2.0/queries/highlights.scm +47 -0
  16. tree_sitter_markdown_text-0.2.0/queries/injections.scm +22 -0
  17. tree_sitter_markdown_text-0.2.0/setup.cfg +4 -0
  18. tree_sitter_markdown_text-0.2.0/setup.py +82 -0
  19. tree_sitter_markdown_text-0.2.0/src/parser.c +139233 -0
  20. tree_sitter_markdown_text-0.2.0/src/tree_sitter/alloc.h +54 -0
  21. tree_sitter_markdown_text-0.2.0/src/tree_sitter/array.h +330 -0
  22. tree_sitter_markdown_text-0.2.0/src/tree_sitter/parser.h +286 -0
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Airbus CERT
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,181 @@
1
+ Metadata-Version: 2.4
2
+ Name: tree-sitter-markdown-text
3
+ Version: 0.2.0
4
+ Summary: Markdown grammar for tree-sitter, with a textlint-style AST shape
5
+ Author: Ophidiarium
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/ophidiarium/tree-sitter-markdown-text
8
+ Keywords: incremental,parsing,tree-sitter,markdown,textlint
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: Topic :: Software Development :: Compilers
11
+ Classifier: Topic :: Text Processing :: Linguistic
12
+ Classifier: Typing :: Typed
13
+ Requires-Python: >=3.10
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+ Provides-Extra: core
17
+ Requires-Dist: tree-sitter~=0.24; extra == "core"
18
+ Dynamic: license-file
19
+
20
+ # tree-sitter-markdown-text
21
+
22
+ Markdown grammar for [tree-sitter](https://github.com/tree-sitter/tree-sitter), shaped so that its AST lines up with the [textlint `TxtNode`](https://github.com/textlint/textlint/blob/master/docs/txtnode.md) model.
23
+
24
+ Parses `.md` (and `.markdown`, `.mdown`, `.mkd`, `.mkdn`) files into a concrete syntax tree covering the full CommonMark block structure plus common extensions (GFM pipe tables, task lists, GFM alerts, YAML/TOML front matter, Pandoc math and directive blocks, footnotes, MDX JSX). Inline content is surfaced as structured children of the `inline` wrapper: classified tokens (`word_token`, `numeric_token`, `identifier_like_token`, `path_like_token`) and punctuation-class nodes (`terminator`, `separator`, `bracket`, `operator_like`), plus inline structural nodes (`emphasis`, `strong`, `strikethrough`, `link`, `image`, `autolink`, `inline_code`, `html_inline`, `math_inline`, `mdx_jsx_inline`, `footnote_reference`).
25
+
26
+ ## Features
27
+
28
+ ### Block nodes
29
+
30
+ - **Document structure** — `document`, nested `section` wrappers around ATX headings, `paragraph`, `blank_line` (as a first-class node).
31
+ - **Headings** — ATX (`#`..`######`) and setext (`===`/`---`) with the heading level exposed as a `level` field on both `atx_heading` and `setext_heading`.
32
+ - **Code blocks** — indented code blocks and fenced code blocks (backtick and tilde), with `info_string`/`language` children for the GFM language tag.
33
+ - **Math blocks** — Pandoc/GitLab/KaTeX display math (`$$…$$`) as a dedicated `math_block` with `math_block_delimiter`/`math_block_content` children.
34
+ - **Lists** — unordered (`+`/`-`/`*`) and ordered (`1.`/`1)`) list markers. GFM task list items are promoted to `task_list_item` (distinct from `list_item`), with `task_list_marker_checked`/`task_list_marker_unchecked` markers.
35
+ - **Block quotes and callouts** — nested quotes and lazy continuations. A block quote whose first paragraph begins with `[!NOTE]` / `[!TIP]` / `[!IMPORTANT]` / `[!WARNING]` / `[!CAUTION]` (or any uppercase-only label) is surfaced as `callout` with a `callout_type` field.
36
+ - **Thematic breaks** — `---`, `***`, `___`.
37
+ - **HTML blocks** — all 7 CommonMark HTML block types; block-level HTML comments are aliased to `html_comment_block` for easy metric extraction.
38
+ - **MDX JSX blocks** &mdash; shallow `mdx_jsx_block` for lines that start with an MDX-style JSX element (`<Component ...>`, `<Component/>`, `</Component>`). Component-style mixed-case names disambiguate from all-caps HTML blocks such as `<DIV>`.
39
+ - **Pipe tables** &mdash; `pipe_table` with `pipe_table_header`, `pipe_table_delimiter_row`, `pipe_table_row`, `pipe_table_cell`, `pipe_table_align_left`/`pipe_table_align_right`.
40
+ - **Link reference definitions** &mdash; `link_reference_definition` with `link_label`/`link_destination`/`link_title` children.
41
+ - **Footnote definitions** &mdash; `footnote_definition` (`[^id]: …`) with a `footnote_label` child.
42
+ - **Directive blocks** &mdash; generic container directives (`:::name … :::`, per remark-directive / MyST / Pandoc fenced divs) as `directive_block` with `directive_block_delimiter`/`directive_name`/`directive_block_content` children.
43
+ - **Image blocks** &mdash; a paragraph consisting of a single block-level image (`![alt](dest)` on its own line) is surfaced as `image_block` with `link_label`/`link_destination` children.
44
+ - **Front matter** &mdash; YAML (`---` fenced) as `minus_metadata`, TOML (`+++` fenced) as `plus_metadata`.
45
+
46
+ ### Inline nodes (children of the `inline` wrapper)
47
+
48
+ - **Classified text tokens** &mdash; `text_span` wraps runs of classified tokens: `word_token` (Unicode alphabetic), `numeric_token` (integers, decimals, versions), `identifier_like_token` (camelCase / PascalCase / snake_case), `path_like_token` (paths with `/` separators or dotted identifiers).
49
+ - **Punctuation classes** &mdash; every punctuation lexeme is classified: `terminator` (`.`, `?`, `!`, `。`, `…`), `separator` (`,`, `;`, `:`), `bracket` (`(`, `)`, `[`, `]`, `{`, `}`, `<`, `>`), `operator_like` (`::`, `->`, `=>`, `=`, `+`, `-`, `*`, `/`, `|`, `&`, and other punctuation).
50
+ - **Emphasis / strong / strikethrough** &mdash; `emphasis` (`*…*` or `_…_`), `strong` (`**…**` or `__…__`), `strikethrough` (`~~…~~`), each with a `_delimiter`/`_content`/`_delimiter` sub-tree.
51
+ - **Code spans** &mdash; `inline_code` with matched backtick-run delimiters (1 or 2 backticks).
52
+ - **Links and images** &mdash; `link` (inline, full-reference, collapsed-reference, shortcut-reference forms) and `image` (`![alt](dest)` or `![alt][ref]`). Both expose `link_label`/`link_destination`/`link_title` children.
53
+ - **Autolinks** &mdash; `autolink` with `uri` or `email` children for `<https://…>` and `<user@example.com>`.
54
+ - **Raw HTML inline** &mdash; `html_inline` with `html_open_tag`/`html_close_tag`/`html_comment`/`html_cdata`/`html_declaration`/`html_processing_instruction` children.
55
+ - **MDX JSX inline** &mdash; shallow `mdx_jsx_inline` with `mdx_jsx_open_tag`/`mdx_jsx_close_tag`/`mdx_jsx_expression` children.
56
+ - **Inline math** &mdash; `math_inline` (`$…$`) with `math_inline_delimiter`/`math_inline_content` children. Disambiguated from `math_block` (`$$…$$`).
57
+ - **Footnote references** &mdash; `footnote_reference` (`[^id]` inside prose) with a `footnote_reference_label` child.
58
+
59
+ - **Injections query** &mdash; ships a `queries/injections.scm` that injects into fenced-code-block info strings, HTML blocks, and front matter.
60
+
61
+ ## Example
62
+
63
+ ```markdown
64
+ # Heading
65
+
66
+ A paragraph with inline content.
67
+
68
+ - one
69
+ - two
70
+
71
+ ```go
72
+ func main() {}
73
+ ```
74
+ ```
75
+
76
+ Parsed tree (abbreviated):
77
+
78
+ ```
79
+ (document
80
+ (section
81
+ (atx_heading level: (atx_h1_marker) heading_content: (inline))
82
+ (blank_line)
83
+ (paragraph (inline))
84
+ (blank_line)
85
+ (list
86
+ (list_item (list_marker_minus) (paragraph (inline)))
87
+ (list_item (list_marker_minus) (paragraph (inline))))
88
+ (blank_line)
89
+ (fenced_code_block
90
+ (fenced_code_block_delimiter)
91
+ (info_string (language))
92
+ (code_fence_content)
93
+ (fenced_code_block_delimiter))))
94
+ ```
95
+
96
+ ## Relationship to textlint
97
+
98
+ The grammar is structurally close to the textlint AST. Every block-level `TxtNode` type has a direct counterpart here; inline `TxtNode` types (`Str`, `Emphasis`, `Strong`, `Link`, `Image`, `Code`, `Html`, `Delete`, `FootnoteReference`) also have direct counterparts as children of the `inline` wrapper. Names stay snake_case per the tree-sitter convention; consumers map names themselves. See [docs/textlint-mapping.md](docs/textlint-mapping.md) for the full table.
99
+
100
+ ## Installation
101
+
102
+ ### npm
103
+
104
+ ```sh
105
+ npm install tree-sitter-markdown-text
106
+ ```
107
+
108
+ ### Cargo
109
+
110
+ ```sh
111
+ cargo add tree-sitter-markdown-text
112
+ ```
113
+
114
+ ### PyPI
115
+
116
+ ```sh
117
+ pip install tree-sitter-markdown-text
118
+ ```
119
+
120
+ ### Go
121
+
122
+ ```go
123
+ import tree_sitter_markdown_text "github.com/ophidiarium/tree-sitter-markdown-text/bindings/go"
124
+ ```
125
+
126
+ The root package also exports the bundled queries via `go:embed`:
127
+
128
+ ```go
129
+ import markdown "github.com/ophidiarium/tree-sitter-markdown-text"
130
+
131
+ lang := markdown.GetLanguage()
132
+ query, _ := markdown.GetHighlightsQuery()
133
+ ```
134
+
135
+ ## Usage
136
+
137
+ ### Node.js
138
+
139
+ ```javascript
140
+ import Parser from "tree-sitter";
141
+ import Markdown from "tree-sitter-markdown-text";
142
+
143
+ const parser = new Parser();
144
+ parser.setLanguage(Markdown);
145
+
146
+ const tree = parser.parse("# hello\n");
147
+ console.log(tree.rootNode.toString());
148
+ ```
149
+
150
+ ### Rust
151
+
152
+ ```rust
153
+ let mut parser = tree_sitter::Parser::new();
154
+ let language = tree_sitter_markdown_text::LANGUAGE;
155
+ parser.set_language(&language.into()).unwrap();
156
+
157
+ let tree = parser.parse("# hello\n", None).unwrap();
158
+ println!("{}", tree.root_node().to_sexp());
159
+ ```
160
+
161
+ ### Python
162
+
163
+ ```python
164
+ from tree_sitter import Language, Parser
165
+ import tree_sitter_markdown_text
166
+
167
+ parser = Parser(Language(tree_sitter_markdown_text.language()))
168
+ tree = parser.parse(b"# hello\n")
169
+ print(tree.root_node.sexp())
170
+ ```
171
+
172
+ ## Credits and references
173
+
174
+ - [tree-sitter-grammars/tree-sitter-markdown](https://github.com/tree-sitter-grammars/tree-sitter-markdown) &mdash; upstream grammar, specifically the `split_parser` branch's block grammar, which this grammar is derived from.
175
+ - [textlint TxtNode](https://github.com/textlint/textlint/blob/master/docs/txtnode.md) &mdash; the AST shape this grammar targets for compatibility.
176
+ - [CommonMark Spec](https://spec.commonmark.org/) &mdash; the block structure this grammar implements.
177
+ - [Github Flavored Markdown](https://github.github.com/gfm/) &mdash; for the pipe-table and task-list extensions.
178
+
179
+ ## License
180
+
181
+ [MIT](LICENSE)
@@ -0,0 +1,162 @@
1
+ # tree-sitter-markdown-text
2
+
3
+ Markdown grammar for [tree-sitter](https://github.com/tree-sitter/tree-sitter), shaped so that its AST lines up with the [textlint `TxtNode`](https://github.com/textlint/textlint/blob/master/docs/txtnode.md) model.
4
+
5
+ Parses `.md` (and `.markdown`, `.mdown`, `.mkd`, `.mkdn`) files into a concrete syntax tree covering the full CommonMark block structure plus common extensions (GFM pipe tables, task lists, GFM alerts, YAML/TOML front matter, Pandoc math and directive blocks, footnotes, MDX JSX). Inline content is surfaced as structured children of the `inline` wrapper: classified tokens (`word_token`, `numeric_token`, `identifier_like_token`, `path_like_token`) and punctuation-class nodes (`terminator`, `separator`, `bracket`, `operator_like`), plus inline structural nodes (`emphasis`, `strong`, `strikethrough`, `link`, `image`, `autolink`, `inline_code`, `html_inline`, `math_inline`, `mdx_jsx_inline`, `footnote_reference`).
6
+
7
+ ## Features
8
+
9
+ ### Block nodes
10
+
11
+ - **Document structure** &mdash; `document`, nested `section` wrappers around ATX headings, `paragraph`, `blank_line` (as a first-class node).
12
+ - **Headings** &mdash; ATX (`#`..`######`) and setext (`===`/`---`) with the heading level exposed as a `level` field on both `atx_heading` and `setext_heading`.
13
+ - **Code blocks** &mdash; indented code blocks and fenced code blocks (backtick and tilde), with `info_string`/`language` children for the GFM language tag.
14
+ - **Math blocks** &mdash; Pandoc/GitLab/KaTeX display math (`$$…$$`) as a dedicated `math_block` with `math_block_delimiter`/`math_block_content` children.
15
+ - **Lists** &mdash; unordered (`+`/`-`/`*`) and ordered (`1.`/`1)`) list markers. GFM task list items are promoted to `task_list_item` (distinct from `list_item`), with `task_list_marker_checked`/`task_list_marker_unchecked` markers.
16
+ - **Block quotes and callouts** &mdash; nested quotes and lazy continuations. A block quote whose first paragraph begins with `[!NOTE]` / `[!TIP]` / `[!IMPORTANT]` / `[!WARNING]` / `[!CAUTION]` (or any uppercase-only label) is surfaced as `callout` with a `callout_type` field.
17
+ - **Thematic breaks** &mdash; `---`, `***`, `___`.
18
+ - **HTML blocks** &mdash; all 7 CommonMark HTML block types; block-level HTML comments are aliased to `html_comment_block` for easy metric extraction.
19
+ - **MDX JSX blocks** &mdash; shallow `mdx_jsx_block` for lines that start with an MDX-style JSX element (`<Component ...>`, `<Component/>`, `</Component>`). Component-style mixed-case names disambiguate from all-caps HTML blocks such as `<DIV>`.
20
+ - **Pipe tables** &mdash; `pipe_table` with `pipe_table_header`, `pipe_table_delimiter_row`, `pipe_table_row`, `pipe_table_cell`, `pipe_table_align_left`/`pipe_table_align_right`.
21
+ - **Link reference definitions** &mdash; `link_reference_definition` with `link_label`/`link_destination`/`link_title` children.
22
+ - **Footnote definitions** &mdash; `footnote_definition` (`[^id]: …`) with a `footnote_label` child.
23
+ - **Directive blocks** &mdash; generic container directives (`:::name … :::`, per remark-directive / MyST / Pandoc fenced divs) as `directive_block` with `directive_block_delimiter`/`directive_name`/`directive_block_content` children.
24
+ - **Image blocks** &mdash; a paragraph consisting of a single block-level image (`![alt](dest)` on its own line) is surfaced as `image_block` with `link_label`/`link_destination` children.
25
+ - **Front matter** &mdash; YAML (`---` fenced) as `minus_metadata`, TOML (`+++` fenced) as `plus_metadata`.
26
+
27
+ ### Inline nodes (children of the `inline` wrapper)
28
+
29
+ - **Classified text tokens** &mdash; `text_span` wraps runs of classified tokens: `word_token` (Unicode alphabetic), `numeric_token` (integers, decimals, versions), `identifier_like_token` (camelCase / PascalCase / snake_case), `path_like_token` (paths with `/` separators or dotted identifiers).
30
+ - **Punctuation classes** &mdash; every punctuation lexeme is classified: `terminator` (`.`, `?`, `!`, `。`, `…`), `separator` (`,`, `;`, `:`), `bracket` (`(`, `)`, `[`, `]`, `{`, `}`, `<`, `>`), `operator_like` (`::`, `->`, `=>`, `=`, `+`, `-`, `*`, `/`, `|`, `&`, and other punctuation).
31
+ - **Emphasis / strong / strikethrough** &mdash; `emphasis` (`*…*` or `_…_`), `strong` (`**…**` or `__…__`), `strikethrough` (`~~…~~`), each with a `_delimiter`/`_content`/`_delimiter` sub-tree.
32
+ - **Code spans** &mdash; `inline_code` with matched backtick-run delimiters (1 or 2 backticks).
33
+ - **Links and images** &mdash; `link` (inline, full-reference, collapsed-reference, shortcut-reference forms) and `image` (`![alt](dest)` or `![alt][ref]`). Both expose `link_label`/`link_destination`/`link_title` children.
34
+ - **Autolinks** &mdash; `autolink` with `uri` or `email` children for `<https://…>` and `<user@example.com>`.
35
+ - **Raw HTML inline** &mdash; `html_inline` with `html_open_tag`/`html_close_tag`/`html_comment`/`html_cdata`/`html_declaration`/`html_processing_instruction` children.
36
+ - **MDX JSX inline** &mdash; shallow `mdx_jsx_inline` with `mdx_jsx_open_tag`/`mdx_jsx_close_tag`/`mdx_jsx_expression` children.
37
+ - **Inline math** &mdash; `math_inline` (`$…$`) with `math_inline_delimiter`/`math_inline_content` children. Disambiguated from `math_block` (`$$…$$`).
38
+ - **Footnote references** &mdash; `footnote_reference` (`[^id]` inside prose) with a `footnote_reference_label` child.
39
+
40
+ - **Injections query** &mdash; ships a `queries/injections.scm` that injects into fenced-code-block info strings, HTML blocks, and front matter.
41
+
42
+ ## Example
43
+
44
+ ```markdown
45
+ # Heading
46
+
47
+ A paragraph with inline content.
48
+
49
+ - one
50
+ - two
51
+
52
+ ```go
53
+ func main() {}
54
+ ```
55
+ ```
56
+
57
+ Parsed tree (abbreviated):
58
+
59
+ ```
60
+ (document
61
+ (section
62
+ (atx_heading level: (atx_h1_marker) heading_content: (inline))
63
+ (blank_line)
64
+ (paragraph (inline))
65
+ (blank_line)
66
+ (list
67
+ (list_item (list_marker_minus) (paragraph (inline)))
68
+ (list_item (list_marker_minus) (paragraph (inline))))
69
+ (blank_line)
70
+ (fenced_code_block
71
+ (fenced_code_block_delimiter)
72
+ (info_string (language))
73
+ (code_fence_content)
74
+ (fenced_code_block_delimiter))))
75
+ ```
76
+
77
+ ## Relationship to textlint
78
+
79
+ The grammar is structurally close to the textlint AST. Every block-level `TxtNode` type has a direct counterpart here; inline `TxtNode` types (`Str`, `Emphasis`, `Strong`, `Link`, `Image`, `Code`, `Html`, `Delete`, `FootnoteReference`) also have direct counterparts as children of the `inline` wrapper. Names stay snake_case per the tree-sitter convention; consumers map names themselves. See [docs/textlint-mapping.md](docs/textlint-mapping.md) for the full table.
80
+
81
+ ## Installation
82
+
83
+ ### npm
84
+
85
+ ```sh
86
+ npm install tree-sitter-markdown-text
87
+ ```
88
+
89
+ ### Cargo
90
+
91
+ ```sh
92
+ cargo add tree-sitter-markdown-text
93
+ ```
94
+
95
+ ### PyPI
96
+
97
+ ```sh
98
+ pip install tree-sitter-markdown-text
99
+ ```
100
+
101
+ ### Go
102
+
103
+ ```go
104
+ import tree_sitter_markdown_text "github.com/ophidiarium/tree-sitter-markdown-text/bindings/go"
105
+ ```
106
+
107
+ The root package also exports the bundled queries via `go:embed`:
108
+
109
+ ```go
110
+ import markdown "github.com/ophidiarium/tree-sitter-markdown-text"
111
+
112
+ lang := markdown.GetLanguage()
113
+ query, _ := markdown.GetHighlightsQuery()
114
+ ```
115
+
116
+ ## Usage
117
+
118
+ ### Node.js
119
+
120
+ ```javascript
121
+ import Parser from "tree-sitter";
122
+ import Markdown from "tree-sitter-markdown-text";
123
+
124
+ const parser = new Parser();
125
+ parser.setLanguage(Markdown);
126
+
127
+ const tree = parser.parse("# hello\n");
128
+ console.log(tree.rootNode.toString());
129
+ ```
130
+
131
+ ### Rust
132
+
133
+ ```rust
134
+ let mut parser = tree_sitter::Parser::new();
135
+ let language = tree_sitter_markdown_text::LANGUAGE;
136
+ parser.set_language(&language.into()).unwrap();
137
+
138
+ let tree = parser.parse("# hello\n", None).unwrap();
139
+ println!("{}", tree.root_node().to_sexp());
140
+ ```
141
+
142
+ ### Python
143
+
144
+ ```python
145
+ from tree_sitter import Language, Parser
146
+ import tree_sitter_markdown_text
147
+
148
+ parser = Parser(Language(tree_sitter_markdown_text.language()))
149
+ tree = parser.parse(b"# hello\n")
150
+ print(tree.root_node.sexp())
151
+ ```
152
+
153
+ ## Credits and references
154
+
155
+ - [tree-sitter-grammars/tree-sitter-markdown](https://github.com/tree-sitter-grammars/tree-sitter-markdown) &mdash; upstream grammar, specifically the `split_parser` branch's block grammar, which this grammar is derived from.
156
+ - [textlint TxtNode](https://github.com/textlint/textlint/blob/master/docs/txtnode.md) &mdash; the AST shape this grammar targets for compatibility.
157
+ - [CommonMark Spec](https://spec.commonmark.org/) &mdash; the block structure this grammar implements.
158
+ - [Github Flavored Markdown](https://github.github.com/gfm/) &mdash; for the pipe-table and task-list extensions.
159
+
160
+ ## License
161
+
162
+ [MIT](LICENSE)
@@ -0,0 +1,34 @@
1
+ """"""
2
+
3
+ from importlib.resources import files as _files
4
+
5
+ from ._binding import language
6
+
7
+
8
+ def _get_query(name, file):
9
+ query = _files(f"{__package__}.queries") / file
10
+ globals()[name] = query.read_text()
11
+ return globals()[name]
12
+
13
+
14
+ def __getattr__(name):
15
+ if name == "HIGHLIGHTS_QUERY":
16
+ return _get_query("HIGHLIGHTS_QUERY", "highlights.scm")
17
+ if name == "INJECTIONS_QUERY":
18
+ return _get_query("INJECTIONS_QUERY", "injections.scm")
19
+
20
+ raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
21
+
22
+
23
+ __all__ = [
24
+ "language",
25
+ "HIGHLIGHTS_QUERY",
26
+ "INJECTIONS_QUERY",
27
+ ]
28
+
29
+
30
+ def __dir__():
31
+ return sorted(__all__ + [
32
+ "__all__", "__builtins__", "__cached__", "__doc__", "__file__",
33
+ "__loader__", "__name__", "__package__", "__path__", "__spec__",
34
+ ])
@@ -0,0 +1,17 @@
1
+ from typing import Final
2
+ from typing_extensions import CapsuleType
3
+
4
+ HIGHLIGHTS_QUERY: Final[str] | None
5
+ """The syntax highlighting query for this grammar."""
6
+
7
+ INJECTIONS_QUERY: Final[str] | None
8
+ """The language injection query for this grammar."""
9
+
10
+ LOCALS_QUERY: Final[str] | None
11
+ """The local variable query for this grammar."""
12
+
13
+ TAGS_QUERY: Final[str] | None
14
+ """The symbol tagging query for this grammar."""
15
+
16
+ def language() -> CapsuleType:
17
+ """The tree-sitter language function for this grammar."""
@@ -0,0 +1,35 @@
1
+ #include <Python.h>
2
+
3
+ typedef struct TSLanguage TSLanguage;
4
+
5
+ TSLanguage *tree_sitter_markdown(void);
6
+
7
+ static PyObject* _binding_language(PyObject *Py_UNUSED(self), PyObject *Py_UNUSED(args)) {
8
+ return PyCapsule_New(tree_sitter_markdown(), "tree_sitter.Language", NULL);
9
+ }
10
+
11
+ static struct PyModuleDef_Slot slots[] = {
12
+ #ifdef Py_GIL_DISABLED
13
+ {Py_mod_gil, Py_MOD_GIL_NOT_USED},
14
+ #endif
15
+ {0, NULL}
16
+ };
17
+
18
+ static PyMethodDef methods[] = {
19
+ {"language", _binding_language, METH_NOARGS,
20
+ "Get the tree-sitter language for this grammar."},
21
+ {NULL, NULL, 0, NULL}
22
+ };
23
+
24
+ static struct PyModuleDef module = {
25
+ .m_base = PyModuleDef_HEAD_INIT,
26
+ .m_name = "_binding",
27
+ .m_doc = NULL,
28
+ .m_size = 0,
29
+ .m_methods = methods,
30
+ .m_slots = slots,
31
+ };
32
+
33
+ PyMODINIT_FUNC PyInit__binding(void) {
34
+ return PyModuleDef_Init(&module);
35
+ }
@@ -0,0 +1,181 @@
1
+ Metadata-Version: 2.4
2
+ Name: tree-sitter-markdown-text
3
+ Version: 0.2.0
4
+ Summary: Markdown grammar for tree-sitter, with a textlint-style AST shape
5
+ Author: Ophidiarium
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/ophidiarium/tree-sitter-markdown-text
8
+ Keywords: incremental,parsing,tree-sitter,markdown,textlint
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: Topic :: Software Development :: Compilers
11
+ Classifier: Topic :: Text Processing :: Linguistic
12
+ Classifier: Typing :: Typed
13
+ Requires-Python: >=3.10
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+ Provides-Extra: core
17
+ Requires-Dist: tree-sitter~=0.24; extra == "core"
18
+ Dynamic: license-file
19
+
20
+ # tree-sitter-markdown-text
21
+
22
+ Markdown grammar for [tree-sitter](https://github.com/tree-sitter/tree-sitter), shaped so that its AST lines up with the [textlint `TxtNode`](https://github.com/textlint/textlint/blob/master/docs/txtnode.md) model.
23
+
24
+ Parses `.md` (and `.markdown`, `.mdown`, `.mkd`, `.mkdn`) files into a concrete syntax tree covering the full CommonMark block structure plus common extensions (GFM pipe tables, task lists, GFM alerts, YAML/TOML front matter, Pandoc math and directive blocks, footnotes, MDX JSX). Inline content is surfaced as structured children of the `inline` wrapper: classified tokens (`word_token`, `numeric_token`, `identifier_like_token`, `path_like_token`) and punctuation-class nodes (`terminator`, `separator`, `bracket`, `operator_like`), plus inline structural nodes (`emphasis`, `strong`, `strikethrough`, `link`, `image`, `autolink`, `inline_code`, `html_inline`, `math_inline`, `mdx_jsx_inline`, `footnote_reference`).
25
+
26
+ ## Features
27
+
28
+ ### Block nodes
29
+
30
+ - **Document structure** &mdash; `document`, nested `section` wrappers around ATX headings, `paragraph`, `blank_line` (as a first-class node).
31
+ - **Headings** &mdash; ATX (`#`..`######`) and setext (`===`/`---`) with the heading level exposed as a `level` field on both `atx_heading` and `setext_heading`.
32
+ - **Code blocks** &mdash; indented code blocks and fenced code blocks (backtick and tilde), with `info_string`/`language` children for the GFM language tag.
33
+ - **Math blocks** &mdash; Pandoc/GitLab/KaTeX display math (`$$…$$`) as a dedicated `math_block` with `math_block_delimiter`/`math_block_content` children.
34
+ - **Lists** &mdash; unordered (`+`/`-`/`*`) and ordered (`1.`/`1)`) list markers. GFM task list items are promoted to `task_list_item` (distinct from `list_item`), with `task_list_marker_checked`/`task_list_marker_unchecked` markers.
35
+ - **Block quotes and callouts** &mdash; nested quotes and lazy continuations. A block quote whose first paragraph begins with `[!NOTE]` / `[!TIP]` / `[!IMPORTANT]` / `[!WARNING]` / `[!CAUTION]` (or any uppercase-only label) is surfaced as `callout` with a `callout_type` field.
36
+ - **Thematic breaks** &mdash; `---`, `***`, `___`.
37
+ - **HTML blocks** &mdash; all 7 CommonMark HTML block types; block-level HTML comments are aliased to `html_comment_block` for easy metric extraction.
38
+ - **MDX JSX blocks** &mdash; shallow `mdx_jsx_block` for lines that start with an MDX-style JSX element (`<Component ...>`, `<Component/>`, `</Component>`). Component-style mixed-case names disambiguate from all-caps HTML blocks such as `<DIV>`.
39
+ - **Pipe tables** &mdash; `pipe_table` with `pipe_table_header`, `pipe_table_delimiter_row`, `pipe_table_row`, `pipe_table_cell`, `pipe_table_align_left`/`pipe_table_align_right`.
40
+ - **Link reference definitions** &mdash; `link_reference_definition` with `link_label`/`link_destination`/`link_title` children.
41
+ - **Footnote definitions** &mdash; `footnote_definition` (`[^id]: …`) with a `footnote_label` child.
42
+ - **Directive blocks** &mdash; generic container directives (`:::name … :::`, per remark-directive / MyST / Pandoc fenced divs) as `directive_block` with `directive_block_delimiter`/`directive_name`/`directive_block_content` children.
43
+ - **Image blocks** &mdash; a paragraph consisting of a single block-level image (`![alt](dest)` on its own line) is surfaced as `image_block` with `link_label`/`link_destination` children.
44
+ - **Front matter** &mdash; YAML (`---` fenced) as `minus_metadata`, TOML (`+++` fenced) as `plus_metadata`.
45
+
46
+ ### Inline nodes (children of the `inline` wrapper)
47
+
48
+ - **Classified text tokens** &mdash; `text_span` wraps runs of classified tokens: `word_token` (Unicode alphabetic), `numeric_token` (integers, decimals, versions), `identifier_like_token` (camelCase / PascalCase / snake_case), `path_like_token` (paths with `/` separators or dotted identifiers).
49
+ - **Punctuation classes** &mdash; every punctuation lexeme is classified: `terminator` (`.`, `?`, `!`, `。`, `…`), `separator` (`,`, `;`, `:`), `bracket` (`(`, `)`, `[`, `]`, `{`, `}`, `<`, `>`), `operator_like` (`::`, `->`, `=>`, `=`, `+`, `-`, `*`, `/`, `|`, `&`, and other punctuation).
50
+ - **Emphasis / strong / strikethrough** &mdash; `emphasis` (`*…*` or `_…_`), `strong` (`**…**` or `__…__`), `strikethrough` (`~~…~~`), each with a `_delimiter`/`_content`/`_delimiter` sub-tree.
51
+ - **Code spans** &mdash; `inline_code` with matched backtick-run delimiters (1 or 2 backticks).
52
+ - **Links and images** &mdash; `link` (inline, full-reference, collapsed-reference, shortcut-reference forms) and `image` (`![alt](dest)` or `![alt][ref]`). Both expose `link_label`/`link_destination`/`link_title` children.
53
+ - **Autolinks** &mdash; `autolink` with `uri` or `email` children for `<https://…>` and `<user@example.com>`.
54
+ - **Raw HTML inline** &mdash; `html_inline` with `html_open_tag`/`html_close_tag`/`html_comment`/`html_cdata`/`html_declaration`/`html_processing_instruction` children.
55
+ - **MDX JSX inline** &mdash; shallow `mdx_jsx_inline` with `mdx_jsx_open_tag`/`mdx_jsx_close_tag`/`mdx_jsx_expression` children.
56
+ - **Inline math** &mdash; `math_inline` (`$…$`) with `math_inline_delimiter`/`math_inline_content` children. Disambiguated from `math_block` (`$$…$$`).
57
+ - **Footnote references** &mdash; `footnote_reference` (`[^id]` inside prose) with a `footnote_reference_label` child.
58
+
59
+ - **Injections query** &mdash; ships a `queries/injections.scm` that injects into fenced-code-block info strings, HTML blocks, and front matter.
60
+
61
+ ## Example
62
+
63
+ ```markdown
64
+ # Heading
65
+
66
+ A paragraph with inline content.
67
+
68
+ - one
69
+ - two
70
+
71
+ ```go
72
+ func main() {}
73
+ ```
74
+ ```
75
+
76
+ Parsed tree (abbreviated):
77
+
78
+ ```
79
+ (document
80
+ (section
81
+ (atx_heading level: (atx_h1_marker) heading_content: (inline))
82
+ (blank_line)
83
+ (paragraph (inline))
84
+ (blank_line)
85
+ (list
86
+ (list_item (list_marker_minus) (paragraph (inline)))
87
+ (list_item (list_marker_minus) (paragraph (inline))))
88
+ (blank_line)
89
+ (fenced_code_block
90
+ (fenced_code_block_delimiter)
91
+ (info_string (language))
92
+ (code_fence_content)
93
+ (fenced_code_block_delimiter))))
94
+ ```
95
+
96
+ ## Relationship to textlint
97
+
98
+ The grammar is structurally close to the textlint AST. Every block-level `TxtNode` type has a direct counterpart here; inline `TxtNode` types (`Str`, `Emphasis`, `Strong`, `Link`, `Image`, `Code`, `Html`, `Delete`, `FootnoteReference`) also have direct counterparts as children of the `inline` wrapper. Names stay snake_case per the tree-sitter convention; consumers map names themselves. See [docs/textlint-mapping.md](docs/textlint-mapping.md) for the full table.
99
+
100
+ ## Installation
101
+
102
+ ### npm
103
+
104
+ ```sh
105
+ npm install tree-sitter-markdown-text
106
+ ```
107
+
108
+ ### Cargo
109
+
110
+ ```sh
111
+ cargo add tree-sitter-markdown-text
112
+ ```
113
+
114
+ ### PyPI
115
+
116
+ ```sh
117
+ pip install tree-sitter-markdown-text
118
+ ```
119
+
120
+ ### Go
121
+
122
+ ```go
123
+ import tree_sitter_markdown_text "github.com/ophidiarium/tree-sitter-markdown-text/bindings/go"
124
+ ```
125
+
126
+ The root package also exports the bundled queries via `go:embed`:
127
+
128
+ ```go
129
+ import markdown "github.com/ophidiarium/tree-sitter-markdown-text"
130
+
131
+ lang := markdown.GetLanguage()
132
+ query, _ := markdown.GetHighlightsQuery()
133
+ ```
134
+
135
+ ## Usage
136
+
137
+ ### Node.js
138
+
139
+ ```javascript
140
+ import Parser from "tree-sitter";
141
+ import Markdown from "tree-sitter-markdown-text";
142
+
143
+ const parser = new Parser();
144
+ parser.setLanguage(Markdown);
145
+
146
+ const tree = parser.parse("# hello\n");
147
+ console.log(tree.rootNode.toString());
148
+ ```
149
+
150
+ ### Rust
151
+
152
+ ```rust
153
+ let mut parser = tree_sitter::Parser::new();
154
+ let language = tree_sitter_markdown_text::LANGUAGE;
155
+ parser.set_language(&language.into()).unwrap();
156
+
157
+ let tree = parser.parse("# hello\n", None).unwrap();
158
+ println!("{}", tree.root_node().to_sexp());
159
+ ```
160
+
161
+ ### Python
162
+
163
+ ```python
164
+ from tree_sitter import Language, Parser
165
+ import tree_sitter_markdown_text
166
+
167
+ parser = Parser(Language(tree_sitter_markdown_text.language()))
168
+ tree = parser.parse(b"# hello\n")
169
+ print(tree.root_node.sexp())
170
+ ```
171
+
172
+ ## Credits and references
173
+
174
+ - [tree-sitter-grammars/tree-sitter-markdown](https://github.com/tree-sitter-grammars/tree-sitter-markdown) &mdash; upstream grammar, specifically the `split_parser` branch's block grammar, which this grammar is derived from.
175
+ - [textlint TxtNode](https://github.com/textlint/textlint/blob/master/docs/txtnode.md) &mdash; the AST shape this grammar targets for compatibility.
176
+ - [CommonMark Spec](https://spec.commonmark.org/) &mdash; the block structure this grammar implements.
177
+ - [Github Flavored Markdown](https://github.github.com/gfm/) &mdash; for the pipe-table and task-list extensions.
178
+
179
+ ## License
180
+
181
+ [MIT](LICENSE)
@@ -0,0 +1,15 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ setup.py
5
+ bindings/python/tree_sitter_markdown_text/__init__.py
6
+ bindings/python/tree_sitter_markdown_text/__init__.pyi
7
+ bindings/python/tree_sitter_markdown_text/binding.c
8
+ bindings/python/tree_sitter_markdown_text/py.typed
9
+ bindings/python/tree_sitter_markdown_text.egg-info/PKG-INFO
10
+ bindings/python/tree_sitter_markdown_text.egg-info/SOURCES.txt
11
+ bindings/python/tree_sitter_markdown_text.egg-info/dependency_links.txt
12
+ bindings/python/tree_sitter_markdown_text.egg-info/not-zip-safe
13
+ bindings/python/tree_sitter_markdown_text.egg-info/requires.txt
14
+ bindings/python/tree_sitter_markdown_text.egg-info/top_level.txt
15
+ src/parser.c