markdown-analysis 0.0.5__tar.gz → 0.1.0__tar.gz
Sign up to get free protection for your applications and to get access to all the features.
- markdown_analysis-0.1.0/PKG-INFO +204 -0
- markdown_analysis-0.1.0/README.md +186 -0
- markdown_analysis-0.1.0/markdown_analysis.egg-info/PKG-INFO +204 -0
- markdown_analysis-0.1.0/mrkdwn_analysis/markdown_analyzer.py +545 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/setup.py +1 -1
- markdown_analysis-0.0.5/PKG-INFO +0 -137
- markdown_analysis-0.0.5/README.md +0 -122
- markdown_analysis-0.0.5/markdown_analysis.egg-info/PKG-INFO +0 -137
- markdown_analysis-0.0.5/mrkdwn_analysis/markdown_analyzer.py +0 -274
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/LICENSE +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/markdown_analysis.egg-info/SOURCES.txt +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/markdown_analysis.egg-info/dependency_links.txt +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/markdown_analysis.egg-info/requires.txt +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/markdown_analysis.egg-info/top_level.txt +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/mrkdwn_analysis/__init__.py +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/setup.cfg +0 -0
- {markdown_analysis-0.0.5 → markdown_analysis-0.1.0}/test/__init__.py +0 -0
@@ -0,0 +1,204 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown_analysis
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Platform: UNKNOWN
|
10
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
14
|
+
Description-Content-Type: text/markdown
|
15
|
+
License-File: LICENSE
|
16
|
+
|
17
|
+
# mrkdwn_analysis
|
18
|
+
|
19
|
+
`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
- **File Loading**: Load any given Markdown file by providing its file path.
|
24
|
+
|
25
|
+
- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.
|
26
|
+
|
27
|
+
- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document’s conceptual divisions.
|
28
|
+
|
29
|
+
- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.
|
30
|
+
|
31
|
+
- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.
|
32
|
+
|
33
|
+
- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.
|
34
|
+
|
35
|
+
- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.
|
36
|
+
|
37
|
+
- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.
|
38
|
+
|
39
|
+
- **Links and Images**: Identify text links (`[text](url)`) and images (`![alt](url)`), as well as reference-style links. This is useful for link validation or content analysis.
|
40
|
+
|
41
|
+
- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.
|
42
|
+
|
43
|
+
- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style="...">... </span>`) as a unified component.
|
44
|
+
|
45
|
+
- **Front Matter**: If present, extract YAML front matter at the start of the file.
|
46
|
+
|
47
|
+
- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).
|
48
|
+
|
49
|
+
- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document’s composition.
|
50
|
+
|
51
|
+
## Installation
|
52
|
+
|
53
|
+
Install `mrkdwn_analysis` from PyPI:
|
54
|
+
|
55
|
+
```bash
|
56
|
+
pip install markdown-analysis
|
57
|
+
```
|
58
|
+
|
59
|
+
## Usage
|
60
|
+
|
61
|
+
Using `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.
|
62
|
+
|
63
|
+
```python
|
64
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
65
|
+
|
66
|
+
analyzer = MarkdownAnalyzer("path/to/document.md")
|
67
|
+
|
68
|
+
headers = analyzer.identify_headers()
|
69
|
+
paragraphs = analyzer.identify_paragraphs()
|
70
|
+
links = analyzer.identify_links()
|
71
|
+
...
|
72
|
+
```
|
73
|
+
|
74
|
+
### Example
|
75
|
+
|
76
|
+
Consider `example.md`:
|
77
|
+
|
78
|
+
```markdown
|
79
|
+
---
|
80
|
+
title: "Python 3.11 Report"
|
81
|
+
author: "John Doe"
|
82
|
+
date: "2024-01-15"
|
83
|
+
---
|
84
|
+
|
85
|
+
Python 3.11
|
86
|
+
===========
|
87
|
+
|
88
|
+
A major **Python** release with significant improvements...
|
89
|
+
|
90
|
+
### Performance Details
|
91
|
+
|
92
|
+
```python
|
93
|
+
import math
|
94
|
+
print(math.factorial(10))
|
95
|
+
```
|
96
|
+
|
97
|
+
> *Quote*: "Python 3.11 brings the speed we needed"
|
98
|
+
|
99
|
+
<div class="note">
|
100
|
+
<p>HTML block example</p>
|
101
|
+
</div>
|
102
|
+
|
103
|
+
This paragraph contains inline HTML: <span style="color:red;">Red text</span>.
|
104
|
+
|
105
|
+
- Unordered list:
|
106
|
+
- A basic point
|
107
|
+
- [ ] A task to do
|
108
|
+
- [x] A completed task
|
109
|
+
|
110
|
+
1. Ordered list item 1
|
111
|
+
2. Ordered list item 2
|
112
|
+
```
|
113
|
+
|
114
|
+
After analysis:
|
115
|
+
|
116
|
+
```python
|
117
|
+
analyzer = MarkdownAnalyzer("example.md")
|
118
|
+
|
119
|
+
print(analyzer.identify_headers())
|
120
|
+
# {"Header": [{"line": X, "level": 1, "text": "Python 3.11"}, {"line": Y, "level": 3, "text": "Performance Details"}]}
|
121
|
+
|
122
|
+
print(analyzer.identify_paragraphs())
|
123
|
+
# {"Paragraph": ["A major **Python** release ...", "This paragraph contains inline HTML: ..."]}
|
124
|
+
|
125
|
+
print(analyzer.identify_html_blocks())
|
126
|
+
# [{"line": Z, "content": "<div class=\"note\">\n <p>HTML block example</p>\n</div>"}]
|
127
|
+
|
128
|
+
print(analyzer.identify_html_inline())
|
129
|
+
# [{"line": W, "html": "<span style=\"color:red;\">Red text</span>"}]
|
130
|
+
|
131
|
+
print(analyzer.identify_lists())
|
132
|
+
# {
|
133
|
+
# "Ordered list": [["Ordered list item 1", "Ordered list item 2"]],
|
134
|
+
# "Unordered list": [["A basic point", "A task to do [Task]", "A completed task [Task done]"]]
|
135
|
+
# }
|
136
|
+
|
137
|
+
print(analyzer.identify_code_blocks())
|
138
|
+
# {"Code block": [{"start_line": X, "content": "import math\nprint(math.factorial(10))", "language": "python"}]}
|
139
|
+
|
140
|
+
print(analyzer.analyse())
|
141
|
+
# {
|
142
|
+
# 'headers': 2,
|
143
|
+
# 'paragraphs': 2,
|
144
|
+
# 'blockquotes': 1,
|
145
|
+
# 'code_blocks': 1,
|
146
|
+
# 'ordered_lists': 2,
|
147
|
+
# 'unordered_lists': 3,
|
148
|
+
# 'tables': 0,
|
149
|
+
# 'html_blocks': 1,
|
150
|
+
# 'html_inline_count': 1,
|
151
|
+
# 'words': 42,
|
152
|
+
# 'characters': 250
|
153
|
+
# }
|
154
|
+
```
|
155
|
+
|
156
|
+
### Key Methods
|
157
|
+
|
158
|
+
- `__init__(self, file_path)`: Load the Markdown file.
|
159
|
+
- `identify_headers()`: Returns all headers.
|
160
|
+
- `identify_sections()`: Returns setext sections.
|
161
|
+
- `identify_paragraphs()`: Returns paragraphs.
|
162
|
+
- `identify_blockquotes()`: Returns blockquotes.
|
163
|
+
- `identify_code_blocks()`: Returns code blocks with content and language.
|
164
|
+
- `identify_lists()`: Returns both ordered and unordered lists (including tasks).
|
165
|
+
- `identify_tables()`: Returns any GFM tables.
|
166
|
+
- `identify_links()`: Returns text and image links.
|
167
|
+
- `identify_footnotes()`: Returns footnotes used in the document.
|
168
|
+
- `identify_html_blocks()`: Returns HTML blocks as single tokens.
|
169
|
+
- `identify_html_inline()`: Returns inline HTML elements.
|
170
|
+
- `identify_todos()`: Returns task items.
|
171
|
+
- `count_elements(element_type)`: Counts occurrences of a specific element type.
|
172
|
+
- `count_words()`: Counts words in the entire document.
|
173
|
+
- `count_characters()`: Counts non-whitespace characters.
|
174
|
+
- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).
|
175
|
+
|
176
|
+
### Checking and Validating Links
|
177
|
+
|
178
|
+
- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.
|
179
|
+
|
180
|
+
### Global Analysis Example
|
181
|
+
|
182
|
+
```python
|
183
|
+
analysis = analyzer.analyse()
|
184
|
+
print(analysis)
|
185
|
+
# {
|
186
|
+
# 'headers': X,
|
187
|
+
# 'paragraphs': Y,
|
188
|
+
# 'blockquotes': Z,
|
189
|
+
# 'code_blocks': A,
|
190
|
+
# 'ordered_lists': B,
|
191
|
+
# 'unordered_lists': C,
|
192
|
+
# 'tables': D,
|
193
|
+
# 'html_blocks': E,
|
194
|
+
# 'html_inline_count': F,
|
195
|
+
# 'words': G,
|
196
|
+
# 'characters': H
|
197
|
+
# }
|
198
|
+
```
|
199
|
+
|
200
|
+
## Contributing
|
201
|
+
|
202
|
+
Contributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.
|
203
|
+
|
204
|
+
|
@@ -0,0 +1,186 @@
|
|
1
|
+
# mrkdwn_analysis
|
2
|
+
|
3
|
+
`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.
|
4
|
+
|
5
|
+
## Features
|
6
|
+
|
7
|
+
- **File Loading**: Load any given Markdown file by providing its file path.
|
8
|
+
|
9
|
+
- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.
|
10
|
+
|
11
|
+
- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document’s conceptual divisions.
|
12
|
+
|
13
|
+
- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.
|
14
|
+
|
15
|
+
- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.
|
16
|
+
|
17
|
+
- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.
|
18
|
+
|
19
|
+
- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.
|
20
|
+
|
21
|
+
- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.
|
22
|
+
|
23
|
+
- **Links and Images**: Identify text links (`[text](url)`) and images (`![alt](url)`), as well as reference-style links. This is useful for link validation or content analysis.
|
24
|
+
|
25
|
+
- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.
|
26
|
+
|
27
|
+
- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style="...">... </span>`) as a unified component.
|
28
|
+
|
29
|
+
- **Front Matter**: If present, extract YAML front matter at the start of the file.
|
30
|
+
|
31
|
+
- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).
|
32
|
+
|
33
|
+
- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document’s composition.
|
34
|
+
|
35
|
+
## Installation
|
36
|
+
|
37
|
+
Install `mrkdwn_analysis` from PyPI:
|
38
|
+
|
39
|
+
```bash
|
40
|
+
pip install markdown-analysis
|
41
|
+
```
|
42
|
+
|
43
|
+
## Usage
|
44
|
+
|
45
|
+
Using `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.
|
46
|
+
|
47
|
+
```python
|
48
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
49
|
+
|
50
|
+
analyzer = MarkdownAnalyzer("path/to/document.md")
|
51
|
+
|
52
|
+
headers = analyzer.identify_headers()
|
53
|
+
paragraphs = analyzer.identify_paragraphs()
|
54
|
+
links = analyzer.identify_links()
|
55
|
+
...
|
56
|
+
```
|
57
|
+
|
58
|
+
### Example
|
59
|
+
|
60
|
+
Consider `example.md`:
|
61
|
+
|
62
|
+
```markdown
|
63
|
+
---
|
64
|
+
title: "Python 3.11 Report"
|
65
|
+
author: "John Doe"
|
66
|
+
date: "2024-01-15"
|
67
|
+
---
|
68
|
+
|
69
|
+
Python 3.11
|
70
|
+
===========
|
71
|
+
|
72
|
+
A major **Python** release with significant improvements...
|
73
|
+
|
74
|
+
### Performance Details
|
75
|
+
|
76
|
+
```python
|
77
|
+
import math
|
78
|
+
print(math.factorial(10))
|
79
|
+
```
|
80
|
+
|
81
|
+
> *Quote*: "Python 3.11 brings the speed we needed"
|
82
|
+
|
83
|
+
<div class="note">
|
84
|
+
<p>HTML block example</p>
|
85
|
+
</div>
|
86
|
+
|
87
|
+
This paragraph contains inline HTML: <span style="color:red;">Red text</span>.
|
88
|
+
|
89
|
+
- Unordered list:
|
90
|
+
- A basic point
|
91
|
+
- [ ] A task to do
|
92
|
+
- [x] A completed task
|
93
|
+
|
94
|
+
1. Ordered list item 1
|
95
|
+
2. Ordered list item 2
|
96
|
+
```
|
97
|
+
|
98
|
+
After analysis:
|
99
|
+
|
100
|
+
```python
|
101
|
+
analyzer = MarkdownAnalyzer("example.md")
|
102
|
+
|
103
|
+
print(analyzer.identify_headers())
|
104
|
+
# {"Header": [{"line": X, "level": 1, "text": "Python 3.11"}, {"line": Y, "level": 3, "text": "Performance Details"}]}
|
105
|
+
|
106
|
+
print(analyzer.identify_paragraphs())
|
107
|
+
# {"Paragraph": ["A major **Python** release ...", "This paragraph contains inline HTML: ..."]}
|
108
|
+
|
109
|
+
print(analyzer.identify_html_blocks())
|
110
|
+
# [{"line": Z, "content": "<div class=\"note\">\n <p>HTML block example</p>\n</div>"}]
|
111
|
+
|
112
|
+
print(analyzer.identify_html_inline())
|
113
|
+
# [{"line": W, "html": "<span style=\"color:red;\">Red text</span>"}]
|
114
|
+
|
115
|
+
print(analyzer.identify_lists())
|
116
|
+
# {
|
117
|
+
# "Ordered list": [["Ordered list item 1", "Ordered list item 2"]],
|
118
|
+
# "Unordered list": [["A basic point", "A task to do [Task]", "A completed task [Task done]"]]
|
119
|
+
# }
|
120
|
+
|
121
|
+
print(analyzer.identify_code_blocks())
|
122
|
+
# {"Code block": [{"start_line": X, "content": "import math\nprint(math.factorial(10))", "language": "python"}]}
|
123
|
+
|
124
|
+
print(analyzer.analyse())
|
125
|
+
# {
|
126
|
+
# 'headers': 2,
|
127
|
+
# 'paragraphs': 2,
|
128
|
+
# 'blockquotes': 1,
|
129
|
+
# 'code_blocks': 1,
|
130
|
+
# 'ordered_lists': 2,
|
131
|
+
# 'unordered_lists': 3,
|
132
|
+
# 'tables': 0,
|
133
|
+
# 'html_blocks': 1,
|
134
|
+
# 'html_inline_count': 1,
|
135
|
+
# 'words': 42,
|
136
|
+
# 'characters': 250
|
137
|
+
# }
|
138
|
+
```
|
139
|
+
|
140
|
+
### Key Methods
|
141
|
+
|
142
|
+
- `__init__(self, file_path)`: Load the Markdown file.
|
143
|
+
- `identify_headers()`: Returns all headers.
|
144
|
+
- `identify_sections()`: Returns setext sections.
|
145
|
+
- `identify_paragraphs()`: Returns paragraphs.
|
146
|
+
- `identify_blockquotes()`: Returns blockquotes.
|
147
|
+
- `identify_code_blocks()`: Returns code blocks with content and language.
|
148
|
+
- `identify_lists()`: Returns both ordered and unordered lists (including tasks).
|
149
|
+
- `identify_tables()`: Returns any GFM tables.
|
150
|
+
- `identify_links()`: Returns text and image links.
|
151
|
+
- `identify_footnotes()`: Returns footnotes used in the document.
|
152
|
+
- `identify_html_blocks()`: Returns HTML blocks as single tokens.
|
153
|
+
- `identify_html_inline()`: Returns inline HTML elements.
|
154
|
+
- `identify_todos()`: Returns task items.
|
155
|
+
- `count_elements(element_type)`: Counts occurrences of a specific element type.
|
156
|
+
- `count_words()`: Counts words in the entire document.
|
157
|
+
- `count_characters()`: Counts non-whitespace characters.
|
158
|
+
- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).
|
159
|
+
|
160
|
+
### Checking and Validating Links
|
161
|
+
|
162
|
+
- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.
|
163
|
+
|
164
|
+
### Global Analysis Example
|
165
|
+
|
166
|
+
```python
|
167
|
+
analysis = analyzer.analyse()
|
168
|
+
print(analysis)
|
169
|
+
# {
|
170
|
+
# 'headers': X,
|
171
|
+
# 'paragraphs': Y,
|
172
|
+
# 'blockquotes': Z,
|
173
|
+
# 'code_blocks': A,
|
174
|
+
# 'ordered_lists': B,
|
175
|
+
# 'unordered_lists': C,
|
176
|
+
# 'tables': D,
|
177
|
+
# 'html_blocks': E,
|
178
|
+
# 'html_inline_count': F,
|
179
|
+
# 'words': G,
|
180
|
+
# 'characters': H
|
181
|
+
# }
|
182
|
+
```
|
183
|
+
|
184
|
+
## Contributing
|
185
|
+
|
186
|
+
Contributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.
|
@@ -0,0 +1,204 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown-analysis
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Platform: UNKNOWN
|
10
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
14
|
+
Description-Content-Type: text/markdown
|
15
|
+
License-File: LICENSE
|
16
|
+
|
17
|
+
# mrkdwn_analysis
|
18
|
+
|
19
|
+
`mrkdwn_analysis` is a powerful Python library designed to analyze Markdown files. It provides extensive parsing capabilities to extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, lists, tables, tasks (todos), footnotes, and even embedded HTML. This makes it a versatile tool for data analysis, content generation, or building other tools that work with Markdown.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
- **File Loading**: Load any given Markdown file by providing its file path.
|
24
|
+
|
25
|
+
- **Header Detection**: Identify all headers (ATX `#` to `######`, and Setext `===` and `---`) in the document, giving you a quick overview of its structure.
|
26
|
+
|
27
|
+
- **Section Identification (Setext)**: Recognize sections defined by a block of text followed by `=` or `-` lines, helping you understand the document’s conceptual divisions.
|
28
|
+
|
29
|
+
- **Paragraph Extraction**: Distinguish regular text (paragraphs) from structured elements like headers, lists, or code blocks, making it easy to isolate the body content.
|
30
|
+
|
31
|
+
- **Blockquote Identification**: Extract all blockquotes defined by lines starting with `>`.
|
32
|
+
|
33
|
+
- **Code Block Extraction**: Detect fenced code blocks delimited by triple backticks (```), optionally retrieve their language, and separate programming code from regular text.
|
34
|
+
|
35
|
+
- **List Recognition**: Identify both ordered and unordered lists, including task lists (`- [ ]`, `- [x]`), and understand their structure and hierarchy.
|
36
|
+
|
37
|
+
- **Tables (GFM)**: Detect GitHub-Flavored Markdown tables, parse their headers and rows, and separate structured tabular data for further analysis.
|
38
|
+
|
39
|
+
- **Links and Images**: Identify text links (`[text](url)`) and images (`![alt](url)`), as well as reference-style links. This is useful for link validation or content analysis.
|
40
|
+
|
41
|
+
- **Footnotes**: Extract and handle Markdown footnotes (`[^note1]`), providing a way to process reference notes in the document.
|
42
|
+
|
43
|
+
- **HTML Blocks and Inline HTML**: Handle HTML blocks (`<div>...</div>`) as a single element, and detect inline HTML elements (`<span style="...">... </span>`) as a unified component.
|
44
|
+
|
45
|
+
- **Front Matter**: If present, extract YAML front matter at the start of the file.
|
46
|
+
|
47
|
+
- **Counting Elements**: Count how many occurrences of a certain element type (e.g., how many headers, code blocks, etc.).
|
48
|
+
|
49
|
+
- **Textual Statistics**: Count the number of words and characters (excluding whitespace). Get a global summary (`analyse()`) of the document’s composition.
|
50
|
+
|
51
|
+
## Installation
|
52
|
+
|
53
|
+
Install `mrkdwn_analysis` from PyPI:
|
54
|
+
|
55
|
+
```bash
|
56
|
+
pip install markdown-analysis
|
57
|
+
```
|
58
|
+
|
59
|
+
## Usage
|
60
|
+
|
61
|
+
Using `mrkdwn_analysis` is straightforward. Import `MarkdownAnalyzer`, create an instance with your Markdown file path, and then call the various methods to extract the elements you need.
|
62
|
+
|
63
|
+
```python
|
64
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
65
|
+
|
66
|
+
analyzer = MarkdownAnalyzer("path/to/document.md")
|
67
|
+
|
68
|
+
headers = analyzer.identify_headers()
|
69
|
+
paragraphs = analyzer.identify_paragraphs()
|
70
|
+
links = analyzer.identify_links()
|
71
|
+
...
|
72
|
+
```
|
73
|
+
|
74
|
+
### Example
|
75
|
+
|
76
|
+
Consider `example.md`:
|
77
|
+
|
78
|
+
```markdown
|
79
|
+
---
|
80
|
+
title: "Python 3.11 Report"
|
81
|
+
author: "John Doe"
|
82
|
+
date: "2024-01-15"
|
83
|
+
---
|
84
|
+
|
85
|
+
Python 3.11
|
86
|
+
===========
|
87
|
+
|
88
|
+
A major **Python** release with significant improvements...
|
89
|
+
|
90
|
+
### Performance Details
|
91
|
+
|
92
|
+
```python
|
93
|
+
import math
|
94
|
+
print(math.factorial(10))
|
95
|
+
```
|
96
|
+
|
97
|
+
> *Quote*: "Python 3.11 brings the speed we needed"
|
98
|
+
|
99
|
+
<div class="note">
|
100
|
+
<p>HTML block example</p>
|
101
|
+
</div>
|
102
|
+
|
103
|
+
This paragraph contains inline HTML: <span style="color:red;">Red text</span>.
|
104
|
+
|
105
|
+
- Unordered list:
|
106
|
+
- A basic point
|
107
|
+
- [ ] A task to do
|
108
|
+
- [x] A completed task
|
109
|
+
|
110
|
+
1. Ordered list item 1
|
111
|
+
2. Ordered list item 2
|
112
|
+
```
|
113
|
+
|
114
|
+
After analysis:
|
115
|
+
|
116
|
+
```python
|
117
|
+
analyzer = MarkdownAnalyzer("example.md")
|
118
|
+
|
119
|
+
print(analyzer.identify_headers())
|
120
|
+
# {"Header": [{"line": X, "level": 1, "text": "Python 3.11"}, {"line": Y, "level": 3, "text": "Performance Details"}]}
|
121
|
+
|
122
|
+
print(analyzer.identify_paragraphs())
|
123
|
+
# {"Paragraph": ["A major **Python** release ...", "This paragraph contains inline HTML: ..."]}
|
124
|
+
|
125
|
+
print(analyzer.identify_html_blocks())
|
126
|
+
# [{"line": Z, "content": "<div class=\"note\">\n <p>HTML block example</p>\n</div>"}]
|
127
|
+
|
128
|
+
print(analyzer.identify_html_inline())
|
129
|
+
# [{"line": W, "html": "<span style=\"color:red;\">Red text</span>"}]
|
130
|
+
|
131
|
+
print(analyzer.identify_lists())
|
132
|
+
# {
|
133
|
+
# "Ordered list": [["Ordered list item 1", "Ordered list item 2"]],
|
134
|
+
# "Unordered list": [["A basic point", "A task to do [Task]", "A completed task [Task done]"]]
|
135
|
+
# }
|
136
|
+
|
137
|
+
print(analyzer.identify_code_blocks())
|
138
|
+
# {"Code block": [{"start_line": X, "content": "import math\nprint(math.factorial(10))", "language": "python"}]}
|
139
|
+
|
140
|
+
print(analyzer.analyse())
|
141
|
+
# {
|
142
|
+
# 'headers': 2,
|
143
|
+
# 'paragraphs': 2,
|
144
|
+
# 'blockquotes': 1,
|
145
|
+
# 'code_blocks': 1,
|
146
|
+
# 'ordered_lists': 2,
|
147
|
+
# 'unordered_lists': 3,
|
148
|
+
# 'tables': 0,
|
149
|
+
# 'html_blocks': 1,
|
150
|
+
# 'html_inline_count': 1,
|
151
|
+
# 'words': 42,
|
152
|
+
# 'characters': 250
|
153
|
+
# }
|
154
|
+
```
|
155
|
+
|
156
|
+
### Key Methods
|
157
|
+
|
158
|
+
- `__init__(self, file_path)`: Load the Markdown file.
|
159
|
+
- `identify_headers()`: Returns all headers.
|
160
|
+
- `identify_sections()`: Returns setext sections.
|
161
|
+
- `identify_paragraphs()`: Returns paragraphs.
|
162
|
+
- `identify_blockquotes()`: Returns blockquotes.
|
163
|
+
- `identify_code_blocks()`: Returns code blocks with content and language.
|
164
|
+
- `identify_lists()`: Returns both ordered and unordered lists (including tasks).
|
165
|
+
- `identify_tables()`: Returns any GFM tables.
|
166
|
+
- `identify_links()`: Returns text and image links.
|
167
|
+
- `identify_footnotes()`: Returns footnotes used in the document.
|
168
|
+
- `identify_html_blocks()`: Returns HTML blocks as single tokens.
|
169
|
+
- `identify_html_inline()`: Returns inline HTML elements.
|
170
|
+
- `identify_todos()`: Returns task items.
|
171
|
+
- `count_elements(element_type)`: Counts occurrences of a specific element type.
|
172
|
+
- `count_words()`: Counts words in the entire document.
|
173
|
+
- `count_characters()`: Counts non-whitespace characters.
|
174
|
+
- `analyse()`: Provides a global summary (headers count, paragraphs count, etc.).
|
175
|
+
|
176
|
+
### Checking and Validating Links
|
177
|
+
|
178
|
+
- `check_links()`: Validates text links to see if they are broken (e.g., non-200 status) and returns a list of broken links.
|
179
|
+
|
180
|
+
### Global Analysis Example
|
181
|
+
|
182
|
+
```python
|
183
|
+
analysis = analyzer.analyse()
|
184
|
+
print(analysis)
|
185
|
+
# {
|
186
|
+
# 'headers': X,
|
187
|
+
# 'paragraphs': Y,
|
188
|
+
# 'blockquotes': Z,
|
189
|
+
# 'code_blocks': A,
|
190
|
+
# 'ordered_lists': B,
|
191
|
+
# 'unordered_lists': C,
|
192
|
+
# 'tables': D,
|
193
|
+
# 'html_blocks': E,
|
194
|
+
# 'html_inline_count': F,
|
195
|
+
# 'words': G,
|
196
|
+
# 'characters': H
|
197
|
+
# }
|
198
|
+
```
|
199
|
+
|
200
|
+
## Contributing
|
201
|
+
|
202
|
+
Contributions are welcome! Feel free to open an issue or submit a pull request for bug reports, feature requests, or code improvements. Your input helps make `mrkdwn_analysis` more robust and versatile.
|
203
|
+
|
204
|
+
|