markdown-analysis 0.0.3__tar.gz → 0.0.4__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- markdown_analysis-0.0.4/PKG-INFO +140 -0
- markdown_analysis-0.0.4/README.md +122 -0
- markdown_analysis-0.0.4/markdown_analysis.egg-info/PKG-INFO +140 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/SOURCES.txt +1 -0
- markdown_analysis-0.0.4/markdown_analysis.egg-info/requires.txt +2 -0
- markdown_analysis-0.0.4/mrkdwn_analysis/markdown_analyzer.py +283 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/setup.py +3 -3
- markdown_analysis-0.0.3/PKG-INFO +0 -54
- markdown_analysis-0.0.3/README.md +0 -36
- markdown_analysis-0.0.3/markdown_analysis.egg-info/PKG-INFO +0 -54
- markdown_analysis-0.0.3/mrkdwn_analysis/markdown_analyzer.py +0 -34
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/LICENSE +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/dependency_links.txt +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/top_level.txt +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/mrkdwn_analysis/__init__.py +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/setup.cfg +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/test/__init__.py +0 -0
@@ -0,0 +1,140 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown_analysis
|
3
|
+
Version: 0.0.4
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Platform: UNKNOWN
|
10
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
14
|
+
Description-Content-Type: text/markdown
|
15
|
+
License-File: LICENSE
|
16
|
+
|
17
|
+
# mrkdwn_analysis
|
18
|
+
|
19
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
24
|
+
|
25
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
26
|
+
|
27
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
28
|
+
|
29
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
30
|
+
|
31
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
32
|
+
|
33
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
34
|
+
|
35
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
36
|
+
|
37
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
38
|
+
|
39
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
40
|
+
|
41
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
42
|
+
|
43
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
44
|
+
|
45
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
46
|
+
|
47
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
48
|
+
|
49
|
+
## Installation
|
50
|
+
You can install `mrkdwn_analysis` from PyPI:
|
51
|
+
|
52
|
+
```bash
|
53
|
+
pip install mrkdwn_analysis
|
54
|
+
```
|
55
|
+
|
56
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
57
|
+
|
58
|
+
## Usage
|
59
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
60
|
+
|
61
|
+
```python
|
62
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
63
|
+
|
64
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
65
|
+
|
66
|
+
headers = analyzer.identify_headers()
|
67
|
+
sections = analyzer.identify_sections()
|
68
|
+
...
|
69
|
+
```
|
70
|
+
|
71
|
+
### Class MarkdownAnalyzer
|
72
|
+
|
73
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
74
|
+
|
75
|
+
### `__init__(self, file_path)`
|
76
|
+
|
77
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
78
|
+
|
79
|
+
- `file_path`: the path of the Markdown file to analyze.
|
80
|
+
|
81
|
+
### `identify_headers(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
84
|
+
|
85
|
+
### `identify_sections(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
88
|
+
|
89
|
+
### `identify_paragraphs(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
92
|
+
|
93
|
+
### `identify_blockquotes(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
96
|
+
|
97
|
+
### `identify_code_blocks(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
100
|
+
|
101
|
+
### `identify_ordered_lists(self)`
|
102
|
+
|
103
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
104
|
+
|
105
|
+
### `identify_unordered_lists(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
108
|
+
|
109
|
+
### `identify_tables(self)`
|
110
|
+
|
111
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
112
|
+
|
113
|
+
### `identify_links(self)`
|
114
|
+
|
115
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
116
|
+
|
117
|
+
### `check_links(self)`
|
118
|
+
|
119
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
120
|
+
|
121
|
+
### `identify_todos(self)`
|
122
|
+
|
123
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
124
|
+
|
125
|
+
### `count_elements(self, element_type)`
|
126
|
+
|
127
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
128
|
+
|
129
|
+
### `count_words(self)`
|
130
|
+
|
131
|
+
Counts the total number of words in the file. Returns the word count.
|
132
|
+
|
133
|
+
### `count_characters(self)`
|
134
|
+
|
135
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
136
|
+
|
137
|
+
## Contributions
|
138
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
139
|
+
|
140
|
+
|
@@ -0,0 +1,122 @@
|
|
1
|
+
# mrkdwn_analysis
|
2
|
+
|
3
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
4
|
+
|
5
|
+
## Features
|
6
|
+
|
7
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
8
|
+
|
9
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
10
|
+
|
11
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
12
|
+
|
13
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
14
|
+
|
15
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
16
|
+
|
17
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
18
|
+
|
19
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
20
|
+
|
21
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
22
|
+
|
23
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
24
|
+
|
25
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
26
|
+
|
27
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
28
|
+
|
29
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
30
|
+
|
31
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
32
|
+
|
33
|
+
## Installation
|
34
|
+
You can install `mrkdwn_analysis` from PyPI:
|
35
|
+
|
36
|
+
```bash
|
37
|
+
pip install mrkdwn_analysis
|
38
|
+
```
|
39
|
+
|
40
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
41
|
+
|
42
|
+
## Usage
|
43
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
44
|
+
|
45
|
+
```python
|
46
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
47
|
+
|
48
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
49
|
+
|
50
|
+
headers = analyzer.identify_headers()
|
51
|
+
sections = analyzer.identify_sections()
|
52
|
+
...
|
53
|
+
```
|
54
|
+
|
55
|
+
### Class MarkdownAnalyzer
|
56
|
+
|
57
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
58
|
+
|
59
|
+
### `__init__(self, file_path)`
|
60
|
+
|
61
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
62
|
+
|
63
|
+
- `file_path`: the path of the Markdown file to analyze.
|
64
|
+
|
65
|
+
### `identify_headers(self)`
|
66
|
+
|
67
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
68
|
+
|
69
|
+
### `identify_sections(self)`
|
70
|
+
|
71
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
72
|
+
|
73
|
+
### `identify_paragraphs(self)`
|
74
|
+
|
75
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
76
|
+
|
77
|
+
### `identify_blockquotes(self)`
|
78
|
+
|
79
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
80
|
+
|
81
|
+
### `identify_code_blocks(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
84
|
+
|
85
|
+
### `identify_ordered_lists(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
88
|
+
|
89
|
+
### `identify_unordered_lists(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
92
|
+
|
93
|
+
### `identify_tables(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
96
|
+
|
97
|
+
### `identify_links(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
100
|
+
|
101
|
+
### `check_links(self)`
|
102
|
+
|
103
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
104
|
+
|
105
|
+
### `identify_todos(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
108
|
+
|
109
|
+
### `count_elements(self, element_type)`
|
110
|
+
|
111
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
112
|
+
|
113
|
+
### `count_words(self)`
|
114
|
+
|
115
|
+
Counts the total number of words in the file. Returns the word count.
|
116
|
+
|
117
|
+
### `count_characters(self)`
|
118
|
+
|
119
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
120
|
+
|
121
|
+
## Contributions
|
122
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
@@ -0,0 +1,140 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown-analysis
|
3
|
+
Version: 0.0.4
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Platform: UNKNOWN
|
10
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
14
|
+
Description-Content-Type: text/markdown
|
15
|
+
License-File: LICENSE
|
16
|
+
|
17
|
+
# mrkdwn_analysis
|
18
|
+
|
19
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
24
|
+
|
25
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
26
|
+
|
27
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
28
|
+
|
29
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
30
|
+
|
31
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
32
|
+
|
33
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
34
|
+
|
35
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
36
|
+
|
37
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
38
|
+
|
39
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
40
|
+
|
41
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
42
|
+
|
43
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
44
|
+
|
45
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
46
|
+
|
47
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
48
|
+
|
49
|
+
## Installation
|
50
|
+
You can install `mrkdwn_analysis` from PyPI:
|
51
|
+
|
52
|
+
```bash
|
53
|
+
pip install mrkdwn_analysis
|
54
|
+
```
|
55
|
+
|
56
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
57
|
+
|
58
|
+
## Usage
|
59
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
60
|
+
|
61
|
+
```python
|
62
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
63
|
+
|
64
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
65
|
+
|
66
|
+
headers = analyzer.identify_headers()
|
67
|
+
sections = analyzer.identify_sections()
|
68
|
+
...
|
69
|
+
```
|
70
|
+
|
71
|
+
### Class MarkdownAnalyzer
|
72
|
+
|
73
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
74
|
+
|
75
|
+
### `__init__(self, file_path)`
|
76
|
+
|
77
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
78
|
+
|
79
|
+
- `file_path`: the path of the Markdown file to analyze.
|
80
|
+
|
81
|
+
### `identify_headers(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
84
|
+
|
85
|
+
### `identify_sections(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
88
|
+
|
89
|
+
### `identify_paragraphs(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
92
|
+
|
93
|
+
### `identify_blockquotes(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
96
|
+
|
97
|
+
### `identify_code_blocks(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
100
|
+
|
101
|
+
### `identify_ordered_lists(self)`
|
102
|
+
|
103
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
104
|
+
|
105
|
+
### `identify_unordered_lists(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
108
|
+
|
109
|
+
### `identify_tables(self)`
|
110
|
+
|
111
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
112
|
+
|
113
|
+
### `identify_links(self)`
|
114
|
+
|
115
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
116
|
+
|
117
|
+
### `check_links(self)`
|
118
|
+
|
119
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
120
|
+
|
121
|
+
### `identify_todos(self)`
|
122
|
+
|
123
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
124
|
+
|
125
|
+
### `count_elements(self, element_type)`
|
126
|
+
|
127
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
128
|
+
|
129
|
+
### `count_words(self)`
|
130
|
+
|
131
|
+
Counts the total number of words in the file. Returns the word count.
|
132
|
+
|
133
|
+
### `count_characters(self)`
|
134
|
+
|
135
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
136
|
+
|
137
|
+
## Contributions
|
138
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
139
|
+
|
140
|
+
|
@@ -4,6 +4,7 @@ setup.py
|
|
4
4
|
markdown_analysis.egg-info/PKG-INFO
|
5
5
|
markdown_analysis.egg-info/SOURCES.txt
|
6
6
|
markdown_analysis.egg-info/dependency_links.txt
|
7
|
+
markdown_analysis.egg-info/requires.txt
|
7
8
|
markdown_analysis.egg-info/top_level.txt
|
8
9
|
mrkdwn_analysis/__init__.py
|
9
10
|
mrkdwn_analysis/markdown_analyzer.py
|
@@ -0,0 +1,283 @@
|
|
1
|
+
import re
|
2
|
+
import requests
|
3
|
+
from collections import defaultdict, Counter
|
4
|
+
|
5
|
+
class MarkdownAnalyzer:
|
6
|
+
def __init__(self, file_path):
|
7
|
+
with open(file_path, 'r') as file:
|
8
|
+
self.lines = file.readlines()
|
9
|
+
|
10
|
+
def identify_headers(self):
|
11
|
+
result = defaultdict(list)
|
12
|
+
pattern = r'^(#{1,6})\s(.*)'
|
13
|
+
pattern_image = r'!\[.*?\]\((.*?)\)' # pattern to identify images
|
14
|
+
for i, line in enumerate(self.lines):
|
15
|
+
line_without_images = re.sub(pattern_image, '', line) # remove images from the line
|
16
|
+
match = re.match(pattern, line_without_images)
|
17
|
+
if match:
|
18
|
+
cleaned_line = re.sub(r'^#+', '', line_without_images).strip()
|
19
|
+
result["Header"].append(cleaned_line)
|
20
|
+
return dict(result) # Convert defaultdict to dict before returning
|
21
|
+
|
22
|
+
def identify_sections(self):
|
23
|
+
result = defaultdict(list)
|
24
|
+
pattern = r'^.*\n[=-]{2,}$'
|
25
|
+
for i, line in enumerate(self.lines):
|
26
|
+
if i < len(self.lines) - 1:
|
27
|
+
match = re.match(pattern, line + self.lines[i+1])
|
28
|
+
else:
|
29
|
+
match = None
|
30
|
+
if match:
|
31
|
+
if self.lines[i+1].strip().startswith("===") or self.lines[i+1].strip().startswith("---"):
|
32
|
+
result["Section"].append(line.strip())
|
33
|
+
return dict(result) # Convert defaultdict to dict before returning
|
34
|
+
|
35
|
+
def identify_paragraphs(lines):
|
36
|
+
result = defaultdict(list)
|
37
|
+
pattern = r'^(?!#)(?!\n)(?!>)(?!-)(?!=)(.*\S)'
|
38
|
+
pattern_underline = r'^.*\n[=-]{2,}$'
|
39
|
+
in_code_block = False
|
40
|
+
for i, line in enumerate(lines):
|
41
|
+
if line.strip().startswith('```'):
|
42
|
+
in_code_block = not in_code_block
|
43
|
+
if in_code_block:
|
44
|
+
continue
|
45
|
+
if i < len(lines) - 1:
|
46
|
+
match_underline = re.match(pattern_underline, line + lines[i+1])
|
47
|
+
if match_underline:
|
48
|
+
continue
|
49
|
+
match = re.match(pattern, line)
|
50
|
+
if match and line.strip() != '```': # added a condition to skip lines that are just ```
|
51
|
+
result["Paragraph"].append(line.strip())
|
52
|
+
return dict(result)
|
53
|
+
|
54
|
+
def identify_blockquotes(lines):
|
55
|
+
result = defaultdict(list)
|
56
|
+
pattern = r'^(>{1,})\s(.*)'
|
57
|
+
blockquote = None
|
58
|
+
in_code_block = False
|
59
|
+
for i, line in enumerate(lines):
|
60
|
+
if line.strip().startswith('```'):
|
61
|
+
in_code_block = not in_code_block # Flip the flag
|
62
|
+
if in_code_block:
|
63
|
+
continue # Skip processing for code blocks
|
64
|
+
match = re.match(pattern, line)
|
65
|
+
if match:
|
66
|
+
depth = len(match.group(1)) # depth is determined by the number of '>' characters
|
67
|
+
text = match.group(2).strip()
|
68
|
+
if depth > 2:
|
69
|
+
raise ValueError(f"Encountered a blockquote of depth {depth} at line {i+1}, but the maximum allowed depth is 2")
|
70
|
+
if blockquote is None:
|
71
|
+
# Start of a new blockquote
|
72
|
+
blockquote = text
|
73
|
+
else:
|
74
|
+
# Continuation of the current blockquote, regardless of depth
|
75
|
+
blockquote += " " + text
|
76
|
+
elif blockquote is not None:
|
77
|
+
# End of the current blockquote
|
78
|
+
result["Blockquote"].append(blockquote)
|
79
|
+
blockquote = None
|
80
|
+
|
81
|
+
if blockquote is not None:
|
82
|
+
# End of the last blockquote
|
83
|
+
result["Blockquote"].append(blockquote)
|
84
|
+
|
85
|
+
return dict(result)
|
86
|
+
|
87
|
+
def identify_code_blocks(lines):
|
88
|
+
result = defaultdict(list)
|
89
|
+
pattern = r'^```'
|
90
|
+
in_code_block = False
|
91
|
+
code_block = None
|
92
|
+
for i, line in enumerate(lines):
|
93
|
+
match = re.match(pattern, line.strip())
|
94
|
+
if match:
|
95
|
+
if in_code_block:
|
96
|
+
# End of code block
|
97
|
+
in_code_block = False
|
98
|
+
code_block += "\n" + line.strip() # Add the line to the code block before ending it
|
99
|
+
result["Code block"].append(code_block)
|
100
|
+
code_block = None
|
101
|
+
else:
|
102
|
+
# Start of code block
|
103
|
+
in_code_block = True
|
104
|
+
code_block = line.strip()
|
105
|
+
elif in_code_block:
|
106
|
+
code_block += "\n" + line.strip()
|
107
|
+
|
108
|
+
if code_block is not None:
|
109
|
+
result["Code block"].append(code_block)
|
110
|
+
|
111
|
+
return dict(result)
|
112
|
+
|
113
|
+
def identify_ordered_lists(lines):
|
114
|
+
result = defaultdict(list)
|
115
|
+
pattern = r'^\s*\d+\.\s'
|
116
|
+
in_list = False
|
117
|
+
list_items = []
|
118
|
+
for i, line in enumerate(lines):
|
119
|
+
match = re.match(pattern, line)
|
120
|
+
if match:
|
121
|
+
if not in_list:
|
122
|
+
# Start of a new list
|
123
|
+
in_list = True
|
124
|
+
# Add the current line to the current list
|
125
|
+
list_items.append(line.strip())
|
126
|
+
elif in_list:
|
127
|
+
# End of the current list
|
128
|
+
in_list = False
|
129
|
+
result["Ordered list"].append(list_items)
|
130
|
+
list_items = []
|
131
|
+
|
132
|
+
if list_items:
|
133
|
+
# End of the last list
|
134
|
+
result["Ordered list"].append(list_items)
|
135
|
+
|
136
|
+
return dict(result)
|
137
|
+
|
138
|
+
def identify_unordered_lists(lines):
|
139
|
+
result = defaultdict(list)
|
140
|
+
pattern = r'^\s*((\d+\\\.|[-*+])\s)'
|
141
|
+
in_list = False
|
142
|
+
list_items = []
|
143
|
+
for i, line in enumerate(lines):
|
144
|
+
match = re.match(pattern, line)
|
145
|
+
if match:
|
146
|
+
if not in_list:
|
147
|
+
# Start of a new list
|
148
|
+
in_list = True
|
149
|
+
# Add the current line to the current list
|
150
|
+
list_items.append(line.strip())
|
151
|
+
elif in_list:
|
152
|
+
# End of the current list
|
153
|
+
in_list = False
|
154
|
+
result["Unordered list"].append(list_items)
|
155
|
+
list_items = []
|
156
|
+
|
157
|
+
if list_items:
|
158
|
+
# End of the last list
|
159
|
+
result["Unordered list"].append(list_items)
|
160
|
+
|
161
|
+
return dict(result)
|
162
|
+
|
163
|
+
def identify_tables(self):
|
164
|
+
result = defaultdict(list)
|
165
|
+
table_pattern = re.compile(r'^ {0,3}\|(?P<table_head>.+)\|[ \t]*\n' +
|
166
|
+
r' {0,3}\|(?P<table_align> *[-:]+[-| :]*)\|[ \t]*\n' +
|
167
|
+
r'(?P<table_body>(?: {0,3}\|.*\|[ \t]*(?:\n|$))*)\n*')
|
168
|
+
nptable_pattern = re.compile(r'^ {0,3}(?P<nptable_head>\S.*\|.*)\n' +
|
169
|
+
r' {0,3}(?P<nptable_align>[-:]+ *\|[-| :]*)\n' +
|
170
|
+
r'(?P<nptable_body>(?:.*\|.*(?:\n|$))*)\n*')
|
171
|
+
|
172
|
+
text = "".join(self.lines)
|
173
|
+
matches_table = re.findall(table_pattern, text)
|
174
|
+
matches_nptable = re.findall(nptable_pattern, text)
|
175
|
+
for match in matches_table + matches_nptable:
|
176
|
+
result["Table"].append(match)
|
177
|
+
|
178
|
+
return dict(result)
|
179
|
+
|
180
|
+
def identify_links(self):
|
181
|
+
result = defaultdict(list)
|
182
|
+
text_link_pattern = r'\[([^\]]+)\]\(([^)]+)\)'
|
183
|
+
image_link_pattern = r'!\[([^\]]*)\]\((.*?)\)'
|
184
|
+
for i, line in enumerate(self.lines):
|
185
|
+
text_links = re.findall(text_link_pattern, line)
|
186
|
+
image_links = re.findall(image_link_pattern, line)
|
187
|
+
for link in text_links:
|
188
|
+
result["Text link"].append({"line": i+1, "text": link[0], "url": link[1]})
|
189
|
+
for link in image_links:
|
190
|
+
result["Image link"].append({"line": i+1, "alt_text": link[0], "url": link[1]})
|
191
|
+
return dict(result)
|
192
|
+
|
193
|
+
def check_links(self):
|
194
|
+
broken_links = []
|
195
|
+
link_pattern = r'\[([^\]]+)\]\(([^)]+)\)'
|
196
|
+
for i, line in enumerate(self.lines):
|
197
|
+
links = re.findall(link_pattern, line)
|
198
|
+
for link in links:
|
199
|
+
try:
|
200
|
+
response = requests.head(link[1], timeout=3)
|
201
|
+
if response.status_code != 200:
|
202
|
+
broken_links.append({'line': i+1, 'text': link[0], 'url': link[1]})
|
203
|
+
except (requests.ConnectionError, requests.Timeout):
|
204
|
+
broken_links.append({'line': i+1, 'text': link[0], 'url': link[1]})
|
205
|
+
return broken_links
|
206
|
+
|
207
|
+
def identify_todos(self):
|
208
|
+
todos = []
|
209
|
+
todo_pattern = r'^\-\s\[ \]\s(.*)'
|
210
|
+
for i, line in enumerate(self.lines):
|
211
|
+
match = re.match(todo_pattern, line)
|
212
|
+
if match:
|
213
|
+
todos.append({'line': i+1, 'text': match.group(1)})
|
214
|
+
return todos
|
215
|
+
|
216
|
+
def count_elements(self, element_type):
|
217
|
+
identify_func = getattr(self, f'identify_{element_type}', None)
|
218
|
+
if not identify_func:
|
219
|
+
raise ValueError(f"No method to identify {element_type} found.")
|
220
|
+
elements = identify_func()
|
221
|
+
return len(elements.get(element_type.capitalize(), []))
|
222
|
+
|
223
|
+
def count_words(self):
|
224
|
+
text = " ".join(self.lines)
|
225
|
+
words = text.split()
|
226
|
+
return len(words)
|
227
|
+
|
228
|
+
def count_characters(self):
|
229
|
+
text = " ".join(self.lines)
|
230
|
+
# Exclude white spaces
|
231
|
+
characters = [char for char in text if not char.isspace()]
|
232
|
+
return len(characters)
|
233
|
+
|
234
|
+
def get_text_statistics(self):
|
235
|
+
statistics = []
|
236
|
+
for i, line in enumerate(self.lines):
|
237
|
+
words = line.split()
|
238
|
+
if words:
|
239
|
+
statistics.append({
|
240
|
+
'line': i+1,
|
241
|
+
'word_count': len(words),
|
242
|
+
'char_count': sum(len(word) for word in words),
|
243
|
+
'average_word_length': sum(len(word) for word in words) / len(words),
|
244
|
+
})
|
245
|
+
return statistics
|
246
|
+
|
247
|
+
def get_word_frequency(self):
|
248
|
+
word_frequency = Counter()
|
249
|
+
for line in self.lines:
|
250
|
+
word_frequency.update(line.lower().split())
|
251
|
+
return dict(word_frequency.most_common())
|
252
|
+
|
253
|
+
def search(self, search_string):
|
254
|
+
result = []
|
255
|
+
for i, line in enumerate(self.lines):
|
256
|
+
if search_string in line:
|
257
|
+
element_types = [func for func in dir(self) if func.startswith('identify_')]
|
258
|
+
found_in_element = None
|
259
|
+
for etype in element_types:
|
260
|
+
element = getattr(self, etype)()
|
261
|
+
for e, content in element.items():
|
262
|
+
if any(search_string in c for c in content):
|
263
|
+
found_in_element = e
|
264
|
+
break
|
265
|
+
if found_in_element:
|
266
|
+
break
|
267
|
+
result.append({"line": i+1, "text": line.strip(), "element": found_in_element})
|
268
|
+
return result
|
269
|
+
|
270
|
+
def analyse(self):
|
271
|
+
analysis = {
|
272
|
+
'headers': self.count_elements('headers'),
|
273
|
+
'sections': self.count_elements('sections'),
|
274
|
+
'paragraphs': self.count_elements('paragraphs'),
|
275
|
+
'blockquotes': self.count_elements('blockquotes'),
|
276
|
+
'code_blocks': self.count_elements('code_blocks'),
|
277
|
+
'ordered_lists': self.count_elements('ordered_lists'),
|
278
|
+
'unordered_lists': self.count_elements('unordered_lists'),
|
279
|
+
'tables': self.count_elements('tables'),
|
280
|
+
'words': self.count_words(),
|
281
|
+
'characters': self.count_characters(),
|
282
|
+
}
|
283
|
+
return analysis
|
@@ -6,16 +6,16 @@ with open("README.md", "r", encoding="utf-8") as fh:
|
|
6
6
|
|
7
7
|
setup(
|
8
8
|
name='markdown_analysis',
|
9
|
-
version='0.0.
|
9
|
+
version='0.0.4',
|
10
10
|
long_description=long_description,
|
11
11
|
long_description_content_type="text/markdown",
|
12
|
-
description='A library to analyze markdown files',
|
13
12
|
author='yannbanas',
|
14
13
|
author_email='yannbanas@gmail.com',
|
15
14
|
url='https://github.com/yannbanas/mrkdwn_analysis',
|
16
15
|
packages=find_packages(),
|
17
16
|
install_requires=[
|
18
|
-
|
17
|
+
'urllib3',
|
18
|
+
'requests'
|
19
19
|
],
|
20
20
|
classifiers=[
|
21
21
|
'Development Status :: 2 - Pre-Alpha',
|
markdown_analysis-0.0.3/PKG-INFO
DELETED
@@ -1,54 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: markdown_analysis
|
3
|
-
Version: 0.0.3
|
4
|
-
Summary: A library to analyze markdown files
|
5
|
-
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
-
Author: yannbanas
|
7
|
-
Author-email: yannbanas@gmail.com
|
8
|
-
License: UNKNOWN
|
9
|
-
Platform: UNKNOWN
|
10
|
-
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
-
Classifier: Intended Audience :: Developers
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
13
|
-
Classifier: Programming Language :: Python :: 3.11
|
14
|
-
Description-Content-Type: text/markdown
|
15
|
-
License-File: LICENSE
|
16
|
-
|
17
|
-
# mrkdwn_analysis
|
18
|
-
|
19
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
-
|
21
|
-
## Features
|
22
|
-
- Extract and categorize various elements of a Markdown file.
|
23
|
-
- Handle both inline and reference-style links and images.
|
24
|
-
- Recognize different types of headers and sections.
|
25
|
-
- Identify and extract code blocks, even nested ones.
|
26
|
-
- Handle both ordered and unordered lists, nested or otherwise.
|
27
|
-
- A simple API that makes parsing Markdown documents a breeze.
|
28
|
-
|
29
|
-
## Usage
|
30
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
31
|
-
|
32
|
-
```python
|
33
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
34
|
-
|
35
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
36
|
-
|
37
|
-
headers = analyzer.identify_headers()
|
38
|
-
sections = analyzer.identify_sections()
|
39
|
-
...
|
40
|
-
```
|
41
|
-
|
42
|
-
## Installation
|
43
|
-
You can install `mrkdwn_analysis` from PyPI:
|
44
|
-
|
45
|
-
```bash
|
46
|
-
pip install mrkdwn_analysis
|
47
|
-
```
|
48
|
-
|
49
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
50
|
-
|
51
|
-
## Contributions
|
52
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
53
|
-
|
54
|
-
|
@@ -1,36 +0,0 @@
|
|
1
|
-
# mrkdwn_analysis
|
2
|
-
|
3
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
4
|
-
|
5
|
-
## Features
|
6
|
-
- Extract and categorize various elements of a Markdown file.
|
7
|
-
- Handle both inline and reference-style links and images.
|
8
|
-
- Recognize different types of headers and sections.
|
9
|
-
- Identify and extract code blocks, even nested ones.
|
10
|
-
- Handle both ordered and unordered lists, nested or otherwise.
|
11
|
-
- A simple API that makes parsing Markdown documents a breeze.
|
12
|
-
|
13
|
-
## Usage
|
14
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
15
|
-
|
16
|
-
```python
|
17
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
18
|
-
|
19
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
20
|
-
|
21
|
-
headers = analyzer.identify_headers()
|
22
|
-
sections = analyzer.identify_sections()
|
23
|
-
...
|
24
|
-
```
|
25
|
-
|
26
|
-
## Installation
|
27
|
-
You can install `mrkdwn_analysis` from PyPI:
|
28
|
-
|
29
|
-
```bash
|
30
|
-
pip install mrkdwn_analysis
|
31
|
-
```
|
32
|
-
|
33
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
34
|
-
|
35
|
-
## Contributions
|
36
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
@@ -1,54 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: markdown-analysis
|
3
|
-
Version: 0.0.3
|
4
|
-
Summary: A library to analyze markdown files
|
5
|
-
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
-
Author: yannbanas
|
7
|
-
Author-email: yannbanas@gmail.com
|
8
|
-
License: UNKNOWN
|
9
|
-
Platform: UNKNOWN
|
10
|
-
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
-
Classifier: Intended Audience :: Developers
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
13
|
-
Classifier: Programming Language :: Python :: 3.11
|
14
|
-
Description-Content-Type: text/markdown
|
15
|
-
License-File: LICENSE
|
16
|
-
|
17
|
-
# mrkdwn_analysis
|
18
|
-
|
19
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
-
|
21
|
-
## Features
|
22
|
-
- Extract and categorize various elements of a Markdown file.
|
23
|
-
- Handle both inline and reference-style links and images.
|
24
|
-
- Recognize different types of headers and sections.
|
25
|
-
- Identify and extract code blocks, even nested ones.
|
26
|
-
- Handle both ordered and unordered lists, nested or otherwise.
|
27
|
-
- A simple API that makes parsing Markdown documents a breeze.
|
28
|
-
|
29
|
-
## Usage
|
30
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
31
|
-
|
32
|
-
```python
|
33
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
34
|
-
|
35
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
36
|
-
|
37
|
-
headers = analyzer.identify_headers()
|
38
|
-
sections = analyzer.identify_sections()
|
39
|
-
...
|
40
|
-
```
|
41
|
-
|
42
|
-
## Installation
|
43
|
-
You can install `mrkdwn_analysis` from PyPI:
|
44
|
-
|
45
|
-
```bash
|
46
|
-
pip install mrkdwn_analysis
|
47
|
-
```
|
48
|
-
|
49
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
50
|
-
|
51
|
-
## Contributions
|
52
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
53
|
-
|
54
|
-
|
@@ -1,34 +0,0 @@
|
|
1
|
-
# core.py
|
2
|
-
|
3
|
-
import re
|
4
|
-
from collections import defaultdict
|
5
|
-
|
6
|
-
class MarkdownAnalyzer:
|
7
|
-
def __init__(self, file_path):
|
8
|
-
with open(file_path, 'r') as file:
|
9
|
-
self.lines = file.readlines()
|
10
|
-
|
11
|
-
def identify_headers(self):
|
12
|
-
result = defaultdict(list)
|
13
|
-
pattern = r'^(#{1,6})\s(.*)'
|
14
|
-
pattern_image = r'!\[.*?\]\((.*?)\)' # pattern to identify images
|
15
|
-
for i, line in enumerate(self.lines):
|
16
|
-
line_without_images = re.sub(pattern_image, '', line) # remove images from the line
|
17
|
-
match = re.match(pattern, line_without_images)
|
18
|
-
if match:
|
19
|
-
cleaned_line = re.sub(r'^#+', '', line_without_images).strip()
|
20
|
-
result["Header"].append(cleaned_line)
|
21
|
-
return result
|
22
|
-
|
23
|
-
def identify_sections(self):
|
24
|
-
result = defaultdict(list)
|
25
|
-
pattern = r'^.*\n[=-]{2,}$'
|
26
|
-
for i, line in enumerate(self.lines):
|
27
|
-
if i < len(self.lines) - 1:
|
28
|
-
match = re.match(pattern, line + self.lines[i+1])
|
29
|
-
else:
|
30
|
-
match = None
|
31
|
-
if match:
|
32
|
-
if self.lines[i+1].strip().startswith("===") or self.lines[i+1].strip().startswith("---"):
|
33
|
-
result["Section"].append(line.strip())
|
34
|
-
return result
|
File without changes
|
{markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/dependency_links.txt
RENAMED
File without changes
|
{markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/top_level.txt
RENAMED
File without changes
|
File without changes
|
File without changes
|
File without changes
|