markdown-analysis 0.0.3__tar.gz → 0.0.4__tar.gz
Sign up to get free protection for your applications and to get access to all the features.
- markdown_analysis-0.0.4/PKG-INFO +140 -0
- markdown_analysis-0.0.4/README.md +122 -0
- markdown_analysis-0.0.4/markdown_analysis.egg-info/PKG-INFO +140 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/SOURCES.txt +1 -0
- markdown_analysis-0.0.4/markdown_analysis.egg-info/requires.txt +2 -0
- markdown_analysis-0.0.4/mrkdwn_analysis/markdown_analyzer.py +283 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/setup.py +3 -3
- markdown_analysis-0.0.3/PKG-INFO +0 -54
- markdown_analysis-0.0.3/README.md +0 -36
- markdown_analysis-0.0.3/markdown_analysis.egg-info/PKG-INFO +0 -54
- markdown_analysis-0.0.3/mrkdwn_analysis/markdown_analyzer.py +0 -34
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/LICENSE +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/dependency_links.txt +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/top_level.txt +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/mrkdwn_analysis/__init__.py +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/setup.cfg +0 -0
- {markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/test/__init__.py +0 -0
@@ -0,0 +1,140 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown_analysis
|
3
|
+
Version: 0.0.4
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Platform: UNKNOWN
|
10
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
14
|
+
Description-Content-Type: text/markdown
|
15
|
+
License-File: LICENSE
|
16
|
+
|
17
|
+
# mrkdwn_analysis
|
18
|
+
|
19
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
24
|
+
|
25
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
26
|
+
|
27
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
28
|
+
|
29
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
30
|
+
|
31
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
32
|
+
|
33
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
34
|
+
|
35
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
36
|
+
|
37
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
38
|
+
|
39
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
40
|
+
|
41
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
42
|
+
|
43
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
44
|
+
|
45
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
46
|
+
|
47
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
48
|
+
|
49
|
+
## Installation
|
50
|
+
You can install `mrkdwn_analysis` from PyPI:
|
51
|
+
|
52
|
+
```bash
|
53
|
+
pip install mrkdwn_analysis
|
54
|
+
```
|
55
|
+
|
56
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
57
|
+
|
58
|
+
## Usage
|
59
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
60
|
+
|
61
|
+
```python
|
62
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
63
|
+
|
64
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
65
|
+
|
66
|
+
headers = analyzer.identify_headers()
|
67
|
+
sections = analyzer.identify_sections()
|
68
|
+
...
|
69
|
+
```
|
70
|
+
|
71
|
+
### Class MarkdownAnalyzer
|
72
|
+
|
73
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
74
|
+
|
75
|
+
### `__init__(self, file_path)`
|
76
|
+
|
77
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
78
|
+
|
79
|
+
- `file_path`: the path of the Markdown file to analyze.
|
80
|
+
|
81
|
+
### `identify_headers(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
84
|
+
|
85
|
+
### `identify_sections(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
88
|
+
|
89
|
+
### `identify_paragraphs(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
92
|
+
|
93
|
+
### `identify_blockquotes(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
96
|
+
|
97
|
+
### `identify_code_blocks(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
100
|
+
|
101
|
+
### `identify_ordered_lists(self)`
|
102
|
+
|
103
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
104
|
+
|
105
|
+
### `identify_unordered_lists(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
108
|
+
|
109
|
+
### `identify_tables(self)`
|
110
|
+
|
111
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
112
|
+
|
113
|
+
### `identify_links(self)`
|
114
|
+
|
115
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
116
|
+
|
117
|
+
### `check_links(self)`
|
118
|
+
|
119
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
120
|
+
|
121
|
+
### `identify_todos(self)`
|
122
|
+
|
123
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
124
|
+
|
125
|
+
### `count_elements(self, element_type)`
|
126
|
+
|
127
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
128
|
+
|
129
|
+
### `count_words(self)`
|
130
|
+
|
131
|
+
Counts the total number of words in the file. Returns the word count.
|
132
|
+
|
133
|
+
### `count_characters(self)`
|
134
|
+
|
135
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
136
|
+
|
137
|
+
## Contributions
|
138
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
139
|
+
|
140
|
+
|
@@ -0,0 +1,122 @@
|
|
1
|
+
# mrkdwn_analysis
|
2
|
+
|
3
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
4
|
+
|
5
|
+
## Features
|
6
|
+
|
7
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
8
|
+
|
9
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
10
|
+
|
11
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
12
|
+
|
13
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
14
|
+
|
15
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
16
|
+
|
17
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
18
|
+
|
19
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
20
|
+
|
21
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
22
|
+
|
23
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
24
|
+
|
25
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
26
|
+
|
27
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
28
|
+
|
29
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
30
|
+
|
31
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
32
|
+
|
33
|
+
## Installation
|
34
|
+
You can install `mrkdwn_analysis` from PyPI:
|
35
|
+
|
36
|
+
```bash
|
37
|
+
pip install mrkdwn_analysis
|
38
|
+
```
|
39
|
+
|
40
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
41
|
+
|
42
|
+
## Usage
|
43
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
44
|
+
|
45
|
+
```python
|
46
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
47
|
+
|
48
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
49
|
+
|
50
|
+
headers = analyzer.identify_headers()
|
51
|
+
sections = analyzer.identify_sections()
|
52
|
+
...
|
53
|
+
```
|
54
|
+
|
55
|
+
### Class MarkdownAnalyzer
|
56
|
+
|
57
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
58
|
+
|
59
|
+
### `__init__(self, file_path)`
|
60
|
+
|
61
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
62
|
+
|
63
|
+
- `file_path`: the path of the Markdown file to analyze.
|
64
|
+
|
65
|
+
### `identify_headers(self)`
|
66
|
+
|
67
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
68
|
+
|
69
|
+
### `identify_sections(self)`
|
70
|
+
|
71
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
72
|
+
|
73
|
+
### `identify_paragraphs(self)`
|
74
|
+
|
75
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
76
|
+
|
77
|
+
### `identify_blockquotes(self)`
|
78
|
+
|
79
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
80
|
+
|
81
|
+
### `identify_code_blocks(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
84
|
+
|
85
|
+
### `identify_ordered_lists(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
88
|
+
|
89
|
+
### `identify_unordered_lists(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
92
|
+
|
93
|
+
### `identify_tables(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
96
|
+
|
97
|
+
### `identify_links(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
100
|
+
|
101
|
+
### `check_links(self)`
|
102
|
+
|
103
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
104
|
+
|
105
|
+
### `identify_todos(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
108
|
+
|
109
|
+
### `count_elements(self, element_type)`
|
110
|
+
|
111
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
112
|
+
|
113
|
+
### `count_words(self)`
|
114
|
+
|
115
|
+
Counts the total number of words in the file. Returns the word count.
|
116
|
+
|
117
|
+
### `count_characters(self)`
|
118
|
+
|
119
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
120
|
+
|
121
|
+
## Contributions
|
122
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
@@ -0,0 +1,140 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown-analysis
|
3
|
+
Version: 0.0.4
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Platform: UNKNOWN
|
10
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
+
Classifier: Intended Audience :: Developers
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
14
|
+
Description-Content-Type: text/markdown
|
15
|
+
License-File: LICENSE
|
16
|
+
|
17
|
+
# mrkdwn_analysis
|
18
|
+
|
19
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
24
|
+
|
25
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
26
|
+
|
27
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
28
|
+
|
29
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
30
|
+
|
31
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
32
|
+
|
33
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
34
|
+
|
35
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
36
|
+
|
37
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
38
|
+
|
39
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
40
|
+
|
41
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
42
|
+
|
43
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
44
|
+
|
45
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
46
|
+
|
47
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
48
|
+
|
49
|
+
## Installation
|
50
|
+
You can install `mrkdwn_analysis` from PyPI:
|
51
|
+
|
52
|
+
```bash
|
53
|
+
pip install mrkdwn_analysis
|
54
|
+
```
|
55
|
+
|
56
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
57
|
+
|
58
|
+
## Usage
|
59
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
60
|
+
|
61
|
+
```python
|
62
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
63
|
+
|
64
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
65
|
+
|
66
|
+
headers = analyzer.identify_headers()
|
67
|
+
sections = analyzer.identify_sections()
|
68
|
+
...
|
69
|
+
```
|
70
|
+
|
71
|
+
### Class MarkdownAnalyzer
|
72
|
+
|
73
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
74
|
+
|
75
|
+
### `__init__(self, file_path)`
|
76
|
+
|
77
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
78
|
+
|
79
|
+
- `file_path`: the path of the Markdown file to analyze.
|
80
|
+
|
81
|
+
### `identify_headers(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
84
|
+
|
85
|
+
### `identify_sections(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
88
|
+
|
89
|
+
### `identify_paragraphs(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
92
|
+
|
93
|
+
### `identify_blockquotes(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
96
|
+
|
97
|
+
### `identify_code_blocks(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
100
|
+
|
101
|
+
### `identify_ordered_lists(self)`
|
102
|
+
|
103
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
104
|
+
|
105
|
+
### `identify_unordered_lists(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
108
|
+
|
109
|
+
### `identify_tables(self)`
|
110
|
+
|
111
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
112
|
+
|
113
|
+
### `identify_links(self)`
|
114
|
+
|
115
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
116
|
+
|
117
|
+
### `check_links(self)`
|
118
|
+
|
119
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
120
|
+
|
121
|
+
### `identify_todos(self)`
|
122
|
+
|
123
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
124
|
+
|
125
|
+
### `count_elements(self, element_type)`
|
126
|
+
|
127
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
128
|
+
|
129
|
+
### `count_words(self)`
|
130
|
+
|
131
|
+
Counts the total number of words in the file. Returns the word count.
|
132
|
+
|
133
|
+
### `count_characters(self)`
|
134
|
+
|
135
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
136
|
+
|
137
|
+
## Contributions
|
138
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
139
|
+
|
140
|
+
|
@@ -4,6 +4,7 @@ setup.py
|
|
4
4
|
markdown_analysis.egg-info/PKG-INFO
|
5
5
|
markdown_analysis.egg-info/SOURCES.txt
|
6
6
|
markdown_analysis.egg-info/dependency_links.txt
|
7
|
+
markdown_analysis.egg-info/requires.txt
|
7
8
|
markdown_analysis.egg-info/top_level.txt
|
8
9
|
mrkdwn_analysis/__init__.py
|
9
10
|
mrkdwn_analysis/markdown_analyzer.py
|
@@ -0,0 +1,283 @@
|
|
1
|
+
import re
|
2
|
+
import requests
|
3
|
+
from collections import defaultdict, Counter
|
4
|
+
|
5
|
+
class MarkdownAnalyzer:
|
6
|
+
def __init__(self, file_path):
|
7
|
+
with open(file_path, 'r') as file:
|
8
|
+
self.lines = file.readlines()
|
9
|
+
|
10
|
+
def identify_headers(self):
|
11
|
+
result = defaultdict(list)
|
12
|
+
pattern = r'^(#{1,6})\s(.*)'
|
13
|
+
pattern_image = r'!\[.*?\]\((.*?)\)' # pattern to identify images
|
14
|
+
for i, line in enumerate(self.lines):
|
15
|
+
line_without_images = re.sub(pattern_image, '', line) # remove images from the line
|
16
|
+
match = re.match(pattern, line_without_images)
|
17
|
+
if match:
|
18
|
+
cleaned_line = re.sub(r'^#+', '', line_without_images).strip()
|
19
|
+
result["Header"].append(cleaned_line)
|
20
|
+
return dict(result) # Convert defaultdict to dict before returning
|
21
|
+
|
22
|
+
def identify_sections(self):
|
23
|
+
result = defaultdict(list)
|
24
|
+
pattern = r'^.*\n[=-]{2,}$'
|
25
|
+
for i, line in enumerate(self.lines):
|
26
|
+
if i < len(self.lines) - 1:
|
27
|
+
match = re.match(pattern, line + self.lines[i+1])
|
28
|
+
else:
|
29
|
+
match = None
|
30
|
+
if match:
|
31
|
+
if self.lines[i+1].strip().startswith("===") or self.lines[i+1].strip().startswith("---"):
|
32
|
+
result["Section"].append(line.strip())
|
33
|
+
return dict(result) # Convert defaultdict to dict before returning
|
34
|
+
|
35
|
+
def identify_paragraphs(lines):
|
36
|
+
result = defaultdict(list)
|
37
|
+
pattern = r'^(?!#)(?!\n)(?!>)(?!-)(?!=)(.*\S)'
|
38
|
+
pattern_underline = r'^.*\n[=-]{2,}$'
|
39
|
+
in_code_block = False
|
40
|
+
for i, line in enumerate(lines):
|
41
|
+
if line.strip().startswith('```'):
|
42
|
+
in_code_block = not in_code_block
|
43
|
+
if in_code_block:
|
44
|
+
continue
|
45
|
+
if i < len(lines) - 1:
|
46
|
+
match_underline = re.match(pattern_underline, line + lines[i+1])
|
47
|
+
if match_underline:
|
48
|
+
continue
|
49
|
+
match = re.match(pattern, line)
|
50
|
+
if match and line.strip() != '```': # added a condition to skip lines that are just ```
|
51
|
+
result["Paragraph"].append(line.strip())
|
52
|
+
return dict(result)
|
53
|
+
|
54
|
+
def identify_blockquotes(lines):
|
55
|
+
result = defaultdict(list)
|
56
|
+
pattern = r'^(>{1,})\s(.*)'
|
57
|
+
blockquote = None
|
58
|
+
in_code_block = False
|
59
|
+
for i, line in enumerate(lines):
|
60
|
+
if line.strip().startswith('```'):
|
61
|
+
in_code_block = not in_code_block # Flip the flag
|
62
|
+
if in_code_block:
|
63
|
+
continue # Skip processing for code blocks
|
64
|
+
match = re.match(pattern, line)
|
65
|
+
if match:
|
66
|
+
depth = len(match.group(1)) # depth is determined by the number of '>' characters
|
67
|
+
text = match.group(2).strip()
|
68
|
+
if depth > 2:
|
69
|
+
raise ValueError(f"Encountered a blockquote of depth {depth} at line {i+1}, but the maximum allowed depth is 2")
|
70
|
+
if blockquote is None:
|
71
|
+
# Start of a new blockquote
|
72
|
+
blockquote = text
|
73
|
+
else:
|
74
|
+
# Continuation of the current blockquote, regardless of depth
|
75
|
+
blockquote += " " + text
|
76
|
+
elif blockquote is not None:
|
77
|
+
# End of the current blockquote
|
78
|
+
result["Blockquote"].append(blockquote)
|
79
|
+
blockquote = None
|
80
|
+
|
81
|
+
if blockquote is not None:
|
82
|
+
# End of the last blockquote
|
83
|
+
result["Blockquote"].append(blockquote)
|
84
|
+
|
85
|
+
return dict(result)
|
86
|
+
|
87
|
+
def identify_code_blocks(lines):
|
88
|
+
result = defaultdict(list)
|
89
|
+
pattern = r'^```'
|
90
|
+
in_code_block = False
|
91
|
+
code_block = None
|
92
|
+
for i, line in enumerate(lines):
|
93
|
+
match = re.match(pattern, line.strip())
|
94
|
+
if match:
|
95
|
+
if in_code_block:
|
96
|
+
# End of code block
|
97
|
+
in_code_block = False
|
98
|
+
code_block += "\n" + line.strip() # Add the line to the code block before ending it
|
99
|
+
result["Code block"].append(code_block)
|
100
|
+
code_block = None
|
101
|
+
else:
|
102
|
+
# Start of code block
|
103
|
+
in_code_block = True
|
104
|
+
code_block = line.strip()
|
105
|
+
elif in_code_block:
|
106
|
+
code_block += "\n" + line.strip()
|
107
|
+
|
108
|
+
if code_block is not None:
|
109
|
+
result["Code block"].append(code_block)
|
110
|
+
|
111
|
+
return dict(result)
|
112
|
+
|
113
|
+
def identify_ordered_lists(lines):
|
114
|
+
result = defaultdict(list)
|
115
|
+
pattern = r'^\s*\d+\.\s'
|
116
|
+
in_list = False
|
117
|
+
list_items = []
|
118
|
+
for i, line in enumerate(lines):
|
119
|
+
match = re.match(pattern, line)
|
120
|
+
if match:
|
121
|
+
if not in_list:
|
122
|
+
# Start of a new list
|
123
|
+
in_list = True
|
124
|
+
# Add the current line to the current list
|
125
|
+
list_items.append(line.strip())
|
126
|
+
elif in_list:
|
127
|
+
# End of the current list
|
128
|
+
in_list = False
|
129
|
+
result["Ordered list"].append(list_items)
|
130
|
+
list_items = []
|
131
|
+
|
132
|
+
if list_items:
|
133
|
+
# End of the last list
|
134
|
+
result["Ordered list"].append(list_items)
|
135
|
+
|
136
|
+
return dict(result)
|
137
|
+
|
138
|
+
def identify_unordered_lists(lines):
|
139
|
+
result = defaultdict(list)
|
140
|
+
pattern = r'^\s*((\d+\\\.|[-*+])\s)'
|
141
|
+
in_list = False
|
142
|
+
list_items = []
|
143
|
+
for i, line in enumerate(lines):
|
144
|
+
match = re.match(pattern, line)
|
145
|
+
if match:
|
146
|
+
if not in_list:
|
147
|
+
# Start of a new list
|
148
|
+
in_list = True
|
149
|
+
# Add the current line to the current list
|
150
|
+
list_items.append(line.strip())
|
151
|
+
elif in_list:
|
152
|
+
# End of the current list
|
153
|
+
in_list = False
|
154
|
+
result["Unordered list"].append(list_items)
|
155
|
+
list_items = []
|
156
|
+
|
157
|
+
if list_items:
|
158
|
+
# End of the last list
|
159
|
+
result["Unordered list"].append(list_items)
|
160
|
+
|
161
|
+
return dict(result)
|
162
|
+
|
163
|
+
def identify_tables(self):
|
164
|
+
result = defaultdict(list)
|
165
|
+
table_pattern = re.compile(r'^ {0,3}\|(?P<table_head>.+)\|[ \t]*\n' +
|
166
|
+
r' {0,3}\|(?P<table_align> *[-:]+[-| :]*)\|[ \t]*\n' +
|
167
|
+
r'(?P<table_body>(?: {0,3}\|.*\|[ \t]*(?:\n|$))*)\n*')
|
168
|
+
nptable_pattern = re.compile(r'^ {0,3}(?P<nptable_head>\S.*\|.*)\n' +
|
169
|
+
r' {0,3}(?P<nptable_align>[-:]+ *\|[-| :]*)\n' +
|
170
|
+
r'(?P<nptable_body>(?:.*\|.*(?:\n|$))*)\n*')
|
171
|
+
|
172
|
+
text = "".join(self.lines)
|
173
|
+
matches_table = re.findall(table_pattern, text)
|
174
|
+
matches_nptable = re.findall(nptable_pattern, text)
|
175
|
+
for match in matches_table + matches_nptable:
|
176
|
+
result["Table"].append(match)
|
177
|
+
|
178
|
+
return dict(result)
|
179
|
+
|
180
|
+
def identify_links(self):
|
181
|
+
result = defaultdict(list)
|
182
|
+
text_link_pattern = r'\[([^\]]+)\]\(([^)]+)\)'
|
183
|
+
image_link_pattern = r'!\[([^\]]*)\]\((.*?)\)'
|
184
|
+
for i, line in enumerate(self.lines):
|
185
|
+
text_links = re.findall(text_link_pattern, line)
|
186
|
+
image_links = re.findall(image_link_pattern, line)
|
187
|
+
for link in text_links:
|
188
|
+
result["Text link"].append({"line": i+1, "text": link[0], "url": link[1]})
|
189
|
+
for link in image_links:
|
190
|
+
result["Image link"].append({"line": i+1, "alt_text": link[0], "url": link[1]})
|
191
|
+
return dict(result)
|
192
|
+
|
193
|
+
def check_links(self):
|
194
|
+
broken_links = []
|
195
|
+
link_pattern = r'\[([^\]]+)\]\(([^)]+)\)'
|
196
|
+
for i, line in enumerate(self.lines):
|
197
|
+
links = re.findall(link_pattern, line)
|
198
|
+
for link in links:
|
199
|
+
try:
|
200
|
+
response = requests.head(link[1], timeout=3)
|
201
|
+
if response.status_code != 200:
|
202
|
+
broken_links.append({'line': i+1, 'text': link[0], 'url': link[1]})
|
203
|
+
except (requests.ConnectionError, requests.Timeout):
|
204
|
+
broken_links.append({'line': i+1, 'text': link[0], 'url': link[1]})
|
205
|
+
return broken_links
|
206
|
+
|
207
|
+
def identify_todos(self):
|
208
|
+
todos = []
|
209
|
+
todo_pattern = r'^\-\s\[ \]\s(.*)'
|
210
|
+
for i, line in enumerate(self.lines):
|
211
|
+
match = re.match(todo_pattern, line)
|
212
|
+
if match:
|
213
|
+
todos.append({'line': i+1, 'text': match.group(1)})
|
214
|
+
return todos
|
215
|
+
|
216
|
+
def count_elements(self, element_type):
|
217
|
+
identify_func = getattr(self, f'identify_{element_type}', None)
|
218
|
+
if not identify_func:
|
219
|
+
raise ValueError(f"No method to identify {element_type} found.")
|
220
|
+
elements = identify_func()
|
221
|
+
return len(elements.get(element_type.capitalize(), []))
|
222
|
+
|
223
|
+
def count_words(self):
|
224
|
+
text = " ".join(self.lines)
|
225
|
+
words = text.split()
|
226
|
+
return len(words)
|
227
|
+
|
228
|
+
def count_characters(self):
|
229
|
+
text = " ".join(self.lines)
|
230
|
+
# Exclude white spaces
|
231
|
+
characters = [char for char in text if not char.isspace()]
|
232
|
+
return len(characters)
|
233
|
+
|
234
|
+
def get_text_statistics(self):
|
235
|
+
statistics = []
|
236
|
+
for i, line in enumerate(self.lines):
|
237
|
+
words = line.split()
|
238
|
+
if words:
|
239
|
+
statistics.append({
|
240
|
+
'line': i+1,
|
241
|
+
'word_count': len(words),
|
242
|
+
'char_count': sum(len(word) for word in words),
|
243
|
+
'average_word_length': sum(len(word) for word in words) / len(words),
|
244
|
+
})
|
245
|
+
return statistics
|
246
|
+
|
247
|
+
def get_word_frequency(self):
|
248
|
+
word_frequency = Counter()
|
249
|
+
for line in self.lines:
|
250
|
+
word_frequency.update(line.lower().split())
|
251
|
+
return dict(word_frequency.most_common())
|
252
|
+
|
253
|
+
def search(self, search_string):
|
254
|
+
result = []
|
255
|
+
for i, line in enumerate(self.lines):
|
256
|
+
if search_string in line:
|
257
|
+
element_types = [func for func in dir(self) if func.startswith('identify_')]
|
258
|
+
found_in_element = None
|
259
|
+
for etype in element_types:
|
260
|
+
element = getattr(self, etype)()
|
261
|
+
for e, content in element.items():
|
262
|
+
if any(search_string in c for c in content):
|
263
|
+
found_in_element = e
|
264
|
+
break
|
265
|
+
if found_in_element:
|
266
|
+
break
|
267
|
+
result.append({"line": i+1, "text": line.strip(), "element": found_in_element})
|
268
|
+
return result
|
269
|
+
|
270
|
+
def analyse(self):
|
271
|
+
analysis = {
|
272
|
+
'headers': self.count_elements('headers'),
|
273
|
+
'sections': self.count_elements('sections'),
|
274
|
+
'paragraphs': self.count_elements('paragraphs'),
|
275
|
+
'blockquotes': self.count_elements('blockquotes'),
|
276
|
+
'code_blocks': self.count_elements('code_blocks'),
|
277
|
+
'ordered_lists': self.count_elements('ordered_lists'),
|
278
|
+
'unordered_lists': self.count_elements('unordered_lists'),
|
279
|
+
'tables': self.count_elements('tables'),
|
280
|
+
'words': self.count_words(),
|
281
|
+
'characters': self.count_characters(),
|
282
|
+
}
|
283
|
+
return analysis
|
@@ -6,16 +6,16 @@ with open("README.md", "r", encoding="utf-8") as fh:
|
|
6
6
|
|
7
7
|
setup(
|
8
8
|
name='markdown_analysis',
|
9
|
-
version='0.0.
|
9
|
+
version='0.0.4',
|
10
10
|
long_description=long_description,
|
11
11
|
long_description_content_type="text/markdown",
|
12
|
-
description='A library to analyze markdown files',
|
13
12
|
author='yannbanas',
|
14
13
|
author_email='yannbanas@gmail.com',
|
15
14
|
url='https://github.com/yannbanas/mrkdwn_analysis',
|
16
15
|
packages=find_packages(),
|
17
16
|
install_requires=[
|
18
|
-
|
17
|
+
'urllib3',
|
18
|
+
'requests'
|
19
19
|
],
|
20
20
|
classifiers=[
|
21
21
|
'Development Status :: 2 - Pre-Alpha',
|
markdown_analysis-0.0.3/PKG-INFO
DELETED
@@ -1,54 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: markdown_analysis
|
3
|
-
Version: 0.0.3
|
4
|
-
Summary: A library to analyze markdown files
|
5
|
-
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
-
Author: yannbanas
|
7
|
-
Author-email: yannbanas@gmail.com
|
8
|
-
License: UNKNOWN
|
9
|
-
Platform: UNKNOWN
|
10
|
-
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
-
Classifier: Intended Audience :: Developers
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
13
|
-
Classifier: Programming Language :: Python :: 3.11
|
14
|
-
Description-Content-Type: text/markdown
|
15
|
-
License-File: LICENSE
|
16
|
-
|
17
|
-
# mrkdwn_analysis
|
18
|
-
|
19
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
-
|
21
|
-
## Features
|
22
|
-
- Extract and categorize various elements of a Markdown file.
|
23
|
-
- Handle both inline and reference-style links and images.
|
24
|
-
- Recognize different types of headers and sections.
|
25
|
-
- Identify and extract code blocks, even nested ones.
|
26
|
-
- Handle both ordered and unordered lists, nested or otherwise.
|
27
|
-
- A simple API that makes parsing Markdown documents a breeze.
|
28
|
-
|
29
|
-
## Usage
|
30
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
31
|
-
|
32
|
-
```python
|
33
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
34
|
-
|
35
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
36
|
-
|
37
|
-
headers = analyzer.identify_headers()
|
38
|
-
sections = analyzer.identify_sections()
|
39
|
-
...
|
40
|
-
```
|
41
|
-
|
42
|
-
## Installation
|
43
|
-
You can install `mrkdwn_analysis` from PyPI:
|
44
|
-
|
45
|
-
```bash
|
46
|
-
pip install mrkdwn_analysis
|
47
|
-
```
|
48
|
-
|
49
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
50
|
-
|
51
|
-
## Contributions
|
52
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
53
|
-
|
54
|
-
|
@@ -1,36 +0,0 @@
|
|
1
|
-
# mrkdwn_analysis
|
2
|
-
|
3
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
4
|
-
|
5
|
-
## Features
|
6
|
-
- Extract and categorize various elements of a Markdown file.
|
7
|
-
- Handle both inline and reference-style links and images.
|
8
|
-
- Recognize different types of headers and sections.
|
9
|
-
- Identify and extract code blocks, even nested ones.
|
10
|
-
- Handle both ordered and unordered lists, nested or otherwise.
|
11
|
-
- A simple API that makes parsing Markdown documents a breeze.
|
12
|
-
|
13
|
-
## Usage
|
14
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
15
|
-
|
16
|
-
```python
|
17
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
18
|
-
|
19
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
20
|
-
|
21
|
-
headers = analyzer.identify_headers()
|
22
|
-
sections = analyzer.identify_sections()
|
23
|
-
...
|
24
|
-
```
|
25
|
-
|
26
|
-
## Installation
|
27
|
-
You can install `mrkdwn_analysis` from PyPI:
|
28
|
-
|
29
|
-
```bash
|
30
|
-
pip install mrkdwn_analysis
|
31
|
-
```
|
32
|
-
|
33
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
34
|
-
|
35
|
-
## Contributions
|
36
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
@@ -1,54 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: markdown-analysis
|
3
|
-
Version: 0.0.3
|
4
|
-
Summary: A library to analyze markdown files
|
5
|
-
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
-
Author: yannbanas
|
7
|
-
Author-email: yannbanas@gmail.com
|
8
|
-
License: UNKNOWN
|
9
|
-
Platform: UNKNOWN
|
10
|
-
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
-
Classifier: Intended Audience :: Developers
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
13
|
-
Classifier: Programming Language :: Python :: 3.11
|
14
|
-
Description-Content-Type: text/markdown
|
15
|
-
License-File: LICENSE
|
16
|
-
|
17
|
-
# mrkdwn_analysis
|
18
|
-
|
19
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
-
|
21
|
-
## Features
|
22
|
-
- Extract and categorize various elements of a Markdown file.
|
23
|
-
- Handle both inline and reference-style links and images.
|
24
|
-
- Recognize different types of headers and sections.
|
25
|
-
- Identify and extract code blocks, even nested ones.
|
26
|
-
- Handle both ordered and unordered lists, nested or otherwise.
|
27
|
-
- A simple API that makes parsing Markdown documents a breeze.
|
28
|
-
|
29
|
-
## Usage
|
30
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
31
|
-
|
32
|
-
```python
|
33
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
34
|
-
|
35
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
36
|
-
|
37
|
-
headers = analyzer.identify_headers()
|
38
|
-
sections = analyzer.identify_sections()
|
39
|
-
...
|
40
|
-
```
|
41
|
-
|
42
|
-
## Installation
|
43
|
-
You can install `mrkdwn_analysis` from PyPI:
|
44
|
-
|
45
|
-
```bash
|
46
|
-
pip install mrkdwn_analysis
|
47
|
-
```
|
48
|
-
|
49
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
50
|
-
|
51
|
-
## Contributions
|
52
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
53
|
-
|
54
|
-
|
@@ -1,34 +0,0 @@
|
|
1
|
-
# core.py
|
2
|
-
|
3
|
-
import re
|
4
|
-
from collections import defaultdict
|
5
|
-
|
6
|
-
class MarkdownAnalyzer:
|
7
|
-
def __init__(self, file_path):
|
8
|
-
with open(file_path, 'r') as file:
|
9
|
-
self.lines = file.readlines()
|
10
|
-
|
11
|
-
def identify_headers(self):
|
12
|
-
result = defaultdict(list)
|
13
|
-
pattern = r'^(#{1,6})\s(.*)'
|
14
|
-
pattern_image = r'!\[.*?\]\((.*?)\)' # pattern to identify images
|
15
|
-
for i, line in enumerate(self.lines):
|
16
|
-
line_without_images = re.sub(pattern_image, '', line) # remove images from the line
|
17
|
-
match = re.match(pattern, line_without_images)
|
18
|
-
if match:
|
19
|
-
cleaned_line = re.sub(r'^#+', '', line_without_images).strip()
|
20
|
-
result["Header"].append(cleaned_line)
|
21
|
-
return result
|
22
|
-
|
23
|
-
def identify_sections(self):
|
24
|
-
result = defaultdict(list)
|
25
|
-
pattern = r'^.*\n[=-]{2,}$'
|
26
|
-
for i, line in enumerate(self.lines):
|
27
|
-
if i < len(self.lines) - 1:
|
28
|
-
match = re.match(pattern, line + self.lines[i+1])
|
29
|
-
else:
|
30
|
-
match = None
|
31
|
-
if match:
|
32
|
-
if self.lines[i+1].strip().startswith("===") or self.lines[i+1].strip().startswith("---"):
|
33
|
-
result["Section"].append(line.strip())
|
34
|
-
return result
|
File without changes
|
{markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/dependency_links.txt
RENAMED
File without changes
|
{markdown_analysis-0.0.3 → markdown_analysis-0.0.4}/markdown_analysis.egg-info/top_level.txt
RENAMED
File without changes
|
File without changes
|
File without changes
|
File without changes
|