markdown-analysis 0.0.4__tar.gz → 0.0.5__tar.gz
Sign up to get free protection for your applications and to get access to all the features.
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/LICENSE +0 -0
- markdown_analysis-0.0.5/PKG-INFO +137 -0
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/README.md +0 -0
- markdown_analysis-0.0.5/markdown_analysis.egg-info/PKG-INFO +137 -0
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/markdown_analysis.egg-info/SOURCES.txt +0 -0
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/markdown_analysis.egg-info/dependency_links.txt +0 -0
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/markdown_analysis.egg-info/requires.txt +1 -1
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/markdown_analysis.egg-info/top_level.txt +0 -0
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/mrkdwn_analysis/__init__.py +0 -0
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/mrkdwn_analysis/markdown_analyzer.py +6 -15
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/setup.cfg +4 -4
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/setup.py +1 -1
- {markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/test/__init__.py +0 -0
- markdown_analysis-0.0.4/PKG-INFO +0 -140
- markdown_analysis-0.0.4/markdown_analysis.egg-info/PKG-INFO +0 -140
File without changes
|
@@ -0,0 +1,137 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown_analysis
|
3
|
+
Version: 0.0.5
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Description: # mrkdwn_analysis
|
10
|
+
|
11
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
12
|
+
|
13
|
+
## Features
|
14
|
+
|
15
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
16
|
+
|
17
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
18
|
+
|
19
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
20
|
+
|
21
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
22
|
+
|
23
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
24
|
+
|
25
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
26
|
+
|
27
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
28
|
+
|
29
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
30
|
+
|
31
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
32
|
+
|
33
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
34
|
+
|
35
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
36
|
+
|
37
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
38
|
+
|
39
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
40
|
+
|
41
|
+
## Installation
|
42
|
+
You can install `mrkdwn_analysis` from PyPI:
|
43
|
+
|
44
|
+
```bash
|
45
|
+
pip install mrkdwn_analysis
|
46
|
+
```
|
47
|
+
|
48
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
49
|
+
|
50
|
+
## Usage
|
51
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
52
|
+
|
53
|
+
```python
|
54
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
55
|
+
|
56
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
57
|
+
|
58
|
+
headers = analyzer.identify_headers()
|
59
|
+
sections = analyzer.identify_sections()
|
60
|
+
...
|
61
|
+
```
|
62
|
+
|
63
|
+
### Class MarkdownAnalyzer
|
64
|
+
|
65
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
66
|
+
|
67
|
+
### `__init__(self, file_path)`
|
68
|
+
|
69
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
70
|
+
|
71
|
+
- `file_path`: the path of the Markdown file to analyze.
|
72
|
+
|
73
|
+
### `identify_headers(self)`
|
74
|
+
|
75
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
76
|
+
|
77
|
+
### `identify_sections(self)`
|
78
|
+
|
79
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
80
|
+
|
81
|
+
### `identify_paragraphs(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
84
|
+
|
85
|
+
### `identify_blockquotes(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
88
|
+
|
89
|
+
### `identify_code_blocks(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
92
|
+
|
93
|
+
### `identify_ordered_lists(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
96
|
+
|
97
|
+
### `identify_unordered_lists(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
100
|
+
|
101
|
+
### `identify_tables(self)`
|
102
|
+
|
103
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
104
|
+
|
105
|
+
### `identify_links(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
108
|
+
|
109
|
+
### `check_links(self)`
|
110
|
+
|
111
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
112
|
+
|
113
|
+
### `identify_todos(self)`
|
114
|
+
|
115
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
116
|
+
|
117
|
+
### `count_elements(self, element_type)`
|
118
|
+
|
119
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
120
|
+
|
121
|
+
### `count_words(self)`
|
122
|
+
|
123
|
+
Counts the total number of words in the file. Returns the word count.
|
124
|
+
|
125
|
+
### `count_characters(self)`
|
126
|
+
|
127
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
128
|
+
|
129
|
+
## Contributions
|
130
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
131
|
+
|
132
|
+
Platform: UNKNOWN
|
133
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
134
|
+
Classifier: Intended Audience :: Developers
|
135
|
+
Classifier: License :: OSI Approved :: MIT License
|
136
|
+
Classifier: Programming Language :: Python :: 3.11
|
137
|
+
Description-Content-Type: text/markdown
|
File without changes
|
@@ -0,0 +1,137 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: markdown-analysis
|
3
|
+
Version: 0.0.5
|
4
|
+
Summary: UNKNOWN
|
5
|
+
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
+
Author: yannbanas
|
7
|
+
Author-email: yannbanas@gmail.com
|
8
|
+
License: UNKNOWN
|
9
|
+
Description: # mrkdwn_analysis
|
10
|
+
|
11
|
+
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
12
|
+
|
13
|
+
## Features
|
14
|
+
|
15
|
+
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
16
|
+
|
17
|
+
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
18
|
+
|
19
|
+
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
20
|
+
|
21
|
+
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
22
|
+
|
23
|
+
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
24
|
+
|
25
|
+
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
26
|
+
|
27
|
+
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
28
|
+
|
29
|
+
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
30
|
+
|
31
|
+
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
32
|
+
|
33
|
+
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
34
|
+
|
35
|
+
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
36
|
+
|
37
|
+
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
38
|
+
|
39
|
+
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
40
|
+
|
41
|
+
## Installation
|
42
|
+
You can install `mrkdwn_analysis` from PyPI:
|
43
|
+
|
44
|
+
```bash
|
45
|
+
pip install mrkdwn_analysis
|
46
|
+
```
|
47
|
+
|
48
|
+
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
49
|
+
|
50
|
+
## Usage
|
51
|
+
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
52
|
+
|
53
|
+
```python
|
54
|
+
from mrkdwn_analysis import MarkdownAnalyzer
|
55
|
+
|
56
|
+
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
57
|
+
|
58
|
+
headers = analyzer.identify_headers()
|
59
|
+
sections = analyzer.identify_sections()
|
60
|
+
...
|
61
|
+
```
|
62
|
+
|
63
|
+
### Class MarkdownAnalyzer
|
64
|
+
|
65
|
+
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
66
|
+
|
67
|
+
### `__init__(self, file_path)`
|
68
|
+
|
69
|
+
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
70
|
+
|
71
|
+
- `file_path`: the path of the Markdown file to analyze.
|
72
|
+
|
73
|
+
### `identify_headers(self)`
|
74
|
+
|
75
|
+
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
76
|
+
|
77
|
+
### `identify_sections(self)`
|
78
|
+
|
79
|
+
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
80
|
+
|
81
|
+
### `identify_paragraphs(self)`
|
82
|
+
|
83
|
+
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
84
|
+
|
85
|
+
### `identify_blockquotes(self)`
|
86
|
+
|
87
|
+
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
88
|
+
|
89
|
+
### `identify_code_blocks(self)`
|
90
|
+
|
91
|
+
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
92
|
+
|
93
|
+
### `identify_ordered_lists(self)`
|
94
|
+
|
95
|
+
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
96
|
+
|
97
|
+
### `identify_unordered_lists(self)`
|
98
|
+
|
99
|
+
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
100
|
+
|
101
|
+
### `identify_tables(self)`
|
102
|
+
|
103
|
+
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
104
|
+
|
105
|
+
### `identify_links(self)`
|
106
|
+
|
107
|
+
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
108
|
+
|
109
|
+
### `check_links(self)`
|
110
|
+
|
111
|
+
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
112
|
+
|
113
|
+
### `identify_todos(self)`
|
114
|
+
|
115
|
+
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
116
|
+
|
117
|
+
### `count_elements(self, element_type)`
|
118
|
+
|
119
|
+
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
120
|
+
|
121
|
+
### `count_words(self)`
|
122
|
+
|
123
|
+
Counts the total number of words in the file. Returns the word count.
|
124
|
+
|
125
|
+
### `count_characters(self)`
|
126
|
+
|
127
|
+
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
128
|
+
|
129
|
+
## Contributions
|
130
|
+
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
131
|
+
|
132
|
+
Platform: UNKNOWN
|
133
|
+
Classifier: Development Status :: 2 - Pre-Alpha
|
134
|
+
Classifier: Intended Audience :: Developers
|
135
|
+
Classifier: License :: OSI Approved :: MIT License
|
136
|
+
Classifier: Programming Language :: Python :: 3.11
|
137
|
+
Description-Content-Type: text/markdown
|
File without changes
|
{markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/markdown_analysis.egg-info/dependency_links.txt
RENAMED
File without changes
|
{markdown_analysis-0.0.4 → markdown_analysis-0.0.5}/markdown_analysis.egg-info/top_level.txt
RENAMED
File without changes
|
File without changes
|
@@ -3,8 +3,8 @@ import requests
|
|
3
3
|
from collections import defaultdict, Counter
|
4
4
|
|
5
5
|
class MarkdownAnalyzer:
|
6
|
-
def __init__(self, file_path):
|
7
|
-
with open(file_path, 'r') as file:
|
6
|
+
def __init__(self, file_path, encoding='utf-8'):
|
7
|
+
with open(file_path, 'r', encoding=encoding) as file:
|
8
8
|
self.lines = file.readlines()
|
9
9
|
|
10
10
|
def identify_headers(self):
|
@@ -162,19 +162,10 @@ class MarkdownAnalyzer:
|
|
162
162
|
|
163
163
|
def identify_tables(self):
|
164
164
|
result = defaultdict(list)
|
165
|
-
table_pattern = re.compile(r'
|
166
|
-
|
167
|
-
|
168
|
-
|
169
|
-
r' {0,3}(?P<nptable_align>[-:]+ *\|[-| :]*)\n' +
|
170
|
-
r'(?P<nptable_body>(?:.*\|.*(?:\n|$))*)\n*')
|
171
|
-
|
172
|
-
text = "".join(self.lines)
|
173
|
-
matches_table = re.findall(table_pattern, text)
|
174
|
-
matches_nptable = re.findall(nptable_pattern, text)
|
175
|
-
for match in matches_table + matches_nptable:
|
176
|
-
result["Table"].append(match)
|
177
|
-
|
165
|
+
table_pattern = re.compile(r'^\|.*\|$', re.MULTILINE)
|
166
|
+
table_rows = table_pattern.findall("".join(self.lines))
|
167
|
+
for table_row in table_rows:
|
168
|
+
result["Table"].append(table_row.strip().split("|"))
|
178
169
|
return dict(result)
|
179
170
|
|
180
171
|
def identify_links(self):
|
@@ -1,4 +1,4 @@
|
|
1
|
-
[egg_info]
|
2
|
-
tag_build =
|
3
|
-
tag_date = 0
|
4
|
-
|
1
|
+
[egg_info]
|
2
|
+
tag_build =
|
3
|
+
tag_date = 0
|
4
|
+
|
File without changes
|
markdown_analysis-0.0.4/PKG-INFO
DELETED
@@ -1,140 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: markdown_analysis
|
3
|
-
Version: 0.0.4
|
4
|
-
Summary: UNKNOWN
|
5
|
-
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
-
Author: yannbanas
|
7
|
-
Author-email: yannbanas@gmail.com
|
8
|
-
License: UNKNOWN
|
9
|
-
Platform: UNKNOWN
|
10
|
-
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
-
Classifier: Intended Audience :: Developers
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
13
|
-
Classifier: Programming Language :: Python :: 3.11
|
14
|
-
Description-Content-Type: text/markdown
|
15
|
-
License-File: LICENSE
|
16
|
-
|
17
|
-
# mrkdwn_analysis
|
18
|
-
|
19
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
-
|
21
|
-
## Features
|
22
|
-
|
23
|
-
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
24
|
-
|
25
|
-
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
26
|
-
|
27
|
-
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
28
|
-
|
29
|
-
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
30
|
-
|
31
|
-
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
32
|
-
|
33
|
-
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
34
|
-
|
35
|
-
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
36
|
-
|
37
|
-
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
38
|
-
|
39
|
-
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
40
|
-
|
41
|
-
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
42
|
-
|
43
|
-
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
44
|
-
|
45
|
-
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
46
|
-
|
47
|
-
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
48
|
-
|
49
|
-
## Installation
|
50
|
-
You can install `mrkdwn_analysis` from PyPI:
|
51
|
-
|
52
|
-
```bash
|
53
|
-
pip install mrkdwn_analysis
|
54
|
-
```
|
55
|
-
|
56
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
57
|
-
|
58
|
-
## Usage
|
59
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
60
|
-
|
61
|
-
```python
|
62
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
63
|
-
|
64
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
65
|
-
|
66
|
-
headers = analyzer.identify_headers()
|
67
|
-
sections = analyzer.identify_sections()
|
68
|
-
...
|
69
|
-
```
|
70
|
-
|
71
|
-
### Class MarkdownAnalyzer
|
72
|
-
|
73
|
-
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
74
|
-
|
75
|
-
### `__init__(self, file_path)`
|
76
|
-
|
77
|
-
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
78
|
-
|
79
|
-
- `file_path`: the path of the Markdown file to analyze.
|
80
|
-
|
81
|
-
### `identify_headers(self)`
|
82
|
-
|
83
|
-
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
84
|
-
|
85
|
-
### `identify_sections(self)`
|
86
|
-
|
87
|
-
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
88
|
-
|
89
|
-
### `identify_paragraphs(self)`
|
90
|
-
|
91
|
-
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
92
|
-
|
93
|
-
### `identify_blockquotes(self)`
|
94
|
-
|
95
|
-
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
96
|
-
|
97
|
-
### `identify_code_blocks(self)`
|
98
|
-
|
99
|
-
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
100
|
-
|
101
|
-
### `identify_ordered_lists(self)`
|
102
|
-
|
103
|
-
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
104
|
-
|
105
|
-
### `identify_unordered_lists(self)`
|
106
|
-
|
107
|
-
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
108
|
-
|
109
|
-
### `identify_tables(self)`
|
110
|
-
|
111
|
-
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
112
|
-
|
113
|
-
### `identify_links(self)`
|
114
|
-
|
115
|
-
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
116
|
-
|
117
|
-
### `check_links(self)`
|
118
|
-
|
119
|
-
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
120
|
-
|
121
|
-
### `identify_todos(self)`
|
122
|
-
|
123
|
-
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
124
|
-
|
125
|
-
### `count_elements(self, element_type)`
|
126
|
-
|
127
|
-
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
128
|
-
|
129
|
-
### `count_words(self)`
|
130
|
-
|
131
|
-
Counts the total number of words in the file. Returns the word count.
|
132
|
-
|
133
|
-
### `count_characters(self)`
|
134
|
-
|
135
|
-
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
136
|
-
|
137
|
-
## Contributions
|
138
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
139
|
-
|
140
|
-
|
@@ -1,140 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: markdown-analysis
|
3
|
-
Version: 0.0.4
|
4
|
-
Summary: UNKNOWN
|
5
|
-
Home-page: https://github.com/yannbanas/mrkdwn_analysis
|
6
|
-
Author: yannbanas
|
7
|
-
Author-email: yannbanas@gmail.com
|
8
|
-
License: UNKNOWN
|
9
|
-
Platform: UNKNOWN
|
10
|
-
Classifier: Development Status :: 2 - Pre-Alpha
|
11
|
-
Classifier: Intended Audience :: Developers
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
13
|
-
Classifier: Programming Language :: Python :: 3.11
|
14
|
-
Description-Content-Type: text/markdown
|
15
|
-
License-File: LICENSE
|
16
|
-
|
17
|
-
# mrkdwn_analysis
|
18
|
-
|
19
|
-
`mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
|
20
|
-
|
21
|
-
## Features
|
22
|
-
|
23
|
-
- File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
|
24
|
-
|
25
|
-
- Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
|
26
|
-
|
27
|
-
- Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
|
28
|
-
|
29
|
-
- Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
|
30
|
-
|
31
|
-
- Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
|
32
|
-
|
33
|
-
- Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
|
34
|
-
|
35
|
-
- List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
|
36
|
-
|
37
|
-
- Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
|
38
|
-
|
39
|
-
- Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
|
40
|
-
|
41
|
-
- Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
|
42
|
-
|
43
|
-
- Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
|
44
|
-
|
45
|
-
- Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
|
46
|
-
|
47
|
-
- Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
|
48
|
-
|
49
|
-
## Installation
|
50
|
-
You can install `mrkdwn_analysis` from PyPI:
|
51
|
-
|
52
|
-
```bash
|
53
|
-
pip install mrkdwn_analysis
|
54
|
-
```
|
55
|
-
|
56
|
-
We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
|
57
|
-
|
58
|
-
## Usage
|
59
|
-
Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
|
60
|
-
|
61
|
-
```python
|
62
|
-
from mrkdwn_analysis import MarkdownAnalyzer
|
63
|
-
|
64
|
-
analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
|
65
|
-
|
66
|
-
headers = analyzer.identify_headers()
|
67
|
-
sections = analyzer.identify_sections()
|
68
|
-
...
|
69
|
-
```
|
70
|
-
|
71
|
-
### Class MarkdownAnalyzer
|
72
|
-
|
73
|
-
The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
|
74
|
-
|
75
|
-
### `__init__(self, file_path)`
|
76
|
-
|
77
|
-
The constructor of the class. It opens the specified Markdown file and stores its content line by line.
|
78
|
-
|
79
|
-
- `file_path`: the path of the Markdown file to analyze.
|
80
|
-
|
81
|
-
### `identify_headers(self)`
|
82
|
-
|
83
|
-
Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
|
84
|
-
|
85
|
-
### `identify_sections(self)`
|
86
|
-
|
87
|
-
Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
|
88
|
-
|
89
|
-
### `identify_paragraphs(self)`
|
90
|
-
|
91
|
-
Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
|
92
|
-
|
93
|
-
### `identify_blockquotes(self)`
|
94
|
-
|
95
|
-
Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
|
96
|
-
|
97
|
-
### `identify_code_blocks(self)`
|
98
|
-
|
99
|
-
Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
|
100
|
-
|
101
|
-
### `identify_ordered_lists(self)`
|
102
|
-
|
103
|
-
Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
|
104
|
-
|
105
|
-
### `identify_unordered_lists(self)`
|
106
|
-
|
107
|
-
Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
|
108
|
-
|
109
|
-
### `identify_tables(self)`
|
110
|
-
|
111
|
-
Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
|
112
|
-
|
113
|
-
### `identify_links(self)`
|
114
|
-
|
115
|
-
Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
|
116
|
-
|
117
|
-
### `check_links(self)`
|
118
|
-
|
119
|
-
Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
|
120
|
-
|
121
|
-
### `identify_todos(self)`
|
122
|
-
|
123
|
-
Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
|
124
|
-
|
125
|
-
### `count_elements(self, element_type)`
|
126
|
-
|
127
|
-
Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
|
128
|
-
|
129
|
-
### `count_words(self)`
|
130
|
-
|
131
|
-
Counts the total number of words in the file. Returns the word count.
|
132
|
-
|
133
|
-
### `count_characters(self)`
|
134
|
-
|
135
|
-
Counts the total number of characters (excluding spaces) in the file. Returns the character count.
|
136
|
-
|
137
|
-
## Contributions
|
138
|
-
Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
|
139
|
-
|
140
|
-
|