markdown-analysis 0.0.3__tar.gz → 0.0.4__tar.gz

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,140 @@
1
+ Metadata-Version: 2.1
2
+ Name: markdown_analysis
3
+ Version: 0.0.4
4
+ Summary: UNKNOWN
5
+ Home-page: https://github.com/yannbanas/mrkdwn_analysis
6
+ Author: yannbanas
7
+ Author-email: yannbanas@gmail.com
8
+ License: UNKNOWN
9
+ Platform: UNKNOWN
10
+ Classifier: Development Status :: 2 - Pre-Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+
17
+ # mrkdwn_analysis
18
+
19
+ `mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
20
+
21
+ ## Features
22
+
23
+ - File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
24
+
25
+ - Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
26
+
27
+ - Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
28
+
29
+ - Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
30
+
31
+ - Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
32
+
33
+ - Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
34
+
35
+ - List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
36
+
37
+ - Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
38
+
39
+ - Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
40
+
41
+ - Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
42
+
43
+ - Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
44
+
45
+ - Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
46
+
47
+ - Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
48
+
49
+ ## Installation
50
+ You can install `mrkdwn_analysis` from PyPI:
51
+
52
+ ```bash
53
+ pip install mrkdwn_analysis
54
+ ```
55
+
56
+ We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
57
+
58
+ ## Usage
59
+ Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
60
+
61
+ ```python
62
+ from mrkdwn_analysis import MarkdownAnalyzer
63
+
64
+ analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
65
+
66
+ headers = analyzer.identify_headers()
67
+ sections = analyzer.identify_sections()
68
+ ...
69
+ ```
70
+
71
+ ### Class MarkdownAnalyzer
72
+
73
+ The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
74
+
75
+ ### `__init__(self, file_path)`
76
+
77
+ The constructor of the class. It opens the specified Markdown file and stores its content line by line.
78
+
79
+ - `file_path`: the path of the Markdown file to analyze.
80
+
81
+ ### `identify_headers(self)`
82
+
83
+ Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
84
+
85
+ ### `identify_sections(self)`
86
+
87
+ Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
88
+
89
+ ### `identify_paragraphs(self)`
90
+
91
+ Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
92
+
93
+ ### `identify_blockquotes(self)`
94
+
95
+ Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
96
+
97
+ ### `identify_code_blocks(self)`
98
+
99
+ Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
100
+
101
+ ### `identify_ordered_lists(self)`
102
+
103
+ Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
104
+
105
+ ### `identify_unordered_lists(self)`
106
+
107
+ Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
108
+
109
+ ### `identify_tables(self)`
110
+
111
+ Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
112
+
113
+ ### `identify_links(self)`
114
+
115
+ Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
116
+
117
+ ### `check_links(self)`
118
+
119
+ Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
120
+
121
+ ### `identify_todos(self)`
122
+
123
+ Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
124
+
125
+ ### `count_elements(self, element_type)`
126
+
127
+ Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
128
+
129
+ ### `count_words(self)`
130
+
131
+ Counts the total number of words in the file. Returns the word count.
132
+
133
+ ### `count_characters(self)`
134
+
135
+ Counts the total number of characters (excluding spaces) in the file. Returns the character count.
136
+
137
+ ## Contributions
138
+ Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
139
+
140
+
@@ -0,0 +1,122 @@
1
+ # mrkdwn_analysis
2
+
3
+ `mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
4
+
5
+ ## Features
6
+
7
+ - File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
8
+
9
+ - Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
10
+
11
+ - Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
12
+
13
+ - Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
14
+
15
+ - Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
16
+
17
+ - Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
18
+
19
+ - List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
20
+
21
+ - Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
22
+
23
+ - Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
24
+
25
+ - Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
26
+
27
+ - Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
28
+
29
+ - Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
30
+
31
+ - Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
32
+
33
+ ## Installation
34
+ You can install `mrkdwn_analysis` from PyPI:
35
+
36
+ ```bash
37
+ pip install mrkdwn_analysis
38
+ ```
39
+
40
+ We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
41
+
42
+ ## Usage
43
+ Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
44
+
45
+ ```python
46
+ from mrkdwn_analysis import MarkdownAnalyzer
47
+
48
+ analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
49
+
50
+ headers = analyzer.identify_headers()
51
+ sections = analyzer.identify_sections()
52
+ ...
53
+ ```
54
+
55
+ ### Class MarkdownAnalyzer
56
+
57
+ The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
58
+
59
+ ### `__init__(self, file_path)`
60
+
61
+ The constructor of the class. It opens the specified Markdown file and stores its content line by line.
62
+
63
+ - `file_path`: the path of the Markdown file to analyze.
64
+
65
+ ### `identify_headers(self)`
66
+
67
+ Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
68
+
69
+ ### `identify_sections(self)`
70
+
71
+ Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
72
+
73
+ ### `identify_paragraphs(self)`
74
+
75
+ Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
76
+
77
+ ### `identify_blockquotes(self)`
78
+
79
+ Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
80
+
81
+ ### `identify_code_blocks(self)`
82
+
83
+ Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
84
+
85
+ ### `identify_ordered_lists(self)`
86
+
87
+ Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
88
+
89
+ ### `identify_unordered_lists(self)`
90
+
91
+ Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
92
+
93
+ ### `identify_tables(self)`
94
+
95
+ Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
96
+
97
+ ### `identify_links(self)`
98
+
99
+ Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
100
+
101
+ ### `check_links(self)`
102
+
103
+ Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
104
+
105
+ ### `identify_todos(self)`
106
+
107
+ Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
108
+
109
+ ### `count_elements(self, element_type)`
110
+
111
+ Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
112
+
113
+ ### `count_words(self)`
114
+
115
+ Counts the total number of words in the file. Returns the word count.
116
+
117
+ ### `count_characters(self)`
118
+
119
+ Counts the total number of characters (excluding spaces) in the file. Returns the character count.
120
+
121
+ ## Contributions
122
+ Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
@@ -0,0 +1,140 @@
1
+ Metadata-Version: 2.1
2
+ Name: markdown-analysis
3
+ Version: 0.0.4
4
+ Summary: UNKNOWN
5
+ Home-page: https://github.com/yannbanas/mrkdwn_analysis
6
+ Author: yannbanas
7
+ Author-email: yannbanas@gmail.com
8
+ License: UNKNOWN
9
+ Platform: UNKNOWN
10
+ Classifier: Development Status :: 2 - Pre-Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+
17
+ # mrkdwn_analysis
18
+
19
+ `mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
20
+
21
+ ## Features
22
+
23
+ - File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.
24
+
25
+ - Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.
26
+
27
+ - Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.
28
+
29
+ - Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.
30
+
31
+ - Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.
32
+
33
+ - Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.
34
+
35
+ - List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.
36
+
37
+ - Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.
38
+
39
+ - Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.
40
+
41
+ - Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.
42
+
43
+ - Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.
44
+
45
+ - Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.
46
+
47
+ - Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.
48
+
49
+ ## Installation
50
+ You can install `mrkdwn_analysis` from PyPI:
51
+
52
+ ```bash
53
+ pip install mrkdwn_analysis
54
+ ```
55
+
56
+ We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
57
+
58
+ ## Usage
59
+ Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
60
+
61
+ ```python
62
+ from mrkdwn_analysis import MarkdownAnalyzer
63
+
64
+ analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
65
+
66
+ headers = analyzer.identify_headers()
67
+ sections = analyzer.identify_sections()
68
+ ...
69
+ ```
70
+
71
+ ### Class MarkdownAnalyzer
72
+
73
+ The `MarkdownAnalyzer` class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.
74
+
75
+ ### `__init__(self, file_path)`
76
+
77
+ The constructor of the class. It opens the specified Markdown file and stores its content line by line.
78
+
79
+ - `file_path`: the path of the Markdown file to analyze.
80
+
81
+ ### `identify_headers(self)`
82
+
83
+ Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.
84
+
85
+ ### `identify_sections(self)`
86
+
87
+ Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of `=` or `-` characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.
88
+
89
+ ### `identify_paragraphs(self)`
90
+
91
+ Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.
92
+
93
+ ### `identify_blockquotes(self)`
94
+
95
+ Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the `>` character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.
96
+
97
+ ### `identify_code_blocks(self)`
98
+
99
+ Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.
100
+
101
+ ### `identify_ordered_lists(self)`
102
+
103
+ Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.
104
+
105
+ ### `identify_unordered_lists(self)`
106
+
107
+ Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a `-`, `*`, or `+`. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.
108
+
109
+ ### `identify_tables(self)`
110
+
111
+ Analyzes the file and identifies all tables. Tables are defined by lines containing `|` to delimit cells and are separated by lines containing `-` to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.
112
+
113
+ ### `identify_links(self)`
114
+
115
+ Analyzes the file and identifies all links. Links are defined by the format `[text](url)`. Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.
116
+
117
+ ### `check_links(self)`
118
+
119
+ Checks all links identified by `identify_links` to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.
120
+
121
+ ### `identify_todos(self)`
122
+
123
+ Analyzes the file and identifies all todos. Todos are defined by lines starting with `- [ ] `. Todos are returned as a list, each item being a dictionary containing the line number and todo text.
124
+
125
+ ### `count_elements(self, element_type)`
126
+
127
+ Counts the total number of a specific element type in the file. The `element_type` should match the name of one of the identification methods (for example, "headers" for `identify_headers`). Returns the total number of elements of this type.
128
+
129
+ ### `count_words(self)`
130
+
131
+ Counts the total number of words in the file. Returns the word count.
132
+
133
+ ### `count_characters(self)`
134
+
135
+ Counts the total number of characters (excluding spaces) in the file. Returns the character count.
136
+
137
+ ## Contributions
138
+ Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
139
+
140
+
@@ -4,6 +4,7 @@ setup.py
4
4
  markdown_analysis.egg-info/PKG-INFO
5
5
  markdown_analysis.egg-info/SOURCES.txt
6
6
  markdown_analysis.egg-info/dependency_links.txt
7
+ markdown_analysis.egg-info/requires.txt
7
8
  markdown_analysis.egg-info/top_level.txt
8
9
  mrkdwn_analysis/__init__.py
9
10
  mrkdwn_analysis/markdown_analyzer.py
@@ -0,0 +1,2 @@
1
+ requests
2
+ urllib3
@@ -0,0 +1,283 @@
1
+ import re
2
+ import requests
3
+ from collections import defaultdict, Counter
4
+
5
+ class MarkdownAnalyzer:
6
+ def __init__(self, file_path):
7
+ with open(file_path, 'r') as file:
8
+ self.lines = file.readlines()
9
+
10
+ def identify_headers(self):
11
+ result = defaultdict(list)
12
+ pattern = r'^(#{1,6})\s(.*)'
13
+ pattern_image = r'!\[.*?\]\((.*?)\)' # pattern to identify images
14
+ for i, line in enumerate(self.lines):
15
+ line_without_images = re.sub(pattern_image, '', line) # remove images from the line
16
+ match = re.match(pattern, line_without_images)
17
+ if match:
18
+ cleaned_line = re.sub(r'^#+', '', line_without_images).strip()
19
+ result["Header"].append(cleaned_line)
20
+ return dict(result) # Convert defaultdict to dict before returning
21
+
22
+ def identify_sections(self):
23
+ result = defaultdict(list)
24
+ pattern = r'^.*\n[=-]{2,}$'
25
+ for i, line in enumerate(self.lines):
26
+ if i < len(self.lines) - 1:
27
+ match = re.match(pattern, line + self.lines[i+1])
28
+ else:
29
+ match = None
30
+ if match:
31
+ if self.lines[i+1].strip().startswith("===") or self.lines[i+1].strip().startswith("---"):
32
+ result["Section"].append(line.strip())
33
+ return dict(result) # Convert defaultdict to dict before returning
34
+
35
+ def identify_paragraphs(lines):
36
+ result = defaultdict(list)
37
+ pattern = r'^(?!#)(?!\n)(?!>)(?!-)(?!=)(.*\S)'
38
+ pattern_underline = r'^.*\n[=-]{2,}$'
39
+ in_code_block = False
40
+ for i, line in enumerate(lines):
41
+ if line.strip().startswith('```'):
42
+ in_code_block = not in_code_block
43
+ if in_code_block:
44
+ continue
45
+ if i < len(lines) - 1:
46
+ match_underline = re.match(pattern_underline, line + lines[i+1])
47
+ if match_underline:
48
+ continue
49
+ match = re.match(pattern, line)
50
+ if match and line.strip() != '```': # added a condition to skip lines that are just ```
51
+ result["Paragraph"].append(line.strip())
52
+ return dict(result)
53
+
54
+ def identify_blockquotes(lines):
55
+ result = defaultdict(list)
56
+ pattern = r'^(>{1,})\s(.*)'
57
+ blockquote = None
58
+ in_code_block = False
59
+ for i, line in enumerate(lines):
60
+ if line.strip().startswith('```'):
61
+ in_code_block = not in_code_block # Flip the flag
62
+ if in_code_block:
63
+ continue # Skip processing for code blocks
64
+ match = re.match(pattern, line)
65
+ if match:
66
+ depth = len(match.group(1)) # depth is determined by the number of '>' characters
67
+ text = match.group(2).strip()
68
+ if depth > 2:
69
+ raise ValueError(f"Encountered a blockquote of depth {depth} at line {i+1}, but the maximum allowed depth is 2")
70
+ if blockquote is None:
71
+ # Start of a new blockquote
72
+ blockquote = text
73
+ else:
74
+ # Continuation of the current blockquote, regardless of depth
75
+ blockquote += " " + text
76
+ elif blockquote is not None:
77
+ # End of the current blockquote
78
+ result["Blockquote"].append(blockquote)
79
+ blockquote = None
80
+
81
+ if blockquote is not None:
82
+ # End of the last blockquote
83
+ result["Blockquote"].append(blockquote)
84
+
85
+ return dict(result)
86
+
87
+ def identify_code_blocks(lines):
88
+ result = defaultdict(list)
89
+ pattern = r'^```'
90
+ in_code_block = False
91
+ code_block = None
92
+ for i, line in enumerate(lines):
93
+ match = re.match(pattern, line.strip())
94
+ if match:
95
+ if in_code_block:
96
+ # End of code block
97
+ in_code_block = False
98
+ code_block += "\n" + line.strip() # Add the line to the code block before ending it
99
+ result["Code block"].append(code_block)
100
+ code_block = None
101
+ else:
102
+ # Start of code block
103
+ in_code_block = True
104
+ code_block = line.strip()
105
+ elif in_code_block:
106
+ code_block += "\n" + line.strip()
107
+
108
+ if code_block is not None:
109
+ result["Code block"].append(code_block)
110
+
111
+ return dict(result)
112
+
113
+ def identify_ordered_lists(lines):
114
+ result = defaultdict(list)
115
+ pattern = r'^\s*\d+\.\s'
116
+ in_list = False
117
+ list_items = []
118
+ for i, line in enumerate(lines):
119
+ match = re.match(pattern, line)
120
+ if match:
121
+ if not in_list:
122
+ # Start of a new list
123
+ in_list = True
124
+ # Add the current line to the current list
125
+ list_items.append(line.strip())
126
+ elif in_list:
127
+ # End of the current list
128
+ in_list = False
129
+ result["Ordered list"].append(list_items)
130
+ list_items = []
131
+
132
+ if list_items:
133
+ # End of the last list
134
+ result["Ordered list"].append(list_items)
135
+
136
+ return dict(result)
137
+
138
+ def identify_unordered_lists(lines):
139
+ result = defaultdict(list)
140
+ pattern = r'^\s*((\d+\\\.|[-*+])\s)'
141
+ in_list = False
142
+ list_items = []
143
+ for i, line in enumerate(lines):
144
+ match = re.match(pattern, line)
145
+ if match:
146
+ if not in_list:
147
+ # Start of a new list
148
+ in_list = True
149
+ # Add the current line to the current list
150
+ list_items.append(line.strip())
151
+ elif in_list:
152
+ # End of the current list
153
+ in_list = False
154
+ result["Unordered list"].append(list_items)
155
+ list_items = []
156
+
157
+ if list_items:
158
+ # End of the last list
159
+ result["Unordered list"].append(list_items)
160
+
161
+ return dict(result)
162
+
163
+ def identify_tables(self):
164
+ result = defaultdict(list)
165
+ table_pattern = re.compile(r'^ {0,3}\|(?P<table_head>.+)\|[ \t]*\n' +
166
+ r' {0,3}\|(?P<table_align> *[-:]+[-| :]*)\|[ \t]*\n' +
167
+ r'(?P<table_body>(?: {0,3}\|.*\|[ \t]*(?:\n|$))*)\n*')
168
+ nptable_pattern = re.compile(r'^ {0,3}(?P<nptable_head>\S.*\|.*)\n' +
169
+ r' {0,3}(?P<nptable_align>[-:]+ *\|[-| :]*)\n' +
170
+ r'(?P<nptable_body>(?:.*\|.*(?:\n|$))*)\n*')
171
+
172
+ text = "".join(self.lines)
173
+ matches_table = re.findall(table_pattern, text)
174
+ matches_nptable = re.findall(nptable_pattern, text)
175
+ for match in matches_table + matches_nptable:
176
+ result["Table"].append(match)
177
+
178
+ return dict(result)
179
+
180
+ def identify_links(self):
181
+ result = defaultdict(list)
182
+ text_link_pattern = r'\[([^\]]+)\]\(([^)]+)\)'
183
+ image_link_pattern = r'!\[([^\]]*)\]\((.*?)\)'
184
+ for i, line in enumerate(self.lines):
185
+ text_links = re.findall(text_link_pattern, line)
186
+ image_links = re.findall(image_link_pattern, line)
187
+ for link in text_links:
188
+ result["Text link"].append({"line": i+1, "text": link[0], "url": link[1]})
189
+ for link in image_links:
190
+ result["Image link"].append({"line": i+1, "alt_text": link[0], "url": link[1]})
191
+ return dict(result)
192
+
193
+ def check_links(self):
194
+ broken_links = []
195
+ link_pattern = r'\[([^\]]+)\]\(([^)]+)\)'
196
+ for i, line in enumerate(self.lines):
197
+ links = re.findall(link_pattern, line)
198
+ for link in links:
199
+ try:
200
+ response = requests.head(link[1], timeout=3)
201
+ if response.status_code != 200:
202
+ broken_links.append({'line': i+1, 'text': link[0], 'url': link[1]})
203
+ except (requests.ConnectionError, requests.Timeout):
204
+ broken_links.append({'line': i+1, 'text': link[0], 'url': link[1]})
205
+ return broken_links
206
+
207
+ def identify_todos(self):
208
+ todos = []
209
+ todo_pattern = r'^\-\s\[ \]\s(.*)'
210
+ for i, line in enumerate(self.lines):
211
+ match = re.match(todo_pattern, line)
212
+ if match:
213
+ todos.append({'line': i+1, 'text': match.group(1)})
214
+ return todos
215
+
216
+ def count_elements(self, element_type):
217
+ identify_func = getattr(self, f'identify_{element_type}', None)
218
+ if not identify_func:
219
+ raise ValueError(f"No method to identify {element_type} found.")
220
+ elements = identify_func()
221
+ return len(elements.get(element_type.capitalize(), []))
222
+
223
+ def count_words(self):
224
+ text = " ".join(self.lines)
225
+ words = text.split()
226
+ return len(words)
227
+
228
+ def count_characters(self):
229
+ text = " ".join(self.lines)
230
+ # Exclude white spaces
231
+ characters = [char for char in text if not char.isspace()]
232
+ return len(characters)
233
+
234
+ def get_text_statistics(self):
235
+ statistics = []
236
+ for i, line in enumerate(self.lines):
237
+ words = line.split()
238
+ if words:
239
+ statistics.append({
240
+ 'line': i+1,
241
+ 'word_count': len(words),
242
+ 'char_count': sum(len(word) for word in words),
243
+ 'average_word_length': sum(len(word) for word in words) / len(words),
244
+ })
245
+ return statistics
246
+
247
+ def get_word_frequency(self):
248
+ word_frequency = Counter()
249
+ for line in self.lines:
250
+ word_frequency.update(line.lower().split())
251
+ return dict(word_frequency.most_common())
252
+
253
+ def search(self, search_string):
254
+ result = []
255
+ for i, line in enumerate(self.lines):
256
+ if search_string in line:
257
+ element_types = [func for func in dir(self) if func.startswith('identify_')]
258
+ found_in_element = None
259
+ for etype in element_types:
260
+ element = getattr(self, etype)()
261
+ for e, content in element.items():
262
+ if any(search_string in c for c in content):
263
+ found_in_element = e
264
+ break
265
+ if found_in_element:
266
+ break
267
+ result.append({"line": i+1, "text": line.strip(), "element": found_in_element})
268
+ return result
269
+
270
+ def analyse(self):
271
+ analysis = {
272
+ 'headers': self.count_elements('headers'),
273
+ 'sections': self.count_elements('sections'),
274
+ 'paragraphs': self.count_elements('paragraphs'),
275
+ 'blockquotes': self.count_elements('blockquotes'),
276
+ 'code_blocks': self.count_elements('code_blocks'),
277
+ 'ordered_lists': self.count_elements('ordered_lists'),
278
+ 'unordered_lists': self.count_elements('unordered_lists'),
279
+ 'tables': self.count_elements('tables'),
280
+ 'words': self.count_words(),
281
+ 'characters': self.count_characters(),
282
+ }
283
+ return analysis
@@ -6,16 +6,16 @@ with open("README.md", "r", encoding="utf-8") as fh:
6
6
 
7
7
  setup(
8
8
  name='markdown_analysis',
9
- version='0.0.3',
9
+ version='0.0.4',
10
10
  long_description=long_description,
11
11
  long_description_content_type="text/markdown",
12
- description='A library to analyze markdown files',
13
12
  author='yannbanas',
14
13
  author_email='yannbanas@gmail.com',
15
14
  url='https://github.com/yannbanas/mrkdwn_analysis',
16
15
  packages=find_packages(),
17
16
  install_requires=[
18
- # list your dependencies here
17
+ 'urllib3',
18
+ 'requests'
19
19
  ],
20
20
  classifiers=[
21
21
  'Development Status :: 2 - Pre-Alpha',
@@ -1,54 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: markdown_analysis
3
- Version: 0.0.3
4
- Summary: A library to analyze markdown files
5
- Home-page: https://github.com/yannbanas/mrkdwn_analysis
6
- Author: yannbanas
7
- Author-email: yannbanas@gmail.com
8
- License: UNKNOWN
9
- Platform: UNKNOWN
10
- Classifier: Development Status :: 2 - Pre-Alpha
11
- Classifier: Intended Audience :: Developers
12
- Classifier: License :: OSI Approved :: MIT License
13
- Classifier: Programming Language :: Python :: 3.11
14
- Description-Content-Type: text/markdown
15
- License-File: LICENSE
16
-
17
- # mrkdwn_analysis
18
-
19
- `mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
20
-
21
- ## Features
22
- - Extract and categorize various elements of a Markdown file.
23
- - Handle both inline and reference-style links and images.
24
- - Recognize different types of headers and sections.
25
- - Identify and extract code blocks, even nested ones.
26
- - Handle both ordered and unordered lists, nested or otherwise.
27
- - A simple API that makes parsing Markdown documents a breeze.
28
-
29
- ## Usage
30
- Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
31
-
32
- ```python
33
- from mrkdwn_analysis import MarkdownAnalyzer
34
-
35
- analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
36
-
37
- headers = analyzer.identify_headers()
38
- sections = analyzer.identify_sections()
39
- ...
40
- ```
41
-
42
- ## Installation
43
- You can install `mrkdwn_analysis` from PyPI:
44
-
45
- ```bash
46
- pip install mrkdwn_analysis
47
- ```
48
-
49
- We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
50
-
51
- ## Contributions
52
- Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
53
-
54
-
@@ -1,36 +0,0 @@
1
- # mrkdwn_analysis
2
-
3
- `mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
4
-
5
- ## Features
6
- - Extract and categorize various elements of a Markdown file.
7
- - Handle both inline and reference-style links and images.
8
- - Recognize different types of headers and sections.
9
- - Identify and extract code blocks, even nested ones.
10
- - Handle both ordered and unordered lists, nested or otherwise.
11
- - A simple API that makes parsing Markdown documents a breeze.
12
-
13
- ## Usage
14
- Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
15
-
16
- ```python
17
- from mrkdwn_analysis import MarkdownAnalyzer
18
-
19
- analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
20
-
21
- headers = analyzer.identify_headers()
22
- sections = analyzer.identify_sections()
23
- ...
24
- ```
25
-
26
- ## Installation
27
- You can install `mrkdwn_analysis` from PyPI:
28
-
29
- ```bash
30
- pip install mrkdwn_analysis
31
- ```
32
-
33
- We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
34
-
35
- ## Contributions
36
- Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
@@ -1,54 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: markdown-analysis
3
- Version: 0.0.3
4
- Summary: A library to analyze markdown files
5
- Home-page: https://github.com/yannbanas/mrkdwn_analysis
6
- Author: yannbanas
7
- Author-email: yannbanas@gmail.com
8
- License: UNKNOWN
9
- Platform: UNKNOWN
10
- Classifier: Development Status :: 2 - Pre-Alpha
11
- Classifier: Intended Audience :: Developers
12
- Classifier: License :: OSI Approved :: MIT License
13
- Classifier: Programming Language :: Python :: 3.11
14
- Description-Content-Type: text/markdown
15
- License-File: LICENSE
16
-
17
- # mrkdwn_analysis
18
-
19
- `mrkdwn_analysis` is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.
20
-
21
- ## Features
22
- - Extract and categorize various elements of a Markdown file.
23
- - Handle both inline and reference-style links and images.
24
- - Recognize different types of headers and sections.
25
- - Identify and extract code blocks, even nested ones.
26
- - Handle both ordered and unordered lists, nested or otherwise.
27
- - A simple API that makes parsing Markdown documents a breeze.
28
-
29
- ## Usage
30
- Using `mrkdwn_analysis` is simple. Just import the `MarkdownAnalyzer` class, create an instance with your Markdown file, and you're good to go!
31
-
32
- ```python
33
- from mrkdwn_analysis import MarkdownAnalyzer
34
-
35
- analyzer = MarkdownAnalyzer("path/to/your/markdown.md")
36
-
37
- headers = analyzer.identify_headers()
38
- sections = analyzer.identify_sections()
39
- ...
40
- ```
41
-
42
- ## Installation
43
- You can install `mrkdwn_analysis` from PyPI:
44
-
45
- ```bash
46
- pip install mrkdwn_analysis
47
- ```
48
-
49
- We hope `mrkdwn_analysis` helps you with all your Markdown analyzing needs!
50
-
51
- ## Contributions
52
- Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.
53
-
54
-
@@ -1,34 +0,0 @@
1
- # core.py
2
-
3
- import re
4
- from collections import defaultdict
5
-
6
- class MarkdownAnalyzer:
7
- def __init__(self, file_path):
8
- with open(file_path, 'r') as file:
9
- self.lines = file.readlines()
10
-
11
- def identify_headers(self):
12
- result = defaultdict(list)
13
- pattern = r'^(#{1,6})\s(.*)'
14
- pattern_image = r'!\[.*?\]\((.*?)\)' # pattern to identify images
15
- for i, line in enumerate(self.lines):
16
- line_without_images = re.sub(pattern_image, '', line) # remove images from the line
17
- match = re.match(pattern, line_without_images)
18
- if match:
19
- cleaned_line = re.sub(r'^#+', '', line_without_images).strip()
20
- result["Header"].append(cleaned_line)
21
- return result
22
-
23
- def identify_sections(self):
24
- result = defaultdict(list)
25
- pattern = r'^.*\n[=-]{2,}$'
26
- for i, line in enumerate(self.lines):
27
- if i < len(self.lines) - 1:
28
- match = re.match(pattern, line + self.lines[i+1])
29
- else:
30
- match = None
31
- if match:
32
- if self.lines[i+1].strip().startswith("===") or self.lines[i+1].strip().startswith("---"):
33
- result["Section"].append(line.strip())
34
- return result