site-to-md 1.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +55 -0
- data/LICENSE +21 -0
- data/README.md +206 -0
- data/exe/site-to-md +7 -0
- data/lib/site_to_md/cli.rb +23 -0
- data/lib/site_to_md/errors.rb +27 -0
- data/lib/site_to_md/file_converter.rb +60 -0
- data/lib/site_to_md/html_converter.rb +21 -0
- data/lib/site_to_md/processor.rb +54 -0
- data/lib/site_to_md/version.rb +5 -0
- data/lib/site_to_md.rb +12 -0
- data/test/file_converter_test.rb +50 -0
- data/test/fixtures/sample_site/about.html +9 -0
- data/test/fixtures/sample_site/index.html +12 -0
- data/test/html_converter_test.rb +38 -0
- data/test/processor_test.rb +58 -0
- data/test/test_helper.rb +22 -0
- metadata +113 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: b2d16ad412ed98a97e3871cf72fd3b83665b4ce9f802eb6264e8f76c77b9f3a7
|
4
|
+
data.tar.gz: 7f2aa545d6ee2a3e97f8a80fc69e6c2e68b0f117b908660fc4370282f34008f3
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 0527d93e4d21d06ae60bab265edd19b500ffd584623bcd3a349521797ce60a8c466fef932724cbaa229bef6ddad3ecc1de1bc3845db74172792c826f566f571d
|
7
|
+
data.tar.gz: 8c1631e55a27e3f2274cf18e2a0a5f2b9af148a2869e78432a4767da9f4456cb14e9ea28ebdf3a4de1d5ff18bf7f0eaea70f7e2a5359d2c4c5dca9f268f96fbe
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
All notable changes to this project will be documented in this file.
|
4
|
+
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
7
|
+
|
8
|
+
## [Unreleased]
|
9
|
+
|
10
|
+
## [1.0.5] - 2024-12-27
|
11
|
+
|
12
|
+
### Fixed
|
13
|
+
|
14
|
+
- Fix Release process
|
15
|
+
|
16
|
+
## [1.0.4] - 2024-12-27
|
17
|
+
|
18
|
+
### Fixed
|
19
|
+
|
20
|
+
- Fix Release process
|
21
|
+
|
22
|
+
## [1.0.3] - 2024-12-27
|
23
|
+
|
24
|
+
### Fixed
|
25
|
+
|
26
|
+
- Fix Release process
|
27
|
+
|
28
|
+
## [1.0.2] - 2024-12-27
|
29
|
+
|
30
|
+
### Fixed
|
31
|
+
|
32
|
+
- Fix Release process
|
33
|
+
|
34
|
+
## [1.0.1] - 2024-12-27
|
35
|
+
|
36
|
+
## 1.0.0 - 2024-12-26
|
37
|
+
|
38
|
+
### Added
|
39
|
+
|
40
|
+
- Command-line interface (CLI) for converting HTML files to a single markdown document
|
41
|
+
- Support for extracting content from `<main>` tag with `<body>` fallback
|
42
|
+
- Preservation of frontmatter metadata in markdown output
|
43
|
+
- Removal of unnecessary HTML elements to optimize markdown for AI tools
|
44
|
+
- Enhanced description emphasizing the tool's ease of use for AI tools like Claude AI or ChatGPT
|
45
|
+
- Initial test suite for ensuring code quality and reliability
|
46
|
+
- Continuous Integration (CI) setup using GitHub Actions
|
47
|
+
- Dockerfile to make the CLI tool available as a Docker image (useful for CI)
|
48
|
+
- Dependabot configuration for automated dependency updates
|
49
|
+
|
50
|
+
[1.0.1]: https://github.com/tmaier/site-to-md/compare/v1.0.0...v1.0.1
|
51
|
+
[1.0.2]: https://github.com/tmaier/site-to-md/compare/v1.0.1...v1.0.2
|
52
|
+
[1.0.3]: https://github.com/tmaier/site-to-md/compare/v1.0.2...v1.0.3
|
53
|
+
[1.0.4]: https://github.com/tmaier/site-to-md/compare/v1.0.3...v1.0.4
|
54
|
+
|
55
|
+
[1.0.5]: https://github.com/tmaier/site-to-md/compare/v1.0.4...v1.0.5
|
data/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Copyright (c) 2024 maier.io UG (haftungsbeschränkt)
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,206 @@
|
|
1
|
+
# site-to-md
|
2
|
+
|
3
|
+
[](https://badge.fury.io/rb/site-to-md)
|
4
|
+
[](https://github.com/tmaier/site-to-md/actions?query=workflow%3ATests)
|
5
|
+
[](https://github.com/tmaier/site-to-md/actions?query=workflow%3ARuboCop)
|
6
|
+
[](LICENSE.txt)
|
7
|
+
|
8
|
+
This command-line tool aggregates content from HTML pages into a single, streamlined markdown file.
|
9
|
+
It removes unnecessary HTML elements to reduce token usage and provides an easily uploadable format for AI tools like Claude AI or ChatGPT.
|
10
|
+
|
11
|
+
This tool can also be used to create a [llms.txt or llms-full.txt](https://llmstxt.org/).
|
12
|
+
For this use case, consider to add this tool to the build pipeline of your satic website.
|
13
|
+
|
14
|
+
## Features
|
15
|
+
|
16
|
+
- Converts all HTML files in a directory (and subdirectories) to markdown
|
17
|
+
- Extracts content from `<main>` tag (falls back to `<body>` if not found)
|
18
|
+
- Preserves document structure with frontmatter metadata
|
19
|
+
- Maintains proper markdown formatting for:
|
20
|
+
- Headers
|
21
|
+
- Lists
|
22
|
+
- Tables
|
23
|
+
- Code blocks
|
24
|
+
- Links
|
25
|
+
- And more...
|
26
|
+
- Command-line interface for easy integration
|
27
|
+
|
28
|
+
## Installation
|
29
|
+
|
30
|
+
```bash
|
31
|
+
gem install site-to-md
|
32
|
+
```
|
33
|
+
|
34
|
+
Or add to your Gemfile:
|
35
|
+
|
36
|
+
```ruby
|
37
|
+
gem 'site-to-md'
|
38
|
+
```
|
39
|
+
|
40
|
+
## Usage
|
41
|
+
|
42
|
+
### Command Line
|
43
|
+
|
44
|
+
Basic usage:
|
45
|
+
|
46
|
+
```bash
|
47
|
+
site-to-md convert path/to/site
|
48
|
+
```
|
49
|
+
|
50
|
+
Specify output file:
|
51
|
+
|
52
|
+
```bash
|
53
|
+
site-to-md convert path/to/site -o output.md
|
54
|
+
```
|
55
|
+
|
56
|
+
Get help:
|
57
|
+
|
58
|
+
```bash
|
59
|
+
site-to-md help
|
60
|
+
```
|
61
|
+
|
62
|
+
### Docker Image
|
63
|
+
|
64
|
+
You can use the [site-to-md tool via a Docker image](https://github.com/tmaier/site-to-md/pkgs/container/site-to-md), making it convenient to include in your build pipeline for static websites.
|
65
|
+
|
66
|
+
#### Example: GitLab CI
|
67
|
+
|
68
|
+
Here's an example GitLab CI configuration.
|
69
|
+
This configuration includes a job `llms-full-txt` that uses the Docker image to convert HTML files in the public folder and generates the llms-full.txt file in the same folder. This
|
70
|
+
|
71
|
+
```yaml
|
72
|
+
llms-full-txt:
|
73
|
+
image: ghcr.io/tmaier/site-to-md:latest
|
74
|
+
script:
|
75
|
+
- site-to-md convert public -o public/llms-full.txt
|
76
|
+
artifacts:
|
77
|
+
paths:
|
78
|
+
- public/llms-full.txt
|
79
|
+
```
|
80
|
+
|
81
|
+
### Ruby API
|
82
|
+
|
83
|
+
```ruby
|
84
|
+
require 'site_to_md'
|
85
|
+
|
86
|
+
processor = SiteToMd::Processor.new('path/to/site', 'output.md')
|
87
|
+
processor.process
|
88
|
+
```
|
89
|
+
|
90
|
+
## Output Format
|
91
|
+
|
92
|
+
The generated markdown file contains all HTML documents concatenated with frontmatter metadata:
|
93
|
+
|
94
|
+
```markdown
|
95
|
+
---
|
96
|
+
path: relative/path/to/file.html
|
97
|
+
title: Page Title
|
98
|
+
---
|
99
|
+
|
100
|
+
# Content starts here
|
101
|
+
|
102
|
+
Document content in markdown format...
|
103
|
+
|
104
|
+
================================================================
|
105
|
+
|
106
|
+
---
|
107
|
+
|
108
|
+
path: another/file.html
|
109
|
+
title: Another Page
|
110
|
+
|
111
|
+
---
|
112
|
+
|
113
|
+
More content...
|
114
|
+
```
|
115
|
+
|
116
|
+
## Development
|
117
|
+
|
118
|
+
### Requirements
|
119
|
+
|
120
|
+
- Ruby 3.2 or higher
|
121
|
+
- Bundler
|
122
|
+
|
123
|
+
### Getting Started
|
124
|
+
|
125
|
+
1. Clone the repository
|
126
|
+
2. Open in VSCode with Dev Containers extension installed
|
127
|
+
3. Click "Reopen in Container" when prompted
|
128
|
+
|
129
|
+
The development container will set up everything you need:
|
130
|
+
|
131
|
+
- Ruby development environment
|
132
|
+
- Ruby LSP for code intelligence
|
133
|
+
- RuboCop for code style checking
|
134
|
+
- Development dependencies
|
135
|
+
|
136
|
+
Alternatively, you can run without Dev Containers:
|
137
|
+
|
138
|
+
```bash
|
139
|
+
bin/setup
|
140
|
+
```
|
141
|
+
|
142
|
+
### Testing
|
143
|
+
|
144
|
+
Run the test suite:
|
145
|
+
|
146
|
+
```bash
|
147
|
+
bun/rake test
|
148
|
+
```
|
149
|
+
|
150
|
+
Run the linter:
|
151
|
+
|
152
|
+
```bash
|
153
|
+
bin/rubocop
|
154
|
+
```
|
155
|
+
|
156
|
+
### Dependency Management
|
157
|
+
|
158
|
+
We use Dependabot to keep dependencies up to date.
|
159
|
+
Dependabot creates pull requests to update:
|
160
|
+
|
161
|
+
- Ruby gem dependencies (weekly)
|
162
|
+
- GitHub Actions (weekly)
|
163
|
+
- Dockerfile (weekly)
|
164
|
+
|
165
|
+
### Release Process
|
166
|
+
|
167
|
+
To create a new release, run `bin/cut-new-release {major|minor|patch}`.
|
168
|
+
Follow semantic versioning guidelines and ensure the [CHANGELOG](CHANGELOG.md) is up to date with the latest changes.
|
169
|
+
|
170
|
+
`bin/cut-new-release` will:
|
171
|
+
|
172
|
+
- Ensure some basic checks pass (e.g., right branch, clean working directory, tests pass)
|
173
|
+
- Update the version number
|
174
|
+
- Update the [CHANGELOG](CHANGELOG.md) with the new version number and release date
|
175
|
+
- Commit the changes and create the new tag
|
176
|
+
- Push the changes to GitHub
|
177
|
+
|
178
|
+
You need to create a new release manually on GitHub and update it with the content of the [CHANGELOG](CHANGELOG.md).
|
179
|
+
|
180
|
+
## Contributing
|
181
|
+
|
182
|
+
Bug reports and pull requests are welcome on GitHub at <https://github.com/tmaier/site-to-md>.
|
183
|
+
This project is intended to be a safe, welcoming space for collaboration.
|
184
|
+
|
185
|
+
1. Fork it
|
186
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
187
|
+
3. Make your changes:
|
188
|
+
- Add tests for new functionality
|
189
|
+
- Update documentation if needed
|
190
|
+
- Ensure tests pass (`rake test`)
|
191
|
+
- Ensure code style checks pass (`bundle exec rubocop`)
|
192
|
+
4. Commit your changes (`git commit -am 'Add some feature'`)
|
193
|
+
5. Push to the branch (`git push origin my-new-feature`)
|
194
|
+
6. Create new Pull Request
|
195
|
+
|
196
|
+
## License
|
197
|
+
|
198
|
+
The gem is available as open source under the terms of the [MIT License](LICENSE).
|
199
|
+
|
200
|
+
## About
|
201
|
+
|
202
|
+
site-to-md is maintained by [maier.io UG (haftungsbeschränkt)](https://maier.io) and [Tobias L. Maier](https://tobiasmaier.info).
|
203
|
+
|
204
|
+
## Related Projects
|
205
|
+
|
206
|
+
- [reverse_markdown](https://github.com/xijo/reverse_markdown) - The HTML to Markdown converter used by this gem
|
data/exe/site-to-md
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'thor'
|
4
|
+
|
5
|
+
module SiteToMd
|
6
|
+
# CLI class handles the command line interface for site-to-markdown conversion.
|
7
|
+
class CLI < Thor
|
8
|
+
desc 'convert DIRECTORY', 'Convert HTML files from DIRECTORY to markdown'
|
9
|
+
method_option :output, aliases: '-o', desc: 'Output file path', default: 'site_content.md'
|
10
|
+
def convert(directory)
|
11
|
+
processor = Processor.new(directory, options[:output])
|
12
|
+
processor.process
|
13
|
+
puts "Successfully converted HTML files to #{options[:output]}"
|
14
|
+
rescue StandardError => e
|
15
|
+
puts "Error: #{e.message}"
|
16
|
+
exit 1
|
17
|
+
end
|
18
|
+
|
19
|
+
def self.exit_on_failure?
|
20
|
+
true
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
@@ -0,0 +1,27 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SiteToMd
|
4
|
+
# Base error class for all SiteToMd errors
|
5
|
+
class Error < StandardError; end
|
6
|
+
|
7
|
+
# Error raised when no converter is available for a given file type
|
8
|
+
class UnsupportedFileTypeError < Error
|
9
|
+
def initialize(file)
|
10
|
+
super("No converter available for #{file}")
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
# Error raised when no files are found in the source directory
|
15
|
+
class NoFilesFoundError < Error
|
16
|
+
def initialize
|
17
|
+
super('No files found in the source directory')
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
# Error raised when the source directory is invalid
|
22
|
+
class InvalidSourceDirectoryError < ArgumentError
|
23
|
+
def initialize
|
24
|
+
super('Source directory is required and must exist')
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'nokogiri'
|
4
|
+
|
5
|
+
module SiteToMd
|
6
|
+
# FileConverter is responsible for converting individual files to a format based on the given converter.
|
7
|
+
class FileConverter
|
8
|
+
def initialize(file_path, base_directory, converter)
|
9
|
+
@file_path = file_path
|
10
|
+
@base_directory = base_directory
|
11
|
+
@converter = converter
|
12
|
+
end
|
13
|
+
|
14
|
+
def convert
|
15
|
+
return nil if content.nil? || content.strip.empty?
|
16
|
+
|
17
|
+
format_document
|
18
|
+
end
|
19
|
+
|
20
|
+
private
|
21
|
+
|
22
|
+
def relative_path
|
23
|
+
@file_path.sub("#{@base_directory}/", '')
|
24
|
+
end
|
25
|
+
|
26
|
+
def document
|
27
|
+
@document ||= Nokogiri::HTML(File.read(@file_path))
|
28
|
+
end
|
29
|
+
|
30
|
+
def title
|
31
|
+
document.at_css('title')&.text || 'Untitled'
|
32
|
+
end
|
33
|
+
|
34
|
+
def content_element
|
35
|
+
@content_element ||= document.at_css('main') || document.at_css('body')
|
36
|
+
end
|
37
|
+
|
38
|
+
def html_content
|
39
|
+
content_element.to_html
|
40
|
+
end
|
41
|
+
|
42
|
+
def content
|
43
|
+
@content ||= @converter.convert(html_content)
|
44
|
+
end
|
45
|
+
|
46
|
+
def format_document
|
47
|
+
<<~DOCUMENT
|
48
|
+
---
|
49
|
+
path: #{relative_path}
|
50
|
+
title: #{title}
|
51
|
+
---
|
52
|
+
|
53
|
+
#{content.strip}
|
54
|
+
|
55
|
+
================================================================
|
56
|
+
|
57
|
+
DOCUMENT
|
58
|
+
end
|
59
|
+
end
|
60
|
+
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'reverse_markdown'
|
4
|
+
|
5
|
+
module SiteToMd
|
6
|
+
# HTMLConverter uses ReverseMarkdown to convert HTML to markdown.
|
7
|
+
class HTMLConverter
|
8
|
+
def initialize
|
9
|
+
@config = {
|
10
|
+
unknown_tags: :bypass,
|
11
|
+
github_flavored: true,
|
12
|
+
tables: true,
|
13
|
+
tag_border: ' '
|
14
|
+
}
|
15
|
+
end
|
16
|
+
|
17
|
+
def convert(html)
|
18
|
+
ReverseMarkdown.convert(html, @config)
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
@@ -0,0 +1,54 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module SiteToMd
|
4
|
+
# Processor collects HTML files and converts them using FileConverter.
|
5
|
+
class Processor
|
6
|
+
CONVERTERS = { '.html' => HTMLConverter.new }.freeze
|
7
|
+
|
8
|
+
def initialize(source_directory, output_file = 'site_content.md')
|
9
|
+
raise InvalidSourceDirectoryError if source_directory.nil? || source_directory.empty?
|
10
|
+
raise InvalidSourceDirectoryError unless Dir.exist?(source_directory)
|
11
|
+
|
12
|
+
@source_directory = source_directory
|
13
|
+
@output_file = output_file
|
14
|
+
end
|
15
|
+
|
16
|
+
def process
|
17
|
+
files = collect_files
|
18
|
+
raise NoFilesFoundError if files.empty?
|
19
|
+
|
20
|
+
content = convert_files(files)
|
21
|
+
write_output(content)
|
22
|
+
end
|
23
|
+
|
24
|
+
private
|
25
|
+
|
26
|
+
def collect_files
|
27
|
+
extensions_pattern = CONVERTERS.keys.map { |ext| ext.delete_prefix('.') }.join(',')
|
28
|
+
Dir.glob(File.join(@source_directory, '**', "*.{#{extensions_pattern}}"))
|
29
|
+
end
|
30
|
+
|
31
|
+
def convert_files(files)
|
32
|
+
files.each_with_object([]) do |file, output|
|
33
|
+
content = convert_file(file)
|
34
|
+
output << content if content
|
35
|
+
end.join("\n")
|
36
|
+
end
|
37
|
+
|
38
|
+
def convert_file(file)
|
39
|
+
extension = File.extname(file)
|
40
|
+
converter = CONVERTERS.fetch(extension) { raise SiteToMd::UnsupportedFileTypeError, file }
|
41
|
+
document = FileConverter.new(file, @source_directory, converter)
|
42
|
+
document.convert
|
43
|
+
rescue SiteToMd::UnsupportedFileTypeError
|
44
|
+
raise
|
45
|
+
rescue StandardError => e
|
46
|
+
warn "Error processing #{file}: #{e.message}"
|
47
|
+
nil
|
48
|
+
end
|
49
|
+
|
50
|
+
def write_output(content)
|
51
|
+
File.write(@output_file, content)
|
52
|
+
end
|
53
|
+
end
|
54
|
+
end
|
data/lib/site_to_md.rb
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'site_to_md/cli'
|
4
|
+
require 'site_to_md/errors'
|
5
|
+
require 'site_to_md/file_converter'
|
6
|
+
require 'site_to_md/html_converter'
|
7
|
+
require 'site_to_md/processor'
|
8
|
+
require 'site_to_md/version'
|
9
|
+
|
10
|
+
# SiteToMd module serves as the namespace for all classes related to the site-to-markdown conversion.
|
11
|
+
module SiteToMd
|
12
|
+
end
|
@@ -0,0 +1,50 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'test_helper'
|
4
|
+
|
5
|
+
class FileConverterTest < Minitest::Test
|
6
|
+
include TestHelpers
|
7
|
+
|
8
|
+
def setup
|
9
|
+
@sample_site = fixture_path('sample_site')
|
10
|
+
@markdown_converter = SiteToMd::HTMLConverter.new
|
11
|
+
end
|
12
|
+
|
13
|
+
def test_converts_document_with_main_tag # rubocop:disable Minitest/MultipleAssertions
|
14
|
+
file_path = File.join(@sample_site, 'index.html')
|
15
|
+
converter = SiteToMd::FileConverter.new(file_path, @sample_site, @markdown_converter)
|
16
|
+
result = converter.convert
|
17
|
+
|
18
|
+
assert_match(/^---$/, result)
|
19
|
+
assert_match(/^path: index\.html$/, result)
|
20
|
+
assert_match(/^title: Home Page$/, result)
|
21
|
+
assert_match(/# Welcome/, result)
|
22
|
+
assert_match(/This is a test page/, result)
|
23
|
+
end
|
24
|
+
|
25
|
+
def test_converts_document_without_main_tag # rubocop:disable Minitest/MultipleAssertions
|
26
|
+
file_path = File.join(@sample_site, 'about.html')
|
27
|
+
converter = SiteToMd::FileConverter.new(file_path, @sample_site, @markdown_converter)
|
28
|
+
result = converter.convert
|
29
|
+
|
30
|
+
assert_match(/^---$/, result)
|
31
|
+
assert_match(/^path: about\.html$/, result)
|
32
|
+
assert_match(/^title: About$/, result)
|
33
|
+
assert_match(/About page content/, result)
|
34
|
+
end
|
35
|
+
|
36
|
+
def test_handles_missing_title
|
37
|
+
html = '<html><body><p>Content</p></body></html>'
|
38
|
+
temp_file = File.join(@sample_site, 'no_title.html')
|
39
|
+
File.write(temp_file, html)
|
40
|
+
|
41
|
+
begin
|
42
|
+
converter = SiteToMd::FileConverter.new(temp_file, @sample_site, @markdown_converter)
|
43
|
+
result = converter.convert
|
44
|
+
|
45
|
+
assert_match(/^title: Untitled$/, result)
|
46
|
+
ensure
|
47
|
+
File.delete(temp_file)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
end
|
@@ -0,0 +1,38 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'test_helper'
|
4
|
+
|
5
|
+
class HTMLConverterTest < Minitest::Test
|
6
|
+
def setup
|
7
|
+
@converter = SiteToMd::HTMLConverter.new
|
8
|
+
end
|
9
|
+
|
10
|
+
def test_converts_basic_html
|
11
|
+
html = '<h1>Title</h1><p>Content</p>'
|
12
|
+
result = @converter.convert(html)
|
13
|
+
|
14
|
+
assert_equal "# Title\n\nContent", result.strip
|
15
|
+
end
|
16
|
+
|
17
|
+
def test_converts_links
|
18
|
+
html = '<a href="https://example.com">Link</a>'
|
19
|
+
result = @converter.convert(html)
|
20
|
+
|
21
|
+
assert_equal '[Link](https://example.com)', result.strip
|
22
|
+
end
|
23
|
+
|
24
|
+
def test_converts_lists
|
25
|
+
html = '<ul><li>Item 1</li><li>Item 2</li></ul>'
|
26
|
+
result = @converter.convert(html)
|
27
|
+
|
28
|
+
assert_match(/- Item 1\n- Item 2/, result.strip)
|
29
|
+
end
|
30
|
+
|
31
|
+
def test_converts_tables
|
32
|
+
html = '<table><tr><th>Header</th></tr><tr><td>Data</td></tr></table>'
|
33
|
+
result = @converter.convert(html)
|
34
|
+
|
35
|
+
assert_match(/\| Header \|/, result)
|
36
|
+
assert_match(/\| Data \|/, result)
|
37
|
+
end
|
38
|
+
end
|
@@ -0,0 +1,58 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'test_helper'
|
4
|
+
|
5
|
+
class ProcessorTest < Minitest::Test
|
6
|
+
include TestHelpers
|
7
|
+
|
8
|
+
def setup
|
9
|
+
@temp_dir = create_temp_dir
|
10
|
+
@sample_site = fixture_path('sample_site')
|
11
|
+
@output_file = File.join(@temp_dir, 'output.md')
|
12
|
+
@processor = SiteToMd::Processor.new(@sample_site, @output_file)
|
13
|
+
end
|
14
|
+
|
15
|
+
def teardown
|
16
|
+
remove_temp_dir(@temp_dir)
|
17
|
+
end
|
18
|
+
|
19
|
+
def test_raises_error_for_nonexistent_directory
|
20
|
+
assert_raises(ArgumentError) do
|
21
|
+
SiteToMd::Processor.new('nonexistent_directory')
|
22
|
+
end
|
23
|
+
end
|
24
|
+
|
25
|
+
def test_raises_error_for_unsupported_extension
|
26
|
+
unsupported_file = File.join(@sample_site, 'unsupported.foobar')
|
27
|
+
File.write(unsupported_file, 'This is test content.')
|
28
|
+
|
29
|
+
begin
|
30
|
+
assert_raises(SiteToMd::UnsupportedFileTypeError) do
|
31
|
+
@processor.send(:convert_file, unsupported_file)
|
32
|
+
end
|
33
|
+
ensure
|
34
|
+
FileUtils.rm_f(unsupported_file)
|
35
|
+
end
|
36
|
+
end
|
37
|
+
|
38
|
+
def test_processes_html_files # rubocop:disable Minitest/MultipleAssertions
|
39
|
+
@processor.process
|
40
|
+
|
41
|
+
assert_path_exists @output_file
|
42
|
+
content = File.read(@output_file)
|
43
|
+
|
44
|
+
assert_match(/path: index\.html/, content)
|
45
|
+
assert_match(/title: Home Page/, content)
|
46
|
+
assert_match(/# Welcome/, content)
|
47
|
+
assert_match(/This is a test page/, content)
|
48
|
+
end
|
49
|
+
|
50
|
+
def test_handles_files_without_main_tag
|
51
|
+
@processor.process
|
52
|
+
content = File.read(@output_file)
|
53
|
+
|
54
|
+
assert_match(/path: about\.html/, content)
|
55
|
+
assert_match(/title: About/, content)
|
56
|
+
assert_match(/About page content/, content)
|
57
|
+
end
|
58
|
+
end
|
data/test/test_helper.rb
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'fileutils'
|
4
|
+
require 'minitest/autorun'
|
5
|
+
require 'minitest/pride'
|
6
|
+
require 'site_to_md'
|
7
|
+
|
8
|
+
module TestHelpers
|
9
|
+
def fixture_path(path)
|
10
|
+
File.join(File.expand_path('fixtures', __dir__), path)
|
11
|
+
end
|
12
|
+
|
13
|
+
def create_temp_dir
|
14
|
+
dir = File.join(Dir.tmpdir, "site-to-md-#{Time.now.to_i}")
|
15
|
+
FileUtils.mkdir_p(dir)
|
16
|
+
dir
|
17
|
+
end
|
18
|
+
|
19
|
+
def remove_temp_dir(dir)
|
20
|
+
FileUtils.rm_rf(dir) if dir && File.directory?(dir)
|
21
|
+
end
|
22
|
+
end
|
metadata
ADDED
@@ -0,0 +1,113 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: site-to-md
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.5
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- maier.io
|
8
|
+
autorequire:
|
9
|
+
bindir: exe
|
10
|
+
cert_chain: []
|
11
|
+
date: 2024-12-27 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: nokogiri
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.18'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.18'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: reverse_markdown
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '3.0'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '3.0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: thor
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '1.3'
|
48
|
+
type: :runtime
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '1.3'
|
55
|
+
description: |
|
56
|
+
A tool that extracts and combines text from HTML files into a single, streamlined markdown document.
|
57
|
+
It provides a command-line interface for easy usage, removes unnecessary HTML elements to reduce
|
58
|
+
token usage, and creates an easily uploadable format for AI tools like Claude AI or ChatGPT.
|
59
|
+
The tool preserves document structure and includes frontmatter metadata.
|
60
|
+
email:
|
61
|
+
- hello@maier.io
|
62
|
+
executables:
|
63
|
+
- site-to-md
|
64
|
+
extensions: []
|
65
|
+
extra_rdoc_files: []
|
66
|
+
files:
|
67
|
+
- CHANGELOG.md
|
68
|
+
- LICENSE
|
69
|
+
- README.md
|
70
|
+
- exe/site-to-md
|
71
|
+
- lib/site_to_md.rb
|
72
|
+
- lib/site_to_md/cli.rb
|
73
|
+
- lib/site_to_md/errors.rb
|
74
|
+
- lib/site_to_md/file_converter.rb
|
75
|
+
- lib/site_to_md/html_converter.rb
|
76
|
+
- lib/site_to_md/processor.rb
|
77
|
+
- lib/site_to_md/version.rb
|
78
|
+
- test/file_converter_test.rb
|
79
|
+
- test/fixtures/sample_site/about.html
|
80
|
+
- test/fixtures/sample_site/index.html
|
81
|
+
- test/html_converter_test.rb
|
82
|
+
- test/processor_test.rb
|
83
|
+
- test/test_helper.rb
|
84
|
+
homepage: https://github.com/tmaier/site-to-md
|
85
|
+
licenses:
|
86
|
+
- MIT
|
87
|
+
metadata:
|
88
|
+
homepage_uri: https://github.com/tmaier/site-to-md
|
89
|
+
source_code_uri: https://github.com/tmaier/site-to-md
|
90
|
+
changelog_uri: https://github.com/tmaier/site-to-md/blob/main/CHANGELOG.md
|
91
|
+
bug_tracker_uri: https://github.com/tmaier/site-to-md/issues
|
92
|
+
documentation_uri: https://github.com/tmaier/site-to-md/blob/main/README.md
|
93
|
+
rubygems_mfa_required: 'true'
|
94
|
+
post_install_message:
|
95
|
+
rdoc_options: []
|
96
|
+
require_paths:
|
97
|
+
- lib
|
98
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
99
|
+
requirements:
|
100
|
+
- - ">="
|
101
|
+
- !ruby/object:Gem::Version
|
102
|
+
version: 3.2.0
|
103
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
104
|
+
requirements:
|
105
|
+
- - ">="
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
requirements: []
|
109
|
+
rubygems_version: 3.4.19
|
110
|
+
signing_key:
|
111
|
+
specification_version: 4
|
112
|
+
summary: Convert static site HTML to a single markdown file
|
113
|
+
test_files: []
|