@mintlify/scraping 4.0.113 → 4.0.115
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CONTRIBUTING.md +22 -106
- package/bin/cli.js +6 -1
- package/bin/cli.js.map +1 -1
- package/bin/constants.js +1 -1
- package/bin/tsconfig.build.tsbuildinfo +1 -1
- package/package.json +4 -4
- package/src/cli.ts +7 -1
- package/src/constants.ts +1 -1
package/CONTRIBUTING.md
CHANGED
|
@@ -1,131 +1,47 @@
|
|
|
1
|
-
#
|
|
1
|
+
# Mintlify Scraping CLI
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
## Installation
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
To uninstall locally, run `npm uninstall @mintlify/scraping -g`.
|
|
5
|
+
```sh
|
|
6
|
+
npm i -g @mintlify/scraping
|
|
7
|
+
```
|
|
10
8
|
|
|
11
|
-
|
|
9
|
+
### Uninstall
|
|
12
10
|
|
|
13
|
-
|
|
11
|
+
To uninstall, run `npm uninstall -g @mintlify/scraping`.
|
|
14
12
|
|
|
15
|
-
##
|
|
13
|
+
## Usage
|
|
16
14
|
|
|
17
15
|
There are three main commands:
|
|
18
16
|
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
Scraping a page downloads a single page’s content. Scraping a section goes through the navigation and scrapes each page. The code for downloading a page’s content is shared between the two commands. Scraping an OpenAPI file supports either a file path or an HTTPS URL.
|
|
26
|
-
|
|
27
|
-
Important files: `scraping/scrapePageCommands.ts`, `scraping/scrapeSectionAutomatically.ts`
|
|
28
|
-
|
|
29
|
-
We have `scrape-gitbook-page` and similar commands for debugging. Ignore them, they just call internal functions directly. You should not need to use them unless you are debugging issues with Detecting Frameworks.
|
|
30
|
-
|
|
31
|
-
## Overwriting
|
|
32
|
-
|
|
33
|
-
The user has to add a `--overwrite` flag if they want to overwrite their current files.
|
|
34
|
-
|
|
35
|
-
## Sections vs Websites
|
|
36
|
-
|
|
37
|
-
We call the command `scrape-section` instead of `scrape-website` because we cannot scrape pages not in the navigation of the URL first passed in. For example, ReadMe has API Reference and other sections accessible through a separate top-navigation which we do not parse. We only scrape the navigation on the left: [https://docs.readme.com/main/docs](https://docs.readme.com/main/docs)
|
|
17
|
+
```sh
|
|
18
|
+
mintlify-scrape page [url] # for scraping a single page in your docs
|
|
19
|
+
mintlify-scrape section [url] # for scraping your entire docs site
|
|
20
|
+
mintlify-scrape openapi-file [url] # for scraping an OpenAPI spec into the required MDX format for Mintlify to display
|
|
21
|
+
```
|
|
38
22
|
|
|
39
23
|
## Detecting Frameworks
|
|
40
24
|
|
|
41
|
-
The
|
|
42
|
-
|
|
43
|
-
Each framework’s scrapers live in `scraping/site-scrapers/`
|
|
25
|
+
The CLI will automatically detect the framework of the passed in URL and scrape each page accordingly.
|
|
44
26
|
|
|
45
27
|
We currently support:
|
|
46
28
|
|
|
47
29
|
- Docusaurus
|
|
48
30
|
- GitBook
|
|
49
31
|
- ReadMe
|
|
50
|
-
- Intercom
|
|
51
|
-
|
|
52
|
-
## Terminal Output
|
|
53
|
-
|
|
54
|
-
We print a line in the terminal for every file we write. `util.ts` has a createPage function that takes care of writing the file and logging.
|
|
55
32
|
|
|
56
|
-
|
|
33
|
+
## Output
|
|
57
34
|
|
|
58
|
-
|
|
35
|
+
You will get an output of all the pages that we scraped into MDX for you, as well as the `docs.json` config file that allows you to use the Mintlify platform.
|
|
59
36
|
|
|
60
|
-
|
|
61
|
-
Add the following to your navigation in mint.json:
|
|
37
|
+
We recommend running the CLI in a new directory, because it will add every page from the docs you scrape into the folder you're currently running in:
|
|
62
38
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
39
|
+
```sh
|
|
40
|
+
mkdir <new-docs-folder>
|
|
41
|
+
cd <new-docs-folder>
|
|
42
|
+
mintlify-scrape section <url>
|
|
67
43
|
```
|
|
68
44
|
|
|
69
|
-
# Navigation Scraping
|
|
70
|
-
|
|
71
|
-
Most sites use JavaScript to open navigation menus which do not automatically include the menu buttons in the HTML. We use Puppeteer to click every nested menu so the site adds the menu buttons to the HTML. For example the original site’s HTML:
|
|
72
|
-
|
|
73
|
-
```jsx
|
|
74
|
-
<div>
|
|
75
|
-
<a id="my-nested-menu"></a>
|
|
76
|
-
</div>
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
can turn into this after opening the nested menu:
|
|
80
|
-
|
|
81
|
-
```jsx
|
|
82
|
-
<div>
|
|
83
|
-
<a id="my-nested-menu" aria-expanded=true></a>
|
|
84
|
-
<div>
|
|
85
|
-
<a href="/page"></a>
|
|
86
|
-
<a href="/other-page"></a>
|
|
87
|
-
</div>
|
|
88
|
-
</div>
|
|
89
|
-
```
|
|
90
|
-
|
|
91
|
-
Ultimately, all section scrapers need to find an array of links to visit then call the scrape page function in a loop.
|
|
92
|
-
|
|
93
|
-
We use axios instead of Puppeteer if a site doesn’t hide links. Puppeteer is slow.
|
|
94
|
-
|
|
95
45
|
# Image File Locations
|
|
96
46
|
|
|
97
|
-
Images go in an `images/` folder
|
|
98
|
-
|
|
99
|
-
# Cheerio
|
|
100
|
-
|
|
101
|
-
Cheerio is a library to scrape/handle the HTML after we have it in a string. Most of the work is using inspect-element to view a website and figure out where the content we want is, then writing the corresponding Cheerio code.
|
|
102
|
-
|
|
103
|
-
# HTML to MDX
|
|
104
|
-
|
|
105
|
-
We use an open-source library to convert HTML to Markdown: https://github.com/crosstype/node-html-markdown
|
|
106
|
-
|
|
107
|
-
The `util.ts` createPage function assembles the MDX metadata, we just need to return an object of the form `{ title, description, content }` from each page scraper.
|
|
108
|
-
|
|
109
|
-
## Parsing Issues
|
|
110
|
-
|
|
111
|
-
Parsing struggles when documentation websites are using non-standard HTML. For example, code blocks are supposed to use. `<pre><code></code></pre>` but GitBook just uses divs.
|
|
112
|
-
|
|
113
|
-
We can write custom translators for the library that determine how we parse certain objects.
|
|
114
|
-
|
|
115
|
-
In some cases, we will want custom translators even if parsing succeeds. For example, ReadMe callouts are using quote syntax
|
|
116
|
-
|
|
117
|
-
```jsx
|
|
118
|
-
> 💡
|
|
119
|
-
> Callout text
|
|
120
|
-
>
|
|
121
|
-
```
|
|
122
|
-
|
|
123
|
-
When we want to convert them to:
|
|
124
|
-
|
|
125
|
-
```jsx
|
|
126
|
-
<Tip>Callout text</Tip>
|
|
127
|
-
```
|
|
128
|
-
|
|
129
|
-
## Regex
|
|
130
|
-
|
|
131
|
-
You can use regex to make small changes where translators are overkill or there’s no obvious component to modify. For example, here’s the end of `scrapeDocusaurusPage.ts`:
|
|
47
|
+
Images go in an `images/` folder and map 1:1 with the docs they were scraped from. For example, if there's a page under the URL path `/integrations/payments/stripe`, any images on that page will be scraped to `/images/integrations/payments/stripe/image.png`.
|
package/bin/cli.js
CHANGED
|
@@ -1,4 +1,5 @@
|
|
|
1
1
|
#!/usr/bin/env node
|
|
2
|
+
import { upgradeToDocsConfig } from '@mintlify/validation';
|
|
2
3
|
import yargs from 'yargs';
|
|
3
4
|
import { hideBin } from 'yargs/helpers';
|
|
4
5
|
import { FINAL_SUCCESS_MESSAGE } from './constants.js';
|
|
@@ -100,7 +101,11 @@ async function site(url) {
|
|
|
100
101
|
log('Successfully retrieved initial HTML from src: ' + urlObj.toString());
|
|
101
102
|
const result = await scrapeAllSiteTabs(html, urlObj);
|
|
102
103
|
if (result.success) {
|
|
103
|
-
|
|
104
|
+
const mintConfig = result.data;
|
|
105
|
+
const docsConfig = upgradeToDocsConfig(mintConfig, {
|
|
106
|
+
shouldUpgradeTheme: true,
|
|
107
|
+
});
|
|
108
|
+
write('docs.json', JSON.stringify(docsConfig, undefined, 2));
|
|
104
109
|
log(FINAL_SUCCESS_MESSAGE);
|
|
105
110
|
}
|
|
106
111
|
else {
|
package/bin/cli.js.map
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"cli.js","sourceRoot":"","sources":["../src/cli.ts"],"names":[],"mappings":";
|
|
1
|
+
{"version":3,"file":"cli.js","sourceRoot":"","sources":["../src/cli.ts"],"names":[],"mappings":";AAEA,OAAO,EAAE,mBAAmB,EAAE,MAAM,sBAAsB,CAAC;AAC3D,OAAO,KAAK,MAAM,OAAO,CAAC;AAC1B,OAAO,EAAE,OAAO,EAAE,MAAM,eAAe,CAAC;AAExC,OAAO,EAAE,qBAAqB,EAAE,MAAM,gBAAgB,CAAC;AACvD,OAAO,EAAE,oBAAoB,EAAE,MAAM,mCAAmC,CAAC;AACzE,OAAO,EAAE,eAAe,EAAE,MAAM,qBAAqB,CAAC;AACtD,OAAO,EAAE,UAAU,EAAE,MAAM,oBAAoB,CAAC;AAChD,OAAO,EAAE,iBAAiB,EAAE,MAAM,oBAAoB,CAAC;AACvD,OAAO,EAAE,eAAe,EAAE,SAAS,EAAE,MAAM,4BAA4B,CAAC;AACxE,OAAO,EAAE,eAAe,EAAE,MAAM,mBAAmB,CAAC;AACpD,OAAO,EAAE,KAAK,EAAE,MAAM,iBAAiB,CAAC;AACxC,OAAO,EAAE,GAAG,EAAE,MAAM,gBAAgB,CAAC;AACrC,OAAO,EAAE,aAAa,EAAE,MAAM,oBAAoB,CAAC;AACnD,OAAO,EAAE,QAAQ,EAAE,MAAM,gBAAgB,CAAC;AAE1C,MAAM,KAAK,CAAC,OAAO,CAAC,OAAO,CAAC,IAAI,CAAC,CAAC;KAC/B,OAAO,CACN,YAAY,EACZ,4CAA4C,EAC5C,CAAC,KAAK,EAAE,EAAE,CAAC,KAAK,CAAC,UAAU,CAAC,KAAK,EAAE,EAAE,IAAI,EAAE,QAAQ,EAAE,YAAY,EAAE,IAAI,EAAE,CAAC,CAAC,KAAK,CAAC,QAAQ,CAAC,EAC1F,KAAK,EAAE,EAAE,GAAG,EAAE,EAAE,EAAE,CAAC,MAAM,IAAI,CAAC,GAAG,CAAC,CACnC;KAEA,OAAO,CACN,eAAe,EACf,wDAAwD,EACxD,CAAC,KAAK,EAAE,EAAE,CAAC,KAAK,CAAC,UAAU,CAAC,KAAK,EAAE,EAAE,IAAI,EAAE,QAAQ,EAAE,YAAY,EAAE,IAAI,EAAE,CAAC,CAAC,KAAK,CAAC,QAAQ,CAAC,EAC1F,KAAK,EAAE,EAAE,GAAG,EAAE,EAAE,EAAE,CAAC,MAAM,IAAI,CAAC,GAAG,CAAC,CACnC;KAEA,OAAO,CACN,gCAAgC,EAChC,wCAAwC,EACxC,CAAC,KAAK,EAAE,EAAE,CACR,KAAK;KACF,UAAU,CAAC,iBAAiB,EAAE;IAC7B,QAAQ,EAAE,kDAAkD;IAC5D,IAAI,EAAE,QAAQ;IACd,YAAY,EAAE,IAAI;CACnB,CAAC;KACD,MAAM,CAAC,YAAY,EAAE;IACpB,QAAQ,EAAE,+CAA+C;IACzD,OAAO,EAAE,IAAI;IACb,IAAI,EAAE,SAAS;IACf,KAAK,EAAE,GAAG;CACX,CAAC;KACD,MAAM,CAAC,QAAQ,EAAE;IAChB,QAAQ,EAAE,4DAA4D;IACtE,IAAI,EAAE,QAAQ;IACd,KAAK,EAAE,GAAG;CACX,CAAC;KACD,MAAM,CAAC,WAAW,EAAE;IACnB,QAAQ,EAAE,4CAA4C;IACtD,OAAO,EAAE,KAAK;IACd,IAAI,EAAE,SAAS;CAChB,CAAC,EACN,KAAK,EAAE,IAAI,EAAE,EAAE;IACb,IAAI,CAAC;QACH,MAAM,EAAE,GAAG,EAAE,KAAK,EAAE,GAAG,MAAM,oBAAoB,CAAC,IAAI,CAAC,eAAe,EAAE;YACtE,eAAe,EAAE,SAAS;YAC1B,OAAO,EAAE,SAAS;YAClB,UAAU,EAAE,IAAI,CAAC,UAAU;YAC3B,MAAM,EAAE,IAAI,CAAC,MAAM;YACnB,SAAS,EAAE,IAAI,CAAC,SAAS;SAC1B,CAAC,CAAC;QACH,OAAO,CAAC,GAAG,CAAC,+BAA+B,CAAC,CAAC;QAC7C,OAAO,CAAC,GAAG,CAAC,IAAI,CAAC,SAAS,CAAC,GAAG,EAAE,SAAS,EAAE,CAAC,CAAC,CAAC,CAAC;QAC/C,IAAI,KAAK,EAAE,CAAC;YACV,OAAO,CAAC,GAAG,CAAC,8BAA8B,CAAC,CAAC;YAC5C,OAAO,CAAC,GAAG,CAAC,YAAY,IAAI,CAAC,eAAe,EAAE,CAAC,CAAC;QAClD,CAAC;IACH,CAAC;IAAC,OAAO,KAAK,EAAE,CAAC;QACf,IAAI,KAAK,YAAY,KAAK,EAAE,CAAC;YAC3B,OAAO,CAAC,KAAK,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC;QAC/B,CAAC;aAAM,CAAC;YACN,OAAO,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC;QACvB,CAAC;IACH,CAAC;AACH,CAAC,CACF;KAEA,cAAc,EAAE;KAChB,aAAa,CAAC,CAAC,EAAE,gEAAgE,CAAC;KAClF,KAAK,CAAC,GAAG,EAAE,MAAM,CAAC;KAClB,KAAK,CAAC,GAAG,EAAE,SAAS,CAAC;KACrB,KAAK,EAAE,CAAC;AAEX,KAAK,UAAU,IAAI,CAAC,GAAW;IAC7B,IAAI,CAAC;QACH,MAAM,MAAM,GAAG,IAAI,GAAG,CAAC,GAAG,CAAC,CAAC;QAC5B,MAAM,IAAI,GAAG,MAAM,aAAa,CAAC,MAAM,CAAC,CAAC;QACzC,GAAG,CAAC,gDAAgD,GAAG,MAAM,CAAC,QAAQ,EAAE,CAAC,CAAC;QAE1E,MAAM,IAAI,GAAG,UAAU,CAAC,IAAI,CAAC,CAAC;QAC9B,eAAe,CAAC,IAAI,CAAC,CAAC;QAEtB,MAAM,YAAY,GAAG,SAAS,CAAC,MAAM,KAAK,SAAS,CAAC;QACpD,MAAM,OAAO,GAAG,MAAM,eAAe,CAAC,CAAC,MAAM,CAAC,EAAE,YAAY,CAAC,CAAC;QAC9D,MAAM,MAAM,GAAG,OAAO,CAAC,CAAC,CAAC,IAAI;YAC3B,OAAO,EAAE,KAAK;YACd,OAAO,EAAE,2CAA2C,GAAG,EAAE;SAC1D,CAAC;QAEF,IAAI,MAAM,CAAC,OAAO,EAAE,CAAC;YACnB,GAAG,CAAC,wBAAwB,GAAG,IAAI,MAAM,CAAC,IAAI,CAAC,CAAC,CAAC,QAAQ,MAAM,CAAC,IAAI,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC;QACpF,CAAC;aAAM,CAAC;YACN,GAAG,CAAC,MAAM,CAAC,OAAO,CAAC,CAAC;QACtB,CAAC;QACD,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC;IAClB,CAAC;IAAC,OAAO,KAAK,EAAE,CAAC;QACf,MAAM,YAAY,GAAG,eAAe,CAAC,KAAK,CAAC,CAAC;QAC5C,GAAG,CAAC,YAAY,CAAC,CAAC;QAClB,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC;IAClB,CAAC;AACH,CAAC;AAED,KAAK,UAAU,IAAI,CAAC,GAAW;IAC7B,IAAI,CAAC;QACH,MAAM,MAAM,GAAG,IAAI,GAAG,CAAC,GAAG,CAAC,CAAC;QAC5B,MAAM,IAAI,GAAG,MAAM,aAAa,CAAC,MAAM,CAAC,CAAC;QACzC,GAAG,CAAC,gDAAgD,GAAG,MAAM,CAAC,QAAQ,EAAE,CAAC,CAAC;QAE1E,MAAM,MAAM,GAAG,MAAM,iBAAiB,CAAC,IAAI,EAAE,MAAM,CAAC,CAAC;QACrD,IAAI,MAAM,CAAC,OAAO,EAAE,CAAC;YACnB,MAAM,UAAU,GAAG,MAAM,CAAC,IAAsB,CAAC;YACjD,MAAM,UAAU,GAAG,mBAAmB,CAAC,UAAU,EAAE;gBACjD,kBAAkB,EAAE,IAAI;aACzB,CAAC,CAAC;YACH,KAAK,CAAC,WAAW,EAAE,IAAI,CAAC,SAAS,CAAC,UAAU,EAAE,SAAS,EAAE,CAAC,CAAC,CAAC,CAAC;YAC7D,GAAG,CAAC,qBAAqB,CAAC,CAAC;QAC7B,CAAC;aAAM,CAAC;YACN,GAAG,CAAC,MAAM,CAAC,OAAO,CAAC,CAAC;QACtB,CAAC;QACD,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC;IAClB,CAAC;IAAC,OAAO,KAAK,EAAE,CAAC;QACf,MAAM,YAAY,GAAG,eAAe,CAAC,KAAK,CAAC,CAAC;QAC5C,GAAG,CAAC,YAAY,CAAC,CAAC;QAClB,OAAO,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC;IAClB,CAAC;AACH,CAAC"}
|
package/bin/constants.js
CHANGED
|
@@ -43,5 +43,5 @@ ${SPACES}We currently support: ReadMe, GitBook, and Docusaurus`;
|
|
|
43
43
|
export const MDAST_FAILURE_MSG = 'failed to convert MDAST to Markdown string';
|
|
44
44
|
export const FINAL_SUCCESS_MESSAGE = `We've successfully scraped your docs site.
|
|
45
45
|
${SPACES}We've downloaded the ${activeColors.cyan}\`navigation\`${activeColors.default} array (and if necessary, the ${activeColors.cyan}\`tabs\`${activeColors.default} array)
|
|
46
|
-
${SPACES}into ${activeColors.blue}\`
|
|
46
|
+
${SPACES}into ${activeColors.blue}\`docs.json\`${activeColors.default}.`;
|
|
47
47
|
//# sourceMappingURL=constants.js.map
|