npm - @yuji-min/google-docs-parser - Versions diffs - 1.0.0 - Mend

@yuji-min/google-docs-parser 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,242 @@
+# 📄 Google Docs Parser
+<h1 align="center">
+	<img width="200px" src="media/logo.png" alt="octoreport">
+</h1>
+![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue?logo=typescript)
+![License](https://img.shields.io/badge/License-MIT-green)
+![Status](https://img.shields.io/badge/Status-Beta-orange)
+![npm](https://img.shields.io/npm/v/@yuji-min/google-docs-parser)
+**Turn your Google Docs into a Headless CMS.**
+`google-docs-parser` is a TypeScript library that transforms raw Google Docs content into structured JSON data based on a user-defined schema. Stop wrestling with the raw Google Docs API structure—define your schema and get clean data instantly.
+---
+## 🚀 Why use this?
+Parsing the raw `docs_v1.Schema$Document` JSON from the Google API is complex. It involves handling deep nesting of `structuralElements`, `paragraph`, `elements`, and `textRun`, along with varying styling attributes.
+This library solves that complexity by allowing you to define a **Schema** that maps your document's visual structure (Headings, Lists, Key-Values) directly to data structures.
+### ✨ Key Features
+- **Type-Safe:** The return type is automatically inferred from your schema configuration using TypeScript generics.
+- **Hierarchical Parsing:** Supports nested tree structures (e.g., _Heading 2_ containing _Heading 3_ children).
+- **Smart Text Parsing:** Built-in parsers for:
+  - Key-Value pairs (e.g., `Role: Engineer`)
+  - Delimited fields (e.g., `2024 | Senior Dev | Google`)
+  - Flattened lists or grouped arrays.
+- **Auth Ready:** Seamless integration with `google-auth-library` and `googleapis`.
+---
+## 📦 Installation
+```bash
+npm install @yuji-min/google-docs-parser googleapis google-auth-library
+# or
+yarn add @yuji-min/google-docs-parser googleapis google-auth-library
+```
+---
+## 🔑 Authentication & Setup
+To use this library, you need a Google Cloud Service Account with access to the Google Docs API.
+### 1. Create Google Cloud Credentials
+1.  Go to the [Google Cloud Console](https://console.cloud.google.com/).
+2.  Create a new project (or select an existing one).
+3.  Enable the **Google Docs API** in the "APIs & Services" > "Library" section.
+4.  Go to "IAM & Admin" > "Service Accounts" and create a new service account.
+5.  Create and download a **JSON key** for this service account.
+### 2. Configure Environment Variable
+Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your downloaded JSON key file.
+**Mac/Linux:**
+```bash
+export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"
+```
+**Windows (PowerShell):**
+```powershell
+$env:GOOGLE_APPLICATION_CREDENTIALS="C:\path\to\your\service-account-key.json"
+```
+> **Note:** The library uses `google-auth-library` internally, which automatically looks for credentials at the path defined in this environment variable.
+### 3. Share the Document (Important!)
+Google Docs are private by default. You must share the target document with your Service Account's email address (found in your JSON key, e.g., my-bot@my-project.iam.gserviceaccount.com) with Viewer permission.
+**If you don't do this, the parser will throw a permission error.**
+---
+## 🛠️ Getting Started
+### 1. Prepare your Google Doc
+Imagine a Google Doc structured like a resume or project list:
+> **Profile** (Heading 1)
+>
+> Senior Software Engineer based in Seoul.
+>
+> **Experience** (Heading 1)
+>
+> **Tech Corp | Backend Lead** (Heading 2)
+>
+> - Designed microservices architecture
+> - Managed a team of 5
+>
+> **Startup Inc | Full Stack** (Heading 2)
+>
+> - Built MVP in 3 months
+### 2. Define Schema & Parse
+Create a schema object that mirrors the visual hierarchy of your document.
+```typescript
+import { getParsedDocument, ParseSchema } from "@yuji-min/google-docs-parser";
+// 1. Define the schema
+const resumeSchema = {
+  sections: [
+    {
+      // Matches a "Heading 1" named 'Profile'
+      title: { name: "Profile", namedStyleType: "HEADING_1" },
+      // content is undefined -> defaults to simple text block
+    },
+    {
+      // Matches a "Heading 1" named 'Experience'
+      title: { name: "Experience", namedStyleType: "HEADING_1" },
+      content: {
+        kind: "tree", // This section is a hierarchical tree
+        node: {
+          // The tree nodes start with "Heading 2"
+          // We can also parse the heading text itself!
+          title: {
+            namedStyleType: "HEADING_2",
+            keys: ["company", "role"],
+            delimiter: "|",
+          },
+          // Under each H2, treat the content as a list
+          content: { kind: "list" },
+        },
+      },
+    },
+  ],
+} as const; // 'as const' is CRITICAL for type inference
+// 2. Fetch and Parse
+async function main() {
+  const docId = "YOUR_GOOGLE_DOC_ID";
+  try {
+    // 'data' is fully typed based on resumeSchema!
+    const data = await getParsedDocument(docId, resumeSchema);
+    console.log(JSON.stringify(data, null, 2));
+  } catch (error) {
+    console.error(error);
+  }
+}
+main();
+```
+### 3. The Result
+```json
+{
+  "Profile": "Senior Software Engineer based in Seoul.",
+  "Experience": [
+    {
+      "company": "Tech Corp",
+      "role": "Backend Lead",
+      "content": ["Designed microservices architecture", "Managed a team of 5"]
+    },
+    {
+      "company": "Startup Inc",
+      "role": "Full Stack",
+      "content": ["Built MVP in 3 months"]
+    }
+  ]
+}
+```
+---
+## 📚 Parsing Schema Guide
+The `ParseSchema` object controls how the parser reads your document.
+### Section Configuration
+| Property               | Type     | Description                                                                                    |
+| :--------------------- | :------- | :--------------------------------------------------------------------------------------------- |
+| `title.name`           | `string` | The text of the heading to find (case-insensitive). This becomes the key in the result object. |
+| `title.namedStyleType` | `string` | The Google Docs style to match (e.g., `HEADING_1`, `TITLE`).                                   |
+| `content`              | `Object` | (Optional) Defines the content structure. If omitted, parses as a text block.                  |
+### Content Kinds
+#### 1. Text Block (Default)
+If `content` is undefined, the parser collects all paragraphs following the header until the next section starts, joining them into a single string.
+#### 2. List (`kind: "list"`)
+Parses paragraphs as an array. Useful for bullet points or simple lists.
+- **`isFlatten`**: (boolean) If true, merges multiple lines into a single flat array.
+- **`keyDelimiter`**: (string) Parses "Key: Value" lines into `{ key: "...", value: [...] }` objects.
+- **`delimiter`**: (string) Splits a line by a character (e.g., comma) into an array.
+#### 3. Tree (`kind: "tree"`)
+Parses hierarchical structures. Ideal for nested sections like "H2 -> H3 -> Content".
+- **`node`**: Defines the schema for the child nodes.
+- **Strict Nesting**: The parser automatically stops collecting children when it encounters a heading of the same or higher level (e.g., an H2 stops an open H2 block).
+---
+## 🧪 Testing
+We use **Vitest** for testing. The repository includes a comprehensive test suite covering parsers, cursors, and authentication logic.
+```bash
+# Run tests
+npm test
+# Run tests with coverage
+npm run test:coverage
+```
+---
+## 🤝 Contributing
+Contributions are welcome! If you find a bug or have a feature request, please open an issue.
+1.  Fork the repository
+2.  Create your feature branch (`git checkout -b feature/amazing-feature`)
+3.  Commit your changes (`git commit -m 'Add some amazing feature'`)
+4.  Push to the branch (`git push origin feature/amazing-feature`)
+5.  Open a Pull Request
+---
+## 📃 License
+This project is licensed under the MIT License. See the [LICENSE](https://www.google.com/search?q=LICENSE) file for details.

package/dist/index.cjs ADDED Viewed

@@ -0,0 +1,426 @@
+'use strict';
+var googleapis = require('googleapis');
+var googleAuthLibrary = require('google-auth-library');
+// src/constants.ts
+var VALID_NAMED_STYLES = [
+  "HEADING_1",
+  "HEADING_2",
+  "HEADING_3",
+  "HEADING_4",
+  "HEADING_5",
+  "HEADING_6",
+  "TITLE",
+  "SUBTITLE"
+];
+var VALID_NAMED_STYLES_SET = new Set(VALID_NAMED_STYLES);
+// src/utils.ts
+function extractParagraphText(paragraph) {
+  const elements = paragraph.elements ?? [];
+  const text = elements.map((el) => el.textRun?.content || "").join("").trim().replace(/\n/g, " ");
+  return text || "";
+}
+function hasNamedStyle(paragraph, namedStyleType) {
+  if (!namedStyleType) return false;
+  const style = paragraph.paragraphStyle?.namedStyleType;
+  return style === namedStyleType;
+}
+function getParagraphNamedStyleType(paragraph) {
+  if (!paragraph.paragraphStyle && !paragraph.elements?.length) {
+    return void 0;
+  }
+  if (!paragraph.paragraphStyle) {
+    return "NORMAL_TEXT";
+  }
+  const namedStyleType = paragraph.paragraphStyle.namedStyleType;
+  if (!namedStyleType) {
+    return "NORMAL_TEXT";
+  }
+  if (VALID_NAMED_STYLES_SET.has(namedStyleType)) {
+    return namedStyleType;
+  }
+  return void 0;
+}
+function isNamedStyleType(style) {
+  if (typeof style !== "string") return false;
+  if (style === "NORMAL_TEXT") return false;
+  return VALID_NAMED_STYLES_SET.has(style);
+}
+function splitAndTrim(text, delimiter, filterEmpty = false) {
+  if (text === "") {
+    return [];
+  }
+  const items = text.split(delimiter).map((t) => t.trim());
+  return filterEmpty ? items.filter((t) => t.length > 0) : items;
+}
+function parseToKeyedList(text, keyDelimiter, listDelimiter) {
+  const delimiterIndex = text.indexOf(keyDelimiter);
+  if (delimiterIndex <= 0) {
+    return text;
+  }
+  const key = text.substring(0, delimiterIndex).trim();
+  const valuePart = text.substring(delimiterIndex + keyDelimiter.length).trim();
+  const value = valuePart ? splitAndTrim(valuePart, listDelimiter, true) : [];
+  return { key, value };
+}
+function parseToFields(text, keys, delimiter) {
+  const values = splitAndTrim(text, delimiter, false);
+  return keys.reduce((acc, key, index) => {
+    const value = values[index];
+    acc[key] = value !== void 0 && value !== "" ? value : "";
+    return acc;
+  }, {});
+}
+function parseDelimitedList(text, delimiter) {
+  const values = text.split(delimiter).map((v) => v.trim()).filter((v) => v.length > 0);
+  return values;
+}
+function parseStructuredText(text, schema) {
+  const delimiter = schema.delimiter || ",";
+  if (schema.keyDelimiter) {
+    return parseToKeyedList(text, schema.keyDelimiter, delimiter);
+  }
+  if (schema.keys && schema.keys.length > 0) {
+    return parseToFields(text, schema.keys, delimiter);
+  }
+  return parseDelimitedList(text, delimiter);
+}
+// src/tree.ts
+var CONTENT_KEY = "content";
+function collectNodeStylesRecursive(node, set) {
+  if (node.title.namedStyleType) set.add(node.title.namedStyleType);
+  if (node.content && node.content.kind === "tree") {
+    collectNodeStylesRecursive(node.content.node, set);
+  }
+}
+function createNodeFromTitle(text, titleSchema) {
+  const hasDelimiterSchema = !!titleSchema.keyDelimiter || !!titleSchema.keys && titleSchema.keys.length > 0 || !!titleSchema.delimiter;
+  if (hasDelimiterSchema) {
+    const structuredText = parseStructuredText(text, titleSchema);
+    if (Array.isArray(structuredText)) {
+      return { title: structuredText, content: [] };
+    } else if (typeof structuredText === "object" && structuredText !== null) {
+      return { title: structuredText, content: [] };
+    } else {
+      return { title: text, content: [] };
+    }
+  } else {
+    return { title: text, content: [] };
+  }
+}
+function determineTreeParsingAction(paragraph, cursor, nodeSchema, ancestorNodeList) {
+  const nodeTitleStyle = nodeSchema.title.namedStyleType;
+  const childNode = nodeSchema.content?.kind === "tree" ? nodeSchema.content.node : void 0;
+  const isCurrentNodeTitle = hasNamedStyle(paragraph, nodeTitleStyle);
+  const isChildNodeTitle = !!childNode && hasNamedStyle(paragraph, childNode.title.namedStyleType);
+  const isAncestorNodeTitle = ancestorNodeList.some(
+    (a) => hasNamedStyle(paragraph, a.title.namedStyleType)
+  );
+  const isInThisTree = isCurrentNodeTitle || isChildNodeTitle || isAncestorNodeTitle;
+  const isHeading = cursor.isAtParagraphHeading();
+  const isAtNewSection = cursor.isAtNewSection();
+  const isHeadingOutsideThisTree = isHeading && !isInThisTree;
+  const isHigherLevelHeading = isHeading && isAncestorNodeTitle;
+  const isSameLevelHeading = isHeading && isCurrentNodeTitle;
+  if (isAtNewSection) return { kind: "exitSection" };
+  if (isCurrentNodeTitle) return { kind: "createNode" };
+  if (isChildNodeTitle) return { kind: "startChildNode" };
+  if (isHigherLevelHeading || isSameLevelHeading)
+    return { kind: "finishCurrentNode" };
+  if (isHeadingOutsideThisTree) return { kind: "exitSection" };
+  return { kind: "appendDetail" };
+}
+function parseTreeSection(cursor, section, allNodeTitleStyles) {
+  const treeContent = section.content;
+  const nodeSchema = treeContent?.kind === "tree" ? treeContent.node : void 0;
+  if (!treeContent || !nodeSchema) return [];
+  while (!cursor.isEndOfDocument()) {
+    const info = cursor.getCurrentParagraph();
+    if (!info) {
+      cursor.getNextParagraph();
+      continue;
+    }
+    const { paragraph, style } = info;
+    if (cursor.isAtNewSection()) break;
+    if (cursor.isAtParagraphHeading() && !allNodeTitleStyles.has(style)) break;
+    if (hasNamedStyle(paragraph, nodeSchema.title.namedStyleType)) {
+      return parseTreeNode(cursor, nodeSchema, [], allNodeTitleStyles);
+    }
+    cursor.getNextParagraph();
+  }
+  return [];
+}
+function parseTreeNode(cursor, nodeSchema, ancestorNodeList, allNodeTitleStyles) {
+  const result = [];
+  let currentNode = null;
+  const childNodeSchema = nodeSchema.content?.kind === "tree" ? nodeSchema.content.node : void 0;
+  while (!cursor.isEndOfDocument()) {
+    const info = cursor.getCurrentParagraph();
+    if (!info) {
+      cursor.getNextParagraph();
+      continue;
+    }
+    const { text, paragraph, style } = info;
+    if (cursor.isAtNewSection()) {
+      return result;
+    }
+    if (cursor.isAtParagraphHeading() && !allNodeTitleStyles.has(style)) {
+      return result;
+    }
+    const decision = determineTreeParsingAction(
+      paragraph,
+      cursor,
+      nodeSchema,
+      ancestorNodeList
+    );
+    switch (decision.kind) {
+      case "exitSection": {
+        return result;
+      }
+      case "createNode": {
+        currentNode = createNodeFromTitle(text, nodeSchema.title);
+        result.push(currentNode);
+        cursor.getNextParagraph();
+        break;
+      }
+      case "startChildNode": {
+        if (!currentNode || !Array.isArray(currentNode[CONTENT_KEY]) || !childNodeSchema) {
+          cursor.getNextParagraph();
+          break;
+        }
+        const children = parseTreeNode(
+          cursor,
+          childNodeSchema,
+          [nodeSchema, ...ancestorNodeList],
+          allNodeTitleStyles
+        );
+        currentNode[CONTENT_KEY].push(...children);
+        break;
+      }
+      case "finishCurrentNode": {
+        if (currentNode) return result;
+        cursor.getNextParagraph();
+        break;
+      }
+      case "appendDetail": {
+        if (nodeSchema.content?.kind === "tree") {
+          cursor.getNextParagraph();
+          break;
+        }
+        if (currentNode && Array.isArray(currentNode[CONTENT_KEY])) {
+          currentNode[CONTENT_KEY].push(text.trim());
+        }
+        cursor.getNextParagraph();
+        break;
+      }
+    }
+  }
+  return result;
+}
+// src/list.ts
+function parseListSection(cursor, section) {
+  const result = [];
+  if (!section.content || section.content.kind !== "list") {
+    return [];
+  }
+  const contentSchema = section.content;
+  while (!cursor.isEndOfDocument()) {
+    const info = cursor.getCurrentParagraph();
+    if (!info) {
+      cursor.getNextParagraph();
+      continue;
+    }
+    if (cursor.isAtNewSection()) break;
+    if (cursor.isAtParagraphHeading()) break;
+    const parsed = parseStructuredText(info.text, contentSchema);
+    if (contentSchema.isFlatten && Array.isArray(parsed)) {
+      result.push(...parsed);
+    } else {
+      result.push(parsed);
+    }
+    cursor.getNextParagraph();
+  }
+  return result;
+}
+// src/textBlock.ts
+function parseTextBlockSection(cursor) {
+  const textPartList = [];
+  while (!cursor.isEndOfDocument()) {
+    const paragraph = cursor.getCurrentParagraph();
+    if (!paragraph) {
+      cursor.getNextParagraph();
+      continue;
+    }
+    if (cursor.isAtNewSection()) break;
+    if (cursor.isAtParagraphHeading()) break;
+    textPartList.push(paragraph.text);
+    cursor.getNextParagraph();
+  }
+  return textPartList.join(" ");
+}
+// src/section.ts
+function getSectionTitle(paragraph, text, parseSchema) {
+  const normalized = text.trim().toLowerCase();
+  const sectionList = parseSchema.sections ?? [];
+  for (const section of sectionList) {
+    const { name, namedStyleType } = section.title;
+    if (name && namedStyleType) {
+      const styleMatches = hasNamedStyle(paragraph, namedStyleType);
+      const textMatches = normalized === name.trim().toLowerCase();
+      if (styleMatches && textMatches) {
+        return name;
+      }
+    }
+  }
+  return null;
+}
+function parseSectionContent(cursor, section) {
+  const content = section.content;
+  if (!content) {
+    return parseTextBlockSection(cursor);
+  }
+  switch (content.kind) {
+    case "tree": {
+      if (!content.node) {
+        return [];
+      }
+      const allNodeTitleStyles = /* @__PURE__ */ new Set();
+      collectNodeStylesRecursive(content.node, allNodeTitleStyles);
+      return parseTreeSection(cursor, section, allNodeTitleStyles);
+    }
+    case "list":
+      return parseListSection(cursor, section);
+    default:
+      return parseTextBlockSection(cursor);
+  }
+}
+// src/cursor.ts
+function getParagraph(paragraph) {
+  const text = extractParagraphText(paragraph);
+  if (!text) return null;
+  const style = getParagraphNamedStyleType(paragraph);
+  return { text, style, paragraph };
+}
+var ParagraphCursor = class {
+  constructor(paragraphList, parseSchema) {
+    this.paragraphList = paragraphList;
+    this.parseSchema = parseSchema;
+  }
+  index = 0;
+  /**
+   * Retrieves the paragraph at the current cursor position.
+   *
+   * @returns The current `Paragraph` object, or `null` if the cursor is at the end of the document or the line is empty.
+   */
+  getCurrentParagraph() {
+    if (this.isEndOfDocument()) return null;
+    const paragraph = this.paragraphList[this.index];
+    if (!paragraph) return null;
+    return getParagraph(paragraph);
+  }
+  /**
+   * Advances the cursor to the next position and returns the new paragraph.
+   *
+   * @returns The next `Paragraph` object, or `null` if the end of the document is reached.
+   */
+  getNextParagraph() {
+    if (this.isEndOfDocument()) return null;
+    this.index++;
+    return this.getCurrentParagraph();
+  }
+  /**
+   * Checks if the cursor has reached the end of the paragraph list.
+   */
+  isEndOfDocument() {
+    return this.index >= this.paragraphList.length;
+  }
+  /**
+   * Determines if the current paragraph corresponds to a section title defined in the schema.
+   *
+   * @returns The section name if matched, otherwise `null`.
+   */
+  getCurrentSectionTitle() {
+    const info = this.getCurrentParagraph();
+    if (!info) return null;
+    return getSectionTitle(info.paragraph, info.text, this.parseSchema);
+  }
+  /**
+   * Checks if the current cursor position marks the start of a new section.
+   */
+  isAtNewSection() {
+    return this.getCurrentSectionTitle() !== null;
+  }
+  /**
+   * Checks if the current paragraph has a named heading style (e.g., HEADING_1).
+   */
+  isAtParagraphHeading() {
+    const info = this.getCurrentParagraph();
+    return !!info && isNamedStyleType(info.style);
+  }
+};
+function createDocsClient() {
+  try {
+    const auth = new googleAuthLibrary.GoogleAuth({
+      scopes: ["https://www.googleapis.com/auth/documents.readonly"]
+    });
+    return googleapis.google.docs({
+      version: "v1",
+      auth
+    });
+  } catch (error) {
+    console.error("Error initializing Google Docs client:", error);
+    throw new Error(
+      "Failed to initialize Google Docs client. Check setup and credentials."
+    );
+  }
+}
+// src/parser.ts
+function parseDocument(doc, parseSchema) {
+  const content = doc.body?.content || [];
+  const result = {};
+  const validParagraphList = content.map((element) => element.paragraph).filter((paragraph) => !!paragraph);
+  const cursor = new ParagraphCursor(validParagraphList, parseSchema);
+  while (!cursor.isEndOfDocument()) {
+    const currentSectionTitle = cursor.getCurrentSectionTitle();
+    if (currentSectionTitle) {
+      const section = parseSchema.sections.find(
+        (s) => s.title.name === currentSectionTitle
+      );
+      if (section) {
+        cursor.getNextParagraph();
+        const parsedData = parseSectionContent(cursor, section);
+        result[currentSectionTitle] = parsedData;
+        continue;
+      }
+    }
+    cursor.getNextParagraph();
+  }
+  return result;
+}
+async function getParsedDocument(documentId, parseSchema) {
+  try {
+    const docs = createDocsClient();
+    const response = await docs.documents.get({ documentId });
+    if (!response.data) {
+      throw new Error("Empty document response from Google Docs API.");
+    }
+    const parsedDocument = parseDocument(response.data, parseSchema);
+    return parsedDocument;
+  } catch (e) {
+    throw new Error(
+      `Google Docs API call failed. Check Doc ID and Service Account permissions. Original error: ${e instanceof Error ? e.message : String(e)}`
+    );
+  }
+}
+exports.getParsedDocument = getParsedDocument;
+//# sourceMappingURL=index.cjs.map
+//# sourceMappingURL=index.cjs.map