npm - tryll-dataset-builder-mcp - Versions diffs - 1.1.1 → 1.3.0 - Mend

tryll-dataset-builder-mcp 1.1.1 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -1,9 +1,11 @@
 # Tryll Dataset Builder — MCP Server
-An MCP (Model Context Protocol) server for building structured RAG knowledge base datasets. Use it with Claude Code to create, manage, and export JSON datasets via natural language.
+An MCP (Model Context Protocol) server for building structured RAG knowledge base datasets. Use it with Claude Code to create, manage, and export JSON datasets via natural language — with optional real-time sync to the [Dataset Builder web app](https://trylljsoncreator.onrender.com).
 Built by [Tryll Engine](https://tryllengine.com) | [Discord](https://discord.gg/CMnMrmapyB)
+---
 ## Quick Start
 ### 1. Install
@@ -14,8 +16,6 @@ npm install -g tryll-dataset-builder-mcp
 ### 2. Add to Claude Code
-Run in your terminal:
 ```bash
 claude mcp add dataset-builder -- npx tryll-dataset-builder-mcp
 ```
@@ -42,55 +42,208 @@ Just talk to Claude:
 > "Create a knowledge base about Minecraft with categories: Mobs, Blocks, Biomes. Add 10 chunks to each category."
-> "Import my existing dataset from ./data/minecraft.json"
+> "Parse this wiki page and add it to my dataset: https://minecraft.wiki/w/Creeper"
+> "Show me the version history of my project"
-> "Search for all chunks mentioning 'diamond' in my project"
+---
 ## Configuration
 | Variable | Default | Description |
 |----------|---------|-------------|
-| `DATA_DIR` | `./datasets` | Directory where project JSON files are stored |
+| `DATA_DIR` | `./datasets` | Directory for project JSON files (local mode) |
+---
+## Two Modes of Operation
+### Local Mode (default)
+Data is stored as JSON files in `DATA_DIR`. No server needed.
+### Connected Mode (real-time sync)
+Connect to the [Dataset Builder web app](https://trylljsoncreator.onrender.com) for live collaboration. Changes made via MCP appear instantly in the browser, and vice versa.
+```
+You: "Connect to session ABC123"
+Claude: *connects via WebSocket*
+You: "Add 5 chunks about dragons"
+→ chunks appear in the browser in real-time
+```
+---
-## Available Tools (18)
+## Available Tools (27)
+### Session Management
+| Tool | Description |
+|------|-------------|
+| `connect_session` | Connect to the web app for real-time collaboration. Requires a 6-character session code from the browser UI |
+| `disconnect_session` | Disconnect from the web app, switch back to local storage |
 ### Project Management
 | Tool | Description |
 |------|-------------|
 | `create_project` | Create a new dataset project |
 | `list_projects` | List all projects with stats |
-| `delete_project` | Delete a project |
-| `get_project_stats` | Detailed statistics |
+| `delete_project` | Permanently delete a project |
+| `get_project_stats` | Detailed statistics (categories, chunks, text lengths) |
 ### Category Management
 | Tool | Description |
 |------|-------------|
-| `create_category` | Add a category to a project |
+| `create_category` | Add a category to organize chunks |
 | `list_categories` | List categories with chunk counts |
 | `rename_category` | Rename a category |
-| `delete_category` | Delete a category and its chunks |
+| `delete_category` | Delete a category and all its chunks |
 ### Chunk Operations
 | Tool | Description |
 |------|-------------|
-| `add_chunk` | Add a single knowledge chunk |
-| `bulk_add_chunks` | Add multiple chunks at once |
-| `get_chunk` | Get chunk content by ID |
-| `update_chunk` | Update chunk fields |
-| `delete_chunk` | Delete a chunk |
-| `duplicate_chunk` | Clone a chunk |
-| `move_chunk` | Move chunk between categories |
+| `add_chunk` | Add a single knowledge chunk with ID, text, and metadata |
+| `bulk_add_chunks` | Add multiple chunks at once (faster than one by one) |
+| `get_chunk` | Get full content of a chunk by ID |
+| `update_chunk` | Update chunk fields (ID, text, metadata) |
+| `delete_chunk` | Delete a chunk by ID |
+| `duplicate_chunk` | Clone a chunk (creates `id_copy`) |
+| `move_chunk` | Move a chunk between categories |
 ### Search & Export
+| Tool | Description |
+|------|-------------|
+| `search_chunks` | Search by chunk ID or text content |
+| `export_project` | Export as flat JSON array (RAG-ready) |
+| `import_json` | Import an existing JSON dataset |
+| `export_category` | Export a single category as JSON |
+### URL Parsing
+| Tool | Description |
+|------|-------------|
+| `parse_url` | Fetch a web page, extract text, auto-create chunks. Splits text > 2000 chars into multiple chunks. Extracts wiki infobox metadata |
+| `batch_parse_urls` | Parse multiple URLs at once |
+### Bulk Operations
 | Tool | Description |
 |------|-------------|
-| `search_chunks` | Search by ID or text content |
-| `export_project` | Export as flat JSON (RAG-ready) |
-| `import_json` | Import existing JSON dataset |
+| `bulk_update_metadata` | Set a metadata field across all chunks (or per category) |
+| `merge_projects` | Merge all data from one project into another |
+### Version History
+| Tool | Description |
+|------|-------------|
+| `get_history` | Get version history (last 50 commits) for a project |
+| `get_commit` | Get a specific commit with full snapshot data for diffing |
+| `rollback` | Rollback a project to a previous commit's state |
+---
+## Tool Details
+### `add_chunk`
+```
+project: "minecraft"
+category: "Mobs"
+id: "creeper"
+text: "A Creeper is a hostile mob that silently approaches players..."
+metadata:
+  page_title: "Creeper"
+  source: "Minecraft Wiki"
+  license: "CC BY-NC-SA 3.0"
+  health: "20"          ← custom metadata field
+  behavior: "explodes"  ← custom metadata field
+```
+Standard metadata fields: `page_title`, `source`, `license`. Any extra fields become custom metadata.
+### `parse_url`
+```
+project: "minecraft"
+category: "Mobs"
+url: "https://minecraft.wiki/w/Creeper"
+chunk_id: "creeper"
+license: "CC BY-NC-SA 3.0"
+```
-## Export Format
+- Fetches the page, extracts main text content
+- If text > 2000 chars → auto-splits into `creeper_1`, `creeper_2`, etc.
+- Extracts page title and source URL as metadata
+- For wiki pages: extracts infobox/sidebar data as custom metadata fields
-The exported JSON is a flat array, compatible with the [Dataset Builder web app](https://github.com/Skizziik/json_creator) and ready for RAG pipelines:
+### `get_history`
+```
+project: "minecraft"
+```
+Returns:
+```json
+[
+  {
+    "id": "uuid",
+    "timestamp": "2026-02-27T14:30:00.000Z",
+    "source": "mcp",
+    "action": "addChunk",
+    "summary": "Added chunk 'creeper' to 'Mobs'",
+    "stats": { "categories": 3, "chunks": 12 }
+  }
+]
+```
+### `rollback`
+```
+project: "minecraft"
+commit_id: "uuid-of-target-commit"
+```
+Restores the project to that commit's snapshot. Creates a new "rollback" commit so you can undo the rollback later.
+---
+## Data Formats
+### Project JSON (internal)
+```json
+{
+  "name": "minecraft",
+  "createdAt": "2026-02-27T10:00:00.000Z",
+  "categories": [
+    {
+      "id": "uuid",
+      "name": "Mobs",
+      "expanded": true,
+      "chunks": [
+        {
+          "_uid": "uuid",
+          "id": "creeper",
+          "text": "A Creeper is a hostile mob...",
+          "metadata": {
+            "page_title": "Creeper",
+            "source": "Minecraft Wiki",
+            "license": "CC BY-NC-SA 3.0"
+          },
+          "customFields": [
+            { "key": "health", "value": "20" }
+          ]
+        }
+      ]
+    }
+  ]
+}
+```
+### Export Format (RAG-ready)
 ```json
 [
@@ -101,20 +254,67 @@ The exported JSON is a flat array, compatible with the [Dataset Builder web app]
       "page_title": "Creeper",
       "source": "Minecraft Wiki",
       "license": "CC BY-NC-SA 3.0",
-      "type": "hostile_mob",
       "health": "20"
     }
   }
 ]
 ```
+### History Commit
+```json
+{
+  "id": "uuid",
+  "timestamp": "2026-02-27T14:30:00.000Z",
+  "source": "browser | mcp",
+  "action": "addChunk",
+  "summary": "Added chunk 'creeper' to 'Mobs'",
+  "stats": { "categories": 3, "chunks": 12 },
+  "snapshot": { "...full project state..." }
+}
+```
+---
+## Real-Time Collaboration
+```
+┌─────────────┐     WebSocket      ┌──────────────┐     REST API     ┌─────────────┐
+│   Browser    │ ◄──────────────► │  Web Server   │ ◄──────────────► │  MCP Server  │
+│  (Dataset    │   data:changed    │  (Express +   │   POST/PUT/DEL   │  (Claude     │
+│   Builder)   │   mcp:connected   │   WebSocket)  │   + source:mcp   │   Code)      │
+└─────────────┘                    └──────────────┘                    └─────────────┘
+```
+1. Open the [Dataset Builder](https://trylljsoncreator.onrender.com) in your browser
+2. Copy the 6-character session code from the top bar
+3. Tell Claude: *"Connect to session ABC123"*
+4. All changes sync in real-time between browser and Claude
+5. Version history tracks who made each change (browser vs MCP)
+---
 ## Example Prompts
 - *"Create a Dark Souls knowledge base with categories for Bosses, Weapons, and Locations"*
-- *"Add 15 chunks about Minecraft mobs with detailed descriptions"*
-- *"Export my project as JSON and save to file"*
-- *"Search for chunks about 'fire' in my dark_souls project"*
-- *"Move chunk 'ancient_dragon' from Bosses to Enemies category"*
+- *"Parse these wiki pages and add them to my Minecraft project: [url1], [url2], [url3]"*
+- *"Bulk update the license field to 'MIT' for all chunks in the Mobs category"*
+- *"Show me the version history of my project"*
+- *"Rollback my project to the commit before I deleted that category"*
+- *"Merge my test_data project into the main production project"*
+- *"Export the Bosses category as JSON"*
+- *"Connect to session XYZ789 and add 20 chunks about potions"*
+---
+## Links
+- **Web App**: [trylljsoncreator.onrender.com](https://trylljsoncreator.onrender.com)
+- **Web App Repo**: [github.com/Skizziik/json_creator](https://github.com/Skizziik/json_creator)
+- **MCP Repo**: [github.com/Skizziik/tryll_dataset_builder](https://github.com/Skizziik/tryll_dataset_builder)
+- **npm**: [tryll-dataset-builder-mcp](https://www.npmjs.com/package/tryll-dataset-builder-mcp)
+- **Tryll Engine**: [tryllengine.com](https://tryllengine.com)
+- **Discord**: [discord.gg/CMnMrmapyB](https://discord.gg/CMnMrmapyB)
 ## License

package/index.js CHANGED Viewed

@@ -5,11 +5,12 @@ import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"
 import { CallToolRequestSchema, ListToolsRequestSchema } from "@modelcontextprotocol/sdk/types.js";
 import { Store } from "./lib/store.js";
 import WebSocket from "ws";
+import * as cheerio from "cheerio";
 const store = new Store(process.env.DATA_DIR);
 const server = new Server(
-  { name: "tryll-dataset-builder", version: "1.1.0" },
+  { name: "tryll-dataset-builder", version: "1.3.0" },
   { capabilities: { tools: {} } }
 );
@@ -38,6 +39,96 @@ async function apiCall(method, path, body) {
   return data;
 }
+// ============================================
+// URL PARSING HELPERS
+// ============================================
+const CHUNK_LIMIT = 2000;
+async function parseUrl(url) {
+  const res = await fetch(url, {
+    headers: { 'User-Agent': 'Mozilla/5.0 (compatible; TryllDatasetBuilder/1.2)' },
+  });
+  if (!res.ok) throw new Error(`Failed to fetch ${url}: HTTP ${res.status}`);
+  const html = await res.text();
+  const $ = cheerio.load(html);
+  // Extract page title
+  const pageTitle = $('title').first().text().trim()
+    || $('h1').first().text().trim()
+    || '';
+  // Extract wiki infobox metadata
+  const infobox = {};
+  $('.infobox tr, .sidebar tr, .wikitable.infobox tr, table.infobox tr').each((_, row) => {
+    const $row = $(row);
+    const key = $row.find('th').first().text().trim().replace(/\s+/g, ' ');
+    const val = $row.find('td').first().text().trim().replace(/\s+/g, ' ');
+    if (key && val && key.length < 60 && val.length < 200) {
+      infobox[key] = val;
+    }
+  });
+  // Remove noise elements
+  $('script, style, nav, footer, header, .sidebar, .infobox, .navbox, .mw-editsection, .reference, .reflist, #mw-navigation, .noprint, .toc').remove();
+  // Extract main text
+  const mainContent = $('article, main, #mw-content-text, #content, .mw-parser-output, #bodyContent, .entry-content, .post-content').first();
+  let text = '';
+  if (mainContent.length) {
+    text = mainContent.text();
+  } else {
+    text = $('body').text();
+  }
+  // Clean up whitespace
+  text = text
+    .replace(/\t/g, ' ')
+    .replace(/[ ]{2,}/g, ' ')
+    .replace(/\n{3,}/g, '\n\n')
+    .trim();
+  return { text, pageTitle, infobox, source: url };
+}
+function splitTextIntoChunks(text, baseId, limit = CHUNK_LIMIT) {
+  if (text.length <= limit) {
+    return [{ id: baseId, text }];
+  }
+  const chunks = [];
+  let remaining = text;
+  let index = 1;
+  while (remaining.length > 0) {
+    let cutPoint = limit;
+    if (remaining.length > limit) {
+      // Try to cut at paragraph boundary
+      const paraBreak = remaining.lastIndexOf('\n\n', limit);
+      if (paraBreak > limit * 0.3) {
+        cutPoint = paraBreak;
+      } else {
+        // Try sentence boundary
+        const sentBreak = remaining.lastIndexOf('. ', limit);
+        if (sentBreak > limit * 0.3) {
+          cutPoint = sentBreak + 1;
+        }
+      }
+    } else {
+      cutPoint = remaining.length;
+    }
+    chunks.push({
+      id: `${baseId}_${index}`,
+      text: remaining.substring(0, cutPoint).trim(),
+    });
+    remaining = remaining.substring(cutPoint).trim();
+    index++;
+  }
+  return chunks;
+}
 // ============================================
 // TOOL DEFINITIONS
 // ============================================
@@ -309,6 +400,126 @@ const TOOLS = [
       required: ["project"],
     },
   },
+  // ---- URL Parsing ----
+  {
+    name: "parse_url",
+    description: "Fetch a web page, extract its text content, and auto-create chunks. If text exceeds 2000 characters, it auto-splits into multiple chunks with _1, _2 suffixes. Extracts page title and source URL as metadata. For wiki pages, extracts infobox/sidebar data as custom metadata fields.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+        category: { type: "string", description: "Category to add chunks into" },
+        url: { type: "string", description: "URL to fetch and parse" },
+        chunk_id: { type: "string", description: "Base chunk ID. If text is split, becomes chunk_id_1, chunk_id_2, etc." },
+        license: { type: "string", description: "License for the content. Default: CC BY-NC-SA 3.0" },
+      },
+      required: ["project", "category", "url", "chunk_id"],
+    },
+  },
+  {
+    name: "batch_parse_urls",
+    description: "Parse multiple URLs at once and add all chunks to a category. Each URL gets its own chunk ID prefix. Auto-splits long texts into multiple chunks.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+        category: { type: "string", description: "Category to add chunks into" },
+        urls: {
+          type: "array",
+          description: "Array of URL entries to parse",
+          items: {
+            type: "object",
+            properties: {
+              url: { type: "string", description: "URL to fetch" },
+              chunk_id: { type: "string", description: "Base chunk ID for this URL" },
+            },
+            required: ["url", "chunk_id"],
+          },
+        },
+        license: { type: "string", description: "License for all content. Default: CC BY-NC-SA 3.0" },
+      },
+      required: ["project", "category", "urls"],
+    },
+  },
+  // ---- Bulk Operations ----
+  {
+    name: "bulk_update_metadata",
+    description: "Update a metadata field across ALL chunks in a project (or a specific category). Useful for setting license, source, or custom fields in bulk.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+        field: { type: "string", description: "Metadata field to update (e.g. 'license', 'source', or any custom field name)" },
+        value: { type: "string", description: "New value for the field" },
+        category: { type: "string", description: "Optional: only update chunks in this category. If omitted, updates all chunks in the project." },
+      },
+      required: ["project", "field", "value"],
+    },
+  },
+  {
+    name: "merge_projects",
+    description: "Merge all categories and chunks from a source project into a target project. Categories with the same name are combined. Chunks with duplicate IDs are skipped.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        source: { type: "string", description: "Source project name (data is copied FROM here)" },
+        target: { type: "string", description: "Target project name (data is merged INTO here)" },
+      },
+      required: ["source", "target"],
+    },
+  },
+  {
+    name: "export_category",
+    description: "Export a single category as a flat JSON array. Same format as export_project but filtered to one category.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+        category: { type: "string", description: "Category name to export" },
+        save_to_file: { type: "boolean", description: "If true, saves to a file. Default: false." },
+      },
+      required: ["project", "category"],
+    },
+  },
+  // ---- History ----
+  {
+    name: "get_history",
+    description: "Get version history (last 50 commits) for a project. Each commit shows who made the change (browser/MCP), what was changed, and when. Returns lightweight list without snapshots.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+      },
+      required: ["project"],
+    },
+  },
+  {
+    name: "get_commit",
+    description: "Get a specific commit with full snapshot data. Returns the commit's snapshot and the previous commit's snapshot for computing diffs.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+        commit_id: { type: "string", description: "Commit UUID" },
+      },
+      required: ["project", "commit_id"],
+    },
+  },
+  {
+    name: "rollback",
+    description: "Rollback a project to a specific commit's state. Restores the project data from that commit's snapshot and creates a new 'rollback' commit in history. Safe: you can undo a rollback by rolling back to a later commit.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        project: { type: "string", description: "Project name" },
+        commit_id: { type: "string", description: "Commit UUID to rollback to" },
+      },
+      required: ["project", "commit_id"],
+    },
+  },
 ];
 // ============================================
@@ -327,28 +538,28 @@ async function handleRemote(name, args) {
   switch (name) {
     case "create_project":
-      return apiCall('POST', '/api/projects', { name: args.name, session: s });
+      return apiCall('POST', '/api/projects', { name: args.name, session: s, source: 'mcp' });
     case "list_projects":
       return apiCall('GET', '/api/projects');
     case "delete_project":
-      return apiCall('DELETE', `/api/projects/${p(args.name)}?session=${s}`);
+      return apiCall('DELETE', `/api/projects/${p(args.name)}?session=${s}&source=mcp`);
     case "get_project_stats":
       return apiCall('GET', `/api/projects/${p(args.name)}/stats`);
     case "create_category":
-      return apiCall('POST', `/api/projects/${p(args.project)}/categories`, { name: args.name, session: s });
+      return apiCall('POST', `/api/projects/${p(args.project)}/categories`, { name: args.name, session: s, source: 'mcp' });
     case "list_categories":
       return apiCall('GET', `/api/projects/${p(args.project)}/categories`);
     case "rename_category":
-      return apiCall('PUT', `/api/projects/${p(args.project)}/categories/${p(args.old_name)}`, { newName: args.new_name, session: s });
+      return apiCall('PUT', `/api/projects/${p(args.project)}/categories/${p(args.old_name)}`, { newName: args.new_name, session: s, source: 'mcp' });
     case "delete_category":
-      return apiCall('DELETE', `/api/projects/${p(args.project)}/categories/${p(args.name)}?session=${s}`);
+      return apiCall('DELETE', `/api/projects/${p(args.project)}/categories/${p(args.name)}?session=${s}&source=mcp`);
     case "add_chunk":
       return apiCall('POST', `/api/projects/${p(args.project)}/categories/${p(args.category)}/chunks`, {
-        id: args.id, text: args.text, metadata: args.metadata, session: s,
+        id: args.id, text: args.text, metadata: args.metadata, session: s, source: 'mcp',
       });
     case "bulk_add_chunks":
       return apiCall('POST', `/api/projects/${p(args.project)}/categories/${p(args.category)}/chunks/bulk`, {
-        chunks: args.chunks, session: s,
+        chunks: args.chunks, session: s, source: 'mcp',
       });
     case "get_chunk": {
       const proj = await apiCall('GET', `/api/projects/${p(args.project)}`);
@@ -363,7 +574,7 @@ async function handleRemote(name, args) {
       for (const cat of proj2.categories) {
         const ch = cat.chunks.find(c => c.id === args.id);
         if (ch) {
-          const body = { session: s };
+          const body = { session: s, source: 'mcp' };
           if (args.new_id !== undefined) body.id = args.new_id;
           if (args.text !== undefined) body.text = args.text;
           const meta = {};
@@ -384,7 +595,7 @@ async function handleRemote(name, args) {
       for (const cat of proj3.categories) {
         const ch = cat.chunks.find(c => c.id === args.id);
         if (ch) {
-          return apiCall('DELETE', `/api/projects/${p(args.project)}/categories/${cat.id}/chunks/${ch._uid}?session=${s}`);
+          return apiCall('DELETE', `/api/projects/${p(args.project)}/categories/${cat.id}/chunks/${ch._uid}?session=${s}&source=mcp`);
         }
       }
       throw new Error(`Chunk "${args.id}" not found`);
@@ -394,14 +605,14 @@ async function handleRemote(name, args) {
       for (const cat of proj4.categories) {
         const ch = cat.chunks.find(c => c.id === args.id);
         if (ch) {
-          return apiCall('POST', `/api/projects/${p(args.project)}/categories/${cat.id}/chunks/${ch._uid}/duplicate`);
+          return apiCall('POST', `/api/projects/${p(args.project)}/categories/${cat.id}/chunks/${ch._uid}/duplicate`, { source: 'mcp' });
         }
       }
       throw new Error(`Chunk "${args.id}" not found`);
     }
     case "move_chunk":
       return apiCall('POST', `/api/projects/${p(args.project)}/chunks/${p(args.id)}/move`, {
-        targetCategory: args.target_category, session: s,
+        targetCategory: args.target_category, session: s, source: 'mcp',
       });
     case "search_chunks":
       return apiCall('GET', `/api/projects/${p(args.project)}/search?q=${encodeURIComponent(args.query)}`);
@@ -415,9 +626,61 @@ async function handleRemote(name, args) {
       }
       if (!jsonData) throw new Error('Provide either "json_path" or "data" parameter');
       return apiCall('POST', `/api/projects/${p(args.project)}/import`, {
-        data: jsonData, category: args.category, session: s,
+        data: jsonData, category: args.category, session: s, source: 'mcp',
+      });
+    }
+    case "parse_url": {
+      const parsed = await parseUrl(args.url);
+      const chunks = splitTextIntoChunks(parsed.text, args.chunk_id);
+      const license = args.license || 'CC BY-NC-SA 3.0';
+      const chunkData = chunks.map(ch => ({
+        id: ch.id, text: ch.text,
+        metadata: { page_title: parsed.pageTitle, source: parsed.source, license, ...parsed.infobox },
+      }));
+      const result = await apiCall('POST', `/api/projects/${p(args.project)}/categories/${p(args.category)}/chunks/bulk`, {
+        chunks: chunkData, session: s, source: 'mcp',
       });
+      return { ...result, pageTitle: parsed.pageTitle, chunksCreated: chunks.length, infoboxFields: Object.keys(parsed.infobox) };
     }
+    case "batch_parse_urls": {
+      const results = [];
+      const license = args.license || 'CC BY-NC-SA 3.0';
+      for (const entry of args.urls) {
+        try {
+          const parsed = await parseUrl(entry.url);
+          const chunks = splitTextIntoChunks(parsed.text, entry.chunk_id);
+          const chunkData = chunks.map(ch => ({
+            id: ch.id, text: ch.text,
+            metadata: { page_title: parsed.pageTitle, source: parsed.source, license, ...parsed.infobox },
+          }));
+          const r = await apiCall('POST', `/api/projects/${p(args.project)}/categories/${p(args.category)}/chunks/bulk`, {
+            chunks: chunkData, session: s, source: 'mcp',
+          });
+          results.push({ url: entry.url, chunk_id: entry.chunk_id, chunks: chunks.length, added: r.added, errors: r.errors });
+        } catch (err) {
+          results.push({ url: entry.url, chunk_id: entry.chunk_id, error: err.message });
+        }
+      }
+      return { parsed: results.filter(r => !r.error).length, failed: results.filter(r => r.error).length, results };
+    }
+    case "bulk_update_metadata":
+      return apiCall('POST', `/api/projects/${p(args.project)}/bulk-metadata`, {
+        field: args.field, value: args.value, category: args.category, session: s, source: 'mcp',
+      });
+    case "merge_projects":
+      return apiCall('POST', `/api/projects/${p(args.source)}/merge`, {
+        target: args.target, session: s, source: 'mcp',
+      });
+    case "export_category":
+      return apiCall('GET', `/api/projects/${p(args.project)}/categories/${p(args.category)}/export`);
+    case "get_history":
+      return apiCall('GET', `/api/projects/${p(args.project)}/history`);
+    case "get_commit":
+      return apiCall('GET', `/api/projects/${p(args.project)}/history/${args.commit_id}`);
+    case "rollback":
+      return apiCall('POST', `/api/projects/${p(args.project)}/history/${args.commit_id}/rollback`, {
+        session: s, source: 'mcp',
+      });
     default:
       throw new Error(`Unknown tool: ${name}`);
   }
@@ -593,6 +856,73 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
           break;
         }
+        case "parse_url": {
+          const parsed = await parseUrl(args.url);
+          const chunks = splitTextIntoChunks(parsed.text, args.chunk_id);
+          const license = args.license || 'CC BY-NC-SA 3.0';
+          const chunkData = chunks.map(ch => ({
+            id: ch.id, text: ch.text,
+            metadata: { page_title: parsed.pageTitle, source: parsed.source, license, ...parsed.infobox },
+          }));
+          const bulkResult = store.bulkAddChunks(args.project, args.category, chunkData);
+          result = { ...bulkResult, pageTitle: parsed.pageTitle, chunksCreated: chunks.length, infoboxFields: Object.keys(parsed.infobox) };
+          break;
+        }
+        case "batch_parse_urls": {
+          const results = [];
+          const license = args.license || 'CC BY-NC-SA 3.0';
+          for (const entry of args.urls) {
+            try {
+              const parsed = await parseUrl(entry.url);
+              const chunks = splitTextIntoChunks(parsed.text, entry.chunk_id);
+              const chunkData = chunks.map(ch => ({
+                id: ch.id, text: ch.text,
+                metadata: { page_title: parsed.pageTitle, source: parsed.source, license, ...parsed.infobox },
+              }));
+              const r = store.bulkAddChunks(args.project, args.category, chunkData);
+              results.push({ url: entry.url, chunk_id: entry.chunk_id, chunks: chunks.length, added: r.added, errors: r.errors });
+            } catch (err) {
+              results.push({ url: entry.url, chunk_id: entry.chunk_id, error: err.message });
+            }
+          }
+          result = { parsed: results.filter(r => !r.error).length, failed: results.filter(r => r.error).length, results };
+          break;
+        }
+        case "bulk_update_metadata":
+          result = store.bulkUpdateMetadata(args.project, args.field, args.value, args.category);
+          break;
+        case "merge_projects":
+          result = store.mergeProjects(args.source, args.target);
+          break;
+        case "export_category": {
+          const exported = store.exportCategory(args.project, args.category);
+          if (args.save_to_file) {
+            const outPath = store._filePath(args.project).replace('.json', `.${args.category}.export.json`);
+            const { writeFileSync } = await import('fs');
+            writeFileSync(outPath, JSON.stringify(exported, null, 2), 'utf-8');
+            result = { exported: exported.length, savedTo: outPath };
+          } else {
+            result = { exported: exported.length, data: exported };
+          }
+          break;
+        }
+        case "get_history":
+          result = store.getHistory(args.project);
+          break;
+        case "get_commit":
+          result = store.getCommit(args.project, args.commit_id);
+          break;
+        case "rollback":
+          result = store.rollback(args.project, args.commit_id, 'mcp');
+          break;
         default:
           throw new Error(`Unknown tool: ${name}`);
       }
@@ -617,7 +947,7 @@ server.setRequestHandler(CallToolRequestSchema, async (request) => {
 async function main() {
   const transport = new StdioServerTransport();
   await server.connect(transport);
-  console.error("Tryll Dataset Builder MCP server running (v1.1.0)");
+  console.error("Tryll Dataset Builder MCP server running (v1.3.0)");
 }
 main().catch((err) => {

package/lib/store.js CHANGED Viewed

@@ -4,6 +4,7 @@ import { randomUUID } from 'crypto';
 const DEFAULT_LICENSE = 'CC BY-NC-SA 3.0';
 const STANDARD_META = ['page_title', 'source', 'license'];
+const MAX_HISTORY = 50;
 export class Store {
   constructor(dataDir) {
@@ -25,7 +26,7 @@ export class Store {
   listProjects() {
     this._ensureDir();
-    const files = readdirSync(this.dataDir).filter(f => f.endsWith('.json'));
+    const files = readdirSync(this.dataDir).filter(f => f.endsWith('.json') && !f.endsWith('.history.json'));
     return files.map(f => {
       const name = f.replace(/\.json$/, '');
       try {
@@ -343,6 +344,74 @@ export class Store {
     return { project: projectName, category: catName, imported, skipped };
   }
+  // ---- BULK UPDATE METADATA ----
+  bulkUpdateMetadata(projectName, field, value, categoryName) {
+    const data = this._load(projectName);
+    let updated = 0;
+    const cats = categoryName
+      ? [this._findCategory(data, categoryName)]
+      : data.categories;
+    for (const cat of cats) {
+      for (const ch of cat.chunks) {
+        if (STANDARD_META.includes(field)) {
+          ch.metadata[field] = value;
+        } else {
+          if (!ch.customFields) ch.customFields = [];
+          const existing = ch.customFields.find(cf => cf.key === field);
+          if (existing) { existing.value = value; }
+          else { ch.customFields.push({ key: field, value }); }
+        }
+        updated++;
+      }
+    }
+    this._save(projectName, data);
+    return { project: projectName, field, value, updated };
+  }
+  // ---- MERGE PROJECTS ----
+  mergeProjects(sourceName, targetName) {
+    const source = this._load(sourceName);
+    const target = this._load(targetName);
+    let categoriesMerged = 0, chunksAdded = 0, chunksSkipped = 0;
+    for (const srcCat of source.categories) {
+      let tgtCat = target.categories.find(c => c.name.toLowerCase() === srcCat.name.toLowerCase());
+      if (!tgtCat) {
+        tgtCat = { id: randomUUID(), name: srcCat.name, expanded: true, chunks: [] };
+        target.categories.push(tgtCat);
+        categoriesMerged++;
+      }
+      for (const ch of srcCat.chunks) {
+        if (this._isIdTaken(target, ch.id)) { chunksSkipped++; continue; }
+        tgtCat.chunks.push({ ...JSON.parse(JSON.stringify(ch)), _uid: randomUUID() });
+        chunksAdded++;
+      }
+    }
+    this._save(targetName, target);
+    return { source: sourceName, target: targetName, categoriesMerged, chunksAdded, chunksSkipped };
+  }
+  // ---- EXPORT CATEGORY ----
+  exportCategory(projectName, categoryName) {
+    const data = this._load(projectName);
+    const cat = this._findCategory(data, categoryName);
+    const flat = [];
+    for (const ch of cat.chunks) {
+      const entry = { id: ch.id, text: ch.text, metadata: { ...ch.metadata } };
+      if (ch.customFields) {
+        for (const cf of ch.customFields) {
+          if (cf.key && cf.key.trim()) entry.metadata[cf.key.trim()] = String(cf.value ?? '');
+        }
+      }
+      flat.push(entry);
+    }
+    return flat;
+  }
   // ---- INTERNAL ----
   _load(name) {
@@ -371,6 +440,67 @@ export class Store {
     return false;
   }
+  // ---- HISTORY ----
+  _historyFilePath(name) {
+    return join(this.dataDir, `${name}.history.json`);
+  }
+  _loadHistory(name) {
+    const fp = this._historyFilePath(name);
+    if (!existsSync(fp)) return { project: name, commits: [] };
+    return JSON.parse(readFileSync(fp, 'utf-8'));
+  }
+  _saveHistory(name, history) {
+    this._ensureDir();
+    writeFileSync(this._historyFilePath(name), JSON.stringify(history, null, 2), 'utf-8');
+  }
+  _commit(projectName, action, summary, source) {
+    try {
+      const data = this._load(projectName);
+      const history = this._loadHistory(projectName);
+      const totalChunks = data.categories.reduce((sum, c) => sum + c.chunks.length, 0);
+      history.commits.unshift({
+        id: randomUUID(),
+        timestamp: new Date().toISOString(),
+        source: source || 'mcp',
+        action, summary,
+        stats: { categories: data.categories.length, chunks: totalChunks },
+        snapshot: JSON.parse(JSON.stringify(data)),
+      });
+      if (history.commits.length > MAX_HISTORY) history.commits.length = MAX_HISTORY;
+      this._saveHistory(projectName, history);
+    } catch { /* history logging should never break mutations */ }
+  }
+  getHistory(name) {
+    const history = this._loadHistory(name);
+    return history.commits.map(c => ({
+      id: c.id, timestamp: c.timestamp, source: c.source,
+      action: c.action, summary: c.summary, stats: c.stats,
+    }));
+  }
+  getCommit(name, commitId) {
+    const history = this._loadHistory(name);
+    const idx = history.commits.findIndex(c => c.id === commitId);
+    if (idx === -1) throw new Error('Commit not found');
+    const commit = history.commits[idx];
+    const prev = idx + 1 < history.commits.length ? history.commits[idx + 1].snapshot : null;
+    return { ...commit, prevSnapshot: prev };
+  }
+  rollback(name, commitId, source) {
+    const history = this._loadHistory(name);
+    const commit = history.commits.find(c => c.id === commitId);
+    if (!commit) throw new Error('Commit not found');
+    this._save(name, commit.snapshot);
+    this._commit(name, 'rollback', `Rolled back to commit from ${commit.timestamp}`, source || 'mcp');
+    return this._load(name);
+  }
   _parseCustomFields(metadata) {
     if (!metadata || typeof metadata !== 'object') return [];
     return Object.entries(metadata)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "tryll-dataset-builder-mcp",
-  "version": "1.1.1",
+  "version": "1.3.0",
   "description": "MCP server for building RAG knowledge base datasets. Create, manage and export structured JSON datasets via Claude Code.",
   "type": "module",
   "main": "index.js",
@@ -31,6 +31,7 @@
   },
   "dependencies": {
     "@modelcontextprotocol/sdk": "^1.12.1",
+    "cheerio": "^1.2.0",
     "ws": "^8.19.0",
     "zod": "^3.24.0"
   }