studylens 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,118 +1,46 @@
1
- # StudyLens
2
-
3
- AI-powered deep study assistant. Paste notes, upload files, or provide URLs — StudyLens extracts knowledge points, organizes them for browsing, and generates topic pages with AI-driven Q&A.
4
-
5
- ## Features
6
-
7
- - **Multi-source ingestion** — text, PDF, DOCX, XLSX, and web URLs
8
- - **LLM-powered extraction** — automatically identifies knowledge points, tags, and relationships
9
- - **Knowledge graph** — visual force-directed graph of connected concepts
10
- - **Topic pages** — AI-generated study pages with version history
11
- - **Deep analysis** drill down into any concept with AI-powered sub-topic expansion
12
- - **Smart Q&A** ask questions about your knowledge base with context-aware answers
13
- - **Timeline & category views** browse knowledge by time or subject
14
- - **Export**single-page HTML export with print-optimized CSS
15
- - **Granularity control** limit max knowledge points per ingestion for high-level summaries
16
- - **Multi-provider LLM** supports OpenAI-compatible APIs, Ollama, and custom endpoints
17
-
18
- ## Install
19
-
20
- ```bash
21
- npm install -g studylens
22
- studylens
23
- ```
24
-
25
- Open `http://localhost:3000` — on first launch the Settings panel opens automatically to guide you through LLM setup.
26
-
27
- Data is stored in `./studylens-data/` in the current directory. Set `STUDYLENS_DATA_DIR` to change the location.
28
-
29
- ## Quick Start (Development)
30
-
31
- ```bash
32
- npm run setup # Install dependencies (server + portal)
33
- npm run dev # Start server (port 3000) + dev portal (port 3001)
34
- ```
35
-
36
- Open `http://localhost:3001` for development (hot-reload, recommended).
37
-
38
- Port 3000 serves the production build — run `npm run build` first to generate `portal/dist/`, otherwise it will only serve the API.
39
-
40
- ## LLM Configuration
41
-
42
- StudyLens requires an LLM backend. On first launch the Settings panel opens automatically to guide you through setup. Three options:
43
-
44
- ### Option A: Agent Maestro (recommended for GitHub Copilot users)
45
-
46
- Zero API key needed — uses your existing Copilot subscription via VS Code.
47
-
48
- 1. Install the [Agent Maestro](https://marketplace.visualstudio.com/items?itemName=Joouis.agent-maestro) VS Code extension
49
- 2. It starts a local proxy at `http://localhost:23333`
50
- 3. In StudyLens settings, enable `agent-maestro` and test the connection
51
-
52
- ### Option B: OpenAI-compatible API
53
-
54
- Works with OpenAI, Azure OpenAI, DeepSeek, or any compatible endpoint.
55
-
56
- 1. In StudyLens settings, enable `openai-compatible`
57
- 2. Set `baseUrl` (default: `https://api.openai.com/v1`), `apiKey`, and `model`
58
-
59
- ### Option C: Ollama (fully local, free)
60
-
61
- Run models locally with no API key or internet required.
62
-
63
- 1. Install [Ollama](https://ollama.com) and pull a model: `ollama pull llama3.2`
64
- 2. In StudyLens settings, enable `ollama` (default URL: `http://localhost:11434`)
65
-
66
- Configuration is stored in `wiki/config/llm-config.json`. A template is at `config/llm-config.template.json`.
67
-
68
- ## Project Structure
69
-
70
- ```
71
- StudyLens/
72
- ├── server/ # Express API server
73
- │ └── index.js
74
- ├── core/ # Business logic
75
- │ ├── extractor.js # Knowledge extraction prompts
76
- │ ├── llm-provider.js # Multi-provider LLM client
77
- │ └── wiki-storage.js # Markdown-based file storage
78
- ├── portal/ # React frontend (Vite)
79
- │ └── src/
80
- │ ├── components/ # UI components
81
- │ └── lib/ # Shared utilities
82
- ├── config/ # Configuration templates
83
- ├── e2e/ # Playwright E2E tests
84
- ├── tests/ # API integration tests
85
- ├── scripts/ # Utility scripts
86
- └── docs/ # User & developer guides
87
- ```
88
-
89
- ## Data Storage
90
-
91
- All data is stored as Markdown files in the `wiki/` directory (gitignored by default):
92
-
93
- - `wiki/entries/` — knowledge point Markdown files with YAML frontmatter
94
- - `wiki/topic-pages/` — generated topic page HTML
95
- - `wiki/index/` — JSON indexes for fast lookup
96
- - `wiki/config/` — runtime configuration
97
-
98
- ## Testing
99
-
100
- ```bash
101
- npm test # Unit tests (API + portal)
102
- npm run test:e2e # Playwright E2E tests
103
- npm run test:api # API tests only
104
- npm run test:portal # Portal component tests only
105
- ```
106
-
107
- ## Scripts
108
-
109
- ```bash
110
- npm run server # Start API server only (port 3000)
111
- npm run portal # Start Vite dev server only (port 3001)
112
- npm run dev # Start both concurrently
113
- npm run setup # Install all dependencies
114
- ```
115
-
116
- ## License
117
-
118
- MIT
1
+ # StudyLens
2
+
3
+ > 把零散的笔记,变成会生长的个人知识库。
4
+
5
+ StudyLens 是一个 **AI 驱动的深度学习助手**。粘贴一段笔记、上传一份文件,或丢进一个网址 —— 它会自动提取知识点、理清脉络,再帮你层层追问、生成可分享的专题页面。
6
+
7
+ 学到的东西不再是一堆散落的文字,而是一张能浏览、能下钻、能不断补充的知识网络。
8
+
9
+ ## 它能做什么
10
+
11
+ - **多来源录入**文本、PDF、Word、Excel,甚至一个网页链接
12
+ - **自动提取知识点**AI 识别要点、归类、打标签,并建立彼此的关联
13
+ - **知识图谱**用力导向图直观看到概念之间的联系
14
+ - **层层深入**对任一概念继续追问,AI 拆解出子主题,逐个击破
15
+ - **专题页面**一键生成图文并茂的学习专题,带版本历史,可批注、可导出
16
+ - **智能问答**围绕你的知识库提问,得到结合上下文的回答
17
+ - **多视图浏览** — 按学科分类,或沿时间线回顾
18
+ - **完全可控的模型** — 支持智谱、通义千问、DeepSeek、OpenAI 等任意 OpenAI 兼容服务,也支持 Ollama 本地免费运行
19
+
20
+ ## 上手
21
+
22
+ ```bash
23
+ npm install -g studylens
24
+ studylens
25
+ ```
26
+
27
+ 打开 `http://localhost:3000`,首次启动会自动引导你配置大模型。详细步骤见 **[快速开始](docs/getting-started.md)**。
28
+
29
+ ## 文档
30
+
31
+ - 📖 **[快速开始](docs/getting-started.md)** — 安装、配置大模型(含国内模型示例)、跑通第一条知识
32
+ - 📘 **[用户指南](docs/user-guide.md)** 录入、探索、专题页、深入分析等完整功能
33
+ - 🛠️ **[开发者指南](docs/developer-guide.md)** 架构、目录结构、本地开发与测试
34
+
35
+ ## 本地开发
36
+
37
+ ```bash
38
+ npm install # 安装依赖(postinstall 会自动装好前端)
39
+ npm run dev # 启动后端(3000)+ 开发版前端(3001)
40
+ ```
41
+
42
+ 开发时请打开 `http://localhost:3001`(支持热重载)。更多脚本、端口配置与测试说明见 **[开发者指南](docs/developer-guide.md)**。
43
+
44
+ ## 许可证
45
+
46
+ MIT
@@ -1,21 +1,21 @@
1
1
  {
2
2
  "defaultProvider": "auto",
3
3
  "providers": {
4
- "agent-maestro": {
5
- "enabled": false,
6
- "baseUrl": "http://localhost:23333/api/anthropic",
7
- "model": "claude-sonnet-4-6"
8
- },
9
4
  "openai-compatible": {
10
- "enabled": false,
11
- "baseUrl": "https://api.openai.com/v1",
5
+ "enabled": true,
6
+ "baseUrl": "https://open.bigmodel.cn/api/paas/v4",
12
7
  "apiKey": "",
13
- "model": "gpt-4o"
8
+ "model": "GLM-4.6"
14
9
  },
15
10
  "ollama": {
16
11
  "enabled": false,
17
12
  "baseUrl": "http://localhost:11434",
18
- "model": "llama3.2"
13
+ "model": "deepseek-r1:7b"
14
+ },
15
+ "agent-maestro": {
16
+ "enabled": false,
17
+ "baseUrl": "http://localhost:23333/api/anthropic",
18
+ "model": "claude-sonnet-4-6"
19
19
  }
20
20
  },
21
21
  "taskRouting": {
@@ -1,14 +1,14 @@
1
1
  {
2
2
  "subjects": {
3
3
  "英语": {
4
- "analyzePrompt": "You are a knowledge extraction assistant for a middle school English student. Analyze the following study notes and extract structured knowledge entries.\n\nFor each distinct knowledge point, return a JSON array of objects with:\n- \"title\": concise title in Chinese (under 20 chars), e.g. \"Unit5 比较级与最高级\"\n- \"content\": the knowledge point explained clearly in Chinese, include English examples with Chinese translations in parentheses\n- \"subject\": precise classification like \"英语-词汇与话题\", \"英语-语法\", \"英语-形容词比较级\", \"英语-写作技巧\", \"英语-动名词\" etc.\n- \"tags\": array of relevant tags including:\n 1. Grammar points: \"比较级\", \"最高级\", \"一般过去时\", \"动名词\", \"可数名词\", \"不可数名词\" etc.\n 2. Topic/Unit tags: \"自然景观\", \"运动\", \"饮食健康\" etc.\n 3. Key phrases as tags: \"be famous for\", \"imagine doing\" etc.\n 4. Skill dimensions: \"短语搭配\", \"语法规则\", \"易错点\", \"写作\" etc.\n\nImportant:\n- Each entry should focus on ONE grammar point or vocabulary cluster\n- Always include English examples with Chinese translations\n- Tag grammar-related entries with specific grammar terms",
5
- "questionsPrompt": "生成的问题应该覆盖以下维度,帮助学生深入掌握英语知识点:\n1. 语法规则深挖(规则的例外情况、易错点、为什么是这样)\n2. 造句练习(要求使用特定短语或语法结构造复合句)\n3. 辨析对比(易混淆词、近义词区别、相似语法的区分)\n4. 知识迁移(把语法规则应用到新语境,如翻译句子)\n5. 综合运用(同时使用多个短语或语法点完成一个写作任务)\n\n问题要求:\n- 问题用中文提出,但涉及的英语内容保持英文\n- 鼓励学生写出完整英文句子而不是只选择答案\n- 涉及语法规则时,要求学生解释原因而不只是记忆\n- 可以引用课文中的重点短语来设计造句题\n\n返回JSON数组,每个元素: {\"question\": \"问题内容\", \"category\": \"语法/造句/辨析/迁移/综合\"}"
4
+ "analyzePrompt": "你是一名面向中学生的英语知识点提取助手。请分析以下学习笔记,提取结构化的知识条目。\n\n对每一个独立知识点,返回一个 JSON 数组,每个对象包含以下字段:\n- \"title\":简洁的中文标题(20 字以内),例如 \"Unit5 比较级与最高级\"\n- \"content\":用中文清晰讲解该知识点,英文例句后用括号附上中文翻译\n- \"subject\":精确的分类,例如 \"英语-词汇与话题\"、\"英语-语法\"、\"英语-形容词比较级\"、\"英语-写作技巧\"、\"英语-动名词\" 等\n- \"tags\":相关标签数组,需包含:\n 1. 语法点:\"比较级\"、\"最高级\"、\"一般过去时\"、\"动名词\"、\"可数名词\"、\"不可数名词\" 等\n 2. 话题/单元标签:\"自然景观\"、\"运动\"、\"饮食健康\" 等\n 3. 重点短语作为标签:\"be famous for\"、\"imagine doing\" 等(短语本身保留英文)\n 4. 能力维度:\"短语搭配\"、\"语法规则\"、\"易错点\"、\"写作\" 等\n\n注意:\n- 每个条目只聚焦一个语法点或一组词汇\n- 英文例句务必附中文翻译\n- 涉及语法的条目要用具体语法术语打标签",
5
+ "questionsPrompt": "生成的问题应该覆盖以下维度,帮助学生深入掌握英语知识点:\n1. 语法规则深挖(规则的例外情况、易错点、为什么是这样)\n2. 造句练习(要求使用特定短语或语法结构造复合句)\n3. 辨析对比(易混淆词、近义词区别、相似语法的区分)\n4. 知识迁移(把语法规则应用到新语境,如翻译句子)\n5. 综合运用(同时使用多个短语或语法点完成一个写作任务)\n\n问题要求:\n- 问题用中文提出,但涉及的英语内容保持英文\n- 鼓励学生写出完整英文句子而不是只选择答案\n- 涉及语法规则时,要求学生解释原因而不只是记忆\n- 可以引用课文中的重点短语来设计造句题\n\n返回 JSON 数组,每个元素:{\"question\": \"问题内容\", \"category\": \"语法/造句/辨析/迁移/综合\"}"
6
6
  }
7
7
  },
8
8
  "defaultPrompts": {
9
- "analyzePrompt": "You are a knowledge extraction assistant for a student. Analyze the following study notes and extract structured knowledge entries.\n\nFor each distinct knowledge point, return a JSON array of objects with:\n- \"title\": concise title (under 20 chars)\n- \"content\": the knowledge point explained clearly\n- \"subject\": precise subject classification (see rules below)\n- \"tags\": array of relevant tags — include ALL of the following dimensions:\n 1. Core concepts: key terms, names, formulas (e.g. \"科举制\", \"赵匡胤\", \"勾股定理\")\n 2. Category dimensions: assign multi-dimensional category tags based on the subject area:\n - For history: add tags from these dimensions where applicable:\n \"政治制度\", \"军事战争\", \"经济发展\", \"民族关系\", \"对外交流\", \"科技发明\", \"文化艺术\", \"社会生活\", \"人物\"\n - For math: \"代数\", \"几何\", \"概率\", \"函数\", \"公式\", \"定理\", \"证明\"\n - For physics: \"力学\", \"电磁\", \"热学\", \"光学\", \"实验\", \"公式\"\n - For other subjects: infer appropriate dimensional tags\n 3. Connections: tags that link to related knowledge across different categories\n\nSubject classification rules:\n- For history: use specific dynasty like \"历史-隋朝\", \"历史-唐朝\", \"历史-北宋\" etc.\n- For other subjects: use patterns like \"数学-代数\", \"物理-力学\", \"化学-有机\" etc.\n- Each knowledge point must belong to exactly ONE specific category.\n\nReturn ONLY valid JSON array, no other text.",
10
- "topicPrompt": "你是一个教育内容设计师。基于以下知识点和相关资料,生成一个美观的HTML专题页面。\n\n要求:\n1. 生成完整的HTML页面(含内联CSS),适合iframe嵌入\n2. 深色主题(背景 #0f1117,文字 #e0e0e0)\n3. 分章节展示:导语→背景→核心内容→影响/意义→总结\n4. 使用清晰的排版:标题、卡片、分隔线、高亮重点\n5. 中文内容,适合中学生阅读\n6. 使用你自己的知识补充完整内容,不要局限于提供的材料\n7. 页面宽度100%,无需滚动条样式\n8. 配色美观,使用渐变和阴影效果",
11
- "qaPrompt": "You are an expert study assistant with deep knowledge across all subjects. A student is studying and asks you questions.\n\nIMPORTANT: Use your OWN comprehensive knowledge to answer thoroughly and accurately. The student's notes are supplementary context, not the boundary of your answer.\n\nInstructions:\n1. Answer using your full knowledge — be thorough, accurate, and educational\n2. If the student has relevant notes, reference them to build connections\n3. Use comparisons, analysis, and specific facts/data where appropriate\n4. Write in Chinese, suitable for a middle/high school student\n5. Suggest knowledge cards that capture KEY points — NEW knowledge beyond existing notes\n\nReturn a JSON object:\n{\n \"answer\": \"Your comprehensive answer in Chinese...\",\n \"suggestedCards\": [\n {\n \"title\": \"card title (under 20 chars)\",\n \"content\": \"knowledge point explained clearly\",\n \"subject\": \"precise subject like 历史-唐朝\",\n \"tags\": [\"relevant\", \"tags\"]\n }\n ]\n}\n\nCRITICAL: The answer field must be PLAIN TEXT only — no markdown formatting.\nReturn ONLY valid JSON, no other text.",
12
- "questionsPrompt": "生成的问题应该覆盖:\n1. 基本概念(是什么)\n2. 原因分析(为什么)\n3. 影响/意义(有什么影响)\n4. 比较对比(与其他知识的关联)\n5. 深入思考(评价/启示)\n\n返回JSON数组,每个元素: {\"question\": \"问题内容\", \"category\": \"概念/原因/影响/对比/思考\"}"
9
+ "analyzePrompt": "你是一名面向学生的知识点提取助手。请分析以下学习笔记,提取结构化的知识条目。\n\n对每一个独立知识点,返回一个 JSON 数组,每个对象包含以下字段:\n- \"title\":简洁标题(20 字以内)\n- \"content\":清晰讲解该知识点\n- \"subject\":精确的学科分类(见下方规则)\n- \"tags\":相关标签数组,需包含以下所有维度:\n 1. 核心概念:关键术语、人名、公式(如 \"科举制\"、\"赵匡胤\"、\"勾股定理\")\n 2. 分类维度:根据学科领域赋予多维分类标签:\n - 历史:在适用时添加这些维度的标签:\n \"政治制度\"、\"军事战争\"、\"经济发展\"、\"民族关系\"、\"对外交流\"、\"科技发明\"、\"文化艺术\"、\"社会生活\"、\"人物\"\n - 数学:\"代数\"、\"几何\"、\"概率\"、\"函数\"、\"公式\"、\"定理\"、\"证明\"\n - 物理:\"力学\"、\"电磁\"、\"热学\"、\"光学\"、\"实验\"、\"公式\"\n - 其他学科:推断合适的维度标签\n 3. 关联:跨分类指向相关知识的标签\n\n学科分类规则:\n- 历史:使用具体朝代,如 \"历史-隋朝\"、\"历史-唐朝\"、\"历史-北宋\" 等\n- 其他学科:使用 \"数学-代数\"、\"物理-力学\"、\"化学-有机\" 等格式\n- 每个知识点必须只属于一个具体分类\n\n只返回合法的 JSON 数组,不要输出其他文字。",
10
+ "topicPrompt": "你是一个教育内容设计师。基于以下知识点和相关资料,生成一个美观的 HTML 专题页面。\n\n要求:\n1. 生成完整的 HTML 页面(含内联 CSS),适合 iframe 嵌入\n2. 深色主题(背景 #0f1117,文字 #e0e0e0)\n3. 分章节展示:导语→背景→核心内容→影响/意义→总结\n4. 使用清晰的排版:标题、卡片、分隔线、高亮重点\n5. 中文内容,适合中学生阅读\n6. 使用你自己的知识补充完整内容,不要局限于提供的材料\n7. 页面宽度 100%,无需滚动条样式\n8. 配色美观,使用渐变和阴影效果",
11
+ "qaPrompt": "你是一位知识渊博、精通各学科的学习助手。一名学生正在学习并向你提问。\n\n重要:请用你自己的全面知识来透彻、准确地回答。学生的笔记只是补充背景,不是你回答的边界。\n\n要求:\n1. 用你的全部知识作答,做到透彻、准确、有教育意义\n2. 如果学生有相关笔记,引用它们来建立联系\n3. 适当使用对比、分析和具体的事实/数据\n4. 用中文书写,适合中学生阅读\n5. 推荐能捕捉关键要点的知识卡片——是已有笔记之外的新知识\n\n返回一个 JSON 对象:\n{\n \"answer\": \"你的中文完整回答……\",\n \"suggestedCards\": [\n {\n \"title\": \"卡片标题(20 字以内)\",\n \"content\": \"清晰讲解的知识点\",\n \"subject\": \"精确分类,如 历史-唐朝\",\n \"tags\": [\"相关\", \"标签\"]\n }\n ]\n}\n\n关键:answer 字段必须是纯文本,不要使用任何 markdown 格式。\n只返回合法的 JSON,不要输出其他文字。",
12
+ "questionsPrompt": "生成的问题应该覆盖:\n1. 基本概念(是什么)\n2. 原因分析(为什么)\n3. 影响/意义(有什么影响)\n4. 比较对比(与其他知识的关联)\n5. 深入思考(评价/启示)\n\n返回 JSON 数组,每个元素:{\"question\": \"问题内容\", \"category\": \"概念/原因/影响/对比/思考\"}"
13
13
  }
14
- }
14
+ }
package/core/extractor.js CHANGED
@@ -1,70 +1,70 @@
1
- const fs = require('fs');
2
- const path = require('path');
3
- const http = require('http');
4
- const https = require('https');
5
-
6
- async function extractFromFile(filePath) {
7
- const ext = path.extname(filePath).toLowerCase();
8
- const buf = fs.readFileSync(filePath);
9
-
10
- if (ext === '.pdf') {
11
- const pdfParse = require('pdf-parse');
12
- const data = await pdfParse(buf);
13
- return data.text;
14
- }
15
-
16
- if (ext === '.docx') {
17
- const mammoth = require('mammoth');
18
- const result = await mammoth.extractRawText({ buffer: buf });
19
- return result.value;
20
- }
21
-
22
- if (ext === '.xlsx' || ext === '.xls') {
23
- const XLSX = require('xlsx');
24
- const wb = XLSX.read(buf);
25
- const texts = [];
26
- for (const name of wb.SheetNames) {
27
- const sheet = wb.Sheets[name];
28
- texts.push(`[${name}]\n${XLSX.utils.sheet_to_csv(sheet)}`);
29
- }
30
- return texts.join('\n\n');
31
- }
32
-
33
- if (ext === '.txt' || ext === '.md') {
34
- return buf.toString('utf-8');
35
- }
36
-
37
- throw new Error(`Unsupported file type: ${ext}`);
38
- }
39
-
40
- function fetchUrl(urlStr) {
41
- return new Promise((resolve, reject) => {
42
- const url = new URL(urlStr);
43
- const mod = url.protocol === 'https:' ? https : http;
44
- mod.get(url, { timeout: 30000, headers: { 'User-Agent': 'Mozilla/5.0 StudyLens/1.0' } }, (res) => {
45
- if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location) {
46
- return fetchUrl(res.headers.location).then(resolve, reject);
47
- }
48
- if (res.statusCode >= 400) return reject(new Error(`HTTP ${res.statusCode}`));
49
- const chunks = [];
50
- res.on('data', c => chunks.push(c));
51
- res.on('end', () => resolve(Buffer.concat(chunks).toString('utf-8')));
52
- }).on('error', reject);
53
- });
54
- }
55
-
56
- async function extractFromUrl(urlStr) {
57
- const html = await fetchUrl(urlStr);
58
- const cheerio = require('cheerio');
59
- const $ = cheerio.load(html);
60
-
61
- $('script, style, nav, footer, header, iframe, noscript').remove();
62
-
63
- const article = $('article').length ? $('article') : $('main').length ? $('main') : $('body');
64
- const text = article.text().replace(/\s+/g, ' ').trim();
65
-
66
- if (text.length < 50) throw new Error('Could not extract meaningful text from URL');
67
- return text.slice(0, 15000);
68
- }
69
-
70
- module.exports = { extractFromFile, extractFromUrl };
1
+ const fs = require('fs');
2
+ const path = require('path');
3
+ const http = require('http');
4
+ const https = require('https');
5
+
6
+ async function extractFromFile(filePath) {
7
+ const ext = path.extname(filePath).toLowerCase();
8
+ const buf = fs.readFileSync(filePath);
9
+
10
+ if (ext === '.pdf') {
11
+ const pdfParse = require('pdf-parse');
12
+ const data = await pdfParse(buf);
13
+ return data.text;
14
+ }
15
+
16
+ if (ext === '.docx') {
17
+ const mammoth = require('mammoth');
18
+ const result = await mammoth.extractRawText({ buffer: buf });
19
+ return result.value;
20
+ }
21
+
22
+ if (ext === '.xlsx' || ext === '.xls') {
23
+ const XLSX = require('xlsx');
24
+ const wb = XLSX.read(buf);
25
+ const texts = [];
26
+ for (const name of wb.SheetNames) {
27
+ const sheet = wb.Sheets[name];
28
+ texts.push(`[${name}]\n${XLSX.utils.sheet_to_csv(sheet)}`);
29
+ }
30
+ return texts.join('\n\n');
31
+ }
32
+
33
+ if (ext === '.txt' || ext === '.md') {
34
+ return buf.toString('utf-8');
35
+ }
36
+
37
+ throw new Error(`Unsupported file type: ${ext}`);
38
+ }
39
+
40
+ function fetchUrl(urlStr) {
41
+ return new Promise((resolve, reject) => {
42
+ const url = new URL(urlStr);
43
+ const mod = url.protocol === 'https:' ? https : http;
44
+ mod.get(url, { timeout: 30000, headers: { 'User-Agent': 'Mozilla/5.0 StudyLens/1.0' } }, (res) => {
45
+ if (res.statusCode >= 300 && res.statusCode < 400 && res.headers.location) {
46
+ return fetchUrl(res.headers.location).then(resolve, reject);
47
+ }
48
+ if (res.statusCode >= 400) return reject(new Error(`HTTP ${res.statusCode}`));
49
+ const chunks = [];
50
+ res.on('data', c => chunks.push(c));
51
+ res.on('end', () => resolve(Buffer.concat(chunks).toString('utf-8')));
52
+ }).on('error', reject);
53
+ });
54
+ }
55
+
56
+ async function extractFromUrl(urlStr) {
57
+ const html = await fetchUrl(urlStr);
58
+ const cheerio = require('cheerio');
59
+ const $ = cheerio.load(html);
60
+
61
+ $('script, style, nav, footer, header, iframe, noscript').remove();
62
+
63
+ const article = $('article').length ? $('article') : $('main').length ? $('main') : $('body');
64
+ const text = article.text().replace(/\s+/g, ' ').trim();
65
+
66
+ if (text.length < 50) throw new Error('Could not extract meaningful text from URL');
67
+ return text.slice(0, 15000);
68
+ }
69
+
70
+ module.exports = { extractFromFile, extractFromUrl };
@@ -3,7 +3,7 @@ const https = require('https');
3
3
  const fs = require('fs');
4
4
  const path = require('path');
5
5
 
6
- const LLM_CONFIG_PATH = path.join(__dirname, '..', 'wiki', 'config', 'llm-config.json');
6
+ const LLM_CONFIG_PATH = path.join(__dirname, '..', 'config', 'llm-config.json');
7
7
  const LLM_CONFIG_TEMPLATE = path.join(__dirname, '..', 'config', 'llm-config.template.json');
8
8
 
9
9
  function loadLLMConfig() {
@@ -670,4 +670,4 @@ Return ONLY valid JSON, no other text.`;
670
670
  return extractJSON(result, { isArray: true }) || [];
671
671
  }
672
672
 
673
- module.exports = { callLLM, analyze, findConnections, askQuestion, restructure, buildQAMindMap, generateSmartQuestions, generateTopicHTML, expandEntry, extractJSON, loadLLMConfig, saveLLMConfig, probeAgentMaestro, checkDuplicates };
673
+ module.exports = { callLLM, analyze, findConnections, askQuestion, restructure, buildQAMindMap, generateSmartQuestions, generateTopicHTML, expandEntry, extractJSON, loadLLMConfig, saveLLMConfig, probeAgentMaestro, checkDuplicates, buildProvider };