rosetta-cli 2.0.0b108__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- rosetta_cli-2.0.0b108/MANIFEST.in +9 -0
- rosetta_cli-2.0.0b108/PKG-INFO +639 -0
- rosetta_cli-2.0.0b108/README.md +611 -0
- rosetta_cli-2.0.0b108/env.template +47 -0
- rosetta_cli-2.0.0b108/pyproject.toml +51 -0
- rosetta_cli-2.0.0b108/rosetta_cli/__init__.py +12 -0
- rosetta_cli-2.0.0b108/rosetta_cli/__main__.py +6 -0
- rosetta_cli-2.0.0b108/rosetta_cli/cli.py +379 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/__init__.py +5 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/base_command.py +82 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/cleanup_command.py +214 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/list_command.py +70 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/parse_command.py +205 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/publish_command.py +113 -0
- rosetta_cli-2.0.0b108/rosetta_cli/commands/verify_command.py +46 -0
- rosetta_cli-2.0.0b108/rosetta_cli/ims_auth.py +124 -0
- rosetta_cli-2.0.0b108/rosetta_cli/ims_config.py +317 -0
- rosetta_cli-2.0.0b108/rosetta_cli/ims_publisher.py +836 -0
- rosetta_cli-2.0.0b108/rosetta_cli/ims_utils.py +28 -0
- rosetta_cli-2.0.0b108/rosetta_cli/ragflow_client.py +928 -0
- rosetta_cli-2.0.0b108/rosetta_cli/services/__init__.py +8 -0
- rosetta_cli-2.0.0b108/rosetta_cli/services/auth_service.py +114 -0
- rosetta_cli-2.0.0b108/rosetta_cli/services/dataset_service.py +72 -0
- rosetta_cli-2.0.0b108/rosetta_cli/services/document_data.py +408 -0
- rosetta_cli-2.0.0b108/rosetta_cli/services/document_service.py +357 -0
- rosetta_cli-2.0.0b108/rosetta_cli/typing_utils.py +49 -0
- rosetta_cli-2.0.0b108/rosetta_cli.egg-info/PKG-INFO +639 -0
- rosetta_cli-2.0.0b108/rosetta_cli.egg-info/SOURCES.txt +31 -0
- rosetta_cli-2.0.0b108/rosetta_cli.egg-info/dependency_links.txt +1 -0
- rosetta_cli-2.0.0b108/rosetta_cli.egg-info/entry_points.txt +2 -0
- rosetta_cli-2.0.0b108/rosetta_cli.egg-info/requires.txt +10 -0
- rosetta_cli-2.0.0b108/rosetta_cli.egg-info/top_level.txt +1 -0
- rosetta_cli-2.0.0b108/setup.cfg +4 -0
|
@@ -0,0 +1,639 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: rosetta-cli
|
|
3
|
+
Version: 2.0.0b108
|
|
4
|
+
Summary: Rosetta CLI for publishing knowledge base content to RAGFlow
|
|
5
|
+
Author: Igor Solomatov
|
|
6
|
+
License-Expression: Apache-2.0
|
|
7
|
+
Project-URL: Homepage, https://github.com/griddynamics/rosetta
|
|
8
|
+
Project-URL: Discord, https://discord.gg/QzZ2cWg36g
|
|
9
|
+
Project-URL: Website, https://griddynamics.github.io/rosetta/
|
|
10
|
+
Project-URL: Support, https://github.com/griddynamics/rosetta/issues
|
|
11
|
+
Keywords: rosetta,ragflow,cli,knowledge-base,publishing
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
16
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
17
|
+
Requires-Python: >=3.12
|
|
18
|
+
Description-Content-Type: text/markdown
|
|
19
|
+
Requires-Dist: python-dotenv<2.0.0,>=1.0.0
|
|
20
|
+
Requires-Dist: python-frontmatter<2.0.0,>=1.1.0
|
|
21
|
+
Requires-Dist: ragflow-sdk<1.0.0,>=0.23.1
|
|
22
|
+
Requires-Dist: requests<3.0.0,>=2.31.0
|
|
23
|
+
Requires-Dist: tqdm<5.0.0,>=4.67.0
|
|
24
|
+
Provides-Extra: dev
|
|
25
|
+
Requires-Dist: build>=1.0.0; extra == "dev"
|
|
26
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
27
|
+
Requires-Dist: twine>=4.0.0; extra == "dev"
|
|
28
|
+
|
|
29
|
+
# Rosetta CLI
|
|
30
|
+
|
|
31
|
+
> Knowledge base publishing and management tools powered by RAGFlow
|
|
32
|
+
|
|
33
|
+
## 🎯 Overview
|
|
34
|
+
|
|
35
|
+
This directory contains the Python package for publishing knowledge base content to RAGFlow instances. The CLI supports multi-environment workflows with smart change detection and auto-metadata extraction.
|
|
36
|
+
|
|
37
|
+
## Community
|
|
38
|
+
|
|
39
|
+
- [Discord](https://discord.gg/QzZ2cWg36g)
|
|
40
|
+
- [Website](https://griddynamics.github.io/rosetta/)
|
|
41
|
+
- [rosetta-support@griddynamics.com](mailto:rosetta-support@griddynamics.com)
|
|
42
|
+
|
|
43
|
+
### Key Features
|
|
44
|
+
|
|
45
|
+
- **🚀 Smart Publishing** - MD5 hash-based change detection (~77% faster republishing)
|
|
46
|
+
- **🏗️ Modular Architecture** - Command pattern with service layer for maintainability
|
|
47
|
+
- **🏷️ Tag-in-Title Format** - `[tag1][tag2] filename.ext` for powerful server-side filtering
|
|
48
|
+
- **📊 Parse Status Tracking** - Monitor document parsing progress with visual indicators
|
|
49
|
+
- **🔄 Upsert Semantics** - No duplicates, republishing updates existing documents
|
|
50
|
+
- **⏱️ Performance Timing** - All commands show execution time
|
|
51
|
+
- **🌍 Multi-Environment** - Switch between local, dev, and production configs
|
|
52
|
+
- **🔐 API Key Auth** - Secure authentication via RAGFlow API keys
|
|
53
|
+
- **🎯 Server-Side Filtering** - Reduce network traffic with metadata conditions
|
|
54
|
+
|
|
55
|
+
### Quick Navigation
|
|
56
|
+
|
|
57
|
+
- **Complete Setup Guide:** See [docs/QUICKSTART.md](docs/QUICKSTART.md) for detailed setup instructions
|
|
58
|
+
- **CLI Commands:** See [CLI Commands](#-cli-commands) for all available commands
|
|
59
|
+
- **Environment Management:** See [Environment Management](#-environment-management) for switching configs
|
|
60
|
+
|
|
61
|
+
## 📁 Contents
|
|
62
|
+
|
|
63
|
+
```
|
|
64
|
+
rosetta-cli/
|
|
65
|
+
├── pyproject.toml # Package metadata + console entrypoint
|
|
66
|
+
├── rosetta_cli/ # Installable Python package
|
|
67
|
+
│ ├── cli.py # CLI entry point
|
|
68
|
+
│ ├── commands/ # Command implementations
|
|
69
|
+
│ ├── services/ # Shared business logic
|
|
70
|
+
│ ├── ims_config.py # Configuration management
|
|
71
|
+
│ ├── ims_publisher.py # Publishing orchestration
|
|
72
|
+
│ └── ragflow_client.py # RAGFlow SDK wrapper
|
|
73
|
+
├── env.template # Environment configuration template
|
|
74
|
+
├── tests/ # CLI unit tests
|
|
75
|
+
└── README.md # This file
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
## 🚀 Quick Start
|
|
79
|
+
|
|
80
|
+
Complete setup instructions are in [docs/QUICKSTART.md](../docs/QUICKSTART.md). Here's the quick reference:
|
|
81
|
+
|
|
82
|
+
### Prerequisites
|
|
83
|
+
|
|
84
|
+
- Python 3.12 (required by ragflow-sdk 0.23.1)
|
|
85
|
+
- RAGFlow instance (local via Docker Compose or remote)
|
|
86
|
+
- `uvx` for installed CLI usage
|
|
87
|
+
- Root virtual environment configured for local CLI development
|
|
88
|
+
|
|
89
|
+
### Installed Usage
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
uvx rosetta-cli@latest verify
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Local Development
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
python3 -m venv venv
|
|
99
|
+
venv/bin/pip install -r requirements.txt
|
|
100
|
+
cp rosetta-cli/.env.dev .env
|
|
101
|
+
venv/bin/rosetta-cli verify
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
## 🔧 CLI Commands
|
|
105
|
+
|
|
106
|
+
All commands support `--env <environment>` flag to override the active environment.
|
|
107
|
+
|
|
108
|
+
### Publishing Commands
|
|
109
|
+
|
|
110
|
+
#### Publish Knowledge Base Content
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
# Publish all instructions (only changed files)
|
|
114
|
+
uvx rosetta-cli@latest publish ../instructions
|
|
115
|
+
|
|
116
|
+
# Publish business context
|
|
117
|
+
uvx rosetta-cli@latest publish ../business
|
|
118
|
+
|
|
119
|
+
# Force republish all files (bypass change detection)
|
|
120
|
+
uvx rosetta-cli@latest publish ../instructions --force
|
|
121
|
+
|
|
122
|
+
# Preview changes without publishing
|
|
123
|
+
uvx rosetta-cli@latest publish ../instructions --dry-run
|
|
124
|
+
|
|
125
|
+
# Use different environment
|
|
126
|
+
uvx rosetta-cli@latest publish ../instructions --env production
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
**Performance:**
|
|
130
|
+
- First publish: ~10-15s per file (embedding generation + parsing)
|
|
131
|
+
- Subsequent publishes: Only changed files (~77% faster)
|
|
132
|
+
- Dry run: Preview in ~2-3s
|
|
133
|
+
|
|
134
|
+
**What gets published:**
|
|
135
|
+
|
|
136
|
+
```
|
|
137
|
+
File: /instructions/agents/r1/agents.md
|
|
138
|
+
|
|
139
|
+
Published as:
|
|
140
|
+
Document ID: b0ec4d56-6cc5-5bbd-9868-5d49afa2a7d8 (UUID from path)
|
|
141
|
+
Title: [instructions][agents][r1] agents.md
|
|
142
|
+
Dataset: aia-r1 (from template: aia-{release})
|
|
143
|
+
Tags: ["instructions", "agents", "r1"] (in metadata)
|
|
144
|
+
Domain: instructions (first folder)
|
|
145
|
+
Release: r1 (auto-detected from path)
|
|
146
|
+
Content Hash: abc123... (MD5 of content)
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
#### Trigger Document Parsing
|
|
150
|
+
|
|
151
|
+
Re-parse documents without re-uploading (useful for changing parser settings):
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
# Parse all unparsed documents
|
|
155
|
+
uvx rosetta-cli@latest parse
|
|
156
|
+
|
|
157
|
+
# Parse specific dataset
|
|
158
|
+
uvx rosetta-cli@latest parse --dataset aia-r1
|
|
159
|
+
|
|
160
|
+
# Force re-parse ALL documents
|
|
161
|
+
uvx rosetta-cli@latest parse --dataset aia-r1 --force
|
|
162
|
+
|
|
163
|
+
# Preview without parsing (dry run)
|
|
164
|
+
uvx rosetta-cli@latest parse --dataset aia-r1 --dry-run
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
#### List Documents
|
|
168
|
+
|
|
169
|
+
```bash
|
|
170
|
+
# List documents in default dataset
|
|
171
|
+
uvx rosetta-cli@latest list-dataset
|
|
172
|
+
|
|
173
|
+
# List specific dataset
|
|
174
|
+
uvx rosetta-cli@latest list-dataset --dataset aia-r1
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Output shows:**
|
|
178
|
+
- Document title (with tag prefixes)
|
|
179
|
+
- Document ID, file size, parse status, chunk count
|
|
180
|
+
- Metadata (tags, domain, release, source path)
|
|
181
|
+
|
|
182
|
+
#### Cleanup Dataset
|
|
183
|
+
|
|
184
|
+
```bash
|
|
185
|
+
# Preview cleanup without deleting
|
|
186
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --dry-run
|
|
187
|
+
|
|
188
|
+
# Cleanup documents with specific prefix
|
|
189
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --prefix "aqa-phase" --dry-run
|
|
190
|
+
|
|
191
|
+
# Cleanup documents with specific tags (space-separated)
|
|
192
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1 agents" --dry-run
|
|
193
|
+
|
|
194
|
+
# Cleanup documents with specific tags (comma-separated)
|
|
195
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1,agents" --dry-run
|
|
196
|
+
|
|
197
|
+
# Force cleanup without confirmation
|
|
198
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --force
|
|
199
|
+
|
|
200
|
+
# Force cleanup with prefix
|
|
201
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --prefix "aqa-phase" --force
|
|
202
|
+
|
|
203
|
+
# Force cleanup with tags
|
|
204
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1,agents" --force
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
⚠️ **Warning:** Without `--prefix` or `--tags`, this deletes ALL documents. Use `--dry-run` first.
|
|
208
|
+
|
|
209
|
+
**Filtering Options:**
|
|
210
|
+
- `--prefix`: Match documents by title prefix (e.g., `"[instructions]"`)
|
|
211
|
+
- `--tags`: Match documents by metadata tags (e.g., `"r1 agents"` or `"r1,agents"`)
|
|
212
|
+
- Uses OR logic: finds documents with ANY of the specified tags
|
|
213
|
+
- Server-side filtering for efficiency
|
|
214
|
+
|
|
215
|
+
### Verification Commands
|
|
216
|
+
|
|
217
|
+
#### Verify Connection
|
|
218
|
+
|
|
219
|
+
```bash
|
|
220
|
+
uvx rosetta-cli@latest verify
|
|
221
|
+
|
|
222
|
+
# Check production environment
|
|
223
|
+
uvx rosetta-cli@latest verify --env production
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Checks:
|
|
227
|
+
- API key validity
|
|
228
|
+
- RAGFlow server connectivity
|
|
229
|
+
- System health (database, Redis, document engine)
|
|
230
|
+
- Available datasets
|
|
231
|
+
|
|
232
|
+
## 🌍 Environment Management
|
|
233
|
+
|
|
234
|
+
### Configuration Files
|
|
235
|
+
|
|
236
|
+
| File | Environment | Purpose |
|
|
237
|
+
|------|-------------|---------|
|
|
238
|
+
| `env.template` | Template | Create new environments |
|
|
239
|
+
| `.env` | **Active** | Current configuration (gitignored) |
|
|
240
|
+
| `.env.local` | Local | Local RAGFlow development |
|
|
241
|
+
| `.env.remote` | Remote | Production RAGFlow instance |
|
|
242
|
+
|
|
243
|
+
### Switch Environments
|
|
244
|
+
|
|
245
|
+
**Method 1: Copy file to .env** (recommended)
|
|
246
|
+
|
|
247
|
+
```bash
|
|
248
|
+
# Switch to local
|
|
249
|
+
cp .env.local .env
|
|
250
|
+
|
|
251
|
+
# Switch to production
|
|
252
|
+
cp .env.remote .env
|
|
253
|
+
|
|
254
|
+
# Check current environment
|
|
255
|
+
grep "ENVIRONMENT=" .env
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
**Method 2: Use --env flag** (temporary override)
|
|
259
|
+
|
|
260
|
+
```bash
|
|
261
|
+
uvx rosetta-cli@latest list-dataset --env local
|
|
262
|
+
uvx rosetta-cli@latest publish ../instructions --env production
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
### Environment Variables
|
|
266
|
+
|
|
267
|
+
```bash
|
|
268
|
+
# Required
|
|
269
|
+
RAGFLOW_BASE_URL=http://your-ragflow-instance
|
|
270
|
+
RAGFLOW_API_KEY=ragflow-xxx...
|
|
271
|
+
ENVIRONMENT=local
|
|
272
|
+
|
|
273
|
+
# Dataset Configuration
|
|
274
|
+
RAGFLOW_DATASET_DEFAULT=aia
|
|
275
|
+
RAGFLOW_DATASET_TEMPLATE=aia-{release}
|
|
276
|
+
|
|
277
|
+
# Embedding Model (optional)
|
|
278
|
+
RAGFLOW_EMBEDDING_MODEL=text-embedding-3-small@OpenAI
|
|
279
|
+
|
|
280
|
+
# Chunking Configuration (optional)
|
|
281
|
+
RAGFLOW_CHUNK_METHOD=naive
|
|
282
|
+
RAGFLOW_CHUNK_TOKEN_NUM=512
|
|
283
|
+
RAGFLOW_DELIMITER=\n
|
|
284
|
+
RAGFLOW_AUTO_KEYWORDS=0
|
|
285
|
+
RAGFLOW_AUTO_QUESTIONS=0
|
|
286
|
+
```
|
|
287
|
+
|
|
288
|
+
### Creating New Environments
|
|
289
|
+
|
|
290
|
+
```bash
|
|
291
|
+
cp env.template .env.staging
|
|
292
|
+
nano .env.staging
|
|
293
|
+
uvx rosetta-cli@latest verify --env staging
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
## 🏗️ Architecture
|
|
297
|
+
|
|
298
|
+
### Key Components
|
|
299
|
+
|
|
300
|
+
#### RAGFlowClient (`ragflow_client.py`)
|
|
301
|
+
|
|
302
|
+
Wrapper around ragflow-sdk:
|
|
303
|
+
|
|
304
|
+
```python
|
|
305
|
+
from rosetta_cli.ragflow_client import RAGFlowClient, DocumentMetadata
|
|
306
|
+
|
|
307
|
+
client = RAGFlowClient(api_key="ragflow-xxx", base_url="http://your-ragflow-instance")
|
|
308
|
+
|
|
309
|
+
# Dataset management
|
|
310
|
+
client.create_dataset(name="aia-r1", description="Release 1")
|
|
311
|
+
client.get_dataset(name="aia-r1")
|
|
312
|
+
client.list_datasets()
|
|
313
|
+
|
|
314
|
+
# Document upload with change detection
|
|
315
|
+
client.upload_document(
|
|
316
|
+
file_path=Path("agents.md"),
|
|
317
|
+
metadata=DocumentMetadata(...),
|
|
318
|
+
dataset_id="dataset-id",
|
|
319
|
+
force=False # Skip if unchanged
|
|
320
|
+
)
|
|
321
|
+
|
|
322
|
+
# Health check
|
|
323
|
+
client.verify_connection()
|
|
324
|
+
client.get_system_health()
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
#### IMSConfig (`ims_config.py`)
|
|
328
|
+
|
|
329
|
+
Configuration management with smart .env discovery:
|
|
330
|
+
|
|
331
|
+
```python
|
|
332
|
+
from rosetta_cli.ims_config import IMSConfig
|
|
333
|
+
|
|
334
|
+
# Auto-discover .env (searches cwd, script dir, git root)
|
|
335
|
+
config = IMSConfig.from_env()
|
|
336
|
+
|
|
337
|
+
# Use specific environment
|
|
338
|
+
config = IMSConfig.from_env(environment="production")
|
|
339
|
+
|
|
340
|
+
# Validate configuration
|
|
341
|
+
config.validate()
|
|
342
|
+
```
|
|
343
|
+
|
|
344
|
+
#### ContentPublisher (`ims_publisher.py`)
|
|
345
|
+
|
|
346
|
+
Publishing logic with metadata extraction:
|
|
347
|
+
|
|
348
|
+
```python
|
|
349
|
+
from rosetta_cli.ims_publisher import ContentPublisher
|
|
350
|
+
|
|
351
|
+
publisher = ContentPublisher(client, config, workspace_root)
|
|
352
|
+
|
|
353
|
+
results = publisher.publish(
|
|
354
|
+
content_path=Path("../instructions"),
|
|
355
|
+
force=False, # Skip unchanged files
|
|
356
|
+
dry_run=False, # Preview mode
|
|
357
|
+
no_parse=False, # Skip parsing after upload
|
|
358
|
+
parse_timeout=300 # Parse timeout in seconds
|
|
359
|
+
)
|
|
360
|
+
|
|
361
|
+
print(f"Published: {results.published_count}")
|
|
362
|
+
print(f"Skipped: {results.skipped_count}")
|
|
363
|
+
print(f"Failed: {results.failed_count}")
|
|
364
|
+
```
|
|
365
|
+
|
|
366
|
+
**Metadata Extraction:**
|
|
367
|
+
|
|
368
|
+
```
|
|
369
|
+
File: /instructions/agents/r1/bootstrap.md
|
|
370
|
+
|
|
371
|
+
Extracted:
|
|
372
|
+
Tags: ["instructions", "agents", "r1"]
|
|
373
|
+
Domain: instructions
|
|
374
|
+
Release: r1
|
|
375
|
+
Title: bootstrap.md
|
|
376
|
+
Content Hash: abc123... (MD5)
|
|
377
|
+
Document ID: uuid-from-path
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
## 🎯 Tag-in-Title Format
|
|
381
|
+
|
|
382
|
+
### What is Tag-in-Title?
|
|
383
|
+
|
|
384
|
+
Documents are stored with tags as prefixes for server-side filtering:
|
|
385
|
+
|
|
386
|
+
```
|
|
387
|
+
Format: [tag1][tag2][tag3] filename.ext
|
|
388
|
+
|
|
389
|
+
Examples:
|
|
390
|
+
[instructions][agents][r1] agents.md
|
|
391
|
+
[business][project] RFP.pdf
|
|
392
|
+
```
|
|
393
|
+
|
|
394
|
+
### Why Two Locations?
|
|
395
|
+
|
|
396
|
+
Tags are stored in **both title and metadata**:
|
|
397
|
+
|
|
398
|
+
**Title:** Fast server-side keyword search
|
|
399
|
+
**Metadata:** Precise client-side filtering with complex queries
|
|
400
|
+
|
|
401
|
+
### How Tags are Generated
|
|
402
|
+
|
|
403
|
+
Tags come from **folder structure only**:
|
|
404
|
+
|
|
405
|
+
```
|
|
406
|
+
File: /instructions/agents/r1/bootstrap.md
|
|
407
|
+
Folders: instructions / agents / r1 / (file)
|
|
408
|
+
Tags: [instructions][agents][r1]
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
### Using Tags for Filtering
|
|
412
|
+
|
|
413
|
+
```bash
|
|
414
|
+
# Delete all instruction documents
|
|
415
|
+
uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions]"
|
|
416
|
+
|
|
417
|
+
# Delete all r1 agent documents
|
|
418
|
+
uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions][agents][r1]"
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
## 💻 Usage Examples
|
|
422
|
+
|
|
423
|
+
### Example 1: First-Time Setup
|
|
424
|
+
|
|
425
|
+
```bash
|
|
426
|
+
python3 -m venv venv
|
|
427
|
+
venv/bin/pip install -r requirements.txt
|
|
428
|
+
cp rosetta-cli/env.template .env
|
|
429
|
+
nano .env # Add RAGFLOW_BASE_URL and RAGFLOW_API_KEY
|
|
430
|
+
uvx rosetta-cli@latest verify
|
|
431
|
+
uvx rosetta-cli@latest publish instructions
|
|
432
|
+
```
|
|
433
|
+
|
|
434
|
+
### Example 2: Daily Publishing Workflow
|
|
435
|
+
|
|
436
|
+
```bash
|
|
437
|
+
uvx rosetta-cli@latest publish ../instructions --dry-run
|
|
438
|
+
uvx rosetta-cli@latest publish ../instructions
|
|
439
|
+
uvx rosetta-cli@latest list-dataset
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
### Example 3: Multi-Environment Publishing
|
|
443
|
+
|
|
444
|
+
```bash
|
|
445
|
+
# Publish to dev
|
|
446
|
+
uvx rosetta-cli@latest publish ../instructions --env dev
|
|
447
|
+
|
|
448
|
+
# Verify on dev
|
|
449
|
+
uvx rosetta-cli@latest verify --env dev
|
|
450
|
+
|
|
451
|
+
# Publish to production
|
|
452
|
+
uvx rosetta-cli@latest publish ../instructions --env prod
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
### Example 4: Cleanup and Republish
|
|
456
|
+
|
|
457
|
+
```bash
|
|
458
|
+
# Preview deletion
|
|
459
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --dry-run
|
|
460
|
+
|
|
461
|
+
# Delete all documents
|
|
462
|
+
uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --force
|
|
463
|
+
|
|
464
|
+
# Republish everything
|
|
465
|
+
uvx rosetta-cli@latest publish ../instructions --force
|
|
466
|
+
```
|
|
467
|
+
|
|
468
|
+
### Example 5: Programmatic Usage
|
|
469
|
+
|
|
470
|
+
```python
|
|
471
|
+
from pathlib import Path
|
|
472
|
+
from rosetta_cli.ragflow_client import RAGFlowClient, DocumentMetadata
|
|
473
|
+
from rosetta_cli.ims_config import IMSConfig
|
|
474
|
+
from rosetta_cli.ims_publisher import ContentPublisher
|
|
475
|
+
|
|
476
|
+
config = IMSConfig.from_env()
|
|
477
|
+
client = RAGFlowClient(
|
|
478
|
+
api_key=config.api_key,
|
|
479
|
+
base_url=config.base_url,
|
|
480
|
+
embedding_model=config.embedding_model,
|
|
481
|
+
chunk_method=config.chunk_method,
|
|
482
|
+
parser_config=config.parser_config
|
|
483
|
+
)
|
|
484
|
+
|
|
485
|
+
client.verify_connection()
|
|
486
|
+
publisher = ContentPublisher(client, config, Path("/path/to/workspace"))
|
|
487
|
+
|
|
488
|
+
results = publisher.publish(
|
|
489
|
+
content_path=Path("/path/to/workspace") / "instructions",
|
|
490
|
+
force=False,
|
|
491
|
+
dry_run=False
|
|
492
|
+
)
|
|
493
|
+
|
|
494
|
+
print(f"Published: {results.published_count}, Skipped: {results.skipped_count}")
|
|
495
|
+
```
|
|
496
|
+
|
|
497
|
+
## 🔍 Troubleshooting
|
|
498
|
+
|
|
499
|
+
### Error: "api_key cannot be empty"
|
|
500
|
+
|
|
501
|
+
Set `RAGFLOW_API_KEY` in `.env`:
|
|
502
|
+
```bash
|
|
503
|
+
nano .env
|
|
504
|
+
# Add: RAGFLOW_API_KEY=ragflow-xxxxxxxxxxxxxxxxxxxx
|
|
505
|
+
```
|
|
506
|
+
|
|
507
|
+
### Error: "Invalid API key or expired token"
|
|
508
|
+
|
|
509
|
+
Generate new API key:
|
|
510
|
+
1. Login to RAGFlow
|
|
511
|
+
2. Profile → API Keys → Generate New Key
|
|
512
|
+
3. Update `.env` file
|
|
513
|
+
|
|
514
|
+
### Error: "Connection refused"
|
|
515
|
+
|
|
516
|
+
1. Check RAGFlow is running: `docker ps | grep ragflow`
|
|
517
|
+
2. Verify URL: `grep RAGFLOW_BASE_URL .env`
|
|
518
|
+
3. Test: `curl http://your-ragflow-instance/v1/system/healthz`
|
|
519
|
+
|
|
520
|
+
### Error: "Module 'ragflow_sdk' not found"
|
|
521
|
+
|
|
522
|
+
```bash
|
|
523
|
+
venv/bin/pip install -r requirements.txt
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
### Error: "No .env file found"
|
|
527
|
+
|
|
528
|
+
```bash
|
|
529
|
+
cp rosetta-cli/env.template .env
|
|
530
|
+
nano .env
|
|
531
|
+
```
|
|
532
|
+
|
|
533
|
+
### Parse Status Shows "FAIL"
|
|
534
|
+
|
|
535
|
+
1. Check document format (PDF, MD, TXT supported)
|
|
536
|
+
2. Re-trigger parsing: `uvx rosetta-cli@latest parse --dataset aia-r1 --force`
|
|
537
|
+
3. Check RAGFlow logs: `docker logs ragflow-server`
|
|
538
|
+
|
|
539
|
+
### Slow Publishing Performance
|
|
540
|
+
|
|
541
|
+
- Use faster embedding model: `RAGFLOW_EMBEDDING_MODEL=text-embedding-3-small@OpenAI`
|
|
542
|
+
- Ensure change detection works (don't use `--force`)
|
|
543
|
+
- Reduce chunk size: `RAGFLOW_CHUNK_TOKEN_NUM=256`
|
|
544
|
+
|
|
545
|
+
### Documents Not Showing Tags
|
|
546
|
+
|
|
547
|
+
Tags should appear in title with format `[tag1][tag2]`:
|
|
548
|
+
```bash
|
|
549
|
+
uvx rosetta-cli@latest list-dataset
|
|
550
|
+
# Output: 1. [instructions][agents][r1] agents.md
|
|
551
|
+
```
|
|
552
|
+
|
|
553
|
+
## 🚦 Performance Tips
|
|
554
|
+
|
|
555
|
+
### 1. Use Change Detection
|
|
556
|
+
|
|
557
|
+
```bash
|
|
558
|
+
# Good: Only publishes changed files (~77% faster)
|
|
559
|
+
uvx rosetta-cli@latest publish ../instructions
|
|
560
|
+
|
|
561
|
+
# Bad: Republishes everything
|
|
562
|
+
uvx rosetta-cli@latest publish ../instructions --force
|
|
563
|
+
```
|
|
564
|
+
|
|
565
|
+
### 2. Use Dry Run to Preview
|
|
566
|
+
|
|
567
|
+
```bash
|
|
568
|
+
# Preview (fast)
|
|
569
|
+
uvx rosetta-cli@latest publish ../instructions --dry-run
|
|
570
|
+
|
|
571
|
+
# Then publish for real
|
|
572
|
+
uvx rosetta-cli@latest publish ../instructions
|
|
573
|
+
```
|
|
574
|
+
|
|
575
|
+
### 3. Optimize Chunking
|
|
576
|
+
|
|
577
|
+
```bash
|
|
578
|
+
# Faster parsing
|
|
579
|
+
RAGFLOW_CHUNK_TOKEN_NUM=256
|
|
580
|
+
|
|
581
|
+
# Better context
|
|
582
|
+
RAGFLOW_CHUNK_TOKEN_NUM=1024
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
### 4. Use Selective Cleanup
|
|
586
|
+
|
|
587
|
+
```bash
|
|
588
|
+
# Fast: Delete specific documents
|
|
589
|
+
uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions][agents]" --force
|
|
590
|
+
|
|
591
|
+
# Slow: Delete and republish everything
|
|
592
|
+
uvx rosetta-cli@latest cleanup-dataset --force
|
|
593
|
+
uvx rosetta-cli@latest publish ../instructions --force
|
|
594
|
+
```
|
|
595
|
+
|
|
596
|
+
### 5. Monitor Parse Status
|
|
597
|
+
|
|
598
|
+
```bash
|
|
599
|
+
uvx rosetta-cli@latest list-dataset | grep "Parse Status"
|
|
600
|
+
```
|
|
601
|
+
|
|
602
|
+
## 📖 Advanced Topics
|
|
603
|
+
|
|
604
|
+
### Custom Dataset Naming
|
|
605
|
+
|
|
606
|
+
The `RAGFLOW_DATASET_TEMPLATE` supports `{release}` placeholder:
|
|
607
|
+
|
|
608
|
+
```bash
|
|
609
|
+
RAGFLOW_DATASET_TEMPLATE=aia-{release}
|
|
610
|
+
|
|
611
|
+
# /instructions/r1/file.md → aia-r1
|
|
612
|
+
# /instructions/r2/file.md → aia-r2
|
|
613
|
+
# /instructions/file.md → aia (default)
|
|
614
|
+
```
|
|
615
|
+
|
|
616
|
+
### Supported File Types
|
|
617
|
+
|
|
618
|
+
**Text files** (extracted and chunked):
|
|
619
|
+
- Markdown (`.md`)
|
|
620
|
+
- Plain text (`.txt`)
|
|
621
|
+
|
|
622
|
+
**Binary files** (uploaded for storage):
|
|
623
|
+
- PDF, Excel, Word, PowerPoint
|
|
624
|
+
|
|
625
|
+
|
|
626
|
+
### Environment File Discovery
|
|
627
|
+
|
|
628
|
+
When running commands without specifying config, search order:
|
|
629
|
+
|
|
630
|
+
1. Current directory: `.env.{environment}` or `.env`
|
|
631
|
+
2. Script directory: `.env.{environment}` or `.env`
|
|
632
|
+
3. Git root: `.env.{environment}` or `.env`
|
|
633
|
+
|
|
634
|
+
## 📝 Related Documentation
|
|
635
|
+
|
|
636
|
+
- **Complete Setup:** [docs/QUICKSTART.md](../docs/QUICKSTART.md) - Comprehensive setup guide
|
|
637
|
+
- **Architecture:** [docs/CONTEXT.md](../docs/CONTEXT.md) - System architecture
|
|
638
|
+
- **Environment Template:** `env.template` - Configuration options
|
|
639
|
+
- **Requirements:** `requirements.txt` - Python dependencies
|