rosetta-cli 2.0.0b108__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. rosetta_cli-2.0.0b108/MANIFEST.in +9 -0
  2. rosetta_cli-2.0.0b108/PKG-INFO +639 -0
  3. rosetta_cli-2.0.0b108/README.md +611 -0
  4. rosetta_cli-2.0.0b108/env.template +47 -0
  5. rosetta_cli-2.0.0b108/pyproject.toml +51 -0
  6. rosetta_cli-2.0.0b108/rosetta_cli/__init__.py +12 -0
  7. rosetta_cli-2.0.0b108/rosetta_cli/__main__.py +6 -0
  8. rosetta_cli-2.0.0b108/rosetta_cli/cli.py +379 -0
  9. rosetta_cli-2.0.0b108/rosetta_cli/commands/__init__.py +5 -0
  10. rosetta_cli-2.0.0b108/rosetta_cli/commands/base_command.py +82 -0
  11. rosetta_cli-2.0.0b108/rosetta_cli/commands/cleanup_command.py +214 -0
  12. rosetta_cli-2.0.0b108/rosetta_cli/commands/list_command.py +70 -0
  13. rosetta_cli-2.0.0b108/rosetta_cli/commands/parse_command.py +205 -0
  14. rosetta_cli-2.0.0b108/rosetta_cli/commands/publish_command.py +113 -0
  15. rosetta_cli-2.0.0b108/rosetta_cli/commands/verify_command.py +46 -0
  16. rosetta_cli-2.0.0b108/rosetta_cli/ims_auth.py +124 -0
  17. rosetta_cli-2.0.0b108/rosetta_cli/ims_config.py +317 -0
  18. rosetta_cli-2.0.0b108/rosetta_cli/ims_publisher.py +836 -0
  19. rosetta_cli-2.0.0b108/rosetta_cli/ims_utils.py +28 -0
  20. rosetta_cli-2.0.0b108/rosetta_cli/ragflow_client.py +928 -0
  21. rosetta_cli-2.0.0b108/rosetta_cli/services/__init__.py +8 -0
  22. rosetta_cli-2.0.0b108/rosetta_cli/services/auth_service.py +114 -0
  23. rosetta_cli-2.0.0b108/rosetta_cli/services/dataset_service.py +72 -0
  24. rosetta_cli-2.0.0b108/rosetta_cli/services/document_data.py +408 -0
  25. rosetta_cli-2.0.0b108/rosetta_cli/services/document_service.py +357 -0
  26. rosetta_cli-2.0.0b108/rosetta_cli/typing_utils.py +49 -0
  27. rosetta_cli-2.0.0b108/rosetta_cli.egg-info/PKG-INFO +639 -0
  28. rosetta_cli-2.0.0b108/rosetta_cli.egg-info/SOURCES.txt +31 -0
  29. rosetta_cli-2.0.0b108/rosetta_cli.egg-info/dependency_links.txt +1 -0
  30. rosetta_cli-2.0.0b108/rosetta_cli.egg-info/entry_points.txt +2 -0
  31. rosetta_cli-2.0.0b108/rosetta_cli.egg-info/requires.txt +10 -0
  32. rosetta_cli-2.0.0b108/rosetta_cli.egg-info/top_level.txt +1 -0
  33. rosetta_cli-2.0.0b108/setup.cfg +4 -0
@@ -0,0 +1,9 @@
1
+ include README.md
2
+ include env.template
3
+ recursive-include rosetta_cli *.py
4
+ recursive-exclude tests *
5
+ exclude .env
6
+ exclude .env.dev
7
+ exclude .env.prod
8
+ exclude .DS_Store
9
+ prune logs
@@ -0,0 +1,639 @@
1
+ Metadata-Version: 2.4
2
+ Name: rosetta-cli
3
+ Version: 2.0.0b108
4
+ Summary: Rosetta CLI for publishing knowledge base content to RAGFlow
5
+ Author: Igor Solomatov
6
+ License-Expression: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/griddynamics/rosetta
8
+ Project-URL: Discord, https://discord.gg/QzZ2cWg36g
9
+ Project-URL: Website, https://griddynamics.github.io/rosetta/
10
+ Project-URL: Support, https://github.com/griddynamics/rosetta/issues
11
+ Keywords: rosetta,ragflow,cli,knowledge-base,publishing
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
17
+ Requires-Python: >=3.12
18
+ Description-Content-Type: text/markdown
19
+ Requires-Dist: python-dotenv<2.0.0,>=1.0.0
20
+ Requires-Dist: python-frontmatter<2.0.0,>=1.1.0
21
+ Requires-Dist: ragflow-sdk<1.0.0,>=0.23.1
22
+ Requires-Dist: requests<3.0.0,>=2.31.0
23
+ Requires-Dist: tqdm<5.0.0,>=4.67.0
24
+ Provides-Extra: dev
25
+ Requires-Dist: build>=1.0.0; extra == "dev"
26
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
27
+ Requires-Dist: twine>=4.0.0; extra == "dev"
28
+
29
+ # Rosetta CLI
30
+
31
+ > Knowledge base publishing and management tools powered by RAGFlow
32
+
33
+ ## 🎯 Overview
34
+
35
+ This directory contains the Python package for publishing knowledge base content to RAGFlow instances. The CLI supports multi-environment workflows with smart change detection and auto-metadata extraction.
36
+
37
+ ## Community
38
+
39
+ - [Discord](https://discord.gg/QzZ2cWg36g)
40
+ - [Website](https://griddynamics.github.io/rosetta/)
41
+ - [rosetta-support@griddynamics.com](mailto:rosetta-support@griddynamics.com)
42
+
43
+ ### Key Features
44
+
45
+ - **🚀 Smart Publishing** - MD5 hash-based change detection (~77% faster republishing)
46
+ - **🏗️ Modular Architecture** - Command pattern with service layer for maintainability
47
+ - **🏷️ Tag-in-Title Format** - `[tag1][tag2] filename.ext` for powerful server-side filtering
48
+ - **📊 Parse Status Tracking** - Monitor document parsing progress with visual indicators
49
+ - **🔄 Upsert Semantics** - No duplicates, republishing updates existing documents
50
+ - **⏱️ Performance Timing** - All commands show execution time
51
+ - **🌍 Multi-Environment** - Switch between local, dev, and production configs
52
+ - **🔐 API Key Auth** - Secure authentication via RAGFlow API keys
53
+ - **🎯 Server-Side Filtering** - Reduce network traffic with metadata conditions
54
+
55
+ ### Quick Navigation
56
+
57
+ - **Complete Setup Guide:** See [docs/QUICKSTART.md](docs/QUICKSTART.md) for detailed setup instructions
58
+ - **CLI Commands:** See [CLI Commands](#-cli-commands) for all available commands
59
+ - **Environment Management:** See [Environment Management](#-environment-management) for switching configs
60
+
61
+ ## 📁 Contents
62
+
63
+ ```
64
+ rosetta-cli/
65
+ ├── pyproject.toml # Package metadata + console entrypoint
66
+ ├── rosetta_cli/ # Installable Python package
67
+ │ ├── cli.py # CLI entry point
68
+ │ ├── commands/ # Command implementations
69
+ │ ├── services/ # Shared business logic
70
+ │ ├── ims_config.py # Configuration management
71
+ │ ├── ims_publisher.py # Publishing orchestration
72
+ │ └── ragflow_client.py # RAGFlow SDK wrapper
73
+ ├── env.template # Environment configuration template
74
+ ├── tests/ # CLI unit tests
75
+ └── README.md # This file
76
+ ```
77
+
78
+ ## 🚀 Quick Start
79
+
80
+ Complete setup instructions are in [docs/QUICKSTART.md](../docs/QUICKSTART.md). Here's the quick reference:
81
+
82
+ ### Prerequisites
83
+
84
+ - Python 3.12 (required by ragflow-sdk 0.23.1)
85
+ - RAGFlow instance (local via Docker Compose or remote)
86
+ - `uvx` for installed CLI usage
87
+ - Root virtual environment configured for local CLI development
88
+
89
+ ### Installed Usage
90
+
91
+ ```bash
92
+ uvx rosetta-cli@latest verify
93
+ ```
94
+
95
+ ### Local Development
96
+
97
+ ```bash
98
+ python3 -m venv venv
99
+ venv/bin/pip install -r requirements.txt
100
+ cp rosetta-cli/.env.dev .env
101
+ venv/bin/rosetta-cli verify
102
+ ```
103
+
104
+ ## 🔧 CLI Commands
105
+
106
+ All commands support `--env <environment>` flag to override the active environment.
107
+
108
+ ### Publishing Commands
109
+
110
+ #### Publish Knowledge Base Content
111
+
112
+ ```bash
113
+ # Publish all instructions (only changed files)
114
+ uvx rosetta-cli@latest publish ../instructions
115
+
116
+ # Publish business context
117
+ uvx rosetta-cli@latest publish ../business
118
+
119
+ # Force republish all files (bypass change detection)
120
+ uvx rosetta-cli@latest publish ../instructions --force
121
+
122
+ # Preview changes without publishing
123
+ uvx rosetta-cli@latest publish ../instructions --dry-run
124
+
125
+ # Use different environment
126
+ uvx rosetta-cli@latest publish ../instructions --env production
127
+ ```
128
+
129
+ **Performance:**
130
+ - First publish: ~10-15s per file (embedding generation + parsing)
131
+ - Subsequent publishes: Only changed files (~77% faster)
132
+ - Dry run: Preview in ~2-3s
133
+
134
+ **What gets published:**
135
+
136
+ ```
137
+ File: /instructions/agents/r1/agents.md
138
+
139
+ Published as:
140
+ Document ID: b0ec4d56-6cc5-5bbd-9868-5d49afa2a7d8 (UUID from path)
141
+ Title: [instructions][agents][r1] agents.md
142
+ Dataset: aia-r1 (from template: aia-{release})
143
+ Tags: ["instructions", "agents", "r1"] (in metadata)
144
+ Domain: instructions (first folder)
145
+ Release: r1 (auto-detected from path)
146
+ Content Hash: abc123... (MD5 of content)
147
+ ```
148
+
149
+ #### Trigger Document Parsing
150
+
151
+ Re-parse documents without re-uploading (useful for changing parser settings):
152
+
153
+ ```bash
154
+ # Parse all unparsed documents
155
+ uvx rosetta-cli@latest parse
156
+
157
+ # Parse specific dataset
158
+ uvx rosetta-cli@latest parse --dataset aia-r1
159
+
160
+ # Force re-parse ALL documents
161
+ uvx rosetta-cli@latest parse --dataset aia-r1 --force
162
+
163
+ # Preview without parsing (dry run)
164
+ uvx rosetta-cli@latest parse --dataset aia-r1 --dry-run
165
+ ```
166
+
167
+ #### List Documents
168
+
169
+ ```bash
170
+ # List documents in default dataset
171
+ uvx rosetta-cli@latest list-dataset
172
+
173
+ # List specific dataset
174
+ uvx rosetta-cli@latest list-dataset --dataset aia-r1
175
+ ```
176
+
177
+ **Output shows:**
178
+ - Document title (with tag prefixes)
179
+ - Document ID, file size, parse status, chunk count
180
+ - Metadata (tags, domain, release, source path)
181
+
182
+ #### Cleanup Dataset
183
+
184
+ ```bash
185
+ # Preview cleanup without deleting
186
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --dry-run
187
+
188
+ # Cleanup documents with specific prefix
189
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --prefix "aqa-phase" --dry-run
190
+
191
+ # Cleanup documents with specific tags (space-separated)
192
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1 agents" --dry-run
193
+
194
+ # Cleanup documents with specific tags (comma-separated)
195
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1,agents" --dry-run
196
+
197
+ # Force cleanup without confirmation
198
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --force
199
+
200
+ # Force cleanup with prefix
201
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --prefix "aqa-phase" --force
202
+
203
+ # Force cleanup with tags
204
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --tags "r1,agents" --force
205
+ ```
206
+
207
+ ⚠️ **Warning:** Without `--prefix` or `--tags`, this deletes ALL documents. Use `--dry-run` first.
208
+
209
+ **Filtering Options:**
210
+ - `--prefix`: Match documents by title prefix (e.g., `"[instructions]"`)
211
+ - `--tags`: Match documents by metadata tags (e.g., `"r1 agents"` or `"r1,agents"`)
212
+ - Uses OR logic: finds documents with ANY of the specified tags
213
+ - Server-side filtering for efficiency
214
+
215
+ ### Verification Commands
216
+
217
+ #### Verify Connection
218
+
219
+ ```bash
220
+ uvx rosetta-cli@latest verify
221
+
222
+ # Check production environment
223
+ uvx rosetta-cli@latest verify --env production
224
+ ```
225
+
226
+ Checks:
227
+ - API key validity
228
+ - RAGFlow server connectivity
229
+ - System health (database, Redis, document engine)
230
+ - Available datasets
231
+
232
+ ## 🌍 Environment Management
233
+
234
+ ### Configuration Files
235
+
236
+ | File | Environment | Purpose |
237
+ |------|-------------|---------|
238
+ | `env.template` | Template | Create new environments |
239
+ | `.env` | **Active** | Current configuration (gitignored) |
240
+ | `.env.local` | Local | Local RAGFlow development |
241
+ | `.env.remote` | Remote | Production RAGFlow instance |
242
+
243
+ ### Switch Environments
244
+
245
+ **Method 1: Copy file to .env** (recommended)
246
+
247
+ ```bash
248
+ # Switch to local
249
+ cp .env.local .env
250
+
251
+ # Switch to production
252
+ cp .env.remote .env
253
+
254
+ # Check current environment
255
+ grep "ENVIRONMENT=" .env
256
+ ```
257
+
258
+ **Method 2: Use --env flag** (temporary override)
259
+
260
+ ```bash
261
+ uvx rosetta-cli@latest list-dataset --env local
262
+ uvx rosetta-cli@latest publish ../instructions --env production
263
+ ```
264
+
265
+ ### Environment Variables
266
+
267
+ ```bash
268
+ # Required
269
+ RAGFLOW_BASE_URL=http://your-ragflow-instance
270
+ RAGFLOW_API_KEY=ragflow-xxx...
271
+ ENVIRONMENT=local
272
+
273
+ # Dataset Configuration
274
+ RAGFLOW_DATASET_DEFAULT=aia
275
+ RAGFLOW_DATASET_TEMPLATE=aia-{release}
276
+
277
+ # Embedding Model (optional)
278
+ RAGFLOW_EMBEDDING_MODEL=text-embedding-3-small@OpenAI
279
+
280
+ # Chunking Configuration (optional)
281
+ RAGFLOW_CHUNK_METHOD=naive
282
+ RAGFLOW_CHUNK_TOKEN_NUM=512
283
+ RAGFLOW_DELIMITER=\n
284
+ RAGFLOW_AUTO_KEYWORDS=0
285
+ RAGFLOW_AUTO_QUESTIONS=0
286
+ ```
287
+
288
+ ### Creating New Environments
289
+
290
+ ```bash
291
+ cp env.template .env.staging
292
+ nano .env.staging
293
+ uvx rosetta-cli@latest verify --env staging
294
+ ```
295
+
296
+ ## 🏗️ Architecture
297
+
298
+ ### Key Components
299
+
300
+ #### RAGFlowClient (`ragflow_client.py`)
301
+
302
+ Wrapper around ragflow-sdk:
303
+
304
+ ```python
305
+ from rosetta_cli.ragflow_client import RAGFlowClient, DocumentMetadata
306
+
307
+ client = RAGFlowClient(api_key="ragflow-xxx", base_url="http://your-ragflow-instance")
308
+
309
+ # Dataset management
310
+ client.create_dataset(name="aia-r1", description="Release 1")
311
+ client.get_dataset(name="aia-r1")
312
+ client.list_datasets()
313
+
314
+ # Document upload with change detection
315
+ client.upload_document(
316
+ file_path=Path("agents.md"),
317
+ metadata=DocumentMetadata(...),
318
+ dataset_id="dataset-id",
319
+ force=False # Skip if unchanged
320
+ )
321
+
322
+ # Health check
323
+ client.verify_connection()
324
+ client.get_system_health()
325
+ ```
326
+
327
+ #### IMSConfig (`ims_config.py`)
328
+
329
+ Configuration management with smart .env discovery:
330
+
331
+ ```python
332
+ from rosetta_cli.ims_config import IMSConfig
333
+
334
+ # Auto-discover .env (searches cwd, script dir, git root)
335
+ config = IMSConfig.from_env()
336
+
337
+ # Use specific environment
338
+ config = IMSConfig.from_env(environment="production")
339
+
340
+ # Validate configuration
341
+ config.validate()
342
+ ```
343
+
344
+ #### ContentPublisher (`ims_publisher.py`)
345
+
346
+ Publishing logic with metadata extraction:
347
+
348
+ ```python
349
+ from rosetta_cli.ims_publisher import ContentPublisher
350
+
351
+ publisher = ContentPublisher(client, config, workspace_root)
352
+
353
+ results = publisher.publish(
354
+ content_path=Path("../instructions"),
355
+ force=False, # Skip unchanged files
356
+ dry_run=False, # Preview mode
357
+ no_parse=False, # Skip parsing after upload
358
+ parse_timeout=300 # Parse timeout in seconds
359
+ )
360
+
361
+ print(f"Published: {results.published_count}")
362
+ print(f"Skipped: {results.skipped_count}")
363
+ print(f"Failed: {results.failed_count}")
364
+ ```
365
+
366
+ **Metadata Extraction:**
367
+
368
+ ```
369
+ File: /instructions/agents/r1/bootstrap.md
370
+
371
+ Extracted:
372
+ Tags: ["instructions", "agents", "r1"]
373
+ Domain: instructions
374
+ Release: r1
375
+ Title: bootstrap.md
376
+ Content Hash: abc123... (MD5)
377
+ Document ID: uuid-from-path
378
+ ```
379
+
380
+ ## 🎯 Tag-in-Title Format
381
+
382
+ ### What is Tag-in-Title?
383
+
384
+ Documents are stored with tags as prefixes for server-side filtering:
385
+
386
+ ```
387
+ Format: [tag1][tag2][tag3] filename.ext
388
+
389
+ Examples:
390
+ [instructions][agents][r1] agents.md
391
+ [business][project] RFP.pdf
392
+ ```
393
+
394
+ ### Why Two Locations?
395
+
396
+ Tags are stored in **both title and metadata**:
397
+
398
+ **Title:** Fast server-side keyword search
399
+ **Metadata:** Precise client-side filtering with complex queries
400
+
401
+ ### How Tags are Generated
402
+
403
+ Tags come from **folder structure only**:
404
+
405
+ ```
406
+ File: /instructions/agents/r1/bootstrap.md
407
+ Folders: instructions / agents / r1 / (file)
408
+ Tags: [instructions][agents][r1]
409
+ ```
410
+
411
+ ### Using Tags for Filtering
412
+
413
+ ```bash
414
+ # Delete all instruction documents
415
+ uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions]"
416
+
417
+ # Delete all r1 agent documents
418
+ uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions][agents][r1]"
419
+ ```
420
+
421
+ ## 💻 Usage Examples
422
+
423
+ ### Example 1: First-Time Setup
424
+
425
+ ```bash
426
+ python3 -m venv venv
427
+ venv/bin/pip install -r requirements.txt
428
+ cp rosetta-cli/env.template .env
429
+ nano .env # Add RAGFLOW_BASE_URL and RAGFLOW_API_KEY
430
+ uvx rosetta-cli@latest verify
431
+ uvx rosetta-cli@latest publish instructions
432
+ ```
433
+
434
+ ### Example 2: Daily Publishing Workflow
435
+
436
+ ```bash
437
+ uvx rosetta-cli@latest publish ../instructions --dry-run
438
+ uvx rosetta-cli@latest publish ../instructions
439
+ uvx rosetta-cli@latest list-dataset
440
+ ```
441
+
442
+ ### Example 3: Multi-Environment Publishing
443
+
444
+ ```bash
445
+ # Publish to dev
446
+ uvx rosetta-cli@latest publish ../instructions --env dev
447
+
448
+ # Verify on dev
449
+ uvx rosetta-cli@latest verify --env dev
450
+
451
+ # Publish to production
452
+ uvx rosetta-cli@latest publish ../instructions --env prod
453
+ ```
454
+
455
+ ### Example 4: Cleanup and Republish
456
+
457
+ ```bash
458
+ # Preview deletion
459
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --dry-run
460
+
461
+ # Delete all documents
462
+ uvx rosetta-cli@latest cleanup-dataset --dataset aia-r1 --force
463
+
464
+ # Republish everything
465
+ uvx rosetta-cli@latest publish ../instructions --force
466
+ ```
467
+
468
+ ### Example 5: Programmatic Usage
469
+
470
+ ```python
471
+ from pathlib import Path
472
+ from rosetta_cli.ragflow_client import RAGFlowClient, DocumentMetadata
473
+ from rosetta_cli.ims_config import IMSConfig
474
+ from rosetta_cli.ims_publisher import ContentPublisher
475
+
476
+ config = IMSConfig.from_env()
477
+ client = RAGFlowClient(
478
+ api_key=config.api_key,
479
+ base_url=config.base_url,
480
+ embedding_model=config.embedding_model,
481
+ chunk_method=config.chunk_method,
482
+ parser_config=config.parser_config
483
+ )
484
+
485
+ client.verify_connection()
486
+ publisher = ContentPublisher(client, config, Path("/path/to/workspace"))
487
+
488
+ results = publisher.publish(
489
+ content_path=Path("/path/to/workspace") / "instructions",
490
+ force=False,
491
+ dry_run=False
492
+ )
493
+
494
+ print(f"Published: {results.published_count}, Skipped: {results.skipped_count}")
495
+ ```
496
+
497
+ ## 🔍 Troubleshooting
498
+
499
+ ### Error: "api_key cannot be empty"
500
+
501
+ Set `RAGFLOW_API_KEY` in `.env`:
502
+ ```bash
503
+ nano .env
504
+ # Add: RAGFLOW_API_KEY=ragflow-xxxxxxxxxxxxxxxxxxxx
505
+ ```
506
+
507
+ ### Error: "Invalid API key or expired token"
508
+
509
+ Generate new API key:
510
+ 1. Login to RAGFlow
511
+ 2. Profile → API Keys → Generate New Key
512
+ 3. Update `.env` file
513
+
514
+ ### Error: "Connection refused"
515
+
516
+ 1. Check RAGFlow is running: `docker ps | grep ragflow`
517
+ 2. Verify URL: `grep RAGFLOW_BASE_URL .env`
518
+ 3. Test: `curl http://your-ragflow-instance/v1/system/healthz`
519
+
520
+ ### Error: "Module 'ragflow_sdk' not found"
521
+
522
+ ```bash
523
+ venv/bin/pip install -r requirements.txt
524
+ ```
525
+
526
+ ### Error: "No .env file found"
527
+
528
+ ```bash
529
+ cp rosetta-cli/env.template .env
530
+ nano .env
531
+ ```
532
+
533
+ ### Parse Status Shows "FAIL"
534
+
535
+ 1. Check document format (PDF, MD, TXT supported)
536
+ 2. Re-trigger parsing: `uvx rosetta-cli@latest parse --dataset aia-r1 --force`
537
+ 3. Check RAGFlow logs: `docker logs ragflow-server`
538
+
539
+ ### Slow Publishing Performance
540
+
541
+ - Use faster embedding model: `RAGFLOW_EMBEDDING_MODEL=text-embedding-3-small@OpenAI`
542
+ - Ensure change detection works (don't use `--force`)
543
+ - Reduce chunk size: `RAGFLOW_CHUNK_TOKEN_NUM=256`
544
+
545
+ ### Documents Not Showing Tags
546
+
547
+ Tags should appear in title with format `[tag1][tag2]`:
548
+ ```bash
549
+ uvx rosetta-cli@latest list-dataset
550
+ # Output: 1. [instructions][agents][r1] agents.md
551
+ ```
552
+
553
+ ## 🚦 Performance Tips
554
+
555
+ ### 1. Use Change Detection
556
+
557
+ ```bash
558
+ # Good: Only publishes changed files (~77% faster)
559
+ uvx rosetta-cli@latest publish ../instructions
560
+
561
+ # Bad: Republishes everything
562
+ uvx rosetta-cli@latest publish ../instructions --force
563
+ ```
564
+
565
+ ### 2. Use Dry Run to Preview
566
+
567
+ ```bash
568
+ # Preview (fast)
569
+ uvx rosetta-cli@latest publish ../instructions --dry-run
570
+
571
+ # Then publish for real
572
+ uvx rosetta-cli@latest publish ../instructions
573
+ ```
574
+
575
+ ### 3. Optimize Chunking
576
+
577
+ ```bash
578
+ # Faster parsing
579
+ RAGFLOW_CHUNK_TOKEN_NUM=256
580
+
581
+ # Better context
582
+ RAGFLOW_CHUNK_TOKEN_NUM=1024
583
+ ```
584
+
585
+ ### 4. Use Selective Cleanup
586
+
587
+ ```bash
588
+ # Fast: Delete specific documents
589
+ uvx rosetta-cli@latest cleanup-dataset --prefix "[instructions][agents]" --force
590
+
591
+ # Slow: Delete and republish everything
592
+ uvx rosetta-cli@latest cleanup-dataset --force
593
+ uvx rosetta-cli@latest publish ../instructions --force
594
+ ```
595
+
596
+ ### 5. Monitor Parse Status
597
+
598
+ ```bash
599
+ uvx rosetta-cli@latest list-dataset | grep "Parse Status"
600
+ ```
601
+
602
+ ## 📖 Advanced Topics
603
+
604
+ ### Custom Dataset Naming
605
+
606
+ The `RAGFLOW_DATASET_TEMPLATE` supports `{release}` placeholder:
607
+
608
+ ```bash
609
+ RAGFLOW_DATASET_TEMPLATE=aia-{release}
610
+
611
+ # /instructions/r1/file.md → aia-r1
612
+ # /instructions/r2/file.md → aia-r2
613
+ # /instructions/file.md → aia (default)
614
+ ```
615
+
616
+ ### Supported File Types
617
+
618
+ **Text files** (extracted and chunked):
619
+ - Markdown (`.md`)
620
+ - Plain text (`.txt`)
621
+
622
+ **Binary files** (uploaded for storage):
623
+ - PDF, Excel, Word, PowerPoint
624
+
625
+
626
+ ### Environment File Discovery
627
+
628
+ When running commands without specifying config, search order:
629
+
630
+ 1. Current directory: `.env.{environment}` or `.env`
631
+ 2. Script directory: `.env.{environment}` or `.env`
632
+ 3. Git root: `.env.{environment}` or `.env`
633
+
634
+ ## 📝 Related Documentation
635
+
636
+ - **Complete Setup:** [docs/QUICKSTART.md](../docs/QUICKSTART.md) - Comprehensive setup guide
637
+ - **Architecture:** [docs/CONTEXT.md](../docs/CONTEXT.md) - System architecture
638
+ - **Environment Template:** `env.template` - Configuration options
639
+ - **Requirements:** `requirements.txt` - Python dependencies