RubyGems - ragnar-cli - Versions diffs - 0.1.0.pre.2 → 0.1.0.pre.4 - Mend

ragnar-cli 0.1.0.pre.2 → 0.1.0.pre.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/README.md +187 -36
data/lib/ragnar/cli.rb +543 -172
data/lib/ragnar/cli_visualization.rb +184 -0
data/lib/ragnar/config.rb +226 -0
data/lib/ragnar/database.rb +94 -8
data/lib/ragnar/llm_manager.rb +4 -1
data/lib/ragnar/query_processor.rb +38 -20
data/lib/ragnar/topic_modeling.rb +13 -10
data/lib/ragnar/umap_processor.rb +190 -73
data/lib/ragnar/umap_transform_service.rb +169 -88
data/lib/ragnar/version.rb +1 -1
metadata +43 -22
data/lib/ragnar/topic_modeling/engine.rb +0 -221
data/lib/ragnar/topic_modeling/labeling_strategies.rb +0 -300
data/lib/ragnar/topic_modeling/llm_adapter.rb +0 -131
data/lib/ragnar/topic_modeling/metrics.rb +0 -186
data/lib/ragnar/topic_modeling/term_extractor.rb +0 -170
data/lib/ragnar/topic_modeling/topic.rb +0 -117
data/lib/ragnar/topic_modeling/topic_labeler.rb +0 -61

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b0a173feabe7b256be1fd768e9dc8cf756074db612682815616ff0032693b850
-  data.tar.gz: 1be44293e519e221e72b9c497ffc96f2be3bf68e416ecb0f32213c7965192c84
+  metadata.gz: e0837b5d907d7b5336d938a4fa94c53b4dacdb96b56ffc753144dfaa4f476133
+  data.tar.gz: 7cd9d94241f8dc38a7dd4b3b2732966f21f9783b818087b30db3f03b5e6c2dfd
 SHA512:
-  metadata.gz: f5ff7022d1764633f781f0a9f8ead81a9efbbf179815ef9232ab18ff559f99f204b6b3b020a56187988c54ed600285c116e82392fa59771805fc4418e219d492
-  data.tar.gz: 2e1917dc5f2da3c602864a802d3297f47d638ceccd452c05a45e4091be845c2b74b83e9bbc27b10b7d0cd81c1d9d6c5149cd9a92e13dd78d2dddb8f24fb45439
+  metadata.gz: 2a87f654f8502b292d3bfbea31c5f6bb5ba6f02638cd024e8efd623ec88c69f528c59ca4f2604437df9169ec4ad11ffcf4f6223441f9787b8d94ab312c149192
+  data.tar.gz: 133364ec6142c14ded8c58c7041aab342b290e9a196da54c9164f3dfc83d466410d173cb35aa5a5988b201dac251f34e441346f2debfcc0d8bb3e95e24476d1c

data/README.md CHANGED Viewed

@@ -124,14 +124,14 @@ flowchart TB
 ### As a Gem
 ```bash
-gem install ragnar
+gem install ragnar-cli
 ```
 ### From Source
 ```bash
-git clone https://github.com/yourusername/ragnar.git
-cd ragnar
+git clone https://github.com/scientist-labs/ragnar-cli.git
+cd ragnar-cli
 bundle install
 gem build ragnar.gemspec
 gem install ./ragnar-*.gem
@@ -165,7 +165,36 @@ ragnar train-umap \
 ragnar apply-umap
 ```
-### 3. Query the System
+### 3. Extract Topics
+Perform topic modeling to discover themes in your indexed documents:
+```bash
+# Basic topic extraction (requires minimum 20-30 indexed documents)
+ragnar topics
+# Adjust clustering parameters for smaller datasets
+ragnar topics --min-cluster-size 3  # Allow smaller topics
+ragnar topics --min-samples 2       # Less strict density requirements
+# Export visualizations
+ragnar topics --export html  # Interactive D3.js visualization
+ragnar topics --export json  # JSON data for further processing
+# Verbose mode for debugging
+ragnar topics --verbose
+```
+**Note**: Topic modeling requires sufficient documents to identify meaningful patterns. For best results:
+- Index at least 20-30 documents (ideally 50+)
+- Ensure documents cover diverse topics
+- Documents should be substantial (50+ words each)
+The HTML export includes:
+- **Topic Bubbles**: Interactive bubble chart showing topic sizes and coherence
+- **Embedding Scatter Plot**: Visualization of all documents in embedding space, colored by cluster
+### 4. Query the System
 ```bash
 # Basic query
@@ -197,7 +226,7 @@ When using `--verbose` or `-v`, you'll see:
 6. **Response Generation**: The final LLM prompt and response
 7. **Final Results**: Confidence score and source attribution
-### 4. Check Statistics
+### 5. Check Statistics
 ```bash
 ragnar stats
@@ -231,30 +260,111 @@ ragnar stats
 ## Configuration
-### Default Settings
+Ragnar uses a flexible YAML-based configuration system that allows you to customize all aspects of the RAG pipeline.
-```ruby
-DEFAULT_DB_PATH = "ragnar_database"
-DEFAULT_CHUNK_SIZE = 512
-DEFAULT_CHUNK_OVERLAP = 50
-DEFAULT_EMBEDDING_MODEL = "jinaai/jina-embeddings-v2-base-en"
+### Configuration File
+Ragnar looks for configuration files in the following order:
+1. `.ragnar.yml` in the current directory
+2. `.ragnarrc.yml` in the current directory
+3. `ragnar.yml` in the current directory
+4. `.ragnar.yml` in your home directory
+5. Built-in defaults
+Generate a configuration file:
+```bash
+# Create local config (in current directory)
+ragnar init-config
+# Create global config (in home directory)
+ragnar init-config --global
+# Force overwrite existing config
+ragnar init-config --force
+```
+### Configuration Options
+Example `.ragnar.yml` file:
+```yaml
+# Storage paths (all support ~ expansion)
+storage:
+  database_path: "~/.cache/ragnar/database"    # Vector database location
+  models_dir: "~/.cache/ragnar/models"         # Downloaded model files
+  history_file: "~/.cache/ragnar/history"      # Interactive mode history
+# Embedding configuration
+embeddings:
+  model: jinaai/jina-embeddings-v2-base-en    # Embedding model to use
+  chunk_size: 512                              # Tokens per chunk
+  chunk_overlap: 50                            # Token overlap between chunks
+# UMAP dimensionality reduction
+umap:
+  reduced_dimensions: 64                       # Target dimensions (2-100)
+  n_neighbors: 15                              # UMAP neighbors parameter
+  min_dist: 0.1                                # UMAP minimum distance
+  model_filename: umap_model.bin              # Saved model filename
+# LLM configuration
+llm:
+  default_model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
+  default_gguf_file: tinyllama-1.1b-chat-v1.0.q4_k_m.gguf
+# Query processing
+query:
+  top_k: 3                      # Number of documents to retrieve
+  enable_query_rewriting: true  # Use LLM to improve queries
+# Interactive mode
+interactive:
+  prompt: 'ragnar> '            # Command prompt
+  quiet_mode: true              # Suppress verbose output
+# Output settings
+output:
+  show_progress: true           # Show progress bars during indexing
 ```
+### Viewing Configuration
+Check current configuration:
+```bash
+# Show all configuration settings
+ragnar config
+# Show LLM model information
+ragnar model
+```
+In interactive mode:
+```bash
+ragnar interactive
+ragnar> config    # Show configuration
+ragnar> model     # Show model details
+```
+### Environment Variables
+Configuration values can be overridden with environment variables:
+- `XDG_CACHE_HOME` - Override default cache directory (~/.cache)
 ### Supported Models
 **Embedding Models** (via red-candle):
-- jinaai/jina-embeddings-v2-base-en
-- BAAI/bge-base-en-v1.5
-- sentence-transformers/all-MiniLM-L6-v2
+- `jinaai/jina-embeddings-v2-base-en` (default, 768 dimensions)
+- `BAAI/bge-base-en-v1.5`
+- `sentence-transformers/all-MiniLM-L6-v2`
-**LLM Models** (via red-candle):
-- Qwen/Qwen2.5-1.5B-Instruct
-- microsoft/phi-2
-- TinyLlama/TinyLlama-1.1B-Chat-v1.0
+**LLM Models** (via red-candle, GGUF format):
+- `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` (default, fast)
+- `TheBloke/Qwen2.5-1.5B-Instruct-GGUF`
+- `TheBloke/phi-2-GGUF`
 **Reranker Models** (via red-candle):
-- BAAI/bge-reranker-base
-- cross-encoder/ms-marco-MiniLM-L-6-v2
+- `BAAI/bge-reranker-base`
+- `cross-encoder/ms-marco-MiniLM-L-6-v2`
 ## Advanced Usage
@@ -284,6 +394,60 @@ puts result[:answer]
 puts "Confidence: #{result[:confidence]}%"
 ```
+### Topic Modeling
+Extract topics from your indexed documents:
+```ruby
+# Example with sufficient documents for clustering (minimum ~20-30 needed)
+documents = [
+  # Finance cluster
+  "Federal Reserve raises interest rates to combat inflation",
+  "Stock markets rally on positive earnings reports",
+  "Cryptocurrency markets show increased volatility",
+  "Corporate bonds yield higher returns this quarter",
+  "Central banks coordinate global monetary policy",
+  # Technology cluster
+  "AI breakthrough in natural language processing announced",
+  "Machine learning transforms healthcare diagnostics",
+  "Cloud computing adoption accelerates in enterprises",
+  "Quantum computing reaches new error correction milestone",
+  "Open source frameworks receive major updates",
+  # Healthcare cluster
+  "Clinical trials show promise for cancer immunotherapy",
+  "Telemedicine reshapes patient care delivery models",
+  "Gene editing advances treatment for rare diseases",
+  "Mental health awareness campaigns gain momentum",
+  "mRNA vaccine technology platform expands",
+  # Add more documents for better clustering...
+  # See TOPIC_MODELING_EXAMPLE.md for complete example
+]
+# Extract topics using Topical
+database = Ragnar::Database.new("ragnar_database")
+docs = database.get_all_documents_with_embeddings
+embeddings = docs.map { |d| d[:embedding] }
+texts = docs.map { |d| d[:chunk_text] }
+topics = Topical.extract(
+  embeddings: embeddings,
+  documents: texts,
+  min_topic_size: 3  # Minimum docs per topic
+)
+topics.each do |topic|
+  puts "Topic: #{topic.label}"
+  puts "Terms: #{topic.terms.join(', ')}"
+  puts "Size: #{topic.size} documents\n\n"
+end
+```
+For a complete working example with 40+ documents, see [TOPIC_MODELING_EXAMPLE.md](TOPIC_MODELING_EXAMPLE.md).
 ### Custom Chunking Strategies
 ```ruby
@@ -420,20 +584,7 @@ MIT License - see LICENSE file for details
 This project integrates several excellent Ruby gems:
 - [red-candle](https://github.com/assaydepot/red-candle) - Ruby ML/LLM toolkit
-- [lancelot](https://github.com/cpetersen/lancelot) - Lance database bindings
-- [clusterkit](https://github.com/cpetersen/clusterkit) - UMAP and clustering implementation
-- [parsekit](https://github.com/cpetersen/parsekit) - Content extraction
+- [lancelot](https://github.com/scientist-labs/lancelot) - Lance database bindings
+- [clusterkit](https://github.com/scientist-labs/clusterkit) - UMAP and clustering implementation
+- [parsekit](https://github.com/scientist-labs/parsekit) - Content extraction
 - [baran](https://github.com/moeki0/baran) - Text splitting utilities
-## Roadmap
-- [ ] Add support for PDF and HTML documents
-- [ ] Implement incremental indexing
-- [ ] Add conversation memory for multi-turn queries
-- [ ] Support for hybrid search (vector + keyword)
-- [ ] Web UI for interactive queries
-- [ ] Docker containerization
-- [ ] Performance benchmarking suite
-- [ ] Support for multiple embedding models simultaneously
-- [ ] Query result caching
-- [ ] Automatic index optimization