rust-crate-pipeline 1.1.1__py3-none-any.whl → 1.2.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,474 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: rust-crate-pipeline
3
- Version: 1.1.1
4
- Summary: A comprehensive system for gathering, enriching, and analyzing metadata for Rust crates using AI-powered insights
5
- Home-page: https://github.com/DaveTmire85/SigilDERG-Data_Production
6
- Author: SuperUser666-Sigil
7
- Author-email: SuperUser666-Sigil <miragemodularframework@gmail.com>
8
- License-Expression: MIT
9
- Project-URL: Homepage, https://github.com/DaveTmire85/SigilDERG-Data_Production
10
- Project-URL: Documentation, https://github.com/DaveTmire85/SigilDERG-Data_Production#readme
11
- Project-URL: Repository, https://github.com/DaveTmire85/SigilDERG-Data_Production
12
- Project-URL: Bug Tracker, https://github.com/DaveTmire85/SigilDERG-Data_Production/issues
13
- Keywords: rust,crates,metadata,ai,analysis,pipeline,dependencies
14
- Classifier: Development Status :: 4 - Beta
15
- Classifier: Intended Audience :: Developers
16
- Classifier: Operating System :: OS Independent
17
- Classifier: Programming Language :: Python :: 3
18
- Classifier: Programming Language :: Python :: 3.8
19
- Classifier: Programming Language :: Python :: 3.9
20
- Classifier: Programming Language :: Python :: 3.10
21
- Classifier: Programming Language :: Python :: 3.11
22
- Classifier: Programming Language :: Python :: 3.12
23
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
24
- Classifier: Topic :: Software Development :: Build Tools
25
- Classifier: Topic :: Software Development :: Quality Assurance
26
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
27
- Requires-Python: >=3.8
28
- Description-Content-Type: text/markdown
29
- License-File: LICENSE
30
- Requires-Dist: requests>=2.28.0
31
- Requires-Dist: requests-cache>=1.0.0
32
- Requires-Dist: beautifulsoup4>=4.11.0
33
- Requires-Dist: tqdm>=4.64.0
34
- Requires-Dist: llama-cpp-python>=0.2.0
35
- Requires-Dist: tiktoken>=0.5.0
36
- Requires-Dist: psutil>=5.9.0
37
- Requires-Dist: python-dateutil>=2.8.0
38
- Provides-Extra: dev
39
- Requires-Dist: pytest>=7.0.0; extra == "dev"
40
- Requires-Dist: black>=22.0.0; extra == "dev"
41
- Requires-Dist: isort>=5.10.0; extra == "dev"
42
- Provides-Extra: advanced
43
- Requires-Dist: radon>=6.0.0; extra == "advanced"
44
- Requires-Dist: rustworkx>=0.13.0; extra == "advanced"
45
- Dynamic: author
46
- Dynamic: home-page
47
- Dynamic: license-file
48
- Dynamic: requires-python
49
-
50
- # Rust Crate Data Processing Pipeline
51
-
52
- [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
53
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
54
-
55
- A comprehensive system for gathering, enriching, and analyzing metadata for Rust crates using AI-powered insights and dependency analysis.
56
-
57
- ## 🚀 Features
58
-
59
- ### 📊 **Comprehensive Data Collection**
60
- - **Multi-source metadata fetching**: Pulls data from crates.io, GitHub, and lib.rs
61
- - **Dependency analysis**: Complete dependency graphs and reverse dependency mapping
62
- - **Code snippet extraction**: Automatically extracts Rust code examples from READMEs
63
- - **Feature analysis**: Detailed breakdown of crate features and their dependencies
64
-
65
- ### 🤖 **AI-Powered Enrichment**
66
- - **Use case classification**: Automatically categorizes crates (Web Framework, ML, Database, etc.)
67
- - **Feature summarization**: AI-generated explanations of crate features
68
- - **Factual/counterfactual pairs**: Generates training data for fact verification
69
- - **Smart content truncation**: Intelligently preserves important README sections
70
-
71
- ### 🔍 **Advanced Analysis**
72
- - **Source code metrics**: Lines of code, complexity analysis, API surface area
73
- - **Security scanning**: Vulnerability checks and security pattern analysis
74
- - **Community metrics**: GitHub activity, issue tracking, version adoption
75
- - **Performance optimization**: Batch processing, caching, and retry logic
76
-
77
- ### ⚡ **Production-Ready Features**
78
- - **Robust error handling**: Graceful degradation and comprehensive logging
79
- - **Rate limiting**: Respects GitHub API limits with intelligent backoff
80
- - **Checkpointing**: Automatic progress saving for long-running processes
81
- - **Configurable processing**: Extensive CLI and config file options
82
-
83
- ## 📋 Prerequisites
84
-
85
- ### Required Dependencies
86
- ```bash
87
- pip install requests requests-cache beautifulsoup4 tqdm llama-cpp-python tiktoken psutil
88
- ```
89
-
90
- ### Optional Dependencies
91
- ```bash
92
- pip install radon rustworkx # For advanced code analysis
93
- ```
94
-
95
- ### System Requirements
96
- - **Python 3.8+**
97
- - **Local LLM Model**: Deepseek Coder or compatible GGUF model
98
- - **GitHub Token**: For enhanced GitHub API access (optional but recommended)
99
- - **Disk Space**: ~1GB free space for processing and caching
100
-
101
- ## 🛠️ Installation
102
-
103
- ### 1. Clone the Repository
104
- ```bash
105
- git clone <repository-url>
106
- cd enrichment-flow2
107
- ```
108
-
109
- ### 2. Install Dependencies
110
- ```bash
111
- pip install -r requirements.txt
112
- ```
113
-
114
- ### 3. Download LLM Model
115
- ```bash
116
- # Example: Download Deepseek Coder model
117
- mkdir -p ~/models/deepseek/
118
- wget https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF/resolve/main/deepseek-coder-6.7b-instruct.Q4_K_M.gguf \
119
- -O ~/models/deepseek/deepseek-coder-6.7b-instruct.Q4_K_M.gguf
120
- ```
121
-
122
- ### 4. Set Environment Variables (Optional)
123
- ```bash
124
- export GITHUB_TOKEN="your_github_token_here"
125
- ```
126
-
127
- ## 🚀 Quick Start
128
-
129
- ### Installation
130
-
131
- #### From PyPI (Recommended)
132
- ```bash
133
- pip install rust-crate-pipeline
134
- ```
135
-
136
- #### From Source
137
- ```bash
138
- git clone https://github.com/DaveTmire85/SigilDERG-Data_Production.git
139
- cd SigilDERG-Data_Production
140
- pip install -e .
141
- ```
142
-
143
- #### Development Installation
144
- ```bash
145
- git clone https://github.com/DaveTmire85/SigilDERG-Data_Production.git
146
- cd SigilDERG-Data_Production
147
- pip install -e ".[dev]"
148
- ```
149
-
150
- ### Basic Usage
151
- ```bash
152
- # Run with default settings
153
- python -m rust_crate_pipeline
154
-
155
- # Process only 20 crates for testing
156
- python -m rust_crate_pipeline --limit 20
157
-
158
- # Skip AI processing for faster metadata-only collection
159
- python -m rust_crate_pipeline --skip-ai --limit 50
160
- ```
161
-
162
- ### Advanced Usage
163
- ```bash
164
- # Custom configuration
165
- python -m rust_crate_pipeline \
166
- --limit 100 \
167
- --batch-size 5 \
168
- --workers 2 \
169
- --log-level DEBUG \
170
- --output-dir ./results
171
-
172
- # Process specific crates
173
- python -m rust_crate_pipeline \
174
- --crate-list serde tokio actix-web reqwest \
175
- --output-dir ./specific_crates
176
-
177
- # Use custom model and config
178
- python -m rust_crate_pipeline \
179
- --model-path ./my-model.gguf \
180
- --config-file ./custom_config.json
181
- ```
182
-
183
- ## 📁 Project Structure
184
-
185
- ```
186
- enrichment-flow2/
187
- ├── __init__.py # Package initialization and public API
188
- ├── __main__.py # Entry point for python -m execution
189
- ├── main.py # CLI interface and main execution logic
190
- ├── config.py # Configuration classes and data models
191
- ├── pipeline.py # Main orchestration and workflow management
192
- ├── ai_processing.py # LLM integration and AI-powered enrichment
193
- ├── network.py # API clients and HTTP request handling
194
- ├── analysis.py # Source code, security, and dependency analysis
195
- └── utils/ # Utility functions
196
- ├── logging_utils.py # Logging configuration and decorators
197
- └── file_utils.py # File operations and disk management
198
- ```
199
-
200
- ## ⚙️ Configuration
201
-
202
- ### Command Line Arguments
203
-
204
- | Argument | Type | Default | Description |
205
- |----------|------|---------|-------------|
206
- | `--limit` | int | None | Limit number of crates to process |
207
- | `--batch-size` | int | 10 | Crates processed per batch |
208
- | `--workers` | int | 4 | Parallel workers for API requests |
209
- | `--output-dir` | str | auto | Custom output directory |
210
- | `--model-path` | str | default | Path to LLM model file |
211
- | `--max-tokens` | int | 256 | Maximum tokens for LLM generation |
212
- | `--checkpoint-interval` | int | 10 | Save progress every N crates |
213
- | `--log-level` | str | INFO | Logging verbosity |
214
- | `--skip-ai` | flag | False | Skip AI enrichment |
215
- | `--skip-source-analysis` | flag | False | Skip source code analysis |
216
- | `--crate-list` | list | None | Specific crates to process |
217
- | `--config-file` | str | None | JSON configuration file |
218
-
219
- ### Configuration File Example
220
- ```json
221
- {
222
- "model_path": "/path/to/your/model.gguf",
223
- "batch_size": 5,
224
- "n_workers": 2,
225
- "max_tokens": 512,
226
- "checkpoint_interval": 5,
227
- "github_token": "ghp_your_token_here",
228
- "cache_ttl": 7200
229
- }
230
- ```
231
-
232
- ## 📊 Output Format
233
-
234
- The pipeline generates several output files:
235
-
236
- ### 1. **Enriched Metadata** (`enriched_crate_metadata_TIMESTAMP.jsonl`)
237
- ```json
238
- {
239
- "name": "serde",
240
- "version": "1.0.193",
241
- "description": "A generic serialization/deserialization framework",
242
- "use_case": "Serialization",
243
- "score": 8542.3,
244
- "feature_summary": "Provides derive macros for automatic serialization...",
245
- "factual_counterfactual": "✅ Factual: Serde supports JSON serialization...",
246
- "source_analysis": {
247
- "file_count": 45,
248
- "loc": 12500,
249
- "functions": ["serialize", "deserialize", ...],
250
- "has_tests": true
251
- }
252
- }
253
- ```
254
-
255
- ### 2. **Dependency Analysis** (`dependency_analysis_TIMESTAMP.json`)
256
- ```json
257
- {
258
- "dependency_graph": {
259
- "actix-web": ["tokio", "serde", "futures"],
260
- "tokio": ["mio", "parking_lot"]
261
- },
262
- "reverse_dependencies": {
263
- "serde": ["actix-web", "reqwest", "clap"],
264
- "tokio": ["actix-web", "reqwest"]
265
- },
266
- "most_depended": [
267
- ["serde", 156],
268
- ["tokio", 98]
269
- ]
270
- }
271
- ```
272
-
273
- ### 3. **Summary Report** (`summary_report_TIMESTAMP.json`)
274
- ```json
275
- {
276
- "total_crates": 150,
277
- "total_time": "1247.32s",
278
- "timestamp": "2025-06-18T10:30:00",
279
- "most_popular": [
280
- {"name": "serde", "score": 8542.3},
281
- {"name": "tokio", "score": 7234.1}
282
- ]
283
- }
284
- ```
285
-
286
- ## 🔧 Advanced Features
287
-
288
- ### Custom Crate Lists
289
- Process specific crates by providing a custom list:
290
- ```bash
291
- python -m rust_crate_pipeline --crate-list \
292
- serde tokio actix-web reqwest clap \
293
- --output-dir ./web_framework_analysis
294
- ```
295
-
296
- ### Performance Tuning
297
- Optimize for your system:
298
- ```bash
299
- # High-performance setup (good internet, powerful machine)
300
- python -m rust_crate_pipeline --batch-size 20 --workers 8
301
-
302
- # Conservative setup (limited resources)
303
- python -m rust_crate_pipeline --batch-size 3 --workers 1
304
- ```
305
-
306
- ### Development Mode
307
- Quick testing with minimal processing:
308
- ```bash
309
- python -m rust_crate_pipeline \
310
- --limit 5 \
311
- --skip-ai \
312
- --skip-source-analysis \
313
- --log-level DEBUG
314
- ```
315
-
316
- ## 🏗️ Architecture
317
-
318
- ### Core Components
319
-
320
- 1. **CrateDataPipeline**: Main orchestration class that coordinates all processing
321
- 2. **LLMEnricher**: Handles AI-powered enrichment using local LLM models
322
- 3. **CrateAPIClient**: Manages API interactions with crates.io and fallback sources
323
- 4. **GitHubBatchClient**: Optimized GitHub API client with rate limiting
324
- 5. **SourceAnalyzer**: Analyzes source code metrics and complexity
325
- 6. **SecurityAnalyzer**: Checks for security vulnerabilities and patterns
326
- 7. **UserBehaviorAnalyzer**: Tracks community engagement and version adoption
327
- 8. **DependencyAnalyzer**: Builds and analyzes dependency relationships
328
-
329
- ### Processing Flow
330
-
331
- ```
332
- 1. Crate Discovery → 2. Metadata Fetching → 3. AI Enrichment
333
- ↓ ↓ ↓
334
- 4. Source Analysis → 5. Security Scanning → 6. Community Analysis
335
- ↓ ↓ ↓
336
- 7. Dependency Mapping → 8. Data Aggregation → 9. Report Generation
337
- ```
338
-
339
- ## 🧪 API Usage
340
-
341
- ### Programmatic Usage
342
- ```python
343
- from rust_crate_pipeline import CrateDataPipeline, PipelineConfig
344
-
345
- # Create custom configuration
346
- config = PipelineConfig(
347
- batch_size=5,
348
- max_tokens=512,
349
- model_path="/path/to/model.gguf"
350
- )
351
-
352
- # Initialize and run pipeline
353
- pipeline = CrateDataPipeline(config)
354
- pipeline.run()
355
-
356
- # Or use individual components
357
- from rust_crate_pipeline import LLMEnricher, SourceAnalyzer
358
-
359
- enricher = LLMEnricher(config)
360
- analyzer = SourceAnalyzer()
361
- ```
362
-
363
- ### Custom Processing
364
- ```python
365
- # Process specific crates with custom options
366
- pipeline = CrateDataPipeline(
367
- config,
368
- limit=50,
369
- crate_list=["serde", "tokio", "actix-web"],
370
- skip_ai=False,
371
- output_dir="./custom_analysis"
372
- )
373
- ```
374
-
375
- ## 🐛 Troubleshooting
376
-
377
- ### Common Issues
378
-
379
- **🔴 Model Loading Errors**
380
- ```bash
381
- # Verify model path
382
- ls -la ~/models/deepseek/deepseek-coder-6.7b-instruct.Q4_K_M.gguf
383
-
384
- # Check model format compatibility
385
- python -c "from llama_cpp import Llama; print('Model loading OK')"
386
- ```
387
-
388
- **🔴 API Rate Limiting**
389
- ```bash
390
- # Set GitHub token for higher rate limits
391
- export GITHUB_TOKEN="your_token_here"
392
-
393
- # Reduce batch size and workers
394
- python -m rust_crate_pipeline --batch-size 3 --workers 1
395
- ```
396
-
397
- **🔴 Memory Issues**
398
- ```bash
399
- # Reduce token limits and batch size
400
- python -m rust_crate_pipeline --max-tokens 128 --batch-size 2
401
- ```
402
-
403
- **🔴 Network Timeouts**
404
- ```bash
405
- # Enable debug logging to identify issues
406
- python -m rust_crate_pipeline --log-level DEBUG --limit 10
407
- ```
408
-
409
- ### Performance Optimization
410
-
411
- 1. **Use SSD storage** for faster caching and temporary file operations
412
- 2. **Increase RAM** if processing large batches (recommended: 8GB+)
413
- 3. **Set GITHUB_TOKEN** for 5000 req/hour instead of 60 req/hour
414
- 4. **Use appropriate batch sizes** based on your internet connection
415
- 5. **Monitor disk space** - processing can generate several GB of data
416
-
417
- ## 📈 Performance Metrics
418
-
419
- ### Typical Processing Times
420
- - **Metadata only**: ~2-3 seconds per crate
421
- - **With AI enrichment**: ~15-30 seconds per crate
422
- - **Full analysis**: ~45-60 seconds per crate
423
-
424
- ### Resource Usage
425
- - **Memory**: 2-4GB during processing
426
- - **Disk**: 10-50MB per crate (temporary files)
427
- - **Network**: ~1-5MB per crate (API calls)
428
-
429
- ## 🤝 Contributing
430
-
431
- ### Development Setup
432
- ```bash
433
- # Clone repository
434
- git clone <repository-url>
435
- cd enrichment-flow2
436
-
437
- # Install development dependencies
438
- pip install -r requirements-dev.txt
439
-
440
- # Run tests
441
- python -m pytest tests/
442
-
443
- # Format code
444
- black . && isort .
445
- ```
446
-
447
- ### Adding New Analysis Features
448
- 1. Implement new analyzer in `analysis.py`
449
- 2. Add configuration options to `config.py`
450
- 3. Integrate with pipeline in `pipeline.py`
451
- 4. Add CLI arguments in `main.py`
452
- 5. Update documentation
453
-
454
- ## 📄 License
455
-
456
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
457
-
458
- ## 🙏 Acknowledgments
459
-
460
- - **Rust Community** for the excellent crates ecosystem
461
- - **crates.io** for providing comprehensive API access
462
- - **GitHub** for repository metadata and community data
463
- - **Deepseek** for the powerful code-focused language model
464
- - **llama.cpp** team for efficient local inference capabilities
465
-
466
- ## 📞 Support
467
-
468
- - **Issues**: [GitHub Issues](https://github.com/your-repo/issues)
469
- - **Discussions**: [GitHub Discussions](https://github.com/your-repo/discussions)
470
- - **Documentation**: [Wiki](https://github.com/your-repo/wiki)
471
-
472
- ---
473
-
474
- **Happy crate analyzing! 🦀✨**