PyPI - splunk-ddss-extractor - Versions diffs - 0.3.0__tar.gz - Mend

splunk-ddss-extractor 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

splunk_ddss_extractor-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,277 @@
+Metadata-Version: 2.4
+Name: splunk-ddss-extractor
+Version: 0.3.0
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Rust
+Classifier: Topic :: System :: Archiving
+Requires-Dist: zstandard>=0.22.0
+Requires-Dist: orjson>=3.9.0
+Requires-Dist: click>=8.1.7 ; extra == 'cli'
+Requires-Dist: pytest>=7.4.3 ; extra == 'dev'
+Requires-Dist: pytest-cov>=4.1.0 ; extra == 'dev'
+Requires-Dist: pyarrow>=14.0.0 ; extra == 'parquet'
+Requires-Dist: boto3>=1.34.0 ; extra == 's3'
+Provides-Extra: cli
+Provides-Extra: dev
+Provides-Extra: parquet
+Provides-Extra: s3
+Summary: Extract events from Splunk journal archives to raw format (JSON, CSV, Parquet)
+Author: Lech Lachowicz
+License: MIT
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: Homepage, https://github.com/ponquersohn/splunk_ddss_extractor
+Project-URL: Issues, https://github.com/ponquersohn/splunk_ddss_extractor/issues
+Project-URL: Repository, https://github.com/ponquersohn/splunk_ddss_extractor
+# Splunk DDSS Extractor
+Convert Splunk self-hosted storage archives from compressed journal format to raw format.
+## Overview
+Splunk DDSS Extractor is a Python library that processes Splunk journal archives, extracts events, and converts them to raw format for easier analysis and long-term storage. Use it in your own applications, data pipelines, or as a CLI tool.
+**Note:** This project is based on the concept from [fionera/splunker](https://github.com/fionera/splunker), reimplemented in Python with additional features for production use.
+## Features
+- Automatic compression detection (.zst, .gz, uncompressed)
+- Extract events with full metadata (host, source, sourcetype, timestamps)
+- Multiple output formats (JSON Lines, CSV, Parquet)
+- Streaming processing for memory efficiency
+- Simple Python API and CLI interface
+- Docker support for containerized deployments
+- Integrates with AWS Lambda, ECS, or any Python environment
+## Quick Start
+### Using the Makefile (Recommended)
+```bash
+# Show all available commands
+make env
+# Complete development setup (venv + dependencies)
+make dev-setup
+# Run tests
+make test
+# Build Docker image
+make docker
+```
+### Manual Setup
+#### Installation
+```bash
+# Create virtual environment
+python3 -m venv venv
+source venv/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+pip install -e .
+# Optional: Install Parquet support
+pip install pyarrow
+```
+#### Basic Usage
+**Extract a journal file:**
+```python
+from splunk_ddss_extractor.extractor import Extractor
+extractor = Extractor()
+# Extract to JSON Lines
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='output.json',
+    output_format='ndjson'
+)
+# Extract to CSV
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='output.csv',
+    output_format='csv'
+)
+# Extract to Parquet
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='output.parquet',
+    output_format='parquet'
+)
+# Extract from S3 to local file (streaming, no download)
+extractor.extract(
+    input_path='s3://bucket/path/journal.zst',
+    output_path='output.json',
+    output_format='ndjson'
+)
+# Extract from local to S3
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='s3://bucket/output/data.json',
+    output_format='ndjson'
+)
+```
+**Low-level streaming (advanced):**
+```python
+from splunk_ddss_extractor.decoder import JournalDecoder
+import zstandard as zstd
+# For low-level access, decoder needs an uncompressed stream
+# If reading a compressed file, decompress it first:
+with open('/path/to/journal.zst', 'rb') as compressed_file:
+    dctx = zstd.ZstdDecompressor()
+    with dctx.stream_reader(compressed_file) as reader:
+        decoder = JournalDecoder(reader=reader)
+        while decoder.scan():
+            event = decoder.get_event()
+            print(f"Host: {decoder.host()}")
+            print(f"Source: {decoder.source()}")
+            print(f"Sourcetype: {decoder.source_type()}")
+            print(f"Timestamp: {event.index_time}")
+            print(f"Message: {event.message_string()}")
+# For uncompressed journal files:
+with open('/path/to/journal', 'rb') as f:
+    decoder = JournalDecoder(reader=f)
+    while decoder.scan():
+        event = decoder.get_event()
+        # Process event...
+```
+#### Docker Usage
+```bash
+# Build image
+make docker
+# Run with local file
+docker run -v /path/to/data:/data ghcr.io/ponquersohn/splunk_ddss_extractor:latest
+# Use in your own Dockerfile
+FROM ghcr.io/ponquersohn/splunk_ddss_extractor:latest
+COPY your_script.py /app/
+CMD ["python", "/app/your_script.py"]
+```
+## Architecture
+This is a **Python library** with the following components:
+1. **Journal Decoder** - Low-level decoder for Splunk's binary journal format
+2. **Extractor Interface** - High-level API for common extraction tasks
+3. **Output Writers** - Support for JSON, CSV, and Parquet formats
+4. **Compression Detection** - Automatic detection and handling of .zst, .gz formats
+**Integration Options:**
+- Direct Python import in your applications
+- AWS Lambda functions for serverless processing
+- ECS/Fargate tasks for batch processing
+- Docker containers for isolated environments
+- Local scripts for one-off extractions
+See [CLAUDE.md](CLAUDE.md) for detailed architecture documentation.
+## Development
+### Quick Commands
+```bash
+# Run tests
+make test
+# Run tests with coverage
+make test-coverage
+# Build Docker image
+make docker
+# Test Docker locally
+make docker-run
+# Run all checks (tests)
+make check
+# Clean temporary files
+make clean
+```
+### Manual Commands
+```bash
+# Run tests
+pytest tests/
+# Code formatting
+black src/ tests/
+# Local Docker testing
+cd docker
+docker-compose up
+```
+## Configuration
+When integrating with AWS or other environments, you may use these environment variables:
+- `OUTPUT_FORMAT`: Output format - json, csv, or parquet (default: json)
+- `LOG_LEVEL`: Logging level (default: INFO)
+- `AWS_REGION`: AWS region for S3 operations (default: us-east-1)
+- `S3_BUCKET`: S3 bucket name (for S3 integrations)
+## Output Formats
+### JSON Lines (default)
+```json
+{"timestamp": 1234567890, "host": "server01", "source": "/var/log/app.log", "sourcetype": "app", "message": "Event data"}
+```
+### CSV
+```csv
+timestamp,host,source,sourcetype,message
+1234567890,server01,/var/log/app.log,app,"Event data"
+```
+### Parquet
+Columnar format optimized for analytics (requires pyarrow).
+## Credits
+This project is inspired by and based on the concept from [fionera/splunker](https://github.com/fionera/splunker), a Go implementation for extracting Splunk journal files. This Python implementation extends the original concept with:
+- Streaming S3 support (no temporary files)
+- Multiple output formats (JSON Lines, CSV, Parquet)
+- Python library API for easy integration
+- Docker and AWS deployment options
+## License
+Proprietary
+## Contributing
+See [CLAUDE.md](CLAUDE.md) for development guidelines.

splunk_ddss_extractor-0.3.0/README.md ADDED Viewed

@@ -0,0 +1,243 @@
+# Splunk DDSS Extractor
+Convert Splunk self-hosted storage archives from compressed journal format to raw format.
+## Overview
+Splunk DDSS Extractor is a Python library that processes Splunk journal archives, extracts events, and converts them to raw format for easier analysis and long-term storage. Use it in your own applications, data pipelines, or as a CLI tool.
+**Note:** This project is based on the concept from [fionera/splunker](https://github.com/fionera/splunker), reimplemented in Python with additional features for production use.
+## Features
+- Automatic compression detection (.zst, .gz, uncompressed)
+- Extract events with full metadata (host, source, sourcetype, timestamps)
+- Multiple output formats (JSON Lines, CSV, Parquet)
+- Streaming processing for memory efficiency
+- Simple Python API and CLI interface
+- Docker support for containerized deployments
+- Integrates with AWS Lambda, ECS, or any Python environment
+## Quick Start
+### Using the Makefile (Recommended)
+```bash
+# Show all available commands
+make env
+# Complete development setup (venv + dependencies)
+make dev-setup
+# Run tests
+make test
+# Build Docker image
+make docker
+```
+### Manual Setup
+#### Installation
+```bash
+# Create virtual environment
+python3 -m venv venv
+source venv/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+pip install -e .
+# Optional: Install Parquet support
+pip install pyarrow
+```
+#### Basic Usage
+**Extract a journal file:**
+```python
+from splunk_ddss_extractor.extractor import Extractor
+extractor = Extractor()
+# Extract to JSON Lines
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='output.json',
+    output_format='ndjson'
+)
+# Extract to CSV
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='output.csv',
+    output_format='csv'
+)
+# Extract to Parquet
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='output.parquet',
+    output_format='parquet'
+)
+# Extract from S3 to local file (streaming, no download)
+extractor.extract(
+    input_path='s3://bucket/path/journal.zst',
+    output_path='output.json',
+    output_format='ndjson'
+)
+# Extract from local to S3
+extractor.extract(
+    input_path='/path/to/journal.zst',
+    output_path='s3://bucket/output/data.json',
+    output_format='ndjson'
+)
+```
+**Low-level streaming (advanced):**
+```python
+from splunk_ddss_extractor.decoder import JournalDecoder
+import zstandard as zstd
+# For low-level access, decoder needs an uncompressed stream
+# If reading a compressed file, decompress it first:
+with open('/path/to/journal.zst', 'rb') as compressed_file:
+    dctx = zstd.ZstdDecompressor()
+    with dctx.stream_reader(compressed_file) as reader:
+        decoder = JournalDecoder(reader=reader)
+        while decoder.scan():
+            event = decoder.get_event()
+            print(f"Host: {decoder.host()}")
+            print(f"Source: {decoder.source()}")
+            print(f"Sourcetype: {decoder.source_type()}")
+            print(f"Timestamp: {event.index_time}")
+            print(f"Message: {event.message_string()}")
+# For uncompressed journal files:
+with open('/path/to/journal', 'rb') as f:
+    decoder = JournalDecoder(reader=f)
+    while decoder.scan():
+        event = decoder.get_event()
+        # Process event...
+```
+#### Docker Usage
+```bash
+# Build image
+make docker
+# Run with local file
+docker run -v /path/to/data:/data ghcr.io/ponquersohn/splunk_ddss_extractor:latest
+# Use in your own Dockerfile
+FROM ghcr.io/ponquersohn/splunk_ddss_extractor:latest
+COPY your_script.py /app/
+CMD ["python", "/app/your_script.py"]
+```
+## Architecture
+This is a **Python library** with the following components:
+1. **Journal Decoder** - Low-level decoder for Splunk's binary journal format
+2. **Extractor Interface** - High-level API for common extraction tasks
+3. **Output Writers** - Support for JSON, CSV, and Parquet formats
+4. **Compression Detection** - Automatic detection and handling of .zst, .gz formats
+**Integration Options:**
+- Direct Python import in your applications
+- AWS Lambda functions for serverless processing
+- ECS/Fargate tasks for batch processing
+- Docker containers for isolated environments
+- Local scripts for one-off extractions
+See [CLAUDE.md](CLAUDE.md) for detailed architecture documentation.
+## Development
+### Quick Commands
+```bash
+# Run tests
+make test
+# Run tests with coverage
+make test-coverage
+# Build Docker image
+make docker
+# Test Docker locally
+make docker-run
+# Run all checks (tests)
+make check
+# Clean temporary files
+make clean
+```
+### Manual Commands
+```bash
+# Run tests
+pytest tests/
+# Code formatting
+black src/ tests/
+# Local Docker testing
+cd docker
+docker-compose up
+```
+## Configuration
+When integrating with AWS or other environments, you may use these environment variables:
+- `OUTPUT_FORMAT`: Output format - json, csv, or parquet (default: json)
+- `LOG_LEVEL`: Logging level (default: INFO)
+- `AWS_REGION`: AWS region for S3 operations (default: us-east-1)
+- `S3_BUCKET`: S3 bucket name (for S3 integrations)
+## Output Formats
+### JSON Lines (default)
+```json
+{"timestamp": 1234567890, "host": "server01", "source": "/var/log/app.log", "sourcetype": "app", "message": "Event data"}
+```
+### CSV
+```csv
+timestamp,host,source,sourcetype,message
+1234567890,server01,/var/log/app.log,app,"Event data"
+```
+### Parquet
+Columnar format optimized for analytics (requires pyarrow).
+## Credits
+This project is inspired by and based on the concept from [fionera/splunker](https://github.com/fionera/splunker), a Go implementation for extracting Splunk journal files. This Python implementation extends the original concept with:
+- Streaming S3 support (no temporary files)
+- Multiple output formats (JSON Lines, CSV, Parquet)
+- Python library API for easy integration
+- Docker and AWS deployment options
+## License
+Proprietary
+## Contributing
+See [CLAUDE.md](CLAUDE.md) for development guidelines.

splunk_ddss_extractor-0.3.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,47 @@
+[build-system]
+requires = ["maturin>=1.0,<2.0"]
+build-backend = "maturin"
+[project]
+name = "splunk-ddss-extractor"
+version = "0.3.0"
+description = "Extract events from Splunk journal archives to raw format (JSON, CSV, Parquet)"
+readme = "README.md"
+license = {text = "MIT"}
+requires-python = ">=3.10"
+authors = [
+    {name = "Lech Lachowicz"},
+]
+classifiers = [
+    "Development Status :: 4 - Beta",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Programming Language :: Rust",
+    "Topic :: System :: Archiving",
+]
+dependencies = ["zstandard>=0.22.0", "orjson>=3.9.0"]
+[project.optional-dependencies]
+s3 = ["boto3>=1.34.0"]
+cli = ["click>=8.1.7"]
+parquet = ["pyarrow>=14.0.0"]
+dev = ["pytest>=7.4.3", "pytest-cov>=4.1.0"]
+[project.urls]
+Homepage = "https://github.com/ponquersohn/splunk_ddss_extractor"
+Repository = "https://github.com/ponquersohn/splunk_ddss_extractor"
+Issues = "https://github.com/ponquersohn/splunk_ddss_extractor/issues"
+[project.scripts]
+splunk-extract = "splunk_ddss_extractor.main:main"
+[tool.maturin]
+manifest-path = "rust/Cargo.toml"
+module-name = "splunk_ddss_extractor._native"
+python-source = "src"
+sdist-include = ["rust/Cargo.toml", "rust/src/**"]