rustream 0.1.0__py3-none-win_amd64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
Binary file
|
|
@@ -0,0 +1,201 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: rustream
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Classifier: Development Status :: 3 - Alpha
|
|
5
|
+
Classifier: Environment :: Console
|
|
6
|
+
Classifier: Intended Audience :: Developers
|
|
7
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
8
|
+
Classifier: Programming Language :: Rust
|
|
9
|
+
Classifier: Topic :: Database
|
|
10
|
+
Summary: Fast Postgres → Parquet sync tool
|
|
11
|
+
Keywords: postgres,parquet,s3,sync,etl
|
|
12
|
+
License: MIT
|
|
13
|
+
Requires-Python: >=3.8
|
|
14
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
15
|
+
Project-URL: Repository, https://github.com/kraftaa/rustream
|
|
16
|
+
|
|
17
|
+
# rustream
|
|
18
|
+
|
|
19
|
+
Fast Postgres to Parquet sync tool. Reads tables from Postgres, writes Parquet files to local disk or S3. Supports incremental sync via `updated_at` watermark tracking.
|
|
20
|
+
|
|
21
|
+
## Installation
|
|
22
|
+
|
|
23
|
+
### From PyPI
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
pipx install rustream
|
|
27
|
+
# or
|
|
28
|
+
pip install rustream
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
### From source
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
git clone https://github.com/kraftaa/rustream.git
|
|
35
|
+
cd rustream
|
|
36
|
+
cargo build --release
|
|
37
|
+
# binary is at target/release/rustream
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### With maturin (local dev)
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
pip install maturin
|
|
44
|
+
maturin develop --release
|
|
45
|
+
# now `rustream` is on your PATH
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## Usage
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
# Copy and edit the example config
|
|
52
|
+
cp config.example.yaml config.yaml
|
|
53
|
+
|
|
54
|
+
# Preview what will be synced (no files written)
|
|
55
|
+
rustream sync --config config.yaml --dry-run
|
|
56
|
+
|
|
57
|
+
# Run sync
|
|
58
|
+
rustream sync --config config.yaml
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Enable debug logging with `RUST_LOG`:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
RUST_LOG=rustream=debug rustream sync --config config.yaml
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Configuration
|
|
68
|
+
|
|
69
|
+
### Specific tables (recommended)
|
|
70
|
+
|
|
71
|
+
```yaml
|
|
72
|
+
postgres:
|
|
73
|
+
host: localhost
|
|
74
|
+
database: mydb
|
|
75
|
+
user: postgres
|
|
76
|
+
password: secret
|
|
77
|
+
|
|
78
|
+
output:
|
|
79
|
+
type: local
|
|
80
|
+
path: ./output
|
|
81
|
+
|
|
82
|
+
tables:
|
|
83
|
+
- name: users
|
|
84
|
+
incremental_column: updated_at
|
|
85
|
+
columns: # optional: pick specific columns
|
|
86
|
+
- id
|
|
87
|
+
- email
|
|
88
|
+
- created_at
|
|
89
|
+
- updated_at
|
|
90
|
+
|
|
91
|
+
- name: orders
|
|
92
|
+
incremental_column: updated_at
|
|
93
|
+
|
|
94
|
+
- name: products # no incremental_column = full sync every run
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### All tables (auto-discover)
|
|
98
|
+
|
|
99
|
+
Omit `tables` to sync every table in the schema. Use `exclude` to skip some:
|
|
100
|
+
|
|
101
|
+
```yaml
|
|
102
|
+
postgres:
|
|
103
|
+
host: localhost
|
|
104
|
+
database: mydb
|
|
105
|
+
user: postgres
|
|
106
|
+
|
|
107
|
+
output:
|
|
108
|
+
type: local
|
|
109
|
+
path: ./output
|
|
110
|
+
|
|
111
|
+
# schema: public # default
|
|
112
|
+
exclude:
|
|
113
|
+
- schema_migrations
|
|
114
|
+
- ar_internal_metadata
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### S3 output
|
|
118
|
+
|
|
119
|
+
```yaml
|
|
120
|
+
output:
|
|
121
|
+
type: s3
|
|
122
|
+
bucket: my-data-lake
|
|
123
|
+
prefix: raw/postgres
|
|
124
|
+
region: us-east-1
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
AWS credentials come from environment variables, `~/.aws/credentials`, or IAM role.
|
|
128
|
+
|
|
129
|
+
### Config reference
|
|
130
|
+
|
|
131
|
+
| Field | Description |
|
|
132
|
+
|---|---|
|
|
133
|
+
| `postgres.host` | Postgres host |
|
|
134
|
+
| `postgres.port` | Postgres port (default: 5432) |
|
|
135
|
+
| `postgres.database` | Database name |
|
|
136
|
+
| `postgres.user` | Database user |
|
|
137
|
+
| `postgres.password` | Database password (optional) |
|
|
138
|
+
| `output.type` | `local` or `s3` |
|
|
139
|
+
| `output.path` | Local directory for Parquet files (when type=local) |
|
|
140
|
+
| `output.bucket` | S3 bucket (when type=s3) |
|
|
141
|
+
| `output.prefix` | S3 key prefix (when type=s3) |
|
|
142
|
+
| `output.region` | AWS region (when type=s3, optional) |
|
|
143
|
+
| `batch_size` | Rows per Parquet file (default: 10000) |
|
|
144
|
+
| `state_dir` | Directory for SQLite watermark state (default: `.rustream_state`) |
|
|
145
|
+
| `schema` | Schema to discover tables from (default: `public`) |
|
|
146
|
+
| `exclude` | List of table names to skip when using auto-discovery |
|
|
147
|
+
| `tables[].name` | Table name |
|
|
148
|
+
| `tables[].schema` | Schema name (default: `public`) |
|
|
149
|
+
| `tables[].columns` | Columns to sync (default: all) |
|
|
150
|
+
| `tables[].incremental_column` | Column for watermark-based incremental sync |
|
|
151
|
+
| `tables[].partition_by` | Partition output files: `date`, `month`, or `year` |
|
|
152
|
+
|
|
153
|
+
## How it works
|
|
154
|
+
|
|
155
|
+
1. Connects to Postgres and introspects each table's schema via `information_schema`
|
|
156
|
+
2. Maps Postgres column types to Arrow types automatically
|
|
157
|
+
3. Reads rows in batches, converting to Arrow RecordBatches
|
|
158
|
+
4. Writes each batch as a Snappy-compressed Parquet file
|
|
159
|
+
5. Tracks the high watermark (max value of `incremental_column`) in local SQLite
|
|
160
|
+
6. On next run, only reads rows where `incremental_column > last_watermark`
|
|
161
|
+
|
|
162
|
+
Tables without `incremental_column` do a full sync every run.
|
|
163
|
+
|
|
164
|
+
## Supported Postgres types
|
|
165
|
+
|
|
166
|
+
| Postgres | Arrow |
|
|
167
|
+
|---|---|
|
|
168
|
+
| `boolean` | Boolean |
|
|
169
|
+
| `smallint` | Int16 |
|
|
170
|
+
| `integer`, `serial` | Int32 |
|
|
171
|
+
| `bigint`, `bigserial` | Int64 |
|
|
172
|
+
| `real` | Float32 |
|
|
173
|
+
| `double precision` | Float64 |
|
|
174
|
+
| `numeric` / `decimal` | Utf8 (preserves precision) |
|
|
175
|
+
| `text`, `varchar`, `char` | Utf8 |
|
|
176
|
+
| `bytea` | Binary |
|
|
177
|
+
| `date` | Date32 |
|
|
178
|
+
| `timestamp` | Timestamp(Microsecond) |
|
|
179
|
+
| `timestamptz` | Timestamp(Microsecond, UTC) |
|
|
180
|
+
| `uuid` | Utf8 |
|
|
181
|
+
| `json`, `jsonb` | Utf8 |
|
|
182
|
+
| arrays | Utf8 (JSON serialized) |
|
|
183
|
+
|
|
184
|
+
## Publishing
|
|
185
|
+
|
|
186
|
+
The project uses [maturin](https://github.com/PyO3/maturin) to package the Rust binary as a Python wheel (same approach as ruff, uv, etc). The CI workflow in `.github/workflows/release.yml` builds wheels for Linux, macOS, and Windows, then publishes to PyPI on tagged releases.
|
|
187
|
+
|
|
188
|
+
To publish manually:
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
# Build wheels for current platform
|
|
192
|
+
maturin build --release
|
|
193
|
+
|
|
194
|
+
# Upload to PyPI (needs PYPI_API_TOKEN)
|
|
195
|
+
maturin publish
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
## License
|
|
199
|
+
|
|
200
|
+
MIT
|
|
201
|
+
|
|
@@ -0,0 +1,4 @@
|
|
|
1
|
+
rustream-0.1.0.data\scripts\rustream.exe,sha256=Zj23eK-y1VvsCK7WpEY6Xzsu28dUop8sZvXdGCH26fY,29388288
|
|
2
|
+
rustream-0.1.0.dist-info\METADATA,sha256=9pcWqF5bPtejATiztJ8NzmxIN6Aw1rxcNsPRZbQ8mDo,5242
|
|
3
|
+
rustream-0.1.0.dist-info\WHEEL,sha256=jsSEiVNsW1dJj5gDaReR40i7mhgBjWtms6nAD6EViXU,94
|
|
4
|
+
rustream-0.1.0.dist-info\RECORD,,
|