owlbear 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,9 @@
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "WebFetch(domain:github.com)",
5
+ "WebFetch(domain:boto3.amazonaws.com)",
6
+ "WebFetch(domain:arrow.apache.org)"
7
+ ]
8
+ }
9
+ }
@@ -0,0 +1,31 @@
1
+ # These are some examples of commonly ignored file patterns.
2
+ # You should customize this list as applicable to your project.
3
+ # Learn more about .gitignore:
4
+ # https://www.atlassian.com/git/tutorials/saving-changes/gitignore
5
+
6
+ # Python specific
7
+ venv/
8
+ __pycache__/
9
+ *.py[cod]
10
+ *.pyc
11
+ .pytest_cache/
12
+
13
+ # direnv
14
+ .direnv/
15
+ .envrc
16
+
17
+ # Build artifacts
18
+ dist/
19
+ build/
20
+ *.egg-info/
21
+
22
+ # IDE
23
+ .idea/
24
+ .vscode/
25
+
26
+ # OS generated files
27
+ .DS_Store
28
+ Thumbs.db
29
+
30
+ # Log files
31
+ *.log
@@ -0,0 +1,26 @@
1
+ # Claude Instructions for owlbear
2
+
3
+ ## Project Overview
4
+ Owlbear is a Python client that bridges AWS Athena and Polars. It executes Athena SQL queries and returns results as typed Polars DataFrames via PyArrow. Named for its two halves: Owl (Athena) + Bear (Polars).
5
+
6
+ ## Development Guidelines
7
+ - Use Polars for all data processing operations
8
+ - Follow Python packaging best practices with pyproject.toml
9
+ - Maintain compatibility with Python 3.8+
10
+
11
+ ## Dependencies
12
+ - polars: Core data processing library
13
+ - boto3: AWS SDK for Athena integration
14
+
15
+ ## Development Dependencies
16
+ - pytest: Testing framework
17
+ - black: Code formatter
18
+ - ruff: Linter
19
+ - mypy: Type checker
20
+
21
+ ## Commands
22
+ - Install dependencies: `pip install -e .[dev]`
23
+ - Run tests: `pytest`
24
+ - Format code: `black .`
25
+ - Lint code: `ruff check .`
26
+ - Type check: `mypy src/`
owlbear-0.2.0/PKG-INFO ADDED
@@ -0,0 +1,436 @@
1
+ Metadata-Version: 2.4
2
+ Name: owlbear
3
+ Version: 0.2.0
4
+ Summary: Feathers and claws for your data lake
5
+ Project-URL: Homepage, https://github.com/jdonaldson/owlbear
6
+ Project-URL: Repository, https://github.com/jdonaldson/owlbear
7
+ Project-URL: Issues, https://github.com/jdonaldson/owlbear/issues
8
+ Author-email: "J. Justin Donaldson" <jjd@jjd.io>
9
+ License: MIT
10
+ Keywords: analytics,athena,aws,data,polars,trino
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.9
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Database
20
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
21
+ Requires-Python: >=3.9
22
+ Requires-Dist: polars>=0.20.0
23
+ Requires-Dist: pyarrow>=10.0.0
24
+ Provides-Extra: all
25
+ Requires-Dist: boto3>=1.26.0; extra == 'all'
26
+ Requires-Dist: trino>=0.320.0; extra == 'all'
27
+ Provides-Extra: athena
28
+ Requires-Dist: boto3>=1.26.0; extra == 'athena'
29
+ Provides-Extra: dev
30
+ Requires-Dist: black>=23.0.0; extra == 'dev'
31
+ Requires-Dist: boto3>=1.26.0; extra == 'dev'
32
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
33
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
34
+ Requires-Dist: ruff>=0.1.0; extra == 'dev'
35
+ Requires-Dist: trino>=0.320.0; extra == 'dev'
36
+ Provides-Extra: trino
37
+ Requires-Dist: trino>=0.320.0; extra == 'trino'
38
+ Description-Content-Type: text/markdown
39
+
40
+ # owlbear
41
+
42
+ <img src="owlbear.png" width="150" align="right" alt="Owlbear" />
43
+
44
+ **Feathers and claws for your data lake.**
45
+
46
+ Owlbear is a Python client that bridges **Athena** and **Trino** to **Polars** DataFrames via PyArrow. A wise chimera — part **Owl** ([Athena](https://aws.amazon.com/athena/), goddess of wisdom), part **Bear** ([Polars](https://pola.rs/), the bear constellation). Query your data lake with SQL, get back fast, typed DataFrames — no serialization or ODBC overhead.
47
+
48
+ ## Features
49
+
50
+ - **Two backends**: `AthenaClient` (AWS Athena via boto3) and `TrinoClient` (direct Trino connection)
51
+ - Shared Presto-family type conversion — both backends produce identically typed Polars DataFrames
52
+ - Pagination support for large result sets (Athena) and row limits (both)
53
+ - Comprehensive error handling and timeout management
54
+ - Query cancellation and execution monitoring (Athena)
55
+ - Built-in retry logic with exponential backoff (Athena)
56
+
57
+ ## Installation
58
+
59
+ ### From GitHub (Git)
60
+
61
+ ```bash
62
+ # Core only (no backend)
63
+ pip install git+https://github.com/jdonaldson/owlbear.git
64
+
65
+ # With Athena backend
66
+ pip install "owlbear[athena] @ git+https://github.com/jdonaldson/owlbear.git"
67
+
68
+ # With Trino backend
69
+ pip install "owlbear[trino] @ git+https://github.com/jdonaldson/owlbear.git"
70
+
71
+ # Both backends
72
+ pip install "owlbear[all] @ git+https://github.com/jdonaldson/owlbear.git"
73
+ ```
74
+
75
+ ### For Development
76
+
77
+ ```bash
78
+ git clone https://github.com/jdonaldson/owlbear.git
79
+ cd owlbear
80
+ pip install -e ".[dev]"
81
+ ```
82
+
83
+ ## Prerequisites
84
+
85
+ - Python 3.8+
86
+ - **Athena**: AWS credentials configured (via AWS CLI, environment variables, or IAM roles) and an S3 bucket for query results
87
+ - **Trino**: A running Trino cluster with network access
88
+
89
+ ## Quick Start
90
+
91
+ ### Athena
92
+
93
+ ```python
94
+ from owlbear import AthenaClient
95
+
96
+ client = AthenaClient(
97
+ database="my_database",
98
+ output_location="s3://my-bucket/athena-results/",
99
+ region="us-east-1"
100
+ )
101
+
102
+ execution_id = client.query("SELECT * FROM orders LIMIT 5")
103
+ df = client.results(execution_id)
104
+ print(df)
105
+ ```
106
+
107
+ ```
108
+ shape: (5, 4)
109
+ ┌─────────────┬────────────┬──────────────┬────────────┐
110
+ │ customer_id ┆ order_date ┆ order_amount ┆ status │
111
+ │ --- ┆ --- ┆ --- ┆ --- │
112
+ │ i64 ┆ date ┆ f64 ┆ str │
113
+ ╞═════════════╪════════════╪══════════════╪════════════╡
114
+ │ 1001 ┆ 2024-03-15 ┆ 249.99 ┆ shipped │
115
+ │ 1002 ┆ 2024-03-15 ┆ 89.50 ┆ delivered │
116
+ │ 1003 ┆ 2024-03-16 ┆ 1024.00 ┆ processing │
117
+ │ 1001 ┆ 2024-03-17 ┆ 54.25 ┆ shipped │
118
+ │ 1004 ┆ 2024-03-17 ┆ 399.99 ┆ delivered │
119
+ └─────────────┴────────────┴──────────────┴────────────┘
120
+ ```
121
+
122
+ ### Trino
123
+
124
+ ```python
125
+ from owlbear import TrinoClient
126
+
127
+ client = TrinoClient(
128
+ host="trino.example.com",
129
+ port=443,
130
+ user="analyst",
131
+ catalog="hive",
132
+ schema="default",
133
+ )
134
+
135
+ df = client.query("SELECT * FROM orders LIMIT 5")
136
+ print(df)
137
+ ```
138
+
139
+ ```
140
+ shape: (5, 4)
141
+ ┌─────────────┬────────────┬──────────────┬────────────┐
142
+ │ customer_id ┆ order_date ┆ order_amount ┆ status │
143
+ │ --- ┆ --- ┆ --- ┆ --- │
144
+ │ i64 ┆ date ┆ f64 ┆ str │
145
+ ╞═════════════╪════════════╪══════════════╪════════════╡
146
+ │ 1001 ┆ 2024-03-15 ┆ 249.99 ┆ shipped │
147
+ │ 1002 ┆ 2024-03-15 ┆ 89.50 ┆ delivered │
148
+ │ 1003 ┆ 2024-03-16 ┆ 1024.00 ┆ processing │
149
+ │ 1001 ┆ 2024-03-17 ┆ 54.25 ┆ shipped │
150
+ │ 1004 ┆ 2024-03-17 ┆ 399.99 ┆ delivered │
151
+ └─────────────┴────────────┴──────────────┴────────────┘
152
+ ```
153
+
154
+ ## Usage Examples
155
+
156
+ ### Basic Query Execution
157
+
158
+ ```python
159
+ from owlbear import AthenaClient
160
+
161
+ # Initialize client
162
+ client = AthenaClient(
163
+ database="analytics_db",
164
+ output_location="s3://my-athena-results/queries/",
165
+ region="us-west-2"
166
+ )
167
+
168
+ # Execute query with automatic waiting
169
+ query = """
170
+ SELECT
171
+ customer_id,
172
+ SUM(order_amount) as total_spent,
173
+ COUNT(*) as order_count
174
+ FROM orders
175
+ WHERE order_date >= '2024-01-01'
176
+ GROUP BY customer_id
177
+ ORDER BY total_spent DESC
178
+ LIMIT 50
179
+ """
180
+
181
+ execution_id = client.query(query, wait_for_completion=True)
182
+ results_df = client.results(execution_id)
183
+
184
+ # Use Polars operations
185
+ top_customers = results_df.filter(pl.col("total_spent") > 1000)
186
+ print(f"Found {len(top_customers)} high-value customers")
187
+ ```
188
+
189
+ ### Asynchronous Query Execution
190
+
191
+ ```python
192
+ # Start query without waiting
193
+ execution_id = client.query(
194
+ "SELECT * FROM large_table",
195
+ wait_for_completion=False
196
+ )
197
+
198
+ # Check query status
199
+ query_info = client.get_query_info(execution_id)
200
+ print(f"Query status: {query_info['Status']['State']}")
201
+
202
+ # Wait for completion and get results when ready
203
+ client._wait_for_completion(execution_id)
204
+ df = client.results(execution_id)
205
+ ```
206
+
207
+ ### Using Work Groups
208
+
209
+ ```python
210
+ # Execute query with a specific work group
211
+ execution_id = client.query(
212
+ query="SELECT COUNT(*) FROM my_table",
213
+ work_group="my-workgroup"
214
+ )
215
+ df = client.results(execution_id)
216
+ ```
217
+
218
+ ### Handling Large Result Sets
219
+
220
+ ```python
221
+ # Get results with pagination (limit to 5000 rows)
222
+ df = client.results(execution_id, max_rows=5000)
223
+
224
+ # For larger datasets, consider using LIMIT in your SQL query
225
+ # or processing results in chunks
226
+ ```
227
+
228
+ ### Using with Existing boto3 Session
229
+
230
+ ```python
231
+ import boto3
232
+ from owlbear import AthenaClient
233
+
234
+ # Use existing session (useful for custom credential handling)
235
+ session = boto3.Session(profile_name='my-profile')
236
+ client = AthenaClient.from_session(
237
+ session=session,
238
+ database="my_db",
239
+ output_location="s3://my-bucket/results/"
240
+ )
241
+
242
+ # Or with custom config
243
+ from botocore.config import Config
244
+
245
+ config = Config(
246
+ region_name='eu-west-1',
247
+ retries={'max_attempts': 5}
248
+ )
249
+
250
+ client = AthenaClient(
251
+ database="my_db",
252
+ output_location="s3://my-bucket/results/",
253
+ config=config
254
+ )
255
+ ```
256
+
257
+ ### Query Management
258
+
259
+ ```python
260
+ # List available work groups
261
+ work_groups = client.list_work_groups()
262
+ print(f"Available work groups: {work_groups}")
263
+
264
+ # Cancel a running query
265
+ client.cancel_query(execution_id)
266
+
267
+ # Get detailed query information
268
+ query_info = client.get_query_info(execution_id)
269
+ print(f"Query execution time: {query_info['Statistics']['TotalExecutionTimeInMillis']}ms")
270
+ print(f"Data processed: {query_info['Statistics']['DataProcessedInBytes']} bytes")
271
+ ```
272
+
273
+ ### Error Handling
274
+
275
+ ```python
276
+ try:
277
+ execution_id = client.query("SELECT * FROM non_existent_table")
278
+ df = client.results(execution_id)
279
+ except Exception as e:
280
+ if "Query failed" in str(e):
281
+ print(f"Query execution failed: {e}")
282
+ elif "timeout" in str(e).lower():
283
+ print(f"Query timed out: {e}")
284
+ else:
285
+ print(f"Unexpected error: {e}")
286
+ ```
287
+
288
+ ## Advanced Usage
289
+
290
+ ### Custom Query Context
291
+
292
+ ```python
293
+ execution_id = client.query(
294
+ query="SELECT * FROM my_table",
295
+ query_context={"Catalog": "my_catalog"},
296
+ result_config={"EncryptionConfiguration": {"EncryptionOption": "SSE_S3"}}
297
+ )
298
+ ```
299
+
300
+ ### Working with Different Data Types
301
+
302
+ The library automatically handles various Athena data types using PyArrow for proper type inference:
303
+
304
+ ```python
305
+ # Data types are automatically inferred and converted
306
+ df = client.results(execution_id)
307
+
308
+ # Check the inferred types
309
+ print(df.dtypes) # [Int32, Utf8, Float64, Boolean, Date32, etc.]
310
+
311
+ # No manual casting needed for basic types, but you can still cast if needed
312
+ df_modified = df.with_columns([
313
+ pl.col("timestamp_col").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
314
+ ])
315
+ ```
316
+
317
+ ## Configuration
318
+
319
+ ### Environment Variables
320
+
321
+ You can configure AWS credentials using standard environment variables:
322
+
323
+ ```bash
324
+ export AWS_ACCESS_KEY_ID=your_access_key
325
+ export AWS_SECRET_ACCESS_KEY=your_secret_key
326
+ export AWS_DEFAULT_REGION=us-east-1
327
+ ```
328
+
329
+ ### IAM Permissions
330
+
331
+ Your AWS credentials need the following permissions:
332
+
333
+ ```json
334
+ {
335
+ "Version": "2012-10-17",
336
+ "Statement": [
337
+ {
338
+ "Effect": "Allow",
339
+ "Action": [
340
+ "athena:StartQueryExecution",
341
+ "athena:GetQueryExecution",
342
+ "athena:GetQueryResults",
343
+ "athena:StopQueryExecution",
344
+ "athena:ListWorkGroups"
345
+ ],
346
+ "Resource": "*"
347
+ },
348
+ {
349
+ "Effect": "Allow",
350
+ "Action": [
351
+ "s3:GetObject",
352
+ "s3:PutObject"
353
+ ],
354
+ "Resource": "arn:aws:s3:::your-athena-results-bucket/*"
355
+ },
356
+ {
357
+ "Effect": "Allow",
358
+ "Action": [
359
+ "glue:GetDatabase",
360
+ "glue:GetTable",
361
+ "glue:GetPartitions"
362
+ ],
363
+ "Resource": "*"
364
+ }
365
+ ]
366
+ }
367
+ ```
368
+
369
+ ## Testing
370
+
371
+ Run the test suite:
372
+
373
+ ```bash
374
+ pytest tests/ -v
375
+ ```
376
+
377
+ Run tests with coverage:
378
+
379
+ ```bash
380
+ pytest tests/ --cov=src --cov-report=html
381
+ ```
382
+
383
+ ## Development
384
+
385
+ ### Setup Development Environment
386
+
387
+ ```bash
388
+ git clone https://github.com/jdonaldson/owlbear.git
389
+ cd owlbear
390
+ pip install -e ".[dev]"
391
+ ```
392
+
393
+ ### Code Quality
394
+
395
+ Format code:
396
+ ```bash
397
+ black .
398
+ ```
399
+
400
+ Lint code:
401
+ ```bash
402
+ ruff check .
403
+ ```
404
+
405
+ Type checking:
406
+ ```bash
407
+ mypy src/
408
+ ```
409
+
410
+ ## License
411
+
412
+ MIT License - see LICENSE file for details.
413
+
414
+ ## Contributing
415
+
416
+ 1. Fork the repository on GitHub
417
+ 2. Create a feature branch
418
+ 3. Make your changes with tests
419
+ 4. Ensure all tests pass and code is formatted
420
+ 5. Submit a pull request
421
+
422
+ ## Changelog
423
+
424
+ ### v0.2.0
425
+ - Add `TrinoClient` for direct Trino connections
426
+ - Rename `OwlbearClient` → `AthenaClient` (alias kept for backward compat)
427
+ - Extract shared `presto_type_to_pyarrow` type converter
428
+ - Make `boto3` and `trino` optional extras (`[athena]`, `[trino]`, `[all]`)
429
+
430
+ ### v0.1.0 (2024-08-28)
431
+ - Initial release
432
+ - `AthenaClient` for executing Athena SQL and returning typed Polars DataFrames via PyArrow
433
+ - Automatic Athena-to-PyArrow type mapping (integers, floats, decimals, timestamps, booleans, arrays, maps)
434
+ - Paginated result retrieval with configurable row limits
435
+ - Async query execution with exponential-backoff polling
436
+ - Work group support, query cancellation, and execution monitoring