spark-connect-cli 0.2.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- spark_connect_cli/SKILL.md +151 -0
- spark_connect_cli/__init__.py +2 -0
- spark_connect_cli/__main__.py +4 -0
- spark_connect_cli/cli.py +138 -0
- spark_connect_cli/jobs.py +221 -0
- spark_connect_cli/meta.py +95 -0
- spark_connect_cli/query.py +57 -0
- spark_connect_cli/rest.py +53 -0
- spark_connect_cli/session.py +72 -0
- spark_connect_cli/sync.py +146 -0
- spark_connect_cli-0.2.0.dist-info/METADATA +156 -0
- spark_connect_cli-0.2.0.dist-info/RECORD +15 -0
- spark_connect_cli-0.2.0.dist-info/WHEEL +4 -0
- spark_connect_cli-0.2.0.dist-info/entry_points.txt +3 -0
- spark_connect_cli-0.2.0.dist-info/licenses/LICENSE +21 -0
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: spark-connect-cli
|
|
3
|
+
description: >-
|
|
4
|
+
Query Spark / Hive from the shell with the `scq` CLI over Spark Connect, and
|
|
5
|
+
run long Spark jobs (e.g. Hive->ClickHouse syncs) without blocking. Use
|
|
6
|
+
whenever the user wants to read Hive/Spark data, explore databases/tables/
|
|
7
|
+
schema, run a Spark SQL analysis, or sync a Hive table somewhere. Triggers:
|
|
8
|
+
Hive, Spark, Spark SQL, 查 Hive, 跑个 Spark SQL, 看下这个表, 同步到 ClickHouse,
|
|
9
|
+
sync table.
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# scq — Spark Connect from the shell
|
|
13
|
+
|
|
14
|
+
`scq` queries a Spark Connect server (JSON-first, **read-only by default**) and
|
|
15
|
+
manages **async long jobs** so you never sit in a blocking tool call.
|
|
16
|
+
|
|
17
|
+
## Discover before you query
|
|
18
|
+
|
|
19
|
+
Don't guess names. Discover them first:
|
|
20
|
+
|
|
21
|
+
1. `scq databases` — list databases.
|
|
22
|
+
2. `scq tables [DB] --like '%keyword%'` — list tables.
|
|
23
|
+
3. `scq describe <db.table>` — list columns (name, type, comment).
|
|
24
|
+
4. `scq query "SELECT ..."` — run it once you know the schema.
|
|
25
|
+
|
|
26
|
+
## Reading output
|
|
27
|
+
|
|
28
|
+
- **stdout** carries data. Default is **JSONEachRow** (NDJSON — one JSON object
|
|
29
|
+
per line). Other formats: `--format json|csv|tsv|table`.
|
|
30
|
+
- **stderr** carries errors as one JSON object `{"error": ..., "code": ...}`.
|
|
31
|
+
- `query` caps at `SCQ_MAX_ROWS` (default 10k); a `{"warning": ...}` on stderr
|
|
32
|
+
means it was truncated — add `LIMIT`/filters or raise `--max-rows`.
|
|
33
|
+
|
|
34
|
+
## Branch on the exit code
|
|
35
|
+
|
|
36
|
+
`0` ok · `1` query error (fix the SQL) · `2` connection error (check
|
|
37
|
+
`$SPARK_REMOTE`) · `3` read-only guard blocked it · `4` job-control error.
|
|
38
|
+
|
|
39
|
+
## Read-only by default
|
|
40
|
+
|
|
41
|
+
`scq query` allows only SELECT/SHOW/DESCRIBE/EXPLAIN/WITH. Writes and DDL exit
|
|
42
|
+
with code `3` unless you pass `--allow-ddl`. **Only** add `--allow-ddl` when the
|
|
43
|
+
user explicitly asked to modify data or schema.
|
|
44
|
+
|
|
45
|
+
## Long jobs — submit, then poll. NEVER block.
|
|
46
|
+
|
|
47
|
+
A full-table sync is a multi-minute Spark job. **Do not** run it in the
|
|
48
|
+
foreground and wait. Submit it, tell the user the job id, and hand control back:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
scq sync ods.orders --to clickhouse # prints {"job_id": "...", "state": "running"}
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
Then, *only when the user asks* "how's it going" / after a natural pause:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
scq jobs status j-20260625-... # state, source_rows, written_rows, exit_code
|
|
58
|
+
scq jobs logs j-20260625-... --tail 40 # recent progress
|
|
59
|
+
scq jobs cancel j-20260625-... # stop it
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Etiquette:
|
|
63
|
+
- After submitting, reply with the job id and a one-line "I'll check when you
|
|
64
|
+
want." Don't loop on `status` in a tight wait — let the user drive, or poll on
|
|
65
|
+
a relaxed cadence.
|
|
66
|
+
- Report terminal state plainly: `succeeded` with `written_rows`, or `failed`
|
|
67
|
+
with the tail of the log.
|
|
68
|
+
|
|
69
|
+
## Hive → ClickHouse sync workflow
|
|
70
|
+
|
|
71
|
+
When the user says "同步 X 表到 ClickHouse":
|
|
72
|
+
|
|
73
|
+
1. `scq describe <src>` — get the Hive schema.
|
|
74
|
+
2. Decide the **target database and table**. `--target` takes `db.table`:
|
|
75
|
+
- If the user names a database (e.g. `class_db`), pass it **qualified**:
|
|
76
|
+
`--target class_db.class`. A **bare table name lands in the connection's
|
|
77
|
+
default database** (`default`) — don't let data silently go there.
|
|
78
|
+
- The **database must already exist** (auto-create makes the table, not the
|
|
79
|
+
database). Ensure it first: `chsql query --allow-ddl "CREATE DATABASE IF NOT
|
|
80
|
+
EXISTS class_db"`.
|
|
81
|
+
3. Make sure the target table is good:
|
|
82
|
+
- For a quick/one-off sync, let `scq sync` auto-create it — but pass
|
|
83
|
+
`--order-by <key>` so it gets a real sort key (otherwise it is created with
|
|
84
|
+
`ORDER BY tuple()`, no primary index).
|
|
85
|
+
- For a production table, **pre-create it** with `chsql query --allow-ddl
|
|
86
|
+
"CREATE TABLE class_db.class (...) ENGINE = MergeTree ORDER BY (...)"` (full
|
|
87
|
+
control over engine, keys, partitioning), then sync.
|
|
88
|
+
4. Submit: `scq sync <src> --target db.table [--order-by key] [--where ...]`.
|
|
89
|
+
5. Hand back the job id. Verify with row counts when it finishes.
|
|
90
|
+
|
|
91
|
+
The ClickHouse JDBC connection (`$SCQ_CH_JDBC`) is preconfigured — you do **not**
|
|
92
|
+
pass credentials; just choose the `db.table` with `--target`.
|
|
93
|
+
|
|
94
|
+
### Spark/Hive → ClickHouse type mapping
|
|
95
|
+
|
|
96
|
+
| Spark/Hive | ClickHouse |
|
|
97
|
+
|------------|------------|
|
|
98
|
+
| boolean | Bool |
|
|
99
|
+
| tinyint / smallint / int / bigint | Int8 / Int16 / Int32 / Int64 |
|
|
100
|
+
| float / double | Float32 / Float64 |
|
|
101
|
+
| decimal(p,s) | Decimal(p,s) |
|
|
102
|
+
| string / varchar / char / binary | String |
|
|
103
|
+
| date | Date32 |
|
|
104
|
+
| timestamp | DateTime64(3) |
|
|
105
|
+
|
|
106
|
+
Nullable columns map to `Nullable(T)`. Nested/complex types default to `String`
|
|
107
|
+
(JSON) — confirm with the user before relying on them.
|
|
108
|
+
|
|
109
|
+
## Metadata & execution introspection
|
|
110
|
+
|
|
111
|
+
Two general primitives — don't hand-stitch many queries.
|
|
112
|
+
|
|
113
|
+
**Table metadata → `scq meta db.table`** — one JSON: schema (+ ClickHouse type
|
|
114
|
+
mapping), created time, owner, format, HDFS location, partition columns +
|
|
115
|
+
partition list/count, file count/total size, and min/max file modification time
|
|
116
|
+
(i.e. "when did the data arrive"). Add `--count` for an exact row count (runs a
|
|
117
|
+
`count(*)`, so only when asked). For ad-hoc bits you can still use
|
|
118
|
+
`scq query "DESCRIBE EXTENDED t"` / `"SHOW PARTITIONS t"`.
|
|
119
|
+
|
|
120
|
+
**Execution metadata → `scq exec <path>`** — read-only passthrough to the Spark
|
|
121
|
+
REST API (auto-discovers the app, GET-only). The model reads the JSON, so any
|
|
122
|
+
runtime question is the same command with a different path:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
scq exec stages?status=active # what's running now
|
|
126
|
+
scq exec sql # each query's plan + metrics
|
|
127
|
+
scq exec executors # cores / memory / GC / shuffle
|
|
128
|
+
scq exec jobs
|
|
129
|
+
scq exec stages/<id>/<attempt>/taskSummary?quantiles=0.5,0.95,1.0
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
- **Data skew**: pull a stage's `taskSummary` and compare a metric's **max vs
|
|
133
|
+
median** (`executorRunTime`, `shuffleReadBytes`, `shuffleReadRecords`). A large
|
|
134
|
+
`max/median` ratio = a straggler / skewed partition. `…?details=true` on a
|
|
135
|
+
stage lists every task to find the hot one.
|
|
136
|
+
- Stage/job lists can be long — filter (`?status=active`) or fetch one id.
|
|
137
|
+
- For the *plan before running*, use `scq query "EXPLAIN FORMATTED SELECT ..."`.
|
|
138
|
+
|
|
139
|
+
## Connection
|
|
140
|
+
|
|
141
|
+
`scq --remote sc://host:15002 ...` or set `$SPARK_REMOTE`. No Kerberos or JVM is
|
|
142
|
+
needed on this side — the Spark Connect server does the auth.
|
|
143
|
+
|
|
144
|
+
## Recipes
|
|
145
|
+
|
|
146
|
+
```bash
|
|
147
|
+
scq --format table tables analytics --like '%event%'
|
|
148
|
+
scq query --format table "SELECT count(*) FROM analytics.events"
|
|
149
|
+
scq query --max-rows 0 "SELECT * FROM small_dim" # no cap
|
|
150
|
+
scq sync analytics.events --to clickhouse --where "dt='2026-06-25'"
|
|
151
|
+
```
|
spark_connect_cli/cli.py
ADDED
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
"""Argument parsing and dispatch for `scq` / `spark-connect-cli`."""
|
|
2
|
+
from __future__ import annotations
|
|
3
|
+
|
|
4
|
+
import json
|
|
5
|
+
import os
|
|
6
|
+
import sys
|
|
7
|
+
|
|
8
|
+
from . import jobs, meta, query, rest
|
|
9
|
+
from .session import DEFAULT_REMOTE
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def cmd_sync(args) -> None:
|
|
13
|
+
# The JDBC URL carries the ClickHouse password, so it must NOT land in argv
|
|
14
|
+
# (argv is persisted in the job registry's meta.json). Pass it to the worker
|
|
15
|
+
# through the environment instead — submit() copies os.environ to the child.
|
|
16
|
+
if args.ch_jdbc:
|
|
17
|
+
os.environ["SCQ_CH_JDBC"] = args.ch_jdbc
|
|
18
|
+
argv = [args.source, "--to", args.to, "--remote", args.remote, "--mode", args.mode]
|
|
19
|
+
if args.target:
|
|
20
|
+
argv += ["--target", args.target]
|
|
21
|
+
if args.where:
|
|
22
|
+
argv += ["--where", args.where]
|
|
23
|
+
if args.limit:
|
|
24
|
+
argv += ["--limit", str(args.limit)]
|
|
25
|
+
if args.batchsize:
|
|
26
|
+
argv += ["--batchsize", str(args.batchsize)]
|
|
27
|
+
if args.num_partitions:
|
|
28
|
+
argv += ["--num-partitions", str(args.num_partitions)]
|
|
29
|
+
if args.order_by:
|
|
30
|
+
argv += ["--order-by", args.order_by]
|
|
31
|
+
if args.engine:
|
|
32
|
+
argv += ["--engine", args.engine]
|
|
33
|
+
job_id = jobs.submit("sync", argv,
|
|
34
|
+
{"source": args.source, "target": args.target or "", "to": args.to})
|
|
35
|
+
print(json.dumps({
|
|
36
|
+
"job_id": job_id, "state": "running",
|
|
37
|
+
"message": f"sync of {args.source} -> {args.to} submitted; "
|
|
38
|
+
f"poll with `scq jobs status {job_id}`",
|
|
39
|
+
}))
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
def cmd_skill_install(args) -> None:
|
|
43
|
+
"""Write the bundled SKILL.md into an agent skills directory (mirrors
|
|
44
|
+
`chsql skill install`)."""
|
|
45
|
+
import importlib.resources as ir
|
|
46
|
+
from pathlib import Path
|
|
47
|
+
root = Path(args.dir or os.environ.get("SKILLS_DIR")
|
|
48
|
+
or (Path.home() / ".agents" / "skills"))
|
|
49
|
+
dest = root / "spark-connect-cli"
|
|
50
|
+
dest.mkdir(parents=True, exist_ok=True)
|
|
51
|
+
content = ir.files("spark_connect_cli").joinpath("SKILL.md").read_text()
|
|
52
|
+
(dest / "SKILL.md").write_text(content)
|
|
53
|
+
print(json.dumps({"installed": str(dest / "SKILL.md")}))
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
def build_parser():
|
|
57
|
+
import argparse
|
|
58
|
+
ap = argparse.ArgumentParser(
|
|
59
|
+
prog="scq",
|
|
60
|
+
description="Agent-friendly Spark Connect CLI: read-only querying + "
|
|
61
|
+
"async long-job control. No JVM, no Kerberos on the client.")
|
|
62
|
+
ap.add_argument("--remote", default=DEFAULT_REMOTE,
|
|
63
|
+
help=f"Spark Connect endpoint (default {DEFAULT_REMOTE} / $SPARK_REMOTE)")
|
|
64
|
+
ap.add_argument("--format", default="jsoneachrow",
|
|
65
|
+
choices=["jsoneachrow", "json", "csv", "tsv", "table"])
|
|
66
|
+
sub = ap.add_subparsers(dest="cmd", required=True)
|
|
67
|
+
|
|
68
|
+
sub.add_parser("databases", help="List databases").set_defaults(func=query.cmd_databases)
|
|
69
|
+
|
|
70
|
+
pt = sub.add_parser("tables", help="List tables in a database")
|
|
71
|
+
pt.add_argument("database", nargs="?", default=None)
|
|
72
|
+
pt.add_argument("--like", default=None)
|
|
73
|
+
pt.set_defaults(func=query.cmd_tables)
|
|
74
|
+
|
|
75
|
+
pd = sub.add_parser("describe", help="Show a table's columns")
|
|
76
|
+
pd.add_argument("table")
|
|
77
|
+
pd.set_defaults(func=query.cmd_describe)
|
|
78
|
+
|
|
79
|
+
pq = sub.add_parser("query", help="Run SQL (read-only unless --allow-ddl)")
|
|
80
|
+
pq.add_argument("sql")
|
|
81
|
+
pq.add_argument("--allow-ddl", action="store_true")
|
|
82
|
+
pq.add_argument("--max-rows", type=int, default=None)
|
|
83
|
+
pq.set_defaults(func=query.cmd_query)
|
|
84
|
+
|
|
85
|
+
ps = sub.add_parser("sync", help="Submit an async Hive->ClickHouse sync (returns a job id)")
|
|
86
|
+
ps.add_argument("source", help="Hive table, e.g. db.table")
|
|
87
|
+
ps.add_argument("--to", default="clickhouse")
|
|
88
|
+
ps.add_argument("--mode", choices=["auto", "parallel", "single"], default="auto")
|
|
89
|
+
ps.add_argument("--ch-jdbc", default=None, help="ClickHouse JDBC URL (or $SCQ_CH_JDBC)")
|
|
90
|
+
ps.add_argument("--target", default=None, help="ClickHouse target as db.table (or bare table = default db)")
|
|
91
|
+
ps.add_argument("--where", default=None)
|
|
92
|
+
ps.add_argument("--limit", type=int, default=0)
|
|
93
|
+
ps.add_argument("--batchsize", type=int, default=None)
|
|
94
|
+
ps.add_argument("--num-partitions", type=int, default=None)
|
|
95
|
+
ps.add_argument("--order-by", default=None, help="ORDER BY key for an auto-created table, e.g. 'id'")
|
|
96
|
+
ps.add_argument("--engine", default=None, help="Engine for an auto-created table (default MergeTree)")
|
|
97
|
+
ps.set_defaults(func=cmd_sync)
|
|
98
|
+
|
|
99
|
+
pm = sub.add_parser("meta", help="One JSON bundle of a table's metadata")
|
|
100
|
+
pm.add_argument("table", help="db.table")
|
|
101
|
+
pm.add_argument("--count", action="store_true", help="include exact row count (runs count(*))")
|
|
102
|
+
pm.set_defaults(func=meta.cmd_meta)
|
|
103
|
+
|
|
104
|
+
pe = sub.add_parser("exec", help="Read-only Spark execution metadata (REST API passthrough)")
|
|
105
|
+
pe.add_argument("path", nargs="?", default="",
|
|
106
|
+
help="REST subpath, e.g. 'stages' or 'stages/61/0/taskSummary?quantiles=0.5,0.95,1.0'")
|
|
107
|
+
pe.add_argument("--rm", default=None, help="YARN RM base URL (or $SCQ_YARN_RM)")
|
|
108
|
+
pe.add_argument("--compact", action="store_true", help="compact JSON output")
|
|
109
|
+
pe.set_defaults(func=rest.cmd_exec)
|
|
110
|
+
|
|
111
|
+
psk = sub.add_parser("skill", help="Manage the agent skill")
|
|
112
|
+
sksub = psk.add_subparsers(dest="skcmd", required=True)
|
|
113
|
+
ski = sksub.add_parser("install", help="Write SKILL.md into the skills dir")
|
|
114
|
+
ski.add_argument("--dir", default=None, help="Skills dir (or $SKILLS_DIR)")
|
|
115
|
+
ski.set_defaults(func=cmd_skill_install)
|
|
116
|
+
|
|
117
|
+
pj = sub.add_parser("jobs", help="Manage async jobs")
|
|
118
|
+
js = pj.add_subparsers(dest="jcmd", required=True)
|
|
119
|
+
js.add_parser("list", help="List jobs").set_defaults(func=jobs.cmd_list)
|
|
120
|
+
s = js.add_parser("status", help="Show a job's full status")
|
|
121
|
+
s.add_argument("id"); s.set_defaults(func=jobs.cmd_status)
|
|
122
|
+
lg = js.add_parser("logs", help="Show a job's log (tail by default)")
|
|
123
|
+
lg.add_argument("id"); lg.add_argument("--tail", type=int, default=40)
|
|
124
|
+
lg.add_argument("--full", action="store_true"); lg.set_defaults(func=jobs.cmd_logs)
|
|
125
|
+
c = js.add_parser("cancel", help="Cancel a running job")
|
|
126
|
+
c.add_argument("id"); c.set_defaults(func=jobs.cmd_cancel)
|
|
127
|
+
|
|
128
|
+
return ap
|
|
129
|
+
|
|
130
|
+
|
|
131
|
+
def main(argv=None) -> None:
|
|
132
|
+
argv = list(sys.argv[1:] if argv is None else argv)
|
|
133
|
+
# Internal entrypoint for the detached worker.
|
|
134
|
+
if argv and argv[0] == "__run-job":
|
|
135
|
+
jobs.run_worker(argv[1])
|
|
136
|
+
return
|
|
137
|
+
args = build_parser().parse_args(argv)
|
|
138
|
+
args.func(args)
|
|
@@ -0,0 +1,221 @@
|
|
|
1
|
+
"""Layer A — async background-job control.
|
|
2
|
+
|
|
3
|
+
The point: long Spark jobs must not block the caller (an LLM agent should not sit
|
|
4
|
+
in a 30-minute tool call). `submit()` spawns the work detached in its own process
|
|
5
|
+
group, records a durable handle on disk, and returns immediately. The handle
|
|
6
|
+
survives a process/container restart because it lives in a file registry, not in
|
|
7
|
+
memory.
|
|
8
|
+
|
|
9
|
+
A job is generic: `kind="exec"` runs an arbitrary argv (used for tests and ad-hoc
|
|
10
|
+
long commands); `kind="sync"` runs the Hive->ClickHouse mover. New kinds plug in
|
|
11
|
+
via `_dispatch`.
|
|
12
|
+
"""
|
|
13
|
+
from __future__ import annotations
|
|
14
|
+
|
|
15
|
+
import json
|
|
16
|
+
import os
|
|
17
|
+
import signal
|
|
18
|
+
import subprocess
|
|
19
|
+
import sys
|
|
20
|
+
import time
|
|
21
|
+
import uuid
|
|
22
|
+
from datetime import datetime, timezone
|
|
23
|
+
from pathlib import Path
|
|
24
|
+
|
|
25
|
+
from .session import EXIT_JOB_ERR, err
|
|
26
|
+
|
|
27
|
+
JOBS_DIR = Path(os.environ.get(
|
|
28
|
+
"SCQ_JOBS_DIR", str(Path.home() / ".spark-connect-cli" / "jobs")))
|
|
29
|
+
|
|
30
|
+
|
|
31
|
+
def _now() -> str:
|
|
32
|
+
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
def _job_dir(job_id: str) -> Path:
|
|
36
|
+
return JOBS_DIR / job_id
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
def _meta_path(job_id: str) -> Path:
|
|
40
|
+
return _job_dir(job_id) / "meta.json"
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
def _log_path(job_id: str) -> Path:
|
|
44
|
+
return _job_dir(job_id) / "out.log"
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
def read_meta(job_id: str) -> dict:
|
|
48
|
+
p = _meta_path(job_id)
|
|
49
|
+
if not p.exists():
|
|
50
|
+
err(f"no such job: {job_id}", EXIT_JOB_ERR)
|
|
51
|
+
return json.loads(p.read_text())
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
def write_meta(job_id: str, meta: dict) -> None:
|
|
55
|
+
tmp = _meta_path(job_id).with_suffix(".tmp")
|
|
56
|
+
tmp.write_text(json.dumps(meta, indent=2, default=str))
|
|
57
|
+
tmp.replace(_meta_path(job_id))
|
|
58
|
+
|
|
59
|
+
|
|
60
|
+
def _pid_alive(pid: int) -> bool:
|
|
61
|
+
if not pid:
|
|
62
|
+
return False
|
|
63
|
+
try:
|
|
64
|
+
os.kill(pid, 0)
|
|
65
|
+
except ProcessLookupError:
|
|
66
|
+
return False
|
|
67
|
+
except PermissionError:
|
|
68
|
+
return True
|
|
69
|
+
# Signal-0 succeeds for a zombie (terminated but not yet reaped by its
|
|
70
|
+
# parent), but a zombie is not doing work — treat it as dead. On Linux the
|
|
71
|
+
# process state is the char right after the ")" that closes comm in stat.
|
|
72
|
+
try:
|
|
73
|
+
with open(f"/proc/{pid}/stat") as f:
|
|
74
|
+
stat = f.read()
|
|
75
|
+
if stat[stat.rfind(")") + 2] == "Z":
|
|
76
|
+
return False
|
|
77
|
+
except (FileNotFoundError, ProcessLookupError, IndexError):
|
|
78
|
+
return False
|
|
79
|
+
return True
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
def reconcile(meta: dict) -> dict:
|
|
83
|
+
"""A 'running' job whose process is gone but that never recorded an end has
|
|
84
|
+
crashed — mark it failed so status never lies."""
|
|
85
|
+
if meta.get("state") == "running" and not _pid_alive(meta.get("pid", 0)):
|
|
86
|
+
meta["state"] = "failed"
|
|
87
|
+
meta["ended_at"] = _now()
|
|
88
|
+
if meta.get("exit_code") is None:
|
|
89
|
+
meta["exit_code"] = -1
|
|
90
|
+
meta["error"] = meta.get("error") or "process exited without recording completion"
|
|
91
|
+
write_meta(meta["id"], meta)
|
|
92
|
+
return meta
|
|
93
|
+
|
|
94
|
+
|
|
95
|
+
def _new_job_id() -> str:
|
|
96
|
+
return f"j-{datetime.now().strftime('%Y%m%d-%H%M%S')}-{uuid.uuid4().hex[:4]}"
|
|
97
|
+
|
|
98
|
+
|
|
99
|
+
def submit(kind: str, argv: list[str], descr: dict | None = None) -> str:
|
|
100
|
+
"""Spawn a detached worker for this job and return its id immediately."""
|
|
101
|
+
JOBS_DIR.mkdir(parents=True, exist_ok=True)
|
|
102
|
+
job_id = _new_job_id()
|
|
103
|
+
_job_dir(job_id).mkdir(parents=True)
|
|
104
|
+
meta = {
|
|
105
|
+
"id": job_id, "kind": kind, "state": "submitted",
|
|
106
|
+
"submitted_at": _now(), "started_at": None, "ended_at": None,
|
|
107
|
+
"pid": None, "pgid": None, "exit_code": None, "argv": argv,
|
|
108
|
+
**(descr or {}),
|
|
109
|
+
}
|
|
110
|
+
write_meta(job_id, meta)
|
|
111
|
+
|
|
112
|
+
log = open(_log_path(job_id), "ab", buffering=0)
|
|
113
|
+
proc = subprocess.Popen(
|
|
114
|
+
[sys.executable, "-m", "spark_connect_cli", "__run-job", job_id],
|
|
115
|
+
stdout=log, stderr=subprocess.STDOUT, stdin=subprocess.DEVNULL,
|
|
116
|
+
start_new_session=True, # own process group -> kill the whole tree
|
|
117
|
+
env=os.environ.copy(),
|
|
118
|
+
)
|
|
119
|
+
meta["pid"] = proc.pid
|
|
120
|
+
meta["pgid"] = os.getpgid(proc.pid)
|
|
121
|
+
meta["state"] = "running"
|
|
122
|
+
meta["started_at"] = _now()
|
|
123
|
+
write_meta(job_id, meta)
|
|
124
|
+
return job_id
|
|
125
|
+
|
|
126
|
+
|
|
127
|
+
def _dispatch(meta: dict) -> int:
|
|
128
|
+
kind = meta["kind"]
|
|
129
|
+
argv = meta.get("argv", [])
|
|
130
|
+
if kind == "exec":
|
|
131
|
+
return subprocess.run(argv).returncode
|
|
132
|
+
if kind == "sync":
|
|
133
|
+
from .sync import run as sync_run
|
|
134
|
+
return sync_run(argv, meta)
|
|
135
|
+
print(f"[scq] unknown job kind: {kind}", flush=True)
|
|
136
|
+
return 2
|
|
137
|
+
|
|
138
|
+
|
|
139
|
+
def run_worker(job_id: str) -> None:
|
|
140
|
+
"""Runs INSIDE the detached child. Executes the work, then records a terminal
|
|
141
|
+
state. Never raises out — always writes a final meta."""
|
|
142
|
+
meta = read_meta(job_id)
|
|
143
|
+
rc = 0
|
|
144
|
+
try:
|
|
145
|
+
rc = _dispatch(meta)
|
|
146
|
+
except SystemExit as e:
|
|
147
|
+
rc = int(e.code) if isinstance(e.code, int) else 1
|
|
148
|
+
except Exception as e: # noqa: BLE001
|
|
149
|
+
meta = read_meta(job_id)
|
|
150
|
+
meta["error"] = str(e)
|
|
151
|
+
print(f"[scq] job failed: {e}", flush=True)
|
|
152
|
+
rc = 1
|
|
153
|
+
meta = read_meta(job_id) # re-read: the body may have updated counters
|
|
154
|
+
meta["state"] = "succeeded" if rc == 0 else "failed"
|
|
155
|
+
meta["exit_code"] = rc
|
|
156
|
+
meta["ended_at"] = _now()
|
|
157
|
+
write_meta(job_id, meta)
|
|
158
|
+
sys.exit(rc)
|
|
159
|
+
|
|
160
|
+
|
|
161
|
+
# -- agent-facing commands -------------------------------------------------
|
|
162
|
+
|
|
163
|
+
def cmd_list(args) -> None:
|
|
164
|
+
if not JOBS_DIR.exists():
|
|
165
|
+
print(json.dumps({"jobs": []}))
|
|
166
|
+
return
|
|
167
|
+
out = []
|
|
168
|
+
for d in sorted(JOBS_DIR.iterdir(), reverse=True):
|
|
169
|
+
if not (d / "meta.json").exists():
|
|
170
|
+
continue
|
|
171
|
+
m = reconcile(json.loads((d / "meta.json").read_text()))
|
|
172
|
+
out.append({k: m.get(k) for k in
|
|
173
|
+
("id", "kind", "state", "source", "target", "submitted_at",
|
|
174
|
+
"source_rows", "written_rows")})
|
|
175
|
+
if args.format == "table":
|
|
176
|
+
from .session import emit_rows
|
|
177
|
+
emit_rows([(j["id"], j["kind"], j["state"], j.get("source") or "",
|
|
178
|
+
str(j.get("written_rows") if j.get("written_rows") is not None else ""))
|
|
179
|
+
for j in out],
|
|
180
|
+
["id", "kind", "state", "source", "written"], "table")
|
|
181
|
+
else:
|
|
182
|
+
print(json.dumps({"jobs": out}, default=str))
|
|
183
|
+
|
|
184
|
+
|
|
185
|
+
def cmd_status(args) -> None:
|
|
186
|
+
print(json.dumps(reconcile(read_meta(args.id)), indent=2, default=str))
|
|
187
|
+
|
|
188
|
+
|
|
189
|
+
def cmd_logs(args) -> None:
|
|
190
|
+
read_meta(args.id) # validate existence
|
|
191
|
+
p = _log_path(args.id)
|
|
192
|
+
if not p.exists():
|
|
193
|
+
print("")
|
|
194
|
+
return
|
|
195
|
+
data = p.read_text(errors="replace")
|
|
196
|
+
if args.full:
|
|
197
|
+
sys.stdout.write(data)
|
|
198
|
+
return
|
|
199
|
+
lines = data.splitlines()
|
|
200
|
+
sys.stdout.write("\n".join(lines[-args.tail:]) + ("\n" if lines else ""))
|
|
201
|
+
|
|
202
|
+
|
|
203
|
+
def cmd_cancel(args) -> None:
|
|
204
|
+
meta = reconcile(read_meta(args.id))
|
|
205
|
+
if meta["state"] not in ("running", "submitted"):
|
|
206
|
+
print(json.dumps({"id": args.id, "state": meta["state"],
|
|
207
|
+
"message": "job already finished"}))
|
|
208
|
+
return
|
|
209
|
+
pgid = meta.get("pgid")
|
|
210
|
+
if pgid:
|
|
211
|
+
try:
|
|
212
|
+
os.killpg(pgid, signal.SIGTERM)
|
|
213
|
+
time.sleep(2)
|
|
214
|
+
if _pid_alive(meta.get("pid", 0)):
|
|
215
|
+
os.killpg(pgid, signal.SIGKILL)
|
|
216
|
+
except ProcessLookupError:
|
|
217
|
+
pass
|
|
218
|
+
meta["state"] = "cancelled"
|
|
219
|
+
meta["ended_at"] = _now()
|
|
220
|
+
write_meta(args.id, meta)
|
|
221
|
+
print(json.dumps({"id": args.id, "state": "cancelled"}))
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
"""scq meta — one structured metadata document for a table.
|
|
2
|
+
|
|
3
|
+
Bundles what is otherwise spread across DESCRIBE EXTENDED + SHOW PARTITIONS +
|
|
4
|
+
the Spark `_metadata` hidden column, so an agent gets the whole picture of a
|
|
5
|
+
table in a single call instead of stitching several queries.
|
|
6
|
+
"""
|
|
7
|
+
from __future__ import annotations
|
|
8
|
+
|
|
9
|
+
import json
|
|
10
|
+
|
|
11
|
+
from .session import EXIT_QUERY_ERR, err, get_spark
|
|
12
|
+
from .sync import map_type
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
def _describe_extended(spark, table):
|
|
16
|
+
"""Parse DESCRIBE EXTENDED into (columns, partition_columns, details)."""
|
|
17
|
+
rows = spark.sql(f"DESCRIBE EXTENDED {table}").collect()
|
|
18
|
+
cols, part_cols, details = [], [], {}
|
|
19
|
+
section = "cols"
|
|
20
|
+
for r in rows:
|
|
21
|
+
name = (r[0] or "").strip()
|
|
22
|
+
val = (r[1] or "")
|
|
23
|
+
if name.startswith("# Partition Information"):
|
|
24
|
+
section = "partcols"
|
|
25
|
+
continue
|
|
26
|
+
if name.startswith("# Detailed Table Information"):
|
|
27
|
+
section = "details"
|
|
28
|
+
continue
|
|
29
|
+
if name.startswith("#") or not name:
|
|
30
|
+
continue
|
|
31
|
+
if section == "cols":
|
|
32
|
+
cols.append({"name": name, "type": val, "clickhouse": map_type(val),
|
|
33
|
+
"comment": r[2]})
|
|
34
|
+
elif section == "partcols":
|
|
35
|
+
part_cols.append(name)
|
|
36
|
+
else:
|
|
37
|
+
details[name] = val.strip()
|
|
38
|
+
return cols, part_cols, details
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
def _file_stats(spark, table):
|
|
42
|
+
"""Aggregate per-file size/mtime from the _metadata hidden column. Returns
|
|
43
|
+
None if the source doesn't expose _metadata."""
|
|
44
|
+
try:
|
|
45
|
+
row = spark.sql(
|
|
46
|
+
f"SELECT count(*) AS num_files, sum(sz) AS total_bytes, "
|
|
47
|
+
f"min(mt) AS first_modified, max(mt) AS last_modified FROM ("
|
|
48
|
+
f" SELECT _metadata.file_path AS p, max(_metadata.file_size) AS sz, "
|
|
49
|
+
f" max(_metadata.file_modification_time) AS mt FROM {table} GROUP BY 1)"
|
|
50
|
+
).collect()[0]
|
|
51
|
+
return {"numFiles": row["num_files"], "totalBytes": row["total_bytes"],
|
|
52
|
+
"firstModified": str(row["first_modified"]),
|
|
53
|
+
"lastModified": str(row["last_modified"])}
|
|
54
|
+
except Exception: # noqa: BLE001 — _metadata unsupported / empty table
|
|
55
|
+
return None
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
def cmd_meta(args) -> None:
|
|
59
|
+
spark = get_spark(args.remote)
|
|
60
|
+
table = args.table
|
|
61
|
+
try:
|
|
62
|
+
cols, part_cols, details = _describe_extended(spark, table)
|
|
63
|
+
except Exception as e: # noqa: BLE001
|
|
64
|
+
err(f"describe failed: {e}", EXIT_QUERY_ERR)
|
|
65
|
+
|
|
66
|
+
out = {
|
|
67
|
+
"table": table,
|
|
68
|
+
"createdTime": details.get("Created Time"),
|
|
69
|
+
"lastAccess": details.get("Last Access"),
|
|
70
|
+
"owner": details.get("Owner"),
|
|
71
|
+
"createdBy": details.get("Created By"),
|
|
72
|
+
"provider": details.get("Provider"),
|
|
73
|
+
"type": details.get("Type"),
|
|
74
|
+
"location": details.get("Location"),
|
|
75
|
+
"statistics": details.get("Statistics"),
|
|
76
|
+
"partitionColumns": part_cols,
|
|
77
|
+
"columns": cols,
|
|
78
|
+
}
|
|
79
|
+
|
|
80
|
+
if part_cols:
|
|
81
|
+
try:
|
|
82
|
+
parts = [r[0] for r in spark.sql(f"SHOW PARTITIONS {table}").collect()]
|
|
83
|
+
out["partitionCount"] = len(parts)
|
|
84
|
+
out["partitions"] = parts if len(parts) <= 200 else parts[:200] + ["…"]
|
|
85
|
+
except Exception: # noqa: BLE001
|
|
86
|
+
pass
|
|
87
|
+
|
|
88
|
+
files = _file_stats(spark, table)
|
|
89
|
+
if files:
|
|
90
|
+
out["files"] = files
|
|
91
|
+
|
|
92
|
+
if args.count:
|
|
93
|
+
out["rowCount"] = spark.sql(f"SELECT count(*) c FROM {table}").collect()[0]["c"]
|
|
94
|
+
|
|
95
|
+
print(json.dumps(out, ensure_ascii=False, indent=2, default=str))
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
"""Read-only / discovery commands over Spark Connect."""
|
|
2
|
+
from __future__ import annotations
|
|
3
|
+
|
|
4
|
+
import json
|
|
5
|
+
import sys
|
|
6
|
+
|
|
7
|
+
from .session import (DEFAULT_MAX_ROWS, EXIT_BLOCKED, EXIT_QUERY_ERR, emit_rows,
|
|
8
|
+
err, get_spark, is_read_only)
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
def cmd_databases(args) -> None:
|
|
12
|
+
spark = get_spark(args.remote)
|
|
13
|
+
rows = spark.sql("SHOW DATABASES").collect()
|
|
14
|
+
emit_rows([(r[0],) for r in rows], ["database"], args.format)
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
def cmd_tables(args) -> None:
|
|
18
|
+
spark = get_spark(args.remote)
|
|
19
|
+
db = args.database or "default"
|
|
20
|
+
rows = spark.sql(f"SHOW TABLES IN `{db}`").collect()
|
|
21
|
+
out = [(r["namespace"], r["tableName"]) for r in rows]
|
|
22
|
+
if args.like:
|
|
23
|
+
pat = args.like.replace("%", "").lower()
|
|
24
|
+
out = [t for t in out if pat in t[1].lower()]
|
|
25
|
+
emit_rows(out, ["database", "table"], args.format)
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
def cmd_describe(args) -> None:
|
|
29
|
+
spark = get_spark(args.remote)
|
|
30
|
+
rows = spark.sql(f"DESCRIBE TABLE {args.table}").collect()
|
|
31
|
+
emit_rows([(r[0], r[1], r[2]) for r in rows],
|
|
32
|
+
["col_name", "data_type", "comment"], args.format)
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
def cmd_query(args) -> None:
|
|
36
|
+
sql = args.sql
|
|
37
|
+
if not args.allow_ddl and not is_read_only(sql):
|
|
38
|
+
err("write/DDL blocked by read-only guard; pass --allow-ddl to override",
|
|
39
|
+
EXIT_BLOCKED)
|
|
40
|
+
spark = get_spark(args.remote)
|
|
41
|
+
try:
|
|
42
|
+
df = spark.sql(sql)
|
|
43
|
+
except Exception as e: # noqa: BLE001
|
|
44
|
+
err(f"query failed: {e}", EXIT_QUERY_ERR)
|
|
45
|
+
if not is_read_only(sql):
|
|
46
|
+
print(json.dumps({"ok": True}))
|
|
47
|
+
return
|
|
48
|
+
max_rows = args.max_rows if args.max_rows is not None else DEFAULT_MAX_ROWS
|
|
49
|
+
columns = df.columns
|
|
50
|
+
limited = df.limit(max_rows + 1).collect() if max_rows > 0 else df.collect()
|
|
51
|
+
truncated = max_rows > 0 and len(limited) > max_rows
|
|
52
|
+
rows = limited[:max_rows] if truncated else limited
|
|
53
|
+
emit_rows([tuple(r) for r in rows], columns, args.format)
|
|
54
|
+
if truncated:
|
|
55
|
+
print(json.dumps({"warning": f"result capped at {max_rows} rows; "
|
|
56
|
+
"add LIMIT/filters or raise --max-rows"}),
|
|
57
|
+
file=sys.stderr)
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
"""scq exec — read-only passthrough to the Spark REST API.
|
|
2
|
+
|
|
3
|
+
One generic accessor over the Spark monitoring REST API instead of a pile of
|
|
4
|
+
single-purpose diagnostics. The model composes the path and interprets the JSON,
|
|
5
|
+
so skew, slow stages, shuffle spill, executor GC/OOM, etc. are all the same
|
|
6
|
+
command with different readings.
|
|
7
|
+
|
|
8
|
+
Access path: the Spark UI redirects to the YARN ResourceManager web proxy, so we
|
|
9
|
+
discover the running Spark application via the RM REST and go through the proxy.
|
|
10
|
+
Pure-Python (urllib) — no curl, no manual app id. GET-only against /api/v1.
|
|
11
|
+
"""
|
|
12
|
+
from __future__ import annotations
|
|
13
|
+
|
|
14
|
+
import json
|
|
15
|
+
import os
|
|
16
|
+
import urllib.request
|
|
17
|
+
|
|
18
|
+
from .session import EXIT_CONN_ERR, err
|
|
19
|
+
|
|
20
|
+
DEFAULT_RM = os.environ.get("SCQ_YARN_RM", "http://namenode.hive-net:8088")
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
def _get_json(url: str, timeout: int = 15):
|
|
24
|
+
with urllib.request.urlopen(url, timeout=timeout) as r: # noqa: S310 — fixed RM base
|
|
25
|
+
return json.load(r)
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
def _discover_app(rm: str) -> str | None:
|
|
29
|
+
apps = _get_json(f"{rm}/ws/v1/cluster/apps?states=RUNNING&applicationTypes=SPARK")
|
|
30
|
+
lst = ((apps.get("apps") or {}).get("app")) or []
|
|
31
|
+
for a in lst: # prefer the Spark Connect Server
|
|
32
|
+
if "connect" in (a.get("name") or "").lower():
|
|
33
|
+
return a["id"]
|
|
34
|
+
return lst[0]["id"] if lst else None
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+
def cmd_exec(args) -> None:
|
|
38
|
+
rm = (args.rm or DEFAULT_RM).rstrip("/")
|
|
39
|
+
path = (args.path or "").lstrip("/")
|
|
40
|
+
if ".." in path:
|
|
41
|
+
err("path must not contain '..'", EXIT_CONN_ERR)
|
|
42
|
+
try:
|
|
43
|
+
app = _discover_app(rm)
|
|
44
|
+
if not app:
|
|
45
|
+
err("no RUNNING Spark application found via the YARN RM", EXIT_CONN_ERR)
|
|
46
|
+
base = f"{rm}/proxy/{app}/api/v1/applications/{app}"
|
|
47
|
+
data = _get_json(f"{base}/{path}" if path else base)
|
|
48
|
+
except SystemExit:
|
|
49
|
+
raise
|
|
50
|
+
except Exception as e: # noqa: BLE001
|
|
51
|
+
err(f"Spark REST request failed: {e}", EXIT_CONN_ERR)
|
|
52
|
+
print(json.dumps(data, ensure_ascii=False,
|
|
53
|
+
indent=None if args.compact else 2, default=str))
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
"""Spark Connect session, read-only guard, and output formatting.
|
|
2
|
+
|
|
3
|
+
Connecting to Spark Connect needs NO Kerberos and NO JVM on the client side —
|
|
4
|
+
the server runs under its own keytab and does the auth. The endpoint is a plain
|
|
5
|
+
gRPC address, e.g. sc://spark-connect:15002.
|
|
6
|
+
"""
|
|
7
|
+
from __future__ import annotations
|
|
8
|
+
|
|
9
|
+
import json
|
|
10
|
+
import os
|
|
11
|
+
import sys
|
|
12
|
+
|
|
13
|
+
DEFAULT_REMOTE = os.environ.get("SPARK_REMOTE", "sc://localhost:15002")
|
|
14
|
+
DEFAULT_MAX_ROWS = int(os.environ.get("SCQ_MAX_ROWS", "10000"))
|
|
15
|
+
|
|
16
|
+
# Exit codes — stable contract so an agent can branch on them.
|
|
17
|
+
EXIT_OK = 0
|
|
18
|
+
EXIT_QUERY_ERR = 1
|
|
19
|
+
EXIT_CONN_ERR = 2
|
|
20
|
+
EXIT_BLOCKED = 3 # read-only guard tripped
|
|
21
|
+
EXIT_JOB_ERR = 4 # job-control error (no such job, etc.)
|
|
22
|
+
|
|
23
|
+
READ_ONLY_LEADERS = ("select", "show", "describe", "desc", "explain", "with")
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
def err(msg: object, code: int) -> None:
|
|
27
|
+
"""Emit a single JSON error object on stderr and exit."""
|
|
28
|
+
print(json.dumps({"error": str(msg), "code": code}), file=sys.stderr)
|
|
29
|
+
sys.exit(code)
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
def is_read_only(sql: str) -> bool:
|
|
33
|
+
"""True if the statement only reads (SELECT/SHOW/DESCRIBE/EXPLAIN/WITH)."""
|
|
34
|
+
leader = sql.strip().lstrip("(").split(None, 1)
|
|
35
|
+
return bool(leader) and leader[0].lower() in READ_ONLY_LEADERS
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
def get_spark(remote: str):
|
|
39
|
+
try:
|
|
40
|
+
from pyspark.sql import SparkSession
|
|
41
|
+
except ModuleNotFoundError:
|
|
42
|
+
err("pyspark is not installed; `pip install 'pyspark[connect]>=3.5,<4'`", EXIT_CONN_ERR)
|
|
43
|
+
try:
|
|
44
|
+
return SparkSession.builder.remote(remote).getOrCreate()
|
|
45
|
+
except Exception as e: # noqa: BLE001
|
|
46
|
+
err(f"could not connect to Spark Connect at {remote}: {e}", EXIT_CONN_ERR)
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
def emit_rows(rows, columns, fmt: str) -> None:
|
|
50
|
+
"""Render a result set. JSON-first (JSONEachRow) by default."""
|
|
51
|
+
if fmt == "jsoneachrow":
|
|
52
|
+
for r in rows:
|
|
53
|
+
print(json.dumps(dict(zip(columns, r)), default=str, ensure_ascii=False))
|
|
54
|
+
elif fmt == "json":
|
|
55
|
+
print(json.dumps(
|
|
56
|
+
{"meta": list(columns), "data": [list(r) for r in rows], "rows": len(rows)},
|
|
57
|
+
default=str, ensure_ascii=False))
|
|
58
|
+
elif fmt in ("csv", "tsv"):
|
|
59
|
+
sep = "," if fmt == "csv" else "\t"
|
|
60
|
+
print(sep.join(columns))
|
|
61
|
+
for r in rows:
|
|
62
|
+
print(sep.join("" if v is None else str(v) for v in r))
|
|
63
|
+
elif fmt == "table":
|
|
64
|
+
widths = [len(c) for c in columns]
|
|
65
|
+
srows = [["" if v is None else str(v) for v in r] for r in rows]
|
|
66
|
+
for r in srows:
|
|
67
|
+
for i, v in enumerate(r):
|
|
68
|
+
widths[i] = max(widths[i], len(v))
|
|
69
|
+
print(" | ".join(c.ljust(widths[i]) for i, c in enumerate(columns)))
|
|
70
|
+
print("-+-".join("-" * w for w in widths))
|
|
71
|
+
for r in srows:
|
|
72
|
+
print(" | ".join(v.ljust(widths[i]) for i, v in enumerate(r)))
|
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
"""Hive -> ClickHouse sync — one feature built on the async job subsystem.
|
|
2
|
+
|
|
3
|
+
Runs inside a detached job worker, so all output goes to the job log and never
|
|
4
|
+
into the agent's context. The data path is **Spark direct write**: a Spark
|
|
5
|
+
Connect job reads the Hive table and writes to ClickHouse over JDBC. The write
|
|
6
|
+
happens on the executors (in the cluster), so rows never pass through this
|
|
7
|
+
process or the agent.
|
|
8
|
+
|
|
9
|
+
Requirements (the "path A wiring"):
|
|
10
|
+
- clickhouse-jdbc on the Spark Connect server classpath (/opt/spark/jars/),
|
|
11
|
+
- network egress from the cluster to ClickHouse,
|
|
12
|
+
- a JDBC URL with credentials (`--ch-jdbc` / $SCQ_CH_JDBC),
|
|
13
|
+
- the target ClickHouse table already created with a suitable engine
|
|
14
|
+
(Spark `append` does not create a usable MergeTree table for you).
|
|
15
|
+
|
|
16
|
+
Modes:
|
|
17
|
+
single — one JDBC connection (numPartitions=1). Best for small tables.
|
|
18
|
+
parallel — N partitions write concurrently. Best for large tables.
|
|
19
|
+
auto — single under --auto-threshold rows, else parallel.
|
|
20
|
+
"""
|
|
21
|
+
from __future__ import annotations
|
|
22
|
+
|
|
23
|
+
import argparse
|
|
24
|
+
import os
|
|
25
|
+
import re
|
|
26
|
+
|
|
27
|
+
from .jobs import write_meta
|
|
28
|
+
from .session import DEFAULT_REMOTE, get_spark
|
|
29
|
+
|
|
30
|
+
# Spark/Hive -> ClickHouse type mapping. The SKILL carries the authoritative
|
|
31
|
+
# table the agent reasons with; this is just for the descriptive log line.
|
|
32
|
+
SPARK_TO_CH = {
|
|
33
|
+
"boolean": "Bool", "tinyint": "Int8", "smallint": "Int16", "int": "Int32",
|
|
34
|
+
"integer": "Int32", "bigint": "Int64", "float": "Float32", "double": "Float64",
|
|
35
|
+
"string": "String", "varchar": "String", "char": "String", "binary": "String",
|
|
36
|
+
"date": "Date32", "timestamp": "DateTime64(3)",
|
|
37
|
+
}
|
|
38
|
+
|
|
39
|
+
AUTO_THRESHOLD = int(os.environ.get("SCQ_AUTO_THRESHOLD", "1000000"))
|
|
40
|
+
DEFAULT_BATCHSIZE = int(os.environ.get("SCQ_BATCHSIZE", "100000"))
|
|
41
|
+
DEFAULT_NUM_PARTITIONS = int(os.environ.get("SCQ_NUM_PARTITIONS", "8"))
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
def map_type(spark_type: str) -> str:
|
|
45
|
+
t = spark_type.lower().strip()
|
|
46
|
+
if t.startswith("decimal"):
|
|
47
|
+
return t.replace("decimal", "Decimal")
|
|
48
|
+
base = re.split(r"[(<]", t, 1)[0]
|
|
49
|
+
return SPARK_TO_CH.get(base, "String")
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
def _parse(argv: list[str]):
|
|
53
|
+
p = argparse.ArgumentParser(prog="scq sync")
|
|
54
|
+
p.add_argument("source")
|
|
55
|
+
p.add_argument("--to", default="clickhouse")
|
|
56
|
+
p.add_argument("--remote", default=DEFAULT_REMOTE)
|
|
57
|
+
p.add_argument("--mode", choices=["auto", "parallel", "single"], default="auto")
|
|
58
|
+
p.add_argument("--ch-jdbc", default=os.environ.get("SCQ_CH_JDBC", ""))
|
|
59
|
+
p.add_argument("--target", default=None)
|
|
60
|
+
p.add_argument("--where", default=None)
|
|
61
|
+
p.add_argument("--limit", type=int, default=0)
|
|
62
|
+
p.add_argument("--batchsize", type=int, default=DEFAULT_BATCHSIZE)
|
|
63
|
+
p.add_argument("--num-partitions", type=int, default=DEFAULT_NUM_PARTITIONS)
|
|
64
|
+
p.add_argument("--auto-threshold", type=int, default=AUTO_THRESHOLD)
|
|
65
|
+
# Auto-create control: when the target table doesn't exist, Spark creates it.
|
|
66
|
+
# Without an explicit sort key ClickHouse defaults to ORDER BY tuple() (no
|
|
67
|
+
# primary index). --order-by injects a real key via createTableOptions.
|
|
68
|
+
p.add_argument("--order-by", default=None, help="ORDER BY key(s) for auto-created table, e.g. 'id' or 'id,dt'")
|
|
69
|
+
p.add_argument("--engine", default="MergeTree", help="Engine for auto-created table (default MergeTree)")
|
|
70
|
+
return p.parse_args(argv)
|
|
71
|
+
|
|
72
|
+
|
|
73
|
+
def run(argv: list[str], meta: dict) -> int:
|
|
74
|
+
a = _parse(argv)
|
|
75
|
+
if not a.ch_jdbc:
|
|
76
|
+
print("[scq] no --ch-jdbc / SCQ_CH_JDBC set — cannot write to ClickHouse",
|
|
77
|
+
flush=True)
|
|
78
|
+
return 2
|
|
79
|
+
|
|
80
|
+
print(f"[scq] sync start: {a.source} -> {a.to} mode={a.mode}", flush=True)
|
|
81
|
+
spark = get_spark(a.remote)
|
|
82
|
+
|
|
83
|
+
# 1. discover schema (informational; the target table must already exist)
|
|
84
|
+
desc = spark.sql(f"DESCRIBE TABLE {a.source}").collect()
|
|
85
|
+
cols = [(r[0], r[1]) for r in desc if r[0] and not r[0].startswith("#")]
|
|
86
|
+
print(f"[scq] {len(cols)} columns: "
|
|
87
|
+
+ ", ".join(f"{c}:{t}->{map_type(t)}" for c, t in cols), flush=True)
|
|
88
|
+
|
|
89
|
+
src_count = spark.sql(f"SELECT count(*) c FROM {a.source}").collect()[0]["c"]
|
|
90
|
+
# Target may be `db.table` (explicit database) or a bare table name (lands in
|
|
91
|
+
# the JDBC connection's default database). Keep the landing spot explicit.
|
|
92
|
+
target = a.target or a.source.split(".")[-1]
|
|
93
|
+
qualified = "." in target
|
|
94
|
+
meta["source_rows"] = src_count
|
|
95
|
+
meta["target"] = target
|
|
96
|
+
write_meta(meta["id"], meta)
|
|
97
|
+
hint = "" if qualified else " (connection default database; pass --target db.table to choose one)"
|
|
98
|
+
print(f"[scq] source rows: {src_count} -> target {target}{hint}", flush=True)
|
|
99
|
+
|
|
100
|
+
# 2. build the read
|
|
101
|
+
sel = f"SELECT * FROM {a.source}"
|
|
102
|
+
if a.where:
|
|
103
|
+
sel += f" WHERE {a.where}"
|
|
104
|
+
if a.limit:
|
|
105
|
+
sel += f" LIMIT {a.limit}"
|
|
106
|
+
df = spark.sql(sel)
|
|
107
|
+
|
|
108
|
+
# 3. choose write parallelism
|
|
109
|
+
mode = a.mode
|
|
110
|
+
if mode == "auto":
|
|
111
|
+
mode = "parallel" if src_count >= a.auto_threshold else "single"
|
|
112
|
+
num_partitions = 1 if mode == "single" else max(1, a.num_partitions)
|
|
113
|
+
if num_partitions == 1:
|
|
114
|
+
df = df.coalesce(1)
|
|
115
|
+
else:
|
|
116
|
+
df = df.repartition(num_partitions)
|
|
117
|
+
print(f"[scq] writing via JDBC: mode={mode} numPartitions={num_partitions} "
|
|
118
|
+
f"batchsize={a.batchsize}", flush=True)
|
|
119
|
+
|
|
120
|
+
# 4. Spark direct write to ClickHouse. Rows are written by the executors;
|
|
121
|
+
# nothing flows through this process.
|
|
122
|
+
writer = (df.write.format("jdbc")
|
|
123
|
+
.option("url", a.ch_jdbc)
|
|
124
|
+
.option("dbtable", target)
|
|
125
|
+
.option("driver", "com.clickhouse.jdbc.ClickHouseDriver")
|
|
126
|
+
.option("batchsize", a.batchsize)
|
|
127
|
+
.option("isolationLevel", "NONE")) # ClickHouse has no txns
|
|
128
|
+
# createTableOptions only affects auto-create (when the table is missing); it
|
|
129
|
+
# is ignored when the table already exists.
|
|
130
|
+
if a.order_by:
|
|
131
|
+
writer = writer.option("createTableOptions",
|
|
132
|
+
f"ENGINE = {a.engine} ORDER BY ({a.order_by})")
|
|
133
|
+
print(f"[scq] auto-create (if needed): ENGINE = {a.engine} ORDER BY ({a.order_by})", flush=True)
|
|
134
|
+
else:
|
|
135
|
+
print("[scq] note: an auto-created target uses ORDER BY tuple() (no sort key) — "
|
|
136
|
+
"pass --order-by for a real key, or pre-create the table", flush=True)
|
|
137
|
+
try:
|
|
138
|
+
writer.mode("append").save()
|
|
139
|
+
except Exception as e: # noqa: BLE001
|
|
140
|
+
print(f"[scq] JDBC write failed: {e}", flush=True)
|
|
141
|
+
return 1
|
|
142
|
+
|
|
143
|
+
print(f"[scq] done: wrote {src_count} rows to {target}", flush=True)
|
|
144
|
+
meta["written_rows"] = src_count
|
|
145
|
+
write_meta(meta["id"], meta)
|
|
146
|
+
return 0
|
|
@@ -0,0 +1,156 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: spark-connect-cli
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Agent-friendly Spark Connect CLI: read-only querying + async long-job control. No JVM, no Kerberos on the client.
|
|
5
|
+
Project-URL: Homepage, https://github.com/dengshu2/spark-connect-cli
|
|
6
|
+
Project-URL: Issues, https://github.com/dengshu2/spark-connect-cli/issues
|
|
7
|
+
Author: dengshu
|
|
8
|
+
License: MIT
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: agent,cli,clickhouse,hive,llm,spark,spark-connect
|
|
11
|
+
Classifier: Environment :: Console
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Topic :: Database :: Front-Ends
|
|
15
|
+
Requires-Python: >=3.9
|
|
16
|
+
Requires-Dist: pyspark[connect]<4,>=3.5
|
|
17
|
+
Provides-Extra: dev
|
|
18
|
+
Requires-Dist: pytest>=7; extra == 'dev'
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
|
|
21
|
+
# spark-connect-cli (`scq`)
|
|
22
|
+
|
|
23
|
+
An agent-friendly [Spark Connect](https://spark.apache.org/spark-connect/) CLI —
|
|
24
|
+
**read-only querying** plus **async control for long-running jobs**.
|
|
25
|
+
|
|
26
|
+
Built for LLM agents and humans who live in a shell. Unlike `spark-sql` /
|
|
27
|
+
`spark-submit`, the client is a thin **pure-Python gRPC client**: no JVM, and
|
|
28
|
+
**no Kerberos on the client side** — the Spark Connect server authenticates with
|
|
29
|
+
its own keytab, so you just point at `sc://host:15002` and go.
|
|
30
|
+
|
|
31
|
+
## Why
|
|
32
|
+
|
|
33
|
+
- **JSON-first, read-only by default.** Safe for an agent to call for
|
|
34
|
+
exploration; writes/DDL are blocked unless you opt in (`--allow-ddl`).
|
|
35
|
+
- **Long jobs don't block you.** A multi-minute Spark job shouldn't trap an agent
|
|
36
|
+
in a 30-minute tool call. `scq` submits the job, hands back a durable **job
|
|
37
|
+
id**, and returns immediately. Poll it whenever you like; the handle survives a
|
|
38
|
+
client/container restart because it lives in an on-disk registry.
|
|
39
|
+
- **Stable exit codes** so a caller can branch without scraping text.
|
|
40
|
+
|
|
41
|
+
## Install
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
pip install spark-connect-cli # once published
|
|
45
|
+
# or, from source:
|
|
46
|
+
pip install -e .
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Quick start
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
export SPARK_REMOTE=sc://localhost:15002 # your Spark Connect endpoint
|
|
53
|
+
|
|
54
|
+
scq databases
|
|
55
|
+
scq tables mydb --like '%orders%'
|
|
56
|
+
scq describe mydb.orders
|
|
57
|
+
scq query "SELECT id, name FROM mydb.orders LIMIT 10"
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Output is **JSONEachRow** (one JSON object per line) by default; pick another with
|
|
61
|
+
`--format json|csv|tsv|table`.
|
|
62
|
+
|
|
63
|
+
### Read-only guard
|
|
64
|
+
|
|
65
|
+
`scq query` allows only `SELECT/SHOW/DESCRIBE/EXPLAIN/WITH`. Anything else exits
|
|
66
|
+
with code **3** unless you pass `--allow-ddl`.
|
|
67
|
+
|
|
68
|
+
| exit | meaning |
|
|
69
|
+
|------|---------|
|
|
70
|
+
| 0 | success |
|
|
71
|
+
| 1 | query error (bad SQL) |
|
|
72
|
+
| 2 | connection error |
|
|
73
|
+
| 3 | blocked by the read-only guard |
|
|
74
|
+
| 4 | job-control error (no such job, …) |
|
|
75
|
+
|
|
76
|
+
## Async jobs (Layer A)
|
|
77
|
+
|
|
78
|
+
Long work runs detached and is tracked by a file-based registry under
|
|
79
|
+
`$SCQ_JOBS_DIR` (default `~/.spark-connect-cli/jobs`).
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
# submit — returns a job id immediately, does NOT block
|
|
83
|
+
scq sync ods.orders --to clickhouse
|
|
84
|
+
# {"job_id": "j-20260625-...", "state": "running", "message": "... poll with ..."}
|
|
85
|
+
|
|
86
|
+
scq jobs list # all jobs + state
|
|
87
|
+
scq jobs status j-20260625-... # full status (rows, timings, pid, exit code)
|
|
88
|
+
scq jobs logs j-20260625-... --tail 40
|
|
89
|
+
scq jobs cancel j-20260625-... # kills the whole process group
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
Design: each job is a directory with `meta.json` (state machine:
|
|
93
|
+
`submitted → running → succeeded|failed|cancelled`) and `out.log`. The worker
|
|
94
|
+
runs in its **own process group**, so cancel kills the entire tree (no orphans).
|
|
95
|
+
A `running` job whose process has vanished is reconciled to `failed` on the next
|
|
96
|
+
status read, so status never lies.
|
|
97
|
+
|
|
98
|
+
## Hive → ClickHouse sync
|
|
99
|
+
|
|
100
|
+
`scq sync` is one job kind built on the async subsystem. It uses **Spark direct
|
|
101
|
+
write**: a Spark Connect job reads the Hive table and writes to ClickHouse over
|
|
102
|
+
JDBC. The write runs on the executors, so rows never pass through this process or
|
|
103
|
+
the agent.
|
|
104
|
+
|
|
105
|
+
Modes control write parallelism — `single` (one connection, small tables),
|
|
106
|
+
`parallel` (N partitions, large tables), `auto` (picks by row count).
|
|
107
|
+
|
|
108
|
+
Requires:
|
|
109
|
+
- `clickhouse-jdbc` on the Spark Connect server classpath (`/opt/spark/jars/`),
|
|
110
|
+
- cluster→ClickHouse network egress,
|
|
111
|
+
- a JDBC URL with credentials via `--ch-jdbc` / `$SCQ_CH_JDBC`,
|
|
112
|
+
- the **target ClickHouse table created beforehand** with a suitable engine
|
|
113
|
+
(Spark `append` won't build a usable MergeTree table for you — create it first,
|
|
114
|
+
e.g. with the `chsql` skill).
|
|
115
|
+
|
|
116
|
+
## Introspection
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
scq meta db.table # one JSON: schema, created time, location,
|
|
120
|
+
# partitions, file count/size, mtime range
|
|
121
|
+
scq meta db.table --count # also run an exact count(*)
|
|
122
|
+
|
|
123
|
+
scq exec stages?status=active # read-only Spark REST passthrough
|
|
124
|
+
scq exec executors
|
|
125
|
+
scq exec stages/<id>/<attempt>/taskSummary?quantiles=0.5,0.95,1.0 # skew: max/median
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
`scq exec` auto-discovers the running Spark app via the YARN ResourceManager and
|
|
129
|
+
proxies its monitoring REST API (GET-only). Set the RM base with `$SCQ_YARN_RM`.
|
|
130
|
+
|
|
131
|
+
**Reading `scq exec executors`** — the `maxMemory` field is Spark's
|
|
132
|
+
**storage/cache pool** (`(heap − 300 MB reserved) × 0.6`), *not* the executor's
|
|
133
|
+
total memory: a 512 MB executor reports ~93 MB, a 1536 MB driver ~741 MB. The
|
|
134
|
+
real heap is `spark.executor.memory` (+ off-heap overhead). The `driver` row has
|
|
135
|
+
0 cores and runs no tasks. With dynamic allocation, idle executors are released —
|
|
136
|
+
so the list may show only the driver when nothing is running.
|
|
137
|
+
|
|
138
|
+
## Configuration
|
|
139
|
+
|
|
140
|
+
| env | default | meaning |
|
|
141
|
+
|-----|---------|---------|
|
|
142
|
+
| `SPARK_REMOTE` | `sc://localhost:15002` | Spark Connect endpoint |
|
|
143
|
+
| `SCQ_JOBS_DIR` | `~/.spark-connect-cli/jobs` | job registry (put on a persistent volume) |
|
|
144
|
+
| `SCQ_MAX_ROWS` | `10000` | default row cap for `query` |
|
|
145
|
+
| `SCQ_CH_JDBC` | — | ClickHouse JDBC URL for `sync` path A |
|
|
146
|
+
| `SCQ_YARN_RM` | `http://namenode.hive-net:8088` | YARN RM base for `scq exec` |
|
|
147
|
+
|
|
148
|
+
## Use with an LLM agent
|
|
149
|
+
|
|
150
|
+
`SKILL.md` ships a ready-made skill (discover-before-query workflow, async-job
|
|
151
|
+
etiquette, type-mapping table). Drop it into your agent's skills directory and
|
|
152
|
+
the agent drives `scq` through a shell/Bash tool.
|
|
153
|
+
|
|
154
|
+
## License
|
|
155
|
+
|
|
156
|
+
MIT
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
spark_connect_cli/__init__.py,sha256=d_xkc721_-_mG3JGkeXU7BspJ1x_-yufyS9D6dLgR-k,87
|
|
2
|
+
spark_connect_cli/__main__.py,sha256=MSmt_5Xg84uHqzTN38JwgseJK8rsJn_11A8WD99VtEo,61
|
|
3
|
+
spark_connect_cli/cli.py,sha256=8VLGgQIVgTr0OccAT3uLnBpFCsO0i8IYTq1fScYn-9U,6379
|
|
4
|
+
spark_connect_cli/jobs.py,sha256=t55-tBFXd5jNMhiSWL7StrTlgCGPjU_XsYGr5qew7gY,7233
|
|
5
|
+
spark_connect_cli/meta.py,sha256=u1-RaKxP-Y7wbKXppeSuh2d9UCRk1HZ1U-GaFOJ4M9g,3473
|
|
6
|
+
spark_connect_cli/query.py,sha256=Ej0M0yW6RAdpShhEudTGTOJhL2ezSzwovFDPgRhCWH8,2103
|
|
7
|
+
spark_connect_cli/rest.py,sha256=JueQlHgTFhNvl8KCTum-1LGVxJWDs441NVDsU1GjBOE,2021
|
|
8
|
+
spark_connect_cli/session.py,sha256=xSXsSUZTwERj0uGKDyuQQUNveH6ox9LqsAurNiD7AyY,2756
|
|
9
|
+
spark_connect_cli/sync.py,sha256=fNCfqbYO6ClCfBo0Y_CDIfnCbGuEX26bNilKMqieM-4,6554
|
|
10
|
+
spark_connect_cli/SKILL.md,sha256=Rm_9aT_4G24iu2gM8MR6YHp4G_mLLbKMLPANU3-9qSA,6450
|
|
11
|
+
spark_connect_cli-0.2.0.dist-info/METADATA,sha256=RuIoJAQxqZl1Y21Msioj1JrqpTD_kLlpyvSMnbM1BdI,6081
|
|
12
|
+
spark_connect_cli-0.2.0.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
|
|
13
|
+
spark_connect_cli-0.2.0.dist-info/entry_points.txt,sha256=xweUT4medkn5exc6YvKPNBWlOV9qWBz5hxjZsoRdLK4,98
|
|
14
|
+
spark_connect_cli-0.2.0.dist-info/licenses/LICENSE,sha256=CQKNwxelRbTkzP7qdNam0XmENhKgOC6PgWd3ljvRiVM,1064
|
|
15
|
+
spark_connect_cli-0.2.0.dist-info/RECORD,,
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 dengshu
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|