spark-connect-cli 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,151 @@
1
+ ---
2
+ name: spark-connect-cli
3
+ description: >-
4
+ Query Spark / Hive from the shell with the `scq` CLI over Spark Connect, and
5
+ run long Spark jobs (e.g. Hive->ClickHouse syncs) without blocking. Use
6
+ whenever the user wants to read Hive/Spark data, explore databases/tables/
7
+ schema, run a Spark SQL analysis, or sync a Hive table somewhere. Triggers:
8
+ Hive, Spark, Spark SQL, 查 Hive, 跑个 Spark SQL, 看下这个表, 同步到 ClickHouse,
9
+ sync table.
10
+ ---
11
+
12
+ # scq — Spark Connect from the shell
13
+
14
+ `scq` queries a Spark Connect server (JSON-first, **read-only by default**) and
15
+ manages **async long jobs** so you never sit in a blocking tool call.
16
+
17
+ ## Discover before you query
18
+
19
+ Don't guess names. Discover them first:
20
+
21
+ 1. `scq databases` — list databases.
22
+ 2. `scq tables [DB] --like '%keyword%'` — list tables.
23
+ 3. `scq describe <db.table>` — list columns (name, type, comment).
24
+ 4. `scq query "SELECT ..."` — run it once you know the schema.
25
+
26
+ ## Reading output
27
+
28
+ - **stdout** carries data. Default is **JSONEachRow** (NDJSON — one JSON object
29
+ per line). Other formats: `--format json|csv|tsv|table`.
30
+ - **stderr** carries errors as one JSON object `{"error": ..., "code": ...}`.
31
+ - `query` caps at `SCQ_MAX_ROWS` (default 10k); a `{"warning": ...}` on stderr
32
+ means it was truncated — add `LIMIT`/filters or raise `--max-rows`.
33
+
34
+ ## Branch on the exit code
35
+
36
+ `0` ok · `1` query error (fix the SQL) · `2` connection error (check
37
+ `$SPARK_REMOTE`) · `3` read-only guard blocked it · `4` job-control error.
38
+
39
+ ## Read-only by default
40
+
41
+ `scq query` allows only SELECT/SHOW/DESCRIBE/EXPLAIN/WITH. Writes and DDL exit
42
+ with code `3` unless you pass `--allow-ddl`. **Only** add `--allow-ddl` when the
43
+ user explicitly asked to modify data or schema.
44
+
45
+ ## Long jobs — submit, then poll. NEVER block.
46
+
47
+ A full-table sync is a multi-minute Spark job. **Do not** run it in the
48
+ foreground and wait. Submit it, tell the user the job id, and hand control back:
49
+
50
+ ```bash
51
+ scq sync ods.orders --to clickhouse # prints {"job_id": "...", "state": "running"}
52
+ ```
53
+
54
+ Then, *only when the user asks* "how's it going" / after a natural pause:
55
+
56
+ ```bash
57
+ scq jobs status j-20260625-... # state, source_rows, written_rows, exit_code
58
+ scq jobs logs j-20260625-... --tail 40 # recent progress
59
+ scq jobs cancel j-20260625-... # stop it
60
+ ```
61
+
62
+ Etiquette:
63
+ - After submitting, reply with the job id and a one-line "I'll check when you
64
+ want." Don't loop on `status` in a tight wait — let the user drive, or poll on
65
+ a relaxed cadence.
66
+ - Report terminal state plainly: `succeeded` with `written_rows`, or `failed`
67
+ with the tail of the log.
68
+
69
+ ## Hive → ClickHouse sync workflow
70
+
71
+ When the user says "同步 X 表到 ClickHouse":
72
+
73
+ 1. `scq describe <src>` — get the Hive schema.
74
+ 2. Decide the **target database and table**. `--target` takes `db.table`:
75
+ - If the user names a database (e.g. `class_db`), pass it **qualified**:
76
+ `--target class_db.class`. A **bare table name lands in the connection's
77
+ default database** (`default`) — don't let data silently go there.
78
+ - The **database must already exist** (auto-create makes the table, not the
79
+ database). Ensure it first: `chsql query --allow-ddl "CREATE DATABASE IF NOT
80
+ EXISTS class_db"`.
81
+ 3. Make sure the target table is good:
82
+ - For a quick/one-off sync, let `scq sync` auto-create it — but pass
83
+ `--order-by <key>` so it gets a real sort key (otherwise it is created with
84
+ `ORDER BY tuple()`, no primary index).
85
+ - For a production table, **pre-create it** with `chsql query --allow-ddl
86
+ "CREATE TABLE class_db.class (...) ENGINE = MergeTree ORDER BY (...)"` (full
87
+ control over engine, keys, partitioning), then sync.
88
+ 4. Submit: `scq sync <src> --target db.table [--order-by key] [--where ...]`.
89
+ 5. Hand back the job id. Verify with row counts when it finishes.
90
+
91
+ The ClickHouse JDBC connection (`$SCQ_CH_JDBC`) is preconfigured — you do **not**
92
+ pass credentials; just choose the `db.table` with `--target`.
93
+
94
+ ### Spark/Hive → ClickHouse type mapping
95
+
96
+ | Spark/Hive | ClickHouse |
97
+ |------------|------------|
98
+ | boolean | Bool |
99
+ | tinyint / smallint / int / bigint | Int8 / Int16 / Int32 / Int64 |
100
+ | float / double | Float32 / Float64 |
101
+ | decimal(p,s) | Decimal(p,s) |
102
+ | string / varchar / char / binary | String |
103
+ | date | Date32 |
104
+ | timestamp | DateTime64(3) |
105
+
106
+ Nullable columns map to `Nullable(T)`. Nested/complex types default to `String`
107
+ (JSON) — confirm with the user before relying on them.
108
+
109
+ ## Metadata & execution introspection
110
+
111
+ Two general primitives — don't hand-stitch many queries.
112
+
113
+ **Table metadata → `scq meta db.table`** — one JSON: schema (+ ClickHouse type
114
+ mapping), created time, owner, format, HDFS location, partition columns +
115
+ partition list/count, file count/total size, and min/max file modification time
116
+ (i.e. "when did the data arrive"). Add `--count` for an exact row count (runs a
117
+ `count(*)`, so only when asked). For ad-hoc bits you can still use
118
+ `scq query "DESCRIBE EXTENDED t"` / `"SHOW PARTITIONS t"`.
119
+
120
+ **Execution metadata → `scq exec <path>`** — read-only passthrough to the Spark
121
+ REST API (auto-discovers the app, GET-only). The model reads the JSON, so any
122
+ runtime question is the same command with a different path:
123
+
124
+ ```bash
125
+ scq exec stages?status=active # what's running now
126
+ scq exec sql # each query's plan + metrics
127
+ scq exec executors # cores / memory / GC / shuffle
128
+ scq exec jobs
129
+ scq exec stages/<id>/<attempt>/taskSummary?quantiles=0.5,0.95,1.0
130
+ ```
131
+
132
+ - **Data skew**: pull a stage's `taskSummary` and compare a metric's **max vs
133
+ median** (`executorRunTime`, `shuffleReadBytes`, `shuffleReadRecords`). A large
134
+ `max/median` ratio = a straggler / skewed partition. `…?details=true` on a
135
+ stage lists every task to find the hot one.
136
+ - Stage/job lists can be long — filter (`?status=active`) or fetch one id.
137
+ - For the *plan before running*, use `scq query "EXPLAIN FORMATTED SELECT ..."`.
138
+
139
+ ## Connection
140
+
141
+ `scq --remote sc://host:15002 ...` or set `$SPARK_REMOTE`. No Kerberos or JVM is
142
+ needed on this side — the Spark Connect server does the auth.
143
+
144
+ ## Recipes
145
+
146
+ ```bash
147
+ scq --format table tables analytics --like '%event%'
148
+ scq query --format table "SELECT count(*) FROM analytics.events"
149
+ scq query --max-rows 0 "SELECT * FROM small_dim" # no cap
150
+ scq sync analytics.events --to clickhouse --where "dt='2026-06-25'"
151
+ ```
@@ -0,0 +1,2 @@
1
+ """spark-connect-cli — an agent-friendly Spark Connect CLI."""
2
+ __version__ = "0.2.0"
@@ -0,0 +1,4 @@
1
+ from .cli import main
2
+
3
+ if __name__ == "__main__":
4
+ main()
@@ -0,0 +1,138 @@
1
+ """Argument parsing and dispatch for `scq` / `spark-connect-cli`."""
2
+ from __future__ import annotations
3
+
4
+ import json
5
+ import os
6
+ import sys
7
+
8
+ from . import jobs, meta, query, rest
9
+ from .session import DEFAULT_REMOTE
10
+
11
+
12
+ def cmd_sync(args) -> None:
13
+ # The JDBC URL carries the ClickHouse password, so it must NOT land in argv
14
+ # (argv is persisted in the job registry's meta.json). Pass it to the worker
15
+ # through the environment instead — submit() copies os.environ to the child.
16
+ if args.ch_jdbc:
17
+ os.environ["SCQ_CH_JDBC"] = args.ch_jdbc
18
+ argv = [args.source, "--to", args.to, "--remote", args.remote, "--mode", args.mode]
19
+ if args.target:
20
+ argv += ["--target", args.target]
21
+ if args.where:
22
+ argv += ["--where", args.where]
23
+ if args.limit:
24
+ argv += ["--limit", str(args.limit)]
25
+ if args.batchsize:
26
+ argv += ["--batchsize", str(args.batchsize)]
27
+ if args.num_partitions:
28
+ argv += ["--num-partitions", str(args.num_partitions)]
29
+ if args.order_by:
30
+ argv += ["--order-by", args.order_by]
31
+ if args.engine:
32
+ argv += ["--engine", args.engine]
33
+ job_id = jobs.submit("sync", argv,
34
+ {"source": args.source, "target": args.target or "", "to": args.to})
35
+ print(json.dumps({
36
+ "job_id": job_id, "state": "running",
37
+ "message": f"sync of {args.source} -> {args.to} submitted; "
38
+ f"poll with `scq jobs status {job_id}`",
39
+ }))
40
+
41
+
42
+ def cmd_skill_install(args) -> None:
43
+ """Write the bundled SKILL.md into an agent skills directory (mirrors
44
+ `chsql skill install`)."""
45
+ import importlib.resources as ir
46
+ from pathlib import Path
47
+ root = Path(args.dir or os.environ.get("SKILLS_DIR")
48
+ or (Path.home() / ".agents" / "skills"))
49
+ dest = root / "spark-connect-cli"
50
+ dest.mkdir(parents=True, exist_ok=True)
51
+ content = ir.files("spark_connect_cli").joinpath("SKILL.md").read_text()
52
+ (dest / "SKILL.md").write_text(content)
53
+ print(json.dumps({"installed": str(dest / "SKILL.md")}))
54
+
55
+
56
+ def build_parser():
57
+ import argparse
58
+ ap = argparse.ArgumentParser(
59
+ prog="scq",
60
+ description="Agent-friendly Spark Connect CLI: read-only querying + "
61
+ "async long-job control. No JVM, no Kerberos on the client.")
62
+ ap.add_argument("--remote", default=DEFAULT_REMOTE,
63
+ help=f"Spark Connect endpoint (default {DEFAULT_REMOTE} / $SPARK_REMOTE)")
64
+ ap.add_argument("--format", default="jsoneachrow",
65
+ choices=["jsoneachrow", "json", "csv", "tsv", "table"])
66
+ sub = ap.add_subparsers(dest="cmd", required=True)
67
+
68
+ sub.add_parser("databases", help="List databases").set_defaults(func=query.cmd_databases)
69
+
70
+ pt = sub.add_parser("tables", help="List tables in a database")
71
+ pt.add_argument("database", nargs="?", default=None)
72
+ pt.add_argument("--like", default=None)
73
+ pt.set_defaults(func=query.cmd_tables)
74
+
75
+ pd = sub.add_parser("describe", help="Show a table's columns")
76
+ pd.add_argument("table")
77
+ pd.set_defaults(func=query.cmd_describe)
78
+
79
+ pq = sub.add_parser("query", help="Run SQL (read-only unless --allow-ddl)")
80
+ pq.add_argument("sql")
81
+ pq.add_argument("--allow-ddl", action="store_true")
82
+ pq.add_argument("--max-rows", type=int, default=None)
83
+ pq.set_defaults(func=query.cmd_query)
84
+
85
+ ps = sub.add_parser("sync", help="Submit an async Hive->ClickHouse sync (returns a job id)")
86
+ ps.add_argument("source", help="Hive table, e.g. db.table")
87
+ ps.add_argument("--to", default="clickhouse")
88
+ ps.add_argument("--mode", choices=["auto", "parallel", "single"], default="auto")
89
+ ps.add_argument("--ch-jdbc", default=None, help="ClickHouse JDBC URL (or $SCQ_CH_JDBC)")
90
+ ps.add_argument("--target", default=None, help="ClickHouse target as db.table (or bare table = default db)")
91
+ ps.add_argument("--where", default=None)
92
+ ps.add_argument("--limit", type=int, default=0)
93
+ ps.add_argument("--batchsize", type=int, default=None)
94
+ ps.add_argument("--num-partitions", type=int, default=None)
95
+ ps.add_argument("--order-by", default=None, help="ORDER BY key for an auto-created table, e.g. 'id'")
96
+ ps.add_argument("--engine", default=None, help="Engine for an auto-created table (default MergeTree)")
97
+ ps.set_defaults(func=cmd_sync)
98
+
99
+ pm = sub.add_parser("meta", help="One JSON bundle of a table's metadata")
100
+ pm.add_argument("table", help="db.table")
101
+ pm.add_argument("--count", action="store_true", help="include exact row count (runs count(*))")
102
+ pm.set_defaults(func=meta.cmd_meta)
103
+
104
+ pe = sub.add_parser("exec", help="Read-only Spark execution metadata (REST API passthrough)")
105
+ pe.add_argument("path", nargs="?", default="",
106
+ help="REST subpath, e.g. 'stages' or 'stages/61/0/taskSummary?quantiles=0.5,0.95,1.0'")
107
+ pe.add_argument("--rm", default=None, help="YARN RM base URL (or $SCQ_YARN_RM)")
108
+ pe.add_argument("--compact", action="store_true", help="compact JSON output")
109
+ pe.set_defaults(func=rest.cmd_exec)
110
+
111
+ psk = sub.add_parser("skill", help="Manage the agent skill")
112
+ sksub = psk.add_subparsers(dest="skcmd", required=True)
113
+ ski = sksub.add_parser("install", help="Write SKILL.md into the skills dir")
114
+ ski.add_argument("--dir", default=None, help="Skills dir (or $SKILLS_DIR)")
115
+ ski.set_defaults(func=cmd_skill_install)
116
+
117
+ pj = sub.add_parser("jobs", help="Manage async jobs")
118
+ js = pj.add_subparsers(dest="jcmd", required=True)
119
+ js.add_parser("list", help="List jobs").set_defaults(func=jobs.cmd_list)
120
+ s = js.add_parser("status", help="Show a job's full status")
121
+ s.add_argument("id"); s.set_defaults(func=jobs.cmd_status)
122
+ lg = js.add_parser("logs", help="Show a job's log (tail by default)")
123
+ lg.add_argument("id"); lg.add_argument("--tail", type=int, default=40)
124
+ lg.add_argument("--full", action="store_true"); lg.set_defaults(func=jobs.cmd_logs)
125
+ c = js.add_parser("cancel", help="Cancel a running job")
126
+ c.add_argument("id"); c.set_defaults(func=jobs.cmd_cancel)
127
+
128
+ return ap
129
+
130
+
131
+ def main(argv=None) -> None:
132
+ argv = list(sys.argv[1:] if argv is None else argv)
133
+ # Internal entrypoint for the detached worker.
134
+ if argv and argv[0] == "__run-job":
135
+ jobs.run_worker(argv[1])
136
+ return
137
+ args = build_parser().parse_args(argv)
138
+ args.func(args)
@@ -0,0 +1,221 @@
1
+ """Layer A — async background-job control.
2
+
3
+ The point: long Spark jobs must not block the caller (an LLM agent should not sit
4
+ in a 30-minute tool call). `submit()` spawns the work detached in its own process
5
+ group, records a durable handle on disk, and returns immediately. The handle
6
+ survives a process/container restart because it lives in a file registry, not in
7
+ memory.
8
+
9
+ A job is generic: `kind="exec"` runs an arbitrary argv (used for tests and ad-hoc
10
+ long commands); `kind="sync"` runs the Hive->ClickHouse mover. New kinds plug in
11
+ via `_dispatch`.
12
+ """
13
+ from __future__ import annotations
14
+
15
+ import json
16
+ import os
17
+ import signal
18
+ import subprocess
19
+ import sys
20
+ import time
21
+ import uuid
22
+ from datetime import datetime, timezone
23
+ from pathlib import Path
24
+
25
+ from .session import EXIT_JOB_ERR, err
26
+
27
+ JOBS_DIR = Path(os.environ.get(
28
+ "SCQ_JOBS_DIR", str(Path.home() / ".spark-connect-cli" / "jobs")))
29
+
30
+
31
+ def _now() -> str:
32
+ return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
33
+
34
+
35
+ def _job_dir(job_id: str) -> Path:
36
+ return JOBS_DIR / job_id
37
+
38
+
39
+ def _meta_path(job_id: str) -> Path:
40
+ return _job_dir(job_id) / "meta.json"
41
+
42
+
43
+ def _log_path(job_id: str) -> Path:
44
+ return _job_dir(job_id) / "out.log"
45
+
46
+
47
+ def read_meta(job_id: str) -> dict:
48
+ p = _meta_path(job_id)
49
+ if not p.exists():
50
+ err(f"no such job: {job_id}", EXIT_JOB_ERR)
51
+ return json.loads(p.read_text())
52
+
53
+
54
+ def write_meta(job_id: str, meta: dict) -> None:
55
+ tmp = _meta_path(job_id).with_suffix(".tmp")
56
+ tmp.write_text(json.dumps(meta, indent=2, default=str))
57
+ tmp.replace(_meta_path(job_id))
58
+
59
+
60
+ def _pid_alive(pid: int) -> bool:
61
+ if not pid:
62
+ return False
63
+ try:
64
+ os.kill(pid, 0)
65
+ except ProcessLookupError:
66
+ return False
67
+ except PermissionError:
68
+ return True
69
+ # Signal-0 succeeds for a zombie (terminated but not yet reaped by its
70
+ # parent), but a zombie is not doing work — treat it as dead. On Linux the
71
+ # process state is the char right after the ")" that closes comm in stat.
72
+ try:
73
+ with open(f"/proc/{pid}/stat") as f:
74
+ stat = f.read()
75
+ if stat[stat.rfind(")") + 2] == "Z":
76
+ return False
77
+ except (FileNotFoundError, ProcessLookupError, IndexError):
78
+ return False
79
+ return True
80
+
81
+
82
+ def reconcile(meta: dict) -> dict:
83
+ """A 'running' job whose process is gone but that never recorded an end has
84
+ crashed — mark it failed so status never lies."""
85
+ if meta.get("state") == "running" and not _pid_alive(meta.get("pid", 0)):
86
+ meta["state"] = "failed"
87
+ meta["ended_at"] = _now()
88
+ if meta.get("exit_code") is None:
89
+ meta["exit_code"] = -1
90
+ meta["error"] = meta.get("error") or "process exited without recording completion"
91
+ write_meta(meta["id"], meta)
92
+ return meta
93
+
94
+
95
+ def _new_job_id() -> str:
96
+ return f"j-{datetime.now().strftime('%Y%m%d-%H%M%S')}-{uuid.uuid4().hex[:4]}"
97
+
98
+
99
+ def submit(kind: str, argv: list[str], descr: dict | None = None) -> str:
100
+ """Spawn a detached worker for this job and return its id immediately."""
101
+ JOBS_DIR.mkdir(parents=True, exist_ok=True)
102
+ job_id = _new_job_id()
103
+ _job_dir(job_id).mkdir(parents=True)
104
+ meta = {
105
+ "id": job_id, "kind": kind, "state": "submitted",
106
+ "submitted_at": _now(), "started_at": None, "ended_at": None,
107
+ "pid": None, "pgid": None, "exit_code": None, "argv": argv,
108
+ **(descr or {}),
109
+ }
110
+ write_meta(job_id, meta)
111
+
112
+ log = open(_log_path(job_id), "ab", buffering=0)
113
+ proc = subprocess.Popen(
114
+ [sys.executable, "-m", "spark_connect_cli", "__run-job", job_id],
115
+ stdout=log, stderr=subprocess.STDOUT, stdin=subprocess.DEVNULL,
116
+ start_new_session=True, # own process group -> kill the whole tree
117
+ env=os.environ.copy(),
118
+ )
119
+ meta["pid"] = proc.pid
120
+ meta["pgid"] = os.getpgid(proc.pid)
121
+ meta["state"] = "running"
122
+ meta["started_at"] = _now()
123
+ write_meta(job_id, meta)
124
+ return job_id
125
+
126
+
127
+ def _dispatch(meta: dict) -> int:
128
+ kind = meta["kind"]
129
+ argv = meta.get("argv", [])
130
+ if kind == "exec":
131
+ return subprocess.run(argv).returncode
132
+ if kind == "sync":
133
+ from .sync import run as sync_run
134
+ return sync_run(argv, meta)
135
+ print(f"[scq] unknown job kind: {kind}", flush=True)
136
+ return 2
137
+
138
+
139
+ def run_worker(job_id: str) -> None:
140
+ """Runs INSIDE the detached child. Executes the work, then records a terminal
141
+ state. Never raises out — always writes a final meta."""
142
+ meta = read_meta(job_id)
143
+ rc = 0
144
+ try:
145
+ rc = _dispatch(meta)
146
+ except SystemExit as e:
147
+ rc = int(e.code) if isinstance(e.code, int) else 1
148
+ except Exception as e: # noqa: BLE001
149
+ meta = read_meta(job_id)
150
+ meta["error"] = str(e)
151
+ print(f"[scq] job failed: {e}", flush=True)
152
+ rc = 1
153
+ meta = read_meta(job_id) # re-read: the body may have updated counters
154
+ meta["state"] = "succeeded" if rc == 0 else "failed"
155
+ meta["exit_code"] = rc
156
+ meta["ended_at"] = _now()
157
+ write_meta(job_id, meta)
158
+ sys.exit(rc)
159
+
160
+
161
+ # -- agent-facing commands -------------------------------------------------
162
+
163
+ def cmd_list(args) -> None:
164
+ if not JOBS_DIR.exists():
165
+ print(json.dumps({"jobs": []}))
166
+ return
167
+ out = []
168
+ for d in sorted(JOBS_DIR.iterdir(), reverse=True):
169
+ if not (d / "meta.json").exists():
170
+ continue
171
+ m = reconcile(json.loads((d / "meta.json").read_text()))
172
+ out.append({k: m.get(k) for k in
173
+ ("id", "kind", "state", "source", "target", "submitted_at",
174
+ "source_rows", "written_rows")})
175
+ if args.format == "table":
176
+ from .session import emit_rows
177
+ emit_rows([(j["id"], j["kind"], j["state"], j.get("source") or "",
178
+ str(j.get("written_rows") if j.get("written_rows") is not None else ""))
179
+ for j in out],
180
+ ["id", "kind", "state", "source", "written"], "table")
181
+ else:
182
+ print(json.dumps({"jobs": out}, default=str))
183
+
184
+
185
+ def cmd_status(args) -> None:
186
+ print(json.dumps(reconcile(read_meta(args.id)), indent=2, default=str))
187
+
188
+
189
+ def cmd_logs(args) -> None:
190
+ read_meta(args.id) # validate existence
191
+ p = _log_path(args.id)
192
+ if not p.exists():
193
+ print("")
194
+ return
195
+ data = p.read_text(errors="replace")
196
+ if args.full:
197
+ sys.stdout.write(data)
198
+ return
199
+ lines = data.splitlines()
200
+ sys.stdout.write("\n".join(lines[-args.tail:]) + ("\n" if lines else ""))
201
+
202
+
203
+ def cmd_cancel(args) -> None:
204
+ meta = reconcile(read_meta(args.id))
205
+ if meta["state"] not in ("running", "submitted"):
206
+ print(json.dumps({"id": args.id, "state": meta["state"],
207
+ "message": "job already finished"}))
208
+ return
209
+ pgid = meta.get("pgid")
210
+ if pgid:
211
+ try:
212
+ os.killpg(pgid, signal.SIGTERM)
213
+ time.sleep(2)
214
+ if _pid_alive(meta.get("pid", 0)):
215
+ os.killpg(pgid, signal.SIGKILL)
216
+ except ProcessLookupError:
217
+ pass
218
+ meta["state"] = "cancelled"
219
+ meta["ended_at"] = _now()
220
+ write_meta(args.id, meta)
221
+ print(json.dumps({"id": args.id, "state": "cancelled"}))
@@ -0,0 +1,95 @@
1
+ """scq meta — one structured metadata document for a table.
2
+
3
+ Bundles what is otherwise spread across DESCRIBE EXTENDED + SHOW PARTITIONS +
4
+ the Spark `_metadata` hidden column, so an agent gets the whole picture of a
5
+ table in a single call instead of stitching several queries.
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import json
10
+
11
+ from .session import EXIT_QUERY_ERR, err, get_spark
12
+ from .sync import map_type
13
+
14
+
15
+ def _describe_extended(spark, table):
16
+ """Parse DESCRIBE EXTENDED into (columns, partition_columns, details)."""
17
+ rows = spark.sql(f"DESCRIBE EXTENDED {table}").collect()
18
+ cols, part_cols, details = [], [], {}
19
+ section = "cols"
20
+ for r in rows:
21
+ name = (r[0] or "").strip()
22
+ val = (r[1] or "")
23
+ if name.startswith("# Partition Information"):
24
+ section = "partcols"
25
+ continue
26
+ if name.startswith("# Detailed Table Information"):
27
+ section = "details"
28
+ continue
29
+ if name.startswith("#") or not name:
30
+ continue
31
+ if section == "cols":
32
+ cols.append({"name": name, "type": val, "clickhouse": map_type(val),
33
+ "comment": r[2]})
34
+ elif section == "partcols":
35
+ part_cols.append(name)
36
+ else:
37
+ details[name] = val.strip()
38
+ return cols, part_cols, details
39
+
40
+
41
+ def _file_stats(spark, table):
42
+ """Aggregate per-file size/mtime from the _metadata hidden column. Returns
43
+ None if the source doesn't expose _metadata."""
44
+ try:
45
+ row = spark.sql(
46
+ f"SELECT count(*) AS num_files, sum(sz) AS total_bytes, "
47
+ f"min(mt) AS first_modified, max(mt) AS last_modified FROM ("
48
+ f" SELECT _metadata.file_path AS p, max(_metadata.file_size) AS sz, "
49
+ f" max(_metadata.file_modification_time) AS mt FROM {table} GROUP BY 1)"
50
+ ).collect()[0]
51
+ return {"numFiles": row["num_files"], "totalBytes": row["total_bytes"],
52
+ "firstModified": str(row["first_modified"]),
53
+ "lastModified": str(row["last_modified"])}
54
+ except Exception: # noqa: BLE001 — _metadata unsupported / empty table
55
+ return None
56
+
57
+
58
+ def cmd_meta(args) -> None:
59
+ spark = get_spark(args.remote)
60
+ table = args.table
61
+ try:
62
+ cols, part_cols, details = _describe_extended(spark, table)
63
+ except Exception as e: # noqa: BLE001
64
+ err(f"describe failed: {e}", EXIT_QUERY_ERR)
65
+
66
+ out = {
67
+ "table": table,
68
+ "createdTime": details.get("Created Time"),
69
+ "lastAccess": details.get("Last Access"),
70
+ "owner": details.get("Owner"),
71
+ "createdBy": details.get("Created By"),
72
+ "provider": details.get("Provider"),
73
+ "type": details.get("Type"),
74
+ "location": details.get("Location"),
75
+ "statistics": details.get("Statistics"),
76
+ "partitionColumns": part_cols,
77
+ "columns": cols,
78
+ }
79
+
80
+ if part_cols:
81
+ try:
82
+ parts = [r[0] for r in spark.sql(f"SHOW PARTITIONS {table}").collect()]
83
+ out["partitionCount"] = len(parts)
84
+ out["partitions"] = parts if len(parts) <= 200 else parts[:200] + ["…"]
85
+ except Exception: # noqa: BLE001
86
+ pass
87
+
88
+ files = _file_stats(spark, table)
89
+ if files:
90
+ out["files"] = files
91
+
92
+ if args.count:
93
+ out["rowCount"] = spark.sql(f"SELECT count(*) c FROM {table}").collect()[0]["c"]
94
+
95
+ print(json.dumps(out, ensure_ascii=False, indent=2, default=str))
@@ -0,0 +1,57 @@
1
+ """Read-only / discovery commands over Spark Connect."""
2
+ from __future__ import annotations
3
+
4
+ import json
5
+ import sys
6
+
7
+ from .session import (DEFAULT_MAX_ROWS, EXIT_BLOCKED, EXIT_QUERY_ERR, emit_rows,
8
+ err, get_spark, is_read_only)
9
+
10
+
11
+ def cmd_databases(args) -> None:
12
+ spark = get_spark(args.remote)
13
+ rows = spark.sql("SHOW DATABASES").collect()
14
+ emit_rows([(r[0],) for r in rows], ["database"], args.format)
15
+
16
+
17
+ def cmd_tables(args) -> None:
18
+ spark = get_spark(args.remote)
19
+ db = args.database or "default"
20
+ rows = spark.sql(f"SHOW TABLES IN `{db}`").collect()
21
+ out = [(r["namespace"], r["tableName"]) for r in rows]
22
+ if args.like:
23
+ pat = args.like.replace("%", "").lower()
24
+ out = [t for t in out if pat in t[1].lower()]
25
+ emit_rows(out, ["database", "table"], args.format)
26
+
27
+
28
+ def cmd_describe(args) -> None:
29
+ spark = get_spark(args.remote)
30
+ rows = spark.sql(f"DESCRIBE TABLE {args.table}").collect()
31
+ emit_rows([(r[0], r[1], r[2]) for r in rows],
32
+ ["col_name", "data_type", "comment"], args.format)
33
+
34
+
35
+ def cmd_query(args) -> None:
36
+ sql = args.sql
37
+ if not args.allow_ddl and not is_read_only(sql):
38
+ err("write/DDL blocked by read-only guard; pass --allow-ddl to override",
39
+ EXIT_BLOCKED)
40
+ spark = get_spark(args.remote)
41
+ try:
42
+ df = spark.sql(sql)
43
+ except Exception as e: # noqa: BLE001
44
+ err(f"query failed: {e}", EXIT_QUERY_ERR)
45
+ if not is_read_only(sql):
46
+ print(json.dumps({"ok": True}))
47
+ return
48
+ max_rows = args.max_rows if args.max_rows is not None else DEFAULT_MAX_ROWS
49
+ columns = df.columns
50
+ limited = df.limit(max_rows + 1).collect() if max_rows > 0 else df.collect()
51
+ truncated = max_rows > 0 and len(limited) > max_rows
52
+ rows = limited[:max_rows] if truncated else limited
53
+ emit_rows([tuple(r) for r in rows], columns, args.format)
54
+ if truncated:
55
+ print(json.dumps({"warning": f"result capped at {max_rows} rows; "
56
+ "add LIMIT/filters or raise --max-rows"}),
57
+ file=sys.stderr)
@@ -0,0 +1,53 @@
1
+ """scq exec — read-only passthrough to the Spark REST API.
2
+
3
+ One generic accessor over the Spark monitoring REST API instead of a pile of
4
+ single-purpose diagnostics. The model composes the path and interprets the JSON,
5
+ so skew, slow stages, shuffle spill, executor GC/OOM, etc. are all the same
6
+ command with different readings.
7
+
8
+ Access path: the Spark UI redirects to the YARN ResourceManager web proxy, so we
9
+ discover the running Spark application via the RM REST and go through the proxy.
10
+ Pure-Python (urllib) — no curl, no manual app id. GET-only against /api/v1.
11
+ """
12
+ from __future__ import annotations
13
+
14
+ import json
15
+ import os
16
+ import urllib.request
17
+
18
+ from .session import EXIT_CONN_ERR, err
19
+
20
+ DEFAULT_RM = os.environ.get("SCQ_YARN_RM", "http://namenode.hive-net:8088")
21
+
22
+
23
+ def _get_json(url: str, timeout: int = 15):
24
+ with urllib.request.urlopen(url, timeout=timeout) as r: # noqa: S310 — fixed RM base
25
+ return json.load(r)
26
+
27
+
28
+ def _discover_app(rm: str) -> str | None:
29
+ apps = _get_json(f"{rm}/ws/v1/cluster/apps?states=RUNNING&applicationTypes=SPARK")
30
+ lst = ((apps.get("apps") or {}).get("app")) or []
31
+ for a in lst: # prefer the Spark Connect Server
32
+ if "connect" in (a.get("name") or "").lower():
33
+ return a["id"]
34
+ return lst[0]["id"] if lst else None
35
+
36
+
37
+ def cmd_exec(args) -> None:
38
+ rm = (args.rm or DEFAULT_RM).rstrip("/")
39
+ path = (args.path or "").lstrip("/")
40
+ if ".." in path:
41
+ err("path must not contain '..'", EXIT_CONN_ERR)
42
+ try:
43
+ app = _discover_app(rm)
44
+ if not app:
45
+ err("no RUNNING Spark application found via the YARN RM", EXIT_CONN_ERR)
46
+ base = f"{rm}/proxy/{app}/api/v1/applications/{app}"
47
+ data = _get_json(f"{base}/{path}" if path else base)
48
+ except SystemExit:
49
+ raise
50
+ except Exception as e: # noqa: BLE001
51
+ err(f"Spark REST request failed: {e}", EXIT_CONN_ERR)
52
+ print(json.dumps(data, ensure_ascii=False,
53
+ indent=None if args.compact else 2, default=str))
@@ -0,0 +1,72 @@
1
+ """Spark Connect session, read-only guard, and output formatting.
2
+
3
+ Connecting to Spark Connect needs NO Kerberos and NO JVM on the client side —
4
+ the server runs under its own keytab and does the auth. The endpoint is a plain
5
+ gRPC address, e.g. sc://spark-connect:15002.
6
+ """
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ import os
11
+ import sys
12
+
13
+ DEFAULT_REMOTE = os.environ.get("SPARK_REMOTE", "sc://localhost:15002")
14
+ DEFAULT_MAX_ROWS = int(os.environ.get("SCQ_MAX_ROWS", "10000"))
15
+
16
+ # Exit codes — stable contract so an agent can branch on them.
17
+ EXIT_OK = 0
18
+ EXIT_QUERY_ERR = 1
19
+ EXIT_CONN_ERR = 2
20
+ EXIT_BLOCKED = 3 # read-only guard tripped
21
+ EXIT_JOB_ERR = 4 # job-control error (no such job, etc.)
22
+
23
+ READ_ONLY_LEADERS = ("select", "show", "describe", "desc", "explain", "with")
24
+
25
+
26
+ def err(msg: object, code: int) -> None:
27
+ """Emit a single JSON error object on stderr and exit."""
28
+ print(json.dumps({"error": str(msg), "code": code}), file=sys.stderr)
29
+ sys.exit(code)
30
+
31
+
32
+ def is_read_only(sql: str) -> bool:
33
+ """True if the statement only reads (SELECT/SHOW/DESCRIBE/EXPLAIN/WITH)."""
34
+ leader = sql.strip().lstrip("(").split(None, 1)
35
+ return bool(leader) and leader[0].lower() in READ_ONLY_LEADERS
36
+
37
+
38
+ def get_spark(remote: str):
39
+ try:
40
+ from pyspark.sql import SparkSession
41
+ except ModuleNotFoundError:
42
+ err("pyspark is not installed; `pip install 'pyspark[connect]>=3.5,<4'`", EXIT_CONN_ERR)
43
+ try:
44
+ return SparkSession.builder.remote(remote).getOrCreate()
45
+ except Exception as e: # noqa: BLE001
46
+ err(f"could not connect to Spark Connect at {remote}: {e}", EXIT_CONN_ERR)
47
+
48
+
49
+ def emit_rows(rows, columns, fmt: str) -> None:
50
+ """Render a result set. JSON-first (JSONEachRow) by default."""
51
+ if fmt == "jsoneachrow":
52
+ for r in rows:
53
+ print(json.dumps(dict(zip(columns, r)), default=str, ensure_ascii=False))
54
+ elif fmt == "json":
55
+ print(json.dumps(
56
+ {"meta": list(columns), "data": [list(r) for r in rows], "rows": len(rows)},
57
+ default=str, ensure_ascii=False))
58
+ elif fmt in ("csv", "tsv"):
59
+ sep = "," if fmt == "csv" else "\t"
60
+ print(sep.join(columns))
61
+ for r in rows:
62
+ print(sep.join("" if v is None else str(v) for v in r))
63
+ elif fmt == "table":
64
+ widths = [len(c) for c in columns]
65
+ srows = [["" if v is None else str(v) for v in r] for r in rows]
66
+ for r in srows:
67
+ for i, v in enumerate(r):
68
+ widths[i] = max(widths[i], len(v))
69
+ print(" | ".join(c.ljust(widths[i]) for i, c in enumerate(columns)))
70
+ print("-+-".join("-" * w for w in widths))
71
+ for r in srows:
72
+ print(" | ".join(v.ljust(widths[i]) for i, v in enumerate(r)))
@@ -0,0 +1,146 @@
1
+ """Hive -> ClickHouse sync — one feature built on the async job subsystem.
2
+
3
+ Runs inside a detached job worker, so all output goes to the job log and never
4
+ into the agent's context. The data path is **Spark direct write**: a Spark
5
+ Connect job reads the Hive table and writes to ClickHouse over JDBC. The write
6
+ happens on the executors (in the cluster), so rows never pass through this
7
+ process or the agent.
8
+
9
+ Requirements (the "path A wiring"):
10
+ - clickhouse-jdbc on the Spark Connect server classpath (/opt/spark/jars/),
11
+ - network egress from the cluster to ClickHouse,
12
+ - a JDBC URL with credentials (`--ch-jdbc` / $SCQ_CH_JDBC),
13
+ - the target ClickHouse table already created with a suitable engine
14
+ (Spark `append` does not create a usable MergeTree table for you).
15
+
16
+ Modes:
17
+ single — one JDBC connection (numPartitions=1). Best for small tables.
18
+ parallel — N partitions write concurrently. Best for large tables.
19
+ auto — single under --auto-threshold rows, else parallel.
20
+ """
21
+ from __future__ import annotations
22
+
23
+ import argparse
24
+ import os
25
+ import re
26
+
27
+ from .jobs import write_meta
28
+ from .session import DEFAULT_REMOTE, get_spark
29
+
30
+ # Spark/Hive -> ClickHouse type mapping. The SKILL carries the authoritative
31
+ # table the agent reasons with; this is just for the descriptive log line.
32
+ SPARK_TO_CH = {
33
+ "boolean": "Bool", "tinyint": "Int8", "smallint": "Int16", "int": "Int32",
34
+ "integer": "Int32", "bigint": "Int64", "float": "Float32", "double": "Float64",
35
+ "string": "String", "varchar": "String", "char": "String", "binary": "String",
36
+ "date": "Date32", "timestamp": "DateTime64(3)",
37
+ }
38
+
39
+ AUTO_THRESHOLD = int(os.environ.get("SCQ_AUTO_THRESHOLD", "1000000"))
40
+ DEFAULT_BATCHSIZE = int(os.environ.get("SCQ_BATCHSIZE", "100000"))
41
+ DEFAULT_NUM_PARTITIONS = int(os.environ.get("SCQ_NUM_PARTITIONS", "8"))
42
+
43
+
44
+ def map_type(spark_type: str) -> str:
45
+ t = spark_type.lower().strip()
46
+ if t.startswith("decimal"):
47
+ return t.replace("decimal", "Decimal")
48
+ base = re.split(r"[(<]", t, 1)[0]
49
+ return SPARK_TO_CH.get(base, "String")
50
+
51
+
52
+ def _parse(argv: list[str]):
53
+ p = argparse.ArgumentParser(prog="scq sync")
54
+ p.add_argument("source")
55
+ p.add_argument("--to", default="clickhouse")
56
+ p.add_argument("--remote", default=DEFAULT_REMOTE)
57
+ p.add_argument("--mode", choices=["auto", "parallel", "single"], default="auto")
58
+ p.add_argument("--ch-jdbc", default=os.environ.get("SCQ_CH_JDBC", ""))
59
+ p.add_argument("--target", default=None)
60
+ p.add_argument("--where", default=None)
61
+ p.add_argument("--limit", type=int, default=0)
62
+ p.add_argument("--batchsize", type=int, default=DEFAULT_BATCHSIZE)
63
+ p.add_argument("--num-partitions", type=int, default=DEFAULT_NUM_PARTITIONS)
64
+ p.add_argument("--auto-threshold", type=int, default=AUTO_THRESHOLD)
65
+ # Auto-create control: when the target table doesn't exist, Spark creates it.
66
+ # Without an explicit sort key ClickHouse defaults to ORDER BY tuple() (no
67
+ # primary index). --order-by injects a real key via createTableOptions.
68
+ p.add_argument("--order-by", default=None, help="ORDER BY key(s) for auto-created table, e.g. 'id' or 'id,dt'")
69
+ p.add_argument("--engine", default="MergeTree", help="Engine for auto-created table (default MergeTree)")
70
+ return p.parse_args(argv)
71
+
72
+
73
+ def run(argv: list[str], meta: dict) -> int:
74
+ a = _parse(argv)
75
+ if not a.ch_jdbc:
76
+ print("[scq] no --ch-jdbc / SCQ_CH_JDBC set — cannot write to ClickHouse",
77
+ flush=True)
78
+ return 2
79
+
80
+ print(f"[scq] sync start: {a.source} -> {a.to} mode={a.mode}", flush=True)
81
+ spark = get_spark(a.remote)
82
+
83
+ # 1. discover schema (informational; the target table must already exist)
84
+ desc = spark.sql(f"DESCRIBE TABLE {a.source}").collect()
85
+ cols = [(r[0], r[1]) for r in desc if r[0] and not r[0].startswith("#")]
86
+ print(f"[scq] {len(cols)} columns: "
87
+ + ", ".join(f"{c}:{t}->{map_type(t)}" for c, t in cols), flush=True)
88
+
89
+ src_count = spark.sql(f"SELECT count(*) c FROM {a.source}").collect()[0]["c"]
90
+ # Target may be `db.table` (explicit database) or a bare table name (lands in
91
+ # the JDBC connection's default database). Keep the landing spot explicit.
92
+ target = a.target or a.source.split(".")[-1]
93
+ qualified = "." in target
94
+ meta["source_rows"] = src_count
95
+ meta["target"] = target
96
+ write_meta(meta["id"], meta)
97
+ hint = "" if qualified else " (connection default database; pass --target db.table to choose one)"
98
+ print(f"[scq] source rows: {src_count} -> target {target}{hint}", flush=True)
99
+
100
+ # 2. build the read
101
+ sel = f"SELECT * FROM {a.source}"
102
+ if a.where:
103
+ sel += f" WHERE {a.where}"
104
+ if a.limit:
105
+ sel += f" LIMIT {a.limit}"
106
+ df = spark.sql(sel)
107
+
108
+ # 3. choose write parallelism
109
+ mode = a.mode
110
+ if mode == "auto":
111
+ mode = "parallel" if src_count >= a.auto_threshold else "single"
112
+ num_partitions = 1 if mode == "single" else max(1, a.num_partitions)
113
+ if num_partitions == 1:
114
+ df = df.coalesce(1)
115
+ else:
116
+ df = df.repartition(num_partitions)
117
+ print(f"[scq] writing via JDBC: mode={mode} numPartitions={num_partitions} "
118
+ f"batchsize={a.batchsize}", flush=True)
119
+
120
+ # 4. Spark direct write to ClickHouse. Rows are written by the executors;
121
+ # nothing flows through this process.
122
+ writer = (df.write.format("jdbc")
123
+ .option("url", a.ch_jdbc)
124
+ .option("dbtable", target)
125
+ .option("driver", "com.clickhouse.jdbc.ClickHouseDriver")
126
+ .option("batchsize", a.batchsize)
127
+ .option("isolationLevel", "NONE")) # ClickHouse has no txns
128
+ # createTableOptions only affects auto-create (when the table is missing); it
129
+ # is ignored when the table already exists.
130
+ if a.order_by:
131
+ writer = writer.option("createTableOptions",
132
+ f"ENGINE = {a.engine} ORDER BY ({a.order_by})")
133
+ print(f"[scq] auto-create (if needed): ENGINE = {a.engine} ORDER BY ({a.order_by})", flush=True)
134
+ else:
135
+ print("[scq] note: an auto-created target uses ORDER BY tuple() (no sort key) — "
136
+ "pass --order-by for a real key, or pre-create the table", flush=True)
137
+ try:
138
+ writer.mode("append").save()
139
+ except Exception as e: # noqa: BLE001
140
+ print(f"[scq] JDBC write failed: {e}", flush=True)
141
+ return 1
142
+
143
+ print(f"[scq] done: wrote {src_count} rows to {target}", flush=True)
144
+ meta["written_rows"] = src_count
145
+ write_meta(meta["id"], meta)
146
+ return 0
@@ -0,0 +1,156 @@
1
+ Metadata-Version: 2.4
2
+ Name: spark-connect-cli
3
+ Version: 0.2.0
4
+ Summary: Agent-friendly Spark Connect CLI: read-only querying + async long-job control. No JVM, no Kerberos on the client.
5
+ Project-URL: Homepage, https://github.com/dengshu2/spark-connect-cli
6
+ Project-URL: Issues, https://github.com/dengshu2/spark-connect-cli/issues
7
+ Author: dengshu
8
+ License: MIT
9
+ License-File: LICENSE
10
+ Keywords: agent,cli,clickhouse,hive,llm,spark,spark-connect
11
+ Classifier: Environment :: Console
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Topic :: Database :: Front-Ends
15
+ Requires-Python: >=3.9
16
+ Requires-Dist: pyspark[connect]<4,>=3.5
17
+ Provides-Extra: dev
18
+ Requires-Dist: pytest>=7; extra == 'dev'
19
+ Description-Content-Type: text/markdown
20
+
21
+ # spark-connect-cli (`scq`)
22
+
23
+ An agent-friendly [Spark Connect](https://spark.apache.org/spark-connect/) CLI —
24
+ **read-only querying** plus **async control for long-running jobs**.
25
+
26
+ Built for LLM agents and humans who live in a shell. Unlike `spark-sql` /
27
+ `spark-submit`, the client is a thin **pure-Python gRPC client**: no JVM, and
28
+ **no Kerberos on the client side** — the Spark Connect server authenticates with
29
+ its own keytab, so you just point at `sc://host:15002` and go.
30
+
31
+ ## Why
32
+
33
+ - **JSON-first, read-only by default.** Safe for an agent to call for
34
+ exploration; writes/DDL are blocked unless you opt in (`--allow-ddl`).
35
+ - **Long jobs don't block you.** A multi-minute Spark job shouldn't trap an agent
36
+ in a 30-minute tool call. `scq` submits the job, hands back a durable **job
37
+ id**, and returns immediately. Poll it whenever you like; the handle survives a
38
+ client/container restart because it lives in an on-disk registry.
39
+ - **Stable exit codes** so a caller can branch without scraping text.
40
+
41
+ ## Install
42
+
43
+ ```bash
44
+ pip install spark-connect-cli # once published
45
+ # or, from source:
46
+ pip install -e .
47
+ ```
48
+
49
+ ## Quick start
50
+
51
+ ```bash
52
+ export SPARK_REMOTE=sc://localhost:15002 # your Spark Connect endpoint
53
+
54
+ scq databases
55
+ scq tables mydb --like '%orders%'
56
+ scq describe mydb.orders
57
+ scq query "SELECT id, name FROM mydb.orders LIMIT 10"
58
+ ```
59
+
60
+ Output is **JSONEachRow** (one JSON object per line) by default; pick another with
61
+ `--format json|csv|tsv|table`.
62
+
63
+ ### Read-only guard
64
+
65
+ `scq query` allows only `SELECT/SHOW/DESCRIBE/EXPLAIN/WITH`. Anything else exits
66
+ with code **3** unless you pass `--allow-ddl`.
67
+
68
+ | exit | meaning |
69
+ |------|---------|
70
+ | 0 | success |
71
+ | 1 | query error (bad SQL) |
72
+ | 2 | connection error |
73
+ | 3 | blocked by the read-only guard |
74
+ | 4 | job-control error (no such job, …) |
75
+
76
+ ## Async jobs (Layer A)
77
+
78
+ Long work runs detached and is tracked by a file-based registry under
79
+ `$SCQ_JOBS_DIR` (default `~/.spark-connect-cli/jobs`).
80
+
81
+ ```bash
82
+ # submit — returns a job id immediately, does NOT block
83
+ scq sync ods.orders --to clickhouse
84
+ # {"job_id": "j-20260625-...", "state": "running", "message": "... poll with ..."}
85
+
86
+ scq jobs list # all jobs + state
87
+ scq jobs status j-20260625-... # full status (rows, timings, pid, exit code)
88
+ scq jobs logs j-20260625-... --tail 40
89
+ scq jobs cancel j-20260625-... # kills the whole process group
90
+ ```
91
+
92
+ Design: each job is a directory with `meta.json` (state machine:
93
+ `submitted → running → succeeded|failed|cancelled`) and `out.log`. The worker
94
+ runs in its **own process group**, so cancel kills the entire tree (no orphans).
95
+ A `running` job whose process has vanished is reconciled to `failed` on the next
96
+ status read, so status never lies.
97
+
98
+ ## Hive → ClickHouse sync
99
+
100
+ `scq sync` is one job kind built on the async subsystem. It uses **Spark direct
101
+ write**: a Spark Connect job reads the Hive table and writes to ClickHouse over
102
+ JDBC. The write runs on the executors, so rows never pass through this process or
103
+ the agent.
104
+
105
+ Modes control write parallelism — `single` (one connection, small tables),
106
+ `parallel` (N partitions, large tables), `auto` (picks by row count).
107
+
108
+ Requires:
109
+ - `clickhouse-jdbc` on the Spark Connect server classpath (`/opt/spark/jars/`),
110
+ - cluster→ClickHouse network egress,
111
+ - a JDBC URL with credentials via `--ch-jdbc` / `$SCQ_CH_JDBC`,
112
+ - the **target ClickHouse table created beforehand** with a suitable engine
113
+ (Spark `append` won't build a usable MergeTree table for you — create it first,
114
+ e.g. with the `chsql` skill).
115
+
116
+ ## Introspection
117
+
118
+ ```bash
119
+ scq meta db.table # one JSON: schema, created time, location,
120
+ # partitions, file count/size, mtime range
121
+ scq meta db.table --count # also run an exact count(*)
122
+
123
+ scq exec stages?status=active # read-only Spark REST passthrough
124
+ scq exec executors
125
+ scq exec stages/<id>/<attempt>/taskSummary?quantiles=0.5,0.95,1.0 # skew: max/median
126
+ ```
127
+
128
+ `scq exec` auto-discovers the running Spark app via the YARN ResourceManager and
129
+ proxies its monitoring REST API (GET-only). Set the RM base with `$SCQ_YARN_RM`.
130
+
131
+ **Reading `scq exec executors`** — the `maxMemory` field is Spark's
132
+ **storage/cache pool** (`(heap − 300 MB reserved) × 0.6`), *not* the executor's
133
+ total memory: a 512 MB executor reports ~93 MB, a 1536 MB driver ~741 MB. The
134
+ real heap is `spark.executor.memory` (+ off-heap overhead). The `driver` row has
135
+ 0 cores and runs no tasks. With dynamic allocation, idle executors are released —
136
+ so the list may show only the driver when nothing is running.
137
+
138
+ ## Configuration
139
+
140
+ | env | default | meaning |
141
+ |-----|---------|---------|
142
+ | `SPARK_REMOTE` | `sc://localhost:15002` | Spark Connect endpoint |
143
+ | `SCQ_JOBS_DIR` | `~/.spark-connect-cli/jobs` | job registry (put on a persistent volume) |
144
+ | `SCQ_MAX_ROWS` | `10000` | default row cap for `query` |
145
+ | `SCQ_CH_JDBC` | — | ClickHouse JDBC URL for `sync` path A |
146
+ | `SCQ_YARN_RM` | `http://namenode.hive-net:8088` | YARN RM base for `scq exec` |
147
+
148
+ ## Use with an LLM agent
149
+
150
+ `SKILL.md` ships a ready-made skill (discover-before-query workflow, async-job
151
+ etiquette, type-mapping table). Drop it into your agent's skills directory and
152
+ the agent drives `scq` through a shell/Bash tool.
153
+
154
+ ## License
155
+
156
+ MIT
@@ -0,0 +1,15 @@
1
+ spark_connect_cli/__init__.py,sha256=d_xkc721_-_mG3JGkeXU7BspJ1x_-yufyS9D6dLgR-k,87
2
+ spark_connect_cli/__main__.py,sha256=MSmt_5Xg84uHqzTN38JwgseJK8rsJn_11A8WD99VtEo,61
3
+ spark_connect_cli/cli.py,sha256=8VLGgQIVgTr0OccAT3uLnBpFCsO0i8IYTq1fScYn-9U,6379
4
+ spark_connect_cli/jobs.py,sha256=t55-tBFXd5jNMhiSWL7StrTlgCGPjU_XsYGr5qew7gY,7233
5
+ spark_connect_cli/meta.py,sha256=u1-RaKxP-Y7wbKXppeSuh2d9UCRk1HZ1U-GaFOJ4M9g,3473
6
+ spark_connect_cli/query.py,sha256=Ej0M0yW6RAdpShhEudTGTOJhL2ezSzwovFDPgRhCWH8,2103
7
+ spark_connect_cli/rest.py,sha256=JueQlHgTFhNvl8KCTum-1LGVxJWDs441NVDsU1GjBOE,2021
8
+ spark_connect_cli/session.py,sha256=xSXsSUZTwERj0uGKDyuQQUNveH6ox9LqsAurNiD7AyY,2756
9
+ spark_connect_cli/sync.py,sha256=fNCfqbYO6ClCfBo0Y_CDIfnCbGuEX26bNilKMqieM-4,6554
10
+ spark_connect_cli/SKILL.md,sha256=Rm_9aT_4G24iu2gM8MR6YHp4G_mLLbKMLPANU3-9qSA,6450
11
+ spark_connect_cli-0.2.0.dist-info/METADATA,sha256=RuIoJAQxqZl1Y21Msioj1JrqpTD_kLlpyvSMnbM1BdI,6081
12
+ spark_connect_cli-0.2.0.dist-info/WHEEL,sha256=mffPy8wBnZQn2VnJUU5jE99KsxaSfiyMHV9Yt0aLVxs,87
13
+ spark_connect_cli-0.2.0.dist-info/entry_points.txt,sha256=xweUT4medkn5exc6YvKPNBWlOV9qWBz5hxjZsoRdLK4,98
14
+ spark_connect_cli-0.2.0.dist-info/licenses/LICENSE,sha256=CQKNwxelRbTkzP7qdNam0XmENhKgOC6PgWd3ljvRiVM,1064
15
+ spark_connect_cli-0.2.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.30.1
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ scq = spark_connect_cli.cli:main
3
+ spark-connect-cli = spark_connect_cli.cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 dengshu
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.