kipimo 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
kipimo-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 AfriKaziOS maintainers
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
kipimo-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,47 @@
1
+ Metadata-Version: 2.4
2
+ Name: kipimo
3
+ Version: 0.1.0
4
+ Summary: Swahili agent-task evaluation suite for the East Africa coordination stack — model-agnostic seed benchmark (46 tasks: server routing, term grounding, cascade routing).
5
+ Author: AfriKaziOS maintainers
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/gabrielmahia/kipimo
8
+ Project-URL: Repository, https://github.com/gabrielmahia/kipimo
9
+ Project-URL: Issues, https://github.com/gabrielmahia/kipimo/issues
10
+ Keywords: swahili,benchmark,evaluation,mcp,kenya,ai-agents
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Dynamic: license-file
15
+
16
+ # kipimo
17
+
18
+ Over one hundred million people coordinate their lives in Swahili, yet no benchmark measures whether an AI agent can route their requests correctly — send money, check drought status, find a clinic, verify a worker's credentials. Agents targeting East Africa are evaluated on English tasks and deployed on faith.
19
+
20
+ `kipimo` (Swahili: *a measure*) is a model-agnostic seed benchmark for exactly that gap: **46 tasks** across three types, with golds machine-derived from authoritative sources — the coordination-stack registry and the live `africa-coord-bus` routing table — never from memory.
21
+
22
+ | Type | n | What it measures | Metric |
23
+ |---|---|---|---|
24
+ | `server_routing` | 25 | Swahili request → correct stack server (payments, tax, health, land, labour…) | exact |
25
+ | `term_grounding` | 14 | Swahili domain term → English meaning | exact (case-insensitive) |
26
+ | `cascade_routing` | 7 | Coordination event → which sectors must be notified | set F1 |
27
+
28
+ ## Use it (any model, no API keys)
29
+
30
+ ```bash
31
+ pip install kipimo
32
+ kipimo tasks > tasks.jsonl # feed to your agent however you like
33
+ kipimo template > preds.jsonl # fill "prediction": [...] per id
34
+ kipimo score preds.jsonl # per-type + overall report
35
+ ```
36
+
37
+ The harness never calls a model — you generate predictions with whatever system you're evaluating; kipimo only scores. Any lab can publish comparable numbers.
38
+
39
+ ## Honesty box
40
+ - **v0.1 is a seed set.** 46 tasks establish the format and scoring; breadth comes from contributions.
41
+ - Swahili phrasing is simple-register and **pending native-speaker review** — that is [issue #1](https://github.com/gabrielmahia/kipimo/issues), and corrections are the most valuable contribution possible.
42
+ - Scores measure stack-routing competence, not general Swahili fluency.
43
+ - Dataset: **CC BY 4.0** (usable by everyone, including commercial labs — that's the point). Harness: **MIT**.
44
+
45
+ ## IP & Collaboration
46
+
47
+ MIT-licensed harness, CC BY 4.0 data. Feedback via GitHub Issues only — pull requests are not accepted; task corrections and additions via Issues are actively wanted. Full policy: [docs/architecture/IP_POLICY.md](docs/architecture/IP_POLICY.md). Security: see [SECURITY.md](SECURITY.md).
kipimo-0.1.0/README.md ADDED
@@ -0,0 +1,32 @@
1
+ # kipimo
2
+
3
+ Over one hundred million people coordinate their lives in Swahili, yet no benchmark measures whether an AI agent can route their requests correctly — send money, check drought status, find a clinic, verify a worker's credentials. Agents targeting East Africa are evaluated on English tasks and deployed on faith.
4
+
5
+ `kipimo` (Swahili: *a measure*) is a model-agnostic seed benchmark for exactly that gap: **46 tasks** across three types, with golds machine-derived from authoritative sources — the coordination-stack registry and the live `africa-coord-bus` routing table — never from memory.
6
+
7
+ | Type | n | What it measures | Metric |
8
+ |---|---|---|---|
9
+ | `server_routing` | 25 | Swahili request → correct stack server (payments, tax, health, land, labour…) | exact |
10
+ | `term_grounding` | 14 | Swahili domain term → English meaning | exact (case-insensitive) |
11
+ | `cascade_routing` | 7 | Coordination event → which sectors must be notified | set F1 |
12
+
13
+ ## Use it (any model, no API keys)
14
+
15
+ ```bash
16
+ pip install kipimo
17
+ kipimo tasks > tasks.jsonl # feed to your agent however you like
18
+ kipimo template > preds.jsonl # fill "prediction": [...] per id
19
+ kipimo score preds.jsonl # per-type + overall report
20
+ ```
21
+
22
+ The harness never calls a model — you generate predictions with whatever system you're evaluating; kipimo only scores. Any lab can publish comparable numbers.
23
+
24
+ ## Honesty box
25
+ - **v0.1 is a seed set.** 46 tasks establish the format and scoring; breadth comes from contributions.
26
+ - Swahili phrasing is simple-register and **pending native-speaker review** — that is [issue #1](https://github.com/gabrielmahia/kipimo/issues), and corrections are the most valuable contribution possible.
27
+ - Scores measure stack-routing competence, not general Swahili fluency.
28
+ - Dataset: **CC BY 4.0** (usable by everyone, including commercial labs — that's the point). Harness: **MIT**.
29
+
30
+ ## IP & Collaboration
31
+
32
+ MIT-licensed harness, CC BY 4.0 data. Feedback via GitHub Issues only — pull requests are not accepted; task corrections and additions via Issues are actively wanted. Full policy: [docs/architecture/IP_POLICY.md](docs/architecture/IP_POLICY.md). Security: see [SECURITY.md](SECURITY.md).
@@ -0,0 +1,33 @@
1
+ [build-system]
2
+ requires = ["setuptools>=77", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "kipimo"
7
+ version = "0.1.0"
8
+ description = "Swahili agent-task evaluation suite for the East Africa coordination stack — model-agnostic seed benchmark (46 tasks: server routing, term grounding, cascade routing)."
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ license = "MIT"
12
+ license-files = ["LICENSE"]
13
+ authors = [{name = "AfriKaziOS maintainers"}]
14
+ keywords = ["swahili", "benchmark", "evaluation", "mcp", "kenya", "ai-agents"]
15
+
16
+ [project.urls]
17
+ Homepage = "https://github.com/gabrielmahia/kipimo"
18
+ Repository = "https://github.com/gabrielmahia/kipimo"
19
+ Issues = "https://github.com/gabrielmahia/kipimo/issues"
20
+
21
+ [project.scripts]
22
+ kipimo = "kipimo.cli:_main"
23
+
24
+ [tool.setuptools]
25
+ package-dir = {"" = "src"}
26
+ packages = ["kipimo"]
27
+
28
+ [tool.setuptools.package-data]
29
+ kipimo = ["data/*.jsonl"]
30
+
31
+ [tool.ruff]
32
+ target-version = "py310"
33
+ line-length = 100
kipimo-0.1.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,6 @@
1
+ """kipimo — Swahili agent-task evaluation for the East Africa coordination stack."""
2
+
3
+ from .cli import __version__ as __version__
4
+ from .cli import load_tasks as load_tasks
5
+ from .cli import score_file as score_file
6
+ from .cli import score_one as score_one
@@ -0,0 +1,115 @@
1
+ """kipimo — Swahili agent-task evaluation for the East Africa coordination stack.
2
+
3
+ Model-agnostic by design: kipimo emits tasks and scores prediction files. It
4
+ never calls a model API, so any lab, student, or vendor can evaluate any agent
5
+ against the same gold set. Golds are machine-derived from authoritative
6
+ sources (the stack registry; the africa-coord-bus routing table), never from
7
+ memory.
8
+
9
+ Usage:
10
+ kipimo tasks # emit the task set (JSONL) to stdout
11
+ kipimo template # emit an empty predictions file to fill in
12
+ kipimo score preds.jsonl # score predictions against gold
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import argparse
18
+ import json
19
+ import sys
20
+ from collections import defaultdict
21
+ from importlib import resources
22
+
23
+ __version__ = "0.1.0"
24
+
25
+ DISCLAIMER = ("kipimo v0.1 is a SEED benchmark (46 tasks). Swahili phrasing is "
26
+ "simple-register and pending native-speaker review (issue #1). "
27
+ "Scores indicate stack-routing competence, not general Swahili "
28
+ "fluency. Do not use as a sole deployment gate.")
29
+
30
+
31
+ def load_tasks() -> list[dict]:
32
+ text = resources.files("kipimo").joinpath("data/kipimo_v0.1.jsonl").read_text("utf-8")
33
+ return [json.loads(line) for line in text.splitlines() if line.strip()]
34
+
35
+
36
+ def _norm(s: str) -> str:
37
+ return " ".join(str(s).lower().split())
38
+
39
+
40
+ def score_one(task: dict, pred: list[str]) -> float:
41
+ gold = [_norm(g) for g in task["gold"]]
42
+ p = [_norm(x) for x in (pred or [])]
43
+ if task["metric"] in ("exact", "exact_ci"):
44
+ return 1.0 if p and p[0] in gold else 0.0
45
+ if task["metric"] == "set_f1":
46
+ gs, ps = set(gold), set(p)
47
+ if not ps:
48
+ return 0.0
49
+ tp = len(gs & ps)
50
+ prec, rec = tp / len(ps), tp / len(gs)
51
+ return 0.0 if tp == 0 else 2 * prec * rec / (prec + rec)
52
+ raise ValueError(f"unknown metric {task['metric']}")
53
+
54
+
55
+ def score_file(path: str) -> dict:
56
+ preds = {}
57
+ with open(path, encoding="utf-8") as f:
58
+ for line in f:
59
+ if line.strip():
60
+ row = json.loads(line)
61
+ preds[row["id"]] = row.get("prediction", [])
62
+ tasks = load_tasks()
63
+ by_type: dict[str, list[float]] = defaultdict(list)
64
+ missing = []
65
+ for t in tasks:
66
+ if t["id"] not in preds:
67
+ missing.append(t["id"])
68
+ by_type[t["type"]].append(0.0)
69
+ else:
70
+ by_type[t["type"]].append(score_one(t, preds[t["id"]]))
71
+ report = {k: round(sum(v) / len(v), 4) for k, v in by_type.items()}
72
+ allv = [x for v in by_type.values() for x in v]
73
+ report["overall"] = round(sum(allv) / len(allv), 4)
74
+ report["n_tasks"] = len(tasks)
75
+ report["n_missing"] = len(missing)
76
+ return report
77
+
78
+
79
+ def main(argv: list[str] | None = None) -> int:
80
+ p = argparse.ArgumentParser(prog="kipimo", description=__doc__.split("\n")[0],
81
+ epilog=DISCLAIMER)
82
+ sub = p.add_subparsers(dest="cmd", required=True)
83
+ sub.add_parser("tasks", help="emit task set JSONL to stdout")
84
+ sub.add_parser("template", help="emit empty predictions JSONL to stdout")
85
+ sp = sub.add_parser("score", help="score a predictions file")
86
+ sp.add_argument("predictions", help="JSONL with {id, prediction:[...]} rows")
87
+ args = p.parse_args(argv)
88
+
89
+ if args.cmd == "tasks":
90
+ for t in load_tasks():
91
+ print(json.dumps(t, ensure_ascii=False))
92
+ elif args.cmd == "template":
93
+ for t in load_tasks():
94
+ print(json.dumps({"id": t["id"], "prediction": []}))
95
+ else:
96
+ rep = score_file(args.predictions)
97
+ print(json.dumps(rep, indent=2))
98
+ print(f"\n{DISCLAIMER}", file=sys.stderr)
99
+ if rep["n_missing"]:
100
+ print(f"Note: {rep['n_missing']} task(s) had no prediction and scored 0. "
101
+ f"Run `kipimo template` for the full id list.", file=sys.stderr)
102
+ return 0
103
+
104
+
105
+ def _main() -> int:
106
+ try:
107
+ return main()
108
+ except BrokenPipeError:
109
+ import os
110
+ os.dup2(os.open(os.devnull, os.O_WRONLY), sys.stdout.fileno())
111
+ return 0
112
+
113
+
114
+ if __name__ == "__main__":
115
+ raise SystemExit(main())
@@ -0,0 +1,46 @@
1
+ {"id": "rt-01", "type": "server_routing", "lang": "sw", "input": "Nataka kutuma pesa kwa mama yangu kupitia simu.", "options": null, "gold": ["mpesa-mcp"], "metric": "exact"}
2
+ {"id": "rt-02", "type": "server_routing", "lang": "sw", "input": "Ninahitaji kujua deni langu la ushuru KRA.", "options": null, "gold": ["kra-mcp"], "metric": "exact"}
3
+ {"id": "rt-03", "type": "server_routing", "lang": "sw", "input": "Nataka mkopo mdogo kwa biashara yangu ya mboga.", "options": null, "gold": ["mkopo-mcp"], "metric": "exact"}
4
+ {"id": "rt-04", "type": "server_routing", "lang": "sw", "input": "Je, kuna bima ya mazao dhidi ya ukame?", "options": null, "gold": ["bima-mcp"], "metric": "exact"}
5
+ {"id": "rt-05", "type": "server_routing", "lang": "sw", "input": "Bei ya mahindi sokoni Nakuru ni ngapi leo?", "options": null, "gold": ["soko-mcp"], "metric": "exact"}
6
+ {"id": "rt-06", "type": "server_routing", "lang": "sw", "input": "Nataka kutuma pesa kutoka Marekani hadi Kenya kwa gharama nafuu.", "options": null, "gold": ["remit-mcp"], "metric": "exact"}
7
+ {"id": "rt-07", "type": "server_routing", "lang": "sw", "input": "Ni wapi hospitali iliyo karibu inayotoa huduma za upasuaji?", "options": null, "gold": ["kenya-health-mcp"], "metric": "exact"}
8
+ {"id": "rt-08", "type": "server_routing", "lang": "sw", "input": "Mtoto wangu ana homa na kikohozi, nifanye nini?", "options": null, "gold": ["afya-mcp"], "metric": "exact"}
9
+ {"id": "rt-09", "type": "server_routing", "lang": "sw", "input": "Ninahisi huzuni sana na msongo wa mawazo, nipate msaada wapi?", "options": null, "gold": ["afya-ya-akili-mcp"], "metric": "exact"}
10
+ {"id": "rt-10", "type": "server_routing", "lang": "sw", "input": "Nipande mahindi lini msimu huu Uasin Gishu?", "options": null, "gold": ["kilimo-mcp"], "metric": "exact"}
11
+ {"id": "rt-11", "type": "server_routing", "lang": "sw", "input": "Hali ya ukame kaunti ya Baringo ikoje sasa?", "options": null, "gold": ["wapimaji-mcp"], "metric": "exact"}
12
+ {"id": "rt-12", "type": "server_routing", "lang": "sw", "input": "Nahitaji kibali cha mazingira kwa kiwanda changu kidogo.", "options": null, "gold": ["mazingira-mcp"], "metric": "exact"}
13
+ {"id": "rt-13", "type": "server_routing", "lang": "sw", "input": "Huduma za kaunti ya Machakos ni zipi na bajeti yake?", "options": null, "gold": ["county-mcp"], "metric": "exact"}
14
+ {"id": "rt-14", "type": "server_routing", "lang": "sw", "input": "Nifanye nini kupata cheti cha kuzaliwa? Fomu gani inahitajika?", "options": null, "gold": ["fomu-mcp"], "metric": "exact"}
15
+ {"id": "rt-15", "type": "server_routing", "lang": "sw", "input": "Mwajiri wangu amenifukuza bila notisi. Haki zangu ni zipi?", "options": null, "gold": ["haki-ya-kazi-mcp"], "metric": "exact"}
16
+ {"id": "rt-16", "type": "server_routing", "lang": "sw", "input": "Natafuta kazi ya udereva Nairobi.", "options": null, "gold": ["kazi-mcp"], "metric": "exact"}
17
+ {"id": "rt-17", "type": "server_routing", "lang": "sw", "input": "Shule za sekondari Kiambu na matokeo ya KCSE.", "options": null, "gold": ["elimu-mcp"], "metric": "exact"}
18
+ {"id": "rt-18", "type": "server_routing", "lang": "sw", "input": "Tafsiri sentensi hii kwa Kiingereza tafadhali.", "options": null, "gold": ["tafsiri-mcp"], "metric": "exact"}
19
+ {"id": "rt-19", "type": "server_routing", "lang": "sw", "input": "Nataka kuhakiki ujuzi na vyeti vya mfanyakazi ninayemwajiri.", "options": null, "gold": ["sifa-mcp"], "metric": "exact"}
20
+ {"id": "rt-20", "type": "server_routing", "lang": "sw", "input": "Nyaraka za shamba langu Kiambu — nithibitishe umiliki.", "options": null, "gold": ["ardhi-mcp"], "metric": "exact"}
21
+ {"id": "rt-21", "type": "server_routing", "lang": "sw", "input": "Nyumba ya kupanga Ruaka bei gani kwa mwezi?", "options": null, "gold": ["nyumba-mcp"], "metric": "exact"}
22
+ {"id": "rt-22", "type": "server_routing", "lang": "sw", "input": "Njia za matatu kutoka CBD hadi Rongai?", "options": null, "gold": ["usafiri-mcp"], "metric": "exact"}
23
+ {"id": "rt-23", "type": "server_routing", "lang": "sw", "input": "Niko diaspora — nasimamia vipi mali yangu Kenya nikiwa nje?", "options": null, "gold": ["diaspora-mcp"], "metric": "exact"}
24
+ {"id": "rt-24", "type": "server_routing", "lang": "sw", "input": "Tunataka kuanzisha chama cha kuweka akiba mtaani.", "options": null, "gold": ["jumuia-mcp"], "metric": "exact"}
25
+ {"id": "rt-25", "type": "server_routing", "lang": "sw", "input": "Umeme haujafika kijiji chetu — kuna mpango gani wa nishati?", "options": null, "gold": ["nishati-mcp"], "metric": "exact"}
26
+ {"id": "tm-01", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'mkopo' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["credit/loan"], "metric": "exact_ci"}
27
+ {"id": "tm-02", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'bima' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["insurance"], "metric": "exact_ci"}
28
+ {"id": "tm-03", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'faida' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["profit/returns"], "metric": "exact_ci"}
29
+ {"id": "tm-04", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'soko' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["market"], "metric": "exact_ci"}
30
+ {"id": "tm-05", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'afya' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["health"], "metric": "exact_ci"}
31
+ {"id": "tm-06", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'afya ya akili' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["mental health"], "metric": "exact_ci"}
32
+ {"id": "tm-07", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'kilimo' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["agriculture"], "metric": "exact_ci"}
33
+ {"id": "tm-08", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'wapimaji' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["water"], "metric": "exact_ci"}
34
+ {"id": "tm-09", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'mazingira' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["environment"], "metric": "exact_ci"}
35
+ {"id": "tm-10", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'fomu' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["forms"], "metric": "exact_ci"}
36
+ {"id": "tm-11", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'haki ya kazi' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["labour rights"], "metric": "exact_ci"}
37
+ {"id": "tm-12", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'kazi' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["work"], "metric": "exact_ci"}
38
+ {"id": "tm-13", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'habari' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["news"], "metric": "exact_ci"}
39
+ {"id": "tm-14", "type": "term_grounding", "lang": "sw", "input": "Neno la Kiswahili 'elimu' linamaanisha nini kwa Kiingereza (neno moja au mawili)?", "gold": ["education"], "metric": "exact_ci"}
40
+ {"id": "cas-01", "type": "cascade_routing", "lang": "sw", "input": "Ukame umefikia hatua ya tahadhari Baringo. Taarifa hii inapaswa kufika kwenye sekta zipi?", "event": {"domain": "water", "event_type": "drought_alert", "severity": "alert"}, "gold": ["agriculture", "finance", "health"], "metric": "set_f1"}
41
+ {"id": "cas-02", "type": "cascade_routing", "lang": "sw", "input": "Ukame ni hatari sana Turkana. Sekta zipi zinapaswa kuarifiwa?", "event": {"domain": "water", "event_type": "drought_alert", "severity": "critical"}, "gold": ["agriculture", "finance", "health"], "metric": "set_f1"}
42
+ {"id": "cas-03", "type": "cascade_routing", "lang": "sw", "input": "Mlipuko wa kipindupindu Kisumu. Taarifa ipelekwe kwenye sekta zipi?", "event": {"domain": "health", "event_type": "disease_outbreak", "severity": "alert"}, "gold": ["civic", "procurement"], "metric": "set_f1"}
43
+ {"id": "cas-04", "type": "cascade_routing", "lang": "sw", "input": "Onyo la mafuriko bonde la Tana. Sekta zipi zihusishwe?", "event": {"domain": "water", "event_type": "flood_alert", "severity": "warning"}, "gold": ["civic", "health"], "metric": "set_f1"}
44
+ {"id": "cas-05", "type": "cascade_routing", "lang": "sw", "input": "Mafuriko makubwa Tana River. Nani apokee taarifa?", "event": {"domain": "water", "event_type": "flood_alert", "severity": "critical"}, "gold": ["civic", "health"], "metric": "set_f1"}
45
+ {"id": "cas-06", "type": "cascade_routing", "lang": "sw", "input": "Mlipuko mkubwa wa ugonjwa. Sekta zipi zinahusika?", "event": {"domain": "health", "event_type": "disease_outbreak", "severity": "critical"}, "gold": ["civic", "procurement"], "metric": "set_f1"}
46
+ {"id": "cas-07", "type": "cascade_routing", "lang": "sw", "input": "Dalili za ukame zimeanza Kajiado. Wapi taarifa ipelekwe?", "event": {"domain": "water", "event_type": "drought_alert", "severity": "warning"}, "gold": ["agriculture", "finance"], "metric": "set_f1"}
@@ -0,0 +1,47 @@
1
+ Metadata-Version: 2.4
2
+ Name: kipimo
3
+ Version: 0.1.0
4
+ Summary: Swahili agent-task evaluation suite for the East Africa coordination stack — model-agnostic seed benchmark (46 tasks: server routing, term grounding, cascade routing).
5
+ Author: AfriKaziOS maintainers
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/gabrielmahia/kipimo
8
+ Project-URL: Repository, https://github.com/gabrielmahia/kipimo
9
+ Project-URL: Issues, https://github.com/gabrielmahia/kipimo/issues
10
+ Keywords: swahili,benchmark,evaluation,mcp,kenya,ai-agents
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Dynamic: license-file
15
+
16
+ # kipimo
17
+
18
+ Over one hundred million people coordinate their lives in Swahili, yet no benchmark measures whether an AI agent can route their requests correctly — send money, check drought status, find a clinic, verify a worker's credentials. Agents targeting East Africa are evaluated on English tasks and deployed on faith.
19
+
20
+ `kipimo` (Swahili: *a measure*) is a model-agnostic seed benchmark for exactly that gap: **46 tasks** across three types, with golds machine-derived from authoritative sources — the coordination-stack registry and the live `africa-coord-bus` routing table — never from memory.
21
+
22
+ | Type | n | What it measures | Metric |
23
+ |---|---|---|---|
24
+ | `server_routing` | 25 | Swahili request → correct stack server (payments, tax, health, land, labour…) | exact |
25
+ | `term_grounding` | 14 | Swahili domain term → English meaning | exact (case-insensitive) |
26
+ | `cascade_routing` | 7 | Coordination event → which sectors must be notified | set F1 |
27
+
28
+ ## Use it (any model, no API keys)
29
+
30
+ ```bash
31
+ pip install kipimo
32
+ kipimo tasks > tasks.jsonl # feed to your agent however you like
33
+ kipimo template > preds.jsonl # fill "prediction": [...] per id
34
+ kipimo score preds.jsonl # per-type + overall report
35
+ ```
36
+
37
+ The harness never calls a model — you generate predictions with whatever system you're evaluating; kipimo only scores. Any lab can publish comparable numbers.
38
+
39
+ ## Honesty box
40
+ - **v0.1 is a seed set.** 46 tasks establish the format and scoring; breadth comes from contributions.
41
+ - Swahili phrasing is simple-register and **pending native-speaker review** — that is [issue #1](https://github.com/gabrielmahia/kipimo/issues), and corrections are the most valuable contribution possible.
42
+ - Scores measure stack-routing competence, not general Swahili fluency.
43
+ - Dataset: **CC BY 4.0** (usable by everyone, including commercial labs — that's the point). Harness: **MIT**.
44
+
45
+ ## IP & Collaboration
46
+
47
+ MIT-licensed harness, CC BY 4.0 data. Feedback via GitHub Issues only — pull requests are not accepted; task corrections and additions via Issues are actively wanted. Full policy: [docs/architecture/IP_POLICY.md](docs/architecture/IP_POLICY.md). Security: see [SECURITY.md](SECURITY.md).
@@ -0,0 +1,13 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ src/kipimo/__init__.py
5
+ src/kipimo/cli.py
6
+ src/kipimo.egg-info/PKG-INFO
7
+ src/kipimo.egg-info/SOURCES.txt
8
+ src/kipimo.egg-info/dependency_links.txt
9
+ src/kipimo.egg-info/entry_points.txt
10
+ src/kipimo.egg-info/top_level.txt
11
+ src/kipimo/data/kipimo_v0.1.jsonl
12
+ tests/test_kipimo.py
13
+ tests/test_smoke.py
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ kipimo = kipimo.cli:_main
@@ -0,0 +1 @@
1
+ kipimo
@@ -0,0 +1,47 @@
1
+ """Integrity + scoring tests for the kipimo seed benchmark."""
2
+ import json
3
+ import subprocess
4
+ import sys
5
+
6
+ from kipimo import load_tasks, score_one
7
+
8
+
9
+ def test_dataset_integrity():
10
+ tasks = load_tasks()
11
+ assert len(tasks) == 46
12
+ ids = [t["id"] for t in tasks]
13
+ assert len(ids) == len(set(ids)), "duplicate ids"
14
+ for t in tasks:
15
+ assert t["type"] in ("server_routing", "term_grounding", "cascade_routing")
16
+ assert t["gold"], t["id"]
17
+ assert t["metric"] in ("exact", "exact_ci", "set_f1")
18
+ assert t["input"].strip()
19
+
20
+
21
+ def test_exact_scoring():
22
+ t = {"gold": ["mpesa-mcp"], "metric": "exact"}
23
+ assert score_one(t, ["mpesa-mcp"]) == 1.0
24
+ assert score_one(t, ["MPESA-MCP"]) == 1.0 # normalized
25
+ assert score_one(t, ["kra-mcp"]) == 0.0
26
+ assert score_one(t, []) == 0.0
27
+
28
+
29
+ def test_set_f1_scoring():
30
+ t = {"gold": ["health", "finance"], "metric": "set_f1"}
31
+ assert score_one(t, ["health", "finance"]) == 1.0
32
+ assert 0 < score_one(t, ["health"]) < 1.0
33
+ assert score_one(t, ["water"]) == 0.0
34
+
35
+
36
+ def test_cli_end_to_end(tmp_path):
37
+ r = subprocess.run([sys.executable, "-m", "kipimo.cli", "template"],
38
+ capture_output=True, text=True)
39
+ assert r.returncode == 0
40
+ preds = tmp_path / "p.jsonl"
41
+ # perfect predictions = gold copied in
42
+ lines = [json.dumps({"id": t["id"], "prediction": t["gold"]}) for t in load_tasks()]
43
+ preds.write_text("\n".join(lines))
44
+ r = subprocess.run([sys.executable, "-m", "kipimo.cli", "score", str(preds)],
45
+ capture_output=True, text=True)
46
+ rep = json.loads(r.stdout)
47
+ assert rep["overall"] == 1.0 and rep["n_missing"] == 0
@@ -0,0 +1,12 @@
1
+ import ast
2
+ import pathlib
3
+
4
+
5
+ def test_all_sources_parse():
6
+ root = pathlib.Path(__file__).parent.parent / "src"
7
+ for f in root.rglob("*.py"):
8
+ ast.parse(f.read_text(encoding="utf-8"))
9
+
10
+
11
+ def test_package_importable():
12
+ import kipimo # noqa: F401