langtell 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +91 -0
- package/package.json +20 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Oleksandr Zhuravlov
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
1
|
+
# langtell
|
|
2
|
+
|
|
3
|
+
> Tell me the language.
|
|
4
|
+
|
|
5
|
+
`langtell` infers the language of short strings — titles, snippets, headlines —
|
|
6
|
+
by **fusing evidence from many signals** into a single weighted verdict with a
|
|
7
|
+
confidence score and an auditable trail. It reads the *tells*: the script and
|
|
8
|
+
distinctive letters of the text, the `<html lang>` / `og:locale` / meta tags of
|
|
9
|
+
the page it came from, the HTTP `Content-Language` header, and — optionally —
|
|
10
|
+
heavier statistical engines like [franc](https://github.com/wooorm/franc) or the
|
|
11
|
+
on-device Chrome AI language detector.
|
|
12
|
+
|
|
13
|
+
It is **not** another trigram detector competing with franc/cld3/tinyld. Those
|
|
14
|
+
answer *"what language is this body of text?"* from the characters alone.
|
|
15
|
+
`langtell` answers *"what language is this **title**, given the page, transport,
|
|
16
|
+
and source it arrived in?"* — and shows its work.
|
|
17
|
+
|
|
18
|
+
> **Status:** design preview. The API below is the committed design; the
|
|
19
|
+
> implementation is in progress. This `0.0.x` release reserves the name and
|
|
20
|
+
> documents the design — it has no runtime yet.
|
|
21
|
+
|
|
22
|
+
## Why
|
|
23
|
+
|
|
24
|
+
- **Short strings beat statistical detectors.** A two-word title gives franc too
|
|
25
|
+
little to chew on. `langtell` leans on script ranges, distinctive letters, and
|
|
26
|
+
out-of-band metadata that a pure text detector never sees.
|
|
27
|
+
- **Auditable, not magic.** Every verdict carries the list of signals that
|
|
28
|
+
produced it (`evidence[]`), each with its kind, language, confidence, and raw
|
|
29
|
+
value — so you can debug *why* a title was classified the way it was.
|
|
30
|
+
- **Pay only for what you use.** The zero-dependency core (script + HTML + header
|
|
31
|
+
signals) is fully synchronous. Heavy engines (franc's trigram tables, the
|
|
32
|
+
browser detector) live behind their own subpaths and only enter your bundle —
|
|
33
|
+
and only run — when you opt in.
|
|
34
|
+
|
|
35
|
+
## Quick start
|
|
36
|
+
|
|
37
|
+
```ts
|
|
38
|
+
import { compile } from "langtell";
|
|
39
|
+
|
|
40
|
+
// compile() does the per-roster setup once; call the returned fn many times.
|
|
41
|
+
const detect = compile({ candidates: [UK, RU, EN] });
|
|
42
|
+
|
|
43
|
+
const result = detect({
|
|
44
|
+
text: "Їжак Сонік",
|
|
45
|
+
html, // optional: <html lang>, og:locale, meta content-language
|
|
46
|
+
responseHeaders, // optional: HTTP Content-Language
|
|
47
|
+
});
|
|
48
|
+
// → { language: "uk", confidence: 0.9x, evidence: [{ kind: "title-script", ... }, ...] }
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Add a heavy engine — it stays behind its own import door, and the return type
|
|
52
|
+
becomes `Promise` automatically because the engine is async:
|
|
53
|
+
|
|
54
|
+
```ts
|
|
55
|
+
import { compile } from "langtell";
|
|
56
|
+
import { francEngine } from "langtell/franc";
|
|
57
|
+
|
|
58
|
+
const detect = compile({ candidates: [UK, RU, EN], engines: [francEngine] });
|
|
59
|
+
const result = await detect({ text, html, responseHeaders });
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## API at a glance
|
|
63
|
+
|
|
64
|
+
| Export | Role |
|
|
65
|
+
| --- | --- |
|
|
66
|
+
| `compile(config)` | Build a configured `detect` function (does the precompute once). |
|
|
67
|
+
| `detect(input)` | The compiled detector. Sync or `Promise`, by config — see below. |
|
|
68
|
+
| `evidenceFromText(text)` | Producer: script + distinctive-letter signals. Zero-dep, sync. |
|
|
69
|
+
| `evidenceFromHtml(html)` | Producer: `<html lang>`, meta content-language, `og:locale`. Zero-dep, sync. |
|
|
70
|
+
| `evidenceFromHeaders(h)` | Producer: HTTP `Content-Language`. Zero-dep, sync. |
|
|
71
|
+
| `fuse(evidence, opts?)` | Weighted blend + "context never overrides clear script" guard. |
|
|
72
|
+
| `langtell/franc` | Opt-in franc engine (pulls trigram tables). |
|
|
73
|
+
| `langtell/chrome-ai` | Opt-in on-device Chrome AI engine (browser). |
|
|
74
|
+
|
|
75
|
+
`detect` returns a plain `Classification` when every registered source is
|
|
76
|
+
synchronous, and `Promise<Classification>` the moment an async engine is in the
|
|
77
|
+
mix — the type reflects the config, so you never guess whether to `await`. See
|
|
78
|
+
[DESIGN.md](./DESIGN.md) for the full architecture.
|
|
79
|
+
|
|
80
|
+
## Prior art
|
|
81
|
+
|
|
82
|
+
- [`franc`](https://github.com/wooorm/franc) — trigram detection over 400+
|
|
83
|
+
languages. `langtell` can use it as one engine, but works on short strings
|
|
84
|
+
where franc has too little signal, and fuses it with page/transport metadata.
|
|
85
|
+
- `cld3`, `tinyld`, `languagedetect` — statistical text-only detectors.
|
|
86
|
+
`langtell` differs by combining script logic with out-of-band evidence and
|
|
87
|
+
emitting an auditable trail.
|
|
88
|
+
|
|
89
|
+
## License
|
|
90
|
+
|
|
91
|
+
[MIT](./LICENSE)
|
package/package.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "langtell",
|
|
3
|
+
"version": "0.0.1",
|
|
4
|
+
"description": "Tell me the language — evidence-fusion language detection for short strings, with an auditable confidence trail.",
|
|
5
|
+
"type": "module",
|
|
6
|
+
"license": "MIT",
|
|
7
|
+
"sideEffects": false,
|
|
8
|
+
"keywords": [
|
|
9
|
+
"language",
|
|
10
|
+
"language-detection",
|
|
11
|
+
"language-identification",
|
|
12
|
+
"i18n",
|
|
13
|
+
"locale",
|
|
14
|
+
"bcp47",
|
|
15
|
+
"script-detection"
|
|
16
|
+
],
|
|
17
|
+
"files": [
|
|
18
|
+
"dist"
|
|
19
|
+
]
|
|
20
|
+
}
|