datemonkey 0.1.0__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {datemonkey-0.1.0/src/datemonkey.egg-info → datemonkey-0.2.0}/PKG-INFO +20 -1
- {datemonkey-0.1.0 → datemonkey-0.2.0}/README.md +13 -0
- {datemonkey-0.1.0 → datemonkey-0.2.0}/pyproject.toml +12 -1
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/__init__.py +1 -1
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/cli.py +43 -20
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/detector.py +73 -36
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/excel.py +45 -19
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/formats.py +30 -2
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/models.py +5 -1
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/parser.py +59 -17
- {datemonkey-0.1.0 → datemonkey-0.2.0/src/datemonkey.egg-info}/PKG-INFO +20 -1
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/SOURCES.txt +3 -1
- datemonkey-0.2.0/src/datemonkey.egg-info/requires.txt +7 -0
- datemonkey-0.2.0/tests/test_cli.py +112 -0
- datemonkey-0.2.0/tests/test_detector.py +288 -0
- datemonkey-0.2.0/tests/test_excel.py +169 -0
- datemonkey-0.2.0/tests/test_parser.py +250 -0
- datemonkey-0.2.0/tests/test_properties.py +125 -0
- datemonkey-0.1.0/tests/test_cli.py +0 -68
- datemonkey-0.1.0/tests/test_detector.py +0 -137
- datemonkey-0.1.0/tests/test_excel.py +0 -87
- datemonkey-0.1.0/tests/test_parser.py +0 -134
- {datemonkey-0.1.0 → datemonkey-0.2.0}/LICENSE +0 -0
- {datemonkey-0.1.0 → datemonkey-0.2.0}/setup.cfg +0 -0
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/dependency_links.txt +0 -0
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/entry_points.txt +0 -0
- {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: datemonkey
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: Batch date parsing with ambiguity detection, confidence scores, and format lock-in.
|
|
5
5
|
Author-email: RexBytes <pythonic@rexbytes.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -21,6 +21,12 @@ Classifier: Topic :: Text Processing
|
|
|
21
21
|
Requires-Python: >=3.9
|
|
22
22
|
Description-Content-Type: text/markdown
|
|
23
23
|
License-File: LICENSE
|
|
24
|
+
Provides-Extra: dev
|
|
25
|
+
Requires-Dist: pytest; extra == "dev"
|
|
26
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
27
|
+
Requires-Dist: hypothesis; extra == "dev"
|
|
28
|
+
Requires-Dist: ruff; extra == "dev"
|
|
29
|
+
Requires-Dist: mypy; extra == "dev"
|
|
24
30
|
Dynamic: license-file
|
|
25
31
|
|
|
26
32
|
# datemonkey
|
|
@@ -193,6 +199,19 @@ Convert an Excel serial date number to a Python datetime.
|
|
|
193
199
|
|
|
194
200
|
datemonkey is designed to work well as a tool for large language models. Date parsing is a common source of silent errors in LLM-driven data pipelines — ambiguous formats lead to wrong guesses, wasted tokens on retries, and broken downstream logic. datemonkey reduces that complexity: a single call returns a structured result with the detected format, confidence level, and any ambiguities — no multi-step prompting or validation loops required. Fewer tokens in, reliable answers out.
|
|
195
201
|
|
|
202
|
+
## Changelog
|
|
203
|
+
|
|
204
|
+
See [CHANGELOG.md](CHANGELOG.md) for release history.
|
|
205
|
+
|
|
206
|
+
## Development & review
|
|
207
|
+
|
|
208
|
+
datemonkey is hardened with a competitive multi-model review methodology. The
|
|
209
|
+
self-contained kit lives in [`review-kit/`](review-kit/):
|
|
210
|
+
|
|
211
|
+
- [`review-kit/CONTRIBUTING.md`](review-kit/CONTRIBUTING.md) — testing philosophy and the review-panel process
|
|
212
|
+
- [`review-kit/LIMITATIONS.md`](review-kit/LIMITATIONS.md) — **deliberate** design tradeoffs (DD/MM ambiguity, the Excel leap-year bug, the two-digit-year pivot, format lock-in). Read this before "fixing" behaviour that looks wrong.
|
|
213
|
+
- [`review-kit/RELEASE_READINESS.md`](review-kit/RELEASE_READINESS.md) — the release rubric; run `python review-kit/scripts/readiness.py`.
|
|
214
|
+
|
|
196
215
|
## License
|
|
197
216
|
|
|
198
217
|
MIT
|
|
@@ -168,6 +168,19 @@ Convert an Excel serial date number to a Python datetime.
|
|
|
168
168
|
|
|
169
169
|
datemonkey is designed to work well as a tool for large language models. Date parsing is a common source of silent errors in LLM-driven data pipelines — ambiguous formats lead to wrong guesses, wasted tokens on retries, and broken downstream logic. datemonkey reduces that complexity: a single call returns a structured result with the detected format, confidence level, and any ambiguities — no multi-step prompting or validation loops required. Fewer tokens in, reliable answers out.
|
|
170
170
|
|
|
171
|
+
## Changelog
|
|
172
|
+
|
|
173
|
+
See [CHANGELOG.md](CHANGELOG.md) for release history.
|
|
174
|
+
|
|
175
|
+
## Development & review
|
|
176
|
+
|
|
177
|
+
datemonkey is hardened with a competitive multi-model review methodology. The
|
|
178
|
+
self-contained kit lives in [`review-kit/`](review-kit/):
|
|
179
|
+
|
|
180
|
+
- [`review-kit/CONTRIBUTING.md`](review-kit/CONTRIBUTING.md) — testing philosophy and the review-panel process
|
|
181
|
+
- [`review-kit/LIMITATIONS.md`](review-kit/LIMITATIONS.md) — **deliberate** design tradeoffs (DD/MM ambiguity, the Excel leap-year bug, the two-digit-year pivot, format lock-in). Read this before "fixing" behaviour that looks wrong.
|
|
182
|
+
- [`review-kit/RELEASE_READINESS.md`](review-kit/RELEASE_READINESS.md) — the release rubric; run `python review-kit/scripts/readiness.py`.
|
|
183
|
+
|
|
171
184
|
## License
|
|
172
185
|
|
|
173
186
|
MIT
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "datemonkey"
|
|
7
|
-
version = "0.
|
|
7
|
+
version = "0.2.0"
|
|
8
8
|
description = "Batch date parsing with ambiguity detection, confidence scores, and format lock-in."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
license = "MIT"
|
|
@@ -26,6 +26,9 @@ classifiers = [
|
|
|
26
26
|
"Topic :: Text Processing",
|
|
27
27
|
]
|
|
28
28
|
|
|
29
|
+
[project.optional-dependencies]
|
|
30
|
+
dev = ["pytest", "pytest-cov", "hypothesis", "ruff", "mypy"]
|
|
31
|
+
|
|
29
32
|
[project.scripts]
|
|
30
33
|
datemonkey = "datemonkey.cli:main"
|
|
31
34
|
|
|
@@ -40,3 +43,11 @@ where = ["src"]
|
|
|
40
43
|
[tool.pytest.ini_options]
|
|
41
44
|
testpaths = ["tests"]
|
|
42
45
|
pythonpath = ["src"]
|
|
46
|
+
|
|
47
|
+
[tool.ruff]
|
|
48
|
+
src = ["src", "tests"]
|
|
49
|
+
|
|
50
|
+
[tool.mypy]
|
|
51
|
+
files = ["src"]
|
|
52
|
+
python_version = "3.9"
|
|
53
|
+
ignore_missing_imports = true
|
|
@@ -10,35 +10,58 @@ from typing import Optional
|
|
|
10
10
|
|
|
11
11
|
from .detector import detect_format
|
|
12
12
|
from .formats import ALL_FORMATS
|
|
13
|
-
from .models import Confidence
|
|
14
13
|
from .parser import parse_dates
|
|
15
14
|
|
|
16
15
|
|
|
16
|
+
def _extract_column(f, col: int, skip_header: bool) -> list[str]:
|
|
17
|
+
"""Pull one CSV column from an open text stream."""
|
|
18
|
+
if col < 0:
|
|
19
|
+
print("Error: --column must be a non-negative integer", file=sys.stderr)
|
|
20
|
+
sys.exit(1)
|
|
21
|
+
reader = csv.reader(f)
|
|
22
|
+
values = []
|
|
23
|
+
skipped = 0
|
|
24
|
+
for i, row in enumerate(reader):
|
|
25
|
+
if skip_header and i == 0:
|
|
26
|
+
continue
|
|
27
|
+
if col < len(row):
|
|
28
|
+
values.append(row[col])
|
|
29
|
+
elif row:
|
|
30
|
+
# Non-empty row missing the requested column: don't drop it
|
|
31
|
+
# silently — report the data loss to stderr.
|
|
32
|
+
skipped += 1
|
|
33
|
+
if skipped:
|
|
34
|
+
print(
|
|
35
|
+
f"Warning: {skipped} row(s) had no column {col}; skipped.",
|
|
36
|
+
file=sys.stderr,
|
|
37
|
+
)
|
|
38
|
+
return values
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
def _read_lines(lines, skip_header: bool) -> list[str]:
|
|
42
|
+
"""Read non-empty stripped lines, optionally dropping a header row."""
|
|
43
|
+
stripped = [line.strip() for line in lines]
|
|
44
|
+
if skip_header and stripped:
|
|
45
|
+
stripped = stripped[1:]
|
|
46
|
+
return [ln for ln in stripped if ln]
|
|
47
|
+
|
|
48
|
+
|
|
17
49
|
def _read_values(args: argparse.Namespace) -> list[str]:
|
|
18
|
-
"""Read date values from arguments or stdin."""
|
|
50
|
+
"""Read date values from arguments, a file, or stdin."""
|
|
19
51
|
if args.values:
|
|
20
52
|
return args.values
|
|
21
53
|
|
|
22
54
|
if args.file:
|
|
23
55
|
with open(args.file) as f:
|
|
24
56
|
if args.column is not None:
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
sys.exit(1)
|
|
28
|
-
reader = csv.reader(f)
|
|
29
|
-
col = args.column
|
|
30
|
-
values = []
|
|
31
|
-
for i, row in enumerate(reader):
|
|
32
|
-
if args.skip_header and i == 0:
|
|
33
|
-
continue
|
|
34
|
-
if col < len(row):
|
|
35
|
-
values.append(row[col])
|
|
36
|
-
return values
|
|
37
|
-
else:
|
|
38
|
-
return [line.strip() for line in f if line.strip()]
|
|
57
|
+
return _extract_column(f, args.column, args.skip_header)
|
|
58
|
+
return _read_lines(f, args.skip_header)
|
|
39
59
|
|
|
40
60
|
if not sys.stdin.isatty():
|
|
41
|
-
|
|
61
|
+
# --column and --skip-header apply to piped CSV too, not only to --file.
|
|
62
|
+
if args.column is not None:
|
|
63
|
+
return _extract_column(sys.stdin, args.column, args.skip_header)
|
|
64
|
+
return _read_lines(sys.stdin, args.skip_header)
|
|
42
65
|
|
|
43
66
|
print("No input provided. Pass values as arguments, via --file, or pipe to stdin.", file=sys.stderr)
|
|
44
67
|
sys.exit(1)
|
|
@@ -87,12 +110,12 @@ def cmd_detect(args: argparse.Namespace) -> None:
|
|
|
87
110
|
print(f"\n AMBIGUOUS: {', '.join(a.value for a in result.ambiguities)}")
|
|
88
111
|
|
|
89
112
|
if result.candidates:
|
|
90
|
-
print(
|
|
113
|
+
print("\n Top candidates:")
|
|
91
114
|
for c in result.candidates[:5]:
|
|
92
115
|
print(f" {c.format.label:40s} {c.match_count}/{c.sample_size} matches")
|
|
93
116
|
|
|
94
117
|
if result.warnings:
|
|
95
|
-
print(
|
|
118
|
+
print("\n Warnings:")
|
|
96
119
|
for w in result.warnings:
|
|
97
120
|
print(f" - {w}")
|
|
98
121
|
|
|
@@ -169,7 +192,7 @@ def main(argv: Optional[list[str]] = None) -> None:
|
|
|
169
192
|
prog="datemonkey",
|
|
170
193
|
description="Batch date parsing with ambiguity detection.",
|
|
171
194
|
)
|
|
172
|
-
ap.add_argument("--version", action="version", version="datemonkey 0.
|
|
195
|
+
ap.add_argument("--version", action="version", version="datemonkey 0.2.0")
|
|
173
196
|
|
|
174
197
|
sub = ap.add_subparsers(dest="command", help="Available commands")
|
|
175
198
|
|
|
@@ -11,10 +11,10 @@ from typing import Any, Optional, Sequence
|
|
|
11
11
|
|
|
12
12
|
from .formats import (
|
|
13
13
|
ALL_FORMATS,
|
|
14
|
-
AMBIGUOUS_PAIRS,
|
|
15
14
|
could_be_excel_serial,
|
|
16
15
|
get_ambiguous_partner,
|
|
17
16
|
is_date_like,
|
|
17
|
+
normalize_locale,
|
|
18
18
|
)
|
|
19
19
|
from .models import (
|
|
20
20
|
AmbiguityType,
|
|
@@ -185,8 +185,10 @@ def detect_format(
|
|
|
185
185
|
],
|
|
186
186
|
)
|
|
187
187
|
|
|
188
|
-
# Score all candidate formats
|
|
189
|
-
|
|
188
|
+
# Score all candidate formats. Use `is not None` (not truthiness) so an
|
|
189
|
+
# explicit empty list means "test no formats" rather than silently falling
|
|
190
|
+
# back to the full catalogue — the docstring promises None alone defaults.
|
|
191
|
+
test_formats = formats if formats is not None else ALL_FORMATS
|
|
190
192
|
candidates = _score_candidates(date_like, test_formats)
|
|
191
193
|
|
|
192
194
|
if not candidates:
|
|
@@ -212,46 +214,55 @@ def detect_format(
|
|
|
212
214
|
# Check DD/MM vs MM/DD ambiguity
|
|
213
215
|
ambig_partner = _check_day_month_ambiguity(date_like, best, candidates)
|
|
214
216
|
if ambig_partner is not None:
|
|
215
|
-
# Try to resolve from data
|
|
217
|
+
# Try to resolve from data. _check_day_month_ambiguity only returns a
|
|
218
|
+
# partner with an *equal* match count, so _resolve_day_month can only
|
|
219
|
+
# confirm `best` (some value parses under best but not the partner) or
|
|
220
|
+
# report it as truly ambiguous — it can never name the partner as the
|
|
221
|
+
# winner here, so `best` never needs swapping.
|
|
216
222
|
resolved = _resolve_day_month(date_like, best.format, ambig_partner.format)
|
|
217
|
-
if resolved is
|
|
218
|
-
# Data resolves it — pick the right one
|
|
219
|
-
if resolved != best.format:
|
|
220
|
-
# Swap: the partner is actually the correct one
|
|
221
|
-
for c in candidates:
|
|
222
|
-
if c.format == resolved:
|
|
223
|
-
best = c
|
|
224
|
-
break
|
|
225
|
-
else:
|
|
223
|
+
if resolved is None:
|
|
226
224
|
# Truly ambiguous
|
|
227
225
|
ambiguities.append(AmbiguityType.DAY_MONTH_SWAP)
|
|
228
226
|
warnings.append(
|
|
229
227
|
f"Ambiguous: cannot distinguish {best.format.label} from "
|
|
230
228
|
f"{ambig_partner.format.label} using data alone."
|
|
231
229
|
)
|
|
232
|
-
# Apply locale preference if provided
|
|
230
|
+
# Apply locale preference if provided. Choose between the two
|
|
231
|
+
# ambiguous candidates only (`best` and its partner `ambig_partner`)
|
|
232
|
+
# — never scan all candidates for any %m/%d format, or a custom
|
|
233
|
+
# `formats=` list could surface an unrelated MM/DD format (e.g.
|
|
234
|
+
# US_DASH for slash data) that cannot even parse the values.
|
|
233
235
|
if locale_preference:
|
|
234
|
-
lp = locale_preference
|
|
235
|
-
if lp
|
|
236
|
-
# Prefer MM/DD
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
236
|
+
lp = normalize_locale(locale_preference)
|
|
237
|
+
if lp == "us":
|
|
238
|
+
# Prefer MM/DD (the %m side of the ambiguous pair).
|
|
239
|
+
best = (
|
|
240
|
+
best
|
|
241
|
+
if best.format.pattern.startswith("%m")
|
|
242
|
+
else ambig_partner
|
|
243
|
+
)
|
|
241
244
|
warnings.append(
|
|
242
245
|
f"Locale preference '{locale_preference}' applied: "
|
|
243
246
|
f"using {best.format.label}."
|
|
244
247
|
)
|
|
245
|
-
elif lp
|
|
246
|
-
# Prefer DD/MM
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
248
|
+
elif lp == "eu":
|
|
249
|
+
# Prefer DD/MM (the %d side of the ambiguous pair).
|
|
250
|
+
best = (
|
|
251
|
+
best
|
|
252
|
+
if best.format.pattern.startswith("%d")
|
|
253
|
+
else ambig_partner
|
|
254
|
+
)
|
|
251
255
|
warnings.append(
|
|
252
256
|
f"Locale preference '{locale_preference}' applied: "
|
|
253
257
|
f"using {best.format.label}."
|
|
254
258
|
)
|
|
259
|
+
else:
|
|
260
|
+
# Unrecognised hint: do not silently ignore it — say so, so
|
|
261
|
+
# the ambiguity is not mistaken for "resolved by locale".
|
|
262
|
+
warnings.append(
|
|
263
|
+
f"Locale preference '{locale_preference}' not "
|
|
264
|
+
f"recognized (use 'us' or 'eu'); ignoring."
|
|
265
|
+
)
|
|
255
266
|
|
|
256
267
|
# Check two-digit year
|
|
257
268
|
if _has_two_digit_year(best.format):
|
|
@@ -261,14 +272,24 @@ def detect_format(
|
|
|
261
272
|
"00-68 as 2000-2068 and 69-99 as 1969-1999."
|
|
262
273
|
)
|
|
263
274
|
|
|
264
|
-
# Check mixed formats (multiple candidates with high match counts)
|
|
275
|
+
# Check mixed formats (multiple candidates with high match counts).
|
|
276
|
+
# Compare against the strongest candidate that is neither `best` itself nor
|
|
277
|
+
# its DD/MM-vs-MM/DD partner. Indexing candidates[1] blindly is wrong once a
|
|
278
|
+
# locale_preference has reassigned `best` to candidates[1]: the "second"
|
|
279
|
+
# would then be `best` again, producing a self-referential MIXED_FORMATS
|
|
280
|
+
# warning on the eu branch but not the us branch.
|
|
265
281
|
if len(candidates) >= 2:
|
|
266
|
-
|
|
267
|
-
|
|
268
|
-
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
282
|
+
partner = get_ambiguous_partner(best.format)
|
|
283
|
+
second = next(
|
|
284
|
+
(
|
|
285
|
+
c
|
|
286
|
+
for c in candidates
|
|
287
|
+
if c.format != best.format and c.format != partner
|
|
288
|
+
),
|
|
289
|
+
None,
|
|
290
|
+
)
|
|
291
|
+
# If the next distinct family matches a significant portion, flag it.
|
|
292
|
+
if second is not None and second.match_count >= len(date_like) * 0.2:
|
|
272
293
|
ambiguities.append(AmbiguityType.MIXED_FORMATS)
|
|
273
294
|
warnings.append(
|
|
274
295
|
f"Possible mixed formats: {best.format.label} "
|
|
@@ -286,9 +307,25 @@ def detect_format(
|
|
|
286
307
|
else:
|
|
287
308
|
confidence = Confidence.LOW
|
|
288
309
|
|
|
289
|
-
# Assign confidence to candidates
|
|
310
|
+
# Assign confidence to candidates. `FormatCandidate.confidence` means
|
|
311
|
+
# "confidence if this format were chosen", so it must agree with the
|
|
312
|
+
# headline value the same format would produce as the pick — otherwise the
|
|
313
|
+
# same DateFormat reports two different numbers.
|
|
314
|
+
# - the chosen format carries the headline confidence directly;
|
|
315
|
+
# - best's unresolved-swap partner has the SAME counterfactual headline
|
|
316
|
+
# (LOW), so mirror it rather than letting a full match fall to MEDIUM;
|
|
317
|
+
# - everything else gets a per-format estimate (an ambiguous-pair format
|
|
318
|
+
# is capped below HIGH only while the swap is actually unresolved).
|
|
319
|
+
swap_unresolved = AmbiguityType.DAY_MONTH_SWAP in ambiguities
|
|
320
|
+
partner_of_best = get_ambiguous_partner(best.format)
|
|
290
321
|
for c in candidates:
|
|
291
|
-
if c.
|
|
322
|
+
if c.format == best.format:
|
|
323
|
+
c.confidence = confidence
|
|
324
|
+
elif swap_unresolved and c.format == partner_of_best:
|
|
325
|
+
c.confidence = confidence
|
|
326
|
+
elif c.match_count == len(date_like) and not (
|
|
327
|
+
swap_unresolved and get_ambiguous_partner(c.format) is not None
|
|
328
|
+
):
|
|
292
329
|
c.confidence = Confidence.HIGH
|
|
293
330
|
elif c.match_count >= len(date_like) * 0.8:
|
|
294
331
|
c.confidence = Confidence.MEDIUM
|
|
@@ -12,6 +12,7 @@ by matching Excel's behavior for compatibility.
|
|
|
12
12
|
from __future__ import annotations
|
|
13
13
|
|
|
14
14
|
import datetime
|
|
15
|
+
import math
|
|
15
16
|
from typing import Optional, Union
|
|
16
17
|
|
|
17
18
|
from .models import Confidence, DateFormat, DateResult
|
|
@@ -47,31 +48,56 @@ def excel_serial_to_datetime(
|
|
|
47
48
|
except (ValueError, TypeError):
|
|
48
49
|
return None
|
|
49
50
|
|
|
50
|
-
|
|
51
|
+
# NaN and infinity survive float() but cannot be converted to a day count:
|
|
52
|
+
# int(nan) raises ValueError and int(inf) raises OverflowError, and NaN
|
|
53
|
+
# comparisons are all False so the range check below would not stop them.
|
|
54
|
+
# They are "invalid" per the contract, so map to None rather than crash.
|
|
55
|
+
if not math.isfinite(num):
|
|
51
56
|
return None
|
|
52
57
|
|
|
53
|
-
#
|
|
54
|
-
#
|
|
55
|
-
#
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
return datetime.datetime(1900, 3, 1)
|
|
60
|
-
|
|
61
|
-
days = int(num)
|
|
62
|
-
fraction = num - days
|
|
58
|
+
# Bound on the integer day part: a serial like 2958465.5 (noon on the last
|
|
59
|
+
# representable day, 9999-12-31) is in range even though 2958465.5 > the
|
|
60
|
+
# integer _MAX_SERIAL. Comparing the float against the int max wrongly
|
|
61
|
+
# rejected every fractional time on the final day.
|
|
62
|
+
if num < _MIN_SERIAL or int(num) > _MAX_SERIAL:
|
|
63
|
+
return None
|
|
63
64
|
|
|
64
|
-
|
|
65
|
-
|
|
65
|
+
try:
|
|
66
|
+
# Handle the Lotus 1-2-3 leap year bug:
|
|
67
|
+
# Excel thinks 1900-02-29 exists (serial 60).
|
|
68
|
+
# For serials > 60, subtract 1 to correct the off-by-one.
|
|
69
|
+
if int(num) == 60:
|
|
70
|
+
# Serial 60 is the non-existent Feb 29, 1900. Map to March 1, 1900,
|
|
71
|
+
# preserving any fractional time-of-day component (the fraction is
|
|
72
|
+
# otherwise silently dropped at this one serial).
|
|
73
|
+
dt = datetime.datetime(1900, 3, 1)
|
|
74
|
+
dt += _time_of_day(num - 60)
|
|
75
|
+
return dt
|
|
76
|
+
|
|
77
|
+
days = int(num)
|
|
78
|
+
fraction = num - days
|
|
79
|
+
|
|
80
|
+
if days > 60:
|
|
81
|
+
days -= 1
|
|
82
|
+
|
|
83
|
+
dt = _EXCEL_EPOCH + datetime.timedelta(days=days)
|
|
84
|
+
dt += _time_of_day(fraction)
|
|
85
|
+
return dt
|
|
86
|
+
except (OverflowError, ValueError):
|
|
87
|
+
# Defensive: any residual range/overflow surprise -> invalid -> None.
|
|
88
|
+
return None
|
|
66
89
|
|
|
67
|
-
dt = _EXCEL_EPOCH + datetime.timedelta(days=days)
|
|
68
90
|
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
total_seconds = round(fraction * 86400)
|
|
72
|
-
dt += datetime.timedelta(seconds=total_seconds)
|
|
91
|
+
def _time_of_day(fraction: float) -> datetime.timedelta:
|
|
92
|
+
"""Convert a day fraction to a timedelta within the same calendar day.
|
|
73
93
|
|
|
74
|
-
|
|
94
|
+
A fraction at or above 86399.5/86400 (≈ the last half-second of a day)
|
|
95
|
+
would round up to a full 86400 seconds and silently roll the date into the
|
|
96
|
+
next day; clamp to 86399 so the time stays inside the intended day.
|
|
97
|
+
"""
|
|
98
|
+
if fraction <= 0:
|
|
99
|
+
return datetime.timedelta(0)
|
|
100
|
+
return datetime.timedelta(seconds=min(round(fraction * 86400), 86399))
|
|
75
101
|
|
|
76
102
|
|
|
77
103
|
def parse_excel_serial(
|
|
@@ -119,8 +119,36 @@ def could_be_excel_serial(value: str) -> bool:
|
|
|
119
119
|
return False
|
|
120
120
|
num = float(s)
|
|
121
121
|
# Excel serial dates: 1 = 1900-01-01, reasonable range up to ~2958465 (9999-12-31)
|
|
122
|
-
# But practically, most dates are between 1 (1900) and ~55000 (2050+)
|
|
123
|
-
|
|
122
|
+
# But practically, most dates are between 1 (1900) and ~55000 (2050+).
|
|
123
|
+
# Bound on the integer day part so a valid fractional time on the last day
|
|
124
|
+
# (e.g. "2958465.5") is still recognised, matching excel_serial_to_datetime.
|
|
125
|
+
return 1 <= num and int(num) <= 2958465
|
|
126
|
+
|
|
127
|
+
|
|
128
|
+
# ── Locale preference normalisation ──────────────────────────────────────
|
|
129
|
+
|
|
130
|
+
US_LOCALE_ALIASES = frozenset({"us", "en_us", "en-us", "american"})
|
|
131
|
+
EU_LOCALE_ALIASES = frozenset(
|
|
132
|
+
{"eu", "european", "en_gb", "en-gb", "british", "de", "fr", "es", "it"}
|
|
133
|
+
)
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
def normalize_locale(locale_preference: Optional[str]) -> Optional[str]:
|
|
137
|
+
"""Normalise a locale_preference hint to ``"us"``, ``"eu"``, or ``None``.
|
|
138
|
+
|
|
139
|
+
Trims surrounding whitespace and lowercases before matching, so values like
|
|
140
|
+
``" eu "`` or ``"EU"`` resolve. Returns ``None`` for both a missing
|
|
141
|
+
preference and an *unrecognised* one — callers that must tell those apart
|
|
142
|
+
should also test ``locale_preference`` itself.
|
|
143
|
+
"""
|
|
144
|
+
if not locale_preference:
|
|
145
|
+
return None
|
|
146
|
+
lp = locale_preference.strip().lower()
|
|
147
|
+
if lp in US_LOCALE_ALIASES:
|
|
148
|
+
return "us"
|
|
149
|
+
if lp in EU_LOCALE_ALIASES:
|
|
150
|
+
return "eu"
|
|
151
|
+
return None
|
|
124
152
|
|
|
125
153
|
|
|
126
154
|
def get_ambiguous_partner(fmt: DateFormat) -> Optional[DateFormat]:
|
|
@@ -166,7 +166,11 @@ class BatchResult:
|
|
|
166
166
|
@property
|
|
167
167
|
def ok(self) -> bool:
|
|
168
168
|
"""True if all values were parsed successfully."""
|
|
169
|
-
|
|
169
|
+
# Require every value to have parsed — not merely "nothing failed".
|
|
170
|
+
# Early-return paths (undetectable format, strict refusal, all-blank
|
|
171
|
+
# input) leave parsed_count == failed_count == 0 with total > 0; those
|
|
172
|
+
# must report ok == False, not True.
|
|
173
|
+
return self.total > 0 and self.parsed_count == self.total
|
|
170
174
|
|
|
171
175
|
@property
|
|
172
176
|
def success_ratio(self) -> float:
|
|
@@ -11,7 +11,7 @@ from typing import Any, Optional, Sequence, Union
|
|
|
11
11
|
|
|
12
12
|
from .detector import detect_format
|
|
13
13
|
from .excel import EXCEL_SERIAL_FORMAT, parse_excel_serial
|
|
14
|
-
from .formats import could_be_excel_serial
|
|
14
|
+
from .formats import could_be_excel_serial, normalize_locale
|
|
15
15
|
from .models import (
|
|
16
16
|
AmbiguityType,
|
|
17
17
|
BatchResult,
|
|
@@ -68,6 +68,30 @@ def _parse_single(
|
|
|
68
68
|
)
|
|
69
69
|
|
|
70
70
|
|
|
71
|
+
def _all_failed(
|
|
72
|
+
values: Sequence[Any],
|
|
73
|
+
fmt: Optional[DateFormat],
|
|
74
|
+
reason: str,
|
|
75
|
+
) -> list[DateResult]:
|
|
76
|
+
"""Build one FAILED DateResult per value for an early-return path.
|
|
77
|
+
|
|
78
|
+
Keeps the batch invariant ``total == failed_count == len(results)`` and
|
|
79
|
+
lets callers inspect ``batch.failed`` even when parsing was refused or no
|
|
80
|
+
format could be determined (otherwise ``results`` would be empty while
|
|
81
|
+
``total`` reported N values).
|
|
82
|
+
"""
|
|
83
|
+
return [
|
|
84
|
+
DateResult(
|
|
85
|
+
original=v,
|
|
86
|
+
format_used=fmt,
|
|
87
|
+
confidence=Confidence.FAILED,
|
|
88
|
+
warnings=[reason],
|
|
89
|
+
row_index=i,
|
|
90
|
+
)
|
|
91
|
+
for i, v in enumerate(values)
|
|
92
|
+
]
|
|
93
|
+
|
|
94
|
+
|
|
71
95
|
def parse_dates(
|
|
72
96
|
values: Sequence[Any],
|
|
73
97
|
*,
|
|
@@ -86,8 +110,11 @@ def parse_dates(
|
|
|
86
110
|
If None, auto-detect from the values.
|
|
87
111
|
locale_preference: Hint for resolving DD/MM vs MM/DD ambiguity
|
|
88
112
|
during auto-detection. "us" for MM/DD, "eu" for DD/MM.
|
|
89
|
-
strict: If True,
|
|
90
|
-
|
|
113
|
+
strict: If True, refuse to parse when the batch has an unresolved
|
|
114
|
+
DD/MM vs MM/DD (day/month swap) ambiguity and no recognized
|
|
115
|
+
locale_preference was supplied to break it. Other ambiguity types
|
|
116
|
+
(two-digit year, mixed formats) are reported as warnings but do not
|
|
117
|
+
block parsing.
|
|
91
118
|
|
|
92
119
|
Returns:
|
|
93
120
|
BatchResult containing per-value results, detected format,
|
|
@@ -120,25 +147,40 @@ def parse_dates(
|
|
|
120
147
|
resolved_format = detection_result.format
|
|
121
148
|
|
|
122
149
|
if resolved_format is None:
|
|
150
|
+
reason = "Could not determine date format."
|
|
123
151
|
return BatchResult(
|
|
152
|
+
results=_all_failed(values, None, reason),
|
|
124
153
|
total=len(values),
|
|
125
|
-
|
|
154
|
+
failed_count=len(values),
|
|
155
|
+
warnings=[reason],
|
|
126
156
|
format_detection=detection_result,
|
|
127
157
|
)
|
|
128
158
|
|
|
129
|
-
# In strict mode, refuse to parse if ambiguous
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
159
|
+
# In strict mode, refuse to parse if ambiguous — UNLESS the caller supplied
|
|
160
|
+
# a *recognized* locale_preference, the documented escape hatch for the
|
|
161
|
+
# DD/MM vs MM/DD case. Gate on whether the preference actually resolves the
|
|
162
|
+
# ambiguity (normalize_locale), not merely on whether some string was
|
|
163
|
+
# passed: an unrecognized/whitespace/empty hint (e.g. "en", " eu ", "") is
|
|
164
|
+
# never applied by detection, so it must NOT silently disable the refusal.
|
|
165
|
+
if (
|
|
166
|
+
strict
|
|
167
|
+
and detection_result
|
|
168
|
+
and detection_result.is_ambiguous
|
|
169
|
+
and normalize_locale(locale_preference) is None
|
|
170
|
+
and AmbiguityType.DAY_MONTH_SWAP in detection_result.ambiguities
|
|
171
|
+
):
|
|
172
|
+
reason = (
|
|
173
|
+
"Strict mode: refusing to parse due to DD/MM vs MM/DD "
|
|
174
|
+
"ambiguity. Provide a format or locale_preference to resolve."
|
|
175
|
+
)
|
|
176
|
+
return BatchResult(
|
|
177
|
+
results=_all_failed(values, resolved_format, reason),
|
|
178
|
+
detected_format=resolved_format,
|
|
179
|
+
total=len(values),
|
|
180
|
+
failed_count=len(values),
|
|
181
|
+
warnings=[reason] + detection_result.warnings,
|
|
182
|
+
format_detection=detection_result,
|
|
183
|
+
)
|
|
142
184
|
|
|
143
185
|
# Parse each value with the locked-in format
|
|
144
186
|
results: list[DateResult] = []
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: datemonkey
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: Batch date parsing with ambiguity detection, confidence scores, and format lock-in.
|
|
5
5
|
Author-email: RexBytes <pythonic@rexbytes.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -21,6 +21,12 @@ Classifier: Topic :: Text Processing
|
|
|
21
21
|
Requires-Python: >=3.9
|
|
22
22
|
Description-Content-Type: text/markdown
|
|
23
23
|
License-File: LICENSE
|
|
24
|
+
Provides-Extra: dev
|
|
25
|
+
Requires-Dist: pytest; extra == "dev"
|
|
26
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
|
27
|
+
Requires-Dist: hypothesis; extra == "dev"
|
|
28
|
+
Requires-Dist: ruff; extra == "dev"
|
|
29
|
+
Requires-Dist: mypy; extra == "dev"
|
|
24
30
|
Dynamic: license-file
|
|
25
31
|
|
|
26
32
|
# datemonkey
|
|
@@ -193,6 +199,19 @@ Convert an Excel serial date number to a Python datetime.
|
|
|
193
199
|
|
|
194
200
|
datemonkey is designed to work well as a tool for large language models. Date parsing is a common source of silent errors in LLM-driven data pipelines — ambiguous formats lead to wrong guesses, wasted tokens on retries, and broken downstream logic. datemonkey reduces that complexity: a single call returns a structured result with the detected format, confidence level, and any ambiguities — no multi-step prompting or validation loops required. Fewer tokens in, reliable answers out.
|
|
195
201
|
|
|
202
|
+
## Changelog
|
|
203
|
+
|
|
204
|
+
See [CHANGELOG.md](CHANGELOG.md) for release history.
|
|
205
|
+
|
|
206
|
+
## Development & review
|
|
207
|
+
|
|
208
|
+
datemonkey is hardened with a competitive multi-model review methodology. The
|
|
209
|
+
self-contained kit lives in [`review-kit/`](review-kit/):
|
|
210
|
+
|
|
211
|
+
- [`review-kit/CONTRIBUTING.md`](review-kit/CONTRIBUTING.md) — testing philosophy and the review-panel process
|
|
212
|
+
- [`review-kit/LIMITATIONS.md`](review-kit/LIMITATIONS.md) — **deliberate** design tradeoffs (DD/MM ambiguity, the Excel leap-year bug, the two-digit-year pivot, format lock-in). Read this before "fixing" behaviour that looks wrong.
|
|
213
|
+
- [`review-kit/RELEASE_READINESS.md`](review-kit/RELEASE_READINESS.md) — the release rubric; run `python review-kit/scripts/readiness.py`.
|
|
214
|
+
|
|
196
215
|
## License
|
|
197
216
|
|
|
198
217
|
MIT
|