datemonkey 0.1.0__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {datemonkey-0.1.0/src/datemonkey.egg-info → datemonkey-0.2.0}/PKG-INFO +20 -1
  2. {datemonkey-0.1.0 → datemonkey-0.2.0}/README.md +13 -0
  3. {datemonkey-0.1.0 → datemonkey-0.2.0}/pyproject.toml +12 -1
  4. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/__init__.py +1 -1
  5. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/cli.py +43 -20
  6. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/detector.py +73 -36
  7. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/excel.py +45 -19
  8. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/formats.py +30 -2
  9. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/models.py +5 -1
  10. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey/parser.py +59 -17
  11. {datemonkey-0.1.0 → datemonkey-0.2.0/src/datemonkey.egg-info}/PKG-INFO +20 -1
  12. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/SOURCES.txt +3 -1
  13. datemonkey-0.2.0/src/datemonkey.egg-info/requires.txt +7 -0
  14. datemonkey-0.2.0/tests/test_cli.py +112 -0
  15. datemonkey-0.2.0/tests/test_detector.py +288 -0
  16. datemonkey-0.2.0/tests/test_excel.py +169 -0
  17. datemonkey-0.2.0/tests/test_parser.py +250 -0
  18. datemonkey-0.2.0/tests/test_properties.py +125 -0
  19. datemonkey-0.1.0/tests/test_cli.py +0 -68
  20. datemonkey-0.1.0/tests/test_detector.py +0 -137
  21. datemonkey-0.1.0/tests/test_excel.py +0 -87
  22. datemonkey-0.1.0/tests/test_parser.py +0 -134
  23. {datemonkey-0.1.0 → datemonkey-0.2.0}/LICENSE +0 -0
  24. {datemonkey-0.1.0 → datemonkey-0.2.0}/setup.cfg +0 -0
  25. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/dependency_links.txt +0 -0
  26. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/entry_points.txt +0 -0
  27. {datemonkey-0.1.0 → datemonkey-0.2.0}/src/datemonkey.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: datemonkey
3
- Version: 0.1.0
3
+ Version: 0.2.0
4
4
  Summary: Batch date parsing with ambiguity detection, confidence scores, and format lock-in.
5
5
  Author-email: RexBytes <pythonic@rexbytes.com>
6
6
  License-Expression: MIT
@@ -21,6 +21,12 @@ Classifier: Topic :: Text Processing
21
21
  Requires-Python: >=3.9
22
22
  Description-Content-Type: text/markdown
23
23
  License-File: LICENSE
24
+ Provides-Extra: dev
25
+ Requires-Dist: pytest; extra == "dev"
26
+ Requires-Dist: pytest-cov; extra == "dev"
27
+ Requires-Dist: hypothesis; extra == "dev"
28
+ Requires-Dist: ruff; extra == "dev"
29
+ Requires-Dist: mypy; extra == "dev"
24
30
  Dynamic: license-file
25
31
 
26
32
  # datemonkey
@@ -193,6 +199,19 @@ Convert an Excel serial date number to a Python datetime.
193
199
 
194
200
  datemonkey is designed to work well as a tool for large language models. Date parsing is a common source of silent errors in LLM-driven data pipelines — ambiguous formats lead to wrong guesses, wasted tokens on retries, and broken downstream logic. datemonkey reduces that complexity: a single call returns a structured result with the detected format, confidence level, and any ambiguities — no multi-step prompting or validation loops required. Fewer tokens in, reliable answers out.
195
201
 
202
+ ## Changelog
203
+
204
+ See [CHANGELOG.md](CHANGELOG.md) for release history.
205
+
206
+ ## Development & review
207
+
208
+ datemonkey is hardened with a competitive multi-model review methodology. The
209
+ self-contained kit lives in [`review-kit/`](review-kit/):
210
+
211
+ - [`review-kit/CONTRIBUTING.md`](review-kit/CONTRIBUTING.md) — testing philosophy and the review-panel process
212
+ - [`review-kit/LIMITATIONS.md`](review-kit/LIMITATIONS.md) — **deliberate** design tradeoffs (DD/MM ambiguity, the Excel leap-year bug, the two-digit-year pivot, format lock-in). Read this before "fixing" behaviour that looks wrong.
213
+ - [`review-kit/RELEASE_READINESS.md`](review-kit/RELEASE_READINESS.md) — the release rubric; run `python review-kit/scripts/readiness.py`.
214
+
196
215
  ## License
197
216
 
198
217
  MIT
@@ -168,6 +168,19 @@ Convert an Excel serial date number to a Python datetime.
168
168
 
169
169
  datemonkey is designed to work well as a tool for large language models. Date parsing is a common source of silent errors in LLM-driven data pipelines — ambiguous formats lead to wrong guesses, wasted tokens on retries, and broken downstream logic. datemonkey reduces that complexity: a single call returns a structured result with the detected format, confidence level, and any ambiguities — no multi-step prompting or validation loops required. Fewer tokens in, reliable answers out.
170
170
 
171
+ ## Changelog
172
+
173
+ See [CHANGELOG.md](CHANGELOG.md) for release history.
174
+
175
+ ## Development & review
176
+
177
+ datemonkey is hardened with a competitive multi-model review methodology. The
178
+ self-contained kit lives in [`review-kit/`](review-kit/):
179
+
180
+ - [`review-kit/CONTRIBUTING.md`](review-kit/CONTRIBUTING.md) — testing philosophy and the review-panel process
181
+ - [`review-kit/LIMITATIONS.md`](review-kit/LIMITATIONS.md) — **deliberate** design tradeoffs (DD/MM ambiguity, the Excel leap-year bug, the two-digit-year pivot, format lock-in). Read this before "fixing" behaviour that looks wrong.
182
+ - [`review-kit/RELEASE_READINESS.md`](review-kit/RELEASE_READINESS.md) — the release rubric; run `python review-kit/scripts/readiness.py`.
183
+
171
184
  ## License
172
185
 
173
186
  MIT
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "datemonkey"
7
- version = "0.1.0"
7
+ version = "0.2.0"
8
8
  description = "Batch date parsing with ambiguity detection, confidence scores, and format lock-in."
9
9
  readme = "README.md"
10
10
  license = "MIT"
@@ -26,6 +26,9 @@ classifiers = [
26
26
  "Topic :: Text Processing",
27
27
  ]
28
28
 
29
+ [project.optional-dependencies]
30
+ dev = ["pytest", "pytest-cov", "hypothesis", "ruff", "mypy"]
31
+
29
32
  [project.scripts]
30
33
  datemonkey = "datemonkey.cli:main"
31
34
 
@@ -40,3 +43,11 @@ where = ["src"]
40
43
  [tool.pytest.ini_options]
41
44
  testpaths = ["tests"]
42
45
  pythonpath = ["src"]
46
+
47
+ [tool.ruff]
48
+ src = ["src", "tests"]
49
+
50
+ [tool.mypy]
51
+ files = ["src"]
52
+ python_version = "3.9"
53
+ ignore_missing_imports = true
@@ -31,7 +31,7 @@ from .models import (
31
31
  )
32
32
  from .parser import parse_dates
33
33
 
34
- __version__ = "0.1.0"
34
+ __version__ = "0.2.0"
35
35
 
36
36
  __all__ = [
37
37
  # Core API
@@ -10,35 +10,58 @@ from typing import Optional
10
10
 
11
11
  from .detector import detect_format
12
12
  from .formats import ALL_FORMATS
13
- from .models import Confidence
14
13
  from .parser import parse_dates
15
14
 
16
15
 
16
+ def _extract_column(f, col: int, skip_header: bool) -> list[str]:
17
+ """Pull one CSV column from an open text stream."""
18
+ if col < 0:
19
+ print("Error: --column must be a non-negative integer", file=sys.stderr)
20
+ sys.exit(1)
21
+ reader = csv.reader(f)
22
+ values = []
23
+ skipped = 0
24
+ for i, row in enumerate(reader):
25
+ if skip_header and i == 0:
26
+ continue
27
+ if col < len(row):
28
+ values.append(row[col])
29
+ elif row:
30
+ # Non-empty row missing the requested column: don't drop it
31
+ # silently — report the data loss to stderr.
32
+ skipped += 1
33
+ if skipped:
34
+ print(
35
+ f"Warning: {skipped} row(s) had no column {col}; skipped.",
36
+ file=sys.stderr,
37
+ )
38
+ return values
39
+
40
+
41
+ def _read_lines(lines, skip_header: bool) -> list[str]:
42
+ """Read non-empty stripped lines, optionally dropping a header row."""
43
+ stripped = [line.strip() for line in lines]
44
+ if skip_header and stripped:
45
+ stripped = stripped[1:]
46
+ return [ln for ln in stripped if ln]
47
+
48
+
17
49
  def _read_values(args: argparse.Namespace) -> list[str]:
18
- """Read date values from arguments or stdin."""
50
+ """Read date values from arguments, a file, or stdin."""
19
51
  if args.values:
20
52
  return args.values
21
53
 
22
54
  if args.file:
23
55
  with open(args.file) as f:
24
56
  if args.column is not None:
25
- if args.column < 0:
26
- print("Error: --column must be a non-negative integer", file=sys.stderr)
27
- sys.exit(1)
28
- reader = csv.reader(f)
29
- col = args.column
30
- values = []
31
- for i, row in enumerate(reader):
32
- if args.skip_header and i == 0:
33
- continue
34
- if col < len(row):
35
- values.append(row[col])
36
- return values
37
- else:
38
- return [line.strip() for line in f if line.strip()]
57
+ return _extract_column(f, args.column, args.skip_header)
58
+ return _read_lines(f, args.skip_header)
39
59
 
40
60
  if not sys.stdin.isatty():
41
- return [line.strip() for line in sys.stdin if line.strip()]
61
+ # --column and --skip-header apply to piped CSV too, not only to --file.
62
+ if args.column is not None:
63
+ return _extract_column(sys.stdin, args.column, args.skip_header)
64
+ return _read_lines(sys.stdin, args.skip_header)
42
65
 
43
66
  print("No input provided. Pass values as arguments, via --file, or pipe to stdin.", file=sys.stderr)
44
67
  sys.exit(1)
@@ -87,12 +110,12 @@ def cmd_detect(args: argparse.Namespace) -> None:
87
110
  print(f"\n AMBIGUOUS: {', '.join(a.value for a in result.ambiguities)}")
88
111
 
89
112
  if result.candidates:
90
- print(f"\n Top candidates:")
113
+ print("\n Top candidates:")
91
114
  for c in result.candidates[:5]:
92
115
  print(f" {c.format.label:40s} {c.match_count}/{c.sample_size} matches")
93
116
 
94
117
  if result.warnings:
95
- print(f"\n Warnings:")
118
+ print("\n Warnings:")
96
119
  for w in result.warnings:
97
120
  print(f" - {w}")
98
121
 
@@ -169,7 +192,7 @@ def main(argv: Optional[list[str]] = None) -> None:
169
192
  prog="datemonkey",
170
193
  description="Batch date parsing with ambiguity detection.",
171
194
  )
172
- ap.add_argument("--version", action="version", version="datemonkey 0.1.0")
195
+ ap.add_argument("--version", action="version", version="datemonkey 0.2.0")
173
196
 
174
197
  sub = ap.add_subparsers(dest="command", help="Available commands")
175
198
 
@@ -11,10 +11,10 @@ from typing import Any, Optional, Sequence
11
11
 
12
12
  from .formats import (
13
13
  ALL_FORMATS,
14
- AMBIGUOUS_PAIRS,
15
14
  could_be_excel_serial,
16
15
  get_ambiguous_partner,
17
16
  is_date_like,
17
+ normalize_locale,
18
18
  )
19
19
  from .models import (
20
20
  AmbiguityType,
@@ -185,8 +185,10 @@ def detect_format(
185
185
  ],
186
186
  )
187
187
 
188
- # Score all candidate formats
189
- test_formats = formats or ALL_FORMATS
188
+ # Score all candidate formats. Use `is not None` (not truthiness) so an
189
+ # explicit empty list means "test no formats" rather than silently falling
190
+ # back to the full catalogue — the docstring promises None alone defaults.
191
+ test_formats = formats if formats is not None else ALL_FORMATS
190
192
  candidates = _score_candidates(date_like, test_formats)
191
193
 
192
194
  if not candidates:
@@ -212,46 +214,55 @@ def detect_format(
212
214
  # Check DD/MM vs MM/DD ambiguity
213
215
  ambig_partner = _check_day_month_ambiguity(date_like, best, candidates)
214
216
  if ambig_partner is not None:
215
- # Try to resolve from data
217
+ # Try to resolve from data. _check_day_month_ambiguity only returns a
218
+ # partner with an *equal* match count, so _resolve_day_month can only
219
+ # confirm `best` (some value parses under best but not the partner) or
220
+ # report it as truly ambiguous — it can never name the partner as the
221
+ # winner here, so `best` never needs swapping.
216
222
  resolved = _resolve_day_month(date_like, best.format, ambig_partner.format)
217
- if resolved is not None:
218
- # Data resolves it — pick the right one
219
- if resolved != best.format:
220
- # Swap: the partner is actually the correct one
221
- for c in candidates:
222
- if c.format == resolved:
223
- best = c
224
- break
225
- else:
223
+ if resolved is None:
226
224
  # Truly ambiguous
227
225
  ambiguities.append(AmbiguityType.DAY_MONTH_SWAP)
228
226
  warnings.append(
229
227
  f"Ambiguous: cannot distinguish {best.format.label} from "
230
228
  f"{ambig_partner.format.label} using data alone."
231
229
  )
232
- # Apply locale preference if provided
230
+ # Apply locale preference if provided. Choose between the two
231
+ # ambiguous candidates only (`best` and its partner `ambig_partner`)
232
+ # — never scan all candidates for any %m/%d format, or a custom
233
+ # `formats=` list could surface an unrelated MM/DD format (e.g.
234
+ # US_DASH for slash data) that cannot even parse the values.
233
235
  if locale_preference:
234
- lp = locale_preference.lower()
235
- if lp in ("us", "en_us", "en-us", "american"):
236
- # Prefer MM/DD
237
- for c in candidates:
238
- if c.format.pattern.startswith("%m"):
239
- best = c
240
- break
236
+ lp = normalize_locale(locale_preference)
237
+ if lp == "us":
238
+ # Prefer MM/DD (the %m side of the ambiguous pair).
239
+ best = (
240
+ best
241
+ if best.format.pattern.startswith("%m")
242
+ else ambig_partner
243
+ )
241
244
  warnings.append(
242
245
  f"Locale preference '{locale_preference}' applied: "
243
246
  f"using {best.format.label}."
244
247
  )
245
- elif lp in ("eu", "european", "en_gb", "en-gb", "british", "de", "fr", "es", "it"):
246
- # Prefer DD/MM
247
- for c in candidates:
248
- if c.format.pattern.startswith("%d"):
249
- best = c
250
- break
248
+ elif lp == "eu":
249
+ # Prefer DD/MM (the %d side of the ambiguous pair).
250
+ best = (
251
+ best
252
+ if best.format.pattern.startswith("%d")
253
+ else ambig_partner
254
+ )
251
255
  warnings.append(
252
256
  f"Locale preference '{locale_preference}' applied: "
253
257
  f"using {best.format.label}."
254
258
  )
259
+ else:
260
+ # Unrecognised hint: do not silently ignore it — say so, so
261
+ # the ambiguity is not mistaken for "resolved by locale".
262
+ warnings.append(
263
+ f"Locale preference '{locale_preference}' not "
264
+ f"recognized (use 'us' or 'eu'); ignoring."
265
+ )
255
266
 
256
267
  # Check two-digit year
257
268
  if _has_two_digit_year(best.format):
@@ -261,14 +272,24 @@ def detect_format(
261
272
  "00-68 as 2000-2068 and 69-99 as 1969-1999."
262
273
  )
263
274
 
264
- # Check mixed formats (multiple candidates with high match counts)
275
+ # Check mixed formats (multiple candidates with high match counts).
276
+ # Compare against the strongest candidate that is neither `best` itself nor
277
+ # its DD/MM-vs-MM/DD partner. Indexing candidates[1] blindly is wrong once a
278
+ # locale_preference has reassigned `best` to candidates[1]: the "second"
279
+ # would then be `best` again, producing a self-referential MIXED_FORMATS
280
+ # warning on the eu branch but not the us branch.
265
281
  if len(candidates) >= 2:
266
- second = candidates[1]
267
- # If second-best matches a significant portion and is a different "family"
268
- if (
269
- second.match_count >= len(date_like) * 0.2
270
- and get_ambiguous_partner(best.format) != second.format
271
- ):
282
+ partner = get_ambiguous_partner(best.format)
283
+ second = next(
284
+ (
285
+ c
286
+ for c in candidates
287
+ if c.format != best.format and c.format != partner
288
+ ),
289
+ None,
290
+ )
291
+ # If the next distinct family matches a significant portion, flag it.
292
+ if second is not None and second.match_count >= len(date_like) * 0.2:
272
293
  ambiguities.append(AmbiguityType.MIXED_FORMATS)
273
294
  warnings.append(
274
295
  f"Possible mixed formats: {best.format.label} "
@@ -286,9 +307,25 @@ def detect_format(
286
307
  else:
287
308
  confidence = Confidence.LOW
288
309
 
289
- # Assign confidence to candidates
310
+ # Assign confidence to candidates. `FormatCandidate.confidence` means
311
+ # "confidence if this format were chosen", so it must agree with the
312
+ # headline value the same format would produce as the pick — otherwise the
313
+ # same DateFormat reports two different numbers.
314
+ # - the chosen format carries the headline confidence directly;
315
+ # - best's unresolved-swap partner has the SAME counterfactual headline
316
+ # (LOW), so mirror it rather than letting a full match fall to MEDIUM;
317
+ # - everything else gets a per-format estimate (an ambiguous-pair format
318
+ # is capped below HIGH only while the swap is actually unresolved).
319
+ swap_unresolved = AmbiguityType.DAY_MONTH_SWAP in ambiguities
320
+ partner_of_best = get_ambiguous_partner(best.format)
290
321
  for c in candidates:
291
- if c.match_count == len(date_like) and get_ambiguous_partner(c.format) is None:
322
+ if c.format == best.format:
323
+ c.confidence = confidence
324
+ elif swap_unresolved and c.format == partner_of_best:
325
+ c.confidence = confidence
326
+ elif c.match_count == len(date_like) and not (
327
+ swap_unresolved and get_ambiguous_partner(c.format) is not None
328
+ ):
292
329
  c.confidence = Confidence.HIGH
293
330
  elif c.match_count >= len(date_like) * 0.8:
294
331
  c.confidence = Confidence.MEDIUM
@@ -12,6 +12,7 @@ by matching Excel's behavior for compatibility.
12
12
  from __future__ import annotations
13
13
 
14
14
  import datetime
15
+ import math
15
16
  from typing import Optional, Union
16
17
 
17
18
  from .models import Confidence, DateFormat, DateResult
@@ -47,31 +48,56 @@ def excel_serial_to_datetime(
47
48
  except (ValueError, TypeError):
48
49
  return None
49
50
 
50
- if num < _MIN_SERIAL or num > _MAX_SERIAL:
51
+ # NaN and infinity survive float() but cannot be converted to a day count:
52
+ # int(nan) raises ValueError and int(inf) raises OverflowError, and NaN
53
+ # comparisons are all False so the range check below would not stop them.
54
+ # They are "invalid" per the contract, so map to None rather than crash.
55
+ if not math.isfinite(num):
51
56
  return None
52
57
 
53
- # Handle the Lotus 1-2-3 leap year bug:
54
- # Excel thinks 1900-02-29 exists (serial 60).
55
- # For serials > 60, subtract 1 to correct the off-by-one.
56
- if int(num) == 60:
57
- # Serial 60 is the non-existent Feb 29, 1900.
58
- # Return March 1, 1900 to match common convention.
59
- return datetime.datetime(1900, 3, 1)
60
-
61
- days = int(num)
62
- fraction = num - days
58
+ # Bound on the integer day part: a serial like 2958465.5 (noon on the last
59
+ # representable day, 9999-12-31) is in range even though 2958465.5 > the
60
+ # integer _MAX_SERIAL. Comparing the float against the int max wrongly
61
+ # rejected every fractional time on the final day.
62
+ if num < _MIN_SERIAL or int(num) > _MAX_SERIAL:
63
+ return None
63
64
 
64
- if days > 60:
65
- days -= 1
65
+ try:
66
+ # Handle the Lotus 1-2-3 leap year bug:
67
+ # Excel thinks 1900-02-29 exists (serial 60).
68
+ # For serials > 60, subtract 1 to correct the off-by-one.
69
+ if int(num) == 60:
70
+ # Serial 60 is the non-existent Feb 29, 1900. Map to March 1, 1900,
71
+ # preserving any fractional time-of-day component (the fraction is
72
+ # otherwise silently dropped at this one serial).
73
+ dt = datetime.datetime(1900, 3, 1)
74
+ dt += _time_of_day(num - 60)
75
+ return dt
76
+
77
+ days = int(num)
78
+ fraction = num - days
79
+
80
+ if days > 60:
81
+ days -= 1
82
+
83
+ dt = _EXCEL_EPOCH + datetime.timedelta(days=days)
84
+ dt += _time_of_day(fraction)
85
+ return dt
86
+ except (OverflowError, ValueError):
87
+ # Defensive: any residual range/overflow surprise -> invalid -> None.
88
+ return None
66
89
 
67
- dt = _EXCEL_EPOCH + datetime.timedelta(days=days)
68
90
 
69
- # Add time component from fractional part
70
- if fraction > 0:
71
- total_seconds = round(fraction * 86400)
72
- dt += datetime.timedelta(seconds=total_seconds)
91
+ def _time_of_day(fraction: float) -> datetime.timedelta:
92
+ """Convert a day fraction to a timedelta within the same calendar day.
73
93
 
74
- return dt
94
+ A fraction at or above 86399.5/86400 (≈ the last half-second of a day)
95
+ would round up to a full 86400 seconds and silently roll the date into the
96
+ next day; clamp to 86399 so the time stays inside the intended day.
97
+ """
98
+ if fraction <= 0:
99
+ return datetime.timedelta(0)
100
+ return datetime.timedelta(seconds=min(round(fraction * 86400), 86399))
75
101
 
76
102
 
77
103
  def parse_excel_serial(
@@ -119,8 +119,36 @@ def could_be_excel_serial(value: str) -> bool:
119
119
  return False
120
120
  num = float(s)
121
121
  # Excel serial dates: 1 = 1900-01-01, reasonable range up to ~2958465 (9999-12-31)
122
- # But practically, most dates are between 1 (1900) and ~55000 (2050+)
123
- return 1 <= num <= 2958465
122
+ # But practically, most dates are between 1 (1900) and ~55000 (2050+).
123
+ # Bound on the integer day part so a valid fractional time on the last day
124
+ # (e.g. "2958465.5") is still recognised, matching excel_serial_to_datetime.
125
+ return 1 <= num and int(num) <= 2958465
126
+
127
+
128
+ # ── Locale preference normalisation ──────────────────────────────────────
129
+
130
+ US_LOCALE_ALIASES = frozenset({"us", "en_us", "en-us", "american"})
131
+ EU_LOCALE_ALIASES = frozenset(
132
+ {"eu", "european", "en_gb", "en-gb", "british", "de", "fr", "es", "it"}
133
+ )
134
+
135
+
136
+ def normalize_locale(locale_preference: Optional[str]) -> Optional[str]:
137
+ """Normalise a locale_preference hint to ``"us"``, ``"eu"``, or ``None``.
138
+
139
+ Trims surrounding whitespace and lowercases before matching, so values like
140
+ ``" eu "`` or ``"EU"`` resolve. Returns ``None`` for both a missing
141
+ preference and an *unrecognised* one — callers that must tell those apart
142
+ should also test ``locale_preference`` itself.
143
+ """
144
+ if not locale_preference:
145
+ return None
146
+ lp = locale_preference.strip().lower()
147
+ if lp in US_LOCALE_ALIASES:
148
+ return "us"
149
+ if lp in EU_LOCALE_ALIASES:
150
+ return "eu"
151
+ return None
124
152
 
125
153
 
126
154
  def get_ambiguous_partner(fmt: DateFormat) -> Optional[DateFormat]:
@@ -166,7 +166,11 @@ class BatchResult:
166
166
  @property
167
167
  def ok(self) -> bool:
168
168
  """True if all values were parsed successfully."""
169
- return self.failed_count == 0 and self.total > 0
169
+ # Require every value to have parsed — not merely "nothing failed".
170
+ # Early-return paths (undetectable format, strict refusal, all-blank
171
+ # input) leave parsed_count == failed_count == 0 with total > 0; those
172
+ # must report ok == False, not True.
173
+ return self.total > 0 and self.parsed_count == self.total
170
174
 
171
175
  @property
172
176
  def success_ratio(self) -> float:
@@ -11,7 +11,7 @@ from typing import Any, Optional, Sequence, Union
11
11
 
12
12
  from .detector import detect_format
13
13
  from .excel import EXCEL_SERIAL_FORMAT, parse_excel_serial
14
- from .formats import could_be_excel_serial
14
+ from .formats import could_be_excel_serial, normalize_locale
15
15
  from .models import (
16
16
  AmbiguityType,
17
17
  BatchResult,
@@ -68,6 +68,30 @@ def _parse_single(
68
68
  )
69
69
 
70
70
 
71
+ def _all_failed(
72
+ values: Sequence[Any],
73
+ fmt: Optional[DateFormat],
74
+ reason: str,
75
+ ) -> list[DateResult]:
76
+ """Build one FAILED DateResult per value for an early-return path.
77
+
78
+ Keeps the batch invariant ``total == failed_count == len(results)`` and
79
+ lets callers inspect ``batch.failed`` even when parsing was refused or no
80
+ format could be determined (otherwise ``results`` would be empty while
81
+ ``total`` reported N values).
82
+ """
83
+ return [
84
+ DateResult(
85
+ original=v,
86
+ format_used=fmt,
87
+ confidence=Confidence.FAILED,
88
+ warnings=[reason],
89
+ row_index=i,
90
+ )
91
+ for i, v in enumerate(values)
92
+ ]
93
+
94
+
71
95
  def parse_dates(
72
96
  values: Sequence[Any],
73
97
  *,
@@ -86,8 +110,11 @@ def parse_dates(
86
110
  If None, auto-detect from the values.
87
111
  locale_preference: Hint for resolving DD/MM vs MM/DD ambiguity
88
112
  during auto-detection. "us" for MM/DD, "eu" for DD/MM.
89
- strict: If True, treat any ambiguity as a failure (don't parse
90
- ambiguous values, report them as errors).
113
+ strict: If True, refuse to parse when the batch has an unresolved
114
+ DD/MM vs MM/DD (day/month swap) ambiguity and no recognized
115
+ locale_preference was supplied to break it. Other ambiguity types
116
+ (two-digit year, mixed formats) are reported as warnings but do not
117
+ block parsing.
91
118
 
92
119
  Returns:
93
120
  BatchResult containing per-value results, detected format,
@@ -120,25 +147,40 @@ def parse_dates(
120
147
  resolved_format = detection_result.format
121
148
 
122
149
  if resolved_format is None:
150
+ reason = "Could not determine date format."
123
151
  return BatchResult(
152
+ results=_all_failed(values, None, reason),
124
153
  total=len(values),
125
- warnings=["Could not determine date format."],
154
+ failed_count=len(values),
155
+ warnings=[reason],
126
156
  format_detection=detection_result,
127
157
  )
128
158
 
129
- # In strict mode, refuse to parse if ambiguous
130
- if strict and detection_result and detection_result.is_ambiguous:
131
- if AmbiguityType.DAY_MONTH_SWAP in detection_result.ambiguities:
132
- return BatchResult(
133
- total=len(values),
134
- detected_format=resolved_format,
135
- warnings=[
136
- "Strict mode: refusing to parse due to DD/MM vs MM/DD "
137
- "ambiguity. Provide a format or locale_preference to resolve."
138
- ]
139
- + detection_result.warnings,
140
- format_detection=detection_result,
141
- )
159
+ # In strict mode, refuse to parse if ambiguous — UNLESS the caller supplied
160
+ # a *recognized* locale_preference, the documented escape hatch for the
161
+ # DD/MM vs MM/DD case. Gate on whether the preference actually resolves the
162
+ # ambiguity (normalize_locale), not merely on whether some string was
163
+ # passed: an unrecognized/whitespace/empty hint (e.g. "en", " eu ", "") is
164
+ # never applied by detection, so it must NOT silently disable the refusal.
165
+ if (
166
+ strict
167
+ and detection_result
168
+ and detection_result.is_ambiguous
169
+ and normalize_locale(locale_preference) is None
170
+ and AmbiguityType.DAY_MONTH_SWAP in detection_result.ambiguities
171
+ ):
172
+ reason = (
173
+ "Strict mode: refusing to parse due to DD/MM vs MM/DD "
174
+ "ambiguity. Provide a format or locale_preference to resolve."
175
+ )
176
+ return BatchResult(
177
+ results=_all_failed(values, resolved_format, reason),
178
+ detected_format=resolved_format,
179
+ total=len(values),
180
+ failed_count=len(values),
181
+ warnings=[reason] + detection_result.warnings,
182
+ format_detection=detection_result,
183
+ )
142
184
 
143
185
  # Parse each value with the locked-in format
144
186
  results: list[DateResult] = []
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: datemonkey
3
- Version: 0.1.0
3
+ Version: 0.2.0
4
4
  Summary: Batch date parsing with ambiguity detection, confidence scores, and format lock-in.
5
5
  Author-email: RexBytes <pythonic@rexbytes.com>
6
6
  License-Expression: MIT
@@ -21,6 +21,12 @@ Classifier: Topic :: Text Processing
21
21
  Requires-Python: >=3.9
22
22
  Description-Content-Type: text/markdown
23
23
  License-File: LICENSE
24
+ Provides-Extra: dev
25
+ Requires-Dist: pytest; extra == "dev"
26
+ Requires-Dist: pytest-cov; extra == "dev"
27
+ Requires-Dist: hypothesis; extra == "dev"
28
+ Requires-Dist: ruff; extra == "dev"
29
+ Requires-Dist: mypy; extra == "dev"
24
30
  Dynamic: license-file
25
31
 
26
32
  # datemonkey
@@ -193,6 +199,19 @@ Convert an Excel serial date number to a Python datetime.
193
199
 
194
200
  datemonkey is designed to work well as a tool for large language models. Date parsing is a common source of silent errors in LLM-driven data pipelines — ambiguous formats lead to wrong guesses, wasted tokens on retries, and broken downstream logic. datemonkey reduces that complexity: a single call returns a structured result with the detected format, confidence level, and any ambiguities — no multi-step prompting or validation loops required. Fewer tokens in, reliable answers out.
195
201
 
202
+ ## Changelog
203
+
204
+ See [CHANGELOG.md](CHANGELOG.md) for release history.
205
+
206
+ ## Development & review
207
+
208
+ datemonkey is hardened with a competitive multi-model review methodology. The
209
+ self-contained kit lives in [`review-kit/`](review-kit/):
210
+
211
+ - [`review-kit/CONTRIBUTING.md`](review-kit/CONTRIBUTING.md) — testing philosophy and the review-panel process
212
+ - [`review-kit/LIMITATIONS.md`](review-kit/LIMITATIONS.md) — **deliberate** design tradeoffs (DD/MM ambiguity, the Excel leap-year bug, the two-digit-year pivot, format lock-in). Read this before "fixing" behaviour that looks wrong.
213
+ - [`review-kit/RELEASE_READINESS.md`](review-kit/RELEASE_READINESS.md) — the release rubric; run `python review-kit/scripts/readiness.py`.
214
+
196
215
  ## License
197
216
 
198
217
  MIT