application-pipeline 0.6.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- application_pipeline-0.6.0/.github/workflows/publish.yml +103 -0
- application_pipeline-0.6.0/.gitignore +215 -0
- application_pipeline-0.6.0/.python-version +1 -0
- application_pipeline-0.6.0/CLAUDE.md +13 -0
- application_pipeline-0.6.0/CONTEXT.md +132 -0
- application_pipeline-0.6.0/PKG-INFO +13 -0
- application_pipeline-0.6.0/README.md +48 -0
- application_pipeline-0.6.0/docs/adr/0001-claude-cli-as-llm-backend.md +17 -0
- application_pipeline-0.6.0/docs/adr/0002-seen-state-durable-via-syncthing.md +18 -0
- application_pipeline-0.6.0/docs/adr/0003-dedup-alias-on-tuple-match.md +17 -0
- application_pipeline-0.6.0/docs/adr/0004-layout-as-user-editable-python-module.md +20 -0
- application_pipeline-0.6.0/docs/adr/0005-parser-threads-as-pure-producers.md +19 -0
- application_pipeline-0.6.0/docs/adr/0006-seen-result-three-variant.md +16 -0
- application_pipeline-0.6.0/docs/adr/0007-orchestrator-queue-topology.md +20 -0
- application_pipeline-0.6.0/docs/adr/0010-failures-as-syncthing-files.md +25 -0
- application_pipeline-0.6.0/docs/adr/0011-user-settings-and-paths-anchored-to-data-dir.md +40 -0
- application_pipeline-0.6.0/docs/adr/0012-parser-location-resolution-policy.md +27 -0
- application_pipeline-0.6.0/docs/adr/0013-external-redirect-skip-or-log.md +35 -0
- application_pipeline-0.6.0/docs/adr/0014-classify-batching-and-worker-model.md +35 -0
- application_pipeline-0.6.0/docs/adr/0015-claude-cli-pinned-model-and-tag-wrapped-output.md +38 -0
- application_pipeline-0.6.0/docs/adr/0016-hardcoded-prompts-with-externalized-user-info-single-language.md +35 -0
- application_pipeline-0.6.0/docs/adr/0017-no-dedup-discover-early-stop.md +17 -0
- application_pipeline-0.6.0/docs/adr/0018-log-convention-components-by-layer.md +40 -0
- application_pipeline-0.6.0/docs/adr/0019-prefilter-pure-title-only-blacklist.md +28 -0
- application_pipeline-0.6.0/docs/adr/0020-match-tier-retired-judge-picks-top-n.md +28 -0
- application_pipeline-0.6.0/docs/adr/0021-daily-results-file-replaces-trio.md +29 -0
- application_pipeline-0.6.0/docs/adr/0022-structured-extracts-as-pool-members.md +36 -0
- application_pipeline-0.6.0/docs/adr/0023-quota-sleep-and-retry.md +35 -0
- application_pipeline-0.6.0/docs/adr/0024-cron-once-per-day-no-migration.md +28 -0
- application_pipeline-0.6.0/docs/adr/0025-freshness-gate-drops-stale-listings.md +32 -0
- application_pipeline-0.6.0/docs/adr/0026-triage-profile-reused-as-v2-authoring-context.md +29 -0
- application_pipeline-0.6.0/docs/adr/0027-distribution-via-pypi-and-cron-upgrade.md +30 -0
- application_pipeline-0.6.0/docs/agents/domain.md +51 -0
- application_pipeline-0.6.0/docs/agents/issue-tracker.md +22 -0
- application_pipeline-0.6.0/docs/agents/triage-labels.md +15 -0
- application_pipeline-0.6.0/docs/cron-setup.md +118 -0
- application_pipeline-0.6.0/docs/miktex-setup.md +48 -0
- application_pipeline-0.6.0/docs/usage.md +125 -0
- application_pipeline-0.6.0/pycastle/.gitignore +3 -0
- application_pipeline-0.6.0/pyproject.toml +52 -0
- application_pipeline-0.6.0/setup.cfg +4 -0
- application_pipeline-0.6.0/src/application_pipeline/__init__.py +59 -0
- application_pipeline-0.6.0/src/application_pipeline/__main__.py +104 -0
- application_pipeline-0.6.0/src/application_pipeline/_context.py +7 -0
- application_pipeline-0.6.0/src/application_pipeline/config/__init__.py +11 -0
- application_pipeline-0.6.0/src/application_pipeline/config/loader.py +207 -0
- application_pipeline-0.6.0/src/application_pipeline/config/types.py +68 -0
- application_pipeline-0.6.0/src/application_pipeline/dedup/__init__.py +14 -0
- application_pipeline-0.6.0/src/application_pipeline/dedup/errors.py +2 -0
- application_pipeline-0.6.0/src/application_pipeline/dedup/store.py +319 -0
- application_pipeline-0.6.0/src/application_pipeline/extracts/__init__.py +8 -0
- application_pipeline-0.6.0/src/application_pipeline/extracts/errors.py +2 -0
- application_pipeline-0.6.0/src/application_pipeline/extracts/store.py +116 -0
- application_pipeline-0.6.0/src/application_pipeline/failure_report.py +56 -0
- application_pipeline-0.6.0/src/application_pipeline/freshness_gate.py +133 -0
- application_pipeline-0.6.0/src/application_pipeline/http/__init__.py +15 -0
- application_pipeline-0.6.0/src/application_pipeline/http/errors.py +17 -0
- application_pipeline-0.6.0/src/application_pipeline/init_cmd.py +35 -0
- application_pipeline-0.6.0/src/application_pipeline/layout/__init__.py +12 -0
- application_pipeline-0.6.0/src/application_pipeline/layout/loader.py +149 -0
- application_pipeline-0.6.0/src/application_pipeline/layout/types.py +23 -0
- application_pipeline-0.6.0/src/application_pipeline/llm/__init__.py +45 -0
- application_pipeline-0.6.0/src/application_pipeline/llm/agent_output.py +55 -0
- application_pipeline-0.6.0/src/application_pipeline/llm/claude.py +445 -0
- application_pipeline-0.6.0/src/application_pipeline/llm/claude_cli.py +220 -0
- application_pipeline-0.6.0/src/application_pipeline/llm/quota.py +85 -0
- application_pipeline-0.6.0/src/application_pipeline/llm/types.py +116 -0
- application_pipeline-0.6.0/src/application_pipeline/orchestrator.py +958 -0
- application_pipeline-0.6.0/src/application_pipeline/parser_log.py +63 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/__init__.py +13 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/_text.py +46 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/bundesagentur_api.py +245 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/errors.py +2 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/http.py +169 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/jobs_beim_staat_html.py +323 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/location.py +112 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/protocol.py +24 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/registry.py +23 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/stellen_hamburg_api.py +235 -0
- application_pipeline-0.6.0/src/application_pipeline/parsers/types.py +74 -0
- application_pipeline-0.6.0/src/application_pipeline/prefilter_gate.py +121 -0
- application_pipeline-0.6.0/src/application_pipeline/prompts.py +140 -0
- application_pipeline-0.6.0/src/application_pipeline/renderer.py +82 -0
- application_pipeline-0.6.0/src/application_pipeline/results/__init__.py +8 -0
- application_pipeline-0.6.0/src/application_pipeline/results/errors.py +2 -0
- application_pipeline-0.6.0/src/application_pipeline/results/manager.py +25 -0
- application_pipeline-0.6.0/src/application_pipeline/run_metrics.py +671 -0
- application_pipeline-0.6.0/src/application_pipeline/status_display.py +152 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/__init__.py +0 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/config.py +28 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/latex/README.md +117 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/latex/cv_template.tex +118 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/latex/moderncv.cls +491 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/latex/moderncvcolorblue.sty +27 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/latex/moderncvstylecasual.sty +194 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/latex/tweaklist.sty +56 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/layout.py +41 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/prompts/__init__.py +0 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/prompts/classify_relevance.md +29 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/prompts/judge_match.md +44 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/prompts/judge_top_n.md +31 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/setup/cron-install.sh +16 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/setup/cron-uninstall.sh +21 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/setup/cron.sh +40 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/contact.tex +7 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/content_pool.tex +100 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/domain-fit.md +7 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/identity.tex +6 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/match-criteria.md +7 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/profile.png +0 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/self-description.md +6 -0
- application_pipeline-0.6.0/src/application_pipeline/templates/user-info/signature.png +0 -0
- application_pipeline-0.6.0/src/application_pipeline/text/__init__.py +3 -0
- application_pipeline-0.6.0/src/application_pipeline/text/normalize.py +6 -0
- application_pipeline-0.6.0/src/application_pipeline/user_settings.py +32 -0
- application_pipeline-0.6.0/src/application_pipeline.egg-info/PKG-INFO +13 -0
- application_pipeline-0.6.0/src/application_pipeline.egg-info/SOURCES.txt +169 -0
- application_pipeline-0.6.0/src/application_pipeline.egg-info/dependency_links.txt +1 -0
- application_pipeline-0.6.0/src/application_pipeline.egg-info/entry_points.txt +2 -0
- application_pipeline-0.6.0/src/application_pipeline.egg-info/requires.txt +10 -0
- application_pipeline-0.6.0/src/application_pipeline.egg-info/top_level.txt +1 -0
- application_pipeline-0.6.0/tests/__init__.py +0 -0
- application_pipeline-0.6.0/tests/conftest.py +10 -0
- application_pipeline-0.6.0/tests/fake_status_display.py +51 -0
- application_pipeline-0.6.0/tests/parsers/__init__.py +0 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/bundesagentur/detail.json +12 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/bundesagentur/search.json +27 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/jobs_beim_staat/detail.html +32 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/jobs_beim_staat/iframe_target.html +22 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/jobs_beim_staat/list.html +336 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/jobs_beim_staat/wrapper.html +11 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/stellen_hamburg/detail.html +21 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/stellen_hamburg/detail_list_wrapped.html +28 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/stellen_hamburg/detail_no_job_posting.html +24 -0
- application_pipeline-0.6.0/tests/parsers/fixtures/stellen_hamburg/search.json +25 -0
- application_pipeline-0.6.0/tests/parsers/smoke/__init__.py +0 -0
- application_pipeline-0.6.0/tests/parsers/smoke/test_bundesagentur_api_smoke.py +55 -0
- application_pipeline-0.6.0/tests/parsers/smoke/test_jobs_beim_staat_html_smoke.py +39 -0
- application_pipeline-0.6.0/tests/parsers/smoke/test_stellen_hamburg_api_smoke.py +28 -0
- application_pipeline-0.6.0/tests/parsers/test_bundesagentur_api.py +930 -0
- application_pipeline-0.6.0/tests/parsers/test_jobs_beim_staat_html.py +856 -0
- application_pipeline-0.6.0/tests/parsers/test_stellen_hamburg_api.py +553 -0
- application_pipeline-0.6.0/tests/test_agent_output.py +103 -0
- application_pipeline-0.6.0/tests/test_claude_cli_invoker.py +379 -0
- application_pipeline-0.6.0/tests/test_claude_extractor.py +915 -0
- application_pipeline-0.6.0/tests/test_cli_dispatch.py +65 -0
- application_pipeline-0.6.0/tests/test_component_log.py +244 -0
- application_pipeline-0.6.0/tests/test_config_loader.py +783 -0
- application_pipeline-0.6.0/tests/test_dedup_store.py +941 -0
- application_pipeline-0.6.0/tests/test_extract_store.py +119 -0
- application_pipeline-0.6.0/tests/test_failure_report.py +103 -0
- application_pipeline-0.6.0/tests/test_freshness_gate.py +373 -0
- application_pipeline-0.6.0/tests/test_init_cmd.py +449 -0
- application_pipeline-0.6.0/tests/test_layout_loader.py +331 -0
- application_pipeline-0.6.0/tests/test_llm_extractor.py +201 -0
- application_pipeline-0.6.0/tests/test_log_artifacts.py +182 -0
- application_pipeline-0.6.0/tests/test_main_dotenv.py +50 -0
- application_pipeline-0.6.0/tests/test_main_failure.py +59 -0
- application_pipeline-0.6.0/tests/test_normalize.py +30 -0
- application_pipeline-0.6.0/tests/test_orchestrator.py +4810 -0
- application_pipeline-0.6.0/tests/test_parser_http_class.py +434 -0
- application_pipeline-0.6.0/tests/test_parser_log.py +246 -0
- application_pipeline-0.6.0/tests/test_parsers.py +330 -0
- application_pipeline-0.6.0/tests/test_prefilter_gate.py +308 -0
- application_pipeline-0.6.0/tests/test_prompt_loader.py +242 -0
- application_pipeline-0.6.0/tests/test_renderer.py +501 -0
- application_pipeline-0.6.0/tests/test_results_manager.py +101 -0
- application_pipeline-0.6.0/tests/test_run_metrics.py +989 -0
- application_pipeline-0.6.0/tests/test_status_display.py +292 -0
- application_pipeline-0.6.0/tests/test_stellen_hamburg_parser.py +518 -0
- application_pipeline-0.6.0/tests/test_user_settings.py +83 -0
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
name: Publish
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main]
|
|
6
|
+
tags: ["v*.*.*"]
|
|
7
|
+
pull_request:
|
|
8
|
+
branches: [main]
|
|
9
|
+
|
|
10
|
+
jobs:
|
|
11
|
+
test:
|
|
12
|
+
runs-on: ubuntu-latest
|
|
13
|
+
steps:
|
|
14
|
+
- uses: actions/checkout@v4
|
|
15
|
+
with:
|
|
16
|
+
fetch-depth: 0
|
|
17
|
+
|
|
18
|
+
- uses: actions/setup-python@v5
|
|
19
|
+
with:
|
|
20
|
+
python-version: "3.11"
|
|
21
|
+
|
|
22
|
+
- name: Install dependencies
|
|
23
|
+
run: pip install -e '.[dev]'
|
|
24
|
+
|
|
25
|
+
- name: Lint
|
|
26
|
+
run: ruff check .
|
|
27
|
+
|
|
28
|
+
- name: Test
|
|
29
|
+
run: pytest -m "not smoke"
|
|
30
|
+
|
|
31
|
+
build:
|
|
32
|
+
if: github.event_name == 'push'
|
|
33
|
+
runs-on: ubuntu-latest
|
|
34
|
+
outputs:
|
|
35
|
+
version: ${{ steps.version.outputs.version }}
|
|
36
|
+
steps:
|
|
37
|
+
- uses: actions/checkout@v4
|
|
38
|
+
with:
|
|
39
|
+
fetch-depth: 0
|
|
40
|
+
|
|
41
|
+
- uses: actions/setup-python@v5
|
|
42
|
+
with:
|
|
43
|
+
python-version: "3.11"
|
|
44
|
+
|
|
45
|
+
- name: Install build tools
|
|
46
|
+
run: pip install build setuptools-scm
|
|
47
|
+
|
|
48
|
+
- name: Compute version
|
|
49
|
+
id: version
|
|
50
|
+
run: |
|
|
51
|
+
VERSION=$(python -m setuptools_scm)
|
|
52
|
+
echo "SETUPTOOLS_SCM_PRETEND_VERSION_FOR_APPLICATION_PIPELINE=$VERSION" >> "$GITHUB_ENV"
|
|
53
|
+
echo "version=$VERSION" >> "$GITHUB_OUTPUT"
|
|
54
|
+
|
|
55
|
+
- name: Reject dev version on tag
|
|
56
|
+
if: startsWith(github.ref, 'refs/tags/')
|
|
57
|
+
env:
|
|
58
|
+
VERSION: ${{ steps.version.outputs.version }}
|
|
59
|
+
run: |
|
|
60
|
+
if [[ "$VERSION" == *.dev* ]]; then
|
|
61
|
+
echo "Error: tag build produced a dev version: $VERSION"
|
|
62
|
+
exit 1
|
|
63
|
+
fi
|
|
64
|
+
|
|
65
|
+
- name: Build
|
|
66
|
+
run: python -m build
|
|
67
|
+
|
|
68
|
+
- uses: actions/upload-artifact@v4
|
|
69
|
+
with:
|
|
70
|
+
name: dist
|
|
71
|
+
path: dist/
|
|
72
|
+
|
|
73
|
+
publish-testpypi:
|
|
74
|
+
needs: [test, build]
|
|
75
|
+
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
|
|
76
|
+
runs-on: ubuntu-latest
|
|
77
|
+
environment: testpypi
|
|
78
|
+
permissions:
|
|
79
|
+
id-token: write
|
|
80
|
+
steps:
|
|
81
|
+
- uses: actions/download-artifact@v4
|
|
82
|
+
with:
|
|
83
|
+
name: dist
|
|
84
|
+
path: dist/
|
|
85
|
+
|
|
86
|
+
- uses: pypa/gh-action-pypi-publish@release/v1
|
|
87
|
+
with:
|
|
88
|
+
repository-url: https://test.pypi.org/legacy/
|
|
89
|
+
|
|
90
|
+
publish-pypi:
|
|
91
|
+
needs: [test, build]
|
|
92
|
+
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
|
|
93
|
+
runs-on: ubuntu-latest
|
|
94
|
+
environment: pypi
|
|
95
|
+
permissions:
|
|
96
|
+
id-token: write
|
|
97
|
+
steps:
|
|
98
|
+
- uses: actions/download-artifact@v4
|
|
99
|
+
with:
|
|
100
|
+
name: dist
|
|
101
|
+
path: dist/
|
|
102
|
+
|
|
103
|
+
- uses: pypa/gh-action-pypi-publish@release/v1
|
|
@@ -0,0 +1,215 @@
|
|
|
1
|
+
*.idea
|
|
2
|
+
*.claude
|
|
3
|
+
pycastle/
|
|
4
|
+
synched/
|
|
5
|
+
|
|
6
|
+
# Deduplication store — durability via Syncthing per ADR-0010
|
|
7
|
+
.seen.json
|
|
8
|
+
|
|
9
|
+
# Byte-compiled / optimized / DLL files
|
|
10
|
+
__pycache__/
|
|
11
|
+
*.py[codz]
|
|
12
|
+
*$py.class
|
|
13
|
+
|
|
14
|
+
# C extensions
|
|
15
|
+
*.so
|
|
16
|
+
|
|
17
|
+
# Distribution / packaging
|
|
18
|
+
.Python
|
|
19
|
+
build/
|
|
20
|
+
develop-eggs/
|
|
21
|
+
dist/
|
|
22
|
+
downloads/
|
|
23
|
+
eggs/
|
|
24
|
+
.eggs/
|
|
25
|
+
lib/
|
|
26
|
+
lib64/
|
|
27
|
+
parts/
|
|
28
|
+
sdist/
|
|
29
|
+
var/
|
|
30
|
+
wheels/
|
|
31
|
+
share/python-wheels/
|
|
32
|
+
*.egg-info/
|
|
33
|
+
.installed.cfg
|
|
34
|
+
*.egg
|
|
35
|
+
MANIFEST
|
|
36
|
+
|
|
37
|
+
# PyInstaller
|
|
38
|
+
# Usually these files are written by a python script from a template
|
|
39
|
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
|
40
|
+
*.manifest
|
|
41
|
+
*.spec
|
|
42
|
+
|
|
43
|
+
# Installer logs
|
|
44
|
+
pip-log.txt
|
|
45
|
+
pip-delete-this-directory.txt
|
|
46
|
+
|
|
47
|
+
# Unit test / coverage reports
|
|
48
|
+
htmlcov/
|
|
49
|
+
.tox/
|
|
50
|
+
.nox/
|
|
51
|
+
.coverage
|
|
52
|
+
.coverage.*
|
|
53
|
+
.cache
|
|
54
|
+
nosetests.xml
|
|
55
|
+
coverage.xml
|
|
56
|
+
*.cover
|
|
57
|
+
*.py.cover
|
|
58
|
+
.hypothesis/
|
|
59
|
+
.pytest_cache/
|
|
60
|
+
cover/
|
|
61
|
+
|
|
62
|
+
# Translations
|
|
63
|
+
*.mo
|
|
64
|
+
*.pot
|
|
65
|
+
|
|
66
|
+
# Django stuff:
|
|
67
|
+
*.log
|
|
68
|
+
local_settings.py
|
|
69
|
+
db.sqlite3
|
|
70
|
+
db.sqlite3-journal
|
|
71
|
+
|
|
72
|
+
# Flask stuff:
|
|
73
|
+
instance/
|
|
74
|
+
.webassets-cache
|
|
75
|
+
|
|
76
|
+
# Scrapy stuff:
|
|
77
|
+
.scrapy
|
|
78
|
+
|
|
79
|
+
# Sphinx documentation
|
|
80
|
+
docs/_build/
|
|
81
|
+
|
|
82
|
+
# PyBuilder
|
|
83
|
+
.pybuilder/
|
|
84
|
+
target/
|
|
85
|
+
|
|
86
|
+
# Jupyter Notebook
|
|
87
|
+
.ipynb_checkpoints
|
|
88
|
+
|
|
89
|
+
# IPython
|
|
90
|
+
profile_default/
|
|
91
|
+
ipython_config.py
|
|
92
|
+
|
|
93
|
+
# pyenv
|
|
94
|
+
# For a library or package, you might want to ignore these files since the code is
|
|
95
|
+
# intended to run in multiple environments; otherwise, check them in:
|
|
96
|
+
# .python-version
|
|
97
|
+
|
|
98
|
+
# pipenv
|
|
99
|
+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
|
100
|
+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
|
101
|
+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
|
102
|
+
# install all needed dependencies.
|
|
103
|
+
#Pipfile.lock
|
|
104
|
+
|
|
105
|
+
# UV
|
|
106
|
+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
|
|
107
|
+
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
|
108
|
+
# commonly ignored for libraries.
|
|
109
|
+
#uv.lock
|
|
110
|
+
|
|
111
|
+
# poetry
|
|
112
|
+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
|
113
|
+
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
|
114
|
+
# commonly ignored for libraries.
|
|
115
|
+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
|
116
|
+
#poetry.lock
|
|
117
|
+
#poetry.toml
|
|
118
|
+
|
|
119
|
+
# pdm
|
|
120
|
+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
|
121
|
+
# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
|
|
122
|
+
# https://pdm-project.org/en/latest/usage/project/#working-with-version-control
|
|
123
|
+
#pdm.lock
|
|
124
|
+
#pdm.toml
|
|
125
|
+
.pdm-python
|
|
126
|
+
.pdm-build/
|
|
127
|
+
|
|
128
|
+
# pixi
|
|
129
|
+
# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
|
|
130
|
+
#pixi.lock
|
|
131
|
+
# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
|
|
132
|
+
# in the .venv directory. It is recommended not to include this directory in version control.
|
|
133
|
+
.pixi
|
|
134
|
+
|
|
135
|
+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
|
136
|
+
__pypackages__/
|
|
137
|
+
|
|
138
|
+
# Celery stuff
|
|
139
|
+
celerybeat-schedule
|
|
140
|
+
celerybeat.pid
|
|
141
|
+
|
|
142
|
+
# SageMath parsed files
|
|
143
|
+
*.sage.py
|
|
144
|
+
|
|
145
|
+
# Environments
|
|
146
|
+
.env
|
|
147
|
+
.envrc
|
|
148
|
+
.venv
|
|
149
|
+
env/
|
|
150
|
+
venv/
|
|
151
|
+
ENV/
|
|
152
|
+
env.bak/
|
|
153
|
+
venv.bak/
|
|
154
|
+
|
|
155
|
+
# Spyder project settings
|
|
156
|
+
.spyderproject
|
|
157
|
+
.spyproject
|
|
158
|
+
|
|
159
|
+
# Rope project settings
|
|
160
|
+
.ropeproject
|
|
161
|
+
|
|
162
|
+
# mkdocs documentation
|
|
163
|
+
/site
|
|
164
|
+
|
|
165
|
+
# mypy
|
|
166
|
+
.mypy_cache/
|
|
167
|
+
.dmypy.json
|
|
168
|
+
dmypy.json
|
|
169
|
+
|
|
170
|
+
# Pyre type checker
|
|
171
|
+
.pyre/
|
|
172
|
+
|
|
173
|
+
# pytype static type analyzer
|
|
174
|
+
.pytype/
|
|
175
|
+
|
|
176
|
+
# Cython debug symbols
|
|
177
|
+
cython_debug/
|
|
178
|
+
|
|
179
|
+
# PyCharm
|
|
180
|
+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
|
181
|
+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
|
182
|
+
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
|
183
|
+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
|
184
|
+
#.idea/
|
|
185
|
+
|
|
186
|
+
# Abstra
|
|
187
|
+
# Abstra is an AI-powered process automation framework.
|
|
188
|
+
# Ignore directories containing user credentials, local state, and settings.
|
|
189
|
+
# Learn more at https://abstra.io/docs
|
|
190
|
+
.abstra/
|
|
191
|
+
|
|
192
|
+
# Visual Studio Code
|
|
193
|
+
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
|
|
194
|
+
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
|
|
195
|
+
# and can be added to the global gitignore or merged into this file. However, if you prefer,
|
|
196
|
+
# you could uncomment the following to ignore the entire vscode folder
|
|
197
|
+
# .vscode/
|
|
198
|
+
|
|
199
|
+
# Ruff stuff:
|
|
200
|
+
.ruff_cache/
|
|
201
|
+
|
|
202
|
+
# PyPI configuration file
|
|
203
|
+
.pypirc
|
|
204
|
+
|
|
205
|
+
# Cursor
|
|
206
|
+
# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
|
|
207
|
+
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
|
|
208
|
+
# refer to https://docs.cursor.com/context/ignore-files
|
|
209
|
+
.cursorignore
|
|
210
|
+
.cursorindexingignore
|
|
211
|
+
|
|
212
|
+
# Marimo
|
|
213
|
+
marimo/_static/
|
|
214
|
+
marimo/_lsp/
|
|
215
|
+
__marimo__/
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.11.3
|
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
## Agent skills
|
|
2
|
+
|
|
3
|
+
### Issue tracker
|
|
4
|
+
|
|
5
|
+
Issues are tracked in GitHub Issues on `Johannes-Kutsch/application-pipeline` via the `gh` CLI. See `docs/agents/issue-tracker.md`.
|
|
6
|
+
|
|
7
|
+
### Triage labels
|
|
8
|
+
|
|
9
|
+
Uses the five canonical triage label names as-is (`needs-triage`, `needs-info`, `ready-for-agent`, `ready-for-human`, `wontfix`). See `docs/agents/triage-labels.md`.
|
|
10
|
+
|
|
11
|
+
### Domain docs
|
|
12
|
+
|
|
13
|
+
Single-context: `CONTEXT.md` and `docs/adr/` at the repo root. See `docs/agents/domain.md`.
|
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
# Application Pipeline
|
|
2
|
+
|
|
3
|
+
Personal job-discovery and triage pipeline. Fetches listings from a small set of sources, classifies relevance with Claude, accumulates an in-domain **Pool** across days, emits one dated **Daily Results File** per day carrying the **Daily Top-5** **Cards** ranked by the **Match Judge**. Application authoring (CV/cover letter) is out of scope for v1.
|
|
4
|
+
|
|
5
|
+
## Scope
|
|
6
|
+
|
|
7
|
+
- **In scope (v1):** Working **Parsers** for **Bundesagentur**, **stellen.hamburg**, **jobs-beim-staat**, smoke-tested standalone on the laptop.
|
|
8
|
+
- **In scope (v1.1):** Full pipeline on Pi — orchestrator, **Relevance Classifier** (producing **Structured Extracts**), **Match Judge** (one call per run, picks **Daily Top-5** from the **Pool**), **Deduplication**, daily file, cron once per day, Syncthing sync.
|
|
9
|
+
- **Out of scope:** **CV** / **Cover Letter** generation, LaTeX pipeline, **Profile** ingestion, additional commercial parsers, browser automation.
|
|
10
|
+
|
|
11
|
+
## Language
|
|
12
|
+
|
|
13
|
+
### Pipeline artifacts
|
|
14
|
+
|
|
15
|
+
**Config**: A `config.py` at the root of the settings directory the user picks at `init` time (conventionally `~/application-pipeline/`) controlling the search — `KEYWORDS`, `SKILLS`, `NEGATIVE_KEYWORDS`, `SOURCES`, `LOCATIONS`, `INCLUDE_REMOTE`, `USER_INFO_DIR`, layout-path override, Claude settings (`claude_classify_batch_size: int` default 100, optional `claude_cli_path`), and `MAX_LISTING_AGE_DAYS: int` (default 180, `≥ 1`) driving the **Freshness Gate**'s `posted_date` arm. Plain Python literals; `#` comments. Materialised on first bootstrap from `src/application_pipeline/templates/config.py` via `application-pipeline init <dir>` (ADR-0011); the user edits in place from then on. Loaded by **Config Loader**, returns a frozen typed `Config`. `KEYWORDS` applies to every **Source** (Cartesian product over keyword × source); **Deduplication** absorbs overlap. Each `SourceEntry` carries its own `max_results` (default 1000). `NEGATIVE_KEYWORDS` entries validate to length ≥ 3. Cron schedule lives in crontab, not Config (ADR-0024, once daily at 00:30 local). _Avoid_: config file, settings file, search config.
|
|
16
|
+
|
|
17
|
+
**Layout**: A `layout.py` alongside **Config**, owning user-tunable cosmetic/structural choices for the **Daily Results File**. Seeded from `src/application_pipeline/templates/layout.py` by `init`. Named placeholder groups + a single `CARD_TEMPLATE` (no per-tier dispatch — **Match Tier** retired per ADR-0020). Placeholder set excludes `emoji`/`color`/`tier`, includes `rank` (1..5). Loaded by `load_user_module`, validated into a frozen `Layout`. The **Renderer** substitutes placeholders via `str.format_map`. `LayoutError` subclasses `UserSettingsError` (shared with `ConfigError`). _Avoid_: style file, theme file, template file.
|
|
18
|
+
|
|
19
|
+
**Daily Results File**: One dated markdown file per calendar day at `<settings-dir>/results/YYYY-MM-DD.md`, holding the **Daily Top-5** as **Cards** in **Rank** order. Date is **cron-anchored** (ADR-0021/0023) — a run started day X writes to day X's file even if it crosses midnight sleeping through quota. No `FILE_HEADER`, no preamble, no **Run Divider**. If today's pool had fewer than 5, the file carries however many cards; if pool was empty, no file is written. Write-once on the host; if a sync channel is configured (e.g. Syncthing), the file propagates outward read-only. _Avoid_: results file (without "daily"), `current.md` (gone), dated results file.
|
|
20
|
+
|
|
21
|
+
**Failure Report**: A markdown file at `<settings-dir>/failures/<timestamp>.md` written on a failed run — `cron.sh` pip-upgrade errors (ADR-0027), orchestrator runtime errors, **Match Judge** failure (no daily file), and non-quota classifier errors that left the run with no writeable output. Per ADR-0010. Acknowledged by deleting the file. Quota errors do NOT trigger a Failure Report — they sleep until reset (ADR-0023). _Avoid_: incident report, error log.
|
|
22
|
+
|
|
23
|
+
**Position**: A single job listing surviving the **Relevance Classifier** and the **Match Judge**'s top-5 selection. Identified within its **Daily Results File** by its rendered H1 (`# {company} · {title} · {location_segment}`); the explicit `{rank}` placeholder carries ordering. _Avoid_: job, listing, vacancy.
|
|
24
|
+
|
|
25
|
+
**Position Schema**: The frozen-dataclass shape every **Parser** must return from `enrich()` — `Position(stub: PositionStub, raw_description: str, ...)` with optional `salary: str | None` (free string — German pay grades like `"E13 TV-L"` preserved verbatim), `contract_type: Literal["permanent","fixed-term","freelance"] | None`, `employment_type: Literal["full-time","part-time","internship"] | None`, `work_model: Literal["remote","hybrid","on-site"] | None`, `posted_date: date | None`, `deadline: date | None`. Closed-enum fields are `None` when the source exposes no signal — parsers never guess. `Position` *contains* its `PositionStub` (composition); fields reach consumers as `position.stub.<field>` (required: `url, title, source`; `company, location` are `str | None`). Both dataclasses live in `parsers/types.py` (parsers don't import from consumers). Both `frozen=True`, never mutated. Not persisted across runs — reconstructed by `Parser.enrich(stub)` each day for Pool items. The **Match Judge** later attaches a **Match Verdict** to the 5 winners. _Avoid_: output format, dict structure.
|
|
26
|
+
|
|
27
|
+
**Raw Description**: The full body text of a **Position**, normalized to plain Unicode with `\n\n` paragraph breaks (no HTML, Markdown, or entities). Each parser owns normalization for its source. Empty after normalization is allowed (`raw_description = ""`). Input to the **Relevance Classifier** (where the **Structured Extract** is produced) and rendered verbatim into the **Card**'s `## Job Description`. The **Match Judge** never sees raw_description — it ranks on **Structured Extracts** only (ADR-0022). _Avoid_: description (when full text is meant); not to be confused with the **Match Verdict**'s `summary` or the **Structured Extract**.
|
|
28
|
+
|
|
29
|
+
**Structured Extract**: A fixed-field representation of an in-domain **Position**, produced by the **Relevance Classifier** as side output alongside `in_domain: true`, consumed by the **Match Judge** as its sole per-candidate input. Carries `seniority`, `work_model`, `contract_type`, `key_skills: list[str]` (≤10), `key_responsibilities: list[str]` (≤10), `must_have_requirements: list[str]` (≤10), and free-text `notable_caveats: str` (≤200 chars) — the escape hatch for tone/negation ("kein Homeoffice"). Per ADR-0022. Persisted in `extracts.json` keyed by a **stable cross-URL identifier** (canonical-url or synthetic id, shared across tuple-aliased records per ADR-0003) so the same role's extract is paid once even when re-discovered under a syndicated URL. Classify worker writes eagerly (same fsync as `mark_in_domain`). Deleted on transition to `selected_by_judge` or `out_of_domain`. _Avoid_: summary (overloaded with **Match Verdict**'s `summary` — the classifier produces the extract, the judge produces the summary; these are different artifacts), classification (the verdict is the bool, the extract is its side output).
|
|
30
|
+
|
|
31
|
+
### Filtering & scoring
|
|
32
|
+
|
|
33
|
+
**Triage Profile**: The applicant's self-description plus rules deciding in-domain / good-match. Lives in `<settings-dir>/user-info/` as four markdown files — `self-description.md` (fed into both LLM call sites), `domain-fit.md` (classifier), `match-criteria.md` (judge), and `writing-style.md` (PRD-#16-v2.1 authoring skills only, NOT injected into v1 prompts). Per ADR-0016, the **Prompt Loader** concatenates the three v1-consumed files per call site; the **LLM Extractor** injects the payload into the hardcoded prompt's `{USER_INFO}` slot wrapped in `<user-info>` tags. Bullets/keywords, German, "extremely concise, sacrifice grammar for concision" (ADR-0026). Reused as v2 authoring context — no separate CV Profile. _Avoid_: profile (without qualifier), bio, CV Profile (retired before being built — do not reintroduce).
|
|
34
|
+
|
|
35
|
+
**Skill**: A hard-skill or technology item from the **Config**'s `SKILLS`. Rendered as a bullet list into the **Match Judge** prompt's `{skills}` slot at `ClaudeExtractor` construction. Per ADR-0019, `SKILLS` is **only** consumed by the judge prompt — the **Domain Pre-Filter** no longer reads it. The LLM's `matched`/`missing` lists are open-vocabulary. _Avoid_: keyword (when matching).
|
|
36
|
+
|
|
37
|
+
**Keyword**: A search term used to query a **Source**, from `Config.KEYWORDS`. Distinct from a **Skill** (does not affect judgment) and from `NEGATIVE_KEYWORDS` (which drives the **Domain Pre-Filter**). _Avoid_: skill (when querying), negative keyword (when querying).
|
|
38
|
+
|
|
39
|
+
**Negative Keyword**: An entry in `Config.NEGATIVE_KEYWORDS`. Patterns (length ≥ 3) that, if found via case-insensitive substring match in a **Position**'s **title** (body text not consulted, per ADR-0019), cause the **Domain Pre-Filter** to drop the listing. No rescue mechanism. _Avoid_: blacklist, exclusion.
|
|
40
|
+
|
|
41
|
+
**Match Verdict**: The structured output of the **Match Judge** for each of the 5 winners — `{rank: 1..5, matched: list[str], missing: list[str], summary: "2-3 sentences"}`. Open-vocabulary lists (validation is type/length only). `rank` is the explicit position in the **Daily Top-5** (1 = best). Drives **Card** content; **Match Tier** is retired (ADR-0020). _Avoid_: score, rating, tier (retired).
|
|
42
|
+
|
|
43
|
+
**Rank**: The `rank` field — integer 1..5 assigned by the judge when picking the **Daily Top-5**. Surfaced in the **Card** via `{rank}` placeholder. Not a score (no cross-day comparability); not a tier (no enum). _Avoid_: tier (retired), score, position (overloaded).
|
|
44
|
+
|
|
45
|
+
**Pool**: The implicit set of in-domain **Positions** eligible for today's **Match Judge** call — every URL in `.seen.json` with `status == in_domain` re-discovered by a parser in the current run. No separate data structure; computed per-run as `{url ∈ enriched_today : .seen.json[url].status == in_domain}`. Enter: classified `in_domain` (status flips, extract written). Exit: judge picks it (status → `selected_by_judge`, extract deleted) or no longer re-discovered (record sticks but never re-enters). No pool size cap, no TTL eviction in v1 (ADR-0022). _Avoid_: queue, candidate set, judge list — and not to be confused with **Daily Top-5** (the pool is the input set; the top-5 is the selected output).
|
|
46
|
+
|
|
47
|
+
**Daily Top-5**: The 5 **Positions** the **Match Judge** returns from its single end-of-run call, drawn from today's **Pool**. ≤5 if pool was smaller. Surfaced as **Cards** in **Rank** order. Each winner gets `mark_selected_by_judge(stub)` after its **Card** is appended+fsynced, removing it from the **Pool**. _Avoid_: top-N, shortlist.
|
|
48
|
+
|
|
49
|
+
**Relevance Classifier**: The call site that decides whether a **Position** is in the applicant's professional domain (AI / Data / Game Dev / SWE) and, when in-domain, produces its **Structured Extract**. Implemented as `LLMExtractor.classify_relevance_batch` — no wrapper class. Per-batch response: `[{"id": "...", "in_domain": true, "extract": {...}}, ...]` for in-domain, `[{"id": "...", "in_domain": false}, ...]` otherwise. Batched (default 100/call, ADR-0014); may stay pipelined with parsers or run serially. On `ClaudeUsageLimitError` the orchestrator sleeps until reset+2min (ADR-0023) and retries. Malformed batch (bad JSON, length/id mismatch) fails the whole batch — none marked seen; failure to `llm_classify_relevance.events.jsonl` and the full prompt+response to `llm_classify_relevance.transcripts.jsonl`. Per ADR-0018, every component identifier carries a layer prefix (`parser_`, `llm_`, `pipeline_`). Runs only on Positions that survive the **Domain Pre-Filter** and **Freshness Gate** with status `not_classified`. Off-domain Positions get `mark_out_of_domain(stub)` immediately. _Avoid_: filter (ambiguous — see **Domain Pre-Filter**), gate.
|
|
50
|
+
|
|
51
|
+
**Freshness Gate**: A gate after `Parser.enrich()` and before the **Relevance Classifier**, dropping temporally invalid **Positions** (ADR-0025). Owns its full operational protocol — verdict, per-Position transcript, per-reason counter, drop cleanup, per-run aggregate — behind `admit(position) -> bool` / `emit_run_complete()`. Drop conditions: `posted_date is not None and (anchored_today - posted_date).days > MAX_LISTING_AGE_DAYS`, **or** `deadline is not None and deadline < anchored_today`. `None` on a field is "no signal, don't drop on that field alone"; both `None` → pass. `anchored_today` is the cron-anchored logical date (ADR-0021), shared with the **Daily Results File**. Drops write status `expired`; on `in_domain → expired` the URL's `extracts.json` entry is deleted. Runs on **every** enrich including Pool re-discovery, so pool items naturally age out. Log component `pipeline_freshness`. Future-dated `posted_date` (negative age) passes silently. _Avoid_: staleness filter, expiry gate, date filter.
|
|
52
|
+
|
|
53
|
+
**Domain Pre-Filter**: A gate before the **Relevance Classifier**, dropping **Positions** whose **title** matches any **Negative Keyword** (title-only, blacklist-only, no whitelist — ADR-0019). Substring match, case-insensitive after `normalize()`. Owns its full operational protocol behind `admit(position) -> bool` / `emit_run_complete()`. Drops write status `out_of_domain` (same status the classifier writes for LLM-rejected items, per ADR-0020 renaming). Enumerates all matching keywords internally so it records per-position decisions to `pipeline_prefilter.transcripts.jsonl` and emits a per-keyword end-of-run summary to `pipeline_prefilter.events.jsonl`. Transcript record: `url, title, source, passes, reason ∈ {passed, blacklist_drop}, blacklist_matches, title_len`. Aggregate counts *positions matched* (not raw occurrences) and surfaces a `NEGATIVE_KEYWORDS_dead` list of zero-match keywords. Assumes input has been keyword-filtered upstream (parser invariant). _Avoid_: filter (ambiguous — see **Relevance Classifier**), gate, classifier.
|
|
54
|
+
|
|
55
|
+
**Match Judge**: The call site that picks the **Daily Top-5** from today's **Pool**. Implemented as a single `LLMExtractor.judge_top_n(candidates)` per run — no wrapper class, no per-item call. Per ADR-0020, takes `list[JudgeCandidate]` (stable id + **Structured Extract**) and returns up to 5 **Match Verdicts** (`{id, rank, matched, missing, summary}`). The judge sees only structured extracts — no `raw_description` in the judge prompt. **Skills** and the **Triage Profile**'s match-criteria reach the judge via `{skills}` and `{USER_INFO}` slots. On `ClaudeUsageLimitError` orchestrator sleeps and retries (ADR-0023). On any non-quota error the run fails without writing a daily file, and a **Failure Report** is emitted. Runs only on days the **Pool** is non-empty.
|
|
56
|
+
|
|
57
|
+
### Deduplication and run state
|
|
58
|
+
|
|
59
|
+
**Deduplication**: Skipping **Positions** seen in a previous run or earlier in the current run. Three-tier: an in-run ephemeral URL set (`RunScopedDedup`, obtained from the **Deduplication Store** via `run_scope()` context manager, sits in front of the persistent store to absorb the Cartesian overlap without paying duplicate `enrich()` cost) plus the persistent **Deduplication Store** with two tiers: exact-match on URL, plus exact-match on `(company_lc, title_lc, location_lc)`. The tuple tier fires only when **all three fields are non-`None`** on both sides. When the tuple matches under a new URL (syndicated copy), the store records an alias under the new URL — copying `status` and `first_seen` — so subsequent runs hit the cheap URL tier (ADR-0003). `first_seen` answers "when did this *role* first appear", not "when did this URL first appear". `is_seen` returns a `SeenResult`: `url_hit`/`tuple_hit` skip; `in_domain` enriches and routes directly into the **Pool** for today's judge call (no classify pass); `miss` processes from scratch. Alias write is performed inside `is_seen`.
|
|
60
|
+
|
|
61
|
+
**Dedup status enum** (per ADR-0020 / ADR-0022, supersedes the prior set):
|
|
62
|
+
- `not_classified` — first-contact write; eligible for the classifier next re-discovery.
|
|
63
|
+
- `out_of_domain` — written by **Domain Pre-Filter** (title hit) and **Relevance Classifier** (LLM `in_domain: false`). Terminal-skip.
|
|
64
|
+
- `in_domain` — written by classifier on `in_domain: true`, alongside the **Structured Extract** in `extracts.json`. Means "in the **Pool**"; re-discovery routes directly into today's judge candidates.
|
|
65
|
+
- `selected_by_judge` — written after judge picks the item and the **Card** is appended+fsynced. Terminal-skip; extract deleted.
|
|
66
|
+
- `expired` — written by **Freshness Gate** when `posted_date` exceeds `MAX_LISTING_AGE_DAYS` or `deadline` < anchored date (ADR-0025). Terminal-skip. May transition from `not_classified` or `in_domain` (the latter also deletes the extract).
|
|
67
|
+
- `enrich_failed` — parser's `enrich()` raised `ParserError` (incl. per-URL 4xx wrapped by HTTP layer). Terminal-skip.
|
|
68
|
+
- `external_redirect` — parser emitted `ExternalRedirect` (ADR-0013). Terminal-skip.
|
|
69
|
+
|
|
70
|
+
**Error semantics for `mark_*`** (orchestrator behavior, single-writer Pi):
|
|
71
|
+
- **Pre-filter drop** → `mark_out_of_domain(stub)` immediately. No LLM cost.
|
|
72
|
+
- **Classifier non-quota error**: batch fails; items NOT marked; orchestrator continues with other batches and still calls the judge on what classified successfully (ADR-0021). Next run retries.
|
|
73
|
+
- **Classifier `ClaudeUsageLimitError`** (ADR-0023): parse reset time, sleep until reset+2min (or fallback `next_top_of_hour + 2min`), retry the batch. No `degraded_reason`; no Failure Report. Cron-anchored day (ADR-0021) handles midnight crossings.
|
|
74
|
+
- **Judge non-quota error**: no daily file; **Failure Report** emitted; **Pool** intact (no `selected_by_judge` transitions); tomorrow sees today's pool plus tomorrow's arrivals.
|
|
75
|
+
- **Judge `ClaudeUsageLimitError`**: sleep and retry.
|
|
76
|
+
- **Per-card append succeeds, then judge processing fails mid-loop**: each card is appended+fsynced+`mark_selected_by_judge` atomically per winner. A failure between cards leaves earlier winners marked, later winners unmarked (back in pool for tomorrow) — bounded.
|
|
77
|
+
- **`enrich()` raises `ParserError`** → `mark_enrich_failed(stub)`. Terminal-skip. Includes per-URL 4xx from `ParserHttp.get()` (404/400/422 wrapped by HTTP layer; auth/5xx raise `HttpParserFatalError` and propagate to thread-bootstrap, becoming `PARSER_DEAD`).
|
|
78
|
+
|
|
79
|
+
State lives in `.seen.json` (synced via Syncthing alongside Daily Results Files, ADR-0002), shaped `{url: {company_lc, title_lc, location_lc, status, first_seen}}` (any of the lowercased fields may be `None`); fuzzy index built in-memory at load, not persisted. `DeduplicationStore` exposes seven narrow methods (one per status: `mark_out_of_domain`, `mark_in_domain` — takes the extract as kwarg, `mark_selected_by_judge`, `mark_expired`, `mark_enrich_failed`, `mark_external_redirect`, `mark_not_classified`). Pipeline is single-writer at the process level (ADR-0002); the lock inside `DeduplicationStore` (ADR-0014) covers concurrent classify/judge worker writes. Parser threads remain pure producers. **Migration**: ADR-0024 wipes the previous `.seen.json` and the old trio; no automatic translation. _Avoid_: duplicate filtering, URL filtering.
|
|
80
|
+
|
|
81
|
+
### Display
|
|
82
|
+
|
|
83
|
+
**Card**: The expanded view rendered for **every** **Daily Top-5** winner. Structure: H1 `# {company} · {title} · {location_segment}` (`location_segment` = literal `location` for on-site/location-only; `{location} (Hybrid)`/`(Remote)` when a non-on-site `work_model` is known; `Unknown Location` when `location` is `None`; `Unknown Location (Hybrid|Remote)` for `None`-location + non-on-site). Below H1: meta line `posted_date · contract_type · employment_type` (None segments omitted; whole line omitted if all None); `**Salary:** {salary}` (omitted on `None`); `## AI Assessment` opening with `**Rank {rank}/5**`, then summary, then `**Matched:**`/`**Missing:**` bulleted lists (each list+label omitted when empty); `## Job Description` (whole section omitted on empty `raw_description`); horizontal rule and bare-autolink URL footer. Headings are English.
|
|
84
|
+
|
|
85
|
+
**Renderer**: A pure module exposing `render(position, verdict, layout) -> str`. Uses the **Layout**'s `CARD_TEMPLATE`. Builds a placeholder dict + named groups + `rank`, substitutes via `str.format_map`, returns the block. No file I/O, deterministic. No tier-derived placeholders (ADR-0020). _Avoid_: formatter, presenter.
|
|
86
|
+
|
|
87
|
+
**Results File Manager**: The only module that reads or writes the **Daily Results File**. Function-shaped: orchestrator holds the day's `Path` (`<settings-dir>/results/{cron_anchored_date}.md`, ADR-0021) and dispatches. Surface: `ensure_initialized(path)` (`mkdir(parents=True)`; no header — files start empty), `append(path, rendered_block)` (verbatim write + `flush` + `os.fsync`, propagates `OSError`). `data_dir` is the parent of the loaded `config.py` (ADR-0011). _Avoid_: writer, output manager.
|
|
88
|
+
|
|
89
|
+
### Observability
|
|
90
|
+
|
|
91
|
+
**Log Artifacts**: Files under `logs/`, laid out **by reader** (ADR-0018). Four types: `<comp>.events.jsonl` (one structured row per step; the run-end metrics row that used to live on the retired Run Divider now lands as `event=run_complete` here per ADR-0021); `lifecycle.jsonl` (single shared file carrying status-display `registered`/`phase_changed`/`removed` events); `run.log` (single shared file carrying tracebacks and `SUMMARY OF SESSION` blocks); `<comp>.transcripts.jsonl` (LLM components only — full prompt/response payloads). Every component identifier carries a layer prefix — `parser_`, `llm_`, `pipeline_`. For LLM code, **component is the call site** (`llm_classify_relevance`, `llm_judge_match`), not the implementation class. _Avoid_: log file (without qualifier), debug log.
|
|
92
|
+
|
|
93
|
+
**Run Log**: The per-run instance that writes the **Log Artifacts** to `logs/`. Constructed once at orchestrator entry from `cfg.logs_dir`; threaded as a kwarg into every component that emits log events (`RunMetrics`, `ClaudeExtractor`, the status-display core, the outbound dispatcher, every **Parser** via context-manager construction which forwards into its `ParserHttp`). Owns the directory path, file-naming convention, JSONL envelope shape, and `=== <component> <ts> <kind> ===` header convention in `run.log`. Filesystem errors propagate as `OSError`. Safe to share across threads — each method opens, writes, closes per call. _Avoid_: log writer, logger (overloaded with `logging.Logger`).
|
|
94
|
+
|
|
95
|
+
**Status Display**: A live, in-process view of pipeline progress. Ephemeral — exists only while `run()` executes. A `StatusDisplay` Protocol with two implementations: `RichStatusDisplay` (`rich.Live` table repainting at ~4 Hz, when `sys.stdout.isatty()`) and `PlainStatusDisplay` (line-by-line `print`, Pi cron). One row per stage: overall `pipeline` phase, transient `startup`, one **Agent Row** per configured **Parser** (`parser_<parser_type>`), and rows for `pipeline_dedup`, `pipeline_prefilter`, `pipeline_freshness`, `llm_classify_relevance`, `llm_judge_match`. Body strings formatted by **RunMetrics**. `register()`/`update_phase()` also emit to `lifecycle.jsonl` (ADR-0018). _Avoid_: progress bar (implies known total), TUI, dashboard.
|
|
96
|
+
|
|
97
|
+
### Sources & extraction
|
|
98
|
+
|
|
99
|
+
**Source**: A configured job board or API in `Config.SOURCES`, declared as a `SourceEntry` carrying a **Parser Type** and per-source `max_results`. A **Source** is a *config entry* (what to search); a **Parser** is the *module* (how). One-to-one in v1. _Avoid_: site, board.
|
|
100
|
+
|
|
101
|
+
**Parser**: A Python module in `src/application_pipeline/parsers/`. Two-phase API: `discover(query)` returns cheap **Position Stubs**; `enrich(stub)` fetches the detail page and returns a full **Position**. The orchestrator runs **Deduplication** between phases. Each parser is a context manager owning a single `httpx.Client`; no shared mutable module-level state (ADR-0005). The orchestrator owns Cartesian expansion of `KEYWORDS × LOCATIONS [+ remote]` and emits one `discover(ParserQuery(keyword, location, max_results))` per combination; `location` is sealed **Location** (`City(name) | Remote`). Each parser declares its **Location Coverage** as module-level symbols (ADR-0012). Parsers own all per-host pacing; the orchestrator never sleeps between queries. Parsers may emit `ExternalRedirect` from `enrich`. **Keyword-match invariant** (ADR-0019): every stub yielded from `discover()` is expected to match its `ParserQuery.keyword` in the title. _Avoid_: scraper, fetcher.
|
|
102
|
+
|
|
103
|
+
**Position Stub**: The cheap result of `Parser.discover()` — `url, title, company, location, source`. Required: `url, title, source`. Optional (`str | None`): `company, location`. _Avoid_: preview, summary.
|
|
104
|
+
|
|
105
|
+
**External Redirect**: A payload emitted by a **Parser**'s `enrich()` instead of a `Position` when the detail page carries an outbound URL **and no usable body**. Carries `PositionStub` + `outbound_url`. Orchestrator marks the stub `external_redirect` and bumps the `external_redirects` counter. A separate `external_redirect` event with `skipped=false` may also be emitted by a parser that detected an outbound URL but still produced a normal `Position` — for outbound-host analysis, does not affect dedup state or the skip counter (ADR-0013). _Avoid_: redirect stub, forwarded listing.
|
|
106
|
+
|
|
107
|
+
**Parser Type**: The string on a `SourceEntry` identifying which **Parser** to use — maps directly to a filename in `parsers/`. _Avoid_: adapter.
|
|
108
|
+
|
|
109
|
+
**Location**: Geographic slot the orchestrator passes via `ParserQuery.location`. Sealed sum type: `City(name: str)` for a named place from `Config.LOCATIONS`, `Remote` for the remote-only slot when `INCLUDE_REMOTE=True`. _Avoid_: place, geo.
|
|
110
|
+
|
|
111
|
+
**City**: The `City(name)` arm. `name` is the user-typed string, normalized via `normalize()` before lookup. No central catalog — resolvable city names are the dynamic union of configured **Sources**' `serves()` predicates. _Avoid_: town, place.
|
|
112
|
+
|
|
113
|
+
**Remote**: The `Remote` arm. Each **Parser** decides what `Remote` means for its **Source** via `remote_wire()`. _Avoid_: homeoffice, work-from-home (when the type variant is meant).
|
|
114
|
+
|
|
115
|
+
**Location Coverage**: The module-level **Protocol** every **Parser** satisfies — `serves(name: str) -> bool`, `to_wire(name: str) -> str`, `serves_remote: bool`, `remote_wire() -> Any`. Each parser is local authority. **Config Loader** validates `LOCATIONS` at load time and rejects unresolvable entries with `ConfigError`. Per ADR-0012. _Avoid_: location map, slug table.
|
|
116
|
+
|
|
117
|
+
**LLM Extractor**: A Protocol with two methods — `classify_relevance_batch(items: list[ClassifyItem]) -> (list[RelevanceVerdict], CallUsage)` and `judge_top_n(candidates: list[JudgeCandidate]) -> (list[MatchVerdict], CallUsage)`. `ClassifyItem` carries opaque `id`, `title`, `raw_description`; response is id-keyed (`[{"id":..., "in_domain": bool, "extract": {...} | absent}, ...]`). `RelevanceVerdict` is `{in_domain, extract: StructuredExtract | None}` (extract non-None iff in-domain). `JudgeCandidate` carries stable id, **Structured Extract**, and the `Position`'s `stub` fields for tie-breaking. Hardcoded prompts live in `src/application_pipeline/templates/prompts/`; **Triage Profile** injected via `{USER_INFO}` slot; **Skills** bound at construction. The v1 production implementation is `ClaudeExtractor` shelling out to **Claude Code CLI** (`claude -p - --output-format json --model <alias>` + `--effort <level>` where applicable) as a fresh subprocess per call. Per ADR-0015: `haiku` (no `--effort`) for the classifier, `haiku` with `--effort medium` for the judge. Structured outputs wrapped in `<verdicts>` tags and parsed via **Agent Output Protocol** (ADR-0015). On `ClaudeUsageLimitError` the implementation parses the 429 reset time and raises `ClaudeUsageLimitError(reset_time)` so the orchestrator can sleep (ADR-0023). _Avoid_: LLM, model (without qualifier).
|
|
118
|
+
|
|
119
|
+
**Agent Output Protocol**: A project-agnostic pure module under `application_pipeline/llm/` that extracts a structured JSON payload from a CLI response whose `result` carries free-form text with payload wrapped in a semantic XML tag (ADR-0015). Exposes `extract_json_block(text: str, tag: str) -> Any` + `AgentOutputProtocolError(kind: "tag_missing" | "json_malformed")`. Finds rightmost `</tag>`, walks back through preceding `<tag>` openings, strips an optional surrounding markdown fence, then `json.loads`. Returns the first candidate that parses; raises on no tag / no parse. The **LLM Extractor** catches at its call-site boundary and re-raises as `ExtractorBatchMalformedError`/`ExtractorMalformedJSONError`. _Avoid_: output parser (too generic), response handler (overloaded with HTTP).
|
|
120
|
+
|
|
121
|
+
**Pagination**: Successive page fetches per **Source**, preferably ordered newest-first by `posted_date`, until either the source signals no next page **or** the per-Source `max_results` cap is reached. Per ADR-0017 there is no dedup-driven discover early-stop. `SKIP_AND_END_QUERY` (ADR-0007) is only emitted on `run_state.is_aborted`. Each `SourceEntry` carries its own cap (default 1000).
|
|
122
|
+
|
|
123
|
+
### Invocation
|
|
124
|
+
|
|
125
|
+
The package is distributed via PyPI (ADR-0027). Hosts install into a project-local `.venv` (`python3 -m venv .venv && .venv/bin/pip install application-pipeline`), bootstrap the settings folder with `application-pipeline init <dir>`, and arm cron with `bash <dir>/setup/cron-install.sh`. Cron fires **once per day at 00:30 local** (ADR-0024). Each tick runs the seeded `setup/cron.sh`: `pip install --upgrade application-pipeline` (×2 for CDN propagation), `application-pipeline init --refresh <dir>` (self-heal new template files), then `application-pipeline run <dir>/config.py`. A global `flock` at `$APPLICATION_PIPELINE_HOME/.cron.lock` (modelled on pycastle's pattern) serialises overlapping ticks; a run that sleeps through a quota window (ADR-0023) may overlap the next cron fire, which the flock silently waits out — absence of the next day's daily file is the operator signal. The laptop is used only during development to smoke-test individual **Parsers** in isolation. Note: `pycastle/` in this repo is an unrelated RALPH Loop coding-agent plugin used to *build* this project; its quota-handling logic (`pycastle/services/claude_service.py:88-145`) is the reference port for ADR-0023's parse-and-sleep behavior, and its `setup/*.sh` scripts are the reference shape for ADR-0027's seeded cron wrappers.
|
|
126
|
+
|
|
127
|
+
## Relationships
|
|
128
|
+
|
|
129
|
+
- A **Position** is written into a **Daily Results File** only if it passes the **Domain Pre-Filter**, then the **Freshness Gate**, then the **Relevance Classifier** (status → `in_domain`), then is picked into the **Daily Top-5** by the **Match Judge** (status → `selected_by_judge`). The **Freshness Gate** also re-runs on Pool re-discovery, so pool items can transition `in_domain → expired` before today's judge call.
|
|
130
|
+
- The **Pool** is the set of `status == in_domain` URLs re-discovered by a parser in the current run — an emergent property of `.seen.json` + today's parse output, not a stored data structure.
|
|
131
|
+
- The **Match Judge** runs once per run on the **Pool**, takes **Structured Extracts** plus construction-bound **Skills** and the injected **Triage Profile**, returns up to 5 **Match Verdicts** with explicit **Rank**.
|
|
132
|
+
- The **Triage Profile** reaches the LLM via the hardcoded prompt's `{USER_INFO}` slot (ADR-0016); **Skills** are bound at `ClaudeExtractor` construction and reach the LLM via `{skills}`; `NEGATIVE_KEYWORDS` reaches the **Domain Pre-Filter** directly (ADR-0019).
|
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: application-pipeline
|
|
3
|
+
Version: 0.6.0
|
|
4
|
+
Requires-Python: >=3.11.3
|
|
5
|
+
Requires-Dist: beautifulsoup4
|
|
6
|
+
Requires-Dist: httpx~=0.27
|
|
7
|
+
Requires-Dist: python-dotenv
|
|
8
|
+
Requires-Dist: rich
|
|
9
|
+
Provides-Extra: dev
|
|
10
|
+
Requires-Dist: pytest; extra == "dev"
|
|
11
|
+
Requires-Dist: mypy; extra == "dev"
|
|
12
|
+
Requires-Dist: respx; extra == "dev"
|
|
13
|
+
Requires-Dist: ruff; extra == "dev"
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# application-pipeline
|
|
2
|
+
|
|
3
|
+
A personal job-discovery and triage pipeline. It fetches listings from configured sources, filters
|
|
4
|
+
out noise, classifies each position's relevance with Claude, accumulates a rolling pool of in-domain
|
|
5
|
+
candidates, and emits one dated results file per day ranking the top five matches.
|
|
6
|
+
|
|
7
|
+
## Why
|
|
8
|
+
|
|
9
|
+
- **Automated discovery** — parsers walk configured job boards each morning so you don't have to.
|
|
10
|
+
- **Noise reduction** — the Domain Pre-Filter drops title-level mismatches cheaply, before any LLM
|
|
11
|
+
cost is incurred.
|
|
12
|
+
- **Freshness control** — the Freshness Gate discards listings beyond your configured age ceiling and
|
|
13
|
+
any past their deadline.
|
|
14
|
+
- **Structured ranking** — the Match Judge scores each in-domain candidate against your skills and
|
|
15
|
+
match criteria, returning an explicit rank with matched/missing skill lists.
|
|
16
|
+
- **One file per day** — a dated markdown file with up to five cards lands in your settings folder
|
|
17
|
+
and propagates via Syncthing if configured.
|
|
18
|
+
|
|
19
|
+
## The pipeline
|
|
20
|
+
|
|
21
|
+
Each cron tick walks every configured Source × Keyword × Location combination, then routes each
|
|
22
|
+
Position through the following phases in order:
|
|
23
|
+
|
|
24
|
+
1. **Parsers** — `discover()` yields cheap Position Stubs; `enrich()` fetches the detail page and
|
|
25
|
+
returns a full Position with raw description.
|
|
26
|
+
2. **Deduplication** — skips URLs seen in previous runs (exact URL or company/title/location tuple),
|
|
27
|
+
routing known in-domain positions directly back into the Pool.
|
|
28
|
+
3. **Domain Pre-Filter** — drops any Position whose title matches a Negative Keyword (configured in
|
|
29
|
+
`config.py` as `NEGATIVE_KEYWORDS`).
|
|
30
|
+
4. **Freshness Gate** — drops listings older than `MAX_LISTING_AGE_DAYS` or past their deadline.
|
|
31
|
+
5. **Relevance Classifier** — a batched Claude call decides `in_domain: true/false`; in-domain
|
|
32
|
+
positions receive a Structured Extract and enter the Pool.
|
|
33
|
+
6. **Match Judge** — a single Claude call at end-of-run picks the Daily Top-5 from the Pool,
|
|
34
|
+
returning a Match Verdict (rank, matched skills, missing skills, summary) for each winner.
|
|
35
|
+
7. **Daily Results File** — the five Cards are written to `<settings-dir>/results/YYYY-MM-DD.md` in
|
|
36
|
+
rank order.
|
|
37
|
+
|
|
38
|
+
## Getting started
|
|
39
|
+
|
|
40
|
+
- **[docs/usage.md](docs/usage.md)** — installation, CLI reference, and per-file editing guide for
|
|
41
|
+
the settings folder.
|
|
42
|
+
- **[docs/cron-setup.md](docs/cron-setup.md)** — unattended operation via cron, flock semantics,
|
|
43
|
+
optional Syncthing, migration from a legacy layout, and PyPI release procedure.
|
|
44
|
+
|
|
45
|
+
## Acknowledgements
|
|
46
|
+
|
|
47
|
+
Install flow, cron wrapper shape, and flock-based serialisation modelled on
|
|
48
|
+
[pycastle](https://github.com/Johannes-Kutsch/pycastle).
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
# Claude Code CLI as the LLM backend
|
|
2
|
+
|
|
3
|
+
**LLM Extractor** drives Claude via `claude -p --output-format json` headless subprocess from the Pi. Runs against the user's Claude Code subscription (auth via one-time `claude login`); Anthropic API off-limits as budget item.
|
|
4
|
+
|
|
5
|
+
## Why
|
|
6
|
+
|
|
7
|
+
- **Pi 5 can't sustain local inference cheaply.** Inference off-device makes the Pi a thin coordinator — HTTP out, markdown in. No thermal headroom, no model pulls, no `keep_alive` tuning.
|
|
8
|
+
- **Headless mode works unattended.** `claude -p` is a normal subprocess driveable from cron; subscription auth in `~/.claude/` inherits via the user's home.
|
|
9
|
+
- **Quality ceiling lifts.** Claude beats Qwen on German listings without prompt-engineering acrobatics.
|
|
10
|
+
- **Subprocess + envelope, not SDK.** `--output-format json` returns envelope with `usage` (input/output/cache-read tokens), `total_cost_usd`, `session_id`. Anthropic SDK and Claude Agent SDK both wrap the same CLI but default to API keys — rejected.
|
|
11
|
+
- **Subscription rate-limit handling is structural.** Usage-limit errors surface in the envelope. See ADR-0023 for the sleep-and-retry behaviour.
|
|
12
|
+
|
|
13
|
+
## Consequences
|
|
14
|
+
|
|
15
|
+
- `Config`: `claude_classify_batch_size: int` (default 100, see ADR-0014) and optional `claude_cli_path`.
|
|
16
|
+
- Auth file `~/.claude/.credentials.json` lives outside `data/` deliberately — replicating OAuth through Syncthing is worse than a one-time re-auth.
|
|
17
|
+
- Run-time cost shifts from electricity to subscription quota. Per-call-site Claude token/cost fields land in `pipeline_orchestrator.events.jsonl` (see ADR-0018).
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
# Deduplication state (`.seen.json`) is durable via Syncthing
|
|
2
|
+
|
|
3
|
+
`.seen.json` lives at `data/.seen.json` (sibling to `data/results/`, per ADR-0011), inside the Syncthing-synced folder. The Pi is sole writer; the laptop's mirror is the backup. Not tracked in git.
|
|
4
|
+
|
|
5
|
+
## Why
|
|
6
|
+
|
|
7
|
+
- **Pipeline memory across resets.** The **Daily Results File** is per-day; if `.seen.json` reset too, every fresh day floods with listings the pipeline already showed.
|
|
8
|
+
- **Backup without a credential on the Pi.** Syncthing was already the transport for results — adding `.seen.json` is free, and the laptop mirror is disaster recovery.
|
|
9
|
+
- **History in the file, not the transport.** `first_seen` answers "when did I first see this URL"; git per-commit history was overkill.
|
|
10
|
+
- **Single-writer, no conflict surface.** Only Pi writes. Laptop never runs the full pipeline. Sibling-to-results placement also survives a `mv data/results data/results.archive` reset gesture.
|
|
11
|
+
|
|
12
|
+
## Consequences
|
|
13
|
+
|
|
14
|
+
- `.gitignore`d.
|
|
15
|
+
- Crontab wraps the entry point with `flock -n` so a still-running invocation causes the next cron tick to exit immediately.
|
|
16
|
+
- Disaster recovery: copy laptop's Syncthing copy back into place.
|
|
17
|
+
- Schema migrations: see ADR-0024 — the v2 cutover wipes the file instead.
|
|
18
|
+
- If a future stage introduces a second writer, the single-writer assumption needs a locking/merge strategy.
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
# Tuple-match writes a URL alias inside `is_seen`
|
|
2
|
+
|
|
3
|
+
When `is_seen` matches via the `(company_lc, title_lc, location_lc)` tuple under a *new* URL (syndicated copy), the **Deduplication Store** internally writes an alias entry under the new URL — duplicating the original record's `status` and `first_seen` — so subsequent runs short-circuit on the cheap URL lookup. The return value (the `SeenResult` variant) is unaffected; the alias is a transparent side effect of the read.
|
|
4
|
+
|
|
5
|
+
## Why
|
|
6
|
+
|
|
7
|
+
- **Most stubs expose URL before description.** Re-doing tuple normalisation + lookup on every run for an already-recognised syndicated copy is repeated work.
|
|
8
|
+
- **Alias-write internal to dedup module.** A caller-driven `record_alias` would force every `is_seen` call site to remember a follow-up — a footgun. Folding the write into `is_seen` removes it.
|
|
9
|
+
- **On-disk shape stays flat.** Each entry is self-describing `{url: record}`; no sum type on disk.
|
|
10
|
+
- **`first_seen` semantics stay correct.** Alias copies the *original*'s `first_seen` — the paper trail answers "when did this *role* first appear".
|
|
11
|
+
|
|
12
|
+
## Consequences
|
|
13
|
+
|
|
14
|
+
- Single-writer Pi (ADR-0002) means a side-effecting query is safe.
|
|
15
|
+
- `is_seen` may raise `OSError` from the alias write; the orchestrator's top-level `mark_*` handler covers it.
|
|
16
|
+
- Future fuzzy-match upgrades on the `_tuple_lookup` seam inherit the alias-write automatically.
|
|
17
|
+
- `is_seen`'s docstring must call out the side effect.
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
# Layout as a user-editable Python module in the synced folder
|
|
2
|
+
|
|
3
|
+
`settings/layout.py` lives alongside `config.py` in the synced folder (see ADR-0011). Plain Python: named placeholder groups, full `CARD_TEMPLATE` / `HEADLINE_TEMPLATE`. Loaded at runtime by the same `load_user_module` machinery as **Config**, validated into a frozen `Layout` dataclass, consumed by a pure **Renderer** via `str.format_map`.
|
|
4
|
+
|
|
5
|
+
When `LAYOUT` is unset, the loader defaults to `"layout.py"` next to `config.py` and **errors if missing** — matching how `USER_INFO_DIR` resolves. An explicit `LAYOUT = None` selects the built-in `layout.default()` stub (reserved for tests).
|
|
6
|
+
|
|
7
|
+
## Why
|
|
8
|
+
|
|
9
|
+
- **User iterates on layout often, package code rarely.** Field tweaks, emoji swaps, visual hierarchy — should not require editing `src/`. Co-location with `config.py` matches the edit-restart-see workflow.
|
|
10
|
+
- **`str.format_map` + Python constants is the lightest tool.** No Jinja, no DSL. Named placeholder groups (`PLACEHOLDER_GROUPS = {"meta": (" · ", ["location", "url"])}`) handle the dangling-separator problem.
|
|
11
|
+
- **Loader plumbing already exists.** `load_user_module` (extracted from Config Loader) handles `importlib.util`, missing-attr checks, typed errors. `LayoutError` and `ConfigError` share `UserSettingsError`.
|
|
12
|
+
- **Renderer stays pure.** Layout passed in explicitly (`render(position, verdict, layout)`); no module-level state.
|
|
13
|
+
- **Auto-discovery footgun closure.** A user with a valid `layout.py` who forgot to add `LAYOUT = "layout.py"` would silently get the stub. Pattern-symmetry with `USER_INFO_DIR` + fail-loud on missing closes the gap. Explicit `LAYOUT = None` keeps the stub reachable.
|
|
14
|
+
|
|
15
|
+
## Consequences
|
|
16
|
+
|
|
17
|
+
- Renderer takes a `Layout` argument; orchestrator loads once at startup and threads it through.
|
|
18
|
+
- `LayoutError` joins `ConfigError` under `UserSettingsError`.
|
|
19
|
+
- Typo in placeholder name raises `KeyError` from `str.format_map` at render time — fail-loud is intentional.
|
|
20
|
+
- Validation at startup: `layout_loader` checks every `placeholder_groups` entry references a known field; missing/invalid `layout.py` fails before the first fetch.
|