bidreader 0.2.0__tar.gz → 0.5.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- bidreader-0.5.0/PKG-INFO +151 -0
- bidreader-0.5.0/README.md +124 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader/__init__.py +1 -1
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader/cli.py +6 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader/extract.py +69 -9
- bidreader-0.5.0/bidreader.egg-info/PKG-INFO +151 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader.egg-info/SOURCES.txt +2 -1
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader.egg-info/requires.txt +3 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/pyproject.toml +2 -1
- bidreader-0.5.0/tests/test_offline.py +36 -0
- bidreader-0.2.0/PKG-INFO +0 -109
- bidreader-0.2.0/README.md +0 -84
- bidreader-0.2.0/bidreader.egg-info/PKG-INFO +0 -109
- {bidreader-0.2.0 → bidreader-0.5.0}/LICENSE +0 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader/mcp_server.py +0 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader.egg-info/dependency_links.txt +0 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader.egg-info/entry_points.txt +0 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/bidreader.egg-info/top_level.txt +0 -0
- {bidreader-0.2.0 → bidreader-0.5.0}/setup.cfg +0 -0
bidreader-0.5.0/PKG-INFO
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: bidreader
|
|
3
|
+
Version: 0.5.0
|
|
4
|
+
Summary: Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page.
|
|
5
|
+
Author-email: Anmol <anmol@attentive.ai>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/anmolsam/bidreader
|
|
8
|
+
Project-URL: Issues, https://github.com/anmolsam/bidreader/issues
|
|
9
|
+
Keywords: construction,estimating,takeoff,subcontractor,bid,quote,scope,exclusions,spec,AEC,preconstruction,BOQ,LLM,MCP
|
|
10
|
+
Classifier: Development Status :: 4 - Beta
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Topic :: Office/Business :: Financial
|
|
15
|
+
Requires-Python: >=3.10
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: LICENSE
|
|
18
|
+
Requires-Dist: pymupdf>=1.24
|
|
19
|
+
Requires-Dist: certifi>=2024.0
|
|
20
|
+
Provides-Extra: tables
|
|
21
|
+
Requires-Dist: pdfplumber>=0.11; extra == "tables"
|
|
22
|
+
Provides-Extra: mcp
|
|
23
|
+
Requires-Dist: mcp>=1.2; extra == "mcp"
|
|
24
|
+
Provides-Extra: dev
|
|
25
|
+
Requires-Dist: pytest>=8; extra == "dev"
|
|
26
|
+
Dynamic: license-file
|
|
27
|
+
|
|
28
|
+
<div align="center">
|
|
29
|
+
|
|
30
|
+
# 📄 BidReader
|
|
31
|
+
|
|
32
|
+
### Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.
|
|
33
|
+
|
|
34
|
+
Every line item carries its **page**, the **exact source text** it came from, and an **arithmetic check** (`qty × unit_price == amount`) — verification on top of extraction, not just an LLM guess.
|
|
35
|
+
|
|
36
|
+
[](https://pypi.org/project/bidreader/)
|
|
37
|
+
[](LICENSE)
|
|
38
|
+
[](pyproject.toml)
|
|
39
|
+
[](docs/MCP.md)
|
|
40
|
+
[](docs/FREE_MODELS.md)
|
|
41
|
+
|
|
42
|
+
</div>
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."*
|
|
47
|
+
> — r/Construction, **498 upvotes** ([source](https://www.reddit.com/r/Construction/comments/1pq34ur/))
|
|
48
|
+
|
|
49
|
+
The construction-AI gold rush is all chasing the same crowded, resisted thing — autonomous *takeoff*. The **loudest unmet pain** of estimators is upstream and downstream of it: wrangling crime-scene PDFs into clean data, and **catching what subcontractors quietly excluded** before it costs six figures on the job.
|
|
50
|
+
|
|
51
|
+
No permissively-licensed library did this. **BidReader is that primitive** — MIT, `pip install`, runs on free LLMs, and callable from any AI agent over MCP.
|
|
52
|
+
|
|
53
|
+
## Quickstart (copy-paste, ~30 seconds)
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
pip install bidreader
|
|
57
|
+
|
|
58
|
+
# Use any one — a FREE key works (see docs/FREE_MODELS.md):
|
|
59
|
+
export GEMINI_API_KEY=... # free at aistudio.google.com
|
|
60
|
+
# or export OPENROUTER_API_KEY=... (has :free models)
|
|
61
|
+
# or export REQUESTY_API_KEY=...
|
|
62
|
+
|
|
63
|
+
bidreader your_sub_quote.pdf
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from bidreader import read
|
|
68
|
+
|
|
69
|
+
doc = read("sub_quote.pdf")
|
|
70
|
+
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
|
|
71
|
+
doc.exclusions # [{item, quote, page, risk}, ...] <- the buried stuff
|
|
72
|
+
doc.scope_gaps # trade-standard scope NOT in the doc — confirm before bidding
|
|
73
|
+
doc.to_json()
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## Real output
|
|
77
|
+
|
|
78
|
+
On a real **$324,240.61 drywall estimate** (72 line items, scanned in seconds), BidReader's scope engine caught a genuinely expensive hole:
|
|
79
|
+
|
|
80
|
+
```
|
|
81
|
+
!! SCOPE GAPS TO CONFIRM:
|
|
82
|
+
- Finishing (taping, mudding, sanding) -- the gypsum line items price the BOARD
|
|
83
|
+
only, not the finishing labor to reach a paint-ready surface.
|
|
84
|
+
- Door hardware -- "Door W/ Frame" lines don't include hinges/locks/closers.
|
|
85
|
+
- Firestopping at rated assemblies -- life-safety scope, commonly omitted.
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
On a real **25-page multi-trade GC estimate**, it parsed **959 line items across 16 CSI divisions** (demolition → concrete → steel → finishes → plumbing → fire suppression), each page-cited. See [docs/RESULTS.md](docs/RESULTS.md) and a full worked example in [`examples/`](examples/).
|
|
89
|
+
|
|
90
|
+
## Use it from an AI agent (MCP)
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
pip install "bidreader[mcp]"
|
|
94
|
+
```
|
|
95
|
+
```json
|
|
96
|
+
{ "mcpServers": { "bidreader": {
|
|
97
|
+
"command": "bidreader-mcp",
|
|
98
|
+
"env": { "GEMINI_API_KEY": "..." }
|
|
99
|
+
}}}
|
|
100
|
+
```
|
|
101
|
+
Tools: `read_document`, `catch_exclusions`, `extract_line_items`. Now your agent can answer *"which subs excluded fire-stopping across this bid folder?"* Full guide: [docs/MCP.md](docs/MCP.md).
|
|
102
|
+
|
|
103
|
+
## How it works
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
PDF (sub-quote / bid package / spec / schedule)
|
|
107
|
+
→ page-tagged text extraction (PyMuPDF)
|
|
108
|
+
→ chunk by page (scales to 25+ page, 900+ line-item estimates)
|
|
109
|
+
→ LLM structured extraction (line items · exclusions · assumptions · alternates · scope gaps)
|
|
110
|
+
→ merge + page-cited output (JSON / CLI / MCP)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Text-based, so it runs great on **free** models — see [docs/FREE_MODELS.md](docs/FREE_MODELS.md).
|
|
114
|
+
|
|
115
|
+
## Benchmark
|
|
116
|
+
|
|
117
|
+
Reproducible ground-truth benchmark ([`benchmark/`](benchmark/)) — synthetic docs we author, so truth is exact and the PDFs ship in-repo:
|
|
118
|
+
|
|
119
|
+
| metric | score |
|
|
120
|
+
|---|---|
|
|
121
|
+
| Line-item recall | **100%** |
|
|
122
|
+
| Exclusion-catch recall (incl. prose-buried) | **100%** |
|
|
123
|
+
| No-hallucination rate (clean docs) | **100%** |
|
|
124
|
+
| Bid-total accuracy (±2%) | **100%** |
|
|
125
|
+
| Arithmetic errors caught | **2/2**, 0 false positives |
|
|
126
|
+
|
|
127
|
+
Honest caveat: synthetic docs are cleaner than real scans — these are an **upper bound** on well-structured input, not a claim about messy real bids. Uncontrolled real-document results are in [docs/RESULTS.md](docs/RESULTS.md). Reproduce: `python benchmark/generate.py && python benchmark/run.py`.
|
|
128
|
+
|
|
129
|
+
## Why this, and why now — the evidence
|
|
130
|
+
|
|
131
|
+
A full write-up (problem, market data, prior-art gap, method, results) is in **[PAPER.md](PAPER.md)**. The short version:
|
|
132
|
+
|
|
133
|
+
- **Loudest, most-shared pain** in construction-estimating communities (the 498-upvote thread above; more cited in the paper).
|
|
134
|
+
- **It works *today*** — document extraction is LLM-native, unlike floor-plan symbol detection (academic SOTA tops out ~83% mAP).
|
|
135
|
+
- **Empty slot** — `bidreader`, `blueprint-parser`, `pytakeoff` were all unclaimed on PyPI; the only adjacent tools are AGPL/non-commercial or abandoned toys.
|
|
136
|
+
- **Broadest base** — every estimator *and* every construction-AI builder needs document extraction. The library is the dependency; the MCP server is the agent-era surface.
|
|
137
|
+
|
|
138
|
+
## Roadmap
|
|
139
|
+
|
|
140
|
+
- [ ] Scanned-PDF vision OCR path
|
|
141
|
+
- [ ] Revision/addendum **diff** ("what changed between Addendum 3 and 4")
|
|
142
|
+
- [ ] Excel/CSV BOQ export + multi-quote **leveling** (compare subs side-by-side)
|
|
143
|
+
- [ ] Region/trade notation packs (AISC, BS/IS, AUS)
|
|
144
|
+
|
|
145
|
+
## Contributing
|
|
146
|
+
|
|
147
|
+
PRs welcome — see [CONTRIBUTING.md](CONTRIBUTING.md). Good first issues: add a notation parser, a new export format, or a test fixture.
|
|
148
|
+
|
|
149
|
+
## License
|
|
150
|
+
|
|
151
|
+
[MIT](LICENSE) © 2026. Cite via [CITATION.cff](CITATION.cff).
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
3
|
+
# 📄 BidReader
|
|
4
|
+
|
|
5
|
+
### Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.
|
|
6
|
+
|
|
7
|
+
Every line item carries its **page**, the **exact source text** it came from, and an **arithmetic check** (`qty × unit_price == amount`) — verification on top of extraction, not just an LLM guess.
|
|
8
|
+
|
|
9
|
+
[](https://pypi.org/project/bidreader/)
|
|
10
|
+
[](LICENSE)
|
|
11
|
+
[](pyproject.toml)
|
|
12
|
+
[](docs/MCP.md)
|
|
13
|
+
[](docs/FREE_MODELS.md)
|
|
14
|
+
|
|
15
|
+
</div>
|
|
16
|
+
|
|
17
|
+
---
|
|
18
|
+
|
|
19
|
+
> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."*
|
|
20
|
+
> — r/Construction, **498 upvotes** ([source](https://www.reddit.com/r/Construction/comments/1pq34ur/))
|
|
21
|
+
|
|
22
|
+
The construction-AI gold rush is all chasing the same crowded, resisted thing — autonomous *takeoff*. The **loudest unmet pain** of estimators is upstream and downstream of it: wrangling crime-scene PDFs into clean data, and **catching what subcontractors quietly excluded** before it costs six figures on the job.
|
|
23
|
+
|
|
24
|
+
No permissively-licensed library did this. **BidReader is that primitive** — MIT, `pip install`, runs on free LLMs, and callable from any AI agent over MCP.
|
|
25
|
+
|
|
26
|
+
## Quickstart (copy-paste, ~30 seconds)
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
pip install bidreader
|
|
30
|
+
|
|
31
|
+
# Use any one — a FREE key works (see docs/FREE_MODELS.md):
|
|
32
|
+
export GEMINI_API_KEY=... # free at aistudio.google.com
|
|
33
|
+
# or export OPENROUTER_API_KEY=... (has :free models)
|
|
34
|
+
# or export REQUESTY_API_KEY=...
|
|
35
|
+
|
|
36
|
+
bidreader your_sub_quote.pdf
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
from bidreader import read
|
|
41
|
+
|
|
42
|
+
doc = read("sub_quote.pdf")
|
|
43
|
+
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
|
|
44
|
+
doc.exclusions # [{item, quote, page, risk}, ...] <- the buried stuff
|
|
45
|
+
doc.scope_gaps # trade-standard scope NOT in the doc — confirm before bidding
|
|
46
|
+
doc.to_json()
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Real output
|
|
50
|
+
|
|
51
|
+
On a real **$324,240.61 drywall estimate** (72 line items, scanned in seconds), BidReader's scope engine caught a genuinely expensive hole:
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
!! SCOPE GAPS TO CONFIRM:
|
|
55
|
+
- Finishing (taping, mudding, sanding) -- the gypsum line items price the BOARD
|
|
56
|
+
only, not the finishing labor to reach a paint-ready surface.
|
|
57
|
+
- Door hardware -- "Door W/ Frame" lines don't include hinges/locks/closers.
|
|
58
|
+
- Firestopping at rated assemblies -- life-safety scope, commonly omitted.
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
On a real **25-page multi-trade GC estimate**, it parsed **959 line items across 16 CSI divisions** (demolition → concrete → steel → finishes → plumbing → fire suppression), each page-cited. See [docs/RESULTS.md](docs/RESULTS.md) and a full worked example in [`examples/`](examples/).
|
|
62
|
+
|
|
63
|
+
## Use it from an AI agent (MCP)
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
pip install "bidreader[mcp]"
|
|
67
|
+
```
|
|
68
|
+
```json
|
|
69
|
+
{ "mcpServers": { "bidreader": {
|
|
70
|
+
"command": "bidreader-mcp",
|
|
71
|
+
"env": { "GEMINI_API_KEY": "..." }
|
|
72
|
+
}}}
|
|
73
|
+
```
|
|
74
|
+
Tools: `read_document`, `catch_exclusions`, `extract_line_items`. Now your agent can answer *"which subs excluded fire-stopping across this bid folder?"* Full guide: [docs/MCP.md](docs/MCP.md).
|
|
75
|
+
|
|
76
|
+
## How it works
|
|
77
|
+
|
|
78
|
+
```
|
|
79
|
+
PDF (sub-quote / bid package / spec / schedule)
|
|
80
|
+
→ page-tagged text extraction (PyMuPDF)
|
|
81
|
+
→ chunk by page (scales to 25+ page, 900+ line-item estimates)
|
|
82
|
+
→ LLM structured extraction (line items · exclusions · assumptions · alternates · scope gaps)
|
|
83
|
+
→ merge + page-cited output (JSON / CLI / MCP)
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Text-based, so it runs great on **free** models — see [docs/FREE_MODELS.md](docs/FREE_MODELS.md).
|
|
87
|
+
|
|
88
|
+
## Benchmark
|
|
89
|
+
|
|
90
|
+
Reproducible ground-truth benchmark ([`benchmark/`](benchmark/)) — synthetic docs we author, so truth is exact and the PDFs ship in-repo:
|
|
91
|
+
|
|
92
|
+
| metric | score |
|
|
93
|
+
|---|---|
|
|
94
|
+
| Line-item recall | **100%** |
|
|
95
|
+
| Exclusion-catch recall (incl. prose-buried) | **100%** |
|
|
96
|
+
| No-hallucination rate (clean docs) | **100%** |
|
|
97
|
+
| Bid-total accuracy (±2%) | **100%** |
|
|
98
|
+
| Arithmetic errors caught | **2/2**, 0 false positives |
|
|
99
|
+
|
|
100
|
+
Honest caveat: synthetic docs are cleaner than real scans — these are an **upper bound** on well-structured input, not a claim about messy real bids. Uncontrolled real-document results are in [docs/RESULTS.md](docs/RESULTS.md). Reproduce: `python benchmark/generate.py && python benchmark/run.py`.
|
|
101
|
+
|
|
102
|
+
## Why this, and why now — the evidence
|
|
103
|
+
|
|
104
|
+
A full write-up (problem, market data, prior-art gap, method, results) is in **[PAPER.md](PAPER.md)**. The short version:
|
|
105
|
+
|
|
106
|
+
- **Loudest, most-shared pain** in construction-estimating communities (the 498-upvote thread above; more cited in the paper).
|
|
107
|
+
- **It works *today*** — document extraction is LLM-native, unlike floor-plan symbol detection (academic SOTA tops out ~83% mAP).
|
|
108
|
+
- **Empty slot** — `bidreader`, `blueprint-parser`, `pytakeoff` were all unclaimed on PyPI; the only adjacent tools are AGPL/non-commercial or abandoned toys.
|
|
109
|
+
- **Broadest base** — every estimator *and* every construction-AI builder needs document extraction. The library is the dependency; the MCP server is the agent-era surface.
|
|
110
|
+
|
|
111
|
+
## Roadmap
|
|
112
|
+
|
|
113
|
+
- [ ] Scanned-PDF vision OCR path
|
|
114
|
+
- [ ] Revision/addendum **diff** ("what changed between Addendum 3 and 4")
|
|
115
|
+
- [ ] Excel/CSV BOQ export + multi-quote **leveling** (compare subs side-by-side)
|
|
116
|
+
- [ ] Region/trade notation packs (AISC, BS/IS, AUS)
|
|
117
|
+
|
|
118
|
+
## Contributing
|
|
119
|
+
|
|
120
|
+
PRs welcome — see [CONTRIBUTING.md](CONTRIBUTING.md). Good first issues: add a notation parser, a new export format, or a test fixture.
|
|
121
|
+
|
|
122
|
+
## License
|
|
123
|
+
|
|
124
|
+
[MIT](LICENSE) © 2026. Cite via [CITATION.cff](CITATION.cff).
|
|
@@ -24,6 +24,12 @@ def main():
|
|
|
24
24
|
f"{str(li.get('qty') or ''):>8s}{str(li.get('unit') or ''):>5s}{amt:>13s} p{li.get('page','?')}")
|
|
25
25
|
if d.get('bid_total'):
|
|
26
26
|
print(f" {'BID TOTAL':56s}{'$' + format(d['bid_total'], ',.2f'):>13s}")
|
|
27
|
+
mm = [li for li in d.line_items if li.get('math_check') == 'mismatch']
|
|
28
|
+
if mm:
|
|
29
|
+
print(f"\n!! ARITHMETIC MISMATCHES ({len(mm)}) — qty x unit_price != amount:")
|
|
30
|
+
for li in mm[:10]:
|
|
31
|
+
print(f" - p{li.get('page','?')} {str(li.get('description',''))[:46]}: "
|
|
32
|
+
f"stated {li.get('amount')}, computed {li.get('math_expected')}")
|
|
27
33
|
print(f"\n!! EXCLUSIONS CAUGHT ({len(d.exclusions)}):")
|
|
28
34
|
for e in d.exclusions:
|
|
29
35
|
print(f" - {e.get('item','?')} (page {e.get('page','?')})")
|
|
@@ -24,7 +24,7 @@ SCHEMA_PROMPT = """You are a construction estimating assistant reading a vendor/
|
|
|
24
24
|
"vendor": "<company or null>", "project": "<project/title or null>",
|
|
25
25
|
"trade": "<trade e.g. Drywall, Electrical or null>", "currency": "<e.g. USD or null>",
|
|
26
26
|
"bid_total": <number or null>,
|
|
27
|
-
"line_items": [{"section":"<csi/section or null>","description":"<text>","qty":<num|null>,"unit":"<EA/SF/LF/LS|null>","unit_price":<num|null>,"amount":<num|null>,"page":<int>}],
|
|
27
|
+
"line_items": [{"section":"<csi/section or null>","description":"<text>","qty":<num|null>,"unit":"<EA/SF/LF/LS|null>","unit_price":<num|null>,"amount":<num|null>,"page":<int>,"source_text":"<the EXACT line as printed in the document>"}],
|
|
28
28
|
"exclusions": [{"item":"<short label>","quote":"<EXACT text as written>","page":<int>,"risk":"<one line: why this matters / who eats the cost>"}],
|
|
29
29
|
"assumptions": [{"text":"<exact>","page":<int>}],
|
|
30
30
|
"alternates": [{"text":"<exact>","amount":<num|null>,"page":<int>}],
|
|
@@ -36,13 +36,26 @@ Rules: exclusions are CRITICAL — hunt everywhere, including fine print, footno
|
|
|
36
36
|
For scope_gaps, infer trade-standard scope a vendor commonly omits that is NOT mentioned in this doc."""
|
|
37
37
|
|
|
38
38
|
|
|
39
|
-
def
|
|
40
|
-
|
|
39
|
+
def _page_blocks(doc):
|
|
40
|
+
out = []
|
|
41
41
|
for i, p in enumerate(doc):
|
|
42
42
|
t = p.get_text().strip()
|
|
43
43
|
if t:
|
|
44
|
-
|
|
45
|
-
return
|
|
44
|
+
out.append(f"[PAGE {i+1}]\n{t}")
|
|
45
|
+
return out
|
|
46
|
+
|
|
47
|
+
|
|
48
|
+
def _chunk(blocks, budget=42000):
|
|
49
|
+
"""Group page-blocks into chunks under a char budget so the model's JSON
|
|
50
|
+
output never overflows on large multi-page estimates."""
|
|
51
|
+
chunks, cur, n = [], [], 0
|
|
52
|
+
for b in blocks:
|
|
53
|
+
if cur and n + len(b) > budget:
|
|
54
|
+
chunks.append("\n\n".join(cur)); cur, n = [], 0
|
|
55
|
+
cur.append(b); n += len(b)
|
|
56
|
+
if cur:
|
|
57
|
+
chunks.append("\n\n".join(cur))
|
|
58
|
+
return chunks
|
|
46
59
|
|
|
47
60
|
|
|
48
61
|
def _llm(text):
|
|
@@ -74,6 +87,26 @@ def _clean(s):
|
|
|
74
87
|
return json.loads(s)
|
|
75
88
|
|
|
76
89
|
|
|
90
|
+
def validate(data):
|
|
91
|
+
"""Attach an objective arithmetic check to each line item:
|
|
92
|
+
'ok' if qty*unit_price ≈ amount, 'mismatch' if not, 'n/a' if missing inputs.
|
|
93
|
+
This is non-LLM verification — the trust layer on top of extraction."""
|
|
94
|
+
flagged = 0
|
|
95
|
+
for li in data.get("line_items", []):
|
|
96
|
+
q, up, amt = li.get("qty"), li.get("unit_price"), li.get("amount")
|
|
97
|
+
if isinstance(q, (int, float)) and isinstance(up, (int, float)) and isinstance(amt, (int, float)):
|
|
98
|
+
expect = q * up
|
|
99
|
+
tol = max(1.0, abs(amt) * 0.02) # 2% or $1
|
|
100
|
+
if abs(expect - amt) <= tol:
|
|
101
|
+
li["math_check"] = "ok"
|
|
102
|
+
else:
|
|
103
|
+
li["math_check"] = "mismatch"; li["math_expected"] = round(expect, 2); flagged += 1
|
|
104
|
+
else:
|
|
105
|
+
li["math_check"] = "n/a"
|
|
106
|
+
data["math_flagged"] = flagged
|
|
107
|
+
return data
|
|
108
|
+
|
|
109
|
+
|
|
77
110
|
class Doc(dict):
|
|
78
111
|
"""Result with convenience accessors."""
|
|
79
112
|
@property
|
|
@@ -85,13 +118,40 @@ class Doc(dict):
|
|
|
85
118
|
def to_json(self, **kw): return json.dumps(self, indent=2, **kw)
|
|
86
119
|
|
|
87
120
|
|
|
121
|
+
def _merge(parts):
|
|
122
|
+
out = {"doc_type": None, "vendor": None, "project": None, "trade": None,
|
|
123
|
+
"currency": None, "bid_total": None, "line_items": [], "exclusions": [],
|
|
124
|
+
"assumptions": [], "alternates": [], "scope_gaps": [], "notes": None}
|
|
125
|
+
seen_gap = set()
|
|
126
|
+
totals = []
|
|
127
|
+
for d in parts:
|
|
128
|
+
for k in ("doc_type", "vendor", "project", "trade", "currency", "notes"):
|
|
129
|
+
if not out[k] and d.get(k):
|
|
130
|
+
out[k] = d[k]
|
|
131
|
+
if isinstance(d.get("bid_total"), (int, float)):
|
|
132
|
+
totals.append(d["bid_total"])
|
|
133
|
+
for k in ("line_items", "exclusions", "assumptions", "alternates"):
|
|
134
|
+
out[k].extend(d.get(k) or [])
|
|
135
|
+
for g in d.get("scope_gaps") or []:
|
|
136
|
+
key = (g.get("missing") or "").strip().lower()
|
|
137
|
+
if key and key not in seen_gap:
|
|
138
|
+
seen_gap.add(key); out["scope_gaps"].append(g)
|
|
139
|
+
out["bid_total"] = max(totals) if totals else None # grand total > subtotals
|
|
140
|
+
return out
|
|
141
|
+
|
|
142
|
+
|
|
88
143
|
def read(path: str) -> Doc:
|
|
89
|
-
"""Read a construction PDF into structured, page-cited data.
|
|
144
|
+
"""Read a construction PDF into structured, page-cited data.
|
|
145
|
+
|
|
146
|
+
Large multi-page estimates are chunked by page and merged, so the model's
|
|
147
|
+
JSON output never truncates."""
|
|
90
148
|
doc = fitz.open(path)
|
|
91
|
-
|
|
92
|
-
if len(
|
|
149
|
+
blocks = _page_blocks(doc)
|
|
150
|
+
if sum(len(b) for b in blocks) < 40:
|
|
93
151
|
raise RuntimeError("No extractable text (scanned PDF) — vision OCR path TODO.")
|
|
94
|
-
|
|
152
|
+
chunks = _chunk(blocks)
|
|
153
|
+
data = validate(_merge([_llm(c) for c in chunks]))
|
|
95
154
|
data["_source"] = path.split("/")[-1]
|
|
96
155
|
data["_pages"] = doc.page_count
|
|
156
|
+
data["_chunks"] = len(chunks)
|
|
97
157
|
return Doc(data)
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: bidreader
|
|
3
|
+
Version: 0.5.0
|
|
4
|
+
Summary: Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page.
|
|
5
|
+
Author-email: Anmol <anmol@attentive.ai>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/anmolsam/bidreader
|
|
8
|
+
Project-URL: Issues, https://github.com/anmolsam/bidreader/issues
|
|
9
|
+
Keywords: construction,estimating,takeoff,subcontractor,bid,quote,scope,exclusions,spec,AEC,preconstruction,BOQ,LLM,MCP
|
|
10
|
+
Classifier: Development Status :: 4 - Beta
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Topic :: Office/Business :: Financial
|
|
15
|
+
Requires-Python: >=3.10
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: LICENSE
|
|
18
|
+
Requires-Dist: pymupdf>=1.24
|
|
19
|
+
Requires-Dist: certifi>=2024.0
|
|
20
|
+
Provides-Extra: tables
|
|
21
|
+
Requires-Dist: pdfplumber>=0.11; extra == "tables"
|
|
22
|
+
Provides-Extra: mcp
|
|
23
|
+
Requires-Dist: mcp>=1.2; extra == "mcp"
|
|
24
|
+
Provides-Extra: dev
|
|
25
|
+
Requires-Dist: pytest>=8; extra == "dev"
|
|
26
|
+
Dynamic: license-file
|
|
27
|
+
|
|
28
|
+
<div align="center">
|
|
29
|
+
|
|
30
|
+
# 📄 BidReader
|
|
31
|
+
|
|
32
|
+
### Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.
|
|
33
|
+
|
|
34
|
+
Every line item carries its **page**, the **exact source text** it came from, and an **arithmetic check** (`qty × unit_price == amount`) — verification on top of extraction, not just an LLM guess.
|
|
35
|
+
|
|
36
|
+
[](https://pypi.org/project/bidreader/)
|
|
37
|
+
[](LICENSE)
|
|
38
|
+
[](pyproject.toml)
|
|
39
|
+
[](docs/MCP.md)
|
|
40
|
+
[](docs/FREE_MODELS.md)
|
|
41
|
+
|
|
42
|
+
</div>
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."*
|
|
47
|
+
> — r/Construction, **498 upvotes** ([source](https://www.reddit.com/r/Construction/comments/1pq34ur/))
|
|
48
|
+
|
|
49
|
+
The construction-AI gold rush is all chasing the same crowded, resisted thing — autonomous *takeoff*. The **loudest unmet pain** of estimators is upstream and downstream of it: wrangling crime-scene PDFs into clean data, and **catching what subcontractors quietly excluded** before it costs six figures on the job.
|
|
50
|
+
|
|
51
|
+
No permissively-licensed library did this. **BidReader is that primitive** — MIT, `pip install`, runs on free LLMs, and callable from any AI agent over MCP.
|
|
52
|
+
|
|
53
|
+
## Quickstart (copy-paste, ~30 seconds)
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
pip install bidreader
|
|
57
|
+
|
|
58
|
+
# Use any one — a FREE key works (see docs/FREE_MODELS.md):
|
|
59
|
+
export GEMINI_API_KEY=... # free at aistudio.google.com
|
|
60
|
+
# or export OPENROUTER_API_KEY=... (has :free models)
|
|
61
|
+
# or export REQUESTY_API_KEY=...
|
|
62
|
+
|
|
63
|
+
bidreader your_sub_quote.pdf
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
from bidreader import read
|
|
68
|
+
|
|
69
|
+
doc = read("sub_quote.pdf")
|
|
70
|
+
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
|
|
71
|
+
doc.exclusions # [{item, quote, page, risk}, ...] <- the buried stuff
|
|
72
|
+
doc.scope_gaps # trade-standard scope NOT in the doc — confirm before bidding
|
|
73
|
+
doc.to_json()
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## Real output
|
|
77
|
+
|
|
78
|
+
On a real **$324,240.61 drywall estimate** (72 line items, scanned in seconds), BidReader's scope engine caught a genuinely expensive hole:
|
|
79
|
+
|
|
80
|
+
```
|
|
81
|
+
!! SCOPE GAPS TO CONFIRM:
|
|
82
|
+
- Finishing (taping, mudding, sanding) -- the gypsum line items price the BOARD
|
|
83
|
+
only, not the finishing labor to reach a paint-ready surface.
|
|
84
|
+
- Door hardware -- "Door W/ Frame" lines don't include hinges/locks/closers.
|
|
85
|
+
- Firestopping at rated assemblies -- life-safety scope, commonly omitted.
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
On a real **25-page multi-trade GC estimate**, it parsed **959 line items across 16 CSI divisions** (demolition → concrete → steel → finishes → plumbing → fire suppression), each page-cited. See [docs/RESULTS.md](docs/RESULTS.md) and a full worked example in [`examples/`](examples/).
|
|
89
|
+
|
|
90
|
+
## Use it from an AI agent (MCP)
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
pip install "bidreader[mcp]"
|
|
94
|
+
```
|
|
95
|
+
```json
|
|
96
|
+
{ "mcpServers": { "bidreader": {
|
|
97
|
+
"command": "bidreader-mcp",
|
|
98
|
+
"env": { "GEMINI_API_KEY": "..." }
|
|
99
|
+
}}}
|
|
100
|
+
```
|
|
101
|
+
Tools: `read_document`, `catch_exclusions`, `extract_line_items`. Now your agent can answer *"which subs excluded fire-stopping across this bid folder?"* Full guide: [docs/MCP.md](docs/MCP.md).
|
|
102
|
+
|
|
103
|
+
## How it works
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
PDF (sub-quote / bid package / spec / schedule)
|
|
107
|
+
→ page-tagged text extraction (PyMuPDF)
|
|
108
|
+
→ chunk by page (scales to 25+ page, 900+ line-item estimates)
|
|
109
|
+
→ LLM structured extraction (line items · exclusions · assumptions · alternates · scope gaps)
|
|
110
|
+
→ merge + page-cited output (JSON / CLI / MCP)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Text-based, so it runs great on **free** models — see [docs/FREE_MODELS.md](docs/FREE_MODELS.md).
|
|
114
|
+
|
|
115
|
+
## Benchmark
|
|
116
|
+
|
|
117
|
+
Reproducible ground-truth benchmark ([`benchmark/`](benchmark/)) — synthetic docs we author, so truth is exact and the PDFs ship in-repo:
|
|
118
|
+
|
|
119
|
+
| metric | score |
|
|
120
|
+
|---|---|
|
|
121
|
+
| Line-item recall | **100%** |
|
|
122
|
+
| Exclusion-catch recall (incl. prose-buried) | **100%** |
|
|
123
|
+
| No-hallucination rate (clean docs) | **100%** |
|
|
124
|
+
| Bid-total accuracy (±2%) | **100%** |
|
|
125
|
+
| Arithmetic errors caught | **2/2**, 0 false positives |
|
|
126
|
+
|
|
127
|
+
Honest caveat: synthetic docs are cleaner than real scans — these are an **upper bound** on well-structured input, not a claim about messy real bids. Uncontrolled real-document results are in [docs/RESULTS.md](docs/RESULTS.md). Reproduce: `python benchmark/generate.py && python benchmark/run.py`.
|
|
128
|
+
|
|
129
|
+
## Why this, and why now — the evidence
|
|
130
|
+
|
|
131
|
+
A full write-up (problem, market data, prior-art gap, method, results) is in **[PAPER.md](PAPER.md)**. The short version:
|
|
132
|
+
|
|
133
|
+
- **Loudest, most-shared pain** in construction-estimating communities (the 498-upvote thread above; more cited in the paper).
|
|
134
|
+
- **It works *today*** — document extraction is LLM-native, unlike floor-plan symbol detection (academic SOTA tops out ~83% mAP).
|
|
135
|
+
- **Empty slot** — `bidreader`, `blueprint-parser`, `pytakeoff` were all unclaimed on PyPI; the only adjacent tools are AGPL/non-commercial or abandoned toys.
|
|
136
|
+
- **Broadest base** — every estimator *and* every construction-AI builder needs document extraction. The library is the dependency; the MCP server is the agent-era surface.
|
|
137
|
+
|
|
138
|
+
## Roadmap
|
|
139
|
+
|
|
140
|
+
- [ ] Scanned-PDF vision OCR path
|
|
141
|
+
- [ ] Revision/addendum **diff** ("what changed between Addendum 3 and 4")
|
|
142
|
+
- [ ] Excel/CSV BOQ export + multi-quote **leveling** (compare subs side-by-side)
|
|
143
|
+
- [ ] Region/trade notation packs (AISC, BS/IS, AUS)
|
|
144
|
+
|
|
145
|
+
## Contributing
|
|
146
|
+
|
|
147
|
+
PRs welcome — see [CONTRIBUTING.md](CONTRIBUTING.md). Good first issues: add a notation parser, a new export format, or a test fixture.
|
|
148
|
+
|
|
149
|
+
## License
|
|
150
|
+
|
|
151
|
+
[MIT](LICENSE) © 2026. Cite via [CITATION.cff](CITATION.cff).
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "bidreader"
|
|
7
|
-
version = "0.
|
|
7
|
+
version = "0.5.0"
|
|
8
8
|
description = "Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.10"
|
|
@@ -24,6 +24,7 @@ dependencies = ["pymupdf>=1.24", "certifi>=2024.0"]
|
|
|
24
24
|
[project.optional-dependencies]
|
|
25
25
|
tables = ["pdfplumber>=0.11"]
|
|
26
26
|
mcp = ["mcp>=1.2"]
|
|
27
|
+
dev = ["pytest>=8"]
|
|
27
28
|
|
|
28
29
|
[project.scripts]
|
|
29
30
|
bidreader = "bidreader.cli:main"
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
"""Offline regression tests (no network / no LLM). Run: pytest -q"""
|
|
2
|
+
from bidreader.extract import _chunk, _merge, validate
|
|
3
|
+
|
|
4
|
+
|
|
5
|
+
def test_chunk_respects_budget_and_keeps_all_pages():
|
|
6
|
+
blocks = [f"[PAGE {i}]\n" + "x" * 20000 for i in range(1, 6)]
|
|
7
|
+
chunks = _chunk(blocks, budget=42000)
|
|
8
|
+
assert len(chunks) >= 2
|
|
9
|
+
joined = "\n\n".join(chunks)
|
|
10
|
+
for i in range(1, 6):
|
|
11
|
+
assert f"[PAGE {i}]" in joined # no page dropped
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
def test_merge_concats_and_picks_grand_total():
|
|
15
|
+
a = {"line_items": [{"description": "a"}], "bid_total": 100,
|
|
16
|
+
"scope_gaps": [{"missing": "X", "why": "1"}], "vendor": "Acme"}
|
|
17
|
+
b = {"line_items": [{"description": "b"}], "bid_total": 250,
|
|
18
|
+
"scope_gaps": [{"missing": "x", "why": "dup"}, {"missing": "Y", "why": "2"}]}
|
|
19
|
+
m = _merge([a, b])
|
|
20
|
+
assert len(m["line_items"]) == 2
|
|
21
|
+
assert m["bid_total"] == 250 # grand total = max
|
|
22
|
+
assert len(m["scope_gaps"]) == 2 # 'X' and 'x' dedupe
|
|
23
|
+
assert m["vendor"] == "Acme"
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
def test_validate_flags_bad_arithmetic():
|
|
27
|
+
data = {"line_items": [
|
|
28
|
+
{"qty": 10, "unit_price": 2.0, "amount": 20.0}, # ok
|
|
29
|
+
{"qty": 10, "unit_price": 2.0, "amount": 25.0}, # mismatch
|
|
30
|
+
{"qty": None, "unit_price": 2.0, "amount": 20.0}, # n/a
|
|
31
|
+
]}
|
|
32
|
+
out = validate(data)
|
|
33
|
+
checks = [li["math_check"] for li in out["line_items"]]
|
|
34
|
+
assert checks == ["ok", "mismatch", "n/a"]
|
|
35
|
+
assert out["math_flagged"] == 1
|
|
36
|
+
assert out["line_items"][1]["math_expected"] == 20.0
|
bidreader-0.2.0/PKG-INFO
DELETED
|
@@ -1,109 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: bidreader
|
|
3
|
-
Version: 0.2.0
|
|
4
|
-
Summary: Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page.
|
|
5
|
-
Author-email: Anmol <anmol@attentive.ai>
|
|
6
|
-
License: MIT
|
|
7
|
-
Project-URL: Homepage, https://github.com/anmolsam/bidreader
|
|
8
|
-
Project-URL: Issues, https://github.com/anmolsam/bidreader/issues
|
|
9
|
-
Keywords: construction,estimating,takeoff,subcontractor,bid,quote,scope,exclusions,spec,AEC,preconstruction,BOQ,LLM,MCP
|
|
10
|
-
Classifier: Development Status :: 4 - Beta
|
|
11
|
-
Classifier: Intended Audience :: Developers
|
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
-
Classifier: Programming Language :: Python :: 3
|
|
14
|
-
Classifier: Topic :: Office/Business :: Financial
|
|
15
|
-
Requires-Python: >=3.10
|
|
16
|
-
Description-Content-Type: text/markdown
|
|
17
|
-
License-File: LICENSE
|
|
18
|
-
Requires-Dist: pymupdf>=1.24
|
|
19
|
-
Requires-Dist: certifi>=2024.0
|
|
20
|
-
Provides-Extra: tables
|
|
21
|
-
Requires-Dist: pdfplumber>=0.11; extra == "tables"
|
|
22
|
-
Provides-Extra: mcp
|
|
23
|
-
Requires-Dist: mcp>=1.2; extra == "mcp"
|
|
24
|
-
Dynamic: license-file
|
|
25
|
-
|
|
26
|
-
# BidReader
|
|
27
|
-
|
|
28
|
-
**Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.** Every value is cited to its page and exact source text.
|
|
29
|
-
|
|
30
|
-
MIT · `pip install bidreader` · works on the PDFs estimators actually get.
|
|
31
|
-
|
|
32
|
-
> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."* — r/Construction (498 upvotes)
|
|
33
|
-
|
|
34
|
-
BidReader is that junior who never sleeps: it doesn't write anything new — it finds what's already there and points to it.
|
|
35
|
-
|
|
36
|
-
---
|
|
37
|
-
|
|
38
|
-
## What it does
|
|
39
|
-
|
|
40
|
-
```bash
|
|
41
|
-
pip install bidreader
|
|
42
|
-
export REQUESTY_API_KEY=... # or OPENROUTER_API_KEY / GEMINI_API_KEY (free tier works)
|
|
43
|
-
bidreader sub_quote.pdf
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
```python
|
|
47
|
-
from bidreader import read
|
|
48
|
-
doc = read("sub_quote.pdf")
|
|
49
|
-
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
|
|
50
|
-
doc.exclusions # [{item, quote, page, risk}, ...] <- the stuff they buried
|
|
51
|
-
doc.scope_gaps # trade-standard scope NOT mentioned (confirm before you bid)
|
|
52
|
-
doc.to_json()
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
## Real output (drywall sub-quote, exclusion buried in size-7 font)
|
|
56
|
-
|
|
57
|
-
```
|
|
58
|
-
LINE ITEMS (5):
|
|
59
|
-
09 22 16 Metal stud framing, 3-5/8" 25ga walls 12400 SF $35,340.00 p1
|
|
60
|
-
09 29 00 5/8" Type X gypsum board, both faces 24800 SF $40,920.00 p1
|
|
61
|
-
09 29 00 Tape & finish, Level 4 24800 SF $23,560.00 p1
|
|
62
|
-
... BID TOTAL $121,628.00
|
|
63
|
-
|
|
64
|
-
!! EXCLUSIONS CAUGHT (4):
|
|
65
|
-
- Fire-stopping/firecaulking (page 1)
|
|
66
|
-
"this proposal EXCLUDES fire-stopping/firecaulking at rated assemblies"
|
|
67
|
-
risk: life-safety scope; another trade or a change order eats this cost.
|
|
68
|
-
- Debris removal/haul-off (page 1)
|
|
69
|
-
"removal/haul-off of construction debris (by others)"
|
|
70
|
-
...
|
|
71
|
-
|
|
72
|
-
SCOPE GAPS TO CONFIRM (5):
|
|
73
|
-
- Acoustic ceiling tiles -- grid framing is included but the tiles within it are not.
|
|
74
|
-
- Rough carpentry blocking/backing for fixtures -- not mentioned.
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
## Why
|
|
78
|
-
|
|
79
|
-
The construction-AI gold rush is all building the same crowded, resisted thing — autonomous *takeoff*. The loudest, most-repeated, **unmet** estimator pain is upstream and downstream of it: turning crime-scene PDFs into clean data and **catching what subs quietly excluded**. No permissive library does this. BidReader is that primitive.
|
|
80
|
-
|
|
81
|
-
- **MIT** — depend on it inside your commercial estimating/BIM product (no AGPL/NC contamination).
|
|
82
|
-
- **Provider-agnostic** — Requesty, OpenRouter, or Google AI Studio (free tier).
|
|
83
|
-
- **Cited** — every number traces to a page + the exact source text. Trust is the real adoption blocker; this is built for it.
|
|
84
|
-
|
|
85
|
-
## Use it from an AI agent (MCP)
|
|
86
|
-
|
|
87
|
-
Any MCP client — Claude Desktop, Cursor, etc. — can call BidReader:
|
|
88
|
-
|
|
89
|
-
```bash
|
|
90
|
-
pip install "bidreader[mcp]"
|
|
91
|
-
```
|
|
92
|
-
```json
|
|
93
|
-
{
|
|
94
|
-
"mcpServers": {
|
|
95
|
-
"bidreader": {
|
|
96
|
-
"command": "bidreader-mcp",
|
|
97
|
-
"env": { "REQUESTY_API_KEY": "rqsty-..." }
|
|
98
|
-
}
|
|
99
|
-
}
|
|
100
|
-
}
|
|
101
|
-
```
|
|
102
|
-
Tools exposed: `read_document(path)`, `catch_exclusions(path)`, `extract_line_items(path)`.
|
|
103
|
-
Now your agent can answer *"which subs excluded fire-stopping?"* across a bid folder.
|
|
104
|
-
|
|
105
|
-
## Roadmap
|
|
106
|
-
- Scanned-PDF vision OCR path · revision/addendum **diff** ("what changed between Addendum 3 and 4") · `bidreader-mcp` server (any agent can call it) · Excel/CSV export · multi-quote leveling (compare subs side-by-side).
|
|
107
|
-
|
|
108
|
-
## License
|
|
109
|
-
MIT.
|
bidreader-0.2.0/README.md
DELETED
|
@@ -1,84 +0,0 @@
|
|
|
1
|
-
# BidReader
|
|
2
|
-
|
|
3
|
-
**Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.** Every value is cited to its page and exact source text.
|
|
4
|
-
|
|
5
|
-
MIT · `pip install bidreader` · works on the PDFs estimators actually get.
|
|
6
|
-
|
|
7
|
-
> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."* — r/Construction (498 upvotes)
|
|
8
|
-
|
|
9
|
-
BidReader is that junior who never sleeps: it doesn't write anything new — it finds what's already there and points to it.
|
|
10
|
-
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
## What it does
|
|
14
|
-
|
|
15
|
-
```bash
|
|
16
|
-
pip install bidreader
|
|
17
|
-
export REQUESTY_API_KEY=... # or OPENROUTER_API_KEY / GEMINI_API_KEY (free tier works)
|
|
18
|
-
bidreader sub_quote.pdf
|
|
19
|
-
```
|
|
20
|
-
|
|
21
|
-
```python
|
|
22
|
-
from bidreader import read
|
|
23
|
-
doc = read("sub_quote.pdf")
|
|
24
|
-
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
|
|
25
|
-
doc.exclusions # [{item, quote, page, risk}, ...] <- the stuff they buried
|
|
26
|
-
doc.scope_gaps # trade-standard scope NOT mentioned (confirm before you bid)
|
|
27
|
-
doc.to_json()
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
## Real output (drywall sub-quote, exclusion buried in size-7 font)
|
|
31
|
-
|
|
32
|
-
```
|
|
33
|
-
LINE ITEMS (5):
|
|
34
|
-
09 22 16 Metal stud framing, 3-5/8" 25ga walls 12400 SF $35,340.00 p1
|
|
35
|
-
09 29 00 5/8" Type X gypsum board, both faces 24800 SF $40,920.00 p1
|
|
36
|
-
09 29 00 Tape & finish, Level 4 24800 SF $23,560.00 p1
|
|
37
|
-
... BID TOTAL $121,628.00
|
|
38
|
-
|
|
39
|
-
!! EXCLUSIONS CAUGHT (4):
|
|
40
|
-
- Fire-stopping/firecaulking (page 1)
|
|
41
|
-
"this proposal EXCLUDES fire-stopping/firecaulking at rated assemblies"
|
|
42
|
-
risk: life-safety scope; another trade or a change order eats this cost.
|
|
43
|
-
- Debris removal/haul-off (page 1)
|
|
44
|
-
"removal/haul-off of construction debris (by others)"
|
|
45
|
-
...
|
|
46
|
-
|
|
47
|
-
SCOPE GAPS TO CONFIRM (5):
|
|
48
|
-
- Acoustic ceiling tiles -- grid framing is included but the tiles within it are not.
|
|
49
|
-
- Rough carpentry blocking/backing for fixtures -- not mentioned.
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
## Why
|
|
53
|
-
|
|
54
|
-
The construction-AI gold rush is all building the same crowded, resisted thing — autonomous *takeoff*. The loudest, most-repeated, **unmet** estimator pain is upstream and downstream of it: turning crime-scene PDFs into clean data and **catching what subs quietly excluded**. No permissive library does this. BidReader is that primitive.
|
|
55
|
-
|
|
56
|
-
- **MIT** — depend on it inside your commercial estimating/BIM product (no AGPL/NC contamination).
|
|
57
|
-
- **Provider-agnostic** — Requesty, OpenRouter, or Google AI Studio (free tier).
|
|
58
|
-
- **Cited** — every number traces to a page + the exact source text. Trust is the real adoption blocker; this is built for it.
|
|
59
|
-
|
|
60
|
-
## Use it from an AI agent (MCP)
|
|
61
|
-
|
|
62
|
-
Any MCP client — Claude Desktop, Cursor, etc. — can call BidReader:
|
|
63
|
-
|
|
64
|
-
```bash
|
|
65
|
-
pip install "bidreader[mcp]"
|
|
66
|
-
```
|
|
67
|
-
```json
|
|
68
|
-
{
|
|
69
|
-
"mcpServers": {
|
|
70
|
-
"bidreader": {
|
|
71
|
-
"command": "bidreader-mcp",
|
|
72
|
-
"env": { "REQUESTY_API_KEY": "rqsty-..." }
|
|
73
|
-
}
|
|
74
|
-
}
|
|
75
|
-
}
|
|
76
|
-
```
|
|
77
|
-
Tools exposed: `read_document(path)`, `catch_exclusions(path)`, `extract_line_items(path)`.
|
|
78
|
-
Now your agent can answer *"which subs excluded fire-stopping?"* across a bid folder.
|
|
79
|
-
|
|
80
|
-
## Roadmap
|
|
81
|
-
- Scanned-PDF vision OCR path · revision/addendum **diff** ("what changed between Addendum 3 and 4") · `bidreader-mcp` server (any agent can call it) · Excel/CSV export · multi-quote leveling (compare subs side-by-side).
|
|
82
|
-
|
|
83
|
-
## License
|
|
84
|
-
MIT.
|
|
@@ -1,109 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: bidreader
|
|
3
|
-
Version: 0.2.0
|
|
4
|
-
Summary: Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page.
|
|
5
|
-
Author-email: Anmol <anmol@attentive.ai>
|
|
6
|
-
License: MIT
|
|
7
|
-
Project-URL: Homepage, https://github.com/anmolsam/bidreader
|
|
8
|
-
Project-URL: Issues, https://github.com/anmolsam/bidreader/issues
|
|
9
|
-
Keywords: construction,estimating,takeoff,subcontractor,bid,quote,scope,exclusions,spec,AEC,preconstruction,BOQ,LLM,MCP
|
|
10
|
-
Classifier: Development Status :: 4 - Beta
|
|
11
|
-
Classifier: Intended Audience :: Developers
|
|
12
|
-
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
-
Classifier: Programming Language :: Python :: 3
|
|
14
|
-
Classifier: Topic :: Office/Business :: Financial
|
|
15
|
-
Requires-Python: >=3.10
|
|
16
|
-
Description-Content-Type: text/markdown
|
|
17
|
-
License-File: LICENSE
|
|
18
|
-
Requires-Dist: pymupdf>=1.24
|
|
19
|
-
Requires-Dist: certifi>=2024.0
|
|
20
|
-
Provides-Extra: tables
|
|
21
|
-
Requires-Dist: pdfplumber>=0.11; extra == "tables"
|
|
22
|
-
Provides-Extra: mcp
|
|
23
|
-
Requires-Dist: mcp>=1.2; extra == "mcp"
|
|
24
|
-
Dynamic: license-file
|
|
25
|
-
|
|
26
|
-
# BidReader
|
|
27
|
-
|
|
28
|
-
**Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.** Every value is cited to its page and exact source text.
|
|
29
|
-
|
|
30
|
-
MIT · `pip install bidreader` · works on the PDFs estimators actually get.
|
|
31
|
-
|
|
32
|
-
> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."* — r/Construction (498 upvotes)
|
|
33
|
-
|
|
34
|
-
BidReader is that junior who never sleeps: it doesn't write anything new — it finds what's already there and points to it.
|
|
35
|
-
|
|
36
|
-
---
|
|
37
|
-
|
|
38
|
-
## What it does
|
|
39
|
-
|
|
40
|
-
```bash
|
|
41
|
-
pip install bidreader
|
|
42
|
-
export REQUESTY_API_KEY=... # or OPENROUTER_API_KEY / GEMINI_API_KEY (free tier works)
|
|
43
|
-
bidreader sub_quote.pdf
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
```python
|
|
47
|
-
from bidreader import read
|
|
48
|
-
doc = read("sub_quote.pdf")
|
|
49
|
-
doc.line_items # [{section, description, qty, unit, amount, page}, ...]
|
|
50
|
-
doc.exclusions # [{item, quote, page, risk}, ...] <- the stuff they buried
|
|
51
|
-
doc.scope_gaps # trade-standard scope NOT mentioned (confirm before you bid)
|
|
52
|
-
doc.to_json()
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
## Real output (drywall sub-quote, exclusion buried in size-7 font)
|
|
56
|
-
|
|
57
|
-
```
|
|
58
|
-
LINE ITEMS (5):
|
|
59
|
-
09 22 16 Metal stud framing, 3-5/8" 25ga walls 12400 SF $35,340.00 p1
|
|
60
|
-
09 29 00 5/8" Type X gypsum board, both faces 24800 SF $40,920.00 p1
|
|
61
|
-
09 29 00 Tape & finish, Level 4 24800 SF $23,560.00 p1
|
|
62
|
-
... BID TOTAL $121,628.00
|
|
63
|
-
|
|
64
|
-
!! EXCLUSIONS CAUGHT (4):
|
|
65
|
-
- Fire-stopping/firecaulking (page 1)
|
|
66
|
-
"this proposal EXCLUDES fire-stopping/firecaulking at rated assemblies"
|
|
67
|
-
risk: life-safety scope; another trade or a change order eats this cost.
|
|
68
|
-
- Debris removal/haul-off (page 1)
|
|
69
|
-
"removal/haul-off of construction debris (by others)"
|
|
70
|
-
...
|
|
71
|
-
|
|
72
|
-
SCOPE GAPS TO CONFIRM (5):
|
|
73
|
-
- Acoustic ceiling tiles -- grid framing is included but the tiles within it are not.
|
|
74
|
-
- Rough carpentry blocking/backing for fixtures -- not mentioned.
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
## Why
|
|
78
|
-
|
|
79
|
-
The construction-AI gold rush is all building the same crowded, resisted thing — autonomous *takeoff*. The loudest, most-repeated, **unmet** estimator pain is upstream and downstream of it: turning crime-scene PDFs into clean data and **catching what subs quietly excluded**. No permissive library does this. BidReader is that primitive.
|
|
80
|
-
|
|
81
|
-
- **MIT** — depend on it inside your commercial estimating/BIM product (no AGPL/NC contamination).
|
|
82
|
-
- **Provider-agnostic** — Requesty, OpenRouter, or Google AI Studio (free tier).
|
|
83
|
-
- **Cited** — every number traces to a page + the exact source text. Trust is the real adoption blocker; this is built for it.
|
|
84
|
-
|
|
85
|
-
## Use it from an AI agent (MCP)
|
|
86
|
-
|
|
87
|
-
Any MCP client — Claude Desktop, Cursor, etc. — can call BidReader:
|
|
88
|
-
|
|
89
|
-
```bash
|
|
90
|
-
pip install "bidreader[mcp]"
|
|
91
|
-
```
|
|
92
|
-
```json
|
|
93
|
-
{
|
|
94
|
-
"mcpServers": {
|
|
95
|
-
"bidreader": {
|
|
96
|
-
"command": "bidreader-mcp",
|
|
97
|
-
"env": { "REQUESTY_API_KEY": "rqsty-..." }
|
|
98
|
-
}
|
|
99
|
-
}
|
|
100
|
-
}
|
|
101
|
-
```
|
|
102
|
-
Tools exposed: `read_document(path)`, `catch_exclusions(path)`, `extract_line_items(path)`.
|
|
103
|
-
Now your agent can answer *"which subs excluded fire-stopping?"* across a bid folder.
|
|
104
|
-
|
|
105
|
-
## Roadmap
|
|
106
|
-
- Scanned-PDF vision OCR path · revision/addendum **diff** ("what changed between Addendum 3 and 4") · `bidreader-mcp` server (any agent can call it) · Excel/CSV export · multi-quote leveling (compare subs side-by-side).
|
|
107
|
-
|
|
108
|
-
## License
|
|
109
|
-
MIT.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|