reportea 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
reportea-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Xinlong
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,209 @@
1
+ Metadata-Version: 2.4
2
+ Name: reportea
3
+ Version: 0.1.0
4
+ Summary: An automated, LLM-powered research pipeline that discovers papers, summarises them, and delivers a daily tech digest to your phone.
5
+ Author: Xinlong
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Xinlong
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Requires-Python: >=3.11
29
+ Description-Content-Type: text/markdown
30
+ License-File: LICENSE
31
+ Requires-Dist: requests
32
+ Requires-Dist: pdfplumber
33
+ Dynamic: license-file
34
+
35
+ # Reportea
36
+
37
+ An automated, LLM-powered research pipeline that runs overnight — discovering papers, downloading PDFs, generating structured summaries, and producing a daily tech digest in newspaper style. Powered by the Claude CLI.
38
+
39
+ ---
40
+
41
+ ## How It Works
42
+
43
+ ```
44
+ python scr/timer.py
45
+
46
+ │ [waits until 01:00 AM]
47
+
48
+ 01:00 ├── Build keywords library
49
+ │ base_papers/*.pdf → KeywordExtractor (Claude) → all_kws
50
+
51
+ ├── Init: process base papers
52
+ │ DOIExtractor → base DOIs
53
+ │ → get_pdf_links → download_pdf → summarize_and_save → summaries/*.md
54
+ │ → clear pdf_cache/
55
+
56
+ ├── Init: first discovery round
57
+ │ CitationExtractor (base_papers/) → citation DOIs (validated + cached)
58
+ │ → fetch keywords per DOI (Claude web) → keywords_list.csv
59
+ │ → compare_csv_with_library → top-10 related DOIs
60
+ │ → delete keywords_list.csv
61
+
62
+ └── Loop [repeats until 04:00]
63
+ process_dois → summaries/*.md
64
+ CitationExtractor (pdf_cache/) → citation DOIs
65
+ → keywords_list.csv → compare → next top-10 DOIs
66
+ → clear pdf_cache/ + delete keywords_list.csv
67
+ repeat ──────────────────────────────────────────┘
68
+
69
+ 04:00 └── summarizer.py → summaries/{YYYY-MM-DD}report.md
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Modules
75
+
76
+ | File | Role |
77
+ |---|---|
78
+ | `scr/timer.py` | Orchestrator — waits for 01:00, runs init + loop, calls summarizer at 04:00 |
79
+ | `scr/extractor.py` | Three extraction classes: `DOIExtractor`, `KeywordExtractor`, `CitationExtractor` |
80
+ | `scr/key_words_lib.py` | Builds keyword library; generates `keywords_list.csv`; compares against library to rank DOIs |
81
+ | `scr/calling_llm_reader.py` | DOI → PDF search → download → text extraction → Claude summary → `.md`; also accepts a local PDF path directly via `process_local_pdf()` |
82
+ | `scr/summarizer.py` | Reads all summary `.md` files, generates a newspaper-style daily tech digest; pushes it to your iPhone via Bark |
83
+ | `scr/email_sender.py` | Bark push notification client — sends the daily report to your iPhone |
84
+
85
+ ---
86
+
87
+ ## Extractor Classes (`scr/extractor.py`)
88
+
89
+ | Class | What it does |
90
+ |---|---|
91
+ | `DOIExtractor` | Scans `base_papers/*.pdf`, extracts each paper's own DOI from the first 3000 chars |
92
+ | `KeywordExtractor` | Calls Claude (no web) on the full paper text; returns normalized keyword set |
93
+ | `CitationExtractor` | Extracts cited DOIs from the references section, validates each via `doi.org`, caches to `doi_cache/` |
94
+
95
+ ---
96
+
97
+ ## Usage
98
+
99
+ ### Overnight mode (scheduled)
100
+
101
+ ```bash
102
+ python scr/timer.py
103
+ ```
104
+
105
+ Waits until 01:00 AM, then runs until 04:00 AM and generates the daily digest.
106
+
107
+ ### Immediate mode
108
+
109
+ ```bash
110
+ python scr/timer.py --now # start now, run for 1 hour
111
+ python scr/timer.py --now 2 # start now, run for 2 hours
112
+ ```
113
+
114
+ Skips the 01:00 wait and runs for the specified number of hours, then generates the digest.
115
+
116
+ ### Local mode
117
+
118
+ ```bash
119
+ python scr/timer.py --local
120
+ ```
121
+
122
+ Skips all DOI lookup and citation discovery. Reads every PDF already in `base_papers/` directly, generates a structured summary for each one, then produces the daily digest. Useful for quickly summarising a personal collection without running the full overnight pipeline.
123
+
124
+ ### Run individual modules manually
125
+
126
+ ```bash
127
+ # Build keyword library + generate CSV + rank related papers
128
+ python scr/key_words_lib.py
129
+
130
+ # Summarize a single paper by DOI (edit DOI = "..." at top of file)
131
+ python scr/calling_llm_reader.py
132
+
133
+ # Generate today's tech digest from existing summaries
134
+ python scr/summarizer.py
135
+ ```
136
+
137
+ ---
138
+
139
+ ## Setup
140
+
141
+ 1. **Add your seed papers** — place PDF files into `base_papers/`. The pipeline extracts their DOIs and keywords to build the library, and uses their citation lists as the starting point for discovery.
142
+
143
+ 2. **Install dependencies:**
144
+ ```bash
145
+ pip install requests pdfplumber
146
+ ```
147
+
148
+ 3. **iPhone notifications (optional)** — install [Bark](https://apps.apple.com/app/bark-customed-notifications/id1403753865) from the App Store. Open it to get your device key, then set `BARK_KEY` in `scr/email_sender.py`. The daily report will be pushed automatically after each run and saved in Bark's history.
149
+
150
+ 4. **Claude binary** — requires the Claude Code VS Code extension. If you see `FileNotFoundError`:
151
+ ```bash
152
+ ls ~/.vscode/extensions/ | grep anthropic
153
+ # Update CLAUDE_BIN in each scr/*.py with the current version number
154
+ ```
155
+
156
+ ---
157
+
158
+ ## Outputs
159
+
160
+ | Path | Content |
161
+ |---|---|
162
+ | `summaries/{safe_doi}_summary.md` | Structured research summary per paper (Title, Authors, Abstract, Keywords, Methodology, Findings, etc.) |
163
+ | `summaries/{YYYY-MM-DD}report.md` | Daily tech digest in newspaper format (≤1500 words) |
164
+ | `doi_cache/*_cited_dois.json` | Validated citation DOIs per PDF — persists across runs to avoid re-validating |
165
+ | `keywords_list.csv` | Transient — written and deleted each cycle; `keywords (pipe-sep), doi` |
166
+ | `claude_responses.log` | Append-only log of every Claude interaction, separated by session |
167
+
168
+ ### Daily Report Format
169
+
170
+ ```
171
+ 📰 Tech & Research Daily
172
+ ### March 29, 2026
173
+
174
+ 🔬 Today's Research Highlights
175
+ 📌 Key Stories
176
+ 💡 What This Means
177
+ 🔭 On the Horizon
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Known Issues
183
+
184
+ | # | Description | Impact |
185
+ |---|---|---|
186
+ | 1 | Papers without author-defined keywords cannot be processed — `KeywordExtractor` returns an empty set, causing the paper to be skipped in library building and CSV comparison | Base papers with no keyword section produce an empty library; citation papers with no keywords are excluded from ranking |
187
+ | 2 | Papers without a DOI cannot be processed — `DOIExtractor` relies on finding a DOI string in the PDF text; if none is present the paper yields no DOI and is skipped entirely | Base papers without a DOI contribute no entry point into the discovery pipeline |
188
+ | 3 | DOIs in reference lists that span multiple lines are truncated at the line break during PDF text extraction — the partial DOI fails `doi.org` validation and is discarded | Citation discovery rate can be very low for PDFs with wrapped reference formatting, reducing the number of related papers found per loop iteration |
189
+
190
+ ---
191
+
192
+ ## File Structure
193
+
194
+ ```
195
+ Reportea/
196
+ ├── base_papers/ # Seed PDFs (input — add your papers here)
197
+ ├── pdf_cache/ # Downloaded PDFs (auto-cleared each cycle)
198
+ ├── doi_cache/ # Validated citation DOIs per PDF (persistent cache)
199
+ ├── summaries/ # Generated .md summaries + daily report
200
+ ├── scr/
201
+ │ ├── timer.py # Orchestrator
202
+ │ ├── extractor.py # DOIExtractor, KeywordExtractor, CitationExtractor
203
+ │ ├── key_words_lib.py # Keyword library + CSV generation + comparison
204
+ │ ├── calling_llm_reader.py # DOI → PDF → summary; or local PDF → summary
205
+ │ ├── summarizer.py # Daily tech digest generator
206
+ │ └── email_sender.py # Bark push notification client
207
+ ├── keywords_list.csv # Transient (created/deleted each cycle)
208
+ └── claude_responses.log # Full interaction log
209
+ ```
@@ -0,0 +1,175 @@
1
+ # Reportea
2
+
3
+ An automated, LLM-powered research pipeline that runs overnight — discovering papers, downloading PDFs, generating structured summaries, and producing a daily tech digest in newspaper style. Powered by the Claude CLI.
4
+
5
+ ---
6
+
7
+ ## How It Works
8
+
9
+ ```
10
+ python scr/timer.py
11
+
12
+ │ [waits until 01:00 AM]
13
+
14
+ 01:00 ├── Build keywords library
15
+ │ base_papers/*.pdf → KeywordExtractor (Claude) → all_kws
16
+
17
+ ├── Init: process base papers
18
+ │ DOIExtractor → base DOIs
19
+ │ → get_pdf_links → download_pdf → summarize_and_save → summaries/*.md
20
+ │ → clear pdf_cache/
21
+
22
+ ├── Init: first discovery round
23
+ │ CitationExtractor (base_papers/) → citation DOIs (validated + cached)
24
+ │ → fetch keywords per DOI (Claude web) → keywords_list.csv
25
+ │ → compare_csv_with_library → top-10 related DOIs
26
+ │ → delete keywords_list.csv
27
+
28
+ └── Loop [repeats until 04:00]
29
+ process_dois → summaries/*.md
30
+ CitationExtractor (pdf_cache/) → citation DOIs
31
+ → keywords_list.csv → compare → next top-10 DOIs
32
+ → clear pdf_cache/ + delete keywords_list.csv
33
+ repeat ──────────────────────────────────────────┘
34
+
35
+ 04:00 └── summarizer.py → summaries/{YYYY-MM-DD}report.md
36
+ ```
37
+
38
+ ---
39
+
40
+ ## Modules
41
+
42
+ | File | Role |
43
+ |---|---|
44
+ | `scr/timer.py` | Orchestrator — waits for 01:00, runs init + loop, calls summarizer at 04:00 |
45
+ | `scr/extractor.py` | Three extraction classes: `DOIExtractor`, `KeywordExtractor`, `CitationExtractor` |
46
+ | `scr/key_words_lib.py` | Builds keyword library; generates `keywords_list.csv`; compares against library to rank DOIs |
47
+ | `scr/calling_llm_reader.py` | DOI → PDF search → download → text extraction → Claude summary → `.md`; also accepts a local PDF path directly via `process_local_pdf()` |
48
+ | `scr/summarizer.py` | Reads all summary `.md` files, generates a newspaper-style daily tech digest; pushes it to your iPhone via Bark |
49
+ | `scr/email_sender.py` | Bark push notification client — sends the daily report to your iPhone |
50
+
51
+ ---
52
+
53
+ ## Extractor Classes (`scr/extractor.py`)
54
+
55
+ | Class | What it does |
56
+ |---|---|
57
+ | `DOIExtractor` | Scans `base_papers/*.pdf`, extracts each paper's own DOI from the first 3000 chars |
58
+ | `KeywordExtractor` | Calls Claude (no web) on the full paper text; returns normalized keyword set |
59
+ | `CitationExtractor` | Extracts cited DOIs from the references section, validates each via `doi.org`, caches to `doi_cache/` |
60
+
61
+ ---
62
+
63
+ ## Usage
64
+
65
+ ### Overnight mode (scheduled)
66
+
67
+ ```bash
68
+ python scr/timer.py
69
+ ```
70
+
71
+ Waits until 01:00 AM, then runs until 04:00 AM and generates the daily digest.
72
+
73
+ ### Immediate mode
74
+
75
+ ```bash
76
+ python scr/timer.py --now # start now, run for 1 hour
77
+ python scr/timer.py --now 2 # start now, run for 2 hours
78
+ ```
79
+
80
+ Skips the 01:00 wait and runs for the specified number of hours, then generates the digest.
81
+
82
+ ### Local mode
83
+
84
+ ```bash
85
+ python scr/timer.py --local
86
+ ```
87
+
88
+ Skips all DOI lookup and citation discovery. Reads every PDF already in `base_papers/` directly, generates a structured summary for each one, then produces the daily digest. Useful for quickly summarising a personal collection without running the full overnight pipeline.
89
+
90
+ ### Run individual modules manually
91
+
92
+ ```bash
93
+ # Build keyword library + generate CSV + rank related papers
94
+ python scr/key_words_lib.py
95
+
96
+ # Summarize a single paper by DOI (edit DOI = "..." at top of file)
97
+ python scr/calling_llm_reader.py
98
+
99
+ # Generate today's tech digest from existing summaries
100
+ python scr/summarizer.py
101
+ ```
102
+
103
+ ---
104
+
105
+ ## Setup
106
+
107
+ 1. **Add your seed papers** — place PDF files into `base_papers/`. The pipeline extracts their DOIs and keywords to build the library, and uses their citation lists as the starting point for discovery.
108
+
109
+ 2. **Install dependencies:**
110
+ ```bash
111
+ pip install requests pdfplumber
112
+ ```
113
+
114
+ 3. **iPhone notifications (optional)** — install [Bark](https://apps.apple.com/app/bark-customed-notifications/id1403753865) from the App Store. Open it to get your device key, then set `BARK_KEY` in `scr/email_sender.py`. The daily report will be pushed automatically after each run and saved in Bark's history.
115
+
116
+ 4. **Claude binary** — requires the Claude Code VS Code extension. If you see `FileNotFoundError`:
117
+ ```bash
118
+ ls ~/.vscode/extensions/ | grep anthropic
119
+ # Update CLAUDE_BIN in each scr/*.py with the current version number
120
+ ```
121
+
122
+ ---
123
+
124
+ ## Outputs
125
+
126
+ | Path | Content |
127
+ |---|---|
128
+ | `summaries/{safe_doi}_summary.md` | Structured research summary per paper (Title, Authors, Abstract, Keywords, Methodology, Findings, etc.) |
129
+ | `summaries/{YYYY-MM-DD}report.md` | Daily tech digest in newspaper format (≤1500 words) |
130
+ | `doi_cache/*_cited_dois.json` | Validated citation DOIs per PDF — persists across runs to avoid re-validating |
131
+ | `keywords_list.csv` | Transient — written and deleted each cycle; `keywords (pipe-sep), doi` |
132
+ | `claude_responses.log` | Append-only log of every Claude interaction, separated by session |
133
+
134
+ ### Daily Report Format
135
+
136
+ ```
137
+ 📰 Tech & Research Daily
138
+ ### March 29, 2026
139
+
140
+ 🔬 Today's Research Highlights
141
+ 📌 Key Stories
142
+ 💡 What This Means
143
+ 🔭 On the Horizon
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Known Issues
149
+
150
+ | # | Description | Impact |
151
+ |---|---|---|
152
+ | 1 | Papers without author-defined keywords cannot be processed — `KeywordExtractor` returns an empty set, causing the paper to be skipped in library building and CSV comparison | Base papers with no keyword section produce an empty library; citation papers with no keywords are excluded from ranking |
153
+ | 2 | Papers without a DOI cannot be processed — `DOIExtractor` relies on finding a DOI string in the PDF text; if none is present the paper yields no DOI and is skipped entirely | Base papers without a DOI contribute no entry point into the discovery pipeline |
154
+ | 3 | DOIs in reference lists that span multiple lines are truncated at the line break during PDF text extraction — the partial DOI fails `doi.org` validation and is discarded | Citation discovery rate can be very low for PDFs with wrapped reference formatting, reducing the number of related papers found per loop iteration |
155
+
156
+ ---
157
+
158
+ ## File Structure
159
+
160
+ ```
161
+ Reportea/
162
+ ├── base_papers/ # Seed PDFs (input — add your papers here)
163
+ ├── pdf_cache/ # Downloaded PDFs (auto-cleared each cycle)
164
+ ├── doi_cache/ # Validated citation DOIs per PDF (persistent cache)
165
+ ├── summaries/ # Generated .md summaries + daily report
166
+ ├── scr/
167
+ │ ├── timer.py # Orchestrator
168
+ │ ├── extractor.py # DOIExtractor, KeywordExtractor, CitationExtractor
169
+ │ ├── key_words_lib.py # Keyword library + CSV generation + comparison
170
+ │ ├── calling_llm_reader.py # DOI → PDF → summary; or local PDF → summary
171
+ │ ├── summarizer.py # Daily tech digest generator
172
+ │ └── email_sender.py # Bark push notification client
173
+ ├── keywords_list.csv # Transient (created/deleted each cycle)
174
+ └── claude_responses.log # Full interaction log
175
+ ```
@@ -0,0 +1,28 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "reportea"
7
+ version = "0.1.0"
8
+ description = "An automated, LLM-powered research pipeline that discovers papers, summarises them, and delivers a daily tech digest to your phone."
9
+ readme = "README.md"
10
+ license = { file = "LICENSE" }
11
+ authors = [
12
+ { name = "Xinlong" }
13
+ ]
14
+ requires-python = ">=3.11"
15
+ dependencies = [
16
+ "requests",
17
+ "pdfplumber",
18
+ ]
19
+
20
+ [project.scripts]
21
+ reportea = "reportea.timer:main"
22
+
23
+ [tool.setuptools.package-dir]
24
+ "reportea" = "scr"
25
+
26
+ [tool.setuptools.packages.find]
27
+ where = ["."]
28
+ include = ["scr*"]
@@ -0,0 +1,209 @@
1
+ Metadata-Version: 2.4
2
+ Name: reportea
3
+ Version: 0.1.0
4
+ Summary: An automated, LLM-powered research pipeline that discovers papers, summarises them, and delivers a daily tech digest to your phone.
5
+ Author: Xinlong
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Xinlong
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Requires-Python: >=3.11
29
+ Description-Content-Type: text/markdown
30
+ License-File: LICENSE
31
+ Requires-Dist: requests
32
+ Requires-Dist: pdfplumber
33
+ Dynamic: license-file
34
+
35
+ # Reportea
36
+
37
+ An automated, LLM-powered research pipeline that runs overnight — discovering papers, downloading PDFs, generating structured summaries, and producing a daily tech digest in newspaper style. Powered by the Claude CLI.
38
+
39
+ ---
40
+
41
+ ## How It Works
42
+
43
+ ```
44
+ python scr/timer.py
45
+
46
+ │ [waits until 01:00 AM]
47
+
48
+ 01:00 ├── Build keywords library
49
+ │ base_papers/*.pdf → KeywordExtractor (Claude) → all_kws
50
+
51
+ ├── Init: process base papers
52
+ │ DOIExtractor → base DOIs
53
+ │ → get_pdf_links → download_pdf → summarize_and_save → summaries/*.md
54
+ │ → clear pdf_cache/
55
+
56
+ ├── Init: first discovery round
57
+ │ CitationExtractor (base_papers/) → citation DOIs (validated + cached)
58
+ │ → fetch keywords per DOI (Claude web) → keywords_list.csv
59
+ │ → compare_csv_with_library → top-10 related DOIs
60
+ │ → delete keywords_list.csv
61
+
62
+ └── Loop [repeats until 04:00]
63
+ process_dois → summaries/*.md
64
+ CitationExtractor (pdf_cache/) → citation DOIs
65
+ → keywords_list.csv → compare → next top-10 DOIs
66
+ → clear pdf_cache/ + delete keywords_list.csv
67
+ repeat ──────────────────────────────────────────┘
68
+
69
+ 04:00 └── summarizer.py → summaries/{YYYY-MM-DD}report.md
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Modules
75
+
76
+ | File | Role |
77
+ |---|---|
78
+ | `scr/timer.py` | Orchestrator — waits for 01:00, runs init + loop, calls summarizer at 04:00 |
79
+ | `scr/extractor.py` | Three extraction classes: `DOIExtractor`, `KeywordExtractor`, `CitationExtractor` |
80
+ | `scr/key_words_lib.py` | Builds keyword library; generates `keywords_list.csv`; compares against library to rank DOIs |
81
+ | `scr/calling_llm_reader.py` | DOI → PDF search → download → text extraction → Claude summary → `.md`; also accepts a local PDF path directly via `process_local_pdf()` |
82
+ | `scr/summarizer.py` | Reads all summary `.md` files, generates a newspaper-style daily tech digest; pushes it to your iPhone via Bark |
83
+ | `scr/email_sender.py` | Bark push notification client — sends the daily report to your iPhone |
84
+
85
+ ---
86
+
87
+ ## Extractor Classes (`scr/extractor.py`)
88
+
89
+ | Class | What it does |
90
+ |---|---|
91
+ | `DOIExtractor` | Scans `base_papers/*.pdf`, extracts each paper's own DOI from the first 3000 chars |
92
+ | `KeywordExtractor` | Calls Claude (no web) on the full paper text; returns normalized keyword set |
93
+ | `CitationExtractor` | Extracts cited DOIs from the references section, validates each via `doi.org`, caches to `doi_cache/` |
94
+
95
+ ---
96
+
97
+ ## Usage
98
+
99
+ ### Overnight mode (scheduled)
100
+
101
+ ```bash
102
+ python scr/timer.py
103
+ ```
104
+
105
+ Waits until 01:00 AM, then runs until 04:00 AM and generates the daily digest.
106
+
107
+ ### Immediate mode
108
+
109
+ ```bash
110
+ python scr/timer.py --now # start now, run for 1 hour
111
+ python scr/timer.py --now 2 # start now, run for 2 hours
112
+ ```
113
+
114
+ Skips the 01:00 wait and runs for the specified number of hours, then generates the digest.
115
+
116
+ ### Local mode
117
+
118
+ ```bash
119
+ python scr/timer.py --local
120
+ ```
121
+
122
+ Skips all DOI lookup and citation discovery. Reads every PDF already in `base_papers/` directly, generates a structured summary for each one, then produces the daily digest. Useful for quickly summarising a personal collection without running the full overnight pipeline.
123
+
124
+ ### Run individual modules manually
125
+
126
+ ```bash
127
+ # Build keyword library + generate CSV + rank related papers
128
+ python scr/key_words_lib.py
129
+
130
+ # Summarize a single paper by DOI (edit DOI = "..." at top of file)
131
+ python scr/calling_llm_reader.py
132
+
133
+ # Generate today's tech digest from existing summaries
134
+ python scr/summarizer.py
135
+ ```
136
+
137
+ ---
138
+
139
+ ## Setup
140
+
141
+ 1. **Add your seed papers** — place PDF files into `base_papers/`. The pipeline extracts their DOIs and keywords to build the library, and uses their citation lists as the starting point for discovery.
142
+
143
+ 2. **Install dependencies:**
144
+ ```bash
145
+ pip install requests pdfplumber
146
+ ```
147
+
148
+ 3. **iPhone notifications (optional)** — install [Bark](https://apps.apple.com/app/bark-customed-notifications/id1403753865) from the App Store. Open it to get your device key, then set `BARK_KEY` in `scr/email_sender.py`. The daily report will be pushed automatically after each run and saved in Bark's history.
149
+
150
+ 4. **Claude binary** — requires the Claude Code VS Code extension. If you see `FileNotFoundError`:
151
+ ```bash
152
+ ls ~/.vscode/extensions/ | grep anthropic
153
+ # Update CLAUDE_BIN in each scr/*.py with the current version number
154
+ ```
155
+
156
+ ---
157
+
158
+ ## Outputs
159
+
160
+ | Path | Content |
161
+ |---|---|
162
+ | `summaries/{safe_doi}_summary.md` | Structured research summary per paper (Title, Authors, Abstract, Keywords, Methodology, Findings, etc.) |
163
+ | `summaries/{YYYY-MM-DD}report.md` | Daily tech digest in newspaper format (≤1500 words) |
164
+ | `doi_cache/*_cited_dois.json` | Validated citation DOIs per PDF — persists across runs to avoid re-validating |
165
+ | `keywords_list.csv` | Transient — written and deleted each cycle; `keywords (pipe-sep), doi` |
166
+ | `claude_responses.log` | Append-only log of every Claude interaction, separated by session |
167
+
168
+ ### Daily Report Format
169
+
170
+ ```
171
+ 📰 Tech & Research Daily
172
+ ### March 29, 2026
173
+
174
+ 🔬 Today's Research Highlights
175
+ 📌 Key Stories
176
+ 💡 What This Means
177
+ 🔭 On the Horizon
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Known Issues
183
+
184
+ | # | Description | Impact |
185
+ |---|---|---|
186
+ | 1 | Papers without author-defined keywords cannot be processed — `KeywordExtractor` returns an empty set, causing the paper to be skipped in library building and CSV comparison | Base papers with no keyword section produce an empty library; citation papers with no keywords are excluded from ranking |
187
+ | 2 | Papers without a DOI cannot be processed — `DOIExtractor` relies on finding a DOI string in the PDF text; if none is present the paper yields no DOI and is skipped entirely | Base papers without a DOI contribute no entry point into the discovery pipeline |
188
+ | 3 | DOIs in reference lists that span multiple lines are truncated at the line break during PDF text extraction — the partial DOI fails `doi.org` validation and is discarded | Citation discovery rate can be very low for PDFs with wrapped reference formatting, reducing the number of related papers found per loop iteration |
189
+
190
+ ---
191
+
192
+ ## File Structure
193
+
194
+ ```
195
+ Reportea/
196
+ ├── base_papers/ # Seed PDFs (input — add your papers here)
197
+ ├── pdf_cache/ # Downloaded PDFs (auto-cleared each cycle)
198
+ ├── doi_cache/ # Validated citation DOIs per PDF (persistent cache)
199
+ ├── summaries/ # Generated .md summaries + daily report
200
+ ├── scr/
201
+ │ ├── timer.py # Orchestrator
202
+ │ ├── extractor.py # DOIExtractor, KeywordExtractor, CitationExtractor
203
+ │ ├── key_words_lib.py # Keyword library + CSV generation + comparison
204
+ │ ├── calling_llm_reader.py # DOI → PDF → summary; or local PDF → summary
205
+ │ ├── summarizer.py # Daily tech digest generator
206
+ │ └── email_sender.py # Bark push notification client
207
+ ├── keywords_list.csv # Transient (created/deleted each cycle)
208
+ └── claude_responses.log # Full interaction log
209
+ ```
@@ -0,0 +1,16 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ reportea.egg-info/PKG-INFO
5
+ reportea.egg-info/SOURCES.txt
6
+ reportea.egg-info/dependency_links.txt
7
+ reportea.egg-info/entry_points.txt
8
+ reportea.egg-info/requires.txt
9
+ reportea.egg-info/top_level.txt
10
+ scr/__init__.py
11
+ scr/calling_llm_reader.py
12
+ scr/email_sender.py
13
+ scr/extractor.py
14
+ scr/key_words_lib.py
15
+ scr/summarizer.py
16
+ scr/timer.py
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ reportea = reportea.timer:main
@@ -0,0 +1,2 @@
1
+ requests
2
+ pdfplumber
@@ -0,0 +1 @@
1
+ scr
File without changes