reportea 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- reportea-0.1.0/LICENSE +21 -0
- reportea-0.1.0/PKG-INFO +209 -0
- reportea-0.1.0/README.md +175 -0
- reportea-0.1.0/pyproject.toml +28 -0
- reportea-0.1.0/reportea.egg-info/PKG-INFO +209 -0
- reportea-0.1.0/reportea.egg-info/SOURCES.txt +16 -0
- reportea-0.1.0/reportea.egg-info/dependency_links.txt +1 -0
- reportea-0.1.0/reportea.egg-info/entry_points.txt +2 -0
- reportea-0.1.0/reportea.egg-info/requires.txt +2 -0
- reportea-0.1.0/reportea.egg-info/top_level.txt +1 -0
- reportea-0.1.0/scr/__init__.py +0 -0
- reportea-0.1.0/scr/calling_llm_reader.py +191 -0
- reportea-0.1.0/scr/email_sender.py +63 -0
- reportea-0.1.0/scr/extractor.py +171 -0
- reportea-0.1.0/scr/key_words_lib.py +175 -0
- reportea-0.1.0/scr/summarizer.py +89 -0
- reportea-0.1.0/scr/timer.py +202 -0
- reportea-0.1.0/setup.cfg +4 -0
reportea-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Xinlong
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
reportea-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: reportea
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: An automated, LLM-powered research pipeline that discovers papers, summarises them, and delivers a daily tech digest to your phone.
|
|
5
|
+
Author: Xinlong
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026 Xinlong
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Requires-Python: >=3.11
|
|
29
|
+
Description-Content-Type: text/markdown
|
|
30
|
+
License-File: LICENSE
|
|
31
|
+
Requires-Dist: requests
|
|
32
|
+
Requires-Dist: pdfplumber
|
|
33
|
+
Dynamic: license-file
|
|
34
|
+
|
|
35
|
+
# Reportea
|
|
36
|
+
|
|
37
|
+
An automated, LLM-powered research pipeline that runs overnight — discovering papers, downloading PDFs, generating structured summaries, and producing a daily tech digest in newspaper style. Powered by the Claude CLI.
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## How It Works
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
python scr/timer.py
|
|
45
|
+
│
|
|
46
|
+
│ [waits until 01:00 AM]
|
|
47
|
+
│
|
|
48
|
+
01:00 ├── Build keywords library
|
|
49
|
+
│ base_papers/*.pdf → KeywordExtractor (Claude) → all_kws
|
|
50
|
+
│
|
|
51
|
+
├── Init: process base papers
|
|
52
|
+
│ DOIExtractor → base DOIs
|
|
53
|
+
│ → get_pdf_links → download_pdf → summarize_and_save → summaries/*.md
|
|
54
|
+
│ → clear pdf_cache/
|
|
55
|
+
│
|
|
56
|
+
├── Init: first discovery round
|
|
57
|
+
│ CitationExtractor (base_papers/) → citation DOIs (validated + cached)
|
|
58
|
+
│ → fetch keywords per DOI (Claude web) → keywords_list.csv
|
|
59
|
+
│ → compare_csv_with_library → top-10 related DOIs
|
|
60
|
+
│ → delete keywords_list.csv
|
|
61
|
+
│
|
|
62
|
+
└── Loop [repeats until 04:00]
|
|
63
|
+
process_dois → summaries/*.md
|
|
64
|
+
CitationExtractor (pdf_cache/) → citation DOIs
|
|
65
|
+
→ keywords_list.csv → compare → next top-10 DOIs
|
|
66
|
+
→ clear pdf_cache/ + delete keywords_list.csv
|
|
67
|
+
repeat ──────────────────────────────────────────┘
|
|
68
|
+
│
|
|
69
|
+
04:00 └── summarizer.py → summaries/{YYYY-MM-DD}report.md
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Modules
|
|
75
|
+
|
|
76
|
+
| File | Role |
|
|
77
|
+
|---|---|
|
|
78
|
+
| `scr/timer.py` | Orchestrator — waits for 01:00, runs init + loop, calls summarizer at 04:00 |
|
|
79
|
+
| `scr/extractor.py` | Three extraction classes: `DOIExtractor`, `KeywordExtractor`, `CitationExtractor` |
|
|
80
|
+
| `scr/key_words_lib.py` | Builds keyword library; generates `keywords_list.csv`; compares against library to rank DOIs |
|
|
81
|
+
| `scr/calling_llm_reader.py` | DOI → PDF search → download → text extraction → Claude summary → `.md`; also accepts a local PDF path directly via `process_local_pdf()` |
|
|
82
|
+
| `scr/summarizer.py` | Reads all summary `.md` files, generates a newspaper-style daily tech digest; pushes it to your iPhone via Bark |
|
|
83
|
+
| `scr/email_sender.py` | Bark push notification client — sends the daily report to your iPhone |
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## Extractor Classes (`scr/extractor.py`)
|
|
88
|
+
|
|
89
|
+
| Class | What it does |
|
|
90
|
+
|---|---|
|
|
91
|
+
| `DOIExtractor` | Scans `base_papers/*.pdf`, extracts each paper's own DOI from the first 3000 chars |
|
|
92
|
+
| `KeywordExtractor` | Calls Claude (no web) on the full paper text; returns normalized keyword set |
|
|
93
|
+
| `CitationExtractor` | Extracts cited DOIs from the references section, validates each via `doi.org`, caches to `doi_cache/` |
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Usage
|
|
98
|
+
|
|
99
|
+
### Overnight mode (scheduled)
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
python scr/timer.py
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
Waits until 01:00 AM, then runs until 04:00 AM and generates the daily digest.
|
|
106
|
+
|
|
107
|
+
### Immediate mode
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
python scr/timer.py --now # start now, run for 1 hour
|
|
111
|
+
python scr/timer.py --now 2 # start now, run for 2 hours
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Skips the 01:00 wait and runs for the specified number of hours, then generates the digest.
|
|
115
|
+
|
|
116
|
+
### Local mode
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
python scr/timer.py --local
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Skips all DOI lookup and citation discovery. Reads every PDF already in `base_papers/` directly, generates a structured summary for each one, then produces the daily digest. Useful for quickly summarising a personal collection without running the full overnight pipeline.
|
|
123
|
+
|
|
124
|
+
### Run individual modules manually
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
# Build keyword library + generate CSV + rank related papers
|
|
128
|
+
python scr/key_words_lib.py
|
|
129
|
+
|
|
130
|
+
# Summarize a single paper by DOI (edit DOI = "..." at top of file)
|
|
131
|
+
python scr/calling_llm_reader.py
|
|
132
|
+
|
|
133
|
+
# Generate today's tech digest from existing summaries
|
|
134
|
+
python scr/summarizer.py
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## Setup
|
|
140
|
+
|
|
141
|
+
1. **Add your seed papers** — place PDF files into `base_papers/`. The pipeline extracts their DOIs and keywords to build the library, and uses their citation lists as the starting point for discovery.
|
|
142
|
+
|
|
143
|
+
2. **Install dependencies:**
|
|
144
|
+
```bash
|
|
145
|
+
pip install requests pdfplumber
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
3. **iPhone notifications (optional)** — install [Bark](https://apps.apple.com/app/bark-customed-notifications/id1403753865) from the App Store. Open it to get your device key, then set `BARK_KEY` in `scr/email_sender.py`. The daily report will be pushed automatically after each run and saved in Bark's history.
|
|
149
|
+
|
|
150
|
+
4. **Claude binary** — requires the Claude Code VS Code extension. If you see `FileNotFoundError`:
|
|
151
|
+
```bash
|
|
152
|
+
ls ~/.vscode/extensions/ | grep anthropic
|
|
153
|
+
# Update CLAUDE_BIN in each scr/*.py with the current version number
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## Outputs
|
|
159
|
+
|
|
160
|
+
| Path | Content |
|
|
161
|
+
|---|---|
|
|
162
|
+
| `summaries/{safe_doi}_summary.md` | Structured research summary per paper (Title, Authors, Abstract, Keywords, Methodology, Findings, etc.) |
|
|
163
|
+
| `summaries/{YYYY-MM-DD}report.md` | Daily tech digest in newspaper format (≤1500 words) |
|
|
164
|
+
| `doi_cache/*_cited_dois.json` | Validated citation DOIs per PDF — persists across runs to avoid re-validating |
|
|
165
|
+
| `keywords_list.csv` | Transient — written and deleted each cycle; `keywords (pipe-sep), doi` |
|
|
166
|
+
| `claude_responses.log` | Append-only log of every Claude interaction, separated by session |
|
|
167
|
+
|
|
168
|
+
### Daily Report Format
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
📰 Tech & Research Daily
|
|
172
|
+
### March 29, 2026
|
|
173
|
+
|
|
174
|
+
🔬 Today's Research Highlights
|
|
175
|
+
📌 Key Stories
|
|
176
|
+
💡 What This Means
|
|
177
|
+
🔭 On the Horizon
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## Known Issues
|
|
183
|
+
|
|
184
|
+
| # | Description | Impact |
|
|
185
|
+
|---|---|---|
|
|
186
|
+
| 1 | Papers without author-defined keywords cannot be processed — `KeywordExtractor` returns an empty set, causing the paper to be skipped in library building and CSV comparison | Base papers with no keyword section produce an empty library; citation papers with no keywords are excluded from ranking |
|
|
187
|
+
| 2 | Papers without a DOI cannot be processed — `DOIExtractor` relies on finding a DOI string in the PDF text; if none is present the paper yields no DOI and is skipped entirely | Base papers without a DOI contribute no entry point into the discovery pipeline |
|
|
188
|
+
| 3 | DOIs in reference lists that span multiple lines are truncated at the line break during PDF text extraction — the partial DOI fails `doi.org` validation and is discarded | Citation discovery rate can be very low for PDFs with wrapped reference formatting, reducing the number of related papers found per loop iteration |
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## File Structure
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
Reportea/
|
|
196
|
+
├── base_papers/ # Seed PDFs (input — add your papers here)
|
|
197
|
+
├── pdf_cache/ # Downloaded PDFs (auto-cleared each cycle)
|
|
198
|
+
├── doi_cache/ # Validated citation DOIs per PDF (persistent cache)
|
|
199
|
+
├── summaries/ # Generated .md summaries + daily report
|
|
200
|
+
├── scr/
|
|
201
|
+
│ ├── timer.py # Orchestrator
|
|
202
|
+
│ ├── extractor.py # DOIExtractor, KeywordExtractor, CitationExtractor
|
|
203
|
+
│ ├── key_words_lib.py # Keyword library + CSV generation + comparison
|
|
204
|
+
│ ├── calling_llm_reader.py # DOI → PDF → summary; or local PDF → summary
|
|
205
|
+
│ ├── summarizer.py # Daily tech digest generator
|
|
206
|
+
│ └── email_sender.py # Bark push notification client
|
|
207
|
+
├── keywords_list.csv # Transient (created/deleted each cycle)
|
|
208
|
+
└── claude_responses.log # Full interaction log
|
|
209
|
+
```
|
reportea-0.1.0/README.md
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
1
|
+
# Reportea
|
|
2
|
+
|
|
3
|
+
An automated, LLM-powered research pipeline that runs overnight — discovering papers, downloading PDFs, generating structured summaries, and producing a daily tech digest in newspaper style. Powered by the Claude CLI.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## How It Works
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
python scr/timer.py
|
|
11
|
+
│
|
|
12
|
+
│ [waits until 01:00 AM]
|
|
13
|
+
│
|
|
14
|
+
01:00 ├── Build keywords library
|
|
15
|
+
│ base_papers/*.pdf → KeywordExtractor (Claude) → all_kws
|
|
16
|
+
│
|
|
17
|
+
├── Init: process base papers
|
|
18
|
+
│ DOIExtractor → base DOIs
|
|
19
|
+
│ → get_pdf_links → download_pdf → summarize_and_save → summaries/*.md
|
|
20
|
+
│ → clear pdf_cache/
|
|
21
|
+
│
|
|
22
|
+
├── Init: first discovery round
|
|
23
|
+
│ CitationExtractor (base_papers/) → citation DOIs (validated + cached)
|
|
24
|
+
│ → fetch keywords per DOI (Claude web) → keywords_list.csv
|
|
25
|
+
│ → compare_csv_with_library → top-10 related DOIs
|
|
26
|
+
│ → delete keywords_list.csv
|
|
27
|
+
│
|
|
28
|
+
└── Loop [repeats until 04:00]
|
|
29
|
+
process_dois → summaries/*.md
|
|
30
|
+
CitationExtractor (pdf_cache/) → citation DOIs
|
|
31
|
+
→ keywords_list.csv → compare → next top-10 DOIs
|
|
32
|
+
→ clear pdf_cache/ + delete keywords_list.csv
|
|
33
|
+
repeat ──────────────────────────────────────────┘
|
|
34
|
+
│
|
|
35
|
+
04:00 └── summarizer.py → summaries/{YYYY-MM-DD}report.md
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Modules
|
|
41
|
+
|
|
42
|
+
| File | Role |
|
|
43
|
+
|---|---|
|
|
44
|
+
| `scr/timer.py` | Orchestrator — waits for 01:00, runs init + loop, calls summarizer at 04:00 |
|
|
45
|
+
| `scr/extractor.py` | Three extraction classes: `DOIExtractor`, `KeywordExtractor`, `CitationExtractor` |
|
|
46
|
+
| `scr/key_words_lib.py` | Builds keyword library; generates `keywords_list.csv`; compares against library to rank DOIs |
|
|
47
|
+
| `scr/calling_llm_reader.py` | DOI → PDF search → download → text extraction → Claude summary → `.md`; also accepts a local PDF path directly via `process_local_pdf()` |
|
|
48
|
+
| `scr/summarizer.py` | Reads all summary `.md` files, generates a newspaper-style daily tech digest; pushes it to your iPhone via Bark |
|
|
49
|
+
| `scr/email_sender.py` | Bark push notification client — sends the daily report to your iPhone |
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Extractor Classes (`scr/extractor.py`)
|
|
54
|
+
|
|
55
|
+
| Class | What it does |
|
|
56
|
+
|---|---|
|
|
57
|
+
| `DOIExtractor` | Scans `base_papers/*.pdf`, extracts each paper's own DOI from the first 3000 chars |
|
|
58
|
+
| `KeywordExtractor` | Calls Claude (no web) on the full paper text; returns normalized keyword set |
|
|
59
|
+
| `CitationExtractor` | Extracts cited DOIs from the references section, validates each via `doi.org`, caches to `doi_cache/` |
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Usage
|
|
64
|
+
|
|
65
|
+
### Overnight mode (scheduled)
|
|
66
|
+
|
|
67
|
+
```bash
|
|
68
|
+
python scr/timer.py
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
Waits until 01:00 AM, then runs until 04:00 AM and generates the daily digest.
|
|
72
|
+
|
|
73
|
+
### Immediate mode
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
python scr/timer.py --now # start now, run for 1 hour
|
|
77
|
+
python scr/timer.py --now 2 # start now, run for 2 hours
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Skips the 01:00 wait and runs for the specified number of hours, then generates the digest.
|
|
81
|
+
|
|
82
|
+
### Local mode
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
python scr/timer.py --local
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Skips all DOI lookup and citation discovery. Reads every PDF already in `base_papers/` directly, generates a structured summary for each one, then produces the daily digest. Useful for quickly summarising a personal collection without running the full overnight pipeline.
|
|
89
|
+
|
|
90
|
+
### Run individual modules manually
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
# Build keyword library + generate CSV + rank related papers
|
|
94
|
+
python scr/key_words_lib.py
|
|
95
|
+
|
|
96
|
+
# Summarize a single paper by DOI (edit DOI = "..." at top of file)
|
|
97
|
+
python scr/calling_llm_reader.py
|
|
98
|
+
|
|
99
|
+
# Generate today's tech digest from existing summaries
|
|
100
|
+
python scr/summarizer.py
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## Setup
|
|
106
|
+
|
|
107
|
+
1. **Add your seed papers** — place PDF files into `base_papers/`. The pipeline extracts their DOIs and keywords to build the library, and uses their citation lists as the starting point for discovery.
|
|
108
|
+
|
|
109
|
+
2. **Install dependencies:**
|
|
110
|
+
```bash
|
|
111
|
+
pip install requests pdfplumber
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
3. **iPhone notifications (optional)** — install [Bark](https://apps.apple.com/app/bark-customed-notifications/id1403753865) from the App Store. Open it to get your device key, then set `BARK_KEY` in `scr/email_sender.py`. The daily report will be pushed automatically after each run and saved in Bark's history.
|
|
115
|
+
|
|
116
|
+
4. **Claude binary** — requires the Claude Code VS Code extension. If you see `FileNotFoundError`:
|
|
117
|
+
```bash
|
|
118
|
+
ls ~/.vscode/extensions/ | grep anthropic
|
|
119
|
+
# Update CLAUDE_BIN in each scr/*.py with the current version number
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
## Outputs
|
|
125
|
+
|
|
126
|
+
| Path | Content |
|
|
127
|
+
|---|---|
|
|
128
|
+
| `summaries/{safe_doi}_summary.md` | Structured research summary per paper (Title, Authors, Abstract, Keywords, Methodology, Findings, etc.) |
|
|
129
|
+
| `summaries/{YYYY-MM-DD}report.md` | Daily tech digest in newspaper format (≤1500 words) |
|
|
130
|
+
| `doi_cache/*_cited_dois.json` | Validated citation DOIs per PDF — persists across runs to avoid re-validating |
|
|
131
|
+
| `keywords_list.csv` | Transient — written and deleted each cycle; `keywords (pipe-sep), doi` |
|
|
132
|
+
| `claude_responses.log` | Append-only log of every Claude interaction, separated by session |
|
|
133
|
+
|
|
134
|
+
### Daily Report Format
|
|
135
|
+
|
|
136
|
+
```
|
|
137
|
+
📰 Tech & Research Daily
|
|
138
|
+
### March 29, 2026
|
|
139
|
+
|
|
140
|
+
🔬 Today's Research Highlights
|
|
141
|
+
📌 Key Stories
|
|
142
|
+
💡 What This Means
|
|
143
|
+
🔭 On the Horizon
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Known Issues
|
|
149
|
+
|
|
150
|
+
| # | Description | Impact |
|
|
151
|
+
|---|---|---|
|
|
152
|
+
| 1 | Papers without author-defined keywords cannot be processed — `KeywordExtractor` returns an empty set, causing the paper to be skipped in library building and CSV comparison | Base papers with no keyword section produce an empty library; citation papers with no keywords are excluded from ranking |
|
|
153
|
+
| 2 | Papers without a DOI cannot be processed — `DOIExtractor` relies on finding a DOI string in the PDF text; if none is present the paper yields no DOI and is skipped entirely | Base papers without a DOI contribute no entry point into the discovery pipeline |
|
|
154
|
+
| 3 | DOIs in reference lists that span multiple lines are truncated at the line break during PDF text extraction — the partial DOI fails `doi.org` validation and is discarded | Citation discovery rate can be very low for PDFs with wrapped reference formatting, reducing the number of related papers found per loop iteration |
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## File Structure
|
|
159
|
+
|
|
160
|
+
```
|
|
161
|
+
Reportea/
|
|
162
|
+
├── base_papers/ # Seed PDFs (input — add your papers here)
|
|
163
|
+
├── pdf_cache/ # Downloaded PDFs (auto-cleared each cycle)
|
|
164
|
+
├── doi_cache/ # Validated citation DOIs per PDF (persistent cache)
|
|
165
|
+
├── summaries/ # Generated .md summaries + daily report
|
|
166
|
+
├── scr/
|
|
167
|
+
│ ├── timer.py # Orchestrator
|
|
168
|
+
│ ├── extractor.py # DOIExtractor, KeywordExtractor, CitationExtractor
|
|
169
|
+
│ ├── key_words_lib.py # Keyword library + CSV generation + comparison
|
|
170
|
+
│ ├── calling_llm_reader.py # DOI → PDF → summary; or local PDF → summary
|
|
171
|
+
│ ├── summarizer.py # Daily tech digest generator
|
|
172
|
+
│ └── email_sender.py # Bark push notification client
|
|
173
|
+
├── keywords_list.csv # Transient (created/deleted each cycle)
|
|
174
|
+
└── claude_responses.log # Full interaction log
|
|
175
|
+
```
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "reportea"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "An automated, LLM-powered research pipeline that discovers papers, summarises them, and delivers a daily tech digest to your phone."
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = { file = "LICENSE" }
|
|
11
|
+
authors = [
|
|
12
|
+
{ name = "Xinlong" }
|
|
13
|
+
]
|
|
14
|
+
requires-python = ">=3.11"
|
|
15
|
+
dependencies = [
|
|
16
|
+
"requests",
|
|
17
|
+
"pdfplumber",
|
|
18
|
+
]
|
|
19
|
+
|
|
20
|
+
[project.scripts]
|
|
21
|
+
reportea = "reportea.timer:main"
|
|
22
|
+
|
|
23
|
+
[tool.setuptools.package-dir]
|
|
24
|
+
"reportea" = "scr"
|
|
25
|
+
|
|
26
|
+
[tool.setuptools.packages.find]
|
|
27
|
+
where = ["."]
|
|
28
|
+
include = ["scr*"]
|
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: reportea
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: An automated, LLM-powered research pipeline that discovers papers, summarises them, and delivers a daily tech digest to your phone.
|
|
5
|
+
Author: Xinlong
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026 Xinlong
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Requires-Python: >=3.11
|
|
29
|
+
Description-Content-Type: text/markdown
|
|
30
|
+
License-File: LICENSE
|
|
31
|
+
Requires-Dist: requests
|
|
32
|
+
Requires-Dist: pdfplumber
|
|
33
|
+
Dynamic: license-file
|
|
34
|
+
|
|
35
|
+
# Reportea
|
|
36
|
+
|
|
37
|
+
An automated, LLM-powered research pipeline that runs overnight — discovering papers, downloading PDFs, generating structured summaries, and producing a daily tech digest in newspaper style. Powered by the Claude CLI.
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## How It Works
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
python scr/timer.py
|
|
45
|
+
│
|
|
46
|
+
│ [waits until 01:00 AM]
|
|
47
|
+
│
|
|
48
|
+
01:00 ├── Build keywords library
|
|
49
|
+
│ base_papers/*.pdf → KeywordExtractor (Claude) → all_kws
|
|
50
|
+
│
|
|
51
|
+
├── Init: process base papers
|
|
52
|
+
│ DOIExtractor → base DOIs
|
|
53
|
+
│ → get_pdf_links → download_pdf → summarize_and_save → summaries/*.md
|
|
54
|
+
│ → clear pdf_cache/
|
|
55
|
+
│
|
|
56
|
+
├── Init: first discovery round
|
|
57
|
+
│ CitationExtractor (base_papers/) → citation DOIs (validated + cached)
|
|
58
|
+
│ → fetch keywords per DOI (Claude web) → keywords_list.csv
|
|
59
|
+
│ → compare_csv_with_library → top-10 related DOIs
|
|
60
|
+
│ → delete keywords_list.csv
|
|
61
|
+
│
|
|
62
|
+
└── Loop [repeats until 04:00]
|
|
63
|
+
process_dois → summaries/*.md
|
|
64
|
+
CitationExtractor (pdf_cache/) → citation DOIs
|
|
65
|
+
→ keywords_list.csv → compare → next top-10 DOIs
|
|
66
|
+
→ clear pdf_cache/ + delete keywords_list.csv
|
|
67
|
+
repeat ──────────────────────────────────────────┘
|
|
68
|
+
│
|
|
69
|
+
04:00 └── summarizer.py → summaries/{YYYY-MM-DD}report.md
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Modules
|
|
75
|
+
|
|
76
|
+
| File | Role |
|
|
77
|
+
|---|---|
|
|
78
|
+
| `scr/timer.py` | Orchestrator — waits for 01:00, runs init + loop, calls summarizer at 04:00 |
|
|
79
|
+
| `scr/extractor.py` | Three extraction classes: `DOIExtractor`, `KeywordExtractor`, `CitationExtractor` |
|
|
80
|
+
| `scr/key_words_lib.py` | Builds keyword library; generates `keywords_list.csv`; compares against library to rank DOIs |
|
|
81
|
+
| `scr/calling_llm_reader.py` | DOI → PDF search → download → text extraction → Claude summary → `.md`; also accepts a local PDF path directly via `process_local_pdf()` |
|
|
82
|
+
| `scr/summarizer.py` | Reads all summary `.md` files, generates a newspaper-style daily tech digest; pushes it to your iPhone via Bark |
|
|
83
|
+
| `scr/email_sender.py` | Bark push notification client — sends the daily report to your iPhone |
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## Extractor Classes (`scr/extractor.py`)
|
|
88
|
+
|
|
89
|
+
| Class | What it does |
|
|
90
|
+
|---|---|
|
|
91
|
+
| `DOIExtractor` | Scans `base_papers/*.pdf`, extracts each paper's own DOI from the first 3000 chars |
|
|
92
|
+
| `KeywordExtractor` | Calls Claude (no web) on the full paper text; returns normalized keyword set |
|
|
93
|
+
| `CitationExtractor` | Extracts cited DOIs from the references section, validates each via `doi.org`, caches to `doi_cache/` |
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Usage
|
|
98
|
+
|
|
99
|
+
### Overnight mode (scheduled)
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
python scr/timer.py
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
Waits until 01:00 AM, then runs until 04:00 AM and generates the daily digest.
|
|
106
|
+
|
|
107
|
+
### Immediate mode
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
python scr/timer.py --now # start now, run for 1 hour
|
|
111
|
+
python scr/timer.py --now 2 # start now, run for 2 hours
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Skips the 01:00 wait and runs for the specified number of hours, then generates the digest.
|
|
115
|
+
|
|
116
|
+
### Local mode
|
|
117
|
+
|
|
118
|
+
```bash
|
|
119
|
+
python scr/timer.py --local
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Skips all DOI lookup and citation discovery. Reads every PDF already in `base_papers/` directly, generates a structured summary for each one, then produces the daily digest. Useful for quickly summarising a personal collection without running the full overnight pipeline.
|
|
123
|
+
|
|
124
|
+
### Run individual modules manually
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
# Build keyword library + generate CSV + rank related papers
|
|
128
|
+
python scr/key_words_lib.py
|
|
129
|
+
|
|
130
|
+
# Summarize a single paper by DOI (edit DOI = "..." at top of file)
|
|
131
|
+
python scr/calling_llm_reader.py
|
|
132
|
+
|
|
133
|
+
# Generate today's tech digest from existing summaries
|
|
134
|
+
python scr/summarizer.py
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## Setup
|
|
140
|
+
|
|
141
|
+
1. **Add your seed papers** — place PDF files into `base_papers/`. The pipeline extracts their DOIs and keywords to build the library, and uses their citation lists as the starting point for discovery.
|
|
142
|
+
|
|
143
|
+
2. **Install dependencies:**
|
|
144
|
+
```bash
|
|
145
|
+
pip install requests pdfplumber
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
3. **iPhone notifications (optional)** — install [Bark](https://apps.apple.com/app/bark-customed-notifications/id1403753865) from the App Store. Open it to get your device key, then set `BARK_KEY` in `scr/email_sender.py`. The daily report will be pushed automatically after each run and saved in Bark's history.
|
|
149
|
+
|
|
150
|
+
4. **Claude binary** — requires the Claude Code VS Code extension. If you see `FileNotFoundError`:
|
|
151
|
+
```bash
|
|
152
|
+
ls ~/.vscode/extensions/ | grep anthropic
|
|
153
|
+
# Update CLAUDE_BIN in each scr/*.py with the current version number
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## Outputs
|
|
159
|
+
|
|
160
|
+
| Path | Content |
|
|
161
|
+
|---|---|
|
|
162
|
+
| `summaries/{safe_doi}_summary.md` | Structured research summary per paper (Title, Authors, Abstract, Keywords, Methodology, Findings, etc.) |
|
|
163
|
+
| `summaries/{YYYY-MM-DD}report.md` | Daily tech digest in newspaper format (≤1500 words) |
|
|
164
|
+
| `doi_cache/*_cited_dois.json` | Validated citation DOIs per PDF — persists across runs to avoid re-validating |
|
|
165
|
+
| `keywords_list.csv` | Transient — written and deleted each cycle; `keywords (pipe-sep), doi` |
|
|
166
|
+
| `claude_responses.log` | Append-only log of every Claude interaction, separated by session |
|
|
167
|
+
|
|
168
|
+
### Daily Report Format
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
📰 Tech & Research Daily
|
|
172
|
+
### March 29, 2026
|
|
173
|
+
|
|
174
|
+
🔬 Today's Research Highlights
|
|
175
|
+
📌 Key Stories
|
|
176
|
+
💡 What This Means
|
|
177
|
+
🔭 On the Horizon
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
---
|
|
181
|
+
|
|
182
|
+
## Known Issues
|
|
183
|
+
|
|
184
|
+
| # | Description | Impact |
|
|
185
|
+
|---|---|---|
|
|
186
|
+
| 1 | Papers without author-defined keywords cannot be processed — `KeywordExtractor` returns an empty set, causing the paper to be skipped in library building and CSV comparison | Base papers with no keyword section produce an empty library; citation papers with no keywords are excluded from ranking |
|
|
187
|
+
| 2 | Papers without a DOI cannot be processed — `DOIExtractor` relies on finding a DOI string in the PDF text; if none is present the paper yields no DOI and is skipped entirely | Base papers without a DOI contribute no entry point into the discovery pipeline |
|
|
188
|
+
| 3 | DOIs in reference lists that span multiple lines are truncated at the line break during PDF text extraction — the partial DOI fails `doi.org` validation and is discarded | Citation discovery rate can be very low for PDFs with wrapped reference formatting, reducing the number of related papers found per loop iteration |
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## File Structure
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
Reportea/
|
|
196
|
+
├── base_papers/ # Seed PDFs (input — add your papers here)
|
|
197
|
+
├── pdf_cache/ # Downloaded PDFs (auto-cleared each cycle)
|
|
198
|
+
├── doi_cache/ # Validated citation DOIs per PDF (persistent cache)
|
|
199
|
+
├── summaries/ # Generated .md summaries + daily report
|
|
200
|
+
├── scr/
|
|
201
|
+
│ ├── timer.py # Orchestrator
|
|
202
|
+
│ ├── extractor.py # DOIExtractor, KeywordExtractor, CitationExtractor
|
|
203
|
+
│ ├── key_words_lib.py # Keyword library + CSV generation + comparison
|
|
204
|
+
│ ├── calling_llm_reader.py # DOI → PDF → summary; or local PDF → summary
|
|
205
|
+
│ ├── summarizer.py # Daily tech digest generator
|
|
206
|
+
│ └── email_sender.py # Bark push notification client
|
|
207
|
+
├── keywords_list.csv # Transient (created/deleted each cycle)
|
|
208
|
+
└── claude_responses.log # Full interaction log
|
|
209
|
+
```
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
reportea.egg-info/PKG-INFO
|
|
5
|
+
reportea.egg-info/SOURCES.txt
|
|
6
|
+
reportea.egg-info/dependency_links.txt
|
|
7
|
+
reportea.egg-info/entry_points.txt
|
|
8
|
+
reportea.egg-info/requires.txt
|
|
9
|
+
reportea.egg-info/top_level.txt
|
|
10
|
+
scr/__init__.py
|
|
11
|
+
scr/calling_llm_reader.py
|
|
12
|
+
scr/email_sender.py
|
|
13
|
+
scr/extractor.py
|
|
14
|
+
scr/key_words_lib.py
|
|
15
|
+
scr/summarizer.py
|
|
16
|
+
scr/timer.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
scr
|
|
File without changes
|