swarm-notes 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,228 @@
1
+ Metadata-Version: 2.3
2
+ Name: swarm-notes
3
+ Version: 0.1.0
4
+ Summary: Automated research paper tracking and knowledge synthesis
5
+ Author: LM
6
+ Author-email: LM <hi@leima.is>
7
+ Requires-Dist: beautifulsoup4>=4.14.3
8
+ Requires-Dist: pydantic>=2.12.5
9
+ Requires-Dist: pydantic-ai>=1.71.0
10
+ Requires-Dist: python-dotenv>=1.2.2
11
+ Requires-Dist: python-frontmatter>=1.1.0
12
+ Requires-Dist: pyyaml>=6.0.3
13
+ Requires-Dist: requests>=2.32.5
14
+ Requires-Dist: typer>=0.24.1
15
+ Requires-Python: >=3.11
16
+ Description-Content-Type: text/markdown
17
+
18
+ # Swarm Notes Core Package
19
+
20
+ An autonomous, serverless, multi-agent system that tracks academic papers, extracts structured data, and weaves them into a local, interconnected Markdown knowledge graph — a **Second Brain** for ML research.
21
+ Built to eventually communicate with other identical systems, forming a decentralised **Hive Mind**.
22
+
23
+ ---
24
+
25
+ ## Architecture
26
+
27
+ ```
28
+ ┌─────────────────────────────────────────────────────┐
29
+ │ GitHub Actions CI │
30
+ │ (weekly schedule + workflow_dispatch) │
31
+ └─────────────────────┬───────────────────────────────┘
32
+
33
+ ┌────────────▼────────────┐
34
+ │ Federation Agent │ ← consumes external public_feed.json feeds
35
+ └────────────┬────────────┘
36
+
37
+ ┌────────────▼────────────┐
38
+ │ Watcher │ ← queries ArXiv API by keyword
39
+ └────────────┬────────────┘
40
+ │ RawPaper[]
41
+ ┌────────────▼────────────┐
42
+ │ Router (Skill │ ← routes each paper to a domain skill
43
+ │ Registry) │ (NLP, Vision, TimeSeries, …)
44
+ └────────────┬────────────┘
45
+ │ Skill
46
+ ┌────────────▼────────────┐
47
+ │ Analyst │ ← pydantic-ai structured extraction
48
+ │ (pydantic-ai) │ with taxonomy injection
49
+ └────────────┬────────────┘
50
+ │ PaperAnalysis
51
+ ┌────────────▼────────────┐
52
+ │ Vault Writer │ ← writes .md to tmp_vault/
53
+ │ │ generates concept stubs
54
+ │ │ updates public_feed.json
55
+ └────────────┬────────────┘
56
+ │ atomic move
57
+ ┌────────────▼────────────┐
58
+ │ /vault │ ← permanent, file-based knowledge graph
59
+ │ papers/ concepts/ │
60
+ │ datasets/ │
61
+ └─────────────────────────┘
62
+ ```
63
+
64
+ ## Directory Structure
65
+
66
+ ```
67
+ research-cruise/
68
+ ├── .github/
69
+ │ └── workflows/
70
+ │ └── autonomous-tracker.yml # CI/CD pipeline
71
+ ├── vault/
72
+ │ ├── papers/ # One .md file per paper
73
+ │ ├── concepts/ # Auto-generated concept stubs
74
+ │ └── datasets/ # Dataset stubs
75
+ ├── swarm_notes/
76
+ │ ├── config.py # Configuration & env vars
77
+ │ ├── vault_manager.py # Staging pattern (tmp_vault → vault)
78
+ │ ├── watcher.py # Configurable paper-source watcher
79
+ │ ├── router.py # Skill registry router
80
+ │ ├── analyst.py # pydantic-ai extraction agent
81
+ │ ├── vault_writer.py # Markdown writer + public_feed.json
82
+ │ ├── federation.py # Hive Mind federation agent
83
+ │ └── main.py # Pipeline orchestrator
84
+ ├── taxonomy.json # Controlled vocabulary (tags, domains)
85
+ ├── public_feed.json # Rolling feed of last 20 papers (for federation)
86
+ └── requirements.txt
87
+ ```
88
+
89
+ ## Quick Start
90
+
91
+ ### Prerequisites
92
+
93
+ - Python 3.11+
94
+ - An OpenAI-compatible API key
95
+
96
+ ### Local Run
97
+
98
+ ```bash
99
+ # Install dependencies
100
+ pip install -r requirements.txt
101
+
102
+ # Set your API key
103
+ export LLM_API_KEY="sk-..."
104
+
105
+ # Optionally customise keywords
106
+ export PAPER_KEYWORDS="mamba,diffusion model,retrieval augmented generation"
107
+
108
+ # Optional: switch the watcher to Semantic Scholar
109
+ export PAPER_SOURCE="semantic_scholar"
110
+ export SEMANTIC_SCHOLAR_API_KEY="..."
111
+
112
+ # Run the pipeline
113
+ python -m swarm_notes.main
114
+ ```
115
+
116
+ ### Configuration (Environment Variables)
117
+
118
+ | Variable | Default | Description |
119
+ |---|---|---|
120
+ | `LLM_API_KEY` | *(required)* | API key for the LLM provider |
121
+ | `LLM_MODEL` | `openai:gpt-4o-mini` | pydantic-ai model string |
122
+ | `PAPER_SOURCE` | `arxiv` | Paper search backend: `arxiv` or `semantic_scholar` |
123
+ | `PAPER_KEYWORDS` | See `config.py` | Comma-separated search terms |
124
+ | `PAPER_MAX_RESULTS_PER_KEYWORD` | `5` | Papers fetched per keyword |
125
+ | `PAPER_TOTAL_CAP` | `20` | Hard cap on total papers per run |
126
+ | `SEMANTIC_SCHOLAR_API_KEY` | *(empty)* | Optional Semantic Scholar API key sent as `x-api-key` |
127
+ | `FEDERATION_FEEDS` | *(empty)* | Comma-separated external feed URLs |
128
+ | `PUBLIC_FEED_MAX_ITEMS` | `20` | Max entries kept in `public_feed.json` |
129
+
130
+ When `PAPER_SOURCE=semantic_scholar`, the watcher queries Semantic Scholar's Graph API and keeps only results that can be mapped back to an ArXiv identifier. That preserves compatibility with the rest of the pipeline, which still stores papers by `arxiv_id`.
131
+
132
+ Legacy `ARXIV_KEYWORDS`, `ARXIV_MAX_RESULTS_PER_KEYWORD`, and `ARXIV_TOTAL_CAP` are still accepted for backward compatibility, but `PAPER_*` names are now canonical.
133
+
134
+ ## CI/CD Setup
135
+
136
+ ### 1. Fork the repository
137
+
138
+ Click **Fork** on GitHub to create your own copy of this repository.
139
+
140
+ ### 2. Add the required secret
141
+
142
+ The pipeline needs an OpenAI-compatible API key to run the LLM analyst step.
143
+
144
+ 1. Open your forked repository on GitHub.
145
+ 2. Go to **Settings → Secrets and variables → Actions**.
146
+ 3. Click **New repository secret**.
147
+ 4. Set **Name** to `LLM_API_KEY` and **Secret** to your API key (e.g. `sk-...`).
148
+ 5. Click **Add secret**.
149
+
150
+ > **Note:** The workflow exposes `LLM_API_KEY` as both `LLM_API_KEY` and `OPENAI_API_KEY`
151
+ > so that pydantic-ai's OpenAI provider picks it up automatically.
152
+
153
+ ### 3. (Optional) Override the model
154
+
155
+ By default the pipeline uses `openai:gpt-4o-mini`. To use a different model, add a
156
+ second repository secret (or variable) named `LLM_MODEL` with the pydantic-ai model
157
+ string, e.g. `openai:gpt-4o` or `anthropic:claude-3-5-haiku`.
158
+
159
+ You can also set `LLM_MODEL` in the workflow's `env:` block directly if you prefer not
160
+ to use a secret.
161
+
162
+ ### 4. Run the pipeline
163
+
164
+ - **Scheduled:** the pipeline fires automatically every **Monday at 06:00 UTC**.
165
+ - **Manual:** go to **Actions → Autonomous Research Tracker → Run workflow**, optionally
166
+ override `keywords`, `federation_feeds`, and `max_results` in the dispatch form.
167
+
168
+ ## The Hive Mind (Federation)
169
+
170
+ Every successful run updates `public_feed.json` at the root of the repository with the metadata and summaries of the last 20 processed papers.
171
+
172
+ To subscribe to another agent's feed, pass their raw `public_feed.json` URL:
173
+
174
+ ```bash
175
+ export FEDERATION_FEEDS="https://raw.githubusercontent.com/alice/research-cruise/main/public_feed.json,https://raw.githubusercontent.com/bob/research-cruise/main/public_feed.json"
176
+ python -m swarm_notes.main
177
+ ```
178
+
179
+ Or set `federation_feeds` in the **workflow_dispatch** inputs.
180
+
181
+ **Conflict resolution:** If an external feed contains a review of a paper that already exists locally, the local metadata is preserved. The external summary is appended under a `### External Perspectives` section:
182
+
183
+ ```markdown
184
+ ### External Perspectives
185
+
186
+ > "Transformers are over-engineered for this dataset." - @Agent_alice
187
+ > *(Retrieved 2024-01-15)*
188
+ ```
189
+
190
+ ## Vault File Format
191
+
192
+ Each paper note uses hybrid YAML frontmatter (CSL-compatible fields + custom fields):
193
+
194
+ ```yaml
195
+ ---
196
+ # CSL-compatible fields
197
+ title: "Attention Is All You Need"
198
+ author:
199
+ - literal: "Ashish Vaswani"
200
+ issued:
201
+ date-parts:
202
+ - [2017, 6, 12]
203
+ url: "https://arxiv.org/abs/1706.03762"
204
+
205
+ # Custom fields
206
+ arxiv_id: "1706.03762"
207
+ domain: "nlp"
208
+ tags:
209
+ - "transformer"
210
+ - "attention-mechanism"
211
+ architectures:
212
+ - "encoder-decoder"
213
+ datasets:
214
+ - "WMT 2014"
215
+ skill: "NLPSkill"
216
+ processed_at: "2024-01-15T06:00:00Z"
217
+ ---
218
+ ```
219
+
220
+ Body sections: **Summary**, **Key Contributions**, **Key Concepts** (with relative links to `../concepts/`), **Datasets**, **Limitations**, **Links**.
221
+
222
+ ## Taxonomy
223
+
224
+ `taxonomy.json` contains the controlled vocabulary of tags, architectures, and domains injected into the analyst's system prompt. This prevents LLM hallucination and keeps metadata consistent. Edit `taxonomy.json` to add new terms.
225
+
226
+ ## License
227
+
228
+ MIT — see [LICENSE](LICENSE).
@@ -0,0 +1,211 @@
1
+ # Swarm Notes Core Package
2
+
3
+ An autonomous, serverless, multi-agent system that tracks academic papers, extracts structured data, and weaves them into a local, interconnected Markdown knowledge graph — a **Second Brain** for ML research.
4
+ Built to eventually communicate with other identical systems, forming a decentralised **Hive Mind**.
5
+
6
+ ---
7
+
8
+ ## Architecture
9
+
10
+ ```
11
+ ┌─────────────────────────────────────────────────────┐
12
+ │ GitHub Actions CI │
13
+ │ (weekly schedule + workflow_dispatch) │
14
+ └─────────────────────┬───────────────────────────────┘
15
+
16
+ ┌────────────▼────────────┐
17
+ │ Federation Agent │ ← consumes external public_feed.json feeds
18
+ └────────────┬────────────┘
19
+
20
+ ┌────────────▼────────────┐
21
+ │ Watcher │ ← queries ArXiv API by keyword
22
+ └────────────┬────────────┘
23
+ │ RawPaper[]
24
+ ┌────────────▼────────────┐
25
+ │ Router (Skill │ ← routes each paper to a domain skill
26
+ │ Registry) │ (NLP, Vision, TimeSeries, …)
27
+ └────────────┬────────────┘
28
+ │ Skill
29
+ ┌────────────▼────────────┐
30
+ │ Analyst │ ← pydantic-ai structured extraction
31
+ │ (pydantic-ai) │ with taxonomy injection
32
+ └────────────┬────────────┘
33
+ │ PaperAnalysis
34
+ ┌────────────▼────────────┐
35
+ │ Vault Writer │ ← writes .md to tmp_vault/
36
+ │ │ generates concept stubs
37
+ │ │ updates public_feed.json
38
+ └────────────┬────────────┘
39
+ │ atomic move
40
+ ┌────────────▼────────────┐
41
+ │ /vault │ ← permanent, file-based knowledge graph
42
+ │ papers/ concepts/ │
43
+ │ datasets/ │
44
+ └─────────────────────────┘
45
+ ```
46
+
47
+ ## Directory Structure
48
+
49
+ ```
50
+ research-cruise/
51
+ ├── .github/
52
+ │ └── workflows/
53
+ │ └── autonomous-tracker.yml # CI/CD pipeline
54
+ ├── vault/
55
+ │ ├── papers/ # One .md file per paper
56
+ │ ├── concepts/ # Auto-generated concept stubs
57
+ │ └── datasets/ # Dataset stubs
58
+ ├── swarm_notes/
59
+ │ ├── config.py # Configuration & env vars
60
+ │ ├── vault_manager.py # Staging pattern (tmp_vault → vault)
61
+ │ ├── watcher.py # Configurable paper-source watcher
62
+ │ ├── router.py # Skill registry router
63
+ │ ├── analyst.py # pydantic-ai extraction agent
64
+ │ ├── vault_writer.py # Markdown writer + public_feed.json
65
+ │ ├── federation.py # Hive Mind federation agent
66
+ │ └── main.py # Pipeline orchestrator
67
+ ├── taxonomy.json # Controlled vocabulary (tags, domains)
68
+ ├── public_feed.json # Rolling feed of last 20 papers (for federation)
69
+ └── requirements.txt
70
+ ```
71
+
72
+ ## Quick Start
73
+
74
+ ### Prerequisites
75
+
76
+ - Python 3.11+
77
+ - An OpenAI-compatible API key
78
+
79
+ ### Local Run
80
+
81
+ ```bash
82
+ # Install dependencies
83
+ pip install -r requirements.txt
84
+
85
+ # Set your API key
86
+ export LLM_API_KEY="sk-..."
87
+
88
+ # Optionally customise keywords
89
+ export PAPER_KEYWORDS="mamba,diffusion model,retrieval augmented generation"
90
+
91
+ # Optional: switch the watcher to Semantic Scholar
92
+ export PAPER_SOURCE="semantic_scholar"
93
+ export SEMANTIC_SCHOLAR_API_KEY="..."
94
+
95
+ # Run the pipeline
96
+ python -m swarm_notes.main
97
+ ```
98
+
99
+ ### Configuration (Environment Variables)
100
+
101
+ | Variable | Default | Description |
102
+ |---|---|---|
103
+ | `LLM_API_KEY` | *(required)* | API key for the LLM provider |
104
+ | `LLM_MODEL` | `openai:gpt-4o-mini` | pydantic-ai model string |
105
+ | `PAPER_SOURCE` | `arxiv` | Paper search backend: `arxiv` or `semantic_scholar` |
106
+ | `PAPER_KEYWORDS` | See `config.py` | Comma-separated search terms |
107
+ | `PAPER_MAX_RESULTS_PER_KEYWORD` | `5` | Papers fetched per keyword |
108
+ | `PAPER_TOTAL_CAP` | `20` | Hard cap on total papers per run |
109
+ | `SEMANTIC_SCHOLAR_API_KEY` | *(empty)* | Optional Semantic Scholar API key sent as `x-api-key` |
110
+ | `FEDERATION_FEEDS` | *(empty)* | Comma-separated external feed URLs |
111
+ | `PUBLIC_FEED_MAX_ITEMS` | `20` | Max entries kept in `public_feed.json` |
112
+
113
+ When `PAPER_SOURCE=semantic_scholar`, the watcher queries Semantic Scholar's Graph API and keeps only results that can be mapped back to an ArXiv identifier. That preserves compatibility with the rest of the pipeline, which still stores papers by `arxiv_id`.
114
+
115
+ Legacy `ARXIV_KEYWORDS`, `ARXIV_MAX_RESULTS_PER_KEYWORD`, and `ARXIV_TOTAL_CAP` are still accepted for backward compatibility, but `PAPER_*` names are now canonical.
116
+
117
+ ## CI/CD Setup
118
+
119
+ ### 1. Fork the repository
120
+
121
+ Click **Fork** on GitHub to create your own copy of this repository.
122
+
123
+ ### 2. Add the required secret
124
+
125
+ The pipeline needs an OpenAI-compatible API key to run the LLM analyst step.
126
+
127
+ 1. Open your forked repository on GitHub.
128
+ 2. Go to **Settings → Secrets and variables → Actions**.
129
+ 3. Click **New repository secret**.
130
+ 4. Set **Name** to `LLM_API_KEY` and **Secret** to your API key (e.g. `sk-...`).
131
+ 5. Click **Add secret**.
132
+
133
+ > **Note:** The workflow exposes `LLM_API_KEY` as both `LLM_API_KEY` and `OPENAI_API_KEY`
134
+ > so that pydantic-ai's OpenAI provider picks it up automatically.
135
+
136
+ ### 3. (Optional) Override the model
137
+
138
+ By default the pipeline uses `openai:gpt-4o-mini`. To use a different model, add a
139
+ second repository secret (or variable) named `LLM_MODEL` with the pydantic-ai model
140
+ string, e.g. `openai:gpt-4o` or `anthropic:claude-3-5-haiku`.
141
+
142
+ You can also set `LLM_MODEL` in the workflow's `env:` block directly if you prefer not
143
+ to use a secret.
144
+
145
+ ### 4. Run the pipeline
146
+
147
+ - **Scheduled:** the pipeline fires automatically every **Monday at 06:00 UTC**.
148
+ - **Manual:** go to **Actions → Autonomous Research Tracker → Run workflow**, optionally
149
+ override `keywords`, `federation_feeds`, and `max_results` in the dispatch form.
150
+
151
+ ## The Hive Mind (Federation)
152
+
153
+ Every successful run updates `public_feed.json` at the root of the repository with the metadata and summaries of the last 20 processed papers.
154
+
155
+ To subscribe to another agent's feed, pass their raw `public_feed.json` URL:
156
+
157
+ ```bash
158
+ export FEDERATION_FEEDS="https://raw.githubusercontent.com/alice/research-cruise/main/public_feed.json,https://raw.githubusercontent.com/bob/research-cruise/main/public_feed.json"
159
+ python -m swarm_notes.main
160
+ ```
161
+
162
+ Or set `federation_feeds` in the **workflow_dispatch** inputs.
163
+
164
+ **Conflict resolution:** If an external feed contains a review of a paper that already exists locally, the local metadata is preserved. The external summary is appended under a `### External Perspectives` section:
165
+
166
+ ```markdown
167
+ ### External Perspectives
168
+
169
+ > "Transformers are over-engineered for this dataset." - @Agent_alice
170
+ > *(Retrieved 2024-01-15)*
171
+ ```
172
+
173
+ ## Vault File Format
174
+
175
+ Each paper note uses hybrid YAML frontmatter (CSL-compatible fields + custom fields):
176
+
177
+ ```yaml
178
+ ---
179
+ # CSL-compatible fields
180
+ title: "Attention Is All You Need"
181
+ author:
182
+ - literal: "Ashish Vaswani"
183
+ issued:
184
+ date-parts:
185
+ - [2017, 6, 12]
186
+ url: "https://arxiv.org/abs/1706.03762"
187
+
188
+ # Custom fields
189
+ arxiv_id: "1706.03762"
190
+ domain: "nlp"
191
+ tags:
192
+ - "transformer"
193
+ - "attention-mechanism"
194
+ architectures:
195
+ - "encoder-decoder"
196
+ datasets:
197
+ - "WMT 2014"
198
+ skill: "NLPSkill"
199
+ processed_at: "2024-01-15T06:00:00Z"
200
+ ---
201
+ ```
202
+
203
+ Body sections: **Summary**, **Key Contributions**, **Key Concepts** (with relative links to `../concepts/`), **Datasets**, **Limitations**, **Links**.
204
+
205
+ ## Taxonomy
206
+
207
+ `taxonomy.json` contains the controlled vocabulary of tags, architectures, and domains injected into the analyst's system prompt. This prevents LLM hallucination and keeps metadata consistent. Edit `taxonomy.json` to add new terms.
208
+
209
+ ## License
210
+
211
+ MIT — see [LICENSE](LICENSE).
@@ -0,0 +1,37 @@
1
+ [project]
2
+ name = "swarm-notes"
3
+ version = "0.1.0"
4
+ description = "Automated research paper tracking and knowledge synthesis"
5
+ readme = "README.md"
6
+ authors = [
7
+ { name = "LM", email = "hi@leima.is" }
8
+ ]
9
+ requires-python = ">=3.11"
10
+ dependencies = [
11
+ "beautifulsoup4>=4.14.3",
12
+ "pydantic>=2.12.5",
13
+ "pydantic-ai>=1.71.0",
14
+ "python-dotenv>=1.2.2",
15
+ "python-frontmatter>=1.1.0",
16
+ "pyyaml>=6.0.3",
17
+ "requests>=2.32.5",
18
+ "typer>=0.24.1",
19
+ ]
20
+
21
+ [project.scripts]
22
+ swarm-notes = "swarm_notes.main:app"
23
+
24
+ [build-system]
25
+ requires = ["uv_build>=0.8.15,<0.9.0"]
26
+ build-backend = "uv_build"
27
+
28
+ [dependency-groups]
29
+ dev = [
30
+ "pytest>=9.0.2",
31
+ ]
32
+
33
+ [tool.pytest.ini_options]
34
+ pythonpath = ["src"]
35
+ markers = [
36
+ "integration: mark a test as an integration test that hits external APIs."
37
+ ]
@@ -0,0 +1 @@
1
+ """research-cruise agent package."""