papergraph 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +229 -0
  3. package/dist/index.js +2695 -0
  4. package/package.json +63 -0
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Dashanka De Silva
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,229 @@
1
+ # πŸ“„ PaperGraph
2
+
3
+ **Build interactive research-paper connectivity graphs from any topic.**
4
+
5
+ PaperGraph is a command-line tool that discovers academic papers, traces their citation networks, computes text similarity, runs graph algorithms, and produces explorable visualizations β€” all from a single command.
6
+
7
+ ---
8
+
9
+ ## ✨ Motivation
10
+
11
+ Navigating academic literature is hard. A single topic can span thousands of papers across decades, and understanding *how* they connect β€” who cites whom, which share methods, which disagree β€” requires hours of manual work.
12
+
13
+ PaperGraph automates this:
14
+
15
+ 1. **You provide a topic** (e.g., *"transformer attention mechanisms"*)
16
+ 2. **It discovers papers** via OpenAlex or Semantic Scholar APIs
17
+ 3. **It traces citations** through configurable BFS depth
18
+ 4. **It computes relationships** β€” text similarity, co-citation, bibliographic coupling
19
+ 5. **It ranks and clusters** papers using PageRank and Louvain community detection
20
+ 6. **It produces outputs** β€” an interactive HTML viewer, JSON, GraphML, GEXF, CSV, or Mermaid diagrams
21
+
22
+ The result is a navigable knowledge graph that reveals the structure of a research field at a glance.
23
+
24
+ ---
25
+
26
+ ## πŸ—οΈ Architecture
27
+
28
+ ```
29
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
30
+ β”‚ CLI (Commander) β”‚
31
+ β”‚ build Β· export Β· view Β· inspect Β· cache β”‚
32
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
33
+ β”‚
34
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
35
+ β”‚ Graph Builder β”‚
36
+ β”‚ Orchestrates the full pipeline: β”‚
37
+ β”‚ seed β†’ traverse β†’ NLP β†’ algorithms β†’ store β”‚
38
+ β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
39
+ β”‚ β”‚ β”‚ β”‚
40
+ β–Ό β–Ό β–Ό β–Ό
41
+ β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
42
+ β”‚Source β”‚ β”‚ NLP β”‚ β”‚ Graph β”‚ β”‚ SQLite β”‚
43
+ β”‚Adapt.β”‚ β”‚Pipelineβ”‚ β”‚ Algos β”‚ β”‚ Storage β”‚
44
+ β”œβ”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
45
+ β”‚OpenAlβ”‚ β”‚TF-IDF β”‚ β”‚PageRank β”‚ β”‚10 tables β”‚
46
+ β”‚ ex β”‚ β”‚Cosine β”‚ β”‚Louvain β”‚ β”‚WAL mode β”‚
47
+ β”‚ S2 β”‚ β”‚Entity β”‚ β”‚Co-cite β”‚ β”‚Migrationsβ”‚
48
+ β”‚ β”‚ β”‚Extract β”‚ β”‚Coupling β”‚ β”‚ β”‚
49
+ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚Scoring β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
50
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
51
+ β–Ό
52
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
53
+ β”‚ HTTP Client β”‚
54
+ β”‚ Rate limiting β”‚
55
+ β”‚ Retry + backoff β”‚
56
+ β”‚ Token bucket β”‚
57
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
58
+ ```
59
+
60
+ ### Data Flow
61
+
62
+ ```mermaid
63
+ graph LR
64
+ A["Topic / Papers / DOIs"] --> B["Seed Discovery"]
65
+ B --> C["BFS Citation Traversal"]
66
+ C --> D["TF-IDF Corpus"]
67
+ D --> E["Similarity Edges"]
68
+ C --> F["Co-Citation / Coupling"]
69
+ D --> G["PageRank + Louvain"]
70
+ E --> H["SQLite Database"]
71
+ F --> H
72
+ G --> H
73
+ H --> I["Exporters / Viewer"]
74
+ ```
75
+
76
+ ---
77
+
78
+ ## πŸ“ Project Structure
79
+
80
+ ```
81
+ Paper-Graph/
82
+ β”œβ”€β”€ src/
83
+ β”‚ β”œβ”€β”€ cli/ # CLI entry point (Commander)
84
+ β”‚ β”‚ └── index.ts # 5 commands: build, export, view, inspect, cache
85
+ β”‚ β”‚
86
+ β”‚ β”œβ”€β”€ builder/ # Graph build orchestrator
87
+ β”‚ β”‚ └── graph-builder.ts # Full pipeline: seed β†’ traverse β†’ NLP β†’ rank β†’ store
88
+ β”‚ β”‚
89
+ β”‚ β”œβ”€β”€ sources/ # API data source adapters
90
+ β”‚ β”‚ β”œβ”€β”€ openalex.ts # OpenAlex API adapter
91
+ β”‚ β”‚ β”œβ”€β”€ semantic-scholar.ts # Semantic Scholar API adapter
92
+ β”‚ β”‚ └── utils.ts # Shared utilities (DOI stripping, title similarity)
93
+ β”‚ β”‚
94
+ β”‚ β”œβ”€β”€ nlp/ # Natural language processing
95
+ β”‚ β”‚ β”œβ”€β”€ tokenizer.ts # Deterministic tokenization (no stemming)
96
+ β”‚ β”‚ β”œβ”€β”€ stopwords.ts # 175+ English + academic stopwords
97
+ β”‚ β”‚ β”œβ”€β”€ tfidf.ts # TF-IDF corpus building + topic relevance
98
+ β”‚ β”‚ β”œβ”€β”€ similarity.ts # Cosine similarity + edge generation
99
+ β”‚ β”‚ └── entity-extraction.ts # Dictionary-based entity extraction
100
+ β”‚ β”‚
101
+ β”‚ β”œβ”€β”€ graph/ # Graph algorithms
102
+ β”‚ β”‚ β”œβ”€β”€ algorithms.ts # PageRank, Louvain, co-citation, coupling
103
+ β”‚ β”‚ └── scoring.ts # Composite ranking (PageRank + relevance + recency)
104
+ β”‚ β”‚
105
+ β”‚ β”œβ”€β”€ storage/ # Persistence layer
106
+ β”‚ β”‚ └── database.ts # SQLite via better-sqlite3 (10 tables, WAL mode)
107
+ β”‚ β”‚
108
+ β”‚ β”œβ”€β”€ exporters/ # Output format exporters
109
+ β”‚ β”‚ └── export.ts # JSON, GraphML, GEXF, CSV, Mermaid
110
+ β”‚ β”‚
111
+ β”‚ β”œβ”€β”€ viewer/ # Interactive visualization
112
+ β”‚ β”‚ └── html-viewer.ts # Self-contained Cytoscape.js HTML viewer
113
+ β”‚ β”‚
114
+ β”‚ β”œβ”€β”€ cache/ # API response caching
115
+ β”‚ β”‚ └── response-cache.ts # File-system cache with SHA-256 keys + TTL
116
+ β”‚ β”‚
117
+ β”‚ β”œβ”€β”€ utils/ # Shared infrastructure
118
+ β”‚ β”‚ β”œβ”€β”€ http-client.ts # HTTP client with rate limiting + retries
119
+ β”‚ β”‚ β”œβ”€β”€ logger.ts # Pino-based structured logging
120
+ β”‚ β”‚ └── config.ts # Cosmiconfig configuration resolver
121
+ β”‚ β”‚
122
+ β”‚ β”œβ”€β”€ types/ # TypeScript type definitions
123
+ β”‚ β”‚ β”œβ”€β”€ index.ts # Paper, Edge, Cluster, Entity, Config interfaces
124
+ β”‚ β”‚ └── config.ts # Config types + defaults
125
+ β”‚ β”‚
126
+ β”‚ └── __tests__/ # Test suites (86 tests)
127
+ β”‚
128
+ β”œβ”€β”€ dist/ # Built output (82 KB ESM bundle)
129
+ β”œβ”€β”€ package.json
130
+ β”œβ”€β”€ tsconfig.json
131
+ β”œβ”€β”€ tsup.config.ts
132
+ └── vitest.config.ts
133
+ ```
134
+
135
+ ---
136
+
137
+ ## πŸ”‘ Features
138
+
139
+ ### Data Sources
140
+ | Source | API | Rate Limit | Key Required |
141
+ |--------|-----|-----------|-------------|
142
+ | **OpenAlex** | REST | 10 req/s (polite pool) | Optional (email for polite pool) |
143
+ | **Semantic Scholar** | REST | 1 req/s (100 with key) | Optional |
144
+
145
+ ### Graph Spine Strategies
146
+ | Spine | Description |
147
+ |-------|-------------|
148
+ | `citation` | Direct citation links (A cites B) |
149
+ | `similarity` | TF-IDF cosine similarity between abstracts |
150
+ | `co-citation` | Papers frequently cited together |
151
+ | `coupling` | Papers that cite the same references |
152
+ | `hybrid` | All of the above combined |
153
+
154
+ ### Graph Algorithms
155
+ - **PageRank** β€” Identifies the most influential papers
156
+ - **Louvain** β€” Community detection for topic clustering
157
+ - **Composite Scoring** β€” Weighted combination of PageRank, relevance, and recency
158
+
159
+ ### Export Formats
160
+ | Format | Extension | Use Case |
161
+ |--------|-----------|----------|
162
+ | JSON | `.json` | Programmatic access, custom visualization |
163
+ | GraphML | `.graphml` | yEd, Gephi, NetworkX |
164
+ | GEXF | `.gexf` | Gephi (with attributes) |
165
+ | CSV | `.csv` | Spreadsheets, pandas |
166
+ | Mermaid | `.md` | GitHub/GitLab rendered diagrams |
167
+
168
+ ### Interactive Viewer
169
+ - **Cytoscape.js** β€” force-directed layout
170
+ - **Dark glassmorphism** UI with blur effects
171
+ - **Cluster coloring** β€” papers colored by community
172
+ - **Node sizing** β€” scaled by influence score
173
+ - **Edge coloring** β€” by relationship type
174
+ - **Search** β€” real-time filter by title, venue, DOI
175
+ - **Neighbor highlighting** β€” click a paper to highlight connections
176
+ - **Detail panel** β€” paper metadata with DOI/URL links
177
+
178
+ ### NLP Pipeline
179
+ - Deterministic TF-IDF (no stemming β€” reproducible results)
180
+ - 175+ stopwords including academic terms
181
+ - Cosine similarity with configurable threshold
182
+ - Dictionary-based entity extraction (120+ known entities)
183
+
184
+ ### Infrastructure
185
+ - **Rate limiting** β€” per-source token bucket (won't get you banned)
186
+ - **Retry logic** β€” exponential backoff with jitter for 429/5xx errors
187
+ - **Response cache** β€” SHA-256 keyed file-system cache (24h TTL default)
188
+ - **SQLite with WAL** β€” fast concurrent reads, 10-table schema
189
+
190
+ ---
191
+
192
+ ## πŸ”§ Tech Stack
193
+
194
+ | Layer | Technology |
195
+ |-------|-----------|
196
+ | Language | TypeScript (ESM, NodeNext) |
197
+ | Runtime | Node.js 20+ |
198
+ | CLI | Commander.js |
199
+ | HTTP | undici (Node.js built-in HTTP/1.1 & HTTP/2) |
200
+ | Database | better-sqlite3 (WAL mode) |
201
+ | Graph | graphology + graphology-communities |
202
+ | Logging | pino (JSON + pretty-print) |
203
+ | Config | cosmiconfig |
204
+ | Bundler | tsup |
205
+ | Testing | vitest (86 tests, 6 suites) |
206
+
207
+ ---
208
+
209
+ ## πŸš€ Quick Start
210
+
211
+ ```bash
212
+ # Install dependencies
213
+ npm install
214
+
215
+ # Build
216
+ npm run build
217
+
218
+ # Run
219
+ npx papergraph build -t "transformer attention" -o graph.db
220
+ npx papergraph view -i graph.db
221
+ ```
222
+
223
+ See [USAGE.md](./USAGE.md) for detailed usage instructions.
224
+
225
+ ---
226
+
227
+ ## πŸ“„ License
228
+
229
+ MIT