openalex-local 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- openalex_local-0.1.0/PKG-INFO +152 -0
- openalex_local-0.1.0/README.md +127 -0
- openalex_local-0.1.0/pyproject.toml +49 -0
- openalex_local-0.1.0/setup.cfg +4 -0
- openalex_local-0.1.0/src/openalex_local/__init__.py +14 -0
- openalex_local-0.1.0/src/openalex_local/config.py +73 -0
- openalex_local-0.1.0/src/openalex_local/models.py +187 -0
- openalex_local-0.1.0/src/openalex_local.egg-info/PKG-INFO +152 -0
- openalex_local-0.1.0/src/openalex_local.egg-info/SOURCES.txt +11 -0
- openalex_local-0.1.0/src/openalex_local.egg-info/dependency_links.txt +1 -0
- openalex_local-0.1.0/src/openalex_local.egg-info/entry_points.txt +2 -0
- openalex_local-0.1.0/src/openalex_local.egg-info/requires.txt +6 -0
- openalex_local-0.1.0/src/openalex_local.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: openalex-local
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Local OpenAlex database with 284M+ works, abstracts, and semantic search
|
|
5
|
+
Author-email: Yusuke Watanabe <ywatanabe@alumni.u-tokyo.ac.jp>
|
|
6
|
+
License: AGPL-3.0
|
|
7
|
+
Project-URL: Homepage, https://github.com/ywatanabe1989/openalex-local
|
|
8
|
+
Project-URL: Repository, https://github.com/ywatanabe1989/openalex-local
|
|
9
|
+
Keywords: openalex,academic,research,abstracts,semantic-search
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
+
Classifier: Topic :: Scientific/Engineering
|
|
18
|
+
Requires-Python: >=3.10
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
Requires-Dist: click>=8.0
|
|
21
|
+
Requires-Dist: awscli>=1.0
|
|
22
|
+
Provides-Extra: dev
|
|
23
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
24
|
+
Requires-Dist: pytest-cov>=4.0; extra == "dev"
|
|
25
|
+
|
|
26
|
+
# OpenAlex Local
|
|
27
|
+
|
|
28
|
+
Local OpenAlex database with 284M+ scholarly works, abstracts, and semantic search.
|
|
29
|
+
|
|
30
|
+
[](https://www.python.org/downloads/)
|
|
31
|
+
[](LICENSE)
|
|
32
|
+
|
|
33
|
+
<details>
|
|
34
|
+
<summary><strong>Why OpenAlex Local?</strong></summary>
|
|
35
|
+
|
|
36
|
+
**Built for the LLM era** - features that matter for AI research assistants:
|
|
37
|
+
|
|
38
|
+
| Feature | Benefit |
|
|
39
|
+
|---------|---------|
|
|
40
|
+
| ๐ **284M Works** | More coverage than CrossRef |
|
|
41
|
+
| ๐ **Abstracts** | ~45-60% availability for semantic search |
|
|
42
|
+
| ๐ท๏ธ **Concepts & Topics** | Built-in classification |
|
|
43
|
+
| ๐ค **Author Disambiguation** | Linked to institutions |
|
|
44
|
+
| ๐ **Open Access Info** | OA status and URLs |
|
|
45
|
+
|
|
46
|
+
Perfect for: RAG systems, research assistants, literature review automation.
|
|
47
|
+
|
|
48
|
+
</details>
|
|
49
|
+
|
|
50
|
+
<details>
|
|
51
|
+
<summary><strong>Installation</strong></summary>
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
pip install openalex-local
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
From source:
|
|
58
|
+
```bash
|
|
59
|
+
git clone https://github.com/ywatanabe1989/openalex-local
|
|
60
|
+
cd openalex-local && make install
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
Database setup (~300 GB, ~1-2 days to build):
|
|
64
|
+
```bash
|
|
65
|
+
# Check system status
|
|
66
|
+
make status
|
|
67
|
+
|
|
68
|
+
# 1. Download OpenAlex Works snapshot (~300GB)
|
|
69
|
+
make download-screen # runs in background
|
|
70
|
+
|
|
71
|
+
# 2. Build SQLite database
|
|
72
|
+
make build-db
|
|
73
|
+
|
|
74
|
+
# 3. Build FTS5 index
|
|
75
|
+
make build-fts
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
</details>
|
|
79
|
+
|
|
80
|
+
<details>
|
|
81
|
+
<summary><strong>Python API</strong></summary>
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
from openalex_local import search, get, count
|
|
85
|
+
|
|
86
|
+
# Full-text search (title + abstract)
|
|
87
|
+
results = search("machine learning neural networks")
|
|
88
|
+
for work in results:
|
|
89
|
+
print(f"{work.title} ({work.year})")
|
|
90
|
+
print(f" Abstract: {work.abstract[:200]}...")
|
|
91
|
+
print(f" Concepts: {[c['name'] for c in work.concepts]}")
|
|
92
|
+
|
|
93
|
+
# Get by OpenAlex ID or DOI
|
|
94
|
+
work = get("W2741809807")
|
|
95
|
+
work = get("10.1038/nature12373")
|
|
96
|
+
|
|
97
|
+
# Count matches
|
|
98
|
+
n = count("CRISPR")
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
</details>
|
|
102
|
+
|
|
103
|
+
<details>
|
|
104
|
+
<summary><strong>CLI</strong></summary>
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
openalex-local search "CRISPR genome editing" -n 5
|
|
108
|
+
openalex-local get W2741809807
|
|
109
|
+
openalex-local get 10.1038/nature12373
|
|
110
|
+
openalex-local count "machine learning"
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
</details>
|
|
114
|
+
|
|
115
|
+
<details>
|
|
116
|
+
<summary><strong>Related Projects</strong></summary>
|
|
117
|
+
|
|
118
|
+
**[crossref-local](https://github.com/ywatanabe1989/crossref-local)** - Sister project with CrossRef data:
|
|
119
|
+
|
|
120
|
+
| Feature | crossref-local | openalex-local |
|
|
121
|
+
|---------|----------------|----------------|
|
|
122
|
+
| Works | 167M | 284M |
|
|
123
|
+
| Abstracts | ~21% | ~45-60% |
|
|
124
|
+
| Update frequency | Real-time | Monthly |
|
|
125
|
+
| DOI authority | โ (source) | Uses CrossRef |
|
|
126
|
+
| Citations | Raw references | Linked works |
|
|
127
|
+
| Concepts/Topics | โ | โ |
|
|
128
|
+
| Author IDs | โ | โ |
|
|
129
|
+
| Best for | DOI lookup, raw refs | Semantic search |
|
|
130
|
+
|
|
131
|
+
**When to use CrossRef**: Real-time DOI updates, raw reference parsing, authoritative metadata.
|
|
132
|
+
**When to use OpenAlex**: Semantic search, citation analysis, topic discovery.
|
|
133
|
+
|
|
134
|
+
</details>
|
|
135
|
+
|
|
136
|
+
<details>
|
|
137
|
+
<summary><strong>Data Source</strong></summary>
|
|
138
|
+
|
|
139
|
+
Data from [OpenAlex](https://openalex.org/), an open catalog of scholarly works.
|
|
140
|
+
Updated monthly from their [snapshot](https://docs.openalex.org/download-all-data/openalex-snapshot).
|
|
141
|
+
|
|
142
|
+
</details>
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
<p align="center">
|
|
147
|
+
<a href="https://scitex.ai"><img src="docs/scitex-icon-navy-inverted.png" alt="SciTeX" width="40"/></a>
|
|
148
|
+
<br>
|
|
149
|
+
AGPL-3.0 ยท ywatanabe@scitex.ai
|
|
150
|
+
</p>
|
|
151
|
+
|
|
152
|
+
<!-- EOF -->
|
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
# OpenAlex Local
|
|
2
|
+
|
|
3
|
+
Local OpenAlex database with 284M+ scholarly works, abstracts, and semantic search.
|
|
4
|
+
|
|
5
|
+
[](https://www.python.org/downloads/)
|
|
6
|
+
[](LICENSE)
|
|
7
|
+
|
|
8
|
+
<details>
|
|
9
|
+
<summary><strong>Why OpenAlex Local?</strong></summary>
|
|
10
|
+
|
|
11
|
+
**Built for the LLM era** - features that matter for AI research assistants:
|
|
12
|
+
|
|
13
|
+
| Feature | Benefit |
|
|
14
|
+
|---------|---------|
|
|
15
|
+
| ๐ **284M Works** | More coverage than CrossRef |
|
|
16
|
+
| ๐ **Abstracts** | ~45-60% availability for semantic search |
|
|
17
|
+
| ๐ท๏ธ **Concepts & Topics** | Built-in classification |
|
|
18
|
+
| ๐ค **Author Disambiguation** | Linked to institutions |
|
|
19
|
+
| ๐ **Open Access Info** | OA status and URLs |
|
|
20
|
+
|
|
21
|
+
Perfect for: RAG systems, research assistants, literature review automation.
|
|
22
|
+
|
|
23
|
+
</details>
|
|
24
|
+
|
|
25
|
+
<details>
|
|
26
|
+
<summary><strong>Installation</strong></summary>
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
pip install openalex-local
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
From source:
|
|
33
|
+
```bash
|
|
34
|
+
git clone https://github.com/ywatanabe1989/openalex-local
|
|
35
|
+
cd openalex-local && make install
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Database setup (~300 GB, ~1-2 days to build):
|
|
39
|
+
```bash
|
|
40
|
+
# Check system status
|
|
41
|
+
make status
|
|
42
|
+
|
|
43
|
+
# 1. Download OpenAlex Works snapshot (~300GB)
|
|
44
|
+
make download-screen # runs in background
|
|
45
|
+
|
|
46
|
+
# 2. Build SQLite database
|
|
47
|
+
make build-db
|
|
48
|
+
|
|
49
|
+
# 3. Build FTS5 index
|
|
50
|
+
make build-fts
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
</details>
|
|
54
|
+
|
|
55
|
+
<details>
|
|
56
|
+
<summary><strong>Python API</strong></summary>
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
from openalex_local import search, get, count
|
|
60
|
+
|
|
61
|
+
# Full-text search (title + abstract)
|
|
62
|
+
results = search("machine learning neural networks")
|
|
63
|
+
for work in results:
|
|
64
|
+
print(f"{work.title} ({work.year})")
|
|
65
|
+
print(f" Abstract: {work.abstract[:200]}...")
|
|
66
|
+
print(f" Concepts: {[c['name'] for c in work.concepts]}")
|
|
67
|
+
|
|
68
|
+
# Get by OpenAlex ID or DOI
|
|
69
|
+
work = get("W2741809807")
|
|
70
|
+
work = get("10.1038/nature12373")
|
|
71
|
+
|
|
72
|
+
# Count matches
|
|
73
|
+
n = count("CRISPR")
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
</details>
|
|
77
|
+
|
|
78
|
+
<details>
|
|
79
|
+
<summary><strong>CLI</strong></summary>
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
openalex-local search "CRISPR genome editing" -n 5
|
|
83
|
+
openalex-local get W2741809807
|
|
84
|
+
openalex-local get 10.1038/nature12373
|
|
85
|
+
openalex-local count "machine learning"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
</details>
|
|
89
|
+
|
|
90
|
+
<details>
|
|
91
|
+
<summary><strong>Related Projects</strong></summary>
|
|
92
|
+
|
|
93
|
+
**[crossref-local](https://github.com/ywatanabe1989/crossref-local)** - Sister project with CrossRef data:
|
|
94
|
+
|
|
95
|
+
| Feature | crossref-local | openalex-local |
|
|
96
|
+
|---------|----------------|----------------|
|
|
97
|
+
| Works | 167M | 284M |
|
|
98
|
+
| Abstracts | ~21% | ~45-60% |
|
|
99
|
+
| Update frequency | Real-time | Monthly |
|
|
100
|
+
| DOI authority | โ (source) | Uses CrossRef |
|
|
101
|
+
| Citations | Raw references | Linked works |
|
|
102
|
+
| Concepts/Topics | โ | โ |
|
|
103
|
+
| Author IDs | โ | โ |
|
|
104
|
+
| Best for | DOI lookup, raw refs | Semantic search |
|
|
105
|
+
|
|
106
|
+
**When to use CrossRef**: Real-time DOI updates, raw reference parsing, authoritative metadata.
|
|
107
|
+
**When to use OpenAlex**: Semantic search, citation analysis, topic discovery.
|
|
108
|
+
|
|
109
|
+
</details>
|
|
110
|
+
|
|
111
|
+
<details>
|
|
112
|
+
<summary><strong>Data Source</strong></summary>
|
|
113
|
+
|
|
114
|
+
Data from [OpenAlex](https://openalex.org/), an open catalog of scholarly works.
|
|
115
|
+
Updated monthly from their [snapshot](https://docs.openalex.org/download-all-data/openalex-snapshot).
|
|
116
|
+
|
|
117
|
+
</details>
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
<p align="center">
|
|
122
|
+
<a href="https://scitex.ai"><img src="docs/scitex-icon-navy-inverted.png" alt="SciTeX" width="40"/></a>
|
|
123
|
+
<br>
|
|
124
|
+
AGPL-3.0 ยท ywatanabe@scitex.ai
|
|
125
|
+
</p>
|
|
126
|
+
|
|
127
|
+
<!-- EOF -->
|
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=61.0", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "openalex-local"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Local OpenAlex database with 284M+ works, abstracts, and semantic search"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
license = {text = "AGPL-3.0"}
|
|
11
|
+
authors = [
|
|
12
|
+
{name = "Yusuke Watanabe", email = "ywatanabe@alumni.u-tokyo.ac.jp"}
|
|
13
|
+
]
|
|
14
|
+
requires-python = ">=3.10"
|
|
15
|
+
classifiers = [
|
|
16
|
+
"Development Status :: 3 - Alpha",
|
|
17
|
+
"Intended Audience :: Science/Research",
|
|
18
|
+
"License :: OSI Approved :: GNU Affero General Public License v3",
|
|
19
|
+
"Programming Language :: Python :: 3",
|
|
20
|
+
"Programming Language :: Python :: 3.10",
|
|
21
|
+
"Programming Language :: Python :: 3.11",
|
|
22
|
+
"Programming Language :: Python :: 3.12",
|
|
23
|
+
"Topic :: Scientific/Engineering",
|
|
24
|
+
]
|
|
25
|
+
keywords = ["openalex", "academic", "research", "abstracts", "semantic-search"]
|
|
26
|
+
dependencies = [
|
|
27
|
+
"click>=8.0",
|
|
28
|
+
"awscli>=1.0",
|
|
29
|
+
]
|
|
30
|
+
|
|
31
|
+
[project.optional-dependencies]
|
|
32
|
+
dev = [
|
|
33
|
+
"pytest>=7.0",
|
|
34
|
+
"pytest-cov>=4.0",
|
|
35
|
+
]
|
|
36
|
+
|
|
37
|
+
[project.scripts]
|
|
38
|
+
openalex-local = "openalex_local.cli:main"
|
|
39
|
+
|
|
40
|
+
[project.urls]
|
|
41
|
+
Homepage = "https://github.com/ywatanabe1989/openalex-local"
|
|
42
|
+
Repository = "https://github.com/ywatanabe1989/openalex-local"
|
|
43
|
+
|
|
44
|
+
[tool.setuptools.packages.find]
|
|
45
|
+
where = ["src"]
|
|
46
|
+
|
|
47
|
+
[tool.pytest.ini_options]
|
|
48
|
+
testpaths = ["tests"]
|
|
49
|
+
python_files = ["test_*.py"]
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
"""
|
|
2
|
+
OpenAlex Local - Local OpenAlex database with 284M+ works and semantic search.
|
|
3
|
+
|
|
4
|
+
Example:
|
|
5
|
+
>>> from openalex_local import search, get
|
|
6
|
+
>>> results = search("machine learning neural networks")
|
|
7
|
+
>>> work = get("W2741809807") # OpenAlex ID
|
|
8
|
+
>>> work = get("10.1038/nature12373") # or DOI
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
__version__ = "0.1.0"
|
|
12
|
+
|
|
13
|
+
# API will be exposed here after implementation
|
|
14
|
+
# from .api import search, get, count, info
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
"""Configuration for openalex_local."""
|
|
2
|
+
|
|
3
|
+
import os
|
|
4
|
+
from pathlib import Path
|
|
5
|
+
from typing import Optional
|
|
6
|
+
|
|
7
|
+
# Default database locations (checked in order)
|
|
8
|
+
DEFAULT_DB_PATHS = [
|
|
9
|
+
Path("/home/ywatanabe/proj/openalex-local/data/openalex.db"),
|
|
10
|
+
Path("/home/ywatanabe/proj/openalex_local/data/openalex.db"),
|
|
11
|
+
Path("/mnt/nas_ug/openalex_local/data/openalex.db"),
|
|
12
|
+
Path.home() / ".openalex_local" / "openalex.db",
|
|
13
|
+
Path.cwd() / "data" / "openalex.db",
|
|
14
|
+
]
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
def get_db_path() -> Path:
|
|
18
|
+
"""
|
|
19
|
+
Get database path from environment or auto-detect.
|
|
20
|
+
|
|
21
|
+
Priority:
|
|
22
|
+
1. OPENALEX_LOCAL_DB environment variable
|
|
23
|
+
2. First existing path from DEFAULT_DB_PATHS
|
|
24
|
+
|
|
25
|
+
Returns:
|
|
26
|
+
Path to the database file
|
|
27
|
+
|
|
28
|
+
Raises:
|
|
29
|
+
FileNotFoundError: If no database found
|
|
30
|
+
"""
|
|
31
|
+
# Check environment variable first
|
|
32
|
+
env_path = os.environ.get("OPENALEX_LOCAL_DB")
|
|
33
|
+
if env_path:
|
|
34
|
+
path = Path(env_path)
|
|
35
|
+
if path.exists():
|
|
36
|
+
return path
|
|
37
|
+
raise FileNotFoundError(f"OPENALEX_LOCAL_DB path not found: {env_path}")
|
|
38
|
+
|
|
39
|
+
# Auto-detect from default locations
|
|
40
|
+
for path in DEFAULT_DB_PATHS:
|
|
41
|
+
if path.exists():
|
|
42
|
+
return path
|
|
43
|
+
|
|
44
|
+
raise FileNotFoundError(
|
|
45
|
+
"OpenAlex database not found. Set OPENALEX_LOCAL_DB environment variable "
|
|
46
|
+
f"or place database at one of: {[str(p) for p in DEFAULT_DB_PATHS]}"
|
|
47
|
+
)
|
|
48
|
+
|
|
49
|
+
|
|
50
|
+
class Config:
|
|
51
|
+
"""Configuration container."""
|
|
52
|
+
|
|
53
|
+
_db_path: Optional[Path] = None
|
|
54
|
+
|
|
55
|
+
@classmethod
|
|
56
|
+
def get_db_path(cls) -> Path:
|
|
57
|
+
"""Get or auto-detect database path."""
|
|
58
|
+
if cls._db_path is None:
|
|
59
|
+
cls._db_path = get_db_path()
|
|
60
|
+
return cls._db_path
|
|
61
|
+
|
|
62
|
+
@classmethod
|
|
63
|
+
def set_db_path(cls, path: str | Path) -> None:
|
|
64
|
+
"""Set database path explicitly."""
|
|
65
|
+
path = Path(path)
|
|
66
|
+
if not path.exists():
|
|
67
|
+
raise FileNotFoundError(f"Database not found: {path}")
|
|
68
|
+
cls._db_path = path
|
|
69
|
+
|
|
70
|
+
@classmethod
|
|
71
|
+
def reset(cls) -> None:
|
|
72
|
+
"""Reset configuration (for testing)."""
|
|
73
|
+
cls._db_path = None
|
|
@@ -0,0 +1,187 @@
|
|
|
1
|
+
"""Data models for openalex_local."""
|
|
2
|
+
|
|
3
|
+
from dataclasses import dataclass, field
|
|
4
|
+
from typing import List, Optional, Dict, Any
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
@dataclass
|
|
8
|
+
class Work:
|
|
9
|
+
"""
|
|
10
|
+
Represents a scholarly work from OpenAlex.
|
|
11
|
+
|
|
12
|
+
Attributes:
|
|
13
|
+
openalex_id: OpenAlex ID (e.g., W2741809807)
|
|
14
|
+
doi: Digital Object Identifier
|
|
15
|
+
title: Work title
|
|
16
|
+
abstract: Abstract text (reconstructed from inverted index)
|
|
17
|
+
authors: List of author names
|
|
18
|
+
year: Publication year
|
|
19
|
+
source: Journal/venue name
|
|
20
|
+
issn: Journal ISSN
|
|
21
|
+
volume: Volume number
|
|
22
|
+
issue: Issue number
|
|
23
|
+
pages: Page range
|
|
24
|
+
publisher: Publisher name
|
|
25
|
+
type: Work type (journal-article, book-chapter, etc.)
|
|
26
|
+
concepts: List of OpenAlex concepts
|
|
27
|
+
topics: List of OpenAlex topics
|
|
28
|
+
cited_by_count: Number of citations
|
|
29
|
+
referenced_works: List of referenced OpenAlex IDs
|
|
30
|
+
is_oa: Is open access
|
|
31
|
+
oa_url: Open access URL
|
|
32
|
+
"""
|
|
33
|
+
|
|
34
|
+
openalex_id: str
|
|
35
|
+
doi: Optional[str] = None
|
|
36
|
+
title: Optional[str] = None
|
|
37
|
+
abstract: Optional[str] = None
|
|
38
|
+
authors: List[str] = field(default_factory=list)
|
|
39
|
+
year: Optional[int] = None
|
|
40
|
+
source: Optional[str] = None
|
|
41
|
+
issn: Optional[str] = None
|
|
42
|
+
volume: Optional[str] = None
|
|
43
|
+
issue: Optional[str] = None
|
|
44
|
+
pages: Optional[str] = None
|
|
45
|
+
publisher: Optional[str] = None
|
|
46
|
+
type: Optional[str] = None
|
|
47
|
+
concepts: List[Dict[str, Any]] = field(default_factory=list)
|
|
48
|
+
topics: List[Dict[str, Any]] = field(default_factory=list)
|
|
49
|
+
cited_by_count: Optional[int] = None
|
|
50
|
+
referenced_works: List[str] = field(default_factory=list)
|
|
51
|
+
is_oa: bool = False
|
|
52
|
+
oa_url: Optional[str] = None
|
|
53
|
+
|
|
54
|
+
@classmethod
|
|
55
|
+
def from_openalex(cls, data: dict) -> "Work":
|
|
56
|
+
"""
|
|
57
|
+
Create Work from OpenAlex API/snapshot JSON.
|
|
58
|
+
|
|
59
|
+
Args:
|
|
60
|
+
data: OpenAlex work dictionary
|
|
61
|
+
|
|
62
|
+
Returns:
|
|
63
|
+
Work instance
|
|
64
|
+
"""
|
|
65
|
+
# Extract OpenAlex ID
|
|
66
|
+
openalex_id = data.get("id", "").replace("https://openalex.org/", "")
|
|
67
|
+
|
|
68
|
+
# Extract DOI
|
|
69
|
+
doi = data.get("doi", "").replace("https://doi.org/", "") if data.get("doi") else None
|
|
70
|
+
|
|
71
|
+
# Extract authors
|
|
72
|
+
authors = []
|
|
73
|
+
for authorship in data.get("authorships", []):
|
|
74
|
+
author = authorship.get("author", {})
|
|
75
|
+
name = author.get("display_name")
|
|
76
|
+
if name:
|
|
77
|
+
authors.append(name)
|
|
78
|
+
|
|
79
|
+
# Reconstruct abstract from inverted index
|
|
80
|
+
abstract = None
|
|
81
|
+
inv_index = data.get("abstract_inverted_index")
|
|
82
|
+
if inv_index:
|
|
83
|
+
words = sorted(
|
|
84
|
+
[(pos, word) for word, positions in inv_index.items() for pos in positions]
|
|
85
|
+
)
|
|
86
|
+
abstract = " ".join(word for _, word in words)
|
|
87
|
+
|
|
88
|
+
# Extract source info
|
|
89
|
+
primary_location = data.get("primary_location") or {}
|
|
90
|
+
source_info = primary_location.get("source") or {}
|
|
91
|
+
source = source_info.get("display_name")
|
|
92
|
+
issns = source_info.get("issn") or []
|
|
93
|
+
issn = issns[0] if issns else None
|
|
94
|
+
|
|
95
|
+
# Extract biblio
|
|
96
|
+
biblio = data.get("biblio") or {}
|
|
97
|
+
|
|
98
|
+
# Extract concepts (top 5)
|
|
99
|
+
concepts = [
|
|
100
|
+
{"name": c.get("display_name"), "score": c.get("score")}
|
|
101
|
+
for c in (data.get("concepts") or [])[:5]
|
|
102
|
+
]
|
|
103
|
+
|
|
104
|
+
# Extract topics (top 3)
|
|
105
|
+
topics = [
|
|
106
|
+
{"name": t.get("display_name"), "subfield": t.get("subfield", {}).get("display_name")}
|
|
107
|
+
for t in (data.get("topics") or [])[:3]
|
|
108
|
+
]
|
|
109
|
+
|
|
110
|
+
# Extract OA info
|
|
111
|
+
oa_info = data.get("open_access") or {}
|
|
112
|
+
|
|
113
|
+
return cls(
|
|
114
|
+
openalex_id=openalex_id,
|
|
115
|
+
doi=doi,
|
|
116
|
+
title=data.get("title") or data.get("display_name"),
|
|
117
|
+
abstract=abstract,
|
|
118
|
+
authors=authors,
|
|
119
|
+
year=data.get("publication_year"),
|
|
120
|
+
source=source,
|
|
121
|
+
issn=issn,
|
|
122
|
+
volume=biblio.get("volume"),
|
|
123
|
+
issue=biblio.get("issue"),
|
|
124
|
+
pages=biblio.get("first_page"),
|
|
125
|
+
publisher=source_info.get("host_organization_name"),
|
|
126
|
+
type=data.get("type"),
|
|
127
|
+
concepts=concepts,
|
|
128
|
+
topics=topics,
|
|
129
|
+
cited_by_count=data.get("cited_by_count"),
|
|
130
|
+
referenced_works=[
|
|
131
|
+
r.replace("https://openalex.org/", "")
|
|
132
|
+
for r in (data.get("referenced_works") or [])
|
|
133
|
+
],
|
|
134
|
+
is_oa=oa_info.get("is_oa", False),
|
|
135
|
+
oa_url=oa_info.get("oa_url"),
|
|
136
|
+
)
|
|
137
|
+
|
|
138
|
+
def to_dict(self) -> dict:
|
|
139
|
+
"""Convert to dictionary."""
|
|
140
|
+
return {
|
|
141
|
+
"openalex_id": self.openalex_id,
|
|
142
|
+
"doi": self.doi,
|
|
143
|
+
"title": self.title,
|
|
144
|
+
"abstract": self.abstract,
|
|
145
|
+
"authors": self.authors,
|
|
146
|
+
"year": self.year,
|
|
147
|
+
"source": self.source,
|
|
148
|
+
"issn": self.issn,
|
|
149
|
+
"volume": self.volume,
|
|
150
|
+
"issue": self.issue,
|
|
151
|
+
"pages": self.pages,
|
|
152
|
+
"publisher": self.publisher,
|
|
153
|
+
"type": self.type,
|
|
154
|
+
"concepts": self.concepts,
|
|
155
|
+
"topics": self.topics,
|
|
156
|
+
"cited_by_count": self.cited_by_count,
|
|
157
|
+
"referenced_works": self.referenced_works,
|
|
158
|
+
"is_oa": self.is_oa,
|
|
159
|
+
"oa_url": self.oa_url,
|
|
160
|
+
}
|
|
161
|
+
|
|
162
|
+
|
|
163
|
+
@dataclass
|
|
164
|
+
class SearchResult:
|
|
165
|
+
"""
|
|
166
|
+
Container for search results with metadata.
|
|
167
|
+
|
|
168
|
+
Attributes:
|
|
169
|
+
works: List of Work objects
|
|
170
|
+
total: Total number of matches
|
|
171
|
+
query: Original search query
|
|
172
|
+
elapsed_ms: Search time in milliseconds
|
|
173
|
+
"""
|
|
174
|
+
|
|
175
|
+
works: List[Work]
|
|
176
|
+
total: int
|
|
177
|
+
query: str
|
|
178
|
+
elapsed_ms: float
|
|
179
|
+
|
|
180
|
+
def __len__(self) -> int:
|
|
181
|
+
return len(self.works)
|
|
182
|
+
|
|
183
|
+
def __iter__(self):
|
|
184
|
+
return iter(self.works)
|
|
185
|
+
|
|
186
|
+
def __getitem__(self, idx):
|
|
187
|
+
return self.works[idx]
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: openalex-local
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Local OpenAlex database with 284M+ works, abstracts, and semantic search
|
|
5
|
+
Author-email: Yusuke Watanabe <ywatanabe@alumni.u-tokyo.ac.jp>
|
|
6
|
+
License: AGPL-3.0
|
|
7
|
+
Project-URL: Homepage, https://github.com/ywatanabe1989/openalex-local
|
|
8
|
+
Project-URL: Repository, https://github.com/ywatanabe1989/openalex-local
|
|
9
|
+
Keywords: openalex,academic,research,abstracts,semantic-search
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
17
|
+
Classifier: Topic :: Scientific/Engineering
|
|
18
|
+
Requires-Python: >=3.10
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
Requires-Dist: click>=8.0
|
|
21
|
+
Requires-Dist: awscli>=1.0
|
|
22
|
+
Provides-Extra: dev
|
|
23
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
24
|
+
Requires-Dist: pytest-cov>=4.0; extra == "dev"
|
|
25
|
+
|
|
26
|
+
# OpenAlex Local
|
|
27
|
+
|
|
28
|
+
Local OpenAlex database with 284M+ scholarly works, abstracts, and semantic search.
|
|
29
|
+
|
|
30
|
+
[](https://www.python.org/downloads/)
|
|
31
|
+
[](LICENSE)
|
|
32
|
+
|
|
33
|
+
<details>
|
|
34
|
+
<summary><strong>Why OpenAlex Local?</strong></summary>
|
|
35
|
+
|
|
36
|
+
**Built for the LLM era** - features that matter for AI research assistants:
|
|
37
|
+
|
|
38
|
+
| Feature | Benefit |
|
|
39
|
+
|---------|---------|
|
|
40
|
+
| ๐ **284M Works** | More coverage than CrossRef |
|
|
41
|
+
| ๐ **Abstracts** | ~45-60% availability for semantic search |
|
|
42
|
+
| ๐ท๏ธ **Concepts & Topics** | Built-in classification |
|
|
43
|
+
| ๐ค **Author Disambiguation** | Linked to institutions |
|
|
44
|
+
| ๐ **Open Access Info** | OA status and URLs |
|
|
45
|
+
|
|
46
|
+
Perfect for: RAG systems, research assistants, literature review automation.
|
|
47
|
+
|
|
48
|
+
</details>
|
|
49
|
+
|
|
50
|
+
<details>
|
|
51
|
+
<summary><strong>Installation</strong></summary>
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
pip install openalex-local
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
From source:
|
|
58
|
+
```bash
|
|
59
|
+
git clone https://github.com/ywatanabe1989/openalex-local
|
|
60
|
+
cd openalex-local && make install
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
Database setup (~300 GB, ~1-2 days to build):
|
|
64
|
+
```bash
|
|
65
|
+
# Check system status
|
|
66
|
+
make status
|
|
67
|
+
|
|
68
|
+
# 1. Download OpenAlex Works snapshot (~300GB)
|
|
69
|
+
make download-screen # runs in background
|
|
70
|
+
|
|
71
|
+
# 2. Build SQLite database
|
|
72
|
+
make build-db
|
|
73
|
+
|
|
74
|
+
# 3. Build FTS5 index
|
|
75
|
+
make build-fts
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
</details>
|
|
79
|
+
|
|
80
|
+
<details>
|
|
81
|
+
<summary><strong>Python API</strong></summary>
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
from openalex_local import search, get, count
|
|
85
|
+
|
|
86
|
+
# Full-text search (title + abstract)
|
|
87
|
+
results = search("machine learning neural networks")
|
|
88
|
+
for work in results:
|
|
89
|
+
print(f"{work.title} ({work.year})")
|
|
90
|
+
print(f" Abstract: {work.abstract[:200]}...")
|
|
91
|
+
print(f" Concepts: {[c['name'] for c in work.concepts]}")
|
|
92
|
+
|
|
93
|
+
# Get by OpenAlex ID or DOI
|
|
94
|
+
work = get("W2741809807")
|
|
95
|
+
work = get("10.1038/nature12373")
|
|
96
|
+
|
|
97
|
+
# Count matches
|
|
98
|
+
n = count("CRISPR")
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
</details>
|
|
102
|
+
|
|
103
|
+
<details>
|
|
104
|
+
<summary><strong>CLI</strong></summary>
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
openalex-local search "CRISPR genome editing" -n 5
|
|
108
|
+
openalex-local get W2741809807
|
|
109
|
+
openalex-local get 10.1038/nature12373
|
|
110
|
+
openalex-local count "machine learning"
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
</details>
|
|
114
|
+
|
|
115
|
+
<details>
|
|
116
|
+
<summary><strong>Related Projects</strong></summary>
|
|
117
|
+
|
|
118
|
+
**[crossref-local](https://github.com/ywatanabe1989/crossref-local)** - Sister project with CrossRef data:
|
|
119
|
+
|
|
120
|
+
| Feature | crossref-local | openalex-local |
|
|
121
|
+
|---------|----------------|----------------|
|
|
122
|
+
| Works | 167M | 284M |
|
|
123
|
+
| Abstracts | ~21% | ~45-60% |
|
|
124
|
+
| Update frequency | Real-time | Monthly |
|
|
125
|
+
| DOI authority | โ (source) | Uses CrossRef |
|
|
126
|
+
| Citations | Raw references | Linked works |
|
|
127
|
+
| Concepts/Topics | โ | โ |
|
|
128
|
+
| Author IDs | โ | โ |
|
|
129
|
+
| Best for | DOI lookup, raw refs | Semantic search |
|
|
130
|
+
|
|
131
|
+
**When to use CrossRef**: Real-time DOI updates, raw reference parsing, authoritative metadata.
|
|
132
|
+
**When to use OpenAlex**: Semantic search, citation analysis, topic discovery.
|
|
133
|
+
|
|
134
|
+
</details>
|
|
135
|
+
|
|
136
|
+
<details>
|
|
137
|
+
<summary><strong>Data Source</strong></summary>
|
|
138
|
+
|
|
139
|
+
Data from [OpenAlex](https://openalex.org/), an open catalog of scholarly works.
|
|
140
|
+
Updated monthly from their [snapshot](https://docs.openalex.org/download-all-data/openalex-snapshot).
|
|
141
|
+
|
|
142
|
+
</details>
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
<p align="center">
|
|
147
|
+
<a href="https://scitex.ai"><img src="docs/scitex-icon-navy-inverted.png" alt="SciTeX" width="40"/></a>
|
|
148
|
+
<br>
|
|
149
|
+
AGPL-3.0 ยท ywatanabe@scitex.ai
|
|
150
|
+
</p>
|
|
151
|
+
|
|
152
|
+
<!-- EOF -->
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
pyproject.toml
|
|
3
|
+
src/openalex_local/__init__.py
|
|
4
|
+
src/openalex_local/config.py
|
|
5
|
+
src/openalex_local/models.py
|
|
6
|
+
src/openalex_local.egg-info/PKG-INFO
|
|
7
|
+
src/openalex_local.egg-info/SOURCES.txt
|
|
8
|
+
src/openalex_local.egg-info/dependency_links.txt
|
|
9
|
+
src/openalex_local.egg-info/entry_points.txt
|
|
10
|
+
src/openalex_local.egg-info/requires.txt
|
|
11
|
+
src/openalex_local.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
openalex_local
|