search_paper 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 MinsuChae
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.kr.md ADDED
@@ -0,0 +1,210 @@
1
+ # search-papers
2
+
3
+ 여러 학술 검색 소스에서 논문을 검색하고, 정규화된 구조화 데이터로 반환하는 TypeScript 라이브러리입니다.
4
+
5
+ [ghostfetch](https://github.com/user/ghostfetch) 기반으로 브라우저 핑거프린트 스푸핑 및 봇 차단 우회를 지원합니다.
6
+
7
+ ## 주요 기능
8
+
9
+ - **멀티 소스 검색** - Google Scholar, Semantic Scholar, arXiv를 병렬로 동시 검색
10
+ - **통일된 `Paper` 인터페이스** - 어떤 소스에서 가져오든 동일한 구조화 형태로 반환
11
+ - **중복 제거** - DOI, canonical URL, 제목 정규화를 기반으로 소스 간 중복 논문 자동 병합
12
+ - **Impact Factor 정렬** - 저널 Impact Factor 기준 정렬 (약 100개 저널 정적 매핑)
13
+ - **인용/참조 조회** - Semantic Scholar를 통한 인용 및 참조 논문 조회
14
+ - **봇 차단 우회** - ghostfetch가 브라우저 스푸핑, JS 챌린지 풀이, 리다이렉트 추적을 처리
15
+ - **부분 실패 허용** - 하나의 소스가 실패해도 다른 소스의 결과는 정상 반환
16
+
17
+ ## 요구 사항
18
+
19
+ - Node.js >= 22.0.0
20
+
21
+ ## 설치
22
+
23
+ ```bash
24
+ npm install search-papers
25
+ ```
26
+
27
+ ## 빠른 시작
28
+
29
+ ```typescript
30
+ import { searchPapers, getPaper } from 'search-papers';
31
+
32
+ // 모든 소스에서 통합 검색
33
+ const result = await searchPapers('attention is all you need', {
34
+ limit: 10,
35
+ });
36
+ console.log(result.papers);
37
+
38
+ // 특정 소스만 지정하여 검색
39
+ const arxivOnly = await searchPapers('transformer', {
40
+ sources: ['arxiv'],
41
+ limit: 5,
42
+ sort: 'date',
43
+ });
44
+
45
+ // DOI로 단일 논문 조회
46
+ const paper = await getPaper('10.48550/arXiv.1706.03762');
47
+ console.log(paper?.title);
48
+ ```
49
+
50
+ ## API
51
+
52
+ ### `searchPapers(query, options?)`
53
+
54
+ 여러 소스에서 논문을 동시에 검색합니다.
55
+
56
+ ```typescript
57
+ const result = await searchPapers('deep learning', {
58
+ sources: ['semantic_scholar', 'google_scholar', 'arxiv'], // 기본: 전체
59
+ limit: 10, // 기본: 10
60
+ offset: 0,
61
+ year: { from: 2020, to: 2024 },
62
+ sort: 'relevance', // 'relevance' | 'date' | 'citations'
63
+ client: {
64
+ semanticScholarApiKey: 'your-key', // 선택
65
+ proxy: 'http://proxy:8080', // 선택
66
+ timeout: 15000, // 기본: 15000ms
67
+ },
68
+ });
69
+ ```
70
+
71
+ **반환값**: `SearchResult`
72
+
73
+ ```typescript
74
+ interface SearchResult {
75
+ query: string;
76
+ totalResults?: number;
77
+ papers: Paper[];
78
+ nextPageToken?: string;
79
+ source: SourceType;
80
+ errors?: SourceError[]; // 실패한 소스의 에러 정보
81
+ }
82
+ ```
83
+
84
+ ### `getPaper(doi, options?)`
85
+
86
+ DOI를 사용하여 Semantic Scholar에서 단일 논문을 조회합니다.
87
+
88
+ ```typescript
89
+ const paper = await getPaper('10.1038/nature14539');
90
+ // Paper | null 반환
91
+ ```
92
+
93
+ ### Paper 인터페이스
94
+
95
+ 모든 소스에서 반환되는 논문은 동일한 인터페이스를 따릅니다:
96
+
97
+ ```typescript
98
+ interface Paper {
99
+ title: string;
100
+ authors: Author[];
101
+ abstract?: string;
102
+ year?: number;
103
+ venue?: string; // 저널 또는 학회명
104
+ doi?: string;
105
+ url: string; // 논문 링크
106
+ canonicalUrl?: string; // 최종 리다이렉트 URL
107
+ pdfUrl?: string;
108
+ citationCount?: number;
109
+ impactFactor?: number; // 저널 Impact Factor
110
+ source: SourceType; // 'google_scholar' | 'semantic_scholar' | 'arxiv'
111
+ sourceId?: string; // 소스 내부 ID
112
+ tags?: string[]; // 예: arXiv 카테고리
113
+ references?: string[];
114
+ }
115
+ ```
116
+
117
+ ## 개별 소스 직접 사용
118
+
119
+ 더 세밀한 제어가 필요할 때 소스 클래스를 직접 사용할 수 있습니다:
120
+
121
+ ```typescript
122
+ import { createClient, SemanticScholarSource, GoogleScholarSource, ArxivSource } from 'search-papers';
123
+
124
+ const client = createClient();
125
+
126
+ // Semantic Scholar (CitationSource 구현)
127
+ const s2 = new SemanticScholarSource(client);
128
+ const result = await s2.search('transformer', { limit: 5 });
129
+ const paper = await s2.getPaper('DOI:10.48550/arXiv.1706.03762');
130
+ const citations = await s2.getCitations('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
131
+ const references = await s2.getReferences('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
132
+
133
+ // Google Scholar (PaperSource 구현)
134
+ const gs = new GoogleScholarSource(client);
135
+ const gsResult = await gs.search('deep learning', { limit: 10 });
136
+
137
+ // arXiv (PaperSource 구현)
138
+ const arxiv = new ArxivSource(client);
139
+ const arxivResult = await arxiv.search('neural network', { limit: 10 });
140
+ const arxivPaper = await arxiv.getPaper('1706.03762');
141
+
142
+ await client.destroy();
143
+ ```
144
+
145
+ ## 지원 소스
146
+
147
+ | 소스 | 타입 | 검색 | 논문 조회 | 인용 | 참조 | 비고 |
148
+ |------|------|------|----------|------|------|------|
149
+ | Semantic Scholar | API (JSON) | O | O (DOI, paperId 등) | O | O | API key로 전용 rate limit 확보 가능 |
150
+ | Google Scholar | 스크래핑 (HTML) | O | O (제목 검색) | X | X | CAPTCHA 리스크, 2~5초 랜덤 딜레이 |
151
+ | arXiv | API (Atom XML) | O | O (arXiv ID) | X | X | 요청 간 최소 3초 딜레이 |
152
+
153
+ ## 검색 옵션
154
+
155
+ | 옵션 | 타입 | 기본값 | 설명 |
156
+ |------|------|--------|------|
157
+ | `sources` | `SourceType[]` | 3개 소스 전체 | 검색할 소스 |
158
+ | `limit` | `number` | `10` | 최대 결과 수 |
159
+ | `offset` | `number` | `0` | 페이지네이션 오프셋 |
160
+ | `year` | `{ from?, to? }` | - | 출판 연도 범위 필터 |
161
+ | `sort` | `string` | `'relevance'` | 정렬 순서: `'relevance'`, `'date'`, `'citations'` |
162
+
163
+ ## 클라이언트 옵션
164
+
165
+ | 옵션 | 타입 | 기본값 | 설명 |
166
+ |------|------|--------|------|
167
+ | `browser` | `string` | `'Chrome_131'` | 스푸핑할 브라우저 |
168
+ | `timeout` | `number` | `15000` | 요청 타임아웃 (ms) |
169
+ | `proxy` | `string` | - | HTTP 프록시 URL |
170
+ | `proxyPool` | `string[]` | - | 라운드 로빈 프록시 풀 |
171
+ | `semanticScholarApiKey` | `string` | - | Semantic Scholar API 키 |
172
+
173
+ ## 동작 원리
174
+
175
+ 1. **병렬 요청** - 선택된 모든 소스에 `Promise.allSettled`로 동시 요청
176
+ 2. **Canonical URL 해석** - ghostfetch가 리다이렉트를 추적하여 각 논문의 최종 URL 확보
177
+ 3. **Impact Factor 조회** - 각 논문의 venue를 정적 저널 Impact Factor 테이블과 매칭
178
+ 4. **중복 제거** - DOI > canonical URL > 정규화된 제목 순서로 중복 판별, 여러 소스의 메타데이터 병합
179
+ 5. **정렬** - Impact Factor (내림차순) 기준 정렬, 동일 시 인용수 순
180
+ 6. **limit 적용** - 요청된 수만큼 최종 결과 반환
181
+
182
+ ## 에러 처리
183
+
184
+ 부분 실패 허용 방식으로 동작합니다. 하나의 소스가 실패해도 다른 소스의 결과는 정상 반환됩니다:
185
+
186
+ ```typescript
187
+ const result = await searchPapers('query');
188
+
189
+ if (result.errors) {
190
+ for (const err of result.errors) {
191
+ console.warn(`${err.source}: ${err.message} (${err.code})`);
192
+ // err.code: 'RATE_LIMITED' | 'CAPTCHA' | 'TIMEOUT' | 'NETWORK_ERROR' | 'PARSE_ERROR' | 'UNKNOWN'
193
+ }
194
+ }
195
+
196
+ // result.papers에는 성공한 소스의 결과가 포함됩니다
197
+ ```
198
+
199
+ ## 개발
200
+
201
+ ```bash
202
+ npm run build # tsup으로 ESM + CJS + .d.ts 빌드
203
+ npm run lint # tsc --noEmit 타입 체크
204
+ npm run test # 단위 테스트 실행
205
+ npm run test:live # 라이브 테스트 실행 (LIVE_TEST=true 필요)
206
+ ```
207
+
208
+ ## 라이선스
209
+
210
+ MIT
package/README.md ADDED
@@ -0,0 +1,210 @@
1
+ # search-papers
2
+
3
+ A TypeScript library for searching academic papers across multiple sources, returning structured and normalized results.
4
+
5
+ Built on [ghostfetch](https://github.com/user/ghostfetch) for robust HTTP requests with browser fingerprint spoofing and anti-bot bypass.
6
+
7
+ ## Features
8
+
9
+ - **Multi-source search** - Query Google Scholar, Semantic Scholar, and arXiv in parallel
10
+ - **Unified `Paper` interface** - All sources return the same structured format regardless of origin
11
+ - **Deduplication** - Automatically merges duplicate papers across sources using DOI, canonical URL, and title matching
12
+ - **Impact Factor ranking** - Results sorted by journal Impact Factor (static mapping of ~100 journals)
13
+ - **Citation & reference lookup** - Retrieve citing/referenced papers via Semantic Scholar
14
+ - **Anti-bot bypass** - ghostfetch handles browser spoofing, JS challenge solving, and redirect tracking
15
+ - **Partial failure tolerance** - If one source fails, results from other sources are still returned
16
+
17
+ ## Requirements
18
+
19
+ - Node.js >= 22.0.0
20
+
21
+ ## Installation
22
+
23
+ ```bash
24
+ npm install search-papers
25
+ ```
26
+
27
+ ## Quick Start
28
+
29
+ ```typescript
30
+ import { searchPapers, getPaper } from 'search-papers';
31
+
32
+ // Search across all sources
33
+ const result = await searchPapers('attention is all you need', {
34
+ limit: 10,
35
+ });
36
+ console.log(result.papers);
37
+
38
+ // Search specific sources only
39
+ const arxivOnly = await searchPapers('transformer', {
40
+ sources: ['arxiv'],
41
+ limit: 5,
42
+ sort: 'date',
43
+ });
44
+
45
+ // Look up a single paper by DOI
46
+ const paper = await getPaper('10.48550/arXiv.1706.03762');
47
+ console.log(paper?.title);
48
+ ```
49
+
50
+ ## API
51
+
52
+ ### `searchPapers(query, options?)`
53
+
54
+ Search for papers across multiple sources simultaneously.
55
+
56
+ ```typescript
57
+ const result = await searchPapers('deep learning', {
58
+ sources: ['semantic_scholar', 'google_scholar', 'arxiv'], // default: all
59
+ limit: 10, // default: 10
60
+ offset: 0,
61
+ year: { from: 2020, to: 2024 },
62
+ sort: 'relevance', // 'relevance' | 'date' | 'citations'
63
+ client: {
64
+ semanticScholarApiKey: 'your-key', // optional
65
+ proxy: 'http://proxy:8080', // optional
66
+ timeout: 15000, // default: 15000ms
67
+ },
68
+ });
69
+ ```
70
+
71
+ **Returns**: `SearchResult`
72
+
73
+ ```typescript
74
+ interface SearchResult {
75
+ query: string;
76
+ totalResults?: number;
77
+ papers: Paper[];
78
+ nextPageToken?: string;
79
+ source: SourceType;
80
+ errors?: SourceError[]; // errors from failed sources
81
+ }
82
+ ```
83
+
84
+ ### `getPaper(doi, options?)`
85
+
86
+ Look up a single paper by DOI using Semantic Scholar.
87
+
88
+ ```typescript
89
+ const paper = await getPaper('10.1038/nature14539');
90
+ // Returns Paper | null
91
+ ```
92
+
93
+ ### Paper Interface
94
+
95
+ Every paper returned by any source conforms to this interface:
96
+
97
+ ```typescript
98
+ interface Paper {
99
+ title: string;
100
+ authors: Author[];
101
+ abstract?: string;
102
+ year?: number;
103
+ venue?: string; // journal or conference name
104
+ doi?: string;
105
+ url: string; // link to the paper
106
+ canonicalUrl?: string; // final redirect URL
107
+ pdfUrl?: string;
108
+ citationCount?: number;
109
+ impactFactor?: number; // journal Impact Factor
110
+ source: SourceType; // 'google_scholar' | 'semantic_scholar' | 'arxiv'
111
+ sourceId?: string; // source-specific ID
112
+ tags?: string[]; // e.g. arXiv categories
113
+ references?: string[];
114
+ }
115
+ ```
116
+
117
+ ## Using Individual Sources
118
+
119
+ For more control, use source classes directly:
120
+
121
+ ```typescript
122
+ import { createClient, SemanticScholarSource, GoogleScholarSource, ArxivSource } from 'search-papers';
123
+
124
+ const client = createClient();
125
+
126
+ // Semantic Scholar (implements CitationSource)
127
+ const s2 = new SemanticScholarSource(client);
128
+ const result = await s2.search('transformer', { limit: 5 });
129
+ const paper = await s2.getPaper('DOI:10.48550/arXiv.1706.03762');
130
+ const citations = await s2.getCitations('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
131
+ const references = await s2.getReferences('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
132
+
133
+ // Google Scholar (implements PaperSource)
134
+ const gs = new GoogleScholarSource(client);
135
+ const gsResult = await gs.search('deep learning', { limit: 10 });
136
+
137
+ // arXiv (implements PaperSource)
138
+ const arxiv = new ArxivSource(client);
139
+ const arxivResult = await arxiv.search('neural network', { limit: 10 });
140
+ const arxivPaper = await arxiv.getPaper('1706.03762');
141
+
142
+ await client.destroy();
143
+ ```
144
+
145
+ ## Sources
146
+
147
+ | Source | Type | Search | Get Paper | Citations | References | Notes |
148
+ |--------|------|--------|-----------|-----------|------------|-------|
149
+ | Semantic Scholar | API (JSON) | Yes | Yes (DOI, paperId, etc.) | Yes | Yes | Optional API key for dedicated rate limit |
150
+ | Google Scholar | Scraping (HTML) | Yes | Yes (title search) | No | No | CAPTCHA risk, 2-5s random delay |
151
+ | arXiv | API (Atom XML) | Yes | Yes (arXiv ID) | No | No | 3s minimum delay between requests |
152
+
153
+ ## Search Options
154
+
155
+ | Option | Type | Default | Description |
156
+ |--------|------|---------|-------------|
157
+ | `sources` | `SourceType[]` | All 3 sources | Which sources to query |
158
+ | `limit` | `number` | `10` | Max results to return |
159
+ | `offset` | `number` | `0` | Pagination offset |
160
+ | `year` | `{ from?, to? }` | - | Publication year range filter |
161
+ | `sort` | `string` | `'relevance'` | Sort order: `'relevance'`, `'date'`, `'citations'` |
162
+
163
+ ## Client Options
164
+
165
+ | Option | Type | Default | Description |
166
+ |--------|------|---------|-------------|
167
+ | `browser` | `string` | `'Chrome_131'` | Browser to spoof |
168
+ | `timeout` | `number` | `15000` | Request timeout in ms |
169
+ | `proxy` | `string` | - | HTTP proxy URL |
170
+ | `proxyPool` | `string[]` | - | Proxy pool with round-robin rotation |
171
+ | `semanticScholarApiKey` | `string` | - | Semantic Scholar API key |
172
+
173
+ ## How It Works
174
+
175
+ 1. **Parallel queries** - All selected sources are queried simultaneously via `Promise.allSettled`
176
+ 2. **Canonical URL resolution** - ghostfetch follows redirects to determine the final URL of each paper
177
+ 3. **Impact Factor lookup** - Each paper's venue is matched against a static journal Impact Factor table
178
+ 4. **Deduplication** - Papers are deduplicated using DOI > canonical URL > normalized title (in priority order), merging metadata from multiple sources
179
+ 5. **Sorting** - Results are sorted by Impact Factor (descending), then by citation count
180
+ 6. **Limit** - Final results are trimmed to the requested limit
181
+
182
+ ## Error Handling
183
+
184
+ The library uses partial failure tolerance. If one source fails, results from other sources are still returned:
185
+
186
+ ```typescript
187
+ const result = await searchPapers('query');
188
+
189
+ if (result.errors) {
190
+ for (const err of result.errors) {
191
+ console.warn(`${err.source}: ${err.message} (${err.code})`);
192
+ // err.code: 'RATE_LIMITED' | 'CAPTCHA' | 'TIMEOUT' | 'NETWORK_ERROR' | 'PARSE_ERROR' | 'UNKNOWN'
193
+ }
194
+ }
195
+
196
+ // result.papers still contains results from successful sources
197
+ ```
198
+
199
+ ## Development
200
+
201
+ ```bash
202
+ npm run build # Build ESM + CJS + .d.ts via tsup
203
+ npm run lint # Type check with tsc --noEmit
204
+ npm run test # Run unit tests
205
+ npm run test:live # Run live tests (requires LIVE_TEST=true)
206
+ ```
207
+
208
+ ## License
209
+
210
+ MIT