npm - search_paper - Versions diffs - 0.1.2 - Mend

search_paper 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 MinsuChae
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.kr.md ADDED Viewed

@@ -0,0 +1,210 @@
+# search-papers
+여러 학술 검색 소스에서 논문을 검색하고, 정규화된 구조화 데이터로 반환하는 TypeScript 라이브러리입니다.
+[ghostfetch](https://github.com/user/ghostfetch) 기반으로 브라우저 핑거프린트 스푸핑 및 봇 차단 우회를 지원합니다.
+## 주요 기능
+- **멀티 소스 검색** - Google Scholar, Semantic Scholar, arXiv를 병렬로 동시 검색
+- **통일된 `Paper` 인터페이스** - 어떤 소스에서 가져오든 동일한 구조화 형태로 반환
+- **중복 제거** - DOI, canonical URL, 제목 정규화를 기반으로 소스 간 중복 논문 자동 병합
+- **Impact Factor 정렬** - 저널 Impact Factor 기준 정렬 (약 100개 저널 정적 매핑)
+- **인용/참조 조회** - Semantic Scholar를 통한 인용 및 참조 논문 조회
+- **봇 차단 우회** - ghostfetch가 브라우저 스푸핑, JS 챌린지 풀이, 리다이렉트 추적을 처리
+- **부분 실패 허용** - 하나의 소스가 실패해도 다른 소스의 결과는 정상 반환
+## 요구 사항
+- Node.js >= 22.0.0
+## 설치
+```bash
+npm install search-papers
+```
+## 빠른 시작
+```typescript
+import { searchPapers, getPaper } from 'search-papers';
+// 모든 소스에서 통합 검색
+const result = await searchPapers('attention is all you need', {
+  limit: 10,
+});
+console.log(result.papers);
+// 특정 소스만 지정하여 검색
+const arxivOnly = await searchPapers('transformer', {
+  sources: ['arxiv'],
+  limit: 5,
+  sort: 'date',
+});
+// DOI로 단일 논문 조회
+const paper = await getPaper('10.48550/arXiv.1706.03762');
+console.log(paper?.title);
+```
+## API
+### `searchPapers(query, options?)`
+여러 소스에서 논문을 동시에 검색합니다.
+```typescript
+const result = await searchPapers('deep learning', {
+  sources: ['semantic_scholar', 'google_scholar', 'arxiv'], // 기본: 전체
+  limit: 10,          // 기본: 10
+  offset: 0,
+  year: { from: 2020, to: 2024 },
+  sort: 'relevance',  // 'relevance' | 'date' | 'citations'
+  client: {
+    semanticScholarApiKey: 'your-key', // 선택
+    proxy: 'http://proxy:8080',        // 선택
+    timeout: 15000,                    // 기본: 15000ms
+  },
+});
+```
+**반환값**: `SearchResult`
+```typescript
+interface SearchResult {
+  query: string;
+  totalResults?: number;
+  papers: Paper[];
+  nextPageToken?: string;
+  source: SourceType;
+  errors?: SourceError[];  // 실패한 소스의 에러 정보
+}
+```
+### `getPaper(doi, options?)`
+DOI를 사용하여 Semantic Scholar에서 단일 논문을 조회합니다.
+```typescript
+const paper = await getPaper('10.1038/nature14539');
+// Paper | null 반환
+```
+### Paper 인터페이스
+모든 소스에서 반환되는 논문은 동일한 인터페이스를 따릅니다:
+```typescript
+interface Paper {
+  title: string;
+  authors: Author[];
+  abstract?: string;
+  year?: number;
+  venue?: string;           // 저널 또는 학회명
+  doi?: string;
+  url: string;              // 논문 링크
+  canonicalUrl?: string;    // 최종 리다이렉트 URL
+  pdfUrl?: string;
+  citationCount?: number;
+  impactFactor?: number;    // 저널 Impact Factor
+  source: SourceType;       // 'google_scholar' | 'semantic_scholar' | 'arxiv'
+  sourceId?: string;        // 소스 내부 ID
+  tags?: string[];          // 예: arXiv 카테고리
+  references?: string[];
+}
+```
+## 개별 소스 직접 사용
+더 세밀한 제어가 필요할 때 소스 클래스를 직접 사용할 수 있습니다:
+```typescript
+import { createClient, SemanticScholarSource, GoogleScholarSource, ArxivSource } from 'search-papers';
+const client = createClient();
+// Semantic Scholar (CitationSource 구현)
+const s2 = new SemanticScholarSource(client);
+const result = await s2.search('transformer', { limit: 5 });
+const paper = await s2.getPaper('DOI:10.48550/arXiv.1706.03762');
+const citations = await s2.getCitations('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
+const references = await s2.getReferences('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
+// Google Scholar (PaperSource 구현)
+const gs = new GoogleScholarSource(client);
+const gsResult = await gs.search('deep learning', { limit: 10 });
+// arXiv (PaperSource 구현)
+const arxiv = new ArxivSource(client);
+const arxivResult = await arxiv.search('neural network', { limit: 10 });
+const arxivPaper = await arxiv.getPaper('1706.03762');
+await client.destroy();
+```
+## 지원 소스
+| 소스 | 타입 | 검색 | 논문 조회 | 인용 | 참조 | 비고 |
+|------|------|------|----------|------|------|------|
+| Semantic Scholar | API (JSON) | O | O (DOI, paperId 등) | O | O | API key로 전용 rate limit 확보 가능 |
+| Google Scholar | 스크래핑 (HTML) | O | O (제목 검색) | X | X | CAPTCHA 리스크, 2~5초 랜덤 딜레이 |
+| arXiv | API (Atom XML) | O | O (arXiv ID) | X | X | 요청 간 최소 3초 딜레이 |
+## 검색 옵션
+| 옵션 | 타입 | 기본값 | 설명 |
+|------|------|--------|------|
+| `sources` | `SourceType[]` | 3개 소스 전체 | 검색할 소스 |
+| `limit` | `number` | `10` | 최대 결과 수 |
+| `offset` | `number` | `0` | 페이지네이션 오프셋 |
+| `year` | `{ from?, to? }` | - | 출판 연도 범위 필터 |
+| `sort` | `string` | `'relevance'` | 정렬 순서: `'relevance'`, `'date'`, `'citations'` |
+## 클라이언트 옵션
+| 옵션 | 타입 | 기본값 | 설명 |
+|------|------|--------|------|
+| `browser` | `string` | `'Chrome_131'` | 스푸핑할 브라우저 |
+| `timeout` | `number` | `15000` | 요청 타임아웃 (ms) |
+| `proxy` | `string` | - | HTTP 프록시 URL |
+| `proxyPool` | `string[]` | - | 라운드 로빈 프록시 풀 |
+| `semanticScholarApiKey` | `string` | - | Semantic Scholar API 키 |
+## 동작 원리
+1. **병렬 요청** - 선택된 모든 소스에 `Promise.allSettled`로 동시 요청
+2. **Canonical URL 해석** - ghostfetch가 리다이렉트를 추적하여 각 논문의 최종 URL 확보
+3. **Impact Factor 조회** - 각 논문의 venue를 정적 저널 Impact Factor 테이블과 매칭
+4. **중복 제거** - DOI > canonical URL > 정규화된 제목 순서로 중복 판별, 여러 소스의 메타데이터 병합
+5. **정렬** - Impact Factor (내림차순) 기준 정렬, 동일 시 인용수 순
+6. **limit 적용** - 요청된 수만큼 최종 결과 반환
+## 에러 처리
+부분 실패 허용 방식으로 동작합니다. 하나의 소스가 실패해도 다른 소스의 결과는 정상 반환됩니다:
+```typescript
+const result = await searchPapers('query');
+if (result.errors) {
+  for (const err of result.errors) {
+    console.warn(`${err.source}: ${err.message} (${err.code})`);
+    // err.code: 'RATE_LIMITED' | 'CAPTCHA' | 'TIMEOUT' | 'NETWORK_ERROR' | 'PARSE_ERROR' | 'UNKNOWN'
+  }
+}
+// result.papers에는 성공한 소스의 결과가 포함됩니다
+```
+## 개발
+```bash
+npm run build      # tsup으로 ESM + CJS + .d.ts 빌드
+npm run lint       # tsc --noEmit 타입 체크
+npm run test       # 단위 테스트 실행
+npm run test:live  # 라이브 테스트 실행 (LIVE_TEST=true 필요)
+```
+## 라이선스
+MIT

package/README.md ADDED Viewed

@@ -0,0 +1,210 @@
+# search-papers
+A TypeScript library for searching academic papers across multiple sources, returning structured and normalized results.
+Built on [ghostfetch](https://github.com/user/ghostfetch) for robust HTTP requests with browser fingerprint spoofing and anti-bot bypass.
+## Features
+- **Multi-source search** - Query Google Scholar, Semantic Scholar, and arXiv in parallel
+- **Unified `Paper` interface** - All sources return the same structured format regardless of origin
+- **Deduplication** - Automatically merges duplicate papers across sources using DOI, canonical URL, and title matching
+- **Impact Factor ranking** - Results sorted by journal Impact Factor (static mapping of ~100 journals)
+- **Citation & reference lookup** - Retrieve citing/referenced papers via Semantic Scholar
+- **Anti-bot bypass** - ghostfetch handles browser spoofing, JS challenge solving, and redirect tracking
+- **Partial failure tolerance** - If one source fails, results from other sources are still returned
+## Requirements
+- Node.js >= 22.0.0
+## Installation
+```bash
+npm install search-papers
+```
+## Quick Start
+```typescript
+import { searchPapers, getPaper } from 'search-papers';
+// Search across all sources
+const result = await searchPapers('attention is all you need', {
+  limit: 10,
+});
+console.log(result.papers);
+// Search specific sources only
+const arxivOnly = await searchPapers('transformer', {
+  sources: ['arxiv'],
+  limit: 5,
+  sort: 'date',
+});
+// Look up a single paper by DOI
+const paper = await getPaper('10.48550/arXiv.1706.03762');
+console.log(paper?.title);
+```
+## API
+### `searchPapers(query, options?)`
+Search for papers across multiple sources simultaneously.
+```typescript
+const result = await searchPapers('deep learning', {
+  sources: ['semantic_scholar', 'google_scholar', 'arxiv'], // default: all
+  limit: 10,          // default: 10
+  offset: 0,
+  year: { from: 2020, to: 2024 },
+  sort: 'relevance',  // 'relevance' | 'date' | 'citations'
+  client: {
+    semanticScholarApiKey: 'your-key', // optional
+    proxy: 'http://proxy:8080',        // optional
+    timeout: 15000,                    // default: 15000ms
+  },
+});
+```
+**Returns**: `SearchResult`
+```typescript
+interface SearchResult {
+  query: string;
+  totalResults?: number;
+  papers: Paper[];
+  nextPageToken?: string;
+  source: SourceType;
+  errors?: SourceError[];  // errors from failed sources
+}
+```
+### `getPaper(doi, options?)`
+Look up a single paper by DOI using Semantic Scholar.
+```typescript
+const paper = await getPaper('10.1038/nature14539');
+// Returns Paper | null
+```
+### Paper Interface
+Every paper returned by any source conforms to this interface:
+```typescript
+interface Paper {
+  title: string;
+  authors: Author[];
+  abstract?: string;
+  year?: number;
+  venue?: string;           // journal or conference name
+  doi?: string;
+  url: string;              // link to the paper
+  canonicalUrl?: string;    // final redirect URL
+  pdfUrl?: string;
+  citationCount?: number;
+  impactFactor?: number;    // journal Impact Factor
+  source: SourceType;       // 'google_scholar' | 'semantic_scholar' | 'arxiv'
+  sourceId?: string;        // source-specific ID
+  tags?: string[];          // e.g. arXiv categories
+  references?: string[];
+}
+```
+## Using Individual Sources
+For more control, use source classes directly:
+```typescript
+import { createClient, SemanticScholarSource, GoogleScholarSource, ArxivSource } from 'search-papers';
+const client = createClient();
+// Semantic Scholar (implements CitationSource)
+const s2 = new SemanticScholarSource(client);
+const result = await s2.search('transformer', { limit: 5 });
+const paper = await s2.getPaper('DOI:10.48550/arXiv.1706.03762');
+const citations = await s2.getCitations('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
+const references = await s2.getReferences('204e3073870fae3d05bcbc2f6a8e263d9b72e776');
+// Google Scholar (implements PaperSource)
+const gs = new GoogleScholarSource(client);
+const gsResult = await gs.search('deep learning', { limit: 10 });
+// arXiv (implements PaperSource)
+const arxiv = new ArxivSource(client);
+const arxivResult = await arxiv.search('neural network', { limit: 10 });
+const arxivPaper = await arxiv.getPaper('1706.03762');
+await client.destroy();
+```
+## Sources
+| Source | Type | Search | Get Paper | Citations | References | Notes |
+|--------|------|--------|-----------|-----------|------------|-------|
+| Semantic Scholar | API (JSON) | Yes | Yes (DOI, paperId, etc.) | Yes | Yes | Optional API key for dedicated rate limit |
+| Google Scholar | Scraping (HTML) | Yes | Yes (title search) | No | No | CAPTCHA risk, 2-5s random delay |
+| arXiv | API (Atom XML) | Yes | Yes (arXiv ID) | No | No | 3s minimum delay between requests |
+## Search Options
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `sources` | `SourceType[]` | All 3 sources | Which sources to query |
+| `limit` | `number` | `10` | Max results to return |
+| `offset` | `number` | `0` | Pagination offset |
+| `year` | `{ from?, to? }` | - | Publication year range filter |
+| `sort` | `string` | `'relevance'` | Sort order: `'relevance'`, `'date'`, `'citations'` |
+## Client Options
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `browser` | `string` | `'Chrome_131'` | Browser to spoof |
+| `timeout` | `number` | `15000` | Request timeout in ms |
+| `proxy` | `string` | - | HTTP proxy URL |
+| `proxyPool` | `string[]` | - | Proxy pool with round-robin rotation |
+| `semanticScholarApiKey` | `string` | - | Semantic Scholar API key |
+## How It Works
+1. **Parallel queries** - All selected sources are queried simultaneously via `Promise.allSettled`
+2. **Canonical URL resolution** - ghostfetch follows redirects to determine the final URL of each paper
+3. **Impact Factor lookup** - Each paper's venue is matched against a static journal Impact Factor table
+4. **Deduplication** - Papers are deduplicated using DOI > canonical URL > normalized title (in priority order), merging metadata from multiple sources
+5. **Sorting** - Results are sorted by Impact Factor (descending), then by citation count
+6. **Limit** - Final results are trimmed to the requested limit
+## Error Handling
+The library uses partial failure tolerance. If one source fails, results from other sources are still returned:
+```typescript
+const result = await searchPapers('query');
+if (result.errors) {
+  for (const err of result.errors) {
+    console.warn(`${err.source}: ${err.message} (${err.code})`);
+    // err.code: 'RATE_LIMITED' | 'CAPTCHA' | 'TIMEOUT' | 'NETWORK_ERROR' | 'PARSE_ERROR' | 'UNKNOWN'
+  }
+}
+// result.papers still contains results from successful sources
+```
+## Development
+```bash
+npm run build      # Build ESM + CJS + .d.ts via tsup
+npm run lint       # Type check with tsc --noEmit
+npm run test       # Run unit tests
+npm run test:live  # Run live tests (requires LIVE_TEST=true)
+```
+## License
+MIT