npm - osint-feed - Versions diffs - 0.1.0 - Mend

osint-feed 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,296 @@
+# osint-feed
+Config-driven news harvester for Node.js. Pulls articles from RSS feeds and HTML pages, deduplicates them, and produces a compact digest ready to feed into an LLM context window.
+No AI inside. No opinions about your stack. Just articles in, structured data out.
+Use RSS when you have it. Use HTML selectors when you do not. Filtering for your specific topic belongs in the app that consumes this library.
+## Why
+You're building something that needs fresh news context — a SITREP generator, a threat monitor, a research assistant. You have 30+ sources across languages and formats. You need the data compact enough to fit in a Llama/GPT context window without blowing the budget.
+Existing tools are either Python-only (newspaper4k), heavy self-hosted platforms (Huginn), or commercial APIs (Newscatcher, NewsAPI). Nothing in the JS/TS ecosystem does config-driven multi-source harvesting with built-in LLM-ready compression.
+`osint-feed` fills that gap.
+## Install
+```bash
+npm install osint-feed
+```
+Requires Node.js 18+.
+## Quick Start
+```typescript
+import { createHarvester } from "osint-feed";
+const harvester = createHarvester({
+  sources: [
+    {
+      id: "bbc-world",
+      name: "BBC World",
+      type: "rss",
+      url: "https://feeds.bbci.co.uk/news/world/rss.xml",
+      tags: ["global", "uk"],
+      interval: 15,
+    },
+    {
+      id: "nato",
+      name: "NATO Newsroom",
+      type: "html",
+      url: "https://www.nato.int/cps/en/natohq/news.htm",
+      tags: ["nato"],
+      interval: 30,
+      selectors: {
+        article: ".event-list-item",
+        title: "a span:first-child",
+        link: "a",
+        date: ".event-date",
+      },
+    },
+  ],
+});
+// Fetch everything
+const articles = await harvester.fetchAll();
+// Or get an LLM-ready digest
+const { articles: digest, stats } = await harvester.digest();
+console.log(`${stats.totalFetched} articles -> ${stats.afterDedup} unique -> ${stats.estimatedTokens} tokens`);
+```
+## Source Types
+### RSS / Atom
+Works out of the box. No selectors needed — feeds are parsed automatically.
+```typescript
+{
+  id: "france24",
+  name: "France24",
+  type: "rss",
+  url: "https://www.france24.com/en/rss",
+  tags: ["global", "europe"],
+  interval: 15,
+}
+```
+### HTML Scraping
+You define CSS selectors per source. The library uses [cheerio](https://github.com/cheeriojs/cheerio) — no headless browser, no Puppeteer overhead.
+This is still config-driven scraping: the library does not auto-discover article lists or infer what is relevant to your use case.
+```typescript
+{
+  id: "defence24",
+  name: "Defence24",
+  type: "html",
+  url: "https://defence24.pl/",
+  tags: ["poland", "defence"],
+  interval: 15,
+  selectors: {
+    article: "article",        // repeating container
+    title: "h2 a",             // title text (within article)
+    link: "h2 a",              // link href (within article)
+    date: "time",              // optional: publication date
+    summary: ".lead",          // optional: description text
+  },
+}
+```
+## API
+### `createHarvester(options)`
+Creates a harvester instance. Options:
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `sources` | `SourceConfig[]` | required | Array of source definitions |
+| `dedup.known` | `() => string[]` | — | Returns hashes already in your DB (for cross-session dedup) |
+| `digest` | `DigestOptions` | see below | Default digest settings |
+| `requestTimeout` | `number` | `15000` | HTTP timeout in ms |
+| `requestGap` | `number` | `1000` | Minimum ms between requests (rate limiting) |
+| `maxItemsPerSource` | `number` | `50` | Cap articles returned from one source |
+| `fetch` | `Function` | global fetch | Custom fetch for proxies/testing |
+| `onError` | `Function` | — | Callback for per-source fetch or parse errors |
+| `onWarning` | `Function` | — | Callback for non-fatal source diagnostics |
+### `harvester.fetchAll()`
+Fetches all enabled sources. Returns `Article[]`.
+If one source fails, the method still returns articles from the remaining sources and reports the problem through `onError` when provided.
+### `harvester.fetch(sourceId)`
+Fetches a single source by ID.
+### `harvester.fetchByTags(tags)`
+Fetches sources matching any of the given tags.
+### `harvester.digest(options?)`
+The main event. Fetches all sources, then runs the compression pipeline:
+1. **Dedup** — Groups similar headlines (Jaccard similarity) and keeps the richest version
+2. **Sort** — Newest first
+3. **Tag budget** — Caps articles per tag so no single region dominates
+4. **Truncate** — Cuts content to N characters per article
+5. **Token budget** — Trims from the bottom until under the token limit
+```typescript
+const { articles, stats } = await harvester.digest({
+  maxTokens: 12_000,           // total token budget
+  maxArticlesPerTag: 10,       // max articles per tag group
+  maxContentLength: 500,       // chars per article content
+  similarityThreshold: 0.6,    // title dedup threshold (0-1)
+  sort: "recency",
+});
+// stats.totalFetched     → 700  (raw from all sources)
+// stats.afterDedup       → 200  (unique stories)
+// stats.afterBudget      → 80   (within tag limits)
+// stats.estimatedTokens  → 18000 (final token count)
+```
+### `harvester.start(callbacks)` / `harvester.stop()`
+Runs sources on their configured intervals. You handle storage.
+```typescript
+harvester.start({
+  onArticles: async (articles, source) => {
+    await db.insert("articles", articles);
+    console.log(`${articles.length} new from ${source.name}`);
+  },
+  onError: (err, source) => {
+    console.error(`${source.name} failed:`, err);
+  },
+  onWarning: (warning, source) => {
+    console.warn(`${source.name}: ${warning.code} - ${warning.message}`);
+  },
+});
+// Later:
+harvester.stop();
+```
+## Article Schema
+```typescript
+interface Article {
+  sourceId: string;          // matches source config id
+  url: string;               // canonical article URL
+  title: string;
+  content: string | null;    // full text (when available)
+  summary: string | null;    // short description
+  publishedAt: Date | null;
+  hash: string;              // SHA-256 of URL (dedup key)
+  fetchedAt: Date;
+  tags: string[];            // inherited from source
+}
+```
+## Dedup Across Sessions
+The library handles within-batch dedup automatically. For cross-session dedup (don't re-process articles already in your DB), pass a `known` callback:
+```typescript
+const harvester = createHarvester({
+  sources,
+  dedup: {
+    known: async () => {
+      const rows = await db.query("SELECT hash FROM articles");
+      return rows.map(r => r.hash);
+    },
+  },
+});
+// fetchAll() now skips articles whose URL hash is already known
+```
+## Diagnostics
+The library keeps the happy path simple: `fetchAll()` and `digest()` still return article data directly.
+Use `onError` and `onWarning` if you want visibility into partial failures or weak-quality source output.
+- `onError` covers hard failures like timeouts, HTTP errors, and parsing failures.
+- `onWarning` covers non-fatal issues like empty source results, missing publication dates, or per-source truncation.
+This matches the typical small-library OSS pattern: easy defaults, optional hooks for logging and monitoring.
+## Scope and Limits
+- RSS and HTML are first-class source types.
+- HTML works best when you can define stable list selectors.
+- The library does not execute page JavaScript or run a headless browser.
+- The library does not decide what is relevant for your domain; apply your own filters downstream.
+## Use with Next.js
+```typescript
+// app/api/feed/route.ts
+import { createHarvester } from "osint-feed";
+const harvester = createHarvester({ sources: [...] });
+export async function GET() {
+  const { articles, stats } = await harvester.digest({ maxTokens: 8000 });
+  return Response.json({ articles, stats });
+}
+```
+## Use with Express
+```typescript
+import express from "express";
+import { createHarvester } from "osint-feed";
+const app = express();
+const harvester = createHarvester({ sources: [...] });
+app.get("/digest", async (_req, res) => {
+  const result = await harvester.digest();
+  res.json(result);
+});
+```
+## How the Digest Math Works
+Real numbers from a smoke test with 10 RSS + 3 HTML sources:
+```
+Raw fetch:         324 articles
+After title dedup: 319 unique stories
+After tag budget:  47  (8 per tag, 6 tags)
+Estimated tokens:  5,781
+```
+That's **1.8% of Llama 3's 128k context**. Plenty of room for system prompt, history, and reasoning.
+With 35 sources polling every 15 min you'd get ~700 articles/hour. The digest pipeline compresses that to ~80 articles / ~18k tokens. Adjust `maxArticlesPerTag` and `maxTokens` to taste.
+## Dependencies
+Just two:
+- [`cheerio`](https://github.com/cheeriojs/cheerio) — HTML parsing
+- [`rss-parser`](https://github.com/rbren/rss-parser) — RSS/Atom parsing
+No headless browsers. No native modules. No bloat.
+## License
+MIT
+## Disclaimer
+This library is a tool for fetching and parsing publicly available web content. Users are responsible for compliance with target websites' terms of service and applicable laws. The authors assume no liability for how the library is used.

package/dist/index.d.ts ADDED Viewed

@@ -0,0 +1,161 @@
+interface RssSourceConfig {
+    readonly id: string;
+    readonly name: string;
+    readonly type: "rss";
+    readonly url: string;
+    readonly tags: readonly string[];
+    /** Minutes between checks. Defaults to 15. */
+    readonly interval?: number;
+    /** Whether this source is active. Defaults to true. */
+    readonly enabled?: boolean;
+}
+interface HtmlSelectors {
+    /** Selector for the repeating article container element. */
+    readonly article: string;
+    /** Selector (within article) for the title text. */
+    readonly title: string;
+    /** Selector (within article) for the link href. */
+    readonly link: string;
+    /** Selector (within article) for the publication date. */
+    readonly date?: string;
+    /** Selector (within article) for summary/description text. */
+    readonly summary?: string;
+}
+interface HtmlSourceConfig {
+    readonly id: string;
+    readonly name: string;
+    readonly type: "html";
+    readonly url: string;
+    readonly tags: readonly string[];
+    readonly selectors: HtmlSelectors;
+    /** Minutes between checks. Defaults to 15. */
+    readonly interval?: number;
+    /** Whether this source is active. Defaults to true. */
+    readonly enabled?: boolean;
+}
+type SourceConfig = RssSourceConfig | HtmlSourceConfig;
+type HarvesterWarningCode = "empty-html-result" | "empty-rss-result" | "missing-published-at" | "truncated-source";
+interface HarvesterWarning {
+    readonly code: HarvesterWarningCode;
+    readonly message: string;
+    readonly details?: Readonly<Record<string, string | number | boolean>>;
+}
+interface Article {
+    readonly sourceId: string;
+    readonly url: string;
+    readonly title: string;
+    readonly content: string | null;
+    readonly summary: string | null;
+    readonly publishedAt: Date | null;
+    /** SHA-256 hex hash of the url — stable dedup key. */
+    readonly hash: string;
+    readonly fetchedAt: Date;
+    readonly tags: readonly string[];
+}
+interface DigestOptions {
+    /** Approximate upper bound of tokens in the output. Defaults to 12000. */
+    readonly maxTokens?: number;
+    /** Maximum articles per tag group. Defaults to 10. */
+    readonly maxArticlesPerTag?: number;
+    /** Maximum character length of content/summary per article. Defaults to 500. */
+    readonly maxContentLength?: number;
+    /** Sort strategy. Defaults to "recency". */
+    readonly sort?: "recency" | "relevance";
+    /**
+     * Jaccard similarity threshold (0–1) for title dedup.
+     * Titles more similar than this are considered duplicates.
+     * Defaults to 0.6.
+     */
+    readonly similarityThreshold?: number;
+}
+interface DigestResult {
+    readonly articles: readonly Article[];
+    readonly stats: {
+        readonly totalFetched: number;
+        readonly afterDedup: number;
+        readonly afterBudget: number;
+        readonly estimatedTokens: number;
+    };
+}
+/** Callback returning hashes already known/stored — used for dedup. */
+type KnownHashesFn = () => Promise<readonly string[]> | readonly string[];
+interface HarvesterOptions {
+    readonly sources: readonly SourceConfig[];
+    /** Dedup configuration. */
+    readonly dedup?: {
+        /** Callback returning hashes already known/stored. */
+        readonly known?: KnownHashesFn;
+    };
+    /** Default digest options. Can be overridden per call. */
+    readonly digest?: DigestOptions;
+    /**
+     * Request timeout in milliseconds. Defaults to 15000.
+     */
+    readonly requestTimeout?: number;
+    /**
+     * Minimum gap between HTTP requests in milliseconds. Defaults to 1000.
+     */
+    readonly requestGap?: number;
+    /** Maximum number of articles returned per source. Defaults to 50. */
+    readonly maxItemsPerSource?: number;
+    /**
+     * Custom fetch function. Defaults to global fetch.
+     * Useful for testing or proxying.
+     */
+    readonly fetch?: (input: string | URL | Request, init?: RequestInit) => Promise<Response>;
+    /** Optional callback for per-source fetch or parse failures. */
+    readonly onError?: (error: unknown, source: SourceConfig) => void;
+    /** Optional callback for non-fatal source diagnostics. */
+    readonly onWarning?: (warning: HarvesterWarning, source: SourceConfig) => void;
+}
+interface SchedulerCallbacks {
+    readonly onArticles: (articles: readonly Article[], source: SourceConfig) => void | Promise<void>;
+    readonly onError?: (error: unknown, source: SourceConfig) => void;
+    readonly onWarning?: (warning: HarvesterWarning, source: SourceConfig) => void;
+}
+interface Harvester {
+    fetchAll(): Promise<Article[]>;
+    fetch(sourceId: string): Promise<Article[]>;
+    fetchByTags(tags: readonly string[]): Promise<Article[]>;
+    digest(options?: DigestOptions): Promise<DigestResult>;
+    start(callbacks: SchedulerCallbacks): void;
+    stop(): void;
+}
+declare const createHarvester: (options: HarvesterOptions) => Harvester;
+/**
+ * Deduplicate articles by title similarity.
+ * When two articles are similar above the threshold, the one with more content wins.
+ */
+declare const dedup: (articles: readonly Article[], threshold: number) => Article[];
+/**
+ * Full digest pipeline: dedup → sort → tag budget → truncate → token budget.
+ */
+declare const buildDigest: (articles: readonly Article[], options?: DigestOptions) => {
+    articles: Article[];
+    stats: {
+        totalFetched: number;
+        afterDedup: number;
+        afterBudget: number;
+        estimatedTokens: number;
+    };
+};
+interface FetchRssOptions {
+    readonly fetchFn?: (input: string | URL | Request, init?: RequestInit) => Promise<Response>;
+    readonly timeout?: number;
+    readonly maxItems?: number;
+    readonly onWarning?: (warning: HarvesterWarning) => void;
+}
+declare const fetchRss: (source: RssSourceConfig, options?: FetchRssOptions) => Promise<Article[]>;
+interface FetchHtmlOptions {
+    readonly fetchFn?: (input: string | URL | Request, init?: RequestInit) => Promise<Response>;
+    readonly timeout?: number;
+    readonly maxItems?: number;
+    readonly onWarning?: (warning: HarvesterWarning) => void;
+}
+declare const fetchHtml: (source: HtmlSourceConfig, options?: FetchHtmlOptions) => Promise<Article[]>;
+export { type Article, type DigestOptions, type DigestResult, type Harvester, type HarvesterOptions, type HarvesterWarning, type HarvesterWarningCode, type HtmlSelectors, type HtmlSourceConfig, type KnownHashesFn, type RssSourceConfig, type SchedulerCallbacks, type SourceConfig, buildDigest, createHarvester, dedup, fetchHtml, fetchRss };

package/dist/index.js ADDED Viewed

@@ -0,0 +1,495 @@
+// src/rss.ts
+import RssParser from "rss-parser";
+// src/utils.ts
+import { createHash } from "crypto";
+var hashUrl = (url) => createHash("sha256").update(url).digest("hex");
+var tokenize = (text) => {
+  const words = text.toLowerCase().replace(/[^a-z0-9\u00C0-\u024F\u0400-\u04FF]+/gi, " ").trim().split(/\s+/).filter((w) => w.length > 1);
+  return new Set(words);
+};
+var jaccardSimilarity = (a, b) => {
+  const setA = tokenize(a);
+  const setB = tokenize(b);
+  if (setA.size === 0 && setB.size === 0) return 1;
+  if (setA.size === 0 || setB.size === 0) return 0;
+  let intersection = 0;
+  for (const word of setA) {
+    if (setB.has(word)) intersection++;
+  }
+  const union = setA.size + setB.size - intersection;
+  return union === 0 ? 0 : intersection / union;
+};
+var estimateTokens = (text) => Math.ceil(text.length / 4);
+var normalizeText = (text) => text.replace(/\s+/g, " ").trim();
+var truncate = (text, maxLength) => {
+  if (text.length <= maxLength) return text;
+  const cut = text.lastIndexOf(" ", maxLength);
+  return (cut > 0 ? text.slice(0, cut) : text.slice(0, maxLength)) + "...";
+};
+var ThrottleQueue = class {
+  constructor(gapMs) {
+    this.gapMs = gapMs;
+  }
+  lastRun = 0;
+  async run(fn) {
+    const now = Date.now();
+    const wait = Math.max(0, this.gapMs - (now - this.lastRun));
+    if (wait > 0) {
+      await new Promise((resolve) => setTimeout(resolve, wait));
+    }
+    this.lastRun = Date.now();
+    return fn();
+  }
+};
+var resolveUrl = (href, base) => {
+  try {
+    return new URL(href, base).href;
+  } catch {
+    return href;
+  }
+};
+// src/rss.ts
+var DEFAULT_UA = "osint-feed/0.1 (+https://github.com/osint-feed)";
+var createRssParser = () => new RssParser();
+var fetchRss = async (source, options = {}) => {
+  const now = /* @__PURE__ */ new Date();
+  const {
+    fetchFn = globalThis.fetch,
+    timeout = 15e3,
+    maxItems = Number.POSITIVE_INFINITY,
+    onWarning
+  } = options;
+  const controller = new AbortController();
+  const timer = setTimeout(() => controller.abort(), timeout);
+  try {
+    const res = await fetchFn(source.url, {
+      headers: {
+        "User-Agent": DEFAULT_UA,
+        Accept: "application/rss+xml, application/xml, text/xml, */*"
+      },
+      signal: controller.signal
+    });
+    if (!res.ok) {
+      throw new Error(`RSS fetch failed for ${source.id}: HTTP ${res.status}`);
+    }
+    const xml = await res.text();
+    const feed = await createRssParser().parseString(xml);
+    const totalItems = feed.items.length;
+    const articles = feedToArticles(feed, source, now, maxItems);
+    if (articles.length === 0) {
+      onWarning?.({
+        code: "empty-rss-result",
+        message: `RSS source '${source.id}' returned zero articles`
+      });
+    }
+    if (articles.some((article) => article.publishedAt === null)) {
+      const missingDates = articles.filter((article) => article.publishedAt === null).length;
+      onWarning?.({
+        code: "missing-published-at",
+        message: `RSS source '${source.id}' returned articles without publication dates`,
+        details: { missingDates, totalArticles: articles.length }
+      });
+    }
+    if (Number.isFinite(maxItems) && totalItems > maxItems) {
+      onWarning?.({
+        code: "truncated-source",
+        message: `RSS source '${source.id}' was truncated to ${maxItems} articles`,
+        details: { maxItems, totalItems }
+      });
+    }
+    return articles;
+  } finally {
+    clearTimeout(timer);
+  }
+};
+var feedToArticles = (feed, source, fetchedAt, maxItems) => {
+  const articles = [];
+  for (const item of feed.items) {
+    if (articles.length >= maxItems) break;
+    const url = item.link?.trim();
+    if (!url) continue;
+    const title = normalizeText(item.title ?? "");
+    if (!title) continue;
+    const content = item["content:encoded"] ?? item.content ?? null;
+    const summary = item.contentSnippet ?? item.summary ?? null;
+    let publishedAt = null;
+    if (item.isoDate) {
+      const d = new Date(item.isoDate);
+      if (!isNaN(d.getTime())) publishedAt = d;
+    } else if (item.pubDate) {
+      const d = new Date(item.pubDate);
+      if (!isNaN(d.getTime())) publishedAt = d;
+    }
+    articles.push({
+      sourceId: source.id,
+      url,
+      title,
+      content: typeof content === "string" ? normalizeText(content) : null,
+      summary: typeof summary === "string" ? normalizeText(summary) : null,
+      publishedAt,
+      hash: hashUrl(url),
+      fetchedAt,
+      tags: [...source.tags]
+    });
+  }
+  return articles;
+};
+// src/html.ts
+import * as cheerio from "cheerio";
+var DEFAULT_UA2 = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36";
+var parsePublishedAt = ($el, selector) => {
+  if (!selector) return null;
+  const dateEl = $el.find(selector).first();
+  const candidates = [
+    dateEl.attr("datetime"),
+    dateEl.attr("content"),
+    dateEl.text()
+  ];
+  for (const candidate of candidates) {
+    const raw = normalizeText(candidate ?? "");
+    if (!raw) continue;
+    const parsed = new Date(raw);
+    if (!Number.isNaN(parsed.getTime())) {
+      return parsed;
+    }
+  }
+  return null;
+};
+var parseHtml = (html, source, fetchedAt = /* @__PURE__ */ new Date(), maxItems = Number.POSITIVE_INFINITY) => {
+  const $ = cheerio.load(html);
+  const { selectors } = source;
+  const articles = [];
+  $(selectors.article).each((_i, el) => {
+    if (articles.length >= maxItems) return false;
+    const $el = $(el);
+    const title = normalizeText($el.find(selectors.title).first().text());
+    if (!title) return;
+    const rawHref = normalizeText($el.find(selectors.link).first().attr("href") ?? "");
+    if (!rawHref) return;
+    const url = resolveUrl(rawHref, source.url);
+    const publishedAt = parsePublishedAt($el, selectors.date);
+    const summary = selectors.summary ? normalizeText($el.find(selectors.summary).first().text()) || null : null;
+    articles.push({
+      sourceId: source.id,
+      url,
+      title,
+      content: null,
+      summary,
+      publishedAt,
+      hash: hashUrl(url),
+      fetchedAt,
+      tags: [...source.tags]
+    });
+    return void 0;
+  });
+  return articles;
+};
+var fetchHtml = async (source, options = {}) => {
+  const {
+    fetchFn = globalThis.fetch,
+    timeout = 15e3,
+    maxItems = Number.POSITIVE_INFINITY,
+    onWarning
+  } = options;
+  const controller = new AbortController();
+  const timer = setTimeout(() => controller.abort(), timeout);
+  let html;
+  try {
+    const res = await fetchFn(source.url, {
+      headers: {
+        "User-Agent": DEFAULT_UA2,
+        Accept: "text/html, application/xhtml+xml, */*"
+      },
+      signal: controller.signal
+    });
+    if (!res.ok) {
+      throw new Error(`HTML fetch failed for ${source.id}: HTTP ${res.status}`);
+    }
+    html = await res.text();
+  } finally {
+    clearTimeout(timer);
+  }
+  const totalMatches = cheerio.load(html)(source.selectors.article).length;
+  const articles = parseHtml(html, source, /* @__PURE__ */ new Date(), maxItems);
+  if (articles.length === 0) {
+    onWarning?.({
+      code: "empty-html-result",
+      message: `HTML source '${source.id}' returned zero articles`
+    });
+  }
+  if (articles.some((article) => article.publishedAt === null)) {
+    const missingDates = articles.filter((article) => article.publishedAt === null).length;
+    onWarning?.({
+      code: "missing-published-at",
+      message: `HTML source '${source.id}' returned articles without publication dates`,
+      details: { missingDates, totalArticles: articles.length }
+    });
+  }
+  if (Number.isFinite(maxItems) && totalMatches > maxItems) {
+    onWarning?.({
+      code: "truncated-source",
+      message: `HTML source '${source.id}' was truncated to ${maxItems} articles`,
+      details: { maxItems, totalMatches }
+    });
+  }
+  return articles;
+};
+// src/digest.ts
+var DEFAULT_DIGEST = {
+  maxTokens: 12e3,
+  maxArticlesPerTag: 10,
+  maxContentLength: 500,
+  sort: "recency",
+  similarityThreshold: 0.6
+};
+var dedup = (articles, threshold) => {
+  const kept = [];
+  for (const article of articles) {
+    const isDuplicate = kept.some(
+      (existing) => jaccardSimilarity(existing.title, article.title) >= threshold
+    );
+    if (!isDuplicate) {
+      kept.push({ ...article });
+    } else {
+      const existingIdx = kept.findIndex(
+        (existing) => jaccardSimilarity(existing.title, article.title) >= threshold
+      );
+      if (existingIdx !== -1) {
+        const existing = kept[existingIdx];
+        const existingLen = (existing.content?.length ?? 0) + (existing.summary?.length ?? 0);
+        const newLen = (article.content?.length ?? 0) + (article.summary?.length ?? 0);
+        if (newLen > existingLen) {
+          kept[existingIdx] = { ...article };
+        }
+      }
+    }
+  }
+  return kept;
+};
+var applyTagBudget = (articles, maxPerTag) => {
+  const tagCounts = /* @__PURE__ */ new Map();
+  const result = [];
+  for (const article of articles) {
+    const tags = article.tags.length > 0 ? article.tags : ["_untagged"];
+    let allowed = false;
+    for (const tag of tags) {
+      const count = tagCounts.get(tag) ?? 0;
+      if (count < maxPerTag) {
+        allowed = true;
+      }
+    }
+    if (allowed) {
+      result.push(article);
+      for (const tag of tags) {
+        tagCounts.set(tag, (tagCounts.get(tag) ?? 0) + 1);
+      }
+    }
+  }
+  return result;
+};
+var truncateArticles = (articles, maxContentLength) => articles.map((a) => ({
+  ...a,
+  content: a.content ? truncate(a.content, maxContentLength) : null,
+  summary: a.summary ? truncate(a.summary, maxContentLength) : null
+}));
+var sortArticles = (articles, sort) => {
+  if (sort === "recency") {
+    return [...articles].sort((a, b) => {
+      const dateA = a.publishedAt?.getTime() ?? a.fetchedAt.getTime();
+      const dateB = b.publishedAt?.getTime() ?? b.fetchedAt.getTime();
+      return dateB - dateA;
+    });
+  }
+  return [...articles].sort((a, b) => {
+    const dateA = a.publishedAt?.getTime() ?? a.fetchedAt.getTime();
+    const dateB = b.publishedAt?.getTime() ?? b.fetchedAt.getTime();
+    return dateB - dateA;
+  });
+};
+var applyTokenBudget = (articles, maxTokens) => {
+  let totalTokens = 0;
+  const result = [];
+  for (const article of articles) {
+    const text = [article.title, article.summary, article.content].filter(Boolean).join(" ");
+    const tokens = estimateTokens(text);
+    if (totalTokens + tokens > maxTokens && result.length > 0) {
+      break;
+    }
+    totalTokens += tokens;
+    result.push(article);
+  }
+  return result;
+};
+var buildDigest = (articles, options) => {
+  const opts = { ...DEFAULT_DIGEST, ...options };
+  const totalFetched = articles.length;
+  let result = dedup(articles, opts.similarityThreshold);
+  const afterDedup = result.length;
+  result = sortArticles(result, opts.sort);
+  result = applyTagBudget(result, opts.maxArticlesPerTag);
+  const afterBudget = result.length;
+  result = truncateArticles(result, opts.maxContentLength);
+  result = applyTokenBudget(result, opts.maxTokens);
+  const estimatedTokens = result.reduce((sum, a) => {
+    const text = [a.title, a.summary, a.content].filter(Boolean).join(" ");
+    return sum + estimateTokens(text);
+  }, 0);
+  return {
+    articles: result,
+    stats: { totalFetched, afterDedup, afterBudget, estimatedTokens }
+  };
+};
+// src/scheduler.ts
+var createScheduler = (sources, fetchSource2, callbacks) => {
+  const entries = [];
+  for (const source of sources) {
+    if (source.enabled === false) continue;
+    const intervalMs = (source.interval ?? 15) * 6e4;
+    const tick = async () => {
+      try {
+        const articles = await fetchSource2(source);
+        if (articles.length > 0) {
+          await callbacks.onArticles(articles, source);
+        }
+      } catch (error) {
+        callbacks.onError?.(error, source);
+      }
+    };
+    void tick();
+    const timer = setInterval(() => void tick(), intervalMs);
+    entries.push({ source, timer });
+  }
+  return {
+    stop: () => {
+      for (const entry of entries) {
+        clearInterval(entry.timer);
+      }
+      entries.length = 0;
+    }
+  };
+};
+// src/harvester.ts
+var fetchSource = async (source, throttle, options = {}) => {
+  return throttle.run(async () => {
+    if (source.type === "rss") {
+      const rssOptions = {
+        ...options.fetchFn ? { fetchFn: options.fetchFn } : {},
+        ...options.timeout !== void 0 ? { timeout: options.timeout } : {},
+        ...options.maxItemsPerSource !== void 0 ? { maxItems: options.maxItemsPerSource } : {},
+        ...options.onWarning ? { onWarning: (warning) => options.onWarning?.(warning, source) } : {}
+      };
+      return fetchRss(source, rssOptions);
+    }
+    const htmlOptions = {
+      ...options.fetchFn ? { fetchFn: options.fetchFn } : {},
+      ...options.timeout !== void 0 ? { timeout: options.timeout } : {},
+      ...options.maxItemsPerSource !== void 0 ? { maxItems: options.maxItemsPerSource } : {},
+      ...options.onWarning ? { onWarning: (warning) => options.onWarning?.(warning, source) } : {}
+    };
+    return fetchHtml(source, htmlOptions);
+  });
+};
+var filterKnown = async (articles, knownFn) => {
+  if (!knownFn) return articles;
+  const known = new Set(await knownFn());
+  return articles.filter((a) => !known.has(a.hash));
+};
+var createHarvester = (options) => {
+  const {
+    sources,
+    dedup: dedupOpts,
+    digest: defaultDigestOpts,
+    requestTimeout = 15e3,
+    requestGap = 1e3,
+    maxItemsPerSource = 50,
+    onError,
+    onWarning
+  } = options;
+  const fetchFn = options.fetch;
+  const throttle = new ThrottleQueue(requestGap);
+  let scheduler = null;
+  const getEnabledSources = () => sources.filter((s) => s.enabled !== false);
+  const fetchSingle = async (source) => {
+    const fetchOptions = {
+      timeout: requestTimeout,
+      maxItemsPerSource,
+      ...fetchFn ? { fetchFn } : {},
+      ...onWarning ? { onWarning } : {}
+    };
+    const articles = await fetchSource(source, throttle, fetchOptions);
+    return filterKnown(articles, dedupOpts?.known);
+  };
+  const fetchMany = async (selectedSources) => {
+    const results = [];
+    for (const source of selectedSources) {
+      try {
+        const articles = await fetchSingle(source);
+        results.push(...articles);
+      } catch (error) {
+        onError?.(error, source);
+      }
+    }
+    return results;
+  };
+  return {
+    async fetchAll() {
+      return fetchMany(getEnabledSources());
+    },
+    async fetch(sourceId) {
+      const source = sources.find((s) => s.id === sourceId);
+      if (!source) {
+        throw new Error(`Source not found: ${sourceId}`);
+      }
+      return fetchSingle(source);
+    },
+    async fetchByTags(tags) {
+      const tagSet = new Set(tags);
+      const matching = getEnabledSources().filter(
+        (s) => s.tags.some((t) => tagSet.has(t))
+      );
+      return fetchMany(matching);
+    },
+    async digest(overrides) {
+      const articles = await fetchMany(getEnabledSources());
+      const opts = { ...defaultDigestOpts, ...overrides };
+      const result = buildDigest(articles, opts);
+      return {
+        articles: result.articles,
+        stats: result.stats
+      };
+    },
+    start(callbacks) {
+      if (scheduler) {
+        scheduler.stop();
+      }
+      const schedulerCallbacks = {
+        onArticles: callbacks.onArticles,
+        ...callbacks.onError ?? onError ? { onError: callbacks.onError ?? onError } : {},
+        ...callbacks.onWarning ?? onWarning ? { onWarning: callbacks.onWarning ?? onWarning } : {}
+      };
+      scheduler = createScheduler(
+        getEnabledSources(),
+        (source) => fetchSingle(source),
+        schedulerCallbacks
+      );
+    },
+    stop() {
+      scheduler?.stop();
+      scheduler = null;
+    }
+  };
+};
+export {
+  buildDigest,
+  createHarvester,
+  dedup,
+  fetchHtml,
+  fetchRss
+};
+//# sourceMappingURL=index.js.map

package/dist/index.js.map ADDED Viewed

@@ -0,0 +1 @@

+ {"version":3,"sources":["../src/rss.ts","../src/utils.ts","../src/html.ts","../src/digest.ts","../src/scheduler.ts","../src/harvester.ts"],"sourcesContent":["import RssParser from \"rss-parser\";\nimport type { Article, HarvesterWarning, RssSourceConfig } from \"./types.js\";\nimport { hashUrl, normalizeText } from \"./utils.js\";\n\nconst DEFAULT_UA = \"osint-feed/0.1 (+https://github.com/osint-feed)\";\n\nconst createRssParser = (): RssParser => new RssParser();\n\ninterface FetchRssOptions {\n readonly fetchFn?: (input: string | URL | Request, init?: RequestInit) => Promise<Response>;\n readonly timeout?: number;\n readonly maxItems?: number;\n readonly onWarning?: (warning: HarvesterWarning) => void;\n}\n\nexport const fetchRss = async (\n source: RssSourceConfig,\n options: FetchRssOptions = {},\n): Promise<Article[]> => {\n const now = new Date();\n const {\n fetchFn = globalThis.fetch,\n timeout = 15_000,\n maxItems = Number.POSITIVE_INFINITY,\n onWarning,\n } = options;\n\n const controller = new AbortController();\n const timer = setTimeout(() => controller.abort(), timeout);\n\n try {\n const res = await fetchFn(source.url, {\n headers: {\n \"User-Agent\": DEFAULT_UA,\n Accept: \"application/rss+xml, application/xml, text/xml, */*\",\n },\n signal: controller.signal,\n });\n\n if (!res.ok) {\n throw new Error(`RSS fetch failed for ${source.id}: HTTP ${res.status}`);\n }\n\n const xml = await res.text();\n const feed = await createRssParser().parseString(xml);\n const totalItems = feed.items.length;\n const articles = feedToArticles(feed, source, now, maxItems);\n\n if (articles.length === 0) {\n onWarning?.({\n code: \"empty-rss-result\",\n message: `RSS source '${source.id}' returned zero articles`,\n });\n }\n\n if (articles.some((article) => article.publishedAt === null)) {\n const missingDates = articles.filter((article) => article.publishedAt === null).length;\n onWarning?.({\n code: \"missing-published-at\",\n message: `RSS source '${source.id}' returned articles without publication dates`,\n details: { missingDates, totalArticles: articles.length },\n });\n }\n\n if (Number.isFinite(maxItems) && totalItems > maxItems) {\n onWarning?.({\n code: \"truncated-source\",\n message: `RSS source '${source.id}' was truncated to ${maxItems} articles`,\n details: { maxItems, totalItems },\n });\n }\n\n return articles;\n } finally {\n clearTimeout(timer);\n }\n};\n\nconst feedToArticles = (\n feed: RssParser.Output<Record<string, unknown>>,\n source: RssSourceConfig,\n fetchedAt: Date,\n maxItems: number,\n): Article[] => {\n const articles: Article[] = [];\n\n for (const item of feed.items) {\n if (articles.length >= maxItems) break;\n\n const url = item.link?.trim();\n if (!url) continue;\n\n const title = normalizeText(item.title ?? \"\");\n if (!title) continue;\n\n const content = item[\"content:encoded\"] as string | undefined\n ?? item.content\n ?? null;\n\n const summary = item.contentSnippet\n ?? item.summary\n ?? null;\n\n let publishedAt: Date | null = null;\n if (item.isoDate) {\n const d = new Date(item.isoDate);\n if (!isNaN(d.getTime())) publishedAt = d;\n } else if (item.pubDate) {\n const d = new Date(item.pubDate);\n if (!isNaN(d.getTime())) publishedAt = d;\n }\n\n articles.push({\n sourceId: source.id,\n url,\n title,\n content: typeof content === \"string\" ? normalizeText(content) : null,\n summary: typeof summary === \"string\" ? normalizeText(summary) : null,\n publishedAt,\n hash: hashUrl(url),\n fetchedAt,\n tags: [...source.tags],\n });\n }\n\n return articles;\n};\n","import { createHash } from \"node:crypto\";\n\n/** SHA-256 hex hash of a string. */\nexport const hashUrl = (url: string): string =>\n createHash(\"sha256\").update(url).digest(\"hex\");\n\nconst tokenize = (text: string): Set<string> => {\n const words = text\n .toLowerCase()\n .replace(/[^a-z0-9\\u00C0-\\u024F\\u0400-\\u04FF]+/gi, \" \")\n .trim()\n .split(/\\s+/)\n .filter((w) => w.length > 1);\n return new Set(words);\n};\n\n/**\n * Jaccard similarity between two strings (0–1).\n * Used for title-based dedup.\n */\nexport const jaccardSimilarity = (a: string, b: string): number => {\n const setA = tokenize(a);\n const setB = tokenize(b);\n if (setA.size === 0 && setB.size === 0) return 1;\n if (setA.size === 0 || setB.size === 0) return 0;\n\n let intersection = 0;\n for (const word of setA) {\n if (setB.has(word)) intersection++;\n }\n const union = setA.size + setB.size - intersection;\n return union === 0 ? 0 : intersection / union;\n};\n\n/**\n * Rough token count estimate. ~4 chars per token for English, slightly more for\n * non-Latin scripts. Good enough for budget estimation without a tokenizer dep.\n */\nexport const estimateTokens = (text: string): number =>\n Math.ceil(text.length / 4);\n\nexport const normalizeText = (text: string): string =>\n text.replace(/\\s+/g, \" \").trim();\n\n/**\n * Truncate a string to maxLength characters, breaking at a word boundary.\n */\nexport const truncate = (text: string, maxLength: number): string => {\n if (text.length <= maxLength) return text;\n const cut = text.lastIndexOf(\" \", maxLength);\n return (cut > 0 ? text.slice(0, cut) : text.slice(0, maxLength)) + \"...\";\n};\n\nexport class ThrottleQueue {\n private lastRun = 0;\n\n constructor(private readonly gapMs: number) {}\n\n async run<T>(fn: () => Promise<T>): Promise<T> {\n const now = Date.now();\n const wait = Math.max(0, this.gapMs - (now - this.lastRun));\n if (wait > 0) {\n await new Promise((resolve) => setTimeout(resolve, wait));\n }\n this.lastRun = Date.now();\n return fn();\n }\n}\n\nexport const resolveUrl = (href: string, base: string): string => {\n try {\n return new URL(href, base).href;\n } catch {\n return href;\n }\n};\n","import * as cheerio from \"cheerio\";\nimport type { AnyNode } from \"domhandler\";\nimport type { Article, HarvesterWarning, HtmlSourceConfig } from \"./types.js\";\nimport { hashUrl, normalizeText, resolveUrl } from \"./utils.js\";\n\n/**\n * Many government / military sites block non-browser user agents, so we default\n * to a real-looking UA. Consumers can override via HarvesterOptions.fetch.\n */\nconst DEFAULT_UA =\n \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36\";\n\ninterface FetchHtmlOptions {\n readonly fetchFn?: (input: string | URL | Request, init?: RequestInit) => Promise<Response>;\n readonly timeout?: number;\n readonly maxItems?: number;\n readonly onWarning?: (warning: HarvesterWarning) => void;\n}\n\nconst parsePublishedAt = ($el: cheerio.Cheerio<AnyNode>, selector?: string): Date | null => {\n if (!selector) return null;\n\n const dateEl = $el.find(selector).first();\n const candidates = [\n dateEl.attr(\"datetime\"),\n dateEl.attr(\"content\"),\n dateEl.text(),\n ];\n\n for (const candidate of candidates) {\n const raw = normalizeText(candidate ?? \"\");\n if (!raw) continue;\n const parsed = new Date(raw);\n if (!Number.isNaN(parsed.getTime())) {\n return parsed;\n }\n }\n\n return null;\n};\n\nexport const parseHtml = (\n html: string,\n source: HtmlSourceConfig,\n fetchedAt = new Date(),\n maxItems = Number.POSITIVE_INFINITY,\n): Article[] => {\n const $ = cheerio.load(html);\n const { selectors } = source;\n const articles: Article[] = [];\n\n $(selectors.article).each((_i, el) => {\n if (articles.length >= maxItems) return false;\n\n const $el = $(el);\n const title = normalizeText($el.find(selectors.title).first().text());\n if (!title) return;\n\n const rawHref = normalizeText($el.find(selectors.link).first().attr(\"href\") ?? \"\");\n if (!rawHref) return;\n\n const url = resolveUrl(rawHref, source.url);\n const publishedAt = parsePublishedAt($el, selectors.date);\n const summary = selectors.summary\n ? normalizeText($el.find(selectors.summary).first().text()) || null\n : null;\n\n articles.push({\n sourceId: source.id,\n url,\n title,\n content: null,\n summary,\n publishedAt,\n hash: hashUrl(url),\n fetchedAt,\n tags: [...source.tags],\n });\n\n return undefined;\n });\n\n return articles;\n};\n\nexport const fetchHtml = async (\n source: HtmlSourceConfig,\n options: FetchHtmlOptions = {},\n): Promise<Article[]> => {\n const {\n fetchFn = globalThis.fetch,\n timeout = 15_000,\n maxItems = Number.POSITIVE_INFINITY,\n onWarning,\n } = options;\n const controller = new AbortController();\n const timer = setTimeout(() => controller.abort(), timeout);\n\n let html: string;\n try {\n const res = await fetchFn(source.url, {\n headers: {\n \"User-Agent\": DEFAULT_UA,\n Accept: \"text/html, application/xhtml+xml, */*\",\n },\n signal: controller.signal,\n });\n if (!res.ok) {\n throw new Error(`HTML fetch failed for ${source.id}: HTTP ${res.status}`);\n }\n html = await res.text();\n } finally {\n clearTimeout(timer);\n }\n\n const totalMatches = cheerio.load(html)(source.selectors.article).length;\n const articles = parseHtml(html, source, new Date(), maxItems);\n\n if (articles.length === 0) {\n onWarning?.({\n code: \"empty-html-result\",\n message: `HTML source '${source.id}' returned zero articles`,\n });\n }\n\n if (articles.some((article) => article.publishedAt === null)) {\n const missingDates = articles.filter((article) => article.publishedAt === null).length;\n onWarning?.({\n code: \"missing-published-at\",\n message: `HTML source '${source.id}' returned articles without publication dates`,\n details: { missingDates, totalArticles: articles.length },\n });\n }\n\n if (Number.isFinite(maxItems) && totalMatches > maxItems) {\n onWarning?.({\n code: \"truncated-source\",\n message: `HTML source '${source.id}' was truncated to ${maxItems} articles`,\n details: { maxItems, totalMatches },\n });\n }\n\n return articles;\n};\n","import type { Article, DigestOptions } from \"./types.js\";\nimport { estimateTokens, jaccardSimilarity, truncate } from \"./utils.js\";\n\nconst DEFAULT_DIGEST: Required<DigestOptions> = {\n maxTokens: 12_000,\n maxArticlesPerTag: 10,\n maxContentLength: 500,\n sort: \"recency\",\n similarityThreshold: 0.6,\n};\n\n/**\n * Deduplicate articles by title similarity.\n * When two articles are similar above the threshold, the one with more content wins.\n */\nexport const dedup = (\n articles: readonly Article[],\n threshold: number,\n): Article[] => {\n const kept: Article[] = [];\n\n for (const article of articles) {\n const isDuplicate = kept.some(\n (existing) => jaccardSimilarity(existing.title, article.title) >= threshold,\n );\n if (!isDuplicate) {\n kept.push({ ...article });\n } else {\n const existingIdx = kept.findIndex(\n (existing) => jaccardSimilarity(existing.title, article.title) >= threshold,\n );\n if (existingIdx !== -1) {\n const existing = kept[existingIdx]!;\n const existingLen = (existing.content?.length ?? 0) + (existing.summary?.length ?? 0);\n const newLen = (article.content?.length ?? 0) + (article.summary?.length ?? 0);\n if (newLen > existingLen) {\n kept[existingIdx] = { ...article };\n }\n }\n }\n }\n\n return kept;\n};\n\n/**\n * Apply tag-based budget limits — keep at most N articles per tag group.\n */\nconst applyTagBudget = (\n articles: Article[],\n maxPerTag: number,\n): Article[] => {\n const tagCounts = new Map<string, number>();\n const result: Article[] = [];\n\n for (const article of articles) {\n // An article passes if at least one of its tags hasn't exceeded budget\n const tags = article.tags.length > 0 ? article.tags : [\"_untagged\"];\n let allowed = false;\n\n for (const tag of tags) {\n const count = tagCounts.get(tag) ?? 0;\n if (count < maxPerTag) {\n allowed = true;\n }\n }\n\n if (allowed) {\n result.push(article);\n for (const tag of tags) {\n tagCounts.set(tag, (tagCounts.get(tag) ?? 0) + 1);\n }\n }\n }\n\n return result;\n};\n\n/**\n * Truncate article content/summary to fit within character limits.\n */\nconst truncateArticles = (\n articles: Article[],\n maxContentLength: number,\n): Article[] =>\n articles.map((a) => ({\n ...a,\n content: a.content ? truncate(a.content, maxContentLength) : null,\n summary: a.summary ? truncate(a.summary, maxContentLength) : null,\n }));\n\n/**\n * Sort articles by the chosen strategy.\n */\nconst sortArticles = (\n articles: Article[],\n sort: \"recency\" | \"relevance\",\n): Article[] => {\n if (sort === \"recency\") {\n return [...articles].sort((a, b) => {\n const dateA = a.publishedAt?.getTime() ?? a.fetchedAt.getTime();\n const dateB = b.publishedAt?.getTime() ?? b.fetchedAt.getTime();\n return dateB - dateA;\n });\n }\n // \"relevance\" — for now same as recency; could weight by content length / source priority later\n return [...articles].sort((a, b) => {\n const dateA = a.publishedAt?.getTime() ?? a.fetchedAt.getTime();\n const dateB = b.publishedAt?.getTime() ?? b.fetchedAt.getTime();\n return dateB - dateA;\n });\n};\n\n/**\n * Apply token budget — trim articles from the end until we're under budget.\n */\nconst applyTokenBudget = (\n articles: Article[],\n maxTokens: number,\n): Article[] => {\n let totalTokens = 0;\n const result: Article[] = [];\n\n for (const article of articles) {\n const text = [article.title, article.summary, article.content]\n .filter(Boolean)\n .join(\" \");\n const tokens = estimateTokens(text);\n\n if (totalTokens + tokens > maxTokens && result.length > 0) {\n break;\n }\n\n totalTokens += tokens;\n result.push(article);\n }\n\n return result;\n};\n\n/**\n * Full digest pipeline: dedup → sort → tag budget → truncate → token budget.\n */\nexport const buildDigest = (\n articles: readonly Article[],\n options?: DigestOptions,\n): { articles: Article[]; stats: { totalFetched: number; afterDedup: number; afterBudget: number; estimatedTokens: number } } => {\n const opts = { ...DEFAULT_DIGEST, ...options };\n const totalFetched = articles.length;\n\n let result = dedup(articles, opts.similarityThreshold);\n const afterDedup = result.length;\n\n result = sortArticles(result, opts.sort);\n\n result = applyTagBudget(result, opts.maxArticlesPerTag);\n const afterBudget = result.length;\n\n result = truncateArticles(result, opts.maxContentLength);\n\n result = applyTokenBudget(result, opts.maxTokens);\n\n const estimatedTokens = result.reduce((sum, a) => {\n const text = [a.title, a.summary, a.content].filter(Boolean).join(\" \");\n return sum + estimateTokens(text);\n }, 0);\n\n return {\n articles: result,\n stats: { totalFetched, afterDedup, afterBudget, estimatedTokens },\n };\n};\n","import type { SchedulerCallbacks, SourceConfig, Article } from \"./types.js\";\n\ninterface SchedulerEntry {\n source: SourceConfig;\n timer: ReturnType<typeof setInterval>;\n}\n\nexport const createScheduler = (\n sources: readonly SourceConfig[],\n fetchSource: (source: SourceConfig) => Promise<Article[]>,\n callbacks: SchedulerCallbacks,\n): { stop: () => void } => {\n const entries: SchedulerEntry[] = [];\n\n for (const source of sources) {\n if (source.enabled === false) continue;\n\n const intervalMs = (source.interval ?? 15) * 60_000;\n\n const tick = async (): Promise<void> => {\n try {\n const articles = await fetchSource(source);\n if (articles.length > 0) {\n await callbacks.onArticles(articles, source);\n }\n } catch (error) {\n callbacks.onError?.(error, source);\n }\n };\n\n void tick();\n\n const timer = setInterval(() => void tick(), intervalMs);\n entries.push({ source, timer });\n }\n\n return {\n stop: () => {\n for (const entry of entries) {\n clearInterval(entry.timer);\n }\n entries.length = 0;\n },\n };\n};\n","import type {\n Article,\n DigestOptions,\n DigestResult,\n Harvester,\n HarvesterOptions,\n HarvesterWarning,\n SchedulerCallbacks,\n SourceConfig,\n} from \"./types.js\";\nimport { fetchRss } from \"./rss.js\";\nimport { fetchHtml } from \"./html.js\";\nimport { buildDigest } from \"./digest.js\";\nimport { createScheduler } from \"./scheduler.js\";\nimport { ThrottleQueue } from \"./utils.js\";\n\ntype FetchFn = (input: string | URL | Request, init?: RequestInit) => Promise<Response>;\n\nconst fetchSource = async (\n source: SourceConfig,\n throttle: ThrottleQueue,\n options: {\n readonly fetchFn?: FetchFn;\n readonly timeout?: number;\n readonly maxItemsPerSource?: number;\n readonly onWarning?: (warning: HarvesterWarning, source: SourceConfig) => void;\n } = {},\n): Promise<Article[]> => {\n return throttle.run(async () => {\n if (source.type === \"rss\") {\n const rssOptions: Parameters<typeof fetchRss>[1] = {\n ...(options.fetchFn ? { fetchFn: options.fetchFn } : {}),\n ...(options.timeout !== undefined ? { timeout: options.timeout } : {}),\n ...(options.maxItemsPerSource !== undefined ? { maxItems: options.maxItemsPerSource } : {}),\n ...(options.onWarning\n ? { onWarning: (warning: HarvesterWarning) => options.onWarning?.(warning, source) }\n : {}),\n };\n return fetchRss(source, rssOptions);\n }\n const htmlOptions: Parameters<typeof fetchHtml>[1] = {\n ...(options.fetchFn ? { fetchFn: options.fetchFn } : {}),\n ...(options.timeout !== undefined ? { timeout: options.timeout } : {}),\n ...(options.maxItemsPerSource !== undefined ? { maxItems: options.maxItemsPerSource } : {}),\n ...(options.onWarning\n ? { onWarning: (warning: HarvesterWarning) => options.onWarning?.(warning, source) }\n : {}),\n };\n return fetchHtml(source, htmlOptions);\n });\n};\n\nconst filterKnown = async (\n articles: Article[],\n knownFn?: () => Promise<readonly string[]> | readonly string[],\n): Promise<Article[]> => {\n if (!knownFn) return articles;\n const known = new Set(await knownFn());\n return articles.filter((a) => !known.has(a.hash));\n};\n\nexport const createHarvester = (options: HarvesterOptions): Harvester => {\n const {\n sources,\n dedup: dedupOpts,\n digest: defaultDigestOpts,\n requestTimeout = 15_000,\n requestGap = 1_000,\n maxItemsPerSource = 50,\n onError,\n onWarning,\n } = options;\n\n const fetchFn = options.fetch;\n const throttle = new ThrottleQueue(requestGap);\n\n let scheduler: { stop: () => void } | null = null;\n\n const getEnabledSources = (): SourceConfig[] =>\n sources.filter((s) => s.enabled !== false);\n\n const fetchSingle = async (source: SourceConfig): Promise<Article[]> => {\n const fetchOptions: Parameters<typeof fetchSource>[2] = {\n timeout: requestTimeout,\n maxItemsPerSource,\n ...(fetchFn ? { fetchFn } : {}),\n ...(onWarning ? { onWarning } : {}),\n };\n const articles = await fetchSource(source, throttle, fetchOptions);\n return filterKnown(articles, dedupOpts?.known);\n };\n\n const fetchMany = async (selectedSources: readonly SourceConfig[]): Promise<Article[]> => {\n const results: Article[] = [];\n for (const source of selectedSources) {\n try {\n const articles = await fetchSingle(source);\n results.push(...articles);\n } catch (error) {\n onError?.(error, source);\n }\n }\n return results;\n };\n\n return {\n async fetchAll(): Promise<Article[]> {\n return fetchMany(getEnabledSources());\n },\n\n async fetch(sourceId: string): Promise<Article[]> {\n const source = sources.find((s) => s.id === sourceId);\n if (!source) {\n throw new Error(`Source not found: ${sourceId}`);\n }\n return fetchSingle(source);\n },\n\n async fetchByTags(tags: readonly string[]): Promise<Article[]> {\n const tagSet = new Set(tags);\n const matching = getEnabledSources().filter((s) =>\n s.tags.some((t) => tagSet.has(t)),\n );\n return fetchMany(matching);\n },\n\n async digest(overrides?: DigestOptions): Promise<DigestResult> {\n const articles = await fetchMany(getEnabledSources());\n const opts = { ...defaultDigestOpts, ...overrides };\n const result = buildDigest(articles, opts);\n return {\n articles: result.articles,\n stats: result.stats,\n };\n },\n\n start(callbacks: SchedulerCallbacks): void {\n if (scheduler) {\n scheduler.stop();\n }\n const schedulerCallbacks: SchedulerCallbacks = {\n onArticles: callbacks.onArticles,\n ...(callbacks.onError ?? onError ? { onError: callbacks.onError ?? onError } : {}),\n ...(callbacks.onWarning ?? onWarning ? { onWarning: callbacks.onWarning ?? onWarning } : {}),\n };\n scheduler = createScheduler(\n getEnabledSources(),\n (source) => fetchSingle(source),\n schedulerCallbacks,\n );\n },\n\n stop(): void {\n scheduler?.stop();\n scheduler = null;\n },\n };\n};\n"],"mappings":";AAAA,OAAO,eAAe;;;ACAtB,SAAS,kBAAkB;AAGpB,IAAM,UAAU,CAAC,QACtB,WAAW,QAAQ,EAAE,OAAO,GAAG,EAAE,OAAO,KAAK;AAE/C,IAAM,WAAW,CAAC,SAA8B;AAC9C,QAAM,QAAQ,KACX,YAAY,EACZ,QAAQ,0CAA0C,GAAG,EACrD,KAAK,EACL,MAAM,KAAK,EACX,OAAO,CAAC,MAAM,EAAE,SAAS,CAAC;AAC7B,SAAO,IAAI,IAAI,KAAK;AACtB;AAMO,IAAM,oBAAoB,CAAC,GAAW,MAAsB;AACjE,QAAM,OAAO,SAAS,CAAC;AACvB,QAAM,OAAO,SAAS,CAAC;AACvB,MAAI,KAAK,SAAS,KAAK,KAAK,SAAS,EAAG,QAAO;AAC/C,MAAI,KAAK,SAAS,KAAK,KAAK,SAAS,EAAG,QAAO;AAE/C,MAAI,eAAe;AACnB,aAAW,QAAQ,MAAM;AACvB,QAAI,KAAK,IAAI,IAAI,EAAG;AAAA,EACtB;AACA,QAAM,QAAQ,KAAK,OAAO,KAAK,OAAO;AACtC,SAAO,UAAU,IAAI,IAAI,eAAe;AAC1C;AAMO,IAAM,iBAAiB,CAAC,SAC7B,KAAK,KAAK,KAAK,SAAS,CAAC;AAEpB,IAAM,gBAAgB,CAAC,SAC5B,KAAK,QAAQ,QAAQ,GAAG,EAAE,KAAK;AAK1B,IAAM,WAAW,CAAC,MAAc,cAA8B;AACnE,MAAI,KAAK,UAAU,UAAW,QAAO;AACrC,QAAM,MAAM,KAAK,YAAY,KAAK,SAAS;AAC3C,UAAQ,MAAM,IAAI,KAAK,MAAM,GAAG,GAAG,IAAI,KAAK,MAAM,GAAG,SAAS,KAAK;AACrE;AAEO,IAAM,gBAAN,MAAoB;AAAA,EAGzB,YAA6B,OAAe;AAAf;AAAA,EAAgB;AAAA,EAFrC,UAAU;AAAA,EAIlB,MAAM,IAAO,IAAkC;AAC7C,UAAM,MAAM,KAAK,IAAI;AACrB,UAAM,OAAO,KAAK,IAAI,GAAG,KAAK,SAAS,MAAM,KAAK,QAAQ;AAC1D,QAAI,OAAO,GAAG;AACZ,YAAM,IAAI,QAAQ,CAAC,YAAY,WAAW,SAAS,IAAI,CAAC;AAAA,IAC1D;AACA,SAAK,UAAU,KAAK,IAAI;AACxB,WAAO,GAAG;AAAA,EACZ;AACF;AAEO,IAAM,aAAa,CAAC,MAAc,SAAyB;AAChE,MAAI;AACF,WAAO,IAAI,IAAI,MAAM,IAAI,EAAE;AAAA,EAC7B,QAAQ;AACN,WAAO;AAAA,EACT;AACF;;;ADvEA,IAAM,aAAa;AAEnB,IAAM,kBAAkB,MAAiB,IAAI,UAAU;AAShD,IAAM,WAAW,OACtB,QACA,UAA2B,CAAC,MACL;AACvB,QAAM,MAAM,oBAAI,KAAK;AACrB,QAAM;AAAA,IACJ,UAAU,WAAW;AAAA,IACrB,UAAU;AAAA,IACV,WAAW,OAAO;AAAA,IAClB;AAAA,EACF,IAAI;AAEJ,QAAM,aAAa,IAAI,gBAAgB;AACvC,QAAM,QAAQ,WAAW,MAAM,WAAW,MAAM,GAAG,OAAO;AAE1D,MAAI;AACF,UAAM,MAAM,MAAM,QAAQ,OAAO,KAAK;AAAA,MACpC,SAAS;AAAA,QACP,cAAc;AAAA,QACd,QAAQ;AAAA,MACV;AAAA,MACA,QAAQ,WAAW;AAAA,IACrB,CAAC;AAED,QAAI,CAAC,IAAI,IAAI;AACX,YAAM,IAAI,MAAM,wBAAwB,OAAO,EAAE,UAAU,IAAI,MAAM,EAAE;AAAA,IACzE;AAEA,UAAM,MAAM,MAAM,IAAI,KAAK;AAC3B,UAAM,OAAO,MAAM,gBAAgB,EAAE,YAAY,GAAG;AACpD,UAAM,aAAa,KAAK,MAAM;AAC9B,UAAM,WAAW,eAAe,MAAM,QAAQ,KAAK,QAAQ;AAE3D,QAAI,SAAS,WAAW,GAAG;AACzB,kBAAY;AAAA,QACV,MAAM;AAAA,QACN,SAAS,eAAe,OAAO,EAAE;AAAA,MACnC,CAAC;AAAA,IACH;AAEA,QAAI,SAAS,KAAK,CAAC,YAAY,QAAQ,gBAAgB,IAAI,GAAG;AAC5D,YAAM,eAAe,SAAS,OAAO,CAAC,YAAY,QAAQ,gBAAgB,IAAI,EAAE;AAChF,kBAAY;AAAA,QACV,MAAM;AAAA,QACN,SAAS,eAAe,OAAO,EAAE;AAAA,QACjC,SAAS,EAAE,cAAc,eAAe,SAAS,OAAO;AAAA,MAC1D,CAAC;AAAA,IACH;AAEA,QAAI,OAAO,SAAS,QAAQ,KAAK,aAAa,UAAU;AACtD,kBAAY;AAAA,QACV,MAAM;AAAA,QACN,SAAS,eAAe,OAAO,EAAE,sBAAsB,QAAQ;AAAA,QAC/D,SAAS,EAAE,UAAU,WAAW;AAAA,MAClC,CAAC;AAAA,IACH;AAEA,WAAO;AAAA,EACT,UAAE;AACA,iBAAa,KAAK;AAAA,EACpB;AACF;AAEA,IAAM,iBAAiB,CACrB,MACA,QACA,WACA,aACc;AACd,QAAM,WAAsB,CAAC;AAE7B,aAAW,QAAQ,KAAK,OAAO;AAC7B,QAAI,SAAS,UAAU,SAAU;AAEjC,UAAM,MAAM,KAAK,MAAM,KAAK;AAC5B,QAAI,CAAC,IAAK;AAEV,UAAM,QAAQ,cAAc,KAAK,SAAS,EAAE;AAC5C,QAAI,CAAC,MAAO;AAEZ,UAAM,UAAU,KAAK,iBAAiB,KACjC,KAAK,WACL;AAEL,UAAM,UAAU,KAAK,kBAChB,KAAK,WACL;AAEL,QAAI,cAA2B;AAC/B,QAAI,KAAK,SAAS;AAChB,YAAM,IAAI,IAAI,KAAK,KAAK,OAAO;AAC/B,UAAI,CAAC,MAAM,EAAE,QAAQ,CAAC,EAAG,eAAc;AAAA,IACzC,WAAW,KAAK,SAAS;AACvB,YAAM,IAAI,IAAI,KAAK,KAAK,OAAO;AAC/B,UAAI,CAAC,MAAM,EAAE,QAAQ,CAAC,EAAG,eAAc;AAAA,IACzC;AAEA,aAAS,KAAK;AAAA,MACZ,UAAU,OAAO;AAAA,MACjB;AAAA,MACA;AAAA,MACA,SAAS,OAAO,YAAY,WAAW,cAAc,OAAO,IAAI;AAAA,MAChE,SAAS,OAAO,YAAY,WAAW,cAAc,OAAO,IAAI;AAAA,MAChE;AAAA,MACA,MAAM,QAAQ,GAAG;AAAA,MACjB;AAAA,MACA,MAAM,CAAC,GAAG,OAAO,IAAI;AAAA,IACvB,CAAC;AAAA,EACH;AAEA,SAAO;AACT;;;AE9HA,YAAY,aAAa;AASzB,IAAMA,cACJ;AASF,IAAM,mBAAmB,CAAC,KAA+B,aAAmC;AAC1F,MAAI,CAAC,SAAU,QAAO;AAEtB,QAAM,SAAS,IAAI,KAAK,QAAQ,EAAE,MAAM;AACxC,QAAM,aAAa;AAAA,IACjB,OAAO,KAAK,UAAU;AAAA,IACtB,OAAO,KAAK,SAAS;AAAA,IACrB,OAAO,KAAK;AAAA,EACd;AAEA,aAAW,aAAa,YAAY;AAClC,UAAM,MAAM,cAAc,aAAa,EAAE;AACzC,QAAI,CAAC,IAAK;AACV,UAAM,SAAS,IAAI,KAAK,GAAG;AAC3B,QAAI,CAAC,OAAO,MAAM,OAAO,QAAQ,CAAC,GAAG;AACnC,aAAO;AAAA,IACT;AAAA,EACF;AAEA,SAAO;AACT;AAEO,IAAM,YAAY,CACvB,MACA,QACA,YAAY,oBAAI,KAAK,GACrB,WAAW,OAAO,sBACJ;AACd,QAAM,IAAY,aAAK,IAAI;AAC3B,QAAM,EAAE,UAAU,IAAI;AACtB,QAAM,WAAsB,CAAC;AAE7B,IAAE,UAAU,OAAO,EAAE,KAAK,CAAC,IAAI,OAAO;AACpC,QAAI,SAAS,UAAU,SAAU,QAAO;AAExC,UAAM,MAAM,EAAE,EAAE;AAChB,UAAM,QAAQ,cAAc,IAAI,KAAK,UAAU,KAAK,EAAE,MAAM,EAAE,KAAK,CAAC;AACpE,QAAI,CAAC,MAAO;AAEZ,UAAM,UAAU,cAAc,IAAI,KAAK,UAAU,IAAI,EAAE,MAAM,EAAE,KAAK,MAAM,KAAK,EAAE;AACjF,QAAI,CAAC,QAAS;AAEd,UAAM,MAAM,WAAW,SAAS,OAAO,GAAG;AAC1C,UAAM,cAAc,iBAAiB,KAAK,UAAU,IAAI;AACxD,UAAM,UAAU,UAAU,UACtB,cAAc,IAAI,KAAK,UAAU,OAAO,EAAE,MAAM,EAAE,KAAK,CAAC,KAAK,OAC7D;AAEJ,aAAS,KAAK;AAAA,MACZ,UAAU,OAAO;AAAA,MACjB;AAAA,MACA;AAAA,MACA,SAAS;AAAA,MACT;AAAA,MACA;AAAA,MACA,MAAM,QAAQ,GAAG;AAAA,MACjB;AAAA,MACA,MAAM,CAAC,GAAG,OAAO,IAAI;AAAA,IACvB,CAAC;AAED,WAAO;AAAA,EACT,CAAC;AAED,SAAO;AACT;AAEO,IAAM,YAAY,OACvB,QACA,UAA4B,CAAC,MACN;AACvB,QAAM;AAAA,IACJ,UAAU,WAAW;AAAA,IACrB,UAAU;AAAA,IACV,WAAW,OAAO;AAAA,IAClB;AAAA,EACF,IAAI;AACJ,QAAM,aAAa,IAAI,gBAAgB;AACvC,QAAM,QAAQ,WAAW,MAAM,WAAW,MAAM,GAAG,OAAO;AAE1D,MAAI;AACJ,MAAI;AACF,UAAM,MAAM,MAAM,QAAQ,OAAO,KAAK;AAAA,MACpC,SAAS;AAAA,QACP,cAAcA;AAAA,QACd,QAAQ;AAAA,MACV;AAAA,MACA,QAAQ,WAAW;AAAA,IACrB,CAAC;AACD,QAAI,CAAC,IAAI,IAAI;AACX,YAAM,IAAI,MAAM,yBAAyB,OAAO,EAAE,UAAU,IAAI,MAAM,EAAE;AAAA,IAC1E;AACA,WAAO,MAAM,IAAI,KAAK;AAAA,EACxB,UAAE;AACA,iBAAa,KAAK;AAAA,EACpB;AAEA,QAAM,eAAuB,aAAK,IAAI,EAAE,OAAO,UAAU,OAAO,EAAE;AAClE,QAAM,WAAW,UAAU,MAAM,QAAQ,oBAAI,KAAK,GAAG,QAAQ;AAE7D,MAAI,SAAS,WAAW,GAAG;AACzB,gBAAY;AAAA,MACV,MAAM;AAAA,MACN,SAAS,gBAAgB,OAAO,EAAE;AAAA,IACpC,CAAC;AAAA,EACH;AAEA,MAAI,SAAS,KAAK,CAAC,YAAY,QAAQ,gBAAgB,IAAI,GAAG;AAC5D,UAAM,eAAe,SAAS,OAAO,CAAC,YAAY,QAAQ,gBAAgB,IAAI,EAAE;AAChF,gBAAY;AAAA,MACV,MAAM;AAAA,MACN,SAAS,gBAAgB,OAAO,EAAE;AAAA,MAClC,SAAS,EAAE,cAAc,eAAe,SAAS,OAAO;AAAA,IAC1D,CAAC;AAAA,EACH;AAEA,MAAI,OAAO,SAAS,QAAQ,KAAK,eAAe,UAAU;AACxD,gBAAY;AAAA,MACV,MAAM;AAAA,MACN,SAAS,gBAAgB,OAAO,EAAE,sBAAsB,QAAQ;AAAA,MAChE,SAAS,EAAE,UAAU,aAAa;AAAA,IACpC,CAAC;AAAA,EACH;AAEA,SAAO;AACT;;;AC5IA,IAAM,iBAA0C;AAAA,EAC9C,WAAW;AAAA,EACX,mBAAmB;AAAA,EACnB,kBAAkB;AAAA,EAClB,MAAM;AAAA,EACN,qBAAqB;AACvB;AAMO,IAAM,QAAQ,CACnB,UACA,cACc;AACd,QAAM,OAAkB,CAAC;AAEzB,aAAW,WAAW,UAAU;AAC9B,UAAM,cAAc,KAAK;AAAA,MACvB,CAAC,aAAa,kBAAkB,SAAS,OAAO,QAAQ,KAAK,KAAK;AAAA,IACpE;AACA,QAAI,CAAC,aAAa;AAChB,WAAK,KAAK,EAAE,GAAG,QAAQ,CAAC;AAAA,IAC1B,OAAO;AACL,YAAM,cAAc,KAAK;AAAA,QACvB,CAAC,aAAa,kBAAkB,SAAS,OAAO,QAAQ,KAAK,KAAK;AAAA,MACpE;AACA,UAAI,gBAAgB,IAAI;AACtB,cAAM,WAAW,KAAK,WAAW;AACjC,cAAM,eAAe,SAAS,SAAS,UAAU,MAAM,SAAS,SAAS,UAAU;AACnF,cAAM,UAAU,QAAQ,SAAS,UAAU,MAAM,QAAQ,SAAS,UAAU;AAC5E,YAAI,SAAS,aAAa;AACxB,eAAK,WAAW,IAAI,EAAE,GAAG,QAAQ;AAAA,QACnC;AAAA,MACF;AAAA,IACF;AAAA,EACF;AAEA,SAAO;AACT;AAKA,IAAM,iBAAiB,CACrB,UACA,cACc;AACd,QAAM,YAAY,oBAAI,IAAoB;AAC1C,QAAM,SAAoB,CAAC;AAE3B,aAAW,WAAW,UAAU;AAE9B,UAAM,OAAO,QAAQ,KAAK,SAAS,IAAI,QAAQ,OAAO,CAAC,WAAW;AAClE,QAAI,UAAU;AAEd,eAAW,OAAO,MAAM;AACtB,YAAM,QAAQ,UAAU,IAAI,GAAG,KAAK;AACpC,UAAI,QAAQ,WAAW;AACrB,kBAAU;AAAA,MACZ;AAAA,IACF;AAEA,QAAI,SAAS;AACX,aAAO,KAAK,OAAO;AACnB,iBAAW,OAAO,MAAM;AACtB,kBAAU,IAAI,MAAM,UAAU,IAAI,GAAG,KAAK,KAAK,CAAC;AAAA,MAClD;AAAA,IACF;AAAA,EACF;AAEA,SAAO;AACT;AAKA,IAAM,mBAAmB,CACvB,UACA,qBAEA,SAAS,IAAI,CAAC,OAAO;AAAA,EACnB,GAAG;AAAA,EACH,SAAS,EAAE,UAAU,SAAS,EAAE,SAAS,gBAAgB,IAAI;AAAA,EAC7D,SAAS,EAAE,UAAU,SAAS,EAAE,SAAS,gBAAgB,IAAI;AAC/D,EAAE;AAKJ,IAAM,eAAe,CACnB,UACA,SACc;AACd,MAAI,SAAS,WAAW;AACtB,WAAO,CAAC,GAAG,QAAQ,EAAE,KAAK,CAAC,GAAG,MAAM;AAClC,YAAM,QAAQ,EAAE,aAAa,QAAQ,KAAK,EAAE,UAAU,QAAQ;AAC9D,YAAM,QAAQ,EAAE,aAAa,QAAQ,KAAK,EAAE,UAAU,QAAQ;AAC9D,aAAO,QAAQ;AAAA,IACjB,CAAC;AAAA,EACH;AAEA,SAAO,CAAC,GAAG,QAAQ,EAAE,KAAK,CAAC,GAAG,MAAM;AAClC,UAAM,QAAQ,EAAE,aAAa,QAAQ,KAAK,EAAE,UAAU,QAAQ;AAC9D,UAAM,QAAQ,EAAE,aAAa,QAAQ,KAAK,EAAE,UAAU,QAAQ;AAC9D,WAAO,QAAQ;AAAA,EACjB,CAAC;AACH;AAKA,IAAM,mBAAmB,CACvB,UACA,cACc;AACd,MAAI,cAAc;AAClB,QAAM,SAAoB,CAAC;AAE3B,aAAW,WAAW,UAAU;AAC9B,UAAM,OAAO,CAAC,QAAQ,OAAO,QAAQ,SAAS,QAAQ,OAAO,EAC1D,OAAO,OAAO,EACd,KAAK,GAAG;AACX,UAAM,SAAS,eAAe,IAAI;AAElC,QAAI,cAAc,SAAS,aAAa,OAAO,SAAS,GAAG;AACzD;AAAA,IACF;AAEA,mBAAe;AACf,WAAO,KAAK,OAAO;AAAA,EACrB;AAEA,SAAO;AACT;AAKO,IAAM,cAAc,CACzB,UACA,YAC+H;AAC/H,QAAM,OAAO,EAAE,GAAG,gBAAgB,GAAG,QAAQ;AAC7C,QAAM,eAAe,SAAS;AAE9B,MAAI,SAAS,MAAM,UAAU,KAAK,mBAAmB;AACrD,QAAM,aAAa,OAAO;AAE1B,WAAS,aAAa,QAAQ,KAAK,IAAI;AAEvC,WAAS,eAAe,QAAQ,KAAK,iBAAiB;AACtD,QAAM,cAAc,OAAO;AAE3B,WAAS,iBAAiB,QAAQ,KAAK,gBAAgB;AAEvD,WAAS,iBAAiB,QAAQ,KAAK,SAAS;AAEhD,QAAM,kBAAkB,OAAO,OAAO,CAAC,KAAK,MAAM;AAChD,UAAM,OAAO,CAAC,EAAE,OAAO,EAAE,SAAS,EAAE,OAAO,EAAE,OAAO,OAAO,EAAE,KAAK,GAAG;AACrE,WAAO,MAAM,eAAe,IAAI;AAAA,EAClC,GAAG,CAAC;AAEJ,SAAO;AAAA,IACL,UAAU;AAAA,IACV,OAAO,EAAE,cAAc,YAAY,aAAa,gBAAgB;AAAA,EAClE;AACF;;;ACpKO,IAAM,kBAAkB,CAC7B,SACAC,cACA,cACyB;AACzB,QAAM,UAA4B,CAAC;AAEnC,aAAW,UAAU,SAAS;AAC5B,QAAI,OAAO,YAAY,MAAO;AAE9B,UAAM,cAAc,OAAO,YAAY,MAAM;AAE7C,UAAM,OAAO,YAA2B;AACtC,UAAI;AACF,cAAM,WAAW,MAAMA,aAAY,MAAM;AACzC,YAAI,SAAS,SAAS,GAAG;AACvB,gBAAM,UAAU,WAAW,UAAU,MAAM;AAAA,QAC7C;AAAA,MACF,SAAS,OAAO;AACd,kBAAU,UAAU,OAAO,MAAM;AAAA,MACnC;AAAA,IACF;AAEA,SAAK,KAAK;AAEV,UAAM,QAAQ,YAAY,MAAM,KAAK,KAAK,GAAG,UAAU;AACvD,YAAQ,KAAK,EAAE,QAAQ,MAAM,CAAC;AAAA,EAChC;AAEA,SAAO;AAAA,IACL,MAAM,MAAM;AACV,iBAAW,SAAS,SAAS;AAC3B,sBAAc,MAAM,KAAK;AAAA,MAC3B;AACA,cAAQ,SAAS;AAAA,IACnB;AAAA,EACF;AACF;;;AC1BA,IAAM,cAAc,OAClB,QACA,UACA,UAKI,CAAC,MACkB;AACvB,SAAO,SAAS,IAAI,YAAY;AAC9B,QAAI,OAAO,SAAS,OAAO;AACzB,YAAM,aAA6C;AAAA,QACjD,GAAI,QAAQ,UAAU,EAAE,SAAS,QAAQ,QAAQ,IAAI,CAAC;AAAA,QACtD,GAAI,QAAQ,YAAY,SAAY,EAAE,SAAS,QAAQ,QAAQ,IAAI,CAAC;AAAA,QACpE,GAAI,QAAQ,sBAAsB,SAAY,EAAE,UAAU,QAAQ,kBAAkB,IAAI,CAAC;AAAA,QACzF,GAAI,QAAQ,YACR,EAAE,WAAW,CAAC,YAA8B,QAAQ,YAAY,SAAS,MAAM,EAAE,IACjF,CAAC;AAAA,MACP;AACA,aAAO,SAAS,QAAQ,UAAU;AAAA,IACpC;AACA,UAAM,cAA+C;AAAA,MACnD,GAAI,QAAQ,UAAU,EAAE,SAAS,QAAQ,QAAQ,IAAI,CAAC;AAAA,MACtD,GAAI,QAAQ,YAAY,SAAY,EAAE,SAAS,QAAQ,QAAQ,IAAI,CAAC;AAAA,MACpE,GAAI,QAAQ,sBAAsB,SAAY,EAAE,UAAU,QAAQ,kBAAkB,IAAI,CAAC;AAAA,MACzF,GAAI,QAAQ,YACR,EAAE,WAAW,CAAC,YAA8B,QAAQ,YAAY,SAAS,MAAM,EAAE,IACjF,CAAC;AAAA,IACP;AACA,WAAO,UAAU,QAAQ,WAAW;AAAA,EACtC,CAAC;AACH;AAEA,IAAM,cAAc,OAClB,UACA,YACuB;AACvB,MAAI,CAAC,QAAS,QAAO;AACrB,QAAM,QAAQ,IAAI,IAAI,MAAM,QAAQ,CAAC;AACrC,SAAO,SAAS,OAAO,CAAC,MAAM,CAAC,MAAM,IAAI,EAAE,IAAI,CAAC;AAClD;AAEO,IAAM,kBAAkB,CAAC,YAAyC;AACvE,QAAM;AAAA,IACJ;AAAA,IACA,OAAO;AAAA,IACP,QAAQ;AAAA,IACR,iBAAiB;AAAA,IACjB,aAAa;AAAA,IACb,oBAAoB;AAAA,IACpB;AAAA,IACA;AAAA,EACF,IAAI;AAEJ,QAAM,UAAU,QAAQ;AACxB,QAAM,WAAW,IAAI,cAAc,UAAU;AAE7C,MAAI,YAAyC;AAE7C,QAAM,oBAAoB,MACxB,QAAQ,OAAO,CAAC,MAAM,EAAE,YAAY,KAAK;AAE3C,QAAM,cAAc,OAAO,WAA6C;AACtE,UAAM,eAAkD;AAAA,MACtD,SAAS;AAAA,MACT;AAAA,MACA,GAAI,UAAU,EAAE,QAAQ,IAAI,CAAC;AAAA,MAC7B,GAAI,YAAY,EAAE,UAAU,IAAI,CAAC;AAAA,IACnC;AACA,UAAM,WAAW,MAAM,YAAY,QAAQ,UAAU,YAAY;AACjE,WAAO,YAAY,UAAU,WAAW,KAAK;AAAA,EAC/C;AAEA,QAAM,YAAY,OAAO,oBAAiE;AACxF,UAAM,UAAqB,CAAC;AAC5B,eAAW,UAAU,iBAAiB;AACpC,UAAI;AACF,cAAM,WAAW,MAAM,YAAY,MAAM;AACzC,gBAAQ,KAAK,GAAG,QAAQ;AAAA,MAC1B,SAAS,OAAO;AACd,kBAAU,OAAO,MAAM;AAAA,MACzB;AAAA,IACF;AACA,WAAO;AAAA,EACT;AAEA,SAAO;AAAA,IACL,MAAM,WAA+B;AACnC,aAAO,UAAU,kBAAkB,CAAC;AAAA,IACtC;AAAA,IAEA,MAAM,MAAM,UAAsC;AAChD,YAAM,SAAS,QAAQ,KAAK,CAAC,MAAM,EAAE,OAAO,QAAQ;AACpD,UAAI,CAAC,QAAQ;AACX,cAAM,IAAI,MAAM,qBAAqB,QAAQ,EAAE;AAAA,MACjD;AACA,aAAO,YAAY,MAAM;AAAA,IAC3B;AAAA,IAEA,MAAM,YAAY,MAA6C;AAC7D,YAAM,SAAS,IAAI,IAAI,IAAI;AAC3B,YAAM,WAAW,kBAAkB,EAAE;AAAA,QAAO,CAAC,MAC3C,EAAE,KAAK,KAAK,CAAC,MAAM,OAAO,IAAI,CAAC,CAAC;AAAA,MAClC;AACA,aAAO,UAAU,QAAQ;AAAA,IAC3B;AAAA,IAEA,MAAM,OAAO,WAAkD;AAC7D,YAAM,WAAW,MAAM,UAAU,kBAAkB,CAAC;AACpD,YAAM,OAAO,EAAE,GAAG,mBAAmB,GAAG,UAAU;AAClD,YAAM,SAAS,YAAY,UAAU,IAAI;AACzC,aAAO;AAAA,QACL,UAAU,OAAO;AAAA,QACjB,OAAO,OAAO;AAAA,MAChB;AAAA,IACF;AAAA,IAEA,MAAM,WAAqC;AACzC,UAAI,WAAW;AACb,kBAAU,KAAK;AAAA,MACjB;AACA,YAAM,qBAAyC;AAAA,QAC7C,YAAY,UAAU;AAAA,QACtB,GAAI,UAAU,WAAW,UAAU,EAAE,SAAS,UAAU,WAAW,QAAQ,IAAI,CAAC;AAAA,QAChF,GAAI,UAAU,aAAa,YAAY,EAAE,WAAW,UAAU,aAAa,UAAU,IAAI,CAAC;AAAA,MAC5F;AACA,kBAAY;AAAA,QACV,kBAAkB;AAAA,QAClB,CAAC,WAAW,YAAY,MAAM;AAAA,QAC9B;AAAA,MACF;AAAA,IACF;AAAA,IAEA,OAAa;AACX,iBAAW,KAAK;AAChB,kBAAY;AAAA,IACd;AAAA,EACF;AACF;","names":["DEFAULT_UA","fetchSource"]}

package/package.json ADDED Viewed

@@ -0,0 +1,50 @@
+{
+  "name": "osint-feed",
+  "version": "0.1.0",
+  "description": "Config-driven news harvester for OSINT. RSS + HTML scraping with built-in dedup and LLM-ready digest.",
+  "type": "module",
+  "main": "./dist/index.js",
+  "types": "./dist/index.d.ts",
+  "exports": {
+    ".": {
+      "import": "./dist/index.js",
+      "types": "./dist/index.d.ts"
+    }
+  },
+  "files": [
+    "dist"
+  ],
+  "scripts": {
+    "build": "tsup",
+    "typecheck": "tsc --noEmit",
+    "test": "vitest run",
+    "test:watch": "vitest",
+    "dev": "tsx src/dev.ts"
+  },
+  "keywords": [
+    "osint",
+    "rss",
+    "scraper",
+    "news",
+    "feed",
+    "harvester",
+    "llm",
+    "digest"
+  ],
+  "author": "DewXIT",
+  "license": "MIT",
+  "engines": {
+    "node": ">=18"
+  },
+  "dependencies": {
+    "cheerio": "^1.0.0",
+    "rss-parser": "^3.13.0"
+  },
+  "devDependencies": {
+    "@types/node": "^20.0.0",
+    "tsup": "^8.0.0",
+    "typescript": "^5.5.0",
+    "vitest": "^2.0.0",
+    "tsx": "^4.0.0"
+  }
+}