softhauzpy 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,69 @@
1
+ Metadata-Version: 2.1
2
+ Name: softhauzpy
3
+ Version: 0.0.1
4
+ Description-Content-Type: text/markdown
5
+
6
+ # SofthauzPy
7
+ **SofthauzPy** is a comprehensive Python toolkit built for developers creating intelligent, data-driven web applications. It provides a powerful suite of web utilities including web scraping tools, crawling systems, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions.
8
+
9
+ Designed for scalability and flexibility, Softhauz enables teams to collect, process, index, and search website content efficiently — all within a clean Python-first development ecosystem.
10
+
11
+ Built for developers who need scalable web data tools and intelligent search capabilities, Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
12
+ From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz helps developers build smarter web applications without relying heavily on external search services.
13
+
14
+
15
+ ## Key Features
16
+
17
+ **Web Scraping & Crawling**
18
+
19
+ - High-performance web scraping utilities
20
+ - HTML parsing and structured data extraction
21
+ - Recursive website crawling
22
+ - Sitemap discovery and URL indexing
23
+ - Support for asynchronous scraping workflows
24
+ - Rate limiting and request handling utilities
25
+
26
+ **Search Engine Toolkit**
27
+
28
+ - In-house website search engine creation
29
+ - Full-text indexing and querying
30
+ - Custom relevance ranking algorithms
31
+ - Search filtering and query optimization
32
+ - Incremental indexing support
33
+ - Lightweight search infrastructure for internal platforms
34
+
35
+ **Content Processing**
36
+
37
+ - Text normalization and cleaning
38
+ - Metadata extraction
39
+ - Duplicate content detection
40
+ - Keyword extraction and tagging
41
+ - Content chunking for AI and search applications
42
+
43
+ **AI & Semantic Search Ready**
44
+
45
+ - Embedding generation helpers
46
+ - Vector database compatibility
47
+ - Semantic similarity search utilities
48
+ - Retrieval-Augmented Generation (RAG) support
49
+ - AI-powered content indexing workflows
50
+
51
+ **Developer Experience**
52
+
53
+ - Modular and extensible architecture
54
+ - Framework-friendly design for Flask, Django, and FastAPI
55
+ - Easy API integration
56
+ - Clean, Pythonic interfaces
57
+ - Production-ready utilities for scalable deployments
58
+
59
+ > This program may incorporate artificial intelligence (AI) tools solely
60
+ > to support and enhance development efficiency, code quality, and
61
+ > overall performance. All software design, implementation, testing,
62
+ > validation, and quality assurance processes are conducted and reviewed
63
+ > by a qualified human software professional to ensure accuracy,
64
+ > reliability, security, and compliance with applicable standards.
65
+
66
+ Author:
67
+ **Urate, Karen**<br>
68
+ *Softhauz Software Architect*<br>
69
+ [softhauz.ca](https://softhauz.ca)
@@ -0,0 +1,64 @@
1
+ # SofthauzPy
2
+ **SofthauzPy** is a comprehensive Python toolkit built for developers creating intelligent, data-driven web applications. It provides a powerful suite of web utilities including web scraping tools, crawling systems, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions.
3
+
4
+ Designed for scalability and flexibility, Softhauz enables teams to collect, process, index, and search website content efficiently — all within a clean Python-first development ecosystem.
5
+
6
+ Built for developers who need scalable web data tools and intelligent search capabilities, Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
7
+ From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz helps developers build smarter web applications without relying heavily on external search services.
8
+
9
+
10
+ ## Key Features
11
+
12
+ **Web Scraping & Crawling**
13
+
14
+ - High-performance web scraping utilities
15
+ - HTML parsing and structured data extraction
16
+ - Recursive website crawling
17
+ - Sitemap discovery and URL indexing
18
+ - Support for asynchronous scraping workflows
19
+ - Rate limiting and request handling utilities
20
+
21
+ **Search Engine Toolkit**
22
+
23
+ - In-house website search engine creation
24
+ - Full-text indexing and querying
25
+ - Custom relevance ranking algorithms
26
+ - Search filtering and query optimization
27
+ - Incremental indexing support
28
+ - Lightweight search infrastructure for internal platforms
29
+
30
+ **Content Processing**
31
+
32
+ - Text normalization and cleaning
33
+ - Metadata extraction
34
+ - Duplicate content detection
35
+ - Keyword extraction and tagging
36
+ - Content chunking for AI and search applications
37
+
38
+ **AI & Semantic Search Ready**
39
+
40
+ - Embedding generation helpers
41
+ - Vector database compatibility
42
+ - Semantic similarity search utilities
43
+ - Retrieval-Augmented Generation (RAG) support
44
+ - AI-powered content indexing workflows
45
+
46
+ **Developer Experience**
47
+
48
+ - Modular and extensible architecture
49
+ - Framework-friendly design for Flask, Django, and FastAPI
50
+ - Easy API integration
51
+ - Clean, Pythonic interfaces
52
+ - Production-ready utilities for scalable deployments
53
+
54
+ > This program may incorporate artificial intelligence (AI) tools solely
55
+ > to support and enhance development efficiency, code quality, and
56
+ > overall performance. All software design, implementation, testing,
57
+ > validation, and quality assurance processes are conducted and reviewed
58
+ > by a qualified human software professional to ensure accuracy,
59
+ > reliability, security, and compliance with applicable standards.
60
+
61
+ Author:
62
+ **Urate, Karen**<br>
63
+ *Softhauz Software Architect*<br>
64
+ [softhauz.ca](https://softhauz.ca)
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,17 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ with open("README.md", "r") as f:
4
+ description = f.read()
5
+
6
+ setup(
7
+ name='softhauzpy',
8
+ version='0.0.1',
9
+ packages=find_packages(),
10
+ install_requires=[
11
+ 'requests>=2.32.3',
12
+ 'beautifulsoup4>=4.12.3',
13
+ 'nltk>=3.9.4'
14
+ ],
15
+ long_description=description,
16
+ long_description_content_type="text/markdown",
17
+ )
@@ -0,0 +1,18 @@
1
+ # fingerprints and mappings
2
+ from .main import incremental_update, highlight_query_terms, build_sitemap_urls
3
+ from .main import fingerprint_page, generate_snippet
4
+
5
+ # extractions
6
+ from .main import extract_structured_data, extract_headings
7
+ from .main import extract_metadata, extract_links, extract_pure_text
8
+
9
+ # indexing
10
+ from .main import load_index, save_index, search_index, compute_tfidf, build_inverted_index
11
+
12
+ # crawls and scrapes
13
+ from .main import tokenize, chunk_text, crawl_site, parse_html, fetch_page, get_search_results_list
14
+
15
+
16
+
17
+
18
+
@@ -0,0 +1,1018 @@
1
+ """
2
+
3
+ Softhauz is a modern Python package built for software engineers, software developers, and web application architects who need scalable web data tools and intelligent search capabilities. It provides a powerful suite of web utilities including web scraping tools, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions. Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
4
+
5
+ From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz allows for a seamless building of smarter web applications without relying heavily on external search services. Softhauz helps transform web content into structured, searchable, and intelligent systems with minimal overhead.
6
+
7
+ Softhauz combines modern web scraping, intelligent indexing, and customizable search capabilities into a unified Python toolkit. Instead of stitching together multiple libraries and services, developers can use Softhauz as a centralized foundation for building scalable web data and search infrastructures tailored to their applications.
8
+
9
+ Author: Urate, Karen
10
+ Creation Date: 2026-05-09
11
+ External Package List:
12
+
13
+ - requests >= (v. 2.34.2)
14
+ - beautifulsoup4 >= (v. 4.14.3)
15
+ - nltk >= (v. 3.9.4)
16
+
17
+ """
18
+
19
+ import re
20
+ import json
21
+ import math
22
+ import time
23
+ import hashlib
24
+ import heapq
25
+ from collections import defaultdict, Counter
26
+ from urllib.parse import urljoin, urlparse, urldefrag
27
+ from pathlib import Path
28
+
29
+ import requests
30
+ from bs4 import BeautifulSoup
31
+
32
+ _SKIP_TAGS = {
33
+ "script", "style", "noscript", "head", "meta",
34
+ "link", "comment", "template", "svg", "iframe",
35
+ }
36
+
37
+ # ---------------------------------------------------------------------------
38
+ # Optional: nltk for stemming / stopwords. Gracefully degrade, if missing.
39
+ # ---------------------------------------------------------------------------
40
+ try:
41
+ import nltk
42
+ from nltk.stem import PorterStemmer
43
+ from nltk.corpus import stopwords as nltk_stopwords
44
+
45
+ nltk.download("punkt", quiet=True)
46
+ nltk.download("stopwords", quiet=True)
47
+ _stemmer = PorterStemmer()
48
+ _STOPWORDS = set(nltk_stopwords.words("english"))
49
+ _NLTK_AVAILABLE = True
50
+ except Exception:
51
+ _stemmer = None
52
+ _STOPWORDS = {
53
+ "a", "an", "the", "is", "it", "in", "on", "at", "to", "for",
54
+ "of", "and", "or", "but", "not", "with", "this", "that", "are",
55
+ "was", "be", "by", "from", "as", "we", "i", "you", "he", "she",
56
+ }
57
+ _NLTK_AVAILABLE = False
58
+
59
+ """
60
+
61
+ Fetch a webpage and return only the pure text content found within its HTML tags.
62
+
63
+ Parameters
64
+ ----------
65
+ url : The URL to fetch.
66
+ title : Optional document title (included in the returned text header when provided).
67
+ author : Optional document author (included in the returned text header when provided).
68
+ description : Optional description (included in the returned text header when provided).
69
+ creation_date : Optional creation date string (included in the returned text header when provided).
70
+ modified_date : Optional last-modified date string (included in the returned text header when provided).
71
+
72
+
73
+ Returns
74
+ -------
75
+ dict with keys:
76
+ "url" : str
77
+ "title" : str | None
78
+ "author" : str | None
79
+ "description" : str | None
80
+ "creation_date" : str | None
81
+ "modified_date" : str | None
82
+ "content" : str — pure text extracted from the page
83
+ "meta_data" : str — meta data provided in the parameters
84
+
85
+ Raises
86
+ ------
87
+ requests.HTTPError
88
+ If the server returns a non-2xx status code.
89
+
90
+ """
91
+
92
+
93
+ def extract_pure_text(
94
+ page_url: str,
95
+ *,
96
+ title: str | None = None,
97
+ author: str | None = None,
98
+ description: str | None = None,
99
+ creation_date: str | None = None,
100
+ modified_date: str | None = None) -> dict:
101
+ response = fetch_page(page_url, timeout=15)
102
+ response.raise_for_status()
103
+
104
+ soup = BeautifulSoup(response.text, "html.parser")
105
+
106
+ for tag in soup.find_all(_SKIP_TAGS):
107
+ tag.decompose()
108
+
109
+ raw_text = soup.get_text(separator=" ", strip=True)
110
+ lines = " ".join(raw_text.split()).strip()
111
+
112
+ header_parts = []
113
+ if title:
114
+ header_parts.append(f"Title: {title}")
115
+ if author:
116
+ header_parts.append(f"Author: {author}")
117
+ if description:
118
+ header_parts.append(f"Description: {description}")
119
+ if creation_date:
120
+ header_parts.append(f"Created: {creation_date}")
121
+ if modified_date:
122
+ header_parts.append(f"Last Modified: {modified_date}")
123
+ if page_url:
124
+ header_parts.append(f"URL: {page_url}")
125
+
126
+ header = " ".join(header_parts)
127
+ result = {
128
+ "url": page_url,
129
+ "title": title,
130
+ "author": author,
131
+ "description": description,
132
+ "creation_date": creation_date,
133
+ "modified_date": modified_date,
134
+ "content": lines,
135
+ "meta_data": header
136
+ }
137
+
138
+ return result
139
+
140
+
141
+ """
142
+ Searches a list of pages for entries that match the provided keywords.
143
+
144
+ This method iterates over a list of pages and returns a filtered list of tuples
145
+ containing pages whose content matches the given keywords. Each tuple in the
146
+ returned list contains detailed information about a page.
147
+
148
+ Parameters:
149
+ page_list (list of tuples): A list where each tuple represents a page with the following elements:
150
+ - url (str): The URL of the page.
151
+ - title (str): The title of the page.
152
+ - author (str): The author of the page.
153
+ - description (str): A brief description of the page.
154
+ - creation_date (str): The date the page was created.
155
+ - modified_date (str): The date the page was last modified.
156
+ keywords (str): A string containing keywords to search for within the page entries.
157
+
158
+ Returns:
159
+ list of tuples: A list of tuples matching the search criteria. Each tuple contains:
160
+ - url (str)
161
+ - title (str)
162
+ - author (str)
163
+ - description (str)
164
+ - creation_date (str)
165
+ - modified_date (str)
166
+
167
+ Example:
168
+ >>> pages = [
169
+ ... ("https://example.com", "Example Page", "Alice", "A sample page", "2023-01-01", "2023-01-05"),
170
+ ... ("https://another.com", "Another Page", "Bob", "Another sample page", "2023-02-01", "2023-02-05")
171
+ ... ]
172
+ >>> search_pages(pages, "sample")
173
+ [
174
+ ("https://example.com", "Example Page", "Alice", "A sample page", "2023-01-01", "2023-01-05"),
175
+ ("https://another.com", "Another Page", "Bob", "Another sample page", "2023-02-01", "2023-02-05")
176
+ ]
177
+ """
178
+
179
+
180
+ def get_search_results_list(page_list=[], keywords='') -> list:
181
+ results = []
182
+
183
+ for page in page_list:
184
+
185
+ url = page[0]
186
+
187
+ if len(url) == 0 or len(url) < 1:
188
+ continue
189
+
190
+ title = page[1] or ''
191
+ author = page[2] or ''
192
+ description = page[3] or ''
193
+ creation_date = page[4] or ''
194
+ modified_date = page[5] or ''
195
+
196
+ if keywords in extract_pure_text(url, title, author, description, creation_date, modified_date)["content"]:
197
+ results.append((url, title, author, description, creation_date, modified_date))
198
+
199
+ return results
200
+
201
+
202
+
203
+ """
204
+ Fetch a single URL with retry logic and polite delay.
205
+
206
+ Args:
207
+ url: Target URL.
208
+ timeout: Per-request timeout in seconds.
209
+ retries: Maximum number of attempts before giving up.
210
+ delay: Seconds to wait between retries (doubles on each failure).
211
+ headers: Optional extra HTTP headers (merged with a default UA).
212
+ session: An existing requests.Session (useful for cookie sharing).
213
+
214
+ Returns:
215
+ A requests.Response on success, or None after all retries fail.
216
+
217
+ Example:
218
+ resp = fetch_page("https://example.com/docs/api")
219
+ if resp:
220
+ print(resp.text[:200])
221
+ """
222
+ def fetch_page(
223
+ url: str,
224
+ *,
225
+ timeout: int = 10,
226
+ retries: int = 3,
227
+ delay: float = 1.0,
228
+ headers: dict | None = None,
229
+ session: requests.Session | None = None,
230
+ ) -> requests.Response | None:
231
+ _headers = {
232
+ "User-Agent": (
233
+ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
234
+ )
235
+ }
236
+ if headers:
237
+ _headers.update(headers)
238
+
239
+ requester = session or requests
240
+ wait = delay
241
+ for attempt in range(1, retries + 1):
242
+ try:
243
+ resp = requester.get(url, headers=_headers, timeout=timeout)
244
+ resp.raise_for_status()
245
+ return resp
246
+ except requests.RequestException as exc:
247
+ if attempt == retries:
248
+ print(f"[fetch_page] FAILED AFTER {retries} ATTEMPTS: {exc}")
249
+ return None
250
+ time.sleep(wait)
251
+ wait *= 2
252
+
253
+
254
+ """
255
+ Parse raw HTML into a BeautifulSoup tree.
256
+
257
+ Args:
258
+ html: Raw HTML string or bytes.
259
+ parser: BS4 parser backend ('lxml', 'html.parser', 'html5lib').
260
+
261
+ Returns:
262
+ A BeautifulSoup object ready for querying.
263
+
264
+ Example:
265
+ soup = parse_html(resp.text)
266
+ title = soup.find("h1").get_text()
267
+ """
268
+
269
+
270
+ def parse_html(
271
+ html: str | bytes,
272
+ *,
273
+ parser: str = "lxml",
274
+ ) -> BeautifulSoup:
275
+ return BeautifulSoup(html, parser)
276
+
277
+
278
+ """
279
+ Pull structured metadata from a page's <head> section.
280
+
281
+ Extracts title, description, keywords, Open Graph tags, canonical URL,
282
+ language, and author where available.
283
+
284
+ Args:
285
+ soup: Parsed BeautifulSoup object.
286
+ url: Original URL (used as fallback for canonical).
287
+
288
+ Returns:
289
+ Dict with keys: title, description, keywords, og_title,
290
+ og_description, og_image, canonical, lang, author.
291
+
292
+ Example:
293
+ meta = extract_metadata(soup, url="https://example.com/page")
294
+ print(meta["title"])
295
+ """
296
+
297
+
298
+ def extract_metadata(soup: BeautifulSoup, url: str = "") -> dict:
299
+
300
+ def _meta(name: str, attr: str = "name") -> str:
301
+ tag = soup.find("meta", {attr: name})
302
+ return tag["content"].strip() if tag and tag.get("content") else ""
303
+
304
+ canonical_tag = soup.find("link", rel="canonical")
305
+ canonical = (
306
+ canonical_tag["href"] if canonical_tag and canonical_tag.get("href")
307
+ else url
308
+ )
309
+
310
+ return {
311
+ "title": (soup.title.string.strip() if soup.title else ""),
312
+ "description": _meta("description"),
313
+ "keywords": _meta("keywords"),
314
+ "og_title": _meta("og:title", "property"),
315
+ "og_description": _meta("og:description", "property"),
316
+ "og_image": _meta("og:image", "property"),
317
+ "canonical": canonical,
318
+ "lang": soup.html.get("lang", "") if soup.html else "",
319
+ "author": _meta("author"),
320
+ }
321
+
322
+
323
+ """
324
+ Collect all hyperlinks from a page, normalised to absolute URLs.
325
+
326
+ Args:
327
+ soup: Parsed page.
328
+ base_url: Absolute URL of the page being parsed.
329
+ same_domain_only: When True, filters out external domains.
330
+ exclude_extensions: File extensions to skip (e.g. ['.pdf', '.jpg']).
331
+
332
+ Returns:
333
+ Deduplicated list of absolute URL strings.
334
+
335
+ Example:
336
+ links = extract_links(soup, "https://docs.example.com/intro")
337
+ # => ['https://docs.example.com/api', 'https://docs.example.com/faq']
338
+ """
339
+
340
+
341
+ def extract_links(
342
+ soup: BeautifulSoup,
343
+ base_url: str,
344
+ *,
345
+ same_domain_only: bool = True,
346
+ exclude_extensions: list[str] | None = None,
347
+ ) -> list[str]:
348
+ skip_ext = set(exclude_extensions or [
349
+ ".pdf", ".jpg", ".jpeg", ".png", ".gif", ".svg",
350
+ ".zip", ".tar", ".gz", ".mp4", ".mp3",
351
+ ])
352
+ base_parsed = urlparse(base_url)
353
+ seen: set[str] = set()
354
+ result: list[str] = []
355
+
356
+ for tag in soup.find_all("a", href=True):
357
+ raw = tag["href"].strip()
358
+ if raw.startswith(("mailto:", "tel:", "javascript:", "#")):
359
+ continue
360
+ abs_url, _ = urldefrag(urljoin(base_url, raw))
361
+ parsed = urlparse(abs_url)
362
+ if same_domain_only and parsed.netloc != base_parsed.netloc:
363
+ continue
364
+ if any(parsed.path.lower().endswith(e) for e in skip_ext):
365
+ continue
366
+ if abs_url not in seen:
367
+ seen.add(abs_url)
368
+ result.append(abs_url)
369
+
370
+ return result
371
+
372
+
373
+ """
374
+ BFS crawler that visits pages starting from start_url.
375
+
376
+ Each visited page is fetched, parsed, and its links are queued.
377
+ Returns a list of page records for further processing.
378
+
379
+ Args:
380
+ start_url: Root URL to begin crawling.
381
+ max_pages: Hard cap on pages visited.
382
+ same_domain_only: Stay within the same hostname.
383
+ delay: Polite pause (seconds) between requests.
384
+ session: Reusable requests.Session.
385
+
386
+ Returns:
387
+ List of dicts, each with keys: url, html, soup, status_code.
388
+
389
+ Example:
390
+ pages = crawl_site("https://docs.example.com", max_pages=50)
391
+ print(f"Crawled {len(pages)} pages")
392
+ """
393
+
394
+
395
+ def crawl_site(
396
+ start_url: str,
397
+ *,
398
+ max_pages: int = 200,
399
+ same_domain_only: bool = True,
400
+ delay: float = 0.5,
401
+ session: requests.Session | None = None,
402
+ ) -> list[dict]:
403
+ visited: set[str] = set()
404
+ queue: list[str] = [start_url]
405
+ results: list[dict] = []
406
+
407
+ sess = session or requests.Session()
408
+
409
+ while queue and len(visited) < max_pages:
410
+ url = queue.pop(0)
411
+ if url in visited:
412
+ continue
413
+ visited.add(url)
414
+
415
+ resp = fetch_page(url, session=sess)
416
+ if resp is None:
417
+ continue
418
+
419
+ soup = parse_html(resp.text)
420
+ results.append({
421
+ "url": url,
422
+ "html": resp.text,
423
+ "soup": soup,
424
+ "status_code": resp.status_code,
425
+ })
426
+
427
+ new_links = extract_links(
428
+ soup, url, same_domain_only=same_domain_only
429
+ )
430
+ for link in new_links:
431
+ if link not in visited:
432
+ queue.append(link)
433
+
434
+ time.sleep(delay)
435
+
436
+ return results
437
+
438
+
439
+ """
440
+ Split a long document into overlapping word-level chunks.
441
+
442
+ Overlapping windows ensure that relevant phrases spanning a chunk
443
+ boundary are still findable.
444
+
445
+ Args:
446
+ text: Full document text.
447
+ chunk_size: Maximum words per chunk.
448
+ overlap: Words shared between consecutive chunks.
449
+
450
+ Returns:
451
+ List of text chunks.
452
+
453
+ Example:
454
+ chunks = chunk_text(long_article, chunk_size=200, overlap=30)
455
+ """
456
+
457
+
458
+ def chunk_text(
459
+ text: str,
460
+ *,
461
+ chunk_size: int = 300,
462
+ overlap: int = 50,
463
+ ) -> list[str]:
464
+ words = text.split()
465
+ step = max(1, chunk_size - overlap)
466
+ return [
467
+ " ".join(words[i: i + chunk_size])
468
+ for i in range(0, len(words), step)
469
+ if words[i: i + chunk_size]
470
+ ]
471
+
472
+
473
+ """
474
+ Normalise text into a list of meaningful tokens.
475
+
476
+ Lowercases, strips punctuation, removes stopwords, and optionally
477
+ applies Porter stemming (if nltk is installed).
478
+
479
+ Args:
480
+ text: Input string.
481
+ remove_stopwords: Filter common English stopwords.
482
+ stem: Apply stemming for root-form matching.
483
+ min_token_len: Discard tokens shorter than this length.
484
+
485
+ Returns:
486
+ List of processed token strings.
487
+
488
+ Example:
489
+ tokens = tokenize("Building scalable search engines")
490
+ # => ['build', 'scalabl', 'search', 'engin'] (with stemming)
491
+ """
492
+
493
+
494
+ def tokenize(
495
+ text: str,
496
+ *,
497
+ remove_stopwords: bool = True,
498
+ stem: bool = True,
499
+ min_token_len: int = 2,
500
+ ) -> list[str]:
501
+ text = text.lower()
502
+ text = re.sub(r"[^a-z0-9\s]", " ", text)
503
+ tokens = [t for t in text.split() if len(t) >= min_token_len]
504
+
505
+ if remove_stopwords:
506
+ tokens = [t for t in tokens if t not in _STOPWORDS]
507
+
508
+ if stem and _stemmer:
509
+ tokens = [_stemmer.stem(t) for t in tokens]
510
+
511
+ return tokens
512
+
513
+
514
+ """
515
+ Build an inverted index mapping tokens to list of (doc_id, frequency).
516
+
517
+ The index is the core data structure that enables fast keyword lookups
518
+ without scanning every document on every query.
519
+
520
+ Args:
521
+ documents: List of dicts, each containing at least text_field and
522
+ id_field.
523
+ text_field: Key whose value is the text to index.
524
+ id_field: Key used as the document identifier.
525
+
526
+ Returns:
527
+ Dict: { token: [(doc_id, freq), ...] }
528
+
529
+ Example:
530
+ index = build_inverted_index(docs)
531
+ print(index.get("search")) # [("https://…/search", 12), …]
532
+
533
+ ------------------------------------------------------------------------------------
534
+
535
+ # SAMPLE LIST OF DICTIONARIES
536
+ documents = [
537
+ {"url": "https://example.com/python-basics",
538
+ "text": "Python is a versatile programming language used for web development and data science"},
539
+ {"url": "https://example.com/data-science",
540
+ "text": "Data science uses Python and statistics to extract insights from data"},
541
+ {"url": "https://example.com/web-dev",
542
+ "text": "Web development with Python frameworks like Django makes building apps fast and fun"},
543
+ {"url": "https://example.com/machine-learning",
544
+ "text": "Machine learning is a subset of data science that trains models on data"},
545
+ ]
546
+
547
+ # BUILD THE INDEX
548
+ index = build_inverted_index(documents, text_field="text", id_field="url")
549
+
550
+ # EXAMPLE OUTPUT
551
+ index["python"] →
552
+ [("https://example.com/python-basics", 1), ("https://example.com/data-science", 1), ("https://example.com/web-dev", 1)]
553
+ """
554
+
555
+
556
+ def build_inverted_index(
557
+ documents: list[dict],
558
+ *,
559
+ text_field: str = "text",
560
+ id_field: str = "url",
561
+ ) -> dict:
562
+ index: dict[str, list[tuple[str, int]]] = defaultdict(list)
563
+
564
+ for doc in documents:
565
+ doc_id = doc[id_field]
566
+ text = doc.get(text_field, "")
567
+ token_freq = Counter(tokenize(text))
568
+ for token, freq in token_freq.items():
569
+ index[token].append((doc_id, freq))
570
+
571
+ return dict(index)
572
+
573
+
574
+ """
575
+ Compute TF-IDF scores for every (document, token) pair.
576
+
577
+ TF-IDF (Term Frequency x Inverse Document Frequency) balances how
578
+ often a term appears in one document against how rare it is across
579
+ the whole corpus — the backbone of classical relevance ranking.
580
+
581
+ Args:
582
+ documents: Corpus as a list of dicts.
583
+ text_field: Field containing raw text.
584
+ id_field: Field used as document identifier.
585
+
586
+ Returns:
587
+ Nested dict: { doc_id: { token: tfidf_score } }
588
+
589
+ Example:
590
+ scores = compute_tfidf(docs)
591
+ top = sorted(scores["https://…"].items(), key=lambda x: -x[1])[:5]
592
+ """
593
+
594
+
595
+ def compute_tfidf(
596
+ documents: list[dict],
597
+ *,
598
+ text_field: str = "text",
599
+ id_field: str = "url",
600
+ ) -> dict[str, dict[str, float]]:
601
+ N = len(documents)
602
+ tf_store: dict[str, dict[str, float]] = {}
603
+ doc_freq: Counter = Counter()
604
+
605
+ for doc in documents:
606
+ doc_id = doc[id_field]
607
+ tokens = tokenize(doc.get(text_field, ""))
608
+ total = len(tokens) or 1
609
+ freq = Counter(tokens)
610
+ tf_store[doc_id] = {t: c / total for t, c in freq.items()}
611
+ doc_freq.update(freq.keys())
612
+
613
+ tfidf: dict[str, dict[str, float]] = {}
614
+ for doc in documents:
615
+ doc_id = doc[id_field]
616
+ tfidf[doc_id] = {
617
+ t: tf * math.log((N + 1) / (doc_freq[t] + 1))
618
+ for t, tf in tf_store[doc_id].items()
619
+ }
620
+
621
+ return tfidf
622
+
623
+
624
+ """
625
+ Score and rank documents for a query using the inverted index + TF-IDF.
626
+
627
+ Accumulates TF-IDF scores for all query tokens found in the index,
628
+ then returns the top-k results by total relevance score.
629
+
630
+ Args:
631
+ query: Raw user query string.
632
+ index: Inverted index from build_inverted_index().
633
+ tfidf: TF-IDF matrix from compute_tfidf().
634
+ top_k: Maximum results to return.
635
+
636
+ Returns:
637
+ List of (doc_id, score) tuples, highest score first.
638
+
639
+ Example:
640
+ results = search_index("authentication tokens", index, tfidf)
641
+ for url, score in results:
642
+ print(f"{score:.3f} {url}")
643
+ """
644
+
645
+
646
+ def search_index(
647
+ query: str,
648
+ index: dict,
649
+ tfidf: dict[str, dict[str, float]],
650
+ *,
651
+ top_k: int = 10,
652
+ ) -> list[tuple[str, float]]:
653
+ query_tokens = tokenize(query)
654
+ scores: dict[str, float] = defaultdict(float)
655
+
656
+ for token in query_tokens:
657
+ if token in index:
658
+ for doc_id, _freq in index[token]:
659
+ scores[doc_id] += tfidf.get(doc_id, {}).get(token, 0.0)
660
+
661
+ return heapq.nlargest(top_k, scores.items(), key=lambda x: x[1])
662
+
663
+
664
+ """
665
+ Return an ordered list of all headings (h1-h6) with their level and text.
666
+
667
+ Headings provide document structure signals — boosting a result whose
668
+ h1 matches a query is a simple way to improve ranking quality.
669
+
670
+ Args:
671
+ soup: Parsed BeautifulSoup object.
672
+
673
+ Returns:
674
+ List of dicts: [{ "level": 1, "text": "Getting Started" }, …]
675
+
676
+ Example:
677
+ headings = extract_headings(soup)
678
+ print(headings[0]) # {"level": 1, "text": "Introduction"}
679
+ """
680
+ def extract_headings(soup: BeautifulSoup) -> list[dict]:
681
+
682
+ return [
683
+ {"level": int(tag.name[1]), "text": tag.get_text(strip=True)}
684
+ for tag in soup.find_all(re.compile(r"^h[1-6]$"))
685
+ if tag.get_text(strip=True)
686
+ ]
687
+
688
+
689
+ """
690
+ Extract a query-centred excerpt to display in search results.
691
+
692
+ Finds the first occurrence of any query keyword in the text and
693
+ returns the surrounding word window, mimicking Google's snippet style.
694
+
695
+ Args:
696
+ text: Full document text.
697
+ query: User's search query.
698
+ window: Words to show on each side of the match.
699
+ max_length: Hard character cap on the returned snippet.
700
+
701
+ Returns:
702
+ A short excerpt string, potentially with leading/trailing ellipsis.
703
+
704
+ Example:
705
+ snippet = generate_snippet(article_text, "authentication tokens")
706
+ # => "…Users obtain authentication tokens via the /api/auth endpoint…"
707
+ """
708
+
709
+
710
+ def generate_snippet(
711
+ text: str,
712
+ query: str,
713
+ *,
714
+ window: int = 40,
715
+ max_length: int = 300,
716
+ ) -> str:
717
+ words = text.split()
718
+ query_tokens = set(tokenize(query, stem=False, remove_stopwords=False))
719
+ match_idx = next(
720
+ (i for i, w in enumerate(words) if re.sub(r"[^a-z]", "", w.lower()) in query_tokens),
721
+ None,
722
+ )
723
+
724
+ if match_idx is None:
725
+ snippet = " ".join(words[:window * 2])
726
+ return (snippet[:max_length] + "…") if len(snippet) > max_length else snippet
727
+
728
+ start = max(0, match_idx - window)
729
+ end = min(len(words), match_idx + window + 1)
730
+ snippet = " ".join(words[start:end])
731
+ if start > 0:
732
+ snippet = "…" + snippet
733
+ if end < len(words):
734
+ snippet += "…"
735
+
736
+ return snippet[:max_length] or snippet
737
+
738
+
739
+ """
740
+ Generate a stable SHA-256 fingerprint of a page's text content.
741
+
742
+ Store the fingerprint alongside each crawled document to detect
743
+ whether a page has changed since the last crawl, avoiding re-indexing
744
+ unchanged content.
745
+
746
+ Args:
747
+ text: Extracted page text (from extract_text).
748
+
749
+ Returns:
750
+ 64-character hex string (SHA-256 digest).
751
+
752
+ Example:
753
+ fp = fingerprint_page(page_text)
754
+ if fp != stored_fingerprints.get(url):
755
+ re_index(url)
756
+ """
757
+
758
+
759
+ def fingerprint_page(text: str) -> str:
760
+ normalised = " ".join(text.lower().split())
761
+ return hashlib.sha256(normalised.encode("utf-8")).hexdigest()
762
+
763
+
764
+ """
765
+ Serialise the full search index to a JSON file for persistence.
766
+
767
+ The saved file can be reloaded on the next run, avoiding a full
768
+ re-crawl every time the search service starts.
769
+
770
+ Args:
771
+ index: Inverted index from build_inverted_index().
772
+ tfidf: TF-IDF scores from compute_tfidf().
773
+ metadata: Per-page metadata records (list of dicts).
774
+ path: Output file path.
775
+
776
+ Example:
777
+ save_index(index, tfidf, page_metadata, "data/index.json")
778
+ """
779
+
780
+
781
+ def save_index(
782
+ index: dict,
783
+ tfidf: dict,
784
+ metadata: list[dict],
785
+ path: str = "search_index.json",
786
+ ) -> None:
787
+ payload = {
788
+ "inverted_index": {k: v for k, v in index.items()},
789
+ "tfidf": tfidf,
790
+ "metadata": metadata,
791
+ }
792
+ Path(path).parent.mkdir(parents=True, exist_ok=True)
793
+ with open(path, "w", encoding="utf-8") as f:
794
+ json.dump(payload, f, indent=2, ensure_ascii=False)
795
+ print(f"[save_index] Saved index to {path}")
796
+
797
+
798
+ """
799
+ Deserialise a previously saved search index from JSON.
800
+
801
+ Args:
802
+ path: File path written by save_index().
803
+
804
+ Returns:
805
+ Tuple (inverted_index, tfidf, metadata_list).
806
+
807
+ Raises:
808
+ FileNotFoundError: If the file does not exist.
809
+
810
+ Example:
811
+ index, tfidf, metadata = load_index("data/index.json")
812
+ """
813
+
814
+
815
+ def load_index(path: str = "search_index.json") -> tuple[dict, dict, list]:
816
+ with open(path, "r", encoding="utf-8") as f:
817
+ payload = json.load(f)
818
+
819
+ index = payload["inverted_index"]
820
+ index = {k: [tuple(pair) for pair in v] for k, v in index.items()}
821
+ return index, payload["tfidf"], payload["metadata"]
822
+
823
+
824
+ """
825
+ Extract JSON-LD structured data blocks embedded in a page.
826
+
827
+ Many modern websites include schema.org markup (Article, FAQPage,
828
+ BreadcrumbList, etc.) which provides clean, machine-readable content
829
+ ideal for enriching search results.
830
+
831
+ Args:
832
+ soup: Parsed BeautifulSoup object.
833
+
834
+ Returns:
835
+ List of parsed JSON-LD objects found on the page.
836
+
837
+ Example:
838
+ schemas = extract_structured_data(soup)
839
+ for s in schemas:
840
+ print(s.get("@type"), s.get("name"))
841
+ """
842
+
843
+
844
+ def extract_structured_data(soup: BeautifulSoup) -> list[dict]:
845
+ results = []
846
+ for tag in soup.find_all("script", type="application/ld+json"):
847
+ try:
848
+ data = json.loads(tag.string or "{}")
849
+ if isinstance(data, list):
850
+ results.extend(data)
851
+ else:
852
+ results.append(data)
853
+ except json.JSONDecodeError:
854
+ continue
855
+ return results
856
+
857
+
858
+ """
859
+ Parse a site's sitemap.xml to seed the crawler URL queue.
860
+
861
+ Tries /sitemap.xml and /sitemap_index.xml; follows nested sitemaps
862
+ (sitemapindex elements) one level deep.
863
+
864
+ Args:
865
+ base_url: Root URL of the site, e.g. "https://docs.example.com".
866
+ session: Optional reusable requests.Session.
867
+
868
+ Returns:
869
+ Sorted, deduplicated list of page URLs listed in the sitemap(s).
870
+
871
+ Example:
872
+ urls = build_sitemap_urls("https://docs.example.com")
873
+ print(f"Found {len(urls)} URLs in sitemap")
874
+ """
875
+
876
+
877
+ def build_sitemap_urls(
878
+ base_url: str,
879
+ *,
880
+ session: requests.Session | None = None,
881
+ ) -> list[str]:
882
+ parsed = urlparse(base_url)
883
+ origin = f"{parsed.scheme}://{parsed.netloc}"
884
+ candidates = [f"{origin}/sitemap.xml", f"{origin}/sitemap_index.xml"]
885
+ sess = session or requests
886
+
887
+ def _parse_sitemap(url: str) -> list[str]:
888
+ resp = fetch_page(url, session=sess)
889
+ if not resp:
890
+ return []
891
+ soup = BeautifulSoup(resp.content, "lxml-xml")
892
+ child_maps = [loc.text.strip() for loc in soup.find_all("sitemap") if loc.find("loc")]
893
+ urls: list[str] = []
894
+ for sm in child_maps:
895
+ urls.extend(_parse_sitemap(sm))
896
+ urls.extend(loc.text.strip() for loc in soup.find_all("url") if loc.find("loc"))
897
+ return urls
898
+
899
+ all_urls: set[str] = set()
900
+ for candidate in candidates:
901
+ found = _parse_sitemap(candidate)
902
+ all_urls.update(found)
903
+ if all_urls:
904
+ break
905
+
906
+ return sorted(all_urls)
907
+
908
+
909
+ """
910
+ Wrap query keywords in HTML highlight tags within a snippet string.
911
+
912
+ Use in search result UIs to visually emphasise where the match occurs.
913
+ The matching is case-insensitive and handles whole words only.
914
+
915
+ Args:
916
+ snippet: Text excerpt (from generate_snippet).
917
+ query: Original user query.
918
+ open_tag: Opening HTML tag (default <mark>).
919
+ close_tag: Closing HTML tag (default </mark>).
920
+
921
+ Returns:
922
+ Snippet string with matching keywords wrapped in tags.
923
+
924
+ Example:
925
+ highlighted = highlight_query_terms(
926
+ "retrieve authentication tokens from the vault",
927
+ "authentication tokens"
928
+ )
929
+ # => "retrieve <mark>authentication</mark> <mark>tokens</mark>…"
930
+ """
931
+
932
+
933
+ def highlight_query_terms(
934
+ snippet: str,
935
+ query: str,
936
+ *,
937
+ open_tag: str = "<mark>",
938
+ close_tag: str = "</mark>",
939
+ ) -> str:
940
+ keywords = [
941
+ re.escape(w) for w in query.split()
942
+ if w.lower() not in _STOPWORDS and len(w) > 1
943
+ ]
944
+ if not keywords:
945
+ return snippet
946
+
947
+ pattern = re.compile(r"(" + "|".join(keywords) + r")", re.IGNORECASE)
948
+ return pattern.sub(rf"{open_tag}\1{close_tag}", snippet)
949
+
950
+
951
+ """
952
+ Re-index a single URL only if its content has changed.
953
+
954
+ Compares the page's current fingerprint against a stored one; if
955
+ unchanged the function exits early, making scheduled re-crawls cheap.
956
+
957
+ Args:
958
+ url: Page to check and potentially re-index.
959
+ index: Mutable inverted index (modified in place).
960
+ tfidf: Mutable TF-IDF store (modified in place).
961
+ metadata: Mutable metadata list (modified in place).
962
+ fingerprints: Dict mapping url to last known fingerprint (mutable).
963
+ session: Optional requests.Session.
964
+
965
+ Returns:
966
+ True if the page was re-indexed, False if it was unchanged.
967
+
968
+ Example:
969
+ changed = incremental_update(url, index, tfidf, metadata, fps)
970
+ print("Updated" if changed else "No change")
971
+ """
972
+
973
+
974
+ def incremental_update(
975
+ url: str,
976
+ index: dict,
977
+ tfidf: dict,
978
+ metadata: list[dict],
979
+ fingerprints: dict[str, str],
980
+ *,
981
+ session: requests.Session | None = None,
982
+ ) -> bool:
983
+ resp = fetch_page(url, session=session)
984
+ if resp is None:
985
+ return False
986
+
987
+ soup = parse_html(resp.text)
988
+ text = extract_text(soup)
989
+ fp = fingerprint_page(text)
990
+
991
+ if fingerprints.get(url) == fp:
992
+ return False # No change — skip re-indexing
993
+
994
+ fingerprints[url] = fp
995
+
996
+ # Remove stale entries from index
997
+ for token in list(index.keys()):
998
+ index[token] = [(doc_id, freq) for doc_id, freq in index[token] if doc_id != url]
999
+ if not index[token]:
1000
+ del index[token]
1001
+
1002
+ # Remove stale tfidf and metadata entries
1003
+ tfidf.pop(url, None)
1004
+ metadata[:] = [m for m in metadata if m.get("url") != url]
1005
+
1006
+ # Build fresh entries for this page
1007
+ token_freq = Counter(tokenize(text))
1008
+ total = len(list(token_freq.elements())) or 1
1009
+ for token, freq in token_freq.items():
1010
+ index.setdefault(token, []).append((url, freq))
1011
+ tfidf[url] = {t: (c / total) for t, c in token_freq.items()}
1012
+
1013
+ meta = extract_metadata(soup, url)
1014
+ meta["url"] = url
1015
+ meta["fingerprint"] = fp
1016
+ metadata.append(meta)
1017
+
1018
+ return True
@@ -0,0 +1,69 @@
1
+ Metadata-Version: 2.1
2
+ Name: softhauzpy
3
+ Version: 0.0.1
4
+ Description-Content-Type: text/markdown
5
+
6
+ # SofthauzPy
7
+ **SofthauzPy** is a comprehensive Python toolkit built for developers creating intelligent, data-driven web applications. It provides a powerful suite of web utilities including web scraping tools, crawling systems, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions.
8
+
9
+ Designed for scalability and flexibility, Softhauz enables teams to collect, process, index, and search website content efficiently — all within a clean Python-first development ecosystem.
10
+
11
+ Built for developers who need scalable web data tools and intelligent search capabilities, Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
12
+ From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz helps developers build smarter web applications without relying heavily on external search services.
13
+
14
+
15
+ ## Key Features
16
+
17
+ **Web Scraping & Crawling**
18
+
19
+ - High-performance web scraping utilities
20
+ - HTML parsing and structured data extraction
21
+ - Recursive website crawling
22
+ - Sitemap discovery and URL indexing
23
+ - Support for asynchronous scraping workflows
24
+ - Rate limiting and request handling utilities
25
+
26
+ **Search Engine Toolkit**
27
+
28
+ - In-house website search engine creation
29
+ - Full-text indexing and querying
30
+ - Custom relevance ranking algorithms
31
+ - Search filtering and query optimization
32
+ - Incremental indexing support
33
+ - Lightweight search infrastructure for internal platforms
34
+
35
+ **Content Processing**
36
+
37
+ - Text normalization and cleaning
38
+ - Metadata extraction
39
+ - Duplicate content detection
40
+ - Keyword extraction and tagging
41
+ - Content chunking for AI and search applications
42
+
43
+ **AI & Semantic Search Ready**
44
+
45
+ - Embedding generation helpers
46
+ - Vector database compatibility
47
+ - Semantic similarity search utilities
48
+ - Retrieval-Augmented Generation (RAG) support
49
+ - AI-powered content indexing workflows
50
+
51
+ **Developer Experience**
52
+
53
+ - Modular and extensible architecture
54
+ - Framework-friendly design for Flask, Django, and FastAPI
55
+ - Easy API integration
56
+ - Clean, Pythonic interfaces
57
+ - Production-ready utilities for scalable deployments
58
+
59
+ > This program may incorporate artificial intelligence (AI) tools solely
60
+ > to support and enhance development efficiency, code quality, and
61
+ > overall performance. All software design, implementation, testing,
62
+ > validation, and quality assurance processes are conducted and reviewed
63
+ > by a qualified human software professional to ensure accuracy,
64
+ > reliability, security, and compliance with applicable standards.
65
+
66
+ Author:
67
+ **Urate, Karen**<br>
68
+ *Softhauz Software Architect*<br>
69
+ [softhauz.ca](https://softhauz.ca)
@@ -0,0 +1,9 @@
1
+ README.md
2
+ setup.py
3
+ softhauzpy/__init__.py
4
+ softhauzpy/main.py
5
+ softhauzpy.egg-info/PKG-INFO
6
+ softhauzpy.egg-info/SOURCES.txt
7
+ softhauzpy.egg-info/dependency_links.txt
8
+ softhauzpy.egg-info/requires.txt
9
+ softhauzpy.egg-info/top_level.txt
@@ -0,0 +1,3 @@
1
+ requests>=2.32.3
2
+ beautifulsoup4>=4.12.3
3
+ nltk>=3.9.4
@@ -0,0 +1 @@
1
+ softhauzpy