softhauzpy 0.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- softhauzpy-0.0.1/PKG-INFO +69 -0
- softhauzpy-0.0.1/README.md +64 -0
- softhauzpy-0.0.1/setup.cfg +4 -0
- softhauzpy-0.0.1/setup.py +17 -0
- softhauzpy-0.0.1/softhauzpy/__init__.py +18 -0
- softhauzpy-0.0.1/softhauzpy/main.py +1018 -0
- softhauzpy-0.0.1/softhauzpy.egg-info/PKG-INFO +69 -0
- softhauzpy-0.0.1/softhauzpy.egg-info/SOURCES.txt +9 -0
- softhauzpy-0.0.1/softhauzpy.egg-info/dependency_links.txt +1 -0
- softhauzpy-0.0.1/softhauzpy.egg-info/requires.txt +3 -0
- softhauzpy-0.0.1/softhauzpy.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: softhauzpy
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Description-Content-Type: text/markdown
|
|
5
|
+
|
|
6
|
+
# SofthauzPy
|
|
7
|
+
**SofthauzPy** is a comprehensive Python toolkit built for developers creating intelligent, data-driven web applications. It provides a powerful suite of web utilities including web scraping tools, crawling systems, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions.
|
|
8
|
+
|
|
9
|
+
Designed for scalability and flexibility, Softhauz enables teams to collect, process, index, and search website content efficiently — all within a clean Python-first development ecosystem.
|
|
10
|
+
|
|
11
|
+
Built for developers who need scalable web data tools and intelligent search capabilities, Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
|
|
12
|
+
From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz helps developers build smarter web applications without relying heavily on external search services.
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
## Key Features
|
|
16
|
+
|
|
17
|
+
**Web Scraping & Crawling**
|
|
18
|
+
|
|
19
|
+
- High-performance web scraping utilities
|
|
20
|
+
- HTML parsing and structured data extraction
|
|
21
|
+
- Recursive website crawling
|
|
22
|
+
- Sitemap discovery and URL indexing
|
|
23
|
+
- Support for asynchronous scraping workflows
|
|
24
|
+
- Rate limiting and request handling utilities
|
|
25
|
+
|
|
26
|
+
**Search Engine Toolkit**
|
|
27
|
+
|
|
28
|
+
- In-house website search engine creation
|
|
29
|
+
- Full-text indexing and querying
|
|
30
|
+
- Custom relevance ranking algorithms
|
|
31
|
+
- Search filtering and query optimization
|
|
32
|
+
- Incremental indexing support
|
|
33
|
+
- Lightweight search infrastructure for internal platforms
|
|
34
|
+
|
|
35
|
+
**Content Processing**
|
|
36
|
+
|
|
37
|
+
- Text normalization and cleaning
|
|
38
|
+
- Metadata extraction
|
|
39
|
+
- Duplicate content detection
|
|
40
|
+
- Keyword extraction and tagging
|
|
41
|
+
- Content chunking for AI and search applications
|
|
42
|
+
|
|
43
|
+
**AI & Semantic Search Ready**
|
|
44
|
+
|
|
45
|
+
- Embedding generation helpers
|
|
46
|
+
- Vector database compatibility
|
|
47
|
+
- Semantic similarity search utilities
|
|
48
|
+
- Retrieval-Augmented Generation (RAG) support
|
|
49
|
+
- AI-powered content indexing workflows
|
|
50
|
+
|
|
51
|
+
**Developer Experience**
|
|
52
|
+
|
|
53
|
+
- Modular and extensible architecture
|
|
54
|
+
- Framework-friendly design for Flask, Django, and FastAPI
|
|
55
|
+
- Easy API integration
|
|
56
|
+
- Clean, Pythonic interfaces
|
|
57
|
+
- Production-ready utilities for scalable deployments
|
|
58
|
+
|
|
59
|
+
> This program may incorporate artificial intelligence (AI) tools solely
|
|
60
|
+
> to support and enhance development efficiency, code quality, and
|
|
61
|
+
> overall performance. All software design, implementation, testing,
|
|
62
|
+
> validation, and quality assurance processes are conducted and reviewed
|
|
63
|
+
> by a qualified human software professional to ensure accuracy,
|
|
64
|
+
> reliability, security, and compliance with applicable standards.
|
|
65
|
+
|
|
66
|
+
Author:
|
|
67
|
+
**Urate, Karen**<br>
|
|
68
|
+
*Softhauz Software Architect*<br>
|
|
69
|
+
[softhauz.ca](https://softhauz.ca)
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
# SofthauzPy
|
|
2
|
+
**SofthauzPy** is a comprehensive Python toolkit built for developers creating intelligent, data-driven web applications. It provides a powerful suite of web utilities including web scraping tools, crawling systems, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions.
|
|
3
|
+
|
|
4
|
+
Designed for scalability and flexibility, Softhauz enables teams to collect, process, index, and search website content efficiently — all within a clean Python-first development ecosystem.
|
|
5
|
+
|
|
6
|
+
Built for developers who need scalable web data tools and intelligent search capabilities, Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
|
|
7
|
+
From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz helps developers build smarter web applications without relying heavily on external search services.
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
## Key Features
|
|
11
|
+
|
|
12
|
+
**Web Scraping & Crawling**
|
|
13
|
+
|
|
14
|
+
- High-performance web scraping utilities
|
|
15
|
+
- HTML parsing and structured data extraction
|
|
16
|
+
- Recursive website crawling
|
|
17
|
+
- Sitemap discovery and URL indexing
|
|
18
|
+
- Support for asynchronous scraping workflows
|
|
19
|
+
- Rate limiting and request handling utilities
|
|
20
|
+
|
|
21
|
+
**Search Engine Toolkit**
|
|
22
|
+
|
|
23
|
+
- In-house website search engine creation
|
|
24
|
+
- Full-text indexing and querying
|
|
25
|
+
- Custom relevance ranking algorithms
|
|
26
|
+
- Search filtering and query optimization
|
|
27
|
+
- Incremental indexing support
|
|
28
|
+
- Lightweight search infrastructure for internal platforms
|
|
29
|
+
|
|
30
|
+
**Content Processing**
|
|
31
|
+
|
|
32
|
+
- Text normalization and cleaning
|
|
33
|
+
- Metadata extraction
|
|
34
|
+
- Duplicate content detection
|
|
35
|
+
- Keyword extraction and tagging
|
|
36
|
+
- Content chunking for AI and search applications
|
|
37
|
+
|
|
38
|
+
**AI & Semantic Search Ready**
|
|
39
|
+
|
|
40
|
+
- Embedding generation helpers
|
|
41
|
+
- Vector database compatibility
|
|
42
|
+
- Semantic similarity search utilities
|
|
43
|
+
- Retrieval-Augmented Generation (RAG) support
|
|
44
|
+
- AI-powered content indexing workflows
|
|
45
|
+
|
|
46
|
+
**Developer Experience**
|
|
47
|
+
|
|
48
|
+
- Modular and extensible architecture
|
|
49
|
+
- Framework-friendly design for Flask, Django, and FastAPI
|
|
50
|
+
- Easy API integration
|
|
51
|
+
- Clean, Pythonic interfaces
|
|
52
|
+
- Production-ready utilities for scalable deployments
|
|
53
|
+
|
|
54
|
+
> This program may incorporate artificial intelligence (AI) tools solely
|
|
55
|
+
> to support and enhance development efficiency, code quality, and
|
|
56
|
+
> overall performance. All software design, implementation, testing,
|
|
57
|
+
> validation, and quality assurance processes are conducted and reviewed
|
|
58
|
+
> by a qualified human software professional to ensure accuracy,
|
|
59
|
+
> reliability, security, and compliance with applicable standards.
|
|
60
|
+
|
|
61
|
+
Author:
|
|
62
|
+
**Urate, Karen**<br>
|
|
63
|
+
*Softhauz Software Architect*<br>
|
|
64
|
+
[softhauz.ca](https://softhauz.ca)
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
|
|
3
|
+
with open("README.md", "r") as f:
|
|
4
|
+
description = f.read()
|
|
5
|
+
|
|
6
|
+
setup(
|
|
7
|
+
name='softhauzpy',
|
|
8
|
+
version='0.0.1',
|
|
9
|
+
packages=find_packages(),
|
|
10
|
+
install_requires=[
|
|
11
|
+
'requests>=2.32.3',
|
|
12
|
+
'beautifulsoup4>=4.12.3',
|
|
13
|
+
'nltk>=3.9.4'
|
|
14
|
+
],
|
|
15
|
+
long_description=description,
|
|
16
|
+
long_description_content_type="text/markdown",
|
|
17
|
+
)
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
# fingerprints and mappings
|
|
2
|
+
from .main import incremental_update, highlight_query_terms, build_sitemap_urls
|
|
3
|
+
from .main import fingerprint_page, generate_snippet
|
|
4
|
+
|
|
5
|
+
# extractions
|
|
6
|
+
from .main import extract_structured_data, extract_headings
|
|
7
|
+
from .main import extract_metadata, extract_links, extract_pure_text
|
|
8
|
+
|
|
9
|
+
# indexing
|
|
10
|
+
from .main import load_index, save_index, search_index, compute_tfidf, build_inverted_index
|
|
11
|
+
|
|
12
|
+
# crawls and scrapes
|
|
13
|
+
from .main import tokenize, chunk_text, crawl_site, parse_html, fetch_page, get_search_results_list
|
|
14
|
+
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
|
|
@@ -0,0 +1,1018 @@
|
|
|
1
|
+
"""
|
|
2
|
+
|
|
3
|
+
Softhauz is a modern Python package built for software engineers, software developers, and web application architects who need scalable web data tools and intelligent search capabilities. It provides a powerful suite of web utilities including web scraping tools, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions. Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
|
|
4
|
+
|
|
5
|
+
From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz allows for a seamless building of smarter web applications without relying heavily on external search services. Softhauz helps transform web content into structured, searchable, and intelligent systems with minimal overhead.
|
|
6
|
+
|
|
7
|
+
Softhauz combines modern web scraping, intelligent indexing, and customizable search capabilities into a unified Python toolkit. Instead of stitching together multiple libraries and services, developers can use Softhauz as a centralized foundation for building scalable web data and search infrastructures tailored to their applications.
|
|
8
|
+
|
|
9
|
+
Author: Urate, Karen
|
|
10
|
+
Creation Date: 2026-05-09
|
|
11
|
+
External Package List:
|
|
12
|
+
|
|
13
|
+
- requests >= (v. 2.34.2)
|
|
14
|
+
- beautifulsoup4 >= (v. 4.14.3)
|
|
15
|
+
- nltk >= (v. 3.9.4)
|
|
16
|
+
|
|
17
|
+
"""
|
|
18
|
+
|
|
19
|
+
import re
|
|
20
|
+
import json
|
|
21
|
+
import math
|
|
22
|
+
import time
|
|
23
|
+
import hashlib
|
|
24
|
+
import heapq
|
|
25
|
+
from collections import defaultdict, Counter
|
|
26
|
+
from urllib.parse import urljoin, urlparse, urldefrag
|
|
27
|
+
from pathlib import Path
|
|
28
|
+
|
|
29
|
+
import requests
|
|
30
|
+
from bs4 import BeautifulSoup
|
|
31
|
+
|
|
32
|
+
_SKIP_TAGS = {
|
|
33
|
+
"script", "style", "noscript", "head", "meta",
|
|
34
|
+
"link", "comment", "template", "svg", "iframe",
|
|
35
|
+
}
|
|
36
|
+
|
|
37
|
+
# ---------------------------------------------------------------------------
|
|
38
|
+
# Optional: nltk for stemming / stopwords. Gracefully degrade, if missing.
|
|
39
|
+
# ---------------------------------------------------------------------------
|
|
40
|
+
try:
|
|
41
|
+
import nltk
|
|
42
|
+
from nltk.stem import PorterStemmer
|
|
43
|
+
from nltk.corpus import stopwords as nltk_stopwords
|
|
44
|
+
|
|
45
|
+
nltk.download("punkt", quiet=True)
|
|
46
|
+
nltk.download("stopwords", quiet=True)
|
|
47
|
+
_stemmer = PorterStemmer()
|
|
48
|
+
_STOPWORDS = set(nltk_stopwords.words("english"))
|
|
49
|
+
_NLTK_AVAILABLE = True
|
|
50
|
+
except Exception:
|
|
51
|
+
_stemmer = None
|
|
52
|
+
_STOPWORDS = {
|
|
53
|
+
"a", "an", "the", "is", "it", "in", "on", "at", "to", "for",
|
|
54
|
+
"of", "and", "or", "but", "not", "with", "this", "that", "are",
|
|
55
|
+
"was", "be", "by", "from", "as", "we", "i", "you", "he", "she",
|
|
56
|
+
}
|
|
57
|
+
_NLTK_AVAILABLE = False
|
|
58
|
+
|
|
59
|
+
"""
|
|
60
|
+
|
|
61
|
+
Fetch a webpage and return only the pure text content found within its HTML tags.
|
|
62
|
+
|
|
63
|
+
Parameters
|
|
64
|
+
----------
|
|
65
|
+
url : The URL to fetch.
|
|
66
|
+
title : Optional document title (included in the returned text header when provided).
|
|
67
|
+
author : Optional document author (included in the returned text header when provided).
|
|
68
|
+
description : Optional description (included in the returned text header when provided).
|
|
69
|
+
creation_date : Optional creation date string (included in the returned text header when provided).
|
|
70
|
+
modified_date : Optional last-modified date string (included in the returned text header when provided).
|
|
71
|
+
|
|
72
|
+
|
|
73
|
+
Returns
|
|
74
|
+
-------
|
|
75
|
+
dict with keys:
|
|
76
|
+
"url" : str
|
|
77
|
+
"title" : str | None
|
|
78
|
+
"author" : str | None
|
|
79
|
+
"description" : str | None
|
|
80
|
+
"creation_date" : str | None
|
|
81
|
+
"modified_date" : str | None
|
|
82
|
+
"content" : str — pure text extracted from the page
|
|
83
|
+
"meta_data" : str — meta data provided in the parameters
|
|
84
|
+
|
|
85
|
+
Raises
|
|
86
|
+
------
|
|
87
|
+
requests.HTTPError
|
|
88
|
+
If the server returns a non-2xx status code.
|
|
89
|
+
|
|
90
|
+
"""
|
|
91
|
+
|
|
92
|
+
|
|
93
|
+
def extract_pure_text(
|
|
94
|
+
page_url: str,
|
|
95
|
+
*,
|
|
96
|
+
title: str | None = None,
|
|
97
|
+
author: str | None = None,
|
|
98
|
+
description: str | None = None,
|
|
99
|
+
creation_date: str | None = None,
|
|
100
|
+
modified_date: str | None = None) -> dict:
|
|
101
|
+
response = fetch_page(page_url, timeout=15)
|
|
102
|
+
response.raise_for_status()
|
|
103
|
+
|
|
104
|
+
soup = BeautifulSoup(response.text, "html.parser")
|
|
105
|
+
|
|
106
|
+
for tag in soup.find_all(_SKIP_TAGS):
|
|
107
|
+
tag.decompose()
|
|
108
|
+
|
|
109
|
+
raw_text = soup.get_text(separator=" ", strip=True)
|
|
110
|
+
lines = " ".join(raw_text.split()).strip()
|
|
111
|
+
|
|
112
|
+
header_parts = []
|
|
113
|
+
if title:
|
|
114
|
+
header_parts.append(f"Title: {title}")
|
|
115
|
+
if author:
|
|
116
|
+
header_parts.append(f"Author: {author}")
|
|
117
|
+
if description:
|
|
118
|
+
header_parts.append(f"Description: {description}")
|
|
119
|
+
if creation_date:
|
|
120
|
+
header_parts.append(f"Created: {creation_date}")
|
|
121
|
+
if modified_date:
|
|
122
|
+
header_parts.append(f"Last Modified: {modified_date}")
|
|
123
|
+
if page_url:
|
|
124
|
+
header_parts.append(f"URL: {page_url}")
|
|
125
|
+
|
|
126
|
+
header = " ".join(header_parts)
|
|
127
|
+
result = {
|
|
128
|
+
"url": page_url,
|
|
129
|
+
"title": title,
|
|
130
|
+
"author": author,
|
|
131
|
+
"description": description,
|
|
132
|
+
"creation_date": creation_date,
|
|
133
|
+
"modified_date": modified_date,
|
|
134
|
+
"content": lines,
|
|
135
|
+
"meta_data": header
|
|
136
|
+
}
|
|
137
|
+
|
|
138
|
+
return result
|
|
139
|
+
|
|
140
|
+
|
|
141
|
+
"""
|
|
142
|
+
Searches a list of pages for entries that match the provided keywords.
|
|
143
|
+
|
|
144
|
+
This method iterates over a list of pages and returns a filtered list of tuples
|
|
145
|
+
containing pages whose content matches the given keywords. Each tuple in the
|
|
146
|
+
returned list contains detailed information about a page.
|
|
147
|
+
|
|
148
|
+
Parameters:
|
|
149
|
+
page_list (list of tuples): A list where each tuple represents a page with the following elements:
|
|
150
|
+
- url (str): The URL of the page.
|
|
151
|
+
- title (str): The title of the page.
|
|
152
|
+
- author (str): The author of the page.
|
|
153
|
+
- description (str): A brief description of the page.
|
|
154
|
+
- creation_date (str): The date the page was created.
|
|
155
|
+
- modified_date (str): The date the page was last modified.
|
|
156
|
+
keywords (str): A string containing keywords to search for within the page entries.
|
|
157
|
+
|
|
158
|
+
Returns:
|
|
159
|
+
list of tuples: A list of tuples matching the search criteria. Each tuple contains:
|
|
160
|
+
- url (str)
|
|
161
|
+
- title (str)
|
|
162
|
+
- author (str)
|
|
163
|
+
- description (str)
|
|
164
|
+
- creation_date (str)
|
|
165
|
+
- modified_date (str)
|
|
166
|
+
|
|
167
|
+
Example:
|
|
168
|
+
>>> pages = [
|
|
169
|
+
... ("https://example.com", "Example Page", "Alice", "A sample page", "2023-01-01", "2023-01-05"),
|
|
170
|
+
... ("https://another.com", "Another Page", "Bob", "Another sample page", "2023-02-01", "2023-02-05")
|
|
171
|
+
... ]
|
|
172
|
+
>>> search_pages(pages, "sample")
|
|
173
|
+
[
|
|
174
|
+
("https://example.com", "Example Page", "Alice", "A sample page", "2023-01-01", "2023-01-05"),
|
|
175
|
+
("https://another.com", "Another Page", "Bob", "Another sample page", "2023-02-01", "2023-02-05")
|
|
176
|
+
]
|
|
177
|
+
"""
|
|
178
|
+
|
|
179
|
+
|
|
180
|
+
def get_search_results_list(page_list=[], keywords='') -> list:
|
|
181
|
+
results = []
|
|
182
|
+
|
|
183
|
+
for page in page_list:
|
|
184
|
+
|
|
185
|
+
url = page[0]
|
|
186
|
+
|
|
187
|
+
if len(url) == 0 or len(url) < 1:
|
|
188
|
+
continue
|
|
189
|
+
|
|
190
|
+
title = page[1] or ''
|
|
191
|
+
author = page[2] or ''
|
|
192
|
+
description = page[3] or ''
|
|
193
|
+
creation_date = page[4] or ''
|
|
194
|
+
modified_date = page[5] or ''
|
|
195
|
+
|
|
196
|
+
if keywords in extract_pure_text(url, title, author, description, creation_date, modified_date)["content"]:
|
|
197
|
+
results.append((url, title, author, description, creation_date, modified_date))
|
|
198
|
+
|
|
199
|
+
return results
|
|
200
|
+
|
|
201
|
+
|
|
202
|
+
|
|
203
|
+
"""
|
|
204
|
+
Fetch a single URL with retry logic and polite delay.
|
|
205
|
+
|
|
206
|
+
Args:
|
|
207
|
+
url: Target URL.
|
|
208
|
+
timeout: Per-request timeout in seconds.
|
|
209
|
+
retries: Maximum number of attempts before giving up.
|
|
210
|
+
delay: Seconds to wait between retries (doubles on each failure).
|
|
211
|
+
headers: Optional extra HTTP headers (merged with a default UA).
|
|
212
|
+
session: An existing requests.Session (useful for cookie sharing).
|
|
213
|
+
|
|
214
|
+
Returns:
|
|
215
|
+
A requests.Response on success, or None after all retries fail.
|
|
216
|
+
|
|
217
|
+
Example:
|
|
218
|
+
resp = fetch_page("https://example.com/docs/api")
|
|
219
|
+
if resp:
|
|
220
|
+
print(resp.text[:200])
|
|
221
|
+
"""
|
|
222
|
+
def fetch_page(
|
|
223
|
+
url: str,
|
|
224
|
+
*,
|
|
225
|
+
timeout: int = 10,
|
|
226
|
+
retries: int = 3,
|
|
227
|
+
delay: float = 1.0,
|
|
228
|
+
headers: dict | None = None,
|
|
229
|
+
session: requests.Session | None = None,
|
|
230
|
+
) -> requests.Response | None:
|
|
231
|
+
_headers = {
|
|
232
|
+
"User-Agent": (
|
|
233
|
+
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
|
234
|
+
)
|
|
235
|
+
}
|
|
236
|
+
if headers:
|
|
237
|
+
_headers.update(headers)
|
|
238
|
+
|
|
239
|
+
requester = session or requests
|
|
240
|
+
wait = delay
|
|
241
|
+
for attempt in range(1, retries + 1):
|
|
242
|
+
try:
|
|
243
|
+
resp = requester.get(url, headers=_headers, timeout=timeout)
|
|
244
|
+
resp.raise_for_status()
|
|
245
|
+
return resp
|
|
246
|
+
except requests.RequestException as exc:
|
|
247
|
+
if attempt == retries:
|
|
248
|
+
print(f"[fetch_page] FAILED AFTER {retries} ATTEMPTS: {exc}")
|
|
249
|
+
return None
|
|
250
|
+
time.sleep(wait)
|
|
251
|
+
wait *= 2
|
|
252
|
+
|
|
253
|
+
|
|
254
|
+
"""
|
|
255
|
+
Parse raw HTML into a BeautifulSoup tree.
|
|
256
|
+
|
|
257
|
+
Args:
|
|
258
|
+
html: Raw HTML string or bytes.
|
|
259
|
+
parser: BS4 parser backend ('lxml', 'html.parser', 'html5lib').
|
|
260
|
+
|
|
261
|
+
Returns:
|
|
262
|
+
A BeautifulSoup object ready for querying.
|
|
263
|
+
|
|
264
|
+
Example:
|
|
265
|
+
soup = parse_html(resp.text)
|
|
266
|
+
title = soup.find("h1").get_text()
|
|
267
|
+
"""
|
|
268
|
+
|
|
269
|
+
|
|
270
|
+
def parse_html(
|
|
271
|
+
html: str | bytes,
|
|
272
|
+
*,
|
|
273
|
+
parser: str = "lxml",
|
|
274
|
+
) -> BeautifulSoup:
|
|
275
|
+
return BeautifulSoup(html, parser)
|
|
276
|
+
|
|
277
|
+
|
|
278
|
+
"""
|
|
279
|
+
Pull structured metadata from a page's <head> section.
|
|
280
|
+
|
|
281
|
+
Extracts title, description, keywords, Open Graph tags, canonical URL,
|
|
282
|
+
language, and author where available.
|
|
283
|
+
|
|
284
|
+
Args:
|
|
285
|
+
soup: Parsed BeautifulSoup object.
|
|
286
|
+
url: Original URL (used as fallback for canonical).
|
|
287
|
+
|
|
288
|
+
Returns:
|
|
289
|
+
Dict with keys: title, description, keywords, og_title,
|
|
290
|
+
og_description, og_image, canonical, lang, author.
|
|
291
|
+
|
|
292
|
+
Example:
|
|
293
|
+
meta = extract_metadata(soup, url="https://example.com/page")
|
|
294
|
+
print(meta["title"])
|
|
295
|
+
"""
|
|
296
|
+
|
|
297
|
+
|
|
298
|
+
def extract_metadata(soup: BeautifulSoup, url: str = "") -> dict:
|
|
299
|
+
|
|
300
|
+
def _meta(name: str, attr: str = "name") -> str:
|
|
301
|
+
tag = soup.find("meta", {attr: name})
|
|
302
|
+
return tag["content"].strip() if tag and tag.get("content") else ""
|
|
303
|
+
|
|
304
|
+
canonical_tag = soup.find("link", rel="canonical")
|
|
305
|
+
canonical = (
|
|
306
|
+
canonical_tag["href"] if canonical_tag and canonical_tag.get("href")
|
|
307
|
+
else url
|
|
308
|
+
)
|
|
309
|
+
|
|
310
|
+
return {
|
|
311
|
+
"title": (soup.title.string.strip() if soup.title else ""),
|
|
312
|
+
"description": _meta("description"),
|
|
313
|
+
"keywords": _meta("keywords"),
|
|
314
|
+
"og_title": _meta("og:title", "property"),
|
|
315
|
+
"og_description": _meta("og:description", "property"),
|
|
316
|
+
"og_image": _meta("og:image", "property"),
|
|
317
|
+
"canonical": canonical,
|
|
318
|
+
"lang": soup.html.get("lang", "") if soup.html else "",
|
|
319
|
+
"author": _meta("author"),
|
|
320
|
+
}
|
|
321
|
+
|
|
322
|
+
|
|
323
|
+
"""
|
|
324
|
+
Collect all hyperlinks from a page, normalised to absolute URLs.
|
|
325
|
+
|
|
326
|
+
Args:
|
|
327
|
+
soup: Parsed page.
|
|
328
|
+
base_url: Absolute URL of the page being parsed.
|
|
329
|
+
same_domain_only: When True, filters out external domains.
|
|
330
|
+
exclude_extensions: File extensions to skip (e.g. ['.pdf', '.jpg']).
|
|
331
|
+
|
|
332
|
+
Returns:
|
|
333
|
+
Deduplicated list of absolute URL strings.
|
|
334
|
+
|
|
335
|
+
Example:
|
|
336
|
+
links = extract_links(soup, "https://docs.example.com/intro")
|
|
337
|
+
# => ['https://docs.example.com/api', 'https://docs.example.com/faq']
|
|
338
|
+
"""
|
|
339
|
+
|
|
340
|
+
|
|
341
|
+
def extract_links(
|
|
342
|
+
soup: BeautifulSoup,
|
|
343
|
+
base_url: str,
|
|
344
|
+
*,
|
|
345
|
+
same_domain_only: bool = True,
|
|
346
|
+
exclude_extensions: list[str] | None = None,
|
|
347
|
+
) -> list[str]:
|
|
348
|
+
skip_ext = set(exclude_extensions or [
|
|
349
|
+
".pdf", ".jpg", ".jpeg", ".png", ".gif", ".svg",
|
|
350
|
+
".zip", ".tar", ".gz", ".mp4", ".mp3",
|
|
351
|
+
])
|
|
352
|
+
base_parsed = urlparse(base_url)
|
|
353
|
+
seen: set[str] = set()
|
|
354
|
+
result: list[str] = []
|
|
355
|
+
|
|
356
|
+
for tag in soup.find_all("a", href=True):
|
|
357
|
+
raw = tag["href"].strip()
|
|
358
|
+
if raw.startswith(("mailto:", "tel:", "javascript:", "#")):
|
|
359
|
+
continue
|
|
360
|
+
abs_url, _ = urldefrag(urljoin(base_url, raw))
|
|
361
|
+
parsed = urlparse(abs_url)
|
|
362
|
+
if same_domain_only and parsed.netloc != base_parsed.netloc:
|
|
363
|
+
continue
|
|
364
|
+
if any(parsed.path.lower().endswith(e) for e in skip_ext):
|
|
365
|
+
continue
|
|
366
|
+
if abs_url not in seen:
|
|
367
|
+
seen.add(abs_url)
|
|
368
|
+
result.append(abs_url)
|
|
369
|
+
|
|
370
|
+
return result
|
|
371
|
+
|
|
372
|
+
|
|
373
|
+
"""
|
|
374
|
+
BFS crawler that visits pages starting from start_url.
|
|
375
|
+
|
|
376
|
+
Each visited page is fetched, parsed, and its links are queued.
|
|
377
|
+
Returns a list of page records for further processing.
|
|
378
|
+
|
|
379
|
+
Args:
|
|
380
|
+
start_url: Root URL to begin crawling.
|
|
381
|
+
max_pages: Hard cap on pages visited.
|
|
382
|
+
same_domain_only: Stay within the same hostname.
|
|
383
|
+
delay: Polite pause (seconds) between requests.
|
|
384
|
+
session: Reusable requests.Session.
|
|
385
|
+
|
|
386
|
+
Returns:
|
|
387
|
+
List of dicts, each with keys: url, html, soup, status_code.
|
|
388
|
+
|
|
389
|
+
Example:
|
|
390
|
+
pages = crawl_site("https://docs.example.com", max_pages=50)
|
|
391
|
+
print(f"Crawled {len(pages)} pages")
|
|
392
|
+
"""
|
|
393
|
+
|
|
394
|
+
|
|
395
|
+
def crawl_site(
|
|
396
|
+
start_url: str,
|
|
397
|
+
*,
|
|
398
|
+
max_pages: int = 200,
|
|
399
|
+
same_domain_only: bool = True,
|
|
400
|
+
delay: float = 0.5,
|
|
401
|
+
session: requests.Session | None = None,
|
|
402
|
+
) -> list[dict]:
|
|
403
|
+
visited: set[str] = set()
|
|
404
|
+
queue: list[str] = [start_url]
|
|
405
|
+
results: list[dict] = []
|
|
406
|
+
|
|
407
|
+
sess = session or requests.Session()
|
|
408
|
+
|
|
409
|
+
while queue and len(visited) < max_pages:
|
|
410
|
+
url = queue.pop(0)
|
|
411
|
+
if url in visited:
|
|
412
|
+
continue
|
|
413
|
+
visited.add(url)
|
|
414
|
+
|
|
415
|
+
resp = fetch_page(url, session=sess)
|
|
416
|
+
if resp is None:
|
|
417
|
+
continue
|
|
418
|
+
|
|
419
|
+
soup = parse_html(resp.text)
|
|
420
|
+
results.append({
|
|
421
|
+
"url": url,
|
|
422
|
+
"html": resp.text,
|
|
423
|
+
"soup": soup,
|
|
424
|
+
"status_code": resp.status_code,
|
|
425
|
+
})
|
|
426
|
+
|
|
427
|
+
new_links = extract_links(
|
|
428
|
+
soup, url, same_domain_only=same_domain_only
|
|
429
|
+
)
|
|
430
|
+
for link in new_links:
|
|
431
|
+
if link not in visited:
|
|
432
|
+
queue.append(link)
|
|
433
|
+
|
|
434
|
+
time.sleep(delay)
|
|
435
|
+
|
|
436
|
+
return results
|
|
437
|
+
|
|
438
|
+
|
|
439
|
+
"""
|
|
440
|
+
Split a long document into overlapping word-level chunks.
|
|
441
|
+
|
|
442
|
+
Overlapping windows ensure that relevant phrases spanning a chunk
|
|
443
|
+
boundary are still findable.
|
|
444
|
+
|
|
445
|
+
Args:
|
|
446
|
+
text: Full document text.
|
|
447
|
+
chunk_size: Maximum words per chunk.
|
|
448
|
+
overlap: Words shared between consecutive chunks.
|
|
449
|
+
|
|
450
|
+
Returns:
|
|
451
|
+
List of text chunks.
|
|
452
|
+
|
|
453
|
+
Example:
|
|
454
|
+
chunks = chunk_text(long_article, chunk_size=200, overlap=30)
|
|
455
|
+
"""
|
|
456
|
+
|
|
457
|
+
|
|
458
|
+
def chunk_text(
|
|
459
|
+
text: str,
|
|
460
|
+
*,
|
|
461
|
+
chunk_size: int = 300,
|
|
462
|
+
overlap: int = 50,
|
|
463
|
+
) -> list[str]:
|
|
464
|
+
words = text.split()
|
|
465
|
+
step = max(1, chunk_size - overlap)
|
|
466
|
+
return [
|
|
467
|
+
" ".join(words[i: i + chunk_size])
|
|
468
|
+
for i in range(0, len(words), step)
|
|
469
|
+
if words[i: i + chunk_size]
|
|
470
|
+
]
|
|
471
|
+
|
|
472
|
+
|
|
473
|
+
"""
|
|
474
|
+
Normalise text into a list of meaningful tokens.
|
|
475
|
+
|
|
476
|
+
Lowercases, strips punctuation, removes stopwords, and optionally
|
|
477
|
+
applies Porter stemming (if nltk is installed).
|
|
478
|
+
|
|
479
|
+
Args:
|
|
480
|
+
text: Input string.
|
|
481
|
+
remove_stopwords: Filter common English stopwords.
|
|
482
|
+
stem: Apply stemming for root-form matching.
|
|
483
|
+
min_token_len: Discard tokens shorter than this length.
|
|
484
|
+
|
|
485
|
+
Returns:
|
|
486
|
+
List of processed token strings.
|
|
487
|
+
|
|
488
|
+
Example:
|
|
489
|
+
tokens = tokenize("Building scalable search engines")
|
|
490
|
+
# => ['build', 'scalabl', 'search', 'engin'] (with stemming)
|
|
491
|
+
"""
|
|
492
|
+
|
|
493
|
+
|
|
494
|
+
def tokenize(
|
|
495
|
+
text: str,
|
|
496
|
+
*,
|
|
497
|
+
remove_stopwords: bool = True,
|
|
498
|
+
stem: bool = True,
|
|
499
|
+
min_token_len: int = 2,
|
|
500
|
+
) -> list[str]:
|
|
501
|
+
text = text.lower()
|
|
502
|
+
text = re.sub(r"[^a-z0-9\s]", " ", text)
|
|
503
|
+
tokens = [t for t in text.split() if len(t) >= min_token_len]
|
|
504
|
+
|
|
505
|
+
if remove_stopwords:
|
|
506
|
+
tokens = [t for t in tokens if t not in _STOPWORDS]
|
|
507
|
+
|
|
508
|
+
if stem and _stemmer:
|
|
509
|
+
tokens = [_stemmer.stem(t) for t in tokens]
|
|
510
|
+
|
|
511
|
+
return tokens
|
|
512
|
+
|
|
513
|
+
|
|
514
|
+
"""
|
|
515
|
+
Build an inverted index mapping tokens to list of (doc_id, frequency).
|
|
516
|
+
|
|
517
|
+
The index is the core data structure that enables fast keyword lookups
|
|
518
|
+
without scanning every document on every query.
|
|
519
|
+
|
|
520
|
+
Args:
|
|
521
|
+
documents: List of dicts, each containing at least text_field and
|
|
522
|
+
id_field.
|
|
523
|
+
text_field: Key whose value is the text to index.
|
|
524
|
+
id_field: Key used as the document identifier.
|
|
525
|
+
|
|
526
|
+
Returns:
|
|
527
|
+
Dict: { token: [(doc_id, freq), ...] }
|
|
528
|
+
|
|
529
|
+
Example:
|
|
530
|
+
index = build_inverted_index(docs)
|
|
531
|
+
print(index.get("search")) # [("https://…/search", 12), …]
|
|
532
|
+
|
|
533
|
+
------------------------------------------------------------------------------------
|
|
534
|
+
|
|
535
|
+
# SAMPLE LIST OF DICTIONARIES
|
|
536
|
+
documents = [
|
|
537
|
+
{"url": "https://example.com/python-basics",
|
|
538
|
+
"text": "Python is a versatile programming language used for web development and data science"},
|
|
539
|
+
{"url": "https://example.com/data-science",
|
|
540
|
+
"text": "Data science uses Python and statistics to extract insights from data"},
|
|
541
|
+
{"url": "https://example.com/web-dev",
|
|
542
|
+
"text": "Web development with Python frameworks like Django makes building apps fast and fun"},
|
|
543
|
+
{"url": "https://example.com/machine-learning",
|
|
544
|
+
"text": "Machine learning is a subset of data science that trains models on data"},
|
|
545
|
+
]
|
|
546
|
+
|
|
547
|
+
# BUILD THE INDEX
|
|
548
|
+
index = build_inverted_index(documents, text_field="text", id_field="url")
|
|
549
|
+
|
|
550
|
+
# EXAMPLE OUTPUT
|
|
551
|
+
index["python"] →
|
|
552
|
+
[("https://example.com/python-basics", 1), ("https://example.com/data-science", 1), ("https://example.com/web-dev", 1)]
|
|
553
|
+
"""
|
|
554
|
+
|
|
555
|
+
|
|
556
|
+
def build_inverted_index(
|
|
557
|
+
documents: list[dict],
|
|
558
|
+
*,
|
|
559
|
+
text_field: str = "text",
|
|
560
|
+
id_field: str = "url",
|
|
561
|
+
) -> dict:
|
|
562
|
+
index: dict[str, list[tuple[str, int]]] = defaultdict(list)
|
|
563
|
+
|
|
564
|
+
for doc in documents:
|
|
565
|
+
doc_id = doc[id_field]
|
|
566
|
+
text = doc.get(text_field, "")
|
|
567
|
+
token_freq = Counter(tokenize(text))
|
|
568
|
+
for token, freq in token_freq.items():
|
|
569
|
+
index[token].append((doc_id, freq))
|
|
570
|
+
|
|
571
|
+
return dict(index)
|
|
572
|
+
|
|
573
|
+
|
|
574
|
+
"""
|
|
575
|
+
Compute TF-IDF scores for every (document, token) pair.
|
|
576
|
+
|
|
577
|
+
TF-IDF (Term Frequency x Inverse Document Frequency) balances how
|
|
578
|
+
often a term appears in one document against how rare it is across
|
|
579
|
+
the whole corpus — the backbone of classical relevance ranking.
|
|
580
|
+
|
|
581
|
+
Args:
|
|
582
|
+
documents: Corpus as a list of dicts.
|
|
583
|
+
text_field: Field containing raw text.
|
|
584
|
+
id_field: Field used as document identifier.
|
|
585
|
+
|
|
586
|
+
Returns:
|
|
587
|
+
Nested dict: { doc_id: { token: tfidf_score } }
|
|
588
|
+
|
|
589
|
+
Example:
|
|
590
|
+
scores = compute_tfidf(docs)
|
|
591
|
+
top = sorted(scores["https://…"].items(), key=lambda x: -x[1])[:5]
|
|
592
|
+
"""
|
|
593
|
+
|
|
594
|
+
|
|
595
|
+
def compute_tfidf(
|
|
596
|
+
documents: list[dict],
|
|
597
|
+
*,
|
|
598
|
+
text_field: str = "text",
|
|
599
|
+
id_field: str = "url",
|
|
600
|
+
) -> dict[str, dict[str, float]]:
|
|
601
|
+
N = len(documents)
|
|
602
|
+
tf_store: dict[str, dict[str, float]] = {}
|
|
603
|
+
doc_freq: Counter = Counter()
|
|
604
|
+
|
|
605
|
+
for doc in documents:
|
|
606
|
+
doc_id = doc[id_field]
|
|
607
|
+
tokens = tokenize(doc.get(text_field, ""))
|
|
608
|
+
total = len(tokens) or 1
|
|
609
|
+
freq = Counter(tokens)
|
|
610
|
+
tf_store[doc_id] = {t: c / total for t, c in freq.items()}
|
|
611
|
+
doc_freq.update(freq.keys())
|
|
612
|
+
|
|
613
|
+
tfidf: dict[str, dict[str, float]] = {}
|
|
614
|
+
for doc in documents:
|
|
615
|
+
doc_id = doc[id_field]
|
|
616
|
+
tfidf[doc_id] = {
|
|
617
|
+
t: tf * math.log((N + 1) / (doc_freq[t] + 1))
|
|
618
|
+
for t, tf in tf_store[doc_id].items()
|
|
619
|
+
}
|
|
620
|
+
|
|
621
|
+
return tfidf
|
|
622
|
+
|
|
623
|
+
|
|
624
|
+
"""
|
|
625
|
+
Score and rank documents for a query using the inverted index + TF-IDF.
|
|
626
|
+
|
|
627
|
+
Accumulates TF-IDF scores for all query tokens found in the index,
|
|
628
|
+
then returns the top-k results by total relevance score.
|
|
629
|
+
|
|
630
|
+
Args:
|
|
631
|
+
query: Raw user query string.
|
|
632
|
+
index: Inverted index from build_inverted_index().
|
|
633
|
+
tfidf: TF-IDF matrix from compute_tfidf().
|
|
634
|
+
top_k: Maximum results to return.
|
|
635
|
+
|
|
636
|
+
Returns:
|
|
637
|
+
List of (doc_id, score) tuples, highest score first.
|
|
638
|
+
|
|
639
|
+
Example:
|
|
640
|
+
results = search_index("authentication tokens", index, tfidf)
|
|
641
|
+
for url, score in results:
|
|
642
|
+
print(f"{score:.3f} {url}")
|
|
643
|
+
"""
|
|
644
|
+
|
|
645
|
+
|
|
646
|
+
def search_index(
|
|
647
|
+
query: str,
|
|
648
|
+
index: dict,
|
|
649
|
+
tfidf: dict[str, dict[str, float]],
|
|
650
|
+
*,
|
|
651
|
+
top_k: int = 10,
|
|
652
|
+
) -> list[tuple[str, float]]:
|
|
653
|
+
query_tokens = tokenize(query)
|
|
654
|
+
scores: dict[str, float] = defaultdict(float)
|
|
655
|
+
|
|
656
|
+
for token in query_tokens:
|
|
657
|
+
if token in index:
|
|
658
|
+
for doc_id, _freq in index[token]:
|
|
659
|
+
scores[doc_id] += tfidf.get(doc_id, {}).get(token, 0.0)
|
|
660
|
+
|
|
661
|
+
return heapq.nlargest(top_k, scores.items(), key=lambda x: x[1])
|
|
662
|
+
|
|
663
|
+
|
|
664
|
+
"""
|
|
665
|
+
Return an ordered list of all headings (h1-h6) with their level and text.
|
|
666
|
+
|
|
667
|
+
Headings provide document structure signals — boosting a result whose
|
|
668
|
+
h1 matches a query is a simple way to improve ranking quality.
|
|
669
|
+
|
|
670
|
+
Args:
|
|
671
|
+
soup: Parsed BeautifulSoup object.
|
|
672
|
+
|
|
673
|
+
Returns:
|
|
674
|
+
List of dicts: [{ "level": 1, "text": "Getting Started" }, …]
|
|
675
|
+
|
|
676
|
+
Example:
|
|
677
|
+
headings = extract_headings(soup)
|
|
678
|
+
print(headings[0]) # {"level": 1, "text": "Introduction"}
|
|
679
|
+
"""
|
|
680
|
+
def extract_headings(soup: BeautifulSoup) -> list[dict]:
|
|
681
|
+
|
|
682
|
+
return [
|
|
683
|
+
{"level": int(tag.name[1]), "text": tag.get_text(strip=True)}
|
|
684
|
+
for tag in soup.find_all(re.compile(r"^h[1-6]$"))
|
|
685
|
+
if tag.get_text(strip=True)
|
|
686
|
+
]
|
|
687
|
+
|
|
688
|
+
|
|
689
|
+
"""
|
|
690
|
+
Extract a query-centred excerpt to display in search results.
|
|
691
|
+
|
|
692
|
+
Finds the first occurrence of any query keyword in the text and
|
|
693
|
+
returns the surrounding word window, mimicking Google's snippet style.
|
|
694
|
+
|
|
695
|
+
Args:
|
|
696
|
+
text: Full document text.
|
|
697
|
+
query: User's search query.
|
|
698
|
+
window: Words to show on each side of the match.
|
|
699
|
+
max_length: Hard character cap on the returned snippet.
|
|
700
|
+
|
|
701
|
+
Returns:
|
|
702
|
+
A short excerpt string, potentially with leading/trailing ellipsis.
|
|
703
|
+
|
|
704
|
+
Example:
|
|
705
|
+
snippet = generate_snippet(article_text, "authentication tokens")
|
|
706
|
+
# => "…Users obtain authentication tokens via the /api/auth endpoint…"
|
|
707
|
+
"""
|
|
708
|
+
|
|
709
|
+
|
|
710
|
+
def generate_snippet(
|
|
711
|
+
text: str,
|
|
712
|
+
query: str,
|
|
713
|
+
*,
|
|
714
|
+
window: int = 40,
|
|
715
|
+
max_length: int = 300,
|
|
716
|
+
) -> str:
|
|
717
|
+
words = text.split()
|
|
718
|
+
query_tokens = set(tokenize(query, stem=False, remove_stopwords=False))
|
|
719
|
+
match_idx = next(
|
|
720
|
+
(i for i, w in enumerate(words) if re.sub(r"[^a-z]", "", w.lower()) in query_tokens),
|
|
721
|
+
None,
|
|
722
|
+
)
|
|
723
|
+
|
|
724
|
+
if match_idx is None:
|
|
725
|
+
snippet = " ".join(words[:window * 2])
|
|
726
|
+
return (snippet[:max_length] + "…") if len(snippet) > max_length else snippet
|
|
727
|
+
|
|
728
|
+
start = max(0, match_idx - window)
|
|
729
|
+
end = min(len(words), match_idx + window + 1)
|
|
730
|
+
snippet = " ".join(words[start:end])
|
|
731
|
+
if start > 0:
|
|
732
|
+
snippet = "…" + snippet
|
|
733
|
+
if end < len(words):
|
|
734
|
+
snippet += "…"
|
|
735
|
+
|
|
736
|
+
return snippet[:max_length] or snippet
|
|
737
|
+
|
|
738
|
+
|
|
739
|
+
"""
|
|
740
|
+
Generate a stable SHA-256 fingerprint of a page's text content.
|
|
741
|
+
|
|
742
|
+
Store the fingerprint alongside each crawled document to detect
|
|
743
|
+
whether a page has changed since the last crawl, avoiding re-indexing
|
|
744
|
+
unchanged content.
|
|
745
|
+
|
|
746
|
+
Args:
|
|
747
|
+
text: Extracted page text (from extract_text).
|
|
748
|
+
|
|
749
|
+
Returns:
|
|
750
|
+
64-character hex string (SHA-256 digest).
|
|
751
|
+
|
|
752
|
+
Example:
|
|
753
|
+
fp = fingerprint_page(page_text)
|
|
754
|
+
if fp != stored_fingerprints.get(url):
|
|
755
|
+
re_index(url)
|
|
756
|
+
"""
|
|
757
|
+
|
|
758
|
+
|
|
759
|
+
def fingerprint_page(text: str) -> str:
|
|
760
|
+
normalised = " ".join(text.lower().split())
|
|
761
|
+
return hashlib.sha256(normalised.encode("utf-8")).hexdigest()
|
|
762
|
+
|
|
763
|
+
|
|
764
|
+
"""
|
|
765
|
+
Serialise the full search index to a JSON file for persistence.
|
|
766
|
+
|
|
767
|
+
The saved file can be reloaded on the next run, avoiding a full
|
|
768
|
+
re-crawl every time the search service starts.
|
|
769
|
+
|
|
770
|
+
Args:
|
|
771
|
+
index: Inverted index from build_inverted_index().
|
|
772
|
+
tfidf: TF-IDF scores from compute_tfidf().
|
|
773
|
+
metadata: Per-page metadata records (list of dicts).
|
|
774
|
+
path: Output file path.
|
|
775
|
+
|
|
776
|
+
Example:
|
|
777
|
+
save_index(index, tfidf, page_metadata, "data/index.json")
|
|
778
|
+
"""
|
|
779
|
+
|
|
780
|
+
|
|
781
|
+
def save_index(
|
|
782
|
+
index: dict,
|
|
783
|
+
tfidf: dict,
|
|
784
|
+
metadata: list[dict],
|
|
785
|
+
path: str = "search_index.json",
|
|
786
|
+
) -> None:
|
|
787
|
+
payload = {
|
|
788
|
+
"inverted_index": {k: v for k, v in index.items()},
|
|
789
|
+
"tfidf": tfidf,
|
|
790
|
+
"metadata": metadata,
|
|
791
|
+
}
|
|
792
|
+
Path(path).parent.mkdir(parents=True, exist_ok=True)
|
|
793
|
+
with open(path, "w", encoding="utf-8") as f:
|
|
794
|
+
json.dump(payload, f, indent=2, ensure_ascii=False)
|
|
795
|
+
print(f"[save_index] Saved index to {path}")
|
|
796
|
+
|
|
797
|
+
|
|
798
|
+
"""
|
|
799
|
+
Deserialise a previously saved search index from JSON.
|
|
800
|
+
|
|
801
|
+
Args:
|
|
802
|
+
path: File path written by save_index().
|
|
803
|
+
|
|
804
|
+
Returns:
|
|
805
|
+
Tuple (inverted_index, tfidf, metadata_list).
|
|
806
|
+
|
|
807
|
+
Raises:
|
|
808
|
+
FileNotFoundError: If the file does not exist.
|
|
809
|
+
|
|
810
|
+
Example:
|
|
811
|
+
index, tfidf, metadata = load_index("data/index.json")
|
|
812
|
+
"""
|
|
813
|
+
|
|
814
|
+
|
|
815
|
+
def load_index(path: str = "search_index.json") -> tuple[dict, dict, list]:
|
|
816
|
+
with open(path, "r", encoding="utf-8") as f:
|
|
817
|
+
payload = json.load(f)
|
|
818
|
+
|
|
819
|
+
index = payload["inverted_index"]
|
|
820
|
+
index = {k: [tuple(pair) for pair in v] for k, v in index.items()}
|
|
821
|
+
return index, payload["tfidf"], payload["metadata"]
|
|
822
|
+
|
|
823
|
+
|
|
824
|
+
"""
|
|
825
|
+
Extract JSON-LD structured data blocks embedded in a page.
|
|
826
|
+
|
|
827
|
+
Many modern websites include schema.org markup (Article, FAQPage,
|
|
828
|
+
BreadcrumbList, etc.) which provides clean, machine-readable content
|
|
829
|
+
ideal for enriching search results.
|
|
830
|
+
|
|
831
|
+
Args:
|
|
832
|
+
soup: Parsed BeautifulSoup object.
|
|
833
|
+
|
|
834
|
+
Returns:
|
|
835
|
+
List of parsed JSON-LD objects found on the page.
|
|
836
|
+
|
|
837
|
+
Example:
|
|
838
|
+
schemas = extract_structured_data(soup)
|
|
839
|
+
for s in schemas:
|
|
840
|
+
print(s.get("@type"), s.get("name"))
|
|
841
|
+
"""
|
|
842
|
+
|
|
843
|
+
|
|
844
|
+
def extract_structured_data(soup: BeautifulSoup) -> list[dict]:
|
|
845
|
+
results = []
|
|
846
|
+
for tag in soup.find_all("script", type="application/ld+json"):
|
|
847
|
+
try:
|
|
848
|
+
data = json.loads(tag.string or "{}")
|
|
849
|
+
if isinstance(data, list):
|
|
850
|
+
results.extend(data)
|
|
851
|
+
else:
|
|
852
|
+
results.append(data)
|
|
853
|
+
except json.JSONDecodeError:
|
|
854
|
+
continue
|
|
855
|
+
return results
|
|
856
|
+
|
|
857
|
+
|
|
858
|
+
"""
|
|
859
|
+
Parse a site's sitemap.xml to seed the crawler URL queue.
|
|
860
|
+
|
|
861
|
+
Tries /sitemap.xml and /sitemap_index.xml; follows nested sitemaps
|
|
862
|
+
(sitemapindex elements) one level deep.
|
|
863
|
+
|
|
864
|
+
Args:
|
|
865
|
+
base_url: Root URL of the site, e.g. "https://docs.example.com".
|
|
866
|
+
session: Optional reusable requests.Session.
|
|
867
|
+
|
|
868
|
+
Returns:
|
|
869
|
+
Sorted, deduplicated list of page URLs listed in the sitemap(s).
|
|
870
|
+
|
|
871
|
+
Example:
|
|
872
|
+
urls = build_sitemap_urls("https://docs.example.com")
|
|
873
|
+
print(f"Found {len(urls)} URLs in sitemap")
|
|
874
|
+
"""
|
|
875
|
+
|
|
876
|
+
|
|
877
|
+
def build_sitemap_urls(
|
|
878
|
+
base_url: str,
|
|
879
|
+
*,
|
|
880
|
+
session: requests.Session | None = None,
|
|
881
|
+
) -> list[str]:
|
|
882
|
+
parsed = urlparse(base_url)
|
|
883
|
+
origin = f"{parsed.scheme}://{parsed.netloc}"
|
|
884
|
+
candidates = [f"{origin}/sitemap.xml", f"{origin}/sitemap_index.xml"]
|
|
885
|
+
sess = session or requests
|
|
886
|
+
|
|
887
|
+
def _parse_sitemap(url: str) -> list[str]:
|
|
888
|
+
resp = fetch_page(url, session=sess)
|
|
889
|
+
if not resp:
|
|
890
|
+
return []
|
|
891
|
+
soup = BeautifulSoup(resp.content, "lxml-xml")
|
|
892
|
+
child_maps = [loc.text.strip() for loc in soup.find_all("sitemap") if loc.find("loc")]
|
|
893
|
+
urls: list[str] = []
|
|
894
|
+
for sm in child_maps:
|
|
895
|
+
urls.extend(_parse_sitemap(sm))
|
|
896
|
+
urls.extend(loc.text.strip() for loc in soup.find_all("url") if loc.find("loc"))
|
|
897
|
+
return urls
|
|
898
|
+
|
|
899
|
+
all_urls: set[str] = set()
|
|
900
|
+
for candidate in candidates:
|
|
901
|
+
found = _parse_sitemap(candidate)
|
|
902
|
+
all_urls.update(found)
|
|
903
|
+
if all_urls:
|
|
904
|
+
break
|
|
905
|
+
|
|
906
|
+
return sorted(all_urls)
|
|
907
|
+
|
|
908
|
+
|
|
909
|
+
"""
|
|
910
|
+
Wrap query keywords in HTML highlight tags within a snippet string.
|
|
911
|
+
|
|
912
|
+
Use in search result UIs to visually emphasise where the match occurs.
|
|
913
|
+
The matching is case-insensitive and handles whole words only.
|
|
914
|
+
|
|
915
|
+
Args:
|
|
916
|
+
snippet: Text excerpt (from generate_snippet).
|
|
917
|
+
query: Original user query.
|
|
918
|
+
open_tag: Opening HTML tag (default <mark>).
|
|
919
|
+
close_tag: Closing HTML tag (default </mark>).
|
|
920
|
+
|
|
921
|
+
Returns:
|
|
922
|
+
Snippet string with matching keywords wrapped in tags.
|
|
923
|
+
|
|
924
|
+
Example:
|
|
925
|
+
highlighted = highlight_query_terms(
|
|
926
|
+
"retrieve authentication tokens from the vault",
|
|
927
|
+
"authentication tokens"
|
|
928
|
+
)
|
|
929
|
+
# => "retrieve <mark>authentication</mark> <mark>tokens</mark>…"
|
|
930
|
+
"""
|
|
931
|
+
|
|
932
|
+
|
|
933
|
+
def highlight_query_terms(
|
|
934
|
+
snippet: str,
|
|
935
|
+
query: str,
|
|
936
|
+
*,
|
|
937
|
+
open_tag: str = "<mark>",
|
|
938
|
+
close_tag: str = "</mark>",
|
|
939
|
+
) -> str:
|
|
940
|
+
keywords = [
|
|
941
|
+
re.escape(w) for w in query.split()
|
|
942
|
+
if w.lower() not in _STOPWORDS and len(w) > 1
|
|
943
|
+
]
|
|
944
|
+
if not keywords:
|
|
945
|
+
return snippet
|
|
946
|
+
|
|
947
|
+
pattern = re.compile(r"(" + "|".join(keywords) + r")", re.IGNORECASE)
|
|
948
|
+
return pattern.sub(rf"{open_tag}\1{close_tag}", snippet)
|
|
949
|
+
|
|
950
|
+
|
|
951
|
+
"""
|
|
952
|
+
Re-index a single URL only if its content has changed.
|
|
953
|
+
|
|
954
|
+
Compares the page's current fingerprint against a stored one; if
|
|
955
|
+
unchanged the function exits early, making scheduled re-crawls cheap.
|
|
956
|
+
|
|
957
|
+
Args:
|
|
958
|
+
url: Page to check and potentially re-index.
|
|
959
|
+
index: Mutable inverted index (modified in place).
|
|
960
|
+
tfidf: Mutable TF-IDF store (modified in place).
|
|
961
|
+
metadata: Mutable metadata list (modified in place).
|
|
962
|
+
fingerprints: Dict mapping url to last known fingerprint (mutable).
|
|
963
|
+
session: Optional requests.Session.
|
|
964
|
+
|
|
965
|
+
Returns:
|
|
966
|
+
True if the page was re-indexed, False if it was unchanged.
|
|
967
|
+
|
|
968
|
+
Example:
|
|
969
|
+
changed = incremental_update(url, index, tfidf, metadata, fps)
|
|
970
|
+
print("Updated" if changed else "No change")
|
|
971
|
+
"""
|
|
972
|
+
|
|
973
|
+
|
|
974
|
+
def incremental_update(
|
|
975
|
+
url: str,
|
|
976
|
+
index: dict,
|
|
977
|
+
tfidf: dict,
|
|
978
|
+
metadata: list[dict],
|
|
979
|
+
fingerprints: dict[str, str],
|
|
980
|
+
*,
|
|
981
|
+
session: requests.Session | None = None,
|
|
982
|
+
) -> bool:
|
|
983
|
+
resp = fetch_page(url, session=session)
|
|
984
|
+
if resp is None:
|
|
985
|
+
return False
|
|
986
|
+
|
|
987
|
+
soup = parse_html(resp.text)
|
|
988
|
+
text = extract_text(soup)
|
|
989
|
+
fp = fingerprint_page(text)
|
|
990
|
+
|
|
991
|
+
if fingerprints.get(url) == fp:
|
|
992
|
+
return False # No change — skip re-indexing
|
|
993
|
+
|
|
994
|
+
fingerprints[url] = fp
|
|
995
|
+
|
|
996
|
+
# Remove stale entries from index
|
|
997
|
+
for token in list(index.keys()):
|
|
998
|
+
index[token] = [(doc_id, freq) for doc_id, freq in index[token] if doc_id != url]
|
|
999
|
+
if not index[token]:
|
|
1000
|
+
del index[token]
|
|
1001
|
+
|
|
1002
|
+
# Remove stale tfidf and metadata entries
|
|
1003
|
+
tfidf.pop(url, None)
|
|
1004
|
+
metadata[:] = [m for m in metadata if m.get("url") != url]
|
|
1005
|
+
|
|
1006
|
+
# Build fresh entries for this page
|
|
1007
|
+
token_freq = Counter(tokenize(text))
|
|
1008
|
+
total = len(list(token_freq.elements())) or 1
|
|
1009
|
+
for token, freq in token_freq.items():
|
|
1010
|
+
index.setdefault(token, []).append((url, freq))
|
|
1011
|
+
tfidf[url] = {t: (c / total) for t, c in token_freq.items()}
|
|
1012
|
+
|
|
1013
|
+
meta = extract_metadata(soup, url)
|
|
1014
|
+
meta["url"] = url
|
|
1015
|
+
meta["fingerprint"] = fp
|
|
1016
|
+
metadata.append(meta)
|
|
1017
|
+
|
|
1018
|
+
return True
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: softhauzpy
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Description-Content-Type: text/markdown
|
|
5
|
+
|
|
6
|
+
# SofthauzPy
|
|
7
|
+
**SofthauzPy** is a comprehensive Python toolkit built for developers creating intelligent, data-driven web applications. It provides a powerful suite of web utilities including web scraping tools, crawling systems, content extraction pipelines, and search engine components that help developers build fully customizable in-house website search solutions.
|
|
8
|
+
|
|
9
|
+
Designed for scalability and flexibility, Softhauz enables teams to collect, process, index, and search website content efficiently — all within a clean Python-first development ecosystem.
|
|
10
|
+
|
|
11
|
+
Built for developers who need scalable web data tools and intelligent search capabilities, Softhauz simplifies the process of scraping, processing, indexing, and searching website content.
|
|
12
|
+
From lightweight crawlers to fully customizable in-house search engine functionality, Softhauz helps developers build smarter web applications without relying heavily on external search services.
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
## Key Features
|
|
16
|
+
|
|
17
|
+
**Web Scraping & Crawling**
|
|
18
|
+
|
|
19
|
+
- High-performance web scraping utilities
|
|
20
|
+
- HTML parsing and structured data extraction
|
|
21
|
+
- Recursive website crawling
|
|
22
|
+
- Sitemap discovery and URL indexing
|
|
23
|
+
- Support for asynchronous scraping workflows
|
|
24
|
+
- Rate limiting and request handling utilities
|
|
25
|
+
|
|
26
|
+
**Search Engine Toolkit**
|
|
27
|
+
|
|
28
|
+
- In-house website search engine creation
|
|
29
|
+
- Full-text indexing and querying
|
|
30
|
+
- Custom relevance ranking algorithms
|
|
31
|
+
- Search filtering and query optimization
|
|
32
|
+
- Incremental indexing support
|
|
33
|
+
- Lightweight search infrastructure for internal platforms
|
|
34
|
+
|
|
35
|
+
**Content Processing**
|
|
36
|
+
|
|
37
|
+
- Text normalization and cleaning
|
|
38
|
+
- Metadata extraction
|
|
39
|
+
- Duplicate content detection
|
|
40
|
+
- Keyword extraction and tagging
|
|
41
|
+
- Content chunking for AI and search applications
|
|
42
|
+
|
|
43
|
+
**AI & Semantic Search Ready**
|
|
44
|
+
|
|
45
|
+
- Embedding generation helpers
|
|
46
|
+
- Vector database compatibility
|
|
47
|
+
- Semantic similarity search utilities
|
|
48
|
+
- Retrieval-Augmented Generation (RAG) support
|
|
49
|
+
- AI-powered content indexing workflows
|
|
50
|
+
|
|
51
|
+
**Developer Experience**
|
|
52
|
+
|
|
53
|
+
- Modular and extensible architecture
|
|
54
|
+
- Framework-friendly design for Flask, Django, and FastAPI
|
|
55
|
+
- Easy API integration
|
|
56
|
+
- Clean, Pythonic interfaces
|
|
57
|
+
- Production-ready utilities for scalable deployments
|
|
58
|
+
|
|
59
|
+
> This program may incorporate artificial intelligence (AI) tools solely
|
|
60
|
+
> to support and enhance development efficiency, code quality, and
|
|
61
|
+
> overall performance. All software design, implementation, testing,
|
|
62
|
+
> validation, and quality assurance processes are conducted and reviewed
|
|
63
|
+
> by a qualified human software professional to ensure accuracy,
|
|
64
|
+
> reliability, security, and compliance with applicable standards.
|
|
65
|
+
|
|
66
|
+
Author:
|
|
67
|
+
**Urate, Karen**<br>
|
|
68
|
+
*Softhauz Software Architect*<br>
|
|
69
|
+
[softhauz.ca](https://softhauz.ca)
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
softhauzpy
|