sveden-table-matrixizer 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Gorshipisk
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,147 @@
1
+ Metadata-Version: 2.4
2
+ Name: sveden-table-matrixizer
3
+ Version: 1.0.0
4
+ Summary: Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags
5
+ Author: BestTvGU
6
+ License: MIT
7
+ Requires-Python: >=3.11
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: beautifulsoup4==4.14.3
11
+ Requires-Dist: aiohttp==3.13.5
12
+ Dynamic: license-file
13
+
14
+ # sveden-table-matrixizer
15
+
16
+ **Extract and matrixize HTML tables from Russian educational organization pages (`/sveden`) with full colspan/rowspan
17
+ support.**
18
+
19
+ `sveden-table-matrixizer` is an asynchronous Python library that parses tables found on `/sveden` (сведения об
20
+ образовательной организации) pages, expands `colspan` and `rowspan` attributes into clean two-dimensional matrices, and
21
+ returns structured `head` and `body` cell grids. It gives you full control over edge cases via configurable callbacks.
22
+
23
+ ## Features
24
+
25
+ - **Async page fetching** using `aiohttp`.
26
+ - **Full colspan/rowspan expansion** – cells spanning multiple rows and/or columns are duplicated into the matrix.
27
+ - **Separate head and body matrix extraction** – each table yields a `head` matrix (from `<thead>`) and a `body`
28
+ matrix (from `<tbody>`).
29
+ - **Customizable error handling** – replace default reactions to missing headers, missing bodies, or multiple `<thead>`/
30
+ `<tbody>` elements.
31
+ - **Lightweight** – only depends on `aiohttp` and `beautifulsoup4`.
32
+ - **Works with any HTML** – designed for `/sveden`, but usable on any page containing `<table>` elements.
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ pip install sveden-table-matrixizer
38
+ ```
39
+
40
+ ## Quick Start
41
+
42
+ ```python
43
+ import asyncio
44
+ from sveden_table_matrixizer import matrixize_tables_from_page
45
+
46
+
47
+ async def main():
48
+ url = "https://example.edu/sveden/"
49
+ tables = await matrixize_tables_from_page(url)
50
+
51
+ for i, table in enumerate(tables):
52
+ print(f"Table {i + 1}:")
53
+ print(" Head rows:", len(table.head))
54
+ print(" Body rows:", len(table.body))
55
+ # Access cells as list[list[bs4.Tag]]
56
+
57
+
58
+ asyncio.run(main())
59
+ ```
60
+
61
+ Each `MatrixizedTable` contains:
62
+
63
+ - `head: list[list[Tag]]` – rows of header cells _(each row is a list of `bs4.Tag`)_.
64
+ - `body: list[list[Tag]]` – rows of body cells.
65
+
66
+ ## Handling Edge Cases
67
+
68
+ By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one `<thead>` /
69
+ `<tbody>`. You can override this behavior with `ExtractorOptions`.
70
+
71
+ ```python
72
+ from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
73
+ from sveden_table_matrixizer.def_funcs import MatrixizedTable
74
+
75
+
76
+ def handle_missing_header(table_tag, collected_tables):
77
+ print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
78
+
79
+
80
+ opts = ExtractorOptions(
81
+ on_table_no_header=handle_missing_header,
82
+ # other callbacks can be set similarly
83
+ )
84
+
85
+ tables = await matrixize_tables_from_page(url, options=opts)
86
+ ```
87
+
88
+ You can also supply async callbacks – the library automatically detects and awaits them.
89
+
90
+ ## API Reference
91
+
92
+ ### `matrixize_tables_from_page(url, *, options=None)`
93
+
94
+ - **Parameters:**
95
+ - `url` (`str`) – URL of the page to scrape.
96
+ - `options` (`ExtractorOptions`, optional) – configuration callbacks.
97
+ - **Returns:** `list[MatrixizedTable]` – extracted and matrixized tables.
98
+
99
+ ### `ExtractorOptions`
100
+
101
+ A frozen dataclass with the following fields (all optional):
102
+
103
+ | Field | Type | Default | Description |
104
+ |-----------------------------|---------------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------|
105
+ | `on_table_no_header` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<thead>` |
106
+ | `on_table_no_body` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<tbody>` |
107
+ | `on_multiply_table_headers` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableHeaderError` | Called when more than one `<thead>` is found; must return a single `<thead>` element. |
108
+ | `on_multiply_table_bodies` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableBodyError` | Called when more than one `<tbody>` is found; must return a single `<tbody>` element. |
109
+
110
+ ### `MatrixizedTable`
111
+
112
+ ```python
113
+ @dataclass(frozen=True, kw_only=True)
114
+ class MatrixizedTable:
115
+ head: list[list[Tag]] # matrix of <th> tags
116
+ body: list[list[Tag]] # matrix of <td> tags
117
+ ```
118
+
119
+ ### Exceptions
120
+
121
+ - `NoTableHeaderError` – raised when `options.on_table_no_header` is not overridden.
122
+ - `NoTableBodyError` – raised when `options.on_table_no_body` is not overridden.
123
+ - `MultiplyTableHeaderError` – default reaction to multiple `<thead>` elements.
124
+ - `MultiplyTableBodyError` – default reaction to multiple `<tbody>` elements.
125
+
126
+ All exceptions are exported from `sveden_table_matrixizer.errors`.
127
+
128
+ ## How It Works
129
+
130
+ 1. The page is fetched with `aiohttp` and parsed by BeautifulSoup.
131
+ 2. All `<table>` tags are collected.
132
+ 3. For each table:
133
+ - The `<thead>` is located; if missing or duplicate, the appropriate callback is invoked.
134
+ - The `<tbody>` is located similarly.
135
+ - Header rows are expanded: each `<th>` with `colspan`/`rowspan` is replicated into the correct cells of a 2D list.
136
+ - The same expansion is applied to body rows using `<td>` elements.
137
+ 4. A `MatrixizedTable(head=..., body=...)` is created and added to the result list.
138
+
139
+ ## Dependencies
140
+
141
+ - Python ≥ 3.11
142
+ - [aiohttp](https://pypi.org/project/aiohttp/)
143
+ - [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
144
+
145
+ ## License
146
+
147
+ This project is licensed under the MIT License – see the source repository for details.
@@ -0,0 +1,134 @@
1
+ # sveden-table-matrixizer
2
+
3
+ **Extract and matrixize HTML tables from Russian educational organization pages (`/sveden`) with full colspan/rowspan
4
+ support.**
5
+
6
+ `sveden-table-matrixizer` is an asynchronous Python library that parses tables found on `/sveden` (сведения об
7
+ образовательной организации) pages, expands `colspan` and `rowspan` attributes into clean two-dimensional matrices, and
8
+ returns structured `head` and `body` cell grids. It gives you full control over edge cases via configurable callbacks.
9
+
10
+ ## Features
11
+
12
+ - **Async page fetching** using `aiohttp`.
13
+ - **Full colspan/rowspan expansion** – cells spanning multiple rows and/or columns are duplicated into the matrix.
14
+ - **Separate head and body matrix extraction** – each table yields a `head` matrix (from `<thead>`) and a `body`
15
+ matrix (from `<tbody>`).
16
+ - **Customizable error handling** – replace default reactions to missing headers, missing bodies, or multiple `<thead>`/
17
+ `<tbody>` elements.
18
+ - **Lightweight** – only depends on `aiohttp` and `beautifulsoup4`.
19
+ - **Works with any HTML** – designed for `/sveden`, but usable on any page containing `<table>` elements.
20
+
21
+ ## Installation
22
+
23
+ ```bash
24
+ pip install sveden-table-matrixizer
25
+ ```
26
+
27
+ ## Quick Start
28
+
29
+ ```python
30
+ import asyncio
31
+ from sveden_table_matrixizer import matrixize_tables_from_page
32
+
33
+
34
+ async def main():
35
+ url = "https://example.edu/sveden/"
36
+ tables = await matrixize_tables_from_page(url)
37
+
38
+ for i, table in enumerate(tables):
39
+ print(f"Table {i + 1}:")
40
+ print(" Head rows:", len(table.head))
41
+ print(" Body rows:", len(table.body))
42
+ # Access cells as list[list[bs4.Tag]]
43
+
44
+
45
+ asyncio.run(main())
46
+ ```
47
+
48
+ Each `MatrixizedTable` contains:
49
+
50
+ - `head: list[list[Tag]]` – rows of header cells _(each row is a list of `bs4.Tag`)_.
51
+ - `body: list[list[Tag]]` – rows of body cells.
52
+
53
+ ## Handling Edge Cases
54
+
55
+ By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one `<thead>` /
56
+ `<tbody>`. You can override this behavior with `ExtractorOptions`.
57
+
58
+ ```python
59
+ from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
60
+ from sveden_table_matrixizer.def_funcs import MatrixizedTable
61
+
62
+
63
+ def handle_missing_header(table_tag, collected_tables):
64
+ print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
65
+
66
+
67
+ opts = ExtractorOptions(
68
+ on_table_no_header=handle_missing_header,
69
+ # other callbacks can be set similarly
70
+ )
71
+
72
+ tables = await matrixize_tables_from_page(url, options=opts)
73
+ ```
74
+
75
+ You can also supply async callbacks – the library automatically detects and awaits them.
76
+
77
+ ## API Reference
78
+
79
+ ### `matrixize_tables_from_page(url, *, options=None)`
80
+
81
+ - **Parameters:**
82
+ - `url` (`str`) – URL of the page to scrape.
83
+ - `options` (`ExtractorOptions`, optional) – configuration callbacks.
84
+ - **Returns:** `list[MatrixizedTable]` – extracted and matrixized tables.
85
+
86
+ ### `ExtractorOptions`
87
+
88
+ A frozen dataclass with the following fields (all optional):
89
+
90
+ | Field | Type | Default | Description |
91
+ |-----------------------------|---------------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------|
92
+ | `on_table_no_header` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<thead>` |
93
+ | `on_table_no_body` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<tbody>` |
94
+ | `on_multiply_table_headers` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableHeaderError` | Called when more than one `<thead>` is found; must return a single `<thead>` element. |
95
+ | `on_multiply_table_bodies` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableBodyError` | Called when more than one `<tbody>` is found; must return a single `<tbody>` element. |
96
+
97
+ ### `MatrixizedTable`
98
+
99
+ ```python
100
+ @dataclass(frozen=True, kw_only=True)
101
+ class MatrixizedTable:
102
+ head: list[list[Tag]] # matrix of <th> tags
103
+ body: list[list[Tag]] # matrix of <td> tags
104
+ ```
105
+
106
+ ### Exceptions
107
+
108
+ - `NoTableHeaderError` – raised when `options.on_table_no_header` is not overridden.
109
+ - `NoTableBodyError` – raised when `options.on_table_no_body` is not overridden.
110
+ - `MultiplyTableHeaderError` – default reaction to multiple `<thead>` elements.
111
+ - `MultiplyTableBodyError` – default reaction to multiple `<tbody>` elements.
112
+
113
+ All exceptions are exported from `sveden_table_matrixizer.errors`.
114
+
115
+ ## How It Works
116
+
117
+ 1. The page is fetched with `aiohttp` and parsed by BeautifulSoup.
118
+ 2. All `<table>` tags are collected.
119
+ 3. For each table:
120
+ - The `<thead>` is located; if missing or duplicate, the appropriate callback is invoked.
121
+ - The `<tbody>` is located similarly.
122
+ - Header rows are expanded: each `<th>` with `colspan`/`rowspan` is replicated into the correct cells of a 2D list.
123
+ - The same expansion is applied to body rows using `<td>` elements.
124
+ 4. A `MatrixizedTable(head=..., body=...)` is created and added to the result list.
125
+
126
+ ## Dependencies
127
+
128
+ - Python ≥ 3.11
129
+ - [aiohttp](https://pypi.org/project/aiohttp/)
130
+ - [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
131
+
132
+ ## License
133
+
134
+ This project is licensed under the MIT License – see the source repository for details.
@@ -0,0 +1,26 @@
1
+ [build-system]
2
+ requires = ["setuptools"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "sveden-table-matrixizer"
7
+ version = "1.0.0"
8
+ description = "Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags"
9
+ readme = "README.md"
10
+ requires-python = ">=3.11"
11
+ license = { text = "MIT" }
12
+
13
+ authors = [
14
+ { name = "BestTvGU" }
15
+ ]
16
+
17
+ dependencies = [
18
+ "beautifulsoup4==4.14.3",
19
+ "aiohttp==3.13.5"
20
+ ]
21
+
22
+ [tool.setuptools]
23
+ packages = ["sveden_table_matrixizer"]
24
+
25
+ [tool.setuptools.package-data]
26
+ besttvgu_backend = ["py.typed"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,3 @@
1
+ from .tables_extractor import matrixize_tables_from_page
2
+
3
+ __all__ = ["matrixize_tables_from_page"]
@@ -0,0 +1,28 @@
1
+ from dataclasses import dataclass
2
+ from typing import Never
3
+
4
+ from bs4 import Tag, ResultSet
5
+
6
+ from .errors import MultiplyTableHeaderError, MultiplyTableBodyError
7
+
8
+
9
+ @dataclass(frozen=True, kw_only=True)
10
+ class MatrixizedTable:
11
+ head: list[list[Tag]]
12
+ body: list[list[Tag]]
13
+
14
+
15
+ def on_table_no_header_def(_table: Tag, _matrixized_tables: list[MatrixizedTable]) -> None:
16
+ pass
17
+
18
+
19
+ def on_table_no_body_def(_table: Tag, _matrixized_tables: list[MatrixizedTable]) -> None:
20
+ pass
21
+
22
+
23
+ def on_multiply_table_headers_def(_heads: ResultSet[Tag]) -> Never:
24
+ raise MultiplyTableHeaderError
25
+
26
+
27
+ def on_multiply_table_bodies_def(_bodies: ResultSet[Tag]) -> Never:
28
+ raise MultiplyTableBodyError
@@ -0,0 +1,14 @@
1
+ class NoTableHeaderError(Exception):
2
+ pass
3
+
4
+
5
+ class MultiplyTableHeaderError(Exception):
6
+ pass
7
+
8
+
9
+ class NoTableBodyError(Exception):
10
+ pass
11
+
12
+
13
+ class MultiplyTableBodyError(Exception):
14
+ pass
@@ -0,0 +1,50 @@
1
+ from typing import Generator
2
+
3
+ from bs4 import Tag, ResultSet
4
+
5
+ from .errors import NoTableBodyError
6
+ from .types import ExtractorOptions
7
+ from .utils import handle_maybe_async
8
+
9
+
10
+ async def extract_table_body(table: Tag, *, options: ExtractorOptions) -> Tag:
11
+ bodies: ResultSet[Tag] = table.find_all("tbody", recursive=False)
12
+
13
+ if len(bodies) == 0:
14
+ raise NoTableBodyError
15
+ if len(bodies) > 1:
16
+ return await handle_maybe_async(options.on_multiply_table_bodies, bodies)
17
+
18
+ return bodies[0]
19
+
20
+
21
+ def body_tr_generator(tr_tag: Tag) -> Generator[tuple[Tag, int, int], None, None]:
22
+ for td_tag in tr_tag.find_all("td", recursive=False):
23
+ td_colspan: int = int(td_tag.get("colspan") or 1)
24
+ td_rowspan: int = int(td_tag.get("rowspan") or 1)
25
+
26
+ yield td_tag, td_colspan, td_rowspan
27
+
28
+
29
+ def body_table_generator(table_body: Tag) -> Generator[Generator[tuple[Tag, int, int], None, None], None, None]:
30
+ for tr_tag in table_body.find_all("tr", recursive=False):
31
+ yield body_tr_generator(tr_tag)
32
+
33
+
34
+ async def extract_table_body_columns(table_body: Tag) -> list[list[Tag]]:
35
+ columns_matrix: list[list[Tag]] = []
36
+
37
+ for tr_ind, tr_info in enumerate(body_table_generator(table_body)):
38
+ for td_ind, (td_tag, td_colspan, td_rowspan) in enumerate(tr_info):
39
+ for row_span in range(td_rowspan):
40
+ row_ind: int = tr_ind + row_span
41
+
42
+ try:
43
+ columns_matrix[row_ind]
44
+ except IndexError:
45
+ columns_matrix.append([])
46
+
47
+ for column_span in range(td_colspan):
48
+ columns_matrix[row_ind].append(td_tag)
49
+
50
+ return columns_matrix
@@ -0,0 +1,50 @@
1
+ from typing import Generator
2
+
3
+ from bs4 import Tag, ResultSet
4
+
5
+ from .errors import NoTableHeaderError
6
+ from .types import ExtractorOptions
7
+ from .utils import handle_maybe_async
8
+
9
+
10
+ async def extract_table_head(table: Tag, *, options: ExtractorOptions) -> Tag:
11
+ heads: ResultSet[Tag] = table.find_all("thead", recursive=False)
12
+
13
+ if len(heads) == 0:
14
+ raise NoTableHeaderError
15
+ if len(heads) > 1:
16
+ return await handle_maybe_async(options.on_multiply_table_headers, heads)
17
+
18
+ return heads[0]
19
+
20
+
21
+ def head_tr_generator(tr_tag: Tag) -> Generator[tuple[Tag, int, int], None, None]:
22
+ for th_tag in tr_tag.find_all("th", recursive=False):
23
+ th_colspan: int = int(th_tag.get("colspan") or 1)
24
+ th_rowspan: int = int(th_tag.get("rowspan") or 1)
25
+
26
+ yield th_tag, th_colspan, th_rowspan
27
+
28
+
29
+ def head_table_generator(table_head: Tag) -> Generator[Generator[tuple[Tag, int, int], None, None], None, None]:
30
+ for tr_tag in table_head.find_all("tr", recursive=False):
31
+ yield head_tr_generator(tr_tag)
32
+
33
+
34
+ async def extract_table_head_columns(table_head: Tag) -> list[list[Tag]]:
35
+ columns_matrix: list[list[Tag]] = []
36
+
37
+ for tr_ind, tr_info in enumerate(head_table_generator(table_head)):
38
+ for th_ind, (th_tag, th_colspan, th_rowspan) in enumerate(tr_info):
39
+ for row_span in range(th_rowspan):
40
+ row_ind: int = tr_ind + row_span
41
+
42
+ try:
43
+ columns_matrix[row_ind]
44
+ except IndexError:
45
+ columns_matrix.append([])
46
+
47
+ for column_span in range(th_colspan):
48
+ columns_matrix[row_ind].append(th_tag)
49
+
50
+ return columns_matrix
@@ -0,0 +1,48 @@
1
+ from bs4 import BeautifulSoup, Tag, ResultSet
2
+
3
+ from .def_funcs import MatrixizedTable
4
+ from .errors import NoTableHeaderError, NoTableBodyError
5
+ from .table_body_extractor import extract_table_body_columns, extract_table_body
6
+ from .table_head_extractor import extract_table_head, extract_table_head_columns
7
+ from .types import ExtractorOptions
8
+ from .utils import get_page_bs4, handle_maybe_async
9
+
10
+
11
+ async def matrixize_table(table: Tag, options: ExtractorOptions) -> MatrixizedTable:
12
+ head: Tag = await extract_table_head(table, options=options)
13
+ head_columns: list[list[Tag]] = await extract_table_head_columns(head)
14
+
15
+ body: Tag = await extract_table_body(table, options=options)
16
+ body_columns: list[list[Tag]] = await extract_table_body_columns(body)
17
+
18
+ return MatrixizedTable(
19
+ head=head_columns,
20
+ body=body_columns
21
+ )
22
+
23
+
24
+ async def extract_tables(page_bs4: BeautifulSoup) -> ResultSet[Tag]:
25
+ tables: ResultSet[Tag] = page_bs4.find_all("table", recursive=False)
26
+
27
+ return tables
28
+
29
+
30
+ async def matrixize_tables_from_page(url: str, *, options: ExtractorOptions | None = None) -> list[MatrixizedTable]:
31
+ if options is None:
32
+ options = ExtractorOptions()
33
+
34
+ page_bs4: BeautifulSoup = await get_page_bs4(url)
35
+ tables: ResultSet[Tag] = await extract_tables(page_bs4)
36
+
37
+ matrixized_tables: list[MatrixizedTable] = []
38
+ for table in tables:
39
+ try:
40
+ matrixized_table: MatrixizedTable = await matrixize_table(table, options=options)
41
+
42
+ matrixized_tables.append(matrixized_table)
43
+ except NoTableHeaderError:
44
+ await handle_maybe_async(options.on_table_no_header, table, matrixized_tables)
45
+ except NoTableBodyError:
46
+ await handle_maybe_async(options.on_table_no_body, table, matrixized_tables)
47
+
48
+ return matrixized_tables
@@ -0,0 +1,15 @@
1
+ from dataclasses import dataclass
2
+ from typing import Callable, Any, Never, Sequence
3
+
4
+ from bs4 import Tag
5
+
6
+ from .def_funcs import MatrixizedTable, on_table_no_header_def, on_table_no_body_def, on_multiply_table_headers_def, \
7
+ on_multiply_table_bodies_def
8
+
9
+
10
+ @dataclass(frozen=True, kw_only=True)
11
+ class ExtractorOptions:
12
+ on_table_no_header: Callable[[Tag, Sequence[MatrixizedTable]], Any] = on_table_no_header_def
13
+ on_table_no_body: Callable[[Tag, Sequence[MatrixizedTable]], Any] = on_table_no_body_def
14
+ on_multiply_table_headers: Callable[[Sequence[Tag]], Tag | Never] = on_multiply_table_headers_def
15
+ on_multiply_table_bodies: Callable[[Sequence[Tag]], Tag | Never] = on_multiply_table_bodies_def
@@ -0,0 +1,30 @@
1
+ import inspect
2
+ from typing import Coroutine, TypeVar, Awaitable, Callable
3
+
4
+ from aiohttp import ClientSession
5
+ from bs4 import BeautifulSoup
6
+
7
+
8
+ async def get_page_bs4(url: str) -> BeautifulSoup:
9
+ async with ClientSession() as session:
10
+ async with session.get(url) as response:
11
+ return BeautifulSoup(await response.text(), "html.parser")
12
+
13
+
14
+ T = TypeVar("T")
15
+
16
+
17
+ async def handle_maybe_async(func: Coroutine[..., ..., T] | Callable[..., T | Awaitable[T]] | T, *args, **kwargs) -> T:
18
+ if inspect.iscoroutinefunction(func):
19
+ return await func(*args, **kwargs)
20
+ if inspect.iscoroutine(func):
21
+ return await func
22
+ if not inspect.isfunction(func):
23
+ return func
24
+
25
+ result: T = func(*args, **kwargs)
26
+
27
+ if inspect.isawaitable(result):
28
+ return await result
29
+
30
+ return result
@@ -0,0 +1,147 @@
1
+ Metadata-Version: 2.4
2
+ Name: sveden-table-matrixizer
3
+ Version: 1.0.0
4
+ Summary: Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags
5
+ Author: BestTvGU
6
+ License: MIT
7
+ Requires-Python: >=3.11
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: beautifulsoup4==4.14.3
11
+ Requires-Dist: aiohttp==3.13.5
12
+ Dynamic: license-file
13
+
14
+ # sveden-table-matrixizer
15
+
16
+ **Extract and matrixize HTML tables from Russian educational organization pages (`/sveden`) with full colspan/rowspan
17
+ support.**
18
+
19
+ `sveden-table-matrixizer` is an asynchronous Python library that parses tables found on `/sveden` (сведения об
20
+ образовательной организации) pages, expands `colspan` and `rowspan` attributes into clean two-dimensional matrices, and
21
+ returns structured `head` and `body` cell grids. It gives you full control over edge cases via configurable callbacks.
22
+
23
+ ## Features
24
+
25
+ - **Async page fetching** using `aiohttp`.
26
+ - **Full colspan/rowspan expansion** – cells spanning multiple rows and/or columns are duplicated into the matrix.
27
+ - **Separate head and body matrix extraction** – each table yields a `head` matrix (from `<thead>`) and a `body`
28
+ matrix (from `<tbody>`).
29
+ - **Customizable error handling** – replace default reactions to missing headers, missing bodies, or multiple `<thead>`/
30
+ `<tbody>` elements.
31
+ - **Lightweight** – only depends on `aiohttp` and `beautifulsoup4`.
32
+ - **Works with any HTML** – designed for `/sveden`, but usable on any page containing `<table>` elements.
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ pip install sveden-table-matrixizer
38
+ ```
39
+
40
+ ## Quick Start
41
+
42
+ ```python
43
+ import asyncio
44
+ from sveden_table_matrixizer import matrixize_tables_from_page
45
+
46
+
47
+ async def main():
48
+ url = "https://example.edu/sveden/"
49
+ tables = await matrixize_tables_from_page(url)
50
+
51
+ for i, table in enumerate(tables):
52
+ print(f"Table {i + 1}:")
53
+ print(" Head rows:", len(table.head))
54
+ print(" Body rows:", len(table.body))
55
+ # Access cells as list[list[bs4.Tag]]
56
+
57
+
58
+ asyncio.run(main())
59
+ ```
60
+
61
+ Each `MatrixizedTable` contains:
62
+
63
+ - `head: list[list[Tag]]` – rows of header cells _(each row is a list of `bs4.Tag`)_.
64
+ - `body: list[list[Tag]]` – rows of body cells.
65
+
66
+ ## Handling Edge Cases
67
+
68
+ By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one `<thead>` /
69
+ `<tbody>`. You can override this behavior with `ExtractorOptions`.
70
+
71
+ ```python
72
+ from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
73
+ from sveden_table_matrixizer.def_funcs import MatrixizedTable
74
+
75
+
76
+ def handle_missing_header(table_tag, collected_tables):
77
+ print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
78
+
79
+
80
+ opts = ExtractorOptions(
81
+ on_table_no_header=handle_missing_header,
82
+ # other callbacks can be set similarly
83
+ )
84
+
85
+ tables = await matrixize_tables_from_page(url, options=opts)
86
+ ```
87
+
88
+ You can also supply async callbacks – the library automatically detects and awaits them.
89
+
90
+ ## API Reference
91
+
92
+ ### `matrixize_tables_from_page(url, *, options=None)`
93
+
94
+ - **Parameters:**
95
+ - `url` (`str`) – URL of the page to scrape.
96
+ - `options` (`ExtractorOptions`, optional) – configuration callbacks.
97
+ - **Returns:** `list[MatrixizedTable]` – extracted and matrixized tables.
98
+
99
+ ### `ExtractorOptions`
100
+
101
+ A frozen dataclass with the following fields (all optional):
102
+
103
+ | Field | Type | Default | Description |
104
+ |-----------------------------|---------------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------|
105
+ | `on_table_no_header` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<thead>` |
106
+ | `on_table_no_body` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<tbody>` |
107
+ | `on_multiply_table_headers` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableHeaderError` | Called when more than one `<thead>` is found; must return a single `<thead>` element. |
108
+ | `on_multiply_table_bodies` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableBodyError` | Called when more than one `<tbody>` is found; must return a single `<tbody>` element. |
109
+
110
+ ### `MatrixizedTable`
111
+
112
+ ```python
113
+ @dataclass(frozen=True, kw_only=True)
114
+ class MatrixizedTable:
115
+ head: list[list[Tag]] # matrix of <th> tags
116
+ body: list[list[Tag]] # matrix of <td> tags
117
+ ```
118
+
119
+ ### Exceptions
120
+
121
+ - `NoTableHeaderError` – raised when `options.on_table_no_header` is not overridden.
122
+ - `NoTableBodyError` – raised when `options.on_table_no_body` is not overridden.
123
+ - `MultiplyTableHeaderError` – default reaction to multiple `<thead>` elements.
124
+ - `MultiplyTableBodyError` – default reaction to multiple `<tbody>` elements.
125
+
126
+ All exceptions are exported from `sveden_table_matrixizer.errors`.
127
+
128
+ ## How It Works
129
+
130
+ 1. The page is fetched with `aiohttp` and parsed by BeautifulSoup.
131
+ 2. All `<table>` tags are collected.
132
+ 3. For each table:
133
+ - The `<thead>` is located; if missing or duplicate, the appropriate callback is invoked.
134
+ - The `<tbody>` is located similarly.
135
+ - Header rows are expanded: each `<th>` with `colspan`/`rowspan` is replicated into the correct cells of a 2D list.
136
+ - The same expansion is applied to body rows using `<td>` elements.
137
+ 4. A `MatrixizedTable(head=..., body=...)` is created and added to the result list.
138
+
139
+ ## Dependencies
140
+
141
+ - Python ≥ 3.11
142
+ - [aiohttp](https://pypi.org/project/aiohttp/)
143
+ - [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
144
+
145
+ ## License
146
+
147
+ This project is licensed under the MIT License – see the source repository for details.
@@ -0,0 +1,17 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ sveden_table_matrixizer/__init__.py
5
+ sveden_table_matrixizer/def_funcs.py
6
+ sveden_table_matrixizer/errors.py
7
+ sveden_table_matrixizer/py.typed
8
+ sveden_table_matrixizer/table_body_extractor.py
9
+ sveden_table_matrixizer/table_head_extractor.py
10
+ sveden_table_matrixizer/tables_extractor.py
11
+ sveden_table_matrixizer/types.py
12
+ sveden_table_matrixizer/utils.py
13
+ sveden_table_matrixizer.egg-info/PKG-INFO
14
+ sveden_table_matrixizer.egg-info/SOURCES.txt
15
+ sveden_table_matrixizer.egg-info/dependency_links.txt
16
+ sveden_table_matrixizer.egg-info/requires.txt
17
+ sveden_table_matrixizer.egg-info/top_level.txt
@@ -0,0 +1,2 @@
1
+ beautifulsoup4==4.14.3
2
+ aiohttp==3.13.5
@@ -0,0 +1 @@
1
+ sveden_table_matrixizer