sveden-table-matrixizer 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sveden_table_matrixizer-1.0.0/LICENSE +21 -0
- sveden_table_matrixizer-1.0.0/PKG-INFO +147 -0
- sveden_table_matrixizer-1.0.0/README.md +134 -0
- sveden_table_matrixizer-1.0.0/pyproject.toml +26 -0
- sveden_table_matrixizer-1.0.0/setup.cfg +4 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/__init__.py +3 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/def_funcs.py +28 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/errors.py +14 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/py.typed +0 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/table_body_extractor.py +50 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/table_head_extractor.py +50 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/tables_extractor.py +48 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/types.py +15 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer/utils.py +30 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer.egg-info/PKG-INFO +147 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer.egg-info/SOURCES.txt +17 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer.egg-info/dependency_links.txt +1 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer.egg-info/requires.txt +2 -0
- sveden_table_matrixizer-1.0.0/sveden_table_matrixizer.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Gorshipisk
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sveden-table-matrixizer
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags
|
|
5
|
+
Author: BestTvGU
|
|
6
|
+
License: MIT
|
|
7
|
+
Requires-Python: >=3.11
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: beautifulsoup4==4.14.3
|
|
11
|
+
Requires-Dist: aiohttp==3.13.5
|
|
12
|
+
Dynamic: license-file
|
|
13
|
+
|
|
14
|
+
# sveden-table-matrixizer
|
|
15
|
+
|
|
16
|
+
**Extract and matrixize HTML tables from Russian educational organization pages (`/sveden`) with full colspan/rowspan
|
|
17
|
+
support.**
|
|
18
|
+
|
|
19
|
+
`sveden-table-matrixizer` is an asynchronous Python library that parses tables found on `/sveden` (сведения об
|
|
20
|
+
образовательной организации) pages, expands `colspan` and `rowspan` attributes into clean two-dimensional matrices, and
|
|
21
|
+
returns structured `head` and `body` cell grids. It gives you full control over edge cases via configurable callbacks.
|
|
22
|
+
|
|
23
|
+
## Features
|
|
24
|
+
|
|
25
|
+
- **Async page fetching** using `aiohttp`.
|
|
26
|
+
- **Full colspan/rowspan expansion** – cells spanning multiple rows and/or columns are duplicated into the matrix.
|
|
27
|
+
- **Separate head and body matrix extraction** – each table yields a `head` matrix (from `<thead>`) and a `body`
|
|
28
|
+
matrix (from `<tbody>`).
|
|
29
|
+
- **Customizable error handling** – replace default reactions to missing headers, missing bodies, or multiple `<thead>`/
|
|
30
|
+
`<tbody>` elements.
|
|
31
|
+
- **Lightweight** – only depends on `aiohttp` and `beautifulsoup4`.
|
|
32
|
+
- **Works with any HTML** – designed for `/sveden`, but usable on any page containing `<table>` elements.
|
|
33
|
+
|
|
34
|
+
## Installation
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
pip install sveden-table-matrixizer
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Quick Start
|
|
41
|
+
|
|
42
|
+
```python
|
|
43
|
+
import asyncio
|
|
44
|
+
from sveden_table_matrixizer import matrixize_tables_from_page
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
async def main():
|
|
48
|
+
url = "https://example.edu/sveden/"
|
|
49
|
+
tables = await matrixize_tables_from_page(url)
|
|
50
|
+
|
|
51
|
+
for i, table in enumerate(tables):
|
|
52
|
+
print(f"Table {i + 1}:")
|
|
53
|
+
print(" Head rows:", len(table.head))
|
|
54
|
+
print(" Body rows:", len(table.body))
|
|
55
|
+
# Access cells as list[list[bs4.Tag]]
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
asyncio.run(main())
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Each `MatrixizedTable` contains:
|
|
62
|
+
|
|
63
|
+
- `head: list[list[Tag]]` – rows of header cells _(each row is a list of `bs4.Tag`)_.
|
|
64
|
+
- `body: list[list[Tag]]` – rows of body cells.
|
|
65
|
+
|
|
66
|
+
## Handling Edge Cases
|
|
67
|
+
|
|
68
|
+
By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one `<thead>` /
|
|
69
|
+
`<tbody>`. You can override this behavior with `ExtractorOptions`.
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
|
|
73
|
+
from sveden_table_matrixizer.def_funcs import MatrixizedTable
|
|
74
|
+
|
|
75
|
+
|
|
76
|
+
def handle_missing_header(table_tag, collected_tables):
|
|
77
|
+
print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
|
|
78
|
+
|
|
79
|
+
|
|
80
|
+
opts = ExtractorOptions(
|
|
81
|
+
on_table_no_header=handle_missing_header,
|
|
82
|
+
# other callbacks can be set similarly
|
|
83
|
+
)
|
|
84
|
+
|
|
85
|
+
tables = await matrixize_tables_from_page(url, options=opts)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
You can also supply async callbacks – the library automatically detects and awaits them.
|
|
89
|
+
|
|
90
|
+
## API Reference
|
|
91
|
+
|
|
92
|
+
### `matrixize_tables_from_page(url, *, options=None)`
|
|
93
|
+
|
|
94
|
+
- **Parameters:**
|
|
95
|
+
- `url` (`str`) – URL of the page to scrape.
|
|
96
|
+
- `options` (`ExtractorOptions`, optional) – configuration callbacks.
|
|
97
|
+
- **Returns:** `list[MatrixizedTable]` – extracted and matrixized tables.
|
|
98
|
+
|
|
99
|
+
### `ExtractorOptions`
|
|
100
|
+
|
|
101
|
+
A frozen dataclass with the following fields (all optional):
|
|
102
|
+
|
|
103
|
+
| Field | Type | Default | Description |
|
|
104
|
+
|-----------------------------|---------------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------|
|
|
105
|
+
| `on_table_no_header` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<thead>` |
|
|
106
|
+
| `on_table_no_body` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<tbody>` |
|
|
107
|
+
| `on_multiply_table_headers` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableHeaderError` | Called when more than one `<thead>` is found; must return a single `<thead>` element. |
|
|
108
|
+
| `on_multiply_table_bodies` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableBodyError` | Called when more than one `<tbody>` is found; must return a single `<tbody>` element. |
|
|
109
|
+
|
|
110
|
+
### `MatrixizedTable`
|
|
111
|
+
|
|
112
|
+
```python
|
|
113
|
+
@dataclass(frozen=True, kw_only=True)
|
|
114
|
+
class MatrixizedTable:
|
|
115
|
+
head: list[list[Tag]] # matrix of <th> tags
|
|
116
|
+
body: list[list[Tag]] # matrix of <td> tags
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Exceptions
|
|
120
|
+
|
|
121
|
+
- `NoTableHeaderError` – raised when `options.on_table_no_header` is not overridden.
|
|
122
|
+
- `NoTableBodyError` – raised when `options.on_table_no_body` is not overridden.
|
|
123
|
+
- `MultiplyTableHeaderError` – default reaction to multiple `<thead>` elements.
|
|
124
|
+
- `MultiplyTableBodyError` – default reaction to multiple `<tbody>` elements.
|
|
125
|
+
|
|
126
|
+
All exceptions are exported from `sveden_table_matrixizer.errors`.
|
|
127
|
+
|
|
128
|
+
## How It Works
|
|
129
|
+
|
|
130
|
+
1. The page is fetched with `aiohttp` and parsed by BeautifulSoup.
|
|
131
|
+
2. All `<table>` tags are collected.
|
|
132
|
+
3. For each table:
|
|
133
|
+
- The `<thead>` is located; if missing or duplicate, the appropriate callback is invoked.
|
|
134
|
+
- The `<tbody>` is located similarly.
|
|
135
|
+
- Header rows are expanded: each `<th>` with `colspan`/`rowspan` is replicated into the correct cells of a 2D list.
|
|
136
|
+
- The same expansion is applied to body rows using `<td>` elements.
|
|
137
|
+
4. A `MatrixizedTable(head=..., body=...)` is created and added to the result list.
|
|
138
|
+
|
|
139
|
+
## Dependencies
|
|
140
|
+
|
|
141
|
+
- Python ≥ 3.11
|
|
142
|
+
- [aiohttp](https://pypi.org/project/aiohttp/)
|
|
143
|
+
- [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
|
|
144
|
+
|
|
145
|
+
## License
|
|
146
|
+
|
|
147
|
+
This project is licensed under the MIT License – see the source repository for details.
|
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
# sveden-table-matrixizer
|
|
2
|
+
|
|
3
|
+
**Extract and matrixize HTML tables from Russian educational organization pages (`/sveden`) with full colspan/rowspan
|
|
4
|
+
support.**
|
|
5
|
+
|
|
6
|
+
`sveden-table-matrixizer` is an asynchronous Python library that parses tables found on `/sveden` (сведения об
|
|
7
|
+
образовательной организации) pages, expands `colspan` and `rowspan` attributes into clean two-dimensional matrices, and
|
|
8
|
+
returns structured `head` and `body` cell grids. It gives you full control over edge cases via configurable callbacks.
|
|
9
|
+
|
|
10
|
+
## Features
|
|
11
|
+
|
|
12
|
+
- **Async page fetching** using `aiohttp`.
|
|
13
|
+
- **Full colspan/rowspan expansion** – cells spanning multiple rows and/or columns are duplicated into the matrix.
|
|
14
|
+
- **Separate head and body matrix extraction** – each table yields a `head` matrix (from `<thead>`) and a `body`
|
|
15
|
+
matrix (from `<tbody>`).
|
|
16
|
+
- **Customizable error handling** – replace default reactions to missing headers, missing bodies, or multiple `<thead>`/
|
|
17
|
+
`<tbody>` elements.
|
|
18
|
+
- **Lightweight** – only depends on `aiohttp` and `beautifulsoup4`.
|
|
19
|
+
- **Works with any HTML** – designed for `/sveden`, but usable on any page containing `<table>` elements.
|
|
20
|
+
|
|
21
|
+
## Installation
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
pip install sveden-table-matrixizer
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Quick Start
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
import asyncio
|
|
31
|
+
from sveden_table_matrixizer import matrixize_tables_from_page
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
async def main():
|
|
35
|
+
url = "https://example.edu/sveden/"
|
|
36
|
+
tables = await matrixize_tables_from_page(url)
|
|
37
|
+
|
|
38
|
+
for i, table in enumerate(tables):
|
|
39
|
+
print(f"Table {i + 1}:")
|
|
40
|
+
print(" Head rows:", len(table.head))
|
|
41
|
+
print(" Body rows:", len(table.body))
|
|
42
|
+
# Access cells as list[list[bs4.Tag]]
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
asyncio.run(main())
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
Each `MatrixizedTable` contains:
|
|
49
|
+
|
|
50
|
+
- `head: list[list[Tag]]` – rows of header cells _(each row is a list of `bs4.Tag`)_.
|
|
51
|
+
- `body: list[list[Tag]]` – rows of body cells.
|
|
52
|
+
|
|
53
|
+
## Handling Edge Cases
|
|
54
|
+
|
|
55
|
+
By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one `<thead>` /
|
|
56
|
+
`<tbody>`. You can override this behavior with `ExtractorOptions`.
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
|
|
60
|
+
from sveden_table_matrixizer.def_funcs import MatrixizedTable
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
def handle_missing_header(table_tag, collected_tables):
|
|
64
|
+
print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
|
|
65
|
+
|
|
66
|
+
|
|
67
|
+
opts = ExtractorOptions(
|
|
68
|
+
on_table_no_header=handle_missing_header,
|
|
69
|
+
# other callbacks can be set similarly
|
|
70
|
+
)
|
|
71
|
+
|
|
72
|
+
tables = await matrixize_tables_from_page(url, options=opts)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
You can also supply async callbacks – the library automatically detects and awaits them.
|
|
76
|
+
|
|
77
|
+
## API Reference
|
|
78
|
+
|
|
79
|
+
### `matrixize_tables_from_page(url, *, options=None)`
|
|
80
|
+
|
|
81
|
+
- **Parameters:**
|
|
82
|
+
- `url` (`str`) – URL of the page to scrape.
|
|
83
|
+
- `options` (`ExtractorOptions`, optional) – configuration callbacks.
|
|
84
|
+
- **Returns:** `list[MatrixizedTable]` – extracted and matrixized tables.
|
|
85
|
+
|
|
86
|
+
### `ExtractorOptions`
|
|
87
|
+
|
|
88
|
+
A frozen dataclass with the following fields (all optional):
|
|
89
|
+
|
|
90
|
+
| Field | Type | Default | Description |
|
|
91
|
+
|-----------------------------|---------------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------|
|
|
92
|
+
| `on_table_no_header` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<thead>` |
|
|
93
|
+
| `on_table_no_body` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<tbody>` |
|
|
94
|
+
| `on_multiply_table_headers` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableHeaderError` | Called when more than one `<thead>` is found; must return a single `<thead>` element. |
|
|
95
|
+
| `on_multiply_table_bodies` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableBodyError` | Called when more than one `<tbody>` is found; must return a single `<tbody>` element. |
|
|
96
|
+
|
|
97
|
+
### `MatrixizedTable`
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
@dataclass(frozen=True, kw_only=True)
|
|
101
|
+
class MatrixizedTable:
|
|
102
|
+
head: list[list[Tag]] # matrix of <th> tags
|
|
103
|
+
body: list[list[Tag]] # matrix of <td> tags
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### Exceptions
|
|
107
|
+
|
|
108
|
+
- `NoTableHeaderError` – raised when `options.on_table_no_header` is not overridden.
|
|
109
|
+
- `NoTableBodyError` – raised when `options.on_table_no_body` is not overridden.
|
|
110
|
+
- `MultiplyTableHeaderError` – default reaction to multiple `<thead>` elements.
|
|
111
|
+
- `MultiplyTableBodyError` – default reaction to multiple `<tbody>` elements.
|
|
112
|
+
|
|
113
|
+
All exceptions are exported from `sveden_table_matrixizer.errors`.
|
|
114
|
+
|
|
115
|
+
## How It Works
|
|
116
|
+
|
|
117
|
+
1. The page is fetched with `aiohttp` and parsed by BeautifulSoup.
|
|
118
|
+
2. All `<table>` tags are collected.
|
|
119
|
+
3. For each table:
|
|
120
|
+
- The `<thead>` is located; if missing or duplicate, the appropriate callback is invoked.
|
|
121
|
+
- The `<tbody>` is located similarly.
|
|
122
|
+
- Header rows are expanded: each `<th>` with `colspan`/`rowspan` is replicated into the correct cells of a 2D list.
|
|
123
|
+
- The same expansion is applied to body rows using `<td>` elements.
|
|
124
|
+
4. A `MatrixizedTable(head=..., body=...)` is created and added to the result list.
|
|
125
|
+
|
|
126
|
+
## Dependencies
|
|
127
|
+
|
|
128
|
+
- Python ≥ 3.11
|
|
129
|
+
- [aiohttp](https://pypi.org/project/aiohttp/)
|
|
130
|
+
- [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
|
|
131
|
+
|
|
132
|
+
## License
|
|
133
|
+
|
|
134
|
+
This project is licensed under the MIT License – see the source repository for details.
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "sveden-table-matrixizer"
|
|
7
|
+
version = "1.0.0"
|
|
8
|
+
description = "Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags"
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.11"
|
|
11
|
+
license = { text = "MIT" }
|
|
12
|
+
|
|
13
|
+
authors = [
|
|
14
|
+
{ name = "BestTvGU" }
|
|
15
|
+
]
|
|
16
|
+
|
|
17
|
+
dependencies = [
|
|
18
|
+
"beautifulsoup4==4.14.3",
|
|
19
|
+
"aiohttp==3.13.5"
|
|
20
|
+
]
|
|
21
|
+
|
|
22
|
+
[tool.setuptools]
|
|
23
|
+
packages = ["sveden_table_matrixizer"]
|
|
24
|
+
|
|
25
|
+
[tool.setuptools.package-data]
|
|
26
|
+
besttvgu_backend = ["py.typed"]
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
from dataclasses import dataclass
|
|
2
|
+
from typing import Never
|
|
3
|
+
|
|
4
|
+
from bs4 import Tag, ResultSet
|
|
5
|
+
|
|
6
|
+
from .errors import MultiplyTableHeaderError, MultiplyTableBodyError
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
@dataclass(frozen=True, kw_only=True)
|
|
10
|
+
class MatrixizedTable:
|
|
11
|
+
head: list[list[Tag]]
|
|
12
|
+
body: list[list[Tag]]
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
def on_table_no_header_def(_table: Tag, _matrixized_tables: list[MatrixizedTable]) -> None:
|
|
16
|
+
pass
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
def on_table_no_body_def(_table: Tag, _matrixized_tables: list[MatrixizedTable]) -> None:
|
|
20
|
+
pass
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
def on_multiply_table_headers_def(_heads: ResultSet[Tag]) -> Never:
|
|
24
|
+
raise MultiplyTableHeaderError
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
def on_multiply_table_bodies_def(_bodies: ResultSet[Tag]) -> Never:
|
|
28
|
+
raise MultiplyTableBodyError
|
|
File without changes
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
from typing import Generator
|
|
2
|
+
|
|
3
|
+
from bs4 import Tag, ResultSet
|
|
4
|
+
|
|
5
|
+
from .errors import NoTableBodyError
|
|
6
|
+
from .types import ExtractorOptions
|
|
7
|
+
from .utils import handle_maybe_async
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
async def extract_table_body(table: Tag, *, options: ExtractorOptions) -> Tag:
|
|
11
|
+
bodies: ResultSet[Tag] = table.find_all("tbody", recursive=False)
|
|
12
|
+
|
|
13
|
+
if len(bodies) == 0:
|
|
14
|
+
raise NoTableBodyError
|
|
15
|
+
if len(bodies) > 1:
|
|
16
|
+
return await handle_maybe_async(options.on_multiply_table_bodies, bodies)
|
|
17
|
+
|
|
18
|
+
return bodies[0]
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def body_tr_generator(tr_tag: Tag) -> Generator[tuple[Tag, int, int], None, None]:
|
|
22
|
+
for td_tag in tr_tag.find_all("td", recursive=False):
|
|
23
|
+
td_colspan: int = int(td_tag.get("colspan") or 1)
|
|
24
|
+
td_rowspan: int = int(td_tag.get("rowspan") or 1)
|
|
25
|
+
|
|
26
|
+
yield td_tag, td_colspan, td_rowspan
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
def body_table_generator(table_body: Tag) -> Generator[Generator[tuple[Tag, int, int], None, None], None, None]:
|
|
30
|
+
for tr_tag in table_body.find_all("tr", recursive=False):
|
|
31
|
+
yield body_tr_generator(tr_tag)
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
async def extract_table_body_columns(table_body: Tag) -> list[list[Tag]]:
|
|
35
|
+
columns_matrix: list[list[Tag]] = []
|
|
36
|
+
|
|
37
|
+
for tr_ind, tr_info in enumerate(body_table_generator(table_body)):
|
|
38
|
+
for td_ind, (td_tag, td_colspan, td_rowspan) in enumerate(tr_info):
|
|
39
|
+
for row_span in range(td_rowspan):
|
|
40
|
+
row_ind: int = tr_ind + row_span
|
|
41
|
+
|
|
42
|
+
try:
|
|
43
|
+
columns_matrix[row_ind]
|
|
44
|
+
except IndexError:
|
|
45
|
+
columns_matrix.append([])
|
|
46
|
+
|
|
47
|
+
for column_span in range(td_colspan):
|
|
48
|
+
columns_matrix[row_ind].append(td_tag)
|
|
49
|
+
|
|
50
|
+
return columns_matrix
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
from typing import Generator
|
|
2
|
+
|
|
3
|
+
from bs4 import Tag, ResultSet
|
|
4
|
+
|
|
5
|
+
from .errors import NoTableHeaderError
|
|
6
|
+
from .types import ExtractorOptions
|
|
7
|
+
from .utils import handle_maybe_async
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
async def extract_table_head(table: Tag, *, options: ExtractorOptions) -> Tag:
|
|
11
|
+
heads: ResultSet[Tag] = table.find_all("thead", recursive=False)
|
|
12
|
+
|
|
13
|
+
if len(heads) == 0:
|
|
14
|
+
raise NoTableHeaderError
|
|
15
|
+
if len(heads) > 1:
|
|
16
|
+
return await handle_maybe_async(options.on_multiply_table_headers, heads)
|
|
17
|
+
|
|
18
|
+
return heads[0]
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def head_tr_generator(tr_tag: Tag) -> Generator[tuple[Tag, int, int], None, None]:
|
|
22
|
+
for th_tag in tr_tag.find_all("th", recursive=False):
|
|
23
|
+
th_colspan: int = int(th_tag.get("colspan") or 1)
|
|
24
|
+
th_rowspan: int = int(th_tag.get("rowspan") or 1)
|
|
25
|
+
|
|
26
|
+
yield th_tag, th_colspan, th_rowspan
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
def head_table_generator(table_head: Tag) -> Generator[Generator[tuple[Tag, int, int], None, None], None, None]:
|
|
30
|
+
for tr_tag in table_head.find_all("tr", recursive=False):
|
|
31
|
+
yield head_tr_generator(tr_tag)
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
async def extract_table_head_columns(table_head: Tag) -> list[list[Tag]]:
|
|
35
|
+
columns_matrix: list[list[Tag]] = []
|
|
36
|
+
|
|
37
|
+
for tr_ind, tr_info in enumerate(head_table_generator(table_head)):
|
|
38
|
+
for th_ind, (th_tag, th_colspan, th_rowspan) in enumerate(tr_info):
|
|
39
|
+
for row_span in range(th_rowspan):
|
|
40
|
+
row_ind: int = tr_ind + row_span
|
|
41
|
+
|
|
42
|
+
try:
|
|
43
|
+
columns_matrix[row_ind]
|
|
44
|
+
except IndexError:
|
|
45
|
+
columns_matrix.append([])
|
|
46
|
+
|
|
47
|
+
for column_span in range(th_colspan):
|
|
48
|
+
columns_matrix[row_ind].append(th_tag)
|
|
49
|
+
|
|
50
|
+
return columns_matrix
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
from bs4 import BeautifulSoup, Tag, ResultSet
|
|
2
|
+
|
|
3
|
+
from .def_funcs import MatrixizedTable
|
|
4
|
+
from .errors import NoTableHeaderError, NoTableBodyError
|
|
5
|
+
from .table_body_extractor import extract_table_body_columns, extract_table_body
|
|
6
|
+
from .table_head_extractor import extract_table_head, extract_table_head_columns
|
|
7
|
+
from .types import ExtractorOptions
|
|
8
|
+
from .utils import get_page_bs4, handle_maybe_async
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
async def matrixize_table(table: Tag, options: ExtractorOptions) -> MatrixizedTable:
|
|
12
|
+
head: Tag = await extract_table_head(table, options=options)
|
|
13
|
+
head_columns: list[list[Tag]] = await extract_table_head_columns(head)
|
|
14
|
+
|
|
15
|
+
body: Tag = await extract_table_body(table, options=options)
|
|
16
|
+
body_columns: list[list[Tag]] = await extract_table_body_columns(body)
|
|
17
|
+
|
|
18
|
+
return MatrixizedTable(
|
|
19
|
+
head=head_columns,
|
|
20
|
+
body=body_columns
|
|
21
|
+
)
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
async def extract_tables(page_bs4: BeautifulSoup) -> ResultSet[Tag]:
|
|
25
|
+
tables: ResultSet[Tag] = page_bs4.find_all("table", recursive=False)
|
|
26
|
+
|
|
27
|
+
return tables
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
async def matrixize_tables_from_page(url: str, *, options: ExtractorOptions | None = None) -> list[MatrixizedTable]:
|
|
31
|
+
if options is None:
|
|
32
|
+
options = ExtractorOptions()
|
|
33
|
+
|
|
34
|
+
page_bs4: BeautifulSoup = await get_page_bs4(url)
|
|
35
|
+
tables: ResultSet[Tag] = await extract_tables(page_bs4)
|
|
36
|
+
|
|
37
|
+
matrixized_tables: list[MatrixizedTable] = []
|
|
38
|
+
for table in tables:
|
|
39
|
+
try:
|
|
40
|
+
matrixized_table: MatrixizedTable = await matrixize_table(table, options=options)
|
|
41
|
+
|
|
42
|
+
matrixized_tables.append(matrixized_table)
|
|
43
|
+
except NoTableHeaderError:
|
|
44
|
+
await handle_maybe_async(options.on_table_no_header, table, matrixized_tables)
|
|
45
|
+
except NoTableBodyError:
|
|
46
|
+
await handle_maybe_async(options.on_table_no_body, table, matrixized_tables)
|
|
47
|
+
|
|
48
|
+
return matrixized_tables
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
from dataclasses import dataclass
|
|
2
|
+
from typing import Callable, Any, Never, Sequence
|
|
3
|
+
|
|
4
|
+
from bs4 import Tag
|
|
5
|
+
|
|
6
|
+
from .def_funcs import MatrixizedTable, on_table_no_header_def, on_table_no_body_def, on_multiply_table_headers_def, \
|
|
7
|
+
on_multiply_table_bodies_def
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
@dataclass(frozen=True, kw_only=True)
|
|
11
|
+
class ExtractorOptions:
|
|
12
|
+
on_table_no_header: Callable[[Tag, Sequence[MatrixizedTable]], Any] = on_table_no_header_def
|
|
13
|
+
on_table_no_body: Callable[[Tag, Sequence[MatrixizedTable]], Any] = on_table_no_body_def
|
|
14
|
+
on_multiply_table_headers: Callable[[Sequence[Tag]], Tag | Never] = on_multiply_table_headers_def
|
|
15
|
+
on_multiply_table_bodies: Callable[[Sequence[Tag]], Tag | Never] = on_multiply_table_bodies_def
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
import inspect
|
|
2
|
+
from typing import Coroutine, TypeVar, Awaitable, Callable
|
|
3
|
+
|
|
4
|
+
from aiohttp import ClientSession
|
|
5
|
+
from bs4 import BeautifulSoup
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
async def get_page_bs4(url: str) -> BeautifulSoup:
|
|
9
|
+
async with ClientSession() as session:
|
|
10
|
+
async with session.get(url) as response:
|
|
11
|
+
return BeautifulSoup(await response.text(), "html.parser")
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
T = TypeVar("T")
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
async def handle_maybe_async(func: Coroutine[..., ..., T] | Callable[..., T | Awaitable[T]] | T, *args, **kwargs) -> T:
|
|
18
|
+
if inspect.iscoroutinefunction(func):
|
|
19
|
+
return await func(*args, **kwargs)
|
|
20
|
+
if inspect.iscoroutine(func):
|
|
21
|
+
return await func
|
|
22
|
+
if not inspect.isfunction(func):
|
|
23
|
+
return func
|
|
24
|
+
|
|
25
|
+
result: T = func(*args, **kwargs)
|
|
26
|
+
|
|
27
|
+
if inspect.isawaitable(result):
|
|
28
|
+
return await result
|
|
29
|
+
|
|
30
|
+
return result
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sveden-table-matrixizer
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: Module for convert tables from HTML-pages (/sveden) to matrix of bs4 tags
|
|
5
|
+
Author: BestTvGU
|
|
6
|
+
License: MIT
|
|
7
|
+
Requires-Python: >=3.11
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: beautifulsoup4==4.14.3
|
|
11
|
+
Requires-Dist: aiohttp==3.13.5
|
|
12
|
+
Dynamic: license-file
|
|
13
|
+
|
|
14
|
+
# sveden-table-matrixizer
|
|
15
|
+
|
|
16
|
+
**Extract and matrixize HTML tables from Russian educational organization pages (`/sveden`) with full colspan/rowspan
|
|
17
|
+
support.**
|
|
18
|
+
|
|
19
|
+
`sveden-table-matrixizer` is an asynchronous Python library that parses tables found on `/sveden` (сведения об
|
|
20
|
+
образовательной организации) pages, expands `colspan` and `rowspan` attributes into clean two-dimensional matrices, and
|
|
21
|
+
returns structured `head` and `body` cell grids. It gives you full control over edge cases via configurable callbacks.
|
|
22
|
+
|
|
23
|
+
## Features
|
|
24
|
+
|
|
25
|
+
- **Async page fetching** using `aiohttp`.
|
|
26
|
+
- **Full colspan/rowspan expansion** – cells spanning multiple rows and/or columns are duplicated into the matrix.
|
|
27
|
+
- **Separate head and body matrix extraction** – each table yields a `head` matrix (from `<thead>`) and a `body`
|
|
28
|
+
matrix (from `<tbody>`).
|
|
29
|
+
- **Customizable error handling** – replace default reactions to missing headers, missing bodies, or multiple `<thead>`/
|
|
30
|
+
`<tbody>` elements.
|
|
31
|
+
- **Lightweight** – only depends on `aiohttp` and `beautifulsoup4`.
|
|
32
|
+
- **Works with any HTML** – designed for `/sveden`, but usable on any page containing `<table>` elements.
|
|
33
|
+
|
|
34
|
+
## Installation
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
pip install sveden-table-matrixizer
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Quick Start
|
|
41
|
+
|
|
42
|
+
```python
|
|
43
|
+
import asyncio
|
|
44
|
+
from sveden_table_matrixizer import matrixize_tables_from_page
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
async def main():
|
|
48
|
+
url = "https://example.edu/sveden/"
|
|
49
|
+
tables = await matrixize_tables_from_page(url)
|
|
50
|
+
|
|
51
|
+
for i, table in enumerate(tables):
|
|
52
|
+
print(f"Table {i + 1}:")
|
|
53
|
+
print(" Head rows:", len(table.head))
|
|
54
|
+
print(" Body rows:", len(table.body))
|
|
55
|
+
# Access cells as list[list[bs4.Tag]]
|
|
56
|
+
|
|
57
|
+
|
|
58
|
+
asyncio.run(main())
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Each `MatrixizedTable` contains:
|
|
62
|
+
|
|
63
|
+
- `head: list[list[Tag]]` – rows of header cells _(each row is a list of `bs4.Tag`)_.
|
|
64
|
+
- `body: list[list[Tag]]` – rows of body cells.
|
|
65
|
+
|
|
66
|
+
## Handling Edge Cases
|
|
67
|
+
|
|
68
|
+
By default, the extractor raises exceptions when a table lacks a header or body, or contains more than one `<thead>` /
|
|
69
|
+
`<tbody>`. You can override this behavior with `ExtractorOptions`.
|
|
70
|
+
|
|
71
|
+
```python
|
|
72
|
+
from sveden_table_matrixizer import matrixize_tables_from_page, ExtractorOptions
|
|
73
|
+
from sveden_table_matrixizer.def_funcs import MatrixizedTable
|
|
74
|
+
|
|
75
|
+
|
|
76
|
+
def handle_missing_header(table_tag, collected_tables):
|
|
77
|
+
print(f"Skipping table without <thead>: {table_tag.get('id', 'no id')}")
|
|
78
|
+
|
|
79
|
+
|
|
80
|
+
opts = ExtractorOptions(
|
|
81
|
+
on_table_no_header=handle_missing_header,
|
|
82
|
+
# other callbacks can be set similarly
|
|
83
|
+
)
|
|
84
|
+
|
|
85
|
+
tables = await matrixize_tables_from_page(url, options=opts)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
You can also supply async callbacks – the library automatically detects and awaits them.
|
|
89
|
+
|
|
90
|
+
## API Reference
|
|
91
|
+
|
|
92
|
+
### `matrixize_tables_from_page(url, *, options=None)`
|
|
93
|
+
|
|
94
|
+
- **Parameters:**
|
|
95
|
+
- `url` (`str`) – URL of the page to scrape.
|
|
96
|
+
- `options` (`ExtractorOptions`, optional) – configuration callbacks.
|
|
97
|
+
- **Returns:** `list[MatrixizedTable]` – extracted and matrixized tables.
|
|
98
|
+
|
|
99
|
+
### `ExtractorOptions`
|
|
100
|
+
|
|
101
|
+
A frozen dataclass with the following fields (all optional):
|
|
102
|
+
|
|
103
|
+
| Field | Type | Default | Description |
|
|
104
|
+
|-----------------------------|---------------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------|
|
|
105
|
+
| `on_table_no_header` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<thead>` |
|
|
106
|
+
| `on_table_no_body` | `Callable[[Tag, Sequence[MatrixizedTable]], Any]` | no‑op | Called when a table has no `<tbody>` |
|
|
107
|
+
| `on_multiply_table_headers` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableHeaderError` | Called when more than one `<thead>` is found; must return a single `<thead>` element. |
|
|
108
|
+
| `on_multiply_table_bodies` | `Callable[[Sequence[Tag]], Tag \| Never]` | raises `MultiplyTableBodyError` | Called when more than one `<tbody>` is found; must return a single `<tbody>` element. |
|
|
109
|
+
|
|
110
|
+
### `MatrixizedTable`
|
|
111
|
+
|
|
112
|
+
```python
|
|
113
|
+
@dataclass(frozen=True, kw_only=True)
|
|
114
|
+
class MatrixizedTable:
|
|
115
|
+
head: list[list[Tag]] # matrix of <th> tags
|
|
116
|
+
body: list[list[Tag]] # matrix of <td> tags
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Exceptions
|
|
120
|
+
|
|
121
|
+
- `NoTableHeaderError` – raised when `options.on_table_no_header` is not overridden.
|
|
122
|
+
- `NoTableBodyError` – raised when `options.on_table_no_body` is not overridden.
|
|
123
|
+
- `MultiplyTableHeaderError` – default reaction to multiple `<thead>` elements.
|
|
124
|
+
- `MultiplyTableBodyError` – default reaction to multiple `<tbody>` elements.
|
|
125
|
+
|
|
126
|
+
All exceptions are exported from `sveden_table_matrixizer.errors`.
|
|
127
|
+
|
|
128
|
+
## How It Works
|
|
129
|
+
|
|
130
|
+
1. The page is fetched with `aiohttp` and parsed by BeautifulSoup.
|
|
131
|
+
2. All `<table>` tags are collected.
|
|
132
|
+
3. For each table:
|
|
133
|
+
- The `<thead>` is located; if missing or duplicate, the appropriate callback is invoked.
|
|
134
|
+
- The `<tbody>` is located similarly.
|
|
135
|
+
- Header rows are expanded: each `<th>` with `colspan`/`rowspan` is replicated into the correct cells of a 2D list.
|
|
136
|
+
- The same expansion is applied to body rows using `<td>` elements.
|
|
137
|
+
4. A `MatrixizedTable(head=..., body=...)` is created and added to the result list.
|
|
138
|
+
|
|
139
|
+
## Dependencies
|
|
140
|
+
|
|
141
|
+
- Python ≥ 3.11
|
|
142
|
+
- [aiohttp](https://pypi.org/project/aiohttp/)
|
|
143
|
+
- [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
|
|
144
|
+
|
|
145
|
+
## License
|
|
146
|
+
|
|
147
|
+
This project is licensed under the MIT License – see the source repository for details.
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
sveden_table_matrixizer/__init__.py
|
|
5
|
+
sveden_table_matrixizer/def_funcs.py
|
|
6
|
+
sveden_table_matrixizer/errors.py
|
|
7
|
+
sveden_table_matrixizer/py.typed
|
|
8
|
+
sveden_table_matrixizer/table_body_extractor.py
|
|
9
|
+
sveden_table_matrixizer/table_head_extractor.py
|
|
10
|
+
sveden_table_matrixizer/tables_extractor.py
|
|
11
|
+
sveden_table_matrixizer/types.py
|
|
12
|
+
sveden_table_matrixizer/utils.py
|
|
13
|
+
sveden_table_matrixizer.egg-info/PKG-INFO
|
|
14
|
+
sveden_table_matrixizer.egg-info/SOURCES.txt
|
|
15
|
+
sveden_table_matrixizer.egg-info/dependency_links.txt
|
|
16
|
+
sveden_table_matrixizer.egg-info/requires.txt
|
|
17
|
+
sveden_table_matrixizer.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
sveden_table_matrixizer
|