PyPI - justhtml - Versions diffs - 2.1.0__tar.gz → 2.2.0__tar.gz - Mend

justhtml 2.1.0tar.gz → 2.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (146) hide show

{justhtml-2.1.0 → justhtml-2.2.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [2.2.0] - 2026-06-07
+### Fixed
+- Handle `<select><selectedcontent></selectedcontent></select>` without crashing when no `<option>` is present, replace selectedcontent fallback content during parser finalization, and avoid repeated selectedcontent subtree scans.
+- Preserve source order and tag text when escape-mode sanitization handles disallowed rawtext/RCDATA elements with attributed or self-closing end tags.
+- Make `stream()` use namespace-aware tokenizer context for SVG/MathML CDATA, rawtext decisions, self-closing foreign tags, and foreign end-tag stack updates.
+- Use the correct initial tokenizer states for HTML fragment contexts such as `<title>`, `<textarea>`, `<script>`, `<style>`, and scripting-disabled `<noscript>`.
+- Use HTML rawtext/RCDATA tokenizer states for text-like elements inside SVG/MathML HTML integration points and MathML text integration points.
+- Generate implied end tags before removing `<form>` on `</form>` so following controls do not remain inside still-open descendants.
+- Keep `<form>` elements inside `<template>` from claiming the global form pointer, including table-template form insertion.
+- Close open `<p>` elements correctly around `<option>`, `<optgroup>`, `<hr>`, `<p>`, and `<div>` starts in `<select>` parsing.
+- Close `<template>` correctly when `</template>` is seen while parsing inside `<select>`.
+- Keep `</p>` and `</br>` foreign-content breakouts inside MathML text integration points such as `<mi>` and `<mtext>`.
+- Align customizable `<select>` parsing with Chromium for phantom `</p>` handling and generic custom child elements.
+### Security
+- (Severity: Low) Strip invisible Unicode during URL sink validation even when general invisible-Unicode stripping is disabled. Previously, custom policies using `strip_invisible_unicode=False` could preserve scheme-obfuscated values such as `javascript\u200b:` in otherwise URL-validated attributes.
 ## [2.1.0] - 2026-06-06
 ### Performance

{justhtml-2.1.0 → justhtml-2.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: justhtml
-Version: 2.1.0
+Version: 2.2.0
 Summary: A pure Python HTML5 parser that just works.
 Project-URL: Homepage, https://github.com/emilstenstrom/justhtml
 Project-URL: Issues, https://github.com/emilstenstrom/justhtml/issues
@@ -72,6 +72,8 @@ Requires Python 3.10 or later.
 [Documentation](https://emilstenstrom.github.io/justhtml/) | [Comparison](docs/comparison.md) | [Playground](https://emilstenstrom.github.io/justhtml/playground/) | [Security policy](SECURITY.md)
+![JustHTML turns messy unsafe HTML into a sanitized, queryable DOM, then serializes it to text, Markdown, or HTML.](assets/justhtml-readme-explainer.png)
 ## Why Use It?
 Most Python HTML libraries optimize for one part of the problem.

{justhtml-2.1.0 → justhtml-2.2.0}/README.md RENAMED Viewed

@@ -18,6 +18,8 @@ Requires Python 3.10 or later.
 [Documentation](https://emilstenstrom.github.io/justhtml/) | [Comparison](docs/comparison.md) | [Playground](https://emilstenstrom.github.io/justhtml/playground/) | [Security policy](SECURITY.md)
+![JustHTML turns messy unsafe HTML into a sanitized, queryable DOM, then serializes it to text, Markdown, or HTML.](assets/justhtml-readme-explainer.png)
 ## Why Use It?
 Most Python HTML libraries optimize for one part of the problem.

justhtml-2.2.0/assets/justhtml-readme-explainer.png ADDED Viewed

Binary file

{justhtml-2.1.0 → justhtml-2.2.0}/pyproject.toml RENAMED Viewed

@@ -1,7 +1,7 @@
 [project]
 name = "justhtml"
 authors = [{ name = "Emil Stenström", email = "emil@emilstenstrom.se" }]
-version = "2.1.0"
+version = "2.2.0"
 description = "A pure Python HTML5 parser that just works."
 readme = "README.md"
 license = { file = "LICENSE" }

{justhtml-2.1.0 → justhtml-2.2.0}/src/justhtml/parser/__init__.py RENAMED Viewed

@@ -163,14 +163,17 @@ class JustHTML:
         if needs_escape_incomplete_tags:
             opts.emit_bogus_markup_as_text = True
-        # For RAWTEXT fragment contexts, set initial tokenizer state and rawtext tag
-        if fragment_context and not fragment_context.namespace:
-            rawtext_elements = {"textarea", "title", "style"}
+        # For text-like HTML fragment contexts, set the initial tokenizer state
+        # to match the context element.
+        if fragment_context and fragment_context.namespace in {None, "html"}:
             tag_name = fragment_context.tag_name.lower()
-            if tag_name in rawtext_elements:
+            if tag_name in {"textarea", "title"}:
+                opts.initial_state = Tokenizer.RCDATA
+            elif tag_name in {"iframe", "noembed", "noframes", "script", "style", "xmp"} or (
+                tag_name == "noscript" and opts.scripting_enabled
+            ):
                 opts.initial_state = Tokenizer.RAWTEXT
-                opts.initial_rawtext_tag = tag_name
-            elif tag_name in ("plaintext", "script"):
+            elif tag_name == "plaintext":
                 opts.initial_state = Tokenizer.PLAINTEXT
         self.tokenizer = Tokenizer(

justhtml-2.2.0/src/justhtml/parser/stream.py ADDED Viewed

@@ -0,0 +1,206 @@
+from __future__ import annotations
+from typing import TYPE_CHECKING, Any, Literal, TypeAlias, cast
+if TYPE_CHECKING:
+    from collections.abc import Generator
+from justhtml.core.constants import (
+    FOREIGN_BREAKOUT_ELEMENTS,
+    HTML_INTEGRATION_POINT_SET,
+    MATHML_TEXT_INTEGRATION_POINT_SET,
+    SVG_TAG_NAME_ADJUSTMENTS,
+)
+from justhtml.tokenizer import Tokenizer
+from justhtml.tokenizer.tokens import CommentToken, DoctypeToken, Tag
+from .encoding import decode_html
+StartEvent: TypeAlias = tuple[Literal["start"], tuple[str, dict[str, str | None]]]
+EndEvent: TypeAlias = tuple[Literal["end"], str]
+TextEvent: TypeAlias = tuple[Literal["text"], str]
+CommentEvent: TypeAlias = tuple[Literal["comment"], str]
+DoctypeEvent: TypeAlias = tuple[Literal["doctype"], tuple[str | None, str | None, str | None]]
+StreamEvent: TypeAlias = StartEvent | EndEvent | TextEvent | CommentEvent | DoctypeEvent
+class _DummyNode:
+    __slots__ = ("attrs", "name", "namespace")
+    attrs: dict[str, str | None]
+    name: str
+    namespace: str
+    def __init__(self, name: str, namespace: str, attrs: dict[str, str | None] | None = None) -> None:
+        self.attrs = attrs or {}
+        self.name = name
+        self.namespace = namespace
+class StreamSink:
+    """A sink that buffers tokens for the stream API."""
+    tokens: list[StreamEvent]
+    open_elements: list[_DummyNode]
+    def __init__(self) -> None:
+        self.tokens = []
+        self.open_elements = []  # Required by tokenizer for rawtext checks
+    def _font_breaks_out_of_foreign_content(self, attrs: dict[str, str | None]) -> bool:
+        for name in attrs:
+            if name.lower() in {"color", "face", "size"}:
+                return True
+        return False
+    def _node_attribute_value(self, node: _DummyNode, name: str) -> str | None:
+        target = name.lower()
+        for attr_name, attr_value in node.attrs.items():
+            if attr_name.lower() == target:
+                return attr_value or ""
+        return None
+    def _is_html_integration_point(self, node: _DummyNode) -> bool:
+        if node.namespace == "math" and node.name == "annotation-xml":
+            encoding = self._node_attribute_value(node, "encoding")
+            return encoding is not None and encoding.lower() in {"application/xhtml+xml", "text/html"}
+        return (node.namespace, node.name) in HTML_INTEGRATION_POINT_SET
+    def _is_mathml_text_integration_point(self, node: _DummyNode) -> bool:
+        return (node.namespace, node.name) in MATHML_TEXT_INTEGRATION_POINT_SET
+    def _adjusted_name_for_namespace(self, name: str, namespace: str) -> str:
+        if namespace == "svg":
+            return SVG_TAG_NAME_ADJUSTMENTS.get(name, name)
+        return name
+    def _namespace_from_html_context(self, name: str) -> str:
+        if name == "svg":
+            return "svg"
+        if name == "math":
+            return "math"
+        return "html"
+    def _namespace_for_start_tag(self, token: Tag) -> str:
+        name = token.name
+        parent = self.open_elements[-1] if self.open_elements else None
+        parent_namespace = parent.namespace if parent is not None else "html"
+        if parent is not None:
+            if self._is_html_integration_point(parent):
+                return self._namespace_from_html_context(name)
+            if self._is_mathml_text_integration_point(parent) and name not in {"mglyph", "malignmark"}:
+                return self._namespace_from_html_context(name)
+            if parent_namespace == "math" and parent.name == "annotation-xml" and name == "svg":
+                return "svg"
+        if parent_namespace not in {None, "html"}:
+            breaks_out = name in FOREIGN_BREAKOUT_ELEMENTS or (
+                name == "font" and self._font_breaks_out_of_foreign_content(token.attrs)
+            )
+            if breaks_out:
+                while self.open_elements and self.open_elements[-1].namespace not in {None, "html"}:
+                    self.open_elements.pop()
+                parent_namespace = self.open_elements[-1].namespace if self.open_elements else "html"
+            else:
+                return parent_namespace
+        return self._namespace_from_html_context(name)
+    def _pop_foreign_context(self) -> None:
+        while self.open_elements and self.open_elements[-1].namespace not in {None, "html"}:
+            self.open_elements.pop()
+    def _pop_for_end_tag(self, name: str) -> None:
+        if not self.open_elements:
+            return
+        name_lower = name.lower()
+        current = self.open_elements[-1]
+        if current.namespace not in {None, "html"} and name_lower in {"br", "p"}:
+            self._pop_foreign_context()
+            return
+        for index in range(len(self.open_elements) - 1, -1, -1):
+            node = self.open_elements[index]
+            if node.name.lower() == name_lower:
+                del self.open_elements[index:]
+                return
+            if node.namespace in {None, "html"}:
+                break
+        self.open_elements.pop()
+    def process_token(self, token: Tag | CommentToken | DoctypeToken | Any) -> int:
+        # Tokenizer reuses token objects, so we must copy data
+        if isinstance(token, Tag):
+            # Copy tag data
+            if token.kind == Tag.START:
+                self.tokens.append(("start", (token.name, token.attrs.copy())))
+            else:
+                self.tokens.append(("end", token.name))
+            # Maintain open_elements stack for tokenizer rawtext/CDATA checks.
+            if token.kind == Tag.START:
+                namespace = self._namespace_for_start_tag(token)
+                if not (token.self_closing and namespace not in {None, "html"}):
+                    name = self._adjusted_name_for_namespace(token.name, namespace)
+                    self.open_elements.append(_DummyNode(name, namespace, token.attrs.copy()))
+            else:  # Tag.END
+                self._pop_for_end_tag(token.name)
+        elif isinstance(token, CommentToken):
+            self.tokens.append(("comment", token.data))
+        elif isinstance(token, DoctypeToken):
+            dt = token.doctype
+            self.tokens.append(("doctype", (dt.name, dt.public_id, dt.system_id)))
+        return 0  # TokenSinkResult.Continue
+    def process_characters(self, data: str) -> None:
+        """Handle character data from tokenizer."""
+        self.tokens.append(("text", data))
+def stream(
+    html: str | bytes | bytearray | memoryview,
+    *,
+    encoding: str | None = None,
+) -> Generator[StreamEvent, None, None]:
+    """
+    Stream HTML events from the given HTML string.
+    Yields tuples of (event_type, data).
+    """
+    html_str: str
+    if isinstance(html, (bytes, bytearray, memoryview)):
+        html_str, _ = decode_html(bytes(html), transport_encoding=encoding)
+    else:
+        html_str = html
+    sink = StreamSink()
+    tokenizer = Tokenizer(sink)
+    tokenizer.initialize(html_str)
+    while True:
+        # Run one step of the tokenizer
+        is_eof = tokenizer.step()
+        # Yield any tokens produced by this step
+        if sink.tokens:
+            # Coalesce text tokens
+            text_buffer: list[str] = []
+            for event, data in sink.tokens:
+                if event == "text":
+                    text_buffer.append(cast("str", data))
+                else:
+                    if text_buffer:
+                        yield ("text", "".join(text_buffer))
+                        text_buffer = []
+                    yield cast("StartEvent | EndEvent | CommentEvent | DoctypeEvent", (event, data))
+            if text_buffer:
+                yield ("text", "".join(text_buffer))
+            sink.tokens.clear()
+        if is_eof:
+            break

{justhtml-2.1.0 → justhtml-2.2.0}/src/justhtml/sanitizer/url/runtime.py RENAMED Viewed

@@ -220,6 +220,8 @@ def _sanitize_url_value_with_rule(
         if rewritten is None:
             return None
         v = _strip_invisible_unicode(rewritten)
+    else:
+        v = _strip_invisible_unicode(v)
     stripped = v.strip()
     if _URL_CONTROL_CHAR_REGEX.search(stripped):

{justhtml-2.1.0 → justhtml-2.2.0}/src/justhtml/serializer/html.py RENAMED Viewed

@@ -26,7 +26,7 @@ if TYPE_CHECKING:
 # Note: This matches the logic of the previous loop-based implementation.
 # It checks for space characters, quotes, equals sign, and greater-than.
 _UNQUOTED_ATTR_VALUE_INVALID = re.compile(r'[ \t\n\f\r"\'=>]')
-_LITERAL_TEXT_SERIALIZATION_ELEMENTS = frozenset({"script", "style"})
+_LITERAL_TEXT_SERIALIZATION_ELEMENTS = frozenset({"plaintext", "script", "style"})
 _SERIALIZABLE_TAG_NAME_RE = re.compile(r"^[A-Za-z][A-Za-z0-9:_-]*$")
 _SERIALIZABLE_ATTR_NAME_RE = re.compile(r"^[A-Za-z_:][A-Za-z0-9:._-]*$")
@@ -101,6 +101,8 @@ def _serialize_text_for_parent(text: str | None, parent_name: str | None) -> str
     if parent_name is not None:
         normalized_parent_name = parent_name if parent_name.islower() else parent_name.lower()
         if normalized_parent_name in _LITERAL_TEXT_SERIALIZATION_ELEMENTS:
+            if normalized_parent_name == "plaintext":
+                return text
             return _neutralize_rawtext_end_tag_sequences(text, normalized_parent_name)
     return _escape_text(text)

{justhtml-2.1.0 → justhtml-2.2.0}/src/justhtml/tokenizer/html.py RENAMED Viewed

@@ -7,6 +7,7 @@ from typing import TYPE_CHECKING, Any
 if TYPE_CHECKING:
     from collections.abc import Callable
+from justhtml.core.constants import HTML_INTEGRATION_POINT_SET, MATHML_TEXT_INTEGRATION_POINT_SET
 from justhtml.core.entities import decode_entities_in_text
 from justhtml.core.errors import generate_error_message
@@ -35,6 +36,7 @@ _ATTR_VALUE_UNQUOTED_FAST_BAD_PATTERN = re.compile(r"""[\x00"'<=`]""")
 _TAG_NAME_RUN_PATTERN = re.compile(r"[^\t\n\f />\0]+")
 _ATTR_NAME_RUN_PATTERN = re.compile(r"[^\t\n\f />=\0\"'<]+")
 _COMMENT_RUN_PATTERN = re.compile(r"[^-\0]+")
+_HTML_INTEGRATION_POINT_ENCODINGS = {"application/xhtml+xml", "text/html"}
 # XML Coercion Regex
 _xml_invalid_single_chars = []
@@ -2230,10 +2232,7 @@ class Tokenizer:
                 or (name == "noscript" and self.opts.scripting_enabled)
             )
             if needs_rawtext_check:
-                stack = self.sink.open_elements
-                current_node = stack[-1] if stack else None
-                namespace = current_node.namespace if current_node else None
-                if namespace is None or namespace == "html":
+                if self._current_node_uses_html_text_parsing():
                     if name in _RCDATA_ELEMENTS:
                         self.state = self.RCDATA
                         self.rawtext_tag_name = name
@@ -2260,6 +2259,37 @@ class Tokenizer:
         self.current_tag_kind = Tag.START
         return switched_to_rawtext
+    def _current_node_uses_html_text_parsing(self) -> bool:
+        stack = self.sink.open_elements
+        current_node = stack[-1] if stack else None
+        if current_node is None:
+            return True
+        namespace = current_node.namespace
+        if namespace is None or namespace == "html":
+            return True
+        node_name = current_node.name
+        if (namespace, node_name) in MATHML_TEXT_INTEGRATION_POINT_SET:
+            return True
+        if namespace == "math" and node_name == "annotation-xml":
+            encoding = self._node_attribute_value(current_node, "encoding")
+            return encoding is not None and encoding.lower() in _HTML_INTEGRATION_POINT_ENCODINGS
+        return (namespace, node_name) in HTML_INTEGRATION_POINT_SET
+    def _node_attribute_value(self, node: Any, name: str) -> str | None:
+        attrs = node.attrs
+        if not attrs:
+            return None
+        target = name.lower()
+        for attr_name, attr_value in attrs.items():
+            if attr_name.lower() == target:
+                return attr_value or ""
+        return None
     def _emit_incomplete_tag_as_text(self) -> None:
         if not self.opts.emit_bogus_markup_as_text:
             return
@@ -2528,6 +2558,7 @@ class Tokenizer:
             else:
                 # lt_index == pos - the only remaining possibility
                 # Less-than sign - might be start of end tag
+                self.current_token_start_pos = pos
                 pos += 1
                 self.pos = pos
                 self.state = self.RCDATA_LESS_THAN_SIGN
@@ -2570,6 +2601,9 @@ class Tokenizer:
                 if c == ">":
                     attrs: dict[str, str | None] = {}
                     tag = Tag(Tag.END, tag_name, attrs, False)
+                    if self.track_tag_positions:
+                        tag.start_pos = self.current_token_start_pos
+                        tag.end_pos = self.pos
                     self._flush_text()
                     self._emit_token(tag)
                     self.state = self.DATA
@@ -2578,6 +2612,7 @@ class Tokenizer:
                     return False
                 if c in (" ", "\t", "\n", "\r", "\f"):
                     # Whitespace after tag name - switch to BEFORE_ATTRIBUTE_NAME
+                    self._flush_text()
                     self.current_tag_kind = Tag.END
                     self.current_tag_attrs = {}
                     self.state = self.BEFORE_ATTRIBUTE_NAME
@@ -2647,6 +2682,7 @@ class Tokenizer:
             if lt_index > pos:
                 chunk = buffer[pos:lt_index]
                 self._append_text(chunk)
+            self.current_token_start_pos = lt_index
             pos = lt_index + 1
             self.pos = pos
             # Handle script escaped transition before treating '<' as markup boundary
@@ -2701,6 +2737,9 @@ class Tokenizer:
                 if c == ">":
                     attrs: dict[str, str | None] = {}
                     tag = Tag(Tag.END, tag_name, attrs, False)
+                    if self.track_tag_positions:
+                        tag.start_pos = self.current_token_start_pos
+                        tag.end_pos = self.pos
                     self._flush_text()
                     self._emit_token(tag)
                     self.state = self.DATA
@@ -2709,6 +2748,7 @@ class Tokenizer:
                     return False
                 if c in (" ", "\t", "\n", "\r", "\f"):
                     # Whitespace after tag name - switch to BEFORE_ATTRIBUTE_NAME
+                    self._flush_text()
                     self.current_tag_kind = Tag.END
                     self.current_tag_attrs = {}
                     self.state = self.BEFORE_ATTRIBUTE_NAME
@@ -2866,6 +2906,7 @@ class Tokenizer:
         if is_appropriate:
             if c in (" ", "\t", "\n", "\r", "\f"):
+                self._flush_text()
                 self.current_tag_kind = Tag.END
                 self.current_tag_attrs = {}
                 self.state = self.BEFORE_ATTRIBUTE_NAME
@@ -2880,6 +2921,9 @@ class Tokenizer:
                 self._flush_text()
                 attrs: dict[str, str | None] = {}
                 tag = Tag(Tag.END, tag_name, attrs, False)
+                if self.track_tag_positions:
+                    tag.start_pos = self.current_token_start_pos
+                    tag.end_pos = self.pos
                 self._emit_token(tag)
                 self.state = self.DATA
                 self.rawtext_tag_name = None

{justhtml-2.1.0 → justhtml-2.2.0}/src/justhtml/treebuilder/core.py RENAMED Viewed

@@ -2,7 +2,7 @@
 from __future__ import annotations
-from typing import TYPE_CHECKING, Any, cast
+from typing import TYPE_CHECKING, Any, NamedTuple, cast
 from justhtml.core.constants import (
     BUTTON_SCOPE_TERMINATORS,
@@ -48,6 +48,12 @@ if TYPE_CHECKING:
     from collections.abc import Callable
+class _SelectedContentWalkItem(NamedTuple):
+    node: Any
+    in_disabled_optgroup: bool
+    in_datalist: bool
 class TreeBuilder(TreeBuilderModesMixin):
     __slots__ = (
         "_body_end_handlers",
@@ -787,14 +793,11 @@ class TreeBuilder(TreeBuilderModesMixin):
             if name not in existing:
                 existing[name] = value
-    def _remove_from_open_elements(self, node: Any) -> bool:
-        for index, current in enumerate(self.open_elements):
-            if current is node:
-                self._maybe_mark_end_tag(current)
-                self._note_open_element_removed(current)
-                del self.open_elements[index]
-                return True
-        return False
+    def _remove_from_open_elements(self, node: Any) -> None:
+        index = self.open_elements.index(node)
+        self._maybe_mark_end_tag(node)
+        self._note_open_element_removed(node)
+        del self.open_elements[index]
     def _is_special_element(self, node: Any) -> bool:
         if node.namespace not in {None, "html"}:
@@ -852,6 +855,15 @@ class TreeBuilder(TreeBuilderModesMixin):
                 return True
         return False
+    def _has_detached_active_formatting_a(self) -> bool:
+        for index in range(len(self.active_formatting) - 1, -1, -1):
+            entry = self.active_formatting[index]
+            if entry is FORMAT_MARKER:
+                break
+            if entry["name"] == "a":
+                return entry["node"] not in self.open_elements
+        return False
     def _remove_last_active_formatting_by_name(self, name: str) -> None:
         for index in range(len(self.active_formatting) - 1, -1, -1):
             entry = self.active_formatting[index]
@@ -988,7 +1000,7 @@ class TreeBuilder(TreeBuilderModesMixin):
             self._append_text(data)
             return
-        if self.pending_table_text_should_error:
+        if self.pending_table_text_should_error and self.collect_errors:
             # html5lib reports one foster-parenting error per non-whitespace character.
             for ch in data:
                 if ch not in HTML_SPACE_CHARACTERS:
@@ -1169,7 +1181,7 @@ class TreeBuilder(TreeBuilderModesMixin):
             node = self.open_elements[-1]
             if node.namespace in {None, "html"}:
                 return
-            if self._is_html_integration_point(node):
+            if self._is_html_integration_point(node) or self._is_mathml_text_integration_point(node):
                 return
             if self.fragment_context_element is not None and node is self.fragment_context_element:
                 return
@@ -1310,59 +1322,78 @@ class TreeBuilder(TreeBuilderModesMixin):
         Per HTML5 spec: selectedcontent mirrors the content of the selected option,
         or the first option if none is selected.
         """
-        # Find all select elements
-        selects: list[Any] = []
-        self._find_elements(root, "select", selects)
-        for select in selects:
-            # Find selectedcontent element in this select
-            selectedcontent = self._find_element(select, "selectedcontent")
-            if not selectedcontent:
-                continue
-            # Find all option elements
-            options: list[Any] = []
-            self._find_elements(select, "option", options)
-            # Find selected option or use first one
-            selected_option = None
-            for opt in options:
-                if opt.attrs:
-                    for attr_name in opt.attrs.keys():
-                        if attr_name == "selected":
-                            selected_option = opt
-                            break
-                if selected_option:
-                    break
-            if not selected_option:
-                selected_option = options[0]
-            # Clone content from selected option to selectedcontent
-            self._clone_children(selected_option, selectedcontent)
-    def _find_elements(self, node: Any, name: str, result: list[Any]) -> None:
-        """Find all elements with given name using iterative preorder traversal."""
-        stack: list[Any] = [node]
+        stack: list[Any] = [root]
         while stack:
             current = stack.pop()
-            if current.name == name:
-                result.append(current)
+            if current.name == "select":
+                self._populate_selectedcontent_for_select(current)
             if current.has_child_nodes():
                 stack.extend(reversed(current.children))
-    def _find_element(self, node: Any, name: str) -> Any | None:
-        """Find first element with given name using iterative preorder traversal."""
-        stack: list[Any] = [node]
+    def _populate_selectedcontent_for_select(self, select: Any) -> None:
+        selectedcontents: list[Any] = []
+        first_option = None
+        selected_option = None
+        is_multiple = select.attrs is not None and "multiple" in select.attrs
+        stack = [_SelectedContentWalkItem(select, in_disabled_optgroup=False, in_datalist=False)]
         while stack:
-            current = stack.pop()
-            if current.name == name:
-                return current
+            item = stack.pop()
+            current = item.node
+            attrs = getattr(current, "attrs", None)
+            name = current.name
+            if current is not select:
+                if name == "selectedcontent":
+                    selectedcontents.append(current)
+                if name == "option" and not item.in_datalist:
+                    if first_option is None and self._is_selectedcontent_fallback_option(
+                        attrs, item.in_disabled_optgroup
+                    ):
+                        first_option = current
+                    if attrs is not None and "selected" in attrs:
+                        if is_multiple:
+                            if selected_option is None:
+                                selected_option = current
+                        else:
+                            selected_option = current
             if current.has_child_nodes():
-                stack.extend(reversed(current.children))
-        return None
+                child_disabled_optgroup = item.in_disabled_optgroup or (
+                    name == "optgroup" and attrs is not None and "disabled" in attrs
+                )
+                child_in_datalist = item.in_datalist or name == "datalist"
+                stack.extend(
+                    _SelectedContentWalkItem(child, child_disabled_optgroup, child_in_datalist)
+                    for child in reversed(current.children)
+                )
+        if not selectedcontents:
+            return
+        source_option = selected_option or first_option
+        for selectedcontent in selectedcontents:
+            if source_option is not None and self._is_descendant_of(selectedcontent, source_option):
+                continue
+            children = selectedcontent.children
+            if children:
+                for child in children:
+                    child.parent = None
+                children.clear()
+            if source_option is not None:
+                self._clone_children(source_option, selectedcontent)
+    def _is_selectedcontent_fallback_option(self, attrs: Any, in_disabled_optgroup: bool) -> bool:
+        return not in_disabled_optgroup and (attrs is None or "disabled" not in attrs)
+    def _is_descendant_of(self, node: Any, ancestor: Any) -> bool:
+        parent = node.parent
+        while parent is not None:
+            if parent is ancestor:
+                return True
+            parent = parent.parent
+        return False
     def _clone_children(self, source: Any, target: Any) -> None:
         """Deep clone all children from source to target."""
@@ -1404,6 +1435,7 @@ class TreeBuilder(TreeBuilderModesMixin):
             if not data:
                 return TokenSinkResult.Continue
             if "\x00" in data:
+                self._parse_error("invalid-codepoint")
                 data = data.replace("\x00", "")
                 if not data:
                     return TokenSinkResult.Continue

justhtml 2.1.0__tar.gz → 2.2.0__tar.gz

justhtml 2.1.0tar.gz → 2.2.0tar.gz