selectolax.lexbor module¶

LexborHTMLParser¶

class selectolax.lexbor.LexborHTMLParser(html: str | bytes, is_fragment: bool = False, fragment_tag: str = 'div', fragment_namespace: str = 'html')¶

The lexbor HTML parser.

Use this class to parse raw HTML.

This parser mimics most of the stuff from HTMLParser but not inherits it directly.

Parameters:

htmlstr (unicode) or bytes

any_css_matches(self, tuple selectors)¶

Return True if any of the specified CSS selectors match.

Parameters:

selectorstuple[str]: CSS selectors to evaluate.

Returns:

bool: True when at least one selector matches.

body¶

Return document body.

Returns:

LexborNode or None: <body> element when present, otherwise None.

clone(self)¶

Clone the current document tree.

You can use to do temporary modifications without affecting the original HTML tree. It is tied to the current parser instance. Gets destroyed when the parser instance is destroyed.

Returns:

LexborHTMLParser: A parser instance backed by a deep-copied document.

create_node(self, str tag)¶

Given an HTML tag name, e.g. “div”, create a single empty node for that tag, e.g. “<div></div>”.

Parameters:

tagstr: Name of the tag to create.

Returns:

LexborNode: Newly created element node.
Raises

SelectolaxError: If the element cannot be created.

Examples

>>> parser = LexborHTMLParser("<div></div>")
>>> new_node = parser.create_node("span")
>>> new_node.tag_name
'span'
>>> parser.css_first("div").append_child(new_node)
>>> parser.html
'<html><head></head><body><div><span></span></div></body></html>'

css(self, str query)¶

A CSS selector.

Matches pattern query against HTML tree. CSS selectors reference.

Special selectors:

parser.css(‘p:lexbor-contains(“awesome” i)’) – case-insensitive contains

parser.css(‘p:lexbor-contains(“awesome”)’) – case-sensitive contains

Parameters:

querystr: CSS selector (e.g. “div > :nth-child(2n+1):not(:has(a))”).

Returns:

selectorlist of Node objects

css_first(self, str query, default=None, strict=False)¶

Same as css but returns only the first match.

Parameters:

querystr
defaultAny, default None: Default value to return if there is no match.
strict: bool, default False: Set to True if you want to check if there is strictly only one match in the document.

Returns:

selectorLexborNode object

css_matches(self, str selector)¶

Return True if the document matches the selector at least once.

Parameters:

selectorstr: CSS selector to test.

Returns:

bool: True when a match exists.

head¶

Return document head.

Returns:

LexborNode or None: <head> element when present, otherwise None.

html¶

Return HTML representation of the page.

Returns:

str or None: Serialized HTML of the current document.

html_pretty(self, Py_ssize_t indent=0, bool skip_ws_nodes=False, bool skip_comment=False, bool raw=False, bool without_closing=False, bool tag_with_ns=False, bool without_text_indent=False, bool full_doctype=False, bool html5test=False)¶

Return pretty-printed HTML representation of the page.

Parameters:

indentint, optional: Initial indentation level passed to Lexbor. Defaults to 0.
skip_ws_nodesbool, optional: Skip text nodes that contain only whitespace.
skip_commentbool, optional: Exclude HTML comment nodes from the serialized output.
rawbool, optional: Serialize text and attribute values without HTML escaping.
without_closingbool, optional: Omit closing tags for non-void elements.
tag_with_nsbool, optional: Include namespace prefixes in serialized tag names when available.
without_text_indentbool, optional: Disable extra indentation added around text and comment content.
full_doctypebool, optional: Serialize the full document type declaration when a doctype node is present.
html5testbool, optional: Serialize using Lexbor’s HTML5 test formatting mode.

inner_html¶

LexborHTMLParser.inner_html: str

Return HTML representation of the child nodes.

Works similar to innerHTML in JavaScript. Unlike the .html property, does not include the current node. Can be used to set HTML as well. See the setter docstring.

Returns:

textstr | None

inner_html_pretty(self, Py_ssize_t indent=0, bool skip_ws_nodes=False, bool skip_comment=False, bool raw=False, bool without_closing=False, bool tag_with_ns=False, bool without_text_indent=False, bool full_doctype=False, bool html5test=False)¶

Return pretty-printed HTML representation of the child nodes.

Parameters:

indentint, optional: Initial indentation level passed to Lexbor. Defaults to 0.
skip_ws_nodesbool, optional: Skip text nodes that contain only whitespace.
skip_commentbool, optional: Exclude HTML comment nodes from the serialized output.
rawbool, optional: Serialize text and attribute values without HTML escaping.
without_closingbool, optional: Omit closing tags for non-void elements.
tag_with_nsbool, optional: Include namespace prefixes in serialized tag names when available.
without_text_indentbool, optional: Disable extra indentation added around text and comment content.
full_doctypebool, optional: Serialize the full document type declaration when a doctype node is present.
html5testbool, optional: Serialize using Lexbor’s HTML5 test formatting mode.

merge_text_nodes(self)¶

Iterates over all text nodes and merges all text nodes that are close to each other.

This is useful for text extraction. Use it when you need to strip HTML tags and merge “dangling” text.

Returns:

None

Examples

>>> tree = LexborHTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
>>> node = tree.css_first('div')
>>> tree.unwrap_tags(["strong"])
>>> tree.text(deep=True, separator=" ", strip=True)
"J ohn Doe" # Text extraction produces an extra space because the strong tag was removed.
>>> node.merge_text_nodes()
>>> tree.text(deep=True, separator=" ", strip=True)
"John Doe"

raw_html¶: raw_html: bytes

root¶

Return the document root node.

Returns:

LexborNode or None: Root of the parsed document, or None if unavailable.

script_srcs_contain(self, tuple queries)¶

Return True if any script src contains one of the strings.

Caches values on the first call to improve performance.

Parameters:

queriestuple of str: Strings to look for inside src attributes.

Returns:

bool: True when a matching source value is found.

scripts_contain(self, str query)¶

Return True if any script tag contains the given text.

Caches script tags on the first call to improve performance.

Parameters:

querystr: Text to search for within script contents.

Returns:

bool: True when a matching script tag is found.

select(self, query=None)¶

Select nodes given a CSS selector.

Works similarly to the css method, but supports chained filtering and extra features.

Parameters:

querystr or None: The CSS selector to use when searching for nodes.

Returns:

LexborSelector or None: Selector bound to the root node, or None if the document is empty.

selector¶

Return a lazily created CSS selector helper.

Returns:

LexborCSSSelector: Selector instance bound to this parser.

strip_tags(self, list tags, bool recursive=False)¶

Remove specified tags from the node.

Parameters:

tagslist of str: List of tags to remove.
recursivebool, default False: Whenever to delete all its child nodes

Returns:

None

Examples

>>> tree = LexborHTMLParser('<html><head></head><body><script></script><div>Hello world!</div></body></html>')
>>> tags = ['head', 'style', 'script', 'xmp', 'iframe', 'noembed', 'noframes']
>>> tree.strip_tags(tags)
>>> tree.html
'<html><body><div>Hello world!</div></body></html>'

tags(self, str name)¶

Return all tags that match the provided name.

Parameters:

namestr: Tag name to search for (e.g., "div").

Returns:

list of LexborNode: Matching elements in document order.

Raises:

ValueError: If name is empty or longer than 100 characters.
SelectolaxError: If Lexbor cannot locate the elements.

text(self, deep: bool = True, separator: str = '', strip: bool = False, skip_empty: bool = False) → str¶

Returns the text of the node including text of all its child nodes.

Parameters:

stripbool, default False: If true, calls str.strip() on each text part to remove extra white spaces.
separatorstr, default ‘’: The separator to use when joining text from different nodes.
deepbool, default True: If True, includes text from all child nodes.
skip_emptybool, optional: Exclude text nodes whose content is only ASCII whitespace (space, tab, newline, form feed or carriage return) when True. Defaults to False.

Returns:

textstr: Combined textual content assembled according to the provided options.

unwrap_tags(self, list tags, delete_empty=False)¶

Unwraps specified tags from the HTML tree.

Works the same as the unwrap method, but applied to a list of tags.

Parameters:

tagslist: List of tags to remove.
delete_emptybool: Whenever to delete empty tags.

Returns:

None

Examples

>>> tree = LexborHTMLParser("<div><a href="">Hello</a> <i>world</i>!</div>")
>>> tree.body.unwrap_tags(['i','a'])
>>> tree.body.html
'<body><div>Hello world!</div></body>'

LexborNode¶

class selectolax.lexbor.LexborNode¶

A class that represents HTML node (element).

any_css_matches(self, tuple selectors)¶: Returns True if any of CSS selectors matches a node

attributes¶

Get all attributes that belong to the current node.

The value of empty attributes is None.

Returns:

attributesdictionary of all attributes.

Examples

>>> tree = LexborHTMLParser("<div data id='my_id'></div>")
>>> node = tree.css_first('div')
>>> node.attributes
{'data': None, 'id': 'my_id'}

attrs¶

A dict-like object that is similar to the attributes property, but operates directly on the Node data.

Warning

Use attributes instead, if you don’t want to modify Node attributes.

Returns:

attributesAttributes mapping object.

Examples

>>> tree = LexborHTMLParser("<div id='a'></div>")
>>> node = tree.css_first('div')
>>> node.attrs
<div attributes, 1 items>
>>> node.attrs['id']
'a'
>>> node.attrs['foo'] = 'bar'
>>> del node.attrs['id']
>>> node.attributes
{'foo': 'bar'}
>>> node.attrs['id'] = 'new_id'
>>> node.html
'<div foo="bar" id="new_id"></div>'

child¶

Alias for the first_child property.

Deprecated. Please use first_child instead.

clone(self) → LexborNode¶

Clone the current node.

You can use to do temporary modifications without affecting the original HTML tree.

It is tied to the current parser instance. Gets destroyed when parser instance is destroyed.

comment_content¶

LexborNode.comment_content: str | None

Extract the textual content of an HTML comment node.

Returns:

str or None: Comment text with surrounding whitespace removed, or None if the current node is not a comment or the comment markup cannot be parsed.

Examples

>>> parse_fragment("<!-- hello -->")[0].comment_content
'hello'
>>> parse_fragment("<div>not a comment</div>")[0].comment_content is None
True

css(self, str query)¶

Evaluate CSS selector against current node and its child nodes.

Matches pattern query against HTML tree. CSS selectors reference.

Special selectors:

parser.css(‘p:lexbor-contains(“awesome” i)’) – case-insensitive contains

parser.css(‘p:lexbor-contains(“awesome”)’) – case-sensitive contains

Parameters:

querystr: CSS selector (e.g. “div > :nth-child(2n+1):not(:has(a))”).

Returns:

selectorlist of Node objects

css_first(self, str query, default=None, bool strict=False)¶

Same as css but returns only the first match.

When strict=False stops at the first match. Works faster.

Parameters:

querystr
defaultAny, default None: Default value to return if there is no match.
strict: bool, default False: Set to True if you want to check if there is strictly only one match in the document.

Returns:

selectorLexborNode object

css_matches(self, str selector)¶: Returns True if CSS selector matches a node.

decompose(self, bool recursive=True)¶

Remove the current node from the tree.

Parameters:

recursivebool, default True: Whenever to delete all its child nodes

Examples

>>> tree = LexborHTMLParser(html)
>>> for tag in tree.css('script'):
>>>     tag.decompose()

first_child¶: Return the first child node.

html¶

Return HTML representation of the current node including all its child nodes.

Returns:

textstr

Return pretty-printed HTML for the current node.

Parameters:

indentint, optional: Initial indentation level passed to Lexbor. Defaults to 0.
skip_ws_nodesbool, optional: Skip text nodes that contain only whitespace.
skip_commentbool, optional: Exclude HTML comment nodes from the serialized output.
rawbool, optional: Serialize text and attribute values without HTML escaping.
without_closingbool, optional: Omit closing tags for non-void elements.
tag_with_nsbool, optional: Include namespace prefixes in serialized tag names when available.
without_text_indentbool, optional: Disable extra indentation added around text and comment content.
full_doctypebool, optional: Serialize the full document type declaration when a doctype node is present.
html5testbool, optional: Serialize using Lexbor’s HTML5 test formatting mode.

id¶

Get the id attribute of the node.

Returns None if id does not set.

Returns:

textstr

inner_html¶

LexborNode.inner_html: str | None

Return HTML representation of the child nodes.

Works similar to innerHTML in JavaScript. Unlike the .html property, does not include the current node. Can be used to set HTML as well. See the setter docstring.

Returns:

textstr | None

Return pretty-printed HTML representation of the child nodes.

Parameters:

indentint, optional: Initial indentation level passed to Lexbor. Defaults to 0.
skip_ws_nodesbool, optional: Skip text nodes that contain only whitespace.
skip_commentbool, optional: Exclude HTML comment nodes from the serialized output.
rawbool, optional: Serialize text and attribute values without HTML escaping.
without_closingbool, optional: Omit closing tags for non-void elements.
tag_with_nsbool, optional: Include namespace prefixes in serialized tag names when available.
without_text_indentbool, optional: Disable extra indentation added around text and comment content.
full_doctypebool, optional: Serialize the full document type declaration when a doctype node is present.
html5testbool, optional: Serialize using Lexbor’s HTML5 test formatting mode.

insert_after(signatures, args, kwargs, defaults, _fused_sigindex={})¶

Insert a node after the current Node.

Parameters:

valuestr, bytes or Node: The text or Node instance to insert after the Node. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src="" alt="Laptop"></div>')
>>> img = tree.css_first('img')
>>> img.insert_after(img.attributes.get('alt', ''))
>>> tree.body.child.html
'<div>Get <img src="" alt="Laptop">Laptop</div>'

>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> img_node = html_parser.css_first('img')
>>> img_node.insert_after(html_parser2.body.child)
<div>Get <span alt="Laptop"><img src="/jpg"><div>Test</div> <div></div></span></div>'

insert_before(signatures, args, kwargs, defaults, _fused_sigindex={})¶

Insert a node before the current Node.

Parameters:

valuestr, bytes or Node: The text or Node instance to insert before the Node. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src="" alt="Laptop"></div>')
>>> img = tree.css_first('img')
>>> img.insert_before(img.attributes.get('alt', ''))
>>> tree.body.child.html
'<div>Get Laptop<img src="" alt="Laptop"></div>'

>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> img_node = html_parser.css_first('img')
>>> img_node.insert_before(html_parser2.body.child)
<div>Get <span alt="Laptop"><div>Test</div><img src="/jpg"> <div></div></span></div>'

insert_child(signatures, args, kwargs, defaults, _fused_sigindex={})¶

Insert a node inside (at the end of) the current Node.

Parameters:

valuestr, bytes or Node: The text or Node instance to insert inside the Node. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src=""></div>')
>>> div = tree.css_first('div')
>>> div.insert_child('Laptop')
>>> tree.body.child.html
'<div>Get <img src="">Laptop</div>'

>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"> <div>Laptop</div> </span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> span_node = html_parser.css_first('span')
>>> span_node.insert_child(html_parser2.body.child)
<div>Get <span alt="Laptop"> <div>Laptop</div> <div>Test</div> </span></div>'

is_comment_node¶

LexborNode.is_comment_node: bool

Return True if the node represents a comment node.

is_document_node¶

LexborNode.is_document_node: bool

Return True if the node represents a document node.

is_element_node¶

LexborNode.is_element_node: bool

Return True if the node represents an element node.

is_empty_text_node¶

LexborNode.is_empty_text_node: bool

Check whether the current node is an empty text node.

Returns:

bool: True when the node is a text node whose character data consists only of ASCII whitespace characters (space, tab, newline, form feed or carriage return).

is_text_node¶

LexborNode.is_text_node: bool

Return True if the node represents a text node.

iter(self, bool include_text=False, bool skip_empty=False)¶

Iterate over direct children of this node.

Parameters:

include_textbool, optional: When True, yield text nodes in addition to element nodes. Defaults to False.
skip_emptybool, optional: When include_text is True, ignore text nodes made up solely of ASCII whitespace (space, tab, newline, form feed or carriage return). Defaults to False.

Yields:

LexborNode: Child nodes on the same tree level as this node, filtered according to the provided options.

last_child¶: Return last child node.

merge_text_nodes(self)¶

Iterates over all text nodes and merges all text nodes that are close to each other.

This is useful for text extraction. Use it when you need to strip HTML tags and merge “dangling” text.

Examples

>>> tree = LexborHTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
>>> node = tree.css_first('div')
>>> tree.unwrap_tags(["strong"])
>>> tree.text(deep=True, separator=" ", strip=True)
"J ohn Doe" # Text extraction produces an extra space because the strong tag was removed.
>>> node.merge_text_nodes()
>>> tree.text(deep=True, separator=" ", strip=True)
"John Doe"

next¶: Return next node.

parent¶: Return the parent node.

parser¶: parser: selectolax.lexbor.LexborHTMLParser

prev¶: Return previous node.

raw_value¶

Return the raw (unparsed, original) value of a node.

Currently, works on text nodes only.

Returns:

raw_valuebytes

Examples

>>> html_parser = LexborHTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

remove(self, bool recursive=True)¶: An alias for the decompose method.

replace_with(signatures, args, kwargs, defaults, _fused_sigindex={})¶

Replace current Node with specified value.

Parameters:

valuestr, bytes or Node: The text or Node instance to replace the Node with. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src="" alt="Laptop"></div>')
>>> img = tree.css_first('img')
>>> img.replace_with(img.attributes.get('alt', ''))
>>> tree.body.child.html
'<div>Get Laptop</div>'

>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> img_node = html_parser.css_first('img')
>>> img_node.replace_with(html_parser2.body.child)
'<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>'

script_srcs_contain(self, tuple queries)¶

Returns True if any of the script SRCs attributes contain on of the specified text.

Caches values on the first call to improve performance.

Parameters:

queriestuple of str

scripts_contain(self, str query)¶

Returns True if any of the script tags contain specified text.

Caches script tags on the first call to improve performance.

Parameters:

querystr: The query to check.

select(self, query=None)¶

Select nodes given a CSS selector.

Works similarly to the the css method, but supports chained filtering and extra features.

Parameters:

querystr or None: The CSS selector to use when searching for nodes.

Returns:

selectorThe Selector class.

strip_tags(self, list tags, bool recursive=False)¶

Remove specified tags from the HTML tree.

Parameters:

tagslist: List of tags to remove.
recursivebool, default True: Whenever to delete all its child nodes

Examples

>>> tree = LexborHTMLParser('<html><head></head><body><script></script><div>Hello world!</div></body></html>')
>>> tags = ['head', 'style', 'script', 'xmp', 'iframe', 'noembed', 'noframes']
>>> tree.strip_tags(tags)
>>> tree.html
'<html><body><div>Hello world!</div></body></html>'

tag¶

Return the name of the current tag (e.g. div, p, img).

For for non-tag nodes, returns the following names:

-text - text node

-document - document node

-comment - comment node

This

Returns:

textstr

text(self, bool deep=True, str separator='', bool strip=False, bool skip_empty=False)¶

Return concatenated text from this node.

Parameters:

deepbool, optional: When True (default), include text from all descendant nodes; when False, only include direct children.
separatorstr, optional: String inserted between successive text fragments.
stripbool, optional: If True, apply str.strip() to each fragment before joining to remove surrounding whitespace. Defaults to False.
skip_emptybool, optional: Exclude text nodes whose content is only ASCII whitespace (space, tab, newline, form feed or carriage return) when True. Defaults to False.

Returns:

textstr: Combined textual content assembled according to the provided options.

text_content¶

Returns the text of the node if it is a text node.

Returns None for other nodes. Unlike the text method, does not include child nodes.

Returns:

textstr or None.

text_lexbor(self)¶

Returns the text of the node including text of all its child nodes.

Uses builtin method from lexbor.

traverse(self, bool include_text=False, bool skip_empty=False)¶

Depth-first traversal starting at the current node.

Parameters:

include_textbool, optional: When True, include text nodes in the traversal sequence. Defaults to False.
skip_emptybool, optional: Skip text nodes that contain only ASCII whitespace (space, tab, newline, form feed or carriage return) when include_text is True. Defaults to False.

Yields:

LexborNode: Nodes encountered in depth-first order beginning with the current node, filtered according to the provided options.

unwrap(self, bool delete_empty=False)¶

Replace node with whatever is inside this node.

Does nothing if you perform unwrapping second time on the same node.

Parameters:

delete_emptybool, default False: If True, removes empty tags.

Examples

>>>  tree = LexborHTMLParser("<div>Hello <i>world</i>!</div>")
>>>  tree.css_first('i').unwrap()
>>>  tree.html
'<html><head></head><body><div>Hello world!</div></body></html>'

Note: by default, empty tags are ignored, use “delete_empty” to change this.

unwrap_tags(self, list tags, bool delete_empty=False)¶

Unwraps specified tags from the HTML tree.

Works the same as the unwrap method, but applied to a list of tags.

Parameters:

tagslist: List of tags to remove.
delete_emptybool, default False: If True, removes empty tags.

Examples

>>> tree = LexborHTMLParser("<div><a href="">Hello</a> <i>world</i>!</div>")
>>> tree.body.unwrap_tags(['i','a'])
>>> tree.body.html
'<body><div>Hello world!</div></body>'

Note: by default, empty tags are ignored, use “delete_empty” to change this.

Selector¶

class selectolax.lexbor.LexborSelector(LexborNode node, query)¶

An advanced CSS selector that supports additional operations.

Think of it as a toolkit that mimics some of the features of XPath.

Please note, this is an experimental feature that can change in the future.

any_attribute_longer_than(self, str attribute, int length, str start=None) → bool¶

Returns True any href attribute longer than a specified length.

Similar to string-length in XPath.

any_matches¶

LexborSelector.any_matches: bool

Returns True if there are any matches

any_text_contains(self, str text, bool deep=True, str separator='', bool strip=False) → bool¶: Returns True if any node in the current search scope contains specified text

attribute_longer_than(self, str attribute, int length, str start=None) → LexborSelector¶

Filter all current matches by attribute length.

Similar to string-length in XPath.

css(self, str query)¶: Evaluate CSS selector against current scope.

matches¶

LexborSelector.matches: list

Returns all possible matches

text_contains(self, str text, bool deep=True, str separator='', bool strip=False) → LexborSelector¶: Filter all current matches given text.