selectolax.lexbor module

LexborHTMLParser

class selectolax.lexbor.LexborHTMLParser(html: str | bytes, is_fragment: bool = False, fragment_tag: str = 'div', fragment_namespace: str = 'html')

The lexbor HTML parser.

Use this class to parse raw HTML.

This parser mimics most of the stuff from HTMLParser but not inherits it directly.

Parameters:
htmlstr (unicode) or bytes
any_css_matches(self, tuple selectors)

Return True if any of the specified CSS selectors match.

Parameters:
selectorstuple[str]

CSS selectors to evaluate.

Returns:
bool

True when at least one selector matches.

body

Return document body.

Returns:
LexborNode or None

<body> element when present, otherwise None.

clone(self)

Clone the current document tree.

You can use to do temporary modifications without affecting the original HTML tree. It is tied to the current parser instance. Gets destroyed when the parser instance is destroyed.

Returns:
LexborHTMLParser

A parser instance backed by a deep-copied document.

create_node(self, str tag)

Given an HTML tag name, e.g. “div”, create a single empty node for that tag, e.g. “<div></div>”.

Parameters:
tagstr

Name of the tag to create.

Returns:
LexborNode

Newly created element node.

Raises
SelectolaxError

If the element cannot be created.

Examples

>>> parser = LexborHTMLParser("<div></div>")
>>> new_node = parser.create_node("span")
>>> new_node.tag_name
'span'
>>> parser.css_first("div").append_child(new_node)
>>> parser.html
'<html><head></head><body><div><span></span></div></body></html>'
css(self, str query)

A CSS selector.

Matches pattern query against HTML tree. CSS selectors reference.

Special selectors:

  • parser.css(‘p:lexbor-contains(“awesome” i)’) – case-insensitive contains

  • parser.css(‘p:lexbor-contains(“awesome”)’) – case-sensitive contains

Parameters:
querystr

CSS selector (e.g. “div > :nth-child(2n+1):not(:has(a))”).

Returns:
selectorlist of Node objects
css_first(self, str query, default=None, strict=False)

Same as css but returns only the first match.

Parameters:
querystr
defaultAny, default None

Default value to return if there is no match.

strict: bool, default False

Set to True if you want to check if there is strictly only one match in the document.

Returns:
selectorLexborNode object
css_matches(self, str selector)

Return True if the document matches the selector at least once.

Parameters:
selectorstr

CSS selector to test.

Returns:
bool

True when a match exists.

head

Return document head.

Returns:
LexborNode or None

<head> element when present, otherwise None.

html

Return HTML representation of the page.

Returns:
str or None

Serialized HTML of the current document.

html_pretty(self, Py_ssize_t indent=0, bool skip_ws_nodes=False, bool skip_comment=False, bool raw=False, bool without_closing=False, bool tag_with_ns=False, bool without_text_indent=False, bool full_doctype=False, bool html5test=False)

Return pretty-printed HTML representation of the page.

Parameters:
indentint, optional

Initial indentation level passed to Lexbor. Defaults to 0.

skip_ws_nodesbool, optional

Skip text nodes that contain only whitespace.

skip_commentbool, optional

Exclude HTML comment nodes from the serialized output.

rawbool, optional

Serialize text and attribute values without HTML escaping.

without_closingbool, optional

Omit closing tags for non-void elements.

tag_with_nsbool, optional

Include namespace prefixes in serialized tag names when available.

without_text_indentbool, optional

Disable extra indentation added around text and comment content.

full_doctypebool, optional

Serialize the full document type declaration when a doctype node is present.

html5testbool, optional

Serialize using Lexbor’s HTML5 test formatting mode.

inner_html

LexborHTMLParser.inner_html: str

Return HTML representation of the child nodes.

Works similar to innerHTML in JavaScript. Unlike the .html property, does not include the current node. Can be used to set HTML as well. See the setter docstring.

Returns:
textstr | None
inner_html_pretty(self, Py_ssize_t indent=0, bool skip_ws_nodes=False, bool skip_comment=False, bool raw=False, bool without_closing=False, bool tag_with_ns=False, bool without_text_indent=False, bool full_doctype=False, bool html5test=False)

Return pretty-printed HTML representation of the child nodes.

Parameters:
indentint, optional

Initial indentation level passed to Lexbor. Defaults to 0.

skip_ws_nodesbool, optional

Skip text nodes that contain only whitespace.

skip_commentbool, optional

Exclude HTML comment nodes from the serialized output.

rawbool, optional

Serialize text and attribute values without HTML escaping.

without_closingbool, optional

Omit closing tags for non-void elements.

tag_with_nsbool, optional

Include namespace prefixes in serialized tag names when available.

without_text_indentbool, optional

Disable extra indentation added around text and comment content.

full_doctypebool, optional

Serialize the full document type declaration when a doctype node is present.

html5testbool, optional

Serialize using Lexbor’s HTML5 test formatting mode.

merge_text_nodes(self)

Iterates over all text nodes and merges all text nodes that are close to each other.

This is useful for text extraction. Use it when you need to strip HTML tags and merge “dangling” text.

Returns:
None

Examples

>>> tree = LexborHTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
>>> node = tree.css_first('div')
>>> tree.unwrap_tags(["strong"])
>>> tree.text(deep=True, separator=" ", strip=True)
"J ohn Doe" # Text extraction produces an extra space because the strong tag was removed.
>>> node.merge_text_nodes()
>>> tree.text(deep=True, separator=" ", strip=True)
"John Doe"
raw_html

raw_html: bytes

root

Return the document root node.

Returns:
LexborNode or None

Root of the parsed document, or None if unavailable.

script_srcs_contain(self, tuple queries)

Return True if any script src contains one of the strings.

Caches values on the first call to improve performance.

Parameters:
queriestuple of str

Strings to look for inside src attributes.

Returns:
bool

True when a matching source value is found.

scripts_contain(self, str query)

Return True if any script tag contains the given text.

Caches script tags on the first call to improve performance.

Parameters:
querystr

Text to search for within script contents.

Returns:
bool

True when a matching script tag is found.

select(self, query=None)

Select nodes given a CSS selector.

Works similarly to the css method, but supports chained filtering and extra features.

Parameters:
querystr or None

The CSS selector to use when searching for nodes.

Returns:
LexborSelector or None

Selector bound to the root node, or None if the document is empty.

selector

Return a lazily created CSS selector helper.

Returns:
LexborCSSSelector

Selector instance bound to this parser.

strip_tags(self, list tags, bool recursive=False)

Remove specified tags from the node.

Parameters:
tagslist of str

List of tags to remove.

recursivebool, default False

Whenever to delete all its child nodes

Returns:
None

Examples

>>> tree = LexborHTMLParser('<html><head></head><body><script></script><div>Hello world!</div></body></html>')
>>> tags = ['head', 'style', 'script', 'xmp', 'iframe', 'noembed', 'noframes']
>>> tree.strip_tags(tags)
>>> tree.html
'<html><body><div>Hello world!</div></body></html>'
tags(self, str name)

Return all tags that match the provided name.

Parameters:
namestr

Tag name to search for (e.g., "div").

Returns:
list of LexborNode

Matching elements in document order.

Raises:
ValueError

If name is empty or longer than 100 characters.

SelectolaxError

If Lexbor cannot locate the elements.

text(self, deep: bool = True, separator: str = '', strip: bool = False, skip_empty: bool = False) str

Returns the text of the node including text of all its child nodes.

Parameters:
stripbool, default False

If true, calls str.strip() on each text part to remove extra white spaces.

separatorstr, default ‘’

The separator to use when joining text from different nodes.

deepbool, default True

If True, includes text from all child nodes.

skip_emptybool, optional

Exclude text nodes whose content is only ASCII whitespace (space, tab, newline, form feed or carriage return) when True. Defaults to False.

Returns:
textstr

Combined textual content assembled according to the provided options.

unwrap_tags(self, list tags, delete_empty=False)

Unwraps specified tags from the HTML tree.

Works the same as the unwrap method, but applied to a list of tags.

Parameters:
tagslist

List of tags to remove.

delete_emptybool

Whenever to delete empty tags.

Returns:
None

Examples

>>> tree = LexborHTMLParser("<div><a href="">Hello</a> <i>world</i>!</div>")
>>> tree.body.unwrap_tags(['i','a'])
>>> tree.body.html
'<body><div>Hello world!</div></body>'

LexborNode

class selectolax.lexbor.LexborNode

A class that represents HTML node (element).

any_css_matches(self, tuple selectors)

Returns True if any of CSS selectors matches a node

attributes

Get all attributes that belong to the current node.

The value of empty attributes is None.

Returns:
attributesdictionary of all attributes.

Examples

>>> tree = LexborHTMLParser("<div data id='my_id'></div>")
>>> node = tree.css_first('div')
>>> node.attributes
{'data': None, 'id': 'my_id'}
attrs

A dict-like object that is similar to the attributes property, but operates directly on the Node data.

Warning

Use attributes instead, if you don’t want to modify Node attributes.

Returns:
attributesAttributes mapping object.

Examples

>>> tree = LexborHTMLParser("<div id='a'></div>")
>>> node = tree.css_first('div')
>>> node.attrs
<div attributes, 1 items>
>>> node.attrs['id']
'a'
>>> node.attrs['foo'] = 'bar'
>>> del node.attrs['id']
>>> node.attributes
{'foo': 'bar'}
>>> node.attrs['id'] = 'new_id'
>>> node.html
'<div foo="bar" id="new_id"></div>'
child

Alias for the first_child property.

Deprecated. Please use first_child instead.

clone(self) LexborNode

Clone the current node.

You can use to do temporary modifications without affecting the original HTML tree.

It is tied to the current parser instance. Gets destroyed when parser instance is destroyed.

comment_content

LexborNode.comment_content: str | None

Extract the textual content of an HTML comment node.

Returns:
str or None

Comment text with surrounding whitespace removed, or None if the current node is not a comment or the comment markup cannot be parsed.

Examples

>>> parse_fragment("<!-- hello -->")[0].comment_content
'hello'
>>> parse_fragment("<div>not a comment</div>")[0].comment_content is None
True
css(self, str query)

Evaluate CSS selector against current node and its child nodes.

Matches pattern query against HTML tree. CSS selectors reference.

Special selectors:

  • parser.css(‘p:lexbor-contains(“awesome” i)’) – case-insensitive contains

  • parser.css(‘p:lexbor-contains(“awesome”)’) – case-sensitive contains

Parameters:
querystr

CSS selector (e.g. “div > :nth-child(2n+1):not(:has(a))”).

Returns:
selectorlist of Node objects
css_first(self, str query, default=None, bool strict=False)

Same as css but returns only the first match.

When strict=False stops at the first match. Works faster.

Parameters:
querystr
defaultAny, default None

Default value to return if there is no match.

strict: bool, default False

Set to True if you want to check if there is strictly only one match in the document.

Returns:
selectorLexborNode object
css_matches(self, str selector)

Returns True if CSS selector matches a node.

decompose(self, bool recursive=True)

Remove the current node from the tree.

Parameters:
recursivebool, default True

Whenever to delete all its child nodes

Examples

>>> tree = LexborHTMLParser(html)
>>> for tag in tree.css('script'):
>>>     tag.decompose()
first_child

Return the first child node.

html

Return HTML representation of the current node including all its child nodes.

Returns:
textstr
html_pretty(self, Py_ssize_t indent=0, bool skip_ws_nodes=False, bool skip_comment=False, bool raw=False, bool without_closing=False, bool tag_with_ns=False, bool without_text_indent=False, bool full_doctype=False, bool html5test=False)

Return pretty-printed HTML for the current node.

Parameters:
indentint, optional

Initial indentation level passed to Lexbor. Defaults to 0.

skip_ws_nodesbool, optional

Skip text nodes that contain only whitespace.

skip_commentbool, optional

Exclude HTML comment nodes from the serialized output.

rawbool, optional

Serialize text and attribute values without HTML escaping.

without_closingbool, optional

Omit closing tags for non-void elements.

tag_with_nsbool, optional

Include namespace prefixes in serialized tag names when available.

without_text_indentbool, optional

Disable extra indentation added around text and comment content.

full_doctypebool, optional

Serialize the full document type declaration when a doctype node is present.

html5testbool, optional

Serialize using Lexbor’s HTML5 test formatting mode.

id

Get the id attribute of the node.

Returns None if id does not set.

Returns:
textstr
inner_html

LexborNode.inner_html: str | None

Return HTML representation of the child nodes.

Works similar to innerHTML in JavaScript. Unlike the .html property, does not include the current node. Can be used to set HTML as well. See the setter docstring.

Returns:
textstr | None
inner_html_pretty(self, Py_ssize_t indent=0, bool skip_ws_nodes=False, bool skip_comment=False, bool raw=False, bool without_closing=False, bool tag_with_ns=False, bool without_text_indent=False, bool full_doctype=False, bool html5test=False)

Return pretty-printed HTML representation of the child nodes.

Parameters:
indentint, optional

Initial indentation level passed to Lexbor. Defaults to 0.

skip_ws_nodesbool, optional

Skip text nodes that contain only whitespace.

skip_commentbool, optional

Exclude HTML comment nodes from the serialized output.

rawbool, optional

Serialize text and attribute values without HTML escaping.

without_closingbool, optional

Omit closing tags for non-void elements.

tag_with_nsbool, optional

Include namespace prefixes in serialized tag names when available.

without_text_indentbool, optional

Disable extra indentation added around text and comment content.

full_doctypebool, optional

Serialize the full document type declaration when a doctype node is present.

html5testbool, optional

Serialize using Lexbor’s HTML5 test formatting mode.

insert_after(signatures, args, kwargs, defaults, _fused_sigindex={})

Insert a node after the current Node.

Parameters:
valuestr, bytes or Node

The text or Node instance to insert after the Node. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src="" alt="Laptop"></div>')
>>> img = tree.css_first('img')
>>> img.insert_after(img.attributes.get('alt', ''))
>>> tree.body.child.html
'<div>Get <img src="" alt="Laptop">Laptop</div>'
>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> img_node = html_parser.css_first('img')
>>> img_node.insert_after(html_parser2.body.child)
<div>Get <span alt="Laptop"><img src="/jpg"><div>Test</div> <div></div></span></div>'
insert_before(signatures, args, kwargs, defaults, _fused_sigindex={})

Insert a node before the current Node.

Parameters:
valuestr, bytes or Node

The text or Node instance to insert before the Node. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src="" alt="Laptop"></div>')
>>> img = tree.css_first('img')
>>> img.insert_before(img.attributes.get('alt', ''))
>>> tree.body.child.html
'<div>Get Laptop<img src="" alt="Laptop"></div>'
>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> img_node = html_parser.css_first('img')
>>> img_node.insert_before(html_parser2.body.child)
<div>Get <span alt="Laptop"><div>Test</div><img src="/jpg"> <div></div></span></div>'
insert_child(signatures, args, kwargs, defaults, _fused_sigindex={})

Insert a node inside (at the end of) the current Node.

Parameters:
valuestr, bytes or Node

The text or Node instance to insert inside the Node. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src=""></div>')
>>> div = tree.css_first('div')
>>> div.insert_child('Laptop')
>>> tree.body.child.html
'<div>Get <img src="">Laptop</div>'
>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"> <div>Laptop</div> </span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> span_node = html_parser.css_first('span')
>>> span_node.insert_child(html_parser2.body.child)
<div>Get <span alt="Laptop"> <div>Laptop</div> <div>Test</div> </span></div>'
is_comment_node

LexborNode.is_comment_node: bool

Return True if the node represents a comment node.

is_document_node

LexborNode.is_document_node: bool

Return True if the node represents a document node.

is_element_node

LexborNode.is_element_node: bool

Return True if the node represents an element node.

is_empty_text_node

LexborNode.is_empty_text_node: bool

Check whether the current node is an empty text node.

Returns:
bool

True when the node is a text node whose character data consists only of ASCII whitespace characters (space, tab, newline, form feed or carriage return).

is_text_node

LexborNode.is_text_node: bool

Return True if the node represents a text node.

iter(self, bool include_text=False, bool skip_empty=False)

Iterate over direct children of this node.

Parameters:
include_textbool, optional

When True, yield text nodes in addition to element nodes. Defaults to False.

skip_emptybool, optional

When include_text is True, ignore text nodes made up solely of ASCII whitespace (space, tab, newline, form feed or carriage return). Defaults to False.

Yields:
LexborNode

Child nodes on the same tree level as this node, filtered according to the provided options.

last_child

Return last child node.

merge_text_nodes(self)

Iterates over all text nodes and merges all text nodes that are close to each other.

This is useful for text extraction. Use it when you need to strip HTML tags and merge “dangling” text.

Examples

>>> tree = LexborHTMLParser("<div><p><strong>J</strong>ohn</p><p>Doe</p></div>")
>>> node = tree.css_first('div')
>>> tree.unwrap_tags(["strong"])
>>> tree.text(deep=True, separator=" ", strip=True)
"J ohn Doe" # Text extraction produces an extra space because the strong tag was removed.
>>> node.merge_text_nodes()
>>> tree.text(deep=True, separator=" ", strip=True)
"John Doe"
next

Return next node.

parent

Return the parent node.

parser

parser: selectolax.lexbor.LexborHTMLParser

prev

Return previous node.

raw_value

Return the raw (unparsed, original) value of a node.

Currently, works on text nodes only.

Returns:
raw_valuebytes

Examples

>>> html_parser = LexborHTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'
remove(self, bool recursive=True)

An alias for the decompose method.

replace_with(signatures, args, kwargs, defaults, _fused_sigindex={})

Replace current Node with specified value.

Parameters:
valuestr, bytes or Node

The text or Node instance to replace the Node with. When a text string is passed, it’s treated as text. All HTML tags will be escaped. Convert and pass the Node object when you want to work with HTML. Does not clone the Node object. All future changes to the passed Node object will also be taken into account.

Examples

>>> tree = LexborHTMLParser('<div>Get <img src="" alt="Laptop"></div>')
>>> img = tree.css_first('img')
>>> img.replace_with(img.attributes.get('alt', ''))
>>> tree.body.child.html
'<div>Get Laptop</div>'
>>> html_parser = LexborHTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
>>> html_parser2 = LexborHTMLParser('<div>Test</div>')
>>> img_node = html_parser.css_first('img')
>>> img_node.replace_with(html_parser2.body.child)
'<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>'
script_srcs_contain(self, tuple queries)

Returns True if any of the script SRCs attributes contain on of the specified text.

Caches values on the first call to improve performance.

Parameters:
queriestuple of str
scripts_contain(self, str query)

Returns True if any of the script tags contain specified text.

Caches script tags on the first call to improve performance.

Parameters:
querystr

The query to check.

select(self, query=None)

Select nodes given a CSS selector.

Works similarly to the the css method, but supports chained filtering and extra features.

Parameters:
querystr or None

The CSS selector to use when searching for nodes.

Returns:
selectorThe Selector class.
strip_tags(self, list tags, bool recursive=False)

Remove specified tags from the HTML tree.

Parameters:
tagslist

List of tags to remove.

recursivebool, default True

Whenever to delete all its child nodes

Examples

>>> tree = LexborHTMLParser('<html><head></head><body><script></script><div>Hello world!</div></body></html>')
>>> tags = ['head', 'style', 'script', 'xmp', 'iframe', 'noembed', 'noframes']
>>> tree.strip_tags(tags)
>>> tree.html
'<html><body><div>Hello world!</div></body></html>'
tag

Return the name of the current tag (e.g. div, p, img).

For for non-tag nodes, returns the following names:

  • -text - text node

  • -document - document node

  • -comment - comment node

This

Returns:
textstr
text(self, bool deep=True, str separator='', bool strip=False, bool skip_empty=False)

Return concatenated text from this node.

Parameters:
deepbool, optional

When True (default), include text from all descendant nodes; when False, only include direct children.

separatorstr, optional

String inserted between successive text fragments.

stripbool, optional

If True, apply str.strip() to each fragment before joining to remove surrounding whitespace. Defaults to False.

skip_emptybool, optional

Exclude text nodes whose content is only ASCII whitespace (space, tab, newline, form feed or carriage return) when True. Defaults to False.

Returns:
textstr

Combined textual content assembled according to the provided options.

text_content

Returns the text of the node if it is a text node.

Returns None for other nodes. Unlike the text method, does not include child nodes.

Returns:
textstr or None.
text_lexbor(self)

Returns the text of the node including text of all its child nodes.

Uses builtin method from lexbor.

traverse(self, bool include_text=False, bool skip_empty=False)

Depth-first traversal starting at the current node.

Parameters:
include_textbool, optional

When True, include text nodes in the traversal sequence. Defaults to False.

skip_emptybool, optional

Skip text nodes that contain only ASCII whitespace (space, tab, newline, form feed or carriage return) when include_text is True. Defaults to False.

Yields:
LexborNode

Nodes encountered in depth-first order beginning with the current node, filtered according to the provided options.

unwrap(self, bool delete_empty=False)

Replace node with whatever is inside this node.

Does nothing if you perform unwrapping second time on the same node.

Parameters:
delete_emptybool, default False

If True, removes empty tags.

Examples

>>>  tree = LexborHTMLParser("<div>Hello <i>world</i>!</div>")
>>>  tree.css_first('i').unwrap()
>>>  tree.html
'<html><head></head><body><div>Hello world!</div></body></html>'

Note: by default, empty tags are ignored, use “delete_empty” to change this.

unwrap_tags(self, list tags, bool delete_empty=False)

Unwraps specified tags from the HTML tree.

Works the same as the unwrap method, but applied to a list of tags.

Parameters:
tagslist

List of tags to remove.

delete_emptybool, default False

If True, removes empty tags.

Examples

>>> tree = LexborHTMLParser("<div><a href="">Hello</a> <i>world</i>!</div>")
>>> tree.body.unwrap_tags(['i','a'])
>>> tree.body.html
'<body><div>Hello world!</div></body>'

Note: by default, empty tags are ignored, use “delete_empty” to change this.

Selector

class selectolax.lexbor.LexborSelector(LexborNode node, query)

An advanced CSS selector that supports additional operations.

Think of it as a toolkit that mimics some of the features of XPath.

Please note, this is an experimental feature that can change in the future.

any_attribute_longer_than(self, str attribute, int length, str start=None) bool

Returns True any href attribute longer than a specified length.

Similar to string-length in XPath.

any_matches

LexborSelector.any_matches: bool

Returns True if there are any matches

any_text_contains(self, str text, bool deep=True, str separator='', bool strip=False) bool

Returns True if any node in the current search scope contains specified text

attribute_longer_than(self, str attribute, int length, str start=None) LexborSelector

Filter all current matches by attribute length.

Similar to string-length in XPath.

css(self, str query)

Evaluate CSS selector against current scope.

matches

LexborSelector.matches: list

Returns all possible matches

text_contains(self, str text, bool deep=True, str separator='', bool strip=False) LexborSelector

Filter all current matches given text.