Documents and Loaders¶

Documents¶

class delb.Document(source, parser_options=None, klass=None, **config_options)[source]¶

This class is the entrypoint to obtain a representation of an XML encoded text document.

Parameters:

source – Anything that the configured loaders can make sense of to return a parsed document tree.
parser_options – A delb.ParserOptions instance to configure the used parser.
klass – Explicitly define the initialized class. This can be useful for applications that have default document subclasses in use.
config – Additional keyword arguments for the configuration of extension classes.

For instantiation any object can be passed. A suitable loader must be available for the given source. See Document loaders for the default loaders that come with this package. Plugins are capable to alter the available loaders, see Extending delb.

Nodes can be tested for membership in a document:

>>> document = Document("<root>text</root>")
>>> text_node = document.root[0]
>>> text_node in document
True
>>> text_node.clone() in document
False

The string coercion of a document yields an XML encoded stream as string. See Serialization for details.

>>> document = Document("<root/>")
>>> str(document)
'<?xml version="1.0" encoding="UTF-8"?><root/>'

Attributes:

`config`	Beside the `parser_options`, this property contains the namespaced data that extension classes and loaders may have stored.
`epilogue`	A list-like accessor to the nodes that follow the document's root node.
`namespaces`	The namespace mapping of the document's `root` node.
`prologue`	A list-like accessor to the nodes that precede the document's root node.
`root`	The root node of a document's content tree.
`source_url`	The source URL where a loader obtained the document's contents or `None`.

Methods:

`clone`	Clones the document with its contents.
`css_select`	This method proxies to the `TagNode.css_select()` method of the document's `root` node.
`merge_text_nodes`	This method proxies to the `TagNode.merge_text_nodes()` method of the document's `root` node.
`new_tag_node`	This method proxies to the `TagNode.new_tag_node()` method of the document's root node.
`reduce_whitespace`	Collapses and trims whitespace as described in this TEI recommendation.
`save`	Saves the serialized document contents to a file.
`write`	Writes the serialized document contents to a file-like object.
`xpath`	This method proxies to the `TagNode.xpath()` method of the document's `root` node.

clone() → Document[source]¶

Clones the document with its contents.

Returns:: A new document instance.

config: SimpleNamespace¶: Beside the parser_options, this property contains the namespaced data that extension classes and loaders may have stored.

css_select(expression: str, namespaces: delb.typing.NamespaceDeclarations | None = None) → QueryResults[source]¶: This method proxies to the TagNode.css_select() method of the document’s root node.

epilogue¶: A list-like accessor to the nodes that follow the document’s root node. Note that nodes can’t be removed or replaced.

merge_text_nodes()[source]¶: This method proxies to the TagNode.merge_text_nodes() method of the document’s root node.

property namespaces: Namespaces¶: The namespace mapping of the document’s root node.

new_tag_node(local_name: str, attributes: dict[AttributeAccessor, str] | None = None, namespace: str | None = None) → TagNode[source]¶: This method proxies to the TagNode.new_tag_node() method of the document’s root node.

prologue¶: A list-like accessor to the nodes that precede the document’s root node. Note that nodes can’t be removed or replaced.

reduce_whitespace()[source]¶: Collapses and trims whitespace as described in this TEI recommendation. Text in (sub-)trees with structured data should be trimmed further in subsequent processing. Implicitly merges all neighbouring text nodes.

property root: TagNode¶: The root node of a document’s content tree.

save(path: Path, pretty: bool | None = None, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: NamespaceDeclarations | None = None, newline: None | str = None)[source]¶

Saves the serialized document contents to a file. See Serialization for details.

Parameters:

path – The filesystem path to the target file.
pretty – Deprecated. Adds indentation for human consumers when True.
encoding – The desired text encoding.
format_options – An instance of FormatOptions can be provided to configure formatting.
namespaces – A mapping of prefixes to namespaces. These are overriding possible declarations from a parsed serialisat that the document instance stems from. Prefixes for undeclared namespaces are enumerated with the prefix ns.
newline – See io.TextIOWrapper for a detailed explanation of the parameter with the same name.

source_url: str | None¶: The source URL where a loader obtained the document’s contents or None.

write(buffer: BinaryIO, pretty: bool | None = None, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: delb.typing.NamespaceDeclarations | None = None, newline: None | str = None)[source]¶

Writes the serialized document contents to a file-like object. See Serialization for details.

Parameters:

buffer – A file-like object that the document is written to.
pretty – Deprecated. Adds indentation for human consumers when True.
encoding – The desired text encoding.
format_options – An instance of FormatOptions can be provided to configure formatting.
namespaces – A mapping of prefixes to namespaces. These are overriding possible declarations from a parsed serialisat that the document instance stems from. Prefixes for undeclared namespaces are enumerated with the prefix ns.
newline – See io.TextIOWrapper for a detailed explanation of the parameter with the same name.

xpath(expression: str, namespaces: delb.typing.NamespaceDeclarations | None = None) → QueryResults[source]¶: This method proxies to the TagNode.xpath() method of the document’s root node.

Document loaders¶

If you want or need to manipulate the availability of or order in which loaders are attempted, you can change the delb.plugins.plugin_manager.plugins.loaders object which is a list. Its state is reflected in your whole application. Please refer to this issue when you require finer controls over these aspects.

Core¶

The core_loaders module provides a set loaders to retrieve documents from various data sources.

_delb.plugins.core_loaders.buffer_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader loads a document from a file-like object.

_delb.plugins.core_loaders.etree_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader processes lxml.etree._Element and lxml.etree._ElementTree instances.

_delb.plugins.core_loaders.path_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader loads from a file that is pointed at with a pathlib.Path instance. That instance will be bound to source_path on the document’s Document.config attribute.

_delb.plugins.core_loaders.tag_node_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader loads, or rather clones, a delb.TagNode instance and its descendant nodes.

_delb.plugins.core_loaders.text_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: Parses a string containing a full document.

Extra¶

If delb is installed with https-loader as extra, the required dependencies for this loader are installed as well. See Installation.

_delb.plugins.https_loader.https_loader(data: Any, config: SimpleNamespace, client: httpx.Client = <httpx.Client object>) → LoaderResult[source]¶

This loader loads a document from a URL with the http and https scheme. The default httpx-client follows redirects and can partially be configured with environment variables. The URL will be bound to the name source_url on the document’s Document.config attribute.

Loaders with specifically configured httpx-clients can build on this loader like so:

import httpx
from _delb.plugins import plugin_manager
from _delb.plugins.https_loader import https_loader


client = httpx.Client(follow_redirects=False, trust_env=False)

@plugin_manager.register_loader(before=https_loader)
def custom_https_loader(data, config):
    return https_loader(data, config, client=client)

Parser options¶

class delb.ParserOptions(reduce_whitespace: bool = False, remove_comments: bool = False, remove_processing_instructions: bool = False, resolve_entities: bool = True, unplugged: bool = False)[source]¶

The configuration options that define an XML parser’s behaviour.

Parameters:

reduce_whitespace – Reduce the content's whitespace.
remove_comments – Ignore comments.
remove_processing_instructions – Don’t include processing instructions in the parsed tree.
resolve_entities – Resolve entities.
unplugged – Don’t load referenced resources over network.