Documents and Loaders¶

Documents¶

class delb.Document(source, /, parser_options=None, klass=None, source_url=None, **config_options)[source]¶

This class is the entrypoint to obtain a representation of an XML encoded text document.

Parameters:

source – Anything that the configured loaders can make sense of to return a parsed document tree.
parser_options – A delb.parser.ParserOptions instance to configure the used parser.
klass – Explicitly define the initialized class. This can be useful for applications that have default document subclasses in use.
source_url – An optional source URL for situations where a reader can’t determine one.
config – Additional keyword arguments for the configuration of extension classes.

For instantiation any object can be passed. A suitable loader must be available for the given source. See Document loaders for the default loaders that come with this package. Plugins are capable to alter the available loaders, see Extending delb.

Nodes can be tested for membership in a document:

>>> document = Document("<root>text</root>")
>>> text_node = document.root[0]
>>> text_node in document
True
>>> text_node.clone() in document
False

The string coercion of a document yields an XML encoded stream as string. See Serialization for details.

>>> document = Document("<root/>")
>>> str(document)
'<?xml version="1.0" encoding="UTF-8"?><root/>'

Attributes:

root

The root node of a document's content tree.

Methods:

`clone`	Clones the document with its contents.
`css_select`	This method proxies to the `delb.nodes.TagNode.css_select()` method of the document's `root` node.
`merge_text_nodes`	This method proxies to the `delb.nodes.TagNode.merge_text_nodes()` method of the document's `root` node.
`reduce_whitespace`	Collapses and trims whitespace as described in this TEI recommendation.
`save`	Saves the serialized document contents to a file.
`write`	Writes the serialized document contents to a file-like object.
`xpath`	This method proxies to the `delb.nodes.TagNode.xpath()` method of the document's `root` node.

clone() → Document[source]¶

Clones the document with its contents.

Returns:: A new document instance.

css_select(expression: str, namespaces: NamespaceDeclarations | None = None) → QueryResults[source]¶: This method proxies to the delb.nodes.TagNode.css_select() method of the document’s root node.

merge_text_nodes(deep: bool = True)[source]¶: This method proxies to the delb.nodes.TagNode.merge_text_nodes() method of the document’s root node.

reduce_whitespace()[source]¶: Collapses and trims whitespace as described in this TEI recommendation. Text in (sub-)trees with structured data should be trimmed further in subsequent processing. This routine implicitly merges all neighbouring text nodes. Note that the recommendation doesn’t sufficiently cover situations with neighbouring comments and processing instructions. For determinable results one should remove such nodes before applying the whitespace reduction.

property root: delb.typing.TagNodeType¶: The root node of a document’s content tree.

save(path: Path, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: NamespaceDeclarations | None = None, newline: None | str = None)[source]¶

Saves the serialized document contents to a file. See Serialization for details.

Parameters:

path – The filesystem path to the target file.
encoding – The desired text encoding.
format_options – An instance of FormatOptions can be provided to configure formatting.
namespaces – A mapping of prefixes to namespaces. If not provided the root node’s namespace will serve as default namespace. Prefixes for undeclared namespaces are enumerated with the prefix ns.
newline – See io.TextIOWrapper for a detailed explanation of the parameter with the same name.

write(buffer: BinaryIO, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: TypeAliasForwardRef('delb.typing.NamespaceDeclarations') | None = None, newline: None | str = None)[source]¶

Writes the serialized document contents to a file-like object. See Serialization for details.

Parameters:

buffer – A file-like object that the document is written to.
encoding – The desired text encoding.
format_options – An instance of FormatOptions can be provided to configure formatting.
namespaces – A mapping of prefixes to namespaces. If not provided the root node’s namespace will serve as default namespace. Prefixes for undeclared namespaces are enumerated with the prefix ns.
newline – See io.TextIOWrapper for a detailed explanation of the parameter with the same name.

xpath(expression: str, namespaces: NamespaceDeclarations | None = None) → QueryResults[source]¶: This method proxies to the delb.nodes.TagNode.xpath() method of the document’s root node.

Document loaders¶

If you want or need to manipulate the availability of or order in which loaders are attempted, you can change the delb.plugins.plugin_manager.plugins.loaders object which is a list. Its state is reflected in your whole application. Please refer to this issue when you require finer controls over these aspects.

Core¶

The core_loaders module provides a set loaders to retrieve documents from various data sources.

_delb.plugins.core_loaders.buffer_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader loads a document from a file-like object that reads binary data.

_delb.plugins.core_loaders.path_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader loads from a file that is pointed at with a pathlib.Path instance. That instance will be bound to source_path on the document’s delb.Document.config attribute.

_delb.plugins.core_loaders.tag_node_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: This loader loads either uses a root node (of type delb.typing.TagNodeType) that has no :class:`delb.Document context or clones those with such and any non-root node.

_delb.plugins.core_loaders.text_loader(data: Any, config: SimpleNamespace) → LoaderResult[source]¶: Parses a string containing a full document.

Extra¶

If delb is installed with web-loader as extra, the required dependencies for this loader are installed as well. See Installation.

_delb.plugins.web_loader.web_loader(data: Any, config: SimpleNamespace, client: httpx.Client = <httpx.Client object>) → LoaderResult[source]¶

This loader loads a document from a URL with the http and https scheme. The default httpx-client follows redirects and can partially be configured with environment variables. The URL will be bound to the name source_url on the document’s delb.Document.config attribute.

Loaders with specifically configured httpx-clients can build on this loader like so:

import httpx
from _delb.plugins import plugin_manager
from _delb.plugins.web_loader import web_loader


client = httpx.Client(follow_redirects=False, trust_env=False)

@plugin_manager.register_loader(before=web_loader)
def custom_web_loader(data, config):
    return web_loader(data, config, client=client)

Parser options¶

class delb.parser.ParserOptions(encoding: str | None = None, load_referenced_resources: bool = False, preferred_parsers: str | Sequence[str] = ('lxml', 'expat'), reduce_whitespace: bool = False, remove_comments: bool = False, remove_processing_instructions: bool = False, unplugged: bool = False)[source]¶

The configuration options that define an XML parser’s behaviour.

The used parser backend is determined by their availability and the preferred_parsers setting. delb comes with two contributed implementations and further can be added to the plugin manager based on _delb.plugins.XMLEventParserInterface.

Both contributed implementations should not be tasked with documents that refer invalid Document Type Declarations (DTDs), such may pass when their included character entity declarations aren’t used in the character data of the document or lead to errors of different degrees of severity. Character entity declarations are the only considered DTD feature to provide backward compatibility.

Both will not allow some non-word characters as part of XML names that should be allowed with the 5th edition of the XML 1.0 specification, e.g. : or single combining characters.

Beside the _delb.exceptions.ParsingError exception and its derivations the employed parsers may evoke their specific exceptions when confronted with invalid syntax and not-so-well-formed documents.

The expat parser adapter depends on the xml.sax.expatreader module from the standard library that is available with many Python distributions.

The lxml based parser requires the lxml package to be present in the interpreter environment. This parser is prone to crashing when processing invalid DTDs, it also fails with uncommon, but still valid by spec, DTD contents. It should not be used with other encodings than Unicode to avoid crashes. Neither should it be used in conjunction with the load_referenced_resources when processing larger files / trees.

encoding: str | None¶: This should be used for streams where the encoding is not noted in an XML document declaration or indicated by a BOM for Unicode encodings. It doesn’t affect parsing of data that is passed as str. Default: None.

load_referenced_resources: bool¶: Allows the loading of referenced external DTDs. Default: False.

preferred_parsers: str | Sequence[str]¶: A parser adapter name or a sequence of such that are preferably to be used. Default: ("lxml", "expat").

reduce_whitespace: bool¶: Reduce the content's whitespace. Default: False.

remove_comments: bool¶: Ignore comments. Default: False.

remove_processing_instructions: bool¶: Don’t include processing instructions in the parsed tree. Default: False.

unplugged: bool¶: Don’t load referenced resources over network. Default: False.