Documents and Loaders¶
Documents¶
- class delb.Document(source, parser_options=None, klass=None, **config_options)[source]¶
This class is the entrypoint to obtain a representation of an XML encoded text document.
- Parameters:
source – Anything that the configured loaders can make sense of to return a parsed document tree.
parser_options – A
delb.ParserOptionsinstance to configure the used parser.klass – Explicitly define the initialized class. This can be useful for applications that have default document subclasses in use.
config – Additional keyword arguments for the configuration of extension classes.
For instantiation any object can be passed. A suitable loader must be available for the given source. See Document loaders for the default loaders that come with this package. Plugins are capable to alter the available loaders, see Extending delb.
Nodes can be tested for membership in a document:
>>> document = Document("<root>text</root>") >>> text_node = document.root[0] >>> text_node in document True >>> text_node.clone() in document False
The string coercion of a document yields an XML encoded stream as string. See Serialization for details.
>>> document = Document("<root/>") >>> str(document) '<?xml version="1.0" encoding="UTF-8"?><root/>'
Attributes:
Beside the
parser_options, this property contains the namespaced data that extension classes and loaders may have stored.A list-like accessor to the nodes that follow the document's root node.
The namespace mapping of the document's
rootnode.A list-like accessor to the nodes that precede the document's root node.
The root node of a document's content tree.
The source URL where a loader obtained the document's contents or
None.Methods:
Clones the document with its contents.
This method proxies to the
TagNode.css_select()method of the document'srootnode.This method proxies to the
TagNode.merge_text_nodes()method of the document'srootnode.This method proxies to the
TagNode.new_tag_node()method of the document's root node.Collapses and trims whitespace as described in this TEI recommendation.
Saves the serialized document contents to a file.
Writes the serialized document contents to a file-like object.
This method proxies to the
TagNode.xpath()method of the document'srootnode.- config: SimpleNamespace¶
Beside the
parser_options, this property contains the namespaced data that extension classes and loaders may have stored.
- css_select(expression: str, namespaces: delb.typing.NamespaceDeclarations | None = None) QueryResults[source]¶
This method proxies to the
TagNode.css_select()method of the document’srootnode.
- epilogue¶
A list-like accessor to the nodes that follow the document’s root node. Note that nodes can’t be removed or replaced.
- merge_text_nodes()[source]¶
This method proxies to the
TagNode.merge_text_nodes()method of the document’srootnode.
- new_tag_node(local_name: str, attributes: dict[AttributeAccessor, str] | None = None, namespace: str | None = None) TagNode[source]¶
This method proxies to the
TagNode.new_tag_node()method of the document’s root node.
- prologue¶
A list-like accessor to the nodes that precede the document’s root node. Note that nodes can’t be removed or replaced.
- reduce_whitespace()[source]¶
Collapses and trims whitespace as described in this TEI recommendation. Text in (sub-)trees with structured data should be trimmed further in subsequent processing. Implicitly merges all neighbouring text nodes.
- save(path: Path, pretty: bool | None = None, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: NamespaceDeclarations | None = None, newline: None | str = None)[source]¶
Saves the serialized document contents to a file. See Serialization for details.
- Parameters:
path – The filesystem path to the target file.
pretty – Deprecated. Adds indentation for human consumers when
True.encoding – The desired text encoding.
format_options – An instance of
FormatOptionscan be provided to configure formatting.namespaces – A mapping of prefixes to namespaces. These are overriding possible declarations from a parsed serialisat that the document instance stems from. Prefixes for undeclared namespaces are enumerated with the prefix
ns.newline – See
io.TextIOWrapperfor a detailed explanation of the parameter with the same name.
- write(buffer: BinaryIO, pretty: bool | None = None, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: delb.typing.NamespaceDeclarations | None = None, newline: None | str = None)[source]¶
Writes the serialized document contents to a file-like object. See Serialization for details.
- Parameters:
buffer – A file-like object that the document is written to.
pretty – Deprecated. Adds indentation for human consumers when
True.encoding – The desired text encoding.
format_options – An instance of
FormatOptionscan be provided to configure formatting.namespaces – A mapping of prefixes to namespaces. These are overriding possible declarations from a parsed serialisat that the document instance stems from. Prefixes for undeclared namespaces are enumerated with the prefix
ns.newline – See
io.TextIOWrapperfor a detailed explanation of the parameter with the same name.
- xpath(expression: str, namespaces: delb.typing.NamespaceDeclarations | None = None) QueryResults[source]¶
This method proxies to the
TagNode.xpath()method of the document’srootnode.
Document loaders¶
If you want or need to manipulate the availability of or order in which loaders
are attempted, you can change the
delb.plugins.plugin_manager.plugins.loaders object which is a
list. Its state is reflected in your whole application. Please refer to
this issue when you require finer controls over these aspects.
Core¶
The core_loaders module provides a set loaders to retrieve documents from various
data sources.
- _delb.plugins.core_loaders.buffer_loader(data: Any, config: SimpleNamespace) LoaderResult[source]¶
This loader loads a document from a file-like object.
- _delb.plugins.core_loaders.etree_loader(data: Any, config: SimpleNamespace) LoaderResult[source]¶
This loader processes
lxml.etree._Elementandlxml.etree._ElementTreeinstances.
- _delb.plugins.core_loaders.path_loader(data: Any, config: SimpleNamespace) LoaderResult[source]¶
This loader loads from a file that is pointed at with a
pathlib.Pathinstance. That instance will be bound tosource_pathon the document’sDocument.configattribute.
- _delb.plugins.core_loaders.tag_node_loader(data: Any, config: SimpleNamespace) LoaderResult[source]¶
This loader loads, or rather clones, a
delb.TagNodeinstance and its descendant nodes.
Extra¶
If delb is installed with https-loader as extra, the required
dependencies for this loader are installed as well. See Installation.
- _delb.plugins.https_loader.https_loader(data: Any, config: SimpleNamespace, client: httpx.Client = <httpx.Client object>) LoaderResult[source]¶
This loader loads a document from a URL with the
httpandhttpsscheme. The default httpx-client follows redirects and can partially be configured with environment variables. The URL will be bound to the namesource_urlon the document’sDocument.configattribute.Loaders with specifically configured httpx-clients can build on this loader like so:
import httpx from _delb.plugins import plugin_manager from _delb.plugins.https_loader import https_loader client = httpx.Client(follow_redirects=False, trust_env=False) @plugin_manager.register_loader(before=https_loader) def custom_https_loader(data, config): return https_loader(data, config, client=client)
Parser options¶
- class delb.ParserOptions(reduce_whitespace: bool = False, remove_comments: bool = False, remove_processing_instructions: bool = False, resolve_entities: bool = True, unplugged: bool = False)[source]¶
The configuration options that define an XML parser’s behaviour.
- Parameters:
reduce_whitespace –
Reduce the content's whitespace.remove_comments – Ignore comments.
remove_processing_instructions – Don’t include processing instructions in the parsed tree.
resolve_entities – Resolve entities.
unplugged – Don’t load referenced resources over network.