Documents and Loaders¶
Documents¶
- class delb.Document(source, /, parser_options=None, klass=None, source_url=None, **config_options)[source]¶
This class is the entrypoint to obtain a representation of an XML encoded text document.
- Parameters:
source – Anything that the configured loaders can make sense of to return a parsed document tree.
parser_options – A
delb.parser.ParserOptionsinstance to configure the used parser.klass – Explicitly define the initialized class. This can be useful for applications that have default document subclasses in use.
source_url – An optional source URL for situations where a reader can’t determine one.
config – Additional keyword arguments for the configuration of extension classes.
For instantiation any object can be passed. A suitable loader must be available for the given source. See Document loaders for the default loaders that come with this package. Plugins are capable to alter the available loaders, see Extending delb.
Nodes can be tested for membership in a document:
>>> document = Document("<root>text</root>") >>> text_node = document.root[0] >>> text_node in document True >>> text_node.clone() in document False
The string coercion of a document yields an XML encoded stream as string. See Serialization for details.
>>> document = Document("<root/>") >>> str(document) '<?xml version="1.0" encoding="UTF-8"?><root/>'
Attributes:
The root node of a document's content tree.
Methods:
Clones the document with its contents.
This method proxies to the
delb.nodes.TagNode.css_select()method of the document'srootnode.This method proxies to the
delb.nodes.TagNode.merge_text_nodes()method of the document'srootnode.Collapses and trims whitespace as described in this TEI recommendation.
Saves the serialized document contents to a file.
Writes the serialized document contents to a file-like object.
This method proxies to the
delb.nodes.TagNode.xpath()method of the document'srootnode.- css_select(expression: str, namespaces: NamespaceDeclarations | None = None) QueryResults[source]¶
This method proxies to the
delb.nodes.TagNode.css_select()method of the document’srootnode.
- merge_text_nodes(deep: bool = True)[source]¶
This method proxies to the
delb.nodes.TagNode.merge_text_nodes()method of the document’srootnode.
- reduce_whitespace()[source]¶
Collapses and trims whitespace as described in this TEI recommendation. Text in (sub-)trees with structured data should be trimmed further in subsequent processing. This routine implicitly merges all neighbouring text nodes. Note that the recommendation doesn’t sufficiently cover situations with neighbouring comments and processing instructions. For determinable results one should remove such nodes before applying the whitespace reduction.
- property root: delb.typing.TagNodeType¶
The root node of a document’s content tree.
- save(path: Path, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: NamespaceDeclarations | None = None, newline: None | str = None)[source]¶
Saves the serialized document contents to a file. See Serialization for details.
- Parameters:
path – The filesystem path to the target file.
encoding – The desired text encoding.
format_options – An instance of
FormatOptionscan be provided to configure formatting.namespaces – A mapping of prefixes to namespaces. If not provided the root node’s namespace will serve as default namespace. Prefixes for undeclared namespaces are enumerated with the prefix
ns.newline – See
io.TextIOWrapperfor a detailed explanation of the parameter with the same name.
- write(buffer: BinaryIO, *, encoding: str = 'utf-8', format_options: FormatOptions | None = None, namespaces: TypeAliasForwardRef('delb.typing.NamespaceDeclarations') | None = None, newline: None | str = None)[source]¶
Writes the serialized document contents to a file-like object. See Serialization for details.
- Parameters:
buffer – A file-like object that the document is written to.
encoding – The desired text encoding.
format_options – An instance of
FormatOptionscan be provided to configure formatting.namespaces – A mapping of prefixes to namespaces. If not provided the root node’s namespace will serve as default namespace. Prefixes for undeclared namespaces are enumerated with the prefix
ns.newline – See
io.TextIOWrapperfor a detailed explanation of the parameter with the same name.
- xpath(expression: str, namespaces: NamespaceDeclarations | None = None) QueryResults[source]¶
This method proxies to the
delb.nodes.TagNode.xpath()method of the document’srootnode.
Document loaders¶
If you want or need to manipulate the availability of or order in which loaders
are attempted, you can change the
delb.plugins.plugin_manager.plugins.loaders object which is a
list. Its state is reflected in your whole application. Please refer to
this issue when you require finer controls over these aspects.
Core¶
The core_loaders module provides a set loaders to retrieve documents from various
data sources.
- _delb.plugins.core_loaders.buffer_loader(data: Any, config: SimpleNamespace) LoaderResult[source]¶
This loader loads a document from a file-like object that reads binary data.
- _delb.plugins.core_loaders.path_loader(data: Any, config: SimpleNamespace) LoaderResult[source]¶
This loader loads from a file that is pointed at with a
pathlib.Pathinstance. That instance will be bound tosource_pathon the document’sdelb.Document.configattribute.
Extra¶
If delb is installed with web-loader as extra, the required
dependencies for this loader are installed as well. See Installation.
- _delb.plugins.web_loader.web_loader(data: Any, config: SimpleNamespace, client: httpx.Client = <httpx.Client object>) LoaderResult[source]¶
This loader loads a document from a URL with the
httpandhttpsscheme. The default httpx-client follows redirects and can partially be configured with environment variables. The URL will be bound to the namesource_urlon the document’sdelb.Document.configattribute.Loaders with specifically configured httpx-clients can build on this loader like so:
import httpx from _delb.plugins import plugin_manager from _delb.plugins.web_loader import web_loader client = httpx.Client(follow_redirects=False, trust_env=False) @plugin_manager.register_loader(before=web_loader) def custom_web_loader(data, config): return web_loader(data, config, client=client)
Parser options¶
- class delb.parser.ParserOptions(encoding: str | None = None, load_referenced_resources: bool = False, preferred_parsers: str | Sequence[str] = ('lxml', 'expat'), reduce_whitespace: bool = False, remove_comments: bool = False, remove_processing_instructions: bool = False, unplugged: bool = False)[source]¶
The configuration options that define an XML parser’s behaviour.
The used parser backend is determined by their availability and the
preferred_parserssetting. delb comes with two contributed implementations and further can be added to the plugin manager based on_delb.plugins.XMLEventParserInterface.Both contributed implementations should not be tasked with documents that refer invalid Document Type Declarations (DTDs), such may pass when their included character entity declarations aren’t used in the character data of the document or lead to errors of different degrees of severity. Character entity declarations are the only considered DTD feature to provide backward compatibility.
Both will not allow some non-word characters as part of XML names that should be allowed with the 5th edition of the XML 1.0 specification, e.g.
:or single combining characters.Beside the
_delb.exceptions.ParsingErrorexception and its derivations the employed parsers may evoke their specific exceptions when confronted with invalid syntax and not-so-well-formed documents.The
expatparser adapter depends on thexml.sax.expatreadermodule from the standard library that is available with many Python distributions.The
lxmlbased parser requires the lxml package to be present in the interpreter environment. This parser is prone to crashing when processing invalid DTDs, it also fails with uncommon, but still valid by spec, DTD contents. It should not be used with other encodings than Unicode to avoid crashes. Neither should it be used in conjunction with the load_referenced_resources when processing larger files / trees.- encoding: str | None¶
This should be used for streams where the encoding is not noted in an XML document declaration or indicated by a BOM for Unicode encodings. It doesn’t affect parsing of data that is passed as
str. Default:None.
- preferred_parsers: str | Sequence[str]¶
A parser adapter name or a sequence of such that are preferably to be used. Default:
("lxml", "expat").
- reduce_whitespace: bool¶
Reduce the content's whitespace. Default:False.