Extending delb¶
Note
There are actually two packages that are installed with delb:
delb and _delb. As the underscore indicates, the latter is exposing
private parts of the API while the first is re-exposing what is deemed to
be public from that one and additional contents.
As a rule of thumb, use the public API in applications and the private API
in delb extensions. By doing so, you can avoid circular dependencies if
your extension (or other code that it depends on) uses contents from the
_delb package.
delb offers a plugin system to facilitate the extendability of a few of its
mechanics with Python packages.
A package that extends its functionality must provide entrypoint metadata
for an entrypoint group named delb that points to modules that contain
extensions. Some extensions have to be decorated with specific methods
of the plugin manager object. Authors are encouraged to prefix their package
names with delb- in order to increase discoverability.
These extension types are currently available:
Loaders are functions that try to make sense of any given input value, and if they can they return a parsed document.
Parser adapters can plug to XML parser implementations.
Custom XPath functions can be used in XPath predicate expressions.
Mixins add functionality / information to delb.Document (instead
of inheriting from it). That allows applications to rely optionally on
the availability of plugins and to combine various extensions.
Subclasses can be used to provide distinct models of arbitrary aspects for contents that are represented by a specific encoding. They can optionally implement a test method to qualify themself as default class for recognized contents.
The designated means of communication between extensions is the config
argument to the loader respectively the instance property of a document instance
with that name.
Warning
A module that contains plugins and any module it is explicitly or implicitly
importing must not import anything from the delb package, because
that would initiate the collection of plugin implementations. And these
wouldn’t have been completely registered at that point. Import the required
module members from the according path in the _delb package instead.
Caution
Mind to re-install a package in development when its entrypoint specification changed.
There’s a repository that outlines the mechanics as developer reference: https://github.com/delb-xml/delb-py-reference-plugins
There’s also the delb-existdb project that implements the loader and document mixin plugin types to interact with eXist-db as storage.
Document loaders¶
Loaders are registered with this decorator:
- _delb.plugins.plugin_manager.register_loader(before: LoaderConstraint = None, after: LoaderConstraint = None) SecondOrderDecorator¶
Registers a document loader.
An example module that is specified as
delbplugin for an IPFS loader might look like this:from os import getenv from types import SimpleNamespace from typing import Any from _delb.plugins import plugin_manager from _delb.plugins.web_loader import web_loader from _delb.typing import LoaderResult IPFS_GATEWAY = getenv("IPFS_GATEWAY_PREFIX", "https://ipfs.io/ipfs/") @plugin_manager.register_loader() def ipfs_loader(source: Any, config: SimpleNamespace) -> LoaderResult: if isinstance(source, str) and source.startswith("ipfs://"): config.source_url = source config.ipfs_gateway_source_url = IPFS_GATEWAY + source[7:] return web_loader(config.ipfs_gateway_source_url, config) # return an indication why this loader didn't attempt to load in order # to support debugging return "The input value is not an URL with the ipfs scheme."
The
sourceargument is what adelb.Documentinstance is initialized with as input data.Note that the
configargument that is passed to a loader function contains configuration data, it’s thedelb.Document.configproperty after_init_confighas been processed.Loaders that retrieve a document from an URL should add the origin as string to the
configobject assource_url.You might want to specify a loader to be considered before or after another one. Let’s assume a loader shall figure out what to load from a remote XML resource that contains a reference to the actual document. That one would have to be considered before the one that loads XML documents from a URL with the https scheme:
from _delb.plugins import plugin_manager from _delb.plugins.web_loader import web_loader @plugin_manager.register_loader(before=web_loader) def mets_loader(source, config) -> LoaderResult: # loading logic here pass
Parser adapters¶
- class _delb.plugins.XMLEventParserInterface(options: ParserOptions, base_url: str | None, encoding: str)[source]¶
This is the base class for parser adapters. After initialization their
parse()method will be called for iterate over parser events. Instances don’t have to care about their state beyond the parsing of one input stream as they’re only employed once.- Parameters:
options – The parsing options the user passed with the input stream.
base_url – The base URL for resolving references.
encoding – This is the encoding that was either provided by the user, noted in an XML document declaration or indicated by a Byte Order Mark. But it could also be the fallback value
utf-8if none of the prior was available.
- name: str¶
The parser can be selected by this class attribute’s value as (member of) a
delb.parser.ParserOptions.preferred_parserssetting.
- abstractmethod parse(data: BinaryReader | str) Iterator[Event][source]¶
This method must be implemented and yield the parsed contents in document order as
Eventtuples.
The parsed contents are passed with such constructs:
- _delb.parser.Event¶
An XML stream event tuple consists of two values. The first is a member of
EventTypethat signals the type of event, the second carries the relevant data. All data must be stripped of XML markup characters and character data must be completely parsed and normalized. All XML names and character entities must be resolved.XML event tuples’ structure¶ Event member
Data type
Notes
(target, content)TagEventData|NoneIf data is provided, the tree builder can detect inconsistent tagging in debug mode.
Document mixin classes¶
Document mixin classes are registered by subclassing them from this base class:
- class _delb.plugins.DocumentMixinBase[source]¶
By deriving a subclass from this one, a document extension class is registered as plugin. These are supposed to add additional attributes to a document, e.g. derived data or methods to interact with storage systems. All attributes of an extension should share a common prefix that terminates with an underscore, e.g. storage_load, storage_save, etc.
This base class also acts as termination for methods that can be implemented by mixin classes. Any implementation of a method must call a base class’ one, e.g.:
from types import SimpleNamespace from _delb.plugins import DocumentMixinBase from magic_wonderland import play_disco class MyExtension(DocumentMixinBase): # this method can be implemented by any extension class @classmethod def _init_config(cls, config, kwargs): config.my_extension = SimpleNamespace(tonality=kwargs.pop( "my_extension_tonality") ) super()._init_config(config, kwargs) # this method is specific to this extension def my_extension_makes_magic(self): play_disco(self.config.my_extension.tonality)
- classmethod _init_config(config: SimpleNamespace, kwargs: dict[str, Any])[source]¶
The
kwargsargument contains the additional keyword arguments that adelb.Documentinstance is called with. Extension classes that expect configuration data must process their specific arguments by clearing them from thekwargsdictionary, e.g. withdict.pop(), and preferably storing the final configuration data in atypes.SimpleNamespaceand adding it to thetypes.SimpleNamespacepassed asconfigwith the extension’s name. The initially mentioned keyword arguments should be prefixed with that name as well. This method is called before the loaders try to read and parse the given source for a document.
Document subclasses¶
Of course one can simply subclass delb.Document to add functionality.
Beside using a subclass directly, you can let delb.Document figure out
which subclass is an appropriate representation of the content. Subclasses can
claim that by implementing a staticmethod() named __class_test__ that
takes the document’s root node and the configuration to return a boolean that
indicates the subclass is suited. The first class to return a True value
will immediately be chosen, so be aware of the possible ambiguity in complex
setups. It is only ensured that subclasses are considered before others that
they derive from.
Subclasses are registered by importing them into an application, they must not be pointed to by entrypoint definitions.
Here’s an example:
class TEIDocument(Document):
def __init__(self, *args, **kwargs):
super().__init__(*args, **{**kwargs, "collapse_whitespace": True})
@staticmethod
def __class_test__(root: TagNode, config: types.SimpleNamespace) -> bool:
return root.universal_name == "{http://www.tei-c.org/ns/1.0}TEI"
@property
def title(self) -> str:
return self.css_select('titleStmt title[type="main"]').first.full_text
document = Document("""\
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"><teiHeader><fileDesc><titleStmt>
<title type="main">The Document's Title</title>
</titleStmt></fileDesc></teiHeader></TEI>
""")
if isinstance(document, TEIDocument):
print(document.title)
else:
print("Sorry, I don't know how to retrieve the document's title.")
The Document's Title
The recommendations as laid out for DocumentMixinHooks._init_config also apply for subclasses who
would process configuration arguments in their __init__ method before
calling the super class’ one.
XPath functions¶
Custom XPath functions are registered with this decorator:
- _delb.plugins.PluginManager.register_xpath_function(self, arg: str | GenericDecorated) SecondOrderDecorator | GenericDecorated¶
Custom XPath functions can be defined as shown in the following example. The first argument to a function is always an instance of
_delb.xpath.EvaluationContextfollowed by the expression’s arguments.from delb import Document from _delb.plugins import plugin_manager from _delb.xpath import EvaluationContext @plugin_manager.register_xpath_function("is-last") def is_last(context: EvaluationContext) -> bool: return context.position == context.size @plugin_manager.register_xpath_function def lowercase(_, string: str) -> str: return string.lower() document = Document("<root><node/><node foo='BAR'/></root>") print(document.xpath("//*[is-last() and lowercase(@foo)='bar']").first)
<node foo="BAR"/>