Extending delb

Note

There are actually two packages that are installed with delb: delb and _delb. As the underscore indicates, the latter is exposing private parts of the API while the first is re-exposing what is deemed to be public from that one and additional contents. As a rule of thumb, use the public API in applications and the private API in delb extensions. By doing so, you can avoid circular dependencies if your extension (or other code that it depends on) uses contents from the _delb package.

delb offers a plugin system to facilitate the extendability of a few of its mechanics with Python packages. A package that extends its functionality must provide entrypoint metadata for an entrypoint group named delb that points to modules that contain extensions. Some extensions have to be decorated with specific methods of the plugin manager object. Authors are encouraged to prefix their package names with delb- in order to increase discoverability.

These extension types are currently available:

  • document loaders

  • document mixin classes

  • document subclasses

  • XPath functions

Loaders are functions that try to make sense of any given input value, and if they can they return a parsed document.

Mixin classes add functionality / attributes to the delb.Document class (instead of inheriting from it). That allows applications to rely optionally on the availability of plugins and to combine various extensions.

Subclasses can be used to provide distinct models of arbitrary aspects for contents that are represented by a specific encoding. They can optionally implement a test method to qualify them as default class for recognized contents.

The designated means of communication between extensions is the config argument to the loader respectively the instance property of a document instance with that name.

Warning

A module that contains plugins and any module it is explicitly or implicitly importing must not import anything from the delb module itself, because that would initiate the collection of plugin implementations. And these wouldn’t have been completely registered at that point. Import from the _delb module instead.

Caution

Mind to re-install a package in development when its entrypoint specification changed.

There’s a repository that outlines the mechanics as developer reference: https://github.com/delb-xml/delb-py-reference-plugins

There’s also the snakesist project that implements the loader and document mixin plugin types to interact with eXist-db as storage.

Document loaders

Loaders are registered with this decorator:

_delb.plugins.plugin_manager.register_loader(before: LoaderConstraint = None, after: LoaderConstraint = None) Callable

Registers a document loader.

An example module that is specified as delb plugin for an IPFS loader might look like this:

from os import getenv
from types import SimpleNamespace
from typing import Any

from _delb.plugins import plugin_manager
from _delb.plugins.https_loader import https_loader
from _delb.typing import LoaderResult


IPFS_GATEWAY = getenv("IPFS_GATEWAY_PREFIX", "https://ipfs.io/ipfs/")


@plugin_manager.register_loader()
def ipfs_loader(source: Any, config: SimpleNamespace) -> LoaderResult:
    if isinstance(source, str) and source.startswith("ipfs://"):

        config.source_url = source
        config.ipfs_gateway_source_url = IPFS_GATEWAY + source[7:]

        return https_loader(config.ipfs_gateway_source_url, config)

    # return an indication why this loader didn't attempt to load in order
    # to support debugging
    return "The input value is not an URL with the ipfs scheme."

The source argument is what a Document instance is initialized with as input data.

Note that the config argument that is passed to a loader function contains configuration data, it’s the delb.Document.config property after _init_config has been processed.

Loaders that retrieve a document from an URL should add the origin as string to the config object as source_url.

You might want to specify a loader to be considered before or after another one. Let’s assume a loader shall figure out what to load from a remote XML resource that contains a reference to the actual document. That one would have to be considered before the one that loads XML documents from a URL with the https scheme:

from _delb.plugins import plugin_manager
from _delb.plugins.https_loader import https_loader


@plugin_manager.register_loader(before=https_loader)
def mets_loader(source, config) -> LoaderResult:
    # loading logic here
    pass

Document extensions

Document mixin classes are registered by subclassing them from this base class:

class _delb.plugins.DocumentMixinBase[source]

By deriving a subclass from this one, a document extension class is registered as plugin. These are supposed to add additional attributes to a document, e.g. derived data or methods to interact with storage systems. All attributes of an extension should share a common prefix that terminates with an underscore, e.g. storage_load, storage_save, etc.

This base class also acts as termination for methods that can be implemented by mixin classes. Any implementation of a method must call a base class’ one, e.g.:

from types import SimpleNamespace

from _delb.plugins import DocumentMixinBase
from magic_wonderland import play_disco


class MyExtension(DocumentMixinBase):

    # this method can be implemented by any extension class
    @classmethod
    def _init_config(cls, config, kwargs):
        config.my_extension = SimpleNamespace(conf=kwargs.pop(
            "my_extension_conf"))
        super()._init_config(config, kwargs)

    # this method is specific to this extension
    def my_extension_makes_magic(self):
        play_disco()
classmethod _init_config(config: SimpleNamespace, kwargs: dict[str, Any])[source]

The kwargs argument contains the additional keyword arguments that a Document instance is called with. Extension classes that expect configuration data must process their specific arguments by clearing them from the kwargs dictionary, e.g. with dict.pop(), and preferably storing the final configuration data in a types.SimpleNamespace and adding it to the types.SimpleNamespace passed as config with the extension’s name. The initially mentioned keyword arguments should be prefixed with that name as well. This method is called before the loaders try to read and parse the given source for a document.

Document subclasses

Of course one can simply subclass delb.Document to add functionality. Beside using a subclass directly, you can let delb.Document figure out which subclass is an appropriate representation of the content. Subclasses can claim that by implementing a staticmethod() named __class_test__ that takes the document’s root node and the configuration to return a boolean that indicates the subclass is suited. The first class to return a True value will immediately be chosen, so be aware of the possible ambiguity in complex setups. It is only ensured that subclasses are considered before others that they derive from.

Subclasses are registered by importing them into an application, they must not be pointed to by entrypoint definitions.

Here’s an example:

class TEIDocument(Document):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **{**kwargs, "collapse_whitespace": True})

    @staticmethod
    def __class_test__(root: TagNode, config: types.SimpleNamespace) -> bool:
        return root.universal_name == "{http://www.tei-c.org/ns/1.0}TEI"

    @property
    def title(self) -> str:
        return self.css_select('titleStmt title[type="main"]').first.full_text

document = Document("""\
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"><teiHeader><fileDesc><titleStmt>
<title type="main">The Document's Title</title>
</titleStmt></fileDesc></teiHeader></TEI>
""")

if isinstance(document, TEIDocument):
    print(document.title)
else:
    print("Sorry, I don't know how to retrieve the document's title.")
The Document's Title

The recommendations as laid out for DocumentMixinHooks._init_config also apply for subclasses who would process configuration arguments in their __init__ method before calling the super class’ one.

XPath functions

Custom XPath functions are registered with this decorator:

_delb.plugins.PluginManager.register_xpath_function(self, arg: Callable | str) Callable

Custom XPath functions can be defined as shown in the following example. The first argument to a function is always an instance of _delb.xpath.EvaluationContext followed by the expression’s arguments.

from delb import Document
from _delb.plugins import plugin_manager
from _delb.xpath import EvaluationContext


@plugin_manager.register_xpath_function("is-last")
def is_last(context: EvaluationContext) -> bool:
    return context.position == context.size

@plugin_manager.register_xpath_function
def lowercase(_, string: str) -> str:
    return string.lower()


document = Document("<root><node/><node foo='BAR'/></root>")
print(document.xpath("/*[is-last() and lowercase(@foo)='bar']").first)
<node foo="BAR"/>