Extending delb¶

Note

There are actually two packages that are installed with delb: delb and _delb. As the underscore indicates, the latter is exposing private parts of the API while the first is re-exposing what is deemed to be public from that one and additional contents. As a rule of thumb, use the public API in applications and the private API in delb extensions. By doing so, you can avoid circular dependencies if your extension (or other code that it depends on) uses contents from the _delb package.

delb offers a plugin system to facilitate the extendability of a few of its mechanics with Python packages. A package that extends its functionality must provide entrypoint metadata for an entrypoint group named delb that points to modules that contain extensions. Some extensions have to be decorated with specific methods of the plugin manager object. Authors are encouraged to prefix their package names with delb- in order to increase discoverability.

These extension types are currently available:

Document loaders

Loaders are functions that try to make sense of any given input value, and if they can they return a parsed document.

Document loaders

Parser adapters

Parser adapters can plug to XML parser implementations.

Parser adapters

XPath functions

Custom XPath functions can be used in XPath predicate expressions.

XPath functions

Document mixin classes

Mixins add functionality / information to delb.Document (instead of inheriting from it). That allows applications to rely optionally on the availability of plugins and to combine various extensions.

Document mixin classes

Document subclasses

Subclasses can be used to provide distinct models of arbitrary aspects for contents that are represented by a specific encoding. They can optionally implement a test method to qualify themself as default class for recognized contents.

Document subclasses

The designated means of communication between extensions is the config argument to the loader respectively the instance property of a document instance with that name.

Warning

A module that contains plugins and any module it is explicitly or implicitly importing must not import anything from the delb package, because that would initiate the collection of plugin implementations. And these wouldn’t have been completely registered at that point. Import the required module members from the according path in the _delb package instead.

Caution

Mind to re-install a package in development when its entrypoint specification changed.

There’s a repository that outlines the mechanics as developer reference: https://github.com/delb-xml/delb-py-reference-plugins

There’s also the delb-existdb project that implements the loader and document mixin plugin types to interact with eXist-db as storage.

Document loaders¶

Loaders are registered with this decorator:

_delb.plugins.plugin_manager.register_loader(before: LoaderConstraint = None, after: LoaderConstraint = None) → SecondOrderDecorator¶

Registers a document loader.

An example module that is specified as delb plugin for an IPFS loader might look like this:

from os import getenv
from types import SimpleNamespace
from typing import Any

from _delb.plugins import plugin_manager
from _delb.plugins.web_loader import web_loader
from _delb.typing import LoaderResult


IPFS_GATEWAY = getenv("IPFS_GATEWAY_PREFIX", "https://ipfs.io/ipfs/")


@plugin_manager.register_loader()
def ipfs_loader(source: Any, config: SimpleNamespace) -> LoaderResult:
    if isinstance(source, str) and source.startswith("ipfs://"):

        config.source_url = source
        config.ipfs_gateway_source_url = IPFS_GATEWAY + source[7:]

        return web_loader(config.ipfs_gateway_source_url, config)

    # return an indication why this loader didn't attempt to load in order
    # to support debugging
    return "The input value is not an URL with the ipfs scheme."

The source argument is what a delb.Document instance is initialized with as input data.

Note that the config argument that is passed to a loader function contains configuration data, it’s the delb.Document.config property after _init_config has been processed.

Loaders that retrieve a document from an URL should add the origin as string to the config object as source_url.

You might want to specify a loader to be considered before or after another one. Let’s assume a loader shall figure out what to load from a remote XML resource that contains a reference to the actual document. That one would have to be considered before the one that loads XML documents from a URL with the https scheme:

from _delb.plugins import plugin_manager
from _delb.plugins.web_loader import web_loader


@plugin_manager.register_loader(before=web_loader)
def mets_loader(source, config) -> LoaderResult:
    # loading logic here
    pass

Parser adapters¶

class _delb.plugins.XMLEventParserInterface(options: ParserOptions, base_url: str | None, encoding: str)[source]¶

This is the base class for parser adapters. After initialization their parse() method will be called for iterate over parser events. Instances don’t have to care about their state beyond the parsing of one input stream as they’re only employed once.

Parameters:

options – The parsing options the user passed with the input stream.
base_url – The base URL for resolving references.
encoding – This is the encoding that was either provided by the user, noted in an XML document declaration or indicated by a Byte Order Mark. But it could also be the fallback value utf-8 if none of the prior was available.

name: str¶: The parser can be selected by this class attribute’s value as (member of) a delb.parser.ParserOptions.preferred_parsers setting.

abstractmethod parse(data: BinaryReader | str) → Iterator[Event][source]¶: This method must be implemented and yield the parsed contents in document order as Event tuples.

The parsed contents are passed with such constructs:

_delb.parser.Event¶

An XML stream event tuple consists of two values. The first is a member of EventType that signals the type of event, the second carries the relevant data. All data must be stripped of XML markup characters and character data must be completely parsed and normalized. All XML names and character entities must be resolved.

XML event tuples’ structure¶
Event member	Data type	Notes
`EventType.Comment`	`str`
`EventType.ProcessingInstruction`	`tuple` [`str`, `str`]	`(target, content)`
`EventType.TagStart`	`TagEventData`
`EventType.TagEnd`	`TagEventData` \| `None`	If data is provided, the tree builder can detect inconsistent tagging in debug mode.
`EventType.Text`	`str`

enum _delb.parser.EventType(value)[source]¶

Member Type:: int

Valid values are as follows:

Comment = <EventType.Comment: 1>¶

ProcessingInstruction = <EventType.ProcessingInstruction: 2>¶

TagStart = <EventType.TagStart: 3>¶

TagEnd = <EventType.TagEnd: 4>¶

Text = <EventType.Text: 5>¶

class _delb.parser.TagEventData(namespace, local_name, attributes)[source]¶

attributes: _AttributesData¶: The attributes data must not contain XML namespace declarations. It is optional in case of a EventType.TagEnd.

local_name: str¶

namespace: str¶

Document mixin classes¶

Document mixin classes are registered by subclassing them from this base class:

class _delb.plugins.DocumentMixinBase[source]¶

By deriving a subclass from this one, a document extension class is registered as plugin. These are supposed to add additional attributes to a document, e.g. derived data or methods to interact with storage systems. All attributes of an extension should share a common prefix that terminates with an underscore, e.g. storage_load, storage_save, etc.

This base class also acts as termination for methods that can be implemented by mixin classes. Any implementation of a method must call a base class’ one, e.g.:

from types import SimpleNamespace

from _delb.plugins import DocumentMixinBase
from magic_wonderland import play_disco


class MyExtension(DocumentMixinBase):

    # this method can be implemented by any extension class
    @classmethod
    def _init_config(cls, config, kwargs):
        config.my_extension = SimpleNamespace(tonality=kwargs.pop(
            "my_extension_tonality")
        )
        super()._init_config(config, kwargs)

    # this method is specific to this extension
    def my_extension_makes_magic(self):
        play_disco(self.config.my_extension.tonality)

classmethod _init_config(config: SimpleNamespace, kwargs: dict[str, Any])[source]¶: The kwargs argument contains the additional keyword arguments that a delb.Document instance is called with. Extension classes that expect configuration data must process their specific arguments by clearing them from the kwargs dictionary, e.g. with dict.pop(), and preferably storing the final configuration data in a types.SimpleNamespace and adding it to the types.SimpleNamespace passed as config with the extension’s name. The initially mentioned keyword arguments should be prefixed with that name as well. This method is called before the loaders try to read and parse the given source for a document.

Document subclasses¶

Of course one can simply subclass delb.Document to add functionality. Beside using a subclass directly, you can let delb.Document figure out which subclass is an appropriate representation of the content. Subclasses can claim that by implementing a staticmethod() named __class_test__ that takes the document’s root node and the configuration to return a boolean that indicates the subclass is suited. The first class to return a True value will immediately be chosen, so be aware of the possible ambiguity in complex setups. It is only ensured that subclasses are considered before others that they derive from.

Subclasses are registered by importing them into an application, they must not be pointed to by entrypoint definitions.

Here’s an example:

class TEIDocument(Document):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **{**kwargs, "collapse_whitespace": True})

    @staticmethod
    def __class_test__(root: TagNode, config: types.SimpleNamespace) -> bool:
        return root.universal_name == "{http://www.tei-c.org/ns/1.0}TEI"

    @property
    def title(self) -> str:
        return self.css_select('titleStmt title[type="main"]').first.full_text

document = Document("""\
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"><teiHeader><fileDesc><titleStmt>
<title type="main">The Document's Title</title>
</titleStmt></fileDesc></teiHeader></TEI>
""")

if isinstance(document, TEIDocument):
    print(document.title)
else:
    print("Sorry, I don't know how to retrieve the document's title.")

The Document's Title

The recommendations as laid out for DocumentMixinHooks._init_config also apply for subclasses who would process configuration arguments in their __init__ method before calling the super class’ one.

XPath functions¶

Custom XPath functions are registered with this decorator:

_delb.plugins.PluginManager.register_xpath_function(self, arg: str | GenericDecorated) → SecondOrderDecorator | GenericDecorated¶

Custom XPath functions can be defined as shown in the following example. The first argument to a function is always an instance of _delb.xpath.EvaluationContext followed by the expression’s arguments.

from delb import Document
from _delb.plugins import plugin_manager
from _delb.xpath import EvaluationContext


@plugin_manager.register_xpath_function("is-last")
def is_last(context: EvaluationContext) -> bool:
    return context.position == context.size

@plugin_manager.register_xpath_function
def lowercase(_, string: str) -> str:
    return string.lower()


document = Document("<root><node/><node foo='BAR'/></root>")
print(document.xpath("//*[is-last() and lowercase(@foo)='bar']").first)

<node foo="BAR"/>