About the design¶
tl;dr¶
lxml resp. libxml2 are powerful tools, but have an unergonomic data model to work with encoded text. Let’s build a DOM API inspired wrapper around it.
Aspects & Caveats¶
The library is partly opinionated to encourage good practices and to be more pythonic. Therefore its behaviour deviates from lxml and ignores stuff:
Serializations of documents are UTF-8 encoded by default and always start with an XML declaration.
Comment and processing instruction nodes are shadowed by default, see
delb.altered_default_filters()
on how to make them accessible.CDATA nodes are not accessible at all, but are retained and appear in serializations; unless you [DANGER ZONE] manipulate the tree. Depending on your actions you might encounter no alterations or a complete loss of these nodes within the root node. [/DANGER ZONE]
If you need to apply bad practices anyway, you can fall back to tinker with the
lxml objects that are bound to TagNode._etree_obj
.
Reasoning¶
XML can be used to encode text documents, examples for such uses would be the Open Document Format and XML-TEI. It’s more prevalent use however is to encode data that is to be consumed by algorithms as configuration, measurements, application events, various metadata and so on.
Python is a high-level, general programming language with a vast ecosystem, notably including diverse scientific communities and tools. As such it is well suited to solve and cause problems in the humanities related field of Research Software Engineering by programmers with diverse educational background and expertise.
The commonly used Python library to parse and interact with a representation
of an XML document is lxml. Other libraries like the
xml.etree.ElementTree
module from the Python standard library shall not
be discussed due to their insignificance and shortcomings. It is notable that at
least these two share significant design aspects of Java APIs which is perceived
as weird and clumsy in Python code.
lxml is a wrapper around libxml2 which was developed by the GNOME developers
for other data than text documents. Data that is strictly structured and
expectable. Text documents are different in these regards as natural languages
and variety of media allow and lead to unprecedented manifestations for which an
encoding mixes different abstracted encapsulations of text fragments. And they
are formulated and structured for human consumers, and often printing devices.
So, what’s wrong with lxml? Not much, it’s a rock-solid, fast API for XML documents with known issues and known workarounds that represents the full glory of what a full-fledged family of specification implies - of which a lot is not of concern for the problems at hand and occasionally make solutions complicated. The one aspect that’s very wrong in the context of text processing is unfortunately its central model of elements and data/text that is attached to it in two different relations. In particular the notion of an element tail makes the whole enchilada tricky to traverse / navigate. The existence of this attribute is due to the insignificance of these fragments of an XML stream in the aforementioned, common uses of XML.
Now it is time for an example, given this document snippet:
<p rendition="#justify">
<lb/>Nomadisme épique exploratori-
<lb/><space dim="horizontal" quantity="2" units="chars"/>sme urbain <hi rendition="#b">Art des voya-
<lb/><space dim="horizontal" quantity="2" units="chars"/>ges</hi> et des promenades
</p>
Here’s a graphical representation of the markup with etree’s elements and their text and tail attributes:
When thinking about a paragraph of text, a way to conceptualize it is as a
sequence of sentences, formed by a series of words, a sequence of graphemes,
and punctuation. That’s a quite simple cascade of categories which can be very
well anticipated when processing text. With that mental model, line beginnings
would rather be considered to be on the same level as signs, but “Nomadisme …”
turns out not to be a sibling object of the object that represents the line
beginning and is not in direct relation with the paragraph. In lxml’s model it
is rather an attribute tail
assigned to that line beginning. The text
contents of the object that represents the hi
element and its children give
a good impression how hairy simple tasks can become.
An algorithm that shall remove line beginnings, space representations and
concatenate broken words would need a function that removes the element objects
in question while preserving the text fragments in its meaningful sequence
attached to the text
and tail
properties. In case these have no content,
their value of None
leads to different operations to concatenate strings.
Here’s a working implementation from the inxs library 1 for that data
model:
def remove_elements(
*elements: etree.ElementBase,
keep_children=False,
preserve_text=False,
preserve_tail=False
) -> None:
""" Removes the given elements from its tree. Unless ``keep_children`` is
passed as ``True``, its children vanish with it into void. If
``preserve_text`` is ``True``, the text and tail of a deleted element
will be preserved either in its left sibling's tail or its parent's
text. """
for element in elements:
if preserve_text and element.text:
previous = element.getprevious()
if previous is None:
parent = element.getparent()
if parent.text is None:
parent.text = ''
parent.text += element.text
else:
if previous.tail is None:
previous.tail = element.text
else:
previous.tail += element.text
if preserve_tail and element.tail:
if keep_children and len(element):
if element[-1].tail:
element[-1].tail += element.tail
else:
element[-1].tail = element.tail
else:
previous = element.getprevious()
if previous is None:
parent = element.getparent()
if parent.text is None:
parent.text = ''
parent.text += element.tail
else:
if len(element):
if element[-1].tail is None:
element[-1].tail = element.tail
else:
element[-1].tail += element.tail
else:
if previous.tail is None:
previous.tail = ''
previous.tail += element.tail
if keep_children:
for child in element:
element.addprevious(child)
element.getparent().remove(element)
That by itself is enough to simply remove the space
elements, but also
considering word-breaking dashes to wrap everything up is a similar piece of
routine of its own. And these quirks come back to you steadily while actual
markup is regularly more complex.
Now obviously, the data model that lxml / libxml2 provides is not up to standard Python ergonomics to solve text encoding problems.
There must be a better way.
There is a notable other markup parser that wraps around lxml, BeautifulSoup4.
It carries some interesting ideas, but is overall too opinionated and partly
ambiguous to implement a stringent data model. A notable specification of a
solid model for text documents is the DOM API that is even implemented in the
standard library’s xml.dom.minidom
module. But it lacks an XPath
interface and rumours say it’s slow. To illustrate the more accessible model
with a better locatability, here’s another graphical representation of the
markup example from above with text content in an emancipated, dedicated node
type:
Note that text containing attributes appear in document order which promises an eased lookaround. So, the obvious (?) idea is to wrap lxml in a layer that takes the DOM API as paradigmatic inspiration, looks and behaves pythonic while keeping the wrapped powers accessible.
Now with that API available, this is what an equivalent of the horribly complicated function seen above would look like:
@altered_default_filters()
def remove_nodes(*nodes: NodeBase, keep_children=False):
""" Removes the given nodes from its tree. Unless ``keep_children`` is
passed as ``True``, its children vanish with it into void. """
for node in nodes:
node.detach(retain_child_nodes=keep_children)
Frequently Asked Questions¶
Isn’t XML an obsolete format for text encoding, invented by boomers and cynically held up by their Generation X apologists? Why don’t you put your efforts in developing new approaches such as storing text in a graph database?
We think that XML-based encodings are actually very well suited for long-term usable text representations with a broad potential for granularity of capturing and semantic annotations. Not only is the data format simple enough to hold a full artifact in a self-contained file, but we also consider the duality of a format that can be handled both as stream and as tree as a helpful feature to address the physical and logical dimensions of a text and its manifestation. That is advantageous over depending on a heavy-weight database system. We acknowledge unquestionably that the specifications in the XML universe are often over-engineered, partly stuck in the times of their genesis and thus (euphemistically put) no fun. As a direct result of that the availability of implementations for contemporary development contexts and their ergonomics are poor, if available at all for a platform. That is what delb is addressing.
What are your long-term goals with this project?
Currently we want to flesh out a concluded user interface that lets developers concentrate on their tasks and not on the shortcomings and idiosyncrasies of available tools in the Pythoniverse. After modeling that API as a wrapper around lxml the aim is now to replace it piece by piece with a Pure Python™ implementation that will later be transpiled to C extension code with mypyc.
Eventually we’d like to re-conquer the world wide web and make unagitated, long texts and Stooges clips its predominant content again. On that occasion, fuck you Mark, fuck off Jeff, go fuck yourself Peter and all the other fucknut character masks. What a disgusting misery it is that the capital created from Tim’s ideas.
- 1
The
inxs
library failed. Yet it made clear which layer in Python XML Text handling needs to be fixed.