Serialization

Overview

delb allows users to produce well-readable, content-agnostic XML serializations as well-readable as they can get.

The formatting options are controlled with the delb.FormatOptions that are either passed to serialization methods like delb.Document.save() and delb.TagNode.serialize() or setting the class property delb.DefaultStringOptions.format_options for any conversions of documents and nodes to strings (e.g. with print() or str) on the general application level. Passing / Setting None lets the serializer simply dump a tree’s contents to an XML stream without any extra efforts.

The default “pretty formatting” is best suited to align structured data, it adds newline and optional indentation to mark content nesting without adding or removing (text encoding related) significant whitespace.

The provided content wrapping implementation also always assumes that a document contains mixed content (i.e. both structured and text data) while it prefers a continuous presentation of several nodes on a line with a constrained width. It’s currently quiet expensive to compute. It also doesn’t account for combining Unicode encodings so that wrapped text lengths are determined by the number of codepoints, not the actually represented glyphs.

Serializations that alter whitespace for indentation or wrapping also apply a general reduction of insignificant whitespace as recommended in this TEI recommendation. Furthermore it is guaranteed that a serialized stream from a document that was normalized according to this set of rules [1] will be parsed back to the identical tree if these rules are applied again [2].

There is currently no plan to support the production of character or entity references, yet which extent that would cover.

As its stands custom serialization algorithms should be implemented as standalone units, neither are the contributed implementations suited for derivations nor is the architecture ready for extensions in that regard yet.

A re-implementation that has performance as primary goal later in the beta phase shall then allow customizations. Ideas can be contributed and discussed in this thread.

Examples and comparisons

As an example this input is given:

Source stream
 1<?xml version="1.0" encoding="utf-8" ?>
 2<!DOCTYPE document [
 3    <!ENTITY Uuml "&#x00dc;">
 4]>
 5<document>
 6    <head>
 7    <title>&Uuml;ber suum venire vetuit.</title>
 8    <identifiers>
 9    <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
10    <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
11    </identifiers>
12    <locations>
13    <shelf>A0</shelf>
14    <shelf>B1c</shelf>
15    </locations>
16    <contributors xmlns:pi="https://pirates.code/">
17        <contributor height="~5&#x22;" pi:greeting='Ay&#x27;e!'>Ed Teach</contributor>
18    </contributors>
19    </head>
20<body>
21<text>
22<lb/>Liquidae voluptatis et liberae potest.
23Atqui pugnantibus et <hi>contra</hi>riis studiis consiliisque semper utens nihil
24<lb/>quieti videre, nihil tranquilli potest.
25<lb/>Quodsi vitam omnem continent, neglegentur?
26<lb/>Nam, ut sint illa vendibiliora, haec uberiora certe sunt.
27Quamquam id quidem licebit iis existimare, qui legerint.
28Nos autem hanc omnem quaestionem de finibus bonorum et malorum, <lb/>id est voluptatem.
29Homines optimi non intellegunt totam rationem everti, <lb/>si ita res se habeat.
30Nam si ea sola voluptas esset, <choice><sic>que</sic><corr>quae</corr></choice> quasi delapsa <lb/>de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi.
31Quae enim cupiditates a natura proficiscuntur, <lb/>facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
32<lb/>Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
33</text>
34</body>
35</document>

Note that there’s no indication of the document’s type or schema.

delb

Indentation and aligned attributes

Production
document.save(
  Path("serialization-example-delb-indented.xml"),
  format_options = FormatOptions(
    align_attributes=True,
    indentation="  ",
    width=0,
  )
)
Product
 1<?xml version="1.0" encoding="UTF-8"?>
 2<document xmlns:pi="https://pirates.code/">
 3  <head>
 4    <title>
 5      Über suum venire vetuit.
 6    </title>
 7    <identifiers>
 8      <id>
 9        5dcebaa4-8760-4286-be7a-6b25fd6ae0f0
10      </id>
11      <id>
12        15b0c526-585f-4daf-a45f-411929ffbd61
13      </id>
14    </identifiers>
15    <locations>
16      <shelf>
17        A0
18      </shelf>
19      <shelf>
20        B1c
21      </shelf>
22    </locations>
23    <contributors>
24      <contributor
25              height="~5&quot;"
26         pi:greeting="Ay'e!"
27      >
28        Ed Teach
29      </contributor>
30    </contributors>
31  </head>
32  <body>
33    <text>
34      <lb/>Liquidae voluptatis et liberae potest. Atqui pugnantibus et
35      <hi>
36        contra
37      </hi>riis studiis consiliisque semper utens nihil
38      <lb/>quieti videre, nihil tranquilli potest.
39      <lb/>Quodsi vitam omnem continent, neglegentur?
40      <lb/>Nam, ut sint illa vendibiliora, haec uberiora certe sunt. Quamquam id quidem licebit iis existimare, qui legerint. Nos autem hanc omnem quaestionem de finibus bonorum et malorum,
41      <lb/>id est voluptatem. Homines optimi non intellegunt totam rationem everti,
42      <lb/>si ita res se habeat. Nam si ea sola voluptas esset,
43      <choice>
44        <sic>
45          que
46        </sic><corr>
47          quae
48        </corr>
49      </choice>
50      quasi delapsa
51      <lb/>de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi. Quae enim cupiditates a natura proficiscuntur,
52      <lb/>facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
53      <lb/>Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
54    </text>
55  </body>
56</document>
  • l.2) namespace declarations are consolidated at the root node

  • l.2) attribute values are contained by double quotes for better readability

  • l.5) Unicode characters are produced where the input used an entity reference

  • l.25-26) this is the align_attributes option in action

Text wrapping

Production
document.save(
  Path("serialization-example-delb-wrapped.xml"),
  format_options = FormatOptions(
    align_attributes=False,
    indentation="  ",
    width=59,
  )
)
Product
 1<?xml version="1.0" encoding="UTF-8"?>
 2<document xmlns:pi="https://pirates.code/">
 3  <head>
 4    <title>Über suum venire vetuit.</title>
 5    <identifiers>
 6      <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
 7      <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
 8    </identifiers>
 9    <locations> <shelf>A0</shelf> <shelf>B1c</shelf> </locations>
10    <contributors>
11      <contributor height="~5&quot;" pi:greeting="Ay'e!">
12        Ed Teach
13      </contributor>
14    </contributors>
15  </head>
16  <body>
17    <text>
18      <lb/>Liquidae voluptatis et liberae potest. Atqui
19      pugnantibus et <hi>contra</hi>riis studiis consiliisque
20      semper utens nihil <lb/>quieti videre, nihil tranquilli
21      potest. <lb/>Quodsi vitam omnem continent, neglegentur?
22      <lb/>Nam, ut sint illa vendibiliora, haec uberiora certe
23      sunt. Quamquam id quidem licebit iis existimare, qui
24      legerint. Nos autem hanc omnem quaestionem de finibus
25      bonorum et malorum, <lb/>id est voluptatem. Homines optimi
26      non intellegunt totam rationem everti, <lb/>si ita res se
27      habeat. Nam si ea sola voluptas esset,
28      <choice><sic>que</sic><corr>quae</corr></choice>
29      quasi delapsa <lb/>de caelo est ad quiete vivendum,
30      caritatem, praesertim cum omnino nulla sit causa peccandi.
31      Quae enim cupiditates a natura proficiscuntur, <lb/>facile
32      explentur sine ulla iniuria, quae autem inanes sunt, iis
33      parendum non est. <lb/>Nihil enim desiderabile
34      concupiscunt, plusque in ipsa iniuria detrimenti est quam
35      in.
36    </text>
37  </body>
38</document>
  • l.9) lacking semantic knowledge, also structured data is placed onto one line when it fits

  • l.28) nested content is kept on one line if it fits

lxml

Production
etree.indent(tree)
with Path("serialization-example-lxml.xml").open("bw") as f:
    tree.write(f, pretty_print=True)
Product
 1<?xml version='1.0' encoding='ASCII'?>
 2<!DOCTYPE document [
 3<!ENTITY Uuml "&#x00dc;">
 4]>
 5<document>
 6  <header>&#220;ber suum venire vetuit.</header>
 7  <identifiers>
 8    <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
 9    <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
10  </identifiers>
11  <locations>
12    <shelf>A0</shelf>
13    <shelf>B1c</shelf>
14  </locations>
15  <contributors xmlns:pi="https://pirates.code/">
16    <contributor height="~5&quot;" pi:greeting="Ay'e!">Ed Teach</contributor>
17  </contributors>
18  <body>
19    <text>
20      <lb/>Liquidae voluptatis et liberae potest.
21Atqui pugnantibus et <hi>contra</hi>riis studiis consiliisque semper utens nihil
22<lb/>quieti videre, nihil tranquilli potest.
23<lb/>Quodsi vitam omnem continent, neglegentur?
24<lb/>Nam, ut sint illa vendibiliora, haec uberiora certe sunt.
25Quamquam id quidem licebit iis existimare, qui legerint.
26Nos autem hanc omnem quaestionem de finibus bonorum et malorum, <lb/>id est voluptatem.
27Homines optimi non intellegunt totam rationem everti, <lb/>si ita res se habeat.
28Nam si ea sola voluptas esset, <choice>
29        <sic>que</sic>
30        <corr>quae</corr>
31      </choice> quasi delapsa <lb/>de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi.
32Quae enim cupiditates a natura proficiscuntur, <lb/>facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
33<lb/>Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
34</text>
35  </body>
36</document>
  • l.2-4) still defines an unused named entity

  • l.6) a character reference is produced where Unicode could have been used

  • l.20-33) there’s no wrapping option

  • l.21-34) unpleasing indentation

  • l.28) opening tag is kept on the started line though a newline would be a proper substitute for the preceding space in encoded text

  • l.28-31) related content is spread over lines

xml.dom.minidom

Production
with Path("serialization-example-minidom.xml").open("bw") as f:
    f.write(
        document.toprettyxml("  ", encoding="utf-8", standalone=True)
    )
Product
 1<?xml version="1.0" encoding="utf-8" standalone="yes"?>
 2<!DOCTYPE document [
 3    <!ENTITY Uuml "&#x00dc;">
 4]>
 5<document>
 6  
 7    
 8  <head>
 9    
10    
11    <title>Über suum venire vetuit.</title>
12    
13    
14    <identifiers>
15      
16    
17      <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
18      
19    
20      <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
21      
22    
23    </identifiers>
24    
25    
26    <locations>
27      
28    
29      <shelf>A0</shelf>
30      
31    
32      <shelf>B1c</shelf>
33      
34    
35    </locations>
36    
37    
38    <contributors xmlns:pi="https://pirates.code/">
39      
40        
41      <contributor height="~5&quot;" pi:greeting="Ay'e!">Ed Teach</contributor>
42      
43    
44    </contributors>
45    
46    
47  </head>
48  
49
50  <body>
51    
52
53    <text>
54      
55
56      <lb/>
57      Liquidae voluptatis et liberae potest.
58Atqui pugnantibus et 
59      <hi>contra</hi>
60      riis studiis consiliisque semper utens nihil
61
62      <lb/>
63      quieti videre, nihil tranquilli potest.
64
65      <lb/>
66      Quodsi vitam omnem continent, neglegentur?
67
68      <lb/>
69      Nam, ut sint illa vendibiliora, haec uberiora certe sunt.
70Quamquam id quidem licebit iis existimare, qui legerint.
71Nos autem hanc omnem quaestionem de finibus bonorum et malorum, 
72      <lb/>
73      id est voluptatem.
74Homines optimi non intellegunt totam rationem everti, 
75      <lb/>
76      si ita res se habeat.
77Nam si ea sola voluptas esset, 
78      <choice>
79        <sic>que</sic>
80        <corr>quae</corr>
81      </choice>
82       quasi delapsa 
83      <lb/>
84      de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi.
85Quae enim cupiditates a natura proficiscuntur, 
86      <lb/>
87      facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
88
89      <lb/>
90      Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
91
92    </text>
93    
94
95  </body>
96  
97
98</document>

Many of the previous flaws manifest as well with this implementation from the standard library. There’s excessive additional whitespace, also of significance after each lb tag.

Configuration interfaces

class delb.DefaultStringOptions[source]

This object’s class variables are used to configure the serialization parameters that are applied when nodes are coerced to str objects. Hence it also applies when node objects are fed to the print() function and in other cases where objects are implicitly cast to strings.

Attention

Use this once to define behaviour on application level. For thread-safe serializations of nodes with diverging parameters use XMLNodeType.serialize()! Think thrice whether you want to use this facility in a library.

format_options: ClassWar[None | FormatOptions] = None

An instance of delb.FormatOptions can be provided to configure formatting.

namespaces: ClassWar[None | NamespaceDeclarations] = None

A mapping of prefixes to namespaces. Any other prefixes for undeclared namespaces are enumerated with the prefix ns.

newline: ClassWar[None | str] = None

See io.TextIOWrapper for a detailed explanation of the parameter with the same name.

classmethod reset_defaults()[source]

Restores the factory settings.

class delb.FormatOptions(align_attributes: bool = False, indentation: str = '\t', width: int = 0)[source]

Instances of this class can be used to define serialization formatting that is not so hard to interpret for instances of Homo sapiens s., but more costly to compute.

When it’s employed whitespace contents will be collapsed and trimmed, newlines will be inserted to improve readability, but only where further whitespace reduction would drop it again.

The serialization respects when a tag node bears the xml:space attribute with the value preserve. But if any descendent of such annotated node signals to allow whitespace alterations again that has no effect. Such attributes with invalid values are ignored.

align_attributes: bool

Determines whether attributes’ names and values line up sharply around vertically aligned equal signs.

indentation: str

This string prefixes descending nodes’ contents one time per depth level.

width: int

A positive value indicates that text nodes shall get wrapped at this character position. Indentations are not considered as part of text. This parameter is purposed to define reasonable widths for text displays that could be scrolled horizontally.