Serialization¶
Overview¶
delb allows users to produce well-readable, content-agnostic XML serializations as well-readable as they can get.
The formatting options are controlled with the delb.FormatOptions that
are either passed to serialization methods like delb.Document.save() and
delb.TagNode.serialize() or setting the class property
delb.DefaultStringOptions.format_options for any conversions of
documents and nodes to strings (e.g. with print() or str) on the
general application level. Passing / Setting None lets the serializer
simply dump a tree’s contents to an XML stream without any extra efforts.
The default “pretty formatting” is best suited to align structured data, it adds newline and optional indentation to mark content nesting without adding or removing (text encoding related) significant whitespace.
The provided content wrapping implementation also always assumes that a document contains mixed content (i.e. both structured and text data) while it prefers a continuous presentation of several nodes on a line with a constrained width. It’s currently quiet expensive to compute. It also doesn’t account for combining Unicode encodings so that wrapped text lengths are determined by the number of codepoints, not the actually represented glyphs.
Serializations that alter whitespace for indentation or wrapping also apply a general reduction of insignificant whitespace as recommended in this TEI recommendation. Furthermore it is guaranteed that a serialized stream from a document that was normalized according to this set of rules [1] will be parsed back to the identical tree if these rules are applied again [2].
There is currently no plan to support the production of character or entity references, yet which extent that would cover.
As its stands custom serialization algorithms should be implemented as standalone units, neither are the contributed implementations suited for derivations nor is the architecture ready for extensions in that regard yet.
A re-implementation that has performance as primary goal later in the beta phase shall then allow customizations. Ideas can be contributed and discussed in this thread.
Examples and comparisons¶
As an example this input is given:
Source stream
1<?xml version="1.0" encoding="utf-8" ?>
2<!DOCTYPE document [
3 <!ENTITY Uuml "Ü">
4]>
5<document>
6 <head>
7 <title>Über suum venire vetuit.</title>
8 <identifiers>
9 <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
10 <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
11 </identifiers>
12 <locations>
13 <shelf>A0</shelf>
14 <shelf>B1c</shelf>
15 </locations>
16 <contributors xmlns:pi="https://pirates.code/">
17 <contributor height="~5"" pi:greeting='Ay'e!'>Ed Teach</contributor>
18 </contributors>
19 </head>
20<body>
21<text>
22<lb/>Liquidae voluptatis et liberae potest.
23Atqui pugnantibus et <hi>contra</hi>riis studiis consiliisque semper utens nihil
24<lb/>quieti videre, nihil tranquilli potest.
25<lb/>Quodsi vitam omnem continent, neglegentur?
26<lb/>Nam, ut sint illa vendibiliora, haec uberiora certe sunt.
27Quamquam id quidem licebit iis existimare, qui legerint.
28Nos autem hanc omnem quaestionem de finibus bonorum et malorum, <lb/>id est voluptatem.
29Homines optimi non intellegunt totam rationem everti, <lb/>si ita res se habeat.
30Nam si ea sola voluptas esset, <choice><sic>que</sic><corr>quae</corr></choice> quasi delapsa <lb/>de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi.
31Quae enim cupiditates a natura proficiscuntur, <lb/>facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
32<lb/>Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
33</text>
34</body>
35</document>
Note that there’s no indication of the document’s type or schema.
delb¶
Indentation and aligned attributes¶
Production
document.save(
Path("serialization-example-delb-indented.xml"),
format_options = FormatOptions(
align_attributes=True,
indentation=" ",
text_width=0
)
)
Product
1<?xml version="1.0" encoding="UTF-8"?>
2<document xmlns:pi="https://pirates.code/">
3 <head>
4 <title>
5 Über suum venire vetuit.
6 </title>
7 <identifiers>
8 <id>
9 5dcebaa4-8760-4286-be7a-6b25fd6ae0f0
10 </id>
11 <id>
12 15b0c526-585f-4daf-a45f-411929ffbd61
13 </id>
14 </identifiers>
15 <locations>
16 <shelf>
17 A0
18 </shelf>
19 <shelf>
20 B1c
21 </shelf>
22 </locations>
23 <contributors>
24 <contributor
25 height="~5""
26 pi:greeting="Ay'e!"
27 >
28 Ed Teach
29 </contributor>
30 </contributors>
31 </head>
32 <body>
33 <text>
34 <lb/>Liquidae voluptatis et liberae potest. Atqui pugnantibus et
35 <hi>
36 contra
37 </hi>riis studiis consiliisque semper utens nihil
38 <lb/>quieti videre, nihil tranquilli potest.
39 <lb/>Quodsi vitam omnem continent, neglegentur?
40 <lb/>Nam, ut sint illa vendibiliora, haec uberiora certe sunt. Quamquam id quidem licebit iis existimare, qui legerint. Nos autem hanc omnem quaestionem de finibus bonorum et malorum,
41 <lb/>id est voluptatem. Homines optimi non intellegunt totam rationem everti,
42 <lb/>si ita res se habeat. Nam si ea sola voluptas esset,
43 <choice>
44 <sic>
45 que
46 </sic><corr>
47 quae
48 </corr>
49 </choice>
50 quasi delapsa
51 <lb/>de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi. Quae enim cupiditates a natura proficiscuntur,
52 <lb/>facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
53 <lb/>Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
54 </text>
55 </body>
56</document>
l.2) namespace declarations are consolidated at the root node
l.2) attribute values are contained by double quotes for better readability
l.5) Unicode characters are produced where the input used an entity reference
l.25-26) this is the
align_attributesoption in action
Text wrapping¶
Production
document.save(
Path("serialization-example-delb-wrapped.xml"),
format_options = FormatOptions(
align_attributes=False,
indentation=" ",
text_width=59
)
)
Product
1<?xml version="1.0" encoding="UTF-8"?>
2<document xmlns:pi="https://pirates.code/">
3 <head>
4 <title>Über suum venire vetuit.</title>
5 <identifiers>
6 <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
7 <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
8 </identifiers>
9 <locations> <shelf>A0</shelf> <shelf>B1c</shelf> </locations>
10 <contributors>
11 <contributor height="~5"" pi:greeting="Ay'e!">
12 Ed Teach
13 </contributor>
14 </contributors>
15 </head>
16 <body>
17 <text>
18 <lb/>Liquidae voluptatis et liberae potest. Atqui
19 pugnantibus et <hi>contra</hi>riis studiis consiliisque
20 semper utens nihil <lb/>quieti videre, nihil tranquilli
21 potest. <lb/>Quodsi vitam omnem continent, neglegentur?
22 <lb/>Nam, ut sint illa vendibiliora, haec uberiora certe
23 sunt. Quamquam id quidem licebit iis existimare, qui
24 legerint. Nos autem hanc omnem quaestionem de finibus
25 bonorum et malorum, <lb/>id est voluptatem. Homines optimi
26 non intellegunt totam rationem everti, <lb/>si ita res se
27 habeat. Nam si ea sola voluptas esset,
28 <choice><sic>que</sic><corr>quae</corr></choice>
29 quasi delapsa <lb/>de caelo est ad quiete vivendum,
30 caritatem, praesertim cum omnino nulla sit causa peccandi.
31 Quae enim cupiditates a natura proficiscuntur, <lb/>facile
32 explentur sine ulla iniuria, quae autem inanes sunt, iis
33 parendum non est. <lb/>Nihil enim desiderabile
34 concupiscunt, plusque in ipsa iniuria detrimenti est quam
35 in.
36 </text>
37 </body>
38</document>
l.9) lacking semantic knowledge, also structured data is placed onto one line when it fits
l.28) nested content is kept on one line if it fits
lxml¶
Production
etree.indent(tree)
with Path("serialization-example-lxml.xml").open("bw") as f:
tree.write(f, pretty_print=True)
Product
1<?xml version='1.0' encoding='ASCII'?>
2<!DOCTYPE document [
3<!ENTITY Uuml "Ü">
4]>
5<document>
6 <header>Über suum venire vetuit.</header>
7 <identifiers>
8 <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
9 <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
10 </identifiers>
11 <locations>
12 <shelf>A0</shelf>
13 <shelf>B1c</shelf>
14 </locations>
15 <contributors xmlns:pi="https://pirates.code/">
16 <contributor height="~5"" pi:greeting="Ay'e!">Ed Teach</contributor>
17 </contributors>
18 <body>
19 <text>
20 <lb/>Liquidae voluptatis et liberae potest.
21Atqui pugnantibus et <hi>contra</hi>riis studiis consiliisque semper utens nihil
22<lb/>quieti videre, nihil tranquilli potest.
23<lb/>Quodsi vitam omnem continent, neglegentur?
24<lb/>Nam, ut sint illa vendibiliora, haec uberiora certe sunt.
25Quamquam id quidem licebit iis existimare, qui legerint.
26Nos autem hanc omnem quaestionem de finibus bonorum et malorum, <lb/>id est voluptatem.
27Homines optimi non intellegunt totam rationem everti, <lb/>si ita res se habeat.
28Nam si ea sola voluptas esset, <choice>
29 <sic>que</sic>
30 <corr>quae</corr>
31 </choice> quasi delapsa <lb/>de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi.
32Quae enim cupiditates a natura proficiscuntur, <lb/>facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
33<lb/>Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
34</text>
35 </body>
36</document>
l.2-4) still defines an unused named entity
l.6) a character reference is produced where Unicode could have been used
l.20-33) there’s no wrapping option
l.21-34) unpleasing indentation
l.28) opening tag is kept on the started line though a newline would be a proper substitute for the preceding space in encoded text
l.28-31) related content is spread over lines
xml.dom.minidom¶
Production
with Path("serialization-example-minidom.xml").open("bw") as f:
f.write(
document.toprettyxml(" ", encoding="utf-8", standalone=True)
)
Product
1<?xml version="1.0" encoding="utf-8" standalone="yes"?>
2<!DOCTYPE document [
3 <!ENTITY Uuml "Ü">
4]>
5<document>
6
7
8 <head>
9
10
11 <title>Über suum venire vetuit.</title>
12
13
14 <identifiers>
15
16
17 <id>5dcebaa4-8760-4286-be7a-6b25fd6ae0f0</id>
18
19
20 <id>15b0c526-585f-4daf-a45f-411929ffbd61</id>
21
22
23 </identifiers>
24
25
26 <locations>
27
28
29 <shelf>A0</shelf>
30
31
32 <shelf>B1c</shelf>
33
34
35 </locations>
36
37
38 <contributors xmlns:pi="https://pirates.code/">
39
40
41 <contributor height="~5"" pi:greeting="Ay'e!">Ed Teach</contributor>
42
43
44 </contributors>
45
46
47 </head>
48
49
50 <body>
51
52
53 <text>
54
55
56 <lb/>
57 Liquidae voluptatis et liberae potest.
58Atqui pugnantibus et
59 <hi>contra</hi>
60 riis studiis consiliisque semper utens nihil
61
62 <lb/>
63 quieti videre, nihil tranquilli potest.
64
65 <lb/>
66 Quodsi vitam omnem continent, neglegentur?
67
68 <lb/>
69 Nam, ut sint illa vendibiliora, haec uberiora certe sunt.
70Quamquam id quidem licebit iis existimare, qui legerint.
71Nos autem hanc omnem quaestionem de finibus bonorum et malorum,
72 <lb/>
73 id est voluptatem.
74Homines optimi non intellegunt totam rationem everti,
75 <lb/>
76 si ita res se habeat.
77Nam si ea sola voluptas esset,
78 <choice>
79 <sic>que</sic>
80 <corr>quae</corr>
81 </choice>
82 quasi delapsa
83 <lb/>
84 de caelo est ad quiete vivendum, caritatem, praesertim cum omnino nulla sit causa peccandi.
85Quae enim cupiditates a natura proficiscuntur,
86 <lb/>
87 facile explentur sine ulla iniuria, quae autem inanes sunt, iis parendum non est.
88
89 <lb/>
90 Nihil enim desiderabile concupiscunt, plusque in ipsa iniuria detrimenti est quam in.
91
92 </text>
93
94
95 </body>
96
97
98</document>
Many of the previous flaws manifest as well with this implementation from the
standard library. There’s excessive additional whitespace, also of significance
after each lb tag.
Configuration interfaces¶
- class delb.DefaultStringOptions[source]¶
This object’s class variables are used to configure the serialization parameters that are applied when nodes are coerced to
strobjects. Hence it also applies when node objects are fed to theprint()function and in other cases where objects are implicitly cast to strings.Attention
Use this once to define behaviour on application level. For thread-safe serializations of nodes with diverging parameters use
NodeBase.serialize()! Think thrice whether you want to use this facility in a library.- format_options: ClassWar[None | FormatOptions] = None¶
An instance of
FormatOptionscan be provided to configure formatting.
- namespaces: ClassWar[None | NamespaceDeclarations] = None¶
A mapping of prefixes to namespaces. These are overriding possible declarations from a parsed serialisat that the document instance stems from. Any other prefixes for undeclared namespaces are enumerated with the prefix
ns.
- newline: ClassWar[None | str] = None¶
See
io.TextIOWrapperfor a detailed explanation of the parameter with the same name.
- class delb.FormatOptions(align_attributes: bool = False, indentation: str = '\t', width: int = 0)[source]¶
Instances of this class can be used to define serialization formatting that is not so hard to interpret for instances of Homo sapiens s., but more costly to compute.
When it’s employed whitespace contents will be collapsed and trimmed, newlines will be inserted to improve readability, but only where further whitespace reduction would drop it again.
The serialization respects when a tag node bears the
xml:spaceattribute with the valuepreserve. But if any descendent of such annotated node signals to allow whitespace alterations again that has no effect. Such attributes with invalid values are ignored.