:::message
After manual verification, this article was written by AI.
:::
Introduction
When editing TEI (Text Encoding Initiative) XML, validation of not only element and attribute structures but also more complex business rules becomes necessary. This article explains how to combine RELAX NG (RNG) and Schematron to achieve both structural and content validation, using real project challenges as examples.
Challenges to Solve
When editing classical Japanese literary texts in TEI XML, we had the following requirements:
-
Dynamic validation of ID references: Validate that IDs referenced in
corresp
attributes actually exist inwitness
elements within the document - Auto-completion in Oxygen XML Editor: Automatically display ID candidates during editing
- Support for multiple ID references: Allow multiple IDs to be specified separated by spaces
-
Restrict references to specific elements: Allow only
witness
element IDs to be referenced, and generate an error ifperson
element IDs are included
Why RNG + Schematron?
RELAX NG Strengths
- Element and attribute structure definition
- Data type specification
- Basic content model definition
Schematron Strengths
- XPath-based complex validation rules
- Cross-reference checking within documents
- Custom error message provision
By combining these two, we can achieve strict validation from both structural and content perspectives.
Implementation Example
1. Basic RNG Schema Structure
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
xmlns:sch="http://purl.oclc.org/dsdl/schematron"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
ns="http://www.tei-c.org/ns/1.0">
<!-- Schematron namespace declaration -->
<sch:ns prefix="tei" uri="http://www.tei-c.org/ns/1.0"/>
<!-- Embed Schematron rules here -->
<start>
<ref name="TEI"/>
</start>
<!-- RNG structural definitions -->
</grammar>
2. ID Definition and anyURI Type Usage
To achieve auto-completion in Oxygen XML Editor, we use the anyURI
type:
<!-- Witness list -->
<define name="listWit">
<element name="listWit">
<oneOrMore>
<element name="witness">
<attribute name="xml:id">
<data type="ID"/>
</attribute>
<text/>
</element>
</oneOrMore>
</element>
</define>
<!-- Base text reading -->
<define name="lem">
<element name="lem">
<attribute name="corresp">
<a:documentation>
Reference to witness
Internal reference in IDREF format with #
Oxygen displays a list of xml:ids with #
</a:documentation>
<list>
<oneOrMore>
<data type="anyURI"/>
</oneOrMore>
</list>
</attribute>
<text/>
</element>
</define>
Key points:
-
data type="ID"
ensures uniqueness -
data type="anyURI"
allows internal references with#
-
list
element allows multiple space-separated values
3. Advanced Validation with Schematron
<sch:pattern id="witness-references">
<sch:title>Witness ID Reference Validation</sch:title>
<sch:rule context="tei:lem[@corresp]">
<sch:let name="listWitIds" value="//tei:listWit/tei:witness/@xml:id"/>
<sch:let name="listPersonIds" value="//tei:listPerson/tei:person/@xml:id"/>
<sch:let name="correspTokens" value="tokenize(normalize-space(@corresp), 's+')"/>
<!-- Should reference only witnesses -->
<sch:assert test="every $token in $correspTokens
satisfies (
starts-with($token, '#') and
substring($token, 2) = $listWitIds
)" role="error">
The corresp attribute should only reference witness IDs.
Available witness IDs: #<sch:value-of select="string-join($listWitIds, ', #')"/>
</sch:assert>
<!-- Error when person IDs are included -->
<sch:report test="some $token in $correspTokens
satisfies (
starts-with($token, '#') and
substring($token, 2) = $listPersonIds
)" role="error">
The corresp attribute contains person IDs.
Detected person IDs: <sch:value-of select="
string-join(
for $token in $correspTokens
return if (starts-with($token, '#') and substring($token, 2) = $listPersonIds)
then $token
else (),
', '
)
"/>
</sch:report>
</sch:rule>
</sch:pattern>
Key points:
- Define variables with
sch:let
and dynamically retrieve values with XPath - Parse multiple ID references with
tokenize()
- Error when condition is not met with
sch:assert
- Error when condition is met with
sch:report
- Specify error level with
role="error"
(warning, info also available)
4. Practical Usage Example
<!-- Usage in XML document -->
<?xml-model href="schema.rng" type="application/xml"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="schema.rng" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<listWit>
<witness xml:id="aaa">Witness A</witness>
<witness xml:id="iii">Witness I</witness>
</listWit>
<listPerson>
<person xml:id="abc">
<persName>Person ABC</persName>
</person>
</listPerson>
</teiHeader>
<text>
<body>
<app>
<!-- Correct example: referencing only witnesses -->
<lem corresp="#aaa #iii">Main text</lem>
<rdg corresp="#aaa">Alternative reading</rdg>
</app>
<app>
<!-- Error example: includes person -->
<lem corresp="#aaa #abc">Main text</lem>
<rdg>Alternative reading</rdg>
</app>
</body>
</text>
</TEI>
Implementation Considerations
1. XPath 2.0 Syntax
Pay attention to the syntax of for expressions in XPath within Schematron:
<!-- Correct -->
let $invalid := (
for $token in $correspTokens
return
let $id := substring($token, 2)
return if ($id = $validIds) then () else $token
)
<!-- Will cause error -->
let $invalid := for $token in $correspTokens
let $id := substring($token, 2)
return if ($id = $validIds) then () else $token
2. IDREF vs anyURI
-
IDREF type: Cannot include
#
, limiting completion in Oxygen -
anyURI type: Allows values with
#
, Oxygen automatically provides ID completion
3. Schematron role Attribute
-
role="error"
: Red error marker -
role="warning"
: Yellow warning marker -
role="info"
: Blue information marker
Advanced Examples
Complex Cross-Reference Validation
<sch:pattern id="cross-references">
<!-- app element must have exactly one lem element -->
<sch:rule context="tei:app">
<sch:assert test="count(tei:lem) = 1">
The app element must have exactly one lem element
</sch:assert>
</sch:rule>
<!-- rdg element's corresp cannot duplicate lem element's -->
<sch:rule context="tei:rdg[@corresp]">
<sch:let name="lemCorresp" value="../tei:lem/@corresp"/>
<sch:assert test="not(@corresp = $lemCorresp)">
The rdg element's corresp must have a different value from the lem element
</sch:assert>
</sch:rule>
</sch:pattern>
Conditional Required Attributes
<sch:pattern id="conditional-attributes">
<sch:rule context="tei:date[@when]">
<!-- If when attribute exists, it must be in ISO format -->
<sch:assert test="matches(@when, '^d{4}-d{2}-d{2}$')">
The when attribute must be specified in YYYY-MM-DD format
</sch:assert>
</sch:rule>
</sch:pattern>
Summary
By combining RELAX NG and Schematron:
- Separation of structural and content validation: Design that leverages each technology’s strengths
- Dynamic validation rules: Flexible validation based on document content
- Editor support: Advanced editing support in Oxygen XML Editor and others
- Clear error messages: Custom messages in your native language
Especially when editing documents with complex structures like TEI XML, this combination becomes an extremely powerful tool.
References
- RELAX NG Compact Syntax Tutorial
- Schematron Quick Reference
- TEI Guidelines
- Oxygen XML Editor Documentation
The complete schema code introduced in this article is actually used in production projects. I hope it will be helpful for those facing similar challenges.