Zum Inhalt springen

Implementation Guide for TEI XML Schema Combining RELAX NG and Schematron

:::message
After manual verification, this article was written by AI.
:::

Introduction

When editing TEI (Text Encoding Initiative) XML, validation of not only element and attribute structures but also more complex business rules becomes necessary. This article explains how to combine RELAX NG (RNG) and Schematron to achieve both structural and content validation, using real project challenges as examples.

Challenges to Solve

When editing classical Japanese literary texts in TEI XML, we had the following requirements:

  1. Dynamic validation of ID references: Validate that IDs referenced in corresp attributes actually exist in witness elements within the document
  2. Auto-completion in Oxygen XML Editor: Automatically display ID candidates during editing
  3. Support for multiple ID references: Allow multiple IDs to be specified separated by spaces
  4. Restrict references to specific elements: Allow only witness element IDs to be referenced, and generate an error if person element IDs are included

Why RNG + Schematron?

RELAX NG Strengths

  • Element and attribute structure definition
  • Data type specification
  • Basic content model definition

Schematron Strengths

  • XPath-based complex validation rules
  • Cross-reference checking within documents
  • Custom error message provision

By combining these two, we can achieve strict validation from both structural and content perspectives.

Implementation Example

1. Basic RNG Schema Structure

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
         xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
         xmlns:sch="http://purl.oclc.org/dsdl/schematron"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
         ns="http://www.tei-c.org/ns/1.0">

  <!-- Schematron namespace declaration -->
  <sch:ns prefix="tei" uri="http://www.tei-c.org/ns/1.0"/>

  <!-- Embed Schematron rules here -->

  <start>
    <ref name="TEI"/>
  </start>

  <!-- RNG structural definitions -->
</grammar>

2. ID Definition and anyURI Type Usage

To achieve auto-completion in Oxygen XML Editor, we use the anyURI type:

<!-- Witness list -->
<define name="listWit">
  <element name="listWit">
    <oneOrMore>
      <element name="witness">
        <attribute name="xml:id">
          <data type="ID"/>
        </attribute>
        <text/>
      </element>
    </oneOrMore>
  </element>
</define>

<!-- Base text reading -->
<define name="lem">
  <element name="lem">
    <attribute name="corresp">
      <a:documentation>
        Reference to witness
        Internal reference in IDREF format with #
        Oxygen displays a list of xml:ids with #
      </a:documentation>
      <list>
        <oneOrMore>
          <data type="anyURI"/>
        </oneOrMore>
      </list>
    </attribute>
    <text/>
  </element>
</define>

Key points:

  • data type="ID" ensures uniqueness
  • data type="anyURI" allows internal references with #
  • list element allows multiple space-separated values

3. Advanced Validation with Schematron

<sch:pattern id="witness-references">
  <sch:title>Witness ID Reference Validation</sch:title>

  <sch:rule context="tei:lem[@corresp]">
    <sch:let name="listWitIds" value="//tei:listWit/tei:witness/@xml:id"/>
    <sch:let name="listPersonIds" value="//tei:listPerson/tei:person/@xml:id"/>
    <sch:let name="correspTokens" value="tokenize(normalize-space(@corresp), 's+')"/>

    <!-- Should reference only witnesses -->
    <sch:assert test="every $token in $correspTokens 
                      satisfies (
                        starts-with($token, '#') and 
                        substring($token, 2) = $listWitIds
                      )" role="error">
      The corresp attribute should only reference witness IDs.
      Available witness IDs: #<sch:value-of select="string-join($listWitIds, ', #')"/>
    </sch:assert>

    <!-- Error when person IDs are included -->
    <sch:report test="some $token in $correspTokens 
                      satisfies (
                        starts-with($token, '#') and 
                        substring($token, 2) = $listPersonIds
                      )" role="error">
      The corresp attribute contains person IDs.
      Detected person IDs: <sch:value-of select="
        string-join(
          for $token in $correspTokens
          return if (starts-with($token, '#') and substring($token, 2) = $listPersonIds) 
                 then $token 
                 else (),
          ', '
        )
      "/>
    </sch:report>
  </sch:rule>
</sch:pattern>

Key points:

  • Define variables with sch:let and dynamically retrieve values with XPath
  • Parse multiple ID references with tokenize()
  • Error when condition is not met with sch:assert
  • Error when condition is met with sch:report
  • Specify error level with role="error" (warning, info also available)

4. Practical Usage Example

<!-- Usage in XML document -->
<?xml-model href="schema.rng" type="application/xml" 
    schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="schema.rng" type="application/xml" 
    schematypens="http://purl.oclc.org/dsdl/schematron"?>

<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <listWit>
            <witness xml:id="aaa">Witness A</witness>
            <witness xml:id="iii">Witness I</witness>
        </listWit>
        <listPerson>
            <person xml:id="abc">
                <persName>Person ABC</persName>
            </person>
        </listPerson>
    </teiHeader>
    <text>
        <body>
            <app>
                <!-- Correct example: referencing only witnesses -->
                <lem corresp="#aaa #iii">Main text</lem>
                <rdg corresp="#aaa">Alternative reading</rdg>
            </app>
            <app>
                <!-- Error example: includes person -->
                <lem corresp="#aaa #abc">Main text</lem>
                <rdg>Alternative reading</rdg>
            </app>
        </body>
    </text>
</TEI>

Implementation Considerations

1. XPath 2.0 Syntax

Pay attention to the syntax of for expressions in XPath within Schematron:

<!-- Correct -->
let $invalid := (
  for $token in $correspTokens
  return 
    let $id := substring($token, 2)
    return if ($id = $validIds) then () else $token
)

<!-- Will cause error -->
let $invalid := for $token in $correspTokens
                let $id := substring($token, 2)
                return if ($id = $validIds) then () else $token

2. IDREF vs anyURI

  • IDREF type: Cannot include #, limiting completion in Oxygen
  • anyURI type: Allows values with #, Oxygen automatically provides ID completion

3. Schematron role Attribute

  • role="error": Red error marker
  • role="warning": Yellow warning marker
  • role="info": Blue information marker

Advanced Examples

Complex Cross-Reference Validation

<sch:pattern id="cross-references">
  <!-- app element must have exactly one lem element -->
  <sch:rule context="tei:app">
    <sch:assert test="count(tei:lem) = 1">
      The app element must have exactly one lem element
    </sch:assert>
  </sch:rule>

  <!-- rdg element's corresp cannot duplicate lem element's -->
  <sch:rule context="tei:rdg[@corresp]">
    <sch:let name="lemCorresp" value="../tei:lem/@corresp"/>
    <sch:assert test="not(@corresp = $lemCorresp)">
      The rdg element's corresp must have a different value from the lem element
    </sch:assert>
  </sch:rule>
</sch:pattern>

Conditional Required Attributes

<sch:pattern id="conditional-attributes">
  <sch:rule context="tei:date[@when]">
    <!-- If when attribute exists, it must be in ISO format -->
    <sch:assert test="matches(@when, '^d{4}-d{2}-d{2}$')">
      The when attribute must be specified in YYYY-MM-DD format
    </sch:assert>
  </sch:rule>
</sch:pattern>

Summary

By combining RELAX NG and Schematron:

  1. Separation of structural and content validation: Design that leverages each technology’s strengths
  2. Dynamic validation rules: Flexible validation based on document content
  3. Editor support: Advanced editing support in Oxygen XML Editor and others
  4. Clear error messages: Custom messages in your native language

Especially when editing documents with complex structures like TEI XML, this combination becomes an extremely powerful tool.

References

The complete schema code introduced in this article is actually used in production projects. I hope it will be helpful for those facing similar challenges.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert