Hedge automata: a formal model for XML schemata

This note shows preliminaries of the hedge automaton theory. In the XML community, this theory has been recently recognized as a simple but powerful model for XML schemata. In particular, the design of RELAX (REgular LAnguage for XML) is directly based on this theory.

Hedges

First, we introduce hedges. Informally, a hedge is a sequence of trees. In the XML terminology, a hedge is a sequence of elements possibly interevened by character data (or types of character data); in particular, an XML document is a hedge.

A hedge over a finite set Σ (of symbols) and a finite set X (of variables) is:

ε (the null hedge),
X, where X is a variable in X,
a <u>, where a is a symbol in Σ and u is a hedge (the addition of a symbol as the root node), or
uv, where u and v are hedges (the concatenation of two hedges).

Figure 1 depicts three hedges: a <ε> , a <x > , and a <ε > b < b <ε> x > . Observe that elements of Σ (i.e., a and b ) are used as labels of non-leaf nodes, while elements of X (i.e., x ) are used as labels of leaf nodes. We abbreviate a <ε> as a. Thus, the third example is denoted by a b <b x > .

Figure 1. Three hedges: a <ε> , a <x > , and a <ε > b < b <ε> x >

Next, we consider an XML document. Suppose that Σ = {doc, title, image, para} and X ={#PCDATA}. Then,

doc < title<#PCDATA> para<#PCDATA> <image/> para<#PCDATA>>

Regular Hedge Grammar

In this section, we introduce regular hedge grammars (RHGs). An RHG is a mechanism for generating hedges. In other words, an RHG describes a set of hedges.

Since the primary role of an XML schema is to describe a set of valid documents, an RHG can be considered as a formal representation of a XML schema.

Σ is a finite set of symbols,
X is a finite set of variables,
N is a finite set of non-terminals,
P is a finite set of production rules, each of which takes one of the two forms as below:
- n → x, where n is a non-terminal in N, and x is a variable in X,
- n → a <r>, where n is a non-terminal in N, a is a symbol in Σ, and r is a regular expression comprising non-terminals,
r_f is a regular expression comprising non-terminals.

Now, we consider derivation of RHGs. Informally speaking, given a sequence of non-terminals, we repeatedly replace non-terminals with hedges in the right-hand side of corresponding production rules.

for some production rule n → x, hedge v is obtained by replacing an occurence of n in u by X, or
for some production rule n → a <r>, hedge v is obtained by replacing an occurence of n in u by some a <w> such that w is a sequence of non-terminals and matches r.

The language generated by G, denoted by L(G), is the set of hedges which are derived from some non-terminal sequence that matches r_f .

P = {n₁ → a<n₂⁺ >, n₂ → x }.

L(G) ={ε, a<x >, a<xx>, a<xxx >, ...}

Next, we construct an RHG that mimicks a DTD. As an example, consider a DTD as follows:

Σ = {doc, title, image, para},
X = { #PCDATA } ,
N = {n_d , n_t , n_p , n_i , n_# }
P = {n_d → doc <n_t (n_p | n_i )^* > ,
n_t → title <n_# >,
n_p → para <n_# >,
n_i → image <ε>,
n_# → #PCDATA }

Σ = {segment , para} ,
X = { #PCDATA } ,
N = { n₁ , n₂ , n_p , n_# } ,
P = {n₁ → segment <n_p^* n₂^* > ,
n₂ → segment <n_p^* > ,
n_p → para <n_# > ,
n_# → #PCDATA }

Both the rule for non-terminal n₁ and that for n₂ have segment in the right-hand side. However, the former has content model n_p^* n₂^*, and the latter has content model n_p^*. This impiles that top-level segments can have subordinate segments, but these subordinate segments cannot have subordinate segments.

The DTD syntax cannot exactly capture this RHG, since every occurrence of segments is forced to have the same content model. The smallest DTD that covers this RHG is as below:

Observe that this DTD allows unlimited nesting of segments. Since the DTD syntax does not allow two content models for segments, this DTD uses one loose content model.

Hedge Automaton

In this section, we introduce deterministic hedge automata and non-deterministic hedge automata.

Σ is a finite set of symbols,
X is a finite set of variables,
Q is a finite set of states,
α is a function from Σ × Q^* to Q such that for every q in Q and x in Σ, {q₁ q₂ ... q_k | k >= 0, α(x, q₁ q₂ ... q_k ) = q } is a regular set,
ι is a function from X to Q, and
F is a regular set over Q.

Figure 2.Execution of a deterministic hedge automaton

Next, we show a DHA that accepts the first example in Section 2. Let M = <a , x , { q₀, q₁, q₂ }, α, ι, q₁^? }, where:

L(G) ={ε , a<x> , a<xx> , a<xxx> , ... }

A non-deterministic hedge automaton (NDHA) is < Σ, X, Q, α, ι, F>, where:

Q, Σ, and F are as specified in the definition of DFA,
α is a relation (called transition relation) from Σ × Q^* to Q (or a function from Σ × Q^* to 2^Q ) such that for every q in Q and x in Σ, {q₀q₁...q_k | k >= 0, α(x, q₀q₁...q_k, q)} is a regular string language, and
ι is a relation from X to Q (or a function from X to 2^Q).

By definition, a DHA is also a NDHA. We only have to confuse a state and a singleton set containing that state. Thus, the above DHA is also an example of NDHAs.

The last example RHG in Section ``Regular Hedge Grammer'' can be readily captured by a NDHA <Σ, X, Q, α, ι, F >, where

Σ = {segment, para },
X = { #PCDATA },
Q = < q₁, q₂, q_p, q_# >,
α(a, u ) contains q₁ when a = segment and u in L ( q_p^* q₂^* ) ),
α(a, u ) contains q₂ when a = segment and u in L ( q_p^* ),
α(a, u ) contains q_p when a = para and u in L(q_#)),
ι(x ) = q_# when x = #PCDATA,
F = q₁ .

Properties of Regular Hedge Languges

Equivalence

L is generated by a RHG,
L is accepted by a DHA, and
L is accepted by a NDHA.

The proof that (3) implies (2) is done by the subset construction. The rest of the proof is straightforward.

Boolean closure

Suppose that set L₁ and L₂ are accepted by (N)DHA M₁ and M₂, respectively. We can effectively construct (N)DHAs that accept the following languages.

the intersection of L₁ and L₂,
the union of L₁ and L₂,
the complement of L₁ (the set of all hedges not contained by L₁)

Parse trees of extended context-free grammars

The set of parse trees of an extended context-free grammar is said to be a local tree language. A lot is known about the relationships between local tree languages and regular hedge languages. We mention two observations which are directly relevant to XML.

A local tree language is a regular hedge language (in other words, for any extenced context-free grammar, we can construct a DHA.), and
For any regular hedge language that contains trees only, there exists a unique minimal local tree language that includes that regular hedge language.

Observation (1) implies that RHGs are more powerful than DTDs, while (2) ensures that given any RHG, we can construct a reasonable DTD.

BIBLIOGRAHICS NOTES

In the theoretical computer science community, regular hedge languages were first studied by Pair et al[PQ68] and Takahashi[Tak75]. Regular hedge language can also be considered as extensions of regular tree languages [Tha67]. We borrow some concepts from these papers but adopt definitions more similar to those for regular string languages.

We define RHG's similarily to [PQ68,Tak75], but we avoid projections. Alternatively, our definition can be considered as a hedge-version of Brainerd's tree regular grammars (called "tree generating regular systems) [Bra69].

Our definitions of NDHAs and DHAs are derived from (non-)deterministic tree automata of [Tha67] except that we have extended them to hedges.

It was Kil-Ho Shin (Fuji Xerox) who first proposed to use regular hedge languages as a formal model for schemata of structured documents. His proposal dates back to November, 1991, but he never published any papers. In search of a formalism for document schemata, HIYAMA Masayuki (FAMILY Given) reached a similar formalism in 1996. Since 1993, the present author has applied regular hedge languages (and hedge monoids, which are outside the scope of this note) for schema transformation [Mur97a,Mur97b,Mur98].

The word ``hedge'' was originally proposed by Bruno Courcelle [Cou89]. Derick Wood recommended the use of this word, and it has become the standard word in the XML community after a tutorial by Paul Prescod in 1999. For more information, see the special section on hedge automata in the he SGML/XML Web Page (http://www.oasis-open.org/cover/topics.html#forestAutomata).

References

[Tha87] James W. Thatcher. Tree automata: An informal survey. In Alfred V. Aho, editor, Currents in the theory of computing, pages 143--172. Prentice-Hall, 1987.