Monday, July 15, 2013

XQuery on SQL Hosts

XQuery on SQL Hosts
Torsten Grust, Sherif Sakr, and Jens Teubner
VLDB 2004

I've been interested in XML and XQuery for a long time, and this spring taught a course on Querying and Storing XML, which primarily addresses research topics over the last 10-15 years aimed at supporting database-style querying over XML data.  Although the wave of hype/industry enthusiasm for XML seems to have crested, many of the basic ideas live on in the various NoSQL, MapReduce or RDF-based (graph querying) systems that are now attracting more attention.

One of the most interesting (to me) lines of work on XML query processing has been based on "compiling" operations over XML down to relational operations for which efficient implementations are well-understood.  Previous work by DeHaan et al. (SIGMOD 2003), Grust et al. (TODS 2004), and others investigated an interval encoding approach to storing XML in a database, which has the nice feature that you can translate XPath expressions (even involving recursive steps) to SQL range queries, which existing relational databases can optimize well.

For example, suppose we have a tree represented by a table as follows:



where id is a unique id, parent is the id of a node's parent, name is the element node label name, and pre and post are the numbers shown in the diagram, corresponding to intervals.  Interval nesting corresponds to ancestor-descendant relationship, and the linear order of the pre / post numbers corresponds to document order.  (Incidentally, we use this as an example of the usefulness of higher-order querying in an upcoming ICFP 2013 paper).

In this paper, Grust et al. take this approach a step further: showing how one can represent sequences relationally and translate the core features of XQuery, including sequence operations, for-iteration and element construction to SQL + arithmetic (or SQL:1999 OLAP operations such as dense_rank).  Interestingly, the translation yields a single (albeit likely massive) SQL query for any XQuery input, even if the former is deeply nested.  This is referred to as "loop-lifting", since the basic idea is to introduce explicit indices for both the loop iteration and the position of each item within a given iteration.

This is best illustrated (as the paper already does) by a simple example: consider

for $\$x$ in (1,2,3) return ($\$x$,42)

Then the first iteration produces (1,42), the second (2,42) and the third (3,42).  So we can represent this by a table:

iter | pos | value
--------------------
  0  |  0  |  1
  0  |  1  |  42
  1  |  0  |  2
  1  |  1  |  42
  2  |  0  |  3
  2  |  1  |  42

  
However, this result can be constructed in different ways: we can do it iteration-by-iteration (which is what XQuery's semantics naively does) or "subexpression-by-subexpression", constructing two tables:

iter | pos | value
--------------------
  0  |  0  |  1
  1  |  0  |  2
  2  |  0  |  3 

iter | pos | value
--------------------
  0  |  0  |  42
  1  |  0  |  42
  2  |  0  |  42

and then combining them by "shifting" the positions of the second table over by the maximum position of the elements in the first table.  (The positions are essentially arbitrary as long as they maintain order; they don't have to be dense but this is helpful to support position queries like XPath's [position()=n] filter.)

This approach underlies the MonetDB XQuery implementation (which is no longer being actively developed), as implemented in a tool called Pathfinder.  Pathfinder is also the engine used by the Ferry and Database Supported Haskell tools from the same group; it supports translating queries that return nested results to multiple flat queries. 

Stray thoughts:

  • The translation maintains a query that lists all of the indices of the current loop, and another query that maintains all of the ids of the current document (including ids generated by newly constructed elements).  There is some room for optimization here, e.g. avoiding repeatedly copying subtrees during construction of nested element content (this can easily become quadratic).
  • It would be interesting to see how much of this translation can be supported in our higher-order LINQ approach.  Features like dense_rank or row_number seem not to be available in F#, and union is not first-class.  So these seem like obstacles.
  • There doesn't seem to be a full writeup of this approach; the later Ferry paper gives similar rules adapting the idea to a nested relational query language but it is still pretty mysterious what is happening under the hood in Pathfinder.

Labels: , ,

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home