Wednesday, 7 October 2015

Change - because things don't stand still

As I've decided I like blogging more than tweeting I'm returning to some of my old blogs with a view to rekindling them and moving the conversations on. This blog used to be called 'Keeping it XML' but at the heart of that was my love of XSLT and the realisation that most programming tasks boiled-down to a transformation of information or state from one form to another. With that in mind I've renamed it 'Transformations… underpin progress in a world of change.'

Monday, 14 September 2009

XProc and SMIL: Orchestrating Pipelines

Although the W3C's XML Pipeline Language (XProc) hasn't even left the stable yet, people are already looking beyond its original purpose. XProc was designed to solve the problem of how to describe the joining together of multiple XML processing steps. So, the question is, how do you extend XProc to handle new features like explicit concurrency...

 XProc and SMIL: Orchestrating Pipelines

Friday, 4 September 2009

Serving XML

Serving XML

Abstract

The stage has been set for a new kind of application development that does away with the impedance miss-match that occurs between programming languages like Java and the XML data they seek to process. The ability to query and updated XML content in a consistent and logical manner is being provided by technologies like Native XML Databases, XQuery and the XRX architecture.

Contents


Introduction

This article takes a broad and high-level view of the current state of Native XML Databases, XML Querying, Processing and their future. Starting with a brief review of the way XML used to be stored, we then look at what Native XML databases are today and what is on offer. With XQuery now firmly established as a W3C Recommendation, we see how it is being employed for search, aggregation and the delivery of XML content. As a part of this we make a quick study of the XRX architecture, ask how XSLT fits in, look at XML Pipelines and finally cast our gaze to the future.

The Way Things Were

XML documents have, traditionally, been stored in Relational Database Management Systems (RDBMS) as 'blobs' of character data; usually broken into chunks that best fit the publisher's requirements for reuse, and then indexed accordingly. If the granularity of the 'chunks' is a good fit for your purposes then the system will be well balanced. The balancing act has to consider the size of the chunks e.g. article, chapter, section or paragraph; against how you might want to search, aggregate and reuse those chunks. Too small and it will be like trying to knit sawdust, too large and you'll be juggling bowling balls.

Native XML Databases

The Native XML Database (NXD) has been around for a while now, some built upon existing RDBMS systems and some have taken a fresh approach. However, the essential thing about an NXD is the (logical) model used to store and retrieve an XML document. Two possible models are the XPath Data Model (XDM) and the XML Information Set (Infoset). The XDM would appear to be a logical choice as it would facilitate queries within the document using XPath expressions. Another useful feature is the ability to group documents into Collections that are identified by a Uniform Resource Indicator (URI).

There is a lot of information on the web about NXDs, and much of it dates back to the turn of the century. So, consequently, the information and many peoples' opinions are out-of-date. One prevailing view is that NXDs are too slow. For me, whether or not an NXD can match the performance of an RDBMS is not really the question. What is important is the wider context in which the XML content is being processed. Querying rows in tables for paragraphs of a fragmented document via a mixture of SQL and Java is a fractured solution. NXDs provide a seamless environment within which the developer can query and manipulate document structure.

You no longer have to put-up with disgruntled application developers bashing XML's square pegs into their roundish object-orientated holes. What you can have is a confident group of XML developers using a range of tools that are built upon the XML stack of technologies (UTF-8, URI, Namespaces, XML, XPath, XQuery and XSLT) working in a far more consistent environment.

The stage has been set for a new kind of application development that does away with the impedance miss-match that occurs between programming languages like Java and the XML data they seek to process. The ability to query and updated XML content in a consistent and logical manner is being provided by technologies like NXDs, XQuery and the XRX architecture.

XQuery: A New Beginning

A new beginning indeed, but one that's been a long time coming. XQuery only became a W3C Recommendation in 2007 but has been in development for many years. Its promise is to provide a means of querying large stores of XML content, transforming and then delivering finished XML documents. It is this ability to not only query but create and transform document structure at the same time that sets it above other languages or combinations of languages. The sheer convenience of being able to use the same language to do both and in a way that's transparent to the underlying data model is undeniable.


XQuery can be regarded as a super-set of XPath 2.0, where additional instructions allow the creation of new nodes (the transformation) and the ability to sort the sequences of nodes returned by a query. With a query language that is built upon the logical model with which the content has been stored, NXDs are optimised to the task of storing, querying and retrieving and updating XML content. This enables them to operate with efficiency and speed. It hasn't taken long for developers to realise that an entire web application can be built on top of an NXD by using XQuery as the programming language.


The heart of XQuery is the FLWOR expression (pronounced 'flower') which expands to; For, Let, Where, Order-by, Return. A very simple construct that; 'for' a given sequence of nodes, 'let' some object (value or node) be stored and may be used 'where' a specific test condition is met, 'order' the resulting sequence 'by' a rule and 'return' the nodes(s) as-is or wrapped in a new XML structure. A very powerful and expressive Functional language that is being used to query, aggregate, transform and deliver a wide variety of content.

XRX: An End-to-End XML Solution

The acronym XRX stands for XForms, Rest and XQuery. The front, middle and back-end of an architecture for handling XML content. XRX came into being as a result of the understanding that XQuery was capable of processing HTTP requests and returning responses provide that it has access to the request information as XML. A number of NXDs including eXist and MarkLogic Server do just that via extension functions and thereby provide a framework upon which the XRX architecture can be built.

XForms

Creating and editing XML content on the client-side with XForms reinforces the simple 'one technology stack' approach that was mentioned previously. XForms 1.1 gains some significant enhancements for both User Interface construction and HTTP request/response handling. Previously XForms has been the preserve of the browser plug-in which, like SVG, has somewhat slowed its adoption.

Now, all of a sudden, there are three browser based implementations that are nearing completion, at least two server-side implementations as well as the two main plug-ins for Firefox and Internet Explorer.

XForms Implementations


The W3C XForms Group Wiki has a full list of XForms implementations.


XForms is being used by Xerox, IBM, Mark Logic and shortly by EMC/Documentum as they've all realised that it reduces development time and relieves the bottle-neck between data and the application.

REST

Representational State Transfer, or REST for short, is one of the fundamental building blocks of the web that requires you to view your information as resources, which are identified by URIs. You have at your disposal a set of HTTP 'verbs' (POST, GET, PUT and DELETE) that describe the nature of the action you wish to perform (Create, Retrieve, Update and Delete), and a means of adding information to the request/response, via headers, that helps the server and client process those requests and responses, respectively, whilst leaving the URI free from implementation specific information.

XQuery

Instead of using a conventional programming language to provide the server-side logic for processing requests, use of XQuery instead will do for the server-side what XForms does for the client, allowing direct and powerful access into the XML content. Although not part of the original W3C Recommendation, the ability to update the structure of XML documents within an NXD is supported via extensions that are close to or similar in nature to the up-coming XQuery Update Facility Recommendation.

Where Does XSLT Fit-in

At this point it is worth mentioning the Extensible Style sheet Language Transformations (XSLT). Although the XSLT community does not have a problem with XQuery, you will find evidence of some XQuery developers trying to ignore the existence of or at least changing the subject from XSLT. There isn't and nor should there be a battle between the two. XQuery will query very well, and XSLT will transform very well. Each can do the others work but you have to push them harder to do it. With that in mind it is a good idea to have a tool that allows the two of them (XQuery and XSLT) to be strung together as part of a pipeline. One queries, the other transforms and the result is delivered. That leads us neatly on to XML pipelines.

XProc: An XML Pipeline Language

There are a number of existing technologies for 'stringing' together XML processes: Shell scripts, Apache Ant, Apache Cocoon to name but three. However, none of them hit the nail on the head quite like the W3C's XML Pipeline Language (XProc). Although it takes a little while to get your head around XProc, it is a well thought-out and powerful pipeline description language that enables a wide variety of XML related processing functions to be brought together in a single processing description.

Using the concept of processing steps that describe their inputs, outputs and options, an XProc pipeline can describe some quite complex work flows that include conditional constructs and loops. It is even possible to modify or create new structure in an XSLTesque style. By using a mixture of XQuery to retrieve a collection of documents that match a specific query, the resulting collection can be passed to many subsequent steps like XSLT transforms, further HTTP requests, Formatting Object (FO) processors and many more besides. Currently, the two leading implementations are Norman Walsh's XML Calabash and Calumet by EMC.

The idea here is to split-up and describe your work flow using XProc, to make it self-describing, easy to compose, reusable and standardised. Also, being an XML language that uses XPath, it builds on the whole XML end-to-end philosophy. Even more intriguing  are the possibilities of creating and transforming it by those same tools.

Serving XML

Search, update, aggregate, transform and deliver. These are things that a new breed of application called an XML Server is bring together under one roof. Companies like Mark Logic with their fully integrated MarkLogic Server are now offering end-to-end XML solutions. In the Open Source arena, eXist is another application that marries an NXD, a pipeline processor and XQuery together. Also in the mix are BaseX that provides an XQuery implementation EMC's suite of XML tools. The Wikipedia entry for XML Databases has a extensive list of NXD implementations.

There are many companies and organisations that are waking-up to the fact that the combination of an NXD and XQuery will make the job of delivering their current content in new and more versatile ways whilst adding value to their products. Such users include:

  • Elsevier
  • McGraw-Hill Education
  • O'Reilly Media
  • Oxford University Press
  • Springer Science+Business Media
  • Wiley
  • Wolters Kluwer

These companies are all big-time book and journal publishers who are finding that the efficiency brought about by using technologies that are a better match for their content is enabling them to provide very compelling products in an increasingly competitive market.

What Does the Future Hold?

This is where things start to get interesting. Two of the hottest topics in the industry, let alone XML, are concurrency and streaming. Rather than build these features directly into the XProc recommendation, the W3C took the approach that these should be provided by the implementation. XProc does nothing to hinder this by its design. XProc will become a tabled recommendation, hopefully this year, and will become part-and-parcel of a number of major XML product lines from both EMC and Mark Logic.

XSLT 2.1, which is currently being drafted, will provide for streaming transformations and it, as well as XQuery, will support higher-order functions (passing functions as parameters to functions) further strengthening their Functional Programming credentials.

Another potentially very interesting development that will attempt to cross the divide between the Semantic Web and XML is XSPARQL, a merging of the RDF's SPARQL query language and XQuery, who's aims is to provide the flexibility of semantic search with the result processing power of XQuery.

The renewed interest in XForms will continue through the client-side implementations. and also on the client-side front, there are recent stories of XQuery plug-ins for browsers that will allow very elegant and efficient scripting of client-side behaviours.

The scene has quite definitely been set for an exciting future!

Saturday, 13 January 2007

A few clever tricks with XSLT 2

I've been working, either professionally or personally, with XSLT 2 and Saxon 8 for the best part of the last two years. It is a real step-up from XSLT 1 and I can honestly say I love it even more than XSLT 1. Here are a couple of useful/interesting things I've encountered on the way:

1) Resolving local fragment identifiers within a document. SVG has a mechanism for declaring pieces of reusable mark-up that can be referenced with a <use xlink:href="#someID"/> fragment where #someID is a fragment identifier URL that is resolved within the parent document. When using XSLT 2 to process a document and you wished to resolve that reference you could try the following.

<xsl:sequence select="doc(resolve-uri(@xlink:href, base-uri(root()))"/>

This example treats the fragment identifier as a URI (which it is), it resolves it with respect to the URI of the source document's root and then opens that document and extracts the fragment the identifier points to. Its short and sweet and doesn't require any tedious messing around with substring-after(@xlink:href, '#'). For that matter you could define your own wrapper function called my:resolve-fagment-identifier() that takes a single argument that is the fragment identifier.

2) Now, in saying all that, you could regard the previous reference as something akin to an ID/IDREF. but in examples where you have used the id attribute but don't have a schema or DTD you wont be able to use the XPath id() function.

Or can you?

In a previous post entitled 'Who knows what a node ID is?' I talked about the xml:id attribute. Saxon 8 understands this attribute's intent to uniquely identify its owner element within the scope of the parent document. So put simply, if you don't or wont have a schema/DTD but you do want ID/IDREFs then use xml:id and the id() function.

3) Here's a nice little tip - you have a path expression that must match, for example, one of two attributes.

<xforms:input bind="date" ref="shipping/@date">
....
</xforms:input bind>

In this case the bind attribute has priority over the ref attribute so:

<xsl:value-of select="(@bind, @ref)[1]"/>

The brackets are a sequence constructor and the [1] predicate states that the first item in the sequence will be selected. Where both are present then @bind is selected but in the absence of the bind attribute, @ref will be selected.

Of Schematron, Unit Testing and oXygen...

I've been off the SVG trail for the last three or so months due to a shift in contract work. But, as this blog is not just about SVG, I'd like to pass on my experiences with some truly wonderful things:
  • Schematron - a rules based XML validation language
  • Unit testing - with respect to the above
  • <oxygen/> - an most excellent XML IDE
Schematron, is an example of how you can put existing technologies to very good use. Schematron uses XML and XPath to declaritively express a set of rules that when applied to a source document will test the validity of that source document. But here's the cool part, there is an XSLT implementation of Schematron and what's more - it is, in my opinion, a very good example of what XSLT can do. The XSLT implementation has been written with many hooks to allow overriding of the standard functionality. It is, for example, quite straight forward to customise the report output. I'll say no more than this and advise you to check it out.

Unit Testing, which I'm sure we've all had some involvement with one way or another but for some people its necessary and for others its a necessary evil. Now, considering what I've just been saying about Schematron, a rules based language for validating XML documents, or fragments there of, the light should be coming on about now. Yes, you can use Schematron to validate the results of your XSLT transformations. I could bang on for ages about this but I wont. Have a think and a play.

<oxygen/>, as I've already stated is a most excellent XML IDE, which comes chock full of some wonderful tools including the very good source editor with superb code completion, abbreviations and support for all the main validation languages (including Schematron). It has a sensational debugger that you've got to experience to believe, a profiler that I haven't touched yet but others I know have found it useful and it support XSLT 2 and XQuery via Saxon 8. I'm not kidding when I say - It rocks!!!

Tuesday, 3 October 2006

XSLT is an XML application so why not transform it

As it happens, Michael Kay is presenting a paper entitled Meta-stylesheets at XML2006.

It must be about three years ago that I had what can only be described as an epiphany with respect to seeing XSLT for what it is, an XML application and as such can be generated by XSLT and for that matter transformed by XSLT into another XSLT.

If you are using a framework like Apache's Cocoon, that allows XSLT transforms to be referenced as the product of another pipeline then that's one way to employ meta-stylesheets. Another and potentially more interesting approach, which I'm sure Mr. Kay will bring-up, is the use of the Saxon 8 / XSLT 2 extension functions saxon:compile-transform() and saxon:transform(). These two together allow you to load a stylesheet into your running stylesheet and apply it to a node-set or sequence that you are working on.

But why stop there when you could build a transform at run-time based on some aspect of your source document then apply that transform to either the source or some node-set derived from the source to produce the desired result.

All very wonderful stuff and I'd love to be there when he presents his paper but alas I will not. So I hope it will be available post conference in some way shape or form.

I initially used XSLT transforms on XSLT stylesheets to map some XHTML generating XSLT into XSL-FO generating XSLT. The end result of that was to simplify the maintenance of a website that published to both XHTML and PDF. Structure and style changes to the XHTML where propagated to the PDF output automatically... Sweet :)

More recently I have been looking at Schematron with a mind to using for unit testing. More on that will follow in due course.

Wednesday, 2 August 2006

Who knows what a node ID is?

The question of accessing a node in the source tree by it's identity depends upon which SVG viewer you are using.

Up until 9th September 2005 you could rely upon two things:

1) If there was a DTD available, then your parser would identify id attributes as being of type ID if they were declared so in the DTD.

2) Your application assumed that when you use 'id' as an attribute name then it most probably was.

Either way you could use document.getElementById(id) to locate the uniquely identified node in the document tree.

Now with respect to SVG viewers, both Batik and ASV3 make an assumption about nodes in the source tree as being SVGElement nodes. As a result you can use document.getElementById(id) to locate metadata nodes that are not in the SVG namespace.

But, if they are not in the SVG namespace and not declared in the SVG DTD then they should not be accessible by this method. The two new implementations of SVG, Firefox 1.5 and Opera 9 have made this distinction. Neither of these browsers will allow you to use the DOM to retrieve nodes by their ID if they are not SVG elements.

However, there is light at the end of the tunnel. From 9th September 2005 onwards there was a new W3C recommendation published that covered this exact problem. The xml:id recommendation identifies an new XML attribute, like xml:lang, that has a special meaning. If you use an xml:id attribute, the application processing your XML should interpret this as a unique identifier for the owner element regardless of the presence of a DTD or schema and I'm happy to say that Opera 9 supports xml:id but unfortunately, at this moment in time, Firefox 1.5 does not.

So, once again I find myself being bruised and bumped by web standards support.