Monday, 08 September 2008
PicoSearch

Getting Started With XML

This column appeared in the March 2000 edition of EXE. It discusses working with XML-based data, and I'm proud to say that it is listed as a resource on the official W3C site.


Click here to go to download page


It is apparently a common mistake to assume that XML is a forthcoming replacement for HTML, but this isn’t in fact the case. XML is intended from the outset to be a tag–based markup language that is complementary to HTML. However, whereas HTML describes the layout of a page, XML is concerned with defining and describing data. This month I am discussing what XML actually is and how to work with it from Visual Basic. XML is a large subject matter so there is no way that I can provide any kind of comprehensive coverage; my intention here is to provide an overview to the subject in order to acquaint you with it. It might be new right now but it is intended to become widespread very quickly; one of the primary additions to the next version of SQL Server is XML support, and it will also be catered for extensively in Visual Studio 7.

Description of XML

As I said HTML purely describes the layout of a page, it isn’t suitable for defining data. Once the web evolved to the point where dynamic data started to become commonplace the need to transport data together with its definition became more pressing. XML, or Extensible Markup Language to give it its full title, is based upon an earlier, more comprehensive markup language called SGML (Standard Generalised Markup Language). The problem with SGML however is that it is a rather complex and broad definition and frankly is something of an overkill. The size of this definition leads in turn to a greater overhead than is necessary. A body called the XML Working Group formed with the express intention of producing a data defining language that would be easy to utilise over the Web. This body defined a suitable subset of SGML that carried less of an overhead but was powerful enough to describe data. The upshot of this is that XML is a legal subset of SGML; any XML document is also a true SGML document.

The XML 1.0 specification has been “recommended” by the World Wide Web Consortium (W3C). This recommendation means that XML is defined well enough for vendors to implement it, but technically speaking it isn’t yet a standard. As far as Microsoft is concerned XML 1.0 support was made available with Internet Explorer 4, but Internet Explorer 5 implemented the XML 2.0 specification which added significant new functionality, most notably something called the XML Document Object Model which I’ll come back to a little later. Actually IE5 was reengineered somewhat to make the individual components more granular. This means that the XML support engine, implemented in msxml.dll, doesn’t necessarily need the rest of IE5 to be of use. In fact, you can obtain the separate XML component from http://msdn.microsoft.com/downloads/tools/xmlparser/xmlparser.asp

HTML is a markup language that is based around a standard set of tags. For example,

<h1>My Header Text</h1>

displays the caption “My Header Text”. Because of the <h1></h1> tag pair the browser will identify the text as being one of the predefined header text types and will format it accordingly. Other tags such as head, body, form, input, and so on are all predefined as part of the working set for a particular version of HTML. When the W3C decide to add additional tags into a definition a new, formal version number is specified, such as 3.2 or 4. This approach is in contrast to XML, in which the author of the file can declare and then define new tags as necessary. This is where the ‘extensible’ part comes in.

It is possible for an HTML stream to contain an embedded XML section; these are known as XML Islands. While HTML and XML are two different specifications, the W3C are also introducing XHTML which is a properly defined integration of the two. Additionally, the MathML specification has also been released by W3C as a specific application of XML that supports the use of complex mathematical formulae.

ADO RecordSet Support

How you manipulate XML data depends upon what you want to do with it. If you are working with an ADO 2.1 or above RecordSet object then you are able to persist the contents of the RecordSet to disk, and then read it back in again later. To illustrate this point we can open a RecordSet from the Northwind database:

Set adoCommand = New ADODB.Command
adoCommand.CommandText = "SELECT * FROM Shippers"
Set adoRs = adoCommand.Execute
adoRs.Save "C:\exexml\Shippers.xml", adPersistXML
adoRs.Close

and it can then be read in again at a later time by calling

Set adoRs = New ADODB.Recordset
adoRs.Open "C:\exexml\Shippers.xml"

The XML produced by the Save method is shown in full in Listing 1, although I have slightly tidied up the presentation of it from the version that ADO actually produced.

<xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882'
xmlns:rs='urn:schemas-microsoft-com:rowset'
xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
<s:ElementType name='row' content='eltOnly'>
<s:attribute type='ShipperID'/>
<s:attribute type='CompanyName'/>
<s:attribute type='Phone'/>
<s:extends type='rs:rowbase'/>
</s:ElementType>
<s:AttributeType name='ShipperID' rs:number='1'>
<s:datatype dt:type='int' dt:maxLength='4' rs:precision='10'
rs:fixedlength='true' rs:maybenull='false'/>
</s:AttributeType>
<s:AttributeType name='CompanyName' rs:number='2'
s:writeunknown='true'>
<s:datatype dt:type='string' dt:maxLength='40'
rs:maybenull='false'/>
</s:AttributeType>
<s:AttributeType name='Phone' rs:number='3'
rs:nullable='true' rs:writeunknown='true'>
<s:datatype dt:type='string' dt:maxLength='24'/>
</s:AttributeType>
</s:Schema>
<rs:data>
<z:row ShipperID='1' CompanyName='Speedy Express'
Phone='(503) 555-9831'/>
<z:row ShipperID='2' CompanyName='United Package'
Phone='(503) 555-3199'/>
<z:row ShipperID='3' CompanyName='Federal Shipping'
Phone='(503) 555-9931'/>
</rs:data>
</xml>

Listing 1: Shippers table from NorthWind database expressed as XML

XML Structure

An XML file is regarded in logical terms as a Document. Encased within this definition are two sub-structures:

  1. Prolog Element: The existence of this element within an XML document is actually optional. It’s purpose is to store management information, such as the version of XML that exists within the file. It can also contain a reference to an external Document Type Definition (DTD) if appropriate. A DTD file is merely a means of separating the definition of the data into a file that is separate from the actual data; if it isn’t external then it is internal to the document.

  2. Document Element: This is really the core part of the file as it actually contains the data, as well as the data definition if it doesn’t reside within an external DTD.

Looking at the XML code in Listing 1 allows us to get a handle on the way that the document is laid out. As I said earlier this is almost exactly as ADO churned it out so please ignore the fact that many of the node names are prefixed with “s:” or “z:”; this is an ADO naming convention and a little bit confusing for a first-time reader.

At the start of the file the root node is identified by the presence of the keyword xml. Everything else that appears within that particular set of angle brackets is really prolog information. At the end of the document is the </xml> tag denoting the end of the root node. Beneath the first set of angle brackets are defined two child nodes called s:schema and rs:data. The s:schema node then contains the s:ElementType node which declares the row names, and a series of s:AttributeType nodes which each contain subnodes called s:datatype that declare the nature of each attribute. Finally, down at the rs:data node, each item of data is presented as a z:row node.

The DTD provides the context for the actual data so it is clearly important that the data actually follows these rules. It is the job of the parser to analyse the file while opening it and check it for conformance. This conformance check includes both compliance with the expected layout of an XML document, and also with the conformance of the data to its own DTD. If a document passes both of these checks then it is classed as being a valid document. The downside to performing a validity check is of course that it can take some time to parse a file, particularly if it is some size. Therefore a less rigid classification for a document can be given instead, in which case the document is referred to as being well-formed. A well-formed document purely complies with the formal rules for an XML document; the DTD is not considered in this case.

XML Document Object Model

While it is possible to open an XML document as a text file and pick your way through it by keeping tabs on your indentation levels, it is much easier to use an XML parser. Microsoft provide such a parser with Internet Explorer in the form of msxml.dll which resides in the Windows System32 folder. The parser exposes the XML Document Object Model (DOM for short) and presents the data as a series of objects that you can manipulate programmatically. To connect to this parser from Visual Basic you should select the Microsoft XML entry within the References dialog (found under the Project menu).

When loading up a new referenced object for the first time I always find it helpful to fire up the Object Browser in order to have a look at what interfaces are exposed. The most important object within msxml is the DOMDocument because it is the overall encapsulation of the XML file that you are working with. Elements are broken down into nodes, and each node exposes a childNodes collection to help you drill down.

In order to demonstrate how to use the DOM I have written a simple program that demonstrates the process whereby an ADO RecordSet saves data in XML format. However, the program also provides two different methods of reading the data back in – in this case into an MSFlexGrid component. The interface isn’t very exciting, but my intention is for you to look at the underlying code behind the buttons. In one method the ADO RecordSet is navigated through the MoveNext call, whereas in the other method the XML structure itself is traversed by means of the DMO model. Within this section I also use two different methods for traversing the data, purely to illustrate the different ways in which it can be done. When analysing the DTD to pick out the field names I use a fairly hairy drill down through several collections at once, just to show that you can:

Dim xmlFieldNodes As MSXML.IXMLDOMNodeList
Set xmlFieldNodes = _
xmlDoc.childNodes.Item(0).childNodes.Item(0).childNodes._
Item(0).childNodes

Whereas to obtain the actual data I adopt the more elegant approach of stepping down one node at a time and calling each node by name rather than ordinal position, such as

Dim xmlRootNode As MSXML.IXMLDOMNode
Dim xmlDataNode As MSXML.IXMLDOMNode
Set xmlRootNode = xmlDoc.selectSingleNode("xml")
Set xmlDataNode = xmlRootNode.selectSingleNode("rs:data")

There are further comments within the code itself to explain what is going on. You can download it from the link at the top of this page.

Resources

In terms of available resources Microsoft Press has released a book called XML in Action, written by William J Pardi (ISBN 0-7356-0562-9, priced £25.99). To be honest it makes for fairly dry reading, but this is very much a reflection of the subject matter rather than the writing style – reading raw XML is necessary in order to understand the subject more completely isn’t exactly riveting stuff. The book covers the theory of the subject in a comprehensive matter, but is restricted to providing the sample code in a generic way. The CD contains quite a rich set of resources that could keep you browsing for hours. For greater coverage of using XML with Visual Basic a forthcoming book from Wrox Press might be worth a look when it becomes available: Professional VB6 XML should be available by the time you read this, ISBN 1-861003-32-3.

Finally, Microsoft has made available an XML Code Generator that creates Visual Basic classes from XML schemas, downloadable from http://msdn.microsoft.com/xml/articles/generat.asp. This tool provides a means of working with XML data without having to traverse the DOM in order to get at the data. This tool is primarily implemented as a DLL so you can add the utility to your own applications should you wish. There is also an XML Resource Kit CD available, details at the MSDN site at http://msdn.microsoft.com/xml/default.asp.

XML is destined to be a very important technology within the near-future and I have tried to provide an introduction within just a couple of pages. There is, however, a lot more to say on the subject so I will be providing further coverage of it (update: at least I intended to before the magazine folded!).

Copyright ©2002 Jon Perkins I, Jon Michael Perkins, hereby assert and give notice of my right under section 77 of the Copyright, Designs, and Patents Act 1988 to be identified as the author of the foregoing article.