Getting Started With XML
This column appeared in the March 2000 edition of EXE.
It discusses working with XML-based data, and I'm proud to say that it is
listed as a resource on the official W3C site.

Click here to go to download page
It is apparently a common mistake to assume that XML is a
forthcoming replacement for HTML, but this isnt in fact the case. XML is
intended from the outset to be a tagbased markup language that is
complementary to HTML. However, whereas HTML describes the layout of a page,
XML is concerned with defining and describing data. This month I am discussing
what XML actually is and how to work with it from Visual Basic. XML is a large
subject matter so there is no way that I can provide any kind of comprehensive
coverage; my intention here is to provide an overview to the subject in order
to acquaint you with it. It might be new right now but it is intended to become
widespread very quickly; one of the primary additions to the next version of
SQL Server is XML support, and it will also be catered for extensively in
Visual Studio 7.
Description of XML
As I said HTML purely describes the layout of a page, it
isnt suitable for defining data. Once the web evolved to the point where
dynamic data started to become commonplace the need to transport data together
with its definition became more pressing. XML, or Extensible Markup Language to
give it its full title, is based upon an earlier, more comprehensive markup
language called SGML (Standard Generalised Markup Language). The problem with
SGML however is that it is a rather complex and broad definition and frankly is
something of an overkill. The size of this definition leads in turn to a
greater overhead than is necessary. A body called the XML Working Group formed
with the express intention of producing a data defining language that would be
easy to utilise over the Web. This body defined a suitable subset of SGML that
carried less of an overhead but was powerful enough to describe data. The
upshot of this is that XML is a legal subset of SGML; any XML document is also
a true SGML document.
The XML 1.0 specification has been recommended by
the World Wide Web Consortium (W3C). This recommendation means that XML is
defined well enough for vendors to implement it, but technically speaking it
isnt yet a standard. As far as Microsoft is concerned XML 1.0 support was
made available with Internet Explorer 4, but Internet Explorer 5 implemented
the XML 2.0 specification which added significant new functionality, most
notably something called the XML Document Object Model which Ill come
back to a little later. Actually IE5 was reengineered somewhat to make the
individual components more granular. This means that the XML support engine,
implemented in msxml.dll, doesnt necessarily need the rest of IE5 to be
of use. In fact, you can obtain the separate XML component from
http://msdn.microsoft.com/downloads/tools/xmlparser/xmlparser.asp
HTML is a markup language that is based around a standard set
of tags. For example,
<h1>My Header Text</h1>
displays the caption My Header Text. Because of
the <h1></h1> tag pair the browser will identify the text as being
one of the predefined header text types and will format it accordingly. Other
tags such as head, body, form, input, and so on are all predefined as part of
the working set for a particular version of HTML. When the W3C decide to add
additional tags into a definition a new, formal version number is specified,
such as 3.2 or 4. This approach is in contrast to XML, in which the author of
the file can declare and then define new tags as necessary. This is where the
extensible part comes in.
It is possible for an HTML stream to contain an embedded XML
section; these are known as XML Islands. While HTML and XML are two different
specifications, the W3C are also introducing XHTML which is a properly defined
integration of the two. Additionally, the MathML specification has also been
released by W3C as a specific application of XML that supports the use of
complex mathematical formulae.
ADO RecordSet Support
How you manipulate XML data depends upon what you want to do
with it. If you are working with an ADO 2.1 or above RecordSet object then you
are able to persist the contents of the RecordSet to disk, and then read it
back in again later. To illustrate this point we can open a RecordSet from the
Northwind database:
Set adoCommand = New
ADODB.Command
adoCommand.CommandText = "SELECT * FROM Shippers"
Set adoRs = adoCommand.Execute
adoRs.Save "C:\exexml\Shippers.xml", adPersistXML
adoRs.Close
and it can then be read in again at a later time by calling
Set adoRs = New ADODB.Recordset
adoRs.Open "C:\exexml\Shippers.xml"
The XML produced by the Save method is shown in full in
Listing 1, although I have slightly tidied up the presentation of it from the
version that ADO actually produced.
<xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'
xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882'
xmlns:rs='urn:schemas-microsoft-com:rowset'
xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
<s:ElementType name='row' content='eltOnly'>
<s:attribute type='ShipperID'/>
<s:attribute type='CompanyName'/>
<s:attribute type='Phone'/>
<s:extends type='rs:rowbase'/>
</s:ElementType>
<s:AttributeType name='ShipperID' rs:number='1'>
<s:datatype dt:type='int' dt:maxLength='4' rs:precision='10'
rs:fixedlength='true' rs:maybenull='false'/>
</s:AttributeType>
<s:AttributeType name='CompanyName' rs:number='2'
s:writeunknown='true'>
<s:datatype dt:type='string' dt:maxLength='40'
rs:maybenull='false'/>
</s:AttributeType>
<s:AttributeType name='Phone' rs:number='3'
rs:nullable='true' rs:writeunknown='true'>
<s:datatype dt:type='string' dt:maxLength='24'/>
</s:AttributeType>
</s:Schema>
<rs:data>
<z:row ShipperID='1' CompanyName='Speedy Express'
Phone='(503) 555-9831'/>
<z:row ShipperID='2' CompanyName='United Package'
Phone='(503) 555-3199'/>
<z:row ShipperID='3' CompanyName='Federal Shipping'
Phone='(503) 555-9931'/>
</rs:data>
</xml>
Listing 1: Shippers
table from NorthWind database expressed as XML
XML Structure
An XML file is regarded in logical terms as a Document.
Encased within this definition are two sub-structures:
-
Prolog Element:
The existence of this element within an XML document is actually
optional. Its purpose is to store management information,
such as the version of XML that exists within the file. It can
also contain a reference to an external Document Type Definition
(DTD) if appropriate. A DTD file is merely a means of separating
the definition of the data into a file that is separate from
the actual data; if it isnt external then it is internal
to the document.
-
Document Element: This is really the core
part of the file as it actually contains the data, as well as the data
definition if it doesnt reside within an external DTD.
Looking at the XML code in Listing 1 allows us to get a handle
on the way that the document is laid out. As I said earlier this is almost
exactly as ADO churned it out so please ignore the fact that many of the node
names are prefixed with s: or z:; this is an ADO naming
convention and a little bit confusing for a first-time reader.
At the start of the file the root node is identified by the
presence of the keyword xml. Everything else that appears within that
particular set of angle brackets is really prolog information. At the end of
the document is the </xml> tag denoting the end of the root node. Beneath
the first set of angle brackets are defined two child nodes called s:schema and
rs:data. The s:schema node then contains the s:ElementType node which declares
the row names, and a series of s:AttributeType nodes which each contain
subnodes called s:datatype that declare the nature of each attribute. Finally,
down at the rs:data node, each item of data is presented as a z:row node.
The DTD provides the context for the actual data so it is
clearly important that the data actually follows these rules. It is the job of
the parser to analyse the file while opening it and check it for conformance.
This conformance check includes both compliance with the expected layout of an
XML document, and also with the conformance of the data to its own DTD. If a
document passes both of these checks then it is classed as being a valid
document. The downside to performing a validity check is of course that it can
take some time to parse a file, particularly if it is some size. Therefore a
less rigid classification for a document can be given instead, in which case
the document is referred to as being well-formed. A well-formed document purely
complies with the formal rules for an XML document; the DTD is not considered
in this case.
XML Document Object Model
While it is possible to open an XML document as a text file
and pick your way through it by keeping tabs on your indentation levels, it is
much easier to use an XML parser. Microsoft provide such a parser with Internet
Explorer in the form of msxml.dll which resides in the Windows System32 folder.
The parser exposes the XML Document Object Model (DOM for short) and presents
the data as a series of objects that you can manipulate programmatically. To
connect to this parser from Visual Basic you should select the Microsoft XML
entry within the References dialog (found under the Project menu).
When loading up a new referenced object for the first time I
always find it helpful to fire up the Object Browser in order to have a look at
what interfaces are exposed. The most important object within msxml is the
DOMDocument because it is the overall encapsulation of the XML file that you
are working with. Elements are broken down into nodes, and each node exposes a
childNodes collection to help you drill down.
In order to demonstrate how to use the DOM I have written a
simple program that demonstrates the process whereby an ADO RecordSet saves
data in XML format. However, the program also provides two different methods of
reading the data back in in this case into an MSFlexGrid component. The
interface isnt very exciting, but my intention is for you to look at the
underlying code behind the buttons. In one method the ADO RecordSet is
navigated through the MoveNext call, whereas in the other method the XML
structure itself is traversed by means of the DMO model. Within this section I
also use two different methods for traversing the data, purely to illustrate
the different ways in which it can be done. When analysing the DTD to pick out
the field names I use a fairly hairy drill down through several collections at
once, just to show that you can:
Dim xmlFieldNodes As MSXML.IXMLDOMNodeList
Set xmlFieldNodes = _
xmlDoc.childNodes.Item(0).childNodes.Item(0).childNodes._
Item(0).childNodes
Whereas to obtain the actual data I adopt the more elegant
approach of stepping down one node at a time and calling each node by name
rather than ordinal position, such as
Dim xmlRootNode As MSXML.IXMLDOMNode
Dim xmlDataNode As MSXML.IXMLDOMNode
Set xmlRootNode = xmlDoc.selectSingleNode("xml")
Set xmlDataNode = xmlRootNode.selectSingleNode("rs:data")
There are further comments within the code itself to explain
what is going on. You can download it from the link at the top of this page.
Resources
In terms of available resources Microsoft Press has released a
book called XML in Action, written by William J Pardi (ISBN 0-7356-0562-9,
priced £25.99). To be honest it makes for fairly dry reading, but this is
very much a reflection of the subject matter rather than the writing style
reading raw XML is necessary in order to understand the subject more
completely isnt exactly riveting stuff. The book covers the theory of the
subject in a comprehensive matter, but is restricted to providing the sample
code in a generic way. The CD contains quite a rich set of resources that could
keep you browsing for hours. For greater coverage of using XML with Visual
Basic a forthcoming book from Wrox Press might be worth a look when it becomes
available: Professional VB6 XML should be available by the time you read this,
ISBN 1-861003-32-3.
Finally, Microsoft has made available an XML Code Generator
that creates Visual Basic classes from XML schemas, downloadable from
http://msdn.microsoft.com/xml/articles/generat.asp. This tool provides a
means of working with XML data without having to traverse the DOM in order to
get at the data. This tool is primarily implemented as a DLL so you can add the
utility to your own applications should you wish. There is also an XML Resource
Kit CD available, details at the MSDN site at
http://msdn.microsoft.com/xml/default.asp.
XML is destined to be a very important technology within the
near-future and I have tried to provide an introduction within just a couple of
pages. There is, however, a lot more to say on the subject so I will be
providing further coverage of it (update: at least I
intended to before the magazine folded!).