Experiencing XPath

Introduction

A couple of months back, we had announced that we will be illustrating XML related technologies using programming languages such as C#, Python and Java. We begin the second part of our XML series where we intend to discuss XML from a programmer’s point of view. It is important that you understand the basics before you read these. Most articles in the XML series are available online at www.developeriq.com

XML developers have to keep themselves abreast of some of the latest technologies that have cropped up in the past two years. One such technology that has gained both mindshare as well as confidence of developers is XPath. 

You have by now understood the nuances of XML basics such as DTDs, XML Schemas and XHTML. However, when it comes to processing and gathering information from specific parts of XML documents, you require more than what you have gathered up to this point.

In some cases you would like to process certain parts of an XML document. Consider a large XML score sheet of statistics on cricket players. If you specifically want to work on the part where cricketers are identified as say “Indian batsmen” what would you do? Or if you want exclude some part of the document, like personal details of players, you need to look at technologies such as XPath.

XPath is actually a sub-language for finding information in an XML document. It is used to navigate through elements and attributes in an XML document and can define different parts of an XML document. XPath uses path expressions to navigate through XML documents and is actually a part of XSLT, which we learned almost six months back. XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations [XSLT] and XPointer [XPointer].

Path Expressions
XPath data model represents most parts of a serialized document as a tree of nodes. For example, a root node represents the XML document itself. There are element nodes for representing elements and attribute nodes for node element node. XPath uses path expressions to select nodes or node-sets in an XML document. Most parts of an XML document can be represented through the XML Path expressions. However, XML DOC TYPE definitions cannot be represented and also avoided are XML declarations.

Nodes
Node is a representation in the XPath data model of a logic part of an XML document. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment and document (root) nodes. XML documents are treated as trees of nodes. The root of the tree is called the document node (or root node).

Look at the following XML document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<addressbook>
<address>
  <name> Anand </name>
  <House_no> 44</House_no>
  <street>2nd Street</street>
  <city>Mumbai</city>
  <zip>400003</zip>
</address>
</addressbook>

Example of nodes in the XML document above:

<addressbook>  (document node)
<name>Anand</name>  (element node)

You have a concept called atomic values, which are essentially elements and attributes without child elements.

Parents, children, ancestors and descendents 

Each element and attribute has one parent. The address element is the parent of the name, street, city, zip and so on. Similarly, these elements are the child elements of address. Siblings are nodes that have the same parent so street and name have the same parent and are siblings. An ancestor node is the parent of a node and hence will be the ancestor of all child elements. All elements of address have addressbook as their ancestor. Likewise, descendants are the term used to describe the child elements of address, in relation with address.

A location path can be absolute or relative. An absolute location path starts with a slash (/) while a relative location path does not. In both cases, the location path consists of one or more steps, each separated by a slash.

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed in table 1:

Expression

Description

nodename

Selects all child nodes of the node.

/

Selects from the root node.

//

Selects nodes in the document from the current node that match the selection, no matter where they are.

.

Selects the current node.

..

Selects the parent of the current node.

@

Selects attributes.

Table 1: Understanding XPath expressions

If you are trying to process some parts of the XML document address.xml, then path expression addressbook will select all the child nodes of the addressbook element. Similarly, addressbook/address selects all address elements that are children of addressbook.

Predicates are used to find a specific node or a node that contains a specific value. Predicates are always embedded in square brackets and can be compared to the way you index lists/arrays in Python/Perl. The only difference is that there are a number of functions in place of index. In the table 2 below, we have listed some path expressions with predicates and the result of the expressions.

Path Expression

Result

/addressbook/address[1]

Selects the first address element that is the child of the addressbook element.

/addressbook/address[last()]

Selects the last address element that is the child of the addressbook element.

/addressbook/book[position()<4]

Selects the first three address elements that are children of the addressbook element.

/addressbook/address[name = ‘Dev’]

Selects all the address elements of the addressbook element that have a name element of value ‘Dev’.

Table 2: Path Expression examples

XPath wildcards can be used to select unknown XML elements. By using the | operator in an XPath expression you can select several paths. Refer table 3.

Wildcard

Description

*

Matches any element node.

@*

Matches any attribute node.

Node()

Matches any node of any kind.

Table 3: XPath wildcards

Axes

Imagine your XML document is a map. If you need to find some specific data or analyze specific parts of the document, you need to take directions. In a map you have four fundamental directions. Using XPath you can pinpoint data in the document. An axis defines a node-set relative to the current node. Unlike four directions (North, South, East and West) in a map, there are 13 axes in Xpath!

AxisName

Result

ancestor

Selects all ancestors (parent, grandparent, etc.) of the current node.

ancestor-or-self

Selects all ancestors (parent, grandparent, etc.) of the current node and the current node itself.

attribute

Selects all attributes of the current node.

Child

Selects all children of the current node.

descendant

Selects all descendants (children, grandchildren, etc.) of the current node.

descendant-or-self

Selects all descendants (children, grandchildren, etc.) of the current node and the current node itself.

following

Selects everything in the document after the closing tag of the current node.

following-sibling

Selects all siblings after the current node.

namespace

Selects all namespace nodes of the current node.

parent

Selects the parent of the current node.

preceding

Selects everything in the document that is before the start tag of the current node.

preceding-sibling

Selects all siblings before the current node.

Self

Selects the current node.

Table 4: The 13 Axes Nodes and what they do

Each step is evaluated against the nodes in the current node-set.
A step consists of:

· An axis (defines the tree-relationship between the selected nodes and current node);
· A node-test (identifies a node within an axis); and
· Zero or more predicates (to further refine the selected node-set).

The syntax for a location step is:

axisname::nodetest[predicate] 

This concludes our initial experiences learning XPath. However, to continue our XML experiences, 
we need to understand XML Document Object Model. 

Microsoft Vista powered on XML

In his keynote address at Professional Developer Conference, Bill Gates said that Microsoft Windows Vista and new SQL Server has “XML built into the core." Microsoft’s new range of products will also have XML as a core component. The new Office 12 is one such product, as the new word format will essentially be an XML standard.

 LINQ project

One of the main issues currently facing developers industry-wide is the difficulty of creating data-rich applications, a difficulty that arises from the tremendous differences between query languages used to access data and programming languages commonly used to write applications. Developers writing applications that access data from relational (SQL) or hierarchical (XML) data sources must be adept at traversing very different language syntaxes to get the job done.

To reduce complexity for developers and help boost their productivity, Microsoft today announced a solution for the .NET Framework called the Language Integrated Query (LINQ) Project, a set of language extensions to C# and Visual Basic programming languages that extends the Microsoft .NET Framework by providing integrated querying for objects, databases and XML data. Using LINQ, developers will be able to write queries natively in C# or Visual Basic without having to use other languages, such as Structured Query Language (SQL) or XQuery, a query language for accessing XML data. The announcement was made here at the Microsoft Professional Developers 2005 Conference, where Microsoft is making available a Tech Preview containing pre-release versions of various components of the LINQ Project.

More goodies

Like the Google Desktop 2, Sidebar includes separate elements that are automatically updated with information, such as news feeds, weather or digital photos. Users can select from a gallery of the mini-applications, which Microsoft calls Gadgets.

Gates said Gadgets represents a good opportunity for developers. Yahoo offers a similar array of tiny apps called Widgets, acquired with its July purchase of Konfabulator, as well as tools for developers to create them.

Canon will provide a new color management system for Vista that will provide better color fidelity and screen-to-print matching for digital photos and graphics.

The Vista CTP distributed during the second of week of September includes the WinFX programming model, made up of the Windows Presentation Foundation (formerly Avalon) and Windows Communication Foundation (formerly Indigo), as well as the .NET Framework, will improve the presentation and interactivity of both HTML connections and code run on the client, allowing the browser to move beyond loading Web pages to providing desktop access to applications.

Microsoft is in the early stages of developing an AJAX (Asynchronous JavaScript and XML) authoring tool known as Atlas. AJAX is a development strategy that allows a Web page pull information from a server without having to reload the page.

Atlas will be a Web client framework for building AJAX applications that will be integrated with Visual Studio 2005 and ASP.NET 2.0, to make it easier to develop AJAX apps.

Source: Microsoft, PR Newswire

Language: XML
Platform: Windows




Added on July 28, 2007 Comment

Comments

Post a comment

Your name:

Comment: