XML Parsing with SAX

Web Languages
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

XML has become a standard part of the IT landscape in a very short time. It provides a platform- and vendor-neutral approach to describing data, thus easing the burden of exchanging data between disparate systems. As a developer, I had to learn how to work with this new data format.

XML arrives as normal text. The burden is on the developer to properly interpret it. This interpretation involves the parsing of the XML. Currently, there are two standard parsing methods: DOM and SAX.

DOM (Document Object Model) treats an XML document as a whole. The XML is stored in memory. Once in memory, the data can be worked with as needed. One drawback to this approach, however, is the amount of memory consumed by the XML. Also, it is overkill if you are only concerned with one piece of data in the XML document. This is where SAX enters the picture.

SAX (Simple API for XML) processes an XML document as it is read. Every element is treated as an event. The XML document is not stored in memory; it is read and discarded. It’s up to the developer to store any pertinent data. Thus, code is written to handle certain elements, data, or attributes of an XML document. It is handled accordingly once the specified element is encountered.

We’ll utilize SAX 1.0 in this article. A newer version has been released (2.0), but
1.0 is widely used and is supported in the newest version. We’ll use the freely available Xerces Java parser in this article. It is available from the Apache Software Foundation (xml.apache.org).

SAX Classes and Methods

If you’ve worked with the DOM, you’ll be pleasantly surprised by how easy it is to use SAX. The SAX classes are contained in the org.xml.sax package. The base class in this package is called HandlerBase. The HandlerBase class contains all methods for working with XML data. Methods exist to deal with character data, elements, errors, and so forth.

In order to utilize SAX in your code, your XML parsing classes must use the HandlerBase class as the base or parent class. That is, you must extend it like so: Once you have used HandlerBase as your base class, all of its methods for handling XML are available in your class.

Figure 1 shows a list of methods of the HandlerBase class; the list is not exhaustive, but it gives an idea of how SAX works. All or none of the methods can be utilized in your code. All of the methods return no value (void) and are public, and they throw SAX exceptions (SAXException class).

SAX in Action

The easiest way to get an idea of how the class and its methods are used is in an example. The XML shown in Figure 2 will be used in our first example; it contains a list of books. Each book entry contains a title and ISBN. The XML parsing class shown in Figure 3 utilizes SAX and print statements to handle events (XML entities) with the following 15 steps:

1. The appropriate SAX classes are made available in the code.

2. The class is declared.

3. String object is used to store text data.

4. String object is used to store entity name.

5. The startDocument method is fired at the beginning of an XML document.

6. The endDocument method is fired at the end of an XML document.

7. The startElement method is fired at the opening tag of an element. The element name is one of its parameters, and it is used to populate the String object.

8. The String object is set to the name of the current element.

9. The endElement method is fired at the ending tag of an element.

10. The String object is cleared at the end of the element.

11. The characters method is fired when text is encountered inside an element. A character array, its starting point in the array, and its length are passed as parameters in the method.

12. A String object is populated with the current text.

public class MyClassName extends HandlerBase

13. The text is displayed if, and only if, it is blank. The trim and equalsIgnoreCase methods of the String class are utilized.

14. The error method is fired when parsing errors are encountered.

15. The fatalError method is fired when fatal errors are encountered.

Now, we have our SAX class. We need to utilize it in an application that reads an XML document. Figure 4 (page 58) shows the Java code for our saxExample class. The following 10 steps explain the Java code; Figure 5 (page 58) shows the resulting output:

1. The SAXException class is available in the code.

2. The SAXParser class from Xerces is available in the code.

3. The class is declared.

4. A local file is used as the XML source.

5. The main method is the entry point of the Java application.

6. Declare an instance of the SAXParser class.

7. Declare an instance of the SAX class.

8. Call the setDocumentHandler method of the SAXParser class. This assigns the class as the document handler.

9. The parse method of the SAXParser class is used to parse our file.

10. Errors are handled.

Setting It All Up

I want to discuss setup before diving into the next example. The Java classes that comprise the Xerces XML parser (available for free download at xml.apache.org) must be available in your Java development environment. I am using the command-line Java Development Kit (JDK) from Sun Microsystems, so the system classpath variable is used to locate needed Java files. The classpath variable is used when locating class files referenced in import statements or the base Java class files. I am using version 1.1.8 of the Sun JDK and Xerces 1.01. Xerces is installed in the xerces directory on my D drive, and the JDK is installed in the jdk1.1.8 directory on my C drive. Therefore, my classpath for these examples is:

classpath = c:jdk1.1.8libclasses.zip;d:xercesxerces.jar

The classes.zip file contains all base Java classes. The setup for your development environment may be different, so please consult your documentation.

Taking It a Step Farther

Although the first example may seem simple, it introduces the major aspects of parsing XML via SAX. After all, the S in SAX stands for simple. The example shown in Figure 6 uses the same basic XML, but it is received via an HTTP request. The file is parsed and displayed as an HTML table. The saxExample class from Figure 2 is modified by adding statements to output HTML. The beginning and end of the XML document generates the beginning and ending HTML table tags. A book comprises one line in the table with individual values in separate cells.

Figure 7 shows how the code from Figure 4 can be changed slightly to work with a URL instead of a local file to produce the following output:


Domino Development With Java 1930110049
Practical LotusScript1884777767

Notice the output lists only the HTML for a table. It does not format an entire HTML page. For this reason, this code could be used inside a JavaServer Page (JSP), a servlet, or any other Java application.

Only the Beginning

SAX is an excellent way to handle XML documents when only portions of the data are needed. It is an alternate—not a replacement—for the DOM approach. There are instances in which the whole XML document is needed and SAX will not suffice. SAX is available in most Java and C parsers and the most recent version of the Microsoft parser included in their Internet Explorer browser.

XML is the latest “hot” technology, but it is definitely here to stay. It provides a universal, platform-independent approach to describing data. It has been embraced by the development community and industry giants like IBM, Oracle, and Microsoft. In fact, Microsoft has made it an integral part of the next iteration of its Office suite.

REFERENCES AND RELATED MATERIALS

• Apache XML Project Web site: xml.apache.org

• IBM DeveloperWorks XML Zone: www.ibm.com/xml

• Java and XML. Brett McLaughlin and Mike Loukides. Cambridge, Massachusetts: O’Reilly and Associates, 2000

• The XML Companion. Neil Bradley. Addison-Wesley Publishing Co., 2000

• XML.com Web site: www.xml.com

• XML Developer Center:msdn.microsoft.com/xml

• XML in Action. William J. Pardi. Microsoft Press, 1999

• XML.org Web site: www.xml.org

• XML Pocket Reference. Robert Eckstein. Cambridge, Massachusettes: O’Reilly and Associates, 1999

• XML Programming with VB and ASP. Mark Wilson and Tracey Wilson. Greenwich, Connecticut: Manning Publications, 1999

Method Description

characters The character method is triggered when character data is encountered inside an element. endDocument The endDocument method is triggered at the end of an XML document.
startDocument The startDocument method is triggered at the beginning of an XML document.
error The error method is fired when a parser error is raised.
processingInstruction The processingInstruction method is triggered when a processingInstruction is encountered. startElement The startElement method is triggered at the beginning of an element in the XML document. endElement The endElement method is triggered at the ending tag of an element in the XML document.

Figure 1: The HandlerBase class contains a number of methods that are invoked when the XML parser encounters various portions of a document.


Domino Development With Java

1930110049

Practical LotusScript

1884777767


Figure 2: An XML document is a set of nested tags that describe the data that it contains.

//1.

import org.xml.sax.AttributeList;
import org.xml.sax.HandlerBase;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
//2.

public class saxExample extends HandlerBase {
//3.

private String chValue;
//4.

private String eName;
//5.

public void startDocument() throws SAXException {

System.out.println("Beginning of XML document.");

}

//6.

public void endDocument() throws SAXException {

System.out.println("End of XML document.");

}

//7.

public void startElement(String name, AttributeList amap)

throws SAXException {
//8.

eName = name;

System.out.println("Beginning of element: " + name);

}

//9.

public void endElement(String name) throws SAXException {
//10.

eName = "";

System.out.println("End of element: " + name);

}

//11.

public void characters(char[] ch, int start, int length)

throws SAXException {
//12.

chValue = new String(ch, start, length);
//13.

if (!chValue.trim().equalsIgnoreCase("")) {

System.out.println("Data in element (" + eName + "): " + chValue);

}

}

//14.

public void error(SAXParseException e){

System.out.println("Error has occurred.");

}

//15.

public void fatalError(SAXParseException e){

System.out.println("Fatal error has occurred.");

}

}

Figure 3: The saxExample class is a utility class that customizes the SAX handler methods of the HandlerBase parent class.

//1.

import org.xml.sax.SAXException;
//2.

import org.apache.xerces.parsers.SAXParser;
//3.

public class saxDemo {
//4.

private static String xmlSource = "c:ooks.xml";
//5.

public static void main(String argv[])

throws SAXException {
//6.

SAXParser parser = new SAXParser();

try {
//7.

saxExample saxTest = new saxExample();
//8.

parser.setDocumentHandler(saxTest);
//9.

parser.parse(xmlSource);
//10.

} catch (SAXException e) {

System.out.println("SAX Exception");

e.printStackTrace();

} catch (Exception e) {

e.printStackTrace();

}

}

}

Beginning of XML document.
Beginning of element: booklist
Beginning of element: book
Beginning of element: title
Data in element (title): Domino Development With Java
End of element: title
Beginning of element: isbn
Data in element (isbn): 1930110049
End of element: isbn
End of element: book
Beginning of element: book
Beginning of element: title
Data in element (title): Practical LotusScript
End of element: title
Beginning of element: isbn
Data in element (isbn): 1884777767
End of element: isbn
End of element: book
End of element: booklist
End of XML document.

Figure 4: The saxDemo class uses the saxExample utility class to list the contents of the books.xml file.

Figure 5: The book.xml file is listed as standard output by the saxDemo class (Figure 4) as parsing events are handled by the saxExample class (Figure 3).

import org.xml.sax.AttributeList;
import org.xml.sax.HandlerBase;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;

public class saxExample2 extends HandlerBase {

private String chValue;

//The beginning of an HTML table is printed at the start of the XML document.

public void startDocument() throws SAXException {

System.out.println(“

”);

}

//The ending HTML table tag is printed at the end of the XML document.

public void endDocument() throws SAXException {

System.out.println(“

”);

}

//A new HTML table row is started for each book element.

public void startElement(String name, AttributeList amap)

throws SAXException {

if (name.equalsIgnoreCase(“book”)) {

System.out.println(“”);

}

}

//The HTML table row is ended at the end of each book element.

public void endElement(String name) throws SAXException {

if (name.equalsIgnoreCase(“book”)) {

System.out.println(“”);

}

}

// Display each text value in a separate HTML table cell.

public void characters(char[] ch, int start, int length)

throws SAXException {

chValue = new String(ch, start, length);

if (!chValue.trim().equalsIgnoreCase(“”)) {

System.out.println(“” + chValue + “”);

}

}

}

import org.xml.sax.SAXException;
import org.apache.xerces.parsers.SAXParser;

public class saxDemo2 {

private static String xmlSource = “http://127.0.0.1/books.xml”;

public static void main(String argv[])

throws SAXException {

SAXParser parser = new SAXParser();

try {

saxExample2 saxTest = new saxExample2();

parser.setDocumentHandler(saxTest);

parser.parse(xmlSource);

} catch (SAXException e) {

System.out.println(“SAX Exception”);

}

}

}

Figure 7: The saxDemo2 class uses a URL instead of a file and the HTML output capabilities of saxExample2 (Figure 6) to Web enable the use of the book XML language.

Figure 6: The saxExample2 class outputs an HTML table.

BLOG COMMENTS POWERED BY DISQUS