In this part of our RPG XML series, you'll learn how to use RPG's XML-SAX op-code to deal with problematic XML documents and handle situations that XML-INTO cannot deal with.
In the previous two articles in this series, "%Handling XML-INTO Problems" and "i5/OS Offers Native XML Support in V5R4", we focused on the capabilities of RPG's XML-INTO. As we saw, this op-code processes an entire document, either as a single piece or, when needed or desired, in "chunks" by using the capabilities of the %HANDLER BIF. There are, however, situations when this will not work for you. This often relates to limitations in RPG's data structure (DS) capabilities. As you know, a named DS is limited to a maximum size of 64K (at least until V6R1 anyway). Suppose that even a single repeating element will not fit into this? That may sound unlikely, but it doesn't take a huge number of repeating text fields to exceed this limit. Another example, and one that seems to occur quite often, arises when your XML document contains a structure that simply cannot be represented in an RPG DS. To illustrate this, take a look at the new version of our XML document, shown below:
(A) <Description type="short">Two slot chrome</Description>
(B) <Description type="long">This beautiful two slot chrome finished toaster is
a perfect complement to any modern kitchen ...</Description>
<Description type="short">Four slot matt black</Description>
<CatDescr type="short">Coffee Makers</CatDescr>
<Description>10 cup auto start</Description>
It is substantively the same as in our previous examples, but with one very significant exception: The <Description> element can now be repeated. If that were the only difference, then we could accommodate it by adding a DIM( ) keyword to the element's definition in the DS. But notice that not only does the element repeat, but there is also a new attribute, type, which is used to indicate the type of description (short or long) that is being defined. This presents us with a problem. Since an attribute is treated in the same way as a child element of the parent, the correct RPG definition for "type" would be this:
d description DS
d type 5a
But this leaves us with nowhere to put the content of the description since the content of a DS is the sum of its subfields and any data placed there would overwrite those subfields. In other words, in our situation, the description would overwrite the type field (or vice versa). Not a lot of help! In theory, a DS that looks like the one below should solve the problem:
d description DS Qualified Dim(2)
d description 1000a Varying
d type 5a
In this case, the <Description> would be stored in the field description.description and the "type" attribute would be stored in description.type. Makes sense, doesn't it? Maybe to you, but sadly, not to the compiler.
IBM is aware of this deficiency, and it is on their "to-do" list, but don't expect to see it in V6R1. And don't hold me to it working the way I have described it here; IBM may well have other ideas.
So if we cannot create a DS that matches the structure of the XML data, then we cannot use XML-INTO or at least cannot use it for the whole task. So what are our options?
There are effectively three options:
- The first is to take advantage of RPG's XML-SAX op-code. This can be used either by itself to process the entire document or as a follow-on to an XML-INTO parse to "fill in the gaps." We will be dealing with the usage of XML-SAX in the balance of this article.
- The second is to reformat the document by using an XSL transform so that it is in a format that can be expressed in RPG terms. This is the approach recommended in the IBM Redbook The Ins and Outs of XML and DB2 UDB for i5/OS. If you have the required XSL skills or are prepared to develop them, this is certainly a valid option and can also help to deal with other issues, such as empty elements. Since the Redbook provides a good working example, we won't duplicate that work here.
- Another option would be to process the document in two passes using XML-INTO with a different target DS on each pass. You would also need to use the "AllowExtra" and "AllowMissing" processing options in order to persuade the parser to handle the document since neither of the DSs will exactly match the document. This is not as effective as the XML-SAX option, so we will not be discussing it further.
The operation of XML-SAX is very different from that of XML-INTO. XML-INTO parses the data from many elements at a time and places the parsed content into the appropriate field in the target DS or array. XML-SAX on the other hand parses the document one event at a time. Examples of events include the beginning of an element (i.e., its starting tag), the value of an element, the end of an element (i.e., its ending tag), the name of an attribute, the value of the attribute, etc.
With XML-INTO, the use of a handler procedure is optional, but with XML-SAX %HANDLER must always be specified. Your handler procedure will be called for every event that the parser encounters. It is up to your logic to decide if it should simply ignore the event or react to it in some way.
Logic is needed in the handler to recognize and react to the beginning of each element and attribute and to store the values in the appropriate places. You will perhaps get a better idea of the kind of logic that might be required if you study the list below. It represents the sequence of events and the associated data (in parentheses) that would be passed to the handler when processing the section of the XML document that begins at (A) above and ends at (B).
• Start Element (description)
• Attribute Name (type)
• Attribute Characters (short)
• End Attribute (type)
• Element Characters (two-slot chrome)
• End Element (description)
Notice that when we receive the element and attribute data, we have no idea which element/attribute it belongs to. That is up to us to determine. In fact, this is not a difficult task as the data will always belong to the last element/attribute that began but has not yet ended. With so many events being signaled to your handler, you can no doubt see that writing the logic to completely process even a simple document with XML-SAX would be somewhat tedious, requiring a lot of rather repetitive code. Luckily, we rarely require all of the data in a document, and we also have the option to combine XML-SAX with XML-INTO to simplify our task.
So to handle the situation in our example, that is what we will do. We will use XML-INTO to capture the bulk of the data and then process again using XML-SAX to fill in the missing piece: the type codes associated with the descriptions.
Let's look at the code that achieves this (shown at the end of this article).
The first thing to notice is the change in the product DS (A). Notice that we have made the description field an array with two elements and also added the type field as a two-element array. Note that the name of the type field in the DS (descrType) does not match the name of the attribute (type) to ensure that XML-INTO will not try to populate it and to make that fact more obvious to those who come after us. In fact, there is no need to actually include the type in the DS at all, but it is convenient to keep all the data together.
The XML-INTO must have the "allowextra=yes" option specified (B) to accommodate the extra type fields. Without this option, the parse would fail since the new version of the DS no longer corresponds to the XML document. Once XML-INTO has completed, we invoke XML-SAX (C) to reprocess the document.
There is no difference in the definition of %HANDLER, but there is a difference between the information passed to an XML-SAX handler and the information passed to the XML-INTO handler we saw in the last article. Take a look at the prototype at (D) and you will see what I mean. The only parameter that is common to the two versions is the first one, the Communication Area. The remaining parameters are as follows:
• event is a four-byte integer that identifies the type of event being processed. Don't worry about the fact that the event is identified by a number. As you will see later, RPG supplies a number of named constants that can be compared with the event value.
• pstring is a pointer to the beginning of the string containing the event data (e.g., the element/attribute names or data).
• stringLen is the length of the string "pointed to" by the previous parameter. This length must be used to determine if data is present as there are occasions when a valid pointer is passed even though there is no data. Only the number of characters indicated by this parameter should be processed.
• exceptionId is an error code identifying any error passed to the handler by the parser. We will not be discussing this in this article. Check the RPG manuals for more information.
Having seen the parameters passed to the handler, it is time to study the mechanics of the handler procedure MySAXHandler. The first step (E) is to check whether any data was received. If no data is received, then the handler simply returns control to the parser. If data is present, then the procedure RmvWhiteSpace( ) is called to remove any unwanted characters and reduce them to a single space. We will look at what I mean by "unwanted" in a moment. Notice that %SUBST is used to pass only the valid portion of the data to the subprocedure. Remember, we were passed only a pointer and a length, and there is probably other data beyond the point indicated by the length parameter. It is worth noting at this point that the field string, which is based on the pointer, can be very useful during debug. If you display it, you will usually be able to see not only the data you are about to process, but also the next part of the XML document. In other words, you will know what to expect next and can perhaps set appropriate breakpoints. This is not guaranteed as sometimes the pointer references a work area, but it is worth remembering.
What do we mean by "unwanted" and why do we need the RmvWhiteSpace routine? Because carriage returns, new lines, tabs, and excess spaces are often present in XML data (sometimes to make it look "pretty"), and we need to remove them from the data. We will not be studying the detail of this procedure, but you will find it included in the version of the program that is available for download. Hopefully, its operation is self-explanatory. (Many thanks to IBM Toronto's Barbara Morris for supplying this routine.)
At (F), the real work begins. A SELECT group is used to identify the type of event we are handling; this is where the named constants mentioned earlier come into play. For example, *XML_START_ELEMENT represents the event code that announces the arrival of a new element name. In the SELECT group at (G), we then identify the specific element that we are dealing with and process accordingly. All this logic is really doing is setting up the appropriate array indices for the Category, Product, and Description arrays. Since we know that the document we are processing is the same one that we just parsed with XML-INTO, we can afford to short-circuit the process, so no attempt is made to match the product codes with the descriptions or anything.
If the event does not represent the beginning of an element, then we next test to see if it is an attribute name (H). If it is, we check to see if it is the type attribute, and if so, we turn on the waitingForType indicator. This indicator allows us to associate the attribute data when it arrives (I) as belonging to the type attribute. Remember, we said earlier that it is up to us to determine that. We then store the value for the type attribute in the appropriate descrType array element.
After processing the document, the XML-SAX parse completes and control returns to the program's main line at (J). At this point, the complete content of the XML document has been stored in our category DS, so our program can process or store that data as necessary. In this simple example, we will just display the data. The logic simply loops through all of the categories and products. As in our previous example, the category loop is controlled by the RPG-supplied xmlElements count in the Program Status Data Structure, which was populated by the XML-INTO operation, and the product loop completes when a blank product code is encountered. The format of our XML document is such that there must be a short description, so the first elements of the description and type arrays are displayed. At (K), the logic then tests to see if a second set is present and, if it is, displays the relevant data.
And that's really all there is to it. I won't describe it here, but I have included in the source code accompanying this article a utility program (XMLSAXLIST) that you might find useful when studying XML documents that you need to process. It uses XML-SAX to parse the document and produces a listing of all the events signaled and the length and content of the associated data. If you run the program, you will be able to see the effect of the RmvWhiteSpace procedure as the original length of the data item is included. If you have any questions about the operation of the program, please let me know.
H Option(*NoDebugIO : *SrcStmt )
// This count is populated by XML-INTO whenever the INTO
// variable is an array
D progStatus SDS
D xmlElements 20i 0 Overlay(progStatus: 372)
(D) D MySAXHandler Pr 10i 0
D commArea Like(dummyCommArea)
D event 10i 0 Value
D pstring * Value
D stringLen 20i 0 Value
D exceptionId 10i 0 Value
D RmvWhitespace pr 65535a Varying
D input 65535a Varying Const
D category DS Qualified Dim(20)
D code 2a
D catDescr 20a
D product LikeDS(product) Dim(50)
D product DS Qualified
D code 4a
(A) D descrType 5a Dim(2)
D description 600a Dim(2)
D mSRP 7p 2
D sellPrice 7p 2
D qtyOnHand 5i 0
D XML_Source S 256a Varying
// Short version of Description for display purposes
D S 40a
D dummyCommArea S 1a
D i S 5i 0
D p S 5i 0
(B) XML-INTO category
%XML(XML_Source: 'case=any doc=file allowextra=yes +
// XML-INTO has filled the category array
// Next we use XML-SAX to fill in the missing type details
(C) XML-SAX %HANDLER(MySAXHandler: dummyCommArea)
Dsply ('xmlElements = ' + %char(xmlElements) );
// The XML parser's element count is used to control the loop
(J) For i = 1 to xmlElements;
Dsply ('Cat: ' + category(i).code + ' ' +
For p = 1 to %Elem(category.product);
If category(i).product(p).code = *Blanks;
Leave; // Exit once blank product code entry located
// Process the current product entry
dispDescription = category(i).product(p).description(1);
Dsply ('Product: ' + dispDescription);
Dsply ('Type: ' + category(i).product(p).descrType(1));
// If second description is present, display details
(K) If category(i).product(p).description(2) <> *Blanks;
dispDescription = category(i).product(p).description(2);
Dsply ('Product: ' + dispDescription);
Dsply ('Type: ' + category(i).product(p).descrType(2));