Find out the rest of the story about PRGDLTRCDS and Watch considerations.
This is the sixth in a series of articles on detecting that certain messages have been sent on your system and then making processing decisions based on those messages. The underlying technology, known as Watch support, became available with V5R4.
The first article, "One Approach to System Automation," introduced the Start Watch (STRWCH) command and provided the source for a user exit program to run when the local system time changed due to Daylight Saving Time transitions. The second article, "Handling System Changes Automatically," discussed the internals of that user exit program. The third article, "Re-enable Disabled User Profiles," provided further examples of the automation capabilities available with watches. The fourth article, "Selectively Using RGZPFM on Files," introduced a watch exit program that ran when files exceeded their specified DLTPCT value. The fifth article was not directly related to watches. Rather, it discussed the Purge Deleted Records program PRGDLTRCDS, which works with the watch exit program of article four. If you have not previously read these articles, you should do so before reading the current article. In this article, we will conclude our discussion of the PRGDLTRCDS program. For space reasons, the source for program PRGDLTRCDS will not be repeated here, so you may want to refer to the fifth article of this series.
When we left off in the fifth article, "Reorganizing Files Based on Percentage of Deleted Records," the PRGDLTRCDS program had just completed all of the validation associated with preparing to reorganize the file member identified by the current RGZPFMLST1 record. The program PRGDLTRCDS is now ready to start the reorganization.
At this point, PRGDLTRCDS does an exclusive allocate (ALCOBJ) of the member to be reorganized. The program explicitly performs this allocate, rather than allowing the subsequent RGZPFM command implicitly to do it, for a few reasons.
One reason has to do with ease of use in error reporting. The RGZPFM command may fail for a variety of reasons, only one of which is not being able to allocate the file member. By explicitly allocating the file member, we do not have to "See the previously listed messages," which is the Recovery text for the most frequently returned escape message of the RGZPFM command (CPF2981 - Member &3 file &1 in &2 not reorganized). Though not shown in the sample program, PRGDLTRCDS could, upon receiving an error on the ALCOBJ command, use an API such as List Object Locks (QWCLOBJL) to determine what jobs are currently locking the member and return that information as additional messages in the job log. The QWCLOBJL API is documented here and provides for member-level locking information. As shown, PRGDLTRCDS simply displays the library, file, and member name when the ALCOBJ request fails. This dsply is done because the error messages returned by the ALCOBJ command do not always provide full object/member identification information.
The QWCLOBJL API could also be used in conjunction with an error returned by the RGZPFM command. You would, however, first have to programmatically read through the "previously listed messages" and determine if indeed the problem is related to the allocation of the file member. One possible diagnostic message that you would be looking for would be CPF3202 - File &1 in library &2 in use. Similar to how the sample programs are using the dsply operation rather than message-handler APIs to provide error-related information, so too am I using the ALCOBJ command to avoid introducing the message handler APIs to receive messages from the job log/program message queue. But as mentioned in the first article of this series, I see the need to discuss the message-handler APIs in the future. Those future articles would most likely include packaging the message-handler API functions as exported procedures of a *SRVPGM so that we can use them in the future without too much additional code in our sample programs (which is what I'm trying to avoid in the current series).
A second, and more important, reason for explicitly allocating the file to be reorganized is related to making the RGZPFM and the updating of the REORG field to 'N' as atomic an operation as possible. If PRGDLTRCDS simply ran the RGZPFM command and then updated the RGZPFMLST record to reflect that reorganization is no longer needed (that is, setting REORG to 'N'), there is the possibility of a watch update to REORG being lost. This would be the scenario for this lost update:
Step 1: The RGZPFMLST1 file contains a record for member X with REORG = 'Y' and DLTPCTSTS = 'H' (indicating that a member should be reorganized based on a past (Historical) CPF4653 independent of the current DLTPCT status).
Step 2: In job Z, PRGDLTRCDS performs a RGZPFM on member X.
Step 3: In job Y, program A runs after the RGZPFM of step 2 has completed (that is, member X has been de-allocated) and before PRGDLTRCDS regains control in step 4 to update the RGZPFMLST record. Program A deletes sufficient records to exceed member X's DLTPCT and then closes member X. Message CPF4653 is sent. The WCHCPF4653 exit program is run and effectively sets REORG to 'Y' (though, as coded, WCHCPF4653 will simply release the RGZPFMLST record as REORG is already set to 'Y' due to PRGDLTRCDS not yet having run the update logic of step 4).
Step 4: In job Z, PRGDLTRCDS resumes running, reads the member X control record for member X using RGZPFMLST, updates the control record with REORG = 'N', and moves on to the next control record in RGZPFMLST1.
Step 5: Program B runs (in any job and at any point in the future), adding sufficient records to member X such that the DLTPCT is no longer exceeded. The only requirement in our scenario is that Program B be the first program to process member X since step 4 completed.
We now have a situation where member X should be reorganized due to the past CPF4653 message having been sent, but there is no indication in the RGZPFMLST control file to actually perform the RGZPFM. The probability of this scenario is admittedly small, but it is not 0 and should be addressed. I will also point out that it is these "little" timing windows that can cause much grief when debugging an application problem. In this case, WCHCPF4653 and PRGDLTRCDS will appear to be working perfectly for months. But every once in a while, you find that a RGZPFMLST control record has REORG set incorrectly. Anytime an application has a dependency that two or more resources be in synch (in our case, the actual member data and the REORG flag of the control record), you need to think about the scenario of some other program running in between what are sequential operations in your program.
Explicitly locking the member prior to RGZPFM running PRGDLTRCDS ensures that no other job on the system can be using (and therefore cannot close, which is what we really want to avoid) that member until PRGDLTRCDS explicitly de-allocates the member. As PRGDLTRCDS does not de-allocate the member until after updating the control record to reflect REORG = 'N', there is no risk of job Y "sneaking in."
Another approach to avoid this type of exposure, and which quite a few users use as it is easy to implement, is to simply prevent user jobs from running when system maintenance programs such as PRGDLTRCDS run. This approach certainly stops program A from running in the previous scenario.
This approach however also tends to shut down user applications for a longer fixed period of time, such as 11:00 p.m. to 5:00 a.m., than is absolutely necessary. The outage window could be smaller if the application provided for greater granularity in identifying the resources needed, as is provided with the explicit ALOCOBJ. Applications, for instance, that need to only read member X could be handled by changing the lock state requested on the ALCOBJ and the RGZPFM commands. Any application requiring update capability to member X, however, would need to change due to the initial requirement (mentioned in the initial article of this series) to maintain the original record sequence when reorganizing the file and limitations of the RGZPFM command. These update-capable applications would need to handle the situation where the resource (member X in our case) is not available if we allow user jobs to run at any time. Jobs requiring member X with a lock state that conflicts with the ALCOBJ and RGZPFM lock states will need to inform the operator of the temporary lack of availability and then retry the operation (similar to retrying a read for update after receiving an error on a read or chain operation). But using ALCOBJ to provide atomicity across the RGZPFM command and the updating of the RGZPFMLST control record could be used as a stepping stone to improved availability. Of course, the ideal--providing for 24x7 availability--may require substantial changes to the design of the user application.
Prior to running ALCOBJ, PRGDLTRCDS started a monitor group. This monitor allows us to easily know when the ALCOBJ function failed and to be able to dsply the library/file/member that was being processed.
Immediately after running ALCOBJ, PRGDLTRCDS starts a second monitor group. This monitor encompasses the running of the RGZPFM command and the updating of the RGZPFMLST control record. The main purpose for this monitor group is to ensure that a failure in the RGZPFM command will still result in PRGDLTRCDS de-allocating the member currently allocated.
After successfully reorganizing the member, or displaying an error message if unsuccessful, PRGDLTRCDS then reads the next RGZPFMLST1 control record and re-enters the DOW loop.
This completes our discussion of the PRGDLTRCDS program.
This series of articles has examined the watch support available since V5R4. Watches represent a key enabler for system automation, and hopefully you already have a few ideas on how message watches might be used to streamline operations in your environment. Watch capability is a powerful tool that can be used to address a wide range of programming and operational opportunities. In this series, we have seen how watches can be applied to time changes, disabled user profiles, and file reorganizations--certainly a rather broad range of areas!