In my day-to-day activities, I am frequently asked to assist with AS/400 problem management. I define AS/400 problem management as consisting of the following phases: problem awareness, problem analysis, problem reporting, and problem resolution, including Program Temporary Fixes (PTFs).
Each phase is an important part in the management of the AS/400. Included with every AS/400 operating system are functions to assist with each phase of problem analysis and resolution. These functions are referred to as Electronic Customer Support (ECS). Most AS/400 users generally think of ECS as the function that enables them to get PTFs, but it's really much more than that. Develop a good understanding of the ECS features and you will enhance your ability to minimize problems and their impact on your system.
This may sound a bit comical, but how many users are aware when a problem occurs? Oh, sure, if the system goes casters up, you know about it pretty quickly, but how soon are you aware of a disk drive failure in a mirrored disk pair? Or communications errors on your communications lines?
Most AS/400 sites monitor the QSYSOPR message queue for all messages and, hopefully, they will catch those that are really important. If this is all you do, though, you may miss opportunities to catch minor problems before they become big ones. Let's look at some system tools built into OS/400 to assist with problem awareness.
When monitoring the QSYSOPR message queue, you can easily miss important messages like Error occurred on disk unit 12 if they are mixed in with messages such as Adapter has inserted or left the ring on line TRNLINE1. You can easily segregate the important messages from the riffraff by creating a separate critical system message queue. All you do is create a message queue called QSYSMSG in library QSYS by entering the following code:
CRTMSGQ MSGQ(QSYS/QSYSMSG) + TEXT('Critical System Messages')
Once created, the system (V2R3 and higher) will automatically route the important system-related messages there. (For a complete listing and description of these messages, see Chapter 8, "Working with Messages," in the OS/400 CL Programming V3R1 manual.) This segregation of messages makes your job of system monitoring much easier. I recommend you have two display devices (or one device with the ability to display two sessions side by side) to monitor messages?one device to monitor the QSYSOPR message queue and the other to monitor the QSYSMSG message queue. You can easily accomplish this by creating two user profiles. Specify QSYSOPR as the MSGQ on one user profile and QSYSMSG for the MSGQ value on the other user profile and set the delivery mode to *BREAK.
In addition to the QSYSMSG message queue, the AS/400 will route critical system messages to a list of users or user groups. This is a feature of V3R1, and it is part of the system's service attributes. This ensures that someone will receive the message so it will not get lost among other system-generated messages. You can specify users, or classes of users, to receive a break message when the system detects a critical condition. Enter the user values in order of priority (highest to lowest).
In the event the system detects a critical condition, it will attempt to send a break message to the user or class of users with the highest priority. If the entry is a user name, a break message is sent only if the user is signed on. When the entry specifies a user class, a break message is sent to all users of that class who are currently signed on. If no users specified in the entry are currently signed on, the next entry is checked. This process continues until either a break message can be sent or the last entry is checked.
This feature is enabled only if automatic problem analysis is enabled (more on that later). The following example sets up the critical message notification feature. It will enable automatic problem handling and send critical system messages to nearly everyone:
CHGSRVA ANZPRBAUTO(*YES) + CRITMSGUSR(*SYSOPR *SECOFR + *SECADM *PGMR *USER)
Another method of finding problems is to monitor the problem log. When the system detects a problem, it will usually create a problem record in the problem log. Monitor the problem log for problems that are obviously serious. This may seem daunting, as many problem records are not of particular consequence (e.g., occasional workstation device errors due to shutdown of personal computers). To view the problem log, enter the Work with Problems (WRKPRB) command or take option 2 from the PROBLEM menu. You can print a list of the problems using the Display Problem (DSPPRB) command.
It is also a good idea to periodically monitor the system error log. The system maintains an error log?mostly for service personnel use?that can be displayed or printed. Look for trends or obvious error conditions. If you find an entry that warrants attention, check the problem log to see if an entry matches the error log entry. If there is no entry in the problem log, you can manually create a problem log entry and perform an analysis (see "Analyzing Problems," later in this article). A little common sense here avoids a bad situation later. Use the following steps to browse or print the error log:
1. Sign on as security officer.
2. Start System Service Tools using the command STRSST.
3. Take option 1, Start a service tool.
4. Take option 1, Error log utility.
5. Take option 1, Display or print error log.
6. On the Select Subsystem Data screen, select option 1, All error logs, and change the From and To date and time, if necessary.
7. On the Select Report Type for Subsystem screen, select option 1, Display summary of error log, and press Enter.
The first summary is displayed. Press the F11 key to go to the next summary or F10 to go to the previous summary. The summaries are Processor Entries, Magnetic Media Entries, Local Workstation Entries, Communication Entries, Power Entries, Licensed Program Entries, and Lic Int Code Entries.
You can also print the error log by using the Print Error Log (PRTERRLOG) command or by taking option 17 from the CMDSRV menu.
Automated Problem Analysis
In many cases, you will be able to solve problems yourself. Other times, you may need the help of a technical support person or a service representative. When you do need outside help, it is important to collect as much information about the problem as you can. Now that we have some tools to help with the awareness of problems, let's look at some tools to perform the problem analysis. The AS/400 can perform many hardware diagnostics tests, including self-diagnosis. There are also tools to help perform manual problem analysis.
If you have IBM maintenance, IBM will install, at no cost, a product called AS/400 Performance Edge, which includes AS/400 Service Director. This service constantly monitors AS/400 system functional hardware and peripherals for error activity and automatically reports problems to IBM Service. This greatly simplifies the task of preventive maintenance. The IBM service representatives often respond to a call before the AS/400 administrators realize there is a problem with their systems. If you want AS/400 Performance Edge on your system, ask your IBM service representative to install it.
In V3R1, new problem management functions are also available. They provide automated problem analysis, as well as automated problem reporting. In order to enable these functions, you must set the new (V3R1) service attributes to *YES. The service attributes specify whether problem analysis routines should run automatically when a failure occurs, how the specified service provider should be notified of problems, when PTFs should be installed, and where critical system messages are sent.
Two new commands, Display Service Attribute (DSP-SRVA) and Change Service Attributes (CHGSRVA), are used with service attributes. The system can be set to run a background batch job that will perform automatic problem analysis of all problems at the time of failure. If problem analysis routines are not run automatically at the time of failure, they can be run manually from the QSYSOPR message queue or by using the WRKPRB command. You also specify your service provider (the default is IBM Service). The following example will set the service attributes to enable automatic problem handling and report the problems to IBM Service:
CHGSRVA ANZPRBAUTO(*YES) + RPTPRBAUTO(*YES) + RPTSRVPVD(*IBMSRV)
In the service attributes, you specify how PTFs are applied when using the Install PTF (INSPTF) command or options 7 or 8 from the GO PTF menu. The default is to mark all PTFs as delayed and then perform an automatic IPL.
System APAR Libraries
If ANZPRBAUTO is set to *YES, the system may create libraries for storing Authorized Problem Analysis Report (APAR) data. These APAR libraries are used to store data related to software or hardware problems and have the naming convention QSCxxxxxxx, where xxxxxxx is the problem ID.
If you set the system value QSFWERRLOG to *NOLOG, this will prevent the system from logging software errors, thereby reducing the generation of APAR libraries. The shipped value for QSFWERRLOG is *LOG. When a problem record is deleted from the system, the associated APAR library is also deleted.
Manual Problem Analysis
Several tools assist with manual problem analysis. Most of these are accessible from the Operational Assistant. As with many choices from the Operational Assistant, fast- path commands are available. The most important part of performing problem analysis is to find as much information as you can about the problem. The following tools can help.
AS/400 System Startup and Problem Handling V3R1 Manual
IBM ships this manual with every AS/400 system. Keep it handy for quick reference when researching any problem.
Problem Summary Forms
A problem summary form is useful for recording information about problems when they occur. Record as much information as you can to help the service provider analyze the problem. Refer to 1 for an example of a form you can use for this purpose. Appendix B of the AS/400 System Startup and Problem Handling V3R1 manual contains additional examples of problem summary forms.
A problem summary form is useful for recording information about problems when they occur. Record as much information as you can to help the service provider analyze the problem. Refer to Figure 1 for an example of a form you can use for this purpose. Appendix B of the AS/400 System Startup and Problem Handling V3R1 manual contains additional examples of problem summary forms.
User Help Menu
The User Help menu is provided for the end user who needs help resolving problems. For example, if you were trying to help an end user with problems he was experiencing, instruct the user to go to the User Help menu by entering this command:
Option 3 will then display information about the user that may be helpful to the analyzer.
Option 10, Save information to help resolve a problem, allows the user to enter some text about the problem he is experiencing. When he finishes entering the text, a problem record and several spool files will be created. The spool files are a copy of the user's job log, a list of the user's job status attributes, a listing of PTFs on the system, and a listing of relevant entries from the history log. They also include a listing of the currently active jobs, a copy of the user's message queue, a copy of the workstation's message queue, and a copy of the QSYSOPR message queue.
If you are the person analyzing the problem, you can access the problem record and the spool files to help solve the user's problem. The easiest way to access the problem records is from the Problem menu.
The Problem menu is the main menu for working with problems on the AS/400. The menu provides various options for analyzing problems, creating problem records, viewing problem records, and reporting problems to the service provider. Most of the options on the menu are self-explanatory. To get to the Problem menu, enter GO PROBLEM.
Let's explore the option to work with problems. Option 2 runs the Work with Problems (WRKPRB) command. You will be presented with a listing of all recorded problem records. From this point, you can analyze problems (hardware and software), view the problem details, work with the problems, and report prepared problems.
Refer to 2 for an example of the Work with Problems display. A list of all the problem records is shown. The problem status of Opened indicates that you will need to perform analysis before submitting the problem to your service provider.
Refer to Figure 2 for an example of the Work with Problems display. A list of all the problem records is shown. The problem status of Opened indicates that you will need to perform analysis before submitting the problem to your service provider.
Analyzing problems serves two functions. First, you may be able to solve the problem yourself. Second, it prepares a problem record to be submitted to your service provider.
From the Work with Problems display, use option 8 to work with an existing problem. If you want to create a new problem record, press F14, Analyze new problem, from the Work with Problems display, or enter the Analyze Problem (ANZPRB) command. If the problem is hardware related, you may be prompted to perform tasks such as ending subsystems or jobs to allow hardware testing. If the problem is a communication adapter, you may be prompted to remove a communication cable and insert a wrap connector.
At the conclusion of the analysis, the system may provide suggestions for resolution, recommendations for further testing, and the opportunity for reporting the problem to your service provider. Refer to 3 for an example of the analysis results.
At the conclusion of the analysis, the system may provide suggestions for resolution, recommendations for further testing, and the opportunity for reporting the problem to your service provider. Refer to Figure 3 for an example of the analysis results.
You may be able to resolve the problem yourself. Further analysis may involve third parties (e.g., your communications vendor if the analysis points to a telephone line problem). If it appears that you should report the problem to your AS/400 service provider (presumably IBM), then prepare the problem record for reporting.
Software problems, particularly software usage problems, usually require that you create a new problem record and then enter text to describe the problem. If you describe the problem yourself, the system does not perform any analysis. However, you can manually create problem records and perform analysis if you have jobs, programs, or hardware on which problem analysis can be performed. For example, if you discover hardware errors in the system error log and there is not an entry in the problem log, you can manually create an entry and perform analysis. From the Work with Problems display, press F14 or enter the ANZPRB command. Choose which system you are preparing a problem for and you will be presented with a screen similar to 4.
Software problems, particularly software usage problems, usually require that you create a new problem record and then enter text to describe the problem. If you describe the problem yourself, the system does not perform any analysis. However, you can manually create problem records and perform analysis if you have jobs, programs, or hardware on which problem analysis can be performed. For example, if you discover hardware errors in the system error log and there is not an entry in the problem log, you can manually create an entry and perform analysis. From the Work with Problems display, press F14 or enter the ANZPRB command. Choose which system you are preparing a problem for and you will be presented with a screen similar to Figure 4.
Choose either to analyze a problem (this assumes that a job, a program, or hardware exists so the system can perform an analysis) or to describe a problem (this is generally chosen for software usage questions). If you are having problems with system software, you most likely will choose option 6, Job or program problem (application or system), to describe a problem.
If you choose options 1 through 4, follow the instructions to perform analysis. If you choose option 5, you will be asked a series of questions and then be given the opportunity to enter text to describe the problem. If you choose option 6, select the program product you are having difficulty with and follow the instructions on each of the remaining screens.
After the analysis phase (if there is any), you will get to the Enter Problem Description display. Enter a short description of the problem you are experiencing and press Enter. When you get to the Report Problem display, press F13 to enter notes. Now is your opportunity to provide as much text as you need to completely describe the problem. Once you have completed entering notes, press Enter to report your problem.
Before you can report a problem using ECS, you must analyze it. This will set the problem record status to READY. When the analysis is complete, you can prepare a service request. During the analysis phase, all relevant information and any notes are added to the problem record and become part of the problem record.
Once the problem is analyzed and you are at the Display Problem Analysis Results display, you can display the details of the analysis, press F13 to add notes (a very good idea), or press F6 to report the problem (refer to 3). I strongly recommend you add notes to your problem record at this point. Comments can be very helpful for the service provider. These notes become a part of the problem record.
Once the problem is analyzed and you are at the Display Problem Analysis Results display, you can display the details of the analysis, press F13 to add notes (a very good idea), or press F6 to report the problem (refer to Figure 3). I strongly recommend you add notes to your problem record at this point. Comments can be very helpful for the service provider. These notes become a part of the problem record.
When you are ready to prepare the report for submission to the service provider, press F6. Take the option to prepare the service request. Verify the contact information, select the problem severity level, select the service provider, and choose to send or not send the service request.
If you choose not to send the service request, the problem record will be changed to PREPARED after you exit. You can send all PREPARED service requests from the Work with Problems display using F16.
If a problem is new, and assuming that your service provider is IBM, a problem management record (PMR) number is created by the IBM service system when you report the problem. This PMR number is returned to your AS/400 system and becomes part of the problem record. It is then used as a problem reference number. If your problem is hardware related, information on the failing component and your system will also be gathered during the analysis phase and included in the problem record. The service provider will have a complete description of the failure, including the serial and part numbers of the system and the failing component.
If the problem is hardware related and you have IBM maintenance, your service representative will receive a call on his portable terminal notifying him that you are requesting service. If your problem is software related, the problem record will be submitted for software support. The record will be submitted to the IBM SupportLine staff, with all the information collected during the analysis phase (including any notes you added). If you have an IBM SupportLine contract for voice support (telephone), IBM Service Center personnel should contact you and work with you to resolve the problem. If you do not have voice support, you may view the response to the problem using the Query Problem Status (QRYPRBSTS) command. You can use QRYPRBSTS any time after reporting a problem.
The final phase is to resolve the problem. Over the years, I have discovered that problems sometimes occur and then go away without any human intervention. When that happens, I blame it on magic?smoke and mirrors. Occasionally, I rely on divine intervention. That notwithstanding, let's review some of the things to do to resolve problems.
As already stated, the problem analysis phase may include several steps to test system hardware. If you are assisting end users with application problems, the data collected during that analysis phase may be helpful. If you collect the right data (a bit of a supposition here), you may be able to resolve the problem immediately. If you submitted the problem to a service provider, then that provider will be working with you to resolve the problem.
After submitting a software-related problem to IBM, you can use ECS to query the problem status. This function is not supported with hardware-related problems. If you do not have voice (telephone) support with IBM SupportLine, this is your only option to check on the status. You can use the QRYPRBSTS command, or you can choose to query the status by using option 41 from the Work with Problem display.
To use the QRYPRBSTS command, you must first note the PMR number assigned when you submitted the problem. If the status of the problem is SENT, then select option 5 from the Work with Problems display to view the problem details and scroll down until you find the Service Assigned Number. Then, enter the following from the command line:
QRYPRBSTS PRBID(*PMR) + SRVID(service_number)
You can also query the problem status from the Work with Problems display. Again, if the problem status is SENT, select option 8 to work with the problem, then select option 41 to query the status. The AS/400 will dial the service provider and return with the current status of your problem. As a result of the query, additional text will be added to the problem record. Go back to the problem record and use option 5 from the Work with Problems screen to display the problem details.
As explained in the AS/400 System Startup and Problem Handling V3R1 manual, IBM periodically creates PTFs to correct problems or potential problems found within a particular IBM-licensed program. PTFs may fix problems that appear to be hardware failures, or they may provide new functions. Generally, PTFs are incorporated in a future release of the system.
After you submit a problem to IBM, the service system will use the data collected during the analysis phase to search for PTFs associated with the symptoms. If it finds PTFs that may resolve the problem, they are automatically downloaded to your AS/400. The PTF numbers will then be recorded in the problem record text. Likewise, when you issue a Send PTF Order (SNDPTFORD) request through ECS, a problem record is created and the PTF information is recorded in the text. You can view the details of this problem as you would any problem record. For example, if you issue the command
to request the latest cumulative PTF tape for V3R1, a problem record will be created with a status of ANSWERED. If PTFs are automatically downloaded, you must verify that they fix the problem. The problem record will indicate that PTFs were sent and, unless the record is reopened, IBM may assume the problem is fixed. You can easily reopen the problem record by adding additional text (option 40 from the Work with Problem screen). Then, submit the problem again (option 2 from the Work with Problem screen).
For more information on PTFs and PTF management, refer to Chapter 5 in the AS/400 System Startup and Problem Handling V3R1 manual.
The Big Picture
Here are some main points to remember:
? Make sure you are aware of what your system is reporting. Careful monitoring of system-generated messages can reduce the impact problems may have.
? The Problem Log can be your friend. You can analyze and report problems from this log. You can frequently solve problems during the analysis phase.
? Besides the Problem Log, you should document problems as they occur. This is a great help to the person analyzing problems.
? If you do not subscribe to IBM's SupportLine, IBM will not accept problem reporting on the phone. You can report problems only through ECS, fax, or mail.
? If you submit your problem electronically (through ECS), you should query the problem status electronically.
The AS/400 has great problem determination utilities. When problems occur, use these utilities. Let your system become part of the solution.
AS/400 System Startup and Problem Handling V3R1 (SC41-3206, CD-ROM QBKAGO00).
OS/400 CL Programming V3R1 (SC41-3721, CD-ROM QBKAUO00).