In mathematics, a perfect number is one in which the sum of its divisors (excluding the number itself) exactly measures itself. An example is 6, in which 1+2+3 = 6. Perfect numbers are very scarce. There are only five of them between 1 and 40 million: 6; 28; 496; 8,128; and 33,550,336.
In OS/400, system performance is evaluated by using numbers that are not quite so perfect, specifically the numbers found on the Work with System Status (WRK-SYSSTS) screen. When performance tuning is needed on an AS/400?and it is always needed?these numbers are the driving force behind any manual or automatic adjustments made to the system.
The purpose of this article is to examine that most mysterious and least understood element of AS/400 performance reporting, the WRKSYSSTS screen. I'll examine the numbers on the screen, discuss where they are derived from, and talk about their strengths and limitations in helping you understand how to balance AS/400 system performance. My goal is to explain what you see on this screen and to help you understand how it affects your shop.
Paint by Numbers
In "Storage Pool Management on the AS/400" (MC, April 1995), I explained how the heart of the AS/400's work management scheme lies in its storage pools. WRKSYSSTS supplements that scheme and serves a special purpose in OS/400. It provides an instant report card for how well your work management scheme is running. It also allows you to make storage pool changes on the fly.
This information is raw material for evaluating your AS/400's performance. WRKSYSSTS tells you which storage pools are experiencing problems. It can show that information instantly, or it can average it out over a set period of time.
1 shows a typical view of the WRKSYSSTS screen. OS/400 gives you the option to display system status information in several different assistance levels. This particular screen shows the information in the advanced format. The advanced format shows the maximum amount of data, but eliminates showing you other information, like the function keys. Although you can retrieve information in formats other than what is shown here, for space reasons, this is the only WRKSYSSTS screen that will be covered in this article. For your convenience, we've labeled each field with a number.
Figure 1 shows a typical view of the WRKSYSSTS screen. OS/400 gives you the option to display system status information in several different assistance levels. This particular screen shows the information in the advanced format. The advanced format shows the maximum amount of data, but eliminates showing you other information, like the function keys. Although you can retrieve information in formats other than what is shown here, for space reasons, this is the only WRKSYSSTS screen that will be covered in this article. For your convenience, we've labeled each field with a number.
To bring up an advanced assistance level WRKSYSSTS screen on your system similar to what is shown in 1, type in the following command:
To bring up an advanced assistance level WRKSYSSTS screen on your system similar to what is shown in Figure 1, type in the following command:
A Book of Numbers
There are two sets of numbers to examine when viewing the WRKSYSSTS screen. The first is a relatively static set of numbers that reside in the header of the screen. These numbers give you basic information about your system and what is happening at the moment.
The second set is more dynamic in nature. They are the constantly changing body of the screen that describes your machine's storage pools and what is happening in them.
Before we can talk about what WRKSYSSTS tells you, we have to define the time periods over which it can be run. The most important number to look at on this screen is the Elapsed time field (Label 1) in the header. This field shows you the amount of time over which the statistics were gathered.
When you first start WRKSYSSTS, all the numbers on the screen reflect what is going on in the system from the moment you typed in the command. After the first time, however, the command stores data in a hidden system object until you sign off.
Every time you run the command thereafter, it gets the current information from the system and compares it to the data stored in the hidden system object. That is how it determines the averages when you run the command a number of times.
For example, if you sign on, run the command, exit, and then run the command five minutes later, the statistics it shows will be for what has happened in the elapsed time since you first ran WRKSYSSTS. The first time the command is run, the Elapsed Time field will read zero seconds. The second time, Elapsed Time will read five minutes. This means that the CPU utilization, the database and non-database faults, and almost everything else will be averaged out to reflect what happened during that five minute interval.
This has some serious implications. If you are looking at average numbers to determine performance problems, you may miss some very short, very intensive programs that are sucking up your CPU and memory. Trends in your performance may disappear because of the averaging. It's the same problem that occurs when you average the heights of a 5-foot and a 6-foot woman. The average is a 5.5-foot woman. It's true, but....
So what's the best time period over which to view WRKSYSSTS statistics? A good period to analyze data over is two minutes. Two minutes is long enough to get a feel for how the system is running, but short enough that significant trends will not be missed. However, don't rely on the data from running the command just one or two times. Because of the changing workloads, this data should be viewed several times a day, so one interval doesn't cloud your entire view. You always need to sample your system's performance over several different time periods to get a true picture.
What do the numbers on WRKSYSSTS mean, anyway? We've already covered elapsed time, so let's review the other numbers in the header part of the screen. Keep in mind that these numbers may be averaged over different time periods depending on the elapsed time.
* % CPU used (Label 2) is self-explanatory and refers to the percentage of CPU resources used to process jobs currently running on the system. If you see that your average CPU utilization is at 99 percent?or if this field reads '+++'?it's telling you your CPU cannot handle the current workload. If you see this every time you run the command, it may mean that your processor is undersized for the workload and you either need to adjust the workload or to get a larger processor.
* Jobs in system (Label 3) seems like another self-explanatory number, but it is more complex than that. The number of jobs in the system includes more than just active jobs. It also includes queued jobs that are waiting to run and completed jobs. Completed jobs remain in the system if there are any spooled files on an AS/400 output queue. These jobs can affect performance.
Too many jobs in the system can cause delays in signing on or in job initiation. They can also cause commands that work with job and output queues to run more slowly. In general, to achieve better performance, clear out job logs on a regular basis (or better yet, don't generate them if there are no errors), clear out your output queues by saving only those spooled files that absolutely need to be saved, and don't queue up too many jobs to run.
* % perm addresses (Label 4) and % temp addresses (Label 5) refer to the percentage of the possible system addresses that have been created for permanent and temporary objects on your AS/400. If you see a rapid increase in the percentage of addresses used, it may imply that your system or applications are creating and destroying objects at a rapid pace. This may be affecting performance.
* System ASP (Label 6) refers to the amount of hard disk capacity available in your system Auxiliary Storage Pool (ASP). This can be a deceiving number if you have more than one ASP defined on your system, because this number only reflects the System ASP. If you want to get a full picture of disk utilization, especially if you are using checksum, use the Work with Disk Status (WRKDSKSTS) command.
* % system ASP used (Label 7) refers to the percentage of disk storage in your System ASP that is currently used. Performance is generally affected when your disk storage usage exceeds 80 percent, and system failures can occur after it exceeds 90 percent. You should watch this figure even when your storage is running below 80 percent. In situations when your disk usage jumps significantly in one day?by five percent, for example?there could be a problem with an application that is affecting system performance. It is always a good idea to monitor this number from day to day for unexpected changes.
* Total aux stg (Label 8) refers to the total amount of auxiliary storage available on the system.
* Current unprotect used (Label 9) and Maximum unprotect (Label 10) refer to the current amount of storage in use for temporary objects and machine data that is stored in unprotected storage when checksum protection is turned on. Since checksum protection is an issue in itself, we will not cover these fields in this article.
Storage Pool Numbers
In the bottom half of the WRKSYSSTS screen is a subfile of numbers. This subfile describes the current storage pool setup and gives you a thumbnail sketch of how that setup is working. There will be from 1 to 16 entries in this section, one for each possible storage pool on the AS/400. There are four predefined storage pools?*MACHINE, *BASE, *INTER, and *SPOOL?and up to 12 user-definable storage pools.
For each storage pool, the following information is listed:
* Sys Pool (Label 11) refers to the identification number the AS/400 assigns to the storage pool. This number will always be from 1 to 16, and OS/400 determines the number to assign. Pool 1 is *MACHINE and pool 2 is *BASE; generally, pool 3 is *SPOOL, pool 4 is *INTER, and pools 5 through 16 are the user-defined storage pools. This number has no effect on system performance.
* Pool Size K (Label 12) is the amount of working system memory assigned to the pool. Similar to RAM on a personal computer, this is the memory that OS/400 jobs use for program execution and database access. This value can be changed in WRKSYSSTS, or it can be changed on the Work with Shared Pools (WRKSHRPOOL) screen or with the Change Subsystem Description (CHGSBSD) command. Along with storage pool activity levels, this is the most significant component of AS/400 performance tuning. Well thought out adjustments to a storage pool will help system performance; poorly thought out ones will damage it.
By default, all unused working memory will reside in the *BASE (System Pool id 2) storage pool. An increase in another pool's size will take memory from *BASE. A decrease will add memory to it. You cannot change the pool size for *BASE directly.
* Rsrv Size K (Label 13) refers to storage pool memory that is used by the system and is not available for job use. The *MACHINE pool (Pool ID 1) always uses some reserved memory for AS/400 system overhead, such as storage management I/0 routines or page fault handlers. In pools 2-16, memory is usually reserved during backup and restore operations.
Reserved size can affect system performance. If you don't give enough memory to the *MACHINE pool, most of its memory may be used for reserved processing, which can cause across the board system performance degradation. The danger is, if you run backup and restore programs in an undersized pool, this may cause other jobs to thrash trying to get enough memory to complete. (Thrashing is a situation where the system spends more CPU time swapping the programs and data from disk to memory and back than actually processing the program.) Two simple but basic ways to improve performance are to size your *MACHINE pool memory correctly and to run backup/restore operations in their own adequately sized storage pools.
* Max Act (Label 14) specifies the total number of jobs that can have a status of running (RUN on the WRKACTJOB screen) in a particular storage pool at one time. Activity levels are the second most significant component of system performance tuning. When a problem develops, you can adjust the activity level of a storage pool to allow more or fewer jobs to execute at the same time, which affects how many jobs can compete for the limited memory.
In the world of AS/400 work management, more can be less. Performance problems develop when too many jobs try to compete for system resources at one time. These jobs compete for available memory and thrash as program instructions and data are swapped in and out of memory. On the opposite side, if there is too much memory for the number of jobs running in a pool, system memory that can be better used elsewhere may be wasted.
Numbers and Faults
DB and Non-DB Faults and Pages (Labels 15 and 16) refer to how program instructions and database information enter and leave the pool's memory. Operating Systems 101 tells us that programs and data reside on your hard disk. To execute a program, instructions must be read from the disk to your pool's main memory, and data must also be brought into memory to be read and modified.
Every time program or database information is needed but is not present in main memory, OS/400 declares a fault for that storage pool. A database (DB) fault occurs when file information or an access path is needed. A non-database (NDB) fault occurs whenever something other than database information is needed. NDB faults can involve programs, data queues, and configuration objects.
Faults affect system performance because an executing job must stop and wait for information to be transferred from the disk to main storage. When two or more jobs are competing for working memory, information may be rapidly swapped back and forth between system memory and hard disk and the system is said to be thrashing.
Faults are measured in terms of faults per second. When fault rates go up, AS/400 system performance goes down. IBM's AS/400 Performance Management redbook and AS/400 Work Management guides give guidelines for acceptable and unacceptable faulting rates. These guidelines vary for different machines, so it is necessary to consult the manuals for your particular installation.
High faulting rates can be repaired by adjusting your pool sizes and activity levels. If the problem is memory, giving the storage pool more memory may solve it. If competition is the problem, the activity level can be adjusted downward. These adjustments alone may not solve your problems. High fault rates can also result from faulty hardware or poor application design and coding. Pool adjustment is a delicate process that involves scarce resources (e.g., working memory) that must be shared by several pools.
Faults in the *MACHINE pool can only be controlled by changing the size of the pool, not by changing the maximum number of jobs running in the pool. Because the *MACHINE pool handles system functions, activity levels are irrelevant and are not even listed on the WRKSYSSTS screen.
Waiting for My Number to
Act-Wait (Label 17), Wait-Inel (Label 18), and Act-Inel (Label 19) are transition rates that describe how jobs share activity levels and how well OS/400 processes work in a storage pool. These fields are a bit more complicated and deserve an article of their own but, in a nutshell, here's how they work.
Jobs running on an AS/400 exist in either the active, wait, or ineligible state.
An active job is the ideal state. It exists in main storage and is able to process its work. Jobs in the active state sometimes go into a short wait, which can last up to two seconds. The job retains its activity level while waiting for an action, such as a database write, to occur.
A job in the wait state needs a system resource or a response from the user to keep processing. Jobs in this state lose their activity level so other jobs can process. IBM actually defines four different types of wait states?short wait, short wait extended, long wait, and key/think wait?but to keep things simple we will use the definition of a job that is waiting for a resource or response.
Jobs in the ineligible state have work to do but the system cannot accommodate its demands at the moment. Jobs enter the ineligible queue when there are no activity levels available for them to use. They can also enter the queue from a wait state when they are waiting for a record or object lock and the lock is released but there are no activity levels available.
A job in the ineligible queue is reactivated in one of two ways. If the job became ineligible because it was waiting for a lock, it is placed ahead of other jobs of equal or lesser priority. It becomes active again when that object becomes available and an activity level opens up.
Jobs that become ineligible solely because there are no activity levels are placed behind all other jobs of equal priority already on the job queue. This type of queuing is called first-in, first-out processing. It provides a round-robin approach to processing. Jobs are processed for a specific amount of time, go ineligible, and then become active again after each job has had a turn.
The transition rates shown on the WRKSYSSTS screen detail the rate at which the three transitions occur in transitions per second. Active to Wait (A-W) shows the rate at which jobs are moving from an active state to a wait state. It shows how often jobs are interrupted due to waiting on resources.
The Wait to Ineligible (W-I) rate shows how often jobs are moving from a wait state to the ineligible queue. This rate should be examined in conjunction with the A-W rate. IBM specifies that, if the W-I ratio is zero all the time, then your activity levels may be set too high and your pool is not using all of its activity levels. The other rule of thumb is that your W-I rate should be no more than 20 percent of your A-W rate.
If this ratio is greater than 20 percent, jobs are moving into the ineligible state too often, and you may want to examine how your applications use locks. At 20 percent, performance is slowed by having too many jobs paging in and out of memory looking for object locks. This rate can be improved by ensuring that your applications holding record or object locks release them as soon as they are finished.
The Active to Ineligible (A-I) rate shows how often jobs are moved into the ineligible queue due to lack of an activity level. In general, this number should be as low as possible without reaching zero. If A-I is zero, there is always an activity level available for every job that wants one, and the pool's activity level may be set too high. It is desirable to have some A-I activity, but not so much that it slows down the system. Also, watch this rate in your interactive pools. If there is an excessive amount of interactive A-I activity, batch work may be occurring in an interactive pool.
The Numbers Game
As you can see, WRKSYSSTS is much more complicated than it looks. It contains an incredible amount of information about how your system behaves, but it can also deceive you about that behavior. The numbers are not perfect, but if you understand what they mean, you'll be better equipped to balance your system.
Joe Hertvik is a freelance writer and a system administrator for a manufacturing company outside of Chicago.
AS/400 Performance Management (GG24-3723, CD-ROM GG243723).
AS/400: Work Management (SC41-3306, CD-ROM QBKALG00).