Let me start out by saying we had a catastrophic failure on our iSeries 820 on Friday morning. We were finally up and running about 45 hours later. There are two things I learned in the process of recovery: 1. Our backup strategy is relatively strong (but can be improved). 2. IBM's support leaves a lot to be desired. It ain't what it used to be. I have a number caveats for those who thought they had all of their ducks lined up. Here's the tale of woe... At 8:46am on Friday 5/30 our system abruptly halted with an SRC code. I assumed it was a hardware failure and called 800-IBM-SERV to place a hardware call. I got immediate runaround from IBM. "Sir, we show your machine went out of warranty in 1994." "No ma'am, this is a new model 820 that was upgraded from a 720 on May 17th of this year." "Sorry, sir, we show you stopped paying maintenance in 2000." "No ma'am, I approve the bills and we pay quarterly." "I'm sorry, sir. You are on call back, I will have a service person call you back within 15 minutes." I wait 30 minutes. No call. I call IBM back. "I'm sorry sir, we can't release you service because you are not showing up in our computer as being under maintenance." "Hey, I was told you'd be sending me to service over 30 minutes ago." "We can't do that, you're not under maintenance." Getting angry now. "Yes we are. I understood that new installs and upgrades have a 1 year warranty." "Excuse me, hold on." On hold for about 30 more minutes. I get transferred to 4 different people who say basically the same thing. Finally, I end up talking to Doreesa. I'm pretty angry now, I have over 800 users twiddling their thumbs for about 2 hours and IBM service won't help me. Doreesa becomes somewhat belligerent, condescending and, at times, downright exasperating. She has said, over 25 times, "I apologize" to me. Finally, I blurt out, "Ma'am will you quit apologizing, it's clear that you don't mean it and it's just patronizing." Here's the kicker: I tell her that this is a new 820 machine and it should be under the 1 year IBM warranty. She says, and get this, "YOU DID AN UPGRADE FROM A 720. ONLY THE INSTALLATION PROCESS IS COVERED UNDER WARRANTY. ONCE THE CEs LEAVE AFTER THE UPGRADE YOU ARE NO LONGER COVERED. You must be under maintenance." Whew, that was a kick in the face. In the past that's never been true. In fact, we budget the free 1 year warranty as part of an upgrade process and don't expect to pay maintenance for that first year. Doreesa says, "We can put you on time and material at $255 per hour at a 2 hour minimum." "Fine," I say, "I'll pay $10,000 right now to get our users back up and running. DO IT." "Ok, Mr. Ackerman, someone from service will call you back within 24 hours." Ok, I'm getting ready to lose it. "24 HOURS! I need help NOW." "I apologize (she said that meaningless phrase again) time and material is a 24 hour call back. However, we do have another option. We can bill you for 1 years' back maintenance which will bring you up to date on maintenance and you will still have to pay time and material. Then I can put you directly through to service." It's been almost 3 hours by now and I'll even succumb to blackmail. "Yes, Doreesa, let's do that. I need help now." I tell Doreesa that my analogy is that I'm bleeding on the side of the road and she's the ambulance driver that won't take me to the hospital because I can't produce an insurance card. But she'll take me to the hospital if I can produce cash. (Sounds like we're in a third world country.) She didn't like that analogy and became very huffy. I finally get a call back about 20 minutes later. They determine that 10 of my 25 disk drives aren't reporting in. They send out a CE. The CE works on it for about 5-6 hours with Rochester on the line. He's replaced about every part in the expansion tower. Finally, they determine that it's a backplane. One that's only 2 weeks old and, supposedly, not under warranty. (I can see the dollar signs ringing up with all of these hours and parts). We IPL about 3 times and it fails. System still can't see 10 drives in the expansion tower. Rochester instructs the CE to clear IOP cache. This takes a while and causes 41,000 sectors of hard disk to be cleared. However, after the next IPL we see all of the drives and things are looking rosy. Not so fast. System still down. It's now about 5pm on Friday night. 8 hours. Many employees sent home. It's month end and much must be entered into the system before we can close the month. They'll come in on Saturday 6am if we're up. After all the parts are replaced my CE leaves and hands me over to software support (who immediately see that we are current on software payments) the software support guy in Rochester starts guiding me through a SLIP install. Looks like OS/400 is hosed. A slip install only installs OS/400. No LIC, and no database files. He says we need only restore OS/400 and our data files will probably be fine. "The data gets swapped to disk enough that it's probably OK." I'm feeling pretty good about that. We could be up in an hour or two. (How naive I am!) Good, now I can actually do some work. I put in last full backup (from May 17th when we did the upgrade.) Start the process of installing OS/400. It churns for about 35 minutes and at 63% of the install process it crashes. I call Rochester. He says, "sounds like we need to install from CD." I'm not happy about this since then I'll have to do PTFs all over again. Ok, I hang up and try again. It fails at 63%. I call Rochester. He has me tweak a few things in DST. I try again from tape. It fails at 63%. Further into DST looks like it's still a hardware issue. The CE is called back. Everything points to a "no longer used" fiber optic adapter. We try to take it out. Can't even IPL then. The CE moves the cards around in the back. The CE leaves. Try to reinstall OS/400. Stops at 63%. Try again from CDs. It stops at 63%. All told I did about 11 tries at installing OS/400. I know the src codes on the front panel pretty well by now. It's 1:30 am and I'm a little weary. I contact the CE and ask him to meet me in the morning so I can go home and take a nap. I arrive back at 7am. While I'm awaiting the CE I try another install from tape. I figure that maybe, for some silly reason, it'll work this time. Fails a 63%. CE arrives with new fiber optic attachment and terminators. Install new products. Try installing OS/400. It fails at 63%. It's close to 11am on Saturday now. We get on the horn at Rochester and get deep into lower levels of support. And, by 11am on Saturday, 26 hours after the failure, it's immediately clear to this support person that we need to do a "scratch" install. Clear all hard disks, reformat them, and install everything including LIC, OS/400 AND all of our data. I know this is a long process ahead of me. Why did it take 26 hours to get the guy who knew this immediately. I could have been 14 hours into this scratch process had this guy been involved earlier. Actually, maybe 17 hours into the process had Doreesa not been the gatekeeper! The rest is pretty boring installation details. It takes us until 6am on Sunday morning until I go home again and am confident that users can use the system. I'm fortunate to have a diligent, hard working staff to help me in this process. What will we change in our backup strategy as a result of what we learned? - Our current backup process is to do full backup of data files on Saturday. Then do incremental "changed only" objects during the week. This caused a number of headaches during the restore including a lot of extra restore time. That will change. We will now do full backup daily. We use save while active anyway, so the users are only down 15 minutes for backup, full or incremental. - I will be diligent about having copies of paid maintenance invoices in the computer room. - I will rely less upon IBM. Their service has deteriorated much since my last emergency 10-12 years ago. Their service is no longer the darling of the industry. Once upon a time IBM's service used to be the differentiating factor when determining what vendor to choose when purchasing new systems. I no longer feel that way. - I need to be more forceful to move up the level faster when we aren't getting enough progress. This could have cut about 14 hours out of our downtime. - I can no longer state: "I've never had an AS/400 crash." It does crash and it hurts big time when it does. chuck Opinions expressed are not necessarily those of my employer.

Reply With Quote