COVER STORY: DISASTER RECOVERY

Your computer systems can be the lifeblood of your company.
Here's how to make sure you'll be able to recoup your business if...

A River Runs Through IT

By Lucie Juneau Patrowicz



ROI box

Companies lose vital computing resources for many reasons. A power failure, flood or virus might bring your business to its knees unless you're prepared. Readers will learn the essentials of disaster planning, including

+ Prioritizing and analyzing the company's needs

+ Data backup methods

+ Alternative computing arrangements

 
STARTING TUESDAY EVENING, JULY 22, 1997, a tropical storm resulting from Hurricane Danny dropped at least 15 inches of rain on Albemarle, N.C., a small town about 40 miles east of Charlotte. After coming ashore on Alabama's Gulf coast, the deluge swelled inland waterways and flooded a creek abutting the facilities of Allison Manufacturing, a $70 million, 1,000-employee apparel maker.
    For the last five years, the company's IBM AS/400 has been its central nervous system. It ties into Allison's financial functions, communicates with the EDI-based customer order system and inventory, and feeds manufacturing information to the company's divisions. The information determines which plants fulfill which orders, how much yarn to purchase, when orders are cut and sewn, when artwork is scheduled and the status of inventory. In addition to its manufacturing and administrative hub in Albemarle, Allison has five manufacturing plants and a large distribution center in Texas, another manufacturing facility in Virginia, and corporate offices and a marketing group in New York City.
    Several staff members, including Systems Support Manager Larry Wallingford, were in Allison's Albemarle building investigating a power problem the night the rain began. "A drain plug had popped loose, and water was all over the floor," Wallingford says. The cable running under the floor of a recently installed AS/400 system had gotten wet, causing an electrical short.
    At about 2 a.m., Wallingford checked the water level in the creek before restarting the system. "The water was getting close to the top of the dock, so we decided we'd better not start up," Wallingford says. He feared the transmitter located near the creek might blow while the AS/400 was running, creating even bigger problems for the IS team.
    Noticing also that rainwater had begun seeping into the building, Wallingford and his co-workers decided to lift expensive IS equipment off the floor before leaving for home. But as they hoisted their whole imaging system, including a Data Archives and Retrieval System (DARS) and IBM Corp. ImagePlus onto a cutting table in the art department, they heard a roll-up door in the docking area suddenly give way. Quickly, they ran for the main exit.
    "When we got there, we saw four feet of water rushing by outside," Wallingford says. The parking lot was flooded in about 15 minutes. Wallingford speculates that a blockage upstream had suddenly broken loose, flooding the creek and neighboring areas.
    The only way out of Allison's building was up: to the roof, via an inside ladder. "As we started up the ladder, the water was up to my chin," says the 6-foot-1-inch Wallingford. By that point, a lightning storm was raging, and evacuees took refuge in a rooftop shed. "We huddled in it until the storm subsided," he says. During the night, an Allison trailer bed and a couple of cars were swept downstream. By morning, rescuers arrived by boat to pluck Wallingford and the others safely off the roof.
    The company's IS vice president, Glenn Wood, had left the building before the flood and expected to find nothing worse than some puddling around the building's doors when he returned the next morning. Instead, he discovered that chairs, desks, PCs and file cabinets had been swept up in the tidal wave. "Four chairs from my office were washed out of the room, down the hall, through the computer room and into a programmer's office," Wood says.
    The company's newly upgraded AS/400 and its new uninterruptible power supply (UPS) system were destroyed, along with its supporting IBM Twinax and token ring network and a new IBM electronic imaging system. Gone too were three other LANs and a large Ethernet network of Apple systems used to create artwork on Allison's garments. And despite the efforts of Wallingford and his colleagues, the whole imaging system was lost. Of all the computer equipment in the building, only one CRT was salvageable.

It so happens that the most common risks to a business are disasters of the man-made variety, resulting from simple mistakes, negligence or sabotage, says Lisa Maio Ross, senior analyst for outsourcing at International Data Corp. (IDC), a Framingham, Mass.-based market research and consulting firm and sister company to CIO Communications Inc. Such disasters include hardware and software malfunctions, power failures, computer viruses and hackers. Those occurrences--along with such natural and political disasters as tornadoes, earthquakes, fires, floods, severe storms or bombings--can bring a business to a standstill.
    System disruptions, however, are not the most common problem, reveals a recent survey commissioned by Comdisco Inc., a Rosemont, Ill.-based vendor of disaster recovery solutions. When they do happen, whether they have serious consequences on a business depends in part on how long a company can afford to operate without its systems. For example, companies relying most heavily on electronic recording of customer transactions can least afford downtime (see "The Cost of Downtime,"). Companies in industries that don't depend on continuous electronic transactions won't face ruin if their systems go down briefly; they worry most about lengthy disruptions and disasters capable of destroying their IT systems completely.

The Cost of Downtime










Although 98 percent of CIOs polled believe it is important to have a disaster recovery plan, 25 percent do not have onein place.
SOURCE: RHI consulting



















The market for business/disaster recovery services is expected to reach $3 billion by 1999.
SOURCE: Dataquest



















Finding it Online

Comdisco Inc
(http://www.comdisco.com/)

Dataquest
(http://www.dataquest.com)

Dow Chemical Co
(http://www.dow.com)

Gartner Group Inc
(http://www.gartner.com)

IBM Corp
(http://www.ibm.com)

International Data Corp
(http://www.idc.com)

Meta Group Inc
(http://www.metagroup.com/)

SunGard Recovery Services Inc
(http://recovery.sungard.com/)

 

    To avoid catastrophic disruptions, a company needs a continuity plan that will both minimize downtime in the event of IT destruction and allow the organization to carry on without computer support for a limited period.
    It's probably no accident that the companies with the best IT recovery plans and testing practices are on the West Coast, says Fred Joy, a senior research analyst at Meta Group Inc., a market research firm in Stamford, Conn. Because California is a land of many natural hazards, people there appreciate the need for disaster planning.
    Albemarle, on the other hand, isn't exactly the epicenter of natural or political instability. Until the recent disaster, Wood's system had been down less than half a day since he installed his first IBM AS/400 five years ago. In fact, if Hurricane Hugo hadn't torn through the Carolinas in September 1989, Allison's business, like its flood waters, might have receded from memory after the July disaster. Destruction in the Albemarle area during Hugo was heavy, and storm damage interrupted vital services. For almost a week following the hurricane, the town had no water and power outages were frequent. "We lost two disk drives at once, and I had to recover all my data," Wood says. "Twice."
    But Allison didn't suffer any serious business loss after Hugo. "We were really lucky, and we knew it," says Wood.
    Unfortunately, most companies will postpone contingency planning until after disaster strikes, experts say. "It's difficult for companies to justify the expense because they can't immediately see the ROI," says IDC's Ross. But when a disaster hits home, planning for the possibility of IT destruction suddenly seems anything but wasteful. Because of Allison's experience with Hugo, the company was better prepared to deal with the July catastrophe.
    After managers recognize the need for planning, they should begin to develop a detailed IT recovery plan based on analysis that determines what computer resources are available, what's running on them and how critical they are, Joy says.
    The business impact analysis should reveal a company's most critical exposures, says Chuck Hannah, a partner at Hannah-Watrous Continuity Strategies LLC, a consulting firm in Hartford, Conn. Companies need to identify which computer operations must be kept up and running in the event of a disaster and develop a plan for alternatives for computer operations and recovery procedures. "It's a matter of prioritizing," Hannah says. Which business functions are most important and need to be protected first? Which functions could the business survive without for an extended period?
    Within a few months after Hugo, Wood initiated an evaluation of Allison's business needs. He assigned a staff member to coordinate the development of a recovery plan and met with him regularly for two years to work out the details. Before they could consider how they'd address a possible disaster, they had to think about what the disaster might look like. "We had a hard time wrapping our arms around that one," Wood says.
    The Dow Chemical Co. in Midland, Mich., is one company with well-established recovery procedures, but getting to that point took time. "We originally started doing disaster recovery more than 20 years ago," says Dave Butler, director of global information systems at Dow. He adds that the emphasis and scope of recovery planning are changing. "The concept has evolved from disaster recovery to business continuity," says Butler.
    "Within each function, for each business, we examine key processes and make a determination of how quickly we would have to be fully recovered," he says. At Dow, customer service functions typically get top priority. "Logistics, order processing and shipping need to be recovered within 24 hours," he says. "But in the event of disaster, accounting says they can be [online] last."
    It's a cost-risk assessment. "If you come up with a single answer for the whole company, either you're underprepared or you're spending money you don't need to," Butler says. "If you're planning to recover everything in 12 hours, you're simply throwing money away."
    While it makes sense to spend time planning for the most likely event, a recovery plan must be flexible. "It's the unusual problem that's going to burn you," Wood says. Since they couldn't write detailed procedures for every possible emergency, Wood and his colleagues set up a process to help management evaluate a crisis and take appropriate steps. "First, we needed to try to decide who would be making decisions," Wood says. That choice was based on the severity and breadth of the catastrophe: IS staff could handle a problem affecting the data center alone; a team of company VPs would be responsible for coordinating a response to a major disaster.
    Recovery plans like Wood's typically include details such as the names and phone numbers of staff members, information about the company's IT products, supplier contacts and phone numbers and the system's configuration.
    However, it's not unusual to neglect one or more pieces of the planning puzzle. About 85 percent of Fortune 1000 companies have disaster recovery plans, says Donna Scott, research director at the Stamford, Conn.-based Gartner Group Inc., which provides IT research and advisory services. But 80 percent of those that do have plans that cover mainly their data center resources. Only about 50 percent cover networks, while less than 35 percent have plans that would protect the data on their PC LANs, she says. Far fewer appear to have any means of recovering their Internet work, despite a growing reliance on that resource.
    But companies are becoming increasingly aware of the need to safeguard telecommunications services. Vendors are building dedicated networks strictly for disaster preparedness, according to Ellen Carney, director of network integration and support services at Dataquest, a market research firm in Westborough, Mass. That means that client companies sharing these resources don't have to depend on local phone companies for bandwidth in an emergency.
    To cover its communications path to Japan, Dow Chemical put in its own redundant network, establishing a path from the United States to Australia to Hong Kong to Japan. If there's a break in the path from the United States to Japan, the network will route traffic through Australia, Butler says.
    In addition, most large, best practice companies have budgets for disaster recovery. "In large companies, [the recovery budget] usually runs between 2 percent and 4 percent of total [IS] costs," Butler says. Such companies also rely on outside expertise. But experts can't do it all. "There's really no substitute for thinking through your business processes and anticipating what you would do in the event of a significant disruption," says Butler.


Fire and Ice









 
Another key to a successful recovery plan is to store copies of critical computer data in another location.
    For decades, security-conscious companies have backed up their data regularly, storing tapes offsite. Large companies have even turned to vendors offering recovery centers equipped with computing resources. But changes in technology are fueling a shift in emphasis. Business units throughout an organization now typically require access to electronic information and resources, often shared over networks. And because electronic transactions and communications take place so quickly, the amount of work and business that can be lost in an hour far exceeds the amount that might have been lost a decade ago, says Tim Ging, vice president and manager for corporate business continuity at Pittsburgh-based PNC Bank, a 25,000-employee operation with branches in eight states.
    Many LAN administrators don't follow the most basic data security procedures, says Bruce Watrous, Hannah's partner at Hannah-Watrous Continuity Strategies. "I've seen LAN servers where they're not sure the backup [procedure] works," he says. IS staff can monitor backup procedures and insist on maintenance contracts that assure the quick replacement of equipment in an emergency. In addition, contracts with recovery vendors can include coverage for LAN resources. It's important to keep up-to-date information on these resources so that it will be available when needed.
    It's also important during the planning process to select a strategy for protecting computer operations, Hannah says. Safeguarding data by sending backup media offsite on a daily basis is essential. But IT recovery planning also means ensuring the availability of alternative computing resources if the company's resources aren't available.
    A company's recovery strategy may include reliance on its own IT resources, using a third-party recovery center--such as hot sites, mobile data units or quick shipments of computing equipment--or a combination of options. Solutions should cover telecommunications networks, LANs and peripherals.
    A company may maintain redundant systems for its most critical operations or enter into a reciprocity agreement, arranging to share resources with another company or data center within its own organization in the event of a data center disaster. "Different divisions of corporations that have decentralized computing organizations may build disaster recovery planning around having enough excess capacity at different locations to accommodate them," Hannah says. But advances in telecommunications are leading to greater consolidation, and fewer companies are maintaining decentralized systems, he says.
    A popular planning strategy is to contract with a vendor for a recovery site, which is an alternative data center or workspace. Vendors also offer mobile data units or workspaces that can be moved to a site of the company's choice. A third option involves the speedy shipment of equipment to a user location.
    Recovery sites can provide various services. Cold sites consist simply of prepared environments--that are wired, air-conditioned and have raised flooring--that can be used to house people or replacement equipment. A hot site is a recovery location that offers computing resources. Ideally, a hot site should be located within 30 miles of a client site so that employees can travel there easily, says Gartner's Scott.
    The three largest disaster recovery vendors--IBM Corp., Comdisco Inc. and SunGard Recovery Services Inc.--maintain hot sites near many North American urban centers as well as overseas, equipped with multivendor resources and telecommunications capabilities. These companies offer workgroup spaces too, either at their hot sites or at separate locations, equipped with office furnishings and machines, PCs and peripherals, as well as call-management systems for users manning call-in centers.
    Dow's call-in center serves a critical business function. "If we were to lose that center, customers wouldn't be able to place orders," Butler says. "[We need] another location with all the equipment--with a phone system, with desks and PCs and so on--so if we lost that customer order center, we could move and be up in 24 hours," he says. Comdisco provides the hot site that serves as Dow's backup call-in facility.
    Some companies contract with vendors for hot sites that include equipment that can be configured quickly to match their systems, but they don't electronically transmit data to these sites. "You can have just the computer sitting someplace where your data is not," Hannah says. In the event of disaster, the company would deliver a backup tape to the site.
    More sophisticated hot site arrangements involve remote electronic vaulting. "You may be doing backups to the hot site," Hannah explains. "[That's] commonly done in high-risk applications." Transferring data electronically reduces recovery time.
    Companies in industries relying heavily on electronic transactions may need to recover data entered up until or very close to the time of system failure. Some companies electronically transport journal or log information to their hot sites at relatively short intervals--say, every few hours. Instead of having to re-create a full day's transactions--as they might if they were submitting backup data electronically at the end of each day--they would, in the worst case, lose only a few hours' worth of transactions. Database mirroring updates database information at the hot site instantaneously, so it is always current and the remote system can take over immediately.
    Many large companies use a combination of strategies and turn to multiple vendors for recovery solutions. "For some of our high-availability systems, we will have redundant systems, so if we lost one, another would be immediately available," PNCBank's Ging says. "For some systems that do not have to be recovered immediately, we will use hot site vendors. It's more advantageous to use multiple vendors just to make sure we have the level of service we're looking for."
    Finally, building recovery consciousness into day-to-day operations and testing formal plans can assure disaster readiness. Many companies test their plans periodically to make sure they're adequate, staging mock disasters to see how they fare. Best practice companies tend to hold tests twice a year, Meta Group's Joy says. "If companies test less than twice a year, they tend to get bit by the natural volatility of their environments. Changes can accrue that invalidate plans," he says.


 
Tell Me Where It Hurts

The flash flood that hit Allison in July nearly totaled the company's IT systems. In keeping with the company's recovery plan, Wood and two other business-side executives met within hours of the disaster at a vacant Allison warehouse uphill from the flooded building. There, they assessed damages, and Wood determined the best computer recovery option.
    Impressed with the service IBM had provided him during the Hugo cleanup, Wood had chosen IBM Business Recovery Services from a short-list of possible vendors. According to the terms of Allison's contract with IBM, Wood could request hot site support, a mobile solution or quick delivery of a new system. That kind of flexibility is atypical, notes Scott. "Normally, you contract for one type of solution because pricing is based on the type and kind of equipment provided," she says. The more options a contract allows, the higher the price.
    Wood opted for a quick shipment to an already established temporary site, and his replacement AS/400 configured to Allison's needs was on the road later that day. Within 30 hours of Wood's call to IBM, Allison's replacement machine was in place. A day later, the company's data had been recovered, and the AS/400 was ready to roll.
    Having an empty warehouse close by made it easy to find an appropriate spot for the temporary machine and workspace for those who'd been displaced. "It was all part of the plan," Wood says. "The building was considered our backup site."
    One advantage to the quick shipment option is that it typically costs less than going with a hot site. Wood's temporary machine cost Allison $300 a day. Allison had six weeks to replace it with a permanent machine or the company would have incurred more prohibitive costs. Machine costs might have been lower at IBM's hot site, but Wood determined that the cost of employee travel to Charlotte and overnight accommodations would have made that alternative more expensive.
    The company's biggest concern following the disaster was minimizing customer inconvenience. The morning after the flood, Allison called its EDI providers to ensure that messages that didn't get picked up for two or three days would be saved. Allison also called its biggest customers to let them know that it needed a couple of days to acknowledge orders.
    For all their planning, however, the IS team still hit a snag. Since developing the plan, a new phone system had been installed, and nobody had checked with the vendor to make sure replacement equipment could be brought in quickly. When Wood called after the disaster, he was told it might take two weeks to get parts. After some negotiating, the vendor agreed to deliver the equipment the next day, and Allison didn't have to pay a big penalty for its lack of foresight. "At most, we could have gotten [the system] up half a day faster," Woods says.
    For some businesses, such as Allison, the story ends happily. Allison Manufacturing was able to reconstruct its LANs in three weeks, and cutting, knitting and printing operations were soon back in full swing. Damage to the whole company was estimated at about $10 million. But without proper planning and a strong business recovery system, Allison's survival story could have been a cautionary tale.

Lucie Juneau Patrowicz, a freelance writer based in Salem, Mass., can be reached via e-mail at patrowic@shore.net.



Disaster Plan
Twelve steps to recovery
  1. Enlist the cooperation of upper management
  2. Seek help from qualified experts
  3. Conduct a business impact analysis to identify key business functions and IT resources
  4. Assess the risk of particular disasters based on company profile and location
  5. Devise a detailed, flexible plan that outlines staff responsibilities
  6. Select a company- or vendor-based recovery option (that is, a redundant system, hot site, mobile data unit or quick shipment solution)
  7. Cover all IT resources, including telecommunications networks and LANs
  8. Select IT equipment vendors that can provide prompt service
  9. Maintain updated vendor information
  10. Test your plan at least once a year
  11. Don't underestimate IT needs: Maintain a strong technical support staff and plan to replace lost equipment with more powerful equipment
  12. Structure the workload to address top priorities first

-Source: Research Company of America



Going Mobile
When disaster strikes, mobile units offer a flexible solution because they can be moved as a company's needs change

Following unusually heavy winter snows, the spring thaw of April 1997 began flooding the Red River in North Dakota, and businesses and residents in downtown Grand Forks were forced to evacuate.
    Yet even while Community National Bank President Bill Lee was calling Wayne, Pa.-based SunGard Recovery Services Inc., he knew the company wouldn't be using a hot site solution. It was impossible to cross the Red River for a stretch of about 200 miles because bridges were underwater, so driving to Minnesota--where the area's hot sites are typically located--would have been out of the question.
    Long before the flooding, Lee had rejected the hot site alternative because the bank relies on imaging technology to produce customer statements, and many hot sites don't provide imaging, he explains. Instead, Lee called SunGard to declare a disaster on Sunday morning, April 20, and requested a mobile unit. It was shipped from Phoenix at 8:30 that night and arrived about noon on Tuesday on a semitrailer, which was parked behind a temporary site in the town of Larimore, 30 miles west of Grand Forks. There was ample room in the unit to accommodate five data processing professionals, a Unisys A4-311 replacement, the imaging system, a high-speed laser printer and a few file servers, says John Ouradnik, vice president and cashier at the bank.
    On May 19, the bank returned to its home site, installing its mobile unit behind the building there until June 26, when a new Unisys Clearpath machine arrived.
    Mobile solutions provide clients with independence when they need it most, Ouradnik says. "When you have a mobile unit, you can kind of take control again. If nothing else, it's good for the psyche," he says. "It makes life easier when you're dealing with so many other things anyway," he adds.
    "Service providers such as IBM and Wang Laboratories Inc. suggest that a mobile data recovery site works really well for a call center that needs to take calls routed to agents," adds Alice Murphy, former senior industry analyst at Dataquest. Mobile centers can seat up to 50 people and provide the PC and phone connectivity that call centers need when their own centers are damaged.
-L. Juneau Patrowicz

CIO Magazine - April 1, 1998
© 1998 CIO Communications, Inc.


http://www.cio.com/archive/040198_disaster.html