Home


About Network Testing Labs

Contact Network Testing Labs
Independent Reviews of Network Hardware and Software

 


Managing tons of data

 



By Barry Nance

How much data do large corporations manage? Tons of it.

Referring to "tons" of data may be intuitive for paper records, but it's an unusual way to describe computer-stored information, which is usually measured by character counts and file sizes. Still, using ton may give an added sense of how much data a terabyte is. To be sure, measuring data by the ton isn't definitive because a disk drive's weight doesn't vary significantly over a wide range of storage capacities, but it's a handy starting point.

A common 8GB hard drive weighs a little more than 1 lb. Figure that the weight of a shared enclosure, power supply and electronics will roughly double the drive's weight, and we can say that 8TB of data is approximately equivalent to 1 ton. That much storage is cumbersome and ungainly.

How does an enterprise deal gracefully and effectively with such unwieldy mountains of information? We asked four data-intensive companies - Aetna Inc., The Boeing Co., Atos Origin and AT&T Corp. - to tell us about the problems they faced in managing massive data stores, and how they solved them. For each company, the data is a significant corporate asset resulting from huge investments of time and effort. The data is also the source of many trials and tribulations for the employees who keep vigilant watch over it.

While these companies say that good tools are important for managing terabytes of information, their IT and database administrators also agree that having a clear and comprehensive perspective on the data, via both logical and physical views, is even more critical. Security, data integrity and data availability aren't trivial concerns, they point out, and giving users easy access to the data is a never-ending job.





   Aetna employees, Michael Mathias, Nancy Tillberg and Renee Zaugg
   (left to right)

Insuring a healthy 21.8 tons
On a daily basis, Renee Zaugg, operations manager in the operational services central support area at Aetna, is responsible for 21.8 tons of data (174.6TB). She says 119.2TB reside on mainframe-connected disk drives, while the remaining 55.4TB sit on disks attached to midrange computers running IBM's AIX or Sun Microsystems Inc.'s Solaris. Almost all of this data is located in the company's headquarters in Hartford, Conn. Most of the information is in relational databases, handled by IBM's DB2 Universal Database (Versions 6 and 7 for OS/390), DB2 for AIX, Oracle8 on Solaris and Sybase Inc.'s Adaptive Server 12 on Solaris. To make matters even more interesting, Zaugg adds, outside customers have access to about 20TB of the information. Four interconnected data centers containing 14 mainframes and more than 1,000 midrange servers process the data. It takes more than 4,100 direct-access storage devices to hold Aetna's key databases.

Most of Aetna's ever-growing mountain of data is health care information. The insurance company maintains records for both health maintenance organization participants and customers covered by insurance policies. Aetna has detailed records of providers, such as doctors, hospitals, dentists and pharmacies, and it keeps track of all the claims it has processed. Some of Aetna's larger customers send tapes containing insured employee data, but Nancy Tillberg, head of strategic planning, says the firm is moving toward using the Internet to collect such data.

"Data integrity, backup, security and availability are our biggest concerns," Zaugg says. Her data handling tools, procedures and operations schedules have to stay ahead of not only the normal growth that results from the activities of the sales, underwriting and claims departments but also growth from corporate acquisitions and mergers.

Like Atos Origin and Boeing, Aetna uses IBM's Virtual Tape Servers (VTS) to reduce its tape drive bottleneck. Zaugg says Aetna has used VTS to shrink its tape library from almost 1 million volumes to just under a quarter of that amount. She emphasizes that the major impetus for the consolidation was the time required for tape processing and handling, not the cost of tapes.
 


Since DB2 V6 doesn't support hot backups, the operations area has to take the DB2 V6 systems off-line to make backup copies. VTS lets Aetna drastically cut the time it takes to back up the DB2 V6 and other data, which increases the time the data is available to users. "Aetna's goal is to soon have hands-off tape operations on its mainframe computers," says Tillberg.

She adds that Aetna has a server consolidation effort under way to reduce the effort necessary to manage data on the midrange machines. "Nonetheless," she says, "the need for server load balancing won't go away soon." For its Web servers, Aetna uses Sunnyvale, Calif-based Resonate Inc.'s Global Dispatch to distribute HTTP traffic to the nearest available server that's least busy. Tillberg says she likes the way Global Dispatch manages mirrored Web servers located not only in the same room but also in geographically dispersed locations.

Tillberg also says the company is increasing its use of storage-area network (SAN) technology to centralize and streamline the management of that data. She points out that Aetna uses Global Enterprise Management software from Tivoli Systems Inc. in Austin, Texas, to monitor the network, distribute files and track data usage.

Aetna's database administrators maintain the more than 15,000 database table definitions with the ERWin data modeling tool, according to Michael Mathias, an information systems data storage expert at Aetna. Manual upkeep of the table definitions became impossible years ago, he says. Mathias sees the importance of viewing the maintenance of large amounts of data from a logical perspective. While the physical management of large data stores is certainly a nontrivial effort, Mathias says that failing to keep the data organized leads inexorably to user workflow problems, devaluation of the data as a corporate asset and, eventually, customer complaints.

Since DB2 V6 doesn't support hot backups, the operations area has to take the DB2 V6 systems off-line to make backup copies. VTS lets Aetna drastically cut the time it takes to back up the DB2 V6 and other data, which increases the time the data is available to users. "Aetna's goal is to soon have hands-off tape operations on its mainframe computers," says Tillberg.

Tons of flying data
LeaAnne Armstrong, director of distributed servers at Seattle-based Boeing, makes sure the approximately 50TB to 150TB (6 to almost 19 tons) of data the company owns remains as reliable and safe as the aircraft and spacecraft the company builds. She says the 50TB to 150TB estimate reflects Boeing's inability to know exactly how much data exists on its 150,000 desktop computers. Users don't necessarily store their data files on a server, which makes quantifying Boeing's data stores difficult, she says.

Like Aetna, Boeing has tens of mainframes and thousands of midrange servers running Unix and Windows NT. "Much of the data exists in relational form," Armstrong says, "but across the enterprise, Boeing uses virtually every file format known to man." According to Armstrong, Boeing's files run the gamut from Adobe Portable Document Format to computer-aided design and manufacturing machine and part descriptions. The relational databases are DB2 on the mainframe, Oracle on the Unix (HP-UX, AIX and Solaris) midrange machines and either SQL Server 7 or SQL Server 2000 on the smaller Intel-based Windows NT computers.

For Boeing's diverse terabytes, Armstrong shares Zaugg's basic concerns: data integrity, backup, security and availability. The two companies have similar philosophies and approaches to handling large amounts of data. Like Aetna, Boeing uses IBM's VTS to cache and manage its mainframe tapes and tape devices. Boeing plans to use SAN technology in the near future and to consolidate midrange servers rather than let them proliferate.

Armstrong also says effective use of virtual tape or any other hierarchical storage management (HSM) scheme depends on identifying the categories of data within the enterprise and treating each category appropriately. For example, she warns, Boeing makes a subtle but important distinction between backup tapes of transactional content vs. archive tapes of static aircraft design and manufacturing files. She says data must be classified carefully to get the most value from virtual tape. 

Boeing's data stores are spread out across 27 states and a few overseas locations, but most computing takes place in the Puget Sound area of Washington.

Armstrong says the company currently has dozens of different backup and restore software utilities. Each department buys its own backup media and performs its own backup and restore operations. A major data loss hasn't happened yet, says Armstrong, but she's aware of the risks and
Armstrong also says effective use of virtual tape or any other hierarchical storage management (HSM) scheme depends on identifying the categories of data within the enterprise and treating each category appropriately.

For example, she warns, Boeing makes a subtle but important distinction between backup tapes of transactional content vs. archive tapes of static aircraft design and manufacturing files. She says data must be classified carefully to get the most value from virtual tape.
plans to centralize the backing up and restoring of files in the future.

Armstrong says she hopes the hard disk, optical disk and tape drive manufacturers will eventually offer Boeing vendor-neutral and highly interoperable data storage. Furthermore, although hard disks are inexpensive these days, Armstrong says data management costs on a per-disk or per-tape basis are high enough that she wants to significantly reduce the amount of disk and tape "white space" - the portion of the media that Boeing doesn't use.

Virtual tape technology helps, she says, but she wishes that all Boeing's tapes and disks were based on a "storage-on-demand" model, whereby Boeing could simply rent whatever capacity it needed from an outside vendor and not have to worry about running out of space.

Tons of phone calls
Mark Francis, enterprise architecture director at AT&T, manages several terabytes of information. One of his biggest data stores is a multiterabyte mainframe DB2 database containing phone-call detail. When an AT&T customer makes or attempts to make a call, the switching equipment automatically inserts a new row in the huge database.

For Francis, however, the company's new 650GB operational database of customer data, work orders and billing data is more interesting. He says the company is merging diverse databases of various kinds of customer data into a single, cohesive and consistent database. The project is well under way. "The goal is for everyone within AT&T to have one place to go to get any and all customer data," says Francis.

In the past, IMS-DBDC, DB2, Oracle and Informix Corp. systems were all used to control access to parts of the data, but Francis and his group have chosen Oracle to be the single repository for the new consolidated customer database.

Mirrored across two data centers located in Georgia and Missouri, the new customer data store resides on Sun Ultra 10000 computers. Sun Ultra 5500 computers perform data backup chores, and the two data centers are optically linked to allow fast fail-over among the machines should disaster strike.


Francis says the company allots Sundays for doing full backups and performing software maintenance. AT&T uses Veritas Software Corp.'s NetBackup to make copies of the customer database. While backing up Oracle redo logs provides an ongoing incremental copy of the data, Francis says the process is time-consuming, and he wishes it weren't such a bottleneck.

Francis schedules periodic fire drills to ensure that each of the two data centers can fail-over quickly and painlessly. He points out that managing large data stores across multiple data centers means more than just monitoring hard-disk devices. "At fail-over time, an entire data center - computers, storage, computing infrastructure and network connections - must pick up the workload without skipping a beat."

To handle large data stores efficiently, Francis suggests, "Don't underestimate the time it takes to get the data model - i.e., the schema - and the operational environment correct." Like Aetna's Mathias, Francis stresses the importance of an accurate and well-organized logical view of large data stores.




Terabytes that follow the sun
Mark Eimer, director of global automation tooling at Atos Origin, is responsible for about 300TB (37.5 tons) of other people's data. The majority of the data is relational, but, like Boeing, the Paris-based company stores thousands of different file formats. Atos Origin provides outsourcing services and data operations for other companies. Eimer says one Atos Origin customer is itself an enterprise with 130,000 employees. These users access several terabytes of Lotus Notes data on 600 servers.

Atos Origin manages 22 global data centers for hundreds of outsourcing clients with many hundreds of thousands of users in 31 countries. The company's data centers, which are located primarily in Dallas, Singapore and the Netherlands, house 60 mainframes and about 5,000 midrange servers.

Sixty percent of these midrange machines run Unix (AIX, Sun Solaris, Hewlett-Packard Co.'s HP-UX and DEC Unix) and IBM's OS/400, while the remainder run either Microsoft Corp.'s Windows NT or Novell Inc.'s NetWare. Most of the machines are application-specific processors for enterprise resource planning and other vertical market systems.


Atos Origin employs 27,000 people, but it has more than 60 highly skilled operations people at seven sites around the world to manage the mainframe and midrange processors. Eimer says his organization has the skills and experience to handle large data stores because Atos Origin works hard to ensure a consistent, standard computing environment. "We adhere to stringent standards that we created for ourselves for how we run our servers," remarks Eimer.

For backing up and restoring huge amounts of data, Origin uses Tivoli's Storage Manager as well as Legato Systems Inc.'s products, Computer Associates International Inc.'s ARCserve and HP's OmniBack. Eimer says that while Atos Origin will use whatever software tools a customer wants, it prefers Tivoli's Enterprise suite for managing multiple mainframe and midrange computers and IBM's VTS for mainframe-attached tape drives.

Because Atos Origin is an around-the-clock, follow-the-sun processing environment spread around the globe, the company's network is especially critical. Eimer says he uses IBM's NetView and NetScout Manager Plus to help keep the servers' network connections healthy.


Minding the store
If managing gigabytes of data is like flying a hang glider, managing multiple terabytes of data is like piloting a space shuttle: a thousand times more complex. You can't just extrapolate from experiences with small and medium-size data stores to understand how to successfully manage tons of data. Even an otherwise mundane operation such as backing up a database can be frightening if the time needed to finish copying the data exceeds the time available.

Data integrity, backup, security and availability are collectively the Holy Grail of dealing with large data stores. The sheer volume of data makes these goals a challenge, and a highly decentralized environment complicates matters even more. Developing and adhering to standardized data maintenance procedures in your organization will not only give you the best return on your data dollar investment, but also let you sleep well at night.


Multiple terabytes of the most pampered, best-maintained data in the world are just a slag heap of bits without accurate, meaningful data definitions and schemas.

When you analyze your company's operating procedures for administering large data stores, make sure you incorporate the definitions of that information (such as ERWin or PowerDesigner Data Architect model files) in your plan.

Together, the data and its definitions are a corporate asset that contributes to your company's bottom line, and that you likely couldn't do without.


Tips for managing large data stores
Be selective in how you implement HSM. Instead of blindly giving all your data to a robotic HSM process, analyze and classify your company's data usage to know how often the data is reused and thus when HSM might be appropriate.

The logical view of your data is just as important as the physical view. Knowing which data elements are duplicated in your database and why tells you not only the degree of normalization but also what fraction of the database is involved in purely redundant I/O.

Perform data backup/restore fire drills periodically and religiously to make sure you don't lose lots of data to human error or natural disaster.


Recognize that you may have to develop your own transaction-aware backup software - especially if you have a growing database and your relational database engine doesn't support hot backups. It's not funny when you run out of time for making off-line backup copies.

Carefully segregate externally visible data from your internal data, for security purposes. An ounce of prevention is worth a ton of cure.


Where is HSM today?
At first glance, the idea of storing old, infrequently-used files in a server-accessible but relatively slow optical media jukebox or tape cartridge magazine seems like a good one. The high capacity and low price of hard disks notwithstanding, why leave files that haven’t been accessed in months - or years - on a file server? Hierarchical Storage Management (HSM), a software technology that migrates older files to and from secondary storage, addresses exactly this question.

Available on mainframes for many years and introduced to the PC world about 1993, HSM never caught on. You might wonder why, especially since economics seem to favor HSM. While the cost of hard disks has dropped to about ten cents per megabyte, the cost of managing LAN data has, ironically, risen. Studies report the intangible cost of LAN data management is close to $8/megabyte/year. Even if you discount so-called intangible costs and rely only on hard figures, the out-of-pocket cost of adding storage to a LAN can mean paying for a file server computer, the server software (network operating system), a backup device for the server, and other necessary components. These costs dwarf the price of the hard disk itself.

However, the plummeting price of hard disk space isn’t responsible for HSM’s fade into obscurity. Storage costs five or ten years ago, tangible or intangible, were already low enough to make computers an inexpensive means for holding corporate information. People’s work habits, application job scheduling requirements, excessive administrative burdens and HSM’s monogamous marriage to file server technology did HSM in.

“People’s work habits” is a euphemism for the impatience users express when the computer makes them wait for their data. Imagine telling a person, “Well, it’s slow because you haven’t accessed that file in over six months” while someone’s waiting several minutes for a file to load. You won’t get much sympathy. By the same token, try visualizing an entire department of people agreeing on exactly which files - of thousands on a server - can be marked as candidates for HSM.

In a PC-based production environment of client/server applications, some of those computer programs need to run on a scheduled basis. Job scheduling requirements mean that monthly, quarterly and annual runs need immediate access to data files. If end-of-year processing occurs on a weekend early in January and absolutely must finish before Monday morning arrives, incurring HSM-imposed delays may be intolerable.


HSM works with file servers but not other types of server technologies, such as relational database servers. This marriage of HSM to file servers is far too limiting. You don’t want HSM slowing you down when new storage technologies (either software methodologies or hardware media) come along.

Moreover, HSM also requires an inordinate amount of ongoing administration, and delays will occur if that administration isn’t perfect. Network administrators don’t have the time to keep track of exactly which files and directories are candidates for secondary storage, and users don’t always put files in designated directories.


Is HSM never useful, then? Never say never - some applications, notably CAD/CAM, medical imaging and document management, lend themselves to HSM. However, mentioning these candidates for HSM doesn’t validate HSM so much as it points out HSM’s major fallacy. The time since a file was last accessed may or may not be a criterion for moving that file to secondary storage. The real criteria are application-specific, and few applications are HSM-aware. An accounting application may have some files that are migration candidates after only a week or two, while other accounting files must be immediately available even after a year. Only the application knows for sure, but the application isn’t in control of which files migrate.

Anders Lofgren, Senior Industry Analyst for the Giga Information Group, characterizes HSM as a niche technology that’s found some usefulness in the mainframe world but never took off in the distributed environment of PCs. “Client/server applications and multiple operating system environments are just too complex for HSM to handle well,” he said.

Paul Mason, Vice President of Infrastructure Software Research at IDC, agrees. “People see HSM as simply too complicated. It’s easier to just buy more disk space, and the setup and configuration of HSM is just not worth the trouble.” Until data can “manage itself,” he says HSM will find limited use.

Despite these drawbacks and HSM’s poor reputation, where do you go if you really think HSM will make your file server lean and mean? Several backup software vendors have over the years developed HSM software, and they continue to include HSM capabilities alongside their meat and potatoes backup/restore utilities. Some of these vendors thought HSM might someday become popular. Others developed HSM
software because they encountered we-might-need-it-in-the-future- but-let’s-not-think-too-hard-about-it-today customers who included HSM on their shopping lists. Optical storage vendors such as Hewlett-Packard and ADIC also support HSM with their jukebox disk devices and tape magazine handlers.

Legato’s Networker product includes some HSM features. Veritas HSM is a product that should be on your short list if you really insist on installing HSM at your site. IBM’s ADSM is somewhat HSM-aware, and both Computer Associates (ARCserve and HSM for NetWare) and Seagate Software (Avail HSM, Storage Manager and Backup Exec) offer HSM. Be aware, however, that Avail HSM has significant shortcomings, and Seagate Software is currently migrating its Avail HSM customers to other products.

HSM’s future is far from bright. While it’s possible a space-conscious developer is designing some smart HSM features into a vertical business application, in which the application determines secondary storage candidacy, the brute force approach of migrating files based simply on age doesn’t work for most people.

HSM technology
HSM moves older, infrequently-used files from primary storage (the file server's hard disks) to secondary storage (optical read/write media and magetic tape).

HSM provides online storage of frequently-used files and nearline storage of migrated files. HSM automatically and almost transparently moves files to and from nearline storage as it extends the storage capacity of file servers.

A person at a PC who accesses a migrated file incurs a slight delay of from a few seconds to several minutes while the HSM software demigrates a file, and the person can optionally see a demigration notification message on the workstation screen while the operation takes place.


Research firm Peripheral Strategies (Santa Barbara, CA) categorizes 5 levels of HSM:



Level 1 is simple automatic migration with transparent retrieval. Level 2 adds real-time, dynamic load balancing of free disk space based on pre-defined thresholds. Level 2 also can manage 2 or more layers of near-line storage (optical jukebox and magnetic tape library, for instance). Level 3 provides for management of three or more layers of storage hierarchy and dynamically balances the consumption of available space in each layer. Level 4 HSM products can migrate files based on data type and other criteria, through the use of policies (rules). Products that conform to level 4 preserve ownership, attribute, and location information about files, thus allowing multi-platform (Windows, DOS, UNIX, Macintosh and OS/2) HSM. Finally, level 5 identifies HSM products that can work with database manager software, such as DB2, SQL Server or Oracle, to migrate portions of a database (rather than an entire file) to and from secondary storage.

Successful implementation of level 5 HSM hasn’t happened yet.

An early effort
Some years ago, Lotus Development, in conjunction with Kodak, offered the Lotus Notes: Document Imaging product (LN:DI, commonly pronounced "Lindy"). LN:DI was a set of client/server tools for managing image files, which are often good candidates for HSM. LN:DI included Windows client software that performed basic imaging functions (such as scanning documents, compressing/decompressing files, and zooming, panning, and rotation).


The server component, which ran on an OS/2-based PC, managed an image database with integrated HSM. Before LN:DI, Lotus Notes treated document images like any other type of file and replicated those files indiscriminately, which could bring a WAN to its knees.

With LN:DI, on the other hand, Notes stored images centrally and referenced the files via 100-byte pointers in distributed Notes databases.


Copyright 2012 Network Testing Labs


  
Home

About Network Testing Labs

Contact Network Testing Labs