|Transforming Data in an HA Environment|
|System Administration - High Availability / Disaster Recovery|
|Written by John Gay|
|Tuesday, 09 October 2007 19:00|
When the amount of data was small, it was possible to simply query it in the production systems. But then data volumes grew, turning previously fast, efficient queries into monsters that consumed inordinate system resources, not just because of the quantity of data, but also because it was optimized for operational rather than query and analysis requirements. As data grew and it was needed in a format that afforded easier access, data warehouses were developed. Nightly processes extracted data from production servers, transformed it to the needed format, and then loaded it into the data warehouse using either in-house developed code or a type of software known as Extract, Transform, and Load (ETL).
When looking for a solution to provide real-time replication, obvious considerations come into play. These include the four Cs: cost, complexity, compatibility, and company. For the solution to be a good fit in your environment and provide the greatest value, all four of these issues need to be addressed. They will be discussed briefly below. However, before evaluating those other factors, you must first ensure that the product itself will meet a core set of criteria before you can consider using it to manage your mission-critical data. The following features are a benchmark for a quality replication solution:
At a minimum, every replication solution, regardless of whether it performs bulk extracts or change-based replication, must be capable of capturing the required data on the source system. In the past, and possibly still today, the most common data capture approach was to have the local database administrator hand-code SQL routines that extract, at predefined intervals, the data required by business users. The extracted data was then sent across to the target system, where it could be used for business intelligence purposes.
The simplicity of using a GUI to create the data mappings between the source and target databases and to define the data flow requests significantly lowers the costs of the initial setup, maintenance, and learning curve compared to the hard-coding of separate programs to fulfill each specific data requirement.
Before choosing a solution, be sure that it supports the hardware, operating systems, and databases you currently run, as well as those that figure into your future plans. Because the future usually cannot be accurately foreseen, it is best to be sure the software supports all of the major database vendors. Also make sure the vendor you are considering is stable and invests in the continued development that will be necessary to support future technologies.
The needs of business intelligence are often best served by data that is not in the same format as in the operational databases. For example, in global companies, the same data items may be stored differently in different local databases, such as American versus metric measurements and product names that are in different languages in different countries. All of this data must be homogenized as it is combined into an integrated data warehouse. Therefore, the product must accommodate all such data transformations and any others that will be needed to support business intelligence requirements.
The nature of the business intelligence process is vastly different from other business applications. Operational transactions usually access one row or just a few rows of data in a limited number of tables. Business intelligence and reporting applications, on the other hand, typically need to access all or a large portion of the rows in many joined tables. In addition, online operational applications are typically update-intensive, whereas business intelligence applications usually perform read-only tasks.
Heterogeneous Database Support
Just as the database format and schema might vary between the operational databases and the data warehouse, so might the database engine itself. The best database engine for data warehousing is not always the best engine for operational processing. What's more, your operational systems might not all run on the same database, so the replication solution must be able to transfer data between differing DBMSs.
Finally, the importance of business intelligence databases has become too great to rely solely on manually executed data extraction processes. Doing so imposes an unacceptable burden on the IT department. Instead, the solution must be capable of running without manual intervention, whether that is the periodic running of extract and load operations or the continuous replication of changes.
Cost is a factor in every software acquisition. It may be difficult to assess the value of the features and functions of the various vendors' offerings, but at least the price tags are usually clear. That's not true with in-house developed code. The preliminary cost analysis of this option compared to the price tags of vendors' offerings may make the idea of in-house development seductive, but the analysis is often deceptive.
When evaluated in full, including the lifetime cost of maintaining and supporting the replication software, in-house development typically turns out to be much more costly than a purchased turnkey solution.
Part of the high cost of in-house development derives from the inevitable complexity of the solution. The software has to be capable of transforming data and possibly transferring it between different database schemas, including differing table and column names and even between differently structured tables. Thus, when evaluating software, consider how easily—preferably through an intuitive, point-and-click GUI—it accommodates this complexity.
Most organizations support multiple DBMSs running on a variety of hardware platforms and operating systems for their operational environment. All of these diverse platforms must feed the data warehouse, which may run on yet another platform. What's more, the technologies will undoubtedly continue to evolve. The extensiveness of the software's platform support and the ease with which it can incorporate new platforms into its runtime environment is an important consideration when evaluating replication software.
You will be betting your company's long-term success on the infrastructure supporting your business intelligence systems. Thus, don't look only at the features and functions of the various software offerings in your consideration set. Consider also the companies behind them. Because the solution will be providing the timely data needed to make critical business decisions, make sure that the vendor provides support 24 hours a day, seven days a week.