Disparate data can inhibit value creation, but data integration best practices allow organizations to fully leverage the inherent value in their information resources.
Silos of disparate data plague many companies. The reasons for the data diversity vary: organizations may run best-of-breed applications on the most appropriate platform for each product; in the days of client/server computing, the responsibility for choosing applications may have been decentralized, with different departments choosing different platforms; a merged company may use a variety of different applications, keeping separate some of the formerly individual companies' activities and the software that supports those parallel activities, while choosing the best of the predecessor companies' applications to serve those business functions that are merged. And the list goes on.
The reasons for integrating the disparate data are also varied: denizens of the executive suite need to see reports based on data amalgamated from enterprise-wide databases; Sales, marketing, and customer support staffs need a variety of data to facilitate cross-selling, up-selling, and customer service activities; competitive pressures lead to an intense focus on productivity and, consequently, a diligent drive to eliminate duplicate data entry. And again the list goes on.
Even when operational databases are fairly homogenous across the enterprise, an organization may intentionally introduce disparity by creating a separate data warehouse and/or data marts to better serve query, reporting, and analysis needs. Data warehouses or marts are required not because of a shortage of data, but exactly the opposite. Business executives, managers, and analysts need to be able to search the vast haystack of enterprise data to find the one needle that will allow them to fashion value for the organization.
With this need identified, it then becomes necessary to feed the data warehouse/marts from operational databases. This is not a trivial exercise. In addition to possibly using different hardware, operating systems, and DBMS platforms, the data warehouse/marts are typically organized significantly differently than the operation databases so each can best serve its purposes. Among others, the differences might include denormalized data and meaningful labels rather than codes in the data warehouse/marts.
Planning Is Key
Regardless of why you need to integrate disparate data, planning is critical. In fact, the upfront analysis and planning of your integration requirements and processes should consume much more time than the execution of your plans. Before you undertake a data integration effort, consider the following:
•· What information is needed by which systems and individuals?
•· Will you require bi-directional data sharing? That is, will your organization need the ability to apply updates to the same data on one of two or more nodes in the integration topology and then share those updates with the other nodes?
•· When will the data be needed? Is near real-time replication required, or is it good enough to refresh the data nightly or on some less frequent schedule?
•· How much data will be transferred? Where will it be transmitted from and to?
•· Do the data source and target servers have sufficient processing capacity to capture the information and load it on the target?
•· Do you have sufficient storage capacity on the servers?
•· Does the network have enough capacity to transfer the data with adequate speed?
•· If the target server does not currently exist (such as when creating a new data warehouse as opposed to sharing data between existing applications), what are the required target server specifications?
•· What, if any, data transformations are required when moving data from the source to the target? When sharing data between two existing applications, the answer is obvious as it is defined by the source and target applications. On the other hand, new data warehouses and/or marts must be architected from scratch to best serve the purposes to which they will be put.
•· Are there any regulatory issues that must be addressed? In a number of industries, privacy regulations dictate that some data cannot flow freely across the wire. Do you need to encrypt data sent on a local network and/or external network?
•· What are the auditing requirements? Even in the absence of government regulations, auditors may insist on accurate records being kept of all transfers and transformations of certain types of data.
Warehouse Integration: Snapshot or Real-Time?
When integrating operational databases and a data warehouse or mart, a question of the required timeliness of the data arises. Should you integrate data in near real-time using data replication software, or should you use extract, transform, and load (ETL) software to copy data as periodic snapshots, likely transmitted during system slow periods such as at night? When considering this question, you must assess the value of data timeliness, which is not the same for all organizations and circumstances.
Periodic snapshots performed using ETL software can put a heavy strain on operational and data warehouse/mart systems and on the network as a vast amount of data must be transmitted between the source and target servers in a short period. In an organization with well-defined "off hours," this might not be an issue, but when a company operates in all time zones and in a variety of cultures with different definitions of "weekends" or when some of the data going into the data warehouses is fed from 24x7 Web stores, there might be no time of the day when the online response time delays resulting from periodic snapshot copies are acceptable.
In contrast to ETL, a data replicator copies operational data to a data warehouse/mart in real-time by detecting, copying, and transmitting operational database updates as they occur. This adds somewhat to the load on the operational and data warehouse/marts systems and on the network, but by distributing the load throughout the day--coincident with the corresponding operational data updates--replication processing may have no noticeable impact on system response times.
The use of the term "replication" is misleading because the data on the source and target servers may not be true replicas. For example, some subsidiaries may store data using metric measurements, whereas others use imperial measurements, but the head office needs all of the data merged using standardized measurements. Or the French subsidiary may store product names in French while the American subsidiary uses English, and the enterprise-wide data warehouse needs everything in a common language for analysis purposes. In addition, when sharing data among different systems, each application's database may use different data types and field sizes. These differences must be reconciled when the data is transferred.
Because of the "transform" in its name, it is obvious that ETL software can perform the required transformations, but full-featured replication software offers that functionality as well. Products in both these software classes typically include several common pre-coded transformations, such as built-in functions to change data formats, create timestamps, or concatenate one field onto another. ETL and replication tools usually also provide the ability to plug in custom-coded transformations.
Cleanse the Data
One important best practice when merging data from multiple sources is to cleanse the data either before undertaking the integration project or as part of the data sharing processes. Data cleansing includes the following:
•· Eliminate duplicate records. This may mean reconciling records that exist on multiple databases, but it might also involve searching for and eliminating duplicate records that were inadvertently added to a single database.
•· Standardize differences such as "Rd." on one database versus "Road" on another; "Miss" or "Mrs." versus "Ms.;" "Royal Blue" versus "#4169E1;" "080327" versus "03/27/2008" versus "March 27, 2008;" etc.
•· Apply revisions from publicly available address correction databases and other sources to bring contact data up to date.
•· Detect and correct or bypass corrupted records.
The reliability of some data depends on its age. For example, each year, approximately 20 percent of Americans move to a new address. Thus, if there is no incentive for consumers to provide address updates when they move, you would be justified in questioning the validity of addresses that have not been updated for more than five years. Likewise, consumer research data that is a few years old probably won't accurately reflect current business conditions.
Data that is moved into a data warehouse/mart should, therefore, be time-stamped so users can determine if it has passed its "best before" date. A caveat to this is that a timestamp will accurately reflect the age of the data only if the timestamp is created when the source data was updated or if the timestamp is created as part of a near real-time replication process. In contrast, if you use ETL software to take, for example, monthly snapshots of operational databases, a timestamp created by the ETL software may suggest that the data is as much as a month younger than it actually is.
Bi-directional data integration is a special case that necessitates additional considerations. To use a simple example, your marketing system may be independent of your sales system, and they may run on disparate platforms. When you use replication to share the data residing on the two systems, marketers and salespeople may still need to be able to use their own systems to add records for new prospects and update addresses for customers and prospects. Doing so introduces the possibility of "data collisions."
A collision occurs when, for example, two or more new records are assigned the same keys because they were created at the same time at different locations. This type of data collision can be avoided easily by restricting each server to the use of a unique block of keys. For example, server "A" may be allowed to create only even-numbered keys, while server "B" creates only odd-numbered keys.
Another form of collision occurs when the same record is updated simultaneously at different locations. This type of collision is more difficult to deal with, but full-featured data replication software can detect collisions and offer a choice of ways to resolve them, such as using timestamps to honor the last or first update, giving priority to updates entered on a particular server, adhering to some predefined complex logic for determining the winner of a collision, or holding the colliding updates in a queue and flagging them for manual correction.
Data integration requirements vary over time. Source and target data schemas may change with a new release of one of the applications that manages the data. Servers may be moved to a new location. Distributed servers may be consolidated. New data marts may be added. Additional applications may be brought into the data-sharing architecture. And so on. Thus, you need to be able to alter your data integration architecture quickly, easily, and cost-effectively. The best way to do that is to drive the data sharing processes from a repository of metadata rather than through hard-coded program logic.
Controlling data integration through a metadata repository allows you to change the architecture by merely updating the metadata using a simple user interface, without the need to recode program logic. The metadata repository provides the additional benefit of automatically documenting the data integration architecture.
Data integration architectures, along with the metadata repository and program logic that support them, can be complex. Errors are possible. Consequently, it is important to test the integration environment before going live. This can be done by simulating live data and data updates and then running them through test servers arranged in a topology that mimics your production environment, likely using logical partitioning to run the test servers so you can avoid the need to buy additional physical servers.
Even if there are no errors in your data integration metadata, technology being what it is, it is still possible that there could be errors in the integration processes or on the data storage devices and, as a result, the data on a source and/or target server could become corrupted. Therefore, you should regularly perform synchronization checks to ensure that everything remains in order as per your data integration definitions.
Leverage Information Assets
Disparate data is a fact of life for almost all organizations. If left isolated in independent information silos, it can add unnecessary costs by forcing the need for duplicate data entry, creating data inconsistencies as updates are applied to one data silo but omitted on others and making valuable enterprise-wide data reporting and business analysis difficult if not impossible. Fortunately, sophisticated data replication and/or ETL software provides a solution that, with the use of data integration best practices, can significantly leverage the value of an enterprise's information assets.