19
Fri, Apr
5 New Articles

Employ Best Practices When Integrating Disparate Data

Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Disparate data can inhibit value creation, but data integration best practices allow organizations to fully leverage the inherent value in their information resources.

 

Silos of disparate data plague many companies. The reasons for the data diversity vary: organizations may run best-of-breed applications on the most appropriate platform for each product; in the days of client/server computing, the responsibility for choosing applications may have been decentralized, with different departments choosing different platforms; a merged company may use a variety of different applications, keeping separate some of the formerly individual companies' activities and the software that supports those parallel activities, while choosing the best of the predecessor companies' applications to serve those business functions that are merged. And the list goes on.

 

The reasons for integrating the disparate data are also varied: denizens of the executive suite need to see reports based on data amalgamated from enterprise-wide databases; Sales, marketing, and customer support staffs need a variety of data to facilitate cross-selling, up-selling, and customer service activities; competitive pressures lead to an intense focus on productivity and, consequently, a diligent drive to eliminate duplicate data entry. And again the list goes on.

 

Even when operational databases are fairly homogenous across the enterprise, an organization may intentionally introduce disparity by creating a separate data warehouse and/or data marts to better serve query, reporting, and analysis needs. Data warehouses or marts are required not because of a shortage of data, but exactly the opposite. Business executives, managers, and analysts need to be able to search the vast haystack of enterprise data to find the one needle that will allow them to fashion value for the organization.

 

With this need identified, it then becomes necessary to feed the data warehouse/marts from operational databases. This is not a trivial exercise. In addition to possibly using different hardware, operating systems, and DBMS platforms, the data warehouse/marts are typically organized significantly differently than the operation databases so each can best serve its purposes. Among others, the differences might include denormalized data and meaningful labels rather than codes in the data warehouse/marts.

Planning Is Key

Regardless of why you need to integrate disparate data, planning is critical. In fact, the upfront analysis and planning of your integration requirements and processes should consume much more time than the execution of your plans. Before you undertake a data integration effort, consider the following:

 

•·                                     What information is needed by which systems and individuals?

•·                                     Will you require bi-directional data sharing? That is, will your organization need the ability to apply updates to the same data on one of two or more nodes in the integration topology and then share those updates with the other nodes?

•·                                     When will the data be needed? Is near real-time replication required, or is it good enough to refresh the data nightly or on some less frequent schedule?

•·                                     How much data will be transferred? Where will it be transmitted from and to?

•·                                     Do the data source and target servers have sufficient processing capacity to capture the information and load it on the target?

•·                                     Do you have sufficient storage capacity on the servers?

•·                                     Does the network have enough capacity to transfer the data with adequate speed?

•·                                     If the target server does not currently exist (such as when creating a new data warehouse as opposed to sharing data between existing applications), what are the required target server specifications?

•·                                     What, if any, data transformations are required when moving data from the source to the target? When sharing data between two existing applications, the answer is obvious as it is defined by the source and target applications. On the other hand, new data warehouses and/or marts must be architected from scratch to best serve the purposes to which they will be put.

•·                                     Are there any regulatory issues that must be addressed? In a number of industries, privacy regulations dictate that some data cannot flow freely across the wire. Do you need to encrypt data sent on a local network and/or external network?

•·                                     What are the auditing requirements? Even in the absence of government regulations, auditors may insist on accurate records being kept of all transfers and transformations of certain types of data.

Warehouse Integration: Snapshot or Real-Time?

When integrating operational databases and a data warehouse or mart, a question of the required timeliness of the data arises. Should you integrate data in near real-time using data replication software, or should you use extract, transform, and load (ETL) software to copy data as periodic snapshots, likely transmitted during system slow periods such as at night? When considering this question, you must assess the value of data timeliness, which is not the same for all organizations and circumstances.

 

Periodic snapshots performed using ETL software can put a heavy strain on operational and data warehouse/mart systems and on the network as a vast amount of data must be transmitted between the source and target servers in a short period. In an organization with well-defined "off hours," this might not be an issue, but when a company operates in all time zones and in a variety of cultures with different definitions of "weekends" or when some of the data going into the data warehouses is fed from 24x7 Web stores, there might be no time of the day when the online response time delays resulting from periodic snapshot copies are acceptable.

 

In contrast to ETL, a data replicator copies operational data to a data warehouse/mart in real-time by detecting, copying, and transmitting operational database updates as they occur. This adds somewhat to the load on the operational and data warehouse/marts systems and on the network, but by distributing the load throughout the day--coincident with the corresponding operational data updates--replication processing may have no noticeable impact on system response times.

Unequal Replicas

The use of the term "replication" is misleading because the data on the source and target servers may not be true replicas. For example, some subsidiaries may store data using metric measurements, whereas others use imperial measurements, but the head office needs all of the data merged using standardized measurements. Or the French subsidiary may store product names in French while the American subsidiary uses English, and the enterprise-wide data warehouse needs everything in a common language for analysis purposes. In addition, when sharing data among different systems, each application's database may use different data types and field sizes. These differences must be reconciled when the data is transferred.

 

Because of the "transform" in its name, it is obvious that ETL software can perform the required transformations, but full-featured replication software offers that functionality as well. Products in both these software classes typically include several common pre-coded transformations, such as built-in functions to change data formats, create timestamps, or concatenate one field onto another. ETL and replication tools usually also provide the ability to plug in custom-coded transformations.

Cleanse the Data

One important best practice when merging data from multiple sources is to cleanse the data either before undertaking the integration project or as part of the data sharing processes. Data cleansing includes the following:

•·                                     Eliminate duplicate records. This may mean reconciling records that exist on multiple databases, but it might also involve searching for and eliminating duplicate records that were inadvertently added to a single database.

•·                                     Standardize differences such as "Rd." on one database versus "Road" on another; "Miss" or "Mrs." versus "Ms.;" "Royal Blue" versus "#4169E1;" "080327" versus "03/27/2008" versus "March 27, 2008;" etc.

•·                                     Apply revisions from publicly available address correction databases and other sources to bring contact data up to date.

•·                                     Detect and correct or bypass corrupted records.

Record Freshness

The reliability of some data depends on its age. For example, each year, approximately 20 percent of Americans move to a new address. Thus, if there is no incentive for consumers to provide address updates when they move, you would be justified in questioning the validity of addresses that have not been updated for more than five years. Likewise, consumer research data that is a few years old probably won't accurately reflect current business conditions.

 

Data that is moved into a data warehouse/mart should, therefore, be time-stamped so users can determine if it has passed its "best before" date. A caveat to this is that a timestamp will accurately reflect the age of the data only if the timestamp is created when the source data was updated or if the timestamp is created as part of a near real-time replication process. In contrast, if you use ETL software to take, for example, monthly snapshots of operational databases, a timestamp created by the ETL software may suggest that the data is as much as a month younger than it actually is.

Bi-Directional Integration

Bi-directional data integration is a special case that necessitates additional considerations. To use a simple example, your marketing system may be independent of your sales system, and they may run on disparate platforms. When you use replication to share the data residing on the two systems, marketers and salespeople may still need to be able to use their own systems to add records for new prospects and update addresses for customers and prospects. Doing so introduces the possibility of "data collisions."

 

A collision occurs when, for example, two or more new records are assigned the same keys because they were created at the same time at different locations. This type of data collision can be avoided easily by restricting each server to the use of a unique block of keys. For example, server "A" may be allowed to create only even-numbered keys, while server "B" creates only odd-numbered keys.

 

Another form of collision occurs when the same record is updated simultaneously at different locations. This type of collision is more difficult to deal with, but full-featured data replication software can detect collisions and offer a choice of ways to resolve them, such as using timestamps to honor the last or first update, giving priority to updates entered on a particular server, adhering to some predefined complex logic for determining the winner of a collision, or holding the colliding updates in a queue and flagging them for manual correction.

Data-Driven Integration

Data integration requirements vary over time. Source and target data schemas may change with a new release of one of the applications that manages the data. Servers may be moved to a new location. Distributed servers may be consolidated. New data marts may be added. Additional applications may be brought into the data-sharing architecture. And so on. Thus, you need to be able to alter your data integration architecture quickly, easily, and cost-effectively. The best way to do that is to drive the data sharing processes from a repository of metadata rather than through hard-coded program logic.

 

Controlling data integration through a metadata repository allows you to change the architecture by merely updating the metadata using a simple user interface, without the need to recode program logic. The metadata repository provides the additional benefit of automatically documenting the data integration architecture.

Testing

Data integration architectures, along with the metadata repository and program logic that support them, can be complex. Errors are possible. Consequently, it is important to test the integration environment before going live. This can be done by simulating live data and data updates and then running them through test servers arranged in a topology that mimics your production environment, likely using logical partitioning to run the test servers so you can avoid the need to buy additional physical servers.

 

Even if there are no errors in your data integration metadata, technology being what it is, it is still possible that there could be errors in the integration processes or on the data storage devices and, as a result, the data on a source and/or target server could become corrupted. Therefore, you should regularly perform synchronization checks to ensure that everything remains in order as per your data integration definitions.

Leverage Information Assets

Disparate data is a fact of life for almost all organizations. If left isolated in independent information silos, it can add unnecessary costs by forcing the need for duplicate data entry, creating data inconsistencies as updates are applied to one data silo but omitted on others and making valuable enterprise-wide data reporting and business analysis difficult if not impossible. Fortunately, sophisticated data replication and/or ETL software provides a solution that, with the use of data integration best practices, can significantly leverage the value of an enterprise's information assets.

John Gay

John Gay is Director of Sales Engineering with Vision Solutions, Inc. He served as the Product Strategist for Lakeview Technology (now merged with Vision Solutions) for the Vision replicate1 solution. John spent eight years in development with IBM and Sterling Commerce. He also spent over eight years in technical sales focused on the business intelligence market. Contact John at This email address is being protected from spambots. You need JavaScript enabled to view it..

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: