Employ Best Practices When Integrating Disparate Data

Typography

Smaller Small Medium Big Bigger
Default Helvetica Segoe Georgia Times
Reading Mode

Disparate data can inhibit value creation, but data integration best practices allow organizations to fully leverage the inherent value in their information resources.

Silos of disparate data plague many companies. The reasons for the data diversity vary: organizations may run best-of-breed applications on the most appropriate platform for each product; in the days of client/server computing, the responsibility for choosing applications may have been decentralized, with different departments choosing different platforms; a merged company may use a variety of different applications, keeping separate some of the formerly individual companies' activities and the software that supports those parallel activities, while choosing the best of the predecessor companies' applications to serve those business functions that are merged. And the list goes on.

The reasons for integrating the disparate data are also varied: denizens of the executive suite need to see reports based on data amalgamated from enterprise-wide databases; Sales, marketing, and customer support staffs need a variety of data to facilitate cross-selling, up-selling, and customer service activities; competitive pressures lead to an intense focus on productivity and, consequently, a diligent drive to eliminate duplicate data entry. And again the list goes on.

Even when operational databases are fairly homogenous across the enterprise, an organization may intentionally introduce disparity by creating a separate data warehouse and/or data marts to better serve query, reporting, and analysis needs. Data warehouses or marts are required not because of a shortage of data, but exactly the opposite. Business executives, managers, and analysts need to be able to search the vast haystack of enterprise data to find the one needle that will allow them to fashion value for the organization.

With this need identified, it then becomes necessary to feed the data warehouse/marts from operational databases. This is not a trivial exercise. In addition to possibly using different hardware, operating systems, and DBMS platforms, the data warehouse/marts are typically organized significantly differently than the operation databases so each can best serve its purposes. Among others, the differences might include denormalized data and meaningful labels rather than codes in the data warehouse/marts.

Planning Is Key

Regardless of why you need to integrate disparate data, planning is critical. In fact, the upfront analysis and planning of your integration requirements and processes should consume much more time than the execution of your plans. Before you undertake a data integration effort, consider the following:

•· What information is needed by which systems and individuals?

•· Will you require bi-directional data sharing? That is, will your organization need the ability to apply updates to the same data on one of two or more nodes in the integration topology and then share those updates with the other nodes?

•· When will the data be needed? Is near real-time replication required, or is it good enough to refresh the data nightly or on some less frequent schedule?

•· How much data will be transferred? Where will it be transmitted from and to?

•· Do the data source and target servers have sufficient processing capacity to capture the information and load it on the target?

•· Do you have sufficient storage capacity on the servers?

•· Does the network have enough capacity to transfer the data with adequate speed?

•· If the target server does not currently exist (such as when creating a new data warehouse as opposed to sharing data between existing applications), what are the required target server specifications?

•· What, if any, data transformations are required when moving data from the source to the target? When sharing data between two existing applications, the answer is obvious as it is defined by the source and target applications. On the other hand, new data warehouses and/or marts must be architected from scratch to best serve the purposes to which they will be put.

•· Are there any regulatory issues that must be addressed? In a number of industries, privacy regulations dictate that some data cannot flow freely across the wire. Do you need to encrypt data sent on a local network and/or external network?

•· What are the auditing requirements? Even in the absence of government regulations, auditors may insist on accurate records being kept of all transfers and transformations of certain types of data.

Warehouse Integration: Snapshot or Real-Time?

When integrating operational databases and a data warehouse or mart, a question of the required timeliness of the data arises. Should you integrate data in near real-time using data replication software, or should you use extract, transform, and load (ETL) software to copy data as periodic snapshots, likely transmitted during system slow periods such as at night? When considering this question, you must assess the value of data timeliness, which is not the same for all organizations and circumstances.

Periodic snapshots performed using ETL software can put a heavy strain on operational and data warehouse/mart systems and on the network as a vast amount of data must be transmitted between the source and target servers in a short period. In an organization with well-defined "off hours," this might not be an issue, but when a company operates in all time zones and in a variety of cultures with different definitions of "weekends" or when some of the data going into the data warehouses is fed from 24x7 Web stores, there might be no time of the day when the online response time delays resulting from periodic snapshot copies are acceptable.

In contrast to ETL, a data replicator copies operational data to a data warehouse/mart in real-time by detecting, copying, and transmitting operational database updates as they occur. This adds somewhat to the load on the operational and data warehouse/marts systems and on the network, but by distributing the load throughout the day--coincident with the corresponding operational data updates--replication processing may have no noticeable impact on system response times.

Unequal Replicas

The use of the term "replication" is misleading because the data on the source and target servers may not be true replicas. For example, some subsidiaries may store data using metric measurements, whereas others use imperial measurements, but the head office needs all of the data merged using standardized measurements. Or the French subsidiary may store product names in French while the American subsidiary uses English, and the enterprise-wide data warehouse needs everything in a common language for analysis purposes. In addition, when sharing data among different systems, each application's database may use different data types and field sizes. These differences must be reconciled when the data is transferred.

Because of the "transform" in its name, it is obvious that ETL software can perform the required transformations, but full-featured replication software offers that functionality as well. Products in both these software classes typically include several common pre-coded transformations, such as built-in functions to change data formats, create timestamps, or concatenate one field onto another. ETL and replication tools usually also provide the ability to plug in custom-coded transformations.

Cleanse the Data

One important best practice when merging data from multiple sources is to cleanse the data either before undertaking the integration project or as part of the data sharing processes. Data cleansing includes the following:

•· Eliminate duplicate records. This may mean reconciling records that exist on multiple databases, but it might also involve searching for and eliminating duplicate records that were inadvertently added to a single database.

•· Standardize differences such as "Rd." on one database versus "Road" on another; "Miss" or "Mrs." versus "Ms.;" "Royal Blue" versus "#4169E1;" "080327" versus "03/27/2008" versus "March 27, 2008;" etc.

•· Apply revisions from publicly available address correction databases and other sources to bring contact data up to date.

•· Detect and correct or bypass corrupted records.

Record Freshness

The reliability of some data depends on its age. For example, each year, approximately 20 percent of Americans move to a new address. Thus, if there is no incentive for consumers to provide address updates when they move, you would be justified in questioning the validity of addresses that have not been updated for more than five years. Likewise, consumer research data that is a few years old probably won't accurately reflect current business conditions.

Data that is moved into a data warehouse/mart should, therefore, be time-stamped so users can determine if it has passed its "best before" date. A caveat to this is that a timestamp will accurately reflect the age of the data only if the timestamp is created when the source data was updated or if the timestamp is created as part of a near real-time replication process. In contrast, if you use ETL software to take, for example, monthly snapshots of operational databases, a timestamp created by the ETL software may suggest that the data is as much as a month younger than it actually is.

Bi-Directional Integration

Bi-directional data integration is a special case that necessitates additional considerations. To use a simple example, your marketing system may be independent of your sales system, and they may run on disparate platforms. When you use replication to share the data residing on the two systems, marketers and salespeople may still need to be able to use their own systems to add records for new prospects and update addresses for customers and prospects. Doing so introduces the possibility of "data collisions."

A collision occurs when, for example, two or more new records are assigned the same keys because they were created at the same time at different locations. This type of data collision can be avoided easily by restricting each server to the use of a unique block of keys. For example, server "A" may be allowed to create only even-numbered keys, while server "B" creates only odd-numbered keys.

Another form of collision occurs when the same record is updated simultaneously at different locations. This type of collision is more difficult to deal with, but full-featured data replication software can detect collisions and offer a choice of ways to resolve them, such as using timestamps to honor the last or first update, giving priority to updates entered on a particular server, adhering to some predefined complex logic for determining the winner of a collision, or holding the colliding updates in a queue and flagging them for manual correction.

Data-Driven Integration

Data integration requirements vary over time. Source and target data schemas may change with a new release of one of the applications that manages the data. Servers may be moved to a new location. Distributed servers may be consolidated. New data marts may be added. Additional applications may be brought into the data-sharing architecture. And so on. Thus, you need to be able to alter your data integration architecture quickly, easily, and cost-effectively. The best way to do that is to drive the data sharing processes from a repository of metadata rather than through hard-coded program logic.

Controlling data integration through a metadata repository allows you to change the architecture by merely updating the metadata using a simple user interface, without the need to recode program logic. The metadata repository provides the additional benefit of automatically documenting the data integration architecture.

Testing

Data integration architectures, along with the metadata repository and program logic that support them, can be complex. Errors are possible. Consequently, it is important to test the integration environment before going live. This can be done by simulating live data and data updates and then running them through test servers arranged in a topology that mimics your production environment, likely using logical partitioning to run the test servers so you can avoid the need to buy additional physical servers.

Even if there are no errors in your data integration metadata, technology being what it is, it is still possible that there could be errors in the integration processes or on the data storage devices and, as a result, the data on a source and/or target server could become corrupted. Therefore, you should regularly perform synchronization checks to ensure that everything remains in order as per your data integration definitions.

Leverage Information Assets

Disparate data is a fact of life for almost all organizations. If left isolated in independent information silos, it can add unnecessary costs by forcing the need for duplicate data entry, creating data inconsistencies as updates are applied to one data silo but omitted on others and making valuable enterprise-wide data reporting and business analysis difficult if not impossible. Fortunately, sophisticated data replication and/or ETL software provides a solution that, with the use of data integration best practices, can significantly leverage the value of an enterprise's information assets.

John Gay is Director of Sales Engineering with Vision Solutions, Inc. He served as the Product Strategist for Lakeview Technology (now merged with Vision Solutions) for the Vision replicate1 solution. John spent eight years in development with IBM and Sterling Commerce. He also spent over eight years in technical sales focused on the business intelligence market. Contact John at This email address is being protected from spambots. You need JavaScript enabled to view it..

Also Read

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Book Reviews

Book Review: Extract, Transform, and Load with SSIS

Do your business apps access different data sources? This book shows you how to make that task easier
Book Review: 21st Century RPG: /Free, ILE, and MVC

David Shirey’s first book is an educational and entertaining read for “modern” and “old” RPG programmers alike
Book Review: Developing Business Applications for the Web--With HTML, CSS, JSP, PHP, ASP.NET, and JavaScript

If you are ready to get into Web application development, take this book along as your guide
Book Review: DB2 10.5 Fundamentals for LUW: Certification Study Guide (Exam 615)

DBAs who use the book will find it very helpful first in their test study and later as a reference book.
Book Review: DB2 11 for z/OS Database Administration—Certification Study Guide

This is a well-written DB2 11 book that could easily stand on its own as a reference manual, not just a certification guide.
Book Review: Free-Format RPG IV, Third Edition

Jim Martin comes through for us again.
Book Review: IBM i Security Administration and Compliance, Second Edition
Book Review: Programming in ILE RPG, Fifth Edition

This book really hits the mark and is a must-read for all RPG developers.
Book Review: DB2 10.1/10.5 for Linux, UNIX, and Windows Database Administration: Certification Guide
Book Review: Subfiles in Free-Format RPG

Whether you're a newbie or a seasoned pro, this book has something for you.
Book Review: Evolve Your RPG Coding: Move from OPM to ILE ... and Beyond

This book provides an amazingly comprehensive introduction to the concepts while at the same time delivering enough technical detail to make you productive very quickly.
Book Review: Database Design and SQL for DB2
Book Review: The Chief Data Officer Handbook for Data Governance

When implemented appropriately, data governance is a powerful framework.
Book Review: DB2 10 for z/OS: The Smarter, Faster Way to Upgrade

Trying to figure out whether to upgrade? Read on.
Book Review: 5 Keys to Business Analytics Program Success
Book Review: DB2 11: The Ultimate Database for Cloud, Analytics, and Mobile
Book Review: Flexible Input, Dazzling Output with IBM i

Today, it's all about input and output. Getting data into the IBM i from non-traditional sources and then displaying it back out again in varied formats. But where can you go to learn all that you need to know about this critical skill?
Book Review: Advanced Guide to PHP on IBM i

Enterprise-level PHP skills and techniques have been adapted for IBM i developers in Kevin Schroeder's new book.
Book Review: Java for RPG Programmers

If you've been putting off learning Java, you have no excuse anymore!
Book Review: DB2 10.1 Fundamentals: Certification Study Guide

Too valuable to be classified as merely excellent certification material, this book should also rightly take its place on DB2 DBA bookshelves as a solid day-to-day DB2 reference.
Book Review: DB2 10 for Z/OS Database Administration: Certification Study Guide

Whether you're trying to get certified or you just need a great reference book, this is the book for you.
Book Review: Developing Web 2.0 Applications with EGL for IBM i

It's everything you need to know, from the bottom up.
Book Review: Advanced Integrated RPG

Isn't it about time somebody told us how to integrate RPG and Java?
Book Review: Managing Without Walls

If you manage remote or satellite teams, this book is a must-read!
Book Review: Managing Without Walls

If you manage remote or satellite teams, this book is a must-read!
Book Review: The Remote System Explorer

This book speaks directly to the thousands of IBM i programmers who develop in RPG, COBOL, CL, and DDS every day.
Book Review: IBM System i APIs at Work, Second Edition

API expert Bruce Vining delivers the only comprehensive guide to APIs.
Book Review: Functions in Free-Format RPG IV

This one short volume manages to essentially be both a general introduction and a detailed reference.
Book Review: DB2 11: The Database for Big Data and Analytics
Book Review: IBM Mainframe Security: Beyond the Basics

Beginners will have a strong foundation after reading this book. Experienced professionals will reference it frequently.
Book Review: IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance

Find out how IBM is addressing the challenges of big data.
Book Review: Fundamentals of Technology Project Management

Projects can be overwhelming, but taken in small, deliberate steps, all projects are achievable.
Book Review: Customer Experience Analytics

Use CEA as a strategic weapon to stay ahead of your competitors.
Book Review: Big Data Analytics: Disruptive Technologies for Changing the Game

The disciplines of data analytics are evolving to meet the new challenges of big data.
Book Review: IBM i Security: Administration and Compliance

If you have any interest in IBM i security, whether as an administrator, a programmer, or an auditor, then this book is the perfect resource.
Book Review: DB2 9.7 for Linux, UNIX, and Windows Database Administration (Exam 541)

This book, written by the creator of the certification exam, reveals exactly what you'll need to know to prep for the test.
Book Review: Selling Information Governance to the Business

Who governs the information that runs your company?
Book Review: You Want to Do WHAT with PHP?

If you're serious about programming in PHP, get a book that treats you that way.
Book Review: The IBM i Programmer's Guide to PHP

Both a primer and a reference, this book is a must-have for anyone who wants to program in PHP.
Book Review: JavaScript for the Business Developer

There's no faster, easier way to become proficient in JavaScript.
Book Review: SOA for the Business Developer

If you want to know how SOA works in the real world, this is your book.
Book Review: DB2 9 Fundamentals

Whether you want to obtain an IBM certified DB2 professional certification or simply become well-rounded in the fundamental concepts of DB2 and general database theory, this is your book.
Book Review: The Modern RPG IV Language, Fourth Edition

This book isn't a training manual; it's a reference book.

Resource Center

Node.js on IBM i Webinar Series Pt. 2: Setting Up Your Development Tools

Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:
Profound Logic Solution Guide

More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.
Power10 Power Hour

IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:
Comply in 5! Well, actually UNDER 5 minutes!!

TRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now! Request Now.
Combating Ransomware: Building a Strategy to Prevent and Detect Attacks

Forms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.
Top IBM I Security Vulnerabilities and How to Address Them

IT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.
What Most IBM i Shops Get Wrong About the IFS

Can you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:
Backup and Recovery on IBM i: Your Strategy for the Unexpected

Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:
Manage IBM i Messages by Exception with Robot

Managing messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:
Easiest Way to Save Money? Stop Printing IBM i Reports

The thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:
Hassle-Free IBM i Operations around the Clock

For over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:
Why Migrate When You Can Modernize?

Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.
The Power of Coding in a Low-Code Solution

When it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:
Low Code: A Digital Transformation of Supply Chain and Logistics

Supply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:
Resource Center

The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit
Node Webinar Series Pt. 1: The World of Node.js on IBM i

Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.
IBM i Transformation Risks Every Business Leader Should Know

Join us for this hour-long webcast that will explore:
What to Do When Your AS/400 Talent Retires

IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn:

Analytics & Cognitive Categories

Latest Analytics & Cognitive News

Career Catgories

Latest Career News

Cloud Categories

Latest Cloud News

IT Infrastructure Categories

Latest IT Infrastructure News

News Categories

Latest News

Programming Categories

Latest Programming News

Security Categories

Latest Security News