How would you like to have all the crucial information about your customers and your business available when you need it in a format you can use? I'll bet data warehousing is just what you're looking for. Data warehousing is about knowing your customers' needs and knowing how your business fills those needs. Data warehouses can have huge paybacks for a business-paybacks measured in terms of increased business and more loyal customers. Who isn't looking for more business?
Useful data warehousing capabilities have been promised for years. Finally, today's technologies and products are achieving a level of maturity that fulfills this promise. I say "technologies and products" because it takes more than one product or technology to implement a data warehouse. A successful implementation requires that you examine your business needs and then select the right set of tools to make it all come together. A data warehouse is something you cannot buy; it is something you must build. Products are available to make the implementation easy, but the hardest part is examining your business needs.
This article examines what a data warehouse is and describes some products you can use to implement one. In a future issue, I'll show you, step-by-step, how to build a data warehouse.
Data warehousing is the process of consolidating data from different operational databases, modifying the data so it can be analyzed easily, and then analyzing the data. In other words, data warehousing is the process of turning operational data into informational data. Operational data is contained in the application databases that run your business. These application databases may contain your company's order entry, human resource, financial, or point-of-sale data. Operational data is optimized for transaction processing (i.e., easy updating) and contains very little history. Most operational databases do not contain data over one year old.
Informational databases are optimized for query processing (i.e., getting the data out), and they are denormalized. Informational data is usually summarized so that the volume of data is reduced, easier to understand, and more quickly retrieved. A time element is added to the data so that analysis over time can be done. Informational databases may contain data spanning many years. If you would like to look at sales trends for the last five years, information must be available from five years ago.
Data warehouse implementations usually consist of five parts:
o The warehouse
o Tools to move the data from the operational database to the warehouse
o Tools to analyze the data
o Tools to maintain information about the data
o Middleware, which allows these tools to talk to each other
I'll discuss these one at a time.
The warehouse is made up of hardware and a database that runs on that system. Data warehouses are usually implemented on separate systems. The systems are separate so that the analysis that is being done on the warehouse data does not impact the performance of the business' operational applications. Most operational systems today do not have the capacity to double the amount of resources their users need. The data analysis done with data warehousing can easily double the amount of CPU resources needed to run a business.
AS/400 Advanced Servers are a perfect fit in a data warehouse system. Data warehousing is very batch intensive, and most of the tools used are client/server applications. AS/400 Advanced Servers offer price/performance advantages when running in those environments.
The database that runs on this system and stores the data can be either relational, like DB2 for OS/400, or multidimensional. Multidimensional databases are specialized data stores that offer high-speed, predictable retrieval of mostly numeric data. Using a tool that works something like a spreadsheet, the user navigates through the data looking at successive levels of detail. The database is called multidimensional because of the way the data is organized. For example, sales data may be stored by product, store, salesperson, and time-in other words, four dimensions. Multidimensional databases are often deployed as smaller, subject-specific warehouses called datamarts.
While relational databases don't have the same predictable response time as multidimensional databases (see the sidebar, "Can RDBMS Compete with Multidimensional Databases?"), they do have the ability to store much larger amounts of data. Relational databases also have another advantage: A much wider variety of end-user tools is available. (I'll discuss these tools later.) Relational databases are used as both datamarts and enterprisewide data warehouses. Often, these enterprisewide warehouses are used as clearing houses for information that is then fed into the datamarts. In such cases, operational data is moved into the big warehouse before making its way to the datamarts.
The second part of the warehouse solution consists of the tools that move the data from the operational database to the data warehouse.
The first mission of these tools is to collect the changes that are made to the operational databases. It is not feasible to move entire operational databases to the warehouse on a regular basis. Once the initial data is loaded into the warehouse, it is common for just the changed data to be moved and placed in the warehouse. The changes to the operational databases are captured in one of three ways.
The first is to modify the applications that change the operational data so that they also send those changes to the data movement tool. This is often impractical, because changing applications is very expensive. The second is to journal the operational files so that all changes are recorded in the database journal. The data movement tool then reads the journal to get the changed data. Journaling your database files does add additional overhead, but it is the best way to ensure that all changes are captured. The third way is to add a trigger to your operational files. The trigger programs are given the changed records and can then give these changes to the data movement tools. Using triggers usually has a lower overhead than journaling, but care must be taken so that procedures are in place to recover from any abnormal system terminations.
Moving the data can be accomplished through a number of means. It may be as simple as using the Copy File (CPYF) command with Distributed Data Management (DDM), or it may involve custom applications. The tools for moving data use a variety of methods, usually governed by how the captured data is stored. Data Propagator for OS/400 stores its captured data in database files, so it simply issues a remote query, using Distributed Relational Database Architecture (DRDA) against the files with the changed data. This data movement is usually timed to occur at regular intervals. It could be timed to happen all at once (all changes moved at one time), or it could happen multiple times during the day to spread the resources needed to move the data over more time.
IBM's Data Propagator and DataMirror's DataMirror are just two of the tools available to move data into the warehouse. No one tool does everything.
These tools also may be used to enhance and modify the data before it gets into the warehouse. Data enhancement and modification take a number of forms. The simplest, and most prevalent, is the summarization of data as it is moved to the warehouse. Summarizing the data reduces the volume of data used during analysis. This makes the data easier to understand and faster to query and takes up less storage. For example, if you were looking at sales of running shoes from all of your stores, you probably wouldn't care exactly when each pair of shoes was purchased. (That information is contained in your operational databases, by the way.) You probably do care how many shoes were purchased on a given day, though.
Denormalization of the data also takes place. Denormalization is the process of combining related data from different database files into a single file. While this does take up more storage due to redundant data, it makes the information easier for the end user to deal with. In normalized databases, end users must combine related data from multiple files, using joins in their queries. Most end users have a difficult time with this. This also takes lots of system resources and reduces your data warehouse's performance.
Another important aspect of this process is cleansing the data to ensure that it is valid before it is used for analysis. If the data in the warehouse is not correct, the decisions made using that data will not be correct. For example, names have a tendency to appear in a database in multiple forms. Each name is meant to represent a different customer, but if a customer has multiple versions of his name in the database, you will think that one customer is actually multiple customers. Mark Wulf might show up as M Wulf, M G Wulf, and Mr. M Wulf. I am only one customer, but I am represented multiple times in the database.
There is not an overabundance of tools available to do data cleansing. Data cleansing tends to be data specific and company specific. What this means is that you may not find tools to do all of your data cleansing, and some of this function may need to be custom written for your business. These programs are easy to write after you complete the task of discovering what may be wrong with your data.
The data analysis portion of a data warehouse is the most important part. After all, it is what the end users are going to see, and they are the customers who need to be satisfied. The data must be presented to the end users in a way that is easy to understand, and the data must be easy to find. Data warehouse design is a large portion of making this happen, but the data analysis tools are the other large portion.
Tools used to analyze the data fall into two categories: decision support tools and data mining tools. Decision support tools include query tools, report writers, and Executive Information System (EIS) tools. The amount of training or end user knowledge varies widely across all of these tools. Query tools usually require the end user to know something about database theory and to have some knowledge of how to create queries. Users can usually be taught how to do simple selection, but they have a difficult time with the implementation of joins or aggregation. EIS tools are at the other end of the scale. While these tools typically require some development work to be done before deployment, end users need little education on how to work the tool before they are off and running. No database theory is needed, because it is all hidden behind a set of buttons. Remember the joke about the ultimate EIS system? Just one big button in the middle of the screen that says "Press here for information."
What's really important about this discussion is that, the easier it is for the end user to find and use the information in the warehouse, the more successful you will be in implementing one. It may be easier to just give everyone a query tool, but, in the long run, you may be more successful investing in something else.
Data mining tools work differently from decision support tools. They find information with little or no interaction from a user. The user simply identifies which data to look at and lets the tool loose. Data mining tools run in batch and may run for hours-or even days-to find all the information in the data. This information usually takes the form of patterns or trends in the data. For example, data mining technology can be used to determine which items in a database relate to one another. These items might be those things sold in a retail store. Data mining can find which items are being purchased together when one of your customers walks into the store. It can find out in what quantities these items are being purchased and what items are purchased over time. This information can be used to make better decisions about which items to place on sale, where to put items in the store, or even what the projected sales volumes will be.
There are literally hundreds of decision support tools available. The list includes tools such as Lotus Approach, IBM Visualizer, Cognos Impromptu, Showcase VISTA, and Brio BrioQuery. Data mining tools are not as numerous. IBM offers the Neural Network Utility, which mines your data and allows you to deploy the results directly into an application. IBM also recently announced the IBM Intelligent Miner, which combines all IBM data mining technology into a single tool; it mines your data and shows you the results graphically. The Neural Network Utility is available for most IBM platforms including the AS/400. The IBM Intelligent Miner will be made available for the AS/400 late in 1996.
Tools are also used to maintain information, called metadata, in the warehouse. Metadata describes what the layout of the warehouse is, where to find information, how old the data is, and where the data came from.
Metadata usually comes in two forms. The first is for the end user. It helps the user find where the information is located in the data warehouse. This metadata takes many forms today. It may be a written document describing the contents of each file in the warehouse, or it may be a tool that simply presents text descriptions of information to the end user. The end user can then choose the information, and the metadata tool will give the requested information to the end user, using the appropriate query tool, spreadsheet, or multimedia viewer. Lots of query tools have metadata capabilities built right in. Presenting users with text descriptions of file and field names instead of names like "R26F03" (this type of name is very common in AS/400 databases, by the way) can go a long way toward making information more useful.
The second type of metadata is for the data warehouse administrator. This metadata keeps information about the source of the data, the layouts of the operational and data warehouse databases, when the data movement is happening, and other things that only an administrator needs to know.
The problem with metadata is that there are no good sources of information yet. The DB2/400 system catalogs and file descriptions contain information about what the file looks like, but it's difficult to convert that into end user information. Most of this information must be created from scratch.
Middleware gives all of these tools access to the data. The most common middleware used is Open Database Connectivity (ODBC), because most of the tools are client/server-based. Distributed Relational Database Architecture (DRDA) also plays a role when moving data from the operational database to the warehouse.
What the implementation of a warehouse looks like varies depending on the needs of your business. It could be as simple as a tool to move the data and a query tool to do the data analysis.
More complicated warehouses implement a two-tier warehouse. The first tier, the data warehouse, is where all data from the operational databases is sent. Before being put into the data warehouse, the data is cleansed. The second tier is made up of one or more datamarts, which get their data from the data warehouse. End users usually only have access to information in the datamarts. Datamarts are designed with ease of use and performance in mind.
Again, implement what is right for your business. Most businesses start with a single datamart. Once that is successful, the first tier or data warehouse is created, followed by other datamarts. Multiple datamarts are easier to manage if the correct data warehouse is also in place.
Remember the goal of data warehouse implementation: Know more about your customers and your business so you can meet your customer's needs. This is what data warehousing is all about.
Data Warehousing on the AS/400
Can RDBMS Compete with Multidimensional Databases?
Multidimensional databases are becoming popular parts of many data warehouse implementations today. Companies such as Hoskyns, SAMAC, Silvon, and others have offered multidimensional databases for the AS/400 for a long time, but they were never sold as part of an overall warehouse scheme. They have mainly been sold as fast decision-support databases.
Their main advantage is that of performance. The goal of multidimensional databases is to return any requested data in less than one second. And the performance should be consistent. Changing the request should not change the response time. For the most part, they have met their goal. So how can a relational database like DB2 for OS/400 compete with performance like that?
Relational databases have turned to parallel technologies to make that performance gap smaller. DB2 for OS/400 has delivered parallel I/O to bring data off of disk faster. Symmetric multiprocessing and advanced query optimization techniques allow queries to be divided among available processors so more CPU resource can be applied to make queries run faster. All of these functions have been delivered for V3R1. Coming in the second half of 1996 is DB2 Multisystem, which allows multiple systems to be used to speed up queries.
Will all of this be enough? Probably not. How will DB2 for OS/400 compete? Well, the two types of database really don't compete with one another. Even though multidimensional databases are fast, they can't hold much data. The conventional limit is 10?20GB. Some can go larger, but none can compete with a relational database that holds hundreds of gigabytes.
Because of this storage limitation, multidimensional databases are used mostly as datamarts, while relational databases are used as data warehouses.