- Allow for rapid growth and the rapid addition of new "solutions"
- Create an architecture in which the business entities and rules are adhered to across "consumers"
- Ensure "one version of the truth"
- Add value to the business rapidly
Let's begin with a few simple definitions.
- Business intelligence (BI)—A business management term referring to applications and technologies that are used to gather, provide access to, and analyze data and information about company operations.
- Data warehouse (DW)—The main repository of an organization's historical data, containing the raw material for decision-support systems. Furthermore, the repository is subject-oriented, time-variant, non-volatile, and integrated to support strategic analysis and reporting.
- Data mart—A specialized version of a data warehouse predicated on a specific, predefined need for a certain grouping and configuration of select data. A data mart emphasizes easy access to relevant information.
- Operational data store (ODS)—A repository of an organization's transactional data, containing the raw material for decision-support systems. Furthermore, the repository is subject-oriented, volatile, and integrated in support of tactical analysis and reporting.
- Extract/Transform/Load (ETL)—A process in data warehousing that involves extracting data from outside sources, transforming it to fit business needs, and loading it into the data warehouse.
- Online Transaction Processing (OLTP)—A class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. Sometimes referred to as the line-of-business "operational system."
- Online Analytical Processing (OLAP)—An approach to quickly providing answers to analytical queries that are multidimensional in nature; for example, sales figures by customers located in Minnesota for the year 2007.
What the BI Architecture Often Looks Like
When first implementing a BI application, many companies focus only on getting data out of the OLTP system and placing it into a data mart as quickly and easily as possible, totally ignoring a broader architecture that supports growth and/or the addition of more data marts and consumers. This quick and dirty solution usually consists of extracting transactional data from the operational system on a periodic basis, performing some transformations and propagating the data to the data mart (see Figure 1).
This is sufficient and acceptable if no other data marts are required. Experience tells us that this is rarely the case.
Figure 1: A simple BI architecture with only one data mart benefits only a few users. (Click images to enlarge.)
Once the word is out that the first data mart is operational and consumers are receiving benefit from the new source of information, additional data marts will be requested. Of course, the seemingly easiest way to provide the new marts will be to replicate the same design and implementation that was used for the first mart. This approach is shown in Figure 2.
Figure 2: A BI architecture with multiple data marts is better but still not as good as it could be.
Herein lies the problem. This "ad hoc" methodology of creating data marts without an overall architecture has disadvantages:
- ETL and cleansing of data requires too many processes.
- Duplicate processing burdens the operational system.
- You're likely to have inconsistent data across data marts.
- Additional data marts are increasingly more expensive to create.
Of course, there are some advantages as well, the largest being that you will not have to get the various data mart users to agree on common business entities as each consumer will have their own version of the truth!
What the BI Architecture Should Look Like
A more robust business intelligence architecture will make use of a "data warehouse," a structure that acts as an independent repository for data that is used for strategic analysis, query, and reporting. Furthermore, the architecture might also include an operational data store (ODS), a structure that acts as an independent repository for transactional data used for tactical query and reporting. This is illustrated in Figure 3.
Figure 3: The recommended BI architecture includes a data warehouse and an ODS.
There are inherent advantages in implementing BI solutions using this architecture:
- Extraction, cleansing, and transformation done once for all data
- Rapid data mart creation
- Consistent business entities across all data marts
Additional data marts can source their data from the data warehouse, which helps ensure consistent processing for all data—i.e., one version of the truth. As a new data mart is designed, you will find that much of the data required for it will already be in the warehouse. In other words, by eliminating duplicate processing and pulling the data once from the operational systems, undue work is minimized, making the whole system more efficient.
Of course, one of the disadvantages of using such a robust architecture is that all data mart users will need to agree on consistent definitions of business entities and rules. This will result in more work for the warehouse architects, work that will be both technical and political in nature.
Warning! When laying out a BI architecture, be prepared for complaints such as these:
- "It will take forever to model the whole business. We won't ever finish the warehouse."
- "The warehouse will contain data that's not used. Just think of all the extra disk space that will be needed."
- "I can implement the data mart by itself a lot faster."
To counter these complains, remember that the modeling activity includes only business entities for the data mart that is in progress, not the entire enterprise. By using the focused data model, only the data required to fulfill the current requirement is extracted from the operational system and placed into the data warehouse. And yes, it is very possible that there will be a lot of data stored in the data warehouse and data marts, especially considering that the requirements placed on the business intelligence system can be different from those of the transactional system. For example, the OLTP system might require only two years worth of data, while the BI system requires five years worth of data in support of trending and macro analysis. Why burden the OLTP system with five years of data?
As for the common declaration that "I can implement the data mart by itself a lot faster"...well, we call this guerrilla data warehousing! All of the BI concepts are used— extract data, transform data, load data, store data, query data—but it's done on a personal or local basis. This promotes multiple versions of the truth. And you have to consider the security issues surrounding the storage of vital business data in disparate locations around the enterprise. We are certain that every business has hundreds of Excel spreadsheets scattered around, masquerading as data marts.
What about the idea of "virtual data warehousing"? Can we just query the operational data directly using complex SQL requests? While this is certainly possible given the latest advances in database management system technology as well as the high-performance systems available, it is not recommended. In our experience, asking the operational system to meet the requirements of both a transactional system and business intelligence system is a tall order. At some point, the requirements will diverge and it will be more efficient and economical to have separate systems.
Based on the BI architecture described herein, let's look at one methodology you can use to design and implement the solution. Again, this represents just one methodology.
Overall Steps in the Methodology
This methodology includes the various steps needed to design, build, test, and monitor the BI system (Figure 4).
Figure 4: Follow these steps to design and implement the recommended solution.
When designing and building the system, you can combine some steps or do them in parallel, while other steps will be dependent on previous steps. Many of the steps and sections of the methodology are iterative in nature, meaning that as you go through the process, it might be advantageous or required to go back and do the series of steps again, taking into account newfound information. Just be aware that activities or conclusions in one step can lead you back to others. Designing and building a business intelligence system is a process of discovery. Some of the discoveries will be used to iterate through the design and implementation of the solution. Other discoveries will be categorized and used as input for the next project.
It is also important to realize that good project management and some technical education will be required throughout all phases. And remember, the knowledge and experience gained will be utilized during the next round of data marts.
Properly defining, identifying, and understanding the requirements is the most important step. Given that both business requirements and technical requirements are involved, participants in this activity include business leaders, technical leaders, key end users, and a decision-maker. The output of this step includes the following:
- High-level requirements, both technical and business
- A list of potential projects, with the first project selected
- A feasibility study
- Identification of project participants, both technical and business
- Project scope
- Skills and education required to move forward
During the requirements gathering, it will be very tempting to keep working until all the requirements are identified and agreed upon. This can be a trap we call "analysis paralysis." Thus we use the 80/20 rule: Gather approximately 80 percent of the requirements and start moving forward. The other requirements will fall into place, or you will discover the real requirements in a future step.
Create/Refine Data Model
After requirements are captured, the process of creating and refining a data model can begin. A data model is an understanding of how the entities in a business are defined, what they look like, and how they relate to one another. This step allows the data warehouse project team to gain an understanding of the business rules, to control redundancy, and to come to an agreement on common definitions and entities. This process will involve both the technical and business leaders.
The output of this exercise does not have to be too detailed, nor take months of effort. Do not attempt to model the entire enterprise; only model areas and entities specific to this project's scope—i.e., this data mart.
Design Data Warehouse
Based on the requirements and high-level modeling, it is now possible to create the logical and physical database design. The design will also be based on any data mart prototyping, provided this step was done previously.
In general, a data warehouse model is based on getting data into the warehouse as quickly and efficiently as possible, storing large quantities of detailed data, and getting data out of the warehouse and into the data marts. These requirements usually lead to a more normalized data model.
The output of this step includes these items:
- Logical data model for the data warehouse
- Physical data model for the data warehouse (tables, columns, etc.)
- Information for the specific data mart addressed in this project
During the data warehouse design process, consider future growth in terms of more rows a well as additional columns to support future requirements.
Select Data Mart Tools
As soon as the requirements are gathered, you can start to evaluate and select what tool the end users will use to access, analyze, and visualize the information. In other words, you can start to evaluate data mart tools. This is an exciting step, but it's also fraught with problems. The excitement results from actually seeing new and sophisticated products. The problems arise because every vendor demos well, every tool looks great, and it's easy to get confused and languish in analysis paralysis.
When evaluating end user query, analysis, and reporting products, it is very important to remember and make use of the requirements you have identified. When considering tools, keep in mind these influencing factors:
- What type of problem are you solving (OLAP, data mining, reporting, or simple querying)?
- What are the performance expectations?
- What is the realistic number of concurrent users?
- What are the various types of users (power users vs. report "lookers")?
- What is the anticipated ease of use, ease of learning, and ease of configuration and maintenance?
Another important factor is the budget. It makes no sense to evaluate $100,000 tools when the budget is $10,000. On the other hand, if your business problem requires a $100,000 solution, then by all means get it.
Is it acceptable to change query and reporting tools or to have more than one? The answer rests in meeting requirements. If the requirements dictate more than one tool, this is acceptable. If you discover that the current tool no longer meets the requirements, it is acceptable to change. The goal is to provide business value and a good return on investment (ROI).
Design Data Mart
Once the analysis, query, and reporting tool is identified, you can work on the logical and physical design of the data mart. In many cases, the tool selected drives the design. For example, online analytical processing (OLAP) tools may lead to a star or snowflake schema or may have their own data structure not reliant upon a relational database system. In most cases, query and reporting solutions will lead you to a relational database design. Sophisticated data mining tools may lead you to something completely different in terms of data structure. The design will be dictated by the tooling and algorithms (i.e., the type of mining).
For relational database structures such as star or snowflake schemas, the data mart design process converts requirements into dimensions and facts. The process also converts data into consumable information.
Prototype Data Mart
Once the data mart is designed and the data structures are defined, you can implement a functional prototype using the tool selected. While the data model is realistic, the data doesn't have to be. Generating some test data or a subset of current data is enough to demonstrate the tooling. During the prototyping and testing, you have a first chance to get good sizing information for the data mart.
This step is very exciting yet sometimes frustrating. The end users finally get to see what they will be using, so this is an opportunity for them to provide constructive feedback. New requirements might come up, and existing requirements may be confirmed or denied. It's time to finalize requirements and avoid enlarging the scope of the current project. Unmet requirements can be captured and used as input for the next iteration or project.
This is also an opportunity to get excited about the BI project and continue to drive sponsorship and gather support.
Source Data Warehouse Data
With the data warehouse designed, it's time to determine the operational sources of the warehouse data. In other words, map source data to warehouse data. This activity includes these steps:
- Determining the best source for each data element
- Determining the best extraction method for the data
- Determining what transformation and cleansing must be done to the data
- Setting up routines/tools to extract, move, transform, and load the data
- Defining error-handling procedures to track, report, and fix invalid data
- Implementing appropriate measures for data-quality tracking
During this process, it is important to keep in mind performance requirements to meet the operational batch window and data warehouse refresh requirements. If these requirements were not identified in the beginning, they will come out now.
During this step, the extract, transform, load (ETL) process and tooling will be evaluated and tested. In our experience, the ETL area represents about 70 percent of the effort in implementing a BI solution.
Source Data Mart Data
Given the data mart model, it is now possible to define and design the mapping of data elements to their source in the data warehouse. During this phase, the data mart model can be refined to include summarizations and calculated fields.
Some data mart designs may require an intermediate staging or load file to be created. The loading of data mart data can be set up using similar ETL strategies as the data warehouse, allowing for automation as well as data integrity and quality tracking. The time table for loading new data will be based on the requirements gathered previously. If the users need weekly sales information, the data mart can be populated on a daily or weekly basis. On the other hand, if the users need daily sales information for yesterday, then the data mart load process must occur at least on a daily basis.
Tune Data Warehouse and Data Mart
Once the data warehouse and data mart are populated with the appropriate data, is it time to throw the BI project over the wall and let the users start working?
Of course not! One of the most important steps must occur first: namely, tuning the data warehouse and data mart. During this time, refinements can be made to ensure the data is refreshed in a timely and accurate manner. It is also important to evaluate performance of the ETL system, as well as the analysis, query, and reporting mechanisms. This will involve implementing a proper indexing strategy and possibly implementing materialized query tables. It will be very important to watch for usage behavior to determine if the initial requirements are still accurate. For example, is the data mart easy to use, and is it providing the proper information? One approach is to allow a few key users to take the solution out for a test drive. It will be advantageous to get their unbiased feedback before turning the solution over to the entire user community.
Congratulations! It's time to go live.
Monitor Usage and Maintain Data Warehouse and Data Mart
Once the BI solution is live, is the project over? Absolutely not. Monitoring and maintaining the BI solution are key to providing ongoing value and return on investment. Some of the items to monitor include these:
- Which reports are used and not used
- What new reports are being built
- Which columns are used and not used
- Which summarizations are used
- Which summarizations are calculated at run time
- Which queries are run most frequently
- How performance levels are affected as more users and/or data are added to the system
As the information is consumed, new requirements will surface. The new requirements might be against the existing data mart or might necessitate a different data mart. Keeping a trained eye on the BI solution will help you to determine what to do next.
Moving on to the Next Project
As you complete the BI project and start delivering value, you will start to look to the next project. This is the natural result of success. So, what's different for the next data mart?
It's very likely the next data mart will focus on a different area of the business or enterprise. To support the new data mart, the warehouse data model may need to incorporate new entities or relationships. The data warehouse may need new tables and/or columns added, with sources identified for new data elements. Over time, more and more data will already be present in the warehouse, minimizing the need to source entirely new elements. During this process, it is important to remember to add only the data that is needed for the new mart.
Sourcing the new data mart data may reveal ETL processing that is the same as for previous marts. This allows further refinement of data warehouse design to store cleansed, transformed, and augmented data.
It is very possible that new and different analytical and visualization tools may be required for the new mart. This might be due to a different set of users or a different business problem to be solved. This is OK, and it reaffirms one of the reasons a robust BI architecture supports multiple data marts.
Some Final Thoughts...
It is said that data warehousing is a journey. This is a suitable observation, given the continual change in business goals, focus, and requirements.
Your business intelligence system is considered successful not when it's operational, but when it's used.