To effectively oversee your company's big data, your first steps are knowing what types of big data exist and then determining how to handle key areas, such as metadata, privacy, and data quality.
This article is excerpted from Big Data Governance: An Emerging Imperative.
There are five different types of big data, and seven information governance disciplines. This chapter provides a framework so that organizations can tailor their governance programs by big data type, information governance discipline, industry, and function.
Data Governance Framework Dimensions
The framework for big data governance consists of three dimensions:
- Big data types - Big data governance needs a heightened focus on the data itself. We have classified big data into five distinct types: Web and social media, machine-to-machine, big transaction data, biometrics, and human-generated.
- Information governance disciplines - The traditional disciplines of information governance also apply to big data. These disciplines are organization, metadata, privacy, data quality, business process integration, master data integration, and information lifecycle management.
- Industries and functions - Big data analytics are driven by use cases that are specific to a given industry or function. Given space limitations, we have only included a handful in Figure 1. Big data analytics can be leveraged by many other industries and functions, including marketing, risk management, customer service, information security, information technology, and human resources.
Figure 1: A framework for big data governance
Big Data Types
As shown in Figure 2, big data can be broadly classified into five types. Let's consider each type in more detail:
Figure 2: Big data types
2. Machine-to-machine data - Machine-to-machine, or M2M, refers to technologies that allow both wireless and wired systems to communicate with other devices. M2M uses a device such as a sensor or meter to capture an event such as speed, temperature, pressure, flow, or salinity. This event is relayed through a wireless, wired, or hybrid network to an application that translates the captured event into meaningful information. M2M communications create the so-called "Internet of Things." The big data governance program needs to establish a number of policies around M2M data. For example, the program needs to draw up guidelines around the acceptable use of geolocation and RFID data that can be used to build a profile of individuals and potentially violate their privacy. The program needs to establish retention policies around the massive volumes of M2M data, which can easily overwhelm IT budgets if not properly controlled. The big data governance program needs to address any data quality concerns such as RFID read rates in environments with high moisture content and lots of congestion. Finally, the big data governance program needs to secure the Supervisory Control and Data Acquisition (SCADA) infrastructure from vulnerability to cyberattacks.
3. Big transaction data - This includes healthcare claims, telecommunications CDRs, and utility billing records. Big transaction data is increasingly available in semi-structured and unstructured formats. Information governance challenges such as metadata, data quality, privacy, and information lifecycle management also apply to this data.
4. Biometrics - Biometric recognition, or biometrics, refers to the automatic identification of a person based on his or her anatomical or behavioral characteristics or traits. Anatomical data is created from the physical characteristics of a person including a fingerprint, an iris, a retina, a face, an outline of a hand, an ear shape, a voice pattern, DNA - even body odor. Behavioral data includes handwriting and keystroke analysis. Advances in technology have vastly increased the available biometric data. Law enforcement, the legal system, and intelligence agencies have been using this information for a long time. However, biometric data is increasingly available in the commercial arena, where it can be combined with other types of data such as social media. This opens up new business opportunities as well as several governance issues relating to privacy and data retention.
5. Human-generated data - Human beings generate vast quantities of data, such as call center agents' notes, voice recordings, email, paper documents, surveys, and electronic medical records. This data might contain sensitive information that needs to be masked. It might also contain insights that can improve the quality of structured data sets and integrate with MDM. In addition to dealing with these issues, organizations need to establish policies regarding the retention period for this data to adhere to regulations and manage storage costs.
Information Governance Disciplines
The seven core disciplines of information governance also apply to big data:
1. Organization - The information governance organization needs to consider adding big data to its overall framework, including the charter, organization structure, and roles and responsibilities. The information governance council might seek new members who can provide a unique perspective on big data, such as data scientists. It might also decide to appoint stewards for social media, RFID, and other types of big data. Finally, the information governance program might add additional responsibilities to the job descriptions of existing stewards. For example, the customer data steward might be accountable for the Twitter handles and Facebook accounts within the master data repository.
2. Metadata - The big data governance program needs to integrate big data with the enterprise metadata repository. This involves the following activities:
- Include big data terms within the business glossary. For example, add the term "unique visitor" to support clickstream analytics.
- Import technical metadata from Hadoop into the metadata repository.
- Ensure that the data lineage administrator is able to import flows from Hadoop into the technical metadata repository.
- Manage data lineage and impact analysis within the big data environment.
3. Privacy - As far back as 1890, Louis Brandeis (later a justice of the United States Supreme Court) and Samuel Warren published an article called "The Right to Privacy" in the Harvard Law Review. This article defined privacy as the "right to be left alone."Ãƒâ€šÃ‚Â Subsequent regulations and legislation around the world have formalized and expanded this theory of privacy. Big data governance needs to identify sensitive data and establish policies regarding its acceptable use. These policies need to consider regulations that vary by big data type, industry, and country. Given the many headlines on the subject, the big data governance program needs to establish guidelines regarding the acceptable use of social media and geolocation data, if applicable.
4. Data quality - Data quality management is a discipline that includes the methods to measure, improve, and certify the quality and integrity of an organization's data. Because of its extreme volume, velocity, and variety, big data quality needs to be handled differently than traditional data types. For example, big data quality might need to be handled in real time and address issues relating to semi-structured and unstructured data. Big data needs to be "good enough"Ãƒâ€šÃ‚Â because poor data quality does not necessarily impede the analytics that are required to derive business insights.
5. Business process integration - The program needs to identify key business processes that require big data. The program then needs to define key policies to support the governance of big data. For example, drilling and production are key processes within oil and gas. The big data governance program must establish policies around the retention period for sensor data such as temperature, flow, pressure, and salinity on an oil rig. Not only is this data costly to store, but it might also be required by regulators to justify an operator's actions in case of an oil spill. In another example, a retailer might establish a policy that it will access a customer's Facebook profile, including his or her list of friends, only if it has obtained informed consent via a Facebook app. The retailer will obtain the informed consent from the customer in exchange for discounts on certain products as part of an overall loyalty program.
6. Master data integration - The big data governance program needs to establish policies regarding the integration of big data into the master data management environment. As discussed above, a retailer needs to first define policies for the acceptable use of social media. The retailer then needs to deploy the appropriate data stewardship policies and tools to determine if the "Susie Smith" on Facebook is the same as the "Susan Smith" in the customer master.
7. Information lifecycle management - Because of the massive increase in big data volumes, organizations will be challenged to understand the regulatory and business requirements that determine what data to retain in operational and analytical systems, what data to archive, and what data to delete. Without a high level of specificity regarding the legal and regulatory obligations of information, IT must manage all data as if it had high value and ongoing obligations, or the company faces a very high risk of improper disposal. With IT budgets continuing to be under pressure, over-managing information is a gross waste of capital resources. The program needs to expand the retention schedule to include big data based on regulations and business needs. The big data governance team needs to create pointers to the physical repositories of big data to facilitate records retention and e-discovery activities. The big data governance program needs to leverage compression and archiving policies, tools, and best practices to reduce storage costs and to improve application performance. Finally, the organization needs to defensibly dispose of big data that is no longer required based on regulations and business needs.