Beyond Monitoring - Measuring the effectiveness of a Data Warehouse - I
Posted On April 20, 2008 by Rose Mary filed under Database
Successful measurement is the key to warehouse quality and effectiveness. In Part-1 of this article, the author discusses different types of metrics that help assess whether a data warehouse meets user requirements.
Quality generally means meeting customer expectations but not exceeding them. This is because it costs more to get to a higher level of quality, but getting there doesn't necessarily give you a higher value or return. Quality and Effectiveness of a Data Warehouse refers to a bigger picture, the act of measuring the progress of the data warehouse in terms of its ability to satisfy the business goals. In the Data Warehousing world, Quality and Effectiveness of Data Warehouse is more often related to the data quality. Ensuring Data Quality is a major KPI (key performance indicator), but there are additional considerations in terms of Non Functional Factors, which should be evaluated as well.
“Why measure?”
“Without measuring, everything is subjective. You don't know if you're improving or degrading over time.” Data warehousing is really a process; so process-oriented measures should be used. “These are opposed to product measures, which might measure volumes of data or number of reports pulled from the data warehouse.” These individual measures feed into larger sets of metrics, which are surrounded by an overall data warehouse quality and effectiveness program.
It's critical to understand where the business value is in data warehousing. A data warehouse is a solution for businesspeople to make decisions and take the right action. Given this definition, the primary measurement is whether you have business impact as a result of the warehouse. It's not something that you want to do just for the sake of doing it; there has to be proactive actions to ensure that the warehouse is functioning within the predefined tolerance parameters and any variation to these standards are recorded, analyzed and escalated promptly. The key here is to move beyond the role of monitoring to proactively managing the data warehouse environment.
How can we understand what is happening at the data level of the operations and use that information to improve and ensure performance and end-user satisfaction?
The answer is in three simple steps:
· Measure
· Manage
· Improve
It is difficult to manage when we don’t know what we don’t know! Since we can’t improve what we can’t manage, and we can’t manage what we can’t measure, we must first focus on measuring the complex data environment. Then we can use the measurements to gain insight into steps that can be taken to improve performance, control costs, increase efficiency and do proactive planning.
The First Step - Measure
Successful measurement is the key to warehouse quality and effectiveness, but how do you measure a data warehouse? Following are a few high-level success indicators:
· Economic success - the data warehouse has a positive impact on the bottom line.
· Political success - people realize the potential of the data warehouse and are eager to use it. If the data warehouse isn't used, it's obvious that it has failed politically.
· Technical success - this is the easiest to accomplish. However, don't overwhelm the users with too much technology. Success also means that the chosen technologies are appropriate for the task and are applied correctly.
All of the above success indicators directly or indirectly map to the business goals of the organization.
Without metrics, we have no way of knowing whether we are running a data warehouse that meets user requirements, achieving business goals constantly, and can continue to do so going forward. Metrics play a vital role in assessing data warehouse success, recommending standards or Service Level Agreements (SLA) and establishing accountability and responsibility for taking action to correct situations that are out of compliance with the standards, and recommends how to represent the results of the measurements to management.
Which metrics are most useful to measure the Quality and Effectiveness of the data warehouse?
To measure performance, we need external metrics in addition to the internal ones.
What do we mean by internal and external?
· Internal Metrics – These focus primarily on the internal workings of the operating system and the database, and include metrics such as cache hit rates, CPU usage, disk I/O activity, system availability and so forth. These make up the core of traditional monitoring and constitute a large part of the Non Functional Areas.
· External Metrics – These focus outside the operating system and the database, on the interaction of users with the data, such as who is using what data, when, how, how long, how often, and from what applications. This allows us to understand how the business is using the data in order to proactively manage and provide the best performance possible. These make up the core of the Functional areas directly related to the business goals of the data warehouse.

Figure1
Figure-1 depicts a hierarchical tree representation of the Non Functional Requirements (NFRs). These, if measured, will give us an indication of the quality and effectiveness of the data warehouse. Internal and External metrics needs to be defined to measure these NFRs. Table-1 explodes the root level NFRs to more detailed measurable factors.


Table-1
Types of Metrics
Data Acquisition measures
Traditionally a Push or Pull architecture is employed in the data warehouse population architecture. If there are third party data source providers involved in providing the data feeds, it becomes much more important to bring forward accountability in the form of SLAs (Service Level Agreement). The primary reason for this is, if there is a delay in receiving the data feeds from the source provider, there will be a cascading effect on the entire data population window to the target tables and subsequently impacting the business users.
Feed files arrival time and subsequent archival of these feed files and kickoff of the main ETL (Extract, Transform, Load) processing streams needs to be monitored and tracked religiously. See Table-2 and Figure-2.


ETL processing measures
It is very important to capture the ETL Processing (Extract, Transform, Load) measures as this serves as the key driver to populate the data warehouse target tables. See Table-3.

All occurrences of ETL job related issues needs to be captured in a separate metrics. This will give an indication of the data quality as well as the effort spent in corrective actions. See Table-4.

Staging/Target Area measures
This metrics captures the staging area population statistics. Appropriate comments also need to be captured, in case there are failures in meeting the SLAs. See Table-5.

In traditional data warehousing frameworks, most of the transformations are done during the staging area population; separate metrics also need to be captured for the following data quality errors:
· Constraint Violations – Key/Un-key errors, Lookup failures
· Boundary condition violations – Values in records are more than the predefined ranges
· Incomplete Records – Key fields not having meaningful data
Data Warehouse Performance and Capacity Planning measures
Data Warehouse infrastructure needs is a continuously evolving process. The initial deployment is very much limited in scope; in most cases this is done taking few users into account and concentrated on a department. As the usage of the data warehouse takes a widespread acceptance, more and more users start using the data warehouse, thereby increasing the client base, the data acquisition and storage expands to other departments as well. It is also necessary to capture and analyze these growth factors and plan for the data warehouse capacity ahead of time. See Figure-3.

Analytical Queries do memory intensive calculations and have a direct impact on the performance. In most cases it is seen that users execute queries without realizing the potential impact on the system performance. It is essential to capture the CPU performance and the long-running-user-queries statistics and bring it to senior management’s notice. It also helps to proactively monitor the data warehouse performance and take corrective actions by assisting users in writing the right queries against the desired data set without impacting the system. See Figure-4.

Part-2 of this article covers the next steps – Manage and Improve, that can be taken to improve performance, control costs, increase efficiency and do proactive planning.
