Archive for October, 2011
Tuesday, October 25th, 2011
I was sharing an “October” beer with John Bair (our CTO) last night and I was struck by a few thoughts in our conversation.
Back in college, I always enjoyed reading books by Thomas Kuhn. In 1962, Kuhn authored the book, The Structure of Scientific Revolutions. The book described how “paradigm shifts” occur in the scientific world. The core thought was that new ideas do not always immediately take root and become the new norm. For example, once Einstein proposed relativity, while it was certainly very exciting, it took many confirmations and an extended amount of time for the theory to be accepted. There has to be issues in the existing paradigm to cause people to question the current theories. It takes time to adopt the new, improved theory. A paradigm shift is rarely immediate.
At the same time, I remembered Steven J. Gould’s punctuated equilibrium (in evolutionary biology) theory that came out in 1970s. In contrast to the idea that evolution was gradual, punctuated equilibrium said that sometimes large, infrequent events shift the slow-moving, evolutionary path. So, the key message is that things may be moving along, something big happens, and suddenly you are in a whole new world.
Well, perhaps data governance is like that. Perhaps data governance is really an underutilized organizational process. When I talk to clients about data governance, the conversation invariably turns to the different aspects of data such as data is missing! Data is dirty! Only the business knows the really business rules! There are data errors!
That’s certainly all good and fun topics to talk about and those conversations can consume the entire day. However, there is another aspect of data governance we think is important. Its about program management.
We think of data governance as composed of two tracks: the “data” part of the data governance program and the program management part of the data governance program. The program management part is often the most overlooked. Yes, there is a data governance steering committee and yes there is a “leader” of the overall daily effort who reports to that committee. But the program management aspect of the data governance program is really a management process not unlike other governance programs such as IT portfolio management or “strategic projects” governance. For example, IT governance often helps with priorities, decisioning, budgeting, resolving resource issues, helps communicate to other parts of the organization and bundles scope to form projects/programs.
In the data governance space, we think people often forget this important aspect. A few areas of issues we have observed include:
- Funding: The data governance program should act as a forum for obtaining funding either directly through itself or through other funding mechanisms such as integrating into other projects or proposing in other governance forums.
- Bundling: Data governance should maintain a list of issues and smartly bundle those into projects to be funded and executed. Either direct data governance funding or other business/IT funding could be used.
- Resourcing: For example, perhaps more training is needed for stewards. What resource can help with that task? Do we need to hire a consultant? Do we need to have metrics in place to track attendance and participation? Does HR need to get involved?
- Communicating: The senior people on the committee need to use their organizational influence to help keep the data governance agenda front and center in other parts of the organization.
- Statusing: Perhaps the data governance program needs rejuvenation, how do you get it back on track? The daily governance lead should be identifying program issues that remain unsolved and escalate them for guidance. There should be “program” type status each month in addition to just talking about data.
We think there are issues today that cannot be easily addressed by the traditional style data governance implementation. Changes are needed. The issues we see and listed above are starting to push the boundaries of the current model. Perhaps it’s time for a paradigm shift.
Data governance can be used as a way to manage funding for data-related programs. It can be used to do more than just discuss daily data issues. Its time for data governance to evolve. Instead of always lingering on just the data. Its time for a landslide to happen and make data governance a real management force.
If a data governance program is only focused on data, it is probably too local to act as a problem solving capability. Lets change it. Let’s take the traditional data governance program and make it something more relevant. Let’s ensure that the program aspects of data governance can help fund, help bundle and help communicate to the organization.
So what we have is an old thing, like data governance, playing a new role, becoming more relevant to the business and seeding innovation. We have seen issues in the current “theory” that need to be handled–so lets change the current theory of data governance and ensure that the new model also emphasizes program management. With the paradigm shift in play, lets start a landslide to kill off the old data governance programs and disrupt the equilibrium.
Ajilitee can help you do that.
Data governance is the new pink.
Wednesday, October 19th, 2011
Many BI efforts focus on testing the initial data load. Only the initial data load. We know BI solutions can behave differently depending on the presence or absence of data in the target. Incremental testing recognizes these varied load conditions.
Are you willing to risk defects after just a couple of incremental production runs? Skip incremental data testing and operations may stop soon after deployment to production. How will your team react to that news? What about the end user community?
The test team must make sure that the functional canary data sets include data suited to incremental testing. The test sets needs the right data to act against an empty target and a target that already contains data. System test runs will include a one-two punch consisting of an initial test run followed immediately by an incremental test run.
The best practice is to execute incremental test runs using functional canary and large data sets selected to test multiple target conditions.
The last three blogs answered the question – “What records do we use for system testing?” Functional canary data sets to vet the rules and logic. Large data sets to prove out volume and build user confidence. Incremental data sets to validate the code under working conditions. These data sets collectively demonstrate the determination of your BI team to find problems before the business finds them during UAT. Support the efficacy of your team – and your job – by using all three types of testing.
My upcoming blogs will answer the questions - “How and when is system testing done using these carefully crafted data sets?” I’ll focus on the process of running system testing, system test automation, and timing system test setup and execution as part of the development lifecycle.
- Jim Van de Water contributed to this blog.
No Comments
Category Blog, Business Intelligence, Jim Van de Water, Steve Knutson | Tags: BI system testing, data warehouse system testing, ETL system testing, functional system testing, incremental system testing, QA testing, testing best practices, testing data warehouse applications,
Friday, October 14th, 2011
Functional canary testing ensures that the solution exercises every business rule, but not that your ETL will work under the pressure of production loads. For the next set of tests, we need to think BIG DATA, as in large numbers of records. We also need to start thinking in terms of aggregates. The confidence of end users is gained when the additive records passing through your processing plant are proven against expected totals.
Your development team will use the so-called “sniff test” to use additive totals to validate the data loads against totals from other production systems. The test queries against these large test data sets will be summary in nature and will help to identify aberrant code and other issues. Later, the business testers will run similar tests during user acceptance testing, but don’t defer issue discovery to the business.
Here’s another reason not to belay large data set processing until the system moves to production. You don’t want to surprise users at the differences between what they know and what they don’t know – that is, their skepticism with the information in your new system against their almost evangelical faith in the current one. Your team needs to understand the differences and advise on these before UAT begins. If you let the business users find these issues first, convincing them that the new system is right (and the old one wrong or just plain different) is going to take a Herculean effort. You may have been there. I have to confess that I have been. Avoid as possible.
Big data sets also demonstrate the scalability of your code and test environment(s). Code that performs poorly can be tuned and retested using these large test data sets. You’ll need to ferret out code issues from infrastructure issues, as in processors, memory, I/0, and the size of both the development and test environments. Projects can have long data processing windows when production-sized loads are tested on shared or undersized environments. Take note of the scaling factor of your test versus production environments.
Let’s restate the best practices we just walked through:
- Our project team needs to find the ‘tipping point’ for the quantity of data required to build user faith in the new system.
- Our ETL testers need to work proactively to find the issues with the new system before our business users find them.
- Our project team needs to work proactively to identify the differences between legacy (proven) systems and new (unproven) systems before discovery by our business users.
- Our technical team needs to understand the influence of code versus environment on test performance.
Here are two techniques I use to help establish a baseline of big data.
Dimensional subsetting can be used when there are big history requirements (e.g., year over year comparison). Consider loading a dimensional subset of data. For example, if you have 5 large and 50 small customers, load the data for 1 large customer and 5 small customers for your large data test. The output will show a full set of metrics for a subset of customers. The business will eagerly sign off on those results.
The cycle loading concept considers conflicting performance needs for historical versus cycle load processing. For example, most BI processing systems do not scale linearly. Hourly, daily or weekly load processes may work within service level agreement parameters, but a three year run of data may simply fail. A good practice is to design to run in cycles, both historical and ongoing. The designers will work to make the processing cycle perform well, and the BI team will have a good sense of how the BI application will perform in production. Tuning the BI application for a single megalithic history build will require extra time and coding that may actually cause regular cyclical processes to perform poorly.
Large test data sets enable your ETL team to reduce the time required to build confidence in the new system and will help propel your solution through UAT. Planning and creating these test data sets will take time, but ultimately you will gain acceptance of your solution much more quickly.
– Jim Van de Water contributed to this blog.
No Comments
Category Blog, Business Intelligence, Jim Van de Water, Steve Knutson | Tags: big data, canary test, cycle loading, data warehouse system testing, dimensional subsetting, ETL, ETL system testing, functional system testing, incremental system testing, QA testing, testing best practices, testing data warehouse applications,
Friday, October 7th, 2011
Hadoop is a powerful data delivery alternative to traditional relationship database management systems (RDBMS). Unlike the RDBMS, Hadoop does not store data in tables in columns and rows. Hadoop does not use set manipulation (SQL) to process data. Hadoop offers open source, the flexibility to operate on commodity platforms, lightning fast output and virtual linear scalability. No wonder Hadoop is quickly gaining momentum in the BI world.
Hadoop, which is short for Hadoop Distributed File System (HDFS), can quickly scale data distribution based on the number of available resources. The power of Hadoop lies in the ability to distribute, process, and rapidly return analysis on very large data sets.
Traditional ETL and RDBMS systems process data using set-based transformations. These platforms store data in tabular form for end-user access. These SQL type operations demand high intensity data processing before the data is ready for consumption. Such massively parallel platforms, while fast, are challenged to process and deliver data at a low enough latency to meet near real-time aggregation needs.
Unlike traditional RDBMS, Hadoop stores data in key and value pairs. Each data point contains a key and an associated value/metric. The key is used to access the data value. Incremental data can be made available for near real-time access.
Scalability is embedded in the architecture of a Hadoop cluster. A cluster is comprised of a group of nodes managed by the HDFS to parse data into chunks distributed to and processed at each node. Each node is functionally independent from the rest and can transform, translate, cleanse, filter and apply business rules on the input data it receives. Nodes work in isolation to process and produce output. Individual nodes execute a “mapper” program to produce key and value pairs that are stored locally on the node. The output of the mapper process is then supplied as the input to the “reducer” process.
The reducer process merges and combines the value results within a specific key. Aggregates like sum, min and max are applied on the values against unique sets of keys. The output of the reducer process is then stored locally on each of the nodes. The data dispersed across the nodes can then be gathered for presentation of the desired data analytics or metrics.
The mapper and reducer process together are known as a “mapreduce” model. Mapreduce programs for Hadoop are written in many languages, including Java, Python, PHP, and PERL.
A mapreduce process is tested with a small chunk of data and executed on a cluster after debugging is completed. Scaling up or down the infrastructure and reprioritizing of nodes in a cluster is very easy. Little to no programming change is required to scale up a mapreduce process from one node to n nodes.
During run time the Hadoop infrastructure manages distribution of data onto all available nodes. The number of nodes can be scaled up based on temporal needs with near linear scalability. For example, a mapreduce program that uses twice the number of nodes will cut processing time in half. This is perfect for dealing with peak processing times – like end of month, quarter, and year. This is a cost-effective alternative to the ramp-up costs associated with peak demand management of traditional ETL data processing.
The Hadoop ecosystem contains four open source projects that handle major tasks familiar to the BI crowd:
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- ZooKeeper: A high-performance service that manages data amongst distributed applications.
Hadoop/HDFS has some powerful advantages:
- Fast performance and low latency on large volumes of data.
- Development and analysis of additional sets of data points are very straightforward.
- Resources devoted to data processing can be scaled up incrementally and even temporarily.
- Hadoop allows for incremental ETL development. It is not necessary to have the data models, business rules, and all data points flushed out for the entire initiative. File handling allows for more versatile development without the need to run through the entire SDLC.
- Hadoop/HDFS is an open source infrastructure. The developer community is continuously updating and improving the code base. This means you may see significant updates to the Hadoop project over time compared to the more rigid release schedules of traditional software vendors. You may need to wait for a patch/bug resolution for vendor controlled software.
- Scalability of data and performance enables addition of temporary nodes leveraging cloud infrastructure to meet peak demands.
As well as some disadvantages:
- Data may be subject to duplication into individual repositories by different groups and analysts. A governance policy should be put in place if this is a concern.
- Hadoop mapreduce program outputs are files. You can bring Hadoop into the world of SQL using a combination of Hadoop mapreduce and SQL. One solution is to process large volumes of data using Hadoop then bulk load aggregated output files of the mapreduce programs into an RDBMS/ appliance database. Analysts can then use their SQL based tools and skills for analysis and reporting.
I had the opportunity to use Hadoop at a recent engagement, and was eager to see the benefits applied to the health care arena. The results, anticipated and at the same time expected, were astounding. My next blog, titled, “Using Hadoop for BI: Healthcare datasets,” will recount my experiences.
If you have experiences with Hadoop, please reach out and share them as well.
– Jim Van de Water contributed to this blog.
Wednesday, October 5th, 2011
Come watch Ajilitee and Horizon Blue Cross Blue Shield of New Jersey present on “Establish a Data Governance Council for Better, Faster Decision-Making” on Thursday, November 17 at 2:10 pm at the Winter Data Governance Conference in Ft. Lauderdale, November 16-18, 2011.
Learn how Horizon Blue Cross Blue Shield of New Jersey measurably reduced the complexity and expense of numerous business processes and IT projects by establishing a Data Governance Council – the key to transforming Horizon’s data governanceprogram into real business value.
Speakers:
- Tina McCoppin, Partner, Ajilitee
- Balaji Krishnamoorthy, Director Data and Information Architecture, Horizon BCBS New Jersey
Visit the conference website>>
