October 14, 2011 | By Rey Villar
Functional canary testing ensures that the solution exercises every business rule, but not that your ETL will work under the pressure of production loads. For the next set of tests, we need to think BIG DATA, as in large numbers of records. We also need to start thinking in terms of aggregates. The confidence of end users is gained when the additive records passing through your processing plant are proven against expected totals.
Your development team will use the so-called “sniff test” to use additive totals to validate the data loads against totals from other production systems. The test queries against these large test data sets will be summary in nature and will help to identify aberrant code and other issues. Later, the business testers will run similar tests during user acceptance testing, but don’t defer issue discovery to the business.
Here’s another reason not to belay large data set processing until the system moves to production. You don’t want to surprise users at the differences between what they know and what they don’t know – that is, their skepticism with the information in your new system against their almost evangelical faith in the current one. Your team needs to understand the differences and advise on these before UAT begins. If you let the business users find these issues first, convincing them that the new system is right (and the old one wrong or just plain different) is going to take a Herculean effort. You may have been there. I have to confess that I have been. Avoid as possible.
Big data sets also demonstrate the scalability of your code and test environment(s). Code that performs poorly can be tuned and retested using these large test data sets. You’ll need to ferret out code issues from infrastructure issues, as in processors, memory, I/0, and the size of both the development and test environments. Projects can have long data processing windows when production-sized loads are tested on shared or undersized environments. Take note of the scaling factor of your test versus production environments.
Let’s restate the best practices we just walked through:
- Our project team needs to find the ‘tipping point’ for the quantity of data required to build user faith in the new system.
- Our ETL testers need to work proactively to find the issues with the new system before our business users find them.
- Our project team needs to work proactively to identify the differences between legacy (proven) systems and new (unproven) systems before discovery by our business users.
- Our technical team needs to understand the influence of code versus environment on test performance.
Here are two techniques I use to help establish a baseline of big data.
Dimensional subsetting can be used when there are big history requirements (e.g., year over year comparison). Consider loading a dimensional subset of data. For example, if you have 5 large and 50 small customers, load the data for 1 large customer and 5 small customers for your large data test. The output will show a full set of metrics for a subset of customers. The business will eagerly sign off on those results.
The cycle loading concept considers conflicting performance needs for historical versus cycle load processing. For example, most BI processing systems do not scale linearly. Hourly, daily or weekly load processes may work within service level agreement parameters, but a three year run of data may simply fail. A good practice is to design to run in cycles, both historical and ongoing. The designers will work to make the processing cycle perform well, and the BI team will have a good sense of how the BI application will perform in production. Tuning the BI application for a single megalithic history build will require extra time and coding that may actually cause regular cyclical processes to perform poorly.
Large test data sets enable your ETL team to reduce the time required to build confidence in the new system and will help propel your solution through UAT. Planning and creating these test data sets will take time, but ultimately you will gain acceptance of your solution much more quickly.